SPIKE-RL: Video-LLMs meet Bayesian Surprise

ICLR 2026 Conference SubmissionAnonymous Authors
Video LLMsVideo reasoningBayesian SurpriseBelief tracking
Abstract:

Real-world videos often show routine activities punctuated by memorable, surprising events. However, most Video-LLMs process videos by sampling frames uniformly, likely missing critical moments that define a video's narrative. We introduce SPIKE, an inference-time framework that quantifies Bayesian Surprise as the belief update triggered by new visual evidence in the video stream, identifying moments where new visual evidence conflicts with prior beliefs. SPIKE effectively localizes surprise in videos, correlated with humans on positive (FunQA) and negative (Oops!) surprise benchmarks. SPIKE-RL further improves on SPIKE's ability to detect surprise, leveraging GRPO to refine its belief hypotheses based on a reward signal from the video caption. SPIKE and SPIKE-RL guide query-agnostic surprise-weighted frame sampling, which allocates more frames to interesting moments in the video. With this strategy, we achieve consistent performance gains on five downstream benchmarks. By enabling Video-LLMs to track beliefs and register surprise, our work paves the way for more robust models that can revise their understanding in response to new information.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces SPIKE, a framework that quantifies Bayesian surprise in video streams to identify moments where new visual evidence conflicts with prior beliefs, and SPIKE-RL, which refines belief hypotheses using reinforcement learning. Within the taxonomy, this work resides in the 'Surprise-Based Event and Saliency Detection' leaf under 'Video Anomaly and Event Detection', alongside seven sibling papers. This leaf represents a moderately populated research direction focused on using explicit surprise metrics for event identification, distinguishing it from broader anomaly detection approaches that rely on topic models or Gaussian processes without surprise-based formulations.

The taxonomy reveals that neighboring leaves include 'Bayesian Nonparametric and Topic Models for Anomaly Detection' (five papers using Dirichlet processes) and 'Deep Generative Models for Anomaly Detection' (three papers combining VAEs with Bayesian methods). The paper's approach diverges from these by emphasizing inference-time surprise computation rather than offline model training, and by integrating reinforcement learning for belief optimization. Its connection to 'Core Surprise Models and Attention Mechanisms' (three foundational papers) suggests it builds on established surprise theory while extending it to modern Video-LLM architectures, bridging classical Bayesian frameworks with contemporary deep learning systems.

Among twenty candidates examined across three contributions, the SPIKE framework shows one refutable candidate out of ten examined, indicating some prior work on Bayesian surprise quantification exists within the limited search scope. SPIKE-RL, however, encountered zero refutable candidates among ten examined, suggesting its reinforcement learning approach for belief refinement may represent a less-explored direction. The surprise-weighted frame sampling strategy was not evaluated against prior work (zero candidates examined), leaving its novelty assessment incomplete. These statistics reflect a focused semantic search rather than exhaustive coverage, so additional related work may exist beyond the top-twenty matches.

Based on the limited search scope of twenty semantically similar papers, the work appears to occupy a moderately novel position. The core surprise detection mechanism has some precedent, but the integration with reinforcement learning and application to Video-LLM frame sampling shows less overlap with examined candidates. The taxonomy structure indicates this is an active but not overcrowded research area, with the sibling leaf containing eight papers total. A more comprehensive literature review would be needed to assess whether the specific combination of Bayesian surprise, RL-based belief optimization, and LLM-guided sampling has been explored elsewhere.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
20
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: Bayesian surprise detection in videos. The field centers on identifying unexpected or anomalous events in video streams by quantifying deviations from learned probabilistic models of normal behavior. The taxonomy reveals five main branches that reflect different emphases: Bayesian Surprise Theory and Computational Frameworks develops the mathematical foundations and inference algorithms (e.g., Hierarchical Gaussian Process[1], Variational Bayesian Inference[11]); Video Anomaly and Event Detection applies these principles to surveillance and event recognition (e.g., Contextual Video Surveillance[6], Unusual Events Detection[42]); Multimodal and Spatiotemporal Bayesian Modeling extends surprise measures across sensory modalities and temporal scales (e.g., Audio-Visual Attention Analysis[18], Kalman Variational Autoencoder[15]); Robotics and Autonomous Systems Applications leverages surprise for navigation and decision-making (e.g., Autonomous Vehicle Surprise[28], Landmark Bayesian Surprise[41]); and Specialized Application Domains targets niche settings such as healthcare monitoring (Fall Detection Hospitals[44]) or cognitive modeling (Infant Visual Attention[19]). Together, these branches illustrate a progression from theoretical constructs to diverse real-world deployments. A particularly active line of work explores how surprise-based saliency and attention mechanisms can guide both bottom-up perceptual processing (Bottom-Up Visual Surprise[49]) and top-down event segmentation (Bayesian Topic Events[39]). Trade-offs emerge between computational efficiency—favoring lightweight neuromorphic implementations (Neuromorphic Bayesian Surprise[26])—and representational richness in deep generative models (Multilevel Variational Autoencoders[3]). SPIKE-RL[0] sits within the Surprise-Based Event and Saliency Detection cluster, emphasizing reinforcement learning integration for dynamic video analysis. Its approach contrasts with purely unsupervised anomaly detectors like Simultaneous Localization Anomaly[5], which focus on spatial consistency, and with classical information-theoretic methods such as Information Divergence Saliency[24], which lack adaptive learning. By combining Bayesian surprise with RL, SPIKE-RL[0] bridges perceptual novelty detection and goal-directed behavior, positioning itself at the intersection of event detection and autonomous decision-making.

Claimed Contributions

SPIKE inference-time framework for Bayesian Surprise quantification

SPIKE is a framework that represents a Video-LLM's beliefs as explicit probability distributions over textual hypotheses and measures surprise as the KL divergence between prior and posterior beliefs when new frames are observed. This enables the model to identify moments where visual evidence conflicts with expectations.

10 retrieved papers
Can Refute
SPIKE-RL reinforcement learning method for belief optimization

SPIKE-RL uses Group Relative Policy Optimization (GRPO) to train the hypothesis generator by propagating rewards from final caption quality back to intermediate belief hypotheses. This improves both the diversity of generated beliefs and the accuracy of surprise localization beyond the inference-time scorer alone.

10 retrieved papers
Surprise-weighted frame sampling strategy for Video-LLMs

The authors propose replacing uniform frame sampling in Video-LLMs with a surprise-weighted sampling strategy that allocates the frame budget proportionally to computed surprise scores. This query-agnostic approach consistently improves performance on downstream video understanding tasks.

0 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

SPIKE inference-time framework for Bayesian Surprise quantification

SPIKE is a framework that represents a Video-LLM's beliefs as explicit probability distributions over textual hypotheses and measures surprise as the KL divergence between prior and posterior beliefs when new frames are observed. This enables the model to identify moments where visual evidence conflicts with expectations.

Contribution

SPIKE-RL reinforcement learning method for belief optimization

SPIKE-RL uses Group Relative Policy Optimization (GRPO) to train the hypothesis generator by propagating rewards from final caption quality back to intermediate belief hypotheses. This improves both the diversity of generated beliefs and the accuracy of surprise localization beyond the inference-time scorer alone.

Contribution

Surprise-weighted frame sampling strategy for Video-LLMs

The authors propose replacing uniform frame sampling in Video-LLMs with a surprise-weighted sampling strategy that allocates the frame budget proportionally to computed surprise scores. This query-agnostic approach consistently improves performance on downstream video understanding tasks.