SPIKE-RL: Video-LLMs meet Bayesian Surprise

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

Video LLMsVideo reasoningBayesian SurpriseBelief tracking

Real-world videos often show routine activities punctuated by memorable, surprising events. However, most Video-LLMs process videos by sampling frames uniformly, likely missing critical moments that define a video's narrative. We introduce SPIKE, an inference-time framework that quantifies Bayesian Surprise as the belief update triggered by new visual evidence in the video stream, identifying moments where new visual evidence conflicts with prior beliefs. SPIKE effectively localizes surprise in videos, correlated with humans on positive (FunQA) and negative (Oops!) surprise benchmarks. SPIKE-RL further improves on SPIKE's ability to detect surprise, leveraging GRPO to refine its belief hypotheses based on a reward signal from the video caption. SPIKE and SPIKE-RL guide query-agnostic surprise-weighted frame sampling, which allocates more frames to interesting moments in the video. With this strategy, we achieve consistent performance gains on five downstream benchmarks. By enabling Video-LLMs to track beliefs and register surprise, our work paves the way for more robust models that can revise their understanding in response to new information.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces SPIKE, a framework that quantifies Bayesian surprise in video streams to identify moments where new visual evidence conflicts with prior beliefs, and SPIKE-RL, which refines belief hypotheses using reinforcement learning. Within the taxonomy, this work resides in the 'Surprise-Based Event and Saliency Detection' leaf under 'Video Anomaly and Event Detection', alongside seven sibling papers. This leaf represents a moderately populated research direction focused on using explicit surprise metrics for event identification, distinguishing it from broader anomaly detection approaches that rely on topic models or Gaussian processes without surprise-based formulations.

The taxonomy reveals that neighboring leaves include 'Bayesian Nonparametric and Topic Models for Anomaly Detection' (five papers using Dirichlet processes) and 'Deep Generative Models for Anomaly Detection' (three papers combining VAEs with Bayesian methods). The paper's approach diverges from these by emphasizing inference-time surprise computation rather than offline model training, and by integrating reinforcement learning for belief optimization. Its connection to 'Core Surprise Models and Attention Mechanisms' (three foundational papers) suggests it builds on established surprise theory while extending it to modern Video-LLM architectures, bridging classical Bayesian frameworks with contemporary deep learning systems.

Among twenty candidates examined across three contributions, the SPIKE framework shows one refutable candidate out of ten examined, indicating some prior work on Bayesian surprise quantification exists within the limited search scope. SPIKE-RL, however, encountered zero refutable candidates among ten examined, suggesting its reinforcement learning approach for belief refinement may represent a less-explored direction. The surprise-weighted frame sampling strategy was not evaluated against prior work (zero candidates examined), leaving its novelty assessment incomplete. These statistics reflect a focused semantic search rather than exhaustive coverage, so additional related work may exist beyond the top-twenty matches.

Based on the limited search scope of twenty semantically similar papers, the work appears to occupy a moderately novel position. The core surprise detection mechanism has some precedent, but the integration with reinforcement learning and application to Video-LLM frame sampling shows less overlap with examined candidates. The taxonomy structure indicates this is an active but not overcrowded research area, with the sibling leaf containing eight papers total. A more comprehensive literature review would be needed to assess whether the specific combination of Bayesian surprise, RL-based belief optimization, and LLM-guided sampling has been explored elsewhere.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Bayesian surprise detection in videos. The field centers on identifying unexpected or anomalous events in video streams by quantifying deviations from learned probabilistic models of normal behavior. The taxonomy reveals five main branches that reflect different emphases: Bayesian Surprise Theory and Computational Frameworks develops the mathematical foundations and inference algorithms (e.g., Hierarchical Gaussian Process[1], Variational Bayesian Inference[11]); Video Anomaly and Event Detection applies these principles to surveillance and event recognition (e.g., Contextual Video Surveillance[6], Unusual Events Detection[42]); Multimodal and Spatiotemporal Bayesian Modeling extends surprise measures across sensory modalities and temporal scales (e.g., Audio-Visual Attention Analysis[18], Kalman Variational Autoencoder[15]); Robotics and Autonomous Systems Applications leverages surprise for navigation and decision-making (e.g., Autonomous Vehicle Surprise[28], Landmark Bayesian Surprise[41]); and Specialized Application Domains targets niche settings such as healthcare monitoring (Fall Detection Hospitals[44]) or cognitive modeling (Infant Visual Attention[19]). Together, these branches illustrate a progression from theoretical constructs to diverse real-world deployments. A particularly active line of work explores how surprise-based saliency and attention mechanisms can guide both bottom-up perceptual processing (Bottom-Up Visual Surprise[49]) and top-down event segmentation (Bayesian Topic Events[39]). Trade-offs emerge between computational efficiency—favoring lightweight neuromorphic implementations (Neuromorphic Bayesian Surprise[26])—and representational richness in deep generative models (Multilevel Variational Autoencoders[3]). SPIKE-RL[0] sits within the Surprise-Based Event and Saliency Detection cluster, emphasizing reinforcement learning integration for dynamic video analysis. Its approach contrasts with purely unsupervised anomaly detectors like Simultaneous Localization Anomaly[5], which focus on spatial consistency, and with classical information-theoretic methods such as Information Divergence Saliency[24], which lack adaptive learning. By combining Bayesian surprise with RL, SPIKE-RL[0] bridges perceptual novelty detection and goal-directed behavior, positioning itself at the intersection of event detection and autonomous decision-making.

Claimed Contributions

SPIKE inference-time framework for Bayesian Surprise quantification

Can Refute

10 retrieved papers

SPIKE is a framework that represents a Video-LLM's beliefs as explicit probability distributions over textual hypotheses and measures surprise as the KL divergence between prior and posterior beliefs when new frames are observed. This enables the model to identify moments where visual evidence conflicts with expectations.

10 retrieved papers

Can Refute

SPIKE-RL reinforcement learning method for belief optimization

10 retrieved papers

SPIKE-RL uses Group Relative Policy Optimization (GRPO) to train the hypothesis generator by propagating rewards from final caption quality back to intermediate belief hypotheses. This improves both the diversity of generated beliefs and the accuracy of surprise localization beyond the inference-time scorer alone.

10 retrieved papers

Surprise-weighted frame sampling strategy for Video-LLMs

0 retrieved papers

The authors propose replacing uniform frame sampling in Video-LLMs with a surprise-weighted sampling strategy that allocates the frame budget proportionally to computed surprise scores. This query-agnostic approach consistently improves performance on downstream video understanding tasks.

0 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[18] An audioâvisual human attention analysis approach to abrupt change detection in videos PDF

Yanxiang Chen, Minglong Song, Lixia Xue, Xiaoxue Chen, Meng Wang, M. Wang (2015)

[26] Neuromorphic Bayesian Surprise for Far-Range Event Detection PDF

Randolph Voorhies, Lior Elazary, Randolph C. Voorhies, L. Itti, Laurent Itti (2012) • 2012 IEEE Ninth International Conference on Advanced Video and Signal-Based Surveillance

[27] Identifying surprising events in videos using bayesian topic models PDF

A. Hendel, D. Weinshall, Avishai Hendel, Shmuel Peleg, Daphna Weinshall (2010)

[39] Identifying Surprising Events in Video Using Bayesian Topic Models PDF

Avishai Hendel, Daphna Weinshall, Bezalel Peleg (2010) • Detection and Identification of Rare Audiovisual Cues

[40] Bayesian Surprise for Small and Sub-Pixel Moving Target Detection PDF

Laurent Itti, Daben Liu, L. Itti, Corinne Teeter, Srideep Musuvathy, S. Musuvathy (2025)

[42] The detection of unusual events in video based on Bayesian surprise model PDF

Jinsheng Xie, Li Guo, Yunbi Chen, Jin-Sheng Xie, Long Zhao, Guo Li (2010) • International Conference on Information Science and Engineering

[49] Application of a bottom-up visual surprise model for event detection in dynamic natural scenes PDF

Randolph Voorhies, Lior Elazary, R. Voorhies, L. Itti (2010)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

SPIKE inference-time framework for Bayesian Surprise quantification

[63] Bayesian surprise attracts human attention PDF

Can Refute

[2] Brain network dynamics predict moments of surprise across contexts PDF

Cannot Refute

[14] Modeling emotions associated with novelty at variable uncertainty levels: A Bayesian approach PDF

Cannot Refute

[40] Bayesian Surprise for Small and Sub-Pixel Moving Target Detection PDF

Cannot Refute

[61] Hierarchical surprise signals in naturalistic violation of expectations PDF

Cannot Refute

[62] AutoDiscovery: Open-ended Scientific Discovery via Bayesian Surprise PDF

Cannot Refute

[64] Rejecting outliers: Surprising changes do not always improve belief updating. PDF

Cannot Refute

[65] Neural signals encoding shifts in beliefs PDF

Cannot Refute

[66] Uncertainty and persistence: A Bayesian update semantics for probabilistic expressions PDF

Cannot Refute

[67] Electroencephalographic correlates of temporal Bayesian belief updating and surprise PDF

Cannot Refute

Contribution

SPIKE-RL reinforcement learning method for belief optimization

[51] ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding PDF

Cannot Refute

[52] Videohallu: Evaluating and mitigating multi-modal hallucinations on synthetic video understanding PDF

Cannot Refute

[53] Spatialladder: Progressive training for spatial reasoning in vision-language models PDF

Cannot Refute

[54] Exploring the effect of reinforcement learning on video understanding: Insights from seed-bench-r1 PDF

Cannot Refute

[55] Video-rts: Rethinking reinforcement learning and test-time scaling for efficient and enhanced video reasoning PDF

Cannot Refute

[56] Avatar: Reinforcement learning to see, hear, and reason over video PDF

Cannot Refute

[57] Video-Thinker: Sparking" Thinking with Videos" via Reinforcement Learning PDF

Cannot Refute

[58] Thinking with videos: Multimodal tool-augmented reinforcement learning for long video reasoning PDF

Cannot Refute

[59] The social machine: artificial intelligence (AI) approaches to theory of mind PDF

Cannot Refute

[60] RLZero: Direct Policy Inference from Language Without In-Domain Supervision PDF

Cannot Refute

Contribution

SPIKE-RL: Video-LLMs meet Bayesian Surprise

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[18] An audioâvisual human attention analysis approach to abrupt change detection in videos PDF

[26] Neuromorphic Bayesian Surprise for Far-Range Event Detection PDF

[27] Identifying surprising events in videos using bayesian topic models PDF

[39] Identifying Surprising Events in Video Using Bayesian Topic Models PDF

[40] Bayesian Surprise for Small and Sub-Pixel Moving Target Detection PDF

[42] The detection of unusual events in video based on Bayesian surprise model PDF

[49] Application of a bottom-up visual surprise model for event detection in dynamic natural scenes PDF

Contribution Analysis

SPIKE inference-time framework for Bayesian Surprise quantification

[63] Bayesian surprise attracts human attention PDF

[2] Brain network dynamics predict moments of surprise across contexts PDF

[14] Modeling emotions associated with novelty at variable uncertainty levels: A Bayesian approach PDF

[40] Bayesian Surprise for Small and Sub-Pixel Moving Target Detection PDF

[61] Hierarchical surprise signals in naturalistic violation of expectations PDF

[62] AutoDiscovery: Open-ended Scientific Discovery via Bayesian Surprise PDF

[64] Rejecting outliers: Surprising changes do not always improve belief updating. PDF

[65] Neural signals encoding shifts in beliefs PDF

[66] Uncertainty and persistence: A Bayesian update semantics for probabilistic expressions PDF

[67] Electroencephalographic correlates of temporal Bayesian belief updating and surprise PDF

SPIKE-RL reinforcement learning method for belief optimization

[51] ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding PDF

[52] Videohallu: Evaluating and mitigating multi-modal hallucinations on synthetic video understanding PDF

[53] Spatialladder: Progressive training for spatial reasoning in vision-language models PDF

[54] Exploring the effect of reinforcement learning on video understanding: Insights from seed-bench-r1 PDF

[55] Video-rts: Rethinking reinforcement learning and test-time scaling for efficient and enhanced video reasoning PDF

[56] Avatar: Reinforcement learning to see, hear, and reason over video PDF

[57] Video-Thinker: Sparking" Thinking with Videos" via Reinforcement Learning PDF

[58] Thinking with videos: Multimodal tool-augmented reinforcement learning for long video reasoning PDF

[59] The social machine: artificial intelligence (AI) approaches to theory of mind PDF

[60] RLZero: Direct Policy Inference from Language Without In-Domain Supervision PDF

Surprise-weighted frame sampling strategy for Video-LLMs

Table of Contents

[18] An audioâvisual human attention analysis approach to abrupt change detection in videos PDF