Segment Any Events with Language

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

event sensorevent-based scene understandingopen-vocabulary

Scene understanding with free-form language has been widely explored within diverse modalities such as images, point clouds, and LiDAR. However, related studies on event sensors are scarce or narrowly centered on semantic-level understanding. We introduce SEAL, the first Semantic-aware Segment Any Events framework that addresses Open-Vocabulary Event Instance Segmentation (OV-EIS). Given the visual prompt, our model presents a unified framework to support both event segmentation and open-vocabulary mask classification at multiple levels of granularity, including instance-level and part-level. To enable thorough evaluation on OV-EIS, we curate four benchmarks that cover label granularity from coarse to fine class configurations and semantic granularity from instance-level to part-level understanding. Extensive experiments show that our SEAL largely outperforms proposed baselines in terms of performance and inference speed with a parameter-efficient architecture. In the Appendix, we further present a simple variant of our SEAL achieving generic spatiotemporal OV-EIS that does not require any visual prompts from users in the inference. The code will be publicly available.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces SEAL, a framework for open-vocabulary event instance segmentation that supports both instance-level and part-level mask classification using visual prompts and language guidance. Within the taxonomy, it resides in the 'Event Stream Open-Vocabulary Instance Segmentation' leaf under 'Event-Based Open-Vocabulary Segmentation'. This leaf currently contains only the original paper itself, with no sibling papers identified. The broader event-based branch includes four leaves covering semantic segmentation, object detection, and multimodal understanding, suggesting that instance-level segmentation on event streams is a relatively sparse and emerging research direction compared to more established areas like image or video segmentation.

The taxonomy reveals that neighboring work primarily focuses on event-based semantic segmentation (e.g., cross-modal knowledge transfer from images and text) and adaptive event stream slicing for detection tasks. These directions emphasize semantic-level understanding or object detection rather than instance-level segmentation with open vocabularies. The broader field context shows dense activity in open-vocabulary video instance segmentation (six papers across three leaves) and 3D scene segmentation (six papers across four leaves), indicating that the event-based modality remains less explored. The scope notes clarify that event-based methods are excluded from image, video, and 3D categories, reinforcing the distinct nature of event camera processing.

Among the three contributions analyzed, the literature search examined nineteen candidates total. The SEAL framework itself was compared against five candidates with zero refutable overlaps. The Multimodal Hierarchical Semantic Guidance module was examined against ten candidates, again with no clear refutations. The four benchmarks for evaluation were compared against four candidates, also yielding no refutable prior work. These statistics suggest that within the limited search scope of top-K semantic matches and citation expansion, no prior work directly overlaps with the proposed contributions. However, the small candidate pool (nineteen papers) and the absence of sibling papers in the taxonomy leaf indicate that the search may not have encountered closely related event-based instance segmentation efforts.

Given the limited search scope and the sparse taxonomy leaf, the work appears to occupy a novel position within event-based open-vocabulary segmentation. The absence of refutable candidates across all contributions and the lack of sibling papers suggest that instance-level segmentation on event streams with language guidance is relatively unexplored. However, the small candidate pool and the emerging nature of event-based research mean that the analysis may not capture all relevant prior work, particularly in adjacent areas like video instance segmentation or event-based detection that could share methodological overlap.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: open-vocabulary event instance segmentation with language. This field addresses the challenge of segmenting and recognizing object instances in visual data using flexible, language-driven queries rather than fixed category sets. The taxonomy reveals a diverse landscape organized around input modalities and problem settings. Major branches include Open-Vocabulary Image Instance Segmentation, which tackles static scenes with methods that align vision and language embeddings; Open-Vocabulary Video Instance Segmentation (e.g., OpenVIS[3], CLIP-VIS[8]), which extends these ideas to temporal data; and Open-Vocabulary Instance Segmentation in 3D Scenes (e.g., OpenMask3D[1], OpenIns3D[10]), where spatial reasoning and point-cloud processing become central. Specialized branches cover Audio-Visual Event Localization with Open Vocabularies (e.g., OV-DAVEL[20]) and Language-Guided Video Segmentation, while Domain Adaptation and Specialized Applications address niche scenarios like remote sensing (SegEarth-R2[22]) or camouflaged objects. A small cluster focuses on Event-Based Open-Vocabulary Segmentation, leveraging event cameras for high-temporal-resolution data, and another on Surveys and Training-Free Approaches that explore zero-shot or minimal-supervision regimes. Recent work highlights trade-offs between generalization and computational efficiency, with many studies exploring how to best leverage pretrained vision-language models without extensive retraining. Within the event-based branch, Segment Any Events[0] stands out by directly processing event streams for instance segmentation, contrasting with video-based methods like OpenVIS[3] or CLIP-VIS[8] that operate on conventional frame sequences. While video approaches often rely on dense temporal sampling and optical flow, event-based methods exploit asynchronous pixel-level changes, offering potential advantages in dynamic or low-latency scenarios. Segment Any Events[0] thus occupies a distinct niche, bridging the gap between traditional open-vocabulary segmentation and the unique characteristics of event cameras, and raising questions about how language grounding can be effectively adapted to sparse, high-speed sensory inputs.

Claimed Contributions

SEAL framework for Open-Vocabulary Event Instance Segmentation

5 retrieved papers

The authors propose SEAL, a unified framework that performs both event segmentation and open-vocabulary mask classification at multiple granularity levels (instance-level and part-level) using free-form language queries including noun-level and sentence-level expressions.

5 retrieved papers

Multimodal Hierarchical Semantic Guidance module

10 retrieved papers

The authors introduce MHSG, a multimodal learning framework that leverages vision-language foundation models to learn semantic-rich event representations across multiple levels of granularity (part-level, instance-level, semantic-level) without requiring predefined class candidates.

10 retrieved papers

Four benchmarks for evaluating Open-Vocabulary Event Instance Segmentation

4 retrieved papers

The authors curate four evaluation benchmarks (DDD17-Ins, DSEC11-Ins, DSEC19-Ins, DSEC-Part) that cover diverse settings of label granularity (coarse to fine-grained classes) and semantic granularity (instance-level to part-level segmentation) for thorough evaluation of open-vocabulary event instance segmentation.

4 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

SEAL framework for Open-Vocabulary Event Instance Segmentation

[11] Openess: Event-based semantic scene understanding with open vocabularies PDF

Cannot Refute

[21] Adaptive Event Stream Slicing for Open-Vocabulary Event-Based Object Detection via Vision-Language Knowledge Distillation PDF

Cannot Refute

[33] OpenESS: Event-based Semantic Scene Understanding with Open Vocabularies Supplementary Material PDF

Cannot Refute

[35] Talk2Event: Grounded understanding of dynamic scenes from event cameras PDF

Cannot Refute

[36] OVOSE: Open-Vocabulary Semantic Segmentation in Event-Based Cameras PDF

Cannot Refute

Contribution

Multimodal Hierarchical Semantic Guidance module

[41] Hi robot: Open-ended instruction following with hierarchical vision-language-action models PDF

Cannot Refute

[42] Hypervlm: Hyperbolic space guided vision language modeling for hierarchical multi-modal understanding PDF

Cannot Refute

[43] Hgclip: Exploring vision-language models with graph representations for hierarchical understanding PDF

Cannot Refute

[44] Puma: Empowering unified mllm with multi-granular visual generation PDF

Cannot Refute

[45] SemVLP: Vision-Language Pre-training by Aligning Semantics at Multiple Levels PDF

Cannot Refute

[46] Hiprune: Training-free visual token pruning via hierarchical attention in vision-language models PDF

Cannot Refute

[47] A vision-language model with multi-granular knowledge fusion in medical imaging PDF

Cannot Refute

[48] Urbanvlp: Multi-granularity vision-language pretraining for urban socioeconomic indicator prediction PDF

Cannot Refute

[49] Multi-level vision language interaction learning for cross-modal retrieval PDF

Cannot Refute

[50] Avg-llava: A large multimodal model with adaptive visual granularity PDF

Cannot Refute

Contribution

Four benchmarks for evaluating Open-Vocabulary Event Instance Segmentation

[37] Rethinking RGB-Event Semantic Segmentation with a Novel Bidirectional Motion-enhanced Event Representation PDF

Cannot Refute

[38] Event-Driven Storytelling with Multiple Lifelike Humans in a 3D Scene PDF

Cannot Refute

[39] VideoScaffold: Elastic-Scale Visual Hierarchies for Streaming Video Understanding in MLLMs PDF

Cannot Refute

[40] SG-Net: Spatial Granularity Network for One-Stage Video Instance Segmentation PDF

Cannot Refute

Segment Any Events with Language

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

SEAL framework for Open-Vocabulary Event Instance Segmentation

[11] Openess: Event-based semantic scene understanding with open vocabularies PDF

[21] Adaptive Event Stream Slicing for Open-Vocabulary Event-Based Object Detection via Vision-Language Knowledge Distillation PDF

[33] OpenESS: Event-based Semantic Scene Understanding with Open Vocabularies Supplementary Material PDF

[35] Talk2Event: Grounded understanding of dynamic scenes from event cameras PDF

[36] OVOSE: Open-Vocabulary Semantic Segmentation in Event-Based Cameras PDF

Multimodal Hierarchical Semantic Guidance module

[41] Hi robot: Open-ended instruction following with hierarchical vision-language-action models PDF

[42] Hypervlm: Hyperbolic space guided vision language modeling for hierarchical multi-modal understanding PDF

[43] Hgclip: Exploring vision-language models with graph representations for hierarchical understanding PDF

[44] Puma: Empowering unified mllm with multi-granular visual generation PDF

[45] SemVLP: Vision-Language Pre-training by Aligning Semantics at Multiple Levels PDF

[46] Hiprune: Training-free visual token pruning via hierarchical attention in vision-language models PDF

[47] A vision-language model with multi-granular knowledge fusion in medical imaging PDF

[48] Urbanvlp: Multi-granularity vision-language pretraining for urban socioeconomic indicator prediction PDF

[49] Multi-level vision language interaction learning for cross-modal retrieval PDF

[50] Avg-llava: A large multimodal model with adaptive visual granularity PDF

Four benchmarks for evaluating Open-Vocabulary Event Instance Segmentation

[37] Rethinking RGB-Event Semantic Segmentation with a Novel Bidirectional Motion-enhanced Event Representation PDF

[38] Event-Driven Storytelling with Multiple Lifelike Humans in a 3D Scene PDF

[39] VideoScaffold: Elastic-Scale Visual Hierarchies for Streaming Video Understanding in MLLMs PDF

[40] SG-Net: Spatial Granularity Network for One-Stage Video Instance Segmentation PDF

Table of Contents