Segment Any Events with Language

ICLR 2026 Conference SubmissionAnonymous Authors
event sensorevent-based scene understandingopen-vocabulary
Abstract:

Scene understanding with free-form language has been widely explored within diverse modalities such as images, point clouds, and LiDAR. However, related studies on event sensors are scarce or narrowly centered on semantic-level understanding. We introduce SEAL, the first Semantic-aware Segment Any Events framework that addresses Open-Vocabulary Event Instance Segmentation (OV-EIS). Given the visual prompt, our model presents a unified framework to support both event segmentation and open-vocabulary mask classification at multiple levels of granularity, including instance-level and part-level. To enable thorough evaluation on OV-EIS, we curate four benchmarks that cover label granularity from coarse to fine class configurations and semantic granularity from instance-level to part-level understanding. Extensive experiments show that our SEAL largely outperforms proposed baselines in terms of performance and inference speed with a parameter-efficient architecture. In the Appendix, we further present a simple variant of our SEAL achieving generic spatiotemporal OV-EIS that does not require any visual prompts from users in the inference. The code will be publicly available.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces SEAL, a framework for open-vocabulary event instance segmentation that supports both instance-level and part-level mask classification using visual prompts and language guidance. Within the taxonomy, it resides in the 'Event Stream Open-Vocabulary Instance Segmentation' leaf under 'Event-Based Open-Vocabulary Segmentation'. This leaf currently contains only the original paper itself, with no sibling papers identified. The broader event-based branch includes four leaves covering semantic segmentation, object detection, and multimodal understanding, suggesting that instance-level segmentation on event streams is a relatively sparse and emerging research direction compared to more established areas like image or video segmentation.

The taxonomy reveals that neighboring work primarily focuses on event-based semantic segmentation (e.g., cross-modal knowledge transfer from images and text) and adaptive event stream slicing for detection tasks. These directions emphasize semantic-level understanding or object detection rather than instance-level segmentation with open vocabularies. The broader field context shows dense activity in open-vocabulary video instance segmentation (six papers across three leaves) and 3D scene segmentation (six papers across four leaves), indicating that the event-based modality remains less explored. The scope notes clarify that event-based methods are excluded from image, video, and 3D categories, reinforcing the distinct nature of event camera processing.

Among the three contributions analyzed, the literature search examined nineteen candidates total. The SEAL framework itself was compared against five candidates with zero refutable overlaps. The Multimodal Hierarchical Semantic Guidance module was examined against ten candidates, again with no clear refutations. The four benchmarks for evaluation were compared against four candidates, also yielding no refutable prior work. These statistics suggest that within the limited search scope of top-K semantic matches and citation expansion, no prior work directly overlaps with the proposed contributions. However, the small candidate pool (nineteen papers) and the absence of sibling papers in the taxonomy leaf indicate that the search may not have encountered closely related event-based instance segmentation efforts.

Given the limited search scope and the sparse taxonomy leaf, the work appears to occupy a novel position within event-based open-vocabulary segmentation. The absence of refutable candidates across all contributions and the lack of sibling papers suggest that instance-level segmentation on event streams with language guidance is relatively unexplored. However, the small candidate pool and the emerging nature of event-based research mean that the analysis may not capture all relevant prior work, particularly in adjacent areas like video instance segmentation or event-based detection that could share methodological overlap.

Taxonomy

Core-task Taxonomy Papers
34
3
Claimed Contributions
19
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: open-vocabulary event instance segmentation with language. This field addresses the challenge of segmenting and recognizing object instances in visual data using flexible, language-driven queries rather than fixed category sets. The taxonomy reveals a diverse landscape organized around input modalities and problem settings. Major branches include Open-Vocabulary Image Instance Segmentation, which tackles static scenes with methods that align vision and language embeddings; Open-Vocabulary Video Instance Segmentation (e.g., OpenVIS[3], CLIP-VIS[8]), which extends these ideas to temporal data; and Open-Vocabulary Instance Segmentation in 3D Scenes (e.g., OpenMask3D[1], OpenIns3D[10]), where spatial reasoning and point-cloud processing become central. Specialized branches cover Audio-Visual Event Localization with Open Vocabularies (e.g., OV-DAVEL[20]) and Language-Guided Video Segmentation, while Domain Adaptation and Specialized Applications address niche scenarios like remote sensing (SegEarth-R2[22]) or camouflaged objects. A small cluster focuses on Event-Based Open-Vocabulary Segmentation, leveraging event cameras for high-temporal-resolution data, and another on Surveys and Training-Free Approaches that explore zero-shot or minimal-supervision regimes. Recent work highlights trade-offs between generalization and computational efficiency, with many studies exploring how to best leverage pretrained vision-language models without extensive retraining. Within the event-based branch, Segment Any Events[0] stands out by directly processing event streams for instance segmentation, contrasting with video-based methods like OpenVIS[3] or CLIP-VIS[8] that operate on conventional frame sequences. While video approaches often rely on dense temporal sampling and optical flow, event-based methods exploit asynchronous pixel-level changes, offering potential advantages in dynamic or low-latency scenarios. Segment Any Events[0] thus occupies a distinct niche, bridging the gap between traditional open-vocabulary segmentation and the unique characteristics of event cameras, and raising questions about how language grounding can be effectively adapted to sparse, high-speed sensory inputs.

Claimed Contributions

SEAL framework for Open-Vocabulary Event Instance Segmentation

The authors propose SEAL, a unified framework that performs both event segmentation and open-vocabulary mask classification at multiple granularity levels (instance-level and part-level) using free-form language queries including noun-level and sentence-level expressions.

5 retrieved papers
Multimodal Hierarchical Semantic Guidance module

The authors introduce MHSG, a multimodal learning framework that leverages vision-language foundation models to learn semantic-rich event representations across multiple levels of granularity (part-level, instance-level, semantic-level) without requiring predefined class candidates.

10 retrieved papers
Four benchmarks for evaluating Open-Vocabulary Event Instance Segmentation

The authors curate four evaluation benchmarks (DDD17-Ins, DSEC11-Ins, DSEC19-Ins, DSEC-Part) that cover diverse settings of label granularity (coarse to fine-grained classes) and semantic granularity (instance-level to part-level segmentation) for thorough evaluation of open-vocabulary event instance segmentation.

4 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

SEAL framework for Open-Vocabulary Event Instance Segmentation

The authors propose SEAL, a unified framework that performs both event segmentation and open-vocabulary mask classification at multiple granularity levels (instance-level and part-level) using free-form language queries including noun-level and sentence-level expressions.

Contribution

Multimodal Hierarchical Semantic Guidance module

The authors introduce MHSG, a multimodal learning framework that leverages vision-language foundation models to learn semantic-rich event representations across multiple levels of granularity (part-level, instance-level, semantic-level) without requiring predefined class candidates.

Contribution

Four benchmarks for evaluating Open-Vocabulary Event Instance Segmentation

The authors curate four evaluation benchmarks (DDD17-Ins, DSEC11-Ins, DSEC19-Ins, DSEC-Part) that cover diverse settings of label granularity (coarse to fine-grained classes) and semantic granularity (instance-level to part-level segmentation) for thorough evaluation of open-vocabulary event instance segmentation.