When to Ensemble: Identifying Token-Level Points for Stable and Fast LLM Ensembling

ICLR 2026 Conference SubmissionAnonymous Authors
LLM Ensembleprobability-level ensemblespeculative decoding
Abstract:

Ensembling Large Language Models (LLMs) has gained attention as a promising approach to surpass the performance of individual models by leveraging their complementary strengths. In particular, aggregating models’ next-token probability distributions to select the next token has been shown to be effective in various tasks. However, while successful for short-form answers, its application to long-form generation remains underexplored. In this paper, we show that using existing ensemble methods in long-form generation requires a careful choice of ensembling positions, since the standard practice of ensembling at every token often degrades performance. We identify two key factors for determining the ensembling positions: tokenization mismatch across models and consensus in their next-token probability distributions. Based on this, we propose SAFE\textbf{SAFE}, (S\textbf{S}table A\textbf{A}nd F\textbf{F}ast LLM E\textbf{E}nsembling), a framework that selectively ensembles by jointly considering these factors. To further improve stability, we apply a probability sharpening strategy when the ensemble distribution becomes overly smooth, enabling the selection of more confident tokens during ensembling. Our experiments on diverse benchmarks, including MATH500 and BBH, demonstrate that SAFE outperforms existing methods in both accuracy and efficiency, with gains achieved even when ensembling fewer than 1% of tokens.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes SAFE, a framework for selective token-level ensembling in long-form LLM generation, addressing when and where to combine multiple models' probability distributions. It resides in the 'Selective and Adaptive Ensembling' leaf under 'Probability Distribution Fusion', which contains only three papers total including this work. This represents a relatively sparse research direction within the broader taxonomy of fifty papers, suggesting the specific problem of position-aware ensembling for long-form generation has received limited prior attention compared to other token-level techniques.

The taxonomy reveals that neighboring leaves address related but distinct challenges: 'Vocabulary Alignment for Heterogeneous Models' tackles tokenization discrepancies across models, while 'Full-Sequence Ensembling' applies uniform aggregation without selective positioning. The parent branch 'Token-Level Ensemble Mechanisms' sits alongside 'Multi-Token Prediction and Speculative Decoding' and 'Token-Level Generation Optimization', indicating the field has explored acceleration and control strategies separately from ensemble fusion. SAFE's focus on consensus-based selection and tokenization mismatch bridges these areas by incorporating distribution characteristics into ensemble decisions.

Among twenty-eight candidates examined, the contribution identifying two key factors for ensembling positions shows one refutable candidate from ten examined, while the SAFE framework itself and the probability sharpening strategy each examined ten candidates with zero refutations. The limited search scope means these statistics reflect top-K semantic matches rather than exhaustive coverage. The framework and sharpening contributions appear more novel within this bounded search, whereas the factor identification has at least one overlapping prior work among the candidates reviewed.

Based on the limited literature search of twenty-eight candidates, the work appears to occupy a sparsely populated niche within token-level ensembling. The taxonomy structure confirms that selective ensembling for long-form generation has fewer dedicated papers than adjacent areas like multi-token prediction or decoding strategies. However, the analysis cannot rule out relevant work outside the top-K semantic neighborhood or in related branches not captured by the search methodology.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
28
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: token-level ensembling for large language model generation. The field encompasses diverse strategies for combining or refining token-level outputs from large language models, organized into several major branches. Token-Level Ensemble Mechanisms and Architectures explores how to fuse probability distributions or adaptively select among multiple models, including methods like Top-k Union[4] and M-Ped[19] that merge predictions at each generation step. Multi-Token Prediction and Speculative Decoding focuses on predicting multiple tokens simultaneously or using draft models to accelerate inference, as seen in SpecInfer[15] and Multi-token Prediction[11]. Token-Level Generation Optimization and Control addresses fine-grained steering of outputs through feedback mechanisms such as Token Feedback RL[8] or gradient-based adjustments like Gradient Recomposition[10]. Meanwhile, Token-Level Analysis and Evaluation investigates uncertainty quantification and hallucination detection at the token level, exemplified by Token Hallucination Benchmark[7] and Reasoning Uncertainty[28]. Token Representation and Processing examines alternative tokenization schemes and cross-vocabulary techniques, while Specialized Applications and Domains apply token-level methods to areas like sequential recommendations or dialogue systems. Within the ensemble mechanisms branch, a particularly active line of work centers on selective and adaptive ensembling, where systems dynamically choose which models or distributions to combine based on context or confidence. Token Ensemble[0] falls squarely into this cluster, emphasizing adaptive fusion strategies that adjust ensemble weights or selection criteria at each token position. This contrasts with simpler uniform averaging approaches and aligns closely with methods like EnsemW2S[5], which also adapts ensemble behavior during generation. Compared to Top-k Union[4], which merges candidate sets from multiple models, Token Ensemble[0] appears to focus more on probability-level fusion with learned or heuristic selection rules. A key open question across these works is how to balance computational overhead against the quality gains from adaptive ensembling, especially when scaling to many models or long sequences. The interplay between token-level uncertainty estimation and ensemble selection remains an area of ongoing exploration, as researchers seek principled ways to determine when and how to combine diverse model outputs.

Claimed Contributions

SAFE framework for selective token-level ensembling

The authors introduce SAFE, a generate-verify-ensemble framework that determines optimal token positions for ensembling by considering tokenization mismatch and consensus in next-token probability distributions. This selective approach improves both stability and efficiency in long-form generation.

10 retrieved papers
Probability sharpening strategy for ensemble distributions

The authors propose a probability sharpening method that consolidates diffused probability mass when the ensemble distribution becomes overly smooth due to different tokenization schemes. This enables more confident token selection during ensembling.

10 retrieved papers
Identification of two key factors for ensembling positions

The authors identify and formalize two critical factors that determine when ensembling should occur: whether tokenization mismatch introduces OOV-like tokens and whether models exhibit sufficient consensus in their next-token distributions. These factors guide the selective ensembling strategy.

8 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

SAFE framework for selective token-level ensembling

The authors introduce SAFE, a generate-verify-ensemble framework that determines optimal token positions for ensembling by considering tokenization mismatch and consensus in next-token probability distributions. This selective approach improves both stability and efficiency in long-form generation.

Contribution

Probability sharpening strategy for ensemble distributions

The authors propose a probability sharpening method that consolidates diffused probability mass when the ensemble distribution becomes overly smooth due to different tokenization schemes. This enables more confident token selection during ensembling.

Contribution

Identification of two key factors for ensembling positions

The authors identify and formalize two critical factors that determine when ensembling should occur: whether tokenization mismatch introduces OOV-like tokens and whether models exhibit sufficient consensus in their next-token distributions. These factors guide the selective ensembling strategy.