Gogo: Group-wise granularity-ordered codec for stable and efficient speech generation

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

speech codecspeech language modelspeech generationtext-to-speech synthesis

Current speech language models require their core component, the speech codec, to discretize continuous speech signals into tokens that not only capture high-level cues for autoregressive modeling but also preserve sufficient acoustic details for perceptual quality. To address this need, we propose Gogo, a group-wise granularity-ordered codec that quantizes each group of frames into tokens arranged from coarse to fine, where coarse tokens encode high-level abstractions and fine tokens progressively recover low-level details. Building on the granularity-ordering property of Gogo, we introduce GogoSpeech, a two-stage speech language model that performs speech generation by first constructing a coarse speech backbone at an extremely low token rate and then enriching the backbone with fine-grained acoustic details. Considering the inherently non-uniform information distribution in speech signals, we further design a Group Relative Policy Optimization (GRPO)-trained token allocator that adaptively allocates token budgets to groups based on group-wise complexity. Experimental results demonstrate that Gogo delivers state-of-the-art reconstruction performance across most metrics at a token rate of 47. Moreover, evaluations on zero-shot text-to-speech tasks show that GogoSpeech enables efficient generation by adaptively reducing the average token rate, and attains state-of-the-art results in long-form speech generation.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Gogo, a group-wise granularity-ordered codec that quantizes speech frames from coarse to fine levels, and GogoSpeech, a two-stage generation model exploiting this hierarchy. According to the taxonomy tree, this work occupies the sole position in the 'Group-wise Coarse-to-Fine Codec with Adaptive Allocation' leaf, with no sibling papers identified. This suggests the specific combination of group-wise tokenization and adaptive budget allocation represents a relatively sparse research direction within the broader granularity-ordered tokenization landscape.

The taxonomy reveals two main branches: granularity-ordered multi-stage tokenization and linguistically-motivated tokenization. Gogo sits firmly in the former, emphasizing progressive acoustic refinement rather than linguistic alignment. Neighboring leaves include syllabic-level tokenization and text-aligned embedding approaches, which incorporate phonetic or transcription structure. The taxonomy narrative mentions related works like Scaling Spoken Language Models and TASTE that explore multi-level hierarchies, indicating the broader theme of hierarchical speech representation is active, though Gogo's group-wise adaptive allocation appears to differentiate it from fixed-hierarchy schemes.

Among twenty-five candidates examined, the Gogo codec contribution shows one refutable candidate out of ten examined, while GogoSpeech has two refutable candidates out of ten. The GRPO-trained token allocator examined five candidates with none clearly refuting it. These statistics suggest that within the limited search scope, the adaptive allocation mechanism appears more distinctive than the core codec or generation model components. The presence of some overlapping prior work for the codec and generation stages indicates these areas have received more attention, though the search scale is modest and not exhaustive.

Based on the limited literature search of twenty-five candidates, the work appears to occupy a relatively sparse niche combining group-wise tokenization with adaptive allocation. The adaptive allocation component shows fewer overlaps, suggesting potential novelty in this specific mechanism. However, the modest search scope and presence of some refutable candidates for core contributions indicate that a more comprehensive literature review would be necessary to fully assess originality across all claimed contributions.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: speech generation using group-wise granularity-ordered tokenization. The field centers on designing discrete representations for speech that balance reconstruction quality with modeling efficiency. The taxonomy reveals two main branches: Granularity-Ordered Multi-Stage Speech Tokenization, which organizes tokens hierarchically from coarse to fine levels of detail, and Linguistically-Motivated Speech Tokenization, which incorporates phonetic or linguistic structure into the tokenization process. The former branch emphasizes progressive refinement strategies where initial tokens capture broad acoustic patterns and subsequent stages add finer details, enabling models to generate speech in a staged manner. Representative works like Scaling Spoken Language Models[1] and TASTE[2] explore how multi-level token hierarchies can improve both generation quality and computational tractability. Within the granularity-ordered approaches, a key theme is how to allocate representational capacity across different levels of detail. Some methods use fixed hierarchies, while others adapt the number of tokens or their granularity based on content complexity. Gogo[0] sits within the Group-wise Coarse-to-Fine Codec with Adaptive Allocation cluster, emphasizing dynamic assignment of tokens to match varying acoustic demands. This contrasts with more uniform multi-stage schemes that apply the same refinement structure across all speech segments. The adaptive allocation strategy addresses a central trade-off: achieving high fidelity without excessive token counts, particularly for diverse or challenging acoustic conditions. By situating itself in this adaptive branch, Gogo[0] engages with ongoing questions about how best to balance efficiency and expressiveness in hierarchical speech representations.

Claimed Contributions

Gogo: group-wise granularity-ordered codec

Can Refute

10 retrieved papers

The authors introduce a novel speech codec called Gogo that processes contiguous frames as groups and generates tokens in a coarse-to-fine order, where coarse tokens capture high-level abstractions and fine tokens progressively recover low-level acoustic details.

10 retrieved papers

Can Refute

GogoSpeech: two-stage speech language model

Can Refute

10 retrieved papers

The authors develop GogoSpeech, a two-stage speech generation framework that first predicts a high-level speech backbone at approximately 14 Hz and then incrementally recovers fine-grained details conditioned on this backbone.

10 retrieved papers

Can Refute

GRPO-trained token allocator for adaptive token allocation

5 retrieved papers

The authors propose a token allocator trained using Group Relative Policy Optimization that dynamically assigns token budgets to speech groups according to their complexity, thereby improving generation efficiency while maintaining quality.

5 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution