Gogo: Group-wise granularity-ordered codec for stable and efficient speech generation
Overview
Overall Novelty Assessment
The paper introduces Gogo, a group-wise granularity-ordered codec that quantizes speech frames from coarse to fine levels, and GogoSpeech, a two-stage generation model exploiting this hierarchy. According to the taxonomy tree, this work occupies the sole position in the 'Group-wise Coarse-to-Fine Codec with Adaptive Allocation' leaf, with no sibling papers identified. This suggests the specific combination of group-wise tokenization and adaptive budget allocation represents a relatively sparse research direction within the broader granularity-ordered tokenization landscape.
The taxonomy reveals two main branches: granularity-ordered multi-stage tokenization and linguistically-motivated tokenization. Gogo sits firmly in the former, emphasizing progressive acoustic refinement rather than linguistic alignment. Neighboring leaves include syllabic-level tokenization and text-aligned embedding approaches, which incorporate phonetic or transcription structure. The taxonomy narrative mentions related works like Scaling Spoken Language Models and TASTE that explore multi-level hierarchies, indicating the broader theme of hierarchical speech representation is active, though Gogo's group-wise adaptive allocation appears to differentiate it from fixed-hierarchy schemes.
Among twenty-five candidates examined, the Gogo codec contribution shows one refutable candidate out of ten examined, while GogoSpeech has two refutable candidates out of ten. The GRPO-trained token allocator examined five candidates with none clearly refuting it. These statistics suggest that within the limited search scope, the adaptive allocation mechanism appears more distinctive than the core codec or generation model components. The presence of some overlapping prior work for the codec and generation stages indicates these areas have received more attention, though the search scale is modest and not exhaustive.
Based on the limited literature search of twenty-five candidates, the work appears to occupy a relatively sparse niche combining group-wise tokenization with adaptive allocation. The adaptive allocation component shows fewer overlaps, suggesting potential novelty in this specific mechanism. However, the modest search scope and presence of some refutable candidates for core contributions indicate that a more comprehensive literature review would be necessary to fully assess originality across all claimed contributions.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors introduce a novel speech codec called Gogo that processes contiguous frames as groups and generates tokens in a coarse-to-fine order, where coarse tokens capture high-level abstractions and fine tokens progressively recover low-level acoustic details.
The authors develop GogoSpeech, a two-stage speech generation framework that first predicts a high-level speech backbone at approximately 14 Hz and then incrementally recovers fine-grained details conditioned on this backbone.
The authors propose a token allocator trained using Group Relative Policy Optimization that dynamically assigns token budgets to speech groups according to their complexity, thereby improving generation efficiency while maintaining quality.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
Contribution Analysis
Detailed comparisons for each claimed contribution
Gogo: group-wise granularity-ordered codec
The authors introduce a novel speech codec called Gogo that processes contiguous frames as groups and generates tokens in a coarse-to-fine order, where coarse tokens capture high-level abstractions and fine tokens progressively recover low-level acoustic details.
[12] Speaking from coarse to fine: Improving neural codec language model via multi-scale speech coding and generation PDF
[18] Moshi: a speech-text foundation model for real-time dialogue PDF
[19] ESC: Efficient Speech Coding with Cross-Scale Residual Vector Quantized Transformers PDF
[20] SNAC: Multi-Scale Neural Audio Codec PDF
[21] Generative Pre-trained Speech Language Model with Efficient Hierarchical Transformer PDF
[22] Vec-tok-vc+: Residual-enhanced robust zero-shot voice conversion with progressive constraints in a dual-mode training strategy PDF
[23] LMCodec: A Low Bitrate Speech Codec with Causal Transformer Models PDF
[24] HALL-E: Hierarchical Neural Codec Language Model for Minute-Long Zero-Shot Text-to-Speech Synthesis PDF
[25] SECodec: Structural Entropy-based Compressive Speech Representation Codec for Speech Language Models PDF
[26] Memory-Efficient Fixed-Length Representation of Synchronous Event Frames for Very-Low-Power Chip Integration PDF
GogoSpeech: two-stage speech language model
The authors develop GogoSpeech, a two-stage speech generation framework that first predicts a high-level speech backbone at approximately 14 Hz and then incrementally recovers fine-grained details conditioned on this backbone.
[8] Make-a-voice: Unified voice synthesis with discrete representation PDF
[12] Speaking from coarse to fine: Improving neural codec language model via multi-scale speech coding and generation PDF
[3] Cafe-talk: Generating 3d talking face animation with multimodal coarse-and fine-grained control PDF
[4] FELLE: Autoregressive Speech Synthesis with Token-Wise Coarse-to-Fine Flow Matching PDF
[5] What to talk about and how? selective generation using lstms with coarse-to-fine alignment PDF
[6] Controllable Accented Text-to-Speech Synthesis With Fine and Coarse-Grained Intensity Rendering PDF
[7] THLNet: two-stage heterogeneous lightweight network for monaural speech enhancement PDF
[9] Shallow Flow Matching for Coarse-to-Fine Text-to-Speech Synthesis PDF
[10] Dual-Branch Attention-In-Attention Transformer for Single-Channel Speech Enhancement PDF
[11] Coarse-to-Fine Text-to-Music Latent Diffusion PDF
GRPO-trained token allocator for adaptive token allocation
The authors propose a token allocator trained using Group Relative Policy Optimization that dynamically assigns token budgets to speech groups according to their complexity, thereby improving generation efficiency while maintaining quality.