FlexiCodec: A Dynamic Neural Audio Codec for Low Frame Rates

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 5.7 Download Report PDF

Audio codingneural audio codecsspeech language model

Neural audio codecs are foundational to speech language models. It is expected to have a low frame rate and decoupled semantic and acoustic information. A lower frame rate codec can reduce the computational cost of speech language models by shortening the sequence length. Recent studies have developed 12.5Hz low-frame-rate audio codecs, but even lower frame rate codecs remain underexplored. We find that pushing existing audio codecs to very low frame rates loses much semantic information. We suggest that low-frame-rate codecs' limitations are in both insufficient semantic decoupling and insufficient time resolution at capturing transient phonetic details. This paper introduces FlexiCodec to address this limitation. FlexiCodec improves semantic preservation with a dynamic frame rate approach and introduces a novel architecture featuring an ASR feature-assisted dual stream encoding and Transformer bottlenecks. With dynamic frame rates, it uses less frames at information-sparse regions through adaptively merging semantically similar frames. A dynamic frame rate also allows FlexiCodec to support inference-time controllable frame rates between 3Hz and 12.5Hz. Experiments on 6.25Hz, 8.3Hz and 12.5Hz average frame rates confirm that FlexiCodec excels over baseline systems in semantic information preservation and delivers a high audio reconstruction quality. We also validate the effectiveness of FlexiCodec in language model-based TTS. Demos are available at: https://flexicodec.github.io.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

FlexiCodec proposes a dynamic frame rate neural audio codec targeting very low frame rates (3Hz to 12.5Hz) for speech coding. The paper resides in the 'Semantic-Driven Adaptive Frame Merging' leaf, which contains only two papers total, including the sibling work CodecSlime. This represents a sparse research direction within the broader taxonomy, which encompasses just three papers across two main branches. The limited population suggests this specific approach—merging semantically similar frames to reduce redundancy—is relatively underexplored compared to the broader field of neural audio codecs.

The taxonomy reveals two distinct branches: Dynamic Frame Rate Mechanisms (where FlexiCodec sits) and Statistical Optimization for Non-Uniform Sampling. The neighboring 'Tunable Variable Frame Rate Encoding' leaf contains one paper focused on continuous frame rate adjustment, while the statistical branch addresses optimal sampling via formal optimization. FlexiCodec's semantic-driven approach contrasts with the statistical methods that rely on mathematical frameworks rather than learned content-aware merging. The taxonomy's scope notes clarify that FlexiCodec's semantic similarity assessment distinguishes it from both fixed-rate codecs and purely statistical sampling strategies.

Among 22 candidates examined, the contribution-level analysis shows mixed novelty signals. The core FlexiCodec architecture examined 4 candidates with 1 refutable match, suggesting some prior work on dynamic frame rate codecs exists. The ASR feature-guided allocation examined 8 candidates with none refutable, indicating this aspect may be more novel within the limited search scope. The controllable inference-time frame rate examined 10 candidates with 3 refutable matches, pointing to more substantial prior work in this area. These statistics reflect a targeted semantic search, not an exhaustive survey of all neural codec literature.

The analysis suggests FlexiCodec operates in a sparsely populated research direction with limited direct competitors in the examined candidate pool. The ASR-guided allocation appears most distinctive among the three contributions, while the controllable frame rate mechanism shows more overlap with existing work. However, the small search scope (22 papers) and narrow taxonomy (3 total papers) mean these observations are preliminary and would benefit from broader literature coverage to fully assess positioning.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: dynamic frame rate neural audio codec for low bitrate speech coding. The field addresses the challenge of efficiently compressing speech by adapting the temporal resolution of neural codecs to the varying information density in audio signals. The taxonomy reveals two main branches: Dynamic Frame Rate Mechanisms, which focus on adaptive strategies that adjust frame rates based on content characteristics, and Statistical Optimization for Non-Uniform Sampling, which employs mathematical frameworks to determine optimal sampling patterns. The first branch encompasses methods that merge or skip frames semantically, while the second leverages statistical principles to guide non-uniform temporal allocation. Representative works such as CodecSlime[1] and Unlocking Temporal Flexibility[2] illustrate semantic-driven approaches, whereas Optimal Nonuniform Sampling[3] exemplifies the statistical optimization perspective. Recent activity centers on balancing compression efficiency with perceptual quality through adaptive frame merging and learned temporal structures. A key trade-off emerges between heuristic, content-aware mechanisms that respond to local speech features and principled statistical methods that optimize global sampling distributions. FlexiCodec[0] sits within the semantic-driven adaptive frame merging cluster, closely aligned with CodecSlime[1] in its emphasis on dynamically adjusting frame rates based on speech content. However, FlexiCodec[0] appears to push further toward flexible, learned merging strategies compared to the more structured approaches in Optimal Nonuniform Sampling[3]. Open questions remain around how to best integrate semantic cues with rate-distortion objectives and whether hybrid methods can unify the strengths of both branches for robust low-bitrate speech coding.

Claimed Contributions

FlexiCodec: a dynamic frame rate neural audio codec for very low frame rates

Can Refute

4 retrieved papers

The authors propose FlexiCodec, a neural audio codec that operates at very low frame rates (3-12.5Hz) using a dynamic frame rate mechanism. The codec adaptively merges semantically similar frames based on ASR features and employs a dual-stream architecture with Transformer bottlenecks to preserve semantic information while maintaining high audio quality.

4 retrieved papers

Can Refute

Dynamic frame rate allocation guided by ASR features

8 retrieved papers

The authors introduce a method for dynamically allocating frame rates in the 3-12.5Hz range by using pre-trained ASR features to guide the adaptive merging of semantically similar frames. This allows the codec to use fewer frames in information-sparse regions while preserving transient phonetic details in complex segments.

8 retrieved papers

Controllable frame rate at inference time

Can Refute

10 retrieved papers

The authors develop a codec that enables users to control the frame rate continuously between 3Hz and 12.5Hz at inference time by adjusting a merging threshold parameter. This allows flexible trade-offs between performance and efficiency for downstream applications without retraining the model.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[1] CodecSlime: Temporal Redundancy Compression of Neural Speech Codec via Dynamic Frame Rate PDF

Wang Hankun, Guo, Yiwei, Hankun Wang, Yiwei Guo, Li, Bohan, Chongtian Shao, Chen, Xie, Bohan Li, Yu Kai, Xie Chen, Kai Yu (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

FlexiCodec: a dynamic frame rate neural audio codec for very low frame rates

[23] QuarkAudio Technical Report PDF

Can Refute

[20] Uniaudio 1.5: Large language model-driven audio codec is a few-shot audio task learner PDF

Cannot Refute

[21] Acoustic Teleportation via Disentangled Neural Audio Codec Representations PDF

Cannot Refute

[22] STCTS: Generative Semantic Compression for Ultra-Low Bitrate Speech via Explicit Text-Prosody-Timbre Decomposition PDF

Cannot Refute

Contribution

Dynamic frame rate allocation guided by ASR features

[12] SyncVSR: Data-efficient visual speech recognition with end-to-end crossmodal audio token synchronization PDF

Cannot Refute

[13] Baichuan-audio: A unified framework for end-to-end speech interaction PDF

Cannot Refute

[14] Variable Frame Rate Acoustic Models Using Minimum Error Reinforcement Learning PDF

Cannot Refute

[15] Regarding Topology and Variant Frame Rates for Differentiable WFST-based End-to-End ASR PDF

Cannot Refute

[16] Conv-Transformer Transducer: Low Latency, Low Frame Rate, Streamable End-to-End Speech Recognition PDF

Cannot Refute

[17] Entropy-based variable frame rate analysis of speech signals and its application to ASR PDF

Cannot Refute

[18] Impact of Frame Rates on Speech Tokenizer: A Case Study on Mandarin and English PDF

Cannot Refute

[19] A Novel Front-end Based on Variable Frame Rate Analysis and Mel-filterbank Output Compensation for Robust ASR PDF

Cannot Refute

Contribution

Controllable frame rate at inference time

[1] CodecSlime: Temporal Redundancy Compression of Neural Speech Codec via Dynamic Frame Rate PDF

Can Refute

[2] Unlocking Temporal Flexibility: Neural Speech Codec with Variable Frame Rate PDF

Can Refute

[5] Say More with Less: Variable-Frame-Rate Speech Tokenization via Adaptive Clustering and Implicit Duration Coding PDF

Can Refute

[4] Low frame-rate speech codec: a codec designed for fast high-quality speech LLM training and inference PDF

Cannot Refute

[6] Towards Codec-LM Co-design for Neural Codec Language Models PDF

Cannot Refute

[7] U-Codec: Ultra Low Frame-rate Neural Speech Codec for Fast High-fidelity Speech Generation PDF

Cannot Refute

[8] SNAC: Multi-Scale Neural Audio Codec PDF

Cannot Refute

[9] NanoCodec: Towards high-quality ultra fast speech LLM inference PDF

Cannot Refute

[10] LongCat-Audio-Codec: An Audio Tokenizer and Detokenizer Solution Designed for Speech Large Language Models PDF

Cannot Refute

[11] HALL-E: hierarchical neural codec language model for minute-long zero-shot text-to-speech synthesis PDF

Cannot Refute

FlexiCodec: A Dynamic Neural Audio Codec for Low Frame Rates

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[1] CodecSlime: Temporal Redundancy Compression of Neural Speech Codec via Dynamic Frame Rate PDF

Contribution Analysis

FlexiCodec: a dynamic frame rate neural audio codec for very low frame rates

[23] QuarkAudio Technical Report PDF

[20] Uniaudio 1.5: Large language model-driven audio codec is a few-shot audio task learner PDF

[21] Acoustic Teleportation via Disentangled Neural Audio Codec Representations PDF

[22] STCTS: Generative Semantic Compression for Ultra-Low Bitrate Speech via Explicit Text-Prosody-Timbre Decomposition PDF

Dynamic frame rate allocation guided by ASR features

[12] SyncVSR: Data-efficient visual speech recognition with end-to-end crossmodal audio token synchronization PDF

[13] Baichuan-audio: A unified framework for end-to-end speech interaction PDF

[14] Variable Frame Rate Acoustic Models Using Minimum Error Reinforcement Learning PDF

[15] Regarding Topology and Variant Frame Rates for Differentiable WFST-based End-to-End ASR PDF

[16] Conv-Transformer Transducer: Low Latency, Low Frame Rate, Streamable End-to-End Speech Recognition PDF

[17] Entropy-based variable frame rate analysis of speech signals and its application to ASR PDF

[18] Impact of Frame Rates on Speech Tokenizer: A Case Study on Mandarin and English PDF

[19] A Novel Front-end Based on Variable Frame Rate Analysis and Mel-filterbank Output Compensation for Robust ASR PDF

Controllable frame rate at inference time

[1] CodecSlime: Temporal Redundancy Compression of Neural Speech Codec via Dynamic Frame Rate PDF

[2] Unlocking Temporal Flexibility: Neural Speech Codec with Variable Frame Rate PDF

[5] Say More with Less: Variable-Frame-Rate Speech Tokenization via Adaptive Clustering and Implicit Duration Coding PDF

[4] Low frame-rate speech codec: a codec designed for fast high-quality speech LLM training and inference PDF

[6] Towards Codec-LM Co-design for Neural Codec Language Models PDF

[7] U-Codec: Ultra Low Frame-rate Neural Speech Codec for Fast High-fidelity Speech Generation PDF

[8] SNAC: Multi-Scale Neural Audio Codec PDF

[9] NanoCodec: Towards high-quality ultra fast speech LLM inference PDF

[10] LongCat-Audio-Codec: An Audio Tokenizer and Detokenizer Solution Designed for Speech Large Language Models PDF

[11] HALL-E: hierarchical neural codec language model for minute-long zero-shot text-to-speech synthesis PDF

Table of Contents