FlexiCodec: A Dynamic Neural Audio Codec for Low Frame Rates

ICLR 2026 Conference SubmissionAnonymous Authors
Audio codingneural audio codecsspeech language model
Abstract:

Neural audio codecs are foundational to speech language models. It is expected to have a low frame rate and decoupled semantic and acoustic information. A lower frame rate codec can reduce the computational cost of speech language models by shortening the sequence length. Recent studies have developed 12.5Hz low-frame-rate audio codecs, but even lower frame rate codecs remain underexplored. We find that pushing existing audio codecs to very low frame rates loses much semantic information. We suggest that low-frame-rate codecs' limitations are in both insufficient semantic decoupling and insufficient time resolution at capturing transient phonetic details. This paper introduces FlexiCodec to address this limitation. FlexiCodec improves semantic preservation with a dynamic frame rate approach and introduces a novel architecture featuring an ASR feature-assisted dual stream encoding and Transformer bottlenecks. With dynamic frame rates, it uses less frames at information-sparse regions through adaptively merging semantically similar frames. A dynamic frame rate also allows FlexiCodec to support inference-time controllable frame rates between 3Hz and 12.5Hz. Experiments on 6.25Hz, 8.3Hz and 12.5Hz average frame rates confirm that FlexiCodec excels over baseline systems in semantic information preservation and delivers a high audio reconstruction quality. We also validate the effectiveness of FlexiCodec in language model-based TTS. Demos are available at: https://flexicodec.github.io.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

FlexiCodec proposes a dynamic frame rate neural audio codec targeting very low frame rates (3Hz to 12.5Hz) for speech coding. The paper resides in the 'Semantic-Driven Adaptive Frame Merging' leaf, which contains only two papers total, including the sibling work CodecSlime. This represents a sparse research direction within the broader taxonomy, which encompasses just three papers across two main branches. The limited population suggests this specific approach—merging semantically similar frames to reduce redundancy—is relatively underexplored compared to the broader field of neural audio codecs.

The taxonomy reveals two distinct branches: Dynamic Frame Rate Mechanisms (where FlexiCodec sits) and Statistical Optimization for Non-Uniform Sampling. The neighboring 'Tunable Variable Frame Rate Encoding' leaf contains one paper focused on continuous frame rate adjustment, while the statistical branch addresses optimal sampling via formal optimization. FlexiCodec's semantic-driven approach contrasts with the statistical methods that rely on mathematical frameworks rather than learned content-aware merging. The taxonomy's scope notes clarify that FlexiCodec's semantic similarity assessment distinguishes it from both fixed-rate codecs and purely statistical sampling strategies.

Among 22 candidates examined, the contribution-level analysis shows mixed novelty signals. The core FlexiCodec architecture examined 4 candidates with 1 refutable match, suggesting some prior work on dynamic frame rate codecs exists. The ASR feature-guided allocation examined 8 candidates with none refutable, indicating this aspect may be more novel within the limited search scope. The controllable inference-time frame rate examined 10 candidates with 3 refutable matches, pointing to more substantial prior work in this area. These statistics reflect a targeted semantic search, not an exhaustive survey of all neural codec literature.

The analysis suggests FlexiCodec operates in a sparsely populated research direction with limited direct competitors in the examined candidate pool. The ASR-guided allocation appears most distinctive among the three contributions, while the controllable frame rate mechanism shows more overlap with existing work. However, the small search scope (22 papers) and narrow taxonomy (3 total papers) mean these observations are preliminary and would benefit from broader literature coverage to fully assess positioning.

Taxonomy

Core-task Taxonomy Papers
3
3
Claimed Contributions
22
Contribution Candidate Papers Compared
4
Refutable Paper

Research Landscape Overview

Core task: dynamic frame rate neural audio codec for low bitrate speech coding. The field addresses the challenge of efficiently compressing speech by adapting the temporal resolution of neural codecs to the varying information density in audio signals. The taxonomy reveals two main branches: Dynamic Frame Rate Mechanisms, which focus on adaptive strategies that adjust frame rates based on content characteristics, and Statistical Optimization for Non-Uniform Sampling, which employs mathematical frameworks to determine optimal sampling patterns. The first branch encompasses methods that merge or skip frames semantically, while the second leverages statistical principles to guide non-uniform temporal allocation. Representative works such as CodecSlime[1] and Unlocking Temporal Flexibility[2] illustrate semantic-driven approaches, whereas Optimal Nonuniform Sampling[3] exemplifies the statistical optimization perspective. Recent activity centers on balancing compression efficiency with perceptual quality through adaptive frame merging and learned temporal structures. A key trade-off emerges between heuristic, content-aware mechanisms that respond to local speech features and principled statistical methods that optimize global sampling distributions. FlexiCodec[0] sits within the semantic-driven adaptive frame merging cluster, closely aligned with CodecSlime[1] in its emphasis on dynamically adjusting frame rates based on speech content. However, FlexiCodec[0] appears to push further toward flexible, learned merging strategies compared to the more structured approaches in Optimal Nonuniform Sampling[3]. Open questions remain around how to best integrate semantic cues with rate-distortion objectives and whether hybrid methods can unify the strengths of both branches for robust low-bitrate speech coding.

Claimed Contributions

FlexiCodec: a dynamic frame rate neural audio codec for very low frame rates

The authors propose FlexiCodec, a neural audio codec that operates at very low frame rates (3-12.5Hz) using a dynamic frame rate mechanism. The codec adaptively merges semantically similar frames based on ASR features and employs a dual-stream architecture with Transformer bottlenecks to preserve semantic information while maintaining high audio quality.

4 retrieved papers
Can Refute
Dynamic frame rate allocation guided by ASR features

The authors introduce a method for dynamically allocating frame rates in the 3-12.5Hz range by using pre-trained ASR features to guide the adaptive merging of semantically similar frames. This allows the codec to use fewer frames in information-sparse regions while preserving transient phonetic details in complex segments.

8 retrieved papers
Controllable frame rate at inference time

The authors develop a codec that enables users to control the frame rate continuously between 3Hz and 12.5Hz at inference time by adjusting a merging threshold parameter. This allows flexible trade-offs between performance and efficiency for downstream applications without retraining the model.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

FlexiCodec: a dynamic frame rate neural audio codec for very low frame rates

The authors propose FlexiCodec, a neural audio codec that operates at very low frame rates (3-12.5Hz) using a dynamic frame rate mechanism. The codec adaptively merges semantically similar frames based on ASR features and employs a dual-stream architecture with Transformer bottlenecks to preserve semantic information while maintaining high audio quality.

Contribution

Dynamic frame rate allocation guided by ASR features

The authors introduce a method for dynamically allocating frame rates in the 3-12.5Hz range by using pre-trained ASR features to guide the adaptive merging of semantically similar frames. This allows the codec to use fewer frames in information-sparse regions while preserving transient phonetic details in complex segments.

Contribution

Controllable frame rate at inference time

The authors develop a codec that enables users to control the frame rate continuously between 3Hz and 12.5Hz at inference time by adjusting a merging threshold parameter. This allows flexible trade-offs between performance and efficiency for downstream applications without retraining the model.