Let Features Decide Their Own Solvers: Hybrid Feature Caching for Diffusion Transformers

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 7.0 Download Report PDF

Efficient MLDiffusion Transformer AccelerationFeature Caching

Diffusion Transformers (DiTs) offer state-of-the-art fidelity in image and video synthesis, but their iterative sampling process remains a major bottleneck due to the high cost of transformer forward passes at each timestep. To mitigate this, feature caching has emerged as a training-free acceleration technique that reuses or forecasts hidden representations. However, existing methods often apply a uniform caching strategy across all feature dimensions, ignoring their heterogeneous dynamic behaviors. Therefore, we adopt a new perspective by modeling hidden feature evolution as a mixture of ODEs across dimensions, and introduce \textbf{HyCa}, a Hybrid ODE solver inspired caching framework that applies dimension-wise caching strategies. HyCa achieves near-lossless acceleration across diverse domains and models, including 5.56 $\times$ speedup on FLUX and HunyuanVideo, 6.24 $\times$ speedup on Qwen-Image and Qwen-Image-Edit without retraining. \emph{Our code is in supplementary material and will be released on Github.}

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes HyCa, a hybrid caching framework that models hidden feature evolution as a mixture of ODEs and applies dimension-wise caching strategies. It resides in the 'Caching with ODE Solvers and Sampling Optimization' leaf, which contains only three papers total, including this work and two siblings (AB-Cache and LazyDiT). This represents a relatively sparse research direction within the broader taxonomy of 50 papers across 23 leaf nodes, suggesting the integration of ODE-inspired solvers with feature caching remains an emerging area rather than a saturated one.

The taxonomy reveals that most caching research clusters around core mechanisms (uniform temporal, token-level selective, hierarchical block-level) and adaptive strategies (runtime-adaptive, frequency-aware, magnitude-based). HyCa's parent branch, 'Hybrid and Multi-Paradigm Acceleration,' also includes leaves for caching with parallelization and caching with pruning, indicating the field is exploring synergies between caching and complementary acceleration techniques. The scope note for HyCa's leaf explicitly excludes 'pure caching without solver integration,' positioning this work at the intersection of numerical methods and feature reuse—a boundary less explored than standalone caching or standalone solver optimization.

Among 30 candidates examined, the contribution-level analysis shows mixed novelty signals. 'Heterogeneous Feature Dynamics' (10 candidates, 0 refutable) and 'State-of-the-Art Acceleration Performance' (10 candidates, 0 refutable) appear to have no clear prior work overlap within the limited search scope. However, 'HyCa: Hybrid Feature Caching Framework' (10 candidates, 1 refutable) encounters at least one candidate that provides overlapping prior work, suggesting the core framework design may share conceptual or technical elements with existing methods. The scale of this search—30 papers total—means these findings reflect top semantic matches rather than exhaustive coverage.

Given the sparse population of the ODE-solver-caching leaf and the absence of refutation for two of three contributions, the work appears to occupy a relatively novel niche within the examined scope. The single refutable candidate for the framework contribution indicates some prior overlap exists, but the limited search scale and the emerging nature of this hybrid paradigm suggest the paper may still offer substantive advances. A broader literature review would be needed to confirm whether the dimension-wise ODE mixture modeling and the specific solver integration represent genuine departures from existing hybrid acceleration methods.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: accelerating diffusion transformer inference through feature caching. The field has organized itself around several complementary strategies for reducing computational overhead in diffusion models. Core Feature Caching Mechanisms establish foundational techniques such as token-level reuse (Token Caching[4], KV Caching Diffusion[2]) and dual-stream approaches (Dual Feature Caching[5]), while Adaptive Caching Strategies introduce runtime flexibility through methods like Runtime-Adaptive Caching[16] and cluster-driven selection (Cluster-Driven Caching[22]). Predictive and Forecasting-Based Caching leverages Taylor expansions and confidence gating (TaylorSeers[26], Confidence-Gated Taylor[28]) to anticipate future features, whereas Learning-Based Caching Optimization trains policies or networks to decide what and when to cache (Learning-to-Cache[8]). Architectural and Structural Enhancements modify model designs directly (Long-Skip-Connections[14], Decoupled Diffusion Transformer[41]), and Redundancy Analysis and Profiling systematically identify reusable computations (Unveiling Redundancy[17], Profiling-Based Reuse[36]). Domain-Specific Caching Applications tailor strategies to video generation (Adaptive Caching Video[23]) or text-to-speech (Text-to-Speech Caching[18]), while Hybrid and Multi-Paradigm Acceleration combines caching with ODE solvers or sampling optimizations, and Universal and Cross-Architecture Caching aims for broad applicability across model families (OmniCache[35]). Recent work has explored trade-offs between caching granularity, error accumulation, and computational savings. Fine-grained token-wise methods (Token-wise Feature Caching[9], Rethinking Token-wise Caching[40]) offer precise control but may introduce overhead, whereas block-level or layer-skipping approaches (BlockDance[27], Skip Branches[38]) achieve coarser speedups with simpler logic. Hybrid Feature Caching[0] sits within the Hybrid and Multi-Paradigm Acceleration branch, combining caching with ODE solver refinements to balance quality and speed—a direction also pursued by LazyDiT[44] and AB-Cache[46], which similarly integrate sampling optimizations. Compared to purely adaptive schemes like Runtime-Adaptive Caching[16] or purely predictive methods like TaylorSeers[26], Hybrid Feature Caching[0] emphasizes synergy between multiple acceleration paradigms, aiming to mitigate the exposure bias and error drift that can arise when caching decisions are made in isolation from the underlying numerical solver.

Claimed Contributions

Heterogeneous Feature Dynamics in Diffusion Transformers

10 retrieved papers

The authors demonstrate that hidden feature dimensions in Diffusion Transformers evolve according to distinct temporal patterns rather than a single unified process. Through clustering analysis, they reveal that these dynamics are consistent across prompts, timesteps, and resolutions, motivating the need for dimension-specific solvers.

10 retrieved papers

HyCa: Hybrid Feature Caching Framework

Can Refute

10 retrieved papers

HyCa is a training-free acceleration framework that models hidden feature evolution as a mixture of ODEs. It clusters feature dimensions by their temporal behaviors and assigns the optimal ODE solver to each cluster through a one-time offline optimization, enabling efficient and adaptive feature prediction during inference.

10 retrieved papers

Can Refute

State-of-the-Art Acceleration Performance Across Diverse Tasks

10 retrieved papers

The authors demonstrate that HyCa achieves near-lossless acceleration across multiple domains and models, including 5.56× speedup on FLUX and HunyuanVideo, and 6.24× speedup on Qwen-Image and Qwen-Image-Edit, without requiring retraining. The method is also compatible with distillation techniques, reaching up to 24.4× speedup.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[44] LazyDiT: Lazy Learning for the Acceleration of Diffusion Transformers PDF

Xuan Shen, Zhao Song, Yufa Zhou, Bo Chen, Yanyu Li, Yifan Gong, Kai Zhang, Hao Tan, Jason Kuen, Henghui Ding, Shu Zhihao, Wei Niu, Pu Zhao, Yanzhi Wang, Jiuxiang Gu (2024) • AAAI Conference on Artificial Intelligence

[46] AB-Cache: Training-Free Acceleration of Diffusion Models via Adams-Bashforth Cached Feature Reuse PDF

Zou Zhen, ZHANG Chengwei, Huang, Jie, Zhao Feng, Cun, Xiaodong, Zhang, Wenyi (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Heterogeneous Feature Dynamics in Diffusion Transformers

[64] Dynamic diffusion transformer PDF

Cannot Refute

[65] Emergent Temporal Correspondences from Video Diffusion Transformers PDF

Cannot Refute

[66] A survey on diffusion models for time series and spatio-temporal data PDF

Cannot Refute

[67] Post-Training Quantization for Diffusion Transformer via Hierarchical Timestep Grouping PDF

Cannot Refute

[68] Precipitation Nowcasting Using Diffusion Transformer With Causal Attention PDF

Cannot Refute

[69] Forecast then calibrate: Feature caching as ode for efficient diffusion transformers PDF

Cannot Refute

[70] Diffusion models for intelligent transportation systems: A survey PDF

Cannot Refute

[71] Diffusion Transformers for Tabular Data Time Series Generation PDF

Cannot Refute

[72] Spatio-Temporal Probabilistic Forecasting of Wind Speed Using Transformer-Based Diffusion Models PDF

Cannot Refute

[73] Transformer-Based spatiotemporal graph diffusion convolution network for traffic flow forecasting PDF

Cannot Refute

Contribution

HyCa: Hybrid Feature Caching Framework

[29] Frdiff: Feature reuse for universal training-free acceleration of diffusion models PDF

Can Refute

[36] Model Reveals What to Cache: Profiling-Based Feature Reuse for Video Diffusion Models PDF

Cannot Refute

[43] Token Pruning for Caching Better: 9 Times Acceleration on Stable Diffusion for Free PDF

Cannot Refute

[51] Deepcache: Accelerating diffusion models for free PDF

Cannot Refute

[52] Cachequant: Comprehensively accelerated diffusion models PDF

Cannot Refute

[53] Blended latent diffusion PDF

Cannot Refute

[54] Timestep Embedding Tells: It's Time to Cache for Video Diffusion Model PDF

Cannot Refute

[55] Approximate caching for efficiently serving {Text-to-Image} diffusion models PDF

Cannot Refute

[56] Sortblock: Similarity-Aware Feature Reuse for Diffusion Model PDF

Cannot Refute

[57] MPQ-DM: Mixed Precision Quantization for Extremely Low Bit Diffusion Models PDF

Cannot Refute

Contribution

State-of-the-Art Acceleration Performance Across Diverse Tasks

[3] Speca: Accelerating diffusion transformers with speculative feature caching PDF

Cannot Refute

[8] Learning-to-cache: Accelerating diffusion transformer via layer caching PDF

Cannot Refute

[9] Accelerating Diffusion Transformers with Token-wise Feature Caching PDF

Cannot Refute

[28] Forecasting When to Forecast: Accelerating Diffusion Models with Confidence-Gated Taylor PDF

Cannot Refute

[58] -DiT: A Training-Free Acceleration Method Tailored for Diffusion Transformers PDF

Cannot Refute

[59] Training-free and hardware-friendly acceleration for diffusion models via similarity-based token pruning PDF

Cannot Refute

[60] Attention-driven training-free efficiency enhancement of diffusion models PDF

Cannot Refute

[61] Ac3d: Analyzing and improving 3d camera control in video diffusion transformers PDF

Cannot Refute

[62] dkv-cache: The cache for diffusion language models PDF

Cannot Refute

[63] Block-wise Adaptive Caching for Accelerating Diffusion Policy PDF

Cannot Refute

Let Features Decide Their Own Solvers: Hybrid Feature Caching for Diffusion Transformers

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[44] LazyDiT: Lazy Learning for the Acceleration of Diffusion Transformers PDF

[46] AB-Cache: Training-Free Acceleration of Diffusion Models via Adams-Bashforth Cached Feature Reuse PDF

Contribution Analysis

Heterogeneous Feature Dynamics in Diffusion Transformers

[64] Dynamic diffusion transformer PDF

[65] Emergent Temporal Correspondences from Video Diffusion Transformers PDF

[66] A survey on diffusion models for time series and spatio-temporal data PDF

[67] Post-Training Quantization for Diffusion Transformer via Hierarchical Timestep Grouping PDF

[68] Precipitation Nowcasting Using Diffusion Transformer With Causal Attention PDF

[69] Forecast then calibrate: Feature caching as ode for efficient diffusion transformers PDF

[70] Diffusion models for intelligent transportation systems: A survey PDF

[71] Diffusion Transformers for Tabular Data Time Series Generation PDF

[72] Spatio-Temporal Probabilistic Forecasting of Wind Speed Using Transformer-Based Diffusion Models PDF

[73] Transformer-Based spatiotemporal graph diffusion convolution network for traffic flow forecasting PDF

HyCa: Hybrid Feature Caching Framework

[29] Frdiff: Feature reuse for universal training-free acceleration of diffusion models PDF

[36] Model Reveals What to Cache: Profiling-Based Feature Reuse for Video Diffusion Models PDF

[43] Token Pruning for Caching Better: 9 Times Acceleration on Stable Diffusion for Free PDF

[51] Deepcache: Accelerating diffusion models for free PDF

[52] Cachequant: Comprehensively accelerated diffusion models PDF

[53] Blended latent diffusion PDF

[54] Timestep Embedding Tells: It's Time to Cache for Video Diffusion Model PDF

[55] Approximate caching for efficiently serving {Text-to-Image} diffusion models PDF

[56] Sortblock: Similarity-Aware Feature Reuse for Diffusion Model PDF

[57] MPQ-DM: Mixed Precision Quantization for Extremely Low Bit Diffusion Models PDF

State-of-the-Art Acceleration Performance Across Diverse Tasks

[3] Speca: Accelerating diffusion transformers with speculative feature caching PDF

[8] Learning-to-cache: Accelerating diffusion transformer via layer caching PDF

[9] Accelerating Diffusion Transformers with Token-wise Feature Caching PDF

[28] Forecasting When to Forecast: Accelerating Diffusion Models with Confidence-Gated Taylor PDF

[58] -DiT: A Training-Free Acceleration Method Tailored for Diffusion Transformers PDF

[59] Training-free and hardware-friendly acceleration for diffusion models via similarity-based token pruning PDF

[60] Attention-driven training-free efficiency enhancement of diffusion models PDF

[61] Ac3d: Analyzing and improving 3d camera control in video diffusion transformers PDF

[62] dkv-cache: The cache for diffusion language models PDF

[63] Block-wise Adaptive Caching for Accelerating Diffusion Policy PDF

Table of Contents