The Markovian Thinker

ICLR 2026 Conference SubmissionAnonymous Authors
LLM ReasoningRL for LLMsReasoning ModelsScalable ReasoningTest-Time Scaling
Abstract:

Reasoning LLMs suffer from quadratic compute growth as their context length increases, making reinforcement learning with verifiable rewards (RLVR) and test-time scaling prohibitively expensive. Prior work has tried to lighten the computational burden by shortening reasoning traces through pruning, summarization, or multi-stage training, but these methods remain bound to quadratic costs. We introduce Delethink, a thinking algorithm that realizes the Markovian Thinking Paradigm. Instead of producing one long monolithic reasoning trace, Delethink thinks in a sequence of chunks, the Delethink trace. Each chunk continues reasoning by referring only to a fixed number of prior tokens, which functions as a Markovian state sufficient for progressing reasoning, while deleting the rest. This preserves continuity without carrying the quadratic baggage. As a result, compute scales linearly and peak memory remains constant. In experiments, we show that Delethink can be applied directly to off-the-shelf reasoning models ranging from 1.5B1.5\textnormal{B} to 30B30\textnormal{B} parameters, with no loss in performance. Extended reasoning becomes possible under fixed memory and linear compute, while enabling efficient RL training on new tasks. On the DeepScaleR dataset, Delethink trains R1DistillQwen1.5B to the same benchmark performance as a standard long chain-of-thought (LongCoT) approach, where both models generate up to 24k24\textnormal{k} thinking tokens. The difference is efficiency. Delethink reasons 40%40\% faster with 70%70\% less memory footprint. By decoupling reasoning length from context length, the Markovian Thinking paradigm opens the door to next-generation reasoning LLMs that can scale to millions of tokens with linear compute and constant memory.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Delethink, a thinking algorithm implementing the Markovian Thinking Paradigm for reasoning LLMs. It decomposes reasoning into sequential chunks where each chunk references only a fixed number of prior tokens as Markovian state, deleting the rest to achieve linear compute and constant memory. Within the taxonomy, this work resides in the 'Markovian and Chunked Reasoning Paradigms' leaf, which contains only two papers total. This represents a sparse, emerging research direction focused specifically on structured reasoning under strict memory constraints, contrasting with the more populated branches addressing attention mechanisms or state-space models.

The taxonomy reveals that neighboring approaches pursue efficiency through different mechanisms. The 'State-Space and Recurrent Architectures' branch (containing models like Mamba and xLSTM across five subcategories) achieves linear complexity through sequential state updates but maintains implicit memory in parameters. The 'Memory-Augmented Architectures' branch (five subcategories including Infinite Memory Transformer and Memformer) uses external storage to extend context dynamically. Delethink diverges by imposing explicit Markovian constraints on reasoning traces rather than approximating full attention or managing external memory, positioning it as a fundamentally different paradigm for bounded-state reasoning.

Among thirty candidates examined, the contribution-level analysis shows varied novelty. The core Markovian Thinking Paradigm and zero-shot Delethink inference examined ten candidates each with zero refutations, suggesting these contributions occupy relatively unexplored territory within the limited search scope. However, the Delethink training for reinforcement learning examined ten candidates and found one refutable overlap, indicating more substantial prior work in efficient RL training methods. This pattern suggests the paradigm itself is novel while its application to RL training connects to existing efficiency techniques in that domain.

Based on the limited search of thirty semantically similar papers, Delethink appears to introduce a distinctive approach to reasoning efficiency. The sparse population of its taxonomy leaf and low refutation rates for core contributions suggest novelty, though the analysis cannot claim exhaustiveness. The single refutation in RL training highlights that while the Markovian paradigm is fresh, its integration with established training methods naturally encounters existing work in that intersection.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: efficient reasoning with linear compute and constant memory. The field addresses the fundamental challenge of scaling neural architectures without incurring quadratic attention costs or unbounded memory growth. The taxonomy reveals several complementary strategies: Linear-Complexity Attention Mechanisms explore approximations and sparse patterns to reduce computational overhead (e.g., Linformer[1], Hierarchical Sparse Attention[43]); State-Space and Recurrent Architectures revive sequential models with modern parameterizations (RWKV-X[2], xLSTM[49], LongMamba[45]); Memory-Augmented Architectures introduce external storage to decouple context length from parameter count (Infinite Memory Transformer[6], Memformer[29]); and Markovian and Chunked Reasoning Paradigms impose structured constraints on information flow to maintain bounded state. Additional branches cover algorithmic foundations, hardware optimization, and specialized domains, reflecting the breadth of efficiency concerns across theory and practice. A particularly active tension emerges between architectures that preserve full attention expressiveness through clever approximations versus those that embrace explicit memory constraints or Markovian assumptions. Works like Sequential Parallel Duality[5] and JEDI Linear[7] seek to retain transformer-like flexibility while achieving linear scaling, whereas approaches in the Markovian branch accept bounded context windows in exchange for provable constant-memory guarantees. Markovian Thinker[0] sits squarely within this latter paradigm, emphasizing structured reasoning under strict memory limits—a philosophy it shares with Working Memory Dialogue[12], which similarly explores bounded-state dialogue systems. Compared to memory-augmented methods like Infinite Memory Transformer[6] that dynamically manage external storage, Markovian Thinker[0] opts for a more constrained, predictable footprint, trading off potential long-range recall for deterministic efficiency. This positioning highlights an ongoing debate: whether future scalability lies in smarter approximations of full attention or in fundamentally rethinking what information must be retained.

Claimed Contributions

Markovian Thinking Paradigm and Delethink Algorithm

The authors propose a new reasoning paradigm where models think in fixed-size chunks, retaining only a minimal markovian state from prior reasoning. This enables linear compute scaling and constant memory usage during both training and inference, in contrast to the quadratic costs of standard long chain-of-thought approaches.

10 retrieved papers
Delethink Inference for Zero-Shot Markovian Thinking

The authors present an inference method that can be applied directly to existing reasoning models without additional training or prompting, allowing them to function as Markovian thinkers by reasoning in chunks while maintaining fixed context size.

10 retrieved papers
Delethink Training for Efficient Reinforcement Learning

The authors develop a reinforcement learning training procedure that explicitly trains models to reason in the Markovian manner. This approach achieves comparable performance to standard long chain-of-thought training while requiring significantly less compute and memory resources.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Markovian Thinking Paradigm and Delethink Algorithm

The authors propose a new reasoning paradigm where models think in fixed-size chunks, retaining only a minimal markovian state from prior reasoning. This enables linear compute scaling and constant memory usage during both training and inference, in contrast to the quadratic costs of standard long chain-of-thought approaches.

Contribution

Delethink Inference for Zero-Shot Markovian Thinking

The authors present an inference method that can be applied directly to existing reasoning models without additional training or prompting, allowing them to function as Markovian thinkers by reasoning in chunks while maintaining fixed context size.

Contribution

Delethink Training for Efficient Reinforcement Learning

The authors develop a reinforcement learning training procedure that explicitly trains models to reason in the Markovian manner. This approach achieves comparable performance to standard long chain-of-thought training while requiring significantly less compute and memory resources.