The Markovian Thinker

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

LLM ReasoningRL for LLMsReasoning ModelsScalable ReasoningTest-Time Scaling

Reasoning LLMs suffer from quadratic compute growth as their context length increases, making reinforcement learning with verifiable rewards (RLVR) and test-time scaling prohibitively expensive. Prior work has tried to lighten the computational burden by shortening reasoning traces through pruning, summarization, or multi-stage training, but these methods remain bound to quadratic costs. We introduce Delethink, a thinking algorithm that realizes the Markovian Thinking Paradigm. Instead of producing one long monolithic reasoning trace, Delethink thinks in a sequence of chunks, the Delethink trace. Each chunk continues reasoning by referring only to a fixed number of prior tokens, which functions as a Markovian state sufficient for progressing reasoning, while deleting the rest. This preserves continuity without carrying the quadratic baggage. As a result, compute scales linearly and peak memory remains constant. In experiments, we show that Delethink can be applied directly to off-the-shelf reasoning models ranging from $1.5\textnormal{B}$ to $30\textnormal{B}$ parameters, with no loss in performance. Extended reasoning becomes possible under fixed memory and linear compute, while enabling efficient RL training on new tasks. On the DeepScaleR dataset, Delethink trains R1DistillQwen1.5B to the same benchmark performance as a standard long chain-of-thought (LongCoT) approach, where both models generate up to $24\textnormal{k}$ thinking tokens. The difference is efficiency. Delethink reasons $40\%$ faster with $70\%$ less memory footprint. By decoupling reasoning length from context length, the Markovian Thinking paradigm opens the door to next-generation reasoning LLMs that can scale to millions of tokens with linear compute and constant memory.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Delethink, a thinking algorithm implementing the Markovian Thinking Paradigm for reasoning LLMs. It decomposes reasoning into sequential chunks where each chunk references only a fixed number of prior tokens as Markovian state, deleting the rest to achieve linear compute and constant memory. Within the taxonomy, this work resides in the 'Markovian and Chunked Reasoning Paradigms' leaf, which contains only two papers total. This represents a sparse, emerging research direction focused specifically on structured reasoning under strict memory constraints, contrasting with the more populated branches addressing attention mechanisms or state-space models.

The taxonomy reveals that neighboring approaches pursue efficiency through different mechanisms. The 'State-Space and Recurrent Architectures' branch (containing models like Mamba and xLSTM across five subcategories) achieves linear complexity through sequential state updates but maintains implicit memory in parameters. The 'Memory-Augmented Architectures' branch (five subcategories including Infinite Memory Transformer and Memformer) uses external storage to extend context dynamically. Delethink diverges by imposing explicit Markovian constraints on reasoning traces rather than approximating full attention or managing external memory, positioning it as a fundamentally different paradigm for bounded-state reasoning.

Among thirty candidates examined, the contribution-level analysis shows varied novelty. The core Markovian Thinking Paradigm and zero-shot Delethink inference examined ten candidates each with zero refutations, suggesting these contributions occupy relatively unexplored territory within the limited search scope. However, the Delethink training for reinforcement learning examined ten candidates and found one refutable overlap, indicating more substantial prior work in efficient RL training methods. This pattern suggests the paradigm itself is novel while its application to RL training connects to existing efficiency techniques in that domain.

Based on the limited search of thirty semantically similar papers, Delethink appears to introduce a distinctive approach to reasoning efficiency. The sparse population of its taxonomy leaf and low refutation rates for core contributions suggest novelty, though the analysis cannot claim exhaustiveness. The single refutation in RL training highlights that while the Markovian paradigm is fresh, its integration with established training methods naturally encounters existing work in that intersection.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: efficient reasoning with linear compute and constant memory. The field addresses the fundamental challenge of scaling neural architectures without incurring quadratic attention costs or unbounded memory growth. The taxonomy reveals several complementary strategies: Linear-Complexity Attention Mechanisms explore approximations and sparse patterns to reduce computational overhead (e.g., Linformer[1], Hierarchical Sparse Attention[43]); State-Space and Recurrent Architectures revive sequential models with modern parameterizations (RWKV-X[2], xLSTM[49], LongMamba[45]); Memory-Augmented Architectures introduce external storage to decouple context length from parameter count (Infinite Memory Transformer[6], Memformer[29]); and Markovian and Chunked Reasoning Paradigms impose structured constraints on information flow to maintain bounded state. Additional branches cover algorithmic foundations, hardware optimization, and specialized domains, reflecting the breadth of efficiency concerns across theory and practice. A particularly active tension emerges between architectures that preserve full attention expressiveness through clever approximations versus those that embrace explicit memory constraints or Markovian assumptions. Works like Sequential Parallel Duality[5] and JEDI Linear[7] seek to retain transformer-like flexibility while achieving linear scaling, whereas approaches in the Markovian branch accept bounded context windows in exchange for provable constant-memory guarantees. Markovian Thinker[0] sits squarely within this latter paradigm, emphasizing structured reasoning under strict memory limits—a philosophy it shares with Working Memory Dialogue[12], which similarly explores bounded-state dialogue systems. Compared to memory-augmented methods like Infinite Memory Transformer[6] that dynamically manage external storage, Markovian Thinker[0] opts for a more constrained, predictable footprint, trading off potential long-range recall for deterministic efficiency. This positioning highlights an ongoing debate: whether future scalability lies in smarter approximations of full attention or in fundamentally rethinking what information must be retained.

Claimed Contributions

Markovian Thinking Paradigm and Delethink Algorithm

10 retrieved papers

The authors propose a new reasoning paradigm where models think in fixed-size chunks, retaining only a minimal markovian state from prior reasoning. This enables linear compute scaling and constant memory usage during both training and inference, in contrast to the quadratic costs of standard long chain-of-thought approaches.

10 retrieved papers

Delethink Inference for Zero-Shot Markovian Thinking

10 retrieved papers

The authors present an inference method that can be applied directly to existing reasoning models without additional training or prompting, allowing them to function as Markovian thinkers by reasoning in chunks while maintaining fixed context size.

10 retrieved papers

Delethink Training for Efficient Reinforcement Learning

Can Refute

10 retrieved papers

The authors develop a reinforcement learning training procedure that explicitly trains models to reason in the Markovian manner. This approach achieves comparable performance to standard long chain-of-thought training while requiring significantly less compute and memory resources.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[12] Working-Memory-Correct Long-Horizon Expert-Retrieval TTT Dialogue PDF

DP Ghosh (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Markovian Thinking Paradigm and Delethink Algorithm

[51] Titans: Learning to memorize at test time PDF

Cannot Refute

[52] Scaling Reasoning without Attention PDF

Cannot Refute

[53] JanusVLN: Decoupling Semantics and Spatiality with Dual Implicit Memory for Vision-Language Navigation PDF

Cannot Refute

[54] Random-access infinite context length for transformers PDF

Cannot Refute

[55] Lserve: Efficient long-sequence llm serving with unified sparse attention PDF

Cannot Refute

[56] Artificial hippocampus networks for efficient long-context modeling PDF

Cannot Refute

[57] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads PDF

Cannot Refute

[58] Lococo: Dropping in convolutions for long context compression PDF

Cannot Refute

[59] Trellis: Learning to Compress Key-Value Memory in Attention Models PDF

Cannot Refute

[60] MEDCnet: A Memory Efficient Approach for Processing HighâResolution Fundus Images for Diabetic Retinopathy Classification Using CNN PDF

Cannot Refute

Contribution

Delethink Inference for Zero-Shot Markovian Thinking

[71] Least-to-Most Prompting Enables Complex Reasoning in Large Language Models PDF

Cannot Refute

[72] Art: Automatic multi-step reasoning and tool-use for large language models PDF

Cannot Refute

[73] A comprehensive survey of prompt engineering techniques in large language models PDF

Cannot Refute

[74] Lisa: Reasoning segmentation via large language model PDF

Cannot Refute

[75] Recursive decomposition of logical thoughts: Framework for superior reasoning and knowledge propagation in large language models PDF

Cannot Refute

[76] PAL: Program-aided Language Models PDF

Cannot Refute

[77] A systematic survey of prompt engineering in large language models: Techniques and applications PDF

Cannot Refute

[78] SelfCheck: Using LLMs to Zero-Shot Check Their Own Step-by-Step Reasoning PDF

Cannot Refute

[79] Reasoning with large language models, a survey PDF

Cannot Refute

[80] Decomposed Prompting: A Modular Approach for Solving Complex Tasks PDF

Cannot Refute

Contribution

Delethink Training for Efficient Reinforcement Learning

[63] Thinkprune: Pruning long chain-of-thought of llms via reinforcement learning PDF

Can Refute

[61] VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks PDF

Cannot Refute

[62] When More is Less: Understanding Chain-of-Thought Length in LLMs PDF

Cannot Refute

[64] L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning PDF

Cannot Refute

[65] Training language models to reason efficiently PDF

Cannot Refute

[66] Video-rts: Rethinking reinforcement learning and test-time scaling for efficient and enhanced video reasoning PDF

Cannot Refute

[67] Latent chain-of-thought for visual reasoning PDF

Cannot Refute

[68] Reinforcement Learning-Guided Chain-of-Draft for Token-Efficient Code Generation PDF

Cannot Refute

[69] LLM-Guided Reinforcement Learning for Interactive Environments PDF

Cannot Refute

[70] GRPO-MA: Multi-Answer Generation in GRPO for Stable and Efficient Chain-of-Thought Training PDF

Cannot Refute

The Markovian Thinker

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[12] Working-Memory-Correct Long-Horizon Expert-Retrieval TTT Dialogue PDF

Contribution Analysis

Markovian Thinking Paradigm and Delethink Algorithm

[51] Titans: Learning to memorize at test time PDF

[52] Scaling Reasoning without Attention PDF

[53] JanusVLN: Decoupling Semantics and Spatiality with Dual Implicit Memory for Vision-Language Navigation PDF

[54] Random-access infinite context length for transformers PDF

[55] Lserve: Efficient long-sequence llm serving with unified sparse attention PDF

[56] Artificial hippocampus networks for efficient long-context modeling PDF

[57] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads PDF

[58] Lococo: Dropping in convolutions for long context compression PDF

[59] Trellis: Learning to Compress Key-Value Memory in Attention Models PDF

[60] MEDCnet: A Memory Efficient Approach for Processing HighâResolution Fundus Images for Diabetic Retinopathy Classification Using CNN PDF

Delethink Inference for Zero-Shot Markovian Thinking

[71] Least-to-Most Prompting Enables Complex Reasoning in Large Language Models PDF

[72] Art: Automatic multi-step reasoning and tool-use for large language models PDF

[73] A comprehensive survey of prompt engineering techniques in large language models PDF

[74] Lisa: Reasoning segmentation via large language model PDF

[75] Recursive decomposition of logical thoughts: Framework for superior reasoning and knowledge propagation in large language models PDF

[76] PAL: Program-aided Language Models PDF

[77] A systematic survey of prompt engineering in large language models: Techniques and applications PDF

[78] SelfCheck: Using LLMs to Zero-Shot Check Their Own Step-by-Step Reasoning PDF

[79] Reasoning with large language models, a survey PDF

[80] Decomposed Prompting: A Modular Approach for Solving Complex Tasks PDF

Delethink Training for Efficient Reinforcement Learning

[63] Thinkprune: Pruning long chain-of-thought of llms via reinforcement learning PDF

[61] VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks PDF

[62] When More is Less: Understanding Chain-of-Thought Length in LLMs PDF

[64] L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning PDF

[65] Training language models to reason efficiently PDF

[66] Video-rts: Rethinking reinforcement learning and test-time scaling for efficient and enhanced video reasoning PDF

[67] Latent chain-of-thought for visual reasoning PDF

[68] Reinforcement Learning-Guided Chain-of-Draft for Token-Efficient Code Generation PDF

[69] LLM-Guided Reinforcement Learning for Interactive Environments PDF

[70] GRPO-MA: Multi-Answer Generation in GRPO for Stable and Efficient Chain-of-Thought Training PDF

Table of Contents

[60] MEDCnet: A Memory Efficient Approach for Processing HighâResolution Fundus Images for Diabetic Retinopathy Classification Using CNN PDF