StreamingThinker: Large Language Models Can Think While Reading

ICLR 2026 Conference SubmissionAnonymous Authors
LLMsReasoningStreaming
Abstract:

Large language models (LLMs) have demonstrated remarkable capabilities in chain of thought (CoT) reasoning. However, the current LLM reasoning paradigm initiates thinking only after the entire input is available, which introduces unnecessary latency and weakens attention to earlier information in dynamic scenarios. Inspired by human cognition of thinking while reading, we first design a \textit{\textbf{streaming thinking}} paradigm for LLMs, where reasoning unfolds in the order of input and further adjusts its depth once reading is complete. We instantiate this paradigm with \textit{StreamingThinker}, a framework that enables LLMs to think while reading through the integration of streaming CoT generation, streaming-constraint training, and streaming parallel inference. Specifically, StreamingThinker employs streaming reasoning units with quality control for CoT generation, enforces order-preserving reasoning through streaming attention masks and position encoding, and leverages parallel KV caches that decouple input encoding from reasoning generation, thereby ensuring alignment and enabling true concurrency. We evaluate StreamingThinker on the Qwen3 model family across math reasoning, logical reasoning, and context-based QA reasoning tasks. Experimental results show that the StreamingThinker preserves performance comparable to batch thinking, while yielding an 80% reduction in token waiting before the onset of reasoning and a more than 60% reduction in time-level latency for producing the final answer, demonstrating the effectiveness of the streaming paradigm for LLM reasoning.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces a streaming thinking paradigm enabling LLMs to reason concurrently with input reception, instantiated through the StreamingThinker framework. It resides in the 'Concurrent Input-Output Streaming for LLMs' leaf, which contains only three papers including this one. This leaf sits within the broader 'Streaming Language Model Reasoning' branch, indicating a relatively sparse but emerging research direction. The taxonomy reveals that while streaming inference architectures are well-studied across hardware and distributed systems, concurrent reasoning during input arrival for LLMs remains underexplored compared to adjacent areas like multimodal streaming or simultaneous translation.

The taxonomy positions this work at the intersection of streaming inference and reasoning-specific challenges. Neighboring leaves address streaming multimodal understanding and simultaneous translation, which share the concurrent processing goal but target different modalities or tasks. The parent branch excludes batch-based reasoning and inference-time scaling without streaming, clarifying that StreamingThinker's novelty lies in its order-preserving reasoning during input arrival rather than post-hoc computation scaling. Sibling papers in the same leaf explore duplex communication and privacy-preserving scenarios, suggesting the field is fragmenting into specialized concurrent processing contexts rather than converging on unified frameworks.

Across three contributions, the analysis examined 29 candidate papers with zero refutable pairs identified. The streaming thinking paradigm examined 9 candidates with no refutations, the StreamingThinker framework examined 10 with none refutable, and the streaming CoT generation pipeline examined 10 with none refutable. This suggests that among the top-30 semantically similar works retrieved, none provide directly overlapping prior art for the specific combination of streaming reasoning, order-preserving CoT generation, and parallel KV cache mechanisms. The limited search scope means exhaustive coverage cannot be claimed, but within the examined set, the contributions appear distinct from existing concurrent inference and reasoning methods.

Given the sparse taxonomy leaf and absence of refutable candidates in the limited search, the work appears to occupy a relatively novel position within streaming LLM reasoning. However, the analysis covers only 29 candidates from semantic search, leaving open the possibility of relevant work in adjacent communities or under different terminology. The taxonomy's structure suggests the field is still coalescing around core abstractions for concurrent reasoning, and StreamingThinker's integration of streaming constraints with reasoning depth adjustment may represent an early exploration of this design space rather than an incremental refinement of established methods.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
29
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: streaming reasoning with concurrent input processing. This field addresses the challenge of performing inference or reasoning over data that arrives continuously, often requiring systems to produce outputs while new inputs are still being received. The taxonomy reflects a diverse landscape spanning multiple communities. Streaming Inference Architectures and Parallelism focuses on hardware-aware designs and model partitioning strategies that enable efficient concurrent execution, as seen in works like Edge Multiple-Model Scheduling[2] and CNN Model Parallelism[5]. Streaming Reasoning and Semantic Processing emphasizes knowledge representation and logic-based methods for integrating temporal or event-driven data, while Streaming Language Model Reasoning targets large language models that must handle incremental or overlapping input-output flows. Online Learning and Adaptive Models for Streams and Data Stream Analytics and Pattern Mining address scenarios where models must update continuously or detect evolving patterns in high-velocity data. Application Domains and Real-Time Systems showcase deployments in robotics, IoT, and multimedia, and Optimization and Control with Streaming Inference explores feedback-driven decision-making under streaming constraints. Within this landscape, a particularly active line of work centers on concurrent input-output streaming for large language models, where the goal is to interleave token generation with ongoing input reception. StreamingThinker[0] sits squarely in this cluster, proposing mechanisms that allow reasoning to proceed even as new context arrives, a capability also explored by Duplex Speech Modeling[27] in conversational settings and GhostShell[36] in privacy-preserving scenarios. These efforts contrast with more traditional pipeline approaches that separate encoding and decoding phases, and they differ from hardware-centric parallelism studies like Parallel CPU-GPU Inference[50] by emphasizing algorithmic strategies for overlapping computation. Meanwhile, works such as Streaming Video Memory[3] tackle related challenges in vision domains, highlighting that concurrent processing spans modalities. StreamingThinker[0] distinguishes itself by focusing on reasoning tasks that require maintaining coherent intermediate states across dynamically arriving inputs, a setting that remains less explored than pure generation or classification streams.

Claimed Contributions

Streaming thinking paradigm for LLMs

The authors introduce a new reasoning paradigm where LLMs perform reasoning concurrently with input reception rather than waiting for complete input. This paradigm mirrors human cognition of thinking while reading and allows adaptive reasoning depth adjustment after input completion.

9 retrieved papers
StreamingThinker framework

The authors develop a complete framework implementing the streaming thinking paradigm. It integrates three components: a generation pipeline for streaming chain-of-thought traces, training mechanisms with streaming attention masks and position encoding, and parallel KV cache inference that decouples input encoding from reasoning generation.

10 retrieved papers
Streaming CoT generation pipeline with quality control

The authors design a data generation method that produces streaming-compatible reasoning traces. It employs boundary tokens to define reasoning units, uses teacher model reconstruction for alignment, and includes quality metrics (granularity and sequential consistency scores) with depth-controlled reasoning variants.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Streaming thinking paradigm for LLMs

The authors introduce a new reasoning paradigm where LLMs perform reasoning concurrently with input reception rather than waiting for complete input. This paradigm mirrors human cognition of thinking while reading and allows adaptive reasoning depth adjustment after input completion.

Contribution

StreamingThinker framework

The authors develop a complete framework implementing the streaming thinking paradigm. It integrates three components: a generation pipeline for streaming chain-of-thought traces, training mechanisms with streaming attention masks and position encoding, and parallel KV cache inference that decouples input encoding from reasoning generation.

Contribution

Streaming CoT generation pipeline with quality control

The authors design a data generation method that produces streaming-compatible reasoning traces. It employs boundary tokens to define reasoning units, uses teacher model reconstruction for alignment, and includes quality metrics (granularity and sequential consistency scores) with depth-controlled reasoning variants.