All Code, No Thought: Language Models Struggle to Reason in Ciphered Language

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.7 Download Report PDF

AI safetychain of thoughtLLMCoT monitoring

Detecting harmful AI actions is important as AI agents gain adoption. Chain-of-thought (CoT) monitoring is one method widely used to detect adversarial attacks and AI misalignment. However, attackers and misaligned models might evade CoT monitoring through ciphered reasoning: reasoning hidden in encrypted, translated, or compressed text. To assess this risk, we test whether models can perform ciphered reasoning. For each of 28 different ciphers, we fine-tune and prompt up to 10 models to reason in that cipher. We measure model accuracy on math problems as a proxy for reasoning ability. Across the models we test, we find an asymmetry: model accuracy can drop significantly when reasoning in ciphered text, even though models demonstrate comprehension of ciphered text by being able to translate it accurately to English. Even frontier models struggle with lesser-known ciphers, although they can reason accurately in well-known ciphers like rot13. We show that ciphered reasoning capability correlates with cipher prevalence in pretraining data. We also identify scaling laws showing that ciphered reasoning capability improves slowly with additional fine-tuning data. Our work suggests that evading CoT monitoring using ciphered reasoning may be an ineffective tactic for current models and offers guidance on constraining the development of this capability in future frontier models.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper evaluates whether language models can perform mathematical reasoning when inputs and intermediate steps are expressed in various ciphers, testing 28 cipher types across multiple models. It resides in the 'Specialized Applications of Encrypted Reasoning' leaf, which contains only two papers total. This leaf sits at the periphery of the taxonomy, distinct from the more populated branches on privacy-preserving inference (18 papers across homomorphic encryption and secure computation) and adversarial exploitation (5 papers on jailbreak attacks). The sparse population suggests this particular angle—assessing reasoning capability rather than cryptographic security or adversarial robustness—is relatively underexplored in the literature.

The taxonomy reveals several neighboring research directions. 'Cryptanalysis and Cipher Decryption Using AI' (6 papers) focuses on breaking encryption, while 'Adversarial Exploitation of Encrypted Reasoning' (5 papers) examines jailbreak attacks via cipher encoding. 'Privacy-Preserving Inference on Encrypted Data' (18 papers) emphasizes computational security guarantees through homomorphic encryption or secure multi-party computation. The original paper diverges by treating ciphers as a lens for understanding model reasoning limitations rather than as cryptographic primitives to attack or defend. Its scope excludes both adversarial safety bypasses and formal privacy protocols, instead probing the cognitive boundaries of language models when linguistic structure is obscured.

Among 30 candidates examined, the contribution on cipher prevalence correlating with reasoning performance encountered one potentially refutable prior work, while the other two contributions—evaluating ciphered reasoning capability and identifying scaling laws—showed no clear refutation across 10 candidates each. The limited search scope means these statistics reflect top-K semantic matches and citation expansion, not exhaustive coverage. The evaluation framework and scaling law analyses appear less contested in the examined literature, whereas the data-prevalence hypothesis may overlap with existing observations about pretraining distribution effects. The sibling paper in the same leaf addresses network traffic analysis, offering minimal direct overlap with mathematical reasoning in ciphers.

Given the sparse taxonomy leaf and the modest search scale, the work appears to occupy a relatively novel position within the examined literature. The absence of extensive prior work on reasoning-in-cipher evaluation suggests the framing is distinctive, though the single refutable candidate for the prevalence-correlation claim indicates some conceptual overlap exists. The analysis does not cover the full breadth of cipher-related AI research, so conclusions remain provisional pending broader literature review.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: reasoning in encrypted or encoded text. This field spans a diverse set of challenges, from breaking classical ciphers with modern AI techniques (Cryptanalysis and Cipher Decryption Using AI) to adversarially exploiting encrypted reasoning systems, ensuring privacy-preserving inference on encrypted data, and protecting models or representations through obfuscation. Privacy-Preserving Inference on Encrypted Data includes work on homomorphic encryption for neural networks (Encrypted Learning Models[4], Homomorphic Encrypted Databases[5]) and secure transformer inference (Secure Transformer Inference[11]), while Data Obfuscation and Representation Protection addresses methods that scramble inputs or latent features to prevent unauthorized access (Instance Obfuscation[25], Shielding Latent Faces[30]). Model and Algorithm Protection focuses on safeguarding intellectual property and preventing extraction attacks (Model Extraction Defense[15]), and Privacy-Preserving Data Publishing and Querying explores secure search and linked data (Searching Encrypted Data[8], Privacy Preserving Linked Data[1]). Specialized Applications of Encrypted Reasoning captures domain-specific uses, such as network traffic analysis (PacketCLIP[16]) and historical manuscript decryption (Encrypted Thomas More[32]). A particularly active line of work examines whether large language models can perform reasoning over ciphered inputs, balancing utility with confidentiality. Ciphered Language Reasoning[0] sits squarely in this specialized application space, exploring how models handle encrypted natural language tasks. This contrasts with cryptanalysis-focused efforts like Attention Transformer Cryptanalysis[42] and Zero Shot Cryptanalysis[47], which aim to break encryption rather than reason within it, and with privacy-preserving inference methods like Secure Prompt Ensembling[14] or Private Language Model[20], which emphasize computational security guarantees over linguistic obfuscation. Meanwhile, adversarial studies such as Complex Ciphers Jailbreak[13] probe whether encryption can be exploited to bypass safety mechanisms. Ciphered Language Reasoning[0] thus occupies a niche where the goal is not cryptographic robustness per se, but rather understanding and enabling model comprehension of encoded text—a theme that also appears in Cipherbank[2] and Arabic Encrypted Texts[34], which provide benchmarks for evaluating such capabilities across languages and cipher types.

Claimed Contributions

Evaluation of ciphered reasoning capability across models and ciphers

10 retrieved papers

The authors systematically test whether language models can perform mathematical reasoning when their chain-of-thought is expressed in various encrypted, translated, or compressed forms. They find models can translate ciphered text accurately but experience significant accuracy drops when reasoning in ciphered language, even for frontier models on lesser-known ciphers.

10 retrieved papers

Correlation between cipher prevalence in pretraining data and reasoning performance

Can Refute

10 retrieved papers

The authors demonstrate that ciphered reasoning capability correlates with how frequently a cipher appears in pretraining corpora. They estimate cipher prevalence using token n-gram frequencies and show strong correlation (R² = 0.906 for structure-preserving ciphers) between pretraining prevalence and math accuracy.

10 retrieved papers

Can Refute

Data- and parameter-scaling laws for ciphered reasoning

10 retrieved papers

The authors characterize how ciphered reasoning capability scales with fine-tuning data and model parameters. They show that even a simple cipher requires more than 3.7 billion tokens of ciphered fine-tuning data to approach plain-text reasoning accuracy, suggesting substantial data requirements for developing this capability.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[16] PacketCLIP: Multi-Modal Embedding of Network Traffic and Language for Cybersecurity Reasoning PDF

Ryozo Masukawa, Sanggeon Yun, Sungheon Jeong, Wenjun Huang, Yang Ni, Ian Bryant, Nathaniel D. Bastian, Mohsen Imani, Nathaniel D Bastian (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Evaluation of ciphered reasoning capability across models and ciphers

[71] Improve vision language model chain-of-thought reasoning PDF

Cannot Refute

[72] Demystifying long chain-of-thought reasoning in llms PDF

Cannot Refute

[73] Multimodal chain-of-thought reasoning in language models PDF

Cannot Refute

[74] Exploring formal defeasible reasoning of large language models: A Chain-of-Thought approach PDF

Cannot Refute

[75] Monitoring reasoning models for misbehavior and the risks of promoting obfuscation PDF

Cannot Refute

[76] Cot-drive: Efficient motion forecasting for autonomous driving with llms and chain-of-thought prompting PDF

Cannot Refute

[77] Making reasoning matter: Measuring and improving faithfulness of chain-of-thought reasoning PDF

Cannot Refute

[78] Critic-cot: Boosting the reasoning abilities of large language model via chain-of-thought critic PDF

Cannot Refute

[79] Applying large language models and chain-of-thought for automatic scoring PDF

Cannot Refute

[80] What makes a good reasoning chain? uncovering structural patterns in long chain-of-thought reasoning PDF

Cannot Refute

Contribution

Correlation between cipher prevalence in pretraining data and reasoning performance

[61] Impact of pretraining term frequencies on few-shot reasoning PDF

Can Refute

[62] Sources of hallucination by large language models on inference tasks PDF

Cannot Refute

[63] On linear representations and pretraining data frequency in language models PDF

Cannot Refute

[64] Doremi: Optimizing data mixtures speeds up language model pretraining PDF

Cannot Refute

[65] When do you need billions of words of pretraining data? PDF

Cannot Refute

[66] Calibrate before use: Improving few-shot performance of language models PDF

Cannot Refute

[67] Lmd3: Language model data density dependence PDF

Cannot Refute

[68] Generalization vs Memorization: Tracing Language Models' Capabilities Back to Pretraining Data PDF

Cannot Refute

[69] Snoopy: An online interface for exploring the effect of pretraining term frequencies on few-shot lm performance PDF

Cannot Refute

[70] No "zero-shot" without exponential data: pretraining concept frequency determines multimodal model performance

Cannot Refute

Contribution

Data- and parameter-scaling laws for ciphered reasoning

[51] When scaling meets llm finetuning: The effect of data, model and finetuning method PDF

Cannot Refute

[52] Reproducible scaling laws for contrastive language-image learning PDF

Cannot Refute

[53] Scaling data-constrained language models PDF

Cannot Refute

[54] Scaling Laws for Transfer PDF

Cannot Refute

[55] Labeling supervised fine-tuning data with the scaling law PDF

Cannot Refute

[56] Scaling laws for reward model overoptimization PDF

Cannot Refute

[57] Scaling laws for neural language models PDF

Cannot Refute

[58] Towards foundation models for scientific machine learning: Characterizing scaling and transfer behavior PDF

Cannot Refute

[59] LENSLLM: Unveiling Fine-Tuning Dynamics for LLM Selection PDF

Cannot Refute

[60] Scaling Laws for Data-Efficient Visual Transfer Learning PDF

Cannot Refute

All Code, No Thought: Language Models Struggle to Reason in Ciphered Language

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[16] PacketCLIP: Multi-Modal Embedding of Network Traffic and Language for Cybersecurity Reasoning PDF

Contribution Analysis

Evaluation of ciphered reasoning capability across models and ciphers

[71] Improve vision language model chain-of-thought reasoning PDF

[72] Demystifying long chain-of-thought reasoning in llms PDF

[73] Multimodal chain-of-thought reasoning in language models PDF

[74] Exploring formal defeasible reasoning of large language models: A Chain-of-Thought approach PDF

[75] Monitoring reasoning models for misbehavior and the risks of promoting obfuscation PDF

[76] Cot-drive: Efficient motion forecasting for autonomous driving with llms and chain-of-thought prompting PDF

[77] Making reasoning matter: Measuring and improving faithfulness of chain-of-thought reasoning PDF

[78] Critic-cot: Boosting the reasoning abilities of large language model via chain-of-thought critic PDF

[79] Applying large language models and chain-of-thought for automatic scoring PDF

[80] What makes a good reasoning chain? uncovering structural patterns in long chain-of-thought reasoning PDF

Correlation between cipher prevalence in pretraining data and reasoning performance

[61] Impact of pretraining term frequencies on few-shot reasoning PDF

[62] Sources of hallucination by large language models on inference tasks PDF

[63] On linear representations and pretraining data frequency in language models PDF

[64] Doremi: Optimizing data mixtures speeds up language model pretraining PDF

[65] When do you need billions of words of pretraining data? PDF

[66] Calibrate before use: Improving few-shot performance of language models PDF

[67] Lmd3: Language model data density dependence PDF

[68] Generalization vs Memorization: Tracing Language Models' Capabilities Back to Pretraining Data PDF

[69] Snoopy: An online interface for exploring the effect of pretraining term frequencies on few-shot lm performance PDF

[70] No "zero-shot" without exponential data: pretraining concept frequency determines multimodal model performance

Data- and parameter-scaling laws for ciphered reasoning

[51] When scaling meets llm finetuning: The effect of data, model and finetuning method PDF

[52] Reproducible scaling laws for contrastive language-image learning PDF

[53] Scaling data-constrained language models PDF

[54] Scaling Laws for Transfer PDF

[55] Labeling supervised fine-tuning data with the scaling law PDF

[56] Scaling laws for reward model overoptimization PDF

[57] Scaling laws for neural language models PDF

[58] Towards foundation models for scientific machine learning: Characterizing scaling and transfer behavior PDF

[59] LENSLLM: Unveiling Fine-Tuning Dynamics for LLM Selection PDF

[60] Scaling Laws for Data-Efficient Visual Transfer Learning PDF

Table of Contents