Should We Still Pretrain Encoders with Masked Language Modeling?

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

Encoder PretrainingMasked Language ModelingCausal Language ModelingText RepresentationsRepresentation Learning

Learning high-quality text representations is fundamental to a wide range of NLP tasks. While encoder pretraining has traditionally relied on Masked Language Modeling (MLM), recent evidence suggests that decoder models pretrained with Causal Language Modeling (CLM) can be effectively repurposed as encoders, often surpassing traditional encoders on text representation benchmarks. However, it remains unclear whether these gains reflect an inherent advantage of the CLM approach or arise from confounding factors such as model and data scale. In this paper, we address this question through a series of large-scale, carefully controlled pretraining ablations, training a total of 38 models ranging from 210 million to 1 billion parameters, and conducting over 15,000 fine-tuning and evaluation runs. We find that while training with MLM generally yields better performance across text representation tasks, CLM-trained models are more data-efficient and demonstrate improved fine-tuning stability. Building on these findings, we experimentally show that a biphasic training strategy that sequentially applies CLM and then MLM, achieves optimal performance under a fixed computational training budget. Moreover, we demonstrate that this strategy becomes more appealing when initializing from readily available pretrained CLM models, reducing the computational burden needed to train best-in-class encoder models. We release all project artifacts at \url{https://huggingface.co/XXX} to foster further research.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

This paper contributes a large-scale controlled ablation study comparing masked language modeling (MLM) and causal language modeling (CLM) for encoder pretraining, training 38 models from 210M to 1B parameters with over 15,000 evaluation runs. It resides in the 'Direct Comparisons of MLM and CLM Pretraining' leaf, which contains five papers total including this one. This leaf sits within a moderately populated branch on 'Pretraining Objectives and Architectures,' suggesting the paper addresses a well-established but not overcrowded research direction focused on foundational design choices for language model pretraining.

The taxonomy reveals closely related work in adjacent leaves: 'Hybrid and Sequential Pretraining Strategies' (five papers) explores combined objectives, while 'Bidirectional Context Modeling in Autoregressive Frameworks' (four papers) examines bidirectionality within decoder architectures. The paper's biphasic CLM-then-MLM strategy bridges these areas, connecting direct objective comparisons with sequential training methods. Neighboring branches on autoregressive generation and application domains indicate the field's broader interest in how pretraining choices propagate to downstream tasks, though this work focuses specifically on encoder-level representation quality rather than generation or task-specific deployment.

Among 30 candidates examined, the large-scale ablation study (Contribution 1) shows no clear refutation across 10 candidates, suggesting this systematic scale and control may be distinctive. However, the biphasic CLM-then-MLM strategy (Contribution 2) and the demonstration of CLM-to-MLM superiority (Contribution 3) each found one refutable candidate among 10 examined, indicating prior work on sequential or hybrid training exists within this limited search scope. The statistics suggest the ablation methodology may be more novel than the biphasic training concept itself, though the search examined only top-30 semantic matches rather than exhaustive coverage.

Based on the limited 30-candidate search, the work appears to offer methodological rigor (large-scale ablations) in a moderately explored area, while the biphasic training strategy shows some overlap with existing hybrid approaches. The taxonomy structure confirms this sits in an active but not saturated research direction, with clear boundaries separating direct comparisons from hybrid methods and application studies. A broader literature search might reveal additional sequential training precedents not captured in this top-K semantic scope.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Comparing masked language modeling and causal language modeling for encoder pretraining. The field structure reflects a broad investigation into how different pretraining objectives shape language model capabilities. The taxonomy organizes work into several main branches: one focused on pretraining objectives and architectures, examining foundational design choices such as masked versus causal modeling; another on autoregressive and bidirectional modeling for sequence generation, exploring how directionality affects downstream performance; a third on application domains where pretrained models are deployed; and additional branches covering surveys and non-NLP sequence modeling. Within the pretraining objectives branch, many studies directly compare MLM and CLM strategies, assessing trade-offs in representation quality, computational efficiency, and task-specific performance. Representative works like Zero Shot Generalization[8] and Zero Shot Architecture[12] illustrate how architectural decisions influence generalization, while Encoder Low Resource[16] highlights the importance of pretraining choices in constrained settings. A particularly active line of work examines the role of bidirectionality in pretraining, with studies such as Bidirectionality Pretraining Role[20] and Bidirectional Awareness Induction[33] investigating how bidirectional context improves understanding tasks. Pretrain Encoders MLM[0] sits squarely within this cluster, directly comparing MLM and CLM for encoder pretraining and emphasizing the empirical advantages of masked objectives for certain encoder architectures. This contrasts with works like GPT or BERT[24] and BERT GPT Impact[25], which take a broader view of the historical and practical distinctions between these paradigms. Meanwhile, papers such as CMLM Sentence Embeddings[27] explore hybrid or conditional masked approaches, suggesting that the boundary between pure MLM and CLM is not always sharp. The central question remains how to balance the rich bidirectional context of MLM with the simplicity and scalability of causal modeling, especially as models are adapted to diverse downstream tasks.

Claimed Contributions

Large-scale controlled ablation study comparing MLM and CLM for encoder pretraining

10 retrieved papers

The authors conduct extensive controlled experiments training 38 models from 210M to 1B parameters with both MLM and CLM objectives, performing over 15,000 fine-tuning runs to isolate the effects of pretraining paradigm from confounding factors like model scale and data size.

10 retrieved papers

Biphasic CLM-then-MLM pretraining strategy

Can Refute

10 retrieved papers

The authors propose and validate a two-stage pretraining approach that first applies Causal Language Modeling followed by Masked Language Modeling, demonstrating that this sequential strategy outperforms MLM-only training under fixed compute budgets.

10 retrieved papers

Can Refute

Demonstration that CLM-to-MLM continued pretraining outperforms MLM-only training

Can Refute

10 retrieved papers

The authors show that applying MLM continued pretraining to decoder models initially trained with CLM yields better text representations than continuing to train models that were pretrained with MLM, suggesting that leveraging existing pretrained decoders is the most effective path to strong encoders.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[8] What language model architecture and pretraining objective works best for zero-shot generalization? PDF

T Wang, A Roberts, D Hesslow (2022)

[12] What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization? PDF

Wang, Thomas, Thomas J. Wang, Roberts, Adam, Adam Roberts, Thomas Wang, Hesslow, Daniel, Daniel Hesslow, Scao, Teven Le, Teven Le Scao, Chung, Hyung Won, Hyung Won Chung, Beltagy, Iz, Iz Beltagy, Launay, Julien, Julien Launay, Raffel, Colin, Colin Raffel (2022) • International Conference on Machine Learning

[16] Scaling Behavior of Encoder Language Models in Low-Resource Settings PDF

R Visser, T Grobler, M Dunaiski (2025)

[20] On the role of bidirectionality in language model pre-training PDF

Mikel Artetxe, Jingfei Du, Naman Goyal, Luke Zettlemoyer, Veselin Stoyanov (2022)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Large-scale controlled ablation study comparing MLM and CLM for encoder pretraining

[4] Diverse image inpainting with bidirectional and autoregressive transformers PDF

Cannot Refute

[8] What language model architecture and pretraining objective works best for zero-shot generalization? PDF

Cannot Refute

[12] What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization? PDF

Cannot Refute

[51] Look ahead or look around? a theoretical comparison between autoregressive and masked pretraining PDF

Cannot Refute

[52] Enabling autoregressive models to fill in masked tokens PDF

Cannot Refute

[53] Unilmv2: Pseudo-masked language models for unified language model pre-training PDF

Cannot Refute

[54] AntLM: Bridging Causal and Masked Language Models PDF

Cannot Refute

[55] Relative position prediction as pre-training for text encoders PDF

Cannot Refute

[56] Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More PDF

Cannot Refute

[57] Fusing Sentence Embeddings Into LSTM-based Autoregressive Language Models PDF

Cannot Refute

Contribution

Biphasic CLM-then-MLM pretraining strategy

[54] AntLM: Bridging Causal and Masked Language Models PDF

Can Refute

[8] What language model architecture and pretraining objective works best for zero-shot generalization? PDF

Cannot Refute

[12] What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization? PDF

Cannot Refute

[40] Causal and Masked Language Modeling of Javanese Language using Transformer-based Architectures PDF

Cannot Refute

[62] Generative Audio Language Modeling with Continuous-valued Tokens and Masked Next-Token Prediction PDF

Cannot Refute

[63] NormFormer: Improved Transformer Pretraining with Extra Normalization PDF

Cannot Refute

[64] Mask more and mask later: Efficient pre-training of masked language models by disentangling the token PDF

Cannot Refute

[65] Unified Multimodal Pre-training and Prompt-based Tuning for Vision-Language Understanding and Generation PDF

Cannot Refute

[66] Heptapod: Language Modeling on Visual Signals PDF

Cannot Refute

[67] EmbedTurk: Leveraging Large Language Models as Text Encoders for Turkish Language PDF

Cannot Refute

Contribution

Demonstration that CLM-to-MLM continued pretraining outperforms MLM-only training

[8] What language model architecture and pretraining objective works best for zero-shot generalization? PDF

Can Refute

[2] Bamm: Bidirectional autoregressive motion model PDF

Cannot Refute

[3] Bad: Bidirectional auto-regressive diffusion for text-to-motion generation PDF

Cannot Refute

[4] Diverse image inpainting with bidirectional and autoregressive transformers PDF

Cannot Refute

[24] GPT or BERT: why not both? PDF

Cannot Refute

[53] Unilmv2: Pseudo-masked language models for unified language model pre-training PDF

Cannot Refute

[58] Denoising token prediction in masked autoregressive models PDF

Cannot Refute

[59] BERTs are generative in-context learners PDF

Cannot Refute

[60] Efficient Domain-adaptive Continual Pretraining for the Process Industry in the German Language PDF

Cannot Refute

[61] Customization of Large Language Models for Causal Inference and Data Quality PDF

Cannot Refute

Should We Still Pretrain Encoders with Masked Language Modeling?

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[8] What language model architecture and pretraining objective works best for zero-shot generalization? PDF

[12] What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization? PDF

[16] Scaling Behavior of Encoder Language Models in Low-Resource Settings PDF

[20] On the role of bidirectionality in language model pre-training PDF

Contribution Analysis

Large-scale controlled ablation study comparing MLM and CLM for encoder pretraining

[4] Diverse image inpainting with bidirectional and autoregressive transformers PDF

[8] What language model architecture and pretraining objective works best for zero-shot generalization? PDF

[12] What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization? PDF

[51] Look ahead or look around? a theoretical comparison between autoregressive and masked pretraining PDF

[52] Enabling autoregressive models to fill in masked tokens PDF

[53] Unilmv2: Pseudo-masked language models for unified language model pre-training PDF

[54] AntLM: Bridging Causal and Masked Language Models PDF

[55] Relative position prediction as pre-training for text encoders PDF

[56] Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More PDF

[57] Fusing Sentence Embeddings Into LSTM-based Autoregressive Language Models PDF

Biphasic CLM-then-MLM pretraining strategy

[54] AntLM: Bridging Causal and Masked Language Models PDF

[8] What language model architecture and pretraining objective works best for zero-shot generalization? PDF

[12] What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization? PDF

[40] Causal and Masked Language Modeling of Javanese Language using Transformer-based Architectures PDF

[62] Generative Audio Language Modeling with Continuous-valued Tokens and Masked Next-Token Prediction PDF

[63] NormFormer: Improved Transformer Pretraining with Extra Normalization PDF

[64] Mask more and mask later: Efficient pre-training of masked language models by disentangling the token PDF

[65] Unified Multimodal Pre-training and Prompt-based Tuning for Vision-Language Understanding and Generation PDF

[66] Heptapod: Language Modeling on Visual Signals PDF

[67] EmbedTurk: Leveraging Large Language Models as Text Encoders for Turkish Language PDF

Demonstration that CLM-to-MLM continued pretraining outperforms MLM-only training

[8] What language model architecture and pretraining objective works best for zero-shot generalization? PDF

[2] Bamm: Bidirectional autoregressive motion model PDF

[3] Bad: Bidirectional auto-regressive diffusion for text-to-motion generation PDF

[4] Diverse image inpainting with bidirectional and autoregressive transformers PDF

[24] GPT or BERT: why not both? PDF

[53] Unilmv2: Pseudo-masked language models for unified language model pre-training PDF

[58] Denoising token prediction in masked autoregressive models PDF

[59] BERTs are generative in-context learners PDF

[60] Efficient Domain-adaptive Continual Pretraining for the Process Industry in the German Language PDF

[61] Customization of Large Language Models for Causal Inference and Data Quality PDF

Table of Contents