Should We Still Pretrain Encoders with Masked Language Modeling?

ICLR 2026 Conference SubmissionAnonymous Authors
Encoder PretrainingMasked Language ModelingCausal Language ModelingText RepresentationsRepresentation Learning
Abstract:

Learning high-quality text representations is fundamental to a wide range of NLP tasks. While encoder pretraining has traditionally relied on Masked Language Modeling (MLM), recent evidence suggests that decoder models pretrained with Causal Language Modeling (CLM) can be effectively repurposed as encoders, often surpassing traditional encoders on text representation benchmarks. However, it remains unclear whether these gains reflect an inherent advantage of the CLM approach or arise from confounding factors such as model and data scale. In this paper, we address this question through a series of large-scale, carefully controlled pretraining ablations, training a total of 38 models ranging from 210 million to 1 billion parameters, and conducting over 15,000 fine-tuning and evaluation runs. We find that while training with MLM generally yields better performance across text representation tasks, CLM-trained models are more data-efficient and demonstrate improved fine-tuning stability. Building on these findings, we experimentally show that a biphasic training strategy that sequentially applies CLM and then MLM, achieves optimal performance under a fixed computational training budget. Moreover, we demonstrate that this strategy becomes more appealing when initializing from readily available pretrained CLM models, reducing the computational burden needed to train best-in-class encoder models. We release all project artifacts at \url{https://huggingface.co/XXX} to foster further research.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

This paper contributes a large-scale controlled ablation study comparing masked language modeling (MLM) and causal language modeling (CLM) for encoder pretraining, training 38 models from 210M to 1B parameters with over 15,000 evaluation runs. It resides in the 'Direct Comparisons of MLM and CLM Pretraining' leaf, which contains five papers total including this one. This leaf sits within a moderately populated branch on 'Pretraining Objectives and Architectures,' suggesting the paper addresses a well-established but not overcrowded research direction focused on foundational design choices for language model pretraining.

The taxonomy reveals closely related work in adjacent leaves: 'Hybrid and Sequential Pretraining Strategies' (five papers) explores combined objectives, while 'Bidirectional Context Modeling in Autoregressive Frameworks' (four papers) examines bidirectionality within decoder architectures. The paper's biphasic CLM-then-MLM strategy bridges these areas, connecting direct objective comparisons with sequential training methods. Neighboring branches on autoregressive generation and application domains indicate the field's broader interest in how pretraining choices propagate to downstream tasks, though this work focuses specifically on encoder-level representation quality rather than generation or task-specific deployment.

Among 30 candidates examined, the large-scale ablation study (Contribution 1) shows no clear refutation across 10 candidates, suggesting this systematic scale and control may be distinctive. However, the biphasic CLM-then-MLM strategy (Contribution 2) and the demonstration of CLM-to-MLM superiority (Contribution 3) each found one refutable candidate among 10 examined, indicating prior work on sequential or hybrid training exists within this limited search scope. The statistics suggest the ablation methodology may be more novel than the biphasic training concept itself, though the search examined only top-30 semantic matches rather than exhaustive coverage.

Based on the limited 30-candidate search, the work appears to offer methodological rigor (large-scale ablations) in a moderately explored area, while the biphasic training strategy shows some overlap with existing hybrid approaches. The taxonomy structure confirms this sits in an active but not saturated research direction, with clear boundaries separating direct comparisons from hybrid methods and application studies. A broader literature search might reveal additional sequential training precedents not captured in this top-K semantic scope.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: Comparing masked language modeling and causal language modeling for encoder pretraining. The field structure reflects a broad investigation into how different pretraining objectives shape language model capabilities. The taxonomy organizes work into several main branches: one focused on pretraining objectives and architectures, examining foundational design choices such as masked versus causal modeling; another on autoregressive and bidirectional modeling for sequence generation, exploring how directionality affects downstream performance; a third on application domains where pretrained models are deployed; and additional branches covering surveys and non-NLP sequence modeling. Within the pretraining objectives branch, many studies directly compare MLM and CLM strategies, assessing trade-offs in representation quality, computational efficiency, and task-specific performance. Representative works like Zero Shot Generalization[8] and Zero Shot Architecture[12] illustrate how architectural decisions influence generalization, while Encoder Low Resource[16] highlights the importance of pretraining choices in constrained settings. A particularly active line of work examines the role of bidirectionality in pretraining, with studies such as Bidirectionality Pretraining Role[20] and Bidirectional Awareness Induction[33] investigating how bidirectional context improves understanding tasks. Pretrain Encoders MLM[0] sits squarely within this cluster, directly comparing MLM and CLM for encoder pretraining and emphasizing the empirical advantages of masked objectives for certain encoder architectures. This contrasts with works like GPT or BERT[24] and BERT GPT Impact[25], which take a broader view of the historical and practical distinctions between these paradigms. Meanwhile, papers such as CMLM Sentence Embeddings[27] explore hybrid or conditional masked approaches, suggesting that the boundary between pure MLM and CLM is not always sharp. The central question remains how to balance the rich bidirectional context of MLM with the simplicity and scalability of causal modeling, especially as models are adapted to diverse downstream tasks.

Claimed Contributions

Large-scale controlled ablation study comparing MLM and CLM for encoder pretraining

The authors conduct extensive controlled experiments training 38 models from 210M to 1B parameters with both MLM and CLM objectives, performing over 15,000 fine-tuning runs to isolate the effects of pretraining paradigm from confounding factors like model scale and data size.

10 retrieved papers
Biphasic CLM-then-MLM pretraining strategy

The authors propose and validate a two-stage pretraining approach that first applies Causal Language Modeling followed by Masked Language Modeling, demonstrating that this sequential strategy outperforms MLM-only training under fixed compute budgets.

10 retrieved papers
Can Refute
Demonstration that CLM-to-MLM continued pretraining outperforms MLM-only training

The authors show that applying MLM continued pretraining to decoder models initially trained with CLM yields better text representations than continuing to train models that were pretrained with MLM, suggesting that leveraging existing pretrained decoders is the most effective path to strong encoders.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Large-scale controlled ablation study comparing MLM and CLM for encoder pretraining

The authors conduct extensive controlled experiments training 38 models from 210M to 1B parameters with both MLM and CLM objectives, performing over 15,000 fine-tuning runs to isolate the effects of pretraining paradigm from confounding factors like model scale and data size.

Contribution

Biphasic CLM-then-MLM pretraining strategy

The authors propose and validate a two-stage pretraining approach that first applies Causal Language Modeling followed by Masked Language Modeling, demonstrating that this sequential strategy outperforms MLM-only training under fixed compute budgets.

Contribution

Demonstration that CLM-to-MLM continued pretraining outperforms MLM-only training

The authors show that applying MLM continued pretraining to decoder models initially trained with CLM yields better text representations than continuing to train models that were pretrained with MLM, suggesting that leveraging existing pretrained decoders is the most effective path to strong encoders.