Should We Still Pretrain Encoders with Masked Language Modeling?
Overview
Overall Novelty Assessment
This paper contributes a large-scale controlled ablation study comparing masked language modeling (MLM) and causal language modeling (CLM) for encoder pretraining, training 38 models from 210M to 1B parameters with over 15,000 evaluation runs. It resides in the 'Direct Comparisons of MLM and CLM Pretraining' leaf, which contains five papers total including this one. This leaf sits within a moderately populated branch on 'Pretraining Objectives and Architectures,' suggesting the paper addresses a well-established but not overcrowded research direction focused on foundational design choices for language model pretraining.
The taxonomy reveals closely related work in adjacent leaves: 'Hybrid and Sequential Pretraining Strategies' (five papers) explores combined objectives, while 'Bidirectional Context Modeling in Autoregressive Frameworks' (four papers) examines bidirectionality within decoder architectures. The paper's biphasic CLM-then-MLM strategy bridges these areas, connecting direct objective comparisons with sequential training methods. Neighboring branches on autoregressive generation and application domains indicate the field's broader interest in how pretraining choices propagate to downstream tasks, though this work focuses specifically on encoder-level representation quality rather than generation or task-specific deployment.
Among 30 candidates examined, the large-scale ablation study (Contribution 1) shows no clear refutation across 10 candidates, suggesting this systematic scale and control may be distinctive. However, the biphasic CLM-then-MLM strategy (Contribution 2) and the demonstration of CLM-to-MLM superiority (Contribution 3) each found one refutable candidate among 10 examined, indicating prior work on sequential or hybrid training exists within this limited search scope. The statistics suggest the ablation methodology may be more novel than the biphasic training concept itself, though the search examined only top-30 semantic matches rather than exhaustive coverage.
Based on the limited 30-candidate search, the work appears to offer methodological rigor (large-scale ablations) in a moderately explored area, while the biphasic training strategy shows some overlap with existing hybrid approaches. The taxonomy structure confirms this sits in an active but not saturated research direction, with clear boundaries separating direct comparisons from hybrid methods and application studies. A broader literature search might reveal additional sequential training precedents not captured in this top-K semantic scope.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors conduct extensive controlled experiments training 38 models from 210M to 1B parameters with both MLM and CLM objectives, performing over 15,000 fine-tuning runs to isolate the effects of pretraining paradigm from confounding factors like model scale and data size.
The authors propose and validate a two-stage pretraining approach that first applies Causal Language Modeling followed by Masked Language Modeling, demonstrating that this sequential strategy outperforms MLM-only training under fixed compute budgets.
The authors show that applying MLM continued pretraining to decoder models initially trained with CLM yields better text representations than continuing to train models that were pretrained with MLM, suggesting that leveraging existing pretrained decoders is the most effective path to strong encoders.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[8] What language model architecture and pretraining objective works best for zero-shot generalization? PDF
[12] What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization? PDF
[16] Scaling Behavior of Encoder Language Models in Low-Resource Settings PDF
[20] On the role of bidirectionality in language model pre-training PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
Large-scale controlled ablation study comparing MLM and CLM for encoder pretraining
The authors conduct extensive controlled experiments training 38 models from 210M to 1B parameters with both MLM and CLM objectives, performing over 15,000 fine-tuning runs to isolate the effects of pretraining paradigm from confounding factors like model scale and data size.
[4] Diverse image inpainting with bidirectional and autoregressive transformers PDF
[8] What language model architecture and pretraining objective works best for zero-shot generalization? PDF
[12] What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization? PDF
[51] Look ahead or look around? a theoretical comparison between autoregressive and masked pretraining PDF
[52] Enabling autoregressive models to fill in masked tokens PDF
[53] Unilmv2: Pseudo-masked language models for unified language model pre-training PDF
[54] AntLM: Bridging Causal and Masked Language Models PDF
[55] Relative position prediction as pre-training for text encoders PDF
[56] Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More PDF
[57] Fusing Sentence Embeddings Into LSTM-based Autoregressive Language Models PDF
Biphasic CLM-then-MLM pretraining strategy
The authors propose and validate a two-stage pretraining approach that first applies Causal Language Modeling followed by Masked Language Modeling, demonstrating that this sequential strategy outperforms MLM-only training under fixed compute budgets.
[54] AntLM: Bridging Causal and Masked Language Models PDF
[8] What language model architecture and pretraining objective works best for zero-shot generalization? PDF
[12] What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization? PDF
[40] Causal and Masked Language Modeling of Javanese Language using Transformer-based Architectures PDF
[62] Generative Audio Language Modeling with Continuous-valued Tokens and Masked Next-Token Prediction PDF
[63] NormFormer: Improved Transformer Pretraining with Extra Normalization PDF
[64] Mask more and mask later: Efficient pre-training of masked language models by disentangling the token PDF
[65] Unified Multimodal Pre-training and Prompt-based Tuning for Vision-Language Understanding and Generation PDF
[66] Heptapod: Language Modeling on Visual Signals PDF
[67] EmbedTurk: Leveraging Large Language Models as Text Encoders for Turkish Language PDF
Demonstration that CLM-to-MLM continued pretraining outperforms MLM-only training
The authors show that applying MLM continued pretraining to decoder models initially trained with CLM yields better text representations than continuing to train models that were pretrained with MLM, suggesting that leveraging existing pretrained decoders is the most effective path to strong encoders.