Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 7.0 Download Report PDF

datasetpre-traininglarge language modelsopen dataopen sciencemultilingual

Large Language Models (LLMs) are pre-trained on large data from different sources and domains. These data most often contain trillions of tokens with large portions of copyrighted or proprietary content, which hinders the usage of such models under AI legislation. This raises the need for truly open pre-training data that is compliant with the data security regulations. In this paper, we introduce Common Corpus, the largest open dataset for LLM pre-training. The data assembled in Common Corpus are either uncopyrighted or under permissible licenses and amount to about two trillion tokens. The dataset contains a wide variety of languages, ranging from the high-resource European languages to some low-resource languages rarely represented in pre-training datasets. In addition, it includes a large portion of code data. The diversity of data sources in terms of covered domains and time periods opens up the paths for both research and entrepreneurial needs in diverse areas of knowledge. In this paper, we present the detailed provenance of data assembling and the details of dataset filtering and curation. We train two small language models on Common Corpus and find that the resulting model performs comparably to other models of their size, indicating that our dataset is suitable for multilingual pretraining. Common Corpus represents a key contribution to the ecosystem for open science research on large language models.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Common Corpus, a two-trillion-token dataset assembled exclusively from uncopyrighted or permissively licensed sources for LLM pretraining. It resides in the 'Open Licensing and Copyright Compliance' leaf alongside two sibling papers, indicating a relatively sparse but strategically important research direction. This leaf sits within the broader 'Ethical Compliance and Licensing Frameworks' branch, which addresses legal and governance challenges distinct from technical curation pipelines or post-hoc safety interventions covered in neighboring branches.

The taxonomy reveals that Common Corpus occupies a niche at the intersection of legal compliance and large-scale corpus assembly. Neighboring leaves address FAIR principles, code-specific licensing challenges, and transparency practices, while sibling branches cover web-scale filtering methodologies and domain-specific curation. The paper's emphasis on proactive copyright clearance distinguishes it from works in 'Dataset Construction and Curation Methodologies' that prioritize technical quality over legal provenance, and from 'Safety and Ethical Content Interventions' that focus on toxicity filtering rather than licensing.

Among thirty candidates examined, the Common Corpus dataset contribution shows one refutable candidate out of ten examined, suggesting some prior work in open-licensed pretraining corpora. The custom curation tools and pretrained models contributions each examined ten candidates with zero refutations, indicating these aspects may be more novel or less directly comparable to existing literature. The limited search scope means these statistics reflect top-semantic-match overlap rather than exhaustive field coverage, and the single refutation for the dataset contribution likely points to incremental scale or scope differences rather than fundamental conceptual overlap.

Given the constrained search of thirty candidates, the work appears to advance open-licensed corpus assembly through scale and multilingual breadth, though at least one prior effort addresses similar licensing concerns. The taxonomy structure confirms this is an emerging rather than saturated direction, with only three papers in the leaf. The analysis captures semantic proximity but cannot rule out additional relevant work outside the top-thirty matches or in adjacent research communities.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Creating open and ethically compliant datasets for large language model pretraining. The field has organized itself around five main branches that reflect both technical and normative concerns. Dataset Construction and Curation Methodologies addresses the practical challenges of assembling, filtering, and documenting large-scale corpora, with works like Redpajama[1] and LLM360[17] exemplifying transparent data pipelines. Ethical Compliance and Licensing Frameworks focuses on legal and governance issues—ensuring that datasets respect copyright, adhere to open licensing standards, and meet emerging regulatory requirements such as those discussed in Open Washing EU[6] and FAIR Compliant Dataset[2]. Safety and Ethical Content Interventions deals with filtering harmful material and mitigating toxicity during pretraining, as seen in Safer Pretraining Filtering[8] and Toxicity Commons[16]. Ethical Alignment and Value-Driven Development explores how datasets can encode normative principles, drawing on participatory methods like Collective Constitutional AI[9] and Ethical Rules Generation[10]. Finally, Applications and Emerging Use Cases examines domain-specific deployments in healthcare, legal, and other specialized contexts. A particularly active tension runs between the ambition for fully open, reproducible datasets and the practical difficulties of copyright clearance and content moderation at web scale, as highlighted by Web Mined Corpora Challenges[5] and Open Dataset Best Practices[4]. Common Corpus[0] sits squarely within the Ethical Compliance and Licensing Frameworks branch, emphasizing open licensing and copyright compliance as foundational to responsible pretraining. It shares this branch with works like Open Dataset Best Practices[4], which articulates community norms for transparency and legal rigor, and Toxicity Commons[16], which addresses content safety within an open-data paradigm. Compared to earlier efforts such as OPT[3] or Bigscience[15], which pioneered large-scale openness but faced subsequent scrutiny over licensing gaps, Common Corpus[0] represents a more deliberate effort to preemptively address legal and ethical constraints, reflecting the field's maturation toward proactive compliance rather than retroactive remediation.

Claimed Contributions

Common Corpus dataset

Can Refute

10 retrieved papers

The authors present Common Corpus, a dataset of approximately two trillion tokens composed entirely of uncopyrighted or permissively licensed data. This dataset is designed for compliant LLM pre-training under strict AI regulations and includes multilingual content from diverse sources such as government documents, cultural heritage texts, scientific papers, code, and web data.

10 retrieved papers

Can Refute

Custom tools for data curation

10 retrieved papers

The authors developed and will release several specialized tools for dataset curation, including Segmentext for text segmentation, OCRonos for OCR correction, Celadon for multilingual toxicity detection, and custom pipelines for PII removal. These tools address challenges specific to processing multilingual, historical, and digitized documents.

10 retrieved papers

Pre-trained language models on Common Corpus

10 retrieved papers

The authors train two Llama-based models (350M and 1.2B parameters) on Common Corpus and demonstrate that these models achieve comparable performance to existing multilingual models on benchmarks such as MultiBLiMP, XStoryCloze, and XCOPA, validating the dataset's suitability for multilingual pre-training.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[4] Towards Best Practices for Open Datasets for LLM Training PDF

Stefan Baack, Biderman, Stella, Stella Biderman, Kasia Odrozek, Skowron, Aviya, Aviya Skowron, Katarzyna Odrozek, Ayah Bdeir, Bommarito, Jillian, Jillian Bommarito, Ding, Jennifer, Jennifer Ding, Maximilian Gahntz, Paul R. Keller, Langlais, Pierre-Carl, Pierre-Carl Langlais, Paul Keller, Lindahl, Greg, Greg Lindahl, Sebastian Majstorovic, Nik Marda, Penedo, Guilherme, Guilherme Penedo, Van Segbroeck, Maarten, Maarten Van Segbroeck, Wang Jennifer, Jennifer Wang, Von Werra, Leandro, Leandro von Werra, M. Pauline Baker, L. V. Werra, Beliao, Julie, Julie BeliÃ£o, Mitchell Baker, Chmielinski, Kasia, Kasia Chmielinski, Julie Beliao, Fadaee, Marzieh, Marzieh Fadaee, Lisa Gutermuth, Hynek KydlÃÄek, Greg Leppert, Hynek Kydl'ivcek, EM Lewis-Jong, Solana Larsen, Longpre, Shayne, Shayne Longpre, Angela Oduor Lungati, Carolyn A. Miller, Victoria A. Miller, Cullen Miller, Ryabinin, Max, Max Ryabinin, Victor Miller, Siminyu, Kathleen, Kathleen Siminyu, Strait Andrew, Andrew Strait, Mark Surman, Anna TumadÃ³ttir, Weber, Maurice, Maurice Weber, Anna Tumad'ottir, Weiss, Rebecca, Rebecca Weiss, White Lee, Lee R. White, Wolf, Thomas, Thomas Wolf, Lee White (2025)

[16] Toxicity of the commons: Curating open-source pre-training data PDF

Jones, Eliot, Catherine Arnett, Yamshchikov, Ivan P., E. Jones, Langlais, Pierre-Carl, Ivan P. Yamshchikov, Pierre-Carl Langlais (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Common Corpus dataset

[54] The KL3M Data Project: Copyright-Clean Training Resources for Large Language Models PDF

Can Refute

[1] Redpajama: an open dataset for training large language models PDF

Cannot Refute

[4] Towards Best Practices for Open Datasets for LLM Training PDF

Cannot Refute

[49] The stack: 3 tb of permissively licensed source code PDF

Cannot Refute

[50] GPT-NeoX-20B: An Open-Source Autoregressive Language Model PDF

Cannot Refute

[51] DeepSeek-Coder: When the Large Language Model Meets Programming - The Rise of Code Intelligence PDF

Cannot Refute

[52] Multilingual language model pretraining using machine-translated data PDF

Cannot Refute

[53] Zyda: A 1.3 t dataset for open language modeling PDF

Cannot Refute

[55] Meltemi: The first open Large Language Model for Greek PDF

Cannot Refute

[56] Latxa: An Open Language Model and Evaluation Suite for Basque PDF

Cannot Refute

Contribution

Custom tools for data curation

[57] MultiOCR-QA: Dataset for Evaluating Robustness of LLMs in Question Answering on Multilingual OCR Texts PDF

Cannot Refute

[58] MELHISSA: a multilingual entity linking architecture for historical press articles PDF

Cannot Refute

[59] Analytic study of the preprocessing methods impact on historical document analysis and classification PDF

Cannot Refute

[60] Cross-lingual search in pre-processed archival facsimile documents PDF

Cannot Refute

[61] The learnable typewriter: a generative approach to text analysis PDF

Cannot Refute

[62] The adaptability of a transformer-based OCR model for historical documents PDF

Cannot Refute

[63] Novel perspectives for the management of multilingual and multialphabetic heritages through automatic knowledge extraction: The digitalmaktaba approach PDF

Cannot Refute

[64] Multilingual character segmentation and recognition schemes for Indian document images PDF

Cannot Refute

[65] Robust and Multilingual Analysis of Historical Documents. PDF

Cannot Refute

[66] Multilingual research projects: Non-Latin script challenges for making use of standards, authority files, and character recognition PDF

Cannot Refute

Contribution

Pre-trained language models on Common Corpus

[52] Multilingual language model pretraining using machine-translated data PDF

Cannot Refute

[67] Tower: An Open Multilingual Large Language Model for Translation-Related Tasks PDF

Cannot Refute

[68] Enhancing multilingual llm pretraining with model-based data selection PDF

Cannot Refute

[69] Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone PDF

Cannot Refute

[70] Massively Multilingual Adaptation of Large Language Models Using Bilingual Translation Data PDF

Cannot Refute

[71] SabiYarn: Advancing Low Resource Languages with Multitask NLP Pretraining PDF

Cannot Refute

[72] Scaling laws for multilingual language models PDF

Cannot Refute

[73] Sailor: Open language models for south-east asia PDF

Cannot Refute

[74] Wiki-40b: Multilingual language model dataset PDF

Cannot Refute

[75] Twhin-bert: A socially-enriched pre-trained language model for multilingual tweet representations at twitter PDF

Cannot Refute

Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[4] Towards Best Practices for Open Datasets for LLM Training PDF

[16] Toxicity of the commons: Curating open-source pre-training data PDF

Contribution Analysis

Common Corpus dataset

[54] The KL3M Data Project: Copyright-Clean Training Resources for Large Language Models PDF

[1] Redpajama: an open dataset for training large language models PDF

[4] Towards Best Practices for Open Datasets for LLM Training PDF

[49] The stack: 3 tb of permissively licensed source code PDF

[50] GPT-NeoX-20B: An Open-Source Autoregressive Language Model PDF

[51] DeepSeek-Coder: When the Large Language Model Meets Programming - The Rise of Code Intelligence PDF

[52] Multilingual language model pretraining using machine-translated data PDF

[53] Zyda: A 1.3 t dataset for open language modeling PDF

[55] Meltemi: The first open Large Language Model for Greek PDF

[56] Latxa: An Open Language Model and Evaluation Suite for Basque PDF

Custom tools for data curation

[57] MultiOCR-QA: Dataset for Evaluating Robustness of LLMs in Question Answering on Multilingual OCR Texts PDF

[58] MELHISSA: a multilingual entity linking architecture for historical press articles PDF

[59] Analytic study of the preprocessing methods impact on historical document analysis and classification PDF

[60] Cross-lingual search in pre-processed archival facsimile documents PDF

[61] The learnable typewriter: a generative approach to text analysis PDF

[62] The adaptability of a transformer-based OCR model for historical documents PDF

[63] Novel perspectives for the management of multilingual and multialphabetic heritages through automatic knowledge extraction: The digitalmaktaba approach PDF

[64] Multilingual character segmentation and recognition schemes for Indian document images PDF

[65] Robust and Multilingual Analysis of Historical Documents. PDF

[66] Multilingual research projects: Non-Latin script challenges for making use of standards, authority files, and character recognition PDF

Pre-trained language models on Common Corpus

[52] Multilingual language model pretraining using machine-translated data PDF

[67] Tower: An Open Multilingual Large Language Model for Translation-Related Tasks PDF

[68] Enhancing multilingual llm pretraining with model-based data selection PDF

[69] Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone PDF

[70] Massively Multilingual Adaptation of Large Language Models Using Bilingual Translation Data PDF

[71] SabiYarn: Advancing Low Resource Languages with Multitask NLP Pretraining PDF

[72] Scaling laws for multilingual language models PDF

[73] Sailor: Open language models for south-east asia PDF

[74] Wiki-40b: Multilingual language model dataset PDF

[75] Twhin-bert: A socially-enriched pre-trained language model for multilingual tweet representations at twitter PDF

Table of Contents