Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training

ICLR 2026 Conference SubmissionAnonymous Authors
datasetpre-traininglarge language modelsopen dataopen sciencemultilingual
Abstract:

Large Language Models (LLMs) are pre-trained on large data from different sources and domains. These data most often contain trillions of tokens with large portions of copyrighted or proprietary content, which hinders the usage of such models under AI legislation. This raises the need for truly open pre-training data that is compliant with the data security regulations. In this paper, we introduce Common Corpus, the largest open dataset for LLM pre-training. The data assembled in Common Corpus are either uncopyrighted or under permissible licenses and amount to about two trillion tokens. The dataset contains a wide variety of languages, ranging from the high-resource European languages to some low-resource languages rarely represented in pre-training datasets. In addition, it includes a large portion of code data. The diversity of data sources in terms of covered domains and time periods opens up the paths for both research and entrepreneurial needs in diverse areas of knowledge. In this paper, we present the detailed provenance of data assembling and the details of dataset filtering and curation. We train two small language models on Common Corpus and find that the resulting model performs comparably to other models of their size, indicating that our dataset is suitable for multilingual pretraining. Common Corpus represents a key contribution to the ecosystem for open science research on large language models.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Common Corpus, a two-trillion-token dataset assembled exclusively from uncopyrighted or permissively licensed sources for LLM pretraining. It resides in the 'Open Licensing and Copyright Compliance' leaf alongside two sibling papers, indicating a relatively sparse but strategically important research direction. This leaf sits within the broader 'Ethical Compliance and Licensing Frameworks' branch, which addresses legal and governance challenges distinct from technical curation pipelines or post-hoc safety interventions covered in neighboring branches.

The taxonomy reveals that Common Corpus occupies a niche at the intersection of legal compliance and large-scale corpus assembly. Neighboring leaves address FAIR principles, code-specific licensing challenges, and transparency practices, while sibling branches cover web-scale filtering methodologies and domain-specific curation. The paper's emphasis on proactive copyright clearance distinguishes it from works in 'Dataset Construction and Curation Methodologies' that prioritize technical quality over legal provenance, and from 'Safety and Ethical Content Interventions' that focus on toxicity filtering rather than licensing.

Among thirty candidates examined, the Common Corpus dataset contribution shows one refutable candidate out of ten examined, suggesting some prior work in open-licensed pretraining corpora. The custom curation tools and pretrained models contributions each examined ten candidates with zero refutations, indicating these aspects may be more novel or less directly comparable to existing literature. The limited search scope means these statistics reflect top-semantic-match overlap rather than exhaustive field coverage, and the single refutation for the dataset contribution likely points to incremental scale or scope differences rather than fundamental conceptual overlap.

Given the constrained search of thirty candidates, the work appears to advance open-licensed corpus assembly through scale and multilingual breadth, though at least one prior effort addresses similar licensing concerns. The taxonomy structure confirms this is an emerging rather than saturated direction, with only three papers in the leaf. The analysis captures semantic proximity but cannot rule out additional relevant work outside the top-thirty matches or in adjacent research communities.

Taxonomy

Core-task Taxonomy Papers
48
3
Claimed Contributions
30
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: Creating open and ethically compliant datasets for large language model pretraining. The field has organized itself around five main branches that reflect both technical and normative concerns. Dataset Construction and Curation Methodologies addresses the practical challenges of assembling, filtering, and documenting large-scale corpora, with works like Redpajama[1] and LLM360[17] exemplifying transparent data pipelines. Ethical Compliance and Licensing Frameworks focuses on legal and governance issues—ensuring that datasets respect copyright, adhere to open licensing standards, and meet emerging regulatory requirements such as those discussed in Open Washing EU[6] and FAIR Compliant Dataset[2]. Safety and Ethical Content Interventions deals with filtering harmful material and mitigating toxicity during pretraining, as seen in Safer Pretraining Filtering[8] and Toxicity Commons[16]. Ethical Alignment and Value-Driven Development explores how datasets can encode normative principles, drawing on participatory methods like Collective Constitutional AI[9] and Ethical Rules Generation[10]. Finally, Applications and Emerging Use Cases examines domain-specific deployments in healthcare, legal, and other specialized contexts. A particularly active tension runs between the ambition for fully open, reproducible datasets and the practical difficulties of copyright clearance and content moderation at web scale, as highlighted by Web Mined Corpora Challenges[5] and Open Dataset Best Practices[4]. Common Corpus[0] sits squarely within the Ethical Compliance and Licensing Frameworks branch, emphasizing open licensing and copyright compliance as foundational to responsible pretraining. It shares this branch with works like Open Dataset Best Practices[4], which articulates community norms for transparency and legal rigor, and Toxicity Commons[16], which addresses content safety within an open-data paradigm. Compared to earlier efforts such as OPT[3] or Bigscience[15], which pioneered large-scale openness but faced subsequent scrutiny over licensing gaps, Common Corpus[0] represents a more deliberate effort to preemptively address legal and ethical constraints, reflecting the field's maturation toward proactive compliance rather than retroactive remediation.

Claimed Contributions

Common Corpus dataset

The authors present Common Corpus, a dataset of approximately two trillion tokens composed entirely of uncopyrighted or permissively licensed data. This dataset is designed for compliant LLM pre-training under strict AI regulations and includes multilingual content from diverse sources such as government documents, cultural heritage texts, scientific papers, code, and web data.

10 retrieved papers
Can Refute
Custom tools for data curation

The authors developed and will release several specialized tools for dataset curation, including Segmentext for text segmentation, OCRonos for OCR correction, Celadon for multilingual toxicity detection, and custom pipelines for PII removal. These tools address challenges specific to processing multilingual, historical, and digitized documents.

10 retrieved papers
Pre-trained language models on Common Corpus

The authors train two Llama-based models (350M and 1.2B parameters) on Common Corpus and demonstrate that these models achieve comparable performance to existing multilingual models on benchmarks such as MultiBLiMP, XStoryCloze, and XCOPA, validating the dataset's suitability for multilingual pre-training.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Common Corpus dataset

The authors present Common Corpus, a dataset of approximately two trillion tokens composed entirely of uncopyrighted or permissively licensed data. This dataset is designed for compliant LLM pre-training under strict AI regulations and includes multilingual content from diverse sources such as government documents, cultural heritage texts, scientific papers, code, and web data.

Contribution

Custom tools for data curation

The authors developed and will release several specialized tools for dataset curation, including Segmentext for text segmentation, OCRonos for OCR correction, Celadon for multilingual toxicity detection, and custom pipelines for PII removal. These tools address challenges specific to processing multilingual, historical, and digitized documents.

Contribution

Pre-trained language models on Common Corpus

The authors train two Llama-based models (350M and 1.2B parameters) on Common Corpus and demonstrate that these models achieve comparable performance to existing multilingual models on benchmarks such as MultiBLiMP, XStoryCloze, and XCOPA, validating the dataset's suitability for multilingual pre-training.