Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training
Overview
Overall Novelty Assessment
The paper introduces Common Corpus, a two-trillion-token dataset assembled exclusively from uncopyrighted or permissively licensed sources for LLM pretraining. It resides in the 'Open Licensing and Copyright Compliance' leaf alongside two sibling papers, indicating a relatively sparse but strategically important research direction. This leaf sits within the broader 'Ethical Compliance and Licensing Frameworks' branch, which addresses legal and governance challenges distinct from technical curation pipelines or post-hoc safety interventions covered in neighboring branches.
The taxonomy reveals that Common Corpus occupies a niche at the intersection of legal compliance and large-scale corpus assembly. Neighboring leaves address FAIR principles, code-specific licensing challenges, and transparency practices, while sibling branches cover web-scale filtering methodologies and domain-specific curation. The paper's emphasis on proactive copyright clearance distinguishes it from works in 'Dataset Construction and Curation Methodologies' that prioritize technical quality over legal provenance, and from 'Safety and Ethical Content Interventions' that focus on toxicity filtering rather than licensing.
Among thirty candidates examined, the Common Corpus dataset contribution shows one refutable candidate out of ten examined, suggesting some prior work in open-licensed pretraining corpora. The custom curation tools and pretrained models contributions each examined ten candidates with zero refutations, indicating these aspects may be more novel or less directly comparable to existing literature. The limited search scope means these statistics reflect top-semantic-match overlap rather than exhaustive field coverage, and the single refutation for the dataset contribution likely points to incremental scale or scope differences rather than fundamental conceptual overlap.
Given the constrained search of thirty candidates, the work appears to advance open-licensed corpus assembly through scale and multilingual breadth, though at least one prior effort addresses similar licensing concerns. The taxonomy structure confirms this is an emerging rather than saturated direction, with only three papers in the leaf. The analysis captures semantic proximity but cannot rule out additional relevant work outside the top-thirty matches or in adjacent research communities.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors present Common Corpus, a dataset of approximately two trillion tokens composed entirely of uncopyrighted or permissively licensed data. This dataset is designed for compliant LLM pre-training under strict AI regulations and includes multilingual content from diverse sources such as government documents, cultural heritage texts, scientific papers, code, and web data.
The authors developed and will release several specialized tools for dataset curation, including Segmentext for text segmentation, OCRonos for OCR correction, Celadon for multilingual toxicity detection, and custom pipelines for PII removal. These tools address challenges specific to processing multilingual, historical, and digitized documents.
The authors train two Llama-based models (350M and 1.2B parameters) on Common Corpus and demonstrate that these models achieve comparable performance to existing multilingual models on benchmarks such as MultiBLiMP, XStoryCloze, and XCOPA, validating the dataset's suitability for multilingual pre-training.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[4] Towards Best Practices for Open Datasets for LLM Training PDF
[16] Toxicity of the commons: Curating open-source pre-training data PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
Common Corpus dataset
The authors present Common Corpus, a dataset of approximately two trillion tokens composed entirely of uncopyrighted or permissively licensed data. This dataset is designed for compliant LLM pre-training under strict AI regulations and includes multilingual content from diverse sources such as government documents, cultural heritage texts, scientific papers, code, and web data.
[54] The KL3M Data Project: Copyright-Clean Training Resources for Large Language Models PDF
[1] Redpajama: an open dataset for training large language models PDF
[4] Towards Best Practices for Open Datasets for LLM Training PDF
[49] The stack: 3 tb of permissively licensed source code PDF
[50] GPT-NeoX-20B: An Open-Source Autoregressive Language Model PDF
[51] DeepSeek-Coder: When the Large Language Model Meets Programming - The Rise of Code Intelligence PDF
[52] Multilingual language model pretraining using machine-translated data PDF
[53] Zyda: A 1.3 t dataset for open language modeling PDF
[55] Meltemi: The first open Large Language Model for Greek PDF
[56] Latxa: An Open Language Model and Evaluation Suite for Basque PDF
Custom tools for data curation
The authors developed and will release several specialized tools for dataset curation, including Segmentext for text segmentation, OCRonos for OCR correction, Celadon for multilingual toxicity detection, and custom pipelines for PII removal. These tools address challenges specific to processing multilingual, historical, and digitized documents.
[57] MultiOCR-QA: Dataset for Evaluating Robustness of LLMs in Question Answering on Multilingual OCR Texts PDF
[58] MELHISSA: a multilingual entity linking architecture for historical press articles PDF
[59] Analytic study of the preprocessing methods impact on historical document analysis and classification PDF
[60] Cross-lingual search in pre-processed archival facsimile documents PDF
[61] The learnable typewriter: a generative approach to text analysis PDF
[62] The adaptability of a transformer-based OCR model for historical documents PDF
[63] Novel perspectives for the management of multilingual and multialphabetic heritages through automatic knowledge extraction: The digitalmaktaba approach PDF
[64] Multilingual character segmentation and recognition schemes for Indian document images PDF
[65] Robust and Multilingual Analysis of Historical Documents. PDF
[66] Multilingual research projects: Non-Latin script challenges for making use of standards, authority files, and character recognition PDF
Pre-trained language models on Common Corpus
The authors train two Llama-based models (350M and 1.2B parameters) on Common Corpus and demonstrate that these models achieve comparable performance to existing multilingual models on benchmarks such as MultiBLiMP, XStoryCloze, and XCOPA, validating the dataset's suitability for multilingual pre-training.