CDTP: A Large-Scale Chinese Data-Text Pair Dataset for Comprehensive Evaluation of Chinese LLMs
Overview
Overall Novelty Assessment
The paper introduces CDTP, a dataset of over 7 million Chinese text-data pairs aligned with 15 million triples across four domains, and CB-ECLLM, a benchmark for evaluating Chinese LLMs on structured knowledge tasks. Within the taxonomy, it occupies the 'Data-Text Alignment' leaf under 'Structured Data Understanding', where it is currently the sole representative among 50 surveyed papers. This positioning places the work in a relatively sparse research direction focused specifically on the correspondence between structured representations and natural language, distinct from the more populated domain-specific evaluation branches (e.g., 7 papers in Traditional Chinese Medicine, 5 in Financial Domain).
The taxonomy reveals that most neighboring work concentrates on domain-specific benchmarks (Medical, Legal, Financial) or general multi-subject assessments (C-Eval, M3KE), which evaluate end-to-end task performance rather than data-text alignment primitives. The 'Knowledge Engineering' branch addresses knowledge graph construction and querying but focuses on extraction and retrieval rather than paired alignment datasets. The 'Constrained Text Generation' sibling leaf examines format adherence but excludes the bidirectional data-text correspondence that CDTP emphasizes. This structural isolation suggests the paper targets a methodological gap between broad evaluation suites and specialized domain benchmarks.
Among 29 candidates examined across three contributions, no clearly refuting prior work was identified. The CDTP dataset contribution examined 10 candidates with zero refutable matches, as did the CB-ECLLM benchmark (10 candidates, zero refutable). The multi-dimensional evaluation framework analyzed 9 candidates, also yielding no refutations. This absence of overlapping prior work within the limited search scope suggests that large-scale Chinese data-text pair datasets with explicit triple alignment remain underexplored in the surveyed literature, though the search scale (29 papers) leaves open the possibility of relevant work beyond top-K semantic matches.
Based on the limited literature search, the work appears to address a relatively unoccupied niche within Chinese LLM evaluation, focusing on structured alignment rather than domain mastery or general reasoning. The taxonomy structure confirms that data-text correspondence as a standalone evaluation dimension has received less attention than domain-specific or multidisciplinary benchmarks. However, the analysis covers top-30 semantic candidates and does not constitute an exhaustive survey of all Chinese NLP datasets or alignment methods, so conclusions about absolute novelty remain provisional.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors construct a large-scale dataset of over 7 million aligned Chinese text pairs, where each pair consists of unstructured text and corresponding structured triples (totaling 15 million triples). The dataset spans four domains: History and Politics, Humanities and Society, Technology and Economics, and Nature and Environment, addressing the scarcity of structured annotations in Chinese corpora.
The authors present CB-ECLLM, a benchmark specifically designed to evaluate Chinese LLMs on tasks that capture unique Chinese linguistic challenges. The benchmark includes three tasks—Knowledge Graph Completion, Triple-to-Text generation, and Question Answering—and is built on the CDTP dataset to provide systematic, knowledge-driven evaluation.
The authors conduct comprehensive experiments to evaluate Chinese LLMs across three dimensions: effectiveness on various tasks and datasets, the impact of Supervised Fine-Tuning (SFT) on model performance, and robustness when tested on out-of-distribution data. This multi-faceted evaluation provides insights into model generalization and stability.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
Contribution Analysis
Detailed comparisons for each claimed contribution
Chinese Data-Text Pair (CDTP) Dataset
The authors construct a large-scale dataset of over 7 million aligned Chinese text pairs, where each pair consists of unstructured text and corresponding structured triples (totaling 15 million triples). The dataset spans four domains: History and Politics, Humanities and Society, Technology and Economics, and Nature and Environment, addressing the scarcity of structured annotations in Chinese corpora.
[69] Integrating knowledge graphs with ancient Chinese medicine classics: challenges and future prospects of multi-agent system convergence PDF
[70] Nanjing Yunjin intelligent question-answering system based on knowledge graphs and retrieval augmented generation technology PDF
[71] A Systematic Literature Review on RDF Triple Generation From Natural Language Texts PDF
[72] From Discrimination to Generation: Knowledge Graph Completion with Generative Transformer PDF
[73] Duie: A large-scale chinese dataset for information extraction PDF
[74] Joint semantic embedding with structural knowledge and entity description for knowledge representation learning PDF
[75] KdConv: A Chinese multi-domain dialogue dataset towards multi-turn knowledge-driven conversation PDF
[76] TCM-KLLaMA: Intelligent generation model for Traditional Chinese Medicine Prescriptions based on knowledge graph and large language model. PDF
[77] Triple-to-text generation with an anchor-to-prototype framework PDF
[78] CATS: A Pragmatic Chinese Answer-to-Sequence Dataset with Large Scale and High Quality PDF
Comprehensive Benchmark for Evaluating Chinese LLMs (CB-ECLLM)
The authors present CB-ECLLM, a benchmark specifically designed to evaluate Chinese LLMs on tasks that capture unique Chinese linguistic challenges. The benchmark includes three tasks—Knowledge Graph Completion, Triple-to-Text generation, and Question Answering—and is built on the CDTP dataset to provide systematic, knowledge-driven evaluation.
[3] LAiW: A Chinese legal large language models benchmark PDF
[29] C²RBench: A Chinese Complex Reasoning Benchmark for Large Language Models PDF
[61] MulCogBench: a multi-modal cognitive benchmark dataset for evaluating Chinese and English computational language models: Y. Zhang et al. PDF
[62] Alignbench: Benchmarking chinese alignment of large language models PDF
[63] CHAmbi: A New Benchmark on Chinese Ambiguity Challenges for Large Language Models PDF
[64] MulCogBench: a multi-modal cognitive benchmark dataset for evaluating Chinese and English computational language models PDF
[65] Paradox of poetic intent in back-translation: evaluating the quality of large language models in Chinese translation PDF
[66] Comparative evaluation of commercial large language models on promptbench: An english and chinese perspective PDF
[67] WenMind: A comprehensive benchmark for evaluating large language models in Chinese classical literature and language arts PDF
[68] Ac-eval: Evaluating ancient chinese language understanding in large language models PDF
Multi-Dimensional Evaluation Framework
The authors conduct comprehensive experiments to evaluate Chinese LLMs across three dimensions: effectiveness on various tasks and datasets, the impact of Supervised Fine-Tuning (SFT) on model performance, and robustness when tested on out-of-distribution data. This multi-faceted evaluation provides insights into model generalization and stability.