CDTP: A Large-Scale Chinese Data-Text Pair Dataset for Comprehensive Evaluation of Chinese LLMs

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Chinese Data-Text Pair DatasetLarge Language ModelChinese Evaluation

Large Language Models (LLMs) have achieved remarkable success across a wide range of natural language processing tasks. However, Chinese LLMs face unique challenges, primarily due to the dominance of unstructured free text and the lack of structured representations in Chinese corpora. While existing benchmarks for LLMs partially assess Chinese LLMs, they are still predominantly English-centric and fail to address the unique linguistic characteristics of Chinese, lacking structured datasets essential for robust evaluation. To address these challenges, we present a Comprehensive Benchmark for Evaluating Chinese Large Language Models (CB-ECLLM) based on the newly constructed Chinese Data-Text Pair (CDTP) dataset. Specifically, CDTP comprises over 7 million aligned text pairs, each consisting of unstructured text coupled with one or more corresponding triples, alongside a total of 15 million triples spanning four critical domains. The core contributions of CDTP are threefold: (i) enriching Chinese corpora with high-quality structured information; (ii) enabling fine-grained evaluation tailored to knowledge-driven tasks; and (iii) supporting multi-task fine-tuning to assess generalization and robustness across scenarios, including Knowledge Graph Completion, Triple-to-Text generation, and Question Answering. Furthermore, we conduct rigorous evaluations through extensive experiments and ablation studies to assess the effectiveness, Supervised Fine-Tuning (SFT), and robustness of the benchmark. To support reproducible research, we offer an open-source codebase and outline potential directions for future investigations based on our insights.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces CDTP, a dataset of over 7 million Chinese text-data pairs aligned with 15 million triples across four domains, and CB-ECLLM, a benchmark for evaluating Chinese LLMs on structured knowledge tasks. Within the taxonomy, it occupies the 'Data-Text Alignment' leaf under 'Structured Data Understanding', where it is currently the sole representative among 50 surveyed papers. This positioning places the work in a relatively sparse research direction focused specifically on the correspondence between structured representations and natural language, distinct from the more populated domain-specific evaluation branches (e.g., 7 papers in Traditional Chinese Medicine, 5 in Financial Domain).

The taxonomy reveals that most neighboring work concentrates on domain-specific benchmarks (Medical, Legal, Financial) or general multi-subject assessments (C-Eval, M3KE), which evaluate end-to-end task performance rather than data-text alignment primitives. The 'Knowledge Engineering' branch addresses knowledge graph construction and querying but focuses on extraction and retrieval rather than paired alignment datasets. The 'Constrained Text Generation' sibling leaf examines format adherence but excludes the bidirectional data-text correspondence that CDTP emphasizes. This structural isolation suggests the paper targets a methodological gap between broad evaluation suites and specialized domain benchmarks.

Among 29 candidates examined across three contributions, no clearly refuting prior work was identified. The CDTP dataset contribution examined 10 candidates with zero refutable matches, as did the CB-ECLLM benchmark (10 candidates, zero refutable). The multi-dimensional evaluation framework analyzed 9 candidates, also yielding no refutations. This absence of overlapping prior work within the limited search scope suggests that large-scale Chinese data-text pair datasets with explicit triple alignment remain underexplored in the surveyed literature, though the search scale (29 papers) leaves open the possibility of relevant work beyond top-K semantic matches.

Based on the limited literature search, the work appears to address a relatively unoccupied niche within Chinese LLM evaluation, focusing on structured alignment rather than domain mastery or general reasoning. The taxonomy structure confirms that data-text correspondence as a standalone evaluation dimension has received less attention than domain-specific or multidisciplinary benchmarks. However, the analysis covers top-30 semantic candidates and does not constitute an exhaustive survey of all Chinese NLP datasets or alignment methods, so conclusions about absolute novelty remain provisional.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: evaluating Chinese large language models on structured knowledge tasks. The field organizes itself around several complementary perspectives. Domain-Specific Knowledge Evaluation encompasses specialized benchmarks for medicine (TCM-3CEval[6], MTCMB TCM[18]), law (LawBench[7], LAiW Legal Benchmark[3]), finance (FinEval[9], CFinBench[22]), and other professional areas, each probing how well models handle domain terminology and reasoning. General Knowledge Assessment focuses on broad-coverage benchmarks like C-Eval[5] and M3KE[25] that test multidisciplinary understanding. Knowledge Engineering explores how models interact with structured representations such as knowledge graphs (TechGPT Knowledge Graph[4], LLM4EduKG[20]) and policy documents (DocPolicyKG[17]). Structured Data Understanding examines alignment between data formats and natural language, including data-text tasks and schema-aware query generation (SPARQL Schema Selection[28]). Domain Adaptation Methods investigate techniques like knowledge tuning (Knowledge Tuning Medical[2], Medical Knowledge Tuning[15]) and retrieval-augmented generation (OpenTCM GraphRAG[8]) to improve performance on specialized tasks. Across these branches, a central tension emerges between depth and breadth: domain-specific benchmarks achieve high fidelity within narrow contexts, while general assessments sacrifice granularity for coverage. Many studies also grapple with the challenge of grounding model outputs in verifiable structured knowledge versus relying on parametric memory alone. CDTP[0] sits within the Structured Data Understanding branch, specifically addressing data-text alignment—a relatively focused area compared to the broader domain evaluation efforts. While works like C-Eval[5] and M3KE[25] cast wide nets over general knowledge, CDTP[0] emphasizes the precise correspondence between structured representations and their linguistic expressions, a concern shared with schema-oriented approaches such as SPARQL Schema Selection[28]. This positioning reflects a methodological choice: rather than evaluating end-to-end task performance across diverse domains, the work targets the foundational capability of interpreting and generating text that faithfully reflects structured data, a prerequisite for robust knowledge-intensive applications in Chinese NLP.

Claimed Contributions

Chinese Data-Text Pair (CDTP) Dataset

10 retrieved papers

The authors construct a large-scale dataset of over 7 million aligned Chinese text pairs, where each pair consists of unstructured text and corresponding structured triples (totaling 15 million triples). The dataset spans four domains: History and Politics, Humanities and Society, Technology and Economics, and Nature and Environment, addressing the scarcity of structured annotations in Chinese corpora.

10 retrieved papers

Comprehensive Benchmark for Evaluating Chinese LLMs (CB-ECLLM)

10 retrieved papers

The authors present CB-ECLLM, a benchmark specifically designed to evaluate Chinese LLMs on tasks that capture unique Chinese linguistic challenges. The benchmark includes three tasks—Knowledge Graph Completion, Triple-to-Text generation, and Question Answering—and is built on the CDTP dataset to provide systematic, knowledge-driven evaluation.

10 retrieved papers

Multi-Dimensional Evaluation Framework

9 retrieved papers

The authors conduct comprehensive experiments to evaluate Chinese LLMs across three dimensions: effectiveness on various tasks and datasets, the impact of Supervised Fine-Tuning (SFT) on model performance, and robustness when tested on out-of-distribution data. This multi-faceted evaluation provides insights into model generalization and stability.

9 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Chinese Data-Text Pair (CDTP) Dataset

[69] Integrating knowledge graphs with ancient Chinese medicine classics: challenges and future prospects of multi-agent system convergence PDF

Cannot Refute

[70] Nanjing Yunjin intelligent question-answering system based on knowledge graphs and retrieval augmented generation technology PDF

Cannot Refute

[71] A Systematic Literature Review on RDF Triple Generation From Natural Language Texts PDF

Cannot Refute

[72] From Discrimination to Generation: Knowledge Graph Completion with Generative Transformer PDF

Cannot Refute

[73] Duie: A large-scale chinese dataset for information extraction PDF

Cannot Refute

[74] Joint semantic embedding with structural knowledge and entity description for knowledge representation learning PDF

Cannot Refute

[75] KdConv: A Chinese multi-domain dialogue dataset towards multi-turn knowledge-driven conversation PDF

Cannot Refute

[76] TCM-KLLaMA: Intelligent generation model for Traditional Chinese Medicine Prescriptions based on knowledge graph and large language model. PDF

Cannot Refute

[77] Triple-to-text generation with an anchor-to-prototype framework PDF

Cannot Refute

[78] CATS: A Pragmatic Chinese Answer-to-Sequence Dataset with Large Scale and High Quality PDF

Cannot Refute

Contribution

Comprehensive Benchmark for Evaluating Chinese LLMs (CB-ECLLM)

[3] LAiW: A Chinese legal large language models benchmark PDF

Cannot Refute

[29] CÂ²RBench: A Chinese Complex Reasoning Benchmark for Large Language Models PDF

Cannot Refute

[61] MulCogBench: a multi-modal cognitive benchmark dataset for evaluating Chinese and English computational language models: Y. Zhang et al. PDF

Cannot Refute

[62] Alignbench: Benchmarking chinese alignment of large language models PDF

Cannot Refute

[63] CHAmbi: A New Benchmark on Chinese Ambiguity Challenges for Large Language Models PDF

Cannot Refute

[64] MulCogBench: a multi-modal cognitive benchmark dataset for evaluating Chinese and English computational language models PDF

Cannot Refute

[65] Paradox of poetic intent in back-translation: evaluating the quality of large language models in Chinese translation PDF

Cannot Refute

[66] Comparative evaluation of commercial large language models on promptbench: An english and chinese perspective PDF

Cannot Refute

[67] WenMind: A comprehensive benchmark for evaluating large language models in Chinese classical literature and language arts PDF

Cannot Refute

[68] Ac-eval: Evaluating ancient chinese language understanding in large language models PDF

Cannot Refute

Contribution

Multi-Dimensional Evaluation Framework

[51] SaMer: A scenario-aware multi-dimensional evaluator for large language models PDF

Cannot Refute

[53] Mllmguard: A multi-dimensional safety evaluation suite for multimodal large language models PDF

Cannot Refute

[54] A survey on evaluation of large language models PDF

Cannot Refute

[55] LLM-Eval: Unified Multi-Dimensional Automatic Evaluation for Open-Domain Conversations with Large Language Models PDF

Cannot Refute

[56] LLaMA Beyond English: An Empirical Study on Language Capability Transfer PDF

Cannot Refute

[57] VHELM: A Holistic Evaluation of Vision Language Models PDF

Cannot Refute

[58] Toward generalizable evaluation in the llm era: A survey beyond benchmarks PDF

Cannot Refute

[59] Beyond Perplexity: Multi-dimensional Safety Evaluation of LLM Compression PDF

Cannot Refute

[60] Gradient consistency patterns in high-dimensional feature perturbation: A novel technical investigation using large language models PDF

Cannot Refute

CDTP: A Large-Scale Chinese Data-Text Pair Dataset for Comprehensive Evaluation of Chinese LLMs

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

Chinese Data-Text Pair (CDTP) Dataset

[69] Integrating knowledge graphs with ancient Chinese medicine classics: challenges and future prospects of multi-agent system convergence PDF

[70] Nanjing Yunjin intelligent question-answering system based on knowledge graphs and retrieval augmented generation technology PDF

[71] A Systematic Literature Review on RDF Triple Generation From Natural Language Texts PDF

[72] From Discrimination to Generation: Knowledge Graph Completion with Generative Transformer PDF

[73] Duie: A large-scale chinese dataset for information extraction PDF

[74] Joint semantic embedding with structural knowledge and entity description for knowledge representation learning PDF

[75] KdConv: A Chinese multi-domain dialogue dataset towards multi-turn knowledge-driven conversation PDF

[76] TCM-KLLaMA: Intelligent generation model for Traditional Chinese Medicine Prescriptions based on knowledge graph and large language model. PDF

[77] Triple-to-text generation with an anchor-to-prototype framework PDF

[78] CATS: A Pragmatic Chinese Answer-to-Sequence Dataset with Large Scale and High Quality PDF

Comprehensive Benchmark for Evaluating Chinese LLMs (CB-ECLLM)

[3] LAiW: A Chinese legal large language models benchmark PDF

[29] CÂ²RBench: A Chinese Complex Reasoning Benchmark for Large Language Models PDF

[61] MulCogBench: a multi-modal cognitive benchmark dataset for evaluating Chinese and English computational language models: Y. Zhang et al. PDF

[62] Alignbench: Benchmarking chinese alignment of large language models PDF

[63] CHAmbi: A New Benchmark on Chinese Ambiguity Challenges for Large Language Models PDF

[64] MulCogBench: a multi-modal cognitive benchmark dataset for evaluating Chinese and English computational language models PDF

[65] Paradox of poetic intent in back-translation: evaluating the quality of large language models in Chinese translation PDF

[66] Comparative evaluation of commercial large language models on promptbench: An english and chinese perspective PDF

[67] WenMind: A comprehensive benchmark for evaluating large language models in Chinese classical literature and language arts PDF

[68] Ac-eval: Evaluating ancient chinese language understanding in large language models PDF

Multi-Dimensional Evaluation Framework

[51] SaMer: A scenario-aware multi-dimensional evaluator for large language models PDF

[53] Mllmguard: A multi-dimensional safety evaluation suite for multimodal large language models PDF

[54] A survey on evaluation of large language models PDF

[55] LLM-Eval: Unified Multi-Dimensional Automatic Evaluation for Open-Domain Conversations with Large Language Models PDF

[56] LLaMA Beyond English: An Empirical Study on Language Capability Transfer PDF

[57] VHELM: A Holistic Evaluation of Vision Language Models PDF

[58] Toward generalizable evaluation in the llm era: A survey beyond benchmarks PDF

[59] Beyond Perplexity: Multi-dimensional Safety Evaluation of LLM Compression PDF

[60] Gradient consistency patterns in high-dimensional feature perturbation: A novel technical investigation using large language models PDF

Table of Contents