CDTP: A Large-Scale Chinese Data-Text Pair Dataset for Comprehensive Evaluation of Chinese LLMs

ICLR 2026 Conference SubmissionAnonymous Authors
Chinese Data-Text Pair DatasetLarge Language ModelChinese Evaluation
Abstract:

Large Language Models (LLMs) have achieved remarkable success across a wide range of natural language processing tasks. However, Chinese LLMs face unique challenges, primarily due to the dominance of unstructured free text and the lack of structured representations in Chinese corpora. While existing benchmarks for LLMs partially assess Chinese LLMs, they are still predominantly English-centric and fail to address the unique linguistic characteristics of Chinese, lacking structured datasets essential for robust evaluation. To address these challenges, we present a Comprehensive Benchmark for Evaluating Chinese Large Language Models (CB-ECLLM) based on the newly constructed Chinese Data-Text Pair (CDTP) dataset. Specifically, CDTP comprises over 7 million aligned text pairs, each consisting of unstructured text coupled with one or more corresponding triples, alongside a total of 15 million triples spanning four critical domains. The core contributions of CDTP are threefold: (i) enriching Chinese corpora with high-quality structured information; (ii) enabling fine-grained evaluation tailored to knowledge-driven tasks; and (iii) supporting multi-task fine-tuning to assess generalization and robustness across scenarios, including Knowledge Graph Completion, Triple-to-Text generation, and Question Answering. Furthermore, we conduct rigorous evaluations through extensive experiments and ablation studies to assess the effectiveness, Supervised Fine-Tuning (SFT), and robustness of the benchmark. To support reproducible research, we offer an open-source codebase and outline potential directions for future investigations based on our insights.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces CDTP, a dataset of over 7 million Chinese text-data pairs aligned with 15 million triples across four domains, and CB-ECLLM, a benchmark for evaluating Chinese LLMs on structured knowledge tasks. Within the taxonomy, it occupies the 'Data-Text Alignment' leaf under 'Structured Data Understanding', where it is currently the sole representative among 50 surveyed papers. This positioning places the work in a relatively sparse research direction focused specifically on the correspondence between structured representations and natural language, distinct from the more populated domain-specific evaluation branches (e.g., 7 papers in Traditional Chinese Medicine, 5 in Financial Domain).

The taxonomy reveals that most neighboring work concentrates on domain-specific benchmarks (Medical, Legal, Financial) or general multi-subject assessments (C-Eval, M3KE), which evaluate end-to-end task performance rather than data-text alignment primitives. The 'Knowledge Engineering' branch addresses knowledge graph construction and querying but focuses on extraction and retrieval rather than paired alignment datasets. The 'Constrained Text Generation' sibling leaf examines format adherence but excludes the bidirectional data-text correspondence that CDTP emphasizes. This structural isolation suggests the paper targets a methodological gap between broad evaluation suites and specialized domain benchmarks.

Among 29 candidates examined across three contributions, no clearly refuting prior work was identified. The CDTP dataset contribution examined 10 candidates with zero refutable matches, as did the CB-ECLLM benchmark (10 candidates, zero refutable). The multi-dimensional evaluation framework analyzed 9 candidates, also yielding no refutations. This absence of overlapping prior work within the limited search scope suggests that large-scale Chinese data-text pair datasets with explicit triple alignment remain underexplored in the surveyed literature, though the search scale (29 papers) leaves open the possibility of relevant work beyond top-K semantic matches.

Based on the limited literature search, the work appears to address a relatively unoccupied niche within Chinese LLM evaluation, focusing on structured alignment rather than domain mastery or general reasoning. The taxonomy structure confirms that data-text correspondence as a standalone evaluation dimension has received less attention than domain-specific or multidisciplinary benchmarks. However, the analysis covers top-30 semantic candidates and does not constitute an exhaustive survey of all Chinese NLP datasets or alignment methods, so conclusions about absolute novelty remain provisional.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
29
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: evaluating Chinese large language models on structured knowledge tasks. The field organizes itself around several complementary perspectives. Domain-Specific Knowledge Evaluation encompasses specialized benchmarks for medicine (TCM-3CEval[6], MTCMB TCM[18]), law (LawBench[7], LAiW Legal Benchmark[3]), finance (FinEval[9], CFinBench[22]), and other professional areas, each probing how well models handle domain terminology and reasoning. General Knowledge Assessment focuses on broad-coverage benchmarks like C-Eval[5] and M3KE[25] that test multidisciplinary understanding. Knowledge Engineering explores how models interact with structured representations such as knowledge graphs (TechGPT Knowledge Graph[4], LLM4EduKG[20]) and policy documents (DocPolicyKG[17]). Structured Data Understanding examines alignment between data formats and natural language, including data-text tasks and schema-aware query generation (SPARQL Schema Selection[28]). Domain Adaptation Methods investigate techniques like knowledge tuning (Knowledge Tuning Medical[2], Medical Knowledge Tuning[15]) and retrieval-augmented generation (OpenTCM GraphRAG[8]) to improve performance on specialized tasks. Across these branches, a central tension emerges between depth and breadth: domain-specific benchmarks achieve high fidelity within narrow contexts, while general assessments sacrifice granularity for coverage. Many studies also grapple with the challenge of grounding model outputs in verifiable structured knowledge versus relying on parametric memory alone. CDTP[0] sits within the Structured Data Understanding branch, specifically addressing data-text alignment—a relatively focused area compared to the broader domain evaluation efforts. While works like C-Eval[5] and M3KE[25] cast wide nets over general knowledge, CDTP[0] emphasizes the precise correspondence between structured representations and their linguistic expressions, a concern shared with schema-oriented approaches such as SPARQL Schema Selection[28]. This positioning reflects a methodological choice: rather than evaluating end-to-end task performance across diverse domains, the work targets the foundational capability of interpreting and generating text that faithfully reflects structured data, a prerequisite for robust knowledge-intensive applications in Chinese NLP.

Claimed Contributions

Chinese Data-Text Pair (CDTP) Dataset

The authors construct a large-scale dataset of over 7 million aligned Chinese text pairs, where each pair consists of unstructured text and corresponding structured triples (totaling 15 million triples). The dataset spans four domains: History and Politics, Humanities and Society, Technology and Economics, and Nature and Environment, addressing the scarcity of structured annotations in Chinese corpora.

10 retrieved papers
Comprehensive Benchmark for Evaluating Chinese LLMs (CB-ECLLM)

The authors present CB-ECLLM, a benchmark specifically designed to evaluate Chinese LLMs on tasks that capture unique Chinese linguistic challenges. The benchmark includes three tasks—Knowledge Graph Completion, Triple-to-Text generation, and Question Answering—and is built on the CDTP dataset to provide systematic, knowledge-driven evaluation.

10 retrieved papers
Multi-Dimensional Evaluation Framework

The authors conduct comprehensive experiments to evaluate Chinese LLMs across three dimensions: effectiveness on various tasks and datasets, the impact of Supervised Fine-Tuning (SFT) on model performance, and robustness when tested on out-of-distribution data. This multi-faceted evaluation provides insights into model generalization and stability.

9 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Chinese Data-Text Pair (CDTP) Dataset

The authors construct a large-scale dataset of over 7 million aligned Chinese text pairs, where each pair consists of unstructured text and corresponding structured triples (totaling 15 million triples). The dataset spans four domains: History and Politics, Humanities and Society, Technology and Economics, and Nature and Environment, addressing the scarcity of structured annotations in Chinese corpora.

Contribution

Comprehensive Benchmark for Evaluating Chinese LLMs (CB-ECLLM)

The authors present CB-ECLLM, a benchmark specifically designed to evaluate Chinese LLMs on tasks that capture unique Chinese linguistic challenges. The benchmark includes three tasks—Knowledge Graph Completion, Triple-to-Text generation, and Question Answering—and is built on the CDTP dataset to provide systematic, knowledge-driven evaluation.

Contribution

Multi-Dimensional Evaluation Framework

The authors conduct comprehensive experiments to evaluate Chinese LLMs across three dimensions: effectiveness on various tasks and datasets, the impact of Supervised Fine-Tuning (SFT) on model performance, and robustness when tested on out-of-distribution data. This multi-faceted evaluation provides insights into model generalization and stability.