HARDTESTGEN: A High-Quality RL Verifier Generation Pipeline for LLM Algorithimic Coding

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

LLMsRLVRcode generation

Verifiers provide important reward signals for reinforcement learning of large language models (LLMs). However, it is challenging to develop or create reliable verifiers, especially for code generation tasks. A well-disguised wrong solution program may only be detected by carefully human-written edge cases that are difficult to synthesize automatically. To address this issue, we propose HardTestsGen, an approach to synthesize high-quality test cases for algorithmic coding problems. We curate a comprehensive algorithmic programming dataset HardTests with 26.6k problems and high-quality synthetic tests. Compared with existing tests, HardTestsGen tests demonstrate significantly higher accuracy in verifying LLM-generated code (+11.22 percentage points in precision, the percentage of actually correct code within the predicted correct ones). We also show that downstream post-training — including rejection sampling and reinforcement learning (RL) — using HardTests verifier results in improved performance of LLM code generation.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces HardTestsGen, a pipeline for synthesizing challenging test cases for algorithmic coding problems, and curates the HardTests dataset containing 26.6k problems. It resides in the 'Competitive Programming Test Generation' leaf under LLM-Based Test Synthesis, which contains seven papers total. This leaf represents a moderately active research direction within the broader taxonomy of 50 papers across the field. The work shares this space with six sibling papers that similarly leverage large language models to generate test cases for competitive programming, indicating a concentrated but not overcrowded research area focused on stress-testing code under rigorous algorithmic constraints.

The taxonomy reveals that HardTestsGen sits within a well-defined niche. Its parent branch, LLM-Based Test Synthesis, also includes General Software Test Generation (four papers on unit testing), while sibling branches explore Formal and Structured Generation (three papers using symbolic methods) and Evolutionary Test Optimization (two papers on genetic algorithms). Neighboring leaves under Code Generation with Test Integration examine how tests guide synthesis (Test-Guided Code Synthesis, RL-Based Code Improvement), and the Datasets and Benchmarks branch provides complementary resources like competitive programming benchmarks. The scope note for the leaf explicitly excludes general software testing, clarifying that this work targets competitive programming scenarios rather than broader unit test generation.

Among 21 candidates examined across three contributions, the analysis found limited prior work overlap. The test synthesis pipeline and dataset contributions each examined 10 candidates with zero refutable matches, suggesting these components occupy relatively novel ground within the limited search scope. The empirical analysis of test quality impact on post-training examined only one candidate and found one refutable match, indicating more substantial prior work in this area. These statistics reflect a top-K semantic search plus citation expansion, not an exhaustive literature review. The pipeline and dataset appear more distinctive than the downstream training analysis, though the small candidate pool (21 total) limits definitive conclusions about field-wide novelty.

Based on the limited search scope of 21 candidates, the work appears to offer meaningful contributions in test synthesis methodology and dataset curation, while the post-training analysis aligns more closely with existing research directions. The taxonomy context suggests the paper occupies a moderately active but not saturated research area, with clear boundaries separating it from general software testing and formal verification approaches. The analysis does not cover exhaustive prior work beyond top semantic matches and immediate citations.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Synthesizing high-quality test cases for algorithmic coding problems. The field organizes around several complementary branches. Test Case Generation Methods explores diverse synthesis strategies, ranging from LLM-based approaches that leverage large language models for competitive programming and domain-specific test creation, to traditional search-based and mutation-driven techniques. Test Case Evaluation and Quality Assessment focuses on metrics and frameworks for measuring test effectiveness, coverage, and difficulty. Code Generation with Test Integration examines how test synthesis interacts with automated code generation pipelines, often using tests to guide or verify generated solutions. Application Domains and Specialized Tasks addresses context-specific challenges such as unit testing, security testing, and educational assessment. Datasets and Benchmarks provides standardized resources like APPS[7] and newer collections for reproducible evaluation. Finally, Auxiliary Techniques and Theoretical Foundations covers supporting methods including formal verification, symbolic execution, and theoretical models of test adequacy. Within the LLM-based synthesis branch, recent work has concentrated on generating challenging test cases for competitive programming scenarios. HardTestGen[0] exemplifies this trend by targeting difficult edge cases that expose subtle algorithmic errors, positioning itself alongside HardTests[1] and CodeContests Plus[2], which similarly emphasize stress-testing code under rigorous constraints. A key trade-off in this cluster involves balancing test diversity against computational cost: while AutoCode[3] and Reliable Test Generators[6] pursue scalable generation pipelines, works like TestCase Eval[4] and Rigorous Evaluation[5] highlight the need for careful quality assessment to ensure that generated tests are both valid and discriminative. HardTestGen[0] sits squarely in this active subarea, sharing with its neighbors a focus on leveraging LLMs to produce non-trivial test inputs, yet it appears to place particular emphasis on hardness and corner-case discovery rather than sheer volume or baseline correctness checks.

Claimed Contributions

HARDTESTGEN test synthesis pipeline

10 retrieved papers

The authors introduce HARDTESTGEN, an LLM-based pipeline that synthesizes test cases through four techniques (LLMGen, RPGen, SPGen, and HackGen) for generating inputs, plus validation and consensus filtering for outputs. This approach aims to create more reliable verifiers for code generation tasks.

10 retrieved papers

HARDTESTS dataset with 26.6k problems

10 retrieved papers

The authors curate HARDTESTS, a large-scale dataset comprising 26.6k algorithmic coding problems from 13 platforms, each equipped with high-quality test cases generated by HARDTESTGEN. The dataset demonstrates significantly higher precision and recall compared to existing test sets.

10 retrieved papers

Empirical analysis of test quality impact on post-training

Can Refute

1 retrieved paper

The authors provide empirical evidence demonstrating that higher-quality test cases significantly impact downstream LLM post-training methods, including rejection sampling and reinforcement learning, showing improved model performance when using HARDTESTS verifiers.

1 retrieved paper

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[1] HardTests: Synthesizing High-Quality Test Cases for LLM Coding PDF

He, Zhongmou, Zhongmou He, Zhang, Kexun, Yee Man Choi, Ji, Jiabao, Kexun Zhang, Zhou Jun-Ting, Jiabao Ji, Xu, Dejia, Junting Zhou, Dejia Xu, Zhang Ai-dan, Ivan Bercovich, Li Lei, Aidan Zhang, Lei Li (2025)

[2] CodeContests+: High-Quality Test Case Generation for Competitive Programming PDF

Wang Zihan, Liu Si-yao, Sun Yang, Li Hongyan, Shen Kai (2025)

[3] AutoCode: LLMs as Problem Setters for Competitive Programming PDF

Zhou Shang, Zheng Zi-Han, Shang Zhou, Liu Kai-yuan, Zihan Zheng, Shen Zeyu, Kaiyuan Liu, Cheng Zerui, Zeyu Shen, Chen, Zexing, Zexing Chen, Yao, Jianzhu, Hansen He, Jianzhu Yao, Mang, Qiuyang, Huanzhi Mao, Fu, Tianfu, Qiuyang Mang, Li, Beichen, Tianfu Fu, Beichen Li, Chai, Wenhao, Dongruixuan Li, Liu, Zhuang, Wenhao Chai, Korolova, Aleksandra, Zhuang Liu, Henderson, Peter, Aleksandra Korolova, Jaques, Natasha, Peter Henderson, Viswanath, Pramod, Natasha Jaques, Xie, Saining, P. Viswanath, Shang Jingbo, Saining Xie, Jingbo Shang (2025)

[6] Can LLMs Generate Reliable Test Case Generators? A Study on Competition-Level Programming Problems PDF

Cao Yuhan, Chen Zi'an, Quan Kun, Zhang Zi-liang, Wang Yu, Dong Xiaoning, Huang Jingcheng, Li, Jianhao, Tan, Yixuan, Tang Jiafu, Tang Yilin, Wu Jun-Lei, Xiao Qianyu, Zheng Can, Zhu Yuxiang, Huang, Yiming, Xie Tian, HE Tianxing (2025)

[12] Leveraging Generative AI for Enhancing Automated Assessment in Programming Education Contests PDF

Dumitran Adrian Marius (2025)

[23] Klear-CodeTest: Scalable Test Case Generation for Code Reinforcement Learning PDF

Fu Jia, Yang, Xinyu, Jia Fu, Zhang Hong-zhi, Xinyu Yang, Liu Yahui, Hongzhi Zhang, Zhang Jingyuan, Yahui Liu, Wang Qi, Jingyuan Zhang, Zhang Fuzheng, Qi Wang, Zhou, Guorui, Fuzheng Zhang, Guorui Zhou (2025) • arXiv.org

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

HARDTESTGEN test synthesis pipeline

[5] Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation PDF

Cannot Refute

[55] Unit Test Case Generation with Transformers PDF

Cannot Refute

[56] Enhancing large language models for text-to-testcase generation PDF

Cannot Refute

[59] A systematic approach for assessing large language models' test case generation capability PDF

Cannot Refute

[60] Planning with Large Language Models for Code Generation PDF

Cannot Refute

[61] Automatic generation of programming exercises and code explanations using large language models PDF

Cannot Refute

[62] Code-aware prompting: A study of coverage-guided test generation in regression setting using llm PDF

Cannot Refute

[63] Automatic Unit Test Generation for Programming Assignments Using Large Language Models PDF

Cannot Refute

[64] An empirical evaluation of using large language models for automated unit test generation PDF

Cannot Refute

[65] Evaluating and improving chatgpt for unit test generation PDF

Cannot Refute

Contribution

HARDTESTS dataset with 26.6k problems

[5] Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation PDF

Cannot Refute

[15] Codet: Code generation with generated tests PDF

Cannot Refute

[51] AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation PDF

Cannot Refute

[52] OpenCodeInstruct: A Large-scale Instruction Tuning Dataset for Code LLMs PDF

Cannot Refute

[53] LLM-Based Code Generation Method for Golang Compiler Testing PDF

Cannot Refute

[54] Exploring automated assertion generation via large language models PDF

Cannot Refute

[55] Unit Test Case Generation with Transformers PDF

Cannot Refute

[56] Enhancing large language models for text-to-testcase generation PDF

Cannot Refute

[57] Case2Code: Scalable synthetic data for code generation PDF

Cannot Refute

[58] One-to-many testing for code generation from (just) natural language PDF

Cannot Refute

Contribution

Empirical analysis of test quality impact on post-training

[34] Rethinking verification for llm code generation: From generation to testing PDF

Can Refute

HARDTESTGEN: A High-Quality RL Verifier Generation Pipeline for LLM Algorithimic Coding

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[1] HardTests: Synthesizing High-Quality Test Cases for LLM Coding PDF

[2] CodeContests+: High-Quality Test Case Generation for Competitive Programming PDF

[3] AutoCode: LLMs as Problem Setters for Competitive Programming PDF

[6] Can LLMs Generate Reliable Test Case Generators? A Study on Competition-Level Programming Problems PDF

[12] Leveraging Generative AI for Enhancing Automated Assessment in Programming Education Contests PDF

[23] Klear-CodeTest: Scalable Test Case Generation for Code Reinforcement Learning PDF

Contribution Analysis

HARDTESTGEN test synthesis pipeline

[5] Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation PDF

[55] Unit Test Case Generation with Transformers PDF

[56] Enhancing large language models for text-to-testcase generation PDF

[59] A systematic approach for assessing large language models' test case generation capability PDF

[60] Planning with Large Language Models for Code Generation PDF

[61] Automatic generation of programming exercises and code explanations using large language models PDF

[62] Code-aware prompting: A study of coverage-guided test generation in regression setting using llm PDF

[63] Automatic Unit Test Generation for Programming Assignments Using Large Language Models PDF

[64] An empirical evaluation of using large language models for automated unit test generation PDF

[65] Evaluating and improving chatgpt for unit test generation PDF

HARDTESTS dataset with 26.6k problems

[5] Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation PDF

[15] Codet: Code generation with generated tests PDF

[51] AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation PDF

[52] OpenCodeInstruct: A Large-scale Instruction Tuning Dataset for Code LLMs PDF

[53] LLM-Based Code Generation Method for Golang Compiler Testing PDF

[54] Exploring automated assertion generation via large language models PDF

[55] Unit Test Case Generation with Transformers PDF

[56] Enhancing large language models for text-to-testcase generation PDF

[57] Case2Code: Scalable synthetic data for code generation PDF

[58] One-to-many testing for code generation from (just) natural language PDF

Empirical analysis of test quality impact on post-training

[34] Rethinking verification for llm code generation: From generation to testing PDF

Table of Contents