From Assistant to Independent Developer — Are GPTs Ready for Software Development?

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

software dvelopmentapp developmentcoding agentLLMcode model

Large language models (LLMs) have demonstrated remarkable capability in function-level code generation tasks. Unlike isolated functions, real-world applications demand reasoning over the entire software system: developers must orchestrate how different components interact, maintain consistency across states over time, and ensure the application behaves correctly within the lifecycle and framework constraints. Yet, no existing benchmark adequately evaluates whether LLMs can bridge this gap and construct entire software systems from scratch.

To address this gap, we propose \tool, a benchmark consisting of 101 software development problems drawn from real-world Android apps. Given a natural language specification detailing the app functionality, a language model is tasked with \textbf{implementing the functionality into an Android app from scratch}. Developing an Android app from scratch requires understanding and coordinating app states, lifecycle management, and asynchronous operations, calling for LLMs to generate context-aware, robust, and maintainable code. To construct \tool, we design a multi-agent system to automatically summarize the main functionalities from app documents and navigate the app to synthesize test cases validating the functional correctness of app implementation. Following rigorous manual verification by Android development experts, \tool incorporates the test cases within an automated evaluation framework that enables reproducible assessment without human intervention, making it easily adoptable for future research. Our evaluation on 12 flagship LLMs show that all evaluated models achieve low effectiveness, with the best-performing model (GPT-5) developing only 18.8% functionally correct applications, highlighting fundamental limitations in current models' ability to handle complex, multi-component software engineering challenges.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces APPFORGE, a benchmark of 101 real-world Android development problems requiring end-to-end application construction from natural language specifications. It resides in the 'End-to-End Application Development Evaluation' leaf of the taxonomy, which contains only two papers total: this work and one sibling (Text2App). This represents a notably sparse research direction within the broader field, suggesting that comprehensive benchmarks for full-system Android generation remain underexplored compared to component-level or intent-specific evaluation resources.

The taxonomy reveals that most related work clusters in adjacent branches: 'Natural Language to Code Generation Frameworks' contains seven papers spanning direct generation, abstraction-based methods, and iterative refinement systems, while 'Intent Invocation Datasets' focuses narrowly on API-level correctness. APPFORGE bridges these areas by evaluating whether models can orchestrate components, manage state, and handle lifecycle constraints—capabilities that sit between low-level intent invocation and high-level iterative development frameworks. The benchmark's emphasis on real-world app functionality distinguishes it from domain-specific applications in the 'Specialized Application Domains' branch, which target parental control, IoT interfaces, or voice commerce rather than general-purpose development.

Among 24 candidates examined, none clearly refute the three core contributions. The benchmark construction contribution examined four candidates with zero refutations; the multi-agent construction system examined ten candidates with zero refutations; the evaluation revealing LLM limitations examined ten candidates with zero refutations. This limited search scope—focused on top-K semantic matches—suggests that within the immediate neighborhood of Android development benchmarks and multi-agent code generation, no prior work directly overlaps with APPFORGE's combination of real-world app complexity, automated construction pipeline, and comprehensive evaluation protocol.

Given the sparse taxonomy leaf and absence of refutations across 24 examined candidates, the work appears to occupy a relatively novel position in end-to-end Android evaluation. However, the limited search scale means this assessment reflects only the most semantically proximate prior work, not an exhaustive survey of all code generation benchmarks or multi-agent development systems. The novelty claim rests primarily on the integration of real-world app complexity with automated benchmark construction, rather than on entirely unprecedented technical components.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: end-to-end Android application development from natural language specifications. The field organizes around three main branches. Natural Language to Code Generation Frameworks explore techniques for translating user intent into executable code, ranging from model-driven approaches that leverage intermediate representations to direct synthesis methods. Evaluation Resources and Benchmarks provide datasets and testbeds—such as DroidCall Dataset[1] and Text2App[2]—that enable systematic assessment of generation quality, functional correctness, and usability. Specialized Application Domains target particular use cases, including parental control tools like WhatsApp Parental Control[4], IoT integration as in Natural Language IoT[5], and domain-specific platforms such as Aptly Platform[7]. Together, these branches reflect a progression from foundational generation methods through rigorous evaluation to real-world deployment scenarios. Recent work highlights contrasts between fully automated pipelines and iterative refinement strategies. Some studies, including Model-driven Android Generation[3] and AppForge[8], emphasize structured intermediate models to ensure consistency and maintainability, while others such as EvoDev Framework[10] and LLM UI Refinement[9] explore evolutionary or feedback-driven loops that incrementally improve generated artifacts. GPTs Software Development[0] sits within the Evaluation Resources and Benchmarks branch, specifically under End-to-End Application Development Evaluation, positioning it alongside AppForge[8] as a resource for holistic assessment. Whereas AppForge[8] focuses on benchmark construction and multi-stage validation, GPTs Software Development[0] appears to emphasize evaluating the complete development lifecycle when large language models drive the process. This distinction underscores an ongoing question: whether static benchmarks or dynamic, iterative evaluation better capture the complexities of transforming natural language into production-ready Android applications.

Claimed Contributions

APPFORGE benchmark for end-to-end Android app development evaluation

4 retrieved papers

The authors introduce APPFORGE, a new benchmark designed to evaluate whether LLMs can develop complete Android applications from scratch given natural language specifications. The benchmark includes 101 real-world tasks with automated evaluation covering compilation, functional testing, and stability analysis.

4 retrieved papers

Multi-agent system for automated benchmark construction

10 retrieved papers

The authors develop an automated pipeline using a multi-agent system that extracts functionality specifications from app documentation, navigates apps to capture runtime behavior, and synthesizes test cases. This approach is combined with expert validation to ensure scalability and rigor in benchmark construction.

10 retrieved papers

Comprehensive evaluation revealing fundamental limitations of current LLMs

10 retrieved papers

The authors conduct systematic evaluation of 12 state-of-the-art LLMs on APPFORGE, revealing that even the best models achieve less than 20% functional correctness. The evaluation uncovers specific failure patterns including compilation error evasion strategies and highlights fundamental gaps in current models' software engineering capabilities.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[8] AppForge: From Assistant to Independent Developer -- Are GPTs Ready for Software Development? PDF

Ran, Dezhi, Cao Yuan, Chen Si-min, Guo Yu-zhe, Ren, Jun, Song, Zihe, Yu Hao, Li Linyi, Yang Wei, Ray, Baishakhi, Xie Tao (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

APPFORGE benchmark for end-to-end Android app development evaluation

[2] Text2App: A Framework for Creating Android Apps from Text Descriptions PDF

Cannot Refute

[8] AppForge: From Assistant to Independent Developer -- Are GPTs Ready for Software Development? PDF

Cannot Refute

[32] AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents PDF

Cannot Refute

[33] Androidlab: Training and systematic benchmarking of android autonomous agents PDF

Cannot Refute

Contribution

Multi-agent system for automated benchmark construction

[12] Unit Test Generation Multi-Agent AI System for Enhancing Software Documentation and Code Coverage PDF

Cannot Refute

[13] Intelligent Test Automation: A Multi-Agent LLM Framework for Dynamic Test Case Generation and Validation PDF

Cannot Refute

[14] Large Language Models as Test Case Generators: Performance Evaluation and Enhancement PDF

Cannot Refute

[15] Multi-agent Assisted Automatic Test Generation for Java JSON Libraries PDF

Cannot Refute

[16] CodeCoR: An LLM-Based Self-Reflective Multi-Agent Framework for Code Generation PDF

Cannot Refute

[17] Using machine learning techniques for multi-agent systems testing PDF

Cannot Refute

[18] A formal testing method for multi-agent systems using colored Petri nets PDF

Cannot Refute

[19] Multi-Agent hierarchical workflow for autonomous code generation with Large Language Models PDF

Cannot Refute

[20] Agent-based tool for model-based test case generation and execution PDF

Cannot Refute

[21] Automatically Generating Web Applications from Requirements Via Multi-Agent Test-Driven Development PDF

Cannot Refute

Contribution

Comprehensive evaluation revealing fundamental limitations of current LLMs

[22] Large language models for software engineering: Survey and open problems PDF

Cannot Refute

[23] Large Language Models for Software Engineering: A Systematic Literature Review PDF

Cannot Refute

[24] Planning-driven programming: A large language model programming workflow PDF

Cannot Refute

[25] Using Large Language Models to Enhance Programming Error Messages PDF

Cannot Refute

[26] Can large language models reason about program invariants? PDF

Cannot Refute

[27] Assessing and advancing benchmarks for evaluating large language models in software engineering tasks PDF

Cannot Refute

[28] Planning with Large Language Models for Code Generation PDF

Cannot Refute

[29] Agentless: Demystifying LLM-based Software Engineering Agents PDF

Cannot Refute

[30] Towards large language model aided program refinement PDF

Cannot Refute

[31] Jigsaw: Large Language Models meet Program Synthesis PDF

Cannot Refute

From Assistant to Independent Developer — Are GPTs Ready for Software Development?

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[8] AppForge: From Assistant to Independent Developer -- Are GPTs Ready for Software Development? PDF

Contribution Analysis

APPFORGE benchmark for end-to-end Android app development evaluation

[2] Text2App: A Framework for Creating Android Apps from Text Descriptions PDF

[8] AppForge: From Assistant to Independent Developer -- Are GPTs Ready for Software Development? PDF

[32] AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents PDF

[33] Androidlab: Training and systematic benchmarking of android autonomous agents PDF

Multi-agent system for automated benchmark construction

[12] Unit Test Generation Multi-Agent AI System for Enhancing Software Documentation and Code Coverage PDF

[13] Intelligent Test Automation: A Multi-Agent LLM Framework for Dynamic Test Case Generation and Validation PDF

[14] Large Language Models as Test Case Generators: Performance Evaluation and Enhancement PDF

[15] Multi-agent Assisted Automatic Test Generation for Java JSON Libraries PDF

[16] CodeCoR: An LLM-Based Self-Reflective Multi-Agent Framework for Code Generation PDF

[17] Using machine learning techniques for multi-agent systems testing PDF

[18] A formal testing method for multi-agent systems using colored Petri nets PDF

[19] Multi-Agent hierarchical workflow for autonomous code generation with Large Language Models PDF

[20] Agent-based tool for model-based test case generation and execution PDF

[21] Automatically Generating Web Applications from Requirements Via Multi-Agent Test-Driven Development PDF

Comprehensive evaluation revealing fundamental limitations of current LLMs

[22] Large language models for software engineering: Survey and open problems PDF

[23] Large Language Models for Software Engineering: A Systematic Literature Review PDF

[24] Planning-driven programming: A large language model programming workflow PDF

[25] Using Large Language Models to Enhance Programming Error Messages PDF

[26] Can large language models reason about program invariants? PDF

[27] Assessing and advancing benchmarks for evaluating large language models in software engineering tasks PDF

[28] Planning with Large Language Models for Code Generation PDF

[29] Agentless: Demystifying LLM-based Software Engineering Agents PDF

[30] Towards large language model aided program refinement PDF

[31] Jigsaw: Large Language Models meet Program Synthesis PDF

Table of Contents