From Assistant to Independent Developer — Are GPTs Ready for Software Development?

ICLR 2026 Conference SubmissionAnonymous Authors
software dvelopmentapp developmentcoding agentLLMcode model
Abstract:

Large language models (LLMs) have demonstrated remarkable capability in function-level code generation tasks. Unlike isolated functions, real-world applications demand reasoning over the entire software system: developers must orchestrate how different components interact, maintain consistency across states over time, and ensure the application behaves correctly within the lifecycle and framework constraints. Yet, no existing benchmark adequately evaluates whether LLMs can bridge this gap and construct entire software systems from scratch.

To address this gap, we propose \tool, a benchmark consisting of 101 software development problems drawn from real-world Android apps. Given a natural language specification detailing the app functionality, a language model is tasked with \textbf{implementing the functionality into an Android app from scratch}. Developing an Android app from scratch requires understanding and coordinating app states, lifecycle management, and asynchronous operations, calling for LLMs to generate context-aware, robust, and maintainable code. To construct \tool, we design a multi-agent system to automatically summarize the main functionalities from app documents and navigate the app to synthesize test cases validating the functional correctness of app implementation. Following rigorous manual verification by Android development experts, \tool incorporates the test cases within an automated evaluation framework that enables reproducible assessment without human intervention, making it easily adoptable for future research. Our evaluation on 12 flagship LLMs show that all evaluated models achieve low effectiveness, with the best-performing model (GPT-5) developing only 18.8% functionally correct applications, highlighting fundamental limitations in current models' ability to handle complex, multi-component software engineering challenges.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces APPFORGE, a benchmark of 101 real-world Android development problems requiring end-to-end application construction from natural language specifications. It resides in the 'End-to-End Application Development Evaluation' leaf of the taxonomy, which contains only two papers total: this work and one sibling (Text2App). This represents a notably sparse research direction within the broader field, suggesting that comprehensive benchmarks for full-system Android generation remain underexplored compared to component-level or intent-specific evaluation resources.

The taxonomy reveals that most related work clusters in adjacent branches: 'Natural Language to Code Generation Frameworks' contains seven papers spanning direct generation, abstraction-based methods, and iterative refinement systems, while 'Intent Invocation Datasets' focuses narrowly on API-level correctness. APPFORGE bridges these areas by evaluating whether models can orchestrate components, manage state, and handle lifecycle constraints—capabilities that sit between low-level intent invocation and high-level iterative development frameworks. The benchmark's emphasis on real-world app functionality distinguishes it from domain-specific applications in the 'Specialized Application Domains' branch, which target parental control, IoT interfaces, or voice commerce rather than general-purpose development.

Among 24 candidates examined, none clearly refute the three core contributions. The benchmark construction contribution examined four candidates with zero refutations; the multi-agent construction system examined ten candidates with zero refutations; the evaluation revealing LLM limitations examined ten candidates with zero refutations. This limited search scope—focused on top-K semantic matches—suggests that within the immediate neighborhood of Android development benchmarks and multi-agent code generation, no prior work directly overlaps with APPFORGE's combination of real-world app complexity, automated construction pipeline, and comprehensive evaluation protocol.

Given the sparse taxonomy leaf and absence of refutations across 24 examined candidates, the work appears to occupy a relatively novel position in end-to-end Android evaluation. However, the limited search scale means this assessment reflects only the most semantically proximate prior work, not an exhaustive survey of all code generation benchmarks or multi-agent development systems. The novelty claim rests primarily on the integration of real-world app complexity with automated benchmark construction, rather than on entirely unprecedented technical components.

Taxonomy

Core-task Taxonomy Papers
11
3
Claimed Contributions
24
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: end-to-end Android application development from natural language specifications. The field organizes around three main branches. Natural Language to Code Generation Frameworks explore techniques for translating user intent into executable code, ranging from model-driven approaches that leverage intermediate representations to direct synthesis methods. Evaluation Resources and Benchmarks provide datasets and testbeds—such as DroidCall Dataset[1] and Text2App[2]—that enable systematic assessment of generation quality, functional correctness, and usability. Specialized Application Domains target particular use cases, including parental control tools like WhatsApp Parental Control[4], IoT integration as in Natural Language IoT[5], and domain-specific platforms such as Aptly Platform[7]. Together, these branches reflect a progression from foundational generation methods through rigorous evaluation to real-world deployment scenarios. Recent work highlights contrasts between fully automated pipelines and iterative refinement strategies. Some studies, including Model-driven Android Generation[3] and AppForge[8], emphasize structured intermediate models to ensure consistency and maintainability, while others such as EvoDev Framework[10] and LLM UI Refinement[9] explore evolutionary or feedback-driven loops that incrementally improve generated artifacts. GPTs Software Development[0] sits within the Evaluation Resources and Benchmarks branch, specifically under End-to-End Application Development Evaluation, positioning it alongside AppForge[8] as a resource for holistic assessment. Whereas AppForge[8] focuses on benchmark construction and multi-stage validation, GPTs Software Development[0] appears to emphasize evaluating the complete development lifecycle when large language models drive the process. This distinction underscores an ongoing question: whether static benchmarks or dynamic, iterative evaluation better capture the complexities of transforming natural language into production-ready Android applications.

Claimed Contributions

APPFORGE benchmark for end-to-end Android app development evaluation

The authors introduce APPFORGE, a new benchmark designed to evaluate whether LLMs can develop complete Android applications from scratch given natural language specifications. The benchmark includes 101 real-world tasks with automated evaluation covering compilation, functional testing, and stability analysis.

4 retrieved papers
Multi-agent system for automated benchmark construction

The authors develop an automated pipeline using a multi-agent system that extracts functionality specifications from app documentation, navigates apps to capture runtime behavior, and synthesizes test cases. This approach is combined with expert validation to ensure scalability and rigor in benchmark construction.

10 retrieved papers
Comprehensive evaluation revealing fundamental limitations of current LLMs

The authors conduct systematic evaluation of 12 state-of-the-art LLMs on APPFORGE, revealing that even the best models achieve less than 20% functional correctness. The evaluation uncovers specific failure patterns including compilation error evasion strategies and highlights fundamental gaps in current models' software engineering capabilities.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

APPFORGE benchmark for end-to-end Android app development evaluation

The authors introduce APPFORGE, a new benchmark designed to evaluate whether LLMs can develop complete Android applications from scratch given natural language specifications. The benchmark includes 101 real-world tasks with automated evaluation covering compilation, functional testing, and stability analysis.

Contribution

Multi-agent system for automated benchmark construction

The authors develop an automated pipeline using a multi-agent system that extracts functionality specifications from app documentation, navigates apps to capture runtime behavior, and synthesizes test cases. This approach is combined with expert validation to ensure scalability and rigor in benchmark construction.

Contribution

Comprehensive evaluation revealing fundamental limitations of current LLMs

The authors conduct systematic evaluation of 12 state-of-the-art LLMs on APPFORGE, revealing that even the best models achieve less than 20% functional correctness. The evaluation uncovers specific failure patterns including compilation error evasion strategies and highlights fundamental gaps in current models' software engineering capabilities.