SimuHome: A Temporal- and Environment-Aware Benchmark for Smart Home LLM Agents

ICLR 2026 Conference SubmissionAnonymous Authors
smart homesimulatorlanguage modellanguage agentbenchmark
Abstract:

Large Language Model (LLM) agents excel at multi-step, tool-augmented tasks. However, smart homes introduce distinct challenges, requiring agents to handle latent user intents, temporal dependencies, device constraints, scheduling, and more. The main bottlenecks for developing smart home agents with such capabilities include the lack of a realistic simulation environment where agents can interact with devices and observe the results, as well as a challenging benchmark to evaluate them. To address this, we introduce SimuHome\textbf{SimuHome}, a time-accelerated home environment that simulates smart devices, supports API calls, and reflects changes in environmental variables. By building the simulator on the Matter protocol, the global industry standard for smart home communication, SimuHome provides a high-fidelity environment, and agents validated in SimuHome can be deployed on real Matter-compliant devices with minimal adaptation. We provide a challenging benchmark of 600 episodes across twelve user query types that require the aforementioned capabilities. Our evaluation of 16 agents under a unified ReAct framework reveals distinct capabilities and limitations across models. Models under 7B parameters exhibited negligible performance across all query types. Even GPT-4.1, the best-performing standard model, struggled with implicit intent inference, state verification, and particularly temporal scheduling. While reasoning models such as GPT-5.1 consistently outperformed standard models on every query type, they required over three times the average inference time, which can be prohibitive for real-time smart home applications. This highlights a critical trade-off between task performance and real-world practicality. We will release our code and dataset upon publication of the paper.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

SimuHome introduces a time-accelerated smart home simulator built on the Matter protocol, paired with a 600-episode benchmark spanning twelve query types that test latent intent understanding, temporal dependencies, and device constraints. The paper occupies the 'LLM Agent Benchmarks and Simulators' leaf within the Benchmarking and Simulation Environments branch. Notably, this leaf contains only one paper (the original submission itself), indicating a sparse research direction. The taxonomy reveals that while the broader field includes 34 papers across activity recognition, automation control, and formal verification, dedicated simulation platforms for LLM-based smart home agents remain underexplored.

The taxonomy tree shows that neighboring branches focus on complementary concerns: Activity Recognition and Prediction (7 papers) emphasizes sensor-driven inference and occupancy forecasting, while Automation Control and Scheduling (6 papers) addresses rule generation and energy optimization. Multi-Agent Systems (5 papers) explores distributed reasoning frameworks, and Formal Methods (6 papers) applies verification techniques to ensure correctness. SimuHome diverges from these directions by providing a testbed specifically for evaluating LLM agents' temporal reasoning and control capabilities, rather than proposing new recognition algorithms or formal specifications. The scope_note for its leaf explicitly excludes general smart home simulation without LLM agent focus, clarifying its distinct positioning.

Among 24 candidates examined across three contributions, no refutable prior work was identified. The simulator contribution examined 7 candidates with 0 refutations, the benchmark examined 7 candidates with 0 refutations, and the dual evaluation methodology examined 10 candidates with 0 refutations. This suggests that within the limited search scope, no prior work directly overlaps with SimuHome's combination of Matter-based simulation, time acceleration, and LLM-specific benchmarking. The benchmark contribution appears particularly novel, as existing work in Automation Control (e.g., End-User Programming papers) focuses on user interfaces rather than agent evaluation datasets. However, the search examined only top-24 semantic matches, not an exhaustive literature review.

Based on the limited search scope of 24 candidates, SimuHome appears to occupy a relatively unexplored niche at the intersection of LLM agent evaluation and smart home simulation. The taxonomy structure confirms that while related work exists in activity recognition, automation, and formal methods, dedicated benchmarks for LLM agents in time-sensitive smart home scenarios remain sparse. The analysis does not cover broader agent benchmarking literature outside the smart home domain, nor does it exhaustively examine all simulation platforms in IoT research.

Taxonomy

Core-task Taxonomy Papers
34
3
Claimed Contributions
24
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: smart home agent control and temporal reasoning. The field encompasses diverse approaches to enabling intelligent systems that manage residential environments over time. At the highest level, the taxonomy reveals seven main branches: Activity Recognition and Prediction focuses on inferring and forecasting occupant behaviors using sensor data and temporal patterns (e.g., Temporal Activity Recognition[10], Personalized Activity Prediction[4]); Automation Control and Scheduling addresses rule-based and adaptive control strategies for devices and energy management (e.g., Residential Automation Trends[6], Renewable Energy Forecast[22]); Multi-Agent Systems and Reasoning Architectures explore coordination among distributed agents and knowledge-based frameworks (e.g., Probabilistic Multi-Agent[8], Knowledge-Based Collaboration[9]); Formal Methods and Verification apply rigorous techniques to ensure correctness and safety (e.g., Uppaal IoT Verification[16], Formal Method Automation[12]); Security, Safety, and Testing examine vulnerability detection and robustness (e.g., IoTFuzz[7]); Assistive Living and Ambient Intelligence target health monitoring and elderly care (e.g., Elderly Assistance Synergy[15], Ambient AI Assistance[18]); and Benchmarking and Simulation Environments provide testbeds for evaluating agent performance. Recent work highlights a growing emphasis on simulation platforms that enable reproducible evaluation of LLM-based agents in realistic smart home scenarios. SimuHome[0] sits squarely within the Benchmarking and Simulation Environments branch, specifically under LLM Agent Benchmarks and Simulators, offering a controlled environment for testing temporal reasoning and control policies. This contrasts with earlier efforts in Activity Recognition (e.g., Temporal Activity Prediction[29]) that primarily focused on sensor-driven inference rather than agent decision-making, and with automation studies like Bridging Automation Gap[3] that emphasize user programming paradigms over agent autonomy. A key open question across branches is how to integrate formal verification guarantees with the flexibility of learning-based agents, and how benchmarks can capture the full complexity of multi-occupant, long-horizon temporal dependencies that real deployments demand.

Claimed Contributions

SimuHome: A time-accelerated smart home simulator

The authors develop a high-fidelity smart home simulator built on the Matter protocol that models device operations, environmental variables (temperature, illuminance, humidity, air quality), and temporal dynamics. The simulator enables agents to interact with devices through APIs and observe realistic state changes, supporting reproducible experiments and potential transfer to real Matter-compliant devices.

7 retrieved papers
A benchmark of 600 episodes across twelve query types

The authors create a manually validated benchmark containing 600 episodes spanning twelve query types, each with feasible and infeasible variants. Episodes test capabilities including latent intent inference, temporal scheduling, device constraints, and state verification, with each episode packaged with initial home state, verifiable goals, natural-language queries, and required actions.

7 retrieved papers
Dual evaluation methodology combining simulator-based and LLM-judge-based assessment

The authors establish a comprehensive evaluation approach that scores feasible tasks through direct simulator state comparisons and assesses infeasible tasks using validated LLM judges. This dual methodology enables objective, automated evaluation of agent performance across different query types while maintaining high agreement with human judgment.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

SimuHome: A time-accelerated smart home simulator

The authors develop a high-fidelity smart home simulator built on the Matter protocol that models device operations, environmental variables (temperature, illuminance, humidity, air quality), and temporal dynamics. The simulator enables agents to interact with devices through APIs and observe realistic state changes, supporting reproducible experiments and potential transfer to real Matter-compliant devices.

Contribution

A benchmark of 600 episodes across twelve query types

The authors create a manually validated benchmark containing 600 episodes spanning twelve query types, each with feasible and infeasible variants. Episodes test capabilities including latent intent inference, temporal scheduling, device constraints, and state verification, with each episode packaged with initial home state, verifiable goals, natural-language queries, and required actions.

Contribution

Dual evaluation methodology combining simulator-based and LLM-judge-based assessment

The authors establish a comprehensive evaluation approach that scores feasible tasks through direct simulator state comparisons and assesses infeasible tasks using validated LLM judges. This dual methodology enables objective, automated evaluation of agent performance across different query types while maintaining high agreement with human judgment.