SimuHome: A Temporal- and Environment-Aware Benchmark for Smart Home LLM Agents

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

smart homesimulatorlanguage modellanguage agentbenchmark

Large Language Model (LLM) agents excel at multi-step, tool-augmented tasks. However, smart homes introduce distinct challenges, requiring agents to handle latent user intents, temporal dependencies, device constraints, scheduling, and more. The main bottlenecks for developing smart home agents with such capabilities include the lack of a realistic simulation environment where agents can interact with devices and observe the results, as well as a challenging benchmark to evaluate them. To address this, we introduce $\textbf{SimuHome}$ , a time-accelerated home environment that simulates smart devices, supports API calls, and reflects changes in environmental variables. By building the simulator on the Matter protocol, the global industry standard for smart home communication, SimuHome provides a high-fidelity environment, and agents validated in SimuHome can be deployed on real Matter-compliant devices with minimal adaptation. We provide a challenging benchmark of 600 episodes across twelve user query types that require the aforementioned capabilities. Our evaluation of 16 agents under a unified ReAct framework reveals distinct capabilities and limitations across models. Models under 7B parameters exhibited negligible performance across all query types. Even GPT-4.1, the best-performing standard model, struggled with implicit intent inference, state verification, and particularly temporal scheduling. While reasoning models such as GPT-5.1 consistently outperformed standard models on every query type, they required over three times the average inference time, which can be prohibitive for real-time smart home applications. This highlights a critical trade-off between task performance and real-world practicality. We will release our code and dataset upon publication of the paper.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

SimuHome introduces a time-accelerated smart home simulator built on the Matter protocol, paired with a 600-episode benchmark spanning twelve query types that test latent intent understanding, temporal dependencies, and device constraints. The paper occupies the 'LLM Agent Benchmarks and Simulators' leaf within the Benchmarking and Simulation Environments branch. Notably, this leaf contains only one paper (the original submission itself), indicating a sparse research direction. The taxonomy reveals that while the broader field includes 34 papers across activity recognition, automation control, and formal verification, dedicated simulation platforms for LLM-based smart home agents remain underexplored.

The taxonomy tree shows that neighboring branches focus on complementary concerns: Activity Recognition and Prediction (7 papers) emphasizes sensor-driven inference and occupancy forecasting, while Automation Control and Scheduling (6 papers) addresses rule generation and energy optimization. Multi-Agent Systems (5 papers) explores distributed reasoning frameworks, and Formal Methods (6 papers) applies verification techniques to ensure correctness. SimuHome diverges from these directions by providing a testbed specifically for evaluating LLM agents' temporal reasoning and control capabilities, rather than proposing new recognition algorithms or formal specifications. The scope_note for its leaf explicitly excludes general smart home simulation without LLM agent focus, clarifying its distinct positioning.

Among 24 candidates examined across three contributions, no refutable prior work was identified. The simulator contribution examined 7 candidates with 0 refutations, the benchmark examined 7 candidates with 0 refutations, and the dual evaluation methodology examined 10 candidates with 0 refutations. This suggests that within the limited search scope, no prior work directly overlaps with SimuHome's combination of Matter-based simulation, time acceleration, and LLM-specific benchmarking. The benchmark contribution appears particularly novel, as existing work in Automation Control (e.g., End-User Programming papers) focuses on user interfaces rather than agent evaluation datasets. However, the search examined only top-24 semantic matches, not an exhaustive literature review.

Based on the limited search scope of 24 candidates, SimuHome appears to occupy a relatively unexplored niche at the intersection of LLM agent evaluation and smart home simulation. The taxonomy structure confirms that while related work exists in activity recognition, automation, and formal methods, dedicated benchmarks for LLM agents in time-sensitive smart home scenarios remain sparse. The analysis does not cover broader agent benchmarking literature outside the smart home domain, nor does it exhaustively examine all simulation platforms in IoT research.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: smart home agent control and temporal reasoning. The field encompasses diverse approaches to enabling intelligent systems that manage residential environments over time. At the highest level, the taxonomy reveals seven main branches: Activity Recognition and Prediction focuses on inferring and forecasting occupant behaviors using sensor data and temporal patterns (e.g., Temporal Activity Recognition[10], Personalized Activity Prediction[4]); Automation Control and Scheduling addresses rule-based and adaptive control strategies for devices and energy management (e.g., Residential Automation Trends[6], Renewable Energy Forecast[22]); Multi-Agent Systems and Reasoning Architectures explore coordination among distributed agents and knowledge-based frameworks (e.g., Probabilistic Multi-Agent[8], Knowledge-Based Collaboration[9]); Formal Methods and Verification apply rigorous techniques to ensure correctness and safety (e.g., Uppaal IoT Verification[16], Formal Method Automation[12]); Security, Safety, and Testing examine vulnerability detection and robustness (e.g., IoTFuzz[7]); Assistive Living and Ambient Intelligence target health monitoring and elderly care (e.g., Elderly Assistance Synergy[15], Ambient AI Assistance[18]); and Benchmarking and Simulation Environments provide testbeds for evaluating agent performance. Recent work highlights a growing emphasis on simulation platforms that enable reproducible evaluation of LLM-based agents in realistic smart home scenarios. SimuHome[0] sits squarely within the Benchmarking and Simulation Environments branch, specifically under LLM Agent Benchmarks and Simulators, offering a controlled environment for testing temporal reasoning and control policies. This contrasts with earlier efforts in Activity Recognition (e.g., Temporal Activity Prediction[29]) that primarily focused on sensor-driven inference rather than agent decision-making, and with automation studies like Bridging Automation Gap[3] that emphasize user programming paradigms over agent autonomy. A key open question across branches is how to integrate formal verification guarantees with the flexibility of learning-based agents, and how benchmarks can capture the full complexity of multi-occupant, long-horizon temporal dependencies that real deployments demand.

Claimed Contributions

SimuHome: A time-accelerated smart home simulator

7 retrieved papers

The authors develop a high-fidelity smart home simulator built on the Matter protocol that models device operations, environmental variables (temperature, illuminance, humidity, air quality), and temporal dynamics. The simulator enables agents to interact with devices through APIs and observe realistic state changes, supporting reproducible experiments and potential transfer to real Matter-compliant devices.

7 retrieved papers

A benchmark of 600 episodes across twelve query types

7 retrieved papers

The authors create a manually validated benchmark containing 600 episodes spanning twelve query types, each with feasible and infeasible variants. Episodes test capabilities including latent intent inference, temporal scheduling, device constraints, and state verification, with each episode packaged with initial home state, verifiable goals, natural-language queries, and required actions.

7 retrieved papers

Dual evaluation methodology combining simulator-based and LLM-judge-based assessment

10 retrieved papers

The authors establish a comprehensive evaluation approach that scores feasible tasks through direct simulator state comparisons and assesses infeasible tasks using validated LLM judges. This dual methodology enables objective, automated evaluation of agent performance across different query types while maintaining high agreement with human judgment.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

SimuHome: A time-accelerated smart home simulator

[35] Smart Home Simulation in CoppeliaSim Using C# Through WebSocket PDF

Cannot Refute

[36] A Scalable and User-Friendly Framework Integrating IoT and Digital Twins for Home Energy Management Systems PDF

Cannot Refute

[37] Smart home R&D system based on virtual reality PDF

Cannot Refute

[38] A configurable context-aware simulator for smart home systems PDF

Cannot Refute

[39] ISS: the interactive smart home simulator PDF

Cannot Refute

[40] Minerva: a smart video assistant for the kitchen PDF

Cannot Refute

[41] A Multi-Purpose Scenario-based Simulator for Smart House Environments PDF

Cannot Refute

Contribution

A benchmark of 600 episodes across twelve query types

[42] Implementing personalized learning techniques with ai PDF

Cannot Refute

[43] Detecting and handling {IoT} interaction threats in {Multi-Platform}{Multi-Control-Channel} smart homes PDF

Cannot Refute

[44] Applying an Intelligent Personal Agent on a Smart Home Using a Novel Dialogue Generator PDF

Cannot Refute

[45] A novel direct load control testbed for smart appliances PDF

Cannot Refute

[46] Reject or Not?: A Benchmark for Voice Assistant Query Rejection in Smart Home Scenario and an Improved Method Based on LLMs PDF

Cannot Refute

[47] A multi-agent system for intelligent environment control PDF

Cannot Refute

[48] A Systematic Framework for Assessing Iot Adoption Feasibility: A Multi-Case Study PDF

Cannot Refute

Contribution

Dual evaluation methodology combining simulator-based and LLM-judge-based assessment

[49] Evaluating large language models as agents in the clinic PDF

Cannot Refute

[50] Embodied agent interface: Benchmarking llms for embodied decision making PDF

Cannot Refute

[51] Simulation Agent: A Framework for Integrating Simulation and Large Language Models for Enhanced Decision-Making PDF

Cannot Refute

[52] Multi-agent simulator drives language models for legal intensive interaction PDF

Cannot Refute

[53] Simulbench: Evaluating language models with creative simulation tasks PDF

Cannot Refute

[54] Breaking the Code: Security Assessment of AI Code Agents Through Systematic Jailbreaking Attacks PDF

Cannot Refute

[55] Multi-Agent-as-Judge: Aligning LLM-Agent-Based Automated Evaluation with Multi-Dimensional Human Evaluation PDF

Cannot Refute

[56] REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites PDF

Cannot Refute

[57] Smart Agent-Based Modeling: On the Use of Large Language Models in Computer Simulations PDF

Cannot Refute

[58] Risk Analysis Techniques for Governed LLM-based Multi-Agent Systems PDF

Cannot Refute

SimuHome: A Temporal- and Environment-Aware Benchmark for Smart Home LLM Agents

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

SimuHome: A time-accelerated smart home simulator

[35] Smart Home Simulation in CoppeliaSim Using C# Through WebSocket PDF

[36] A Scalable and User-Friendly Framework Integrating IoT and Digital Twins for Home Energy Management Systems PDF

[37] Smart home R&D system based on virtual reality PDF

[38] A configurable context-aware simulator for smart home systems PDF

[39] ISS: the interactive smart home simulator PDF

[40] Minerva: a smart video assistant for the kitchen PDF

[41] A Multi-Purpose Scenario-based Simulator for Smart House Environments PDF

A benchmark of 600 episodes across twelve query types

[42] Implementing personalized learning techniques with ai PDF

[43] Detecting and handling {IoT} interaction threats in {Multi-Platform}{Multi-Control-Channel} smart homes PDF

[44] Applying an Intelligent Personal Agent on a Smart Home Using a Novel Dialogue Generator PDF

[45] A novel direct load control testbed for smart appliances PDF

[46] Reject or Not?: A Benchmark for Voice Assistant Query Rejection in Smart Home Scenario and an Improved Method Based on LLMs PDF

[47] A multi-agent system for intelligent environment control PDF

[48] A Systematic Framework for Assessing Iot Adoption Feasibility: A Multi-Case Study PDF

Dual evaluation methodology combining simulator-based and LLM-judge-based assessment

[49] Evaluating large language models as agents in the clinic PDF

[50] Embodied agent interface: Benchmarking llms for embodied decision making PDF

[51] Simulation Agent: A Framework for Integrating Simulation and Large Language Models for Enhanced Decision-Making PDF

[52] Multi-agent simulator drives language models for legal intensive interaction PDF

[53] Simulbench: Evaluating language models with creative simulation tasks PDF

[54] Breaking the Code: Security Assessment of AI Code Agents Through Systematic Jailbreaking Attacks PDF

[55] Multi-Agent-as-Judge: Aligning LLM-Agent-Based Automated Evaluation with Multi-Dimensional Human Evaluation PDF

[56] REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites PDF

[57] Smart Agent-Based Modeling: On the Use of Large Language Models in Computer Simulations PDF

[58] Risk Analysis Techniques for Governed LLM-based Multi-Agent Systems PDF

Table of Contents