A Framework for Studying AI Agent Behavior: Evidence from Consumer Choice Experiments

ICLR 2026 Conference SubmissionAnonymous Authors
LLMAgentsAgentic AIBehaviorChoicesAlignmentSafetyBenchmark
Abstract:

Environments built for people are increasingly operated by a new class of economic actors: LLM-powered software agents making decisions on our behalf. These decisions range from our purchases to travel plans to medical treatment selection. Current evaluations of these agents largely focus on task competence, but we argue for a deeper assessment: how these agents choose when faced with realistic decisions. We introduce ABxLab, a framework for systematically probing agentic choice through controlled manipulations of option attributes and persuasive cues. We apply this to a realistic web-based shopping environment, where we vary prices, ratings, and psychological nudges, all of which are factors long known to shape human choice. We find that agent decisions shift predictably and substantially in response, revealing that agents are strongly biased choosers even without being subject to the cognitive constraints that shape human biases. This susceptibility reveals both risk and opportunity: risk, because agentic consumers may inherit and amplify human biases; opportunity, because consumer choice provides a powerful testbed for a behavioral science of AI agents, just as it has for the study of human behavior. We release our framework as an open benchmark for rigorous, scalable evaluation of agent decision-making.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces ABxLab, a framework for systematically probing how LLM-powered agents make consumer choices under controlled manipulations of prices, ratings, and psychological nudges. Within the taxonomy, it occupies the 'Experimental Frameworks and Methodological Approaches' leaf, which currently contains only this paper—making it a sparse, methodologically focused niche. While the broader taxonomy encompasses fifty papers across diverse topics like anthropomorphism, trust, and personalization, this leaf stands alone, suggesting the paper addresses a methodological gap in how researchers study agentic decision-making behavior.

The taxonomy reveals dense neighboring branches examining consumer responses to AI design features (anthropomorphism, communication style) and behavioral outcomes (trust, autonomy, purchase decisions). The original paper diverges from these empirical applications by offering a controlled experimental testbed rather than studying consumer perceptions or adoption. Its closest conceptual neighbors—such as work on recommendation nudging and behavioral economics—focus on human responses to AI, whereas this framework evaluates the agents themselves. The taxonomy's scope notes clarify that this leaf excludes theoretical reviews and empirical applications, positioning the paper as a methodological contribution distinct from the field's dominant empirical and theoretical streams.

Among twenty-six candidates examined, the contribution-level analysis shows varied novelty. The ABxLab framework itself (ten candidates examined, zero refutations) appears methodologically distinct within the limited search scope. The scalable benchmark contribution (six candidates, zero refutations) similarly lacks direct prior work among examined papers. However, the empirical finding of systematic biases in LLM agents (ten candidates, one refutation) encounters at least one overlapping study, suggesting this observation may not be entirely new. The analysis explicitly notes this is a top-K semantic search, not an exhaustive review, so these statistics reflect a bounded literature sample rather than definitive field coverage.

Given the limited search scope and the paper's placement in a singleton taxonomy leaf, the framework contribution appears methodologically novel within the examined literature. The empirical bias findings, while supported by controlled experiments, show some overlap with prior work. The taxonomy structure suggests the paper occupies a sparse methodological niche, though the small candidate pool (twenty-six papers) means this assessment is provisional and would benefit from broader literature coverage to confirm the framework's distinctiveness relative to adjacent experimental and evaluation methodologies.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
26
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: Studying AI agent decision-making behavior in consumer choice environments. This field examines how artificial intelligence systems influence, support, or replace human decision-making in purchasing and consumption contexts. The taxonomy reveals a multifaceted landscape organized around several major themes: the design characteristics of AI agents themselves (including anthropomorphism and conversational capabilities), consumer psychological responses (trust, satisfaction, autonomy concerns), the mechanics of personalization and recommendation systems, ethical dimensions of AI-consumer interactions, and domain-specific applications ranging from retail to tourism. Works such as AI Anthropomorphism Literature Review[11] and Conversational AI Review[33] illustrate how researchers systematically map agent design features, while studies like AI Consumer Decision Making[15] and Behavioral Economics AI[13] explore the cognitive and behavioral shifts induced by AI intermediaries. Methodologically, the field spans experimental investigations, theoretical frameworks, and applied evaluations across voice assistants, chatbots, and autonomous agentic systems. Recent lines of inquiry highlight tensions between automation benefits and consumer autonomy. A dense branch examines delegation and control trade-offs, with papers like AI Decision Delegation Trust[36] and Algorithmic Decision Autonomy[40] probing when and why consumers cede choices to algorithms. Another active area focuses on trust formation and relational outcomes, exploring how agent characteristics—gender, communication style, transparency—shape acceptance, as seen in AI Agent Gender Trust[45] and Chatbot Communication Style[41]. The original paper, AI Agent Consumer Choice[0], sits squarely within the Experimental Frameworks and Methodological Approaches branch, emphasizing rigorous empirical methods to understand agent behavior in choice settings. Compared to broader reviews like AI Firms Consumer Survey[3] or applied studies such as AI Online Purchase UTAUT[5], AI Agent Consumer Choice[0] appears to prioritize controlled experimental designs that isolate specific decision mechanisms, contributing methodological rigor to a field increasingly concerned with both theoretical depth and practical consumer welfare.

Claimed Contributions

ABXLAB framework for studying AI agent behavior

The authors present ABXLAB, an open-source man-in-the-middle framework that intercepts and modifies real-world web content in real-time, transforming arbitrary websites into controllable behavioral testbeds for studying AI agent decision-making under controlled experimental conditions.

10 retrieved papers
Scalable benchmark for evaluating agent decision-making

The authors contribute a comprehensive benchmark consisting of over 80,000 experiments across 17 models, systematically testing agent responses to various interventions including authority cues, social proof, scarcity, negative framing, and incentives in realistic web-based shopping environments.

6 retrieved papers
Empirical evidence of systematic biases in LLM agents

The authors provide empirical evidence demonstrating that LLM agents exhibit strong, systematic biases in response to ratings, prices, order effects, and nudges, with effect sizes substantially larger than human baselines, revealing that agents are hypersensitive to choice architecture manipulations.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

ABXLAB framework for studying AI agent behavior

The authors present ABXLAB, an open-source man-in-the-middle framework that intercepts and modifies real-world web content in real-time, transforming arbitrary websites into controllable behavioral testbeds for studying AI agent decision-making under controlled experimental conditions.

Contribution

Scalable benchmark for evaluating agent decision-making

The authors contribute a comprehensive benchmark consisting of over 80,000 experiments across 17 models, systematically testing agent responses to various interventions including authority cues, social proof, scarcity, negative framing, and incentives in realistic web-based shopping environments.

Contribution

Empirical evidence of systematic biases in LLM agents

The authors provide empirical evidence demonstrating that LLM agents exhibit strong, systematic biases in response to ratings, prices, order effects, and nudges, with effect sizes substantially larger than human baselines, revealing that agents are hypersensitive to choice architecture manipulations.