A Framework for Studying AI Agent Behavior: Evidence from Consumer Choice Experiments

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

LLMAgentsAgentic AIBehaviorChoicesAlignmentSafetyBenchmark

Environments built for people are increasingly operated by a new class of economic actors: LLM-powered software agents making decisions on our behalf. These decisions range from our purchases to travel plans to medical treatment selection. Current evaluations of these agents largely focus on task competence, but we argue for a deeper assessment: how these agents choose when faced with realistic decisions. We introduce ABxLab, a framework for systematically probing agentic choice through controlled manipulations of option attributes and persuasive cues. We apply this to a realistic web-based shopping environment, where we vary prices, ratings, and psychological nudges, all of which are factors long known to shape human choice. We find that agent decisions shift predictably and substantially in response, revealing that agents are strongly biased choosers even without being subject to the cognitive constraints that shape human biases. This susceptibility reveals both risk and opportunity: risk, because agentic consumers may inherit and amplify human biases; opportunity, because consumer choice provides a powerful testbed for a behavioral science of AI agents, just as it has for the study of human behavior. We release our framework as an open benchmark for rigorous, scalable evaluation of agent decision-making.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces ABxLab, a framework for systematically probing how LLM-powered agents make consumer choices under controlled manipulations of prices, ratings, and psychological nudges. Within the taxonomy, it occupies the 'Experimental Frameworks and Methodological Approaches' leaf, which currently contains only this paper—making it a sparse, methodologically focused niche. While the broader taxonomy encompasses fifty papers across diverse topics like anthropomorphism, trust, and personalization, this leaf stands alone, suggesting the paper addresses a methodological gap in how researchers study agentic decision-making behavior.

The taxonomy reveals dense neighboring branches examining consumer responses to AI design features (anthropomorphism, communication style) and behavioral outcomes (trust, autonomy, purchase decisions). The original paper diverges from these empirical applications by offering a controlled experimental testbed rather than studying consumer perceptions or adoption. Its closest conceptual neighbors—such as work on recommendation nudging and behavioral economics—focus on human responses to AI, whereas this framework evaluates the agents themselves. The taxonomy's scope notes clarify that this leaf excludes theoretical reviews and empirical applications, positioning the paper as a methodological contribution distinct from the field's dominant empirical and theoretical streams.

Among twenty-six candidates examined, the contribution-level analysis shows varied novelty. The ABxLab framework itself (ten candidates examined, zero refutations) appears methodologically distinct within the limited search scope. The scalable benchmark contribution (six candidates, zero refutations) similarly lacks direct prior work among examined papers. However, the empirical finding of systematic biases in LLM agents (ten candidates, one refutation) encounters at least one overlapping study, suggesting this observation may not be entirely new. The analysis explicitly notes this is a top-K semantic search, not an exhaustive review, so these statistics reflect a bounded literature sample rather than definitive field coverage.

Given the limited search scope and the paper's placement in a singleton taxonomy leaf, the framework contribution appears methodologically novel within the examined literature. The empirical bias findings, while supported by controlled experiments, show some overlap with prior work. The taxonomy structure suggests the paper occupies a sparse methodological niche, though the small candidate pool (twenty-six papers) means this assessment is provisional and would benefit from broader literature coverage to confirm the framework's distinctiveness relative to adjacent experimental and evaluation methodologies.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Studying AI agent decision-making behavior in consumer choice environments. This field examines how artificial intelligence systems influence, support, or replace human decision-making in purchasing and consumption contexts. The taxonomy reveals a multifaceted landscape organized around several major themes: the design characteristics of AI agents themselves (including anthropomorphism and conversational capabilities), consumer psychological responses (trust, satisfaction, autonomy concerns), the mechanics of personalization and recommendation systems, ethical dimensions of AI-consumer interactions, and domain-specific applications ranging from retail to tourism. Works such as AI Anthropomorphism Literature Review[11] and Conversational AI Review[33] illustrate how researchers systematically map agent design features, while studies like AI Consumer Decision Making[15] and Behavioral Economics AI[13] explore the cognitive and behavioral shifts induced by AI intermediaries. Methodologically, the field spans experimental investigations, theoretical frameworks, and applied evaluations across voice assistants, chatbots, and autonomous agentic systems. Recent lines of inquiry highlight tensions between automation benefits and consumer autonomy. A dense branch examines delegation and control trade-offs, with papers like AI Decision Delegation Trust[36] and Algorithmic Decision Autonomy[40] probing when and why consumers cede choices to algorithms. Another active area focuses on trust formation and relational outcomes, exploring how agent characteristics—gender, communication style, transparency—shape acceptance, as seen in AI Agent Gender Trust[45] and Chatbot Communication Style[41]. The original paper, AI Agent Consumer Choice[0], sits squarely within the Experimental Frameworks and Methodological Approaches branch, emphasizing rigorous empirical methods to understand agent behavior in choice settings. Compared to broader reviews like AI Firms Consumer Survey[3] or applied studies such as AI Online Purchase UTAUT[5], AI Agent Consumer Choice[0] appears to prioritize controlled experimental designs that isolate specific decision mechanisms, contributing methodological rigor to a field increasingly concerned with both theoretical depth and practical consumer welfare.

Claimed Contributions

ABXLAB framework for studying AI agent behavior

10 retrieved papers

The authors present ABXLAB, an open-source man-in-the-middle framework that intercepts and modifies real-world web content in real-time, transforming arbitrary websites into controllable behavioral testbeds for studying AI agent decision-making under controlled experimental conditions.

10 retrieved papers

Scalable benchmark for evaluating agent decision-making

6 retrieved papers

The authors contribute a comprehensive benchmark consisting of over 80,000 experiments across 17 models, systematically testing agent responses to various interventions including authority cues, social proof, scarcity, negative framing, and incentives in realistic web-based shopping environments.

6 retrieved papers

Empirical evidence of systematic biases in LLM agents

Can Refute

10 retrieved papers

The authors provide empirical evidence demonstrating that LLM agents exhibit strong, systematic biases in response to ratings, prices, order effects, and nudges, with effect sizes substantially larger than human baselines, revealing that agents are hypersensitive to choice architecture manipulations.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

ABXLAB framework for studying AI agent behavior

[61] Trust and reliance on AIâAn experimental study on the extent and costs of overreliance on AI PDF

Cannot Refute

[62] Explanations can reduce overreliance on ai systems during decision-making PDF

Cannot Refute

[63] Agent q: Advanced reasoning and learning for autonomous ai agents PDF

Cannot Refute

[64] Voting or Consensus? Decision-Making in Multi-Agent Debate PDF

Cannot Refute

[65] HADA: Human-AI Agent Decision Alignment Architecture PDF

Cannot Refute

[66] Interactive AI agent for code refactoring assistance: A study on decision-making strategies and human-agent collaboration effectiveness PDF

Cannot Refute

[67] Analyzing operator states and the impact of ai-enhanced decision support in control rooms: A human-in-the-loop specialized reinforcement learning framework for â¦ PDF

Cannot Refute

[68] The Amplifying Effect of Explainability in AI-assisted Decision-making in Groups PDF

Cannot Refute

[69] PathFinder: A Multi-Modal Multi-Agent System for Medical Diagnostic Decision-Making Applied to Histopathology PDF

Cannot Refute

[70] Utilizing human behavior modeling to manipulate explanations in AI-assisted decision making: the good, the bad, and the scary PDF

Cannot Refute

Contribution

Scalable benchmark for evaluating agent decision-making

[71] PsyScam: A Benchmark for Psychological Techniques in Real-World Scams PDF

Cannot Refute

[72] Belief in Authority: Impact of Authority in Multi-Agent Evaluation Framework PDF

Cannot Refute

[73] The influence of real estate agents on investors decisions in cyprus PDF

Cannot Refute

[74] Afterward: Ignorance Studies PDF

Cannot Refute

[75] Psychology of Phishing Emails: Quantifying Persuasion Principles and Simulating Detection with Large Language Models PDF

Cannot Refute

[76] Delta One, Linguist Zero? Identity, Epistemic Agency, and Rapport Management in Airline Upgrade Negotiations PDF

Cannot Refute

Contribution

Empirical evidence of systematic biases in LLM agents

[53] Systematic biases in LLM simulations of debates PDF

Can Refute

[51] Six fallacies in substituting large language models for human participants PDF

Cannot Refute

[52] Capturing failures of large language models via human cognitive biases PDF

Cannot Refute

[54] Human-like intuitive behavior and reasoning biases emerged in large language models but disappeared in ChatGPT PDF

Cannot Refute

[55] Benchmarking Cognitive Biases in Large Language Models as Evaluators PDF

Cannot Refute

[56] Do llms exhibit human-like response biases? a case study in survey design PDF

Cannot Refute

[57] (Ir)rationality and cognitive biases in large language models PDF

Cannot Refute

[58] Mitigating cognitive biases in clinical decision-making through multi-agent conversations using large language models: simulation study PDF

Cannot Refute

[59] Large language models show amplified cognitive biases in moral decision-making PDF

Cannot Refute

[60] Relative Value Biases in Large Language Models PDF

Cannot Refute

A Framework for Studying AI Agent Behavior: Evidence from Consumer Choice Experiments

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

ABXLAB framework for studying AI agent behavior

[61] Trust and reliance on AIâAn experimental study on the extent and costs of overreliance on AI PDF

[62] Explanations can reduce overreliance on ai systems during decision-making PDF

[63] Agent q: Advanced reasoning and learning for autonomous ai agents PDF

[64] Voting or Consensus? Decision-Making in Multi-Agent Debate PDF

[65] HADA: Human-AI Agent Decision Alignment Architecture PDF

[66] Interactive AI agent for code refactoring assistance: A study on decision-making strategies and human-agent collaboration effectiveness PDF

[67] Analyzing operator states and the impact of ai-enhanced decision support in control rooms: A human-in-the-loop specialized reinforcement learning framework for â¦ PDF

[68] The Amplifying Effect of Explainability in AI-assisted Decision-making in Groups PDF

[69] PathFinder: A Multi-Modal Multi-Agent System for Medical Diagnostic Decision-Making Applied to Histopathology PDF

[70] Utilizing human behavior modeling to manipulate explanations in AI-assisted decision making: the good, the bad, and the scary PDF

Scalable benchmark for evaluating agent decision-making

[71] PsyScam: A Benchmark for Psychological Techniques in Real-World Scams PDF

[72] Belief in Authority: Impact of Authority in Multi-Agent Evaluation Framework PDF

[73] The influence of real estate agents on investors decisions in cyprus PDF

[74] Afterward: Ignorance Studies PDF

[75] Psychology of Phishing Emails: Quantifying Persuasion Principles and Simulating Detection with Large Language Models PDF

[76] Delta One, Linguist Zero? Identity, Epistemic Agency, and Rapport Management in Airline Upgrade Negotiations PDF

Empirical evidence of systematic biases in LLM agents

[53] Systematic biases in LLM simulations of debates PDF

[51] Six fallacies in substituting large language models for human participants PDF

[52] Capturing failures of large language models via human cognitive biases PDF

[54] Human-like intuitive behavior and reasoning biases emerged in large language models but disappeared in ChatGPT PDF

[55] Benchmarking Cognitive Biases in Large Language Models as Evaluators PDF

[56] Do llms exhibit human-like response biases? a case study in survey design PDF

[57] (Ir)rationality and cognitive biases in large language models PDF

[58] Mitigating cognitive biases in clinical decision-making through multi-agent conversations using large language models: simulation study PDF

[59] Large language models show amplified cognitive biases in moral decision-making PDF

[60] Relative Value Biases in Large Language Models PDF

Table of Contents

[61] Trust and reliance on AIâAn experimental study on the extent and costs of overreliance on AI PDF

[67] Analyzing operator states and the impact of ai-enhanced decision support in control rooms: A human-in-the-loop specialized reinforcement learning framework for â¦ PDF