AutoLibra: Agent Metric Induction from Open-Ended Human Feedback

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 5.5 Download Report PDF

AgentEvaluationLLM

Agents are predominantly evaluated and optimized via task success metrics, which are coarse, rely on manual design from experts, and fail to reward intermediate emergent behaviors. We propose AutoLibra, a framework for agent evaluation, that transforms open-ended human feedback e.g. “If you find that the button is disabled, don’t click it again”, or “This agent has too much autonomy to decide what to do on its own” into metrics for evaluating fine-grained behaviors in agent trajectories. AutoLibra accomplishes this by grounding feedback to an agent’s behavior, clustering similar positive and negative behaviors, and creating concrete metrics with clear definitions and concrete examples, which can be used for prompting LLM-as-a-Judge as evaluators. We further propose two meta-metrics to evaluate the alignment of a set of (induced) metrics with open feedback: “coverage” and “redundancy”. Through optimizing these meta-metrics, we experimentally demonstrate AutoLibra’s ability to induce more concrete agent evaluation metrics than the ones proposed in previous agent evaluation benchmarks and discover new metrics to analyze agents. We also present two applications of AutoLibra in agent improvement: First, we show that AutoLibra serve human prompt engineers for diagonalize agent failures and improve prompts iterative. Moreover, we find that AutoLibra can induce metrics for automatic optimization for agents, which makes agents improve through self-regulation. Our results suggest that AutoLibra is a powerful task-agnostic tool for evaluating and improving language agents.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

AutoLibra proposes a framework for transforming open-ended human feedback into concrete evaluation metrics for agent behaviors, addressing the limitation that task success metrics fail to reward intermediate emergent behaviors. The paper resides in the 'Open-Ended Feedback to Metric Transformation' leaf, which contains only two papers including this one. This represents a notably sparse research direction within the broader taxonomy of 35 papers across the field, suggesting the specific problem of automated metric induction from unstructured feedback remains relatively underexplored compared to adjacent areas like structured feedback-based reward modeling or reinforcement learning from human feedback.

The taxonomy reveals that AutoLibra sits at the intersection of metric induction and evaluation frameworks, with neighboring branches focusing on structured feedback approaches (preference rankings, comparisons) and agent optimization methods (RLHF, linguistic refinement). The closest sibling work explores similar transformation pipelines, while related directions like 'Multi-Dimensional Human-Centric Evaluation' and 'Task-Specific Automated Evaluation' address complementary aspects of assessment without the automated induction component. The taxonomy's scope notes clarify that AutoLibra's focus on unstructured free-text feedback distinguishes it from methods using predefined metrics or structured formats, positioning it in a boundary area between human-AI interaction and evaluation methodology.

Among 30 candidates examined through limited semantic search, none clearly refuted any of AutoLibra's three core contributions: the framework itself, the coverage/redundancy meta-metrics, and the two-step thematic analysis-inspired induction process. Each contribution was evaluated against 10 candidates with zero refutable overlaps identified. This suggests that within the examined scope, the specific combination of automated metric induction, meta-metric optimization, and grounding through behavior clustering appears relatively novel. However, this assessment is constrained by the search methodology—top-K semantic matching may not capture all relevant prior work in adjacent evaluation or feedback processing domains.

Based on the limited literature search covering 30 candidates, AutoLibra appears to occupy a sparsely populated research niche, particularly in its automated approach to deriving concrete metrics from unstructured feedback. The absence of refutable prior work among examined candidates, combined with the small leaf size in the taxonomy, suggests meaningful novelty within the analyzed scope. However, the analysis does not cover exhaustive manual surveys of evaluation methodology or human feedback processing literature, leaving open the possibility of relevant work outside the semantic search radius.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: inducing evaluation metrics from open-ended human feedback for agents. The field addresses how to transform rich, unstructured human commentary into actionable signals that guide agent behavior and assessment. The taxonomy reveals several complementary directions: one branch focuses on metric induction and reward learning, developing techniques to distill preferences and critiques into scalar rewards or structured evaluations; another emphasizes agent learning and optimization, exploring how feedback loops refine policies through reinforcement or iterative improvement; a third concentrates on evaluation frameworks and benchmarking, establishing standardized ways to measure human alignment; while additional branches examine interaction mechanisms, domain-specific deployments, and foundational theory. Works such as AlpacaFarm[2] and Learning to Summarize[1] illustrate early efforts to systematize preference data, whereas newer methods like Reflexion[3] and GUIDE[6] demonstrate how agents can leverage textual feedback for self-improvement. A particularly active line of inquiry centers on converting open-ended critiques into usable metrics without heavy manual annotation. AutoLibra[0] sits squarely within this cluster, proposing automated ways to parse free-form human comments into evaluation dimensions that can score agent outputs. This contrasts with approaches like Polos[4] or Multi-Agent Judge[9], which rely more on structured comparisons or ensemble scoring, and differs from reinforcement-heavy methods such as PPO Human Feedback[13] that optimize policies directly from preference rankings. Nearby work like AutoLibra Feedback[20] explores similar transformation pipelines, while studies such as Human-Centric Evaluation[10] and Evaluating Agentic AI[12] emphasize broader frameworks for aligning metrics with human values. The central tension across these efforts involves balancing the expressiveness of natural language feedback against the need for consistent, scalable evaluation signals—a challenge that AutoLibra[0] addresses by automating the induction process while preserving the nuance of open-ended input.

Claimed Contributions

AutoLibra framework for agent evaluation from open-ended human feedback

10 retrieved papers

The authors introduce AutoLibra, a novel framework that converts open-ended human feedback (such as natural language comments about agent behavior) into concrete, interpretable evaluation metrics. The framework grounds feedback to specific agent behaviors, clusters similar behaviors, and creates metrics with clear definitions and examples that can be used with LLM-as-a-Judge evaluators.

10 retrieved papers

Meta-metrics for evaluating induced metrics: coverage and redundancy

10 retrieved papers

The authors propose two meta-metrics to assess the quality of induced metrics: coverage (what proportion of feedback aspects are matched with agent traits) and redundancy (what proportion of detected traits are not mentioned by humans). These meta-metrics enable optimization of the metric induction process.

10 retrieved papers

Two-step metric induction process inspired by thematic analysis

10 retrieved papers

The authors design a two-step induction process drawing from thematic analysis methodology: feedback grounding (where human feedback is grounded to specific behaviors in agent trajectories) and behavior clustering (where similar behaviors are grouped into metrics with definitions and examples).

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[20] AutoLibra: Agent Metric Induction from Open-Ended Feedback PDF

Zhu Hao, Cuvin, Phil, Hao Zhu, Yu Xinkai, Phil Cuvin, Xinkai Yu, Zhang Jason, Charlotte Ka Yee Yan, Yang, Diyi, Jason Zhang, Diyi Yang (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

AutoLibra framework for agent evaluation from open-ended human feedback

[1] Learning to summarize from human feedback PDF

Cannot Refute

[20] AutoLibra: Agent Metric Induction from Open-Ended Feedback PDF

Cannot Refute

[45] A survey of reinforcement learning from human feedback PDF

Cannot Refute

[46] COSIS: An AI-Enabled Digital Transformation Framework Integrating Large Language Models and Key Performance Indicators PDF

Cannot Refute

[47] Quality diversity through human feedback PDF

Cannot Refute

[48] Allhands: Ask me anything on large-scale verbatim feedback via large language models PDF

Cannot Refute

[49] M3hf: Multi-agent reinforcement learning from multi-phase human feedback of mixed quality PDF

Cannot Refute

[50] LILO: Bayesian Optimization with Interactive Natural Language Feedback PDF

Cannot Refute

[51] When life gives you lemons, make cherryade: Converting feedback from bad responses into good labels PDF

Cannot Refute

[52] The empathic framework for task learning from implicit human feedback PDF

Cannot Refute

Contribution

Meta-metrics for evaluating induced metrics: coverage and redundancy

[20] AutoLibra: Agent Metric Induction from Open-Ended Feedback PDF

Cannot Refute

[36] Working out measurement overlap in the assessment of maladaptive exercise. PDF

Cannot Refute

[37] BUSCO: assessing genome assembly and annotation completeness PDF

Cannot Refute

[38] What we talk about when we talk about trauma: Content overlap and heterogeneity in the assessment of trauma exposure. PDF

Cannot Refute

[39] Assessing the impact of training samples overlap and density in Random Forest for landslide susceptibility mapping: implications for degraded land management in â¦ PDF

Cannot Refute

[40] On Evaluation Metrics for Complex Matching Based on Reference Alignments PDF

Cannot Refute

[41] Generating reliable software project task flows using large language models through prompt engineering and robust evaluation PDF

Cannot Refute

[42] Metric Design!= Metric Behavior: Improving Metric Selection for the Unbiased Evaluation of Dimensionality Reduction PDF

Cannot Refute

[43] User perceptions of diversity in recommender systems PDF

Cannot Refute

[44] Evaluation Metrics for Overlapping Community Detection PDF

Cannot Refute

Contribution

Two-step metric induction process inspired by thematic analysis

[53] Video learning analytics: Investigating behavioral patterns and learner clusters in video-based online learning PDF

Cannot Refute

[54] Classifying and Clustering Trading Agents PDF

Cannot Refute

[55] Behavioral pattern clustering for thematic user segmentation in web interaction environments PDF

Cannot Refute

[56] Recognition of crowd behavior from mobile sensors with pattern analysis and graph clustering methods PDF

Cannot Refute

[57] Clusters of Solvers' Behavioral Patterns Based on Analysis of the Programming Process PDF

Cannot Refute

[58] Intelligent Digital Twin for Predicting Technology Discourse Patterns: Agent-Based Modeling of User Interactions and Sentiment Dynamics in DeepSeek â¦ PDF

Cannot Refute

[59] â¦ Digital Twin for Predicting Technology Discourse Patterns: Agent-Based Modeling of User Interactions and Sentiment Dynamics in DeepSeek Discourse Case PDF

Cannot Refute

[60] Analyzing Social Networks and Topic Clustering in Backpacker Tourism Content Reviews using K-means, Fast HDBScan, and Gaussian Mixture with â¦ PDF

Cannot Refute

[61] Navigating the nexus of innovation and insight: an umbrella review and thematic clustering of smart tourism evolution PDF

Cannot Refute

[62] Clustering Student Behavioral Patterns: A Data Mining Approach Using K-Means for Analyzing Study Hours, Attendance, and Tutoring Sessions in Educational Achievement PDF

Cannot Refute

AutoLibra: Agent Metric Induction from Open-Ended Human Feedback

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[20] AutoLibra: Agent Metric Induction from Open-Ended Feedback PDF

Contribution Analysis

AutoLibra framework for agent evaluation from open-ended human feedback

[1] Learning to summarize from human feedback PDF

[20] AutoLibra: Agent Metric Induction from Open-Ended Feedback PDF

[45] A survey of reinforcement learning from human feedback PDF

[46] COSIS: An AI-Enabled Digital Transformation Framework Integrating Large Language Models and Key Performance Indicators PDF

[47] Quality diversity through human feedback PDF

[48] Allhands: Ask me anything on large-scale verbatim feedback via large language models PDF

[49] M3hf: Multi-agent reinforcement learning from multi-phase human feedback of mixed quality PDF

[50] LILO: Bayesian Optimization with Interactive Natural Language Feedback PDF

[51] When life gives you lemons, make cherryade: Converting feedback from bad responses into good labels PDF

[52] The empathic framework for task learning from implicit human feedback PDF

Meta-metrics for evaluating induced metrics: coverage and redundancy

[20] AutoLibra: Agent Metric Induction from Open-Ended Feedback PDF

[36] Working out measurement overlap in the assessment of maladaptive exercise. PDF

[37] BUSCO: assessing genome assembly and annotation completeness PDF

[38] What we talk about when we talk about trauma: Content overlap and heterogeneity in the assessment of trauma exposure. PDF

[39] Assessing the impact of training samples overlap and density in Random Forest for landslide susceptibility mapping: implications for degraded land management in â¦ PDF

[40] On Evaluation Metrics for Complex Matching Based on Reference Alignments PDF

[41] Generating reliable software project task flows using large language models through prompt engineering and robust evaluation PDF

[42] Metric Design!= Metric Behavior: Improving Metric Selection for the Unbiased Evaluation of Dimensionality Reduction PDF

[43] User perceptions of diversity in recommender systems PDF

[44] Evaluation Metrics for Overlapping Community Detection PDF

Two-step metric induction process inspired by thematic analysis

[53] Video learning analytics: Investigating behavioral patterns and learner clusters in video-based online learning PDF

[54] Classifying and Clustering Trading Agents PDF

[55] Behavioral pattern clustering for thematic user segmentation in web interaction environments PDF

[56] Recognition of crowd behavior from mobile sensors with pattern analysis and graph clustering methods PDF

[57] Clusters of Solvers' Behavioral Patterns Based on Analysis of the Programming Process PDF

[58] Intelligent Digital Twin for Predicting Technology Discourse Patterns: Agent-Based Modeling of User Interactions and Sentiment Dynamics in DeepSeek â¦ PDF

[59] â¦ Digital Twin for Predicting Technology Discourse Patterns: Agent-Based Modeling of User Interactions and Sentiment Dynamics in DeepSeek Discourse Case PDF

[60] Analyzing Social Networks and Topic Clustering in Backpacker Tourism Content Reviews using K-means, Fast HDBScan, and Gaussian Mixture with â¦ PDF

[61] Navigating the nexus of innovation and insight: an umbrella review and thematic clustering of smart tourism evolution PDF

[62] Clustering Student Behavioral Patterns: A Data Mining Approach Using K-Means for Analyzing Study Hours, Attendance, and Tutoring Sessions in Educational Achievement PDF

Table of Contents

[39] Assessing the impact of training samples overlap and density in Random Forest for landslide susceptibility mapping: implications for degraded land management in â¦ PDF

[58] Intelligent Digital Twin for Predicting Technology Discourse Patterns: Agent-Based Modeling of User Interactions and Sentiment Dynamics in DeepSeek â¦ PDF

[59] â¦ Digital Twin for Predicting Technology Discourse Patterns: Agent-Based Modeling of User Interactions and Sentiment Dynamics in DeepSeek Discourse Case PDF

[60] Analyzing Social Networks and Topic Clustering in Backpacker Tourism Content Reviews using K-means, Fast HDBScan, and Gaussian Mixture with â¦ PDF