MENLO: From Preferences to Proficiency – Evaluating and Modeling Native-like Quality Across 47 Languages

ICLR 2026 Conference SubmissionAnonymous Authors
multilingualreward modelingrlllm-as-judgehuman evaluation
Abstract:

Ensuring native-like quality of large language model (LLM) responses across many languages is challenging. To address this, we introduce MENLO, a framework that operationalizes the evaluation of native-like response quality based on audience design-inspired mechanisms. Using MENLO, we create a dataset of 6,423 human-annotated prompt–response preference pairs covering four quality dimensions with high inter-annotator agreement in 47 language varieties. Our evaluation reveals that zero-shot LLM judges benefit significantly from pairwise evaluation and our structured annotation rubrics, yet they still underperform human annotators on our dataset. We demonstrate substantial improvements through fine-tuning with reinforcement learning, reward shaping, and multi-task learning approaches. Additionally, we show that RL-trained judges can serve as generative reward models to enhance LLMs' multilingual proficiency, though discrepancies with human judgment remain. Our findings suggest promising directions for scalable multilingual evaluation and preference alignment. We release our dataset and evaluation framework to support further research in multilingual LLM evaluation.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces MENLO, a framework for evaluating native-like response quality in multilingual LLMs through audience design-inspired mechanisms, accompanied by a dataset of 6,423 human-annotated preference pairs across 47 language varieties. Within the taxonomy, this work resides in the 'Native-Like Quality and Naturalness Evaluation' leaf, which contains only three papers total. This represents a relatively sparse research direction compared to neighboring leaves like 'General Multilingual Generation and Understanding Benchmarks' (six papers) or 'LLM-Based Automated Evaluation' (four papers), suggesting the specific focus on native-like quality assessment remains an emerging area.

The taxonomy structure reveals that MENLO's leaf sits within the broader 'Multilingual Evaluation Frameworks and Benchmarks' branch, which also encompasses general benchmarks emphasizing task coverage, machine-generated text detection, and domain-specific evaluation. The sibling papers in this leaf explore related naturalness dimensions—one examining English accent variations in LLMs and another investigating human-like text preferences—but the scope note explicitly distinguishes native-like quality frameworks from general benchmarks lacking explicit naturalness metrics. Neighboring branches address LLM-based automated evaluation and translation quality metrics, indicating that MENLO bridges evaluation methodology with multilingual framework development.

Among the 30 candidates examined through semantic search, none were identified as clearly refuting any of MENLO's three core contributions: the evaluation framework itself, the annotated preference dataset, and the RL-trained judges as reward models. Each contribution was assessed against 10 candidate papers, with zero refutable overlaps found. The framework contribution appears most distinctive given the sparse population of its taxonomy leaf, while the dataset and RL-based reward modeling contributions show no substantial prior work within the limited search scope. This suggests relative novelty across all three dimensions, though the analysis acknowledges its bounded coverage.

Based on the limited literature search of 30 semantically similar papers, MENLO appears to occupy a relatively underexplored niche at the intersection of native-like quality evaluation and multilingual preference alignment. The sparse taxonomy leaf and absence of refuting prior work within the examined candidates suggest meaningful novelty, though the analysis cannot claim exhaustiveness. The framework's emphasis on audience design mechanisms and structured rubrics distinguishes it from broader multilingual benchmarks, while the RL-based reward modeling approach extends beyond existing naturalness evaluation methods within the surveyed scope.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Evaluating and improving native-like response quality in multilingual language models. The field has organized itself around several complementary branches. Multilingual Evaluation Frameworks and Benchmarks establish standardized testbeds spanning diverse languages and tasks, often emphasizing coverage of low-resource settings and culturally grounded phenomena. Evaluation Metrics and Methodologies develop both automatic and human-centered measures to capture fluency, adequacy, and naturalness, with recent work exploring LLM-based evaluators and cross-lingual consistency checks. Model Architectures and Training Approaches investigate how pretraining strategies, alignment techniques, and curriculum learning can enhance multilingual generation quality, while Task-Specific Applications and Specialized Domains adapt these methods to translation, summarization, dialogue, and domain-specific contexts such as medical or legal text. Machine-Generated Text Detection and Analysis examines whether outputs can be distinguished from human writing, and Comparative and Empirical Studies benchmark systems across languages to reveal performance gaps and guide future improvements. A particularly active line of work focuses on native-like quality and naturalness evaluation, where the challenge is to move beyond surface-level metrics toward assessments that capture idiomatic expression, cultural appropriateness, and human-like fluency. MENLO[0] situates itself squarely in this space, proposing methods to evaluate whether multilingual outputs feel genuinely native rather than merely correct. This emphasis contrasts with broader benchmarks like IndicGenBench[5], which prioritizes task coverage across many Indic languages, and with works such as Human Translation Strategy[1] or Translation All You Need[2], which concentrate on translation fidelity and adequacy. Nearby efforts like English Accent LLMs[13] and Human-Like Text Preference[43] explore related dimensions of naturalness and human-likeness, highlighting ongoing debates about what constitutes truly native quality and how best to measure it across diverse linguistic and cultural contexts.

Claimed Contributions

MENLO framework for evaluating native-like response quality

The authors develop a framework that breaks down native-like response quality into four key dimensions (language quality and coherence, alignment with cultural and linguistic nuances, factual correctness and grounding in local context, and overall writing style and helpfulness) using principles from audience design with tailored prompts and structured annotation rubrics.

10 retrieved papers
MENLO dataset with human-annotated preference pairs

The authors construct a multilingual dataset consisting of 6,423 annotated prompt-response preference pairs across 47 language varieties, achieving high inter-annotator agreement (Krippendorff's alpha = 0.84) through carefully designed annotation guidelines and rubrics.

10 retrieved papers
RL-trained judges as generative reward models

The authors demonstrate that judges trained with reinforcement learning, reward shaping, and multi-task learning can be used as generative reward models to directly improve policy model language proficiency, though they note discrepancies between LLM and human evaluations remain.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

MENLO framework for evaluating native-like response quality

The authors develop a framework that breaks down native-like response quality into four key dimensions (language quality and coherence, alignment with cultural and linguistic nuances, factual correctness and grounding in local context, and overall writing style and helpfulness) using principles from audience design with tailored prompts and structured annotation rubrics.

Contribution

MENLO dataset with human-annotated preference pairs

The authors construct a multilingual dataset consisting of 6,423 annotated prompt-response preference pairs across 47 language varieties, achieving high inter-annotator agreement (Krippendorff's alpha = 0.84) through carefully designed annotation guidelines and rubrics.

Contribution

RL-trained judges as generative reward models

The authors demonstrate that judges trained with reinforcement learning, reward shaping, and multi-task learning can be used as generative reward models to directly improve policy model language proficiency, though they note discrepancies between LLM and human evaluations remain.