MENLO: From Preferences to Proficiency – Evaluating and Modeling Native-like Quality Across 47 Languages

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

multilingualreward modelingrlllm-as-judgehuman evaluation

Ensuring native-like quality of large language model (LLM) responses across many languages is challenging. To address this, we introduce MENLO, a framework that operationalizes the evaluation of native-like response quality based on audience design-inspired mechanisms. Using MENLO, we create a dataset of 6,423 human-annotated prompt–response preference pairs covering four quality dimensions with high inter-annotator agreement in 47 language varieties. Our evaluation reveals that zero-shot LLM judges benefit significantly from pairwise evaluation and our structured annotation rubrics, yet they still underperform human annotators on our dataset. We demonstrate substantial improvements through fine-tuning with reinforcement learning, reward shaping, and multi-task learning approaches. Additionally, we show that RL-trained judges can serve as generative reward models to enhance LLMs' multilingual proficiency, though discrepancies with human judgment remain. Our findings suggest promising directions for scalable multilingual evaluation and preference alignment. We release our dataset and evaluation framework to support further research in multilingual LLM evaluation.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces MENLO, a framework for evaluating native-like response quality in multilingual LLMs through audience design-inspired mechanisms, accompanied by a dataset of 6,423 human-annotated preference pairs across 47 language varieties. Within the taxonomy, this work resides in the 'Native-Like Quality and Naturalness Evaluation' leaf, which contains only three papers total. This represents a relatively sparse research direction compared to neighboring leaves like 'General Multilingual Generation and Understanding Benchmarks' (six papers) or 'LLM-Based Automated Evaluation' (four papers), suggesting the specific focus on native-like quality assessment remains an emerging area.

The taxonomy structure reveals that MENLO's leaf sits within the broader 'Multilingual Evaluation Frameworks and Benchmarks' branch, which also encompasses general benchmarks emphasizing task coverage, machine-generated text detection, and domain-specific evaluation. The sibling papers in this leaf explore related naturalness dimensions—one examining English accent variations in LLMs and another investigating human-like text preferences—but the scope note explicitly distinguishes native-like quality frameworks from general benchmarks lacking explicit naturalness metrics. Neighboring branches address LLM-based automated evaluation and translation quality metrics, indicating that MENLO bridges evaluation methodology with multilingual framework development.

Among the 30 candidates examined through semantic search, none were identified as clearly refuting any of MENLO's three core contributions: the evaluation framework itself, the annotated preference dataset, and the RL-trained judges as reward models. Each contribution was assessed against 10 candidate papers, with zero refutable overlaps found. The framework contribution appears most distinctive given the sparse population of its taxonomy leaf, while the dataset and RL-based reward modeling contributions show no substantial prior work within the limited search scope. This suggests relative novelty across all three dimensions, though the analysis acknowledges its bounded coverage.

Based on the limited literature search of 30 semantically similar papers, MENLO appears to occupy a relatively underexplored niche at the intersection of native-like quality evaluation and multilingual preference alignment. The sparse taxonomy leaf and absence of refuting prior work within the examined candidates suggest meaningful novelty, though the analysis cannot claim exhaustiveness. The framework's emphasis on audience design mechanisms and structured rubrics distinguishes it from broader multilingual benchmarks, while the RL-based reward modeling approach extends beyond existing naturalness evaluation methods within the surveyed scope.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Evaluating and improving native-like response quality in multilingual language models. The field has organized itself around several complementary branches. Multilingual Evaluation Frameworks and Benchmarks establish standardized testbeds spanning diverse languages and tasks, often emphasizing coverage of low-resource settings and culturally grounded phenomena. Evaluation Metrics and Methodologies develop both automatic and human-centered measures to capture fluency, adequacy, and naturalness, with recent work exploring LLM-based evaluators and cross-lingual consistency checks. Model Architectures and Training Approaches investigate how pretraining strategies, alignment techniques, and curriculum learning can enhance multilingual generation quality, while Task-Specific Applications and Specialized Domains adapt these methods to translation, summarization, dialogue, and domain-specific contexts such as medical or legal text. Machine-Generated Text Detection and Analysis examines whether outputs can be distinguished from human writing, and Comparative and Empirical Studies benchmark systems across languages to reveal performance gaps and guide future improvements. A particularly active line of work focuses on native-like quality and naturalness evaluation, where the challenge is to move beyond surface-level metrics toward assessments that capture idiomatic expression, cultural appropriateness, and human-like fluency. MENLO[0] situates itself squarely in this space, proposing methods to evaluate whether multilingual outputs feel genuinely native rather than merely correct. This emphasis contrasts with broader benchmarks like IndicGenBench[5], which prioritizes task coverage across many Indic languages, and with works such as Human Translation Strategy[1] or Translation All You Need[2], which concentrate on translation fidelity and adequacy. Nearby efforts like English Accent LLMs[13] and Human-Like Text Preference[43] explore related dimensions of naturalness and human-likeness, highlighting ongoing debates about what constitutes truly native quality and how best to measure it across diverse linguistic and cultural contexts.

Claimed Contributions

MENLO framework for evaluating native-like response quality

10 retrieved papers

The authors develop a framework that breaks down native-like response quality into four key dimensions (language quality and coherence, alignment with cultural and linguistic nuances, factual correctness and grounding in local context, and overall writing style and helpfulness) using principles from audience design with tailored prompts and structured annotation rubrics.

10 retrieved papers

MENLO dataset with human-annotated preference pairs

10 retrieved papers

The authors construct a multilingual dataset consisting of 6,423 annotated prompt-response preference pairs across 47 language varieties, achieving high inter-annotator agreement (Krippendorff's alpha = 0.84) through carefully designed annotation guidelines and rubrics.

10 retrieved papers

RL-trained judges as generative reward models

10 retrieved papers

The authors demonstrate that judges trained with reinforcement learning, reward shaping, and multi-task learning can be used as generative reward models to directly improve policy model language proficiency, though they note discrepancies between LLM and human evaluations remain.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[13] Do large language models have an English accent? evaluating and improving the naturalness of multilingual LLMs PDF

Yanzhu Guo, Simone Conia, Ze-Lin Zhou, Zelin Zhou, Min Li, Saloni Potdar, Henry Xiao (2025)

[43] Is Human-Like Text Liked by Humans? Multilingual Human Detection and Preference Against AI PDF

Wang Yu-xia, Xing Rui, Yuxia Wang, Mansurov, Jonibek, Rui Xing, Puccetti, Giovanni, Jonibek Mansurov, Xie, Zhuohan, Giovanni Puccetti, Ta Minh Ngoc, Zhuohan Xie, Geng, Jiahui, Minh Ngoc Ta, Su, Jinyan, Jiahui Geng, Jinyan Su, Mervat T. Abassy, Saad El Dine Ahmed, K. Elozeiri, Nurkhan Laiyk, Mahmoud Tarek, M. Goloburda, Tarek Mahmoud, Raj Vardhan Tomar, Koike Ryuto, Alexander Aziz, Kaneko, Masahiro, Ryuto Koike, Shelmanov, Artem, Masahiro Kaneko, Artemova, Ekaterina, Artem Shelmanov, Mikhailov, Vladislav, Ekaterina Artemova, Tsvigun, Akim, V. Mikhailov, Aji, Alham Fikri, Akim Tsvigun, Habash, Nizar, Alham Fikri Aji, Gurevych, Iryna, Nizar Habash, Nakov, Preslav, Iryna Gurevych, Preslav Nakov (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

MENLO framework for evaluating native-like response quality

[51] Application of humanization to survey chatbots: Change in chatbot perception, interaction experience, and survey data quality PDF

Cannot Refute

[52] Communication style adaptation in human-computer interaction: An empirical study on the effects of a voice assistant's politeness and machine-likeness on people's â¦ PDF

Cannot Refute

[53] Artificial intelligence platforms enabling conversational chatbots: the case of tiledesk. com PDF

Cannot Refute

[54] Human-Like Embodied AI Interviewer: Employing Android ERICA in Real International Conference PDF

Cannot Refute

[55] Audience-centric natural language generation via style infusion PDF

Cannot Refute

[56] The dimensions and adaptation of partner models in human-machine dialogue PDF

Cannot Refute

[57] Chatbots in healthcare curricula: the case of a conversational virtual patient PDF

Cannot Refute

[58] A socio-onomastic study of Spanish receptive bilinguals: Attitudes, ascription, and audience design PDF

Cannot Refute

[59] English as a second language writing and automated essay evaluation PDF

Cannot Refute

[60] Styling the other to define the self: A study in New Zealand identity making PDF

Cannot Refute

Contribution

MENLO dataset with human-annotated preference pairs

[71] Aya model: An instruction finetuned open-access multilingual language model PDF

Cannot Refute

[72] HelpSteer3-Preference: Open Human-Annotated Preference Data across Diverse Tasks and Languages PDF

Cannot Refute

[73] The PRISM alignment dataset: What participatory, representative and individualised human feedback reveals about the subjective and multicultural alignment of large â¦ PDF

Cannot Refute

[74] SeamlessM4T: massively multilingual & multimodal machine translation PDF

Cannot Refute

[75] Nativqa: Multilingual culturally-aligned natural query for llms PDF

Cannot Refute

[76] AraTraditions10k bridging cultures with a comprehensive dataset for enhanced cross lingual image annotation retrieval and tagging PDF

Cannot Refute

[77] Dialectal Toxicity Detection: Evaluating LLM-as-a-Judge Consistency Across Language Varieties PDF

Cannot Refute

[78] Modeling user preferences with automatic metrics: Creating a high-quality preference dataset for machine translation PDF

Cannot Refute

[79] Mapo: Advancing multilingual reasoning through multilingual alignment-as-preference optimization PDF

Cannot Refute

[80] A Multilingual Similarity Dataset for News Article Frame PDF

Cannot Refute

Contribution

RL-trained judges as generative reward models

[61] Cross-lingual transfer of reward models in multilingual alignment PDF

Cannot Refute

[62] M-rewardbench: Evaluating reward models in multilingual settings PDF

Cannot Refute

[63] Evaluating and improving cultural awareness of reward models for llm alignment PDF

Cannot Refute

[64] Language imbalance driven rewarding for multilingual self-improving PDF

Cannot Refute

[65] Query in your tongue: Reinforce large language models with retrievers for cross-lingual search generative experience PDF

Cannot Refute

[66] The Role of Generative AI in the Evolution of Digital Advertising Products PDF

Cannot Refute

[67] Implicit Cross-Lingual Rewarding for Efficient Multilingual Preference Alignment PDF

Cannot Refute

[68] Okapi: Instruction-tuned large language models in multiple languages with reinforcement learning from human feedback PDF

Cannot Refute

[69] mR3: Multilingual Rubric-Agnostic Reward Reasoning Models PDF

Cannot Refute

[70] MPO: Multilingual Safety Alignment via Reward Gap Optimization PDF

Cannot Refute

MENLO: From Preferences to Proficiency – Evaluating and Modeling Native-like Quality Across 47 Languages

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[13] Do large language models have an English accent? evaluating and improving the naturalness of multilingual LLMs PDF

[43] Is Human-Like Text Liked by Humans? Multilingual Human Detection and Preference Against AI PDF

Contribution Analysis

MENLO framework for evaluating native-like response quality

[51] Application of humanization to survey chatbots: Change in chatbot perception, interaction experience, and survey data quality PDF

[52] Communication style adaptation in human-computer interaction: An empirical study on the effects of a voice assistant's politeness and machine-likeness on people's â¦ PDF

[53] Artificial intelligence platforms enabling conversational chatbots: the case of tiledesk. com PDF

[54] Human-Like Embodied AI Interviewer: Employing Android ERICA in Real International Conference PDF

[55] Audience-centric natural language generation via style infusion PDF

[56] The dimensions and adaptation of partner models in human-machine dialogue PDF

[57] Chatbots in healthcare curricula: the case of a conversational virtual patient PDF

[58] A socio-onomastic study of Spanish receptive bilinguals: Attitudes, ascription, and audience design PDF

[59] English as a second language writing and automated essay evaluation PDF

[60] Styling the other to define the self: A study in New Zealand identity making PDF

MENLO dataset with human-annotated preference pairs

[71] Aya model: An instruction finetuned open-access multilingual language model PDF

[72] HelpSteer3-Preference: Open Human-Annotated Preference Data across Diverse Tasks and Languages PDF

[73] The PRISM alignment dataset: What participatory, representative and individualised human feedback reveals about the subjective and multicultural alignment of large â¦ PDF

[74] SeamlessM4T: massively multilingual & multimodal machine translation PDF

[75] Nativqa: Multilingual culturally-aligned natural query for llms PDF

[76] AraTraditions10k bridging cultures with a comprehensive dataset for enhanced cross lingual image annotation retrieval and tagging PDF

[77] Dialectal Toxicity Detection: Evaluating LLM-as-a-Judge Consistency Across Language Varieties PDF

[78] Modeling user preferences with automatic metrics: Creating a high-quality preference dataset for machine translation PDF

[79] Mapo: Advancing multilingual reasoning through multilingual alignment-as-preference optimization PDF

[80] A Multilingual Similarity Dataset for News Article Frame PDF

RL-trained judges as generative reward models

[61] Cross-lingual transfer of reward models in multilingual alignment PDF

[62] M-rewardbench: Evaluating reward models in multilingual settings PDF

[63] Evaluating and improving cultural awareness of reward models for llm alignment PDF

[64] Language imbalance driven rewarding for multilingual self-improving PDF

[65] Query in your tongue: Reinforce large language models with retrievers for cross-lingual search generative experience PDF

[66] The Role of Generative AI in the Evolution of Digital Advertising Products PDF

[67] Implicit Cross-Lingual Rewarding for Efficient Multilingual Preference Alignment PDF

[68] Okapi: Instruction-tuned large language models in multiple languages with reinforcement learning from human feedback PDF

[69] mR3: Multilingual Rubric-Agnostic Reward Reasoning Models PDF

[70] MPO: Multilingual Safety Alignment via Reward Gap Optimization PDF

Table of Contents

[52] Communication style adaptation in human-computer interaction: An empirical study on the effects of a voice assistant's politeness and machine-likeness on people's â¦ PDF

[73] The PRISM alignment dataset: What participatory, representative and individualised human feedback reveals about the subjective and multicultural alignment of large â¦ PDF