Translation Heads: Unveiling Attention's Role in LLM Multilingual Translation

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

LLMMultilinguisticInterpretability

Recently, large language models (LLMs) have made remarkable progress, with multilingual capability emerging as a core foundational strengths. However, the internal mechanisms by which these models perform translation remain incompletely understood. In this paper, we elucidate the relationship between the attention mechanism in LLMs and their translation abilities. We find that certain attention heads, which we term token alignment heads, are specifically responsible for mapping tokens from the source language to the target language during inference. Through a systematic investigation across various models, we confirm that these token alignment heads exhibit several key characteristics: (1) Universality: They are present in all LLMs we studied. (2) Sparsity: They constitute only a small fraction of all attention heads. (3) Consistency: The set of token alignment heads activated by the model shows strong consistency across different language pairs. (4) Causality: Interventionally removing these heads leads to a sharp decline in the model's translation performance, while randomly removing non-token alignment heads has little impact on translation ability. (5) Functional Specificity: Ablating token alignment heads disproportionately harms translation but has a varied impact on other multilingual tasks. We also traced the formation of token alignment heads during pre-training, revealing an evolutionary path of rapid proliferation, stabilization, and eventual pruning. Furthermore we leverage these token alignment heads to filter multilingual training data, and our experiments show that these data could enhance translation capabilities of the models.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper identifies and characterizes 'token alignment heads'—specialized attention heads responsible for cross-lingual token mapping during translation in large language models. According to the taxonomy tree, this work sits in the 'Token Alignment Head Discovery' leaf under 'Attention Head Analysis and Interpretability'. Notably, this leaf contains only the original paper itself (no sibling papers), indicating a relatively sparse research direction within the broader field of attention mechanism interpretability for multilingual translation.

The taxonomy reveals that the broader 'Attention Head Analysis and Interpretability' branch contains two neighboring leaves: 'Language-Specific Attention Head Identification' (focusing on general attention head importance across languages) and 'Interpretability Evaluation in Low-Resource Settings'. The original paper's focus on token-level alignment distinguishes it from these adjacent directions, which address broader head importance scoring or low-resource evaluation contexts. The taxonomy's scope note explicitly excludes 'general attention importance scoring without token-level alignment focus' from the Token Alignment Head Discovery category, clarifying that this work targets a more specific mechanistic phenomenon than neighboring interpretability studies.

Among thirty candidates examined through semantic search, none were found to clearly refute any of the three main contributions: (1) identification and characterization of token alignment heads (10 candidates examined, 0 refutable), (2) translation score metric and detection algorithm (10 candidates examined, 0 refutable), and (3) TRater data filtering algorithm (10 candidates examined, 0 refutable). This suggests that within the limited search scope, the specific combination of discovering token alignment heads, proposing a detection metric, and developing a filtering algorithm appears relatively novel. However, the analysis is constrained to top-30 semantic matches and does not constitute an exhaustive literature review.

Based on the limited search scope, the work appears to occupy a distinct position within attention mechanism interpretability for multilingual translation. The absence of sibling papers in the same taxonomy leaf and the lack of clearly refuting prior work among examined candidates suggest potential novelty, though this assessment is bounded by the thirty-candidate search window. A more comprehensive literature review would be needed to confirm whether related token alignment phenomena have been studied under different terminology or in adjacent research communities.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: attention mechanism in multilingual translation of large language models. The field organizes around several major branches that reflect both technical and application-oriented concerns. Attention Head Analysis and Interpretability investigates how individual attention heads specialize—for instance, discovering heads that align tokens across languages or capture syntactic dependencies. Attention Architecture Design and Enhancement explores novel attention patterns and modifications to improve translation quality and efficiency. Training Methodologies and Parameter Efficiency addresses how to train multilingual models effectively, often through techniques like adapters or cross-attention pretraining. Low-Resource and Zero-Resource Translation tackles scenarios where parallel data is scarce, a persistent challenge for many language pairs. Domain-Specific and Task-Specific Applications extend attention-based translation to specialized contexts such as speech, images, or technical domains. Translation Quality and Robustness examines evaluation metrics and model reliability, while Surveys and Comprehensive Reviews provide overarching perspectives on the evolving landscape, as seen in works like Transformer MT Survey[3] and MT LLM Survey[5]. Within the interpretability branch, a particularly active line of work focuses on identifying specialized attention heads that perform token alignment or capture cross-lingual correspondences. Translation Heads[0] exemplifies this direction by discovering heads that align source and target tokens, offering insights into how large multilingual models internally represent translation mappings. This contrasts with broader architectural studies like Dynamic Multihead Attention[13] or Cross Attention Pretraining[4], which modify attention mechanisms to enhance overall performance rather than dissecting existing heads. Meanwhile, works such as Source Context Attention[1] and Language Attention Heads[25] explore how attention patterns encode linguistic structure and context, revealing complementary aspects of model behavior. The interpretability research remains crucial for understanding whether large models genuinely learn meaningful cross-lingual alignments or rely on spurious correlations, a question that bridges technical analysis and practical deployment in low-resource settings like those studied in Attention Low Resource[2] and Low Resource Interpretability[23].

Claimed Contributions

Identification and characterization of token alignment heads

10 retrieved papers

The authors identify a specialized class of attention heads called token alignment heads that perform cross-lingual token mapping during translation. They characterize these heads as universal across models, sparse (constituting only a small fraction of all heads), consistent across language pairs, causally important for translation, and functionally specific to translation tasks.

10 retrieved papers

Translation score metric and detection algorithm

10 retrieved papers

The authors introduce a translation score metric that quantifies how frequently an attention head performs valid cross-lingual token alignments. This metric enables systematic detection of token alignment heads by measuring alignment frequency during greedy decoding on translation tasks.

10 retrieved papers

TRater data filtering algorithm

10 retrieved papers

The authors develop TRater, a data filtering algorithm that uses token alignment heads to identify and score multilingual training data critical for translation capability. Experiments demonstrate that a small fraction of data selected by TRater significantly enhances model translation performance.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Identification and characterization of token alignment heads

[13] Dynamic Multihead Attention for Enhancing Neural Machine Translation Performance PDF

Cannot Refute

[31] Cross-attention is all you need: Adapting pretrained transformers for machine translation PDF

Cannot Refute

[32] Fine-grained attention mechanism for neural machine translation PDF

Cannot Refute

[33] Supervised visual attention for multimodal neural machine translation PDF

Cannot Refute

[34] Cross-lingual AMR Aligner: Paying Attention to Cross-Attention PDF

Cannot Refute

[35] Interpreting Attention Mechanisms of NMT with Linguistic Features PDF

Cannot Refute

[36] Tree-to-sequence attentional neural machine translation PDF

Cannot Refute

[37] Neural machine translation with monolingual translation memory PDF

Cannot Refute

[38] Adaptive Token-level Cross-lingual Feature Mixing for Multilingual Neural Machine Translation PDF

Cannot Refute

[39] VECO: Variable and flexible cross-lingual pre-training for language understanding and generation PDF

Cannot Refute

Contribution

Translation score metric and detection algorithm

[34] Cross-lingual AMR Aligner: Paying Attention to Cross-Attention PDF

Cannot Refute

[50] Structured convergence through latent epoch reshaping for reordering intermediate computations in large language model training PDF

Cannot Refute

[51] TokAlign: Efficient Vocabulary Adaptation via Token Alignment PDF

Cannot Refute

[52] Latent gesture routing through semantic phase discontinuities in large language model generation spaces PDF

Cannot Refute

[53] Beyond Literal Token Overlap: Token Alignability for Multilinguality PDF

Cannot Refute

[54] Computational resonance in large language models: A framework for oscillatory token alignment and recursive semantic stabilization PDF

Cannot Refute

[55] Multilingual Alignment of Contextual Word Representations PDF

Cannot Refute

[56] Investigating a Novel Transposon Attention Scaffold for Large Scale Transformer Reasoning Patterns PDF

Cannot Refute

[57] TASE: Token Awareness and Structured Evaluation for Multilingual Language Models PDF

Cannot Refute

[58] Alignatt: Using attention-based audio-translation alignments as a guide for simultaneous speech translation PDF

Cannot Refute

Contribution

TRater data filtering algorithm

[40] Embedded Heterogeneous Attention Transformer for Cross-Lingual Image Captioning PDF

Cannot Refute

[41] CAR-Transformer: Cross-Attention Reinforcement Transformer for Cross-Lingual Summarization PDF

Cannot Refute

[42] Attention-based LSTM Network for Cross-Lingual Sentiment Classification PDF

Cannot Refute

[43] KaLM-Embedding: Superior Training Data Brings A Stronger Embedding Model PDF

Cannot Refute

[44] CAMEL: Cross-Attention Enhanced Mixture-of-Experts and Language Bias for Code-Switching Speech Recognition PDF

Cannot Refute

[45] Event detection via gated multilingual attention mechanism PDF

Cannot Refute

[46] Dual attention network for cross-lingual entity alignment PDF

Cannot Refute

[47] Probabilistic intra-token temporal oscillation in large language model sequence generation for latent knowledge surface mapping PDF

Cannot Refute

[48] Mul-FaD: attention based detection of multiLingual fake news PDF

Cannot Refute

[49] A Closer Look at Transformer Attention for Multilingual Translation PDF

Cannot Refute

Translation Heads: Unveiling Attention's Role in LLM Multilingual Translation

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

Identification and characterization of token alignment heads

[13] Dynamic Multihead Attention for Enhancing Neural Machine Translation Performance PDF

[31] Cross-attention is all you need: Adapting pretrained transformers for machine translation PDF

[32] Fine-grained attention mechanism for neural machine translation PDF

[33] Supervised visual attention for multimodal neural machine translation PDF

[34] Cross-lingual AMR Aligner: Paying Attention to Cross-Attention PDF

[35] Interpreting Attention Mechanisms of NMT with Linguistic Features PDF

[36] Tree-to-sequence attentional neural machine translation PDF

[37] Neural machine translation with monolingual translation memory PDF

[38] Adaptive Token-level Cross-lingual Feature Mixing for Multilingual Neural Machine Translation PDF

[39] VECO: Variable and flexible cross-lingual pre-training for language understanding and generation PDF

Translation score metric and detection algorithm

[34] Cross-lingual AMR Aligner: Paying Attention to Cross-Attention PDF

[50] Structured convergence through latent epoch reshaping for reordering intermediate computations in large language model training PDF

[51] TokAlign: Efficient Vocabulary Adaptation via Token Alignment PDF

[52] Latent gesture routing through semantic phase discontinuities in large language model generation spaces PDF

[53] Beyond Literal Token Overlap: Token Alignability for Multilinguality PDF

[54] Computational resonance in large language models: A framework for oscillatory token alignment and recursive semantic stabilization PDF

[55] Multilingual Alignment of Contextual Word Representations PDF

[56] Investigating a Novel Transposon Attention Scaffold for Large Scale Transformer Reasoning Patterns PDF

[57] TASE: Token Awareness and Structured Evaluation for Multilingual Language Models PDF

[58] Alignatt: Using attention-based audio-translation alignments as a guide for simultaneous speech translation PDF

TRater data filtering algorithm

[40] Embedded Heterogeneous Attention Transformer for Cross-Lingual Image Captioning PDF

[41] CAR-Transformer: Cross-Attention Reinforcement Transformer for Cross-Lingual Summarization PDF

[42] Attention-based LSTM Network for Cross-Lingual Sentiment Classification PDF

[43] KaLM-Embedding: Superior Training Data Brings A Stronger Embedding Model PDF

[44] CAMEL: Cross-Attention Enhanced Mixture-of-Experts and Language Bias for Code-Switching Speech Recognition PDF

[45] Event detection via gated multilingual attention mechanism PDF

[46] Dual attention network for cross-lingual entity alignment PDF

[47] Probabilistic intra-token temporal oscillation in large language model sequence generation for latent knowledge surface mapping PDF

[48] Mul-FaD: attention based detection of multiLingual fake news PDF

[49] A Closer Look at Transformer Attention for Multilingual Translation PDF

Table of Contents