Highlights:

  • Google submits MetricX-25 and GemSpanEval to the WMT25 Translation Evaluation Shared Task.
  • MetricX-25 improves translation quality prediction using a refined architecture based on Gemma 3.
  • GemSpanEval introduces a generative approach to error span detection with contextual outputs.
  • Both systems demonstrate significant advancements over prior benchmarks in translation assessment.

TLDR:

Google’s research team unveiled two new models, MetricX-25 and GemSpanEval, for the WMT25 Translation Evaluation Shared Task. Building on the multilingual Gemma 3 model, MetricX-25 enhances translation quality scoring, while GemSpanEval introduces a generative model that improves error detection accuracy and transparency.

Google Research has once again made waves in the field of computational linguistics with its newly introduced models—MetricX-25 and GemSpanEval—submitted to the WMT25 (Workshop on Machine Translation 2025) Evaluation Shared Task. Led by Juraj Juraska, Tobias Domhan, Mara Finkelstein, Tetsuji Nakagawa, Geza Kovacs, Daniel Deutsch, Pidong Wang, and Markus Freitag, the team focused on pushing the boundaries of translation quality estimation and automated error detection. The two systems target separate subtasks of the competition: MetricX-25 for quality score prediction and GemSpanEval for error span detection.

MetricX-25 represents the next generation of automatic translation quality evaluation metrics. Built upon the state-of-the-art multilingual open-weights model Gemma 3, the researchers adapted it into an encoder-only architecture equipped with a regression head to directly predict MQM (Multidimensional Quality Metrics) and ESA (Error Severity Assessment) scores. This design enables MetricX-25 to provide fine-grained quality evaluations, showing a marked improvement over its predecessor in both prediction accuracy and robustness. The enhanced input format and refined training protocols further allow the model to process multilingual data more consistently, making it a valuable tool for how translation systems are assessed and benchmarked globally.

GemSpanEval addresses a different but complementary challenge: identifying and explaining translation errors. It uses a decoder-only approach based on Gemma 3, reformulating error span detection as a generative task. Instead of merely tagging error positions, GemSpanEval generates the specific spans, along with their severity and category, and even includes contextual information for each detected error. This novel approach helps eliminate ambiguity when evaluating model performance. The research team demonstrated that GemSpanEval achieves competitive performance compared to xCOMET, one of the strongest encoder-based baselines, while offering richer interpretability through its generative design.

The development of these models underscores Google’s commitment to advancing machine translation evaluation by leveraging open-weight large language models. By unifying quality scoring and contextual error detection techniques, the combination of MetricX-25 and GemSpanEval strengthens the analytical framework for translation system improvement. As these tools are refined and adopted, researchers and developers alike can expect more transparent, accurate, and context-aware translation assessments across languages.

Source:

Source:

arXiv:2510.24707 [cs.CL] — Juraj Juraska et al., ‘MetricX-25 and GemSpanEval: Google Translate Submissions to the WMT25 Evaluation Shared Task’ (https://arxiv.org/abs/2510.24707)

Leave a Reply

Your email address will not be published. Required fields are marked *