Highlights:

  • Google researchers introduce MetricX‑25 and GemSpanEval for WMT25 Translation Evaluation Shared Task.
  • MetricX‑25 uses an encoder‑only architecture to predict MQM and ESA quality scores with high accuracy.
  • GemSpanEval formulates error span detection as a generative task, enhancing interpretability and precision.
  • Both systems are built on the multilingual open‑weights model Gemma 3 and fine‑tuned on WMT data.

TLDR:

Google Translate’s latest contributions to WMT25 — MetricX‑25 and GemSpanEval — represent major strides in machine translation evaluation. Leveraging the Gemma 3 model, the tools deliver more accurate quality prediction and detailed error detection, enhancing both human and automatic assessment capabilities.

The WMT25 Translation Evaluation Shared Task has seen a landmark submission from Google Translate researchers led by Juraj Juraska, Tobias Domhan, Mara Finkelstein, Tetsuji Nakagawa, Geza Kovacs, Daniel Deutsch, Pidong Wang, and Markus Freitag. Their paper, **“MetricX‑25 and GemSpanEval: Google Translate Submissions to the WMT25 Evaluation Shared Task,”** introduces two next‑generation systems that push the boundaries of automatic translation quality evaluation.

At the heart of the Quality Score Prediction subtask lies **MetricX‑25**, a refined successor to the MetricX series. Built upon the **Gemma 3 multilingual open‑weights model**, MetricX‑25 adopts an **encoder‑only architecture with a regression head**, trained to forecast both **MQM (Multidimensional Quality Metrics)** and **ESA (Expected Sentence Accuracy)** scores. The enhancements in input formatting and training protocols make it capable of providing more robust and consistent scoring than its predecessors. According to the authors, the model demonstrates superior correlation with human judgment and sets a new bar for translation quality prediction benchmarks.

For the Error Span Detection subtask, the team presents **GemSpanEval**, a **decoder‑only model** that reimagines error detection as a **generative task**. Unlike sequence‑tagging approaches such as the strong xCOMET baseline, GemSpanEval not only identifies error spans but also outputs **context, severity, and category**, ensuring unambiguous interpretations. This innovation marks a significant step toward explainable AI in translation evaluation — giving linguists and developers more actionable insights into model performance and translation flaws.

Both MetricX‑25 and GemSpanEval demonstrate Google’s ongoing commitment to open research and reproducibility. They are fine‑tuned on publicly available **WMT datasets**, aligning with community standards for transparency. By integrating state‑of‑the‑art architectures with novel training methodologies, these tools aim to improve end‑user translation quality in real‑world applications, further strengthening the symbiosis between human judgment and AI‑driven assessment in multilingual communication.

Source:

Source:

Juraj Juraska et al., “MetricX‑25 and GemSpanEval: Google Translate Submissions to the WMT25 Evaluation Shared Task,” arXiv:2510.24707 [cs.CL], 28 Oct 2025. https://doi.org/10.48550/arXiv.2510.24707

Leave a Reply

Your email address will not be published. Required fields are marked *