Highlights:

  • Introduces TimeLens, a systematic rethinking of video temporal grounding (VTG) using multimodal large language models (MLLMs).
  • Launches TimeLens-Bench, a high-quality benchmark featuring re-annotated datasets for fairer evaluation.
  • Presents TimeLens-100K, a large, clean training dataset generated via automated re-annotation.
  • Proposes a new training paradigm, RLVR (Reinforcement Learning with Verifiable Rewards), for efficient temporal reasoning.

TLDR:

TimeLens, developed by Jun Zhang and colleagues, redefines the baseline for video temporal grounding by addressing data quality and algorithmic gaps in multimodal large language models. Through high-fidelity datasets and innovative training techniques, it sets new open-source performance records in temporal video understanding.

The newly published study titled *TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs* represents a significant milestone in the field of computer vision and video understanding. Rather than offering a radical new method, the authors — Jun Zhang, Teng Wang, Yuying Ge, Yixiao Ge, Xinhao Li, Ying Shan, and Limin Wang — systematically rebuild the foundation of video temporal grounding (VTG) research with a focus on model reliability and data integrity. Despite recent progress in multimodal large language models (MLLMs), the way these systems learn to locate specific temporal segments within a video has remained inconsistent and often benchmark-dependent. TimeLens seeks to change this by creating cleaner datasets and refining model training recipes to bring consistency and scientific rigor back into VTG evaluation.

At the heart of the TimeLens project lies an ambitious re-examination of data quality in existing benchmarks. The team introduces *TimeLens-Bench*, a suite of re-annotated datasets derived from three widely used VTG benchmarks. These datasets were meticulously cleaned and annotated under strict quality criteria to minimize human label noise. The authors’ re-ranking analysis revealed that several high-profile models performed inconsistently across legacy datasets, exposing underlying flaws in prior evaluation standards. To complement this, they developed *TimeLens-100K*, an automated pipeline to produce high-fidelity temporal annotations at scale. This dataset, boasting over 100,000 entries, provides a new gold standard for training and validating models in temporal grounding tasks.

Building upon this solid data foundation, the researchers turned their attention to refinement in algorithmic design. They introduced new practices such as interleaved textual encoding for better temporal representation and a novel *thinking-free reinforcement learning with verifiable rewards* (RLVR) framework. Unlike conventional reinforcement learning, RLVR allows verifiable feedback loops for training stability. This approach enhances model efficiency while maintaining interpretability in temporal reasoning. Combined with optimized recipes for RLVR training, TimeLens sets a fresh benchmark for MLLM performance. In comprehensive evaluations, TimeLens not only became the leading open-source solution but even surpassed advanced proprietary systems including GPT-5 and Gemini-2.5-Flash. With a strong commitment to transparency, the team announced that all data, models, and code will be released, enabling researchers worldwide to build upon this milestone and drive the next generation of intelligent video understanding systems.

Source:

Source:

Original research: Zhang, J., Wang, T., Ge, Y., Ge, Y., Li, X., Shan, Y., & Wang, L. (2025). *TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs.* arXiv:2512.14698 [cs.CV]. Available at https://arxiv.org/abs/2512.14698

Leave a Reply

Your email address will not be published. Required fields are marked *