Highlights:

  • Introduces an autoregressive generation-based framework for image segmentation
  • Unifies multimodal understanding and pixel-level perception within one model
  • Replaces traditional segmentation heads with pixel-accurate image generation
  • Implements a next-scale-prediction strategy for faster inference

TLDR:

Researchers present ARGenSeg, a groundbreaking autoregressive image segmentation framework that integrates multimodal understanding and pixel-level image generation—significantly boosting accuracy and speed across segmentation tasks.

A research team led by Xiaolong Wang (https://arxiv.org/search/cs?searchtype=author&query=Wang,+X) and colleagues Lixiang Ru (https://arxiv.org/search/cs?searchtype=author&query=Ru,+L), Ziyuan Huang (https://arxiv.org/search/cs?searchtype=author&query=Huang,+Z), Kaixiang Ji (https://arxiv.org/search/cs?searchtype=author&query=Ji,+K), Dandan Zheng (https://arxiv.org/search/cs?searchtype=author&query=Zheng,+D), Jingdong Chen (https://arxiv.org/search/cs?searchtype=author&query=Chen,+J), and Jun Zhou (https://arxiv.org/search/cs?searchtype=author&query=Zhou,+J) has unveiled **ARGenSeg: Image Segmentation with Autoregressive Image Generation Model**, a cutting-edge approach that redefines how image segmentation operates within multimodal large language models (MLLMs). This work, accepted to NeurIPS 2025, moves beyond conventional segmentation architectures by introducing an autoregressive generation-based framework capable of both high-level understanding and low-level pixel precision in a unified pipeline.

Traditional MLLM-based segmentation systems have depended heavily on boundary point representations or task-specific segmentation heads. While effective, these methods often rely on manual prompt engineering and coarse features, making it hard for models to grasp intricate object details or perform fine-grained mask generation. ARGenSeg tackles these challenges by transforming segmentation into a **visual generation task**—enabling the model to synthesize dense visual masks directly from multimodal cues. Instead of predicting discrete segmentation maps, the framework leverages the MLLM to output visual tokens, which are then transformed into pixel-dense masks via a **universal VQ-VAE (Vector Quantized Variational Autoencoder)**. This shift ensures the segmentation is driven by the model’s intrinsic pixel-level comprehension rather than external heuristics.

One of the key technical innovations of ARGenSeg is the **next-scale-prediction strategy**, which optimizes the inference process by allowing visual token generation to occur in parallel across scales. This design massively reduces latency—a major limitation of prior autoregressive methods—while preserving the fidelity and accuracy of predicted masks. Extensive experiments confirm that ARGenSeg not only surpasses previous state-of-the-art segmentation systems on standard datasets but also achieves sizable improvements in inference speed. The fusion of autoregressive creativity and segmentation precision could open new pathways for autonomous systems, medical imaging, and AI-driven visual reasoning tools, solidifying ARGenSeg as a pivotal stride in computer vision research.

Source:

Source:

ARGenSeg: Image Segmentation with Autoregressive Image Generation Model, arXiv:2510.20803 [cs.CV], https://doi.org/10.48550/arXiv.2510.20803

Leave a Reply

Your email address will not be published. Required fields are marked *