Highlights:
- Introduces ARGenSeg, a new autoregressive generation-based framework for image segmentation.
- Unifies multimodal understanding and pixel-level perception within a single model.
- Eliminates task-specific segmentation heads through generative modeling of image masks.
- Leverages a universal VQ-VAE for visual token reconstruction.
TLDR:
ARGenSeg is a novel autoregressive image segmentation framework that integrates multimodal understanding and pixel-level perception into one model, enabling denser and faster segmentation results while advancing the capabilities of multimodal large language models (MLLMs).
A groundbreaking development in computer vision research has emerged with the introduction of ARGenSeg — a novel autoregressive generation-based approach for image segmentation. The study, authored by Xiaolong Wang (https://arxiv.org/search/cs?searchtype=author&query=Wang,+X), Lixiang Ru (https://arxiv.org/search/cs?searchtype=author&query=Ru,+L), Ziyuan Huang (https://arxiv.org/search/cs?searchtype=author&query=Huang,+Z), Kaixiang Ji (https://arxiv.org/search/cs?searchtype=author&query=Ji,+K), Dandan Zheng (https://arxiv.org/search/cs?searchtype=author&query=Zheng,+D), Jingdong Chen (https://arxiv.org/search/cs?searchtype=author&query=Chen,+J), and Jun Zhou (https://arxiv.org/search/cs?searchtype=author&query=Zhou,+J), was submitted to arXiv and accepted to NeurIPS 2025. Their proposed ARGenSeg framework bridges the gap between generative modeling and image segmentation by enabling multimodal large language models (MLLMs) to comprehend visual content at a pixel level — a capability not fully realized in previous segmentation paradigms.
Traditional image segmentation approaches in MLLMs often used boundary point estimation or specialized segmentation heads reliant on task-specific semantic prompts. While functional, these methods struggle to capture fine-grained visual details essential for high-accuracy segmentation. ARGenSeg addresses this by adopting an image generation-based approach: it conditions the segmentation task on visual token generation rather than discrete boundary classification. This generative mechanism produces dense segmentation masks that translate directly from the MLLM’s pixel-level understanding, creating a more natural and integrated vision-language model.
At the core of ARGenSeg lies the use of a universal Vector Quantized Variational Autoencoder (VQ-VAE), which detokenizes the MLLM’s output visual tokens back into realistic, dense segmentation masks. To further optimize computational efficiency, the researchers introduced a next-scale-prediction strategy, allowing parallel token generation across multiple scales, dramatically reducing latency during inference. Extensive evaluations on benchmark segmentation datasets demonstrate that ARGenSeg not only outperforms current state-of-the-art methods in accuracy but also delivers a considerable boost in inference speed. This innovation positions ARGenSeg as a key advancement for future-generation multimodal AI systems capable of simultaneously understanding and generating complex visual scenes — a critical step toward fully integrated perception and reasoning models in computer vision research.
Source:
Source:
arXiv:2510.20803v1 [cs.CV] — ‘ARGenSeg: Image Segmentation with Autoregressive Image Generation Model’ by Xiaolong Wang, Lixiang Ru, Ziyuan Huang, Kaixiang Ji, Dandan Zheng, Jingdong Chen, and Jun Zhou. DOI: https://doi.org/10.48550/arXiv.2510.20803

