Highlights:
- ARGenSeg introduces a new autoregressive image generation paradigm for precise image segmentation.
- Developed by Xiaolong Wang, Lixiang Ru, Ziyuan Huang, Kaixiang Ji, Dandan Zheng, Jingdong Chen, and Jun Zhou.
- Achieves pixel-level accuracy by leveraging multimodal large language models (MLLMs).
- Outperforms previous state-of-the-art segmentation techniques in both accuracy and inference speed.
TLDR:
ARGenSeg is a breakthrough image segmentation framework that uses autoregressive image generation to achieve superior pixel-level understanding and high inference speed, unifying multimodal comprehension and fine-grained visual reasoning in a single model.
A team of researchers—Xiaolong Wang (https://arxiv.org/search/cs?searchtype=author&query=Wang,+X), Lixiang Ru (https://arxiv.org/search/cs?searchtype=author&query=Ru,+L), Ziyuan Huang (https://arxiv.org/search/cs?searchtype=author&query=Huang,+Z), Kaixiang Ji (https://arxiv.org/search/cs?searchtype=author&query=Ji,+K), Dandan Zheng (https://arxiv.org/search/cs?searchtype=author&query=Zheng,+D), Jingdong Chen (https://arxiv.org/search/cs?searchtype=author&query=Chen,+J), and Jun Zhou (https://arxiv.org/search/cs?searchtype=author&query=Zhou,+J)—has introduced a revolutionary approach to image segmentation named ARGenSeg. Presented in their 2025 NeurIPS paper, this framework unites multimodal understanding and pixel-level perception through an autoregressive image generation process, marking a major advancement in computer vision and multimodal large language models (MLLMs).
Traditional segmentation methods for MLLMs have relied on boundary point representations or task-specific segmentation heads, which often constrain the model’s ability to capture detailed visual features. ARGenSeg overcomes these limitations by modeling segmentation as an image generation task. The system employs an MLLM that outputs visual tokens, which are then reconstructed into images using a universal Vector Quantized Variational Autoencoder (VQ-VAE). This process allows ARGenSeg to generate dense segmentation masks directly from the model’s pixel-level understanding without relying on hand-crafted decoding pipelines.
A particularly innovative component of ARGenSeg is its next-scale-prediction strategy, designed to improve inference efficiency. By enabling parallel generation of visual tokens, the model significantly reduces latency while maintaining high segmentation accuracy. Extensive testing across multiple datasets has shown that ARGenSeg achieves a notable boost in both inference speed and segmentation quality compared to previous approaches. Accepted at NeurIPS 2025, this work highlights how autoregressive generative modeling can fundamentally transform the way segmentation and multimodal reasoning are integrated within AI systems.
Source:
Source:
Original research paper: ‘ARGenSeg: Image Segmentation with Autoregressive Image Generation Model’ by Xiaolong Wang et al., arXiv:2510.20803v1 [cs.CV], DOI: https://doi.org/10.48550/arXiv.2510.20803
