Highlights:
- Researchers propose a comparison between text-based and image-based retrieval methods for multimodal RAG systems.
- Direct multimodal embedding retrieval achieves 13% absolute and 32% relative improvements in retrieval accuracy.
- The study highlights information loss from LLM-generated image summaries during preprocessing.
- Evaluation conducted across 6 LLMs and 2 multimodal embedding models using a new financial earnings call benchmark.
TLDR:
A team of researchers has demonstrated that directly embedding images in retrieval-augmented generation (RAG) systems significantly improves large language model performance, offering more accurate and context-aware results compared to traditional text-only approaches.
A new research study titled “Comparison of Text-Based and Image-Based Retrieval in Multimodal Retrieval Augmented Generation Large Language Model Systems,” authored by Elias Lumer, Alex Cardenas, Matt Melich, Myles Mason, Sara Dieter, Vamse Kumar Subbiah, Pradeep Honaganahalli Basavaraju, and Roberto Hernandez, introduces a groundbreaking analysis on how image-based retrieval strategies can enhance the accuracy and contextual richness of Retrieval-Augmented Generation (RAG) systems. The work, published on arXiv, explores the trade-offs between conventional text summarization of images and direct multimodal embeddings that preserve both textual and visual features.
In conventional multimodal RAG pipelines, Large Language Models (LLMs) rely on preprocessing steps where images are summarized into textual descriptions before embedding them into vector databases. While efficient, this approach often loses valuable visual context—charts, diagrams, and tabular structures—essential for tasks such as financial document analysis. The new study challenges this assumption by proposing a direct multimodal embedding retrieval method. Instead of converting images into text summaries, this technique embeds the native image data directly into the vector space, preserving its original visual attributes for more meaningful retrieval and inference.
Technically, the team evaluated both retrieval strategies across six large-language models and two multimodal embedding frameworks, using a custom-built benchmark derived from financial earnings call documents. The benchmark includes 40 question-answer pairs, each associated with paired text and image documents. The results are striking: multimodal embedding retrieval achieved an absolute improvement of 13% in mean average precision (mAP@5) and 11% in normalized discounted cumulative gain (nDCG@5), corresponding to relative gains of 32% and 20%, respectively. These findings confirm that image-based retrieval does not merely complement textual data—it enhances the factual accuracy and consistency of model-generated answers. The authors also demonstrated that LLM summarization leads to measurable information loss, while direct embeddings maintain a richer representation of multimodal data.
This innovation could redefine how AI systems process complex multimodal datasets in domains such as finance, medicine, and scientific research. By retaining the visual intricacies lost in text translation, future RAG systems can deliver significantly more reliable, context-aware, and factually consistent outputs. The study not only advances the theoretical understanding of multimodal retrieval but also provides a new foundation for building next-generation large language models capable of integrating and reasoning across multiple data modalities.
Source:
Source:
arXiv:2511.16654 [cs.CL], DOI: https://doi.org/10.48550/arXiv.2511.16654, Authors: Elias Lumer, Alex Cardenas, Matt Melich, Myles Mason, Sara Dieter, Vamse Kumar Subbiah, Pradeep Honaganahalli Basavaraju, Roberto Hernandez
