MM-NIAH

Introduction

Needle In A Multimodal Haystack ( Logo mm-niah) is a comprehensive benchmark designed to systematically evaluate the capability of existing MLLMs to comprehend long multimodal documents. This benchmark requires the model to answer specific questions according to the key information scattered throughout the multimodal document. The evaluation data in Logo mm-niah consists of three tasks: retrieval, counting, and reasoning. The needles are inserted into either text or images in the documents. Those inserted into text are termed text needles, whereas those within images are referred to as image needles. Evaluating the leading MLLMs on Logo mm-niah, we observe that existing models still have significant room for improvement on these tasks, especially on vision-centric evaluation. We hope this work can provide a platform for further research on long multimodal document comprehension and contribute to the advancement of MLLMs.

Leaderboard on Validation

Overall performance on the validation subset (3,114 examples) of Logo mm-niah.

Leaderboard on Test

Overall performance on the test subset (17,787 examples with private ground truth) of Logo mm-niah.

The overall performance is obtained by averaging the performance across all context ranges.

🚨 To submit your results to the leaderboard, please send to this email with your result json files.

🚨 For more submission details, please refer to this link.

Overview

We introduce Needle In A Multimodal Haystack ( Logo mm-niah), a benchmark designed to systematically evaluate the comprehension ability for long multimodal documents. This benchmark requires the model to answer specific questions according to the key information scattered throughout the multimodal document. To generate the evaluation data, we first concatenate interleaved image-text sequences from OBELICS to establish the background documents, termed multimodal haystacks. Then, we generate three data types based on these documents: retrieval, counting, and reasoning. We insert either text needles or image needles into documents for each task. Those inserted into text are termed text needles, whereas those within images are referred to as image needles.

Retrieval: The text needle in the retrieval task is a random fact or statement inserted into a certain document depth. The corresponding question asks the model to retrieve this statement. The image needle is a random cartoon-style image generated by DALLE-3, which is inserted into a certain image within the document, and the corresponding question is formulated as a single-choice question. The model is asked to select the image that appears in the document among four image options.
Counting: The text needle in the counting task comprises a series of statements, each of which claims the little penguin counted a certain number of needles. For the image needles, a certain number of cartoon-style images are inserted into each image within the document, serving as the needles to be counted. Inspired by the Counting Stars benchmark~\cite{song2024counting_stars}, we require the model to list the number of needles in each statement or image instead of directly outputting the total number of needles. The motivation behind this design is to ensure that the model accurately retrieves and comprehends all text and image needles inserted into the multimodal document.
Reasoning: A series of statements are inserted into different positions of the given document to serve as the text needle. The model must retrieve all these statements and reason over them to answer the question correctly. Besides, for each evaluation data, images sampled from the Jigsaw and Multi-view reasoning split of BLINK benchmark are inserted into the document to serve as the image needle. The model is required to answer the question related to these images.

All the data examples were divided into two subsets: validation and test.

validation: 3,114 examples used for model development, validation, or for those with limited computing resources.
test: 17,787 examples for standard evaluation. Notably, the answer labels for test will NOT be publicly released.

You can download the dataset on Hugging Face Dataset.

Comparisons with Existing Benchmarks

Existing benchmarks for multi-image comprehensions, such as SEED-Bench-2 and BLINK, consist of short contexts, which fail to evaluate the capability for long-context document comprehension. Additionally, benchmarks for video question answering, like MVBench, concentrate on vision-dominant video understanding rather than text-dominant multimodal document understanding.

Comparison of mm-niah with other multi-image benchmarks. Our benchmark focuses on the evaluation of long multimodal document comprehension.

Statistics

Notable statistics of Logo mm-niah

Task	Needle Type	Answer Type	#Samples (val)	#Samples (test)	#Needles Per Sample
Retrieval	Text	Open-Ended	519	3072	1
Retrieval	Image	Multi-Choice	520	3005	1
Counting	Text	Open-Ended	517	3060	1~3
Counting	Image	Open-Ended	518	2713	1~5
Reasoning	Text	Open-Ended	520	3004	3
Reasoning	Image	Multi-Choice	520	2933	1~2

Main Findings

Based on Logo mm-niah, we conducted a series of experiments. The main findings are summarized as follows:

The most advanced MLLMs (e.g. Gemini-1.5) still struggle to comprehend multimodal documents.
All MLLMs exhibit poor performance on image needles.
MLLMs fail to recognize the exact number of images in the document.
Models pre-trained on image-text interleaved data do not exhibit superior performance.
Training on background documents does not boost performance on MM-NIAH.
The "Lost in the Middle" problem also exists in MLLMs.
Long context capability of LLMs is NOT retained in MLLMs.
RAG boosts Text Needle Retrieval but not Image Needle Retrieval.
Placing questions before context does NOT improve model performance.
Humans achieve near-perfect performance on MM-NIAH.

Please see our paper for more detailed analyses.

Results on each tasks

We present the evaluation results in both heatmap format and table format. In the heatmaps, green slots indicate higher performance, while red slots indicate lower performance. In the tables, we provide the average performance across depths for each context length range.

Results on mm-niah. Green slots indicate higher performance, while red slots indicate lower performance. We evaluate GPT-4V only on our text-needle data because of the constraint that the API of GPT-4V only supports up to 10 images.

Leaderboard on Retrieval-Text-Needle of Test split

Leaderboard on Retrieval-Image-Needle of Test split

Leaderboard on Counting-Text-Needle of Test split

Leaderboard on Counting-Image-Needle of Test split

Leaderboard on Reasoning-Text-Needle of Test split

Leaderboard on Reasoning-Image-Needle of Test split

BibTeX


@article{wang2024needle,
  title={Needle In A Multimodal Haystack}, 
  author={Wang, Weiyun and Zhang, Shuibo and Ren, Yiming and Duan, Yuchen and Li, Tiantong and Liu, Shuo and Hu, Mengkang and Chen, Zhe and Zhang, Kaipeng and Lu, Lewei and Zhu, Xizhou and Luo, Ping and Qiao, Yu and Dai, Jifeng and Shao, Wenqi and Wang, Wenhai},
  journal={arXiv preprint arXiv:2406.07230},
  year={2024}
}

mm-niah

Needle In A Multimodal Haystack