Overall performance on the validation subset (3,114 examples) of mm-niah.
Needle In A Multimodal Haystack ( mm-niah) is a comprehensive benchmark designed to systematically evaluate the capability of existing MLLMs to comprehend long multimodal documents. This benchmark requires the model to answer specific questions according to the key information scattered throughout the multimodal document. The evaluation data in mm-niah consists of three tasks: retrieval, counting, and reasoning. The needles are inserted into either text or images in the documents. Those inserted into text are termed text needles, whereas those within images are referred to as image needles. Evaluating the leading MLLMs on mm-niah, we observe that existing models still have significant room for improvement on these tasks, especially on vision-centric evaluation. We hope this work can provide a platform for further research on long multimodal document comprehension and contribute to the advancement of MLLMs.
Overall performance on the validation subset (3,114 examples) of mm-niah.
Overall performance on the test subset (17,787 examples with private ground truth) of mm-niah.
The overall performance is obtained by averaging the performance across all context ranges.
🚨 To submit your results to the leaderboard, please send to this email with your result json files.
🚨 For more submission details, please refer to this link.
We introduce Needle In A Multimodal Haystack ( mm-niah), a benchmark designed to systematically evaluate the comprehension ability for long multimodal documents. This benchmark requires the model to answer specific questions according to the key information scattered throughout the multimodal document. To generate the evaluation data, we first concatenate interleaved image-text sequences from OBELICS to establish the background documents, termed multimodal haystacks. Then, we generate three data types based on these documents: retrieval, counting, and reasoning. We insert either text needles or image needles into documents for each task. Those inserted into text are termed text needles, whereas those within images are referred to as image needles.
All the data examples were divided into two subsets: validation and test.
Existing benchmarks for multi-image comprehensions, such as SEED-Bench-2 and BLINK, consist of short contexts, which fail to evaluate the capability for long-context document comprehension. Additionally, benchmarks for video question answering, like MVBench, concentrate on vision-dominant video understanding rather than text-dominant multimodal document understanding.
Comparison of mm-niah with other multi-image benchmarks. Our benchmark focuses on the evaluation of long multimodal document comprehension.
Notable statistics of mm-niah
Task | Needle Type | Answer Type | #Samples (val) | #Samples (test) | #Needles Per Sample |
---|---|---|---|---|---|
Retrieval | Text | Open-Ended | 519 | 3072 | 1 |
Image | Multi-Choice | 520 | 3005 | 1 | |
Counting | Text | Open-Ended | 517 | 3060 | 1~3 |
Image | Open-Ended | 518 | 2713 | 1~5 | |
Reasoning | Text | Open-Ended | 520 | 3004 | 3 |
Image | Multi-Choice | 520 | 2933 | 1~2 |
Based on mm-niah, we conducted a series of experiments. The main findings are summarized as follows:
Please see our paper for more detailed analyses.
We present the evaluation results in both heatmap format and table format. In the heatmaps, green slots indicate higher performance, while red slots indicate lower performance. In the tables, we provide the average performance across depths for each context length range.
Results on mm-niah. Green slots indicate higher performance, while red slots indicate lower performance. We evaluate GPT-4V only on our text-needle data because of the constraint that the API of GPT-4V only supports up to 10 images.
@article{wang2024needle,
title={Needle In A Multimodal Haystack},
author={Wang, Weiyun and Zhang, Shuibo and Ren, Yiming and Duan, Yuchen and Li, Tiantong and Liu, Shuo and Hu, Mengkang and Chen, Zhe and Zhang, Kaipeng and Lu, Lewei and Zhu, Xizhou and Luo, Ping and Qiao, Yu and Dai, Jifeng and Shao, Wenqi and Wang, Wenhai},
journal={arXiv preprint arXiv:2406.07230},
year={2024}
}