Illustration of the MRAMG task (above), with scenarios below showing how integrating text and images enhances clarity and understanding.

Abstract

Recent advances in Retrieval-Augmented Generation (RAG) have significantly improved response accuracy and relevance by incorporating external knowledge into Large Language Models (LLMs). However, existing RAG methods primarily focus on generating text-only answers, even in Multimodal Retrieval-Augmented Generation (MRAG) scenarios, where multimodal elements are retrieved to assist in generating text answers. To address this, we introduce the Multimodal Retrieval-Augmented Multimodal Generation (MRAMG) task, in which we aim to generate multimodal answers that combine both text and images, fully leveraging the multimodal data within a corpus. Despite growing attention to this challenging task, a notable lack of a comprehensive benchmark persists for effectively evaluating its performance. To bridge this gap, we provide MRAMG-Bench, a meticulously curated, human-annotated benchmark comprising 4,346 documents, 14,190 images, and 4,800 QA pairs, distributed across six distinct datasets and spanning three domains: Web, Academia, and Lifestyle. The datasets incorporate diverse difficulty levels and complex multi-image scenarios, providing a robust foundation for evaluating the MRAMG task. To facilitate rigorous evaluation, MRAMG-Bench incorporates a comprehensive suite of both statistical and LLM-based metrics, enabling a thorough analysis of the performance of generative models in the MRAMG task. Additionally, we propose an efficient and flexible multimodal answer generation framework that can leverage LLMs/MLLMs to generate multimodal responses.

Evaluation Metric

In this section, we will provide a detailed introduction to evaluation metrics.

Retrieve Evaluation

To evaluate the retrieval performance, we consider the following metrics:

Generation Evaluation

To evaluate the performance of multimodal answers, we consider the following metrics, which can be divided into two categories: statistical-based metrics (first six metrics) and LLM-based metrics (last four metrics).

We use the following statistical-based metrics:

We use the following LLM-based metrics:

Prompts

Generation Prompts

Answer Generation Prompt for LLM-Based Method

#Input
Query: {}    
Context: {}    
Image Caption: {}

# Task
Imagine you are an expert in handling multimodal input queries and producing coherent text-image responses. You will receive:
1. Query: The user query to be answered.
2. Contexts containing multiple images represented as placeholders <img>.    
- The input context follows the format: 
[context_1] <img1>, [context_2] <img2>, ...   
- Each [text_context_x] represents a pure text passage, while each <img> serves as a placeholder for an image.  
3. A set of image captions.   
- Each caption is sequentially aligned in a one-to-one correspondence with its respective image placeholder <img>.     
Your task is to answer the query based solely on the content of the context and input image information. Firstly, you should select appropriate images from the provided context (if none are suitable, you may choose not to include any). And then generate a mixed media response to the query, combining text and the selected images.    

# Requirements 
Ensure that your answer does not include any additional information outside the context.    
Image Insert: When inserting image placeholders, place them at the most appropriate point within the answer. Image placeholders should be embedded naturally in the answer to support and enhance understanding, such as when describing specific locations, historical events, or notable buildings.

# Output Format
Please output your answer in an interwoven text-image format, where you select images from the context and include them in the corresponding placeholder format. 

# Output Example
Doing household chores is a daily task that helps maintain a clean home. At the kitchen, dishes are neatly washed and placed in the drying rack, ready to be put away once they dry.<img10> Similarly, in the living room, the sofa cushions are fluffed and arranged properly, creating a comfortable space for relaxation.<img11>

Answer Generation Prompt for MLLM-Based Methods

# Input   
Query: {} 
Context: {}    
Image Caption: {}

# Task
Imagine you are an expert in handling multimodal input queries and producing coherent text-image responses.
You will receive:    
1. Query: The user query to be answered.
2. Contexts.    
3. A set of images.
4. A set of image captions.   
- Each caption is sequentially aligned in a one-to-one correspondence with its respective input image.

Your task is to answer the query based solely on the content of the context and input image information. Firstly, you should visually and textually understand the images based on the given images and image captions to select appropriate images from the input images (if none are suitable, you may choose not to include any). Next, based on the provided contexts and query, generate a multi-modal answer combining text and the selected images. 

# Requirements
Ensure that your answer does not include any additional information outside the context. Please note, your answer should be presented in an interwoven text-image format, where you select images from the context and output them in the corresponding placeholder format. Please provide only the answer, without including any analysis.
Image Insert: When inserting image placeholders, place them at the most appropriate point within the answer. Image placeholders should be embedded naturally in the answer to support and enhance understanding, such as when describing specific locations, historical events, or notable buildings.

# Output Format    
Please output the answer in an interwoven text-image format, where you select images from the context provided and output them in the corresponding placeholder format.

# Output Example    
Doing household chores is a daily task that helps maintain a clean home. At the kitchen, dishes are neatly washed and placed in the drying rack, ready to be put away once they dry.<img10> Similarly, in the living room, the sofa cushions are fluffed and arranged properly, creating a comfortable space for relaxation.<img11>

Answer Generation Prompt for Rule-Based Methods

# Task
Imagine you are a text QA expert, skilled in delivering contextually relevant answers. You will receive:   
1. Query.  
2. Contexts. 

Your task is to answer the query based solely on the content of the context. 

# Requirements    
Ensure that your answer does not include any additional information outside the context. Please note that your answer should be in pure text format.

# Output Format    
Provide the answer in pure text format. Do not include any information beyond what is contained in the context.

Evaluation Prompts

Answer Evaluation Prompt for Image Relevance

# Input  
Query: {}    
Answer: {}    
Image Context: {}
Image Caption: {}

# Task    
Imagine you are a multimodal QA evaluation expert. Your task is to evaluate the relevance of selected images within an answer to the given query. Specifically, the answer contains both text and images. You need to assess whether the selected images are relevant to the QA pair in terms of content. The evaluation results should be output in the form of reasons and scores.

# Answer Input Format   
[text_1] <img_1> [text_2] <img_2>...  
Explanation:   
Each [text_x]” is a piece of pure text context, and each <img> represents an image. The images will be provided in the same order as the placeholders <img>.

# Image Context Input Format    
[context_above] <img> [context_bottom]    
Explanation:     
This format represents the contextual information surrounding the image within its original document. It provides supplementary information to assist in evaluating the image.

# Scoring Criteria of Relevance (Each Image)    
When scoring, strictly adhere to the following standards, with a range of 1 to 5:    
- 1 point: Completely unrelated: The image has no connection to the main content of the query and answer, and is irrelevant.   
- 2 points: Weakly related: The image has a very tenuous connection to the main content of the query and answer.    
- 3 points: Partially related: The image is somewhat connected to part of the content of the query and answer.    
- 4 points: Mostly related: The image has a fairly clear connection to the main content of the query and answer.    
- 5 points: Highly related: The image is highly relevant to the content of the query and answer. 
Provide a brief reason for the evaluation along with a score from 1 to 5. Ensure you do not use any evaluation criteria beyond the query and answer.
# Output Format    
Please output two lines for the results: the first line is your reasoning for the score, and the second line is the score. Strictly follow this format without any additional content.
# Output Example    
Partially related, the image depicts the general structure of the gate but does not clearly show the number of pillars, making it only somewhat relevant to the QA.  
<relevance_score>3</relevance_score>  

Answer Evaluation Prompt For Image Effectiveness

# Input    
Query: {}    
Answer: {}   
Image Context: {}    
Image Caption: {}

# Task
Imagine you are a multimodal QA evaluation expert. Your task is to evaluate the effectiveness of selected images within an answer to the given query. Specifically, the answer contains both text and images. You need to assess whether the selected images are effective to the QA pair in terms of content. The evaluation results should be output in the form of reasons and scores.

 Answer Input Format    
[text_1] <img_1> [text_2] <img_2>...
Explanation:
Each [text_x] is a piece of pure text context, and each <img> represents an image. The images will be provided in the same order as the placeholders <img>.

# Image Context Input Format   
[context_above] <img> [context_bottom]   
Explanation:    
This format represents the contextual information surrounding the image within its original document. It provides supplementary information to assist in evaluating the image.

# Scoring Criteria of Effectiveness (Each Image)  
When scoring, strictly adhere to the following standards, with a range of 1 to 5:   
- 1 point, Harmful: The images in the answer are harmful to answering the query, such as causing serious misunderstanding for the reader.    
- 2 points, Irrelevant: The images in the answer are mostly unrelated to the query and the answer, with little to no connection overall.    
- 3 points, Partially Effective: The images in the answer are somewhat effective in helping the reader understand the answer to the query.   
- 4 points, Mostly Effective: The images in the answer are largely consistent with the answer to the query and effectively help the reader better understand the answer.    
- 5 points, Highly Effective: The images in the answer provide crucial details for answering the query. They not only align with the answer but also offer highly effective supplementary information that aids in understanding the query-answer pair from a multimodal perspective.   
Provide a brief reason for the evaluation along with a score from 1 to 5. Ensure you do not use any evaluation criteria beyond the query and answer.

# Output Format    
Please output two lines for the results: the first line is your reasoning for the score, and the second line is the score. Strictly follow this format without any additional content.

# Output Example    
Highly effective: The images in the answer, depicting the front entrance with three pillars, are highly effective in helping readers understand the query about how many pillars there are. They strongly support the response that states there are three pillars. All images provide crucial details that aid in the reader's comprehension.
<effective_score>5</effective_score>    

Answer Evaluation Prompt For Comprehensive Answer Quality Evaluation

# Input  
Query: {}
Answer: {}   
Image Context: {}    
Image Caption: {}

# Task  
Imagine you are a multimodal QA evaluation expert. Your task is to evaluate the overall quality of the answer. Specifically, the answer contains both text and images. The evaluation results should be output in the form of reasons and scores.

# Answer Input Format    
[text_1] <img_1> [tex_2] <img_2>...
Explanation:   
Each [text_x] is a piece of pure text context, and each <img> represents an image. The images will be provided in the same order as the placeholders <img>.

# Image Context Input Format    
[context_above] <img> [context_bottom]

Explanation:    
This format represents the contextual information surrounding the image within its original document. It provides supplementary information to assist in evaluating the image.

# Evaluation Criteria of Overall Quality    
Strictly follow the scoring criteria below to assign a score between 1 and 5:    
- 1 point, Poor Quality: The answer fails to address the question, the structure is confusing or missing, and the images are irrelevant or not helpful.
- 2 points, Fair Quality: The answer partially addresses the question but lacks completeness. The structure is weak, and the text-image integration is weak or only partially helpful.
- 3 points, Average Quality: The answer addresses the question but lacks depth. The structure is clear but could be improved. The images are somewhat helpful but don’t fully enhance understanding.
- 4 points, Good Quality: The answer is clear and fairly comprehensive. The structure is logical and well-organized, and the images enhance the understanding of the text.
- 5 points, Excellent Quality: The answer is detailed and insightful. The structure is strong and cohesive, and the images complement the text perfectly, significantly enhancing comprehension. 
Provide a brief reason for the evaluation along with a score from 1 to 5. Ensure you do not use any evaluation criteria beyond the query and answer.

# Output Format    
Please output two lines for the results: the first line is your reasoning for the score, and the second line is the score. Strictly follow this format without any additional content.

# Output Example  
The answer provides a complete and coherent description of the Irish bouzouki, and the images in the answer help reinforce the explanation of its appearance. The structure is logical and easy to follow, with all images appropriately enhancing the reader's understanding of the instrument.
<overall_quality_score>5</overall_quality_score>

Answer Evaluation Prompt For Image Position

# Input  
Query: {}    
Answer: {}    
Image Context: {}   
Image Caption: {}

# Task    
Imagine you are a multimodal problem-solving expert tasked with evaluating whether the position of each selected image within an answer to the given query is appropriate.

# Answer Input Format   
[text_1] <img_1> [text_2] <img_2>...   
Explanation:   
Each [text_x] is a segment of pure text context, and each <img> represents an image. The images will be presented in the same order as the placeholders <img>. 

# Image Context Input Format    
[context_above] <img> [context_bottom] 

Explanation:    
This format represents the contextual information surrounding the image within its original document. It provides supplementary information to assist in evaluating the image.


# Revised Evaluation Criteria:   
Strictly follow the criteria below to assign a score of 0 or 1:    
- 0 point, Inappropriate Position: The image is irrelevant to both the preceding and following context, or the position of the image does not enhance content understanding or visual appeal. The insertion of the image does not align with the logical progression of the text and fails to improve the reading experience or information transmission.    
- 1 point, Appropriate Position: The image is contextually relevant to at least one of the surrounding contexts (preceding or following), and it enhances content understanding or visual effect. The position of the image aligns with the logical flow of the text and is inserted appropriately, improving the overall information delivery. If the description of the image is detailed, it further clarifies the connection between the image and the text, enhancing the overall expressive effect.

# Output Format   
Provide a brief justification for the evaluation and a score of either 0 or 1. Ensure no evaluation criteria beyond the provided query and answer are used.
Please output two lines for each image: the first line is your reasoning for the score, and the second line is the score. Strictly follow this format without any additional content.

# Output Example
<img_1> displays a distant aerial view of the site, but the surrounding context focuses on intricate design details of the main entrance. The image placement does not align with the described content and does not improve comprehension.  
<img_1_score>0</img_1_score>      
<img_2> shows a close-up of one of the pillars, which is directly referenced in the following context about the structure's details. The image placement aligns with the description, enhancing understanding.     
<img_2_score>1</img_2_score>

Results

In this section, we give the full experiment results, wherein the metrics of Prec., Rec., F1., R.L., B.S., Rel., Eff., Comp., Pos., and Avg. represent image precision, image recall, image F1 score, rouge-l, BERTScore, image relevance, image effectiveness, comprehensive score, image position score, and average score, respectively. Specifically, the metric **Ord. ** represents image ordering score.

Comprehensive performance results on MRAMG-Wit(Web Dataset).

Framework Model MRAMG-Wit
Prec. Rec. F1 R.L. B.S. Rel. Eff. Comp. Pos. Avg.
Rule-Based GPT-4o 49.50 49.67 49.56 56.23 92.27 43.67 39.50 77.00 50.08 56.39
GPT-4o-mini 42.83 42.83 42.83 48.55 89.52 38.30 34.83 76.90 43.33 51.10
Claude-3.5-Sonnet 50.08 50.50 50.22 53.37 92.53 44.03 39.93 79.20 50.58 56.72
Gemini-1.5-Pro 28.83 29.00 28.89 39.47 84.96 25.20 22.83 75.50 29.08 40.42
DeepSeek-V3 57.67 58.00 57.78 58.71 93.65 51.00 46.13 79.37 58.17 62.28
Qwen2-VL-7B-Instruct 51.67 51.83 51.72 53.23 91.14 45.97 41.53 74.97 52.25 57.15
Qwen2-VL-72B-Instruct 40.83 41.00 40.89 46.80 88.20 36.17 32.73 73.73 41.58 49.10
InternVL-2.5-8B 37.25 37.33 37.28 42.09 86.57 32.43 29.20 72.10 37.42 45.74
InternVL-2.5-78B 43.25 43.50 43.33 47.52 88.58 37.53 34.20 76.20 43.42 50.84
Llama-3.1-8B-Instruct 24.07 25.50 24.46 26.50 80.51 21.97 20.47 59.40 25.92 34.31
Llama-3.3-70B-Instruct 53.58 53.83 53.67 56.50 92.42 46.97 42.43 78.47 54.25 59.12
MLLM-Based GPT-4o 83.50 84.00 83.67 54.84 93.32 74.67 68.13 81.50 84.33 78.66
GPT-4o-mini 64.61 86.83 71.27 47.62 92.48 74.60 69.60 74.27 67.75 72.11
Claude-3.5-Sonnet 93.83 96.17 94.61 40.00 91.73 86.07 79.03 82.20 95.67 84.37
Gemini-1.5-Pro 94.11 96.17 94.78 50.84 91.56 83.67 75.40 78.80 95.14 84.50
Qwen2-VL-7B-Instruct 22.92 34.67 25.90 35.14 83.90 29.07 26.90 57.40 27.36 38.14
Qwen2-VL-72B-Instruct 60.92 65.17 62.19 49.95 92.34 57.53 53.20 78.37 62.62 64.70
InternVL-2.5-8B 44.71 68.17 51.33 41.24 89.07 59.07 55.53 67.10 56.34 59.17
InternVL-2.5-78B 77.15 82.17 78.75 44.01 91.63 72.87 66.67 80.13 80.71 74.90
LLM-Based GPT-4o 73.75 73.83 73.78 52.80 93.02 66.13 60.03 82.70 74.42 72.27
GPT-4o-mini 61.39 91.33 70.54 42.85 91.80 78.90 72.63 76.80 63.03 72.14
Claude-3.5-Sonnet 91.53 94.83 92.61 44.24 92.58 84.60 77.57 82.37 92.11 83.60
Gemini-1.5-Pro 96.08 96.67 96.28 53.93 92.45 84.73 77.40 80.20 96.42 86.02
DeepSeek-V3 93.81 96.83 94.78 43.64 92.48 86.43 79.23 82.10 94.75 84.89
Llama-3.1-8B-Instruct 32.75 40.50 34.87 32.51 82.06 37.87 35.70 54.77 36.14 43.02
Llama-3.3-70B-Instruct 86.58 96.00 89.09 44.83 92.87 81.93 75.33 78.90 88.15 81.52

Comprehensive performance results on MRAMG-Wiki(Web Dataset).

Framework Model MRAMG-Wiki
Prec. Rec. F1 R.L. B.S. Rel. Eff. Comp. Pos. Avg.
Rule-Based GPT-4o 53.00 53.00 53.00 54.62 95.15 46.60 42.56 82.24 53.00 59.24
GPT-4o-mini 49.60 49.60 49.60 53.39 94.87 42.52 39.12 82.04 49.60 56.70
Claude-3.5-Sonnet 37.80 37.80 37.80 49.32 94.06 32.60 30.00 82.88 37.80 48.90
Gemini-1.5-Pro 41.20 41.20 41.20 47.18 92.46 35.76 32.64 80.44 41.20 50.36
DeepSeek-V3 56.20 56.40 56.27 53.33 95.28 49.36 44.80 83.00 56.20 61.20
Qwen2-VL-7B-Instruct 53.50 53.60 53.53 48.15 93.12 46.04 41.60 76.08 53.50 57.68
Qwen2-VL-72B-Instruct 51.50 51.60 51.53 48.08 92.81 44.76 40.76 77.72 51.50 56.70
InternVL-2.5-8B 50.00 50.20 50.07 48.06 93.32 43.64 40.08 78.20 50.20 55.97
InternVL-2.5-78B 54.00 54.20 54.07 51.42 94.61 46.40 42.60 81.44 54.10 59.20
Llama-3.1-8B-Instruct 21.60 21.80 21.67 27.74 84.65 18.76 17.28 59.96 22.20 32.85
Llama-3.3-70B-Instruct 53.70 53.80 53.73 53.02 94.91 46.76 43.00 80.80 53.70 59.27
MLLM-Based GPT-4o 71.30 71.60 71.40 53.34 95.70 63.32 58.28 83.32 71.40 71.07
GPT-4o-mini 49.83 81.40 58.56 49.99 95.51 70.36 64.60 74.00 51.32 66.17
Claude-3.5-Sonnet 91.90 94.20 92.67 44.42 94.41 83.68 76.00 82.36 92.50 83.57
Gemini-1.5-Pro 92.10 93.80 92.67 50.05 94.34 82.08 74.60 79.76 92.20 83.51
Qwen2-VL-7B-Instruct 24.22 31.60 26.28 32.02 87.45 26.76 25.20 56.24 26.63 37.38
Qwen2-VL-72B-Instruct 53.64 59.60 55.29 47.93 94.63 54.28 49.32 79.92 54.19 60.98
InternVL-2.5-8B 46.92 72.40 53.97 44.69 93.12 61.76 57.64 71.08 53.79 61.71
InternVL-2.5-78B 67.12 72.40 68.66 45.43 94.85 64.32 58.96 81.24 69.33 69.15
LLM-Based GPT-4o 81.40 81.60 81.47 51.53 95.66 72.28 65.76 83.72 81.40 77.20
GPT-4o-mini 44.47 86.80 56.05 47.98 95.20 72.40 67.04 73.68 45.02 65.40
Claude-3.5-Sonnet 93.70 94.80 94.07 45.78 94.63 83.68 76.60 82.48 93.80 84.39
Gemini-1.5-Pro 95.90 96.00 95.93 50.92 94.84 81.76 75.20 79.72 96.10 85.15
DeepSeek-V3 90.03 95.80 91.90 45.71 95.18 84.40 77.00 82.16 90.13 83.59
Llama-3.1-8B-Instruct 23.50 28.00 24.79 35.66 85.16 23.04 21.68 51.16 23.90 35.21
Llama-3.3-70B-Instruct 70.61 94.40 76.35 47.86 95.47 78.16 71.84 76.96 71.46 75.90

Comprehensive performance results on MRAMG-Web(Web Dataset).

Framework Model MRAMG-Web
Prec. Rec. F1 R.L. B.S. Rel. Eff. Comp. Pos. Avg.
Rule-Based GPT-4o 32.47 16.93 22.11 39.17 90.56 29.47 27.81 73.87 32.80 40.58
GPT-4o-mini 26.89 14.27 18.46 34.88 89.84 24.53 23.44 72.72 27.40 36.94
Claude-3.5-Sonnet 52.27 29.20 36.89 49.74 93.69 47.12 44.75 80.27 53.07 54.11
Gemini-1.5-Pro 26.60 15.00 18.87 28.75 85.91 24.80 23.55 70.56 27.60 35.74
DeepSeek-V3 53.27 31.00 38.40 50.29 93.71 48.83 46.13 78.96 54.53 55.01
Qwen2-VL-7B-Instruct 16.69 8.67 11.33 33.36 90.12 15.04 14.13 64.43 16.96 30.08
Qwen2-VL-72B-Instruct 18.87 10.47 13.27 29.15 86.36 17.20 16.56 66.53 19.07 30.83
InternVL-2.5-8B 12.80 6.67 8.71 23.42 84.23 12.03 11.47 62.56 13.20 26.12
InternVL-2.5-78B 25.09 14.13 17.77 36.30 90.46 22.99 21.49 69.31 25.56 35.90
Llama-3.1-8B-Instruct 25.20 15.73 18.61 24.90 83.01 25.41 23.95 56.56 28.97 33.59
Llama-3.3-70B-Instruct 41.80 24.00 29.93 44.60 91.86 38.13 36.11 74.77 43.13 47.15
MLLM-Based GPT-4o 89.78 83.80 85.47 52.09 95.14 94.27 90.08 91.25 93.74 86.18
GPT-4o-mini 87.71 88.60 87.82 53.13 95.66 93.49 89.44 90.03 91.49 86.37
Claude-3.5-Sonnet 88.50 91.33 89.45 50.48 94.89 95.68 92.88 93.20 92.96 87.71
Gemini-1.5-Pro 83.51 83.73 82.91 37.06 91.10 94.05 90.05 90.43 87.01 82.21
Qwen2-VL-7B-Instruct 30.85 31.73 29.53 37.55 90.45 36.83 34.56 67.01 34.95 43.72
Qwen2-VL-72B-Instruct 62.64 57.60 58.82 42.56 91.67 67.25 64.11 82.59 65.44 65.85
InternVL-2.5-8B 62.98 59.67 59.98 46.92 93.31 70.59 67.12 78.45 69.95 67.66
InternVL-2.5-78B 79.78 74.13 75.77 52.47 95.29 81.65 78.48 89.20 83.28 78.89
LLM-Based GPT-4o 86.18 78.73 81.15 54.87 95.96 86.21 82.37 89.52 87.02 82.45
GPT-4o-mini 92.86 93.40 92.95 53.50 95.82 93.20 89.28 89.95 94.59 88.39
Claude-3.5-Sonnet 92.40 92.47 92.16 54.27 95.51 94.48 91.07 91.68 94.23 88.70
Gemini-1.5-Pro 90.16 90.13 89.82 45.64 93.38 94.13 90.16 90.75 91.38 86.17
DeepSeek-V3 94.52 94.27 94.20 56.25 96.10 94.27 90.11 90.80 95.93 89.61
Llama-3.1-8B-Instruct 29.34 26.27 26.31 33.70 81.16 32.08 30.48 51.81 32.38 38.17
Llama-3.3-70B-Instruct 66.83 95.80 75.47 47.98 94.79 92.03 88.03 88.93 69.34 79.91

Comprehensive performance results on MRAMG-Arxiv(Academic Dataset).

Framework Model MRAMG-Arxiv
Prec. Rec. F1 R.L. B.S. Rel. Eff. Comp. Pos. Avg.
Rule-Based GPT-4o 55.42 63.04 57.70 44.96 94.67 69.10 67.30 84.20 75.75 68.02
GPT-4o-mini 51.71 59.29 53.80 44.21 94.36 67.50 64.80 85.20 73.75 66.07
Claude-3.5-Sonnet 55.17 62.79 57.37 42.78 94.09 69.10 66.10 84.20 75.75 67.48
Gemini-1.5-Pro 52.43 56.29 53.10 42.18 93.85 64.20 61.80 83.20 70.28 64.15
DeepSeek-V3 56.12 67.29 59.34 45.74 94.90 74.00 70.30 84.30 78.46 70.05
Qwen2-VL-7B-Instruct 49.17 52.17 49.57 39.32 92.09 60.70 58.90 78.90 67.08 60.88
Qwen2-VL-72B-Instruct 45.42 48.71 45.68 39.86 92.39 60.30 58.10 79.60 65.42 59.50
InternVL-2.5-8B 39.20 48.29 41.51 40.36 91.42 61.60 59.20 76.70 65.17 58.16
InternVL-2.5-78B 52.21 62.00 55.28 43.66 94.51 71.00 68.70 85.40 75.38 67.57
Llama-3.1-8B-Instruct 21.50 23.08 21.90 26.61 85.92 26.20 25.00 58.70 29.00 35.32
Llama-3.3-70B-Instruct 53.00 58.17 53.97 44.14 94.39 65.80 63.50 83.60 73.42 65.55
MLLM-Based GPT-4o 60.39 74.29 64.23 44.25 95.15 89.40 86.20 87.50 90.39 76.87
GPT-4o-mini 36.17 74.79 46.78 42.48 95.08 83.60 80.80 83.20 74.66 68.62
Claude-3.5-Sonnet 47.12 83.50 57.68 40.60 94.65 89.30 86.70 87.60 86.38 74.84
Gemini-1.5-Pro 58.13 80.25 64.74 41.84 94.30 85.10 82.40 85.90 83.61 75.14
Qwen2-VL-7B-Instruct 1.63 4.00 2.18 33.01 84.62 5.20 5.10 49.80 4.46 21.11
Qwen2-VL-72B-Instruct 31.99 44.87 35.22 40.54 93.53 57.90 56.60 84.20 55.16 55.56
InternVL-2.5-8B 12.22 27.87 15.78 31.99 83.72 30.40 29.50 58.10 28.49 35.34
InternVL-2.5-78B 36.62 55.00 41.77 37.99 94.47 68.10 66.20 84.80 64.11 61.01
LLM-Based GPT-4o 65.28 76.54 68.54 44.13 95.23 86.00 82.70 88.90 84.84 76.91
GPT-4o-mini 37.69 83.33 49.90 41.23 95.01 85.90 82.60 84.50 69.07 69.91
Claude-3.5-Sonnet 62.17 88.00 70.16 41.04 94.37 90.90 88.60 89.60 88.17 79.22
Gemini-1.5-Pro 59.85 78.63 65.22 42.41 94.32 84.60 82.20 87.60 80.15 75.00
DeepSeek-V3 46.57 81.13 56.69 39.48 94.70 90.30 86.40 87.50 70.01 72.53
Llama-3.1-8B-Instruct 1.50 2.00 1.67 25.78 80.61 3.30 3.00 43.40 4.00 18.36
Llama-3.3-70B-Instruct 38.78 84.88 48.56 37.83 95.01 85.50 81.80 83.40 64.59 68.93

Comprehensive performance results on MRAMG-Recipe(Lifestyle Dataset).

Framework Model MRAMG-Recipe
Prec. Rec. F1 R.L. B.S. Ord. Rel. Eff. Comp. Pos. Avg.
Rule-Based GPT-4o 48.79 66.11 52.76 51.80 92.10 45.30 77.80 74.64 79.19 78.04 66.65
GPT-4o-mini 51.10 63.88 53.43 49.49 91.14 45.58 75.42 72.42 79.40 76.91 65.88
Claude-3.5-Sonnet 52.15 62.95 53.48 47.13 92.08 44.94 75.36 72.53 79.84 77.21 65.77
Gemini-1.5-Pro 50.61 51.46 47.23 40.71 87.97 39.61 71.09 68.31 78.40 73.08 60.85
DeepSeek-V3 26.13 59.00 33.36 50.51 92.48 22.96 74.58 71.92 73.36 64.49 56.88
Qwen2-VL-7B-Instruct 45.55 63.81 48.46 50.79 91.85 41.36 77.99 74.92 78.06 78.36 65.11
Qwen2-VL-72B-Instruct 31.20 50.10 34.40 46.33 89.91 24.99 73.41 70.50 72.61 71.18 56.46
InternVL-2.5-8B 29.37 52.39 32.92 42.87 90.19 23.14 73.58 70.92 72.53 71.38 55.93
InternVL-2.5-78B 20.90 70.86 29.26 51.20 92.37 17.82 75.20 72.77 74.43 54.05 55.89
Llama-3.1-8B-Instruct 27.59 37.70 25.17 25.89 81.02 18.83 64.42 61.52 65.64 61.73 46.95
Llama-3.3-70B-Instruct 29.56 51.38 34.55 51.56 93.19 24.57 74.31 71.50 72.64 69.57 57.28
MLLM-Based GPT-4o 45.20 46.49 42.25 45.74 92.72 33.70 77.31 74.64 81.65 78.01 61.77
GPT-4o-mini 30.31 50.26 33.86 40.16 91.81 22.67 77.97 75.49 77.52 71.23 57.13
Claude-3.5-Sonnet 30.04 54.21 35.01 34.54 90.90 22.18 80.56 78.18 79.75 74.75 58.01
Gemini-1.5-Pro 39.01 59.50 43.50 43.43 89.89 32.49 81.94 79.22 81.64 70.42 62.10
Qwen2-VL-7B-Instruct 9.06 15.17 9.48 34.47 84.65 4.44 18.81 18.08 55.62 17.17 26.69
Qwen2-VL-72B-Instruct 19.19 26.47 19.70 43.26 91.35 12.27 43.25 41.57 74.52 39.73 41.13
InternVL-2.5-8B 23.01 39.81 23.89 33.22 89.42 15.34 67.19 64.96 74.45 63.44 49.47
InternVL-2.5-78B 21.72 30.07 21.22 36.60 91.13 13.87 56.60 54.66 75.79 52.99 45.46
LLM-Based GPT-4o 49.70 65.03 51.91 44.75 92.42 43.59 82.58 79.38 81.02 81.88 67.23
GPT-4o-mini 45.59 39.32 39.61 47.56 92.91 32.04 51.78 49.82 83.47 54.86 53.70
Claude-3.5-Sonnet 62.24 67.73 61.48 38.65 91.49 53.23 81.15 78.30 84.96 83.87 70.31
Gemini-1.5-Pro 64.87 71.43 64.43 47.01 90.70 56.89 82.39 79.16 83.55 80.69 72.11
DeepSeek-V3 47.53 70.82 51.92 39.83 91.84 40.90 84.38 81.46 82.97 77.92 66.96
Llama-3.1-8B-Instruct 11.56 12.69 10.89 24.61 75.21 6.70 17.71 17.04 41.86 18.32 23.66
Llama-3.3-70B-Instruct 36.87 72.52 44.31 38.38 91.99 31.00 81.84 79.19 80.84 71.99 62.89

Comprehensive performance results on MRAMG-Manual(Lifestyle Dataset).

Framework Model MRAMG-Manual
Prec. Rec. F1 R.L. B.S. Ord. Rel. Eff. Comp. Pos. Avg.
Rule-Based GPT-4o 36.45 47.97 38.32 50.82 91.51 32.10 75.79 73.44 79.08 71.66 59.71
GPT-4o-mini 37.29 47.18 38.22 50.40 91.05 32.83 73.28 71.79 78.87 69.70 59.06
Claude-3.5-Sonnet 39.27 50.38 41.13 48.17 91.69 32.82 73.18 71.23 77.44 71.43 59.67
Gemini-1.5-Pro 40.54 45.17 39.84 46.69 90.40 33.01 73.13 70.67 76.56 73.51 58.95
DeepSeek-V3 32.75 48.83 36.28 51.57 92.05 31.29 76.92 75.08 79.23 69.42 59.34
Qwen2-VL-7B-Instruct 33.18 43.48 34.32 46.68 89.32 27.61 71.79 69.90 75.54 71.14 56.30
Qwen2-VL-72B-Instruct 35.58 44.42 35.38 46.30 89.73 28.86 70.72 68.31 74.82 69.46 56.36
InternVL-2.5-8B 29.53 45.93 32.06 42.30 89.64 24.17 72.10 69.23 74.15 70.72 54.98
InternVL-2.5-78B 32.96 48.63 36.00 48.26 91.10 29.72 75.74 73.44 78.10 71.66 58.56
Llama-3.1-8B-Instruct 32.07 27.50 26.58 30.90 82.93 15.10 50.87 49.44 62.00 55.84 43.32
Llama-3.3-70B-Instruct 34.53 44.35 35.60 49.50 91.22 30.26 73.13 71.03 75.74 69.26 57.46
MLLM-Based GPT-4o 35.07 33.78 32.44 44.68 91.16 24.50 75.49 73.28 79.59 73.38 56.34
GPT-4o-mini 23.43 32.24 25.16 43.60 91.05 17.33 72.92 71.13 75.23 62.22 51.43
Claude-3.5-Sonnet 25.17 39.24 28.47 40.32 91.02 19.94 80.51 78.10 80.41 75.12 55.83
Gemini-1.5-Pro 36.01 44.68 37.14 48.87 90.99 28.76 76.62 74.62 79.79 66.32 58.38
Qwen2-VL-7B-Instruct 13.32 15.05 13.48 41.07 86.02 3.09 13.38 12.82 57.74 10.46 26.65
Qwen2-VL-72B-Instruct 22.13 24.92 21.62 44.36 90.34 12.95 49.08 47.13 73.44 41.23 42.72
InternVL-2.5-8B 17.23 26.63 18.65 39.71 89.33 9.34 47.38 46.26 71.23 39.90 40.57
InternVL-2.5-78B 19.70 23.19 19.37 42.90 91.01 11.36 55.95 55.18 73.28 45.90 43.78
LLM-Based GPT-4o 34.02 46.48 36.78 45.99 91.46 35.80 77.59 75.64 78.05 71.65 59.35
GPT-4o-mini 36.94 31.87 32.64 45.77 91.35 25.46 55.33 54.05 81.79 55.56 51.08
Claude-3.5-Sonnet 45.21 44.59 43.20 42.68 91.64 40.39 75.08 72.67 82.62 74.73 61.28
Gemini-1.5-Pro 46.23 49.69 45.43 50.21 91.58 39.87 76.62 74.36 80.36 73.40 62.77
DeepSeek-V3 34.71 47.89 37.82 43.80 91.38 36.81 81.08 78.77 80.67 71.65 60.46
Llama-3.1-8B-Instruct 12.65 13.12 12.38 22.27 76.31 3.03 10.56 10.46 35.59 10.06 20.64
Llama-3.3-70B-Instruct 25.74 50.15 31.26 39.80 91.31 28.03 76.72 74.36 75.95 62.56 55.59

Comprehensive performance results on MRAMG-Bench.

Framework Model Web Data Academic Data Lifestyle Data
Prec. Rec. F1 R.L. B.S. Rel. Eff. Comp. Pos. Avg. Prec. Rec. F1 R.L. B.S. Rel. Eff. Comp. Pos. Avg. Prec. Rec. F1 R.L. B.S. Ord. Rel. Eff. Comp. Pos. Avg.
Rule-Based GPT-4o 43.54 37.30 39.36 48.88 92.35 38.70 35.59 77.15 43.86 50.75 55.42 63.04 57.70 44.96 94.67 69.10 67.30 84.20 75.75 68.02 47.04 63.54 50.71 51.66 92.01 43.54 77.51 74.47 79.17 77.13 65.68
GPT-4o-mini 38.20 33.08 34.78 44.32 91.10 33.86 31.37 76.59 38.57 46.87 51.71 59.29 53.80 44.21 94.36 67.50 64.80 85.20 73.75 66.07 49.14 61.52 51.27 49.62 91.13 43.87 75.11 72.33 79.32 75.89 64.92
Claude-3.5-Sonnet 47.65 38.43 41.46 50.81 93.42 42.19 39.20 80.63 48.14 53.55 55.17 62.79 57.37 42.78 94.09 69.10 66.10 84.20 75.75 67.48 50.32 61.17 51.73 47.27 92.02 43.32 75.05 72.34 79.50 76.39 64.91
Gemini-1.5-Pro 31.27 26.62 28.15 37.21 87.37 27.89 25.77 74.83 31.76 41.21 52.43 56.29 53.10 42.18 93.85 64.20 61.80 83.20 70.28 64.15 49.18 50.57 46.18 41.56 88.32 38.73 71.38 68.64 78.14 73.14 60.58
DeepSeek-V3 55.49 46.62 49.51 53.84 94.12 49.68 45.77 80.18 56.16 59.04 56.12 67.29 59.34 45.74 94.90 74.00 70.30 84.30 78.46 70.05 27.07 57.56 33.77 50.66 92.42 24.07 74.92 72.36 74.19 65.19 57.22
Qwen2-VL-7B-Instruct 37.98 34.81 35.84 43.80 91.26 33.45 30.44 70.99 38.28 46.32 49.17 52.17 49.57 39.32 92.09 60.70 58.90 78.90 67.08 60.88 43.79 60.93 46.46 50.21 91.49 39.52 77.11 74.20 77.70 77.33 63.87
Qwen2-VL-72B-Instruct 34.81 31.49 32.57 39.99 88.70 30.80 28.35 71.89 35.14 43.75 45.42 48.71 45.68 39.86 92.39 60.30 58.10 79.60 65.42 59.50 31.82 49.30 34.54 46.33 89.88 25.51 73.03 70.19 72.92 70.94 56.45
InternVL-2.5-8B 30.78 28.38 29.15 36.13 87.45 27.19 24.95 69.88 31.05 40.55 39.20 48.29 41.51 40.36 91.42 61.60 59.20 76.70 65.17 58.16 29.39 51.47 32.80 42.79 90.11 23.28 73.37 70.68 72.76 71.29 55.79
InternVL-2.5-78B 38.79 34.49 35.87 44.03 90.97 34.03 31.32 74.82 39.06 47.04 52.21 62.00 55.28 43.66 94.51 71.00 68.70 85.40 75.38 67.57 22.61 67.71 30.21 50.79 92.19 19.41 75.28 72.87 74.95 56.55 56.26
Llama-3.1-8B-Instruct 23.86 20.54 21.34 26.19 82.64 22.50 21.02 58.40 26.15 33.63 21.50 23.08 21.90 26.61 85.92 26.20 25.00 58.70 29.00 35.32 28.23 36.25 25.37 26.60 81.29 18.33 62.49 59.80 65.12 60.90 46.44
Llama-3.3-70B-Instruct 48.84 41.73 44.06 50.73 92.87 43.33 40.02 77.60 49.59 54.31 53.00 58.17 53.97 44.14 94.39 65.80 63.50 83.60 73.42 65.55 30.27 50.38 34.70 51.27 92.91 25.33 74.14 71.43 73.08 69.53 57.30
MLLM-Based GPT-4o 82.75 80.57 81.08 53.32 94.70 79.55 74.37 85.95 84.65 79.66 60.39 74.29 64.23 44.25 95.15 89.40 86.20 87.50 90.39 76.87 43.77 44.68 40.86 45.59 92.50 32.47 77.05 74.45 81.36 77.35 61.01
GPT-4o-mini 69.98 86.08 74.55 50.49 94.59 81.11 76.29 80.58 72.93 76.29 36.17 74.79 46.78 42.48 95.08 83.60 80.80 83.20 74.66 68.62 29.34 47.71 32.62 40.65 91.71 21.95 77.26 74.87 77.19 69.95 56.32
Claude-3.5-Sonnet 91.15 93.68 91.99 45.44 93.73 89.32 83.83 86.70 93.71 85.51 47.12 83.50 57.68 40.60 94.65 89.30 86.70 87.60 86.38 74.84 29.35 52.09 34.08 35.36 90.92 21.88 80.55 78.17 79.85 74.80 57.71
Gemini-1.5-Pro 89.27 90.49 89.39 45.04 92.13 87.45 81.12 83.77 91.05 83.30 58.13 80.25 64.74 41.84 94.30 85.10 82.40 85.90 83.61 75.14 38.59 57.40 42.60 44.20 90.05 31.99 81.19 78.57 81.37 69.84 61.58
Qwen2-VL-7B-Instruct 26.48 32.65 27.47 35.27 87.51 31.59 29.55 60.98 30.24 40.19 1.63 4.00 2.18 33.01 84.62 5.20 5.10 49.80 4.46 21.11 9.66 15.15 10.05 35.41 84.84 4.26 18.04 17.34 55.92 16.22 26.69
Qwen2-VL-72B-Instruct 59.65 60.59 58.96 46.41 92.69 60.59 56.57 80.50 61.49 64.16 31.99 44.87 35.22 40.54 93.53 57.90 56.60 84.20 55.16 55.56 19.61 26.25 19.98 43.42 91.21 12.36 44.07 42.36 74.36 39.94 41.36
InternVL-2.5-8B 52.71 65.86 55.55 44.48 91.89 64.46 60.80 72.78 61.17 63.30 12.22 27.87 15.78 31.99 83.72 30.40 29.50 58.10 28.49 35.34 22.19 37.94 23.14 34.14 89.41 14.54 64.39 62.31 73.99 60.11 48.22
InternVL-2.5-78B 75.50 76.27 74.82 47.82 93.98 74.12 69.37 84.11 78.68 74.96 36.62 55.00 41.77 37.99 94.47 68.10 66.20 84.80 64.11 61.01 21.43 29.09 20.96 37.49 91.11 13.53 56.51 54.73 75.43 51.99 45.23
LLM-Based GPT-4o 80.86 77.92 78.85 53.30 94.92 75.94 70.64 85.74 81.41 77.73 65.28 76.54 68.54 44.13 95.23 86.00 82.70 88.90 84.84 76.91 47.48 62.40 49.76 44.92 92.29 42.55 81.88 78.85 80.60 80.43 66.11
GPT-4o-mini 69.58 90.95 75.71 48.56 94.35 82.94 77.87 81.29 70.96 76.91 37.69 83.33 49.90 41.23 95.01 85.90 82.60 84.50 69.07 69.91 44.36 38.27 38.62 47.30 92.69 31.16 52.28 50.42 83.24 54.96 53.33
Claude-3.5-Sonnet 92.47 93.86 92.82 48.72 94.32 88.36 82.78 86.17 93.43 85.88 62.17 88.00 70.16 41.04 94.37 90.90 88.60 89.60 88.17 79.22 59.83 64.45 58.89 39.22 91.51 51.51 80.29 77.50 84.63 82.58 69.04
Gemini-1.5-Pro 93.63 93.84 93.57 49.75 93.48 87.74 81.98 84.35 94.29 85.85 59.85 78.63 65.22 42.41 94.32 84.60 82.20 87.60 80.15 75.00 62.23 68.34 61.74 47.47 90.83 54.62 81.57 78.48 83.10 79.66 70.80
DeepSeek-V3 93.08 95.51 93.76 49.31 94.68 89.06 83.04 85.64 93.98 86.45 46.57 81.13 56.69 39.48 94.70 90.30 86.40 87.50 70.01 72.53 45.71 67.56 49.92 40.39 91.78 40.35 83.91 81.08 82.64 77.03 66.04
Llama-3.1-8B-Instruct 28.87 31.35 28.68 33.84 82.53 31.51 29.79 52.59 31.31 38.94 1.50 2.00 1.67 25.78 80.61 3.30 3.00 43.40 4.00 18.36 11.71 12.75 11.10 24.28 75.36 6.21 16.70 16.11 40.97 17.15 23.23
Llama-3.3-70B-Instruct 74.26 95.49 80.12 46.93 94.35 85.01 79.54 82.44 76.01 79.35 38.78 84.88 48.56 37.83 95.01 85.50 81.80 83.40 64.59 68.93 35.29 69.34 42.46 38.58 91.89 30.60 81.11 78.51 80.15 70.66 61.86