MMU-RAG


NeurIPS 2025 Competition

Welcome to the official website of MMU-RAG: the Massive Multi-Modal User-Centric Retrieval-Augmented Generation Benchmark. This competition invites researchers and developers to build RAG systems that perform in real-world conditions.


2025 MMU-RAGent Competition β€” Official Winners Announcement

We are excited to announce the results of the 2025 MMU-RAGent Competition, which brought together teams from around the world to tackle challenging problems in multimodal Retrieval-Augmented Generation (RAG). This year’s competition featured two tracks: (1) Text-to-Text and (2) Text-to-Video. Both tracks were evaluated through a combination of automatic metrics, LLM-as-a-judge, human annotation, and our real-time RAG-Arena live evaluation.

Across both tracks, participants demonstrated creative system designs, robust retrieval pipelines, and thoughtful approaches to grounding generative models in multimodal evidence.


Participation Overview

This year’s competition received:

  • 8 full-system submissions to the Text-to-Text track
  • 2 additional validation-only submissions, and
  • 1 full-system submission to the Text-to-Video track

To support development, we released development, validation, and held-out test sets totalling nearly 1,000 queries. Human evaluation played a central role in our assessment: across both tracks, we collected 2,315 annotations from 1,197 annotators, ensuring broad and reliable feedback on relevance, factuality, and utility.


Text-to-Text Track Winners

Final rankings were determined using a robustness-aware aggregation of normalized automatic metrics and human Likert evaluations, with LLM-as-a-judge analysis informing, but not directly contributing, to the final scores.

Winners were recognized in two evaluation modes:

  • Static Evaluation: Teams distinguished themselves through strong semantic alignment, factual grounding, and robustness across automatic and human Likert evaluation modalities.
  • Dynamic Evaluation (RAG-Arena): In real-time interactive comparisons, these winners were preferred most frequently by users, highlighting the importance of evaluating not just correctness, but also clarity, usefulness, and overall preference.

πŸ† Open Source πŸ† Closed Source

πŸ₯‡ Best Static Evaluation

Efficient-Deep-Research


πŸ₯‡ Best Dynamic Evaluation

RMIT-ADMS IR

πŸ₯‡ Best Static Evaluation

Cattalyya


πŸ₯‡ Best Dynamic Evaluation

Nightfeats


Text-to-Video Track Winner

The Text-to-Video track received one full submission, DeepVideoResearcher, We evaluated the system against a strong baseline (Nova-Reel) using both VBench automatic metrics and human utility assessments.

Although the baseline demonstrated higher visual-quality metrics, human evaluators preferred deepvideo-researcher for relevance, precision, and overall utility to the query. This highlights the gap between traditional visual metrics and user-oriented evaluation of RAG-generated videos.


πŸ† Best Human Likert Rating

Deepvideoresearcher

Outperformed the baseline text-to-video model in Human Likert Rating


Key Insights From This Year’s Evaluation

  • Human and LLM-as-a-judge ratings align strongly (correlations β‰ˆ 0.93), validating the use of LLMs for diagnostic evaluation while reinforcing that human ratings should remain the final authority.
  • Live evaluation matters: Arena preferences revealed qualitative distinctions not captured by static metrics.
  • Multimodal video evaluation remains challenging: Existing automatic metrics emphasize visual fidelity, while human evaluators prioritize task relevance and procedural clarity.

We extend our warmest congratulations to the winning teams, and our sincere appreciation to every participant who contributed to this year’s competition. Your work pushes the boundaries of retrieval-augmented generation and helps shape the future of multimodal reasoning systems.


Organizers

  • Luo Qi Chan, DSO National Laboratories / Carnegie Mellon University
  • Tevin Wang, Carnegie Mellon University
  • Shuting Wang, Renmin University of China / Carnegie Mellon University
  • Zhihan Zhang, Carnegie Mellon University
  • Alfredo Gomez, Carnegie Mellon University
  • Prahaladh Chandrahasan, Carnegie Mellon University
  • Lan Yan, Carnegie Mellon University
  • Andy Tang, Carnegie Mellon University
  • Zimeng (Chris) Qiu, Amazon AGI
  • Morteza Ziyadi, Amazon AGI
  • Sherry Wu, Carnegie Mellon University
  • Mona Diab, Carnegie Mellon University
  • Akari Asai, University of Washington
  • Chenyan Xiong, Carnegie Mellon University