Welcome to the MMU-RAG competition!
Registration
To participate in any track, you must register your team by filling out this short form: Register Your Team
-
There’s no limit to the number of team members.
-
Every team must designate one team leader to serve as the main point of contact with the organizers.
MMU-RAG features two exciting tracks:
Track A: Text-to-Text (T2T)
Build a retrieval-augmented system that answers complex, open-ended user queries using textual sources. ⟶ Click here to view T2T track details
Track B: Text-to-Video (T2V)
Develop a system that can ground open-ended queries in retrieved video content, generating coherent video-based responses. ⟶ Click here to view T2V track details
Both tracks offer:
-
A validation set for local testing.
-
A starter codebase to speed up development.
-
API access to ClueWeb-22
-
Support for static and dynamic submission modes.
ClueWeb-22 Search API Access
Base URL:https://clueweb22.us/search
Authentication
All requests must include an API key:
x-api-key: <YOUR_RETRIEVER_API_KEY>
Your API key will be emailed to you after your team registers.
HTTP Request
GET https://clueweb22.us/search
Query Parameters:
Name | Type | Required | Description |
---|---|---|---|
query | string | yes | The search query string |
k | integer | yes | Number of documents to return |
cw22_a | boolean | no | Use ClueWeb22-A instead of default ClueWeb22-B |
Note: The ClueWeb search API uses ClueWeb22-B by default, but it also supports ClueWeb22-A. To use ClueWeb22-A, add the parameter
cw22_a=True
to your query.Examples:
- ClueWeb22-B (default):
https://clueweb22.us/search?query=cmu&k=1
- ClueWeb22-A:
https://clueweb22.us/search?query=cmu&k=1&cw22_a=True
Response Format:
{
"results": [Base64-encoded JSON documents]
}
Each decoded document will contain:
Field | Type | Description |
---|---|---|
text | string | Full text of the document |
url | string | Source URL of the document |
Example Code (Python)
import base64
import json
import requests
RETRIEVER_URL = "https://clueweb22.us/search"
RETRIEVER_API_KEY = "YOUR_API_KEY_HERE"
def query_clueweb(query: str, k: int, use_cw22_a: bool = False):
"""
Query the ClueWeb Search API and return a list of documents.
Each document is a dict with 'text' and 'url' keys.
Args:
query: The search query string
k: Number of documents to return
use_cw22_a: If True, use ClueWeb22-A instead of default ClueWeb22-B
"""
headers = {
'x-api-key': RETRIEVER_API_KEY
}
params = {
"query": query,
"k": k
}
if use_cw22_a:
params["cw22_a"] = True
response = requests.get(RETRIEVER_URL, params=params, headers=headers)
if response.status_code != 200:
raise Exception(f"Error querying ClueWeb: {response.status_code}")
json_data = response.json()
raw_results = json_data.get("results", [])
documents = []
for encoded_doc in raw_results:
# decode the Base64-encoded JSON string
decoded_json = base64.b64decode(encoded_doc).decode("utf-8")
doc = json.loads(decoded_json)
documents.append({
"text": doc.get("text", ""),
"url": doc.get("url", "")
})
return documents
# Usage example
if __name__ == "__main__":
# Search using ClueWeb22-B (default)
docs = query_clueweb("open source search engines", 5)
print("ClueWeb22-B Results:")
for i, d in enumerate(docs, 1):
print(f"Document {i} URL: {d['url']}")
print(f"Excerpt: {d['text'][:200]}…\n")
# Search using ClueWeb22-A
docs_a = query_clueweb("open source search engines", 5, use_cw22_a=True)
print("\nClueWeb22-A Results:")
for i, d in enumerate(docs_a, 1):
print(f"Document {i} URL: {d['url']}")
print(f"Excerpt: {d['text'][:200]}…\n")
Validation Sets
We provide a validation set for each track to help you test your pipeline and debug your submissions.
- The validation data includes input queries and gold reference answers.
- You can use it to verify your system’s outputs and test compatibility with the evaluation format.
- Learn more and download the validation sets here