MLRC-Bench: Can Language Agents Solve Machine Learning Research Challenges?
MLRC-Bench: Can Language Agents Solve Machine Learning Research Challenges?
We introduce MLRC-Bench, a benchmark designed to quantify how effectively language agents can tackle challenging Machine Learning (ML) Research Competitions, with a focus on open research problems that demand novel methodologies. Unlike prior work, e.g., AI Scientist, which evaluates the end-to-end agentic pipeline by using LLM-as-a-judge, MLRC-Bench measures the key steps of proposing and implementing novel research methods and evaluates them with rigorous protocol and objective metrics. Our curated suite of 7 competition tasks reveals significant challenges for LLM agents. Even the best-performing tested agent (gemini-exp-1206 under MLAB) closes only 9.3% of the gap between baseline and top human participant scores. Furthermore, our analysis reveals a misalignment between the LLM-judged innovation and actual performance on cutting-edge ML research problems. MLRC-Bench is a dynamic benchmark, designed to grow with new ML competitions and encourage rigorous, objective evaluations of AI research capabilities. Our leaderboard and code are available at: https://huggingface.co/spaces/launch/MLRC_Bench
Yunxiang Zhang、Muhammad Khalifa、Shitanshu Bhushan、Grant D Murphy、Lajanugen Logeswaran、Jaekyeom Kim、Moontae Lee、Honglak Lee、Lu Wang
计算技术、计算机技术
Yunxiang Zhang,Muhammad Khalifa,Shitanshu Bhushan,Grant D Murphy,Lajanugen Logeswaran,Jaekyeom Kim,Moontae Lee,Honglak Lee,Lu Wang.MLRC-Bench: Can Language Agents Solve Machine Learning Research Challenges?[EB/OL].(2025-04-13)[2025-06-22].https://arxiv.org/abs/2504.09702.点此复制
评论