MedHELM and the Next Phase of Open Medical AI Evaluation

MedHELM and the Next Phase of Open Medical AI Evaluation

Why most medical AI benchmarks fail to reflect real clinical performance, and what comes next. This keynote introduces MedHELM, an open framework designed to evaluate healthcare AI systems on real-world clinical tasks, not just exam-style benchmarks. In this session, Suhana Bedi (Stanford University) and Miguel Fuentes (Stanford Medicine) explain how MedHELM evolved from an academic research project into a community-driven infrastructure for evaluating large language models in healthcare. The talk highlights a critical gap: most medical AI systems are still evaluated on simplified benchmarks such as USMLE-style questions, while only a small fraction of studies use real patient data. ⏱️ Key moments: 00:00 Why current medical AI evaluation is failing 01:40 Why USMLE-style benchmarks are not enough 03:30 Introducing MedHELM and its core idea 05:30 Task taxonomy: mapping real clinical workflows 07:30 Benchmarks and leaderboard insights 09:30 No single model wins across all tasks 11:00 From research to community infrastructure (Pacific AI, CHAI) 13:00 Scaling evaluation: continuous benchmarking lifecycle 14:30 How to use MedHELM in real healthcare systems 📌 Keynote | Day 1 | Applied Healthcare AI Summit 2026 About the Summit: Applied Healthcare AI Summit brings together practitioners from leading healthcare and technology organisations to share what is working in real-world AI deployment. Watch more sessions: https://appliedaisummit.org/ #HealthcareAI #MedicalAI #AIevaluation #LLM #ClinicalAI