▶
MedHELM and the Next Phase of Open Medical AI Evaluation

Why most medical AI benchmarks fail to reflect real clinical performance, and what comes next. This keynote introduces MedHELM, an open framework designed to evaluate healthcare AI systems on real-world clinical tasks, not just exam-style benchmarks.

In this session, Suhana Bedi (Stanford University) and Miguel Fuentes (Stanford Medicine) explain how MedHELM evolved from an academic research project into a community-driven infrastructure for evaluating large language models in healthcare.

The talk highlights a critical gap: most medical AI systems are still evaluated on simplified benchmarks such as USMLE-style questions, while only a small fraction of studies use real patient data. 


⏱️ Key moments:

00:00 Why current medical AI evaluation is failing
01:40 Why USMLE-style benchmarks are not enough
03:30 Introducing MedHELM and its core idea
05:30 Task taxonomy: mapping real clinical workflows
07:30 Benchmarks and leaderboard insights
09:30 No single model wins across all tasks
11:00 From research to community infrastructure (Pacific AI, CHAI)
13:00 Scaling evaluation: continuous benchmarking lifecycle
14:30 How to use MedHELM in real healthcare systems


📌 Keynote | Day 1 | Applied Healthcare AI Summit 2026


About the Summit:
Applied Healthcare AI Summit brings together practitioners from leading healthcare and technology organisations to share what is working in real-world AI deployment.

Watch more sessions:
https://appliedaisummit.org/


#HealthcareAI #MedicalAI #AIevaluation #LLM #ClinicalAI