A Standardized Validation Framework for Clinically Actionable Healthcare Machine Learning with Knee Osteoarthritis Grading as a Case Study DOI Creative Commons

Daniel Nasef,

Demarcus Nasef,

Marc E. Sher

et al.

Algorithms, Journal Year: 2025, Volume and Issue: 18(6), P. 343 - 343

Published: June 5, 2025

Background: High in-domain accuracy in healthcare machine learning (ML) models does not guarantee reliable clinical performance, especially when training and validation protocols are insufficiently robust. This paper presents a standardized framework for validating ML intended classifying medical conditions, emphasizing the need clinically relevant evaluation metrics external validation. Methods: We apply this to case study knee osteoarthritis grading, demonstrating how overfitting, data leakage, inadequate can lead deceptively high that fails translate into reliability. In addition conventional metrics, we introduce composite measures better capture real-world utility. Results: Our findings show with strong performance may underperform on datasets, provide more nuanced assessment of applicability. Conclusions: Standardized protocols, together oriented evaluation, essential developing both statistically robust across range classification tasks.

Language: Английский

A Standardized Validation Framework for Clinically Actionable Healthcare Machine Learning with Knee Osteoarthritis Grading as a Case Study DOI Creative Commons

Daniel Nasef,

Demarcus Nasef,

Marc E. Sher

et al.

Algorithms, Journal Year: 2025, Volume and Issue: 18(6), P. 343 - 343

Published: June 5, 2025

Background: High in-domain accuracy in healthcare machine learning (ML) models does not guarantee reliable clinical performance, especially when training and validation protocols are insufficiently robust. This paper presents a standardized framework for validating ML intended classifying medical conditions, emphasizing the need clinically relevant evaluation metrics external validation. Methods: We apply this to case study knee osteoarthritis grading, demonstrating how overfitting, data leakage, inadequate can lead deceptively high that fails translate into reliability. In addition conventional metrics, we introduce composite measures better capture real-world utility. Results: Our findings show with strong performance may underperform on datasets, provide more nuanced assessment of applicability. Conclusions: Standardized protocols, together oriented evaluation, essential developing both statistically robust across range classification tasks.

Language: Английский

Citations

0