
PLoS Computational Biology, Journal Year: 2025, Volume and Issue: 21(2), P. e1012803 - e1012803
Published: Feb. 13, 2025
Accurately labeling large datasets is important for biomedical machine learning yet challenging while modern data augmentation methods may generate noise in the training data, which deteriorate model performance. Existing approaches addressing noisy typically rely on strict modeling assumptions, classification models and well-curated dataset. To address these, we propose a novel reliability-based training-data-cleaning method employing inductive conformal prediction (ICP). This uses small set of leverages ICP-calculated reliability metrics to selectively correct mislabeled outliers within vast quantities data. The efficacy validated across three tasks with distinct modalities: filtering drug-induced-liver-injury (DILI) literature free-text title abstract, predicting ICU admission COVID-19 patients through CT radiomics electronic health records, subtyping breast cancer using RNA-sequencing Varying levels labels were introduced via label permutation. Our significantly enhanced downstream performance (paired t-tests, p ≤ 0 . 05 among 30 random train/test partitions): significant accuracy enhancement 86 out 96 DILI experiments (up 11.4% increase from 0.812 0.905), AUROC AUPRC enhancements all 48 23.8% 0.597 0.739 AUROC, 69.8% 0.183 0.311 AUPRC), macro-average F1-score improvements 47 74.6% 0.351 0.613 accuracy, 89.0% 0.267 0.505 F1-score). improvement can be both statistically clinically information retrieval, disease diagnosis prognosis. offers potential substantially boost without necessitating an excessive volume or strong distribution assumptions existing semi-supervised methods.
Language: Английский