Improving classification on imbalanced genomic data via KDE–based synthetic sampling DOI

Edoardo Taccaliti,

Jesús S. Aguilar–Ruiz

Research Square (Research Square), Journal Year: 2025, Volume and Issue: unknown

Published: May 8, 2025

Abstract Class imbalance poses a serious challenge in biomedical machine learning, particularly genomics, where datasets are characterized by extremely high dimensionality and very limited sample sizes. In such settings, standard classifiers tend to favor the majority class, leading biased predictions --- an especially problematic issue clinical diagnostics rare conditions must not be overlooked. this study, we introduce Kernel Density Estimation (KDE)--based oversampling approach rebalance imbalanced genomic generating synthetic minority class samples. Unlike conventional methods as SMOTE, KDE estimates global probability distribution of resamples accordingly, avoiding local interpolation pitfalls. We evaluate our method on 15 real--world using three --Naïve Bayes, Decision Trees, Random Forests-- compare it SMOTE baseline training. Experimental results demonstrate that consistently improves classification performance, metrics robust imbalance, AUC IMCP curve. Notably, achieves superior tree-based models while dramatically simplifying sampling process. This offers statistically grounded effective solution for balancing datasets, with strong potential improving fairness accuracy high--stakes medical decision--making.

Language: Английский

Unsupervised translation of vascular masks to NIR-II fluorescence images using Attention-Guided generative adversarial networks DOI Creative Commons
Fang Lü,

Huaixuan Sheng,

Huizhu Li

et al.

Scientific Reports, Journal Year: 2025, Volume and Issue: 15(1)

Published: Feb. 25, 2025

The second near-infrared window (NIR-II) fluorescence imaging is a crucial technology for investigating the structure and functionality of blood vessels. However, challenges arise from privacy concerns significant effort needed data annotation, complicating acquisition vascular datasets. To tackle these issues, methods based on deep learning synthesis have demonstrated promise in generating high-quality synthetic images. In this paper, we propose an unsupervised generative adversarial network (GAN) approach translating masks into realistic NIR-II Leveraging attention mechanism integrated loss function, our model focuses essential features during generation process, resulting NIRII images without need supervision. Our method significantly outperforms eight baseline techniques both visual quality quantitative metrics, demonstrating its potential to address challenge limited datasets medical imaging. This work not only enhances applications but also facilitates downstream tasks by providing abundant, high-fidelity data.

Language: Английский

Citations

0

Improving classification on imbalanced genomic data via KDE–based synthetic sampling DOI

Edoardo Taccaliti,

Jesús S. Aguilar–Ruiz

Research Square (Research Square), Journal Year: 2025, Volume and Issue: unknown

Published: May 8, 2025

Abstract Class imbalance poses a serious challenge in biomedical machine learning, particularly genomics, where datasets are characterized by extremely high dimensionality and very limited sample sizes. In such settings, standard classifiers tend to favor the majority class, leading biased predictions --- an especially problematic issue clinical diagnostics rare conditions must not be overlooked. this study, we introduce Kernel Density Estimation (KDE)--based oversampling approach rebalance imbalanced genomic generating synthetic minority class samples. Unlike conventional methods as SMOTE, KDE estimates global probability distribution of resamples accordingly, avoiding local interpolation pitfalls. We evaluate our method on 15 real--world using three --Naïve Bayes, Decision Trees, Random Forests-- compare it SMOTE baseline training. Experimental results demonstrate that consistently improves classification performance, metrics robust imbalance, AUC IMCP curve. Notably, achieves superior tree-based models while dramatically simplifying sampling process. This offers statistically grounded effective solution for balancing datasets, with strong potential improving fairness accuracy high--stakes medical decision--making.

Language: Английский

Citations

0