Interpretable Deep Learning for Diabetic Retinopathy: A Comparative Study of CNN, ViT, and Hybrid Architectures DOI Creative Commons
Weijie Zhang, Veronika Belcheva, Tatiana Ermakova

et al.

Computers, Journal Year: 2025, Volume and Issue: 14(5), P. 187 - 187

Published: May 12, 2025

Diabetic retinopathy (DR) is a leading cause of vision impairment worldwide, requiring early detection for effective treatment. Deep learning models have been widely used automated DR classification, with Convolutional Neural Networks (CNNs) being the most established approach. Recently, Vision Transformers (ViTs) shown promise, but direct comparison their performance and interpretability remains limited. Additionally, hybrid that combine CNN transformer-based architectures not extensively studied. This work systematically evaluates CNNs (ResNet-50), ViTs (Vision Transformer SwinV2-Tiny), (Convolutional Transformer, LeViT-256, CvT-13) on classification using publicly available retinal image datasets. The are assessed based accuracy interpretability, applying Grad-CAM Attention-Rollout to analyze decision-making patterns. Results indicate outperform both standalone ViTs, achieving better balance between local feature extraction global context awareness. best-performing model (CvT-13) achieved Quadratic Weighted Kappa (QWK) score 0.84 an AUC 0.93 test set. Interpretability analysis shows focus fine-grained lesion details, while exhibit broader less localized attention. These findings provide valuable insights optimizing deep in medical imaging, supporting development clinically viable AI-driven screening systems.

Language: Английский

Interpretable Deep Learning for Diabetic Retinopathy: A Comparative Study of CNN, ViT, and Hybrid Architectures DOI Creative Commons
Weijie Zhang, Veronika Belcheva, Tatiana Ermakova

et al.

Computers, Journal Year: 2025, Volume and Issue: 14(5), P. 187 - 187

Published: May 12, 2025

Diabetic retinopathy (DR) is a leading cause of vision impairment worldwide, requiring early detection for effective treatment. Deep learning models have been widely used automated DR classification, with Convolutional Neural Networks (CNNs) being the most established approach. Recently, Vision Transformers (ViTs) shown promise, but direct comparison their performance and interpretability remains limited. Additionally, hybrid that combine CNN transformer-based architectures not extensively studied. This work systematically evaluates CNNs (ResNet-50), ViTs (Vision Transformer SwinV2-Tiny), (Convolutional Transformer, LeViT-256, CvT-13) on classification using publicly available retinal image datasets. The are assessed based accuracy interpretability, applying Grad-CAM Attention-Rollout to analyze decision-making patterns. Results indicate outperform both standalone ViTs, achieving better balance between local feature extraction global context awareness. best-performing model (CvT-13) achieved Quadratic Weighted Kappa (QWK) score 0.84 an AUC 0.93 test set. Interpretability analysis shows focus fine-grained lesion details, while exhibit broader less localized attention. These findings provide valuable insights optimizing deep in medical imaging, supporting development clinically viable AI-driven screening systems.

Language: Английский

Citations

0