The Effectiveness of Local Fine-Tuned LLMs: Assessment of the Japanese National Examination for Pharmacists DOI
Hiroto Asano, Daisuke Takaya, Asuka Hatabu

et al.

Research Square (Research Square), Journal Year: 2025, Volume and Issue: unknown

Published: April 15, 2025

Abstract Large Language Models (LLMs) offer great potential for applications in healthcare and pharmaceutical fields. While cloud-based implementations are commonly used, they present challenges related to privacy cost. This study examined the performance of locally executable LLMs on Japanese National Examination Pharmacists (JNEP). Additionally, we explore feasibility creating specialized pharmacy models through fine-tuning with Low-Rank Adaptation (LoRA). Text-based questions from 97th 109th JNEP were utilized, comprising 2,421 training 165 testing. Four distinct evaluated, including Microsoft phi-4 DeepSeek R1 Distill Qwen series. Baseline was initially assessed, followed by using LoRA dataset. Model evaluated based accuracy scores achieved test In baseline evaluation against JNEP, ranged 55.15–76.36%. Notably, CyberAgent 32B passing threshold (approximately 61%). Following fine-tuning, exhibited a increase 60.61–66.06%. showed that capable handling knowledge tasks comparable those national pharmacist examination. Moreover, found techniques like can significantly enhance model performance, demonstrating robust AI specifically designed pharmacological applications. These findings contribute understanding implementing secure high-performing solutions tailored use.

Language: Английский

Research on Intelligent Grading of Physics Problems Based on Large Language Models DOI Creative Commons

Yanan Wei,

Rui Zhang, Jianwei Zhang

et al.

Education Sciences, Journal Year: 2025, Volume and Issue: 15(2), P. 116 - 116

Published: Jan. 21, 2025

The automation of educational and instructional assessment plays a crucial role in enhancing the quality teaching management. In physics education, calculation problems with intricate problem-solving ideas pose challenges to intelligent grading tests. This study explores automatic through combination large language models prompt engineering. By comparing performance four strategies (one-shot, few-shot, chain thought, tree thought) within two model frameworks, namely ERNIEBot-4-turbo GPT-4o. finds that thought can better assess complex (N = 100, ACC ≥ 0.9, kappa > 0.8) reduce gap between different models. research provides valuable insights for assessments education.

Language: Английский

Citations

1

A Comparative Study on the Question-Answering Proficiency of Artificial Intelligence Models in Bladder-Related Conditions: An Evaluation of Gemini and ChatGPT 4.o DOI Open Access
Mustafa Azizoğlu, Sergey Klyuev

Medical Records, Journal Year: 2025, Volume and Issue: 7(1), P. 201 - 205

Published: Jan. 10, 2025

Aim: The rapid evolution of artificial intelligence (AI) has revolutionized medicine, with tools like ChatGPT and Google Gemini enhancing clinical decision-making. ChatGPT's advancements, particularly GPT-4, show promise in diagnostics education. However, variability accuracy limitations complex scenarios emphasize the need for further evaluation these models medical applications. This study aimed to assess agreement between 4.o AI identifying bladder-related conditions, including neurogenic bladder, vesicoureteral reflux (VUR), posterior urethral valve (PUV). Material Method: study, conducted October 2024, compared AI's on 51 questions about VUR, PUV. Questions, randomly selected from pediatric surgery urology materials, were evaluated using metrics statistical analysis, highlighting models' performance agreement. Results: demonstrated similar across PUV questions, true response rates 66.7% 68.6%, respectively, no statistically significant differences (p>0.05). Combined all topics was 67.6%. Strong inter-rater reliability (κ=0.87) highlights their Conclusion: comparable ChatGPT-4.o key performance.

Language: Английский

Citations

0

Enhancing ophthalmology students’ awareness of retinitis pigmentosa: assessing the efficacy of ChatGPT in AI-assisted teaching of rare diseases—a quasi-experimental study DOI Creative Commons
Junwen Zeng, Kexin Sun, Peng Qin

et al.

Frontiers in Medicine, Journal Year: 2025, Volume and Issue: 12

Published: March 18, 2025

Retinitis pigmentosa (RP) is a rare retinal dystrophy often underrepresented in ophthalmology education. Despite advancements diagnostics and treatments like gene therapy, RP knowledge gaps persist. This study assesses the efficacy of AI-assisted teaching using ChatGPT compared to traditional methods educating students about RP. A quasi-experimental was conducted with 142 medical randomly assigned control (traditional review materials) groups. Both groups attended lecture on completed pre- post-tests. Statistical analyses learning outcomes, times, response accuracy. significantly improved post-test scores (p < 0.001), but group required less time (24.29 ± 12.62 vs. 42.54 20.43 min, p 0.0001). The also performed better complex questions regarding advanced treatments, demonstrating AI's potential deliver accurate current information efficiently. enhances efficiency comprehension diseases hybrid educational model combining AI can address gaps, offering promising approach for modern

Language: Английский

Citations

0

Is artificial intelligence successful in the Turkish neurology board exam? DOI
Ayse Betul Acar, Ece Yanık, Emine Altin

et al.

Neurological Research, Journal Year: 2025, Volume and Issue: unknown, P. 1 - 4

Published: March 20, 2025

Objectives OpenAI declared that GPT-4 performed better in academic and certain specialty areas. Medical licensing exams assess the clinical competence of doctors. We aimed to investigate for first time howChatGPT will perform Turkish Neurology Proficiency Exam.

Language: Английский

Citations

0

Comparative Evaluation of Advanced AI Reasoning Models in Korean National Licensing Examination OpenAI vs DeepSeek (Preprint) DOI Creative Commons
Jin-Gyu Lee, Gyeong Hoon Kim, Jei Keon Chae

et al.

Published: March 27, 2025

UNSTRUCTURED Artificial intelligence (AI) has advanced in natural language processing and reasoning, with large models (LLMs) increasingly assessed for medical education licensing exams. Given the growing use of AI examinations, evaluating their performance on non-Western, region-specific tests like Korean Medical Licensing Examination (KMLE) is crucial assessing real-world applicability. This study compared five LLMs—GPT-4o, o1, o3-mini (OpenAI), DeepSeek-V3, DeepSeek-R1 (DeepSeek)—on KMLE. A total 150 multiple-choice questions from 2024 KMLE were extracted categorized into three domains: Local Health & Laws, Preventive Medicine, Clinical Medicine. Graph-based excluded. Each model completed independent runs via API, accuracy against official answers. Statistical differences analyzed using ANOVA, consistency was measured Fleiss' kappa coefficient. o1 achieved highest overall (94.3%), excelling Medicine (97.5%) Law (81.0%), while led (92.6%). Despite domain-specific variations, all surpassed passing criteria. For consistency, ranked (97.1%), DeepSeek-V3 (97.5%). Performance declined Law, likely due to legal complexities limited Korean-language training data. first compare OpenAI DeepSeek exam, demonstrating strong performance, ranking within top 10% human candidates. While most accurate, provided a cost-effective alternative. Future research should optimize LLMs non-English exams develop Korea-specific improve domains.

Language: Английский

Citations

0

Comparative analysis of a standard (GPT-4o) and reasoning-enhanced (o1 pro) large language model on complex clinical questions from the Japanese orthopaedic board examination DOI
Joe Hasei,

Ryuichi Nakahara,

Koichi Takeuchi

et al.

Journal of Orthopaedic Science, Journal Year: 2025, Volume and Issue: unknown

Published: April 1, 2025

Language: Английский

Citations

0

A Brief Review on Benchmarking for Large Language Models Evaluation in Healthcare DOI Creative Commons
Leona Cilar, Hongyu Chen, Aokun Chen

et al.

Wiley Interdisciplinary Reviews Data Mining and Knowledge Discovery, Journal Year: 2025, Volume and Issue: 15(2)

Published: April 9, 2025

ABSTRACT This paper reviews benchmarking methods for evaluating large language models (LLMs) in healthcare settings. It highlights the importance of rigorous to ensure LLMs' safety, accuracy, and effectiveness clinical applications. The review also discusses challenges developing standardized benchmarks metrics tailored healthcare‐specific tasks such as medical text generation, disease diagnosis, patient management. Ethical considerations, including privacy, data security, bias, are addressed, underscoring need multidisciplinary collaboration establish robust frameworks that facilitate reliable ethical use healthcare. Evaluation LLMs remains challenging due lack comprehensive datasets. Key concerns include model better explainability, all which impact overall trustworthiness

Language: Английский

Citations

0

Overview of the Lymphoma Information Extraction and Automatic Coding Evaluation Task in CHIP 2024 DOI
Hui Zong, Liang Tao, Zuofeng Li

et al.

Communications in computer and information science, Journal Year: 2025, Volume and Issue: unknown, P. 75 - 84

Published: Jan. 1, 2025

Language: Английский

Citations

0

The Effectiveness of Local Fine-Tuned LLMs: Assessment of the Japanese National Examination for Pharmacists DOI
Hiroto Asano, Daisuke Takaya, Asuka Hatabu

et al.

Research Square (Research Square), Journal Year: 2025, Volume and Issue: unknown

Published: April 15, 2025

Abstract Large Language Models (LLMs) offer great potential for applications in healthcare and pharmaceutical fields. While cloud-based implementations are commonly used, they present challenges related to privacy cost. This study examined the performance of locally executable LLMs on Japanese National Examination Pharmacists (JNEP). Additionally, we explore feasibility creating specialized pharmacy models through fine-tuning with Low-Rank Adaptation (LoRA). Text-based questions from 97th 109th JNEP were utilized, comprising 2,421 training 165 testing. Four distinct evaluated, including Microsoft phi-4 DeepSeek R1 Distill Qwen series. Baseline was initially assessed, followed by using LoRA dataset. Model evaluated based accuracy scores achieved test In baseline evaluation against JNEP, ranged 55.15–76.36%. Notably, CyberAgent 32B passing threshold (approximately 61%). Following fine-tuning, exhibited a increase 60.61–66.06%. showed that capable handling knowledge tasks comparable those national pharmacist examination. Moreover, found techniques like can significantly enhance model performance, demonstrating robust AI specifically designed pharmacological applications. These findings contribute understanding implementing secure high-performing solutions tailored use.

Language: Английский

Citations

0