Evaluating the Performance of ChatGPT in the Prescribing Safety Assessment: Implications for Artificial Intelligence-Assisted Prescribing DOI Open Access

D.R. Bull,

Dide Okaygoun

Cureus, Journal Year: 2024, Volume and Issue: unknown

Published: Nov. 4, 2024

Objective With the rapid advancement of artificial intelligence (AI) technologies, models like Chat Generative Pre-Trained Transformer (ChatGPT) are increasingly being evaluated for their potential applications in healthcare. The Prescribing Safety Assessment (PSA) is a standardised test junior physicians UK to evaluate prescribing competence. This study aims assess ChatGPT's ability pass PSA and its performance across different exam sections. Methodology ChatGPT (version GPT-4) was tested on four official practice papers, each containing 30 questions, three independent trials per paper, with answers using mark schemes. Performance measured by calculating overall percentage scores comparing them marks provided paper. Subsection also analysed identify strengths weaknesses. Results achieved mean 257/300 (85.67%), 236/300 (78.67%), 199/300 (66.33%), 233/300 (77.67%) consistently surpassing where available. performed well sections requiring factual recall, such as "Adverse Drug Reactions", scoring 63/72 (87.50%), "Communicating Information", (88.89%). However, it struggled "Data Interpretation", 32/72 (44.44%), showing variability indicating limitations handling more complex clinical reasoning tasks. Conclusion While demonstrated strong passing excelling knowledge, data interpretation highlight current gaps AI's fully replicate human judgement. shows promise supporting safe prescribing, particularly areas prone error, drug interactions communicating correct information. due tasks, not yet ready replace prescribers should instead serve supplemental tool practice.

Language: Английский

From GPT-3.5 to GPT-4.o: A Leap in AI’s Medical Exam Performance DOI Creative Commons
Markus Kipp

Information, Journal Year: 2024, Volume and Issue: 15(9), P. 543 - 543

Published: Sept. 5, 2024

ChatGPT is a large language model trained on increasingly datasets to perform diverse language-based tasks. It capable of answering multiple-choice questions, such as those posed by medical examinations. has been generating considerable attention in both academic and non-academic domains recent months. In this study, we aimed assess GPT’s performance anatomical questions retrieved from licensing examinations Germany. Two different versions were compared. GPT-3.5 demonstrated moderate accuracy, correctly 60–64% the autumn 2022 spring 2021 exams. contrast, GPT-4.o showed significant improvement, achieving 93% accuracy exam 100% exam. When tested 30 unique not available online, maintained 96% rate. Furthermore, consistently outperformed students across six state exams, with statistically mean score 95.54% compared students’ 72.15%. The study demonstrates that outperforms its predecessor, GPT-3.5, cohort students, indicating potential powerful tool education assessment. This improvement highlights rapid evolution LLMs suggests AI could play an important role supporting enhancing training, potentially offering supplementary resources for professionals. However, further research needed limitations practical applications systems real-world practice.

Language: Английский

Citations

4

Which current chatbot is more competent in urological theoretical knowledge? A comparative analysis by the European board of urology in-service assessment DOI Creative Commons
Mehmet Fatih Şahin, Çağrı Doğan, Erdem Can Topkaç

et al.

World Journal of Urology, Journal Year: 2025, Volume and Issue: 43(1)

Published: Feb. 11, 2025

Abstract Introduction The European Board of Urology (EBU) In-Service Assessment (ISA) test evaluates urologists’ knowledge and interpretation. Artificial Intelligence (AI) chatbots are being used widely by physicians for theoretical information. This research compares five existing chatbots’ performances questions’ Materials methods GPT-4o, Copilot Pro, Gemini Advanced, Claude 3.5, Sonar Huge solved 596 questions in 6 exams between 2017 2022. were divided into two categories: that measure require data exam compared. Results Overall, all except 3.5 passed the examinations with a percentage 60% overall score. Pro scored best, 3.5’s score difference was significant (71.6% vs. 56.2%, p = 0.001 ). When total 444 152 analysis compared, offered greatest information, whereas provided least (72.1% 57.4%, also true analytical skills (70.4% 52.6%, 0.019 Conclusions Four out exams, achieving scores exceeding 60%, while only one did not pass EBU examination. performed best ISA examinations, worst. Chatbots worse on than questions. Thus, although successful terms knowledge, their competence analyzing is questionable.

Language: Английский

Citations

0

Evaluating ChatGPT’s role in urological counseling and clinical decision support DOI
Kamil Malshy, Jathin Bandari, V.V. Kucherov

et al.

World Journal of Urology, Journal Year: 2025, Volume and Issue: 43(1)

Published: Feb. 13, 2025

Language: Английский

Citations

0

Comparative analysis of the effectiveness of microsoft copilot artificial intelligence chatbot and google search in answering patient inquiries about infertility: evaluating readability, understandability, and actionability DOI Creative Commons
Tuncer Bahçeci, Erman Ceyhan, Burak Elmaağaç

et al.

International Journal of Impotence Research, Journal Year: 2025, Volume and Issue: unknown

Published: April 22, 2025

Abstract Failure to achieve spontaneous pregnancy within 12 months despite unprotected intercourse is called infertility. The rapid development of digital health data has led more people search for healthcare-related topics on the Internet. Many infertile individuals and couples use Internet as their primary source information infertility diagnosis treatment. However, it important assess readability, understandability, actionability provided by these sources patients. There a gap in literature addressing this aspect. This study aims compare responses generated Microsoft Copilot (MC), an AI chatbot, Google Search (GS), internet engine, infertility-related queries. Prospectively Trends analysis was conducted identify top 20 queries related February, 2024. Then were entered into GS MC May Answers from both platforms recorded further analysis. Outputs assessed using automated readability tools, scores calculated. Understandability answers evaluated Patient Education Materials Assessment Tool Printable (PEMAT-P) tool. found have significantly higher Automated Readability Index (ARI) Flesch-Kincaid Grade Level (FKGL) than ( p = 0.044), while no significant differences observed Flesch Reading Ease, Gunning Fog Index, Simplified Measure Gobbledygook (SMOG), Coleman-Liau scores. Both outputs had above 8th-grade level, indicating advanced reading levels. According PEMAT-P, outperformed terms understandability (68.65 ± 11.99 vs. 54.50 15.09, 0.001) (29.85 17.8 1 4.47, 0.000). provides understandable actionable queries, that might great potential patient education.

Language: Английский

Citations

0

Letter to the editor for the article “Performance of ChatGPT-3.5 and ChatGPT-4 on the European Board of Urology (EBU) exams: a comparative analysis” DOI

Yuxuan Song,

Tao Xu

World Journal of Urology, Journal Year: 2024, Volume and Issue: 42(1)

Published: Oct. 3, 2024

Language: Английский

Citations

0

Editorial Comment on Can artificial intelligence pass the Japanese urology board examinations? DOI Open Access
Atsushi Okada

International Journal of Urology, Journal Year: 2024, Volume and Issue: unknown

Published: Oct. 15, 2024

The study titled "Can artificial intelligence pass the Japanese urology board examinations?" by Okada et al. provides an insightful and timely exploration of potential for large language models (LLMs) such as GPT-4 Claude3 to succeed in highly specialized medical examinations.1 As (AI) continues advance, its applications education certification processes are expanding, making this particularly relevant. This research demonstrates that achieved highest accuracy among tested LLMs, with passing scores three four prompt conditions. effectively highlights strengths handling complex, domain-specific questions within context Urology Board Examinations. ability surpass a 60% threshold multiple scenarios indicates LLMs nearing level proficiency could complement professionals educational evaluative settings. For instance, Nakao evaluated GPT-4V's performance National Medical Licensing Examination, highlighting interpret complex visual data, crucial component diagnostics.2 Despite these promising results, also underscores limitations LLMs. Hager noted while perform well examination settings, they often struggle clinical decision-making adherence guidelines, which essential real-world practice.3 Agerri further identified issues outdated knowledge hallucinations AI-generated content, posing risks when relying on AI contexts.4 Moreover, Schoch conducted comparative analysis ChatGPT-3.5 European examinations, revealing inconsistencies across different test settings.5 al.'s study1 is valuable contribution ongoing discourse integration certification. By demonstrating paves way development tools can support enhance expertise professionals. None.

Language: Английский

Citations

0

Evaluating the Performance of ChatGPT in the Prescribing Safety Assessment: Implications for Artificial Intelligence-Assisted Prescribing DOI Open Access

D.R. Bull,

Dide Okaygoun

Cureus, Journal Year: 2024, Volume and Issue: unknown

Published: Nov. 4, 2024

Objective With the rapid advancement of artificial intelligence (AI) technologies, models like Chat Generative Pre-Trained Transformer (ChatGPT) are increasingly being evaluated for their potential applications in healthcare. The Prescribing Safety Assessment (PSA) is a standardised test junior physicians UK to evaluate prescribing competence. This study aims assess ChatGPT's ability pass PSA and its performance across different exam sections. Methodology ChatGPT (version GPT-4) was tested on four official practice papers, each containing 30 questions, three independent trials per paper, with answers using mark schemes. Performance measured by calculating overall percentage scores comparing them marks provided paper. Subsection also analysed identify strengths weaknesses. Results achieved mean 257/300 (85.67%), 236/300 (78.67%), 199/300 (66.33%), 233/300 (77.67%) consistently surpassing where available. performed well sections requiring factual recall, such as "Adverse Drug Reactions", scoring 63/72 (87.50%), "Communicating Information", (88.89%). However, it struggled "Data Interpretation", 32/72 (44.44%), showing variability indicating limitations handling more complex clinical reasoning tasks. Conclusion While demonstrated strong passing excelling knowledge, data interpretation highlight current gaps AI's fully replicate human judgement. shows promise supporting safe prescribing, particularly areas prone error, drug interactions communicating correct information. due tasks, not yet ready replace prescribers should instead serve supplemental tool practice.

Language: Английский

Citations

0