
Published: Oct. 22, 2024
Language: Английский
Published: Oct. 22, 2024
Language: Английский
Cureus, Journal Year: 2023, Volume and Issue: unknown
Published: Dec. 12, 2023
Purpose This study aims to evaluate the performance of three large language models (LLMs), Generative Pre-trained Transformer (GPT)-3.5, GPT-4, and Google Bard, on 2023 Japanese National Dentist Examination (JNDE) assess their potential clinical applications in Japan. Methods A total 185 questions from JNDE were used. These categorized by question type category. McNemar's test compared correct response rates between two LLMs, while Fisher's exact evaluated LLMs each Results The overall 73.5% for 66.5% 51.9% GPT-3.5. GPT-4 showed a significantly higher rate than Bard In category essential questions, achieved 80.5%, surpassing passing criterion 80%. contrast, both GPT-3.5 fell short this benchmark, with attaining 77.6% only 52.5%. scores that (p<0.01). For general 71.2% 58.5% 52.5% outperformed professional dental 51.6% 45.3% 35.9% differences among not statistically significant. All demonstrated lower accuracy dentistry other types Conclusions highest score JNDE, followed However, surpassed questions. To further understand application worldwide, more research examinations across different languages is required.
Language: Английский
Citations
21Surgical Endoscopy, Journal Year: 2024, Volume and Issue: 38(5), P. 2320 - 2330
Published: April 17, 2024
Language: Английский
Citations
8Artificial Intelligence Review, Journal Year: 2024, Volume and Issue: 57(9)
Published: Aug. 6, 2024
Abstract Large language models (LLMs) have the intrinsic potential to acquire medical knowledge. Several studies assessing LLMs on examinations been published. However, there is no reported evidence tests related robot-assisted surgery. The aims of this study were perform first systematic review and establish whether ChatGPT, GPT-4, Bard can pass Fundamentals Robotic Surgery (FRS) didactic test. A literature search was performed PubMed, Web Science, Scopus, arXiv following Preferred Reporting Items for Systematic Reviews Meta-Analyses (PRISMA) approach. total 45 analyzed. GPT-4 passed several national qualifying with questions in English, Chinese, Japanese using zero-shot few-shot learning. Med-PaLM 2 obtained similar scores United States Medical Licensing Examination more refined prompt engineering techniques. Five different 2023 releases one tested FRS. Seven attempts each release. score 79.5%. ChatGPT achieved a mean 64.6%, 65.6%, 75.0%, 78.9%, 72.7% respectively from fifth release FRS vs 91.5% 79.5% Bard. outperformed all corresponding statistically significant difference (p < 0.001), but not = 0.002). Our findings agree other included review. We highlighted challenges transform education healthcare professionals stages learning, by assisting teachers preparation teaching contents, trainees acquisition knowledge, up becoming an assessment framework leaners.
Language: Английский
Citations
8Surgery, Journal Year: 2024, Volume and Issue: 176(2), P. 241 - 245
Published: May 19, 2024
Language: Английский
Citations
7In Vivo, Journal Year: 2024, Volume and Issue: 38(3), P. 1009 - 1015
Published: Jan. 1, 2024
Background/Aim: The integration of AI and natural language processing technologies, such as ChatGPT, into surgical practice has shown promising potential in enhancing various aspects abdominopelvic procedures. This systematic review aims to comprehensively evaluate the current state research on applications impact artificial intelligence (AI) ChatGPT surgery summarizing existing literature towards providing a comprehensive overview diverse applications, effectiveness, challenges, future directions these innovative technologies. Materials Methods: A search major electronic databases, including PubMed, Google Scholar, Cochrane Library, Web Science, was conducted from October November 2023, identify relevant studies. Inclusion criteria encompassed studies that investigated utilization settings, including, but not limited preoperative planning, intraoperative decision-making, postoperative care, patient communication. Results: Fourteen met inclusion were included this review. majority analysing ChatGPT's data output decision making while two reported general resident perception tool applied clinical practice. Most high accuracy decision-making process, however with an unforgettable number errors. Conclusion: contributes understanding role surgery, insight their synthesis available evidence will inform directions, guidelines, development technologies optimize benefits care within domain.
Language: Английский
Citations
5Diagnostics, Journal Year: 2023, Volume and Issue: 14(1), P. 90 - 90
Published: Dec. 30, 2023
This study evaluates the diagnostic accuracy and clinical utility of two artificial intelligence (AI) techniques: Kakao Brain Artificial Neural Network for Chest X-ray Reading (KARA-CXR), an assistive technology developed using large-scale AI large language models (LLMs), ChatGPT, a well-known LLM. The was conducted to validate performance technologies in chest reading explore their potential applications medical imaging diagnosis domain. methodology consisted randomly selecting 2000 images from single institution's patient database, radiologists evaluated readings provided by KARA-CXR ChatGPT. used five qualitative factors evaluate generated each model: accuracy, false findings, location inaccuracies, count hallucinations. Statistical analysis showed that achieved significantly higher compared In 'Acceptable' category, rated at 70.50% 68.00% observers, while ChatGPT 40.50% 47.00%. Interobserver agreement moderate both systems, with KARA 0.74 GPT4 0.73. For 'False Findings', scored 68.50%, 37.00% high interobserver agreements 0.96 0.97 GPT4. 'Location Inaccuracy' 'Hallucinations', outperformed significant margins. demonstrated non-hallucination rate 75%, which is than ChatGPT's 38%. (0.91) (0.85) hallucination category. conclusion, this demonstrates diagnostics. It also shows domain, has relatively
Language: Английский
Citations
13Diagnostics, Journal Year: 2024, Volume and Issue: 14(2), P. 171 - 171
Published: Jan. 12, 2024
Our study aimed to assess the accuracy and limitations of ChatGPT in domain MRI, focused on evaluating ChatGPT's performance answering simple knowledge questions specialized multiple-choice related MRI. A two-step approach was used evaluate ChatGPT. In first step, 50 MRI-related were asked, answers categorized as correct, partially or incorrect by independent researchers. second 75 covering various MRI topics posed, similarly categorized. The utilized Cohen's kappa coefficient for assessing interobserver agreement. demonstrated high straightforward questions, with over 85% classified correct. However, its varied significantly across rates ranging from 40% 66.7%, depending topic. This indicated a notable gap ability handle more complex, requiring deeper understanding context. conclusion, this critically evaluates addressing Magnetic Resonance Imaging (MRI), highlighting potential healthcare sector, particularly radiology. findings demonstrate that ChatGPT, while proficient responding exhibits variability accurately answer complex require profound, discrepancy underscores nuanced role AI can play medical education decision-making, necessitating balanced application.
Language: Английский
Citations
4Canadian Journal of Ophthalmology, Journal Year: 2025, Volume and Issue: unknown
Published: Feb. 1, 2025
To evaluate the performance of large language models (LLMs), specifically Microsoft Copilot, GPT-4 (GPT-4o and GPT-4o mini), Google Gemini (Gemini Advanced), in answering ophthalmological questions assessing impact prompting techniques on their accuracy. Prospective qualitative study. Advanced). A total 300 from StatPearls were tested, covering a range subspecialties image-based tasks. Each question was evaluated using 2 techniques: zero-shot forced (prompt 1) combined role-based plan-and-solve+ 2). With prompting, demonstrated significantly superior overall performance, correctly 72.3% outperforming all other models, including Copilot (53.7%), mini (62.0%), (54.3%), Advanced (62.0%) (p < 0.0001). Both showed notable improvements with Prompt over 1, elevating Copilot's accuracy lowest (53.7%) to second highest (72.3%) among LLMs. While newer iterations LLMs, such as Advanced, outperformed less advanced counterparts Gemini), this study emphasizes need for caution clinical applications these models. The choice influences highlighting necessity further research refine LLMs capabilities, particularly visual data interpretation, ensure safe integration into medical practice.
Language: Английский
Citations
0Journal of surgical education, Journal Year: 2025, Volume and Issue: 82(4), P. 103442 - 103442
Published: Feb. 9, 2025
Recent studies investigated the potential of large language models (LLMs) for clinical decision making and answering exam questions based on text input. developments LLMs have extended these with vision capabilities. These image processing are called vision-language (VLMs). However, there is limited investigation applicability VLMs their capabilities content. Therefore, aim this study was to examine performance publicly accessible in 2 different surgical question sets consisting questions. Original from subsets German Medical Licensing Examination (GMLE) United States (USMLE) were collected answered by available (GPT-4, Claude-3 Sonnet, Gemini-1.5). LLM outputs benchmarked accuracy Additionally, LLMs' compared students' average historical (AHP) exams. Moreover, variations analyzed relation difficulty respective type. Overall, all achieved scores equivalent passing grades (≥60%) across both datasets. On image-based questions, only GPT-4 exceeded score required pass, significantly outperforming Gemini-1.5 (GPT: 78% vs. Claude-3: 58% Gemini-1.5: 57.3%; p < 0.001). outperformed students 83.7% AHP students: 67.8%; 0.001) 67.4%; demonstrated substantial it holds considerable use education trainee surgeons.
Language: Английский
Citations
0Frontiers in Oral Health, Journal Year: 2025, Volume and Issue: 6
Published: Feb. 12, 2025
Objectives Oral and maxillofacial diseases affect approximately 3.5 billion people worldwide. With the continuous advancement of Artificial Intelligence technologies, particularly application generative pre-trained transformers like ChatGPT-4, there is potential to enhance public awareness prevention early detection these diseases. This study evaluated performance ChatGPT-4 in addressing oral disease questions using standard approaches Chain Thought (CoT) method, aiming gain a deeper understanding its capabilities, potential, limitations. Materials methods Three experts, drawing from their extensive experience most common clinical settings, selected 130 open-ended 1,805 multiple-choice national dental licensing examination. These encompass 12 areas surgery, including Prosthodontics, Pediatric Dentistry, Maxillofacial Tumors Salivary Gland Diseases, Infections. Results Using CoT approach, exhibited marked enhancements accuracy, structure, completeness, professionalism, overall impression for questions, revealing statistically significant differences compared on general inquiries. In realm method boosted ChatGPT-4's accuracy across all major subjects, achieving an increase 3.1%. Conclusions When employing address incorporating as querying can help improve such issues. However, it not advisable consider substitute doctors.
Language: Английский
Citations
0