Using ChatGPT to generate multiple-choice questions in medical education may have potential adverse effects on medical educators and medical students DOI
Hongnan Ye

Postgraduate Medical Journal, Journal Year: 2024, Volume and Issue: unknown

Published: July 17, 2024

Language: Английский

Cognitive Domain Assessment of Artificial Intelligence Chatbots: A Comparative Study Between ChatGPT and Gemini’s Understanding of Anatomy Education DOI
Arthi Ganapathy, Parul Kaushal

Medical Science Educator, Journal Year: 2025, Volume and Issue: unknown

Published: Feb. 15, 2025

Language: Английский

Citations

0

Potential of Large Language Models in Generating Multiple-Choice Questions for the Japanese National Licensure Examination for Physical Therapists DOI Open Access
Shogo Sawamura,

Kengo Kohiyama,

Takahiro Takenaka

et al.

Cureus, Journal Year: 2025, Volume and Issue: unknown

Published: Feb. 17, 2025

Introduction This study explored the potential of using large language models (LLMs) to generate multiple-choice questions (MCQs) for Japanese National Licensure Examination Physical Therapists. Specifically, it evaluated performance a customized ChatGPT (OpenAI, San Francisco, CA, USA) model named "Physio Exam Generative Pre-trained Transformers (GPT)" in generating high-quality MCQs non-English contexts. Materials and methods Based on data extracted from 57th 58th Therapists, 340 MCQs, including correct answers, explanations, associated topics, were incorporated into knowledge base GPTs. The prompts outputs conducted Japanese. generated covered major topics general (anatomy, physiology, kinesiology) practical (musculoskeletal disorders, central nervous system internal organ disorders). quality their explanations by two independent reviewers 10-point Likert scale across five criteria: clarity, relevance clinical practice, suitability difficulty, distractors, adequacy rationale. Results achieved 100% accuracy both questions. average scores evaluation criteria ranged 7.0 9.8 6.7 Although some areas exhibited lower scores, overall results favorable. Conclusions demonstrates LLMs efficiently even environments such as These findings suggest that can adapt diverse linguistic settings, reduce educators' workload, improve educational resources. lay foundation expanding application settings non-English-speaking regions.

Language: Английский

Citations

0

Application of Artificial Intelligence Generated Content in Medical Examinations DOI Creative Commons
Rui Li, Tong Wu

Advances in Medical Education and Practice, Journal Year: 2025, Volume and Issue: Volume 16, P. 331 - 339

Published: Feb. 1, 2025

As the rapid development of large language model, artificial intelligence generated content (AIGC) presents novel opportunities for constructing medical examination questions. However, it is unclear about way effectively utilizing AIGC designing characterized by its response capabilities and high efficiency, as well good performance in mimicking clinical realities. In this study, we revealed limitations inherent paper-based examinations, provided a streamlined instruction generating questions using AIGC, with particular focus on multiple-choice questions, case study video Manual review remains necessary to ensure accuracy quality content. Future will be benefited from technologies like retrieval augmented generation, multi-agent system, generation technology. continues evolve, anticipated bring transformative changes enhancing preparation, contributing effective cultivation students.

Language: Английский

Citations

0

Quality assurance and validity of AI-generated single best answer questions DOI Creative Commons
Ayla Ahmed,

Ellen Kerr,

Andrew O’Malley

et al.

BMC Medical Education, Journal Year: 2025, Volume and Issue: 25(1)

Published: Feb. 25, 2025

Abstract Background Recent advancements in generative artificial intelligence (AI) have opened new avenues educational methodologies, particularly medical education. This study seeks to assess whether AI might be useful addressing the depletion of assessment question banks, a challenge intensified during Covid-era due prevalence open-book examinations, and augment pool formative opportunities available students. While many recent publications sought ascertain can achieve passing standard existing this investigates potential for generate exam itself. Summary work research utilized commercially large language model (LLM), OpenAI GPT-4, 220 single best answer (SBA) questions, adhering Medical Schools Council Assessment Alliance guidelines selection Learning Outcomes (LOs) Scottish Graduate-Entry Medicine (ScotGEM) program. All questions were assessed by an expert panel accuracy quality. A total 50 AI-generated human-authored used create two 50-item SBA examinations Year 1 2 ScotGEM Each exam, delivered via Speedwell eSystem, comprised 25 presented random order. Students completed online, closed-book exams on personal devices under conditions that reflected summative examinations. The performance both was evaluated, focusing facility discrimination index as key metrics. results screening process revealed 69% SBAs fit inclusion with little or no modifications required. Modifications, when necessary, predominantly reasons such "all above" options, usage American English spellings, non-alphabetized choices. 31% rejected factual inaccuracies non-alignment students’ learning. When included examination, post hoc statistical analysis indicated significant difference between AI- human- authored terms index. Discussion conclusion outcomes suggest LLMs are line best-practice specific LOs. However, robust quality assurance is necessary ensure erroneous identified rejected. insights gained from provide foundation further investigation into refining prompts, aiming more reliable generation curriculum-aligned questions. show supplementing traditional methods approach offers viable solution rapidly replenish diversify resources curricula, marking step forward intersection

Language: Английский

Citations

0

Artificial Intelligence-Based Chatbots’ Ability to Interpret Mammography Images: A Comparison of Chat-GPT 4o and Claude 3.5 DOI Open Access
Betül Nalan Karahan, Emre Emekli,

Mahmut Altuğ Altın

et al.

European Journal of Therapeutics, Journal Year: 2025, Volume and Issue: 31(1), P. 28 - 34

Published: Feb. 28, 2025

Objectives: The aim of this study is to compare the ability artificial intelligence-based chatbots, ChatGPT-4o and Claude 3.5, interpret mammography images. focuses on evaluating their accuracy consistency in BI-RADS classification breast parenchymal type assessment. It also aims explore potential these technologies reduce radiologists’ workload identify limitations medical image analysis. Methods: A total 53 images obtained between January July 2024 were analyzed, focusing same anonymized provided both chatbots under identical prompts. Results: results showed rates for ranging from 18.87% 26.42% 18.7% 3.5. When categories grouped into benign group(BI-RADS 1,2) malignant 4,5), combined was 57.5% (initial evaluation) 55% (second evaluation), compared 47.5% Breast 30.19% 22.64% ChatGPT-4o, Conclusions: findings indicate that demonstrate limited reliability interpreting These highlight need further optimization, larger datasets, advanced training processes improve performance

Language: Английский

Citations

0

UsmleGPT: An AI application for developing MCQs via multi-agent system DOI Open Access

Zhehan Jiang,

S. H. Feng

Software Impacts, Journal Year: 2025, Volume and Issue: 23, P. 100742 - 100742

Published: March 1, 2025

Language: Английский

Citations

0

Automatic item generation for educational assessments: a systematic literature review DOI
Yishen Song, Junlei Du, Qinhua Zheng

et al.

Interactive Learning Environments, Journal Year: 2025, Volume and Issue: unknown, P. 1 - 20

Published: March 24, 2025

Language: Английский

Citations

0

Can ChatGPT Generate Acceptable Case‐Based Multiple‐Choice Questions for Medical School Anatomy Exams? A Pilot Study on Item Difficulty and Discrimination DOI Open Access
Yavuz Selim Kıyak, Ayşe Soylu, Özlem Çoşkun

et al.

Clinical Anatomy, Journal Year: 2025, Volume and Issue: unknown

Published: March 24, 2025

ABSTRACT Developing high‐quality multiple‐choice questions (MCQs) for medical school exams is effortful and time‐consuming. In this study, we investigated the ability of ChatGPT to generate case‐based anatomy MCQs with acceptable levels item difficulty discrimination exams. We used an endocrine urogenital system exam based on a framework artificial intelligence (AI)‐assisted generation. The were evaluated by experts, approved department, administered 502 second‐year students (372 Turkish‐language, 130 English‐language). items analyzed determine indices. indices ranged from 0.29 0.54, indicating differentiation between high‐ low‐performing students. All in Turkish (six out six) five six English met higher threshold (≥ 0.30) required large‐scale standardized tests. 0.41 0.89, most falling within moderate range (0.20–0.80). Therefore, it was concluded that can psychometric properties, offering promising tool educators. However, human expertise remains crucial reviewing refining AI‐generated assessment items. Future research should explore across various topics investigate different AI models question

Language: Английский

Citations

0

Assessing large language models as assistive tools in medical consultations for Kawasaki disease DOI Creative Commons
Chunyi Yan, Zexi Li,

Yunqiang Liang

et al.

Frontiers in Artificial Intelligence, Journal Year: 2025, Volume and Issue: 8

Published: March 31, 2025

Kawasaki disease (KD) presents complex clinical challenges in diagnosis, treatment, and long-term management, requiring a comprehensive understanding by both parents healthcare providers. With advancements artificial intelligence (AI), large language models (LLMs) have shown promise supporting medical practice. This study aims to evaluate compare the appropriateness comprehensibility of different LLMs answering clinically relevant questions about KD assess impact prompting strategies. Twenty-five were formulated, incorporating three strategies: No (NO), Parent-friendly (PF), Doctor-level (DL). These input into LLMs: ChatGPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro. Responses evaluated based on appropriateness, educational quality, comprehensibility, cautionary statements, references, potential misinformation, using Information Quality Grade, Global Scale (GQS), Flesch Reading Ease (FRE) score, word count. Significant differences found among terms response accuracy, (p < 0.001). provided highest proportion completely correct responses (51.1%) achieved median GQS score (5.0), outperforming GPT-4o (4.0) (3.0) significantly. FRE (31.5) assessed as comprehensible (80.4%). Prompting strategies significantly affected LLM responses. Sonnet with DL had rate (81.3%), while PF yielded most acceptable (97.3%). Pro showed minimal variation across prompts but excelled (98.7% under prompting). indicates that great providing information KD, their use requires caution due quality inconsistencies misinformation risks. discrepancies existed offered best comprehensibility. is recommended for seeking information. As AI evolves, expanding research refining crucial ensure reliable, high-quality

Language: Английский

Citations

0

AI in radiography education: Evaluating multiple-choice questions difficulty and discrimination DOI
Emre Emekli, Betül Nalan Karahan

Journal of medical imaging and radiation sciences, Journal Year: 2025, Volume and Issue: 56(4), P. 101896 - 101896

Published: April 1, 2025

Language: Английский

Citations

0