Point‐counterpoint: What is the best strategy for developing generative AI for hospital medicine? DOI

Hannah Kerman,

Andre Kumar, Byron Crowe

et al.

Journal of Hospital Medicine, Journal Year: 2025, Volume and Issue: unknown

Published: May 4, 2025

Abstract Generative Artificial Intelligence (Gen AI) shows significant promise as a technology that could improve healthcare delivery, but its implementation will be influenced by the spheres in which it is studied and limited resources of hospitals. The Point authors argue we should focus on cognitive abilities GenAI or risk being left out technological leap change way doctors practice. Counterpoint argues using to ease system burdens address workflow issues, focusing our efforts fixing problems would doctors’ quality life increase time spent with patients.

Language: Английский

Performance of o1 pro and GPT-4 in self-assessment questions for nephrology board renewal DOI Creative Commons
Ryunosuke Noda,

Chiaki Yuasa,

Fumiya Kitano

et al.

medRxiv (Cold Spring Harbor Laboratory), Journal Year: 2025, Volume and Issue: unknown

Published: Jan. 15, 2025

ABSTRACT Background Large language models (LLMs) are increasingly evaluated in medical education and clinical decision support, but their performance highly specialized fields, such as nephrology, is not well established. We compared two advanced LLMs, GPT-4 the newly released o1 pro, on comprehensive nephrology board renewal examinations. Methods administered 209 Japanese Self-Assessment Questions for Nephrology Board Renewal from 2014–2023 to pro using ChatGPT pro. Each question, including images, was presented separate chat sessions prevent contextual carryover. were classified by taxonomy (recall/interpretation/problem-solving), question type (general/clinical), image inclusion, subspecialty. calculated proportion of correct answers performances chi-square or Fisher’s exact tests. Results Overall, scored 81.3% (170/209), significantly higher than GPT-4’s 51.2% (107/209; p<0.001). exceeded 60% passing criterion every year, while achieved this only out ten years. Across levels, types, presence consistently outperformed (p<0.05 multiple comparisons). Performance differences also significant several subspecialties, chronic kidney disease, confirming pro’s broad superiority. Conclusion substantially a examination, demonstrating reasoning integration knowledge. These findings highlight potential next-generation LLMs valuable tools specialty possibly support warranting further careful validation.

Language: Английский

Citations

0

Integrating AI into clinical education: evaluating general practice trainees’ proficiency in distinguishing AI-generated hallucinations and impacting factors DOI Creative Commons
Jiacheng Zhou, Jintao Zhang, Rongrong Wan

et al.

BMC Medical Education, Journal Year: 2025, Volume and Issue: 25(1)

Published: March 19, 2025

To assess the ability of General Practice (GP) Trainees to detect AI-generated hallucinations in simulated clinical practice, ChatGPT-4o was utilized. The were categorized into three types based on accuracy answers and explanations: (1) correct with incorrect or flawed explanations, (2) explanations that contradict factual evidence, (3) explanations. This multi-center, cross-sectional survey study involved 142 GP Trainees, all whom undergoing Specialist Training volunteered participate. evaluated consistency ChatGPT-4o, as well Trainees' response time, accuracy, sensitivity (d'), tendencies (β). Binary regression analysis used explore factors affecting identify errors generated by ChatGPT-4o. A total 137 participants included, a mean age 25.93 years. Half unfamiliar AI, 35.0% had never it. ChatGPT-4o's overall 80.8%, which slightly decreased 80.1% after human verification. However, for professional practice (Subject 4) only 57.0%, verification, it dropped further 44.2%. 87 identified, primarily occurring at application evaluation levels. detecting these 55.0%, (d') 0.39. Regression revealed shorter times (OR = 0.92, P 0.02), higher self-assessed AI understanding 0.16, 0.04), more frequent use 10.43, 0.01) associated stricter error detection criteria. concluded trainees faced challenges identifying errors, particularly scenarios. highlights importance improving literacy critical thinking skills ensure effective integration medical education.

Language: Английский

Citations

0

“The Machine Will See You Now”: A Clinician's Perspective on Artificial “Intelligence” In Clinical Care DOI Creative Commons
Abhimanyu Mahajan, Andrew J. Lees

Movement Disorders Clinical Practice, Journal Year: 2025, Volume and Issue: unknown

Published: March 20, 2025

Intelligence is the ability to think logically, conceptualize and abstract from reality.1 Its companion, wisdom capacity grasp human nature, which paradoxical, contradictory, subject continual change.1 Both of these constructs are key practice medicine often improve with age clinical experience. Artificial "intelligence" (AI) refers machines recognize "learn" patterns complex data, predict outcomes help in decision-making.2 AI has been heralded as a new medicine, will take over medical diagnosis management. With this background, we draw attention shortcomings potential dangers our own specialty movement disorders. The term was coined 1956 but last few years made considerable progress no longer science fiction.3 A quick google search even reputable generally trustworthy sources reveals frequent promotional slogans such "AI revolutionizing healthcare know it" "2023: year groundbreaking advances computing". Meta launched "Galactica", large language model (LLM) based on training dataset 48 million examples scientific articles, textbooks, websites, lecture notes, encyclopedias. purpose behind Galactica have single AI-based tool summarize all academic articles write code annotate molecules. It lasted total 3 days online after it found be unable distinguish truth fiction that could "hallucinate" data.4 adoption led increased concerns about absence published negative results some top researchers concerned companies adopting "shiny products" safety.5, 6 Leading expressed alarm low regulatory bar for transparency.7 skill writers questioned when emphasis hiring personnel an aptitude 142X increase LinkedIn.8 diagnostic accuracy carries great importance consequence error harm patients. research "Artificial intelligence" funded by NIH tune ~$1.1 billion 2023. Correspondingly, number Medline topic greatly increased. recent study showed only 20% studies using neuroimaging Parkinson's disease passed minimal quality criteria, 8% external test sets where lower.9 systematic review fifty-five relevant use PD three were validated data five had risk bias.10 field disorders relies detailed history focused neurological examination arrive at diagnosis. gold standard criteria most common essential tremor depend acumen part what observe clinic intuition tacit knowledge. Key documented includes case reports, series, videos small sample size, given rarity diagnoses. As such, may not serve appropriately comprehensive datasets model. Different forms chance played major role including story introduction L-DOPA into medicine. Dr. Langston encountered MPTP-induced parkinsonism while he enjoying his coffee annoyed being interrupted residents.11 Amantadine originally introduced utilized antiviral medication.12 While entertaining thought algorithm replacing neurologist, must remember first easiest aspect care. telling however requires nuance, grace empathy. Perhaps question matters regard machine learning is: would trust diagnosing family members? answer highlighted compared AI, + physician comprehensibility similar across groups, empathy, reliability willingness follow advice significantly better physician.13 survey 1400 US adults revealed 69% them uncomfortable AI.14 need continue investing supporting meticulous, caring physicians just state-of-the-art technology. demise expertise fields like pathology radiology hands long predicted. However, certain pitfalls make unlikely. Missing lead bias. Rare missed or overcompensated model, leading overdiagnosis misdiagnosis. LLMs reinforce outdated practices.15, 16 Unlike physicians, insensitive impact their decisions therefore, demonstrate safety behavior and/ its limitation. There also accountability should AI-led calculation error. Perhaps, voiced concern mismatch between real world, known unknown errors.15, Such discrepancy noted conducted Stanford Gastroenterology.17 Preliminary indicated detect polyps during colonoscopy. subsequent trial negative. authors acknowledged difference added more world can noise alter efficacy tools.17 Concerns disappointment opacity computational scientists University Toronto noting trend excitement around feels "advertisement cool technology" instead having basis science.16 substantial towards incorporation electronic health records alleviate documentation burden.18 without risks. In addition hallucination misinterpret recorded text. example issues hands, feet mouth scribed hand, foot disease. chart bloat requiring additional time screening errors. Overall, used discussion, there greater summarizing information synthesizing data.19 look options expand care ensure tools biased inadequately tested, thereby introducing inequity resource-rich resource-poor nations. Recognizing issues, World Health Organization called careful tools, especially biases before they adopted low-resource settings intent reducing inequity. FDA "nimble regulation" avoid "swept up something hardly understand". commissioner recognizes models likely "evolve" implantation "continuous adjustment remain accurate".20, 21 "Intelligence" poor surrogate intelligence, interaction. performs poorly cognitive tests.22 clear never replace paradoxically improving patient-doctor relationship. years, environment high burden stifled innovation, burnout job dissatisfaction.23-25 Several primary contributor exodus.26 "Boring AI" termed offers hope lowering burdens making sustainable pleasurable.27 Quality measures cost hospitals 5 dollars annually, precious manpower. took less than hour draft abstractions >90% AI.28 along video filming does offer interesting future collection synthesis then neurologist.29 approach seek supplant taking physical examination, compliment good patient streamline workflows costs 17-fold preserving reliability.30 Once ready caution, smart systems potentially scheduling, stratification resource allotment. medications approved require prior authorization coverage, letters necessity create reading material patients.31, 32 therefore spend patient. always center any decision related important acknowledge narrative, subjective Despite potential, currently consistent accountable independent diagnostics. understand generative best, predictive errors expert judgment.33 further drops substantially diagnoses conversation simulated highlighting limitations entity.34 neurologists once predicted advent brain imaging.35 Instead, embraced use, evidence, multiple incorporate imperative partner clinicians development exhaustive testing rushing premature dangerous deployment direct (1) Research Project: A. Conception, B. Organization, C. Execution; (2) Statistical Analysis: Design, Execution, Review Critique; (3) Manuscript: Writing First Draft, Critique. A.M.: 1A, 1B, 1C, 3A A.J.L.: 3B Ethical Compliance Statement: This document written following ethical guidelines, IRB approval. Informed consent necessary work. All read complied Journal's Publication Guidelines. We confirm position involved publication affirm work those guidelines. Funding Sources Conflicts Interest: declare funding effort. competing financial interests personal relationships appeared influence reported paper. Financial Disclosures From Last 12 Months: AM served advisory board adaptive biosciences. He serves associate editor MDCP. AJL reports consultancies Britannia Pharmaceuticals BIAL Portela. honoraria Pharmaceuticals, BIAL, Convatec FMQ Brazil. Data sharing applicable article generated analysed current study.

Language: Английский

Citations

0

When AI Eats the Healthcare World - Is Trusting AI Fed, or Earned? (Preprint) DOI Creative Commons
Syaheerah Lebai Lutfi,

Adhari Al Zaabi,

Geshina Ayu Mat Saat

et al.

Published: March 23, 2025

BACKGROUND Perception-based studies are susceptible to bias introduced through the design of instruments used. We demonstrate need shift from perception-based usage-based trust evaluation, emphasizing that must be earned demonstrated reliability rather than assumed pre-adoption surveys. Our findings suggest successful AI implementation requires a proactive approach addresses complex interplay human, technical, and organizational factors, grounded in real-world usage data theoretical, perception-driven acceptance measures. OBJECTIVE To examine disconnect between expectations post-implementation realities healthcare systems. METHODS assessed key perceptive-driven models, namely Unified Theory Acceptance Use Technology (UTAUT), Model (TAM), Diffusion Innovation (DOI) with regards healthcare. then matched using these models real results post-usage evidences. RESULTS Through empirical anecdotal evidence, this paper demonstrates technology adoption frameworks usage, focusing on human factors influence shortcomings current perception-focused research. CONCLUSIONS Real-world hype fall short, underly reluctance or resistance providers fully adopt AI.

Language: Английский

Citations

0

When It Comes to Benchmarks, Humans Are the Only Way DOI
Adam Rodman, Laura Zwaan, Andrew Olson

et al.

NEJM AI, Journal Year: 2025, Volume and Issue: 2(4)

Published: March 25, 2025

Improved performance of large language models (LLMs) on traditional reasoning assessments has led to benchmark saturation. This spurred efforts develop new benchmarks, including synthetic computational simulations clinical practice involving multiple AI agents. We argue that it is crucial ground such in extensive human validation. conclude by providing four recommendations for researchers better evaluate LLMs practice.

Language: Английский

Citations

0

Coordinated AI agents for advancing healthcare DOI
Michael Moritz, Eric J. Topol, Pranav Rajpurkar

et al.

Nature Biomedical Engineering, Journal Year: 2025, Volume and Issue: unknown

Published: April 1, 2025

Language: Английский

Citations

0

Evaluating large language models and agents in healthcare: key challenges in clinical applications DOI Creative Commons
Xiaolan Chen, Jie Xiang,

Shanfu Lu

et al.

Intelligent Medicine, Journal Year: 2025, Volume and Issue: unknown

Published: March 1, 2025

Language: Английский

Citations

0

Transforming breast cancer diagnosis and treatment with large language Models: A comprehensive survey DOI
Mohsen Ghorbian,

Mostaf Ghobaei-Arani,

‪Saeid Ghorbian

et al.

Methods, Journal Year: 2025, Volume and Issue: unknown

Published: April 1, 2025

Language: Английский

Citations

0

Grounding Large Language Model in Clinical Diagnostics DOI

Jian Li,

Xi Chen,

Hanyu Zhou

et al.

Research Square (Research Square), Journal Year: 2025, Volume and Issue: unknown

Published: April 15, 2025

Abstract Large language models (LLMs) possess extensive medical knowledge and demonstrate impressive performance in answering diagnostic questions. However, responding to such questions differs significantly from actual clinical procedures. Real-world diagnostics involve a dynamic, iterative process that includes hypothesis refinement targeted data collection. This complex task is both challenging time-consuming, demanding significant portion of workload healthcare resources. Therefore, evaluating enhancing LLM real-world procedures crucial for deployment. In this study, framework was developed assess LLMs' capability complete encounters, including history, physical examination, tests diagnosis. A benchmark dataset 4,421 cases curated, covering rare common diseases across 32 specialties. Clinical evaluation methods were used comprehensively the GPT-4o-mini, GPT-4o, Claude-3-Haiku, Qwen2.5-72b, Qwen2.5-34b, Qwen2.5-14b Although these performed well questions, they consistently underperformed exhibited number errors. To address challenges, ClinDiag-GPT trained on over 8,000 cases. It emulates physicians' reasoning, collects information line with practice, recommends key definitive diagnoses. outperformed other LLMs accuracy procedural performance. We further compared alone, collaboration physicians, physicians alone. Collaboration between enhanced efficiency, demonstrating ClinDiag-GPT's potential as valuable assistant.

Language: Английский

Citations

0

Red teaming ChatGPT in medicine to yield real-world insights on model behavior DOI Creative Commons
Crystal Chang, Hodan Farah, Haiwen Gui

et al.

npj Digital Medicine, Journal Year: 2025, Volume and Issue: 8(1)

Published: March 7, 2025

Red teaming, the practice of adversarially exposing unexpected or undesired model behaviors, is critical towards improving equity and accuracy large language models, but non-model creator-affiliated red teaming scant in healthcare. We convened teams clinicians, medical engineering students, technical professionals (80 participants total) to stress-test models with real-world clinical cases categorize inappropriate responses along axes safety, privacy, hallucinations/accuracy, bias. Six medically-trained reviewers re-analyzed prompt-response pairs added qualitative annotations. Of 376 unique prompts (1504 responses), 20.1% were (GPT-3.5: 25.8%; GPT-4.0: 16%; GPT-4.0 Internet: 17.8%). Subsequently, we show utility our benchmark by testing GPT-4o, a released after event (20.4% inappropriate). 21.5% appropriate GPT-3.5 updated models. share insights for constructing prompts, present iterative assessments.

Language: Английский

Citations

0