Cited by A Structured Narrative Prompt for Large Language Models to Create Pertinent Narratives of Simulated Agents’ Life Events: A Sentiment Analysis Comparison

Analyzing evaluation methods for large language models in the medical field: a scoping review DOI

Junbok Lee, Sungkyung Park, Jaeyong Shin

и другие.

BMC Medical Informatics and Decision Making, Год журнала: 2024, Номер 24(1)

Опубликована: Ноя. 29, 2024

Abstract Background Owing to the rapid growth in popularity of Large Language Models (LLMs), various performance evaluation studies have been conducted confirm their applicability medical field. However, there is still no clear framework for evaluating LLMs. Objective This study reviews on LLM evaluations field and analyzes research methods used these studies. It aims provide a reference future researchers designing Methods & materials We scoping review three databases (PubMed, Embase, MEDLINE) identify LLM-related articles published between January 1, 2023, September 30, 2023. analyzed types methods, number questions (queries), evaluators, repeat measurements, additional analysis use prompt engineering, metrics other than accuracy. Results A total 142 met inclusion criteria. was primarily categorized as either providing test examinations ( n = 53, 37.3%) or being evaluated by professional 80, 56.3%), with some hybrid cases 5, 3.5%) combination two 4, 2.8%). Most had 100 fewer 18, 29.0%), 15 (24.2%) performed repeated 18 (29.0%) analyses, 8 (12.9%) engineering. For assessment, most 50 queries 54, 64.3%), evaluators 43, 48.3%), 14 (14.7%) Conclusions More required regarding application LLMs healthcare. Although previous performance, will likely focus improving performance. well-structured methodology be systematically.

Язык: Английский

Процитировано

A Structured Narrative Prompt for Prompting Narratives from Large Language Models: Sentiment Assessment of ChatGPT-Generated Narratives and Real Tweets DOI

Christopher J. Lynch, Erik J. Jensen, Virginia Zamponi

и другие.

Future Internet, Год журнала: 2023, Номер 15(12), С. 375 - 375

Опубликована: Ноя. 23, 2023

Large language models (LLMs) excel in providing natural responses that sound authoritative, reflect knowledge of the context area, and can present from a range varied perspectives. Agent-based simulations consist simulated agents interact within environment to explore societal, social, ethical, among other, problems. Simulated generate large volumes data discerning useful relevant content is an onerous task. LLMs help communicating agents’ perspectives on key life events by narratives. However, these narratives should be factual, transparent, reproducible. Therefore, we structured narrative prompt for sending queries LLMs, experiment with generation process using OpenAI’s ChatGPT, assess statistically significant differences across 11 Positive Negative Affect Schedule (PANAS) sentiment levels between generated real tweets chi-squared tests Fisher’s exact tests. The structure effectively yields desired components ChatGPT. In four out forty-four categories, ChatGPT which have scores were not discernibly different, terms statistical significance (alpha level α=0.05), expressed tweets. Three outcomes are provided: (1) list benefits challenges generation; (2) requesting LLM chatbot based information; (3) assessment prevalence compared This indicates promise utilization helping connect agent’s experiences people.

Язык: Английский

Процитировано

Evaluation of the Diagnostic Accuracy of GPT-4 in Five Thousand Rare Disease Cases DOI

Justin Reese, Leonardo Chimirri, Yasemin Bridges

и другие.

medRxiv (Cold Spring Harbor Laboratory), Год журнала: 2024, Номер unknown

Опубликована: Июль 22, 2024

Large language models (LLMs) show promise in supporting differential diagnosis, but their performance is challenging to evaluate due the unstructured nature of responses. To assess current capabilities LLMs diagnose genetic diseases, we benchmarked these on 5,213 case reports using Phenopacket Schema, Human Phenotype Ontology and Mondo disease ontology. Prompts generated from each phenopacket were sent three generative pretrained transformer (GPT) models. The same phenopackets used as input a widely diagnostic tool, Exomiser, phenotype-only mode. best LLM ranked correct diagnosis first 23.6% cases, whereas Exomiser did so 35.5% cases. While for has been improving, it not reached level commonly traditional bioinformatics tools. Future research needed determine approach incorporate into pipelines.

Язык: Английский

Процитировано

Leveraging Generative AI for Enhanced Predictive Maintenance and Anomaly Detection in Manufacturing DOI

Vamshi Sai Mugala

Опубликована: Янв. 1, 2025

Язык: Английский

Процитировано

Consistent Performance of GPT-4o in Rare Disease Diagnosis Across Nine Languages and 4967 Cases DOI

Leonardo Chimirri, J. Harry Caufield, Yasemin Bridges

и другие.

medRxiv (Cold Spring Harbor Laboratory), Год журнала: 2025, Номер unknown

Опубликована: Фев. 28, 2025

Large language models (LLMs) are increasingly used in the medical field for diverse applications including differential diagnostic support. The estimated training data to create LLMs such as Generative Pretrained Transformer (GPT) predominantly consist of English-language texts, but could be across globe support diagnostics if barriers overcome. Initial pilot studies on utility diagnosis languages other than English have shown promise, a large-scale assessment relative performance these variety European and non-European comprehensive corpus challenging rare-disease cases is lacking. We created 4967 clinical vignettes using structured captured with Human Phenotype Ontology (HPO) terms Global Alliance Genomics Health (GA4GH) Phenopacket Schema. These span total 378 distinct genetic diseases 2618 associated phenotypic features. translations together language-specific templates generate prompts English, Chinese, Czech, Dutch, German, Italian, Japanese, Spanish, Turkish. applied GPT-4o, version gpt-4o-2024-08-06, task delivering ranked zero-shot prompt. An ontology-based approach Mondo disease ontology was map synonyms subtypes diagnoses order automate evaluation LLM responses. For GPT-4o placed correct at first rank 19·8% within top-3 ranks 27·0% time. In comparison, eight non-English tested here 1 between 16·9% 20·5%, 25·3% 27·7% cases. consistent nine tested. This suggests that may settings. NHGRI 5U24HG011449 5RM1HG010860. P.N.R. supported by Professorship Alexander von Humboldt Foundation; P.L. National Grant (PMP21/00063 ONTOPREC-ISCIII, Fondos FEDER).

Язык: Английский

Процитировано

A systematic review and meta-analysis of diagnostic performance comparison between generative AI and physicians DOI

Hirotaka Takita, Daijiro Kabata, Shannon L. Walston

и другие.

npj Digital Medicine, Год журнала: 2025, Номер 8(1)

Опубликована: Март 22, 2025

Abstract While generative artificial intelligence (AI) has shown potential in medical diagnostics, comprehensive evaluation of its diagnostic performance and comparison with physicians not been extensively explored. We conducted a systematic review meta-analysis studies validating AI models for tasks published between June 2018 2024. Analysis 83 revealed an overall accuracy 52.1%. No significant difference was found ( p = 0.10) or non-expert 0.93). However, performed significantly worse than expert 0.007). Several demonstrated slightly higher compared to non-experts, although the differences were significant. Generative demonstrates promising capabilities varying by model. Although it yet achieved expert-level reliability, these findings suggest enhancing healthcare delivery education when implemented appropriate understanding limitations.

Язык: Английский

Процитировано

Grounding Large Language Model in Clinical Diagnostics DOI

Jian Li,

Xi Chen, Hanyu Zhou

и другие.

Research Square (Research Square), Год журнала: 2025, Номер unknown

Опубликована: Апрель 15, 2025

Abstract Large language models (LLMs) possess extensive medical knowledge and demonstrate impressive performance in answering diagnostic questions. However, responding to such questions differs significantly from actual clinical procedures. Real-world diagnostics involve a dynamic, iterative process that includes hypothesis refinement targeted data collection. This complex task is both challenging time-consuming, demanding significant portion of workload healthcare resources. Therefore, evaluating enhancing LLM real-world procedures crucial for deployment. In this study, framework was developed assess LLMs' capability complete encounters, including history, physical examination, tests diagnosis. A benchmark dataset 4,421 cases curated, covering rare common diseases across 32 specialties. Clinical evaluation methods were used comprehensively the GPT-4o-mini, GPT-4o, Claude-3-Haiku, Qwen2.5-72b, Qwen2.5-34b, Qwen2.5-14b Although these performed well questions, they consistently underperformed exhibited number errors. To address challenges, ClinDiag-GPT trained on over 8,000 cases. It emulates physicians' reasoning, collects information line with practice, recommends key definitive diagnoses. outperformed other LLMs accuracy procedural performance. We further compared alone, collaboration physicians, physicians alone. Collaboration between enhanced efficiency, demonstrating ClinDiag-GPT's potential as valuable assistant.

Язык: Английский

Процитировано

Semantic Clinical Artificial Intelligence vs Native Large Language Model Performance on the USMLE DOI

Peter L. Elkin, G.K. Mehta, Frank LeHouillier

и другие.

JAMA Network Open, Год журнала: 2025, Номер 8(4), С. e256359 - e256359

Опубликована: Апрель 22, 2025

Importance Large language models (LLMs) are being implemented in health care. Enhanced accuracy and methods to maintain over time needed maximize LLM benefits. Objective To evaluate whether performance on the US Medical Licensing Examination (USMLE) can be improved by including formally represented semantic clinical knowledge. Design, Setting, Participants This comparative effectiveness research study was conducted between June 2024 February 2025 at Department of Biomedical Informatics, Jacobs School Medicine Sciences, University Buffalo, New York, using sample questions from USMLE Steps 1, 2, 3. Intervention Semantic artificial intelligence (SCAI) developed insert knowledge into LLMs retrieval augmented generation (RAG). Main Outcomes Measures The SCAI method evaluated comparing 3 Llama (13B, 70B, 405B; Meta) with without RAG text-based for answering determined output answer key. Results were tested 87 Step 103 123 13B enhanced associated significantly 1 but only met 60% passing threshold (74 correct [60.2%]). 70B 405B passed all steps RAG. model scored 80 (92.0%) correctly 82 (79.6%) 112 (91.1%) 79 (90.8%) (84.5%) 117 (95.1%) Significant improvements found 3, parameter better than model, not model. Conclusions Relevance In this study, scores RAG, well or augmentation. forms reasoning LLMs, like reasoning, have potential improve important medical questions. Improving care targeted, up-to-date is an step implementation acceptance.

Язык: Английский

Процитировано

Assessing the diagnostic accuracy of ChatGPT-4 in the histopathological evaluation of liver fibrosis in MASH DOI

Davide Panzeri, Thiyaphat Laohawetwanit, Reha Akpınar

и другие.

Hepatology Communications, Год журнала: 2025, Номер 9(5)

Опубликована: Апрель 30, 2025

Large language models like ChatGPT have demonstrated potential in medical image interpretation, but their efficacy liver histopathological analysis remains largely unexplored. This study aims to assess ChatGPT-4-vision's diagnostic accuracy, compared pathologists' performance, evaluating fibrosis (stage) metabolic dysfunction-associated steatohepatitis. Digitized Sirius Red-stained images for 59 steatohepatitis tissue biopsy specimens were evaluated by ChatGPT-4 and 4 pathologists using the NASH-CRN staging system. Fields of view at increasing magnification levels, extracted a senior pathologist or randomly selected, shown ChatGPT-4, asking staging. The accuracy was with evaluations correlated collagen proportionate area additional insights. All cases further analyzed an in-context learning approach, where model learns from exemplary provided during prompting. ChatGPT-4's 81% when selected pathologist, while it decreased 54% cropped fields view. By employing increased 88% 77% random view, respectively. method enabled fully correctly identify structures characteristic F4 stages, previously misclassified. also highlighted moderate strong correlation between area. showed remarkable results overlapping those expert pathologists. analysis, applied here first time deposition samples, crucial accurately identifying key features cases, critical early therapeutic decision-making. These findings suggest integrating large as supportive tools pathology.

Язык: Английский

Процитировано

Evaluating Large Language Models on ABA-Style Anesthesiology Questions: Accuracy, Domain Consistency, and Clinical Implications DOI

Sagar Patel,

Vinh Ngo,

Brian G. Wilhelmi

и другие.

Journal of Cardiothoracic and Vascular Anesthesia, Год журнала: 2025, Номер unknown

Опубликована: Май 1, 2025

Язык: Английский

Процитировано