Integration of Generative AI System to IoT Based Healthcare Systems 5.0 DOI
Janjhyam Venkata Naga Ramesh, Veera Talukdar, Ardhariksa Zukhruf Kurniullah

et al.

Studies in systems, decision and control, Journal Year: 2024, Volume and Issue: unknown, P. 199 - 217

Published: Jan. 1, 2024

Language: Английский

A Preliminary Checklist (METRICS) to Standardize the Design and Reporting of Studies on Generative Artificial Intelligence–Based Models in Health Care Education and Practice: Development Study Involving a Literature Review DOI Creative Commons
Malik Sallam, Muna Barakat, Mohammed Sallam

et al.

Interactive Journal of Medical Research, Journal Year: 2024, Volume and Issue: 13, P. e54704 - e54704

Published: Jan. 26, 2024

Background Adherence to evidence-based practice is indispensable in health care. Recently, the utility of generative artificial intelligence (AI) models care has been evaluated extensively. However, lack consensus guidelines on design and reporting findings these studies poses a challenge for interpretation synthesis evidence. Objective This study aimed develop preliminary checklist standardize AI-based education practice. Methods A literature review was conducted Scopus, PubMed, Google Scholar. Published records with “ChatGPT,” “Bing,” or “Bard” title were retrieved. Careful examination methodologies employed included identify common pertinent themes possible gaps reporting. panel discussion held establish unified thorough AI The finalized used evaluate by 2 independent raters. Cohen κ as method interrater reliability. Results final data set that formed basis theme identification analysis comprised total 34 records. 9 collectively referred METRICS (Model, Evaluation, Timing, Range/Randomization, Individual factors, Count, Specificity prompts language). Their details are follows: (1) Model its exact settings; (2) Evaluation approach generated content; (3) Timing testing model; (4) Transparency source; (5) Range tested topics; (6) Randomization selecting queries; (7) factors queries reliability; (8) Count executed test (9) language used. overall mean score 3.0 (SD 0.58). acceptable, range 0.558 0.962 (P<.001 items). With classification per item, highest average recorded “Model” followed “Specificity” while lowest scores “Randomization” item (classified suboptimal) “Individual factors” satisfactory). Conclusions can facilitate guiding researchers toward best practices results. highlight need standardized algorithms care, considering variability observed proposed could be helpful base universally accepted which swiftly evolving research topic.

Language: Английский

Citations

24

A framework for human evaluation of large language models in healthcare derived from literature review DOI Creative Commons

Thomas Yu Chow Tam,

Sonish Sivarajkumar,

Sumit Kapoor

et al.

npj Digital Medicine, Journal Year: 2024, Volume and Issue: 7(1)

Published: Sept. 28, 2024

Abstract With generative artificial intelligence (GenAI), particularly large language models (LLMs), continuing to make inroads in healthcare, assessing LLMs with human evaluations is essential assuring safety and effectiveness. This study reviews existing literature on evaluation methodologies for healthcare across various medical specialties addresses factors such as dimensions, sample types sizes, selection, recruitment of evaluators, frameworks metrics, process, statistical analysis type. Our review 142 studies shows gaps reliability, generalizability, applicability current practices. To overcome significant obstacles LLM developments deployments, we propose QUEST, a comprehensive practical framework covering three phases workflow: Planning, Implementation Adjudication, Scoring Review. QUEST designed five proposed principles: Quality Information, Understanding Reasoning, Expression Style Persona, Safety Harm, Trust Confidence.

Language: Английский

Citations

17

Bibliometric top ten healthcare-related ChatGPT publications in the first ChatGPT anniversary DOI Creative Commons
Malik Sallam

Narra J, Journal Year: 2024, Volume and Issue: 4(2), P. e917 - e917

Published: Aug. 5, 2024

Since its public release on November 30, 2022, ChatGPT has shown promising potential in diverse healthcare applications despite ethical challenges, privacy issues, and possible biases. The aim of this study was to identify assess the most influential publications field utility using bibliometric analysis. employed an advanced search three databases, Scopus, Web Science, Google Scholar, ChatGPT-related records education, research, practice between 27 2023. ranking based retrieved citation count each database. additional alternative metrics that were evaluated included (1) Semantic Scholar highly citations, (2) PlumX captures, (3) mentions, (4) social media (5) Altmetric Attention Scores (AASs). A total 22 unique published 17 different scientific journals from 14 publishers identified databases. Only two top 10 list across Variable publication types identified, with common being editorial/commentary (n=8/22, 36.4%). Nine had corresponding authors affiliated institutions United States (40.9%). range varied per database, highest (1019-121), followed by Scopus (242-88), Science (171-23). citations correlated significantly following metrics: (Spearman's correlation coefficient ρ=0.840,

Language: Английский

Citations

9

The Role of Artificial Intelligence in the Primary Prevention of Common Musculoskeletal Diseases DOI Open Access

Selkin Yilmaz Muluk,

Nazli Olcucu

Cureus, Journal Year: 2024, Volume and Issue: unknown

Published: July 25, 2024

Musculoskeletal disorders (MSDs) are a leading cause of disability worldwide, with growing burden across all demographics. With advancements in technology, conversational artificial intelligence (AI) platforms such as ChatGPT (OpenAI, San Francisco, CA) have become instrumental disseminating health information. This study evaluated the effectiveness versions 3.5 and 4 delivering primary prevention information for common MSDs, emphasizing that is focused on not diagnosis.

Language: Английский

Citations

8

Human versus Artificial Intelligence: ChatGPT-4 Outperforming Bing, Bard, ChatGPT-3.5 and Humans in Clinical Chemistry Multiple-Choice Questions DOI Creative Commons
Malik Sallam,

Khaled Al‐Salahat,

Huda Eid

et al.

Advances in Medical Education and Practice, Journal Year: 2024, Volume and Issue: Volume 15, P. 857 - 871

Published: Sept. 1, 2024

Artificial intelligence (AI) chatbots excel in language understanding and generation. These models can transform healthcare education practice. However, it is important to assess the performance of such AI various topics highlight its strengths possible limitations. This study aimed evaluate ChatGPT (GPT-3.5 GPT-4), Bing, Bard compared human students at a postgraduate master's level Medical Laboratory Sciences.

Language: Английский

Citations

7

Comparative Analysis of Artificial Intelligence Platforms: ChatGPT-3.5 and GoogleBard in Identifying Red Flags of Low Back Pain DOI Open Access

Selkin Yilmaz Muluk,

Nazli Olcucu

Cureus, Journal Year: 2024, Volume and Issue: unknown

Published: July 1, 2024

Background: Low back pain (LBP) is a prevalent healthcare concern that frequently responsive to conservative treatment. However, it can also stem from severe conditions, marked by 'red flags' (RF) such as malignancy, cauda equina syndrome, fractures, infections, spondyloarthropathies, and aneurysm rupture, which physicians should be vigilant about. Given the increasing reliance on online health information, this study assessed ChatGPT-3.5's (OpenAI, San Francisco, CA, USA) GoogleBard's (Google, Mountain View, accuracy in responding RF-related LBP questions their capacity discriminate severity of condition. Methods: We created 70 symptoms diseases following guidelines. Among them, 58 had single symptom (SS), 12 multiple (MS) LBP. Questions were posed ChatGPT GoogleBard, responses two authors for accuracy, completeness, relevance (ACR) using 5-point rubric criteria. Results: Cohen's kappa values (0.60-0.81) indicated significant agreement among authors. The average scores ranged 3.47 3.85 ChatGPT-3.5 3.36 3.76 GoogleBard SS questions, 4.04 4.29 3.50 3.71 MS questions. ratings these 'good' 'excellent'. Most effectively conveyed situation (93.1% ChatGPT-3.5, 94.8% GoogleBard), all did so. No statistically differences found between (p>0.05). Conclusions: In an era characterized widespread information seeking, artificial intelligence (AI) systems play vital role delivering precise medical information. These technologies may hold promise field if they continue improve.

Language: Английский

Citations

5

Language discrepancies in the performance of generative artificial intelligence models: an examination of infectious disease queries in English and Arabic DOI Creative Commons
Malik Sallam,

Kholoud Al-Mahzoum,

Omaima Alshuaib

et al.

BMC Infectious Diseases, Journal Year: 2024, Volume and Issue: 24(1)

Published: Aug. 8, 2024

Assessment of artificial intelligence (AI)-based models across languages is crucial to ensure equitable access and accuracy information in multilingual contexts. This study aimed compare AI model efficiency English Arabic for infectious disease queries.

Language: Английский

Citations

4

The performance of OpenAI ChatGPT-4 and Google Gemini in virology multiple-choice questions: a comparative analysis of English and Arabic responses DOI Creative Commons
Malik Sallam,

Kholoud Al-Mahzoum,

Rawan Ahmad Almutawaa

et al.

BMC Research Notes, Journal Year: 2024, Volume and Issue: 17(1)

Published: Sept. 3, 2024

Language: Английский

Citations

4

Qualitative metrics from the biomedical literature for evaluating large language models in clinical decision-making: a narrative review DOI Creative Commons

Cindy Ho,

Tiffany Tian,

Alessandra T. Ayers

et al.

BMC Medical Informatics and Decision Making, Journal Year: 2024, Volume and Issue: 24(1)

Published: Nov. 26, 2024

The large language models (LLMs), most notably ChatGPT, released since November 30, 2022, have prompted shifting attention to their use in medicine, particularly for supporting clinical decision-making. However, there is little consensus the medical community on how LLM performance contexts should be evaluated. We performed a literature review of PubMed identify publications between December 1, and April 2024, that discussed assessments LLM-generated diagnoses or treatment plans. selected 108 relevant articles from analysis. frequently used LLMs were GPT-3.5, GPT-4, Bard, LLaMa/Alpaca-based models, Bing Chat. five criteria scoring outputs "accuracy", "completeness", "appropriateness", "insight", "consistency". defining high-quality been consistently by researchers over past 1.5 years. identified high degree variation studies reported findings assessed performance. Standardized reporting qualitative evaluation metrics assess quality can developed facilitate research healthcare.

Language: Английский

Citations

4

A Comparative Analysis of Artificial Intelligence Platforms: ChatGPT-4o and Google Gemini in Answering Questions About Birth Control Methods DOI Open Access

Erhan Muluk

Cureus, Journal Year: 2025, Volume and Issue: unknown

Published: Jan. 1, 2025

Background Birth control methods (BCMs) are often underutilized or misunderstood, especially among young individuals entering their reproductive years. With the growing reliance on artificial intelligence (AI) platforms for health-related information, this study evaluates performance of ChatGPT-4o and Google Gemini in addressing commonly asked questions about BCMs. Methods Thirty questions, derived from American College Obstetrics Gynecologists (ACOG) website, were posed to both AI platforms. Questions spanned four categories: general contraception, specific contraceptive types, emergency other topics. Responses evaluated using a five-point rubric assessing Relevance, Completeness, Lack False Information (RCL). Overall scores calculated by averaging scores. Statistical analysis, including Wilcoxon Signed-Rank test, Friedman Kruskal-Wallis was performed compare metrics. Results provided high-quality responses birth control-related queries, with overall 4.38 ± 0.58 4.37 0.52, respectively, categorized as "very good" "excellent." demonstrated higher lack false based descriptive statistics (4.70 0.60 vs. 4.47 0.73), while outperformed relevance, statistically significant difference (4.53 0.57 4.30 0.70, p = 0.035, large effect size). Completeness comparable (p 0.655). analyses revealed no differences 0.548), though potential trend stronger "Other Topics" category. Within-model variability showed had more pronounced metrics (moderate size, Kendall's W 0.357), exhibited smaller (Kendall's 0.165). These findings suggest that offer reliable complementary tools knowledge gaps nuanced strengths warrant further exploration. Conclusions accurate BCM-related slight strengths. underscore tools, public health information needs, particularly seeking guidance contraception. Further studies larger datasets may elucidate between

Language: Английский

Citations

0