Cited by Prompt Framework for Extracting Scale-Related Knowledge Entities from Chinese Medical Literature: Development and Evaluation Study (Preprint)

Natural language processing for chest X‐ray reports in the transformer era: BERT‐like encoders for comprehension and GPT‐like decoders for generation DOI

Han Yuan

iRadiology, Journal Year: 2025, Volume and Issue: unknown

Published: Jan. 6, 2025

We conducted a comprehensive literature search in PubMed to illustrate the current landscape of transformer-based tools from perspective transformer's two integral components: encoder exemplified by BERT and decoder characterized GPT. Also, we discussed adoption barriers potential solutions terms computational burdens, interpretability concerns, ethical issues, hallucination problems, malpractice, legal liabilities. hope that this commentary will serve as foundational introduction for radiologists seeking explore evolving technical chest X-ray report analysis transformer era. Natural language processing (NLP) has gained widespread use computer-assisted (CXR) analysis, particularly since renaissance deep learning (DL) 2012 ImageNet challenge. While early endeavors predominantly employed recurrent neural networks (RNN) convolutional (CNN) [1], revolution is brought [2] its success can be attributed three key factors [3]. First, self-attention mechanism enables simultaneous multiple parts an input sequence, offering significantly greater efficiency compared earlier models such RNN [4]. Second, architecture exhibits exceptional scalability, supporting with over 100 billion parameters capture intricate linguistic relationships human [5]. Third, availability vast internet-based corpus advances power have made pre-training fine-tuning large-scale feasible [6]. The development resolution previously intractable problems achieves expert-level performance across broad range CXR analytical tasks, name entity recognition, question answering, extractive summarization [7]. In commentary, (Figure 1) landscape, barriers, handling comprehension managing generation. As our primary focus NLP, classification criteria or was based on text modules excluded research purely focusing vision transformers (ViT). Literature pipeline identify relevant articles published June 12, 2017, when model first introduced, October 4, 2024. followed previous systematic reviews [3, 8, 9] design groups keywords: (1) "transformer"; (2) "clinical notes", reports", narratives", text", "medical text"; (3) "natural processing", "text mining", "information extraction"; (4) "radiography", "chest film", radiograph", "radiograph", "X-rays". means communication between referring physicians, reports contain high-density information patients' conditions [10]. Much like physicians interpreting reports, step NLP understanding content important application explicitly converting it into format suitable subsequent tasks. One notable [11], which stands bidirectional representations transformers. contrast predecessors rely large amounts expert annotations supervised [12], undergoes self-supervised training unlabeled datasets understand patterns subsequently fine-tuned small set target task [12, 13], yielding superior [14], recognition [15], [16], semantics optimization [17]. context healthcare, Olthof et al. [18] built evaluate varying complexities, disease prevalence, sample sizes, demonstrating statistically outperformed conventional DL CNN, area under curve F1-score, t-test p-values less than 0.05. Beyond models, adapting domain-specific further enhance effectiveness various Yan [19] adapted four BERT-like encoders using millions radiology tackle tasks: identifying sentences describe abnormal findings, assigning diagnostic codes, extracting summarize reports. Their results demonstrated adaptation yielded significant improvements accuracy, ROUGE metrics all Most BERT-relevant studies sentence-, paragraph-, report-level predictions, while are also well-suited word-level pattern recognition. Chambon [20] leveraged [21], biomedical-specific BERT, probability individual tokens containing protected health information, replaced identified sensitive synthetic surrogates ensure privacy preservation. Similarly, Weng [22] developed system utilizing ALBERT [23], lite reduced parameters, keywords unrelated thereby reducing false-positive alarms outperforming regular expression-, syntactic grammar-, DL-based baselines. BERT-derived labels applied develop targeting other modalities 13]. Nowak [24] systematically explored utility BERT-generated silver linked them corresponding radiographs image classifiers. Compared trained exclusively radiologist-annotated gold labels, integrating led improved discriminability. macro-averaged synchronous proved effective settings limited whereas silver, better cases abundant labels. Zhang [25] introduced novel approach more generalizable classifiers, rather relying predefined categories: first, they used extract entities relationships; second, constructed knowledge graph these extractions; third, refined their domain expertise. Unlike traditional multiclass established not only categorized each but revealed interpretable categories, those linking anatomical regions signs. addition deriving advanced capabilities unprecedented innovation: direct supervision pixel-level segmentation medical [26]. Li [26] proposed text-augmented lesion paradigm integrated BERT-based textual compensate deficiency radiograph quality refine pseudo semi-supervision. These highlight strength comprehending healthcare-related annotation systems multi-modality beyond text. Meanwhile, researchers failures complex clinical Sushil [27] implementations inference achieved test accuracy 0.778. adaptations textbooks 0.833, still fell short experts. Potential limitations lie relatively modest parameter size, although larger reliance inadequate corpora, books, Wikipedia, selected databases [28]. Consequently, ability learn remains constrained. shortcomings being alleviated GPT-like decoders, incorporate hundreds billions internet-scale corpora [29]. Following advent encoders, generative pre-trained (GPT) [30], next groundbreaking leap, breaks enabling non-experts perform tasks through freely conversational without any coding. CvT2DistilGPT2 [31], prominent generator era, utilizes ViT GPT-2 decoder. experiments indicated CNN GPT surpassed encoder–decoder architectures specific generation applications, state-of-the-art methods integrate decoders. TranSQ [32] framework. emulates reasoning process generating reports: formulating hypothesis embeddings represent implicit intentions, querying visual features extracted synthesizing semantic cross-modality fusion, transforming candidate DistilGPT [33]. Finally, attained BLEU-4 score 0.205 0.409. comparison, best-performing baseline among 17 retrieval 0.188 0.383, highlighting capability unified multi-modality. Though decoders dominated general domain, family long short-term memory (LSTM) [34] good partially because highly templated characteristics [32]. Kaur Mittal [35] classical architectures, feature extraction, LSTM token They modules, generate numerical inputs prior shortlist disease-relevant afterward. Results presented solution 0.767 0.897, suggesting approaches remain viable backbone scenarios. quantitative comparing outputs ground truth model-generated should supplemented evaluation Boag [36] study automated generation, divergence accuracy. A discrepancy readability been reported [37]. Accordingly, emphasize involvement rating correctness readability. sections, reviewed applications Although remarkable well-established, face problems. Some integration specialized expertise [31, 38], others necessitate resolution. demands era substantial. For example, version contains 334 million GPT-3 175 billion. contrast, support vector machines [39] random forests [40], require few hundred thousand parameters. result, many healthcare providers cannot afford costs tailoring scratch. To address this, offer several recommendations. development, suggest leverage open-access building fine-tuning, considering scales, recommend parameter-efficient technique updates subset model's leaving majority weights unchanged [41]. An exemplificative Taylor [42] empirically validated techniques within domain. advocate prompt engineering techniques, retrieval-augmented crafting informative instructive guide decoders' output changing [43]. Ranjit [44] method retrieve most contextual prompts concise accurate retaining critical entities. Last least, obtaining approval ethics committees share anonymous data facilitate collaboration external partners, helping alleviate resource burdens. including both where decisions directly impact lives. often regarded black-box simple render explainable modern layers neurons dissected visualized, providing insights functionality [45-48]. behavior challenge due complexity associated exponential scaling neuron numbers [49]. though internal activations challenging interpret, preliminary analyzing influence high degree alignment assessments [50, 51]. lies flexibility align instructions. This allows users obtain expected request explanations outputs, fostering enhanced usability [52, 53]. readers overview detailed insights, surveys [54-56]. considerations paramount transformers, given powerful nuanced datasets. concerns pressing private representative population. patient privacy, anonymizing during deployment stages neither learned [57] nor inadvertently disclosed certain [58]. Dataset representativeness issue, underrepresentation minority exacerbate disparities perpetuate inequities [59]. mitigate risk, developers prioritize inclusivity collection, maintainers continuously monitor equitable outcomes [60]. Fourth, coherent responses diverse user solving wide [61], predictive internet instead radiological well-defined logic [62]. Therefore, continue suffer hallucinations, phenomenon appears plausible factually incorrect, nonsensical, users' [63]. Current efforts broadly post-training stages. During training, strategies include in-house reinforcement guided radiologists' feedback 64]. Post-training encompass detection, knowledge, multi-agent collaboration, radiologist-in-the-loop frameworks [62, 65]. Due space constraints, encourage refer 66-68] strategies. Lastly, even after refinements, may present risks potentially leading errors liabilities [69]. Errors arise sources, inaccurate clinician nonadherence correct recommendations, poor workflows [70]. determining responsibility adverse issue stakeholders, software developers, maintenance teams, departments, [71]. European Commission focuses safety liability implications artificial intelligence, applies device laws demonstrates generally falls civil product Civil typically pertains developers. However, stops strict definitive framework inherent ambiguity algorithms questions surrounding likely addressed courts case law. Under existing frameworks, follow standard care, supplementary confirmatory substitutes practice beneficial stakeholders Additionally, departments implement tools, involve radiologists, throughout entire cycle [72], prepare in-depth programs familiarize differ routine statistical tests black boxes resist full interpretation [73]. Moreover, expectations important: unrealistic optimism, seen replacement expertise, undue pessimism, perceived no utility, avoided [74-77]. Han Yuan: Conceptualization; curation; formal analysis; investigation; project administration; validation; visualization; writing—original draft; writing—review editing. None. author declares he conflicts interest. exempt review committee does participants, animal subjects, collection. Not applicable. Data sharing apply were generated analyzed.

Language: Английский

Citations

Evaluation of LLMs accuracy and consistency in the registered dietitian exam through prompt engineering and knowledge retrieval DOI

Iman Azimi, Meng Qi, Wang Li

et al.

Scientific Reports, Journal Year: 2025, Volume and Issue: 15(1)

Published: Jan. 9, 2025

Large language models (LLMs) are fundamentally transforming human-facing applications in the health and well-being domains: boosting patient engagement, accelerating clinical decision-making, facilitating medical education. Although state-of-the-art LLMs have shown superior performance several conversational applications, evaluations within nutrition diet still insufficient. In this paper, we propose to employ Registered Dietitian (RD) exam conduct a standard comprehensive evaluation of LLMs, GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, assessing both accuracy consistency queries. Our includes 1050 RD questions encompassing topics proficiency levels. addition, for first time, examine impact Zero-Shot (ZS), Chain Thought (CoT), with Self Consistency (CoT-SC), Retrieval Augmented Prompting (RAP) on responses. findings revealed that while these obtained acceptable overall performance, their results varied considerably different prompts question domains. GPT-4o CoT-SC prompting outperformed other approaches, whereas Pro ZS recorded highest consistency. For 3.5, CoT improved accuracy, RAP was particularly effective answer Expert level questions. Consequently, choosing appropriate LLM technique, tailored specific domain, can mitigate errors potential risks chatbots.

Language: Английский

Citations

Leveraging Generative AI To Improve Motivation and Retrieval in Higher Education Learners (Preprint) DOI

Noahlana Monzon, Franklin A. Hays

JMIR Medical Education, Journal Year: 2025, Volume and Issue: 11, P. e59210 - e59210

Published: Jan. 2, 2025

Abstract Generative artificial intelligence (GenAI) presents novel approaches to enhance motivation, curriculum structure and development, learning retrieval processes for both learners instructors. Though a focus this emerging technology is academic misconduct, we sought leverage GenAI in facilitate educational outcomes. For instructors, offers new opportunities course design management while reducing time requirements evaluate outcomes personalizing learner feedback. These include innovative instructional designs such as flipped classrooms gamification, enriching teaching methodologies with focused interactive approaches, team-based exercise development among others. learners, unprecedented self-directed opportunities, improved cognitive engagement, effective practices, leading enhanced autonomy, knowledge retention. empowering, evolving landscape has integration challenges ethical considerations, including accuracy, technological evolution, loss of learner’s voice, socioeconomic disparities. Our experience demonstrates that the responsible application GenAI’s settings will revolutionize making education more accessible tailored, producing positive motivational Thus, argue leveraging improve implications extending from primary through higher continuing paradigms.

Language: Английский

Citations

Intuitive Human–Artificial Intelligence Theranostic Complementarity DOI

J. Harvey Turner

Cancer Biotherapy and Radiopharmaceuticals, Journal Year: 2025, Volume and Issue: unknown

Published: Feb. 20, 2025

Deep learning artificial intelligence (AI) algorithms are poised to subsume diagnostic imaging specialists in radiology and nuclear medicine, where radiomics can consistently outperform human analysis reporting capability, do it faster, with greater accuracy reliability. However, claims made for generative AI respect of decision-making the clinical practice theranostic medicine highly contentious. Statistical computer cannot emulate emotion, reason, instinct, intuition, or empathy. simulates without possessing it. has no understanding meaning its outputs. The unique statistical probability attributes large language models must be complemented by innate intuitive capabilities physicians who accept responsibility accountability direct care each individual patient referred management specified cancers. Complementarity envisions synergistic engagement radiomics, genomics, radiobiology, dosimetry, data collation from multidimensional sources, including electronic medical record, enable physician spend informed face time their patient. Together discernment, application technical insights will facilitate optimal formulation a personalized precision strategy empathic, efficacious, targeted treatment cancer accordance wishes.

Language: Английский

Citations

How to Write Effective Prompts for Screening Biomedical Literature Using Large Language Models DOI

Maria Teresa Colangelo, Stefano Guizzardi,

Marco Meleti

et al.

BioMedInformatics, Journal Year: 2025, Volume and Issue: 5(1), P. 15 - 15

Published: March 11, 2025

Large language models (LLMs) have emerged as powerful tools for (semi-)automating the initial screening of abstracts in systematic reviews, offering potential to significantly reduce manual burden on research teams. This paper provides a broad overview prompt engineering principles and highlights how traditional PICO (Population, Intervention, Comparison, Outcome) criteria can be converted into actionable instructions LLMs. We analyze trade-offs between “soft” prompts, which maximize recall by accepting articles unless they explicitly fail an inclusion requirement, “strict” demand explicit evidence every criterion. Using periodontics case study, we illustrate design affects recall, precision, overall efficiency discuss metrics (accuracy, F1 score) evaluate performance. also examine common pitfalls, such overly lengthy prompts or ambiguous instructions, underscore continuing need expert oversight mitigate hallucinations biases inherent LLM outputs. Finally, explore emerging trends, including multi-stage pipelines fine-tuning, while noting ethical considerations related data privacy transparency. By applying rigorous evaluation, researchers optimize LLM-based processes, allowing faster more comprehensive synthesis across biomedical disciplines.

Language: Английский

Citations

Prompt Framework for Extracting Scale-Related Knowledge Entities from Chinese Medical Literature: Development and Evaluation Study DOI

Jie Hao,

Zhenli Chen,

Qin Peng

et al.

Journal of Medical Internet Research, Journal Year: 2025, Volume and Issue: 27, P. e67033 - e67033

Published: March 18, 2025

Background Measurement-based care improves patient outcomes by using standardized scales, but its widespread adoption is hindered the lack of accessible and structured knowledge, particularly in unstructured Chinese medical literature. Extracting scale-related knowledge entities from these texts challenging due to limited annotated data. While large language models (LLMs) show promise named entity recognition (NER), specialized prompting strategies are needed accurately recognize entities, especially low-resource settings. Objective This study aims develop evaluate MedScaleNER, a task-oriented prompt framework designed optimize LLM performance recognizing Methods MedScaleNER incorporates demonstration retrieval within in-context learning, chain-of-thought prompting, self-verification improve performance. The dynamically retrieves optimal examples k-nearest neighbors approach decomposes NER task into two subtasks: type identification labeling. Self-verification ensures reliability final output. A dataset manually journal papers was constructed, focusing on three key types: scale names, measurement concepts, items. Experiments were conducted varying number proportion training data Additionally, MedScaleNER’s compared with locally fine-tuned models. Results CMedS-NER (Chinese Medical Scale Corpus for Named Entity Recognition) dataset, containing 720 27,499 used evaluation. Initial experiments identified GLM-4-0520 as best-performing among six tested When applied GLM-4-0520, significantly improved achieving macro F1-score 59.64% an exact string match full dataset. highest achieved 20-shot demonstrations. Under scenarios (eg, 1% data), outperformed all Ablation studies highlighted importance improving model reliability. Error analysis revealed four main types mistakes: errors, boundary missing indicating areas further improvement. Conclusions advances application LLMs prompts engineering tasks By addressing challenges data, adaptability various biomedical contexts supports more efficient reliable extraction, contributing broader measurement-based implementation clinical research outcomes.

Language: Английский

Citations

Domain knowledge-guided geological named entities recognition of rock minerals based on prompt engineering with error feedback mechanism DOI

Qinjun Qiu, Yun Ma, Peng Han

et al.

Computers & Geosciences, Journal Year: 2025, Volume and Issue: unknown, P. 105944 - 105944

Published: April 1, 2025

Language: Английский

Citations

Large Language Models in Systematic Review Screening: Opportunities, Challenges, and Methodological Considerations DOI

Carlo Galli,

Anna Viktorovna Gavrilova,

Elena Calciolari

et al.

Information, Journal Year: 2025, Volume and Issue: 16(5), P. 378 - 378

Published: May 1, 2025

Systematic reviews require labor-intensive screening processes—an approach prone to bottlenecks, delays, and scalability constraints in large-scale reviews. Large Language Models (LLMs) have recently emerged as a powerful alternative, capable of operating zero-shot or few-shot modes classify abstracts according predefined criteria without requiring continuous human intervention like semi-automated platforms. This review focuses on the central challenges that users biomedical field encounter when integrating LLMs—such GPT-4—into evidence-based research. It examines critical requirements for software data preprocessing, discusses various prompt strategies, underscores continued need oversight maintain rigorous quality control. By drawing current practices cost management, reproducibility, refinement, this article highlights how teams can substantially reduce workloads compromising comprehensiveness inquiry. The findings presented aim balance strengths LLM-driven automation with structured checks, ensuring systematic retain their methodological integrity while leveraging efficiency gains made possible by recent advances artificial intelligence.

Language: Английский

Citations

Prompt Engineering in Clinical Practice: A Comprehensive Guide for Clinicians (Preprint) DOI

Jialin Liu, Changyu Wang, Siru Liu

et al.

Published: Feb. 14, 2025

UNSTRUCTURED Large language models (LLMs) present a promising avenue for improving healthcare by enhancing clinical decision making. However, their effectiveness heavily relies on the accurate prompt engineering. This review focuses understanding and optimizing engineering techniques to guide LLMs in producing clinically relevant responses. The aim is provide clinicians with tools fully utilize practice, ensuring patient-centered care addressing ethical operational challenges. Key principles of engineering, such as specificity, contextual relevance iterative refinement, are essential effective use LLMs. Techniques zero-shot, few-shot chain-of-thought prompting analyzed detail practical insights into how these approaches influence LLM outcomes. also introduces classification system prompts - manual versus automatic discrete continuous help apply more effectively different scenarios. Despite advances, challenges remain data privacy, maintaining accuracy handling multimodal data. Effective can significantly improve performance practice input design accurate, contextually patient-specific outputs. These improvements enable efficient Clinicians need consider ensure integrate adaptive, personalized real-time workflows. By refining practices, take full advantage capabilities, ultimately patient outcomes supporting integration healthcare.

Language: Английский

Citations

Prompting Introspection DOI

Jacqueline Baras Shreibati

Circulation, Journal Year: 2025, Volume and Issue: 151(19), P. 1375 - 1377

Published: May 12, 2025

Language: Английский

Citations