Introduction

Large language models (LLMs) have the potential to improve healthcare due to their capability to parse complex concepts and generate appropriate responses. LLMs have demonstrated proficiency in tasks across the spectrum of clinical activity, such as medical inquiry responses, dialogue systems, and the synthesis and completion of clinical reports1,2,3,4,5. One potential high-value area for LLMs is the ability to promote evidence-based practice through providing clinical decision support systems (CDSSs) according to current medical guidelines, which are distillations of both expert opinion and current evidence from clinical trials and are used to drive improvements in patient outcomes through best practices6,7.

LLMs have enjoyed wide public uptake, especially OpenAI’s ChatGPT (https://openai.com/blog/chatgpt), which enrolled over 100 million users within two months of its release8,9. The widespread use of ChatGPT allowed simple and user-friendly use of generative artificial intelligence for real-life scenarios and academic research. However, a primary concern for LLMs application in healthcare is the potential risk of inaccurate responses (e.g., “hallucinations”) that may lead to patient harm10. In clinical applications, a proposed framework for utilizing LLMs is based on adherence to the three principles of Honesty, Helpfulness, and Harmlessness (the HHH principle)11. To align LLMs to the HHH principle, specific strategies must be undertaken to bind their responses to a specific set of domain knowledge, such as retrieval augmented generation (RAG)12 or supervised fine-tuning (SFT) followed by Reinforcement Learning with Human Feedback (RLHF)13. Both RAG and SFT guide output generation according to a domain-specific dataset of information that, for clinical applications, could be represented by medical guidelines. However, the format of clinical guidelines is subject to broad variations (e.g., general structure, location of recommendations, table format, and flowcharts) that can affect the proper interpretation or retrieval of relevant information.

While the integration of LLMs in healthcare shows promise, the challenge of ensuring accurate interpretation of clinical guidelines becomes particularly relevant in the context of managing widespread chronic diseases such as Hepatitis C Virus (HCV) infection. New antiviral therapies successfully eradicate the disease, with multiple regimens demonstrating >90% efficacy and effectiveness14. HCV management has been codified in multiple guidelines that distill the results from the available randomized controlled trials to recommend best practices in chronic HCV diagnosis and treatment. However, adherence to guidelines ranges from 36–54% for screening and managing chronic HCV infection15,16. There is a need for scalable and reliable solutions to provide guideline-recommended care and bridge the gap in adherence, especially considering the World Health Organization’s goal to eliminate Hepatitis C by 203017.

We present a novel LLM framework integrating clinical guidelines with RAG, prompt engineering, and text reformatting strategies for augmented text interpretation that significantly outperforms the baseline LLM model in producing accurate guideline-specific recommendations, with the primary outcome of qualitatively measuring accuracy based on manual expert review. We also apply quantitative text-similarity methods18,19,20,21 to compare the similarity of the LLM output to expert-generated responses.

Results

Output accuracy analysis

The customized LLM framework achieved 99.0% overall accuracy, which was significantly better than the GPT-4 Turbo alone (99.0% vs. 43.0%; p < 0.001). Incorporating in-context guidelines improved accuracy (67.0% vs. 43.0%; p = 0.001). When the in-context guidelines were cleaned, and tables were converted from images to .csv files, accuracy improved to 78.0% (vs. 43.0%; p < 0.001); after the guidelines were formatted with a consistent structure and tables were re-formatted to text-based lists, accuracy further improved to 90.0% (vs. 43.0%; p < 0.001). Finally, the addition of custom prompt engineering led to an improvement in accuracy of 99.0% (vs. 43.0%; p < 0.001), with no further improvement despite few-shot learning with 54 question-answer pairs (Table 1, Fig. 1).

Table 1 Qualitative evaluation of accuracy based on human expert grading of each answer across all experimental settings
Fig. 1: Qualitative evaluation of accuracy among all experiments from baseline.
figure 1

a Accuracy for all questions. b Accuracy only for text-based questions. c Accuracy for table-based questions. d Accuracy for clinical scenario-based questions. Statistical testing is based on pairwise comparison (Chi-Squared Test) between each experimental setting and the baseline.

For text-based questions, the customized framework achieved 100% overall accuracy, which was better than GPT-4 Turbo alone (100% vs. 62.0%; p < 0.001). Incorporating in-context guidelines improved accuracy (86.0% vs. 62.0%; p = 0.01); after cleaning the text and conversion of tables from images to .csv, further improvement in accuracy was achieved with no further improvement after formatting the text into a consistent structure and converting tables into text-based lists (90.0% vs. 62.0%; p = 0.002). Adding custom prompt engineering resulted in 100% accuracy (100% vs. 62.0%; p < 0.001) with equivalent performance after few-shot learning with 54 question-answer pairs (100% vs. 62.0%; p < 0.001).

For table-based questions, the customized framework achieved 96.0% overall accuracy, which was better than GPT-4 Turbo alone (96.0% vs. 28.0%; p < 0.001). Incorporating in-context guidelines improved accuracy (44.0% vs. 28.0%; p = 0.38); after cleaning the text and conversion of tables from images to .csv, accuracy reached 60.0% (vs. 28.0%; p = 0.046) with a substantial improvement after converting tables into text-based lists and formatting the text into a consistent structure (96.0% vs. 28.0%; p < 0.001) with similar performance in Experiments 4 and 5 as reported in Table 1.

The customized framework achieved 100% overall accuracy for clinical scenarios, which was better than GPT-4 Turbo alone (100% vs. 20.0%; p < 0.001). Incorporating in-context guidelines improved accuracy (52.0% vs. 20.0%; p = 0.039); after cleaning the text and conversion of tables from images to .csv, accuracy reached 72.0% (vs. 20.0%; p < 0.001) with a substantial improvement after converting tables into lists and formatting the text into a consistent structure (84.0% vs. 20.0%; p < 0.001). Finally, the addition of custom prompt engineering achieved an accuracy of 100% (vs. 20.0%; p < 0.001), with no further improvement despite few-shot learning with 54 question–answer pairs.

When inaccurate outputs were reviewed for hallucinations, we found 112 (90.3%) fact-conflicting hallucinations (FCH) and 12 (9.7%) input-conflicting hallucinations (ICH) across all experiments. Hallucination type and distribution across each experiment are reported in Table 2. We did not find contextual-conflicting hallucinations (CCH) in any of our experiments.

Table 2 Hallucinations type and distribution across all experiments

Text-similarity analysis

For the secondary outcomes, we found differences in the customized LLM framework compared to the baseline across similarity scores (BLEU score, ROUGE-LCS F1, METEOR Score F1, and our Custom OpenAI Score) for all questions (Table 3). The score average values for text-based and table-based questions, clinical scenarios, and graphical distributions of each score are reported in Supplementary Table 2 and Supplementary Fig. 1, respectively.

Table 3 Evaluation of text-to-text-similarity between LLM-generated outputs and human expert-provided answers used as the gold standard across all questions

Discussion

Integrating LLMs into CDSSs may revolutionize healthcare delivery by leveraging natural language processing to interpret clinical documentation, aligning LLM-generated recommendations with current medical research and best practices22,23 (Fig. 2). For instance, a locally hosted LLM might be granted access to patient-specific data. This data can be integrated into a tailored prompt designed to identify the most appropriate treatment plan for a specific patient. The LLM, will have contemporary access to the guidelines and provide a recommendation on treatment based on guideline knowledge. However, before having LLM-aided CDSSs, it is necessary to define the best guidelines format that can maximize output accuracy.

Fig. 2: Example of a clinical decision support system integrated with large language models.
figure 2

When a patient is being evaluated for HCV treatment, the doctor prescribes several tests (laboratory and imaging), whose results are stored in the institutional EHR system. The locally hosted LLM has a standardized clinical scenario prompt with laboratory and imaging values that are directly extracted from EHR. Afterward, the standardized prompt is queried to the LLM, which has access to the relevant guidelines to recommend the most appropriate treatment. HCV Hepatitis C virus, EHR electronic health record, RAG retrieval augmented generation, LLM large language model.

We demonstrate the performance of our proposed framework in a subset of the potential questions that could be asked by physicians managing patients with chronic HCV. We identified an optimal framework for LLM-friendly clinical guidelines that achieves near-perfect accuracy and outperforms GPT-4 Turbo alone for answering questions about the management of HCV infection. The baseline GPT-4 Turbo showed an overall accuracy of 43.0%, consistent with other studies querying LLMs for management questions related to gastroenterology and hepatology, ranging from 25 to 90%24,25,26,27,28,29,30,31,32,33,34,35,36,37. This suggested that the model’s base knowledge was imperfect despite having access to information up to April 202338.

Our findings also highlight the difficulty of LLMs to parse tables, with a clear improvement in performance after tables were converted to text-based lists, suggesting that information cannot be retrieved accurately from non-text sources. The difficulty of LLMs to parse tables is a known limitation39, and a critical technical issue that should be addressed since the medical literature often contains tables with important information for clinicians.

Modern LLMs such as GPT-4, according to their multimodal capabilities and context sensitivity, can interpret inputs from both images and textual elements40. OpenAI has noted that GPT-4 was tested on different benchmarks on textual, graphical, and visual elements (ChartQA41, AI2D42, DocVQA43, Infographic VQA44) with an accuracy range from 75.1% to 88.4% redefined the previously best models in these benchmarks which ranged from 61.2% to 88.4%40. Despite GPT-4 becoming state of the art in graphical-context interpretation, we demonstrate that it cannot interpret the non-text sources reported in the HCV guidelines, showing 16.0% overall accuracy in extracting pertinent information (as described in detail in Supplementary Note 1). Inaccuracies in graphical elements interpretation can result in the loss of critical information and context when converting non-text sources into a readable format for LLMs, which likely affected the GPT-4 Turbo’s ability to accurately interpret and reason with the information contained in non-text sources. This factor, coupled with the challenge of context retention across the segmented data in non-text sources, could have contributed to the lower performance in “reasoning and interpretation” tasks. These results imply that the information present in the guidelines should be presented as text (i.e., in LLM-friendly format) to be efficiently and accurately retrieved and interpreted by LLMs.

We found that the similarity scores (BLEU45, ROUGE-L46, METEOR47, and a custom OpenAI score) calculated between output generated by GPT-4 Turbo and the free text expert answers do not necessarily reflect differences in expert-graded qualitative accuracy. We found statistically significant differences across all similarity metrics when the outputs of the in-context guideline experiments were compared to the baseline outputs as reported in Table 3. Importantly, when we evaluated the in-context guideline experiments, we found no clear correlation with the change in similarity metrics with expert-graded qualitative accuracy. This has also been reported in other studies18,19,20,21,48 and may be explained by the fact that these scores were developed to measure word overlap, sentence structure similarity, and semantic coherence and not factual correctness. For clinical questions, factual correctness is the most important feature. This is an important challenge that should be addressed since current responses could appear lexically comparable to a reference answer but fail to capture the factual information necessary to guide clinical care. This can result in high scores for responses that are factually incorrect (false positives) or low scores for accurate responses that are phrased differently than the reference (false negatives). While useful for certain aspects of evaluation, these metrics fail to capture the nuances of medical relevance, completeness, and contextual correctness in the answers provided by the LLM. This limitation underscores the persistent need for expert physician oversight in the evaluation process (i.e., human-in-the-loop), with automated grading of LLM-generated responses still being an unresolved challenge.

We also found that few-shot learning did not improve performance above and beyond in-context learning, text formatting, table conversion, and prompt engineering (Fig. 1). This suggests that the model’s zero-shot querying capabilities were already robust without requiring few-shot strategies, which was previously described by reporting different one-shot vs. few-shot49,50.

Our work is limited by several factors. Firstly, we only investigate the application of LLM in the screening, diagnosis, and management of one disease across the spectrum of hepatology. However, our questions were representative of every section in the guideline, covering each major area of clinical management. Secondly, we ran each question for a limited number of iterations, with the performance being consistently excellent across multiple experiments. We do not vary the temperature setting for each of the questions and stages of the ablation study. We had limited resources to test the framework, though we acknowledge that the performance may differ with changes in the temperature parameter. Finally, we do not evaluate the performance of our framework with other LLMs, such as LlaMA51 or PaLM52. While these other models are used for many tasks, most studies using LLM in gastroenterology and hepatology have employed GTP 3.5 and GPT-4.024,25,26,27,28,29,30,31,32,33,34,35,36,37.

In a recent study, Jin et al. developed LiVersa53, a liver disease-specific LLM using RAG and guidelines from the American Association for the Study of the Liver (AASLD), which showed notable limitations in providing completely accurate answers, especially in complex clinical scenarios. Also, from their methodology, it is unclear how they converted guidelines into text, the chunking strategy (which we do not employ in our framework), and their accuracy assessments and lack of data on output accuracy rates. Therefore, despite the similar aims, we cannot compare our study findings.

We present a novel LLM framework to generate answers to complex clinical questions with high accuracy, drawing from established guidelines for HCV management. We highlight the current limitations in LLM non-text sources interpretation and the benefit of in-context structured re-formatted guidelines with accompanied prompt engineering to guide understanding of the underlying text structure.

In conclusion, our results suggest that LLMs like GPT-4 Turbo are suitable for parsing clinical guidelines, but that their effectiveness can be enhanced by structured formatting strategies, prompt engineering, and text conversion of non-text sources. Moreover, our findings suggest that with appropriate reformatting, few-shot learning may not increase overall accuracy. We highlight the need for further research to enhance LLM’s ability to parse non-text sources and validate new metrics to evaluate not only similarity but also accuracy for clinical LLM applications.

Methods

Guidelines selection

We analyzed the current HCV guidelines from the prominent Northern American and European liver associations. Among these, we selected the European Association for the Study of the Liver (EASL) on the Hepatitis C Virus, entitled “EASL recommendations on treatment of hepatitis C: Final update of the series” published in 202054, to explore our framework. The selected guideline comprised the most complex corpus of text containing broad recommendations on screening and management. In addition, the document contained in-depth information on drug-drug interactions, which was not reported in the Northern American55,56 guidelines. We also tested our framework on specific questions that were not addressed in the European guidelines using the most up-to-date Northern American HCV guidelines (as reported in Supplementary Note 3, Supplementary Table 3, Supplementary Table 4, Supplementary Fig. 3, Supplementary Fig. 4).

Standardized prompts creation

Two expert hepatologists (M.G. and L.S.C.) drafted 20 representative questions (Table 4). Fifteen questions addressed screening and management recommendations from each of the major sections, including the guideline main text (10 questions) and graphical tables (5 questions). Tables are a standard feature of clinical guidelines and summarize recommendations in specific ways that may not be reflected in the text. In addition, the two experts drafted five comprehensive clinical cases, each reflecting different HCV-related management strategies, including best treatment selection, drug–drug interaction, and management of treatment severe adverse reactions. All the questions are structured to test reasoning and comprehension from both the main text and tables.

Table 4 List of questions

Ablation study: customized LLM framework

We used a combination of RAG using EASL HCV guidelines, in different experimental settings with increasing degrees of complexity regarding guideline reformatting, prompt architecture, and few-shot learning to create a customized framework applied to the GPT-4 Turbo model (released by OpenAI, in November 2023 with knowledge updated until April 202338). Experiments with the OpenAI’s Application Programming Interface (API) v. 1.17 cannot directly retrieve information from .pdf files. Therefore, the original pdf guidelines document was converted to a .txt file with UTF-8 encoding using the Python (v. 3.11) library PyPDF2 v3.0.

We carried out an ablation study from the baseline (Experiments 1 through 5) to investigate how different settings in guideline reformatting, prompt architecture, and few-shot learning impact the accuracy and robustness of LLM outputs (Fig. 3). It is still unknown how non-text sources (e.g., graphical tables and flowcharts) are processed by LLMs and whether the information extracted is accurate. Therefore, we performed preliminary experiments to test the accuracy of the GPT image conversion process (Supplementary Note 1) and found very low accuracy (16.0%) in extracting pertinent table information, with accuracy ranging from 0% (graphical tables) to 48.0% (only text tables). In light of these findings, we introduced text conversion of tables (non-text sources) into text-based lists and tested their impact on accuracy in Experiments 3, 4, and 5.

Fig. 3
figure 3

Depiction of Ablation Study experimental settings (Experiment 1 through Experiment 5) to investigate how guideline reformatting, prompt architecture, and few-shot learning impact the accuracy and robustness of LLM outputs.

Baseline

Use of the foundational GPT-4 Turbo without any context. For this experiment, we only provided the questions without any further instruction.

Experiment 1

Use of the foundational GPT-4 Turbo with guidelines uploaded in context after pdf-to-text conversion in UTF-8 encoding without any additional text cleaning processes.

Experiment 2

Use of the foundational GPT-4 Turbo with guidelines uploaded in context after being manually cleaned with the removal of non-informative data (e.g. page header and bibliography). Tables presented as images in the original text were manually converted into .csv files and then provided as context.

Experiment 3

Use of the foundational GPT-4 Turbo with guidelines uploaded as context that were cleaned and formatted to provide a consistent structure alongside the whole document. In addition, we converted all tables from .csv files into text-based lists and included them in the main text. Each paragraph title was preceded by “Paragraph Title”. All the paragraph recommendations were collected and organized into a list preceded by “Paragraph Recommendations”. Evidence reported in the main text was organized and preceded by “Paragraph Text”.

Experiment 4

Use of the foundational GPT-4 Turbo with guidelines uploaded as context that were cleaned and formatted, with tables converted into text-based lists. We also provided a series of prompts (i.e., prompt engineering) that instructed the model on how to interpret the structured guidelines (Supplementary Table 1).

Experiment 5

Use of the foundational GPT-4 Turbo with guidelines uploaded as context that were cleaned and formatted, with tables converted into text-based lists. We included the series of prompts (i.e., prompt engineering) and added a series of 54 question-answer pairs (i.e. few-shot learning) (Supplementary Table 1).

The experiments are summarized in Fig. 3 and were conducted on a local Python environment with OpenAI API access. Instructions, when provided, are summarized in Supplementary Table 1. We used foundational model default parameters, selecting a temperature of 0.9, and setting a maximum number of tokens in output equivalent to 800.

Primary outcome

Our primary outcome was to evaluate qualitative rates of accuracy according to expert grading based on the information reported in EASL guidelines54. We repeated the query 5 times each for the 20 questions for each experimental setting and reported the proportion of accurate responses. Each answer was graded with a score of 1 if the text contained completely accurate information or 0 otherwise. Two expert hepatologists (M.G., with four years of experience in treating HCV patients, and L.S.C., with thirty years of experience in treating HCV patients) manually graded each response. The two graders were blind to each other and towards the experimental setting when labeling answers. Disagreements in grading occurred for 5.0% of outputs and were solved by consensus between the two graders.

When outputs are considered inaccurate, the inaccuracy is caused by hallucinations (i.e., the production of plausible sounding but potentially unverified or incorrect information)57,58. According to the recent definitions of Zhang et al., we defined three types of hallucinations: FCH, ICH, and CCH59.

Secondary outcome

Our secondary outcome was to evaluate the similarity of LLM-generated responses to the human expert-provided answers used as the gold standard. In particular, an expert hepatologist (M.G.) provided a single answer for each of the 20 questions, which was reviewed and approved by the second expert hepatologist (L.S.C.), and then used as the gold standard expert response to which LLM responses were compared in text-similarity using Recall-Oriented Understudy for Gisting Evaluation (ROUGE)46, Bilingual Evaluation Understudy (BLEU)45, Metric for Evaluation of Translation with Explicit Ordering (METEOR)47, and a Custom OpenAI score (for in-depth explanation see Supplementary Note 2). The Custom OpenAI score is based on cosine similarity, while the other scores are based on word overlap and semantic coherence between two text sources. We evaluated the similarity by comparing LLM-generated answers to the corresponding ones provided by experts. All these scores are expressed on a scale from 0 to 1, where a score of 1 denotes perfect alignment between two compared text sources. The mean and standard deviation of the similarities were estimated after repeating the query 5 times each for the 20 questions.

Statistical analysis

We employed the Chi-Square Test to compare accuracy among experiments qualitatively. We employed the Mann-Whitney U Test to compare differences among continuous scoring for automatic evaluation of answers. We considered statistically significant a two-tailed p-value < 0.05. To conduct the analysis, we used Python v 3.11 and SciPy v 1.11.