FFA-GPT: an automated pipeline for fundus fluorescein angiography interpretation and question-answer

Chen, Xiaolan; Zhang, Weiyi; Xu, Pusheng; Zhao, Ziwei; Zheng, Yingfeng; Shi, Danli; He, Mingguang

doi:10.1038/s41746-024-01101-z

Download PDF

Article
Open access
Published: 03 May 2024

FFA-GPT: an automated pipeline for fundus fluorescein angiography interpretation and question-answer

npj Digital Medicine volume 7, Article number: 111 (2024) Cite this article

492 Accesses
1 Altmetric
Metrics details

Subjects

Abstract

Fundus fluorescein angiography (FFA) is a crucial diagnostic tool for chorioretinal diseases, but its interpretation requires significant expertise and time. Prior studies have used Artificial Intelligence (AI)-based systems to assist FFA interpretation, but these systems lack user interaction and comprehensive evaluation by ophthalmologists. Here, we used large language models (LLMs) to develop an automated interpretation pipeline for both report generation and medical question-answering (QA) for FFA images. The pipeline comprises two parts: an image-text alignment module (Bootstrapping Language-Image Pre-training) for report generation and an LLM (Llama 2) for interactive QA. The model was developed using 654,343 FFA images with 9392 reports. It was evaluated both automatically, using language-based and classification-based metrics, and manually by three experienced ophthalmologists. The automatic evaluation of the generated reports demonstrated that the system can generate coherent and comprehensible free-text reports, achieving a BERTScore of 0.70 and F1 scores ranging from 0.64 to 0.82 for detecting top-5 retinal conditions. The manual evaluation revealed acceptable accuracy (68.3%, Kappa 0.746) and completeness (62.3%, Kappa 0.739) of the generated reports. The generated free-form answers were evaluated manually, with the majority meeting the ophthalmologists’ criteria (error-free: 70.7%, complete: 84.0%, harmless: 93.7%, satisfied: 65.3%, Kappa: 0.762–0.834). This study introduces an innovative framework that combines multi-modal transformers and LLMs, enhancing ophthalmic image interpretation, and facilitating interactive communications during medical consultation.

Chákṣu: A glaucoma specific fundus image database

Article Open access 03 February 2023

Deep learning based diagnostic quality assessment of choroidal OCT features with expert-evaluated explainability

Article Open access 28 January 2023

MSHF: A Multi-Source Heterogeneous Fundus (MSHF) Dataset for Image Quality Assessment

Article Open access 17 May 2023

Introduction

Fundus fluorescein angiography (FFA) is a valuable diagnostic imaging technique that uses the administration of a fluorescent dye to evaluate the retinal circulation and visualize blood flow. Compared to other imaging modalities such as fundus photography, FFA has significant advantages in evaluating various vascular ocular diseases, such as diabetic retinopathy (DR), central serous chorioretinopathy (CSC), and retinal vein occlusion (RVO), as it provides more detailed and dynamic vascular images¹. However, the increased level of detail in the images also means a more challenging interpretation. The interpretation of FFA images requires extensive professional expertise and comprehensive training in ophthalmology, which may lead to a shortage of qualified FFA reports in regions lacking retinal specialists². Additionally, ophthalmologists frequently dedicate a substantial amount of time and effort to writing medical reports in conventional clinical practice. There is an urgent need for intelligent tools to effectively manage these challenges.

Image-to-text conversion technology holds the potential to bridge this gap and has made significant strides in the realm of natural images³. Previous research has explored the application of these techniques in generating reports using FFA image data^4,5. However, these studies have limitations in terms of comprehensive evaluation by ophthalmologists. Additionally, they have focused solely on report generation and overlooked the importance of providing further explanation after generating the FFA reports, a critical step in enhancing patient comprehension of clinical information and saving valuable time for doctors. Due to the specialized nature of these reports, patients often struggle to understand the content fully, leading to difficulties in addressing consultation questions and necessitating additional face-to-face consultations. In such cases, having a model that can handle general consultation questions would be beneficial, as it would save valuable time for doctors and enable more patients to receive professional medical services.

Recent astonishing advancements in large language models (LLM) have brought forth new possibilities for addressing these challenges⁶. Previous studies have evaluated the performance of different LLMs in ophthalmic question-answering (QA) tasks. For instance, Cai et al.⁷ assessed the performance of ChatGPT in ophthalmology using board–style questions and found that ChatGPT-3.5 provided correct answers in 58.8% of the questions, while ChatGPT-4.0 achieved a higher accuracy of 71.6%. Another study showed that GPT4-V(ision) model falls short when addressing QAs for ophthalmic images, with the best performance on Slit lamp images only reaching an accuracy of 42.0%⁸. Llama 2⁹, the latest development in open-source LLMs, is renowned for its flexibility and scalability. It holds promise in various medical applications^10,11,12, including ophthalmology, where it can offer valuable medical advice and assistance for ophthalmic consultations¹³. However, the performance of responses generated by LLMs in addressing queries related to ophthalmology reports has not yet been evaluated. Currently, there is also no artificial intelligence (AI) system capable of effectively integrating FFA image-text information and providing QA interactions.

To maximize the AI-assisted interpretation of FFA images, our goal is to develop a two-stage FFA-GPT system. By combining the capabilities of image-text conversion models and LLMs, we aim to achieve FFA report generation and subsequent interactive QA, thereby reducing the reliance on retinal specialists.

Results

The study flow chart is shown in Fig. 1. The final dataset used in the study consisted of 654,343 FFA images paired with 9392 reports. Among these images, 421,916 (64.5%, with 6312 reports) were for training, 76,900 (11.8%, with 1052 reports) were for validation, and 155,527 (23.8%, with 2028 reports) were for testing. The age of the participants had a median value of 51 years, with an interquartile range of 36 to 62 years. Of the participants, 5190 (55.3%) were male. More detailed characteristics of the dataset can be found in Table 1.

**Fig. 1: Schematic diagram of this study.**

Table 1 Fundus fluorescein angiography dataset characteristics

Full size table

The eye conditions extracted from the original FFA reports encompassed a wide range of conditions commonly encountered in clinical practice. The top five most prevalent conditions were media opacity (11.5%), proliferative diabetic retinopathy (8.5%), macular edema (7.8%), DR (7.7%), and cystoid macular edema (7.4%). Additionally, the dataset included several rare diseases. A comprehensive overview of the main conditions can be found in Table 2.

Table 2 The main eye conditions extracted from the fundus fluorescein angiography reports (total N = 9392)

Full size table

Automatic evaluation

The FFA-GPT model demonstrated satisfactory performance in generating FFA reports on the testing set, as evaluated using both language-based metrics and classification-based metrics (see Table 3). For language-based metrics, the model achieved the following scores: Bilingual Evaluation Understudy (BLEU)1 = 0.48, BLEU2 = 0.42, BLEU3 = 0.38, BLEU4 = 0.34, Consensus-based Image Description Evaluation (CIDEr)=0.33, Recall-Oriented Understudy for Gisting Evaluation (ROUGE) = 0.36, Semantic Propositional Image Caption Evaluation (SPICE) = 0.18, and Bidirectional Encoder Representations from Transformers Score (BERTScore)=0.70. These metrics assess the quality and similarity of the generated reports compared to the reference reports. For the classification-based metrics, we extracted conditions from the generated reports using a keyword-matching approach and categorized them individually. We then calculated the classification-based metrics for each condition based on the keyword-matching results and ranked all the conditions accordingly. The top-5 conditions with the highest F1 scores, which were microaneurysm, DR, arteriosclerosis, laser spots, and scar, were selected to demonstrate the model’s ability to classify multiple conditions. These conditions exhibited high specificities, all above 0.94. The accuracy ranged from 0.88 to 0.93. The F1 scores for these conditions were 0.82, 0.80, 0.73, 0.66, and 0.64, respectively, indicating a strong overall performance in disease classification. Based on previous studies^{2,14,15,16,17,18,19}, we also explored the classification performance of nine commonly diagnosed conditions using FFA (see Supplementary Table 1). It can be observed that the model exhibited acceptable classification performance for DR and DR-related lesions, such as microaneurysm and laser scar.

Table 3 Model performance in the test set (155,527 images with 2028 reports)

Full size table

Furthermore, we conducted an investigation to examine how changing the number of input images impacts the performance of the model. The number of input images ranged from 1 to 12. We observed that most of the metrics gradually improved as the number of input images increased, while BERTScore remains stable at around 0.70. However, the performance gains reached a plateau when at least two images were provided, particularly for the BLEU1, ROUGE, and SPICE metrics. The performance of the model showed greater volatility when measured by the CIDEr metric. Notably, the optimal value of 0.36 was achieved only when the number of input images exceeded 11 (see Supplementary Fig. 1).

Manual evaluation

During the analysis of the clinical applicability of FFA-GPT by ophthalmologists, we observed promising quality in both the model-generated reports and answers. Figure 2 showcases an example of the integrated interface of our model.

**Fig. 2: Demonstration of the integrated interface of our model.**

The quality of the reports generated by the FFA-GPT model is comparable to the ground truth reports. Fleiss’ kappa values indicate a good agreement between the three graders for assessing report quality, with a kappa value of 0.746 for completeness and 0.739 for accuracy, respectively (see Supplementary Table 2). Figure 3A summarizes the scoring of each rater on the generated reports in two dimensions. On average across all three raters, the generated reports with no incorrect content in 68.3% of cases, indicating the accuracy of the reports in describing the anatomical location of lesions and fluorescent imaging characteristics. Reports with minor errors accounted for 25.7%, while reports with significant errors accounted for 6.0%. These errors were mainly attributed to inaccuracies in describing fluorescence-like features and misidentification of complex lesions, such as choroidal mass. In terms of completeness, approximately 62.3% of the generated reports had no missing content. In 31.0% of the reports, there were minor instances of missing content, such as extremely small lesions. However, 6.7% of reports had a significant amount of missing content, primarily due to unclear imaging leading to false negatives. Examples of different grades of generated reports are shown in Supplementary Fig. 2.

After the first stage evaluation of report generation, we identified 20 out of 31 conditions (approximately 64.5%) that received positive feedback in the professional assessment. These cases meet the criteria to proceed through the Guard mechanism and can be forwarded into the Generative Pre-trained Transformer (GPT) module. The negative examples and their manually adjusted versions through the Guard mechanism, as well as subsequent QA examples based on these reports, are provided in Supplementary Fig. 3. It can be observed that the unadjusted report missed important lesions, and QA based on the error report could mislead patients. This demonstrates that automatic report generation simply serves as an assistance, the report distributed to patients cannot be fully automatic without the supervision of ophthalmologists. Out of the 20 question sets, we randomly selected 5 questions and combined them with the 20 approved cases to create 100 QA pairs, representing 20 distinct diseases. This dataset serves as the evaluation set for QA.

The evaluation results of the generated answers showed an average of 70.7% of the responses were virtually error-free, 24.7% had minor errors, and 4.6% had major errors for accuracy. In terms of completeness, an average of 84.0% of the answers provided complete responses to the questions related to the reports, 13.7% had minor omissions, and only 2.3% had significant omissions. Regarding the potential harm caused by the answers, the majority (over 93.7%) were deemed safe. However, two answers were identified as potentially causing severe harm. One answer suggested incorrect hormone therapy for CSC patients, while another suggested unreliable at-home light therapy for patients with RVO. In terms of satisfaction, the model achieved a satisfaction rate of 65.3%. The Fleiss’ kappa of the three raters ranged from 0.762 to 0.834, indicating substantial to almost perfect agreement among the raters, highlighting the reliability of the scoring process. The scoring details of each rater for the generated answers and the consistent results among them can be found in Fig. 3B and Supplementary Table 3, respectively. Examples of generated answers with great clinical significance in accuracy, completeness, harmfulness, and unsatisfied are provided in Supplementary Table 4.

Discussion

In this study, we developed a dual-task model for ophthalmic image analysis. Firstly, we utilized a multimodal transformer to intelligently convert FFA images into medical reports. Then, we implemented an LLM to facilitate interactive QA. Through comprehensive automated and manual evaluations, the system demonstrated reliable and satisfactory performance. This research represents the demonstration of the large language model’s potential as an assistant for generating medical reports and enabling QA tasks to enhance FFA image interpretation.

Previous studies have developed AI models based on FFA image data. Some of these studies have focused on detecting multiple conditions in FFA images, such as DR^15,16 and CSC¹⁴. Other studies have aimed to automatically generate FFA reports to address gaps in ophthalmic image-to-text conversion. Li et al.⁵ introduced a cross-modal clinical graph model for generating ophthalmic reports using 330,461 images. Lin et al.⁴ introduced contrastive pre-training to medical report generation based on datasets with multi-modalities, including FFA. Compared to these studies, FFA-GPT offers several advantages. Firstly, our model combines disease classification and report generation capabilities, accurately identifying retinal conditions, particularly DR, and generating coherent and comprehensible free-text reports. However, further improvements are required to enhance the model’s performance in classifying conditions other than DR. Secondly, to evaluate the expertise of our model, we collected clinically annotated reports as references and involved ophthalmic experts to assess the professional performance of the model in terms of accuracy and completeness. Over 60% of the reports met the clinical requirements, whereas Li et al.‘s study only achieved 44.7%. Additionally, our model can accept a random number of input images, which is more feasible in real-world applications. Lastly, in the actual clinical setting, patients often have many questions even after receiving the report, which require ophthalmologists’ explanations. FFA-GPT not only generates reports but also provides interactive explanations in the second stage, making it a comprehensive solution. Future research should focus on enhancing the quality and accuracy of the generated reports, thereby further minimizing the time for manual intervention. This can be achieved by incorporating more diverse data and investigating and comparing different architectural choices to identify the optimal framework^5,20. Moreover, prospective clinical trials could be considered to validate the model’s effectiveness in improving report efficiency within real-world clinical settings.

Doctor-patient interaction is a crucial aspect of healthcare²¹. Patients often have limited knowledge of medical concepts and terminology, and it is the doctor’s responsibility to communicate complex medical information understandably while addressing the patient’s needs and concerns. Conversational large language models like ChatGPT and Llama 2 offer an alternative approach for effective AI-human interaction and collaboration. The integration of LLMs into the image-to-text conversion framework offers several advantages. Firstly, the LLM functions as an intelligent scheduler, activating the image reading module based on the context. This allows for the combination of text reports, which contain crucial details of patient FFA imaging results, with the world knowledge of the GPT module in LLM. This effectively mitigates LLM’s hallucinations and generates optimal answers. Secondly, following a comprehensive evaluation, we demonstrated that the LLM can effectively handle general inquiries related to ophthalmic image reports. The scarcity of high-quality medical QA data poses a real-world challenge. By incorporating the LLM into the report generation model, we introduced a solution for ophthalmic QA without the need for additional QA training. This approach avoids the issue of insufficient datasets and optimizes the utilization of computational resources. Thirdly, our two-stage framework offers greater interpretability than an end-to-end one. The separate generation process allows for increased transparency, and the embedded guard mechanism can assess the quality of the generated reports before their release. This ensures that only reports that have been rigorously reviewed and refined by medical professionals are made available for patients’ use in QA. This assures our system to interact with end-users while maintaining its social responsibility. Continued efforts in empowering LLMs with ophthalmic knowledge, such as incorporating retrieval-augmented generation techniques to include more resources for addressing ophthalmic tasks, will contribute to enhancing the model’s applicability. Additionally, improving the interpretability of the model would be beneficial.

Previous studies have demonstrated that AI assistance can be beneficial in reducing diagnostic time and improving radiologists’ performance^22,23. In daily clinical care, our proposed FFA-GPT offers an automated pipeline for interpreting FFA images, both in the pre- and post-explanation stages. In the stage of generating specialized reports from FFA images, our model shows potential in generating preliminary reports to assist ophthalmologists in interpreting ophthalmic images. It has the potential to reduce diagnostic errors caused by difficulties and fatigue, thereby improving medical efficiency for ophthalmologists. Furthermore, in the stage of communicating with patients after generating specialized reports, our system is expected to assist patients in explaining medical concepts and procedures of FFA reports in simplified terms, potentially making it easier to understand. It can be embedded in applications and accessed through a mobile interface in the future. This enables us to provide additional information on common eye health issues and offer decision support during the consultation. However, the application of this model in a clinical environment requires careful consideration of ethical issues. Firstly, although our model safeguards patient data privacy through techniques such as de-identification and local model deployment, it is imperative to ensure that patients are fully informed and provide consent for the utilization of their data in AI model analysis during future prospective validation. Secondly, to enhance the interpretability of the model, future research development may consider incorporating techniques like Concept Relevance Propagation²⁴ or visualization methods, thereby facilitating clinicians’ understanding of how the model analyzes images and makes assessments. It is also important to emphasize that such models are currently in the early stages and are intended to serve as auxiliary tools to assist ophthalmologists in improving their work efficiency and accuracy, rather than replace their judgment and decision-making. Therefore, the guard mechanism is crucial to promptly detect and correct model errors, ensuring clinical supervision and review of model outputs and minimizing the risk of misdiagnosis, thus ensuring feasibility and safety.

This research has several limitations. Firstly, there are potential risks associated with the generated medical models. Although our research has undergone comprehensive evaluations from multiple perspectives, validating the accuracy and correctness of the generated content still poses challenges. The phenomenon of hallucination stemming from LLM still exists in our model. Therefore, arming the LLM with ophthalmic knowledge and improved guard mechanisms may be important for future work in this regard. Secondly, multi-modal medical imaging has become one of the important trends in the field of medicine today, providing broader space and opportunities for the development of AI. Our model is currently limited to FFA images, and future research should focus on developing multi-modal imaging models. Lastly, our model lacks a fully external dataset for validation and instead relies on temporal-split data. Further evaluation and optimization of the model’s generalizability are needed for different spatial data and entirely new data.

In conclusion, this study introduces an innovative approach that combines a multi-modal transformer and an LLM to establish a connection between ophthalmic imaging and automated report generation, as well as facilitate interactive QA. The FFA-GPT system shows promising potential to enhance the interpretation and reporting of ophthalmic images, offering an important reference for the development of other image-based AI systems.

Methods

Dataset

Data for this study were retrospectively collected from a tertiary hospital in China from November 2016 to December 2019. To ensure patient privacy and adhere to ethical principles outlined in the Declaration of Helsinki, all patient data were anonymized. The study was approved by the Institutional Review Board of the Hong Kong Polytechnic University, and individual consent for retrospective analysis was waived. The dataset encompassed a broad spectrum of eye conditions, including DR, RVO, and CSC. The FFA images were captured using Zeiss FF450 Plus and Heidelberg Spectralis cameras from Heidelberg, Germany, with a resolution of 768 × 768 pixels. We excluded low-quality FFA images by extracting vessels²⁵, where images with detectable vessel area ratios less than 0.005 were excluded. To facilitate external validation, the dataset was divided into training, validation, and testing subsets based on temporal splits. Specifically, images captured before June 2019 were allocated for training, while the remaining images captured after June 2019 were used for validation and testing. This temporal split strategy simulates a study where a model is developed using past data and subsequently validated and tested on future cases. It serves as a form of external validation, contributing to the robustness of the findings²⁶.

Development of FFA-GPT

We utilized the Bootstrapping Language-Image Pre-training (BLIP) framework as the image-text aligning module. BLIP is a pretrained architecture that has been trained on a vast dataset of natural images, which facilitated the acquisition of effective feature representations for both images and text²⁷. Notably, the BLIP model demonstrates exceptional proficiency in encoding long-range dependencies, making it well-suited for handling intricate ophthalmic images and longer sequences of ophthalmic reports. Furthermore, the BLIP model possesses the capability to filter out noisy data during the training process, which substantially enhances the quality of large-scale training datasets. The framework comprises two key components: a visual transformer²⁸ serving as the image encoder, and Bidirectional Encoder Representations from Transformers (BERT)²⁹ serving as the language encoder and decoder.

To fine-tune the pre-trained BLIP model, we employed a dataset consisting of FFA images and their corresponding Chinese reports. During the training process, we randomly selected 1-9 images from each case as input for the model, ensuring a balanced representation of the arterial, venous, and late phases of FFA. All images were resized to 320×320 pixels. For the fine-tuning process, the model was trained with two NVIDIA Tesla V100 GPUs. The initial learning rate was set to 0.00002, with a weight decay of 0.05. We utilized AdamW³⁰ optimizer with a cosine learning rate schedule. The fine-tuning was performed for 50 epochs and the whole training time was 3 days. The model with the highest BLEU1 score on the validation set was selected for testing.

It’s important to note that currently, automatic report generation is intended to assist doctors, rather than to produce reports that can be directly released to patients. Therefore, we included a guard mechanism for quality control of generated reports to reduce the risk of disseminating harmful information. This mechanism necessitates human evaluation and confirmation. Only reports that pass this confirmation or have been corrected by doctors can proceed to the next step. This guard mechanism is open to be optimized and updated to enhance efficiency.

Although the BLIP model can be used for further QA tasks, it requires a large amount of reliable QA training data and significant computational resources, which makes the training and continuous optimization challenging (see Supplementary Table 5). To enable QA without the need for fine-tuning and to facilitate interactive interpretation of the generated reports, we incorporated the advanced open-source large language model, Llama 2⁹ in the second step as the GPT module to enable QA. Specifically, this was achieved by giving instructions of ‘Please answer the following question based on the given FFA report’. We formulated a series of typical questions related to FFA reports, drawing on our clinical experience and following methods similar to those used by Momenaei et al.³¹. The question lists we created contain a total of 20 items covering a range of topics, including a summary of the reports, disease definitions, etiology, visual impact, prevention measures, further examinations, treatment options, prognosis, complications, and FFA testing information such as timing, specific phases, and post-test instructions (See Supplementary Table 6). Given that Llama 2 primarily excels in handling English data, we incorporated a translation plugin to create a connection between the two modules. These questions, along with the English translation of the generated reports, were then fed into the Llama 2 model for testing.

Automatic evaluation of FFA-GPT

We conducted the automatic evaluation of the generated reports using language-based metrics and classification-based metrics³². The language-based metrics used in this study include BLEU, CIDEr, ROUGE, SPICE, and BERTScore. BLEU³³ is a widely used automatic metric for evaluating the quality of machine translations and scores by calculating the overlap of n-grams between the generated text and reference texts. An n-gram is a contiguous sequence of ‘n’ words from a given sample of text. Given that many ophthalmic terms are compound words often composed of up to four words, we calculated BLEU1, BLEU2, BLEU3, and BLEU4 as evaluation metrics. CIDEr³⁴ is developed for image captioning evaluation. It evaluates not just the presence of relevant words but also how frequently those words are used in similar contexts, providing higher scores for words that are both pertinent and rare. In our context, it helps to ensure that the terminology used is not only appropriate but also indicative of the distinctive features in the images, which is crucial for medical reporting. ROUGE³⁵ focuses on the longest common subsequence between the generated text and reference texts, suitable for evaluating model performance at the sentence or paragraph level. Ophthalmic reports often require coherent descriptions of complex clinical information, and ROUGE helps to understand how well the model maintains coherence throughout the information presented. SPICE³⁶ differs from BLEU by focusing on semantic accuracy in image caption generation rather than mere word or phrase matching, providing a comprehensive assessment of generated description quality. BERTScore³⁷ utilizes the pre-trained BERT model to compute the semantic similarity between generated and reference text. It excels at assessing semantic match within the overall context, providing a nuanced and semantically oriented evaluation of model performance.

The classification metrics were employed to supplement the language-based metrics, providing a more comprehensive evaluation of the model’s ability to accurately identify diseases. Initially, we manually constructed a keyword dictionary for ophthalmic conditions. This dictionary included standard medical terms, root words, prefixes, and synonyms associated with each condition (sample shown in Supplementary Fig. 4). We then utilized this dictionary to automatically identify, extract, and standardize disease condition terminology present in the FFA reports. During the keyword-matching process, we also considered the presence of negations. We utilized regular expressions to check if the word was prefixed or followed by negation words such as ‘not observed’, ‘no apparent’, ‘without’, and ‘not supported’, and refrained from extracting the related disease terms. We employed a precise matching strategy while considering negation terms and the mapping is case-insensitive. These standardized terms were also used to evaluate the accuracy of disease identification in the generated reports. Based on the results of this keyword-matching process, we calculated classification metrics, including accuracy, sensitivity, specificity, precision, and F1 score. Accuracy is the most commonly used metric in classification tasks, representing the proportion of correctly predicted classifications out of the total predictions³⁸. Sensitivity measures the proportion of actual positives correctly identified by the model. High sensitivity is crucial in medical diagnostics to ensure timely management of diseases by minimizing false negatives. Specificity is the proportion of actual negatives correctly identified and high specificity helps to ensure that only patients who truly have the disease receive further diagnosis and treatment. Precision indicates the proportion of positive identifications made by the model that are actually correct. F1 score is the harmonic mean of sensitivity and precision and provides a more comprehensive assessment of model performance, especially in situations of class imbalance^14,15,18,19.

Manual evaluation of FFA-GPT

To ensure the integrity and precision of our assessment, we conduct a human evaluation of reports generated by our model. For this evaluation, we randomly selected a subset of 100 images from the test set, and their corresponding generated reports were assessed by three experienced ophthalmologists (X.C., P.X., and Z.Z.) with an average of over five years of clinical experience. We followed a similar evaluation approach as Singhal K et al.³⁹, focusing on two aspects: incorrect content and missing content. The aspect of incorrect content evaluates the model’s accuracy in report generation by determining whether the generated text contained any content that it should not. On the other hand, the aspect of missing content evaluates the model’s completeness in report generation by determining whether the generated text omitted any information that it should include.

The ophthalmologists independently rated the quality of the generated reports compared to the ground-truth reports for accuracy and completeness using a 3-level quality evaluation scale: ‘No,’ ‘Yes, little clinical significance,’ and ‘Yes, great clinical significance.’ To reduce the variability among the raters and ensure reliable results, we calculated Fleiss’ kappa⁴⁰. Fleiss’ kappa is a measure of interrater agreement used to assess the consistency among multiple raters in a categorical classification task. It takes into account both the observed agreement and the agreement expected by chance, providing a statistical measure of agreement beyond random chance. The interpretation of its values varies as follows: 0-0.2 indicates slight agreement, 0.2-0.4 indicates fair agreement, 0.4-0.6 indicates moderate agreement, 0.6-0.8 indicates substantial agreement, and 0.8-1.0 indicates almost perfect agreement.

The answers to the report-related questions were also assessed by the three ophthalmologists using the same method mentioned in the manual evaluation of report generation. As there is no ground truth for the QA pairs, the evaluation criteria in this part are based on current scientific consensus and the ophthalmologists’ clinical experience. In addition to accuracy and completeness, we added two dimensions of harmfulness and satisfaction to the evaluation criteria³⁹. Specifically, accuracy considered the correct identification of medical terms in the answers and the consistency with relevant medical recommendations and consensus. Completeness took into account not only direct answers to the inquiries but also necessary additional details relevant to clinical practice. The evaluation of harmfulness is due to the clinical recommendations that may prompt patients to take actual actions involved in the QA process, and the lack of human supervision. The raters considered the potential physical or mental health-related harm that may result from the actions prompted by the generated answer. The severity of harm was referenced by the Agency for Healthcare Research and Quality (AHRQ) common formats, and three options were provided: ‘No harm,’ ‘Moderate or mild harm,’ and ‘Death or severe harm.’ based on the anticipated impact on vision and physical health⁴¹. The severity of harm caused by answers is distinguished based on its direct or indirect impact on the patient’s future vision and physical health. It is important to note that while our rating was based on the AHRQ scale, it should be considered as a subjective assessment. The evaluation of satisfaction is crucial because relying solely on technical metrics, such as accuracy, may not fully capture users’ genuine needs and expectations. Also, the satisfaction assessment offered three options: Satisfied, Fair, and Unsatisfied. The inter-rater agreement was also calculated using Fleiss’ kappa, which is the same metric used in the evaluation of the generated reports.

Code availability

Code is available at https://github.com/salesforce/BLIP and https://github.com/meta-llama/llama.

References

Kvopka, M., Chan, W., Lake, S. R., Durkin, S. & Taranath, D. Fundus fluorescein angiography imaging of retinopathy of prematurity in infants: A review. Surv. Ophthalmol. 68, 849–860 (2023).
Article PubMed Google Scholar
Jin, K. et al. Automatic detection of non-perfusion areas in diabetic macular edema from fundus fluorescein angiography for decision making using deep learning. Sci. Rep. 10, 15138 (2020).
Article CAS PubMed PubMed Central Google Scholar
Stefanini, M. et al. From Show to Tell: A Survey on Deep Learning-Based Image Captioning. IEEE Trans. pattern Anal. Mach. Intell. 45, 539–559 (2023).
Article PubMed Google Scholar
Lin, Z. et al. Contrastive pre-training and linear interaction attention-based transformer for universal medical reports generation. J. Biomed. Inform. 138, 104281 (2023).
Article PubMed Google Scholar
Li, M. et al. Cross-modal Clinical Graph Transformer for Ophthalmic Report Generation. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 20624-20633 https://doi.org/10.1109/CVPR52688.2022.02000 (2022).
Betzler, B. K. et al. Large language models and their impact in ophthalmology. Lancet Digi. Health 5, e917–e924 (2023).
Article CAS Google Scholar
Cai, L. Z. et al. Performance of Generative Large Language Models on Ophthalmology Board-Style Questions. Am. J. Ophthalmol. 254, 141–149 (2023).
Article PubMed Google Scholar
Xu, P. et al. Evaluation of a digital ophthalmologist app built by GPT4-V (ision). medRxiv, 2023.2011. 2027.23299056 https://doi.org/10.1101/2023.11.27.23299056 (2023).
Touvron, H. et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 https://doi.org/10.48550/arXiv.2307.09288 (2023).
Ge, J. et al. Development of a liver disease-Specific large language model chat Interface using retrieval augmented generation. Hepatology https://doi.org/10.1097/hep.0000000000000834 (2024).
Civettini, I. et al. Evaluating the performance of large language models in haematopoietic stem cell transplantation decision-making. Br. J. Haematol https://doi.org/10.1111/bjh.19200 (2023).
Sandmann, S., Riepenhausen, S., Plagwitz, L. & Varghese, J. Systematic analysis of ChatGPT, Google search and Llama 2 for clinical decision support tasks. Nat. Commun. 15, 2050 (2024).
Article CAS PubMed PubMed Central Google Scholar
Masalkhi, M. et al. A side-by-side evaluation of Llama 2 by meta with ChatGPT and its application in ophthalmology. Eye. 1–4 (2024).
Chen, M. et al. Automatic detection of leakage point in central serous chorioretinopathy of fundus fluorescein angiography based on time sequence deep learning. Graefe’s Arch. Clin. Exp. Ophthalmol. 259, 2401–2411 (2021).
Article CAS Google Scholar
Gao, Z. et al. Automatic interpretation and clinical evaluation for fundus fluorescein angiography images of diabetic retinopathy patients by deep learning. Br. J. Ophthalmol. 107, 1852–1858 (2023).
Article PubMed Google Scholar
Gao, Z. et al. End-to-end diabetic retinopathy grading based on fundus fluorescein angiography images using deep learning. Graefe’s Arch. Clin. Exp. Ophthalmol. 260, 1663–1673 (2022).
Article Google Scholar
Zhang, B. et al. An Improved Microaneurysm Detection Model Based on SwinIR and YOLOv8. Bioengineering 10, 1405 (2023).
Article PubMed PubMed Central Google Scholar
Pan, X. et al. Multi-label classification of retinal lesions in diabetic retinopathy for automatic analysis of fundus fluorescein angiography based on deep learning. Graefe’s Arch. Clin. Exp. Ophthalmol. 258, 779–785 (2020).
Article Google Scholar
Liu, X. et al. Uncovering Language Disparity of ChatGPT on Retinal Vascular Disease Classification: Cross-Sectional Study. J. Med. Internet Res. 26, e51926 (2024).
Article PubMed PubMed Central Google Scholar
Yang, S. et al. Radiology report generation with a learned knowledge base and multi-modal alignment. Med Image Anal. 86, 102798 (2023).
Article PubMed Google Scholar
Marino, F., Alby, F., Zucchermaglio, C. & Fatigante, M. Digital technology in medical visits: a critical review of its impact on doctor-patient communication. Front. Psychiatry 14, 1226225 (2023).
Article PubMed PubMed Central Google Scholar
Lee, J. H., Hong, H., Nam, G., Hwang, E. J. & Park, C. M. Effect of human-AI interaction on detection of malignant lung nodules on chest radiographs. Radiology 307, e222976 (2023).
Article PubMed Google Scholar
Tong, W.-J. et al. Integration of artificial intelligence decision aids to reduce workload and enhance efficiency in thyroid nodule management. JAMA Netw. Open 6, e2313674–e2313674 (2023).
Article PubMed PubMed Central Google Scholar
Achtibat, R. et al. From attribution maps to human-understandable explanations through Concept Relevance Propagation. Nat. Mach. Intell. 5, 1006–1019 (2023).
Article Google Scholar
Shi, D. et al. A deep learning system for fully automated retinal vessel measurement in high throughput image analysis. Front. Cardiovasc. Med. 9, 823436 (2022).
Article PubMed PubMed Central Google Scholar
Collins, G. S., Reitsma, J. B., Altman, D. G. & Moons, K. G. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. Ann. Intern. Med. 162, 55–63 (2015).
Article PubMed Google Scholar
Li, J., Li, D., Xiong, C. & Hoi, S. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. International conference on machine learning, 12888–12900 (2022).
Dosovitskiy, A. et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. International Conference on Learning Representations, (2021).
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL-HLT 2019 1, 4171–4186 (2019).
Google Scholar
Loshchilov, I. & Hutter, F. Decoupled Weight Decay Regularization. International Conference on Learning Representations (2018).
Momenaei, B. et al. Appropriateness and readability of ChatGPT-4-generated responses for surgical treatment of retinal diseases. Ophthalmol. Retin. 7, 862–868 (2023).
Article Google Scholar
Chang, Y. et al. A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology https://doi.org/10.1145/3641289 (2023).
Papineni, K., Roukos, S., Ward, T. & Zhu, W.-J. Bleu: a Method for Automatic Evaluation of Machine Translation. ACL 2002, 311–318 https://doi.org/10.3115/1073083.1073135 (2002).
Vedantam, R., Lawrence Zitnick, C. & Parikh, D. Cider: Consensus-based image description evaluation. Proceedings of the IEEE conference on computer vision and pattern recognition, 4566–4575 (2015).
Lin, C.-Y. ROUGE: A Package for Automatic Evaluation of Summaries. Text. Summariz. Branches Out. 74, 81 (2004).
Google Scholar
Anderson, P., Fernando, B., Johnson, M. & Gould, S. SPICE: Semantic Propositional Image Caption Evaluation. 2016 European Conference on Computer Vision, 382–398 https://doi.org/10.1007/978-3-319-46454-1_24 (2016).
Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q. & Artzi, Y. BERTScore: Evaluating Text Generation with BERT. International Conference on Learning Representations (2020).
Rjoob, K. et al. Machine learning and the electrocardiogram over two decades: Time series and meta-analysis of the algorithms, evaluation metrics and applications. Artif. Intell. Med 132, 102381 (2022).
Article PubMed Google Scholar
Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023).
Article CAS PubMed PubMed Central Google Scholar
Mandrekar, J. N. Measures of interrater agreement. J. Thorac. Oncol. 6, 6–7 (2011).
Article PubMed Google Scholar
Williams, T., Szekendi, M., Pavkovic, S., Clevenger, W. & Cerese, J. The reliability of AHRQ Common Format Harm Scales in rating patient safety events. J. Patient Saf. 11, 52–59 (2015).
Article PubMed Google Scholar

Download references

Acknowledgements

Danli Shi and Mingguang He disclose support for the research and publication of this work from the Start-up Fund for RAPs under the Strategic Hiring Scheme (Grant Number: P0048623) and the Global STEM Professorship Scheme (Grant Number: P0046113) from HKSAR. Yingfeng Zheng discloses support from the National Natural Science Foundation of China (Grant Number: 82171034). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Author information

These authors contributed equally: Xiaolan Chen, Weiyi Zhang.

Authors and Affiliations

School of Optometry, The Hong Kong Polytechnic University, Kowloon, Hong Kong, China
Xiaolan Chen, Weiyi Zhang, Ziwei Zhao, Danli Shi & Mingguang He
State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangdong Provincial Key Laboratory of Ophthalmology and Visual Science, Guangdong Provincial Clinical Research Center for Ocular Diseases, Guangzhou, China
Pusheng Xu & Yingfeng Zheng
Research Centre for SHARP Vision (RCSV), The Hong Kong Polytechnic University, Kowloon, Hong Kong, China
Danli Shi & Mingguang He
Centre for Eye and Vision Research (CEVR), 17W Hong Kong Science Park, Hong Kong, China
Mingguang He

Authors

Xiaolan Chen
View author publications
You can also search for this author in PubMed Google Scholar
Weiyi Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Pusheng Xu
View author publications
You can also search for this author in PubMed Google Scholar
Ziwei Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Yingfeng Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Danli Shi
View author publications
You can also search for this author in PubMed Google Scholar
Mingguang He
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

D.S. and X.C. conceived the study. D.S. and W.Z. built the deep learning model. X.C., P.X. and Z.Z. performed the data analysis. X.C. is a major contributor in writing the manuscript. All authors have commented on the manuscript. X.C. and W.Z. contributed equally in this study.

Corresponding author

Correspondence to Danli Shi.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Figure 1-4 and Table 1-6

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Chen, X., Zhang, W., Xu, P. et al. FFA-GPT: an automated pipeline for fundus fluorescein angiography interpretation and question-answer. npj Digit. Med. 7, 111 (2024). https://doi.org/10.1038/s41746-024-01101-z

Download citation

Received: 29 August 2023
Accepted: 04 April 2024
Published: 03 May 2024
DOI: https://doi.org/10.1038/s41746-024-01101-z