How reliable and consistent are large language models over time in radiology?
In an attempt to answer this question, researchers recently compared the capabilities of the large language models (LLMs) GPT-4, GPT-3.5 Claude and Google Bard to answer multiple-choice practice questions from the American College of Radiology (ACR) Diagnostic in Training Exam. Recently published in the European Journal of Radiology, the study evaluated the LLMs over a three-month period between November 2023 to January 2024.
The researchers found that GPT-4 demonstrated the highest overall accuracy rate at 78 percent in comparison to 73 percent for Google Bard, 71 percent for Claude and 63 percent for GPT-3.5.
However, the study authors pointed out fluctuations with the accuracy rate over the course of the study. While GPT-4 had an 82 percent accuracy rate in November 2023, it declined to 74 percent accuracy in January 2024. The researchers also noted a 13 percent decline in accuracy between November 2023 to December 2023 for GPT 3.5 (71 percent to 58 percent) and a six percent decrease during that time period for Google Bard (76 percent to 70 percent).
“ … The LLMs evaluated here achieved performance near or exceeding 70% on in-training radiology exams. This suggests a level of competence potentially comparable to radiology trainees. However, performance fluctuated over time and across models, highlighting the limited reliability of LLMs on radiology specific questions,” wrote lead study author Mitul Gupta, M.S., who is affiliated with the Department of Diagnostic Medicine at the Dell Medical School at the University of Texas at Austin, and colleagues.
The study authors also maintained that the LLMs demonstrated higher accuracy for broader topics such as disease progression and pathogenesis but struggled with topics such as BI-RADS scores that required more detailed knowledge of radiology.
“This suggests that while LLMs are proficient in general medical understanding, they require further refinement for detailed, specialized knowledge,” maintained Gupta and colleagues.
However, the researchers noted that intra-model discordance with responses decreased overall for all the reviewed models from the November 2023-December 2023 time period to the December 2023-January 2024 comparison.
Three Key Takeaways
- Performance and accuracy. GPT-4 demonstrated the highest overall accuracy (78 percent) in answering radiology exam questions compared to other LLMs like Google Bard (73 percent) and GPT-3.5 (63 percent). However, all models showed performance fluctuations over time, indicating limited reliability for radiology-specific knowledge.
- Strengths and limitations. LLMs performed better on broader topics like disease progression but struggled with more radiology-specific topics like BI-RADS scores, suggesting their capability for general medical competence requires refinement for use in specialty areas of health care such as radiology.
- Improving consistency. Intra-model discordance rates decreased over the study period for all LLMs, indicating potential for improved consistency with continued development. For example, GPT-4's discordance rate dropped from 24 percent at November 2023-December 2023 to 15 percent at December 2023-January 2024.
While GPT-3.5 and Bard had more modest intra-model discordance rate decreases at 4 percent and 3 percent, respectively, the study authors cited a 17 percent decrease for Claude (from 31 percent to 14 percent) and a 9 percent decrease for GPT-4 (from 24 percent to 15 percent).
“The decreasing intra-model discordance rates observed over time hint at the potential for these models to achieve greater consistency with continued development,” posited Gupta and colleagues.
(Editor’s note: For related content, see “New Literature Review Finds ChatGPT Effective in Radiology in 84 Percent of Studies,” “Can GPT-4 Improve Accuracy in Radiology Reports?” and “Can ChatGPT and Bard Bolster Decision-Making for Cancer Screening in Radiology?”)
In regard to study limitations, the authors conceded that the evaluated LLMs are subject to misinterpretation and generalization that may lead to inaccurate answers. They also noted that the analysis focused on text-based questions as LLMs cannot process image inputs at this time. While prompting techniques weren’t assessed in this study, they may impact the effectiveness of these models, according to the study authors.