New Study Examines Short-Term Consistency of Large Language Models in Radiology

November 22, 2024

News

Article

While GPT-4 demonstrated higher overall accuracy than other large language models in answering ACR Diagnostic in Training Exam multiple-choice questions, researchers noted an eight percent decrease in GPT-4’s accuracy rate from the first month to the third month of the study.

How reliable and consistent are large language models over time in radiology?

In an attempt to answer this question, researchers recently compared the capabilities of the large language models (LLMs) GPT-4, GPT-3.5 Claude and Google Bard to answer multiple-choice practice questions from the American College of Radiology (ACR) Diagnostic in Training Exam. Recently published in the European Journal of Radiology, the study evaluated the LLMs over a three-month period between November 2023 to January 2024.

The researchers found that GPT-4 demonstrated the highest overall accuracy rate at 78 percent in comparison to 73 percent for Google Bard, 71 percent for Claude and 63 percent for GPT-3.5.

However, the study authors pointed out fluctuations with the accuracy rate over the course of the study. While GPT-4 had an 82 percent accuracy rate in November 2023, it declined to 74 percent accuracy in January 2024. The researchers also noted a 13 percent decline in accuracy between November 2023 to December 2023 for GPT 3.5 (71 percent to 58 percent) and a six percent decrease during that time period for Google Bard (76 percent to 70 percent).

New Study Examines Short-Term Consistency of Large Language Models in Radiology

While GPT-4 had an 82 percent accuracy rate in November 2023, it declined to 74 percent accuracy in January 2024., according to a new study examining the accuracy of large language models in radiology over a three-month period. The researchers also noted a 13 percent decline in accuracy between November 2023 to December 2023 for GPT 3.5 (71 percent to 58 percent) and a six percent decrease during that time period for Google Bard (76 percent to 70 percent).

“ … The LLMs evaluated here achieved performance near or exceeding 70% on in-training radiology exams. This suggests a level of competence potentially comparable to radiology trainees. However, performance fluctuated over time and across models, highlighting the limited reliability of LLMs on radiology specific questions,” wrote lead study author Mitul Gupta, M.S., who is affiliated with the Department of Diagnostic Medicine at the Dell Medical School at the University of Texas at Austin, and colleagues.

The study authors also maintained that the LLMs demonstrated higher accuracy for broader topics such as disease progression and pathogenesis but struggled with topics such as BI-RADS scores that required more detailed knowledge of radiology.

“This suggests that while LLMs are proficient in general medical understanding, they require further refinement for detailed, specialized knowledge,” maintained Gupta and colleagues.

However, the researchers noted that intra-model discordance with responses decreased overall for all the reviewed models from the November 2023-December 2023 time period to the December 2023-January 2024 comparison.

Three Key Takeaways

Performance and accuracy. GPT-4 demonstrated the highest overall accuracy (78 percent) in answering radiology exam questions compared to other LLMs like Google Bard (73 percent) and GPT-3.5 (63 percent). However, all models showed performance fluctuations over time, indicating limited reliability for radiology-specific knowledge.
Strengths and limitations. LLMs performed better on broader topics like disease progression but struggled with more radiology-specific topics like BI-RADS scores, suggesting their capability for general medical competence requires refinement for use in specialty areas of health care such as radiology.
Improving consistency. Intra-model discordance rates decreased over the study period for all LLMs, indicating potential for improved consistency with continued development. For example, GPT-4's discordance rate dropped from 24 percent at November 2023-December 2023 to 15 percent at December 2023-January 2024.

While GPT-3.5 and Bard had more modest intra-model discordance rate decreases at 4 percent and 3 percent, respectively, the study authors cited a 17 percent decrease for Claude (from 31 percent to 14 percent) and a 9 percent decrease for GPT-4 (from 24 percent to 15 percent).

“The decreasing intra-model discordance rates observed over time hint at the potential for these models to achieve greater consistency with continued development,” posited Gupta and colleagues.

(Editor’s note: For related content, see “New Literature Review Finds ChatGPT Effective in Radiology in 84 Percent of Studies,” “Can GPT-4 Improve Accuracy in Radiology Reports?” and “Can ChatGPT and Bard Bolster Decision-Making for Cancer Screening in Radiology?”)

In regard to study limitations, the authors conceded that the evaluated LLMs are subject to misinterpretation and generalization that may lead to inaccurate answers. They also noted that the analysis focused on text-based questions as LLMs cannot process image inputs at this time. While prompting techniques weren’t assessed in this study, they may impact the effectiveness of these models, according to the study authors.

Related Content

Emerging AI Algorithm Shows Promise for Abbreviated Breast MRI in Multicenter Study

Jeff Hall

April 25th 2025

Article

An artificial intelligence algorithm for dynamic contrast-enhanced breast MRI offered a 93.9 percent AUC for breast cancer detection, and a 92.3 percent sensitivity in BI-RADS 3 cases, according to new research presented at the Society for Breast Imaging (SBI) conference.

The Reading Room Podcast: Emerging Concepts in Breast Cancer Screening and Health Equity Implications, Part 3

Jeff Hall

September 1st 2023

Podcast

In the third episode of a three-part podcast, Anand Narayan, M.D., Ph.D., and Amy Patel, M.D., discuss the challenges of expanded breast cancer screening amid a backdrop of radiologist shortages and ever-increasing volume on radiology worklists.

Could AI-Powered Abbreviated MRI Reinvent Detection for Structural Abnormalities of the Knee?

Jeff Hall

April 24th 2025

Article

Employing deep learning image reconstruction, parallel imaging and multi-slice acceleration in a sub-five-minute 3T knee MRI, researchers noted 100 percent sensitivity and 99 percent specificity for anterior cruciate ligament (ACL) tears.

The Reading Room Podcast: Emerging Trends in the Radiology Workforce

Diagnostic Imaging Staff

February 11th 2022

Podcast

Richard Duszak, MD, and Mina Makary, MD, discuss a number of issues, ranging from demographic trends and NPRPs to physician burnout and medical student recruitment, that figure to impact the radiology workforce now and in the near future.

Can Deep Learning Enhance Low-Field MRI for Multiple Sclerosis Assessment?

Jeff Hall

April 22nd 2025

Article

In comparison to native 64-mT MRI, the deep learning generative model LowGAN offered enhanced white matter lesion conspicuity and image quality in a study involving patients with multiple sclerosis.

Get Moving Radiologist

Eric Postal, MD

April 21st 2025

Article

Understanding and embracing a no-excuses approach to exercise may be more important than ever.