Insight

Study highlights limitations of large language models around clinical diagnosis, interpretation of results and integration

A study published this month in Nature Medicine journal evaluates the limitations of large language models (LLM) in clinical decision-making by exploring their accuracy, with the research team highlighting reported challenges with diagnosis, interpretation of laboratory results and safe integration into existing workflows.

For the purposes of the study a leading open-access LLM was examined along with its derivatives, including generalist versions and medical-domain aligned models. The researchers note that four of the derivatives being explored “have been shown to match and even exceed Chat-GPT performance on medical licensing exams and biomedical question answering tests”.

The researchers curated a dataset covering 2,400 real patient cases diagnosed into four common abdominal pathologies (appendicitis, pancreatitis, cholecystitis and diverticulitis), along with an evaluation framework designed to allow the LLMs to “autonomously engage in every step of the clinical decision-making process” by simulating a realistic clinical setting. The LLMs were provided with patient history of illness and asked to gather and synthesise additional information such as laboratory results and imaging reports until a diagnosis and treatment plan could be created. The diagnostic accuracy of the LLMs was then compared to that of clinicians, and the research team also evaluated a number of additional characteristics including adherence to diagnostic and treatment guidelines and correct interpretation of lab test results.

Firstly the study explored the performance of the models when presented with all available information about a patient, and found that “current LLMs perform significantly worse than clinicians on aggregate across all diseases”.

Whilst most of the models were able to match clinician performance on the diagnosis deemed simplest (appendicitis), the study indicates that developing an accurate diagnosis for other conditions proved a challenge for the LLMs. According to the data, for cholecystitis the diagnostic accuracy of LLMs ranged from 13 percent to a high of 68 percent, versus clinicians at 84 percent.

A similar gap was noted for diverticulitis, with LLMs achieving 35 to 59 percent diagnostic accuracy versus the clinicians’ score of 86 percent.

The research team also highlighted that the two specialist medical LLMs did not display a “significantly better” performance, commenting: “As the medical LLMs are not instruction tuned (that is, trained to understand and undertake new tasks), they are unable to complete the full clinical decision-making task where they must first gather information and then come to a diagnosis.” As such, they were excluded from further analysis with the research team focusing on the non-specialist LLMs for the rest of the study.

The study then explored the diagnostic accuracy of LLMs in an autonomous clinical decision-making scenario, in which the models were required to specify all the information they wished to gather in order to make their diagnosis. A “general decrease in performance” was seen as a result. One remaining LLM’s diagnostic accuracy fell from 58.8 to 45.5 percent; another 67.8 to 54.9 percent; and the third 65.1 to 53.9 percent.

The researchers observed that the LLMs often failed to order the exams required by diagnostic guidelines; only one of the LLMs in the study consistently asked for physical examination results, in order to include them in consideration of the eventual diagnosis. Highlighting a “lack of consistency” when it came to LLMs ordering required test for diagnosis based on current guidelines, the research team shared their view that the models possess a “tendency to diagnose before understanding or considering all the facts of the patient’s case”.

All LLMs were deemed to perform “very poorly” in interpretation of laboratory tests, which the researchers tested by providing each test result with the accompanying reference range and asking the models to classify each result. The LLM performance was said to be particularly poor when it came to classifying high or low test results, which the researchers stated is “critical”.

The models were also found to “generally fail to adhere to treatment guidelines”, with LLMs “consistently” failing to recommend “appropriate or sufficient treatment, especially for patients with more severe forms of the pathologies”.

The researchers summarise their findings by stating: “LLMs do not reach the diagnostic accuracy of clinicians across all pathologies when functioning as second readers, and degrade further in performance when they must gather all information themselves.” As such, “without extensive physician supervision, they would reduce the quality of care that patients receive and are currently unfit for the task of autonomous clinical decision-making.”

Hager, P., Jungmann, F., Holland, R. et al. Evaluation and mitigation of the limitations of large language models in clinical decision-making. Nat Med (2024). https://doi.org/10.1038/s41591-024-03097-1 License here, adaptations made.

Research spotlight

Last week, we highlighted a study exploring the impact of remote monitoring on patients who had recently experienced a heart attack, which reportedly found that “telemedicine patients were 76 percent less likely to be readmitted to hospital within six months and 41 percent less likely to attend A&E, compared to those who followed normal care pathways”.

In June, we reported on a study launching in Birmingham and Solihull with the aim of improving the care of people living with psychosis and multimorbidities, particularly for marginalised populations, by co-designing resources based on patient experiences and utilising digital means to encourage people from a range of backgrounds to take part.