A study published in the National Library of Medicine has assessed the reliability of medical information provided by ChatGPT, finding that the AI software scored well on description of condition, problems, symptoms or treatments, but scored low for quality of information – though overall comparable to static internet information.
The authors make a number of recommendations: AI software should be limited to peer-reviewed published data; a bibliography should be implemented to support transparency; and a visible rating or score included to enable patients and healthcare professionals to “transparently and intuitively understand the degree of quality that a chatbot can provide.”
What happened?
For the purposes of the study, five conditions were identified for analysis: gallstone disease, pancreatitis, liver cirrhosis, pancreatic cancer and hepatocellular carcinoma.
Medical information provided by ChatGPT on each of the five conditions was measured with the EQIP (Ensuring Quality Information for Patients) tool, generating a score by examining topics such as whether the subject is clearly defined, description of medical problem or procedure, definition of the purpose of interventions, potential symptoms, use of everyday language, and names of persons or entities producing supporting documents.
For further analysis, researchers identified guidelines around the five conditions from official healthcare bodies. For each condition, recommendations were taken from these guidelines and fed into ChatGPT as paraphrased questions. Agreement between the guidelines and the AI answer was then measured by two authors independently.
Results
The median EQIP score was 16 across the five conditions, out of the possible 36. The researchers found that results between the analysed conditions varied; pancreatitis received the highest total (19) and hepatocellular carcinoma the lowest (14).
ChatGPT received a low score in a number of areas; it did not provide any information on the origin of its answers, and some information was noted to be “ambiguous”. The researchers said that the AI software did not present information in a logical order, as suggested treatment options “frequently did not follow clinical reasoning or guideline recommendations.” It also received no points in analysis of language, due to the fact that the chatbot “regularly uses complex medical terms without explanation of definition”, and sentence length was deemed to be long and “complex”.
However, despite concerns around logical structuring, ChatGPT received “positive evaluation” for the layout of answers. Researchers commented that the answer structure, which typically “consisted of a short introduction followed by a list of items and a short conclusion”, was “sufficiently well designed”.
With regards to the additional comparison between ChatGPT answers and guideline recommendations, the independent authors measured the mean agreement between the two sources at 60 percent.
Overall, the researchers reported that the AI provided “low-to-moderate quality information” overall; however they also noted that the medical information available online is also reported to be low quality for a number of conditions, and as such “similarity in results can partially be explained by the AI mirroring available knowledge.”
They specified that ChatGPT scored higher in the content domain – around questions such as definition of subject and description of problems, symptoms or treatments – but lower in the issue and structure domains, which focus on factors such as whether names or entities are provided to back up information, and whether the information is accessible and clear.
The researchers made a key observation which they highlight for further discussion, in that ChatGPT “does not inform its user which medical information is controversial, which information is clearly evidence based and backed by high-quality studies, and even which information represents a standard of care.” This, they noted, is “a reflection of the mechanism behind ChatGPT, which resembles a refined search tool and data crawler more than an actual intelligence.”
Recommendations for the future
The study proposed “potential areas of improvement” for AI-based chatbots for medical purposes. Firstly, they suggested that medical information used by the AI software should be limited to peer-reviewed published data. In addition, a bibliography should be implemented to support transparency around where the information is gathered.
Another suggestion is to add a visible rating or score to enable patients and healthcare professionals to “transparently and intuitively understand the degree of quality that a chatbot can provide.”
The researchers also recommend that the areas of improvement highlighted by the EQIP tool are heeded; for example, shortening sentences to improve access and clarity.
“Lastly, awareness of the relevance of AI chatbots and their potential significance must be raised within the health care community,” the authors wrote, acknowledging that AI has the power to “transform how health care professionals search for medical information. In the future, chatbots might even replace guidelines, as clinicians will be able to rapidly obtain information and guidance, eliminating the need to find, download, and read large documents. AI chatbots could facilitate distribution of up-to-date knowledge, which would ultimately benefit patients.”
Citation: Walker HL, Ghani S, Kuemmerli C, Nebiker CA, Müller BP, Raptis DA, Staubli SM
Reliability of Medical Information Provided by ChatGPT: Assessment Against Clinical Guidelines and Patient Information Quality Instrument. J Med Internet Res 2023;25:e47479. doi: 10.2196/47479PMID: 37389908