A group of researchers from Google Research and DeepMind have published an evaluation of an artificial intelligence application used to understand and generate language in a clinical context. As part of the evaluation the team propose a benchmark framework to assess performance of language applications across various parameters.
The researchers share the aims of the work, noting that “medicine is a humane endeavour where language enables key interactions for and between clinicians, researchers, and patients. Yet, today’s AI models for applications in medicine and healthcare have largely failed to fully utilise language.”
They add that “attempts to assess models’ clinical knowledge typically rely on automated evaluations on limited benchmarks” and “there is no standard to evaluate model predictions and reasoning across a breadth of tasks”.
Therefore the researchers propose a benchmark framework for human evaluation of model answers, including factuality, precision, possible harm, and bias. In this study they pilot the framework for physician and lay user evaluation to assess performance, such as clinical consensus, likelihood and possible extent of harm, reading comprehension, recall of relevant clinical knowledge, manipulation of knowledge via valid reasoning, completeness of responses, potential for bias, relevance, and helpfulness.
They state: “Our human evaluations reveal important limitations of today’s models, reinforcing the importance of both evaluation frameworks and method development in creating safe, helpful large language models (LLM) for clinical applications.”
The researchers note areas of potential application for language models could include knowledge retrieval, clinical decision support, and summarisation of key findings.
In their evaluation of PaLM, a 540-billion parameter LLM, they showed that human evaluation revealed key gaps in responses in the application, which led the team to introduce “instruction prompt tuning (a simple, data and parameter-efficient technique for aligning LLMs to the safety-critical medical domain)” to the application.
The resulting model, Med-PaLM, is said to perform encouragingly: “for example, a panel of clinicians judged 92.6 percent of long-form answers to be aligned with scientific consensus, on par with clinician-generated answers (92.9 percent).”
The researchers add: “We show that comprehension, recall of knowledge, and medical reasoning improve with model scale and instruction prompt tuning, suggesting the potential utility of LLMs in medicine.
“However, the safety-critical nature of the domain necessitates thoughtful development of evaluation frameworks, enabling researchers to meaningfully measure progress and capture and mitigate potential harms. This is especially important for LLMs, since these models may produce generations misaligned with clinical and societal values. They may, for instance, hallucinate convincing medical misinformation or incorporate biases that could exacerbate health disparities.”
Summarising, the researchers state: “While these results are promising, the medical domain is complex. Further evaluations are necessary, particularly along the dimensions of fairness, equity, and bias. Our work demonstrates that many limitations must be overcome before such models become viable for use in clinical applications. We outline some key limitations and directions of future research in our study.”