For a recent HTN interview, we caught up with Michael Wornow, a computer science PhD student at Stanford University to discuss some of his most recent projects, including his involvement with research on Advancing Responsible Healthcare AI with Longitudinal EHR Datasets.
Michael began by introducing himself as a PhD candidate currently in his 5th year, working on developing and operationalising AI models in healthcare under Dr. Nigam H. Shah (professor of medicine) and Dr. Chris Ré (professor of computer science).
He first highlighted key areas of their research, stating, “we’ve been very focused on not just making advancements on the methods side of things, but also thinking about practical deployment considerations, developing more rigorous evaluation frameworks, and publishing our research to make it more accessible”. Diving deeper into his own personal work, Michael shared how he has focused on “developing machine learning systems to improve how hospitals operate,” by using electronic health record (EHR) data.
He noted the two main threads underlying his research were: (1) how to improve the individual point decisions that clinicians make within larger workflows, and (2) how to understand and automate the larger end-to-end workflows that these decisions are a part of. When outlining the overall motivation, he narrowed it down to one core question: “How do we use AI to improve care delivery?”
The foundation model approach
To explain the background of his research, Michael told us a bit about the “foundation model” approach to machine learning, where instead of training many task-specific models that are each specifically designed for one narrow task, you train “one big model on tons of unlabelled data” and then use that one foundational model to complete a variety of “downstream tasks”. Giving an example of this, he noted ChatGPT as “one of the most successful foundation models”.
When relating the foundation model approach to his research, Michael went on to say, “when our lab started this work several years ago, we hadn’t quite seen that translation to the EHR domain. So, one of the first questions we asked ourselves was, how well does this approach transfer to the clinical setting? And to better understand that, we needed more rigorous benchmarks.” Michael continued that, “Unfortunately, there’s only a handful of public datasets available to researchers. So, if you’re not attached to a large hospital like Stanford, it’s very hard to curate the scale of data necessary for training and evaluation.”
Even for those like Michael who are fortunate enough to rely on the data from a hospital, there are still issues to contend with, as he noted, “the data can be very messy and is not typically standardised across papers. Not only does that make it impossible to do science in the strict sense of being able to replicate findings, but it also makes it very difficult for people to build on each other’s work, which slows down progress.”
Michael compared the accessibility of pretrained models in healthcare to more general-purpose models like BERT or Llama, noting how they “will have millions and millions of downloads, since virtually anyone can work off these models and fine-tune them for their own use cases.” However, Michael noted that the same thing didn’t exist for healthcare “even though we know that foundation models are great and we’ve seen strong initial successes with them in the healthcare domain.”
Given these potential benefits, the question facing the team boiled down to, “Can we enable more open and reproducible science around foundation models trained on EHR data by publicly releasing better datasets, benchmarks, and models?”
Advancing responsible healthcare AI with longitudinal EHR datasets
To answer this question, Dr. Shah’s team released three different benchmarks and datatsets over the past year: EHRSHOT, INSPECT and MedAlign.
EHRSHOT explored how to model the “structured information within the EHRs of roughly 7,000 deidentified patient records,” Michael said. The dataset covers information such as procedure codes, diagnoses, lab orders, and more. Michael explained that “the most unique aspect of EHRSHOT is its longitudinal data, i.e. it covers the full health history of a patient (potentially over decades) rather than just being restricted to an ICU or ED visit like other public datasets such as MIMIC and eICU.” This was important because “some of the tasks we were looking at only made sense in a longitudinal setting”.
INSPECT was led by other students in Dr. Shah’s lab – namely, Zepeng Huo, Shih-Cheng Huang, and Ethan Steinberg – and focused on the multimodal nature of healthcare data. “EHRSHOT is focused on structured information like billing and procedure codes. However, INSPECT also contains images and text linked to the same patient”, Michael said. The third dataset was MedAlign, which was led by Scott Fleming and focused on text-based clinical tasks.
In addition to the data contained within these benchmarks, Michael highlighted how the team has also trained their own foundation models from scratch on “roughly two and a half million deidentified patient records at Stanford Hospital.” The team has publicly released ~20 EHR foundation models on HuggingFace, a community platform for AI researchers, “making it available for approved researchers to download and fine-tune our model for their own projects.”
Key findings and successes from the research study
When asked about key findings from the research, Michael said that creating the foundation models and “putting them on Hugging Face” was one of the most important aspects of the work, as it will “hopefully encourage more sharing of trained models amongst healthcare researchers.” He added that “getting more raw data out there was also important” because almost everyone currently works off the same one or two public EHR datasets. He noted that this “makes it difficult for the field to learn generalisable, reproducible lessons, as the vast majority of research over the past decade is essentially based on one dataset of roughly 40,000 ICU patients from a single hospital in Boston”.
Michael emphasised that developing new datasets is not easy, noting how “it took about a year and half” and a “heroic amount of work by Nigam and Jason Fries, a research scientist in our lab, who both put a ton of effort into pulling together all of the papers, codebases, and stakeholders” for the team to publish their datasets. Ultimately, however, it was “worth it from the positive feedback we’ve gotten from a ton of people interested in this sort of data”. He added that “because of Jason’s and Nigam’s efforts, it will hopefully be easier for future dataset releases at Stanford as well.”
Looking ahead: the future of research in this area
When asked where this type of research will be in the next 5-10 years, Michael predicted that “AI models will be so good that it will be irresponsible for doctors not to use them.”
He went on to describe how these technologies could also help solve issues in healthcare inequality and accessibility, outlining how “many people don’t have access to a doctor, and even among those that do, there can be huge variation in outcomes. Infinitely scalable AI models trained on huge corpuses of medical knowledge can help level that playing field and give everyone access to state-of-the-art care.”
Lastly, he mentioned a recent effort towards fostering more cross-institutional collaborations to “create better standards for the deep learning for healthcare community” across evaluations, models, methods and frameworks. Michael highlighted the working group he’s been involved with called MEDS (Medical Event Data Standard), led by Matthew McDermott, professor at Columbia University, which encourages collaboration from across the globe to facilitate a standardised approach to machine learning for healthcare.
Other key areas of research
Finally, Michael spoke to us about other research areas and projects that he’s been working on, including the use of large language models to accelerate the process of finding eligible patients for clinical trials. He said this could help reduce the need for clinical research co-ordinators to “scroll through every patient record one-by-one and manually check them against a list of forty to fifty eligibility criteria.” Instead, using an LLM, “we can do that in seconds at high accuracy.”
Michael has also been looking at automating administrative workflows within the hospital. He highlighted some work on automating basic workflows in Epic and outlined a vision for reducing the manual burden on clinicians “so that instead of the nurses running back and forth between the patient bed and their desk to place an order, they could just click a button and the computer would be able to automatically place the order for them”.
When closing our discussion, Michael shared some insights on what excites him most about the future of this research area. “When you sit in the computer science building here at Stanford, you can literally see the future being invented around you,” he said. “The hospital sits right across the street. Despite being so close geographically, however, there remains a large gap when it comes to technology. Bridging this gap is what really excites me.”
Reflecting on their dataset releases, Michael added. “I’ve been fortunate to collaborate with some of the most talented people in the space during my PhD. However, there’s still a lot of work to be done. I hope that these dataset and model releases encourage more smart folks to work in the space, and that these resources help to foster the development of a larger community around open and reproducible deep learning for healthcare.”
We’d like to thank Michael for taking the time to talk to us. Find out more about his research here.