What does human-centred design have to do with medical AI?

The majority of medical AI solutions crash and burn in the real world. It turns out that accuracy isn’t enough. A human-centred design approach is critical to the success of AI in healthcare, as Google also found out.

Rachel Dulberg
UX Collective

--

Image by macrovector — www.freepik.com

“What healthcare needs is human-centered AI: artificial intelligence that is not merely driven by what is technically feasible, but first and foremost by what is humanly desirable” Sean Carney, Chief Experience Design Officer, Philips

HOW MEDICAL AI FAILS IN THE REAL WORLD

MIT Technology Review recently reported that while hundreds of AI tools have been developed in the past year to help overwhelmed doctors around the world diagnose and triage the covid-19 crisis, none of those AI tools lived up to the task.

More than 230 AI prediction tools designed to help medical decision-making during the pandemic were examined in a study published in the British Medical Journal. The study concluded that all but two of the AI models demonstrated poor performance and could inadvertently cause more harm than good if they were to guide clinical decisions.

Some health industry skeptics have said that medical AI may end up not living up to the hype. Systems developed in one hospital often flop when deployed in a different facility. Software used in the care of millions of Americans has been shown to discriminate against minorities. And AI systems can learn to make predictions based on the brand of MRI machine used, the time a blood test is taken, or whether a patient was visited by a chaplain rather than on factors that are related to the disease. In one well-known case, AI software incorrectly concluded that people with pneumonia were less likely to die if they had asthma, an error that could have led doctors to deprive asthma patients of the extra care they need.

The fact that AI models crash and burn in production is a well-known phenomena. But in healthcare, the stakes are high, and the need for reliable AI solutions more acute. Getting things wrong (missed diagnosis, wrong treatment, poor healthcare delivery) could have life and death consequences.

It’s clear that the way we build medical AI models needs to change.

We need to ensure that medical AI lives up to its promise and doesn’t just get stuck in the realm of “opportunity”.

The question is: why is it so hard to build useful medical AI? And how can you tell if your medical AI model is any good before wide implementation proves otherwise?

medical AI design path infographic
Medical AI design path. Who said this would be easy?Source:https://www.emjreviews.com/innovations/article/a-path-for-translation-of-machine-learning-products-into-healthcare-delivery/

WHY IS GOOD MEDICAL AI SO HARD TO BUILD?

Many of the issues surrounding medical AI implementation stem from fragmented healthcare systems, lack of data infrastructure, interoperability issues and poor quality data that renders models inaccurate and potentially biased.

The Alan Turing Institute, the UK’s national institute for data science and artificial intelligence, says that lack of robust and timely data seriously hindered AI researchers during the covid-19 pandemic.

The pandemic has apparently created an extreme case of “Frankenstein datasets” — hastily produced, imperfect datasets were often mislabeled or cobbled together from unknown or multiple sources, contained duplicates and were skewed or biased. For example, one dataset contained a mix of x-ray images taken of patients standing up and lying down. Because more seriously ill patients’ images were taken lying down, the model learned to wrongly predict serious covid illness based on the person’s position.

Errors like these seem obvious in hindsight. They can also be fixed by adjusting the models, if researchers are aware of them. It is possible to acknowledge the shortcomings and release a less accurate, but less misleading model. But many tools were developed either by AI researchers who lacked the medical expertise to spot flaws in the data or by medical researchers who lacked the mathematical skills to compensate for those flaws.- MIT Technology Review

But when trying to integrate an AI product into a complex healthcare environment there are other factors at play. These factors go beyond the technology, and are often overlooked by medical AI developers.

ACCURACY DOESN’T TELL THE FULL STORY

I’ve recently completed the Stanford University AI in Healthcare specialisation. Among other things, I’ve learned an interesting fact that’s often ignored when medical AI models are built.

AI solutions are commonly engineered and refined to achieve certain well known, important performance metrics, such as accuracy.

The trouble is that accuracy metrics don’t tell you anything about how effective your model will be in a healthcare setting.

Many medical AI models today are developed without considering the clinical utility, feasibility or the overall impact that the model will have, once implemented, on the healthcare system (be it from patients’, clinicians’, hospitals’ or other stakeholders’ perspective).

Google discovered that while medical AI can be super accurate in the lab, it’s a different story in the wild. It developed a deep learning algorithm that looks at images of the eye and searches for evidence of diabetic retinopathy, a leading cause of vision loss. The tool’s promise was to bypass the need to wait weeks for an ophthalmologist to review retinal images. But despite high accuracy (the tool demonstrated >90% sensitivity and specificity), it proved impractical in real-world testing, frustrated both patients and nurses with inconsistent results and did not work well with on-the-ground practices.

What went wrong?

HOW DO YOU KNOW IF YOUR MEDICAL AI MODEL IS ANY GOOD?

It’s well known that the journey from initial research to useful medical AI product can take years. As Google found, one important part of that journey is conducting user-centered research. This means studying how care is delivered and how it benefits patients, so we can better understand how algorithms could help, or even inadvertently hinder, assessment and diagnosis.

Taking a user-centred approach is about taking certain steps as part of the development, design and deployment of a medical AI tool that will help you evaluate how good your model really is.

MEDICAL AI’S DEPLOYMENT PATHWAY

But first, let’s zoom out for a second and review the 4 key components of the medical AI “deployment pathway”:

  1. Design and development — identify the right problem to solve and confirm that it can be solved with the available data and your AI model, source and manage data, train and tune your model.
  2. Evaluation and validation — evaluate model performance and clinical utility, check and establish that your model works in its intended environment (e.g. clinical care setting), including a consideration of the human-machine interaction, clinical integration, interoperability etc., get pre-market regulatory approvals.
  3. Diffuse and scale — deploy the product at scale, including data ingestion, integration into the EHR system, establish provider partnerships and obtain funding.
  4. Monitoring & maintenance — continuously monitor safety and performance, undertake regular architecture updates, add new training data, add new features/functionalities, adapt the model to account for changes in source data, clinical practice, patient populations, exposure rates and outcomes.

HUMAN-CENTRED DESIGN QUESTIONS TO ASK IN THE DEV & DESIGN PHASE

There are a few practical questions you need to ask as part of the initial design and development phase that will help you evaluate your idea from a human-centred POV:

  1. What’s your research question?

Start with the problem you are trying to solve and ask: is that a problem worth solving? Can the problem be solved with the data and AI that is available? To answer this question, an outcome-action analysis (outlined below) will help flush out the practical real-world clinical problem your model would be solving.

2. Who are your stakeholders and beneficiaries?

Define who your stakeholders and beneficiaries are and plan how to involve them in the process.

  • Which stakeholders should be involved in the design, development, evaluation, validation and implementation of your model. These could be subject-matter experts, decision-makers and end users.
  • Who is your beneficiary — who is the AI solution made for? Who will use it in the real world (provider, patient, payer)?
  • Have all the stakeholders been involved to evaluate both the input data and output data?

3. What are your data sources?

  • What data will you need for training your model and how will you get access to it?
  • How recent is the data and what is the quality?
  • How often will the data need to be updated to make a prediction (hourly? Weekly? nightly?).
  • What is the population distribution for the question you’re trying to solve and does the data represent this distribution? Is the outcome (say positive diagnosis of a disease) well represented in the data?
  • Is the data potentially biased?

4. Is your model useful and feasible? Will there be likely clinical uptake?

  • How will the model be integrated into the clinical workflow? e.g. how will you ingest data and integrate the correct outcome in real time?
  • Will your output integrate into the EHR system or be displayed separately?
  • Who will be your early adopters (i.e. users that will significantly buy into the solution and champion it)?
  • Will you need industry partners to help you validate and scale the solution?

HOW TO EVALUATE IF YOUR AI MODEL IS CLINICALLY USEFUL?

In the evaluation and validation phase you’ll need to carefully consider if your model is accurate. But it’s also about evaluating if your model is going to be useful in a healthcare setting (aka the utility analysis).

A handy tool that will help you evaluate model utility is the Outcome-Action Pairing framework (or “OAP”).

  • The “outcome” is the purpose and output of the AI model — e.g. does it diagnose disease, classify patient risk or predict an event?
  • The “action” is the step to be taken based on the outcome that will improve clinical care. The key question to ask here is: is there a mitigating action that can be taken to change the outcome?

OAP will help you answer the following questions:

  • Do you have a model output that is actionable?
  • Will your model output positively impact patient outcome or healthcare delivery or have a population level impact?
  • How will the solution be implemented at the point of care?

Then, you’ll be able to:

  • get a rough understanding of the minimum acceptable performance for your model
  • define how your model output would lead to action in various scenarios
  • be able to tell what would be the overall utility of that model
  • answer the basic “research question” mentioned above — is the problem worth solving in the first place using machine learning?

While technical model evaluation typically focuses on metrics such as positive predictive value, sensitivity recall, specificity and calibration, discovering the constraints on the action triggered by the model can often have a much larger impact on determining model utility.

This is because there are many factors that can affect the model’s clinical utility:

  • Lead time offered by the prediction
  • The existence of a mitigating action
  • The costs of intervening
  • The cost of false positives and false negatives
  • The logistics of the intervention — e.g. does the system have the capacity and resources to respond to the prediction?
  • Incentives for the individual and the health care system
  • How does your model’s utility compare to current baseline performance data? Does it actually lead to improved patient outcomes or healthcare delivery compared to current metrics?

HOW TO RUN AN OAP ANALYSIS ON A MEDICAL AI MODEL

When you run an OAP analysis on your medical AI model ask the following questions:

  • How will the prediction be used? — e.g. does a mitigating action exist for the outcome? Will the prediction be used to alert high risk patients of an imminent heart attack, inform clinicians of a diagnosis, trigger a certain treatment or help make other clinical decisions? Will it be used to help with disease management or inform operational decisions?
  • Will the prediction reveal something clinicians don’t already know? If the prediction doesn’t enhance existing clinical practice or knowledge, it may not lead to valuable action.
  • Who will use the prediction? (patients, clinicians, ICU staff, payers?)
  • When will the prediction be used? What is the lead time offered by the prediction?
  • What’s the “window of observation” — when is data collected and aggregated?
  • What’s the “action window” — when would action based on your prediction need to be taken? Is it acute (i.e. immediate) or long-term? E.g. does the action to your output have to happen immediately or over the next month, year, 5 years?
  • In general, early warning lead times lead to better outcomes. How fast can your output be provided so it can be actioned?
  • What are the logistics and cost of the intervention?
  • What are the incentives for acting on the output (for patients/providers/payers)?

The analysis should boil down to 2 key elements:

  1. What type of action will the solution lead to:
  • Operational?
  • Medical?
  • Population Level?

2. What is the action’s lead time — acute or long-term?

evaluation framework for outcome action pairing analysis for medical ai products
Source: Stanford University, Evaluations of AI in Healthcare course, Coursera

HOW TO PUT PEOPLE AT THE CENTRE

In the Google example, the AI model designed to screen patients for diabetic retinopathy was road tested through a partnership Google established with the Ministry of Public Health in Thailand (described in this research paper). They agreed to conduct field research over 8 months, at 11 clinics across the provinces of Pathum Thani and Chiang Mai in order to examine how nurses used a deep learning system in patient care.

At each clinic, Google observed how diabetes nurses handled eye screenings, and interviewed them to understand how to refine the technology. This field research was done alongside a study to evaluate the feasibility and performance of the model in the clinic, with patients who agreed to be carefully observed and medically supervised during the study.

“Developing new products with a user-centered design process requires involving the people who would interact with the technology early in development. This means getting a deep understanding of people’s needs, expectations, values and preferences, and testing ideas and prototypes with them throughout the entire process. When it comes to AI systems in healthcare, we pay special attention to the healthcare environment, current workflows, system transparency, and trust.” — Healthcare AI systems that put people at the center

Google’s research had several interesting findings:

  • Factor in environmental differences between the lab and healthcare settings — Google discovered that lighting, which varies among clinics, can impact the quality of images fed into the model and impact its performance in real-life scenarios. Just as an experienced clinician might know how to account for such variables in order to assess the output, AI systems need to be trained to handle these situations.
  • Adapt the model’s protocol to incorporate real-time observations — Google amended the research protocol to have eye specialists review any images which the model classified as “ungradable” due to image quality, alongside the patient’s medical records, instead of automatically referring patients with ungradable images to an ophthalmologist. This helped ensure a referral was necessary, and reduced unnecessary travel, missed work, and anxiety about receiving a possible false-positive result.
  • Account for the human impacts of integrating an AI system into patient care in addition to evaluating the performance, reliability, and clinical safety of an AI solution — Google found that the AI system empowered nurses to confidently and immediately identify a positive screening, resulting in quicker referrals to an ophthalmologist.

WHAT DOES ALL THIS MEAN FOR MEDICAL AI DEVELOPERS?

Deploying an AI system by considering a diverse set of perspectives in the design and development process is just one part of introducing new health technology that requires human interaction. It’s important to also study and incorporate real-life evaluations in the clinic, and engage meaningfully with clinicians and patients, before the technology is widely deployed. That’s how we can best inform improvements to the technology, and how it is integrated into care, to meet the needs of clinicians and patients.”- Google, Healthcare AI systems that put people at the center

Introducing a new health technology that requires human interaction means putting people at the centre of the design and development process and evaluating the model’s utility in real-life clinical scenarios from a diverse set of perspectives.

To help AI live up to its promise, improve health outcomes for everyone and drive adoption and trust, it’s vital that we create meaningful engagements with patients, clinicians, and other stakeholders and ask ourselves some hard practical questions, well before the solution is deployed at scale.

--

--

Privacy, data + product nerd. Former tech lawyer + founder. I write about issues at the convergence of innovation, technology, product & privacy.