Unpacking Averages: Understanding the Potential for Bias in a Sepsis Prediction Algorithm, a Case Study
Would it surprise you if I told you that a popular and well-respected machine learning algorithm developed to predict the onset of sepsis has shown some evidence of racial bias? How can that be, you might ask, for an algorithm that is simply grounded in biology and medical data? I’ll tell you, but I’m not going to focus on one particular algorithm. Instead, I will use this opportunity to talk about the dozens and dozens of sepsis algorithms out there. And frankly, because the design of these algorithms mimics many other clinical algorithms, these comments will be applicable to clinical algorithms generally.
This may sound like an intimidating technical topic, but I’m going to keep it simple. I went to public schools, so I’m going to leave the complicated math to MIT graduates.
Before I dive into the topic, I just want to prepare you for a potential aha moment. To have that moment, there’s basically only a few things you need to understand.
- Averages. If I calculated the average age at an AARP meeting and the average age in a high school classroom, they would likely be significantly different, even though we’re all Americans. Okay, now you know all the math you need to know.
- We are all different, even biologically or perhaps especially biologically. Everyone talks about personalized medicine because our genetic makeup makes us individuals, but I’m also talking about the fact that there are subgroups in America that are different from other subgroups. There are, for example, important medical differences between treating kids and senior citizens. But there are also important differences in things like our average vital signs depending on the social determinants of health. If one group struggles with getting adequate, healthy, low-sodium food and suffers from higher stress due to job or other financial insecurity, that will show up in their vital signs.
- It is reasonably well established, for example, that there are differences in average vital signs between Blacks and Whites. This might blow your mind. Using only six common vital signs, and using only standard machine learning techniques (nothing too fancy), researchers were able to predict a person’s race with a high level of accuracy. Think about that. The patterns in vital signs are so significant that we can predict a person’s race just from 6 vital signs.
- Combining the ideas in 1 and 2, and you may see an important insight. Look at this chart:
If you take the average blood pressure at the AARP meeting, it may well be 135/70 mm Hg, whatever that means. If you are a doctor treating an 18-year-old woman, if you use that value as the “normal,” it might affect how you treat that 18-year-old woman to her detriment.
- As a result, when I say bias, please don’t automatically assume I am talking about the bias that is the product of prejudice in the dark recesses of human thoughts. Sometimes bias—unwanted bias—comes simply from the fact that a minority is, well, a statistical minority. They may be statistically underrepresented in some calculation that is then used for clinical decision-making, and that can mean they get inferior care if the number is not a good benchmark for them. The 18-year-old woman in my example may get the wrong care as if her blood pressure is too low because the benchmark calculation reflected a group that didn’t include many people like her.
A nonmedical example of this underrepresentation principal is the research that was done on facial recognition algorithms that did a pretty good job of identifying white men, but a much less accurate job of identifying black women, simply because the algorithms hadn’t been trained on very many black women. Underrepresentation in the data from which the algorithm learns about the world is a mathematical problem that impacts the performance of the resulting algorithm on that underrepresented group. No human prejudice is required.
That’s all the math you need to know to make sense of the risk of bias in a sepsis prediction algorithm.
I wanted to label this section Background on Math and Clinical Practice, but I was afraid you wouldn’t read it.
Introduction – For Real This Time
To begin the topic of algorithms that predict sepsis, let’s start with its clinical use and importance. A few contextual facts to keep in mind.
- Sepsis, a combination of infection, inflammation and shock, is a common and deadly disease. About 1.7 million adults develop sepsis every year in the United States and more than 250,000 of them die.
- Sepsis, similar to conditions like a stroke, requires immediate treatment to increase the chance of a successful outcome. Literally minutes matter.
- The onset of sepsis is not obvious. It presents subtly at first, and is notoriously difficult to recognize. Indeed, despite its lethal outcomes, the medical establishment has had difficulty even defining precisely what sepsis is. It appears very amorphous.
- Therein lies the clinical problem, in that many people die of sepsis because healthcare providers didn’t or couldn’t see it soon enough to treat it effectively.
- And that’s the clinical opportunity that the sepsis algorithms are designed to address, namely using an algorithm fed with data to recognize the potential for sepsis earlier such that treatment can begin earlier and be more effective. This has led some experts to observe that these algorithms could save thousands of lives.
Hospitals all around the world are developing sepsis algorithms designed to flag patients who may be developing sepsis earlier than their human caregivers might otherwise see. There are scores of articles describing these algorithms in the scientific literature, and of course many uses of novel experimental products do not get reported in the scientific literature. Each algorithm tends to be slightly or maybe even significantly different. Here, at a high level, is how they work based on a review of the literature:
- The developer of the algorithm selects different types of data to feed into the algorithm. As already noted, they tend to be different for each algorithm, but here are the general categories of data that are fed into some of these algorithms:
- The most popular data are vital signs, which include such things as blood pressure, heart rate, respiration rate, temperature, oxygen saturation and so forth. Often these data are fed into the EHR automatically from devices attached to the patient, such as an SpO2 monitor.
- Clinical laboratory values from tests on, for example, blood which are entered into the EHR. There are literally dozens of different analytes that are tracked in these laboratory tests where the values get fed into the EHR. Examples include total white cell, culture results, lactate, high-sensitivity C-reactive protein, artificial blood gas and something called procalcitonin. I have no idea what that is.
- Clinical data. This is a catchall for other so-called structured data (for example using drop-down menus) that clinicians might enter into the EHR either from physical observation or therapies the doctor selects such as the use of a vasopressors or antibiotics.
- Medical images. Some hospitals are experimenting with taking information from radiological or pathology images and including them in the training data.
- Physician notes. A clever group in Singapore noted that there is tremendous amount of information maintained in clinician notes in EHRs and that such information, or at least the topics addressed in those notes, should be considered by these algorithms to give a broader perspective on the patient. This includes a wide variety of information including such things as diagnostic observations that on their face are unrelated to sepsis, other drugs or medications that the patient is on and so forth. The study done by the Singapore group observes that there is some benefit to the accuracy of the algorithm in predictions during the first four hours, but there’s a very considerable benefit in the predictive value of the algorithm in identifying potential sepsis during the 4 hour to 48 hour time period.
- The hospital then chooses how to implement the algorithm, and it might, for example, run the algorithm on all the data in the EHRs of all of the patients in the ICU at the hospital every hour to see whether any patients appear to be trending toward sepsis.
- If the algorithm spots a patient who appears to be trending toward sepsis, an electronic alert is sent to the human caregivers that they should take a closer look at the patient and consider beginning treatment.
Many of these algorithms developed by individual hospitals or medical centers appear, based on preliminary research, to be a significant improvement over human observation alone. They are catching sepsis earlier, and saving lives.
But the question is, are they biased? Are they working better for some subpopulations than others? Calculating an average improvement across all people does not give us any insight into how well these algorithms work on specific subpopulations.
How We Start the Process of Evaluating a Healthcare Algorithm for Bias
In employment law, many of the statutes and regulations specify categories of “protected classes,” groups of people who have been found by lawmakers to be vulnerable because of historic discrimination and who need to be protected from further discrimination. Healthcare lawyers, on the other hand, must start by figuring out who the categories of people are that need to be statistically evaluated. The law doesn’t designate any protected categories or classes of people.
Instead, healthcare law uses an admittedly vague standard of “reasonable” evaluation and testing taken from tort law. We first must think about who might be disadvantaged by the algorithm based on historical facts. In other words, we must come up with our own categories based on historical data. In healthcare, we ask where are the health disparities?
Healthy People 2030 defines a health disparity as “a particular type of health difference that is closely linked with social, economic, and/or environmental disadvantage. Health disparities adversely affect groups of people who have systematically experienced greater obstacles to health based on their racial or ethnic group; religion; socioeconomic status; gender; age; mental health; cognitive, sensory, or physical disability; sexual orientation or gender identity; geographic location; or other characteristics historically linked to discrimination or exclusion.” That’s quite a list, and not very tightly defined.
For purposes of the rest of this blog post, I’m going to make it easy again. I’m going to pick just one category – race – and then show how we think about that one category for potential bias. But just remember, to do this right, we must come up with all the categories and do the same for each.
Thinking About the Potential for Racial Bias in a Sepsis Algorithm
To do this phase of the analysis right requires much more than this short blog post can address, but we need to begin by thinking broadly about all the different ways through which race can enter this seemingly scientific process of analyzing data from an EHR to try to predict the potential onset of sepsis. We must consider things like how:
- the features were selected for use in the algorithm, and whether there were other features that should have been included that would have provided a more balanced insight into all the races,
- each of the data points used by the algorithm are created and then filtered before they are fed into the algorithm,
- the algorithm itself works from a technical standpoint, and whether the math behind it is somehow leading to bias, and
- the output is used, to spot whether it might somehow lead to more accurate results for one race over another.
The starting point is to make a list of vulnerable populations that could be hurt, given that general nature of the data and the algorithm. To do this, we develop a broad list of the different places where demographics and the social determinants of health affect the data. For example, gender bias might be present when the training data are the product of clinical trials. This is because clinical trials heavily skew male, due to the typical exclusions for pregnant women, women in menopause, and women using birth control. Much of the research in biomarkers, for example, comes out of clinical trials, and so the use of such biomarkers as features in an algorithm may well interject gender bias. Or algorithms that recommend a drug for the same reason. The point is we must start with a general understanding of where different groups have been historically disadvantaged, or where the data are skewed, when looking at a new algorithm.
For purposes of this blog post, I’m simply going to illustrate a few obvious examples of how race might ultimately impact the effectiveness of these sepsis algorithms. Consider three examples.
- The use of vital signs that are racially skewed. From a statistical standpoint, one of the most important types of features considered by this algorithm are vital signs, which as already pointed out are highly correlated to race. If we are using vital signs to predict the onset of sepsis, and if vital signs are correlated to race, and if the majority of the training data set used to create an algorithm are the majority race, then it’s quite possible that the algorithm won’t be as effective when used for racial minorities. I won’t repeat that but you might want to read that sentence again because it contains a lot. The simple point is that if we are using vital signs that represent the racial majority in the United States, it’s quite possible that an algorithm that relies on those vital signs will perform less well when used with racial minorities. I said that vaguely because for all I know maybe the algorithm will produce more false positives or more false negatives. Remember false positives are the algorithm predicating that a person might be trending toward sepsis when in fact she isn’t. That produces its own harm in that the patient might get treated when she shouldn’t be with expensive and risky antibiotics, or healthcare professionals might start to realize that the algorithm doesn’t work very well with minorities and stop using it. At the very least there is an opportunity cost, in the sense that we had the opportunity to improve the health care minorities and we didn’t do it. The downsides of a false negative – saying that someone is healthy when she isn’t – are more obvious and include the potential for death.
- The use of hardware that performs differently for people of different races. I indicated above that one of the data points collected is SpO2, which as you may know is collected via an electronic pulse oximeter. Those are the little gadgets they attach to the tip of your finger that read the oxygenation rate in your blood by shining a light. It turns out that those gadgets don’t work as well on people with dark skin. The FDA has been pursuing a solution. Here there was no prejudicial intent, but nonetheless the technology simply didn’t work as well for all races. Given that pulse oximeters are a data source for some of these sepsis algorithms, the bias from that hardware could be expected to infect the performance of the algorithms.
- The use of physician text notes. As I indicated earlier, some researchers in Singapore realized that adding information collected from EHR physician notes could improve the performance of sepsis algorithms especially over the longer haul, meaning from 4 hours to 48 hours. While the improved accuracy is wonderful, adding physician notes does necessarily raise issues of potential bias given its source in human judgment and communication. On the one hand, human judgments, which include physician judgments, may be racially biased. It’s well-established, just as an example, that physicians interpret information from Black patients about pain differently than they do White patients. Physician implicit bias has been associated with false beliefs that Black patients have greater pain tolerance, thicker skin, and feel less pain than White patients. At any rate, bias in the physician/patient interaction is hardly a newsflash. But it goes well beyond mental bias. People who are uninsured, which may be a higher number of Black Americans over White Americans, will simply have less information in their EHRs because they encounter the healthcare system less. According to the Census Bureau, the U.S. uninsured rate in 2021 across race and Hispanic origin groups ranged from 5.7% for White, non-Hispanic people to 18.8% for those identifying as American Indian and Alaska Native, non-Hispanic. Hispanic or Latino people had among the highest uninsured rate in the nation at 17.7%. Indeed, there are simply disparities in the number of encounters that people of different races have with primary care physicians, whether it’s related to insurance or not. The point is there are lots of reasons to believe that there may be racial differences in the meaning and value of physician notes that would end up then impacting the effectiveness of the sepsis algorithm.
These are just a few of the ways that race can impact a sepsis algorithm. A true analysis would need to explore other avenues as well.
Solutions To Bias Found
I’d like to point out that each of these challenges prompts arguably different solutions. For example, the fact that vital signs are correlated to race may mean that race needs to be explicitly considered by the algorithm. That way, the algorithm can differentiate between Black patients and White patients, for example, and giving them each a better prediction.
The hardware problem involving pulse oximetry obviously would benefit from a hardware solution. In the absence of a hardware solution, predictions for people with darker skin need to be taken with a grain of salt and the users alerted to the potential bias.
The hardest to solve is any bias found in the physician notes. We can’t make information out of thin air, so if the information is simply not included for many of the users, there is no solution to that other than a more fundamental improvement in our healthcare system. So again, perhaps the best we can do is make sure that the physician users are sensitive to the bias.
What’s the Practical Effect of All This?
What I provided above is theory built on facts from research. I started my analysis with research into what is understood clinically about such things as vital signs, and research on how race manifests itself in healthcare data. But theory built on facts is still theory. So it needs to be tested. More to the point, these clinical algorithms generally need to be tested clinically to determine whether they are safe and effective for use, and in particular tested for differential impact on vulnerable categories of people.
As already noted in the preamble to this post, a developer did the responsible thing in analyzing its sepsis algorithm for its impact on sex and race. What they found is that while their algorithm had a confirmation rate generally of 36% for all patients, for black patients the rate was only 33% and for Asian patients the rate was 42%. In other words, the algorithm performed less well on average for Black people, and better than average for Asian patients. The developer also clinically validated its algorithm. If you want to read about the performance of the algorithm in a large clinical trial, the developer published its research.
But beyond that, for the rather large number of sepsis algorithms out there in use, I haven’t seen any other systematic evaluations of these algorithms for racial bias or for that matter other demographic features. I have seen, however, some broad research doing a “Comparison between machine learning methods for mortality prediction for sepsis patients with different social determinants.” It supports the idea that testing and evaluation of these machine learning algorithms for demographic factors requires more attention.
The Legality Question
I am not a philosopher or expert in ethics. People don’t come to me with interesting philosophical questions. They come to ask me whether their algorithm follows the law.
As a result, you might ask me, are the algorithms described above lawful? First, I would ask you back, can you clarify the question? Are you asking me:
- Do the algorithms comply with FDA’s law on clinical decision support software, which would often require preapproval? The answer, in part, depends on whether the particular algorithm developed by a particular hospital is fully embedded in the practice of medicine and thus in effect subject to regulation by the state boards of medicine, or in reality is being commercialized outside the professional practice of medicine. The answer also depends on how transparent and explainable the algorithm is to its users, as well as on the specific features used to train the algorithm and on the nature of the output. According to FDA guidance, the agency regulates clinical decision support software whose output (a) suggests “that a specific patient ‘may exhibit signs’ of a disease or condition” or (b) “identifies a risk probability or risk score for a specific or condition.” Sounds kinda like certain sepsis algorithms. If FDA has jurisdiction over an algorithm, the question then becomes whether an algorithm with potential bias is safe and effective enough for all anticipated patients to earn FDA approval. FDA sets the bar high, but it’s also possible to manage which patients are “anticipated” through the labeling of the algorithm.
- Do the algorithms give rise to liability under state tort law, including malpractice and product liability law? Let’s say someone dies of sepsis at a hospital that used its own software for detection, which given the nature of the disease is not unusual. The question would be whether the developer has done an adequate job of evaluating and testing the software, and if necessary forewarning the users of the potential inaccuracies for specific populations.
- Do the algorithms give rise to potential liability under the HHS proposed rule under the Affordable Care Act that algorithms not discriminate? The answer depends on what the provider institution has done to evaluate the software, but at the same time that rule has not yet been finalized.
- Do the algorithms comply with the proposed ONC rules for transparency designed to allow users to understand whether they might have discriminatory effect? Obviously that requires a lot more information on what was disclosed, and that rule is still in proposed form.
- Do the algorithms comply with the Federal Trade Commission guidelines on ensuring that algorithms don’t discriminate? That’s a complicated and lengthy topic, but I suspect the defense would be an argument that the hospital’s use of its own algorithm is outside of the FTC’s jurisdiction for consumer protection of unfair business practices. However, the FTC seems to want to take the lead on software that falls outside of FDA’s jurisdiction, so the answer in practice will be influenced by the answer to number one above. If FDA regulates the products, then FTC would be unlikely to bother with such a stretch. But given the very direct patient impact, FTC might pursue it if other regulators don’t. FTC certainly has asserted its right to regulate software used by healthcare professionals when the issue is patient privacy, even if the software is simply used in practice management. Further, FTC can assert that the software is certainly used on patients to impact their care and uses patient data, even if the results are only communicated to the physician on behalf of the patient. Sometimes FTC takes an expansive view of its jurisdiction, even concluding that small businesses constitute “consumers” if they feel otherwise there’s a gap in the regulatory scheme. Consider the primary example FTC uses in suggesting that companies be vigilant against unintended bias in algorithms. That example from 2021 follows: “COVID-19 prediction models can help health systems combat the virus through efficient allocation of ICU beds, ventilators, and other resources. But as a recent study in the Journal of the American Medical Informatics Association suggests, if those models use data that reflect existing racial bias in healthcare delivery, AI that was meant to benefit all patients may worsen healthcare disparities for people of color.” That example is eerily similar to the sepsis example. The FTC seems to suggest it has jurisdiction over such algorithms. Right below the example FTC argues: “Section 5 of the FTC Act… prohibits unfair or deceptive practices. That would include the sale or use of – for example – racially biased algorithms.” Notice the word “use.”
- Does the algorithm give rise to liability under general civil rights laws? I have no idea: you have to ask one of my partners.
Apart from such legal issues, obviously the developer should evaluate the impact on its reputation should its algorithm be determined to discriminate against vulnerable groups of people.
The purpose of this post at least nominally is to explain how race can manifest itself in what might seem to be a purely clinical algorithm. More generally, hopefully this post gave some insight into bias audit methodology, starting with research into what has gone before in this area and ultimately finishing with testing to verify theory. These algorithms have incredible potential to improve healthcare. But it will help everyone if the implementation is done responsibly such that users can truly trust these algorithms.
 One of my partners in the FDA law practice went to MIT for engineering and I’m always amazed at how smart he is.
Health Equity and Health Disparities Environmental Scan March 2022, HHS Office of Disease Prevention and Health Promotion
Https://www.forbes.com/sites/carmenniethammer/2020/03/02/ai-bias-could-put-womens-lives-at-riska-challenge-for-regulators and https://www.marshmclennan.com/insights/publications/2020/apr/how-will-ai-affect-gender-gaps-in-health-care-.html