AI, COVID and Wearables - Real World Value or Unrealistic Expectations?

Jan 13

COVID-19 has made more apparent certain limitations in our capacity to detect and monitor for disease. One such limitation is the lack of validated biometric markers, easily accessible to patients in their day-to-day lives, which can reliably detect the presence of disease prior to manifestation of symptoms. Despite all of the data patients can now collect on themselves and track 24/7, we know very little about how to leverage that data to generate actionable insights that can improve health outcomes. With SARS-CoV-2, this becomes even more important when we think about the need to be able to identify asymptomatic cases.

Not surprisingly, many researchers and companies are turning to wearables such as smartwatches, to try to solve this problem. The basic premise is that by monitoring someone’s physiological parameters continuously, wearable devices may be able to pick up on small, potentially imperceptible, changes in a person’s status from their baseline that could signal infection.

The key to this research is to understand the strategy of using algorithms to create new biomarkers, and then aggregating biomarkers to create a single indicator or classifier (ie. classifying whether someone has a disease or not). As we explore in several publications, a key question is whether we can leverage a specific vital sign (ex. respiratory rate) or simultaneously look at the combination of multiple vital signs to inform COVID-19 status in an individual. Let’s see how these strategies apply to COVID-19 and wearables.

WHOOP is a company that makes a wrist-worn heart rate monitor that can measure heart rate (HR), heart rate variability (HRV), sleep, strain, recovery, and respiratory rate. While many researchers have been studying the clinical applications of common biometrics available in most wearables, such as resting heart rate (RHR), activity, and HRV, WHOOP noted early on that the respiratory rate among its users increased in those who reported COVID-19, and in some users, this increase occurred even prior to experiencing symptoms. This was particularly interesting to WHOOP, because when they looked at data from their users, respiratory rate did not appear to change significantly on a day-to-day basis, unlike HRV, RHR, or other biometrics; therefore, a change in respiratory rate may be more informative, or meaningful, as there appear to be fewer factors that affect it. To investigate this hypothesis, the WHOOP team evaluated the existing literature, but did not find any longitudinal studies exploring variability in night-time respiratory rate among healthy adults. And so, WHOOP analyzed the respiratory rate data of over 25,000 healthy WHOOP members across 30 consecutive nights, and they found that, indeed, intraindividual variability in a one-month time frame is fairly minimal, with a lower coefficient of variation than either RHR or HRV. With this knowledge, WHOOP began working on developing an algorithm to predict COVID-19 infection based on data collected from its device.

Let’s dig a little deeper into the study WHOOP published on the development and early results of its algorithm to explore the change in respiratory rate as an indicator of SARS-CoV-2 infection. To develop its algorithm, WHOOP analyzed the data of 271 individual WHOOP members (70% male; 30% female; mean age 37 yrs) who reported symptoms consistent with COVID-19. In order to train an algorithm from scratch, they needed datasets for both training and validation purposes. From the 271 individuals, they extracted 2,672 samples (i.e. days) and divided that data into a training dataset of individuals who tested positive for COVID-19, a validation dataset of individuals who tested positive for COVID-19, and another validation dataset of individuals who tested negative for COVID-19. For clarity's sake, the validation dataset is not used to train the algorithm; rather, it is separated out during the training process, and subsequently used to evaluate (or validate) the algorithm. In other words, it is used to measure the algorithm’s accuracy. If you recall from the AI Roadmap, the intended output of this algorithm is to classify someone as having COVID-19 or not, based on changes in respiratory rate (measured during sleep). To pull in another key concept, the training method conducted in this case would be considered supervised learning, since the algorithm was trained against data with known outcomes. What this study ultimately revealed was that 20% of COVID-19 positive individuals were correctly identified by the algorithm prior to the onset of symptoms and 80% by the third day of symptoms. This may suggest that there is a window of opportunity for identifying some individuals who would need to self-isolate and/or obtain a COVID-19 test, which is especially pertinent in situations where screening tests are not widely available.

To tie it all together from the AI Roadmap, it is important to look at an algorithm’s raw output and understand what types of rules are applied to it. In this case, the raw output of the WHOOP algorithm is not the final classification of “infected” or “not-infected,” but instead it is the probability of how certain the algorithm is that a particular input should be classified as “infected” or “not infected.” This probability ranges from 0 to 1 and the authors of the study make a decision of what threshold should result in which label for the final output. For example, the authors could set a threshold of 0.7, meaning that if the algorithm predicts a 70% probability that a particular individual is infected, then the algorithm will classify that individual as “infected.” However, if for a different individual, the algorithm predicts a probability of 60%, then it would classify the individual as “not infected.” Because the authors determined the risk of labeling false negatives (labeling an individual as negative for infection when they are actually positive) as more harmful than false positives (labeling an individual as positive for infection when they are actually negative), the authors set the threshold of their classifier relatively low. Any output with a predicted probability of 0.3 or greater for COVID-19 infection was labeled as COVID-19 positive. Based on the distribution of the model’s output, this can increase the chance of correctly labeling a true positive as “infected,” but it does so at the expense of increasing the risk of false positives. As clinicians, it is important to assess whether any classifier threshold set by developers makes sense from a clinical perspective.

The research has been promising and these efforts appear to empower the individual by giving them access to and greater insight into their own health, but it is also important to realize its limitations from a clinical utility standpoint. While WHOOP is using respiratory rate as the main signal to detect COVID-19, the purpose of its research is to develop a proprietary algorithm. This means that if WHOOP is able to establish and quantify a meaningful relationship between changes in respiratory rate and the probability of SARS-CoV-2 infection, that knowledge will exist only in its devices. In the United States alone, an estimated 16%, or 52.8 million people, own a smartwatch, but the choice of smartwatch varies among users, and therefore, the knowledge coming from the WHOOP devices would have limited applicability to the general public and those wearing other devices. Every device that measures respiratory rate will have to develop and validate their own algorithm if they want to provide the same service/value to their users. Alternatively, if a payor or provider wants to leverage the information provided by WHOOP’s algorithm to improve care delivery or health outcomes, they would have to provide each patient a WHOOP device. While a seemingly simple solution, it is actually quite challenging to implement. Successful implementation includes justifying the cost of the devices, assuming patients agree to wear the device, educating the patient on how to use the device and its app, and encouraging consistent use.

While this article focused primarily on WHOOP’s respiratory rate-based algorithm, another consideration may be whether respiratory rate, alone, is sufficient to accurately predict COVID-19 infection. Researchers have commented on the improved predictive value of integrating multiple physiological measures; in other words, combining respiratory rate with other metrics, such as core body temperature and arterial oxygen saturation (SpO2), may be a better predictor than any single value on its own. A number of studies have been initiated to evaluate whether changes in a combination of physiological measures, which can be captured by wearables, correlate with a COVID-19 infection and diagnosis. The DETECT study by the Scripps Research Institute looks at HR, activity, and sleep data. Yet another study by the Fitbit research team looks at predicting infection on any given day based on respiratory rate, HR, and HRV. As you can see, there is large interstudy variability in terms of methodology, which, while all valuable as we begin to understand whether wearables can play a clinical role, may lead to more questions than answers. As we move forward in the algorithmic age, it will be more important than ever to be able to assess the clinical validity and utility of emerging algorithms and to evaluate their role and limitations in patient care.

Christy Cheung

AI, COVID and Wearables - Real World Value or Unrealistic Expectations?

Friends of AI Collective: Story Series

Welcome to the AI Collective

Christy & Whitley