Solid Machine Learning Model Predicts New HIV Cases Using EHR Data

Incredible new research from Harvard and Kaiser contributed two substantial leaps forward this summer: first in the useful application of machine learning predictions using electronic health records and second in the accurate targeting of people who can benefit the most from HIV pre-exposure prophylaxis.

This paper in Lancet HIV by Marcus and colleagues is so good that I recap highlights and offer my assessment here. For a different perspective, read Gina Kolata’s article in The NY Times. Why should you care about my opinion? I research HIV and cancer, daily transform EHR data into evidence, build math models for prediction, don’t know these authors, and have no conflicts of interest. What the researchers did is in this study is much harder in real life than it sounds. I’m a big fan.

There are many reasons why we should care about person-centered treatment effects. In fact, informing the decision for Kaiser to cover a new medicine in a health plan was one of the specific examples used in that post last year. Individualized risk predictions, individualized drug coverage targeting, and individualized treatment effect estimates are possible and possibly efficient. But people will not benefit equally from these advances. Read to the end of this post to learn about how this machine learning model was not able to accurately predict HIV incidence for one important group of people.

I love this model because it’s useful

Investigators used EHR records from 3.7 million patients in the Kaiser Permamente Northern California system and found the documented demographic characteristics, labs, drugs, and diagnoses that were most predictive of a patient becoming HIV infected over the next 3 years. The authors generously report results in a format so that other people can reconstruct the fitted regression model, insert values for their own populations, multiply by their coefficients, and predict risk levels for their own patients in a completely different health system.

Here is a selection of some of the top predictors of upcoming HIV incidence that were identified in patient charts (from Table 2).

Predictor	Odds Ratio
Positive urine test for methadone	13.9
Number of penicillin G benzathine injections with syphilis test within 90 days in previous 2 years	4.3
Number of positive tests for urethral gonorrhoea in previous 2 years	3.5
Black race	2.6
Number of positive tests for rectal gonorrhoea or chlamydia in previous 2 years	2.0

I love math models that produce actionable results. Here we have an extremely valuable equation, drawing on real-world data, that can make a prediction for an individual person to reduce uncertainty about their level of risk for HIV infection in the future. This can inform a shared physician-patient decision: Should you be using HIV pre-exposure prophylaxis (PrEP) drugs costing $2k/month?

I’ve shown you the increasing trend in PrEP interest before and debated its cost-effectiveness using a static Markov model and then again with a dynamic transmission model. The need for better targeting of individuals who could benefit from PrEP is necessary to realize its full value. This machine learning model is timely and useful.

Here are 4 healthy questions I ask when evaluating the quality of evidence from an observational study using real world EHR data.

1. Data fit for use

Is the data source appropriate and complete enough to address this research question?

Kaiser, similar to US Veterans Affairs, has EHR that is more comprehensive than other sources I’ve seen. They had more than 4 million patients to include in the modeling study. That is huge. The paper reports that the institutional review board at Kaiser Permanente Northern California (KPNC) approved this study with a waiver of written informed consent.

The investigators had access to the zip codes of where patients resided and where they received care. They used these zip codes to create three variables by linking with other data:

Neighbourhood deprivation index
Received care in one of three cities with high HIV incidence
Resided in one of eight urban ZIP codes with high HIV incidence

Overall, I agree that Kaiser EHR data was fit for use here. It had the depth of information and length of follow-up time necessary to capture the outcome of HIV incidence within three years. Read to the end for discussion points about why the findings based on those zip-code variables surprised me.

2. Causal Inference

Is the scientific approach and methodology sufficient to make conclusions with causal inference?

This study used a least absolute shrinkage and selection operator (LASSO) regression model to identify important features of the structured EHR data and predict new HIV diagnoses within three years. The authors had a well planned exercise to validate model predictions.

"To assess how the full and simpler models might perform prospectively, we validated them with data from an independent set of patients who entered the cohort in 2015–17 (validation dataset), which was not used during any part of the model development process. We again computed the AUC to assess model discrimination, generating 95% CIs by use of bootstrapping with 1000 resamples of the data."

I want to highlight their selection of an ‘independent set of patients who entered the cohort in 2015-2017’ for the prospective validation. This is really important. If the researchers had instead randomly selected a set of patients from all the patients in the 2015-2017 records, with the model being trained on patients from 2007-2014, then the validation would not have been completely honest or accurate. It’s possible that patients in the training data set could have continued to live into the 2015-2017 period and be included in both the training and validation datasets (a different part of their life segmented into each). They avoid this dilemma by ensuring that patients in the prospective validation are new since 2015.

In the prospective validation, the area under the curve (imagine shading below purple line in the figure) is 0·84 (95% CI 0·80–0·89).

**Figure.** Receiver operating characteristic curves and AUCs for full and simpler HIV risk prediction models in the validation dataset AUC=area under the receiver operating characteristic curve, MSM=men who have sex with men. STI=sexually transmitted infection. *Source: Marcus et al, Lancet HIV 2019.*

The part I have a difficult time understanding about the prospective validation is how the researchers compared the prediction of HIV incidence to the outcome of true HIV infection within 3 years for each patient when the validation dataset includes only 2 years of follow-up. Please add a comment below if you can help me with the answer to this one.

In this situation, causal inference about the relationships between individual model features and HIV incidence is less important than the accuracy of future predictions in the population where it will be used to inform decision-making.

3. Reproducibility

Is there sufficient detail provided so that others could reproduce these findings or replicate using patient records from another population?

I was pleased in general about the transparent reporting of methods in this paper. One challenge to replicating these results is lack sufficient detail defining the variables. For me to recode and deploy this regression equation in a different EHR, I would need a lot more detail than has been provided in Appendix Table 1.

Here are examples of questions that run through my head when reading through the table of model parameters and finding ‘NA’ listed as the variable definition:

Number of positive tests for gonorrhea or chlamydia in previous 2 years: If the test was repeated more than once in the same day or multiple times in the same week, do you count each one or are they bundled? Is there a max value, for instance >10?
Number of urine toxicology tests: Is it the number requested, number run, or number with results returned to the patient? What codes are used to identify these tests?
Sexually active: Is there a structured data field for this? What time period does it cover? How recently does someone need to have sex to be considered active? Is this self-report? How often is it collected?

The second blocker is transparency of analysis code. It would have been courageous for these authors to make this open source and post the code on Git Hub. Are all the features included as linear terms or exponential?

4. Generalizability

Can the lessons from this analysis apply to other populations?

The model works great for men with health insurance who live in Northern California. How well would it work for people in California who are uninsured or under-insured? What about insured patients in California that are not part of a health maintenance organization (HMO) like Kaiser? How accurate is this machine learning model in predicting HIV infection for patients that live on the west coast? What about Europe? I would love to see someone take this fitted model and apply it to a different population as an external validation.

A large number of the variables in this machine learning model are based on the utilization of specific healthcare services. It relies on the assumption that all those important services are being captured by the perspective of the EHR data source. I expect HMO EHR data to have more complete coverage of healthcare visits compared to an average academic medical center because the HMO insurance scheme strongly incentivizes patients to only seek healthcare from within their system. The model will likely not be as accurate for patients with other types of health insurance who may be receiving STI testing at clinics separate from their primary care provider.

Service areas of Kaiser Permanente Northern California, *from Kaiser*

My guess is that if you compare the out-of-sample model performance among insured patients in other parts of the country vs uninsured patients in Northern California, then it will work better for the insured patients. We learned from the RAND Health Insurance Experiment and Oregon Experiment that people use more healthcare services if they have insurance. With so many of the model features relying on counts of specific lab tests run, I would say this study is limited to generalize to people with good health insurance.

‘Our models did not identify cases among women, whose HIV risk might be largely dependent on the risk factors of their partners.’

Sad. Looks like the advances in this study did not help women at all.

Discuss

Something surprising

I’m a big fan of trying to account for socioeconomic status as being related to a lot of health outcomes. It’s an interesting finding here that quintiles of the neighborhood deprivation index linked to patient zip code were not related to HIV incidence. I wonder if the authors think this is really true, that socioeconomic status is not a strong predictor, or if in the aggregation into zip code level (high and low income people squashed in one San Francisco neighborhood) is not specific enough to it tease apart.

What is not surprising

The strongest model features confirmed that young, Black, men who have sex with men are a vulnerable population more deserving attention. Also, a positive urine test for methadone should be a clear signal for intervention.