In this tutorial, you’ll gain access to the R code, dataset, and motivation to replicate data visualizations in my latest paper and apply the concepts to your next one.
I sat on an airplane last year, flipping through The Economist, when I froze at the sight of a tiny chart littered with a mess of arrows. The arrows were of different lengths and pointing in many directions, mostly toward the upper right-hand side of the chart.
Further inspection revealed the rounded ends of each arrow tail were scattered on the axes of two features measuring success (of what, I don’t remember). The figure legend told me the tail ends were the cross-sectional value for each group/arrow/country in 2010. The head of each arrow, the pointy end, was the cross-sectional value of each group in 2014. Like below:
This clever design allowed me to interpret the length of each arrow as the total amount of progress achieved over four years. The slope of each arrow away from 45 degrees was related to the whether progress was achieved through equal gains in each feature mainly one more than the other. Vectors are magical because they represent both a quantity and a direction at the same time.
A light bulb when on. I frantically circled the magazine chart, scrawled the note “make this kind of figure to show change in % HIV on treatment and % virally suppressed changing over time!” and drew big stars around it like a middle-school girl discovering an acne solution in Seventeen magazine.
The review and synthesis
After writing about the challenges in understanding how the Patient Protection and Affordable Care Act (ACA) and associated Medicaid expansion made things better or worse for people living with HIV, I was invited to lead a review article about the evidence available so far. I am grateful Current HIV/AIDS Reports allowed me this opportunity to learn and synthesize information into this review.
Supporting the story with evidence synthesis
I analyzed the data and made the charts in R. You can replicate the findings and figures with the code and dataset posted on my GitHub.
As delightfully reminded by data visualizations in The Economist, vector plots can collapse 3 dimensions (two features plus time) to help understand the direction of change over time. When multiple vectors are plotted together, the length of the vector represents the degree of acceleration relative to other groups. In this figure the groups (separate arrows) are states and the color tells us whether or not the state adopted Medicaid expansion with ACA implementation.
Try generating a simple version of this figure on your own by downloading the CDC surveillance dataset here and running the code below in RStudio.
<pre class="wp-block-syntaxhighlighter-code"># Basic vector diagram
ggplot(data = d4, aes(x = pEn_2010, y = pVS_2010, group = state)) +
geom_point(size = 3) +
geom_segment(aes(x = pEn_2010, y = pVS_2010, xend = pEn_2014, <br />yend =pVS_2014), size = 1.2, arrow = arrow(length=unit(0.30,"cm"),type = "closed")) +
geom_segment(aes(x = 65, y = 10, xend = 75, yend =10), size = 1.2, arrow = arrow(length=unit(0.30,"cm"),type = "closed")) +
geom_point(aes(x = 65, y = 10), size = 3) +
geom_text(aes(x = 65, y = 10, label="2010"),hjust=-0.2, vjust=-.5 ,angle=45, size = 5) +
geom_text(aes(x = 75, y = 10, label="2014"),hjust=-0.2, vjust=-.5 ,angle=45, size = 5) +
xlab("Engaged in Care (%)") +
ylab("Viral Suppression (%)") +
ggtitle("Progress toward 90:90:90 goal") +
geom_hline(yintercept = 100*.9*.9*.9, color = "orange", linetype = "dotted", size = 1.5) +
geom_vline(xintercept = 100*.9*.9, color = "orange", linetype = "dotted", size = 1.5) +
theme_classic()+
#xlim(10,85) +
ylim(10,85) +
theme(text = element_text(size=20))
</pre>
Bubble plot
In this one, I was a little ambitious and tried to make an all-in-one-figure with four dimensions (prevalence, ART use, viral suppression, and adoption of Medicaid expansion) for each state. Learning from the wisdom of data visualization experts at The Economist, a better redesign this figure would change the opacity of the bubbles to semi-transparent and emphasize only few states with darker color and labels.
# bubble plot
ggplot(data = d[d$Year==2015,], aes(x = pEn_2014, y = pVS_2014, group = state)) +
geom_point(aes(colour = exp, size = case), position = "jitter", alpha = .8) +
xlab("Using ART (%)") +
ylab("Viral Suppression (%)") +
geom_rug() +
geom_hline(yintercept = 100*.9*.9, color = "orange", linetype = "dotted", size = 1) +
geom_vline(xintercept = 90, color = "orange", linetype = "dotted", size = 1) +
geom_text(aes(x = pEn_2014, y = pVS_2014, label=Abbreviation),hjust=0.3, vjust=.5, size = 1.5) +
theme_classic()+
labs(size = "HIV Prevalence") +
labs(colour = "Medicaid Expansion") +
xlim(30,92) +
ylim(30,83) +
scale_fill_discrete(name = "Medicaid Expansion", labels = c("Yes", "No")) +
theme(text = element_text(size=15))
Making maps
AIDSVu maps are quick and awesome. But sometimes you need more flexibility.
When you need more flexibility and want to design your own custom map, here is template R code to do it.
The choropleth map of states below shows HIV prevalence. I also find this type of figure useful to show differences in data revealing things such as racial disparities or changes in a metric over time. However, I find results from diff-in-diff or triple-diffs do not communicate clearly in this format.
# choropleth map of prevalence in states
df <- d[d$Year==2014,]
df$region <- tolower(as.character(df$state))
df$value <- as.numeric(df$Cases)
state_choropleth <- function(df = df, title = "", legend = "", num_colors = 7, zoom = NULL){
c = StateChoropleth$new(df)
c$title = title
c$legend = legend
c$set_num_colors(num_colors)
c$set_zoom(zoom)
c$render()
}
state_choropleth(df = df, legend = "HIV Prevalence")
c <- StateChoropleth$new(df)
c$legend = "HIV Prevalence"
c$set_num_colors(7)
c$set_zoom(NULL)
c$show_labels <- FALSE
fig1 <- c$render()
fig1
If you want to dive deeper into building gorgeous maps, check out the tutorials from Flowing Data. Some are free, but many good ones are locked behind a paid subscription. In the past, I’ve saved up my advanced data visualizations problems until a peak of desperation, paid for only one month, sucked up all the learning I needed, and then cancelled the subscription. The book Visualize This is beautiful and inspiring too.
Challenges
Data recency
When leveraging government surveillance data, a big challenge is data recency. When I did this analysis in 2018, data was only available through 2015. By the time of publication in 2019, the results were already stale by four years. Often, more years of data in the post-policy implementation period are needed for robust analysis. Real-time policy evaluation may be easier in the future as cloud-based storage solutions and more efficient data processing gain traction.
Causal inference
We chatted about how to evaluate ACA impact on HIV back in 2017, after this comprehensive review article framed a research agenda calling for action.
Methods including difference-in-differences, instrumental variables, and propensity scores are recommended to minimize bias from unmeasured confounders and make causal inference about non-random Medicaid expansion among states.
An excellent new resource is a Causal Inference Book by Harvard Professor Miguel Hernan. Stay tuned for more detail about the Causal Inference Book Club I am hosting.
Not quite right: early attempts on the way to finding the right data visualization
Great data visualizations are rarely created from the first attempt. I’ve shown you the development of some early attempts in this area before and gave you the R code to play with. Here are some other ways I have tried to show the key HIV metrics improving over time year.
Below you will see the progression and maturation of visualizations from the start as a series of histograms, to an overlay of density plots, to a 2-D cross-section, and finally a 2-D longitudinal vector.
Thanks CDC, I think you are awesome
These visualizations are possible because of the wonderful gift of publicly available government data from the Centers for Disease Control HIV Surveillance Reports and NCHHSTP Atlas Plus.
Data Sources and Citations
- Adamson B, Lipira L, Katz A. The Impact of ACA and Medicaid Expansion on Progress Toward UNAIDS 90-90-90 Goals. Current HIV/AIDS Reports. Feb 2019, Volume 16, Issue I, pp 105-112. DOI: 10.1007/s11904-019-00429-6. https://link.springer.com/article/10.1007/s11904-019-00429-6
- Lipira L, Williams EC, Hutcheson R, Katz AB. Evaluating the Impact of the Affordable Care Act on HIV Care, Outcomes, Prevention, and Disparities: A Critical Research Agenda. J Health Care Poor Underserved. 2017;28:1254–75
- Centers for Disease Control and Prevention. NCHHSTP AtlasPlus; 2017.
- Centers for Disease Control and Prevention. HIV surveillance report, 2016. 28:table 24; 2017.
- Centers for Disease Control and Prevention. Monitoring selected national HIV prevention and care objectives by using HIV surveillance data—United States and 6 dependent areas, 2014. Atlanta. 2015.
- Centers for Disease Control and Prevention. Monitoring selected national HIV prevention and care objectives by using HIV surveillance data-United States and 6 dependent areas-2013. HIV Surveillance Supplemental Report. 2015;20:1–70.
- The Henry J. Kaiser Family Foundation. Medicaid enrollment and spending on HIV/AIDS (FY07-FY11). In: State Health Facts. 2016. http://kff.org/health-reform/state-indicator-enrollment-spending- on-hiv. Accessed 21 Mar 2018.
- AIDSVu. https://map.aidsvu.org/map