Visualizing Two Variables of Interest

The Prompt

For my first #tidytuesday, I chose an old prompt. I wanted to explore USDA milk consumption data. The accompanying NPR article considers the oversupply of cheese in the US. Andrew Novakovic, an agricultural economist at Cornell University, explained that milk production has risen while consumption has fallen, so suppliers turn the milk into cheese which is less perishable. But Americans are turning to less processed/more expensive cheeses, and consuming cheese overall.

Since I was late to this prompt, I was able to check out what others had already done with it. @Alex_Danvers compared milk production to google search trends for “lactose,” which I thought was interesting. I decided to do the same, but for milk consumption, which I thought was more relevant.

The Data

The USDA dairy consumption data has yearly domestic consumption for milk, yogurt, butter, etc. in lbs per person, between 1975-2018. I’m just going to focus on milk.

## # A tibble: 6 x 8
##    year  milk yogurt butter american_cheese other_cheese cottage_cheese
##   <dbl> <int>  <dbl>  <dbl>           <dbl>        <dbl>          <dbl>
## 1  1975   247    2      4.7             8.1          6.1            4.6
## 2  1976   247    2.1    4.3             8.9          6.6            4.6
## 3  1977   244    2.3    4.3             9.2          6.8            4.6
## 4  1978   241    2.4    4.4             9.5          7.3            4.6
## 5  1979   238    2.4    4.5             9.6          7.6            4.4
## 6  1980   234    2.5    4.5             9.6          7.9            4.4
## # ... with 1 more variable: ice_cream <dbl>

The Google Trends data is monthly, and unfortunately only goes back to 2004. Values represent search interest relative to the highest point on the chart for U.S. searches between 2004 and 2018. A value of 100 is the peak popularity for the term. A value of 50 means that the term is half as popular as when it peaked. In this data, the term peaked in July of 2018.

## # A tibble: 6 x 2
##   Month   `lactose: (United States)`
##   <chr>                        <int>
## 1 2004-01                         42
## 2 2004-02                         41
## 3 2004-03                         51
## 4 2004-04                         43
## 5 2004-05                         42
## 6 2004-06                         41

Since the USDA data is yearly, I find the yearly average of lactose trends and merge the datasets on year.

The Plot

Dual Axis Plot

First, I plotted both of the variables I was interested in (milk consumption and google search trends) over time. Here’s the graph using raw data. We can see that there’s a spike in search popularity between 2010-2012 and 2014-2018, and a steep decrease in milk consumption between 2010-2014. These two variables definitely appear to be negatively correlated, and indeed their correlation coefficient is -0.98, but is this the best way to demonstrate their relationship?

Side-by-Side Plot

My partner Andrew convinced me that dual axis plots are generally bad practice. Lisa Charlotte Ross wrote a great post about why that is. Basically, they can be misleading about relationships.

One of Lisa’s suggestions was to use side-by-side plots. This doesn’t really offer any more information, but it does keep the reader from making those subconscious false assumptions due to dual axes.

Labeled scatterplot

Another of Lisa’s suggestions was to create a Labeled scatterplot instead. This is nice because it shows us the relationship between our variables of interest, without excluding the year, which itself contains a lot of implicit info.

Below, I see that there is a negative relationship between lactose search trends and milk consumption. We also see that lactose searches increase over time, and milk consumption decreases over time. However, I liked having year on the x-axis. I think it’s more intuitive, and I like seeing each variable’s slope over time.

Indexed Plot (Standardized)

One last suggestion of Lisa’s was to make an indexed plot. That is, adjust the scales of our two data series and compare them on one common scale.

I index first by standardizing. This is what @Alex_Danvers did in the first place. Here I’m essentially just rescaling the vertical axes into relative terms. Notice all that space between these two lines in the original graph? Well, now I can zoom in on the action. Yes, I lose the information that the absolute values tell us, but I’m more interested in the relative changes anyways.

The y-axis here will be z-score. In case you forgot Stat 101, the z-score indicates how much a given value differs from the standard deviation. For example, I can see that in 2006, milk consumption was 1 standard deviation above the mean consumption between 2004-2018.

Below, I can see the same trends that I saw in my first graph, but more clearly. There’s that spike in search popularity between 2010-2012 and 2014-2018, and that big fall in milk consumption between 2010-2014. As it should be, the correlation coefficient of milk consumption and lactose search popularity is the same as before, -0.98.

Note that when @Alex_Danvers made this graph using milk production instead of consumption, he found a weaker relationship with lactose search popularity. This makes sense to me.

To like this graph, you have to understand z-scores, which is a bit esoteric. But I do like the way it allows me to effectively compare the two series on a common scale

Indexed Plot (% Change)

Next, I index using percent changes instead. That works well here because both of my variables of interest have similar rates of change, so I can easily see what’s going on with each.

Below, I can see that milk consumption declined in each year in our time period, and lactose searches increase for all but one year. When google searches for lactose experienced a big jump However, when “lactose” searches experienced another jump in 2017, this time milk consumption doesn’t fall hard. Thus, it doesn’t seem like the changes in lactose searches and the changes in milk consumption are very related. As it turns out, the correlation coefficient this time is -0.13.

I like this last plot the best, because it holds the most interesting story. While all of the other graphs showed that lactose searches go up during the same time that milk consumption goes down, this is the only graph that suggests the spikes aren’t necessarily related.

Tidy Tuesday: Avoiding Dual Axis Plots

2019-11-05