Data Whisperer: Meaningful yet pretty graphs

This post will cover the different stages from data preprocessing to plotting. It will not go into details of the analysis of different types of data, as that depends on the question and type of data you have. I will focus on biological types of data and experimental designs, as this is my own area of expertise. Nevertheless, many of the features are directly applicable to alternative fields of research.

On a sidenote, my attempt is certainly not the first. Others have come before me to show their interpretation of clean and good scientific graphs, e.g. Constance Biegel and Prashant Kamat made a collection of good and bad graphs;
Please check out their site for more inspiration and alternative ways to plot graphs.

Design principles

What can be considered a good data visualization is debatable. A key figure in the development of modern data visualization was Edward Tufte. A lot of his ideas of style influenced current practices. Particularly his ideas of data-ink and chart junk were very influential. They actually go hand in hand: increase the data-to-ink ratio (i.e. show your data) and do not add distracting visual effects (chart junk).

With these principles in mind let’s make some beautiful scientific graphs. :)

Data preprocessing

I assume you have a data set that you want to plot that has been analysed based on the question you have. The preprocessing required for plotting often includes changing the table layout into a long format, often creating one table for all the data you want to show in one graph. Furthermore, it is also useful to change the order of your data, to optimally show it in the graph, subset data, etc.

I recently started to use the data.table package, which really saves a lot of lines of code and time. Data.tables are more enhanced data.frames that speed up and simplify data processing. A cheat sheet with the most important functions can be found here.

I will go through different examples of plots. Each might require different types of preprocessing, so you will see how your data should look like to plot it

Types of plots

Once the data is roughly in a good format, we can start to plot it. It is important to remember that each plot and visualization you choose show your data in a different way, and therefore tells a different story. They highlight different aspects of your data. So it is important to figure out what you want to convey with your graphic. At the same time you might be constrained by the type of data that you have, and which types of plots might be useful with it. Especially in academia, certain plots are used abundantly and very innovative ideas might be viewed skeptically or could be considered too artistic. So you often have to find a balance.

Important for choosing a plot type is whether your data is continuous, categorical, a time series or something in between.

Inspirations which graphics to choose:

Data viz project
Xenographics
general packages that are useful to make (academic) plots: ggplot2 (fundamental), cowplot, ggpubr
more specific packages: ggsurvplot (for survival curves), heatmap2 (for advanced heatmaps), upsetR (a Venn diagram alternative), Venn (to create Venn diagrams)
types of plots that are frequently used: box plot, bar graph, scatterplot, histogram, heatmap
types of error bar: sd, sem, ci
color scales: Susie Lu & Elijah Meeks’ Viz pallete, Rcolor brewer, I want hue
Color theory: Semantically related colors

In this initial blog I will cover the graphs I’ve most often observed: bar graphs vs box plots, and line graphs. More types might be added in later posts.

Example 1: Bar graph vs. Box plot

Let’s get started with loading the packages that we’ll use.


library(boot) # library with many datasets
library(tidyverse) # most important packages required for your daily data 
# wrangling and plotting. Includes ggplot2 among others.
library(data.table) # to make easier data manipulations
library(cowplot) 
library(ggpubr) # to add significant levels to your plot and make plot publication ready
library(eurostat) # access eurostat database http://ec.europa.eu/eurostat/

Next let’s load our first small dataset to play with. I chose the poisons data set, because it has a lot of attributes common in academic designs:

multiple biological replicates (but not too many, in this case 4)
two types of interventions
- three poisons
- four treatments

The data describes how long an animal survived after ingesting the poison and obtaining one of the five possible treatments.

This dataset will give us the opportunity to look at all the data and not be overwhelmed, yet explore ideas like error bars, and showing the mean versus individual data.


dt <- data.table(poisons)

# lets look at this to get a feeling
summary(dt)


      time        poison treat 
 Min.   :0.1800   1:16   A:12  
 1st Qu.:0.3000   2:16   B:12  
 Median :0.4000   3:16   C:12  
 Mean   :0.4794          D:12  
 3rd Qu.:0.6225                
 Max.   :1.2400

We see that we have three columns. The time shows a mean survival of 0.47 (it’s in 10 hour units, so about 4hrs 47 mins.) for all animals. It also shows the number of animals in each poison group and treatment group. Since the data is already in a long format and nicely cleaned up we can immidiately start plotting.

First, lets see the distribution in a box plot.


plt.ex1 <- ggplot(dt, aes(x=poison, y=time, color= treat))+
    geom_boxplot()+
    theme_gray()
plt.ex1

A box plot is very applicable in this case, as it shows the distribution of the data, and is good to compare categorical variables (the poisons and treatment options).

This plot is still very plain. To make it look a lot more sophisticated we will make a few changes:

change the color of boxes
change axis label
change axis weight
add the individual points as jitter (optional)

Particularly, line thickness and font size should often be thicker and slightly bigger than might look good to you on the screen. Still, many articles are printed by the scientists and thin lines might become faint and hard to read. Many journals also require the minimum font size in the final figure to be 6 or higher. If you check, you understand which size is still easily legible.

Pretty box plot

I will show you two alternative methods. The first relies on the default ggplot functions and you adjust it manually to your wishes. The second uses the ggpubr package. The ggpubr documentation is very comprehensive, so I won’t use it much for my examples. Most of all I prefer the flexibility I have with the standard ggplot2.

You should try both and choose whichever suits your needs better. :)


# default ggplot
plt.ex1v2 <- ggplot(dt, aes(x=poison, y=time, color= treat))+
    geom_boxplot(size=1)+ # make the lines of the box plot thicker
    labs(title="Treatment extends survival time after poisoning",
         x= "poison", # is already pretty descriptive
         y= "survival time (in 10 hours)") +
    scale_color_brewer(type="qual", palette = "Paired", name="Treatment") +
    scale_y_continuous(expand=c(0,0), limits=c(0,NA))+ # to start y axis at 0
    theme(axis.line.x = element_line(size=1), # make the axis line thicker
          axis.line.y = element_line(size=1))
plt.ex1v2


# version with individual data
plt.ex1v3 <- ggplot(dt, aes(x=poison, y=time, color= treat))+
    geom_boxplot(size=1)+
    geom_point(size=2, shape=21, stroke=1.1, position = position_jitterdodge(jitter.width = 0.1)) +
    scale_y_continuous(expand=c(0,0), limits=c(0,NA))+ # to start y axis at 0
    labs(title="Treatment extends survival time after poisoning",
         x= "poison", # is already pretty descriptive
         y= "survival time (in 10 hours)") +
    scale_color_brewer(type="qual", palette = "Paired", name="Treatment") +
    theme(axis.line.x = element_line(size=1),
          axis.line.y = element_line(size=1))
plt.ex1v3

GgpubR


plt.ex1v4 <- ggboxplot(dt, x="poison", y="time", 
                       color= "treat", palette = "Paired",
                       add= "jitter", ylab = "survival time (in 10 hours)", 
                       title = "Poison survival times GgpubR", size=1)
ggpar(plt.ex1v4, legend = "right", legend.title = "Treatment", ylim= c(0,1.25))

You can calculate the p-value between different conditions easily with the compare means function of the ggpubr package. By default it uses the Wilcoxon test to calculate the significance between two groups. You can change that via the method statement.

In this multiple variable layout, I couldn’t figure out yet how to show the significance values correctly on the plot.So I let you try and figure it out yourself.


tmp <- compare_means(time~poison, data=dt, group.by = "treat", method= "kruskal.test")
tmp


# A tibble: 4 x 7
  treat .y.         p p.adj p.format p.signif method        
  <fct> <chr>   <dbl> <dbl> <chr>    <chr>    <chr>         
1 A     time  0.0145  0.044 0.0145   *        Kruskal-Wallis
2 B     time  0.0244  0.049 0.0244   *        Kruskal-Wallis
3 C     time  0.00971 0.039 0.0097   **       Kruskal-Wallis
4 D     time  0.0245  0.049 0.0245   *        Kruskal-Wallis

I consider these examples good ways to visualize your data. However, I’ve come across many articles were the same type of data would have been shown in a bar graph with some kind of error bar (too often the one that seems the smallest…). For a box plot we don’t have to make any changes to the data, but for a bar graph we need to calculate the mean and standard variation ourselves.

Just as a reminder:

The standard variation (SD) estimates the variation in your sample. That means how much spread is observed between your biological replicates.
The standard error of the mean (SEM) estimates the uncertainty of your mean. That means how likely is the mean you found the true mean of the underlying distribution.
A confidence interval (CI) estimates the reliability of your measurement.


# Helper functions to calculate the standard error of the mean and the 0.95% confidence interval 
se <- function(x) sqrt(var(x,na.rm=TRUE)/length(na.omit(x)))
ci <- function(x) qnorm(0.975) * sqrt(var(x,na.rm=TRUE)/length(na.omit(x)))

# calculate the mean and standard variations
dtMean <- dt[, .(avg = mean(time), std = sd(time), ci1 = ci(time), stdE = se(time)), 
             by = c("poison", "treat")]

### plot standard deviation
plt.ex1Bar <- ggplot(dtMean, aes(poison, avg, fill=treat))+
    geom_bar(stat="identity", position=position_dodge())+
    geom_errorbar(aes(ymin=avg-std, ymax= avg+std), width=0.25, position=position_dodge(.9))+
    scale_fill_brewer(type="qual", palette = "Paired", name="Treatment")+
    scale_y_continuous(expand=c(0,0), limits=c(0,NA))+ # to start y axis at 0
    theme(axis.line.x = element_line(size=1),
          axis.line.y = element_line(size=1))+
    labs(title="Poison survival times",
     x= "poison", 
     y= "survival time (in 10 hours)")
plt.ex1Bar

I kept the remaining layout changes the same as in the box plot. If you compare the box plot with the bar graph, we run into Tuftes design principles, both chart junk and data-ink ratio. The bar graph requires a lot of ink to plot with little information value, i.e. you require an entire bar to represent the mean! Furthermore, it is misleading. If we look at our data, non of the measured variables start at 0.


### compare different error bars
dtMeanError <- rbind(dtMean[poison==1],dtMean[poison==1],dtMean[poison==1])
# we need to put all types of errors into one column for easy plotting. 
dtMeanError <- melt(dtMeanError, id.vars = c("poison", "treat", "avg"))

plt.ex1Error <- ggplot(dtMeanError, aes(poison, avg, fill=treat))+
    geom_bar(stat="identity", position=position_dodge())+
    geom_errorbar(aes(ymin=avg-value, ymax= avg+value), width=0.25, position=position_dodge(.9))+
    facet_wrap(~variable)+
    scale_fill_brewer(type="qual", palette = "Paired", name="Treatment")+
    scale_y_continuous(expand=c(0,0), limits=c(0,NA))+ # to start y axis at 0
    theme(axis.line.x = element_line(size=1),
          axis.line.y = element_line(size=1),
          plot.caption = element_text(hjust = 0, size=10))+
    labs(title="Different error bars change perception of results",
     x= "poison", 
     y= "survival time (in 10 hours)",
     caption= "Figure. Different error bars change perception of results. std= standard deviation, 
     \n ci1= 95% confidence interval, stdE= standard error or the mean. ")
plt.ex1Error

The example of the error bar also shows interesting results. As expected the SEM is smaller than the SD and the confidence interval. Also the the CI looks about equally spread as the SD, so it is really important to annotate your figure, so that people know what they are looking at. Important to remember: people will first see the errorbar and already make an unconscious judgement about the significance by the size of the bar, only then will they read the legend and see which type of error bar it is.

Example 2: Line plot

Next to bar graphs, line plots are one of the most abundant types of plots I see in academic articles. One of the main use cases of line graph is to show time series data. Of course any continoues variable can be plotted in a line graph, but since time series are very intuitive I’ll chose one for the next graph. Overall, I have no real complaints about line graphs in current scientific literature. Sometimes line graphs are used to show continouety between measured values. This is not necessarily wrong, but might be misleading. The following example shows how to make line graphs with ggplot2.

As test dataset I used a dataset about deaths in the European Union between 1994-2015. It can be found on the eurostat website.


# load only once (first for 2011-2015)
#death <- get_eurostat(id="tps00152", time_format = "date")
#death <- label_eurostat(death, lang="en")
# then for 1994 - 2010 to get more dates
# d2 <- get_eurostat(id="hlth_cd_asdr", time_format = "date")
# d2 <- label_eurostat(d2)
# merge both datasets
# death <- rbind(death, d2[,c(1,4,3,2,5,6,7)])
# save(death, file="D:/Rscripts/eurostat_deaths.Rdata")

# load death in european countries between 2011-2015. The data gives the rate 
# of deaths for male, female and total per 100,000 residents per EU country. 
load(file="D:/Rscripts/eurostat_deaths.Rdata")
death <- data.table(death)

# rename Germany
death$geo <- as.character(death$geo)
death$geo[death$geo=="Germany (until 1990 former territory of the FRG)"] <- "Germany"
# select a few countries, e.g. Germany, Netherlands, Italy and Portugal
countries <- c("Germany", "Portugal", "Italy", "Netherlands")
dfPlot <- death[geo %in% countries & icd10=="All causes of death (A00-Y89) excluding S00-T98" ]

# plot the death of all people per country
plt.Ex2 <- ggplot(dfPlot[sex=="Total" & age=="Total"], aes(time, values, color=geo))+
    geom_point()+
    geom_line()+
    theme_gray()
plt.Ex2

Interestingly, it shows a decline in overall death rate in the last 20 years. Since 2010 it starts to stabilize around a new equilibrium per country. Furthermore, Italy shows a lower death rate per 100,000 inhabitants compared with the other three countries over the whole range. Besides these observations, how can we make this plot look good for a scientific journal?


plt.Ex2v2 <- ggplot(dfPlot[sex=="Total" & age=="Total"], aes(time, values, color=geo))+
    geom_point()+
    geom_line()+
    theme_cowplot()+
    scale_color_manual(values = c("black", "green4" ,"orange", "red3"), name="Country")+
    scale_y_continuous(limits=c(0,1600), expand=c(0,0))+
    labs(title="Death rate declines in Europe 1994-2015",
         x= "year",
         y = "rate per 100,000 inhabitants")+
    theme(axis.line.x = element_line(size=1),
          axis.line.y = element_line(size=1))
plt.Ex2v2

If you still want to go for black and white, here is an alternatives using shapes or different line types. For a quick reference which shapes and lines are available check here.


plt.Ex2v3 <- ggplot(dfPlot[sex=="Total" & age=="Total"], aes(time, values, shape=geo, linetype=geo))+
    geom_point()+
    geom_line()+
    theme_cowplot()+
    scale_shape_manual(values = c(16, 21,15,22), name="Country")+
    scale_linetype_manual(values = c("solid", "dotted","dashed", "dotdash"),name="Country")+
    scale_y_continuous(limits=c(0,1600), expand=c(0,0))+
    labs(title="Death rate declines in Europe 1994-2015",
         x= "year",
         y = "rate per 100,000 inhabitants")+
    theme(axis.line.x = element_line(size=1),
          axis.line.y = element_line(size=1))
plt.Ex2v3

Similar to the box plot you can add error bars to lines. You can either add individual error bars to each data point, or if you want to highlight the range, a ‘ribbon’ indicating the upper and lower limits. To illustrate, let’s look at the average death rate in Europe.


# calculate mean and standard deviation over all EU countries. 
euDeath <- death[icd10=="All causes of death (A00-Y89) excluding S00-T98" & 
                   sex=="Total"& 
                   age== "Total" & 
                   geo != "European Union (current composition)", 
                 .(avg= mean(values), std= sd(values)), 
                 by = "time"]

plt.Ex2v4 <- ggplot(euDeath, aes(time, avg)) +
    geom_ribbon(aes(ymin=avg-std, ymax=avg+std),alpha=0.2, fill="dodgerblue")+
    geom_line(size=1, color="blue4")+
    theme_cowplot()+
    scale_y_continuous(limits=c(0,1800), expand=c(0,0))+
    labs(title="Death rate decreases in Europe 1994-2015",
         x= "year",
         y = "rate per 100,000 inhabitants")+
    theme(axis.line.x = element_line(size=1),
          axis.line.y = element_line(size=1))
plt.Ex2v4