This notebook is a collection of useful tips and tricks I collected to produce figures in R for my academic articles.
This post will cover the different stages from data preprocessing to plotting. It will not go into details of the analysis of different types of data, as that depends on the question and type of data you have. I will focus on biological types of data and experimental designs, as this is my own area of expertise. Nevertheless, many of the features are directly applicable to alternative fields of research.
On a sidenote, my attempt is certainly not the first. Others have come before me to show their interpretation of clean and good scientific graphs, e.g. Constance Biegel and Prashant Kamat made a collection of good and bad graphs;
Please check out their site for more inspiration and alternative ways to plot graphs.
What can be considered a good data visualization is debatable. A key figure in the development of modern data visualization was Edward Tufte. A lot of his ideas of style influenced current practices. Particularly his ideas of data-ink and chart junk were very influential. They actually go hand in hand: increase the data-to-ink ratio (i.e. show your data) and do not add distracting visual effects (chart junk).
With these principles in mind let’s make some beautiful scientific graphs. :)
I assume you have a data set that you want to plot that has been analysed based on the question you have. The preprocessing required for plotting often includes changing the table layout into a long format, often creating one table for all the data you want to show in one graph. Furthermore, it is also useful to change the order of your data, to optimally show it in the graph, subset data, etc.
I recently started to use the data.table package, which really saves a lot of lines of code and time. Data.tables are more enhanced data.frames that speed up and simplify data processing. A cheat sheet with the most important functions can be found here.
I will go through different examples of plots. Each might require different types of preprocessing, so you will see how your data should look like to plot it
Once the data is roughly in a good format, we can start to plot it. It is important to remember that each plot and visualization you choose show your data in a different way, and therefore tells a different story. They highlight different aspects of your data. So it is important to figure out what you want to convey with your graphic. At the same time you might be constrained by the type of data that you have, and which types of plots might be useful with it. Especially in academia, certain plots are used abundantly and very innovative ideas might be viewed skeptically or could be considered too artistic. So you often have to find a balance.
Important for choosing a plot type is whether your data is continuous, categorical, a time series or something in between.
Inspirations which graphics to choose:
Color theory: Semantically related colors
In this initial blog I will cover the graphs I’ve most often observed: bar graphs vs box plots, and line graphs. More types might be added in later posts.
Let’s get started with loading the packages that we’ll use.
library(boot) # library with many datasets
library(tidyverse) # most important packages required for your daily data
# wrangling and plotting. Includes ggplot2 among others.
library(data.table) # to make easier data manipulations
library(cowplot)
library(ggpubr) # to add significant levels to your plot and make plot publication ready
library(eurostat) # access eurostat database http://ec.europa.eu/eurostat/
Next let’s load our first small dataset to play with. I chose the poisons data set, because it has a lot of attributes common in academic designs:
The data describes how long an animal survived after ingesting the poison and obtaining one of the five possible treatments.
This dataset will give us the opportunity to look at all the data and not be overwhelmed, yet explore ideas like error bars, and showing the mean versus individual data.
dt <- data.table(poisons)
# lets look at this to get a feeling
summary(dt)
time poison treat
Min. :0.1800 1:16 A:12
1st Qu.:0.3000 2:16 B:12
Median :0.4000 3:16 C:12
Mean :0.4794 D:12
3rd Qu.:0.6225
Max. :1.2400
We see that we have three columns. The time shows a mean survival of 0.47 (it’s in 10 hour units, so about 4hrs 47 mins.) for all animals. It also shows the number of animals in each poison group and treatment group. Since the data is already in a long format and nicely cleaned up we can immidiately start plotting.
First, lets see the distribution in a box plot.
plt.ex1 <- ggplot(dt, aes(x=poison, y=time, color= treat))+
geom_boxplot()+
theme_gray()
plt.ex1
A box plot is very applicable in this case, as it shows the distribution of the data, and is good to compare categorical variables (the poisons and treatment options).
This plot is still very plain. To make it look a lot more sophisticated we will make a few changes:
Particularly, line thickness and font size should often be thicker and slightly bigger than might look good to you on the screen. Still, many articles are printed by the scientists and thin lines might become faint and hard to read. Many journals also require the minimum font size in the final figure to be 6 or higher. If you check, you understand which size is still easily legible.
I will show you two alternative methods. The first relies on the default ggplot functions and you adjust it manually to your wishes. The second uses the ggpubr package. The ggpubr documentation is very comprehensive, so I won’t use it much for my examples. Most of all I prefer the flexibility I have with the standard ggplot2.
You should try both and choose whichever suits your needs better. :)
# default ggplot
plt.ex1v2 <- ggplot(dt, aes(x=poison, y=time, color= treat))+
geom_boxplot(size=1)+ # make the lines of the box plot thicker
labs(title="Treatment extends survival time after poisoning",
x= "poison", # is already pretty descriptive
y= "survival time (in 10 hours)") +
scale_color_brewer(type="qual", palette = "Paired", name="Treatment") +
scale_y_continuous(expand=c(0,0), limits=c(0,NA))+ # to start y axis at 0
theme(axis.line.x = element_line(size=1), # make the axis line thicker
axis.line.y = element_line(size=1))
plt.ex1v2
# version with individual data
plt.ex1v3 <- ggplot(dt, aes(x=poison, y=time, color= treat))+
geom_boxplot(size=1)+
geom_point(size=2, shape=21, stroke=1.1, position = position_jitterdodge(jitter.width = 0.1)) +
scale_y_continuous(expand=c(0,0), limits=c(0,NA))+ # to start y axis at 0
labs(title="Treatment extends survival time after poisoning",
x= "poison", # is already pretty descriptive
y= "survival time (in 10 hours)") +
scale_color_brewer(type="qual", palette = "Paired", name="Treatment") +
theme(axis.line.x = element_line(size=1),
axis.line.y = element_line(size=1))
plt.ex1v3
plt.ex1v4 <- ggboxplot(dt, x="poison", y="time",
color= "treat", palette = "Paired",
add= "jitter", ylab = "survival time (in 10 hours)",
title = "Poison survival times GgpubR", size=1)
ggpar(plt.ex1v4, legend = "right", legend.title = "Treatment", ylim= c(0,1.25))
You can calculate the p-value between different conditions easily with the compare means function of the ggpubr package. By default it uses the Wilcoxon test to calculate the significance between two groups. You can change that via the method statement.
In this multiple variable layout, I couldn’t figure out yet how to show the significance values correctly on the plot.So I let you try and figure it out yourself.
tmp <- compare_means(time~poison, data=dt, group.by = "treat", method= "kruskal.test")
tmp
# A tibble: 4 x 7
treat .y. p p.adj p.format p.signif method
<fct> <chr> <dbl> <dbl> <chr> <chr> <chr>
1 A time 0.0145 0.044 0.0145 * Kruskal-Wallis
2 B time 0.0244 0.049 0.0244 * Kruskal-Wallis
3 C time 0.00971 0.039 0.0097 ** Kruskal-Wallis
4 D time 0.0245 0.049 0.0245 * Kruskal-Wallis
I consider these examples good ways to visualize your data. However, I’ve come across many articles were the same type of data would have been shown in a bar graph with some kind of error bar (too often the one that seems the smallest…). For a box plot we don’t have to make any changes to the data, but for a bar graph we need to calculate the mean and standard variation ourselves.
Just as a reminder:
# Helper functions to calculate the standard error of the mean and the 0.95% confidence interval
se <- function(x) sqrt(var(x,na.rm=TRUE)/length(na.omit(x)))
ci <- function(x) qnorm(0.975) * sqrt(var(x,na.rm=TRUE)/length(na.omit(x)))
# calculate the mean and standard variations
dtMean <- dt[, .(avg = mean(time), std = sd(time), ci1 = ci(time), stdE = se(time)),
by = c("poison", "treat")]
### plot standard deviation
plt.ex1Bar <- ggplot(dtMean, aes(poison, avg, fill=treat))+
geom_bar(stat="identity", position=position_dodge())+
geom_errorbar(aes(ymin=avg-std, ymax= avg+std), width=0.25, position=position_dodge(.9))+
scale_fill_brewer(type="qual", palette = "Paired", name="Treatment")+
scale_y_continuous(expand=c(0,0), limits=c(0,NA))+ # to start y axis at 0
theme(axis.line.x = element_line(size=1),
axis.line.y = element_line(size=1))+
labs(title="Poison survival times",
x= "poison",
y= "survival time (in 10 hours)")
plt.ex1Bar
I kept the remaining layout changes the same as in the box plot. If you compare the box plot with the bar graph, we run into Tuftes design principles, both chart junk and data-ink ratio. The bar graph requires a lot of ink to plot with little information value, i.e. you require an entire bar to represent the mean! Furthermore, it is misleading. If we look at our data, non of the measured variables start at 0.
### compare different error bars
dtMeanError <- rbind(dtMean[poison==1],dtMean[poison==1],dtMean[poison==1])
# we need to put all types of errors into one column for easy plotting.
dtMeanError <- melt(dtMeanError, id.vars = c("poison", "treat", "avg"))
plt.ex1Error <- ggplot(dtMeanError, aes(poison, avg, fill=treat))+
geom_bar(stat="identity", position=position_dodge())+
geom_errorbar(aes(ymin=avg-value, ymax= avg+value), width=0.25, position=position_dodge(.9))+
facet_wrap(~variable)+
scale_fill_brewer(type="qual", palette = "Paired", name="Treatment")+
scale_y_continuous(expand=c(0,0), limits=c(0,NA))+ # to start y axis at 0
theme(axis.line.x = element_line(size=1),
axis.line.y = element_line(size=1),
plot.caption = element_text(hjust = 0, size=10))+
labs(title="Different error bars change perception of results",
x= "poison",
y= "survival time (in 10 hours)",
caption= "Figure. Different error bars change perception of results. std= standard deviation,
\n ci1= 95% confidence interval, stdE= standard error or the mean. ")
plt.ex1Error
The example of the error bar also shows interesting results. As expected the SEM is smaller than the SD and the confidence interval. Also the the CI looks about equally spread as the SD, so it is really important to annotate your figure, so that people know what they are looking at. Important to remember: people will first see the errorbar and already make an unconscious judgement about the significance by the size of the bar, only then will they read the legend and see which type of error bar it is.
Next to bar graphs, line plots are one of the most abundant types of plots I see in academic articles. One of the main use cases of line graph is to show time series data. Of course any continoues variable can be plotted in a line graph, but since time series are very intuitive I’ll chose one for the next graph. Overall, I have no real complaints about line graphs in current scientific literature. Sometimes line graphs are used to show continouety between measured values. This is not necessarily wrong, but might be misleading. The following example shows how to make line graphs with ggplot2.
As test dataset I used a dataset about deaths in the European Union between 1994-2015. It can be found on the eurostat website.
# load only once (first for 2011-2015)
#death <- get_eurostat(id="tps00152", time_format = "date")
#death <- label_eurostat(death, lang="en")
# then for 1994 - 2010 to get more dates
# d2 <- get_eurostat(id="hlth_cd_asdr", time_format = "date")
# d2 <- label_eurostat(d2)
# merge both datasets
# death <- rbind(death, d2[,c(1,4,3,2,5,6,7)])
# save(death, file="D:/Rscripts/eurostat_deaths.Rdata")
# load death in european countries between 2011-2015. The data gives the rate
# of deaths for male, female and total per 100,000 residents per EU country.
load(file="D:/Rscripts/eurostat_deaths.Rdata")
death <- data.table(death)
# rename Germany
death$geo <- as.character(death$geo)
death$geo[death$geo=="Germany (until 1990 former territory of the FRG)"] <- "Germany"
# select a few countries, e.g. Germany, Netherlands, Italy and Portugal
countries <- c("Germany", "Portugal", "Italy", "Netherlands")
dfPlot <- death[geo %in% countries & icd10=="All causes of death (A00-Y89) excluding S00-T98" ]
# plot the death of all people per country
plt.Ex2 <- ggplot(dfPlot[sex=="Total" & age=="Total"], aes(time, values, color=geo))+
geom_point()+
geom_line()+
theme_gray()
plt.Ex2
Interestingly, it shows a decline in overall death rate in the last 20 years. Since 2010 it starts to stabilize around a new equilibrium per country. Furthermore, Italy shows a lower death rate per 100,000 inhabitants compared with the other three countries over the whole range. Besides these observations, how can we make this plot look good for a scientific journal?
plt.Ex2v2 <- ggplot(dfPlot[sex=="Total" & age=="Total"], aes(time, values, color=geo))+
geom_point()+
geom_line()+
theme_cowplot()+
scale_color_manual(values = c("black", "green4" ,"orange", "red3"), name="Country")+
scale_y_continuous(limits=c(0,1600), expand=c(0,0))+
labs(title="Death rate declines in Europe 1994-2015",
x= "year",
y = "rate per 100,000 inhabitants")+
theme(axis.line.x = element_line(size=1),
axis.line.y = element_line(size=1))
plt.Ex2v2
If you still want to go for black and white, here is an alternatives using shapes or different line types. For a quick reference which shapes and lines are available check here.
plt.Ex2v3 <- ggplot(dfPlot[sex=="Total" & age=="Total"], aes(time, values, shape=geo, linetype=geo))+
geom_point()+
geom_line()+
theme_cowplot()+
scale_shape_manual(values = c(16, 21,15,22), name="Country")+
scale_linetype_manual(values = c("solid", "dotted","dashed", "dotdash"),name="Country")+
scale_y_continuous(limits=c(0,1600), expand=c(0,0))+
labs(title="Death rate declines in Europe 1994-2015",
x= "year",
y = "rate per 100,000 inhabitants")+
theme(axis.line.x = element_line(size=1),
axis.line.y = element_line(size=1))
plt.Ex2v3
Similar to the box plot you can add error bars to lines. You can either add individual error bars to each data point, or if you want to highlight the range, a ‘ribbon’ indicating the upper and lower limits. To illustrate, let’s look at the average death rate in Europe.
# calculate mean and standard deviation over all EU countries.
euDeath <- death[icd10=="All causes of death (A00-Y89) excluding S00-T98" &
sex=="Total"&
age== "Total" &
geo != "European Union (current composition)",
.(avg= mean(values), std= sd(values)),
by = "time"]
plt.Ex2v4 <- ggplot(euDeath, aes(time, avg)) +
geom_ribbon(aes(ymin=avg-std, ymax=avg+std),alpha=0.2, fill="dodgerblue")+
geom_line(size=1, color="blue4")+
theme_cowplot()+
scale_y_continuous(limits=c(0,1800), expand=c(0,0))+
labs(title="Death rate decreases in Europe 1994-2015",
x= "year",
y = "rate per 100,000 inhabitants")+
theme(axis.line.x = element_line(size=1),
axis.line.y = element_line(size=1))
plt.Ex2v4