Box plot with ggplot2

Box Plot

It is used to know the distribution of the continuous variable vary with the levels of the categorical data. It summarizes the shape of the distribution with five summary statistics (the median, two hinges and two whiskers), and outlying data points individually. Sometimes, it is also know as the box and whiskers plot.

It is also used to find out the patterns of outliers within the categories of a factor variable.

In this tutorial, we will use the mpg dataset from ggplot2 package as shown below.

library(ggplot2)
knitr::kable(head(mpg))
manufacturer model displ year cyl trans drv cty hwy fl class
audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
audi a4 1.8 1999 4 manual(m5) f 21 29 p compact
audi a4 2.0 2008 4 manual(m6) f 20 31 p compact
audi a4 2.0 2008 4 auto(av) f 21 30 p compact
audi a4 2.8 1999 6 auto(l5) f 16 26 p compact
audi a4 2.8 1999 6 manual(m5) f 18 26 p compact

This dataset entails fuel economy data from 1999 and 2008 for 38 popular models of car. It contains 11 variables and 234 observations. In this dataset, if we are interested to know how fuel economy on highway varies within car drivetrain, box plot is a preferred option. The hwy variable referred as the highway miles per gallon and drv variable referred as the drivetrain.

A box plot can be constructed by means of geom_boxplot() calling with ggplot() function from ggplot2 package as:

ggplot(mpg, aes(x = drv, y = hwy)) + geom_boxplot()

In above figure, box represents the interquartile range (IQR). The middle hinge corresponds to the median of the distribution (the 50th percentile). The lower and upper hinges correspond to the first and third quartiles (the 25th and 75th percentiles).

The minimum value of the lower whisker: \(minimum(minimum ~ value ~ in ~ the ~ data, Q_{1} - 1.5 \times IQR)\). The maximum value of the upper whisker: \(maximum(maximum ~ value ~ in ~ the ~ data, Q_{3} + 1.5 \times IQR)\).

The values lies outside the lower or upper whiskers are the outliers.

To change the width of the boxes we have to specify the width argument to geom_boxplot() as:

ggplot(mpg, aes(x = drv, y = hwy)) + geom_boxplot(width = 0.3)

Similarly, to change the color of the outline of box plot specify the color argument to geom_boxplot() as:

ggplot(mpg, aes(x = drv, y = hwy)) + geom_boxplot(width = 0.3, color = "blue")

To change the transparency of the outliers, we can use the outlier.alpha() argument with geom_boxplot() as:

ggplot(mpg, aes(x = drv, y = hwy)) + geom_boxplot(width = 0.3, color = "blue", outlier.alpha = 0.3)

The mean of the continuous variable within each class of a factor variable can also be added by means of stat_summary() function as:

ggplot(mpg, aes(x = drv, y = hwy)) + geom_boxplot(width = 0.3, color = "blue") + stat_summary(fun.y = mean, geom = "point", color = "red", shape = 8, size = 2)

The inside color of the box plot can be changed by specifying the fill argument to geom_boxplot() as:

ggplot(mpg, aes(x = drv, y = hwy)) + geom_boxplot(width = 0.3, color = "blue", fill = "pink")

The intensity of the color can also be changed by specifying the alpha argument to geom_boxplot() as:

ggplot(mpg, aes(x = drv, y = hwy)) + geom_boxplot(width = 0.3, color = "blue", fill = "pink", alpha = 0.5)

aplha takes on [0, 1].

The color of the outliers is same as the box plot color. To change the outliers color specify the outlier.color argument to geom_boxplot() as:

ggplot(mpg, aes(x = drv, y = hwy)) + geom_boxplot(width = 0.3, color = "blue", outlier.color = "red")

Similarly, the shape of the outliers can also be changed by means of outlier.shape argument as:

ggplot(mpg, aes(x = drv, y = hwy)) + geom_boxplot(width = 0.3, color = "blue", outlier.color = "red", outlier.shape = 1)

The outliers can be removed by specifying the outlier.shape argument to geom_boxplot() as:

ggplot(mpg, aes(x = drv, y = hwy)) + geom_boxplot(width = 0.3, color = "blue", outlier.shape = NA)

The original data points can also be plotted over box plot with the help of the geom_jitter() function as:

ggplot(mpg, aes(x = drv, y = hwy)) + geom_boxplot(width = 0.3, color = "blue", outlier.shape = NA) + geom_jitter(width = 0.2, color = "red")

Further, if we want to use a unique color for each box according to the drv variable, we have to supply color argument to the aes() of geom_boxplot() function. e.g.,

ggplot(mpg, aes(x = drv, y = hwy)) + geom_boxplot(aes(color = drv), width = 0.3)

These vertical box plot can also be made horizontally by specifying the coord_flip() after geom_boxplot() as:

ggplot(mpg, aes(x = drv, y = hwy)) + geom_boxplot(aes(color = drv), width = 0.3) + coord_flip()

We can also specify the different themes like theme_light() for making our box plot in R.

ggplot(mpg, aes(x = drv, y = hwy)) + geom_boxplot(aes(color = drv), width = 0.3) + theme_light()

ggplot2 package provides many themes to create the plots in R.

Grouped box plot

Sometimes, we want to plot the box plot with subgroups besides groups. In above examples, we have plotted box plot according to the group drivetrain. Here, we will check the distribution of highway miles per gallon according to the subgroup type of car defined by the variable class of mpg dataset. The grouped box plot can be drawn by specifying fill argument to the ggplot() function as:

ggplot(mpg, aes(x = drv, y = hwy, fill = class)) + geom_boxplot(width = 0.3, color = "blue")

Notched box plot

Notches are used to compare groups; if the notches of two boxes do not overlap, this suggests that the medians are significantly different. The notched box plot can be drawn by specifying notch argument to the geom_boxplot() function as:

ggplot(mpg, aes(x = drv, y = hwy)) + geom_boxplot(aes(color = drv), width = 0.3, notch = TRUE)

Box plot with continuos variable

Box plot can also be plotted with a continuous variable by means of cut_width function. The displ is a continuous variable and referred as the engine displacement, in litres. We will use this variable to draw our box plot with the continuous variable without the cut_width().

ggplot(mpg, aes(displ, hwy)) + geom_boxplot(color = "blue")

Now, we will use the cut_width() with geom_boxplot() as:

ggplot(mpg, aes(displ, hwy)) + geom_boxplot(aes(group = cut_width(displ, 0.25)), color = "blue")

To reveal the pattern of distribution according to the continuous variable, we can use different value with in cut_width() function.

Box plot with computation

Sometimes we need to check the distribution of our computed variables in Monte Carlo studies, MCMC, etc. For this, we will simulate the data for our box plot then we will draw the box plot for simulated data.

set.seed(34765)
y <- rnorm(1000)
library(tibble)
data <- tibble(
    x = 0.5,
    y0 = min(y),
    y25 = quantile(y, 0.25),
    y50 = median(y),
    y75 = quantile(y, 0.75),
    y100 = max(y)
)
print(data)
## # A tibble: 1 x 6
##       x    y0    y25      y50   y75  y100
##   <dbl> <dbl>  <dbl>    <dbl> <dbl> <dbl>
## 1   0.5 -2.89 -0.718 -0.00407 0.670  3.40

Now, the box plot can be plotted by specifying the stat = identity argument to geom_boxplot() function as:

ggplot(data, aes(x = x)) + geom_boxplot(aes(ymin = y0, lower = y25, middle = y50, upper = y75, ymax = y100), stat = "identity", color = "blue")

This way we can make different types of box plots with the help of ggplot2 package in R.

Related