# Box plot with ggplot2

## Box Plot

It is used to know the distribution of the *continuous variable* vary with the levels of the categorical data. It summarizes the shape of the distribution with five summary statistics (the median, two hinges and two whiskers), and outlying data points individually. Sometimes, it is also know as the `box and whiskers plot`

.

It is also used to find out the patterns of outliers within the categories of a factor variable.

In this tutorial, we will use the `mpg`

dataset from `ggplot2`

package as shown below.

```
library(ggplot2)
knitr::kable(head(mpg))
```

manufacturer | model | displ | year | cyl | trans | drv | cty | hwy | fl | class |
---|---|---|---|---|---|---|---|---|---|---|

audi | a4 | 1.8 | 1999 | 4 | auto(l5) | f | 18 | 29 | p | compact |

audi | a4 | 1.8 | 1999 | 4 | manual(m5) | f | 21 | 29 | p | compact |

audi | a4 | 2.0 | 2008 | 4 | manual(m6) | f | 20 | 31 | p | compact |

audi | a4 | 2.0 | 2008 | 4 | auto(av) | f | 21 | 30 | p | compact |

audi | a4 | 2.8 | 1999 | 6 | auto(l5) | f | 16 | 26 | p | compact |

audi | a4 | 2.8 | 1999 | 6 | manual(m5) | f | 18 | 26 | p | compact |

This dataset entails fuel economy data from 1999 and 2008 for 38 popular models of car. It contains 11 variables and 234 observations. In this dataset, if we are interested to know how fuel economy on highway varies within car drivetrain, box plot is a preferred option. The **hwy** variable referred as the *highway miles per gallon* and **drv** variable referred as the *drivetrain*.

A box plot can be constructed by means of `geom_boxplot()`

calling with `ggplot()`

function from `ggplot2`

package as:

`ggplot(mpg, aes(x = drv, y = hwy)) + geom_boxplot()`

In above figure, box represents the interquartile range (IQR). The middle hinge corresponds to the median of the distribution (the 50th percentile). The lower and upper hinges correspond to the first and third quartiles (the 25th and 75th percentiles).

The minimum value of the lower whisker: \(minimum(minimum ~ value ~ in ~ the ~ data, Q_{1} - 1.5 \times IQR)\). The maximum value of the upper whisker: \(maximum(maximum ~ value ~ in ~ the ~ data, Q_{3} + 1.5 \times IQR)\).

The values lies outside the lower or upper whiskers are the outliers.

To change the width of the boxes we have to specify the `width`

argument to `geom_boxplot()`

as:

`ggplot(mpg, aes(x = drv, y = hwy)) + geom_boxplot(width = 0.3)`

Similarly, to change the color of the outline of box plot specify the `color`

argument to `geom_boxplot()`

as:

`ggplot(mpg, aes(x = drv, y = hwy)) + geom_boxplot(width = 0.3, color = "blue")`

To change the transparency of the outliers, we can use the `outlier.alpha()`

argument with `geom_boxplot()`

as:

`ggplot(mpg, aes(x = drv, y = hwy)) + geom_boxplot(width = 0.3, color = "blue", outlier.alpha = 0.3)`

The mean of the continuous variable within each class of a factor variable can also be added by means of `stat_summary()`

function as:

`ggplot(mpg, aes(x = drv, y = hwy)) + geom_boxplot(width = 0.3, color = "blue") + stat_summary(fun.y = mean, geom = "point", color = "red", shape = 8, size = 2)`

The inside color of the box plot can be changed by specifying the `fill`

argument to `geom_boxplot()`

as:

`ggplot(mpg, aes(x = drv, y = hwy)) + geom_boxplot(width = 0.3, color = "blue", fill = "pink")`

The intensity of the color can also be changed by specifying the `alpha`

argument to `geom_boxplot()`

as:

`ggplot(mpg, aes(x = drv, y = hwy)) + geom_boxplot(width = 0.3, color = "blue", fill = "pink", alpha = 0.5)`

aplha takes on [0, 1].

The color of the outliers is same as the box plot color. To change the outliers color specify the `outlier.color`

argument to `geom_boxplot()`

as:

`ggplot(mpg, aes(x = drv, y = hwy)) + geom_boxplot(width = 0.3, color = "blue", outlier.color = "red")`

Similarly, the shape of the outliers can also be changed by means of `outlier.shape`

argument as:

`ggplot(mpg, aes(x = drv, y = hwy)) + geom_boxplot(width = 0.3, color = "blue", outlier.color = "red", outlier.shape = 1)`

The outliers can be removed by specifying the `outlier.shape`

argument to `geom_boxplot()`

as:

`ggplot(mpg, aes(x = drv, y = hwy)) + geom_boxplot(width = 0.3, color = "blue", outlier.shape = NA)`

The original data points can also be plotted over box plot with the help of the `geom_jitter()`

function as:

`ggplot(mpg, aes(x = drv, y = hwy)) + geom_boxplot(width = 0.3, color = "blue", outlier.shape = NA) + geom_jitter(width = 0.2, color = "red")`

Further, if we want to use a unique color for each box according to the **drv** variable, we have to supply `color`

argument to the `aes()`

of `geom_boxplot()`

function. e.g.,

`ggplot(mpg, aes(x = drv, y = hwy)) + geom_boxplot(aes(color = drv), width = 0.3)`

These vertical box plot can also be made horizontally by specifying the `coord_flip()`

after `geom_boxplot()`

as:

`ggplot(mpg, aes(x = drv, y = hwy)) + geom_boxplot(aes(color = drv), width = 0.3) + coord_flip()`

We can also specify the different themes like `theme_light()`

for making our box plot in R.

`ggplot(mpg, aes(x = drv, y = hwy)) + geom_boxplot(aes(color = drv), width = 0.3) + theme_light()`

ggplot2 package provides many themes to create the plots in R.

## Grouped box plot

Sometimes, we want to plot the box plot with subgroups besides groups. In above examples, we have plotted box plot according to the group *drivetrain*. Here, we will check the distribution of *highway miles per gallon* according to the subgroup *type of car* defined by the variable **class** of `mpg`

dataset. The grouped box plot can be drawn by specifying `fill`

argument to the `ggplot()`

function as:

`ggplot(mpg, aes(x = drv, y = hwy, fill = class)) + geom_boxplot(width = 0.3, color = "blue")`

## Notched box plot

Notches are used to compare groups; if the notches of two boxes do not overlap, this suggests that the medians are significantly different. The notched box plot can be drawn by specifying `notch`

argument to the `geom_boxplot()`

function as:

`ggplot(mpg, aes(x = drv, y = hwy)) + geom_boxplot(aes(color = drv), width = 0.3, notch = TRUE)`

## Box plot with continuos variable

Box plot can also be plotted with a continuous variable by means of `cut_width`

function. The **displ** is a continuous variable and referred as the *engine displacement, in litres*. We will use this variable to draw our box plot with the continuous variable without the `cut_width()`

.

`ggplot(mpg, aes(displ, hwy)) + geom_boxplot(color = "blue")`

Now, we will use the `cut_width()`

with `geom_boxplot()`

as:

`ggplot(mpg, aes(displ, hwy)) + geom_boxplot(aes(group = cut_width(displ, 0.25)), color = "blue")`

To reveal the pattern of distribution according to the continuous variable, we can use different value with in

`cut_width()`

function.

## Box plot with computation

Sometimes we need to check the distribution of our computed variables in Monte Carlo studies, MCMC, etc. For this, we will simulate the data for our box plot then we will draw the box plot for simulated data.

```
set.seed(34765)
y <- rnorm(1000)
library(tibble)
data <- tibble(
x = 0.5,
y0 = min(y),
y25 = quantile(y, 0.25),
y50 = median(y),
y75 = quantile(y, 0.75),
y100 = max(y)
)
print(data)
```

```
## # A tibble: 1 x 6
## x y0 y25 y50 y75 y100
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.5 -2.89 -0.718 -0.00407 0.670 3.40
```

Now, the box plot can be plotted by specifying the `stat = identity`

argument to geom_boxplot() function as:

`ggplot(data, aes(x = x)) + geom_boxplot(aes(ymin = y0, lower = y25, middle = y50, upper = y75, ymax = y100), stat = "identity", color = "blue")`

This way we can make different types of box plots with the help of `ggplot2`

package in R.