Histogram and frequency polygon with ggplot2

Introduction

Histogram and frequency polygon are very useful graphics to have an idea about the distribution of a continuous variable. It distributes the \(x-axis\) into bins by dividing the entire range of values into non–overlapping intervals and then counting the number of observations in each bin.

Histogram

To draw histograms, we will use the iris dataset from datasets package in base R as shown below:

knitr::kable(head(iris))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa

This dataset gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively. It contains 5 variables and 150 observations. The Sepal.Length variable in this data is of continuous type for which the primary descriptive statistics can be given as:

summary(iris$Sepal.Length, digits = 3)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.30    5.10    5.80    5.84    6.40    7.90

A histogram can be plotted by means of geom_histogram() calling with ggplot() function from ggplot2 package. geom_histogram() uses stat_bin() by default for continuous data. The histogram of Sepal.Length variable from iris dataset can be plotted as:

library(ggplot2)
ggplot(iris, aes(x = Sepal.Length)) + geom_histogram()

For histogram we always use a single continuous variable with ggplot() function.

By default, stat_bin() uses 30 bins. We may need to adjust the value of bins to uncover the hidden truth in our data. Let us experiment with bins = 15 in the previous graph with fill, color and size arguments for better presentation:

ggplot(iris, aes(x = Sepal.Length)) + geom_histogram(bins = 15, fill = "blue", color = "white", size = 0.2)

Now, let us plot histogram with bins = 60:

ggplot(iris, aes(x = Sepal.Length)) + geom_histogram(bins = 60, fill = "blue", color = "white", size = 0.2)

We can also use binwidth = # option instead of bins = # to get the clear picture of our data:

ggplot(iris, aes(x = Sepal.Length)) + geom_histogram(binwidth = 0.1, fill = "blue", color = "white", size = 0.2)

This way we can play either with bins = # or binwidth = # to get the concealed truth from data.

A histogram with density scale can be plotted with computed variable ..density.. as:

ggplot(iris, aes(x = Sepal.Length, y = ..density..)) + geom_histogram(bins = 15, fill = "blue", color = "white", size = 0.2)

Histogram with normal density curve

A histogram with normal density curve can be created by adding normal curve as another layer to the density scale histogram with the help of geom_line() function as:

library(tibble)
x <- seq(4.3, 7.9, length.out=150)
data <- with(iris, tibble(x = x, y = dnorm(x, mean(Sepal.Length), sd(Sepal.Length))))
ggplot(iris, aes(x = Sepal.Length, y = ..density..)) + geom_histogram(bins = 15, fill = "blue", color = "white", size = 0.2) + geom_line(data = data, aes(x = x, y = y), color = "red", size = 1)

The above plot revealed that the Sepal.Length variable of iris dataset is slightly right skewed.

Frequency polygon

The frequency polygon provides a continuous curve by plotting points with \(x-axis\) as the variate values and the \(y-axis\) as the corresponding frequencies. Sometimes, it is considered as an improvement over histogram.

A frequency polygon can be plotted by means of geom_freqpoly() calling with ggplot() function from ggplot2 package. geom_freqpoly() also uses stat_bin() by default for continuous data. The frequency polygon of Sepal.Length variable from iris dataset can be plotted as:

ggplot(iris, aes(x = Sepal.Length)) + geom_freqpoly(color = "blue")

For frequency ploygon we always use a single continuous variable with ggplot() function.

To uncover the hidden truth in your data, we can experiment with bins = 15 in the previous graph:

ggplot(iris, aes(x = Sepal.Length)) + geom_freqpoly(bins = 15, color = "blue", size = 0.8)

Here, instead of bins = #, we can also use binwidth = # options to get the clear picture of our data:

ggplot(iris, aes(x = Sepal.Length)) + geom_freqpoly(binwidth = 0.1, color = "blue", size = 0.8)

This way we can play with bins = # or binwidth = # to get the hidden truth from our data.

Comparison of distribution

If we want to compare the distribution of a continuous variable across a factor variable then we can use the color argument with ggplot() function as shown below:

ggplot(iris, aes(x = Sepal.Length, color = Species)) + geom_freqpoly(binwidth = 1.2, size = 0.8)

This way we can make different types of histograms and frequency polygons with the help of ggplot2 package in R.

Related