Introduction
This is the sixth post in the series Data Visualization With R. In the previous post, we learnt how to build bar plots. In this post, we will learn to
- create univariate/multivariate box plots
- interpret box plots
- create horizontal box plots
- detect outliers
- modify box color
- use formula to compare distributions of different variables
- use notches to compare medians
Libraries, Code & Data
All the data sets used in this post can be found here and code can be downloaded from here.
Box Plot
The box plot is a standardized way of displaying the distribution of data based on the five number summary: minimum, first quartile, median, third quartile, and maximum. Box plots are useful for detecting outliers and for comparing distributions. It shows the shape, central tendancy and variability of the data.
Structure
A boxplot splits the data set into quartiles. The body of the boxplot consists of a “box” (hence, the name), which goes from the first quartile (Q1) to the third quartile (Q3). Within the box, a vertical line is drawn at the Q2, the median of the data set. Two horizontal lines, called whiskers, extend from the front and back of the box. The front whisker goes from Q1 to the smallest non-outlier in the data set, and the back whisker goes from Q3 to the largest non-outlier. If the data set includes one or more outliers, they are plotted separately as points on the chart.
Univariate Box Plot
Basic Plot
Let us begin by creating a basic box plot. We will use the boxplot()
function and specify the data.
boxplot(mtcars$mpg)
Horizontal Box Plot
Use the horizontal
argument in the boxplot()
function to create a horizontal box plot.
boxplot(mtcars$mpg, horizontal = TRUE)
Color
Let us add some color to the boxplot. Use the col
argument to specify a color for the plot.
boxplot(mtcars$mpg, col = 'blue')
Border Color
We can specify a separate color for the border of the box in the boxplot. To modify the border color, use the border
argument.
boxplot(mtcars$mpg, border = 'red')
Range
The range
argument determines how far the plot whiskers extend out from the box. If range is positive, the whiskers extend to the most extreme data point which is no more than range times the interquartile range from the box. A value of zero causes the whiskers to extend to the data extremes.
Let us set the value of range
to 0 and observe the plot.
boxplot(mtcars$mpg, range = 0)
In the below plot, we set the value of range
to 1.
boxplot(mtcars$mpg, range = 1)
Let us observe how the plot appears as we change the value of range
from 0 to 1.
Outline
The outliers in the plot are not drawn if the outline
argument is set to FALSE
. The default value is TRUE
.
boxplot(mtcars$mpg, range = 1, outline = FALSE)
The below plot displays how the plot changes with the values set for outline
:
Varwidth
If the varwidth
argument is set to TRUE
, the boxes are drawn with widths proportional to the square-roots of the number of observations in the groups.
Bivariate/Multivariate Box Plot
As we said in the introduction, box plots can be used to compare distributions of several variables. Let us use the mtcars
data set and compare the distribution of Miles Per Gallon (mpg
) for automobiles with different number of cylinders (cyl
). We will do this by specifying a formula as shown in the below example.
boxplot(mtcars$mpg ~ mtcars$cyl)
We use the formula when we are comparing the distribution of a continuous variable across different levels of a categorical variable. If we want to compare the distributions without using a categorical variable, we need to specify the variable separately in the boxplot()
function. Below is an illustration of this method. We will split the mpg
data using the split()
function and plot them separately. The split()
function splits a continuous variable based on the levels of a categorical variable.
mpg_split <- split(mtcars$mpg, mtcars$cyl)
mpg_split
## $`4`
## [1] 22.8 24.4 22.8 32.4 30.4 33.9 21.5 27.3 26.0 30.4 21.4
##
## $`6`
## [1] 21.0 21.0 21.4 18.1 19.2 17.8 19.7
##
## $`8`
## [1] 18.7 14.3 16.4 17.3 15.2 10.4 10.4 14.7 15.5 15.2 13.3 19.2 15.8 15.0
mpg_4 <- mpg_split$`4`
mpg_6 <- mpg_split$`6`
mpg_8 <- mpg_split$`8`
boxplot(mpg_4, mpg_6, mpg_8)
The same plot can be created in two ways. If you are comparing the distribution of a continuous variable for the different levels of a categorical variable, use the formula. If you are comparing distribution of independent variables, specify all the variablels in the boxplot()
function.
Color
Let us add some color to the plot. We can specify as many colors as the boxes or we can specify the same color for all of them. Below are two examples where we specify the same color in the first one and different colors in the second one.
Single Color
boxplot(mtcars$mpg ~ mtcars$cyl, col = 'blue')
Different Colors
boxplot(mtcars$mpg ~ mtcars$cyl,
col = c('red', 'blue', 'yellow'))
Compare Medians
If we want to test whether the medians of the different groups differ, we can use the notch
argument and set it to TRUE
. A notch is drawn in each side of the boxes and if the notches of the plots do not overlap, it is strong evidence that the medians differ.
We will use a different data set for this example. Download the hsb2
data from UCLA website and compare the distribution of reading score (read
) for males and females (female
).
hsb <- read.table('https://stats.idre.ucla.edu/wp-content/uploads/2016/02/hsb2-2.csv', header=T, sep=",")
boxplot(hsb$read ~ hsb$female, notch = TRUE,
col = c('red', 'blue'))
Since the notches overlap, there is no strong evidence that the medians differ.
Putting it all together
We will end this post by adding a title and axis labels to the box plot.
boxplot(mtcars$mpg ~ mtcars$cyl, range = 1, outline = TRUE,
horizontal = TRUE, col = c('red', 'blue', 'yellow'),
main = 'Miles Per Gallon by Cylinders',
ylab = 'Number of Cylinders', xlab = 'Miles Per Gallon',
names = c('Four', 'Six', 'Eight'))
Summary
In this post, we learnt how to create box plots. Specifically, we learnt to:
- create univariate/multivariate box plots
- interpret box plots
- create horizontal box plots
- detect outliers
- modify box color
- use formula to compare distributions of different variables
- use notches to compare medians