5 min read

ggplot2: Histogram

Introduction

This is the eleventh post in the series Elegant Data Visualization with ggplot2. In the previous post, we learnt to build box plots. In this post, we will learn to

  • build histogram
  • specify bins
  • modify
    • color
    • fill
    • alpha
    • bin width
    • line type
    • line size
  • map aesthetics to variables

A histogram is a plot that can be used to examine the shape and spread of continuous data. It looks very similar to a bar graph and can be used to detect outliers and skewness in data. The histogram graphically shows the following:

  • center (location) of the data
  • spread (dispersion) of the data
  • skewness
  • outliers
  • presence of multiple modes

To construct a histogram, the data is split into intervals called bins. The intervals may or may not be equal sized. For each bin, the number of data points that fall into it are counted (frequency). The Y axis of the histogram represents the frequency and the X axis represents the variable.


Libraries, Code & Data

We will use the following libraries in this post:

All the data sets used in this post can be found here and code can be downloaded from here.


Data

ecom <- readr::read_csv('https://raw.githubusercontent.com/rsquaredacademy/datasets/master/web.csv')
ecom
## # A tibble: 1,000 x 11
##       id referrer device bouncers n_visit n_pages duration country purchase
##    <dbl> <chr>    <chr>  <lgl>      <dbl>   <dbl>    <dbl> <chr>   <lgl>   
##  1     1 google   laptop TRUE          10       1      693 Czech ~ FALSE   
##  2     2 yahoo    tablet TRUE           9       1      459 Yemen   FALSE   
##  3     3 direct   laptop TRUE           0       1      996 Brazil  FALSE   
##  4     4 bing     tablet FALSE          3      18      468 China   TRUE    
##  5     5 yahoo    mobile TRUE           9       1      955 Poland  FALSE   
##  6     6 yahoo    laptop FALSE          5       5      135 South ~ FALSE   
##  7     7 yahoo    mobile TRUE          10       1       75 Bangla~ FALSE   
##  8     8 direct   mobile TRUE          10       1      908 Indone~ FALSE   
##  9     9 bing     mobile FALSE          3      19      209 Nether~ FALSE   
## 10    10 google   mobile TRUE           6       1      208 Czech ~ FALSE   
## # ... with 990 more rows, and 2 more variables: order_items <dbl>,
## #   order_value <dbl>


Data Dictionary

  • id: row id
  • referrer: referrer website/search engine
  • os: operating system
  • browser: browser
  • device: device used to visit the website
  • n_pages: number of pages visited
  • duration: time spent on the website (in seconds)
  • repeat: frequency of visits
  • country: country of origin
  • purchase: whether visitor purchased
  • order_value: order value of visitor (in dollars)


Histogram

To create a histogram, we will use geom_histogram() and specify the variable name within aes(). In the below example, we create histogram of the variable n_visit.

ggplot(ecom) +
  geom_histogram(aes(n_visit))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.


Specify Bins

The default number of bins in ggplot2 is 30. You can modify the number of bins using the bins argument. In the below example, we create a histogram with 7 bins.

ggplot(ecom) +
  geom_histogram(aes(n_visit), bins = 7)


Aesthetics

Now that we know how to create a histogram, let us learn to modify its appearance. We will begin with the background color. Use the fill argument to modify the background color of the histogram. In the below case, we change the color of the histogram to ‘blue’.

ggplot(ecom) +
  geom_histogram(aes(n_visit), bins = 7, fill = 'blue')


As we have learnt before, the transparency of the background color can be modified using the alpha argument. It can take any value between 0 and 1.

ggplot(ecom) +
  geom_histogram(aes(n_visit), bins = 7, fill = 'blue', alpha = 0.3)


The color of the histogram border can be modified using the color argument. The color can be specified either using its name or the associated hex code.

ggplot(ecom) +
  geom_histogram(aes(n_visit), bins = 7, fill = 'white', color = 'blue')


Putting it all together…

Let us modify the bins, the background and border color of the histogram in the below example.

ggplot(ecom) +
  geom_histogram(aes(n_visit), bins = 7, fill = 'blue', color = 'white')


Bin Width

Another way to control the number of bins in a histogram is by using the binwidth argument. In this case, we specify the width of the bins instead of the number of bins. As you can see, in the below example, we do not use the bins argument when using the binwidth argument. You can use either of them but not both.

ggplot(ecom) +
  geom_histogram(aes(n_visit), binwidth = 2, fill = 'blue', color = 'black')


Line Type

The line type of the histogram border can be modified using the linetype argument. It can take any integer value between 0 and 6.

ggplot(ecom) +
  geom_histogram(aes(n_visit), bins = 5, fill = 'white', 
    color = 'blue', linetype = 3)


Line Size

Use the size argument to modify the width of the border of the histogram bins. It can take any value greater than 0.

ggplot(ecom) +
  geom_histogram(aes(n_visit), bins = 5, fill = 'white', 
    color = 'blue', size = 1.25)


Map Variables

You can map the aesthetics to variables as well. In the below example, we map fill to the device variable. You can try mapping color, linetype and size to variables as well.

ggplot(ecom) +
  geom_histogram(aes(n_visit, fill = device), bins = 7)


Summary

In this post, we learnt to:

  • build histogram
  • specify bins
  • modify
    • color
    • fill
    • alpha
    • bin width
    • line type
    • line size
  • map aesthetics to variables


Up Next..

In the next post, we will learn to modify the axes of a plot.