Introduction
This is the third post in the series Elegant Data Visualization with
ggplot2. In the previous post, we learnt how to create plots using the
qplot()
function. In this post, we will create some of the most routinely
used plots to explore data using the geom_*
functions.
Libraries, Code & Data
We will use the following libraries in this post:
All the data sets used in this post can be found here and code can be downloaded from here.
Data
ecom <- readr::read_csv('https://raw.githubusercontent.com/rsquaredacademy/datasets/master/web.csv')
ecom
## # A tibble: 1,000 x 11
## id referrer device bouncers n_visit n_pages duration country purchase
## <dbl> <chr> <chr> <lgl> <dbl> <dbl> <dbl> <chr> <lgl>
## 1 1 google laptop TRUE 10 1 693 Czech ~ FALSE
## 2 2 yahoo tablet TRUE 9 1 459 Yemen FALSE
## 3 3 direct laptop TRUE 0 1 996 Brazil FALSE
## 4 4 bing tablet FALSE 3 18 468 China TRUE
## 5 5 yahoo mobile TRUE 9 1 955 Poland FALSE
## 6 6 yahoo laptop FALSE 5 5 135 South ~ FALSE
## 7 7 yahoo mobile TRUE 10 1 75 Bangla~ FALSE
## 8 8 direct mobile TRUE 10 1 908 Indone~ FALSE
## 9 9 bing mobile FALSE 3 19 209 Nether~ FALSE
## 10 10 google mobile TRUE 6 1 208 Czech ~ FALSE
## # ... with 990 more rows, and 2 more variables: order_items <dbl>,
## # order_value <dbl>
Data Dictionary
- id: row id
- referrer: referrer website/search engine
- os: operating system
- browser: browser
- device: device used to visit the website
- n_pages: number of pages visited
- duration: time spent on the website (in seconds)
- repeat: frequency of visits
- country: country of origin
- purchase: whether visitor purchased
- order_value: order value of visitor (in dollars)
Scatter Plot
A scatter plot displays the relationship between two continuous variables. In
ggplot2, we can build a scatter plot using geom_point()
. Scatterplots can
show you visually
- the strength of the relationship between the variables
- the direction of the relationship between the variables
- and whether outliers exist
Point
The variables representing the X and Y axis can be specified either in ggplot()
or in geom_point()
. We will learn to modify the appearance of the points in a
different post.
ggplot(ecom, aes(x = n_pages, y = duration)) +
geom_point()
Regression Line
A regression line can be fit using either:
geom_abline()
geom_smooth()
Regression Line
If you are using geom_abline()
, you need to specify the intercept and slope
as shown in the below example:
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point() +
geom_abline(intercept = 37.285, slope = -5.344)
Regression Line
If you are using geom_smooth()
, you need to specify the method of fitting the
line, which can be lm
or loess
. You also need to indicate whether the
confidence interval must be displayed using the se
argument.
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_smooth(method = 'lm', se = TRUE)
## `geom_smooth()` using formula 'y ~ x'
Loess Method
Here we use the 'loess'
method to fit the regression line.
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_smooth(method = 'loess', se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
Horizontal/Vertical Lines
Add horizontal or vertical lines using
geom_hline()
geom_vline()
Horizontal Line
To add a horizontal line, the Y axis intercept must be supplied using the
yintercept
argument.
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point() +
geom_hline(yintercept = 30)
Vertical Line
For the vertical line, the X axis intercept must be supplied using the
xintercept
argument.
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point() +
geom_vline(xintercept = 5)
Bar Plot
Bar plots present grouped data with rectangular bars. The bars may represent the frequency of the groups or values. Bar plots can be:
- horizontal
- vertical
- grouped
- stacked
- proportional
Frequency
ggplot(ecom, aes(x = factor(device))) +
geom_bar()
Weight
If the bars should represent a continuous variable, use the weight
argument
within aes()
. In the below example, the bars do not represent the count of
devices, instead, they represent the total order value for each device type.
ggplot(ecom, aes(x = factor(device))) +
geom_bar(aes(weight = order_value))
Stacked Bar Plot
To create a stacked bar plot, the fill
argument must be mapped to a
categorical variable.
ggplot(ecom, aes(x = factor(device))) +
geom_bar(aes(fill = purchase))
Horizontal Bar Plot
A horizontal bar plot can be created by flipping the coordinate axes using the
coord_flip()
function.
ggplot(ecom, aes(x = factor(device))) +
geom_bar(aes(fill = purchase)) +
coord_flip()
Columns
If the data has already been summarized, you can use geom_col()
instead of
geom_bar()
. In the below example, we have the total visits for each device
type. The data has already been summarized and as such we cannot use geom_bar()
.
device <- c('laptop', 'mobile', 'tablet')
visits <- c(30000, 12000, 5000)
traffic <- tibble::tibble(device, visits)
ggplot(traffic, aes(x = device, y = visits)) +
geom_col(fill = 'blue')
Boxplot
The box plot is a standardized way of displaying the distribution of data
based on the five number summary: minimum, first quartile, median, third
quartile, and maximum. Box plots are useful for detecting outliers and for
comparing distributions. It shows the shape, central tendancy and variability
of the data. Use geom_boxplot()
to create a box plot.
ggplot(ecom, aes(x = factor(device), y = n_pages)) +
geom_boxplot()
Histogram
A histogram is a plot that can be used to examine the shape and spread of
continuous data. It looks very similar to a bar graph and can be used to detect
outliers and skewness in data. Use geom_histogram()
to create a histogram.
ggplot(ecom, aes(x = duration)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
You can control the number of bins using the bins
argument.
ggplot(ecom, aes(x = duration)) +
geom_histogram(bins = 5)
Line
Line charts are used to examine trends over time. We will use a different data set for exploring line plots.
Data
gdp <- readr::read_csv('https://raw.githubusercontent.com/rsquaredacademy/datasets/master/gdp.csv')
## Warning: Missing column names filled in: 'X1' [1]
gdp
## # A tibble: 6 x 6
## X1 X year growth india china
## <dbl> <dbl> <date> <dbl> <dbl> <dbl>
## 1 1 1 2000-01-01 6 5 8
## 2 2 2 2001-01-01 9 9 5
## 3 3 3 2002-01-01 8 8 6
## 4 4 4 2003-01-01 9 8 8
## 5 5 5 2004-01-01 9 5 9
## 6 6 6 2005-01-01 8 7 8
Use geom_line()
to create a line chart. In the below plot, we chart the GDP
of India, the fastest growing economy in emerging markets, across years.
ggplot(gdp, aes(year, india)) +
geom_line()
The color and line type can be modified using the color
and linetype
arguments. We will explore the different line types in an upcoming post.
ggplot(gdp, aes(year, india)) +
geom_line(color = 'blue', linetype = 'dashed')
Label
You can label the points using geom_label()
.
ggplot(mtcars, aes(disp, mpg, label = rownames(mtcars))) +
geom_label()
Text
geom_text()
offers another way to add text to the plots. We will learn to
modify the appearance and location of the text in another post.
ggplot(mtcars, aes(disp, mpg, label = rownames(mtcars))) +
geom_text(check_overlap = TRUE, size = 2)
Summary
In this post, we learnt about different geoms
such as
geom_point()
geom_line()
geom_histogram()
geom_bar()
geom_boxplot()
geom_abline()
geom_text()
Up Next..
In the next post, we will learn about aesthetics.