This is the second post in the series Elegant Data Visualization with
ggplot2. In the previous post, we understood the concept of grammar of
graphics and even built a bar plot step by step while exploring the different
components of a plot/chart. In this post, we will learn to quickly build a set
of plots that are routinely used to explore data using
qplot(). It can be
used to quickly create plots but also has certain limitations. Nevertheless, if
you want to quickly explore data using a single function,
qplot() is your friend.
Libraries, Code & Data
We will use the following libraries in this post:
All the data sets used in this post can be found here and code can be downloaded from here.
Scatter plots are used to examine the relationship between two continuous variables. The relationship can be examined across the levels of a categorical variable as well. Let us begin by creating scatter plots. The first two inputs are the variables/columns representing the X and Y axis. The next input is the name of the data set.
qplot(disp, mpg, data = mtcars)
If you want the relationship between the two variables to be represented by
both points and line, use the
geom argument and supply it the values using a
qplot(disp, mpg, data = mtcars, geom = c('point', 'line'))
The color of the points can be mapped to a categorical variable, in our case
cyl, using the color argument. Ensure that the variable is categorical using
qplot(disp, mpg, data = mtcars, color = factor(cyl))
The shape and size of the points can also be mapped to variables using the
size argument as shown in the below examples.
qplot(disp, mpg, data = mtcars, shape = factor(cyl))
Ensure that size is mapped to a continuous variable.
qplot(disp, mpg, data = mtcars, size = qsec)
A bar plot represents data in rectangular bars. The length of the bars are proportional to the values they represent. Bar plots can be either horizontal or vertical. The X axis of the plot represents the levels or the categories and the Y axis represents the frequency/count of the variable.
To create a bar plot, the first input must be a categorical variable. You can
convert a variable to type
factor (R equivalent of categorical) using the
factor() function. The next input is the name of the data set and the final
input is the
geom which is supplied the value
qplot(factor(cyl), data = mtcars, geom = c('bar'))
You can create a stacked bar plot using the
fill argument and mapping it to
another categorical variable.
qplot(factor(cyl), data = mtcars, geom = c('bar'), fill = factor(am))
The box plot is a standardized way of displaying the distribution of data based on the five number summary: minimum, first quartile, median, third quartile, and maximum. Box plots are useful for detecting outliers and for comparing distributions. It shows the shape, central tendancy and variability of the data.
Box plots can be created by supplying the value
'boxplot' to the
argument. The firstinput must be a categorical variable and the second must be
a continuous variable.
qplot(factor(cyl), mpg, data = mtcars, geom = c('boxplot'))
plot(), we cannot create box plots using a single variable. If you are
not comparing the distribution of a variable across the levels of a categorical
variable, you must supply the value
1 as the first input as show below.
qplot(factor(1), mpg, data = mtcars, geom = c('boxplot'))
Line charts are used to examing trends across time. To create a line chart,
supply the value
'line' to the
geom argument. The first two inputs should
be names of the columns/variables representing the X and Y axis, and the third
input must be the name of the data set.
qplot(x = date, y = unemploy, data = economics, geom = c('line'))
The appearance of the line can be modified using the
color argument as shown below.
qplot(x = date, y = unemploy, data = economics, geom = c('line'), color = 'red')
A histogram is a plot that can be used to examine the shape and spread of
continuous data. It looks very similar to a bar graph and can be used to detect
outliers and skewness in data. A histogram is created using the
as shown below. The first input is the name of the continuous variable and the
second is the name of the data set.
qplot(mpg, data = mtcars, bins = 5)
In this post, we learnt to quickly create plots using the
qplot() function. While useful, it has limitations and can be used only to quickly visualize data.
In the next post, we will create the same set of plots but using geoms.