4 min read

Data Visualization With R - Scatter Plots

Introduction

This is the third post in the series Data Visualization With R. In the previous post, we learned how to add title, subtitle and axis labels. We also learned how to modify the range of the axis. In this post, we will learn how to create scatter plots.

  • adding color to the points
  • modify shape of the points
  • modify size of the points

Libraries, Code & Data

All the data sets used in this post can be found here and code can be downloaded from here.

Basic Plot

Let us recreate the plot that we had created in the first post by using the mtcars data set. We will use the disp (displacement) and mpg (miles per gallon) variables. disp will be on the X axis and mpg will be on the Y axis.

plot(mtcars$disp, mtcars$mpg)

We have created a very basic plot and any one looking at it for the first time will get confused with the axis labels mtcars$disp and mtcars$mpg. Let us put into practice what we learnt in the second post, and add a title to the plot, and make the axis labels more meaningful.

plot(mtcars$disp, mtcars$mpg,
     main = 'Displacement vs Miles Per Gallon',
     xlab = 'Displacement', ylab = 'Miles Per Gallon')

Now the plot clearly communicates that it represents the relationship between the displacement and mileage of cars. Now the color of the points in the plot is black by default. Some of us may agree that black is beautiful but not all of us will like it. As a first step in enhancing the way our plot looks, let us change the shape of the points.

Shape

The shape of the point can be specified using the pch argument. It will take values between 0 and 25. Below is an example:

# point shape
plot(mtcars$disp, mtcars$mpg, pch = 6)

Let us check out a few of the other shapes:

We can specify the shape based on a third (categorical variable as well). In the below plot, the shape is based on the levels of the categorical variable cyl (number of cylinders) from the mtcars data set:

# shape  based on number of levels of a third variable
plot(mtcars$disp, mtcars$mpg, pch = nlevels(factor(mtcars$cyl)))

# shape based on a third categorical variable
plot(mtcars$disp, mtcars$mpg, pch = unclass(mtcars$cyl))

Size

The size of the points in the scatter plot can be specified using the cex argument in the plot() function. The default value for cex is 1.

# point size
plot(mtcars$disp, mtcars$mpg, cex = 1.5)

The below plots show the size of the points for values relative to 1.

Color

We can specify a border color for the points using the col argument and a background color using the bg argument. The background color can be specified only for points whose pch argument takes values between 21 and 25. Let us look at some examples to understand this distinction between border and background color.

# shape between 0 and 21
plot(mtcars$disp, mtcars$mpg, pch = 5, col = 'blue', bg = 'red')

You can observe that although we have specified a background color using the bg argument, we do not see the red background color as the value specified for the pch (shape) argument is not between 21 and 25. In the next example, we will use a value between 21 and 25 so that the pch argument is effective.

# shape between 22 and 25
plot(mtcars$disp, mtcars$mpg, pch = 24, col = 'red', bg = 'blue')

The color of the points can be specified using (levels) of a categorical variable as well. In the next example, we will use the cyl variable to specify the color of the points.

# color based on a third variable
plot(mtcars$disp, mtcars$mpg, pch = 5, col = factor(mtcars$cyl))

Since cyl is a categorical variable with 3 levels, we can see that the points now have 3 different colors. The above method is useful when you want to segregate the points in a scatter plot based on a third variable.

Summary

In this post, we learned how to

  • create scatter plots
  • add color to the points
  • modify shape of the points
  • modify size of the points