SQL for Data Science - Part 1

Introduction

This is the third post in the series R & Databases. You can find the links to the other two posts of this series below:

In this post, we will learn to:

Libraries, Code & Data

We will use the following libraries in this post:

All the data sets used in this post can be found here and code can be downloaded from here.

Set Up

ecom <- readr::read_csv('https://raw.githubusercontent.com/rsquaredacademy/datasets/master/web.csv')
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(con, ecom)

Select Columns

The SQL SELECT statement is used to fetch the data from a database table.

Syntax

Below is the basic syntax of the SELECT statement.

SELECT column1, column2, columnN FROM table_name;

Here, column1, column2… are the fields of a table whose values you want to fetch. If you want to fetch all the fields, use the following syntax.

SELECT * FROM table_name;

Select Single Column

Let us begin by selecting the device field from the ecom table.

dbGetQuery(con, "SELECT device FROM ecom")
##      device
## 1    laptop
## 2    tablet
## 3    laptop
## 4    tablet
## 5    mobile
## 6    laptop
## 7    mobile
## 8    mobile
## 9    mobile
## 10   mobile
## 11   laptop
## 12   tablet
## 13   mobile
## 14   tablet
## 15   mobile
## 16   laptop
## 17   tablet
## 18   tablet
## 19   tablet
## 20   tablet
## 21   laptop
## 22   mobile
## 23   mobile
## 24   laptop
## 25   laptop
## 26   laptop
## 27   tablet
## 28   laptop
## 29   mobile
## 30   mobile
## 31   tablet
## 32   mobile
## 33   laptop
## 34   tablet
## 35   mobile
## 36   mobile
## 37   laptop
## 38   mobile
## 39   mobile
## 40   mobile
## 41   mobile
## 42   mobile
## 43   laptop
## 44   tablet
## 45   laptop
## 46   tablet
## 47   mobile
## 48   laptop
## 49   mobile
## 50   tablet
##  [ reached getOption("max.print") -- omitted 950 rows ]

Select Multiple Columns

Select the following fields from the ecom table:

  • referrer
  • device
  • purchase
dbGetQuery(con, "SELECT referrer, device, purchase  FROM ecom")
##      referrer device purchase
## 1      google laptop    false
## 2       yahoo tablet    false
## 3      direct laptop    false
## 4        bing tablet     true
## 5       yahoo mobile    false
## 6       yahoo laptop    false
## 7       yahoo mobile    false
## 8      direct mobile    false
## 9        bing mobile    false
## 10     google mobile    false
## 11     direct laptop    false
## 12     direct tablet    false
## 13     direct mobile     true
## 14      yahoo tablet    false
## 15      yahoo mobile    false
## 16       bing laptop    false
##  [ reached getOption("max.print") -- omitted 984 rows ]

Select All Columns

Select all the fields from the ecom table.

dbGetQuery(con, "SELECT * FROM ecom")
##        id referrer device bouncers n_visit n_pages duration
## 1       1   google laptop     true      10       1      693
## 2       2    yahoo tablet     true       9       1      459
## 3       3   direct laptop     true       0       1      996
## 4       4     bing tablet    false       3      18      468
##                               country purchase order_items order_value
## 1                      Czech Republic    false           0           0
## 2                               Yemen    false           0           0
## 3                              Brazil    false           0           0
## 4                               China     true           6         434
##  [ reached getOption("max.print") -- omitted 996 rows ]

Limit

If you have a large table with thousands of rows, returning all the records will take time. Use LIMIT to specify the number of records to return.

dbGetQuery(con, "SELECT * FROM ecom limit 10")
##    id referrer device bouncers n_visit n_pages duration        country
## 1   1   google laptop     true      10       1      693 Czech Republic
## 2   2    yahoo tablet     true       9       1      459          Yemen
## 3   3   direct laptop     true       0       1      996         Brazil
## 4   4     bing tablet    false       3      18      468          China
##    purchase order_items order_value
## 1     false           0           0
## 2     false           0           0
## 3     false           0           0
## 4      true           6         434
##  [ reached getOption("max.print") -- omitted 6 rows ]

Distinct

A column in a table may often contain many duplicate values; and we might be interested only in the distinct/unique values. In such cases, we can use the SELECT DISTINCT statement to return only distinct values.

dbGetQuery(con, "SELECT distinct referrer FROM ecom")
##   referrer
## 1   google
## 2    yahoo
## 3   direct
## 4     bing
## 5   social

Filter

Now that we know how to select columns, let us focus on filtering data. In SQL, the WHERE keyword is used to extract only those records that fulfill a specified condition. Data filter based on both text and numeric values in a table. Below are a few comparison operators we can use:

  • = equal
  • <> not equal
  • < less than
  • > greater than
  • <= less than or equal to
  • >= greater than or equal to

The following SQL statement filters all rows from the ecom table where the duration field is greater than 300.

dbGetQuery(con, "SELECT * 
                 FROM ecom 
                 WHERE duration > 300")
##      id referrer device bouncers n_visit n_pages duration
## 1     1   google laptop     true      10       1      693
## 2     2    yahoo tablet     true       9       1      459
## 3     3   direct laptop     true       0       1      996
## 4     4     bing tablet    false       3      18      468
##                              country purchase order_items order_value
## 1                     Czech Republic    false           0           0
## 2                              Yemen    false           0           0
## 3                             Brazil    false           0           0
## 4                              China     true           6         434
##  [ reached getOption("max.print") -- omitted 472 rows ]

Let us filter data based on a text value. In the following example, we filter all rows from the ecom table where the device used is mobile.

dbGetQuery(con, "SELECT * 
                 FROM ecom 
                 WHERE device == 'mobile'")
##       id referrer device bouncers n_visit n_pages duration
## 1      5    yahoo mobile     true       9       1      955
## 2      7    yahoo mobile     true      10       1       75
## 3      8   direct mobile     true      10       1      908
## 4      9     bing mobile    false       3      19      209
##                   country purchase order_items order_value
## 1                  Poland    false           0           0
## 2              Bangladesh    false           0           0
## 3               Indonesia    false           0           0
## 4             Netherlands    false           0           0
##  [ reached getOption("max.print") -- omitted 340 rows ]
And, Or & Not

The WHERE clause can be combined with other operators such as

  • AND - displays a record if all the conditions separated by AND is TRUE
  • OR - displays a record if any of the conditions separated by OR is TRUE
  • NOT - displays a record if the condition(s) is NOT TRUE

to filter data based on more than one condition or to create more complex conditions.

In the following example, we filter all the rows from the ecom table where n_visit (visit count) is greater than 3 and duration (time spent on the site) is greater than 100. We use AND to create multiple conditions.

dbGetQuery(con, "SELECT * 
                 FROM ecom 
                 WHERE n_visit > 3 AND duration > 100")
##       id referrer device bouncers n_visit n_pages duration
## 1      1   google laptop     true      10       1      693
## 2      2    yahoo tablet     true       9       1      459
## 3      5    yahoo mobile     true       9       1      955
## 4      6    yahoo laptop    false       5       5      135
##                              country purchase order_items order_value
## 1                     Czech Republic    false           0           0
## 2                              Yemen    false           0           0
## 3                             Poland    false           0           0
## 4                       South Africa    false           0           0
##  [ reached getOption("max.print") -- omitted 509 rows ]

In the next example, we will use both AND & OR. Our goal is to filter all rows from the ecom table that follow the below conditions:

  • n_visit (visit count) is either equal to 3 or 5
  • device used to visit the website is either mobile or tablet
dbGetQuery(con, "SELECT * 
                 FROM ecom WHERE (n_visit == 5 OR n_visit == 3)  
                 AND (device = 'mobile' OR device = 'tablet')")
##      id referrer device bouncers n_visit n_pages duration
## 1     4     bing tablet    false       3      18      468
## 2     9     bing mobile    false       3      19      209
## 3    14    yahoo tablet    false       5       8       80
## 4    17     bing tablet    false       5      16      368
##                   country purchase order_items order_value
## 1                   China     true           6         434
## 2             Netherlands    false           0           0
## 3             Philippines    false           2         362
## 4                    Peru     true           6        1049
##  [ reached getOption("max.print") -- omitted 130 rows ]
BETWEEN

The BETWEEN operator selects values within a given range and is inclusive: begin and end values are included. The values can be numbers, text, or dates. In the following example, we filter rows from the ecom table where the visit count is between 1 and 3, and the device used to visit the website is mobile.

dbGetQuery(con, "SELECT * 
                 FROM ecom
                 WHERE n_visit BETWEEN 1 AND 3 AND device = 'mobile'")
##     id referrer device bouncers n_visit n_pages duration
## 1    9     bing mobile    false       3      19      209
## 2   32   direct mobile     true       2       1      501
## 3   36     bing mobile    false       1       1       25
## 4   38    yahoo mobile     true       3       1      700
##                  country purchase order_items order_value
## 1            Netherlands    false           0           0
## 2            El Salvador    false           0           0
## 3                Ireland    false          10        1885
## 4                 Canada    false           0           0
##  [ reached getOption("max.print") -- omitted 86 rows ]
IN

The IN operator allows us to specify multiple values in a WHERE clause. It is a shorthand for multiple OR conditions. In the below example, we filter rows from the ecom table where the visit count is either 2 or 4 or 6 or 8 or 10. Instead of using multiple OR conditions, we use the IN operator.

dbGetQuery(con, "SELECT * 
                 FROM ecom 
                 WHERE n_visit IN (2, 4, 6, 8, 10)")
##      id referrer device bouncers n_visit n_pages duration
## 1     1   google laptop     true      10       1      693
## 2     7    yahoo mobile     true      10       1       75
## 3     8   direct mobile     true      10       1      908
## 4    10   google mobile     true       6       1      208
##                              country purchase order_items order_value
## 1                     Czech Republic    false           0           0
## 2                         Bangladesh    false           0           0
## 3                          Indonesia    false           0           0
## 4                     Czech Republic    false           0           0
##  [ reached getOption("max.print") -- omitted 443 rows ]
IS NULL

A field with a NULL value is a field with no value. If a field in a table is optional, it is possible to insert a new record or update a record without adding a value to this field. Then, the field will be saved with a NULL value. In the next example, we filter all rows from the ecom table where the device column has NULL values.

dbGetQuery(con, "SELECT * 
                 FROM ecom 
                 WHERE device IS NULL")
##  [1] id          referrer    device      bouncers    n_visit    
##  [6] n_pages     duration    country     purchase    order_items
## [11] order_value
## <0 rows> (or 0-length row.names)
LIKE

The LIKE operator is used to search for a specific pattern in a column. There are two wildcards used in conjunction with the LIKE operator:

  • % : represents zero, one, or multiple characters
  • _ : represents a single character

In the following example, we filter all rows from the ecom table where the name of the country starts with P. We use % after P to indicate that it can be followed by any number or type of characters.

dbGetQuery(con, "SELECT * 
                 FROM ecom 
                 WHERE country LIKE 'P%'")
##      id referrer device bouncers n_visit n_pages duration
## 1     5    yahoo mobile     true       9       1      955
## 2    14    yahoo tablet    false       5       8       80
## 3    17     bing tablet    false       5      16      368
## 4    43     bing laptop     true       0       1      456
##                   country purchase order_items order_value
## 1                  Poland    false           0           0
## 2             Philippines    false           2         362
## 3                    Peru     true           6        1049
## 4                Portugal    false           0           0
##  [ reached getOption("max.print") -- omitted 135 rows ]

Let us look at another example where we filter all rows from the ecom table where the name of the country should follow the below conditions:

  • name can start with any character
  • the second character must be o
  • it can have any type or number of characters after the second character
dbGetQuery(con, "SELECT * 
                 FROM ecom 
                 WHERE country LIKE '_o%'")
##      id referrer device bouncers n_visit n_pages duration
## 1     5    yahoo mobile     true       9       1      955
## 2     6    yahoo laptop    false       5       5      135
## 3    19   social tablet    false       7      10      290
## 4    30    yahoo mobile    false       8       9      225
##                    country purchase order_items order_value
## 1                   Poland    false           0           0
## 2             South Africa    false           0           0
## 3                 Colombia     true           9        1304
## 4                 Colombia    false           0           0
##  [ reached getOption("max.print") -- omitted 117 rows ]

Summary

In this post we learnt to

  • select
    • single column
    • multiple columns
    • distinct values in a column
  • limit the number of records returned
  • handle NULL values
  • and filter columns using the following operators
    • WHERE
    • AND, or & NOT
    • BETWEEN
    • IN
    • LIKE

Up Next..

In the next post, we will learn advanced SQL commands.