4 min read

SQL for Data Science - Part 2

Introduction

This is the fourth post in the series R & Databases. You can find the links to the other two posts of this series below:

In this post, we will learn to

  • aggregate data
  • order data
  • and group data

Libraries, Code & Data

We will use the following libraries in this post:

All the data sets used in this post can be found here and code can be downloaded from here.

Set Up

ecom <- readr::read_csv('https://raw.githubusercontent.com/rsquaredacademy/datasets/master/web.csv')
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(con, ecom)

Aggregate

Let us combine the aggregate statements with WHERE statement to filter data.

  • SUM() : returns the total sum of a numeric column
dbGetQuery(con, "SELECT SUM(n_visit) FROM ecom")
##   SUM(n_visit)
## 1         4972
dbGetQuery(con, "SELECT SUM(n_visit) 
                 FROM ecom 
                 WHERE n_visit > 5")
##   SUM(n_visit)
## 1         3574
  • AVG() : returns the average value of a numeric column
dbGetQuery(con, "SELECT AVG(n_visit) FROM ecom")
##   AVG(n_visit)
## 1        4.972
dbGetQuery(con, "SELECT AVG(n_visit) 
                 FROM ecom 
                 WHERE country LIKE 'P%'")
##   AVG(n_visit)
## 1     5.079137
  • MAX() : returns the largest value of the selected column
dbGetQuery(con, "SELECT MAX(n_visit) FROM ecom")
##   MAX(n_visit)
## 1           10
dbGetQuery(con, "SELECT MAX(n_visit) 
                 FROM ecom 
                 WHERE device == 'tablet'")
##   MAX(n_visit)
## 1           10
  • MIN() : returns the smallest value of the selected column
dbGetQuery(con, "SELECT MIN(n_visit) FROM ecom")
##   MIN(n_visit)
## 1            0
dbGetQuery(con, "SELECT MIN(n_visit) 
                 FROM ecom 
                 WHERE duration BETWEEN 600 AND 900")
##   MIN(n_visit)
## 1            0

Alias

SQL aliases are used to give a table, or a column in a table, a temporary name. They are often used to make column names more readable. An alias only exists for the duration of the query. Below are a few examples:

dbGetQuery(con, "SELECT AVG(n_visit) AS avg_mobile 
                 FROM ecom 
                 WHERE device == 'mobile'")
##   avg_mobile
## 1   5.479651
dbGetQuery(con, "SELECT MAX(n_visit) AS max_visit 
                 FROM ecom")
##   max_visit
## 1        10
dbGetQuery(con, "SELECT MIN(duration) AS min_duration 
                 FROM ecom")
##   min_duration
## 1           10

Order By

The ORDER BY keyword is used to sort the records in ascending or descending order. By default, the records are sorted in ascending order. Use the DESC keyword if you want to sort the records in descending order,

dbGetQuery(con, "SELECT * 
                 FROM ecom 
                 ORDER BY country")
##    id referrer device bouncers n_visit n_pages duration     country purchase
## 1 232   social laptop        0       8       2       60 Afghanistan        0
## 2 299    yahoo laptop        0      10      18      180 Afghanistan        0
## 3 570   social laptop        1       2       1      274 Afghanistan        0
## 4 677   direct tablet        1      10       1      682 Afghanistan        0
##   order_items order_value
## 1           0           0
## 2           0           0
## 3           0           0
## 4           0           0
##  [ reached 'max' / getOption("max.print") -- omitted 996 rows ]
dbGetQuery(con, "SELECT *  
                 FROM ecom 
                 ORDER BY duration")
##    id referrer device bouncers n_visit n_pages duration   country purchase
## 1 236    yahoo tablet        1       5       1       10    Poland        0
## 2 615   social laptop        1       1       1       10   Finland        0
## 3 392    yahoo laptop        0       0       1       12 Indonesia        0
## 4 688   social mobile        1       2       1       12  Botswana        0
##   order_items order_value
## 1           0           0
## 2           0           0
## 3           0           0
## 4           0           0
##  [ reached 'max' / getOption("max.print") -- omitted 996 rows ]
dbGetQuery(con, "SELECT * 
                 FROM ecom 
                 ORDER BY n_visit DESC")
##   id referrer device bouncers n_visit n_pages duration        country purchase
## 1  1   google laptop        1      10       1      693 Czech Republic        0
## 2  7    yahoo mobile        1      10       1       75     Bangladesh        0
## 3  8   direct mobile        1      10       1      908      Indonesia        0
## 4 29   google mobile        1      10       1      338         Russia        0
##   order_items order_value
## 1           0           0
## 2           0           0
## 3           0           0
## 4           0           0
##  [ reached 'max' / getOption("max.print") -- omitted 996 rows ]

Group By

The GROUP BY statement is used with aggregate functions (COUNT, MAX, MIN, SUM, AVG) to group the result by one or more columns.

dbGetQuery(con, "SELECT device, 
                 count(*) AS visits 
                 FROM ecom 
                 GROUP BY device 
                 ORDER by visits DESC")
##   device visits
## 1 mobile    344
## 2 tablet    331
## 3 laptop    325
dbGetQuery(con, "SELECT device, MAX(duration) AS max_duration 
                 FROM ecom 
                 GROUP BY device 
                 ORDER by max_duration DESC")
##   device max_duration
## 1 tablet          999
## 2 laptop          997
## 3 mobile          994

Summary

In this post, we learnt to

  • aggregate data
  • order data
  • and group data