# SQL for Data Science - Part 2

#### Introduction

This is the fourth post in the series R & Databases. You can find the links to the other two posts of this series below:

In this post, we will learn to

• aggregate data
• order data
• and group data

### Libraries, Code & Data

We will use the following libraries in this post:

All the data sets used in this post can be found here and code can be downloaded from here.

#### Set Up

``````ecom <- readr::read_csv('https://raw.githubusercontent.com/rsquaredacademy/datasets/master/web.csv')
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(con, ecom)``````

#### Aggregate

Let us combine the aggregate statements with WHERE statement to filter data.

• SUM() : returns the total sum of a numeric column
``dbGetQuery(con, "SELECT SUM(n_visit) FROM ecom")``
``````##   SUM(n_visit)
## 1         4972``````
``````dbGetQuery(con, "SELECT SUM(n_visit)
FROM ecom
WHERE n_visit > 5")``````
``````##   SUM(n_visit)
## 1         3574``````
• AVG() : returns the average value of a numeric column
``dbGetQuery(con, "SELECT AVG(n_visit) FROM ecom")``
``````##   AVG(n_visit)
## 1        4.972``````
``````dbGetQuery(con, "SELECT AVG(n_visit)
FROM ecom
WHERE country LIKE 'P%'")``````
``````##   AVG(n_visit)
## 1     5.079137``````
• MAX() : returns the largest value of the selected column
``dbGetQuery(con, "SELECT MAX(n_visit) FROM ecom")``
``````##   MAX(n_visit)
## 1           10``````
``````dbGetQuery(con, "SELECT MAX(n_visit)
FROM ecom
WHERE device == 'tablet'")``````
``````##   MAX(n_visit)
## 1           10``````
• MIN() : returns the smallest value of the selected column
``dbGetQuery(con, "SELECT MIN(n_visit) FROM ecom")``
``````##   MIN(n_visit)
## 1            0``````
``````dbGetQuery(con, "SELECT MIN(n_visit)
FROM ecom
WHERE duration BETWEEN 600 AND 900")``````
``````##   MIN(n_visit)
## 1            0``````

#### Alias

SQL aliases are used to give a table, or a column in a table, a temporary name. They are often used to make column names more readable. An alias only exists for the duration of the query. Below are a few examples:

``````dbGetQuery(con, "SELECT AVG(n_visit) AS avg_mobile
FROM ecom
WHERE device == 'mobile'")``````
``````##   avg_mobile
## 1   5.479651``````
``````dbGetQuery(con, "SELECT MAX(n_visit) AS max_visit
FROM ecom")``````
``````##   max_visit
## 1        10``````
``````dbGetQuery(con, "SELECT MIN(duration) AS min_duration
FROM ecom")``````
``````##   min_duration
## 1           10``````

#### Order By

The ORDER BY keyword is used to sort the records in ascending or descending order. By default, the records are sorted in ascending order. Use the DESC keyword if you want to sort the records in descending order,

``````dbGetQuery(con, "SELECT *
FROM ecom
ORDER BY country")``````
``````##        id referrer device bouncers n_visit n_pages duration
## 1     232   social laptop    false       8       2       60
## 2     299    yahoo laptop    false      10      18      180
## 3     570   social laptop     true       2       1      274
## 4     677   direct tablet     true      10       1      682
##                               country purchase order_items order_value
## 1                         Afghanistan    false           0           0
## 2                         Afghanistan    false           0           0
## 3                         Afghanistan    false           0           0
## 4                         Afghanistan    false           0           0
##  [ reached getOption("max.print") -- omitted 996 rows ]``````
``````dbGetQuery(con, "SELECT *
FROM ecom
ORDER BY duration")``````
``````##        id referrer device bouncers n_visit n_pages duration
## 1     236    yahoo tablet     true       5       1       10
## 2     615   social laptop     true       1       1       10
## 3     392    yahoo laptop    false       0       1       12
## 4     688   social mobile     true       2       1       12
##                               country purchase order_items order_value
## 1                              Poland    false           0           0
## 2                             Finland    false           0           0
## 3                           Indonesia    false           0           0
## 4                            Botswana    false           0           0
##  [ reached getOption("max.print") -- omitted 996 rows ]``````
``````dbGetQuery(con, "SELECT *
FROM ecom
ORDER BY n_visit DESC")``````
``````##        id referrer device bouncers n_visit n_pages duration
## 1       1   google laptop     true      10       1      693
## 2       7    yahoo mobile     true      10       1       75
## 3       8   direct mobile     true      10       1      908
## 4      29   google mobile     true      10       1      338
##                               country purchase order_items order_value
## 1                      Czech Republic    false           0           0
## 2                          Bangladesh    false           0           0
## 3                           Indonesia    false           0           0
## 4                              Russia    false           0           0
##  [ reached getOption("max.print") -- omitted 996 rows ]``````

#### Group By

The GROUP BY statement is used with aggregate functions (COUNT, MAX, MIN, SUM, AVG) to group the result by one or more columns.

``````dbGetQuery(con, "SELECT device,
count(*) AS visits
FROM ecom
GROUP BY device
ORDER by visits DESC")``````
``````##   device visits
## 1 mobile    344
## 2 tablet    331
## 3 laptop    325``````
``````dbGetQuery(con, "SELECT device, MAX(duration) AS max_duration
FROM ecom
GROUP BY device
ORDER by max_duration DESC")``````
``````##   device max_duration
## 1 tablet          999
## 2 laptop          997
## 3 mobile          994``````

### Summary

In this post, we learnt to

• aggregate data
• order data
• and group data