Data Wrangling with dplyr - Part 2

Introduction

In the previous post, we learnt about dplyr verbs and used them to compute average order value for an online retail company data. In this post, we will learn to combine tables using different *_join functions provided in dplyr.

Libraries, Code & Data

We will use the following packages:

The data sets can be downloaded from here and the codes from here.

library(dplyr)
library(readr)
options(tibble.width = Inf)

Case Study

For our case study, we will use two data sets. The first one, order, contains details of orders placed by different customers. The second data set, customer contains details of each customer. The below table displays the details stored in each data set.

Let us import both the data sets using read_csv.

Data: Orders

order <- readr::read_csv('https://raw.githubusercontent.com/rsquaredacademy/datasets/master/order.csv')
order
## # A tibble: 300 x 5
##       id order_data order_number customer_id amount
##    <int> <chr>             <int>       <int>  <dbl>
##  1   368 7/2/2016         543346         352    365
##  2   286 11/2/2016        543390         305   2064
##  3    28 2/22/2017        543404          78    432
##  4   309 3/5/2017         543389         447    480
##  5     2 12/28/2016       543233         374    235
##  6    31 12/30/2016       543475          48   2745
##  7   179 12/21/2016       543155         280   2358
##  8   484 11/24/2016       543485         445   1031
##  9   115 9/9/2016         543489         278   1218
## 10   340 5/6/2017         543246         131   1184
## # ... with 290 more rows

Data: Customers

customer <- readr::read_csv('https://raw.githubusercontent.com/rsquaredacademy/datasets/master/customer.csv')
customer
## # A tibble: 91 x 5
##       id first_name last_name             city       phone          
##    <int> <chr>      <chr>                 <chr>      <chr>          
##  1     1 Elbertine  Robertucci            California 1-(805)951-9083
##  2     2 Marcella   Kilshall              Colorado   1-(303)642-9320
##  3     3 Daria      de la Valette Parisot Florida    1-(352)629-9125
##  4     4 Sherilyn   Widd                  Distric... 1-(202)453-3897
##  5     5 Ketty      Cridlin               Texas      1-(214)455-2169
##  6     6 Jethro     Lubomirski            California 1-(510)278-9668
##  7     7 Jeremiah   Carrabot              California 1-(559)713-9864
##  8     8 Constancia Eastmead              Texas      1-(806)550-1701
##  9     9 Muire      Prestwich             Idaho      1-(208)811-7437
## 10    10 Abigail    Craigs                Texas      1-(469)653-1340
## # ... with 81 more rows

We will explore the following in the case study:

  • details of customers who have placed orders and their order details
  • details of customers and their orders irrespective of whether a customer has placed orders or not
  • customer details for each order
  • details of customers who have placed orders
  • details of customers who have not placed orders
  • details of all customers and all orders

Example Data

We will use another data set to illustrate how the different joins work. You can view the example data sets below.

Inner Join

Inner join return all rows from Age where there are matching values in Height, and all columns from Age and Height. If there are multiple matches between Age and Height, all combination of the matches are returned.

Case Study

Details of customers who have placed orders and their order details

To get data for all those customers who have placed orders in the past let us join the order data with the customer data using inner_join.

customer %>%
  inner_join(order)
## Joining, by = "id"
## # A tibble: 55 x 9
##       id first_name last_name  city       phone           order_data
##    <int> <chr>      <chr>      <chr>      <chr>           <chr>     
##  1     2 Marcella   Kilshall   Colorado   1-(303)642-9320 12/28/2016
##  2     2 Marcella   Kilshall   Colorado   1-(303)642-9320 8/31/2016 
##  3     5 Ketty      Cridlin    Texas      1-(214)455-2169 1/17/2017 
##  4     6 Jethro     Lubomirski California 1-(510)278-9668 1/27/2017 
##  5     7 Jeremiah   Carrabot   California 1-(559)713-9864 6/21/2016 
##  6     7 Jeremiah   Carrabot   California 1-(559)713-9864 2/13/2017 
##  7     7 Jeremiah   Carrabot   California 1-(559)713-9864 7/8/2016  
##  8     8 Constancia Eastmead   Texas      1-(806)550-1701 11/5/2016 
##  9     8 Constancia Eastmead   Texas      1-(806)550-1701 5/19/2017 
## 10     9 Muire      Prestwich  Idaho      1-(208)811-7437 12/28/2016
##    order_number customer_id amount
##           <int>       <int>  <dbl>
##  1       543233         374    235
##  2       543129         334   1150
##  3       543453         475    346
##  4       543274         218   2317
##  5       543423         282    136
##  6       543379         346   1407
##  7       543477         483   1914
##  8       543157         341   2461
##  9       543117         364   2714
## 10       543201         207    187
## # ... with 45 more rows

Left Join

Left join return all rows from Age, and all columns from Age and Height. Rows in Age with no match in Height will have NA values in the new columns. If there are multiple matches between Age and Height, all combinations of the matches are returned.

Case Study

Details of customers and their orders irrespective of whether a customer has

placed orders or not.

To get data for all those customers and their orders irrespective of whether a customer has placed orders or not let us join the order data with the customer data using left_join.

customer %>%
  left_join(order)
## Joining, by = "id"
## # A tibble: 104 x 9
##       id first_name last_name             city       phone          
##    <int> <chr>      <chr>                 <chr>      <chr>          
##  1     1 Elbertine  Robertucci            California 1-(805)951-9083
##  2     2 Marcella   Kilshall              Colorado   1-(303)642-9320
##  3     2 Marcella   Kilshall              Colorado   1-(303)642-9320
##  4     3 Daria      de la Valette Parisot Florida    1-(352)629-9125
##  5     4 Sherilyn   Widd                  Distric... 1-(202)453-3897
##  6     5 Ketty      Cridlin               Texas      1-(214)455-2169
##  7     6 Jethro     Lubomirski            California 1-(510)278-9668
##  8     7 Jeremiah   Carrabot              California 1-(559)713-9864
##  9     7 Jeremiah   Carrabot              California 1-(559)713-9864
## 10     7 Jeremiah   Carrabot              California 1-(559)713-9864
##    order_data order_number customer_id amount
##    <chr>             <int>       <int>  <dbl>
##  1 <NA>                 NA          NA     NA
##  2 12/28/2016       543233         374    235
##  3 8/31/2016        543129         334   1150
##  4 <NA>                 NA          NA     NA
##  5 <NA>                 NA          NA     NA
##  6 1/17/2017        543453         475    346
##  7 1/27/2017        543274         218   2317
##  8 6/21/2016        543423         282    136
##  9 2/13/2017        543379         346   1407
## 10 7/8/2016         543477         483   1914
## # ... with 94 more rows

Right Join

Right join return all rows from Height, and all columns from Age and Height. Rows in Height with no match in Age will have NA values in the new columns. If there are multiple matches between Age and Height, all combinations of the matches are returned.

Case Study

Customer details for each order

To get customer data for all orders, let us join the order data with the customer data using right_join.

customer %>%
  right_join(order)
## Joining, by = "id"
## # A tibble: 300 x 9
##       id first_name last_name city      phone           order_data
##    <int> <chr>      <chr>     <chr>     <chr>           <chr>     
##  1   368 <NA>       <NA>      <NA>      <NA>            7/2/2016  
##  2   286 <NA>       <NA>      <NA>      <NA>            11/2/2016 
##  3    28 Avrit      Lerven    Texas     1-(817)651-5322 2/22/2017 
##  4   309 <NA>       <NA>      <NA>      <NA>            3/5/2017  
##  5     2 Marcella   Kilshall  Colorado  1-(303)642-9320 12/28/2016
##  6    31 Clemmie    McKeady   Tennessee 1-(865)334-8263 12/30/2016
##  7   179 <NA>       <NA>      <NA>      <NA>            12/21/2016
##  8   484 <NA>       <NA>      <NA>      <NA>            11/24/2016
##  9   115 <NA>       <NA>      <NA>      <NA>            9/9/2016  
## 10   340 <NA>       <NA>      <NA>      <NA>            5/6/2017  
##    order_number customer_id amount
##           <int>       <int>  <dbl>
##  1       543346         352    365
##  2       543390         305   2064
##  3       543404          78    432
##  4       543389         447    480
##  5       543233         374    235
##  6       543475          48   2745
##  7       543155         280   2358
##  8       543485         445   1031
##  9       543489         278   1218
## 10       543246         131   1184
## # ... with 290 more rows

Semi Join

Semi join return all rows from Age where there are matching values in Height, keeping just columns from Age. A semi join differs from an inner join because an inner join will return one row of Age for each matching row of Height, where a semi join will never duplicate rows of Age.

Case Study

Details of customers who have placed orders

To get customer data for all orders, let us join the order data with the customer data using semi_join.

customer %>%
  semi_join(order)
## Joining, by = "id"
## # A tibble: 42 x 5
##       id first_name last_name  city       phone          
##    <int> <chr>      <chr>      <chr>      <chr>          
##  1     2 Marcella   Kilshall   Colorado   1-(303)642-9320
##  2     5 Ketty      Cridlin    Texas      1-(214)455-2169
##  3     6 Jethro     Lubomirski California 1-(510)278-9668
##  4     7 Jeremiah   Carrabot   California 1-(559)713-9864
##  5     8 Constancia Eastmead   Texas      1-(806)550-1701
##  6     9 Muire      Prestwich  Idaho      1-(208)811-7437
##  7    15 Valentijn  Reddel     California 1-(562)760-7094
##  8    16 Monique    Van Zon    Missouri   1-(314)444-6148
##  9    20 Colette    Doddridge  Texas      1-(214)267-1433
## 10    28 Avrit      Lerven     Texas      1-(817)651-5322
## # ... with 32 more rows

Anti Join

Anti join return all rows from Age where there are not matching values in Height, keeping just columns from Age.

Case Study

Details of customers who have not placed orders

To get details of customers who have not placed orders, let us join the order data with the customer data using anti_join.

customer %>%
  anti_join(order)
## Joining, by = "id"
## # A tibble: 49 x 5
##       id first_name last_name             city       phone          
##    <int> <chr>      <chr>                 <chr>      <chr>          
##  1     1 Elbertine  Robertucci            California 1-(805)951-9083
##  2     3 Daria      de la Valette Parisot Florida    1-(352)629-9125
##  3     4 Sherilyn   Widd                  Distric... 1-(202)453-3897
##  4    10 Abigail    Craigs                Texas      1-(469)653-1340
##  5    11 Wynne      Roote                 Georgia    1-(404)911-0264
##  6    12 Pietra     Hynson                Minnesota  1-(651)506-8628
##  7    13 Bram       Valde                 Iowa       1-(515)637-1353
##  8    14 Rees       MacAllester           New York   1-(716)187-2661
##  9    17 Orazio     Boddis                Louisiana  1-(318)313-4592
## 10    18 Mason      Letterick             Texas      1-(806)351-0113
## # ... with 39 more rows

Full Join

Full join return all rows and all columns from both Age and Height. Where there are not matching values, returns NA for the one missing.

Case Study

Details of all customers and all orders

To get details of all customers and all orders, let us join the order data with the customer data using full_join.

customer %>%
  full_join(order)
## Joining, by = "id"
## # A tibble: 349 x 9
##       id first_name last_name             city       phone          
##    <int> <chr>      <chr>                 <chr>      <chr>          
##  1     1 Elbertine  Robertucci            California 1-(805)951-9083
##  2     2 Marcella   Kilshall              Colorado   1-(303)642-9320
##  3     2 Marcella   Kilshall              Colorado   1-(303)642-9320
##  4     3 Daria      de la Valette Parisot Florida    1-(352)629-9125
##  5     4 Sherilyn   Widd                  Distric... 1-(202)453-3897
##  6     5 Ketty      Cridlin               Texas      1-(214)455-2169
##  7     6 Jethro     Lubomirski            California 1-(510)278-9668
##  8     7 Jeremiah   Carrabot              California 1-(559)713-9864
##  9     7 Jeremiah   Carrabot              California 1-(559)713-9864
## 10     7 Jeremiah   Carrabot              California 1-(559)713-9864
##    order_data order_number customer_id amount
##    <chr>             <int>       <int>  <dbl>
##  1 <NA>                 NA          NA     NA
##  2 12/28/2016       543233         374    235
##  3 8/31/2016        543129         334   1150
##  4 <NA>                 NA          NA     NA
##  5 <NA>                 NA          NA     NA
##  6 1/17/2017        543453         475    346
##  7 1/27/2017        543274         218   2317
##  8 6/21/2016        543423         282    136
##  9 2/13/2017        543379         346   1407
## 10 7/8/2016         543477         483   1914
## # ... with 339 more rows