Data Wrangling with dplyr - Part 3

Introduction

In the previous post, we learnt to combine tables using dplyr. In this post, we will explore a set of helper functions in order to:

  • extract unique rows
  • rename columns
  • sample data
  • extract columns
  • slice rows
  • arrange rows
  • compare tables
  • extract/mutate data using predicate functions
  • count observations for different levels of a variable

Libraries, Code & Data

We will use the following packages:

The data sets can be downloaded from here and the codes from here.

library(dplyr)
library(readr)

Case Study

Let us look at a case study (e-commerce data) and see how we can use dplyr helper functions to answer questions we have about and to modify/transform the underlying data set.

Data

ecom <- readr::read_csv('https://raw.githubusercontent.com/rsquaredacademy/datasets/master/web.csv')
ecom
## # A tibble: 1,000 x 11
##       id referrer device bouncers n_visit n_pages duration country       
##    <int> <chr>    <chr>  <chr>      <int>   <dbl>    <dbl> <chr>         
##  1     1 google   laptop true          10    1.00    693   Czech Republic
##  2     2 yahoo    tablet true           9    1.00    459   Yemen         
##  3     3 direct   laptop true           0    1.00    996   Brazil        
##  4     4 bing     tablet false          3   18.0     468   China         
##  5     5 yahoo    mobile true           9    1.00    955   Poland        
##  6     6 yahoo    laptop false          5    5.00    135   South Africa  
##  7     7 yahoo    mobile true          10    1.00     75.0 Bangladesh    
##  8     8 direct   mobile true          10    1.00    908   Indonesia     
##  9     9 bing     mobile false          3   19.0     209   Netherlands   
## 10    10 google   mobile true           6    1.00    208   Czech Republic
## # ... with 990 more rows, and 3 more variables: purchase <chr>,
## #   order_items <dbl>, order_value <dbl>

Data Dictionary

  • id: row id
  • referrer: referrer website/search engine
  • os: operating system
  • browser: browser
  • device: device used to visit the website
  • n_pages: number of pages visited
  • duration: time spent on the website (in seconds)
  • repeat: frequency of visits
  • country: country of origin
  • purchase: whether visitor purchased
  • order_value: order value of visitor (in dollars)

Data Sanitization

Let us ensure that the data is sanitized by checking the sources of traffic and devices used to visit the site. We will use distinct to examine the values in the referrer column

ecom %>%
  distinct(referrer)
## # A tibble: 5 x 1
##   referrer
##   <chr>   
## 1 google  
## 2 yahoo   
## 3 direct  
## 4 bing    
## 5 social

and the device column as well.

ecom %>%
  distinct(device)
## # A tibble: 3 x 1
##   device
##   <chr> 
## 1 laptop
## 2 tablet
## 3 mobile

Data Tabulation

Let us now look at the proportion or share of visits driven by different sources of traffic.

ecom %>%
  group_by(referrer) %>%
  tally()
## # A tibble: 5 x 2
##   referrer     n
##   <chr>    <int>
## 1 bing       194
## 2 direct     191
## 3 google     208
## 4 social     200
## 5 yahoo      207

We would also like to know the number of bouncers driven by the different sources of traffic.

ecom %>%
  group_by(referrer, bouncers) %>%
  tally()
## # A tibble: 10 x 3
## # Groups:   referrer [?]
##    referrer bouncers     n
##    <chr>    <chr>    <int>
##  1 bing     false      104
##  2 bing     true        90
##  3 direct   false       98
##  4 direct   true        93
##  5 google   false      101
##  6 google   true       107
##  7 social   false       93
##  8 social   true       107
##  9 yahoo    false      110
## 10 yahoo    true        97

Let us look at how many conversions happen across different devices.

ecom %>%
  group_by(device, purchase) %>%
  tally() %>%
  filter(purchase == 'true')
## # A tibble: 3 x 3
## # Groups:   device [3]
##   device purchase     n
##   <chr>  <chr>    <int>
## 1 laptop true        31
## 2 mobile true        36
## 3 tablet true        36

Another way to extract the above information is by using count

ecom %>%
  count(referrer, purchase) %>%
  filter(purchase == "true")
## # A tibble: 5 x 3
##   referrer purchase     n
##   <chr>    <chr>    <int>
## 1 bing     true        17
## 2 direct   true        25
## 3 google   true        19
## 4 social   true        20
## 5 yahoo    true        22

Sampling Data

dplyr offers sampling functions which allow us to specify either the number or percentage of observations. sample_n() allows sampling a specific number of observations.

ecom %>%
  sample_n(700)
## # A tibble: 700 x 11
##       id referrer device bouncers n_visit n_pages duration country      
##    <int> <chr>    <chr>  <chr>      <int>   <dbl>    <dbl> <chr>        
##  1    86 yahoo    tablet true           0    1.00    824   Austria      
##  2   679 yahoo    laptop true           3    1.00    541   Thailand     
##  3   220 social   tablet false          6    6.00    108   Portugal     
##  4   187 bing     laptop false          1    2.00     22.0 Brazil       
##  5    16 bing     laptop true           1    1.00    995   United States
##  6   612 social   mobile false          1   16.0     336   Philippines  
##  7   464 google   tablet true           8    1.00    414   Yemen        
##  8   242 bing     mobile false         10   13.0     221   China        
##  9   625 social   mobile false          7    8.00    216   Tanzania     
## 10   370 social   tablet false          2    8.00    208   Azerbaijan   
## # ... with 690 more rows, and 3 more variables: purchase <chr>,
## #   order_items <dbl>, order_value <dbl>

We can combine the sampling functions with other dplyr functions as shown below where we sample observation after grouping them according to the source of traffic.

ecom %>%
  group_by(referrer) %>%
  sample_n(100)
## # A tibble: 500 x 11
## # Groups:   referrer [5]
##       id referrer device bouncers n_visit n_pages duration country        
##    <int> <chr>    <chr>  <chr>      <int>   <dbl>    <dbl> <chr>          
##  1   540 bing     tablet true           7    1.00      718 Thailand       
##  2   749 bing     laptop false          0   17.0       272 Indonesia      
##  3    47 bing     mobile true           6    1.00      843 Sweden         
##  4    69 bing     tablet true           1    1.00      406 Turkmenistan   
##  5   181 bing     mobile true           9    1.00      650 Poland         
##  6   293 bing     mobile true          10    1.00      480 Panama         
##  7   329 bing     mobile true           3    1.00      490 Sweden         
##  8    97 bing     tablet true           2    1.00      292 Democratic Rep~
##  9   584 bing     mobile true           0    1.00      686 Uzbekistan     
## 10   888 bing     tablet false         10   17.0       442 China          
## # ... with 490 more rows, and 3 more variables: purchase <chr>,
## #   order_items <dbl>, order_value <dbl>

sample_frac() allows a specific percentage of observations.

ecom %>%
  sample_frac(size = 0.7)
## # A tibble: 700 x 11
##       id referrer device bouncers n_visit n_pages duration country    
##    <int> <chr>    <chr>  <chr>      <int>   <dbl>    <dbl> <chr>      
##  1   863 social   tablet true           5    1.00      387 China      
##  2   445 social   tablet false          2   12.0       300 Philippines
##  3   973 social   mobile true           4    1.00      305 Peru       
##  4   955 google   mobile true           6    1.00      965 Netherlands
##  5   803 yahoo    laptop true           8    1.00      630 Israel     
##  6   238 yahoo    mobile true           1    1.00      257 Brazil     
##  7    25 yahoo    laptop true           1    1.00      779 Brazil     
##  8   295 bing     laptop false         10   16.0       304 Poland     
##  9   454 direct   tablet true          10    1.00      970 Syria      
## 10   369 social   laptop true           2    1.00      341 Indonesia  
## # ... with 690 more rows, and 3 more variables: purchase <chr>,
## #   order_items <dbl>, order_value <dbl>

Data Extraction

In the first post, we had observed that dplyr verbs always returned a tibble. What if you want to extract a specific column or a bunch of rows but not as a tibble?

Use pull to extract columns either by name or position. It will return a vector. In the below example, we extract the device column as a vector. I am using head in addition to limit the output printed.

ecom %>%
  pull(device) %>%
  head
## [1] "laptop" "tablet" "laptop" "tablet" "mobile" "laptop"

Let us extract the first column from ecom using column position instead of name.

ecom %>%
  pull(1) %>%
  head
## [1] 1 2 3 4 5 6

You can use - before the column position to indicate the position in reverse. The below example extracts data from the last column.

ecom %>%
  pull(-1) %>%
  head
## [1]   0   0   0 434   0   0

Let us now look at extracting rows using slice(). In the below example, we extract data starting from the 5th row and upto the 15th row.

ecom %>%
  slice(5:15)
## # A tibble: 11 x 11
##       id referrer device bouncers n_visit n_pages duration country       
##    <int> <chr>    <chr>  <chr>      <int>   <dbl>    <dbl> <chr>         
##  1     5 yahoo    mobile true           9    1.00    955   Poland        
##  2     6 yahoo    laptop false          5    5.00    135   South Africa  
##  3     7 yahoo    mobile true          10    1.00     75.0 Bangladesh    
##  4     8 direct   mobile true          10    1.00    908   Indonesia     
##  5     9 bing     mobile false          3   19.0     209   Netherlands   
##  6    10 google   mobile true           6    1.00    208   Czech Republic
##  7    11 direct   laptop true           9    1.00    738   Jamaica       
##  8    12 direct   tablet false          6   12.0     132   Estonia       
##  9    13 direct   mobile false          9   14.0     406   Ireland       
## 10    14 yahoo    tablet false          5    8.00     80.0 Philippines   
## 11    15 yahoo    mobile false          7    1.00     19.0 France        
## # ... with 3 more variables: purchase <chr>, order_items <dbl>,
## #   order_value <dbl>

Use n() inside slice() to extract the last row.

ecom %>%
  slice(n())
## # A tibble: 1 x 11
##      id referrer device bouncers n_visit n_pages duration country purchase
##   <int> <chr>    <chr>  <chr>      <int>   <dbl>    <dbl> <chr>   <chr>   
## 1  1000 google   mobile true           9    1.00      269 China   false   
## # ... with 2 more variables: order_items <dbl>, order_value <dbl>

Between

between() allows us to test if the values in a column lie between two specific values. In the below example, we check how many visits browsed pages between 5 and 15.

ecom_sample <-
  ecom %>%
  sample_n(30)
  
ecom_sample %>%
  pull(n_pages) %>%
  between(5, 15) 
##  [1] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE
## [23]  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE

Case When

case_when() is an alternative to if else. It allows us to lay down the conditions clearly and makes the code more readable. In the below example, we create a new column repeat_visit from n_visit (the number of previous visits).

ecom %>%
  mutate(
    repeat_visit = case_when(
      n_visit > 0 ~ TRUE,
      TRUE ~ FALSE
    )
  ) %>%
  select(n_visit, repeat_visit) 
## # A tibble: 1,000 x 2
##    n_visit repeat_visit
##      <int> <lgl>       
##  1      10 T           
##  2       9 T           
##  3       0 F           
##  4       3 T           
##  5       9 T           
##  6       5 T           
##  7      10 T           
##  8      10 T           
##  9       3 T           
## 10       6 T           
## # ... with 990 more rows