Working with Categorical Data using forcats

Introduction

In this post, we will learn to work with categorical/qualitative data in R using forcats. Let us begin by installing and loading forcats and a set of other pacakges we will be using.

Libraries & Code

We will use the following packages:

The codes from here.

library(forcats)
library(tibble)
library(magrittr)
library(purrr)
library(dplyr)
library(ggplot2)
library(readr)

Case Study

We will use a case study to explore the various features of the forcats package. You can download the data for the case study from here or directly import the data using the readr package. We will do the following in this case study:

  • compute the frequency of different referrers
  • plot average number of pages browsed for different referrers
  • collapse referrers with low sample size into a single group
  • club traffic from social media websites into a new category
  • group referrers with traffic below a threshold into a single category

Data

ecom <- read_csv('https://raw.githubusercontent.com/rsquaredacademy/datasets/master/web.csv',
                 col_types = list(col_integer(), 
                                  col_factor(levels = c("bing", "direct", "social", "yahoo", "google")), 
                                  col_factor(levels = c("laptop", "mobile", "tablet")), 
                                  col_logical(), col_number(),
                                  col_number(), col_number(), col_character(), 
                                  col_logical(), col_number(), col_number()))

ecom
## # A tibble: 1,000 x 11
##       id referrer device bouncers n_visit n_pages duration country       
##    <int> <fct>    <fct>  <lgl>      <dbl>   <dbl>    <dbl> <chr>         
##  1     1 google   laptop T          10.0     1.00    693   Czech Republic
##  2     2 yahoo    tablet T           9.00    1.00    459   Yemen         
##  3     3 direct   laptop T           0       1.00    996   Brazil        
##  4     4 bing     tablet F           3.00   18.0     468   China         
##  5     5 yahoo    mobile T           9.00    1.00    955   Poland        
##  6     6 yahoo    laptop F           5.00    5.00    135   South Africa  
##  7     7 yahoo    mobile T          10.0     1.00     75.0 Bangladesh    
##  8     8 direct   mobile T          10.0     1.00    908   Indonesia     
##  9     9 bing     mobile F           3.00   19.0     209   Netherlands   
## 10    10 google   mobile T           6.00    1.00    208   Czech Republic
## # ... with 990 more rows, and 3 more variables: purchase <lgl>,
## #   order_items <dbl>, order_value <dbl>

Data Dictionary

  • id: row id
  • referrer: referrer website/search engine
  • os: operating system
  • browser: browser
  • device: device used to visit the website
  • n_pages: number of pages visited
  • duration: time spent on the website (in seconds)
  • repeat: frequency of visits
  • country: country of origin
  • purchase: whether visitor purchased
  • order_value: order value of visitor (in dollars)

Tabulate Referrers

Let us look at the traffic driven by different referrers.

ecom %>%
  pull(referrer) %>%
  fct_count
## # A tibble: 5 x 2
##   f          n
##   <fct>  <int>
## 1 bing     194
## 2 direct   191
## 3 social   200
## 4 yahoo    207
## 5 google   208

If you want to sort the output in descending order, use sort and set it to TRUE.

ecom %>%
  pull(referrer) %>%
  fct_count(sort = TRUE)
## # A tibble: 5 x 2
##   f          n
##   <fct>  <int>
## 1 google   208
## 2 yahoo    207
## 3 social   200
## 4 bing     194
## 5 direct   191

Reorder Referrers

We want to examine the average number of pages visited by each referrer type.

refer_summary <- 
  ecom %>%
  group_by(referrer) %>%
  summarise(
    page = mean(n_pages),
    tos = mean(duration),
    n = n()
  )

refer_summary
## # A tibble: 5 x 4
##   referrer  page   tos     n
##   <fct>    <dbl> <dbl> <int>
## 1 bing      6.13   368   194
## 2 direct    6.38   358   191
## 3 social    5.42   355   200
## 4 yahoo     5.99   336   207
## 5 google    5.73   360   208

Let us plot the average number of pages visited by each referrer type.

refer_summary %>%
  ggplot() +
  geom_point(aes(page, referrer))

Use fct_reorder to reorder the referrer types by the average number of pages visited.

refer_summary %>%
  ggplot() +
  geom_point(aes(page, fct_reorder(referrer, page)))

Plot Referrer Frequency (Descending Order)

Let us look at the distribution of the referreres.

ecom %>%
  pull(referrer) %>%
  fct_count(sort = TRUE)
## # A tibble: 5 x 2
##   f          n
##   <fct>  <int>
## 1 google   208
## 2 yahoo    207
## 3 social   200
## 4 bing     194
## 5 direct   191

Use fct_unique to view the categories or levels of the referrer variable.

ecom %>%
  pull(referrer) %>%
  fct_unique
## [1] bing   direct social yahoo  google
## Levels: bing direct social yahoo google

Since we want to plot the referrers in descending order of frequency, we will use fct_infreq() to reorder by frequency.

ecom %>%
  pull(referrer) %>%
  fct_infreq %>%
  fct_unique
## [1] google yahoo  social bing   direct
## Levels: google yahoo social bing direct

Now that we have reordered the referrers by frequency, let us plot them.

ecom %>%
  mutate(
    ref = referrer %>% 
      fct_infreq
  ) %>%
  ggplot(aes(ref)) +
  geom_bar()

Plot Referrer Frequency (Ascending Order)

Let us look at the categories of the referrer variable.

ecom %>%
  pull(referrer) %>%
  fct_unique
## [1] bing   direct social yahoo  google
## Levels: bing direct social yahoo google

Since we want to plot the referrers in ascending order of frequency, we will use fct_rev() to reverse the order.

ecom %>%
  pull(referrer) %>%
  fct_rev %>%
  fct_unique
## [1] google yahoo  social direct bing  
## Levels: google yahoo social direct bing

Let us reorder the referrers by frequency first and then reverse the order before plotting their frequencies.

ecom %>%
  mutate(
    ref = referrer %>% 
      fct_infreq %>% 
      fct_rev
  ) %>%
  ggplot(aes(ref)) +
  geom_bar()

Case Study 2

Data

traffic <- readr::read_csv('https://raw.githubusercontent.com/rsquaredacademy/datasets/master/web_traffic.csv',
  col_types = list(col_factor(levels = c("affiliates", "bing", "direct", "facebook", "yahoo", "google",
    "instagram", "twitter", "unknown"))))

traffic
## # A tibble: 48,232 x 1
##    traffics
##    <fct>   
##  1 google  
##  2 google  
##  3 google  
##  4 google  
##  5 google  
##  6 google  
##  7 google  
##  8 google  
##  9 google  
## 10 google  
## # ... with 48,222 more rows

Tabulate Referrer

Let us compute the traffic driven by different referrers using fct_count.

traffic %>%
  pull(traffics) %>% 
  fct_count
## # A tibble: 9 x 2
##   f              n
##   <fct>      <int>
## 1 affiliates  7641
## 2 bing        5893
## 3 direct      1350
## 4 facebook    8135
## 5 yahoo       4899
## 6 google      9229
## 7 instagram   3907
## 8 twitter     4521
## 9 unknown     2657

Collapse Referrer Categories

We want to group the referrers into 2 categories:

  • social
  • search

Use fct_collapse() to group categories.

traffic %>%
  pull(traffics) %>%
  fct_collapse(
    social = c("facebook", "twitter", "instagram"),
    search = c("google", "bing", "yahoo")
  ) %>% 
  fct_count
## # A tibble: 5 x 2
##   f              n
##   <fct>      <int>
## 1 affiliates  7641
## 2 search     20021
## 3 direct      1350
## 4 social     16563
## 5 unknown     2657

Lump Infrequent Referrer Types

Let us group together referrer types that drive low traffic to the website. Use fct_lump() to lump together categories.

traffic %>%
  pull(traffics) %>% 
  fct_count
## # A tibble: 9 x 2
##   f              n
##   <fct>      <int>
## 1 affiliates  7641
## 2 bing        5893
## 3 direct      1350
## 4 facebook    8135
## 5 yahoo       4899
## 6 google      9229
## 7 instagram   3907
## 8 twitter     4521
## 9 unknown     2657

traffic %>%
  pull(traffics) %>% 
  fct_lump %>% 
  table
## .
## affiliates       bing   facebook      yahoo     google  instagram 
##       7641       5893       8135       4899       9229       3907 
##    twitter    unknown      Other 
##       4521       2657       1350

Retain top 3 referrers

We want to retain the top 3 referrers and combine the rest of them into a single category.

## # A tibble: 9 x 2
##   f              n
##   <fct>      <int>
## 1 google      9229
## 2 facebook    8135
## 3 affiliates  7641
## 4 bing        5893
## 5 yahoo       4899
## 6 twitter     4521
## 7 instagram   3907
## 8 unknown     2657
## 9 direct      1350

Use fct_lump() and set the argument n to 3 indicating we want to retain top 3 categories and combine the rest.

traffic %>% 
  pull(traffics) %>% 
  fct_lump(n = 3) %>% 
  table
## .
## affiliates   facebook     google      Other 
##       7641       8135       9229      23227

Lump Referrer Types with less than 10% traffic

Let us combine referrers that drive less than 10% traffic to the website.

## # A tibble: 9 x 3
##   f              n percent
##   <fct>      <int>   <dbl>
## 1 affiliates  7641   15.8 
## 2 bing        5893   12.2 
## 3 direct      1350    2.80
## 4 facebook    8135   16.9 
## 5 yahoo       4899   10.2 
## 6 google      9229   19.1 
## 7 instagram   3907    8.10
## 8 twitter     4521    9.37
## 9 unknown     2657    5.51

Since we are looking at proportion of traffic driven to the website and not the actual numbers, we use the prop argument and set it to 0.1, indicating that we want to retain only those categories which have a proportion of more than 10% and combine the rest.

traffic %>%
  pull(traffics) %>% 
  fct_lump(prop = 0.1) %>% 
  table
## .
## affiliates       bing   facebook      yahoo     google      Other 
##       7641       5893       8135       4899       9229      12435

Retain 3 Referrer Types with lowest traffic

What if we want to retain 3 referrers which drive the lowest traffic to the website and combine the rest?

## # A tibble: 9 x 2
##   f              n
##   <fct>      <int>
## 1 direct      1350
## 2 unknown     2657
## 3 instagram   3907
## 4 twitter     4521
## 5 yahoo       4899
## 6 bing        5893
## 7 affiliates  7641
## 8 facebook    8135
## 9 google      9229

We will still use the n argument but instead of specifying 3, we now specify -3.

traffic %>% 
  pull(traffics) %>% 
  fct_lump(n = -3) %>% 
  table
## .
##    direct instagram   unknown     Other 
##      1350      3907      2657     40318

Retain 3 Referrer Types with less than 10% traffic

Let us see how to retain referrers that drive less than 10 % traffic to the website and combine the rest into a single group.

## # A tibble: 9 x 3
##   f              n percent
##   <fct>      <int>   <dbl>
## 1 affiliates  7641   15.8 
## 2 bing        5893   12.2 
## 3 direct      1350    2.80
## 4 facebook    8135   16.9 
## 5 yahoo       4899   10.2 
## 6 google      9229   19.1 
## 7 instagram   3907    8.10
## 8 twitter     4521    9.37
## 9 unknown     2657    5.51

Instead of setting prop to 0.1, we will set it to -0.1.

traffic %>% 
  pull(traffics) %>% 
  fct_lump(prop = -0.1) %>% 
  table
## .
##    direct instagram   twitter   unknown     Other 
##      1350      3907      4521      2657     35797