Chapter 5 Welcome to the Tidyverse

“Data is like garbage. You’d better know what you are going to do with it before you collect it.”

- Mark Twain, 1835 - 1910

5.1 The Tidyverse

The tidyverse is currently a collection of eight core packages based largely on the programming vision of Hadley Wickham (Fig 5.1). The packages are:

  • dplyr Grammar and functions for data manipulation. Importantly, dplyr functions are analogous to those from the language SQL.
  • forcats Tools for solving common problems with factors.
  • ggplot2 A system for “declaratively creating graphics”, based on the book The Grammar of Graphics (Wilkinson 2012)“.
  • purrr An enhancement of R’s functional programming (FP) toolkit.
  • readr Methods for reading rectangular data.
  • stringr functions to facilitate working with strings.
  • tibble A ``modern re-imagining of the data frame.”
  • tidyr A set of functions for ``tidying data.”
The main packages of the *tidyverse*.

Figure 5.1: The main packages of the tidyverse.

The tidyverse library also contains several useful ancillary packages, including lubridate, reshape2, hms, blob, margrittr, and glue. While installing tidyverse will result in the installation of both main and ancillary packages, loading the tidyverse will result only in the complete loading of the eight main tidyverse packages.

Importantly, this chapter is not meant to be an authoritative summary of the tidyverse. Coverage is mostly limited to the magrittr, tibble and dplyr packages. The ggplot2 package is dealt with in greater detail in Ch 7. Wickham, Çetinkaya-Rundel, and Grolemund (2023) provides a succinct but thorough introduction to the tidyverse in the open source book R for Data Science.

The tidyverse packages can be downloaded using:

install.packages("tidyverse")

5.2 Pipes,

5.2.1 Basic Pipe

An important component of the tidyverse is the widespread use of the forward pipe operator |> . In programming, a pipe is a set of commands connected in series, where the output of one command is the input of the next1. In many cases, use of pipes can allow clearer representations of code processes2. Incidentally, the |> pipe, from the base package, was motivated by an older forward pipe operator from the tidyverse package magrittr, %>%. As of R 4.1, the native pipe operator for tidyverse is |> (although %>% will still work if magrittr is loaded)3. Although |> is more syntactically streamlined than %>%, there are several features available to %>% that do not exist for |>, including the potential for a placeholder operator4.

Consider the operation \(\log_e(\exp(1))\), which equals one. We could write this as,

1 |> exp() |> log()
[1] 1

Here the number \(1\) is piped into the function exp(), with the result: \(\exp(1) = e^1 = e\), and this outcome is piped into the function log() with the result: \(\log_e e = 1\). Because the first arguments of exp() and log() are simply calls to numeric data, and these can be provided by the previous pipe segment, we do not have to include information about x for f(x) operations. Thus, when functions require only the previous pipe segment result as an argument, then x |> f() is equivalent to \(f(x)\). In the case that multiple arguments need to be specified, the script x |> f(y) is equivalent to \(f(x, y)\), and x |> f(y) |> g(z) describes \(g(f(x, y), z)\). For instance,

10 |> log(base = 2)
[1] 3.321928

Recall that the forward pipe works recursively from the result of the previous pipe segment.

head(Loblolly)
   height age Seed
1    4.51   3  301
15  10.89   5  301
29  28.72  10  301
43  41.74  15  301
57  52.70  20  301
71  60.92  25  301
Loblolly |>
head() |>
tail(2)
   height age Seed
57  52.70  20  301
71  60.92  25  301

We can define the result of a pipe to be a global variable. For instance, consider the script below (Fig 5.2).

x <- seq(1,10,length=100)
y <- x |> sin()
plot(x, y, type = "l", ylab = "sin(x)", xlab = "x (radians)")
Creating a global variable (object) resulting from a pipeline.

Figure 5.2: Creating a global variable (object) resulting from a pipeline.

5.3 tibble

The tidyverse package tibble provides an alternative to the data.frame format of data storage, called a tibble. Tibbles have classes dataframe and tbl_df, allowing them to posess additional characteristics including enhanced printing. Tibble printing conveys more information than dataframe printing (see below). Additional distinguishing characteristics of tibbles include: 1) character vector is not automatically have class factor, 2) Recycling only happens for a length 1 input, 3) there is no partial matching when $ is used to index by tibble columns by name. The functions tibble() generates tibbles.The function as_tibble() coerces a dataframe to a tibble.

data <- data.frame(numbers = 1:3, letters = c("a","b","c"),
                   date = as.Date(c("2021-12-1", "2021-12-2",
                                    "2021-12-2"),
                           format = "%Y-%m-%d"))
data
  numbers letters       date
1       1       a 2021-12-01
2       2       b 2021-12-02
3       3       c 2021-12-02
library(tidyverse)
datat <- as_tibble(data)
datat
# A tibble: 3 x 3
  numbers letters date      
    <int> <chr>   <date>    
1       1 a       2021-12-01
2       2 b       2021-12-02
3       3 c       2021-12-02

The package dplyr is a core tidyverse package for data manipulation5 Table ?? lists some useful dplyr functions.

References

Ritchie, Dennis M. 1984. “The UNIX System: The Evolution of the UNIX Time-Sharing System.” AT&T Bell Laboratories Technical Journal 63 (8): 1577–93.
Wickham, Hadley, Mine Çetinkaya-Rundel, and Garrett Grolemund. 2023. R for Data Science. O’Reilly Media, Inc.
Wilkinson, Leland. 2012. The Grammar of Graphics. Springer.

  1. Historically, pipe programming dates back to early developments in Unix operating systems (Ritchie 1984), wherein pipes are codified as vertical bars "|". Pipes are widely used in the languages F#, Julia, and JavaScript, among others. ↩︎

  2. In particular, when you see |> it is helpful to think “and then.”↩︎

  3. The RStudio shortcut for "%>%" is Ctrl\(+\)Shift\(+\)m. To force RStudio to default to |> when using Ctrl\(+\)Shift\(+\)m (or some other keyboard shortcut), one can modify appropriate settings in Tools\(>\)Global Options\(>\)Code.↩︎

  4. In general, the dot placemolder operator, ., allows operations like \(f(x,y)\) by specifying x |> f(.,y). For example: 2 %>% log(10, base = .). In this script the number 2 will be piped into the base argument.↩︎

  5. dplyr has largely replaced the now retired plyr package.↩︎