Chapter 4 Basic Data Management

“I think, therefore I R."

- William B. King, Psychologist and R enthusiast

$\text{}$
An important characteristic of R is its capacity to efficiently manage and analyze large, complex datasets. In this chapter I list a few functions and approaches useful for data management in base R. Data management considerations for the tidyverse are given in Ch 5.

4.1 Operations on Arrays, Lists and Vectors

Operators can be applied individually to every row or column of an array, or every component of a list or atomic vector using a number of time saving methods.

4.1.1 The `apply` Family of Functions

4.1.1.1 `apply()`

Operations can be performed quickly on rows and columns of two dimensional arrays with the function apply(). The function requires three arguments.

The first argument, X, specifies an array to be analyzed.
The second argument, MARGIN, connotes whether rows or columns are to be analyzed. MARGIN = 1 indicates rows, MARGIN = 2 indicates columns, while MARGIN = c(1, 2) indicates rows and columns.
The third argument, FUN, defines a function to be applied to the margins of the object in the first argument.

Example 4.1
Consider the asbio::bats dataset which contains forearm length data, in millimeters, for northern myotis bats (Myotis septentrionalis), along with corresponding bat ages in in days.

library(asbio)
data(bats)
head(bats)

  days forearm.length
1    1           10.5
2    1           11.0
3    1           12.3
4    1           13.7
5    1           14.2
6    1           14.8

Here we obtain minimum values for the days and forearm.length columns.

apply(bats, 2, min)

          days forearm.length 
           1.0           10.5

It is straightforward to change the third argument in apply() to obtain different summaries, like the mean.

apply(bats, 2, mean)

          days forearm.length 
      13.57895       23.60263

or the standard deviation

apply(bats, 2, sd)

          days forearm.length 
     12.461035       8.434725

Several summary statistical functions exist for numerical arrays that can be used in some instances in the place of apply(). These include rowMeans() and colMeans() which give the sample means of specified rows and columns, respectively, and rowSums() and colSums() which give the sums of specified rows and columns, respectively. For instance:

colMeans(bats)

          days forearm.length 
      13.57895       23.60263

$\blacksquare$

4.1.1.2 `lapply()`

The function lapply() allows one to sweep functions through list components. It has two main arguments:

The first argument X specifies a list to be analyzed.
The second argument, FUN, is used to define a function to be applied to each element in X.

Example 4.2

Consider the following simple list, whose three components have different lengths.

x <- list(a = 1:8, norm.obs = rnorm(10), logic = c(TRUE, TRUE, FALSE, FALSE))
x

$a
[1] 1 2 3 4 5 6 7 8

$norm.obs
 [1]  0.2891653  0.8200166  0.9415476 -0.3100035 -1.0827273  0.2774772
 [7] -0.7331662  0.6197893  1.5590174 -0.9123872

$logic
[1]  TRUE  TRUE FALSE FALSE

Here we sweep the mean function through the list:

lapply(x, mean)

$a
[1] 4.5

$norm.obs
[1] 0.1468729

$logic
[1] 0.5

Note the Boolean outcomes in logic have been coerced to numeric outcomes. Specifically, TRUE = 1 and FALSE = 0. Here are the 1st, 2nd (median), and 3rd quartiles:

lapply(x, quantile, probs = 1:3/4)

$a
 25%  50%  75% 
2.75 4.50 6.25 

$norm.obs
       25%        50%        75% 
-0.6273755  0.2833213  0.7699598 

$logic
25% 50% 75% 
0.0 0.5 1.0

$\blacksquare$

4.1.1.3 `sapply()`

The function sapply() is a user friendly wrapper for lapply() that can return a vector or array instead of an lapply() list.

sapply(x, quantile, probs = 1:3/4)

       a   norm.obs logic
25% 2.75 -0.6273755   0.0
50% 4.50  0.2833213   0.5
75% 6.25  0.7699598   1.0

4.1.1.4 `tapply()`

The tapply() function allows summarization of a one dimensional array (e.g., a column or row from a matrix) with respect to levels in a categorical variable. The function requires three arguments.

The first argument, X, defines a one dimensional array to be analyzed.
The second argument, INDEX should provide a list of one or more factors (see example below) with the same length as X.
The third argument, FUN, is used to specify a function to be applied to X for each level in INDEX.

Example 4.3 $\text{}$
Consider the dataset asbio::heart, which documents pulse rates for twenty four subjects at four time periods following administration of a experimental treatment. These were two active heart medications and a control. Here are average heart rates for the treatments.

data(heart)
with(heart, tapply(rate, drug, mean))

    AX23     BWW9     Ctrl 
76.28125 81.03125 71.90625

Here are the mean heart rates for treatments, for each time frame. Note that the second argument is defined as a list with two components, each of which can be coerced to be a factor.

with(heart, tapply(rate, list(drug = drug, time = time), mean))

      time
drug      t1     t2     t3     t4
  AX23 70.50 80.500 81.000 73.125
  BWW9 81.75 84.000 78.625 79.750
  Ctrl 72.75 72.375 71.500 71.000

$\blacksquare$

The function aggregate() can be considered a more sophisticated extension of tapply(). It allows objects under consideration to be expressed as functions of explanatory factors, and contains additional arguments for data specification and time series analyses.

Example 4.4 $\text{}$
Here we use aggregate() to get identical (but reformatted) results to the prior example.

aggregate(rate ~ drug + time, mean, data = heart)

   drug time   rate
1  AX23   t1 70.500
2  BWW9   t1 81.750
3  Ctrl   t1 72.750
4  AX23   t2 80.500
5  BWW9   t2 84.000
6  Ctrl   t2 72.375
7  AX23   t3 81.000
8  BWW9   t3 78.625
9  Ctrl   t3 71.500
10 AX23   t4 73.125
11 BWW9   t4 79.750
12 Ctrl   t4 71.000

Importantly, the first argument, rate ~ drug + time is in the form of a formula:

f.rate <- with(heart, rate ~ drug + time)
class(f.rate)

[1] "formula"

Here the tilde operator, ~, allows expression of the formulaic framework: y ~ model, where y is a response variable and model specifies a system of (generally) one or more predictor variables.

$\blacksquare$

4.1.2 `outer()`

Another important function for matrix operations is outer(). The function allows creation of an array that contains all possible combinations of two atomic vectors or arrays with respect to a particular function. The outer() function has three required arguments.

The first two arguments, X and Y define arrays or atomic vectors. X and Y can be identical if one wishes to examine pairwise operations of the array elements (see example below).
The third argument, FUN, specifies a function to be used in operations.

Example 4.5 $\text{}$
Suppose I wish to find the means of all possible pairs of observations from an atomic vector. I could use the following commands:

x <- c(1, 2, 3, 5, 4)
outer(x, x, "+")/2

     [,1] [,2] [,3] [,4] [,5]
[1,]  1.0  1.5  2.0  3.0  2.5
[2,]  1.5  2.0  2.5  3.5  3.0
[3,]  2.0  2.5  3.0  4.0  3.5
[4,]  3.0  3.5  4.0  5.0  4.5
[5,]  2.5  3.0  3.5  4.5  4.0

The argument FUN = "+" indicates that we wish to add elements to each other. We divide these sums by two to obtain means. Note that the diagonal of the output matrix contains the original elements of x, because the mean of a number and itself is the original number. The upper and lower triangles are identical because the mean of elements a and b will be the same as the mean of the elements b and a. Note that the result outer(x, x, "*") can also be obtained using x %o% x because %o% is the matrix algebra outer product operator in R.

outer(x, x, "*")

     [,1] [,2] [,3] [,4] [,5]
[1,]    1    2    3    5    4
[2,]    2    4    6   10    8
[3,]    3    6    9   15   12
[4,]    5   10   15   25   20
[5,]    4    8   12   20   16

x %o% x

     [,1] [,2] [,3] [,4] [,5]
[1,]    1    2    3    5    4
[2,]    2    4    6   10    8
[3,]    3    6    9   15   12
[4,]    5   10   15   25   20
[5,]    4    8   12   20   16

$\blacksquare$

When manipulating lists and dataframes it is often useful to move between so-called “long” and “wide” data table formats. These operations can be handled with the functions stack() and unstack(). Specifically, stack() concatenates multiple vectors into a single vector along with a factor indicating where each observation originated, whereas unstack() reverses this process.

Example 4.6 $\text{}$
Consider the 4 x 4 dataframe below.

dataf <- data.frame(matrix(nrow = 4, data = rnorm(16)))
names(dataf) <- c("col1", "col2", "col3", "col4")
dataf

        col1      col2       col3       col4
1  0.8803934 0.6073154 -0.4781750  0.4045848
2 -0.5656263 0.6232219 -0.3142224  0.3654526
3 -0.6916814 0.5770851 -0.3288035 -1.6507825
4 -1.6405416 1.3754006 -0.7131739  0.1169549

Here I stack dataf into a long table format.

sdataf <- stack(dataf)
sdataf

       values  ind
1   0.8803934 col1
2  -0.5656263 col1
3  -0.6916814 col1
4  -1.6405416 col1
5   0.6073154 col2
6   0.6232219 col2
7   0.5770851 col2
8   1.3754006 col2
9  -0.4781750 col3
10 -0.3142224 col3
11 -0.3288035 col3
12 -0.7131739 col3
13  0.4045848 col4
14  0.3654526 col4
15 -1.6507825 col4
16  0.1169549 col4

Here I unstack sdataf.

unstack(sdataf)

        col1      col2       col3       col4
1  0.8803934 0.6073154 -0.4781750  0.4045848
2 -0.5656263 0.6232219 -0.3142224  0.3654526
3 -0.6916814 0.5770851 -0.3288035 -1.6507825
4 -1.6405416 1.3754006 -0.7131739  0.1169549

The function reshape() can handle both stacking and unstacking operations. Here I stack dataf. The arguments timevar, idvar, and v.names are used to provide recognizable identifiers for the columns in the wide table format, observations within those columns, and responses for those combinations.

reshape(dataf, direction = "long",
        varying = list(names(dataf)),
        timevar = "Column",
        idvar = "Column obs.",
        v.names = "Response")

    Column   Response Column obs.
1.1      1  0.8803934           1
2.1      1 -0.5656263           2
3.1      1 -0.6916814           3
4.1      1 -1.6405416           4
1.2      2  0.6073154           1
2.2      2  0.6232219           2
3.2      2  0.5770851           3
4.2      2  1.3754006           4
1.3      3 -0.4781750           1
2.3      3 -0.3142224           2
3.3      3 -0.3288035           3
4.3      3 -0.7131739           4
1.4      4  0.4045848           1
2.4      4  0.3654526           2
3.4      4 -1.6507825           3
4.4      4  0.1169549           4

$\blacksquare$

4.2 Other Simple Data Management Functions

4.2.1 `replace()`

One use the function replace() to replace elements in an atomic vector based, potentially, on Boolean logic. The function requires three arguments.

The first argument, x, specifies the vector to be analyzed.
The second argument, list, connotes which elements need to be replaced. A logical argument can be used here as a replacement index.
The third argument, values, defines the replacement value(s).

Example 4.7 $\text{}$
For instance:

Age <- c(21, 19, 25, 26, 18, 19)
replace(Age, Age < 25, "R is Cool")

[1] "R is Cool" "R is Cool" "25"        "26"        "R is Cool" "R is Cool"

Of course, one can also use square brackets for this operation.

Age[Age < 25] <- "R is Cool"
Age

[1] "R is Cool" "R is Cool" "25"        "26"        "R is Cool" "R is Cool"

$\blacksquare$

4.2.2 `which()`

The function which can be used with logical commands to obtain address indices for data storage object.

Example 4.8
For instance:

Age <- c(21, 19, 25, 26, 18, 19)
w <- which(Age <= 21)
w

[1] 1 2 5 6

Elements one, two, and five meet this criterion. We can now subset based on the index w.

Age[w]

[1] 21 19 18 19

To find which element in Age is closest to 24 I could do something like:

which(abs(Age - 24) == min(abs(Age - 24)))

[1] 3

$\blacksquare$

4.2.3 `sort()`

By default, The function sort() sorts data from an atomic vector into an alphanumeric ascending order.

sort(Age)

[1] 18 19 19 21 25 26

Data can be sorted in a descending order by specifying decreasing = TRUE.

sort(Age, decreasing = T)

[1] 26 25 21 19 19 18

4.2.4 `rank()`

The function rank gives the ascending alphanumeric rank of elements in a vector. Ties are given the average of their ranks. This operation is important to rank-based permutation analyses .

rank(Age)

[1] 4.0 2.5 5.0 6.0 1.0 2.5

The second and last observations were the second smallest in Age. Thus, their average rank is 2.5.

4.2.5 `order()`

The function order() is similar to which() in that it provides element indices that accord with an alphanumeric ordering. This allows one to sort a vector, matrix or dataframe into an ascending or descending order, based on one or several ordered vectors.

Example 4.9
Consider the dataframe below which lists plant percent cover data for four plant species at three sites. In accordance with the field.data example from Ch 3, plant species are identified with four letter codes, corresponding to the first two letters of the Linnaean genus and species names.

field.data <- data.frame(code = c("ACMI", "ELSC", "CAEL", "TACE"),
                         site1 = c(12, 13, 14, 11),
                         site2 = c(0, 20, 4, 5),
                         site3 = c(20, 10, 30, 0))
field.data

  code site1 site2 site3
1 ACMI    12     0    20
2 ELSC    13    20    10
3 CAEL    14     4    30
4 TACE    11     5     0

Assume that we wish to sort the data with respect to an alphanumeric ordering of species codes. Here we obtain the ordering of the codes

o <- order(field.data$code)
o

[1] 1 3 2 4

Now we can sort the rows of field.data based on this ordering.

field.data[o,]

  code site1 site2 site3
1 ACMI    12     0    20
3 CAEL    14     4    30
2 ELSC    13    20    10
4 TACE    11     5     0

$\blacksquare$

4.2.6 `unique()`

To identify unique values in dataset we can use the function unique().

Example 4.10
Below is an atomic vector listing species from a bird survey on islands in southeast Alaska. Species ciphers follow the same coding method used in Example 4.9. Note that there are a large number of repeats.

AK.bird <- c("GLGU", "MEGU", "DOCO", "PAJA", "COLO", "BUFF", "COGO", "WHSC", "TUSW",
"GRSC", "GRTE", "REME", "BLOY", "REPH", "SEPL", "LESA", "ROSA", "WESA", "WISN",
"BAEA", "SHOW", "GLGU", "MEGU", "PAJA", "DOCO", "GRSC", "GRTE", "BUFF", "MADU",
"TUSW", "REME", "SEPL", "REPH","ROSA", "LESA", "COSN", "BAEA", "ROHA")

length(AK.bird)

[1] 38

Applying unique() we obtain a listing of the 24 unique bird species.

unique(AK.bird)

 [1] "GLGU" "MEGU" "DOCO" "PAJA" "COLO" "BUFF" "COGO" "WHSC" "TUSW" "GRSC"
[11] "GRTE" "REME" "BLOY" "REPH" "SEPL" "LESA" "ROSA" "WESA" "WISN" "BAEA"
[21] "SHOW" "MADU" "COSN" "ROHA"

$\blacksquare$

4.2.7 `match()`

Given two vectors, the function match() indexes where objects in the second vector appear in the elements of the first vector. For instance:

x <- c(6, 5, 4, 3, 2, 7)
y <- c(2, 1, 4, 3, 5, 6)
m <- match(y, x)
m

[1]  5 NA  3  4  2  1

The number 2 (the 1st element in y) is the 5th element of x, thus the number 5 is put 1st in the vector m created by match. The number 1 (the 2nd element of y) does not occur in x (it is NA). The number 4 is the 3rd element of y and x. Thus, the number 3 is given as the third element of m, and so on.

Example 4.11
The usefulness of match() may seem unclear at first, but consider a scenario in which I want to convert species code identifiers in field data into actual species names. The following dataframe is a species list that matches four letter species codes to scientific names. Note that the list contains more species than than the field.data dataset used to demonstrate the function order().

species.list <- data.frame(code = c("ACMI", "ASFO", "ELSC", "ERRY", "CAEL",
"CAPA", "TACE"), names = c("Achillea millefolium", "Aster foliaceus",
                           "Elymus scribneri", "Erigeron rydbergii",
                           "Carex elynoides", "Carex paysonis",
                           "Taraxacum ceratophorum"))

species.list

  code                  names
1 ACMI   Achillea millefolium
2 ASFO        Aster foliaceus
3 ELSC       Elymus scribneri
4 ERRY     Erigeron rydbergii
5 CAEL        Carex elynoides
6 CAPA         Carex paysonis
7 TACE Taraxacum ceratophorum

Here I add a column in the field.data of the actual species names using match().

m <- match(field.data$code, species.list$code)
field.data.new <- field.data # make a copy of field data
field.data.new$species.name <- species.list$names[m]
field.data.new

  code site1 site2 site3           species.name
1 ACMI    12     0    20   Achillea millefolium
2 ELSC    13    20    10       Elymus scribneri
3 CAEL    14     4    30        Carex elynoides
4 TACE    11     5     0 Taraxacum ceratophorum

$\blacksquare$

4.2.8 `which()` and `%in%`

We can use the command in conjunction with the function which() to achieve the same results as match().

m <- which(species.list$code %in% field.data$code)
field.data.new$species.name <- species.list$names[m]
field.data.new

  code site1 site2 site3           species.name
1 ACMI    12     0    20   Achillea millefolium
2 ELSC    13    20    10       Elymus scribneri
3 CAEL    14     4    30        Carex elynoides
4 TACE    11     5     0 Taraxacum ceratophorum

Note that the arrangement of arguments are reversed in match() and which(). In the former we have: . In the latter we have: which(species.list$code %in% field.data$code).

4.3 Matching, Querying and Substituting in Strings

R contains a number of useful methods for handling character strings¹ data.

4.3.1 `strtrim()`

The function strtrim is useful for extracting characters from vectors.

Example 4.12 $\text{}$ For the taxonomic codes in the character vector below, the first capital letter indicates whether a species is a flowering plant (anthophyte) or moss (bryophyte) while the last four letters give the species codes (see Example 4.9).

plant <- c("A_CAAT", "B_CASP", "A_SARI")

Assume that I want to distinguish anthophytes from bryophytes by extracting the first letter. This can be done by specifying 1 in the second strtrim argument, width.

phylum <- strtrim(plant, 1)
phylum

[1] "A" "B" "A"

plant[phylum == "A"]

[1] "A_CAAT" "A_SARI"

$\blacksquare$

4.3.2 `strsplit()`

The function strsplit() splits a character string into substrings based on user defined criteria. It contains two important arguments.

The first argument, x, specifies the character string to be analyzed.
The second argument, split, is a character criterion that is used for splitting.

Example 4.13 $\text{}$
Below I split the character string ACMI in two, based on the space between the words Achillea and millefolium.

ACMI <- "Achillea millefolium"
strsplit(ACMI, " ")

[[1]]
[1] "Achillea"    "millefolium"

Note that the result is a list. To get back to a vector (now with two components), I can use the function unlist().

unlist(strsplit(ACMI, " "))

[1] "Achillea"    "millefolium"

Here I split based on the letter "l".

strsplit(ACMI, "l")

[[1]]
[1] "Achi"  ""      "ea mi" ""      "efo"   "ium"

Interestingly, letting the split criterion equal NULL results in spaces being placed between every character in a string.

strsplit(ACMI, NULL)

[[1]]
 [1] "A" "c" "h" "i" "l" "l" "e" "a" " " "m" "i" "l" "l" "e" "f" "o" "l"
[18] "i" "u" "m"

We can use this outcome to reverse the order of characters in a string.

sapply(lapply(strsplit(ACMI, NULL), rev), paste, collapse = "")

[1] "muilofellim aellihcA"

The function rev() provides a reversed version of its first argument, in this case a result from strsplit(). The function paste() can be use to paste together character strings.

$\blacksquare$

Criteria for querying strings can include multiple characters in a particular order, and a particular case:

x <- "R is free software and comes with ABSOLUTELY NO WARRANTY"
strsplit(x, "so")

[[1]]
[1] "R is free "                                  
[2] "ftware and comes with ABSOLUTELY NO WARRANTY"

Note that the "SO" in "ABSOLUTELY" is ignored because it is upper case.

4.3.3 `grep()` and `grepl()`

The functions grep() and grepl() can be used to identify which elements in a character vector have a specified pattern. They have the same first two arguments.

The first argument, pattern specifies a patterns to be matched. This can be a character string, or object coercible to character string or a regular expression (see below).
The second argument, x, is a character vector where matches are sought.

Example 4.14 $\text{}$
The function grep() returns indices identifying which entries in a vector contain a queried pattern. In the character vector below, we see that entries five and six have the same genus, Carex.

names = c("Achillea millefolium", "Aster foliaceus",
                           "Elymus scribneri", "Erigeron rydbergii",
                           "Carex elynoides", "Carex paysonis",
                           "Taraxacum ceratophorum")

grep("Carex", names)

[1] 5 6

The function grepl() does the same thing with Boolean outcomes.

grepl("Carex", names)

[1] FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE

Of course, we could use this information to subset names.

names[grep("Carex", names)]

[1] "Carex elynoides" "Carex paysonis"

We can also get grep to return the values directly by specifying value = TRUE.

grep("Carex", names, value = TRUE)

[1] "Carex elynoides" "Carex paysonis"

$\blacksquare$

4.3.4 `gsub()`

The function gsub() can be used to substitute text that has a specified pattern. Several of it arguments are identical to grep() and grepl():

As before, the first argument, pattern, specifies a pattern to be matched.
The second argument, replacement, specifies a replacement for the matched pattern.
The third argument, x, is a character vector wherein matches are sought and substitutions are made.

Example 4.15 $\text{}$
Assume that we wish to substitute "C." for occurrences of "Carex" in names

gsub("Carex", "C.", names)

[1] "Achillea millefolium"   "Aster foliaceus"       
[3] "Elymus scribneri"       "Erigeron rydbergii"    
[5] "C. elynoides"           "C. paysonis"           
[7] "Taraxacum ceratophorum"

$\blacksquare$

4.3.5 `gregexpr()`

The function gregexpr() identifies the start and end of matching sections in a character vector.

Example 4.16 $\text{}$
Here we examine the first two entries in names, looking for the genus Aster.

gregexpr("Aster", names[c(1:2)])

[[1]]
[1] -1
attr(,"match.length")
[1] -1
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

[[2]]
[1] 1
attr(,"match.length")
[1] 5
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

The output list is cryptic at best and requires some explanation. The first two elements in each of the two list component indicate the character number of the start and end of the matched string. For the first list component, these elements are given the identifier -1 because "Achillea millefolium" does not contain the pattern "Aster". For the second list component, these elements are 1 and 5 because "Aster" makes up the first five letters of "Aster foliaceus".

$\blacksquare$

4.3.6 Regular Expressions

A number of R functions for managing character strings, including grep(), grepl(), gregexpr(), gsub(), and strsplit(), can can incorporate regular expressions. In computer programming, a regular expression (often abbreviated as regex) is a sequence of characters that allow pattern matching in text. Regular expressions have developed within a number of programming frameworks including the POSIX standard (the Portable Operating System Interface standard), developed by the IEEE, and Perl². Regular expressions in R include extended regular expressions (this is the default for most pattern matching and replacement R functions), and Perl-like regular expressions.

4.3.6.1 Extended Regular Expressions

Default extended regular expressions in R use a POSIX framework for commands³, including the use of particular metacharacters. The metacharacters in extended regular expressions are: \, |, ( ), [ ], ^, $, ., { }, *, +, and ?. The metacharacters will vary in meaning depending if they occur outside of square brackets, [ and ], or inside of square brackets, meaning that they are part of a character class (see below). In simplest usage (outside of square brackets) the metacharacters below have the following applications (see https://www.pcre.org/original/pcre.txt):

^ start of string or line.
$ end of string or line.
. match any character except newline.
[ ] start and end character class definition.
| start of alternative branch.
( ) start and end subpattern.
{ } start and end min/max repetition specification.

Several regular expression metacharacters can be placed at the end of the end of a regular expression to specify repetition. For instance, "*" indicates the preceding pattern should be matched zero or more times, "{+}" indicates the preceding pattern should be matched one or more times, "{n}" indicates the preceding pattern should be matched exactly n more times, and "{n,}" indicates the preceding pattern should be matched n or more times.

Example 4.17 $\text{}$
We will use the function regmatches(), which extracts or replaces matched substrings from gregexpr() summaries, to illustrate.

string <- "%aaabaaab"
ID <- gregexpr("a{1}", string)
regmatches(string, ID)

[[1]]
[1] "a" "a" "a" "a" "a" "a"

ID <- gregexpr("a{2}", string)
regmatches(string, ID)

[[1]]
[1] "aa" "aa"

ID <- gregexpr("a{2,}", string)
regmatches(string, ID)

[[1]]
[1] "aaa" "aaa"

$\blacksquare$

Example 4.18 $\text{}$
Metacharacters can be used together. For instance, the code below demonstrates how one might get rid of one or more extra spaces at the end of character strings.

string <- c("###Now is the time      ",
            "# for all  ",
            "#",
            " good men",
            "### to come to the aid of their country.       ")

out <- gsub(" +$", "", string) # drop extra space(s) at end of strings
out <- gsub("^#*","", out) # drop pound sign(s)

paste(out, collapse = "")

[1] "Now is the time for all good men to come to the aid of their country."

$\blacksquare$

Example 4.19 $\text{}$
As another example, RMarkdown delimits monospace font using accent grave metacharacters, ` `, while LaTeX applies this font between the expression \texttt{ and }. Below I convert a RMarkdown-style character vector containing some monospace strings to a LaTeX-style character vector.

char.vec <- c("`+`", "addition", "$2 + 2$", "`2 + 2`")
gsub("(`)(.*)(`)","\\\texttt{\\2}", char.vec)

[1] "\texttt{+}"     "addition"      "$2 + 2$"       "\texttt{2 + 2}"

With the code

"(`)(.*)(`)"

I subset RMarkdown strings in char.vec into three potential components: 1) the ` metacharacter beginning the string, 2) the text content between ` metecharacters, and 3) the closing ` metacharacter itself. I insert the content in item 2 (indicated as \\2) between \texttt{ and } using:

"\\\texttt{\\2}"

$\blacksquare$

A regular expression character class is comprised of a collection of characters, specifying some query or pattern, situated between quotes (single or double) and square brace metacharacters, e.g., "[" and "]". Matches will be evaluated for any single character in the specified text with respect to the specified pattern. The reverse will occur if the first character of the pattern is the regular expression caret metacharacter, ^. For example, the expression "[0-9]" matches any single numeric character in a string, (the regular expression metacharacter - can be used to specify a range) and [^abc] matches anything except the characters "a", "b" or "c".

Example 4.20 $\text{}$

string <- "a1c&m2%b"
strsplit(string, "[0-9]")

[[1]]
[1] "a"   "c&m" "%b"

strsplit(string, "[^abc]")

[[1]]
[1] "a" "c" ""  ""  ""  "b"

$\blacksquare$

If a queried character is a general expression metacharacter, I would have to escape it using single (or double) backslashes.

Example 4.21 $\text{}$
Here I ask for a string split based on the appearance of ? (which is a regex metacharacter) and % (which is not).

string <- "m?2%b"
strsplit(string, "[\\?%]")

[[1]]
[1] "m" "2" "b"

$\blacksquare$

Example 4.22 $\text{}$
This regular expression will match most email addresses:

pattern <- "[-a-z0-9_.%]+\\@[-a-z0-9_.%]+\\.[a-z]+"

The expression literally reads: “1) find one or more occurrences of characters in a-z or A-Z or 0-9 or dashes or periods, followed by 2) the ampersand symbol (literally), followed by 3) one or more occurrences of characters in a-z or A-Z or 0-9 or dashes or periods, followed by 4) a literal period, followed by one or more occurrences of the letters a-z or A-Z.” Here is a string we wish to query:

string <- c("abc_noboby@isu.edu",
            "text with no email",
            "me@mything.com",
            "also",
            "you@yourspace.com",
            "@you"
            )

We confirm that elements 1, 3, and 5 from string are email addresses.

grep(pattern, string, ignore.case = TRUE, value = TRUE)

[1] "abc_noboby@isu.edu" "me@mything.com"     "you@yourspace.com"

$\blacksquare$

Certain characters classes are predefined. These classes have names that are bounded by two square brackets and colons, and include "[[:lower:]]" and "[[:upper:]]" which identify lower and upper case letters, "[:punct:]" which identifies punctuation, [[:alnum:]], which identifies all alphanumeric characters, and "[[:space:]]", which identifies space characters, e.g., tab, newline.

string <- c("M2Ab", "def", "?", "%", "\n")
grepl("[[:lower:]]", string)

[1]  TRUE  TRUE FALSE FALSE FALSE

grepl("[[:upper:]]", string)

[1]  TRUE FALSE FALSE FALSE FALSE

grepl("[[:punct:]]", string)

[1] FALSE FALSE  TRUE  TRUE FALSE

grepl("[[:space:]]", string) # item five is a newline request

[1] FALSE FALSE FALSE FALSE  TRUE

Here I ask R to return elements from string that are three or more characters long.

grep("[[:alnum:]]{3}", string, value = TRUE)

[1] "M2Ab" "def"

For some pattern matching and replacement jobs it may be best turn off the default extended regular expressions and use exact matching by specifying fixed = TRUE. For example, R may place periods in the place of spaces in character strings and column names in dataframes and arrays.

Example 4.23 $\text{}$
Consider the following example:

countries <- c("United.States", "United.Arab.Emirates", "China", "Germany")
gsub(".", " ", countries)

[1] "             "        "                    " "     "               
[4] "       "

Note that using gsub(".", " ", countries) results in the replacement of all text with spaces because of the meaning of the period metacharacter. To get the desired result we could use:

gsub(".", " ", countries, fixed = TRUE)

[1] "United States"        "United Arab Emirates" "China"               
[4] "Germany"

Of course we could also double escape the period.

gsub("\\.", " ", countries)

[1] "United States"        "United Arab Emirates" "China"               
[4] "Germany"

$\blacksquare$

4.3.6.2 Perl-like Regular Expressions

The R character string functions grep(), grepl(), regexpr(), gregexpr(), sub(), gsub(), and strsplit() allow Perl-like regular expression pattern matching. This is done by specifying perl = TRUE, which switches regular expression handling to the PRCE package. Perl allows handling of the POSIX predefined character classes, e.g., "[[:lower:]]", along with a wide variety of other calls which are generally implemented using metacharacters and double backslash commands. Here are some examples.

\\d any decimal digit.
\\D any character that is not a decimal digit.
\\h any horizontal white space character (e.g., tab, space).
\\H any character that is not a horizontal white space character.
\\s any white space character.
\\S any character that is not a white space character.
\\v any vertical white space character (e.g., newline).
\\V any character that is not a vertical white space character.
\\w any word, i.e., letter or character components separated by white space.
\\W any non word.
\\b a word boundary.
\\U upper case character (dependent on context).
\\L lower case character (dependent on context).

Note that reversals in meaning occur for capitalized and uncapitalized commands.

Example 4.24 $\text{}$
As a slightly extended example we will count the number of words in the description of the GNU public licences in R (obtained via RShowDoc("COPYING")). Ideas here largely follow from the function DescTools::StrCountW() (Signorell 2023).

Text can be read from a connection using the function readLines().

GNU <- readLines(RShowDoc("COPYING"))
head(GNU)

[1] "\t\t    GNU GENERAL PUBLIC LICENSE"                                               
[2] "\t\t       Version 2, June 1991"                                                  
[3] ""                                                                               
[4] " Copyright (C) 1989, 1991 Free Software Foundation, Inc."                       
[5] "                       51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA"
[6] " Everyone is permitted to copy and distribute verbatim copies"

Note that the characters t represent the ASCII (American character encoding standard) control character for tab return. Other useful escaped control characters include n, indicating new line or carriage return.

To search for words, we will actually identify string components that are not words, identified with the Perl regex command \\W and word boundaries, i.e., \\b. We can combine these summarily as: \\b\\W+\\b. The call \\W+ indicates a non-word match occurring one or more times. Here we apply this regular expression to the first element of GNU.

GNU[1]

[1] "\t\t    GNU GENERAL PUBLIC LICENSE"

gregexpr("\\b\\W+\\b", GNU[1], perl = TRUE)

[[1]]
[1] 10 18 25
attr(,"match.length")
[1] 1 1 1
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

Matches occur at three locations, 10, 18, and 25, which separate the four words GNU GENERAL PUBLIC LICENSE. Thus, to analyze the entire document we could use:

sum(sapply(gregexpr("\\b\\W+\\b", GNU, perl = TRUE),
           function(x) sum(x > 0)) + 1)

[1] 3048

There are 3048 total words in the license description.

$\blacksquare$

One can identify substrings by number using Perl.

Example 4.25 $\text{}$
In this example, I subdivide a string into two components, the first character, i.e., "(\\w)", and the remaining zero or more characters: "\\w*". These are referred to in the substitute argument of gsub as items \\1 and \\2, respectively. Capitalization for these substrings are handled in different ways below.

string <- "achillea"
gsub("(\\w)(\\w*)", "\\U\\1\\U\\2", string, perl=TRUE) # all caps

[1] "ACHILLEA"

gsub("(\\w)(\\w*)", "\\L\\1\\U\\2", string, perl=TRUE) # low, then upper case

[1] "aCHILLEA"

gsub("(\\w)(\\w*)", "\\U\\1\\L\\2", string, perl=TRUE) # up, then lower case

[1] "Achillea"

$\blacksquare$

4.4 Date-Time Classes

There are two basic R date-time classes, POSIXlt and POSIXct⁴. Class POSIXct represents the (signed) number of seconds since the beginning of 1970 (in the UTC time zone) as a numeric vector. An object of class POSIXlt will be comprised of a list of vectors with the names sec, min, hour, mday) (day of month), mon (month), year, wday (day of week), and yday (day of year).

POSIX naming conventions include:

\%m = Month as a decimal number (01–12).
\%d = Day of the month as a decimal number (01–31).
\%Y = Year with century. Designations in 0:9999 are accepted.
\%H = Hour as a decimal number (00–23).
\%M = Minute as a decimal number (00–59

Example 4.26 $\text{}$
As an example, below are twenty dates and corresponding binary water presence measures (0 = water absent, 1 = water present) recorded at 2.5 hour intervals for an intermittent stream site in southwest Idaho (Aho et al. 2023).

dates <- c("08/13/2019 04:00", "08/13/2019 06:30", "08/13/2019 09:00",
           "08/13/2019 11:30", "08/13/2019 14:00", "08/13/2019 16:30",
           "08/13/2019 19:00", "08/13/2019 21:30", "08/14/2019 00:00",
           "08/14/2019 02:30", "08/14/2019 05:00", "08/14/2019 07:30",
           "08/14/2019 10:00", "08/14/2019 12:30", "08/14/2019 15:00",
           "08/14/2019 17:30", "08/14/2019 20:00", "08/14/2019 22:30",
           "08/15/2019 01:00", "08/15/2019 03:30")

pres.abs <- c(1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1)

To convert the character string dates to a date-time object we can use the function strptime(). We have:

dates.ts <- strptime(dates, format = "%m/%d/%Y %H:%M")
class(dates.ts)

[1] "POSIXlt" "POSIXt"

Note that the dates can now be evaluated numerically.

dates.df <- data.frame(dates = dates.ts, pres.abs = pres.abs)
summary(dates.df)

     dates                        pres.abs   
 Min.   :2019-08-13 04:00:00   Min.   :0.00  
 1st Qu.:2019-08-13 15:52:30   1st Qu.:0.75  
 Median :2019-08-14 03:45:00   Median :1.00  
 Mean   :2019-08-14 03:45:00   Mean   :0.75  
 3rd Qu.:2019-08-14 15:37:30   3rd Qu.:1.00  
 Max.   :2019-08-15 03:30:00   Max.   :1.00

I can also easily extract time series components.

dates.ts$mday # day of month

 [1] 13 13 13 13 13 13 13 13 14 14 14 14 14 14 14 14 14 14 15 15

dates.ts$wday # day of week

 [1] 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4

dates.ts$hour # hour

 [1]  4  6  9 11 14 16 19 21  0  2  5  7 10 12 15 17 20 22  1  3

$\blacksquare$

Exercises

Using the plant dataset from question 5 in Chapter three, perform the following operations.
1. Attempt to simultaneously calculate the column means for plant height and soil % N using FUN = mean in apply(). Was there an issue? Why?
2. Eliminate missing rows in plant using na.omit() and repeat a). Did this change the mean for plant height? Why?
3. Modify the FUN argument in apply() to be: FUN = function(x) mean(x, na.rm = TRUE). This will eliminate NAs on a column by column basis.
4. Compare the results in a), b), c). Which is the best approach? Why?
5. Find the mean and variance of plant heights for each Management Type in plant using tapply(). Use the best practice approach for FUN, as deduced in d).
For the questions below, use the list list.data below.
1. Use sapply(list.data, FUN = length) to get the number of components in each element of list.data.
2. Repeat a) using lapply(). How is the output in b) different from a)?

list.data <- list(a = 1:9, height = rnorm(50),
                  greet = c("hello", "goodbye", "hello"))

A frequently used statistical application is the calculation of all possible mean differences. Assume that we have the means given in the object means below.
1. Calculate all possible mean differences using means as the first two arguments in outer(), and letting FUN = "-".
2. Extract meaningful and non-redundant differences by using upper.tri() or lower.tri() (Chapter three). There should be ${5 \choose 2} = 10$ meaningful (not simply a mean subtracted from itself) and non-redundant differences.

means <- c(trt1 = 20.5, trt2 = 15.3, trt3 = 22.1, trt4 = 30.4, trt5 = 28)

Using the plant dataset from question 5 in Chapter three, perform the following operations.
1. Use the function replace() to identify samples with soil N less than 13.5% by identifying them as "Npoor".
2. Use the function which() to identify which plant heights are greater than or equal to 33.2 dm.
3. Sort plant heights using the function sort().
4. Sort the plant dataset with respect to ascending values of plant height using the function order().
Using match() or which and %in%, replace the code column names in the dataset cliff.sp from the package asbio, with the correct scientific names (genus and specific epithet) from the dataframe sp.list below.

sp.list <- data.frame(code = c("L_ASCA","L_CLCI","L_COSPP","L_COUN",
"L_DEIN","L_LCAT", "L_LCST","L_LEDI","M_POSP","L_STDR","L_THSP","L_TOCA",
"L_XAEL","M_AMSE", "M_CRFI","M_DISP","M_WECO","P_MIGU","P_POAR",
"P_SAOD"), sci.name = c("Aspicilia caesiocineria","Caloplaca citrina",
"Collema spp.", "Collema undulatum", "Dermatocarpon intestiniforme",
"Lecidea atrobrunnea", "Lecidella stigmatea", "Lecanora dispersa",
"Pohlia sp.", "Staurothele drummondii", "Thelidium species",
"Toninia candida", "Xanthoria elegans", "Amblystegium serpens",
"Cratoneuron filicinum", "Dicranella species", "Weissia controversa",
"Mimulus guttatus", "Poa pattersonii", "Saxifraga odontoloma"))

Using the sp.list dataframe from the previous question, perform the following operations:
1. Apply strsplit() to the the column sp.list$sci.name to create a two column dataframe with genus and corresponding species names.
2. A two character prefix in the column sp.list$code indicates whether a taxon is a lichen (prefix = "L_"), a marchantiophyte (prefix = "M_"), or a vascular plant (prefix = "P_"). Use grep() to identify marchantiophytes.
For the questions below, use the string vector string below:
1. Use regular expressions in the pattern argument of gsub() to get rid of extra spaces at the start of string elements while preserving spaces between words.
2. Use the predefined character class [[:alnum:]] and an accompanying quantifier in the pattern argument from grep to count the number of words whose length is greater than or equal to four characters.

string <- c("   Statistics is ", "      a ", " great topic.")

Consider the character vector times below, which has the format: day-month-year hour:minute:second.
1. Convert times into an object of class POSIXlt called time.pos using the function strptime().
2. Extract the day of the week from time.pos.
3. Sort time.pos using sort() to verify that time.pos is quantitative.

times <- c("12-12-2023 12:12:20",
           "12-01-2021 01:12:40",
           "15-10-2021 23:10:15",
           "25-07-2022 13:09:45")

4.5 The Tidyverse

4.6 Pipes

4.6.1 Basic Pipe

4.6.2 Other Pipes

References

Aho, Ken, Cathy Kriloff, Sarah E Godsey, Rob Ramos, Chris Wheeler, Yaqi You, Sara Warix, et al. 2023. “Non-Perennial Stream Networks as Directed Acyclic Graphs: The R-Package streamDAG.” Environmental Modelling & Software 167: 105775.

Signorell, Andri. 2023. DescTools: Tools for Descriptive Statistics. https://CRAN.R-project.org/package=DescTools.

Wikipedia. 2023. “Perl.” https://en.wikipedia.org/wiki/Perl.

———. 2024. “String (Computer Science).” https://en.wikipedia.org/wiki/String_(computer_science).

In computer programming, a string is generally a (non-numeric) sequence of characters (Wikipedia 2024). R frequently uses character vectors, i.e., vec <- c("a", "b", "c"). Each entry in vec would be conventionally considered to be a character string.↩︎
The Perl programming language was introduced by Larry Wall in 1987 as a Unix scripting tool to facilitate report processing (Wikipedia 2023). Despite criticisms as an awkward language, Perl remains widely used for its regular expression framework and string parsing capabilities.↩︎
Specifically, they use a version of the POSIX 1003.2 standard.↩︎
Again, the POSIX prefix refers to the IEEE standard Portable Operating System Interface↩︎

Chapter 4 Basic Data Management

4.1 Operations on Arrays, Lists and Vectors

4.1.1 The apply Family of Functions

4.1.1.1 apply()

4.1.1.2 lapply()

4.1.1.3 sapply()

4.1.1.4 tapply()

4.1.2 outer()

4.2 Other Simple Data Management Functions

4.2.1 replace()

4.2.2 which()

4.2.3 sort()

4.2.4 rank()

4.2.5 order()

4.2.6 unique()

4.2.7 match()

4.2.8 which() and %in%

4.3 Matching, Querying and Substituting in Strings

4.3.1 strtrim()

4.3.2 strsplit()

4.3.3 grep() and grepl()

4.3.4 gsub()

4.3.5 gregexpr()

4.3.6 Regular Expressions

4.3.6.1 Extended Regular Expressions

4.3.6.2 Perl-like Regular Expressions

4.4 Date-Time Classes

Exercises

4.5 The Tidyverse

4.6 Pipes

4.6.1 Basic Pipe

4.6.2 Other Pipes

References

4.1.1 The `apply` Family of Functions

4.1.1.1 `apply()`

4.1.1.2 `lapply()`

4.1.1.3 `sapply()`

4.1.1.4 `tapply()`

4.1.2 `outer()`

4.2.1 `replace()`

4.2.2 `which()`

4.2.3 `sort()`

4.2.4 `rank()`

4.2.5 `order()`

4.2.6 `unique()`

4.2.7 `match()`

4.2.8 `which()` and `%in%`

4.3.1 `strtrim()`

4.3.2 `strsplit()`

4.3.3 `grep()` and `grepl()`

4.3.4 `gsub()`

4.3.5 `gregexpr()`