Chapter 4 Basic Data Management

“I think, therefore I R."

- William B. King, Psychologist and R enthusiast

$\text{}$
An important characteristic of R is its capacity to efficiently manage and analyze large, complex, datasets. In this chapter I list a few functions and approaches useful for data management in base R. Data management considerations for the tidyverse are given in Chapter 5.

4.1 Operations on Arrays, Lists and Vectors

Operators can be applied individually to every row or column of an array, or every component of a list or atomic vector using a number of time saving methods.

4.1.1 The `apply` Family of Functions

4.1.1.1 `apply()`

Operations can be performed quickly on rows and columns of two dimensional arrays with the function apply(). The function requires three arguments.

The first argument, X, specifies an array to be analyzed.
The second argument, MARGIN, connotes whether rows or columns are to be analyzed. MARGIN = 1 indicates rows, MARGIN = 2 indicates columns, whereas MARGIN = c(1, 2) indicates rows and columns.
The third argument, FUN, defines a function to be applied to the margins of the object in the first argument.

Example 4.1
Consider the asbio::bats dataset which contains forearm length data, in millimeters, for northern myotis bats (Myotis septentrionalis), along with corresponding bat ages in in days.

library(asbio)
data(bats)
head(bats)

  days forearm.length
1    1           10.5
2    1           11.0
3    1           12.3
4    1           13.7
5    1           14.2
6    1           14.8

Here we obtain minimum values for the days and forearm.length columns.

apply(bats, 2, min)

          days forearm.length 
           1.0           10.5

It is straightforward to change the third argument in apply() to obtain different summaries, like the mean.

apply(bats, 2, mean)

          days forearm.length 
      13.57895       23.60263

or the standard deviation

apply(bats, 2, sd)

          days forearm.length 
     12.461035       8.434725

Several summary statistical functions exist for numerical arrays that can be used in some instances in the place of apply(). These include rowMeans() and colMeans() which give the sample means of specified rows and columns, respectively, and rowSums() and colSums() which give the sums of specified rows and columns, respectively. For instance:

colMeans(bats)

          days forearm.length 
      13.57895       23.60263

$\blacksquare$

4.1.1.2 `lapply()`

The function lapply() allows one to sweep functions through list components. It has two main arguments:

The first argument, X, specifies a list to be analyzed.
The second argument, FUN, defines a function to be applied to each element in X.

Example 4.2

Consider the following simple list, whose three components have different lengths.

x <- list(a = 1:8, norm.obs = rnorm(10), logic = c(TRUE, TRUE, FALSE, FALSE))
x

$a
[1] 1 2 3 4 5 6 7 8

$norm.obs
 [1]  1.22730347  0.10854551  1.33289436 -0.10706836  2.39835128
 [6] -0.05349229  0.26075305  1.18173824  0.59081373  1.52014568

$logic
[1]  TRUE  TRUE FALSE FALSE

Here we sweep the function mean() through the list:

lapply(x, mean)

$a
[1] 4.5

$norm.obs
[1] 0.8459985

$logic
[1] 0.5

Note the Boolean outcomes in logic have been coerced to numeric outcomes. Specifically, TRUE = 1 and FALSE = 0. Here are the 1st, 2nd (median), and 3rd quartiles of x:

lapply(x, quantile, probs = 1:3/4)

$a
 25%  50%  75% 
2.75 4.50 6.25 

$norm.obs
      25%       50%       75% 
0.1465974 0.8862760 1.3064966 

$logic
25% 50% 75% 
0.0 0.5 1.0

$\blacksquare$

4.1.1.3 `sapply()`

The function sapply() is a user friendly wrapper for lapply() that can return a vector or array instead of a list.

sapply(x, quantile, probs = 1:3/4)

       a  norm.obs logic
25% 2.75 0.1465974   0.0
50% 4.50 0.8862760   0.5
75% 6.25 1.3064966   1.0

4.1.1.4 `tapply()`

The tapply() function allows summarization of a one dimensional array (e.g., a column or row from a matrix) with respect to levels in a categorical variable. The function requires three arguments.

The first argument, X, defines a one dimensional array to be analyzed.
The second argument, INDEX should provide a list of one or more factors (see example below) with the same length as X.
The third argument, FUN, is used to specify a function to be applied to X for each level in INDEX.

Example 4.3 $\text{}$
Consider the dataset asbio::heart, which documents pulse rates for twenty four subjects at four time periods following administration of a experimental treatment. These were two active heart medications and a control. Here are average heart rates for the treatments.

data(heart)
with(heart, tapply(rate, drug, mean))

    AX23     BWW9     Ctrl 
76.28125 81.03125 71.90625

Here are the mean heart rates for treatments, for each time frame. Note that the second argument is defined as a list with two components, each of which can be coerced to be a factor.

with(heart, tapply(rate, list(drug = drug, time = time), mean))

      time
drug      t1     t2     t3     t4
  AX23 70.50 80.500 81.000 73.125
  BWW9 81.75 84.000 78.625 79.750
  Ctrl 72.75 72.375 71.500 71.000

$\blacksquare$

The function aggregate() can be considered a more sophisticated extension of tapply(). It allows objects under consideration to be expressed as functions of explanatory factors, and contains additional arguments for data specification and time series analyses.

Example 4.4 $\text{}$
Here we use aggregate() to get identical (but reformatted) results to the prior example.

aggregate(rate ~ drug + time, mean, data = heart)

   drug time   rate
1  AX23   t1 70.500
2  BWW9   t1 81.750
3  Ctrl   t1 72.750
4  AX23   t2 80.500
5  BWW9   t2 84.000
6  Ctrl   t2 72.375
7  AX23   t3 81.000
8  BWW9   t3 78.625
9  Ctrl   t3 71.500
10 AX23   t4 73.125
11 BWW9   t4 79.750
12 Ctrl   t4 71.000

Importantly, the first argument, rate ~ drug + time is in the form of a formula:

f.rate <- with(heart, rate ~ drug + time)
class(f.rate)

[1] "formula"

Here the tilde operator, ~, allows expression of the formulaic framework: y ~ model, where y is a response variable and model specifies a system of (generally) one or more predictor variables.

$\blacksquare$

4.1.2 `outer()`

Another important function for matrix operations is outer(). This algorithm allows creation of an array that contains all possible combinations of two atomic vectors or arrays with respect to a user-specified function. The outer() function has three required arguments.

The first two arguments, X and Y, define arrays or atomic vectors. X and Y can be identical if one wishes to examine pairwise operations of the array elements (see example below).
The third argument, FUN, specifies a function to be used in operations.

Example 4.5 $\text{}$
Suppose I wish to find the means of all possible pairs of observations from an atomic vector. I could use the following commands:

x <- c(1, 2, 3, 5, 4)
outer(x, x, "+")/2

     [,1] [,2] [,3] [,4] [,5]
[1,]  1.0  1.5  2.0  3.0  2.5
[2,]  1.5  2.0  2.5  3.5  3.0
[3,]  2.0  2.5  3.0  4.0  3.5
[4,]  3.0  3.5  4.0  5.0  4.5
[5,]  2.5  3.0  3.5  4.5  4.0

The argument FUN = "+" indicates that we wish to add elements to each other. We divide these sums by two to obtain means. Note that the diagonal of the output matrix contains the original elements of x, because the mean of a number and itself is the original number. The upper and lower triangles are identical because the mean of elements a and b will be the same as the mean of the elements b and a. Note that the result outer(x, x, "*") can also be obtained using x %o% x because %o% is the matrix algebra outer product operator in R.

outer(x, x, "*")

     [,1] [,2] [,3] [,4] [,5]
[1,]    1    2    3    5    4
[2,]    2    4    6   10    8
[3,]    3    6    9   15   12
[4,]    5   10   15   25   20
[5,]    4    8   12   20   16

x %o% x

     [,1] [,2] [,3] [,4] [,5]
[1,]    1    2    3    5    4
[2,]    2    4    6   10    8
[3,]    3    6    9   15   12
[4,]    5   10   15   25   20
[5,]    4    8   12   20   16

$\blacksquare$

4.1.3 `stack()`, `unstack()` and `reshape()`

When manipulating lists and dataframes it is often useful to move between so-called “long” and “wide” data table formats. These operations can be handled with the functions stack() and unstack(). Specifically, stack() concatenates multiple vectors into a single vector along with a factor indicating where each observation originated, whereas unstack() reverses this process.

Example 4.6 $\text{}$
Consider the 4 x 4 dataframe below.

dataf <- data.frame(matrix(nrow = 4, data = rnorm(16)))
names(dataf) <- c("col1", "col2", "col3", "col4")
dataf

       col1        col2        col3       col4
1 0.9441796 -0.04453284  0.06968586  0.4873315
2 1.9103583 -0.60254997 -0.14386612 -0.9688990
3 1.2850721  0.59731920  1.37414563  0.4051866
4 0.9670279 -1.32108678  0.14105541  0.5456421

Here I stack dataf into a long table format.

sdataf <- stack(dataf)
sdataf

        values  ind
1   0.94417965 col1
2   1.91035833 col1
3   1.28507212 col1
4   0.96702792 col1
5  -0.04453284 col2
6  -0.60254997 col2
7   0.59731920 col2
8  -1.32108678 col2
9   0.06968586 col3
10 -0.14386612 col3
11  1.37414563 col3
12  0.14105541 col3
13  0.48733155 col4
14 -0.96889895 col4
15  0.40518657 col4
16  0.54564209 col4

Here I unstack sdataf.

unstack(sdataf)

       col1        col2        col3       col4
1 0.9441796 -0.04453284  0.06968586  0.4873315
2 1.9103583 -0.60254997 -0.14386612 -0.9688990
3 1.2850721  0.59731920  1.37414563  0.4051866
4 0.9670279 -1.32108678  0.14105541  0.5456421

The function reshape() can handle both stacking and unstacking operations. Here I stack dataf. The arguments timevar, idvar, and v.names are used to provide recognizable identifiers for the columns in the wide table format, observations within those columns, and responses for those combinations.

reshape(dataf, direction = "long",
        varying = list(names(dataf)),
        timevar = "Column",
        idvar = "Column obs.",
        v.names = "Response")

    Column    Response Column obs.
1.1      1  0.94417965           1
2.1      1  1.91035833           2
3.1      1  1.28507212           3
4.1      1  0.96702792           4
1.2      2 -0.04453284           1
2.2      2 -0.60254997           2
3.2      2  0.59731920           3
4.2      2 -1.32108678           4
1.3      3  0.06968586           1
2.3      3 -0.14386612           2
3.3      3  1.37414563           3
4.3      3  0.14105541           4
1.4      4  0.48733155           1
2.4      4 -0.96889895           2
3.4      4  0.40518657           3
4.4      4  0.54564209           4

$\blacksquare$

4.2 Other Simple Data Management Functions

4.2.1 `replace()`

One use the function replace() to replace elements in an atomic vector based, potentially, on Boolean logic. The function requires three arguments.

The first argument, x, specifies the vector to be analyzed.
The second argument, list, connotes which elements need to be replaced. A logical argument can be used here as a replacement index.
The third argument, values, defines the replacement value(s).

Example 4.7 $\text{}$
For instance:

Age <- c(21, 19, 25, 26, 18, 19)
replace(Age, Age < 25, "R is Cool")

[1] "R is Cool" "R is Cool" "25"        "26"        "R is Cool" "R is Cool"

Of course, one can also use square brackets for this operation.

Age[Age < 25] <- "R is Cool"
Age

[1] "R is Cool" "R is Cool" "25"        "26"        "R is Cool" "R is Cool"

$\blacksquare$

4.2.2 `which()`

The function which can be used with logical commands to obtain address indices for data storage object.

Example 4.8
For instance:

Age <- c(21, 19, 25, 26, 18, 19)
w <- which(Age <= 21)
w

[1] 1 2 5 6

Elements one, two, and five meet this criterion. We can now subset based on the index w.

Age[w]

[1] 21 19 18 19

To find which element in Age is closest to 24 I could do something like:

which(abs(Age - 24) == min(abs(Age - 24)))

[1] 3

$\blacksquare$

4.2.3 `sort()`

By default, The function sort() sorts data from an atomic vector into an alphanumeric ascending order.

sort(Age)

[1] 18 19 19 21 25 26

Data can be sorted in a descending order by specifying decreasing = TRUE.

sort(Age, decreasing = T)

[1] 26 25 21 19 19 18

4.2.4 `rank()`

The function rank gives the ascending alphanumeric rank of elements in a vector. Ties are given the average of their ranks. This operation is important to rank-based permutation analyses .

rank(Age)

[1] 4.0 2.5 5.0 6.0 1.0 2.5

The second and last observations were the second smallest in Age. Thus, their average rank is 2.5.

4.2.5 `order()`

The function order() is similar to which() in that it provides element indices that accord with an alphanumeric ordering. This allows one to sort a vector, matrix or dataframe into an ascending or descending order, based on one or several ordered vectors.

Example 4.9 $\text{}$
Consider the dataframe below which lists plant percent cover data for four plant species at three sites. In accordance with the field.data example from Ch 3, plant species are identified with four letter codes, corresponding to the first two letters of the Linnaean genus and species names.

field.data <- data.frame(code = c("ACMI", "ELSC", "CAEL", "TACE"),
                         site1 = c(12, 13, 14, 11),
                         site2 = c(0, 20, 4, 5),
                         site3 = c(20, 10, 30, 0))
field.data

  code site1 site2 site3
1 ACMI    12     0    20
2 ELSC    13    20    10
3 CAEL    14     4    30
4 TACE    11     5     0

Assume that we wish to sort the data with respect to an alphanumeric ordering of species codes. Here we obtain the ordering of the codes

o <- order(field.data$code)
o

[1] 1 3 2 4

Now we can sort the rows of field.data based on this ordering.

field.data[o,]

  code site1 site2 site3
1 ACMI    12     0    20
3 CAEL    14     4    30
2 ELSC    13    20    10
4 TACE    11     5     0

$\blacksquare$

4.2.6 `unique()`

To identify unique values in dataset we can use the function unique().

Example 4.10
Below is an atomic vector listing species from a bird survey on islands in southeast Alaska. Species ciphers follow the same coding method used in Example 4.9. Note that there are a large number of repeats.

AK.bird <- c("GLGU", "MEGU", "DOCO", "PAJA", "COLO", "BUFF", "COGO", 
             "WHSC", "TUSW", "GRSC", "GRTE", "REME", "BLOY", "REPH", 
             "SEPL", "LESA", "ROSA", "WESA", "WISN", "BAEA", "SHOW", 
             "GLGU", "MEGU", "PAJA", "DOCO", "GRSC", "GRTE", "BUFF", 
             "MADU", "TUSW", "REME", "SEPL", "REPH", "ROSA", "LESA", 
             "COSN", "BAEA", "ROHA")

length(AK.bird)

[1] 38

Applying unique() we obtain a listing of the 24 unique bird species.

unique(AK.bird)

 [1] "GLGU" "MEGU" "DOCO" "PAJA" "COLO" "BUFF" "COGO" "WHSC" "TUSW" "GRSC"
[11] "GRTE" "REME" "BLOY" "REPH" "SEPL" "LESA" "ROSA" "WESA" "WISN" "BAEA"
[21] "SHOW" "MADU" "COSN" "ROHA"

$\blacksquare$

4.2.7 `match()`

Given two vectors, the function match() indexes where objects in the second vector appear in the elements of the first vector. For instance:

x <- c(6, 5, 4, 3, 2, 7)
y <- c(2, 1, 4, 3, 5, 6)
m <- match(y, x)
m

[1]  5 NA  3  4  2  1

The number 2 (the 1st element in y) is the 5th element of x, thus the number 5 is put 1st in the vector m created by match. The number 1 (the 2nd element of y) does not occur in x (it is NA). The number 4 is the 3rd element of y and x. Thus, the number 3 is given as the third element of m, and so on.

Example 4.11 $\text{}$
The usefulness of match() may seem unclear at first, but consider a scenario in which I want to convert species code identifiers in field data into actual species names. The following dataframe is a species list that matches four letter species codes to scientific names. Note that the list contains more species than than the field.data dataset used in Example 4.9.

species.list <- data.frame(code = c("ACMI", "ASFO", "ELSC", "ERRY", "CAEL",
"CAPA", "TACE"), names = c("Achillea millefolium", "Aster foliaceus",
                           "Elymus scribneri", "Erigeron rydbergii",
                           "Carex elynoides", "Carex paysonis",
                           "Taraxacum ceratophorum"))

species.list

  code                  names
1 ACMI   Achillea millefolium
2 ASFO        Aster foliaceus
3 ELSC       Elymus scribneri
4 ERRY     Erigeron rydbergii
5 CAEL        Carex elynoides
6 CAPA         Carex paysonis
7 TACE Taraxacum ceratophorum

Here I add a column in the field.data of the actual species names using match().

m <- match(field.data$code, species.list$code)
field.data.new <- field.data # make a copy of field data
field.data.new$species.name <- species.list$names[m]
field.data.new

  code site1 site2 site3           species.name
1 ACMI    12     0    20   Achillea millefolium
2 ELSC    13    20    10       Elymus scribneri
3 CAEL    14     4    30        Carex elynoides
4 TACE    11     5     0 Taraxacum ceratophorum

$\blacksquare$

4.2.8 `which()` and `%in%`

We can use the operator in conjunction with the function which() to achieve the same results as match().

m <- which(species.list$code %in% field.data$code)
field.data.new$species.name <- species.list$names[m]
field.data.new

  code site1 site2 site3           species.name
1 ACMI    12     0    20   Achillea millefolium
2 ELSC    13    20    10       Elymus scribneri
3 CAEL    14     4    30        Carex elynoides
4 TACE    11     5     0 Taraxacum ceratophorum

Note that the arrangement of arguments are reversed in match() and which(). In the former we have: . In the latter we have: which(species.list$code %in% field.data$code).

4.3 Matching, Querying and Substituting in Strings

R contains a number of useful methods for handling character string¹ data.

4.3.1 `strtrim()`

The function strtrim is useful for extracting characters from vectors.

Example 4.12 $\text{}$
For the taxonomic codes in the character vector below, the first capital letter indicates whether a species is a flowering plant (anthophyte) or moss (bryophyte) while the last four letters give the species codes (see Example 4.9).

plant <- c("A_CAAT", "B_CASP", "A_SARI")

Assume that I want to distinguish anthophytes from bryophytes by extracting the first letter. This can be done by specifying 1 in the second strtrim argument, width.

phylum <- strtrim(plant, 1)
phylum

[1] "A" "B" "A"

plant[phylum == "A"]

[1] "A_CAAT" "A_SARI"

$\blacksquare$

4.3.2 `strsplit()`

The function strsplit() splits a character string into substrings based on user defined criteria. It contains two important arguments.

The first argument, x, specifies the character string to be analyzed.
The second argument, split, is a character criterion that is used for splitting.

Example 4.13 $\text{}$
Below I split the character string ACMI in two, based on the space between the words Achillea and millefolium.

ACMI <- "Achillea millefolium"
strsplit(ACMI, " ")

[[1]]
[1] "Achillea"    "millefolium"

Note that the result is a list. To get back to a vector (now with two components), I can use the function unlist().

unlist(strsplit(ACMI, " "))

[1] "Achillea"    "millefolium"

Here I split based on the letter "l".

strsplit(ACMI, "l")

[[1]]
[1] "Achi"  ""      "ea mi" ""      "efo"   "ium"

Interestingly, letting the split criterion equal NULL results in spaces being placed between every character in a string.

strsplit(ACMI, NULL)

[[1]]
 [1] "A" "c" "h" "i" "l" "l" "e" "a" " " "m" "i" "l" "l" "e" "f" "o" "l"
[18] "i" "u" "m"

We can use this outcome to reverse the order of characters in a string.

sapply(lapply(strsplit(ACMI, NULL), rev), paste, collapse = "")

[1] "muilofellim aellihcA"

The function rev() provides a reversed version of its first argument, in this case a result from strsplit(). The function paste() can be use to paste together character strings.

$\blacksquare$

Criteria for querying strings can include multiple characters in a particular order, and a particular case:

x <- "R is free software and comes with ABSOLUTELY NO WARRANTY"
strsplit(x, "so")

[[1]]
[1] "R is free "                                  
[2] "ftware and comes with ABSOLUTELY NO WARRANTY"

Note that the "SO" in "ABSOLUTELY" is ignored because it is upper case.

4.3.3 `grep()` and `grepl()`

The functions grep() and grepl() can be used to identify which elements in a character vector have a specified pattern. They have the same first two arguments.

The first argument, pattern specifies a patterns to be matched. This can be a character string, or object coercible to a character string, or a regular expression (see below).
The second argument, x, is a character vector where matches are sought.

Example 4.14 $\text{}$
The function grep() returns indices identifying which entries in a vector contain a queried pattern. In the character vector below, we see that entries five and six have the same genus, Carex.

names = c("Achillea millefolium", "Aster foliaceus",
                           "Elymus scribneri", "Erigeron rydbergii",
                           "Carex elynoides", "Carex paysonis",
                           "Taraxacum ceratophorum")

grep("Carex", names)

[1] 5 6

The function grepl() does the same thing with Boolean outcomes.

grepl("Carex", names)

[1] FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE

Of course, we could use this information to subset names.

names[grep("Carex", names)]

[1] "Carex elynoides" "Carex paysonis"

We can also get grep to return the values directly by specifying value = TRUE.

grep("Carex", names, value = TRUE)

[1] "Carex elynoides" "Carex paysonis"

$\blacksquare$

4.3.4 `gsub()`

The function gsub() can be used to substitute text that has a specified pattern. Several of its arguments are identical to grep() and grepl():

As before, the first argument, pattern, specifies a pattern to be matched.
The second argument, replacement, specifies a replacement for the matched pattern.
The third argument, x, is a character vector wherein matches are sought and substitutions are made.

Example 4.15 $\text{}$
Here we substitute "C." for occurrences of "Carex" in names.

gsub("Carex", "C.", names)

[1] "Achillea millefolium"   "Aster foliaceus"       
[3] "Elymus scribneri"       "Erigeron rydbergii"    
[5] "C. elynoides"           "C. paysonis"           
[7] "Taraxacum ceratophorum"

$\blacksquare$

4.3.5 `gregexpr()`

The function gregexpr() identifies the start and end of matching sections in a character vector.

Example 4.16 $\text{}$
Here we examine the first two entries in names, looking for the genus Aster.

gregexpr("Aster", names[c(1:2)])

[[1]]
[1] -1
attr(,"match.length")
[1] -1
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

[[2]]
[1] 1
attr(,"match.length")
[1] 5
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

The output list is cryptic at best and requires some explanation. The first two elements in each of the two list components indicate the character number of the start and end of the matched string. For the first list component, these elements are given the identifier -1 because "Achillea millefolium" does not contain the pattern "Aster". For the second list component, these elements are 1 and 5 because "Aster" makes up the first five letters of "Aster foliaceus".

$\blacksquare$

4.3.6 Regular Expressions

A number of R functions for managing character strings, including grep(), grepl(), gregexpr(), gsub(), and strsplit(), can can incorporate regular expressions. In computer programming, a regular expression (often abbreviated as regex) is a sequence of characters that allow pattern matching in text. Regular expressions have developed within a number of programming frameworks including the POSIX standard (the Portable Operating System Interface standard), developed by the IEEE, and particularly the language Perl². Regular expressions in R include extended regular expressions (this is the default for most pattern matching and replacement R functions), and Perl-like regular expressions.

4.3.6.1 Extended Regular Expressions

Default extended regular expressions in R use a POSIX framework for commands³, which includes the use of particular metacharacters. These are: \, |, ( ), [ ], ^, $, ., { }, *, +, and ?. The metacharacters will vary in meaning depending if they occur outside of square brackets, [ and ], or inside of square brackets. The former usage means that they are part of a character class (see below). In non-bracketed usage, the metacharacters in the subset below have the following applications (see https://www.pcre.org/original/pcre.txt):

^ start of string or line.
$ end of string or line.
. match any character except newline.
| start of alternative branch.
( ) start and end subpattern.
{ } start and end min/max repetition specification.

Several regular expression metacharacters can be placed at the end of the end of a regular expression to specify repetition. For instance, "*" indicates the preceding pattern should be matched zero or more times, "{+}" indicates the preceding pattern should be matched one or more times, "{n}" indicates the preceding pattern should be matched exactly n more times, and "{n,}" indicates the preceding pattern should be matched n or more times.

Example 4.17 $\text{}$
We will use the function regmatches(), which extracts or replaces matched substrings from gregexpr() summaries, to illustrate.

string <- "%aaabaaab"
ID <- gregexpr("a{1}", string)
regmatches(string, ID)

[[1]]
[1] "a" "a" "a" "a" "a" "a"

ID <- gregexpr("a{2}", string)
regmatches(string, ID)

[[1]]
[1] "aa" "aa"

ID <- gregexpr("a{2,}", string)
regmatches(string, ID)

[[1]]
[1] "aaa" "aaa"

$\blacksquare$

Example 4.18 $\text{}$
Metacharacters can be used together. For instance, the code below demonstrates how one might get rid of one or more extra spaces at the end of character strings.

string <- c("###Now is the time      ",
            "# for all  ",
            "#",
            " good men",
            "### to come to the aid of their country.       ")

out <- gsub(" +$", "", string) # drop extra space(s) at end of strings
out <- gsub("^#*","", out) # drop pound sign(s)

paste(out, collapse = "")

[1] "Now is the time for all good men to come to the aid of their country."

$\blacksquare$

Example 4.19 $\text{}$
As a biological example, microbial “taxa” identifiers can include cryptic Amplicon Sequence Variant (ASV) codes, followed by a general taxonomic assignment. For example, here is an ASV identifier for a bacterium within the family Comamonadaceae.

asv <- "6abc517aa40e9e7b9c652902fe04bb1a:f__Comamonadaceae"

We can delete the ASV code, which ends in a colon, with:

gsub(".*:", "", asv)

[1] "f__Comamonadaceae"

The regex script in the first argument means: “match any character string occurring zero or more times that ends in :”.

$\blacksquare$

Example 4.20 $\text{}$
As another example, R Markdown delimits monospace font using accent grave metacharacters, ` `, while LaTeX applies this font between the expression \texttt{ and }. Below I convert a R Markdown-style character vector containing some monospace strings to a LaTeX-style character vector.

char.vec <- c("`+`", "addition", "$2 + 2$", "`2 + 2`")
gsub("(`)(.*)(`)","\\\texttt{\\2}", char.vec)

[1] "\texttt{+}"     "addition"      "$2 + 2$"       "\texttt{2 + 2}"

With the code

"(`)(.*)(`)"

I subset R Markdown strings in char.vec into three potential components: 1) the ` metacharacter beginning the string, 2) the text content between ` metacharacters, and 3) the closing ` metacharacter itself. I insert the content in item 2 (indicated as \\2) between \texttt{ and } using:

"\\\texttt{\\2}"

$\blacksquare$

Importantly, Example 4.20 illustrates the procedure to use if a queried character is itself a general expression metacharacter. For instance, the backslash in \texttt. In this case, the metacharacter must be escaped using single or double backslashes. That is, \texttt must be specified as \\\texttt in gsub().

Example 4.21 $\text{}$
Here I ask for a string split based on the appearance of ? (which is a regex metacharacter) and % (which is not).

string <- "m?2%b"
strsplit(string, "[\\?%]")

[[1]]
[1] "m" "2" "b"

$\blacksquare$

Character class

A regular expression character class is comprised of a collection of characters, specifying some query or pattern, situated between quotes (single or double) and square brace metacharacters, e.g., "[" and "]". Thus, the code "[\\?%]" in the previous example defines a character class. Character class pattern matches will be evaluated for any single character in the specified text. The reverse will occur if the first character of the pattern is the regular expression caret metacharacter, ^. For example, the expression "[0-9]" matches any single numeric character in a string, (the regular expression metacharacter - can be used to specify a range) and [^abc] matches anything except the characters "a", "b" or "c".

Example 4.22 $\text{}$
Consider the following examples:

string <- "a1c&m2%b"
strsplit(string, "[0-9]")

[[1]]
[1] "a"   "c&m" "%b"

strsplit(string, "[^abc]")

[[1]]
[1] "a" "c" ""  ""  ""  "b"

$\blacksquare$

Example 4.23 $\text{}$
This regular expression will match most email addresses:

pattern <- "[-a-z0-9_.%]+\\@[-a-z0-9_.%]+\\.[a-z]+"

The expression literally reads: “1) find one or more occurrences of characters in a-z or A-Z or 0-9 or dashes or periods, followed by 2) the ampersand symbol (literally), followed by 3) one or more occurrences of characters in a-z or A-Z or 0-9 or dashes or periods, followed by 4) a literal period, followed by one or more occurrences of the letters a-z or A-Z.” Here is a string we wish to query:

string <- c("abc_noboby@isu.edu",
            "text with no email",
            "me@mything.com",
            "also",
            "you@yourspace.com",
            "@you"
            )

We confirm that elements 1, 3, and 5 from string are email addresses.

grep(pattern, string, ignore.case = TRUE, value = TRUE)

[1] "abc_noboby@isu.edu" "me@mything.com"     "you@yourspace.com"

$\blacksquare$

Certain character classes are predefined. These classes have names that are bounded by two square brackets and colons, and include "[[:lower:]]" and "[[:upper:]]" which identify lower and upper case letters, "[:punct:]" which identifies punctuation, [[:alnum:]], which identifies all alphanumeric characters, and "[[:space:]]", which identifies space characters, e.g., tab and newline.

string <- c("M2Ab", "def", "?", "%", "\n")
grepl("[[:lower:]]", string)

[1]  TRUE  TRUE FALSE FALSE FALSE

grepl("[[:upper:]]", string)

[1]  TRUE FALSE FALSE FALSE FALSE

grepl("[[:punct:]]", string)

[1] FALSE FALSE  TRUE  TRUE FALSE

grepl("[[:space:]]", string) # item five is a newline request

[1] FALSE FALSE FALSE FALSE  TRUE

Here I ask R to return elements from string that are three or more characters long.

grep("[[:alnum:]]{3}", string, value = TRUE)

[1] "M2Ab" "def"

For some pattern matching and replacement jobs it may be best turn off the default extended regular expressions and use exact matching by specifying fixed = TRUE. For example, R may place periods in the place of spaces in character strings and column names in dataframes and arrays.

Example 4.24 $\text{}$
Consider the following example:

countries <- c("United.States", "United.Arab.Emirates", "China", "Germany")
gsub(".", " ", countries)

[1] "             "        "                    " "     "               
[4] "       "

Note that using gsub(".", " ", countries) results in the replacement of all text with spaces because of the meaning of the period metacharacter. To get the desired result we could use:

gsub(".", " ", countries, fixed = TRUE)

[1] "United States"        "United Arab Emirates" "China"               
[4] "Germany"

Of course we could also double escape the period.

gsub("\\.", " ", countries)

[1] "United States"        "United Arab Emirates" "China"               
[4] "Germany"

$\blacksquare$

4.3.6.2 Perl-like Regular Expressions

The R character string functions grep(), grepl(), regexpr(), gregexpr(), sub(), gsub(), and strsplit() allow Perl-like regular expression pattern matching. This is done by specifying perl = TRUE, which switches regular expression handling to the PRCE package. Perl allows handling of the POSIX predefined character classes, e.g., "[[:lower:]]", along with a wide variety of other calls which are generally implemented using metacharacters and double backslash commands. Here are some examples.

\\d any decimal digit.
\\D any character that is not a decimal digit.
\\h any horizontal white space character (e.g., tab, space).
\\H any character that is not a horizontal white space character.
\\s any white space character.
\\S any character that is not a white space character.
\\v any vertical white space character (e.g., newline).
\\V any character that is not a vertical white space character.
\\w any word, i.e., letter or character components separated by white space.
\\W any non word.
\\b a word boundary.
\\U upper case character (dependent on context).
\\L lower case character (dependent on context).

Note that reversals in meaning occur for capitalized and uncapitalized commands.

Example 4.25 $\text{}$

Here we identify string entries containing numbers.

string <- c("Acidobacteria", "Actinobacteria", "TM7.1", "Gitt-GS-136", 
            "Chloroflexia", "Bacili")

grep("\\d", string, perl = TRUE)

[1] 3 4

And those containing non-numeric characters (i.e., all of the entries).

grep("\\D", string, perl = TRUE)

[1] 1 2 3 4 5 6

To subset non-numeric entries, one could do something like:

string[-grep("\\d", string, perl = TRUE)]

[1] "Acidobacteria"  "Actinobacteria" "Chloroflexia"   "Bacili"

$\blacksquare$

Example 4.26 $\text{}$
As a slightly extended example we will count the number of words in the description of the GNU public licences in R (obtained via RShowDoc("COPYING")). Ideas here largely follow from the function DescTools::StrCountW() (Signorell 2023).

Text can be read from a connection using the function readLines().

GNU <- readLines(RShowDoc("COPYING"))
head(GNU)

[1] "\t\t    GNU GENERAL PUBLIC LICENSE"                                               
[2] "\t\t       Version 2, June 1991"                                                  
[3] ""                                                                               
[4] " Copyright (C) 1989, 1991 Free Software Foundation, Inc."                       
[5] "                       51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA"
[6] " Everyone is permitted to copy and distribute verbatim copies"

Note that the escaped command \t represent the ASCII (American character encoding standard) control character for tab return. Other useful escaped control characters include \n, indicating new line or carriage return.

To search for words, we will actually identify string components that are not words, identified with the Perl regex command \\W and word boundaries, i.e., \\b. We can combine these summarily as: \\b\\W+\\b. The call \\W+ indicates a non-word match occurring one or more times. Here we apply this regular expression to the first element of GNU.

GNU[1]

[1] "\t\t    GNU GENERAL PUBLIC LICENSE"

gregexpr("\\b\\W+\\b", GNU[1], perl = TRUE)

[[1]]
[1] 10 18 25
attr(,"match.length")
[1] 1 1 1
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

Matches occur at three locations, 10, 18, and 25, which separate the four words GNU GENERAL PUBLIC LICENSE. Thus, to analyze the entire document we could use:

sum(sapply(gregexpr("\\b\\W+\\b", GNU, perl = TRUE),
           function(x) sum(x > 0)) + 1)

[1] 3048

There are 3048 total words in the license description.

$\blacksquare$

One can identify substrings by number using Perl.

Example 4.27 $\text{}$
In this example, I subdivide a string into two components, the first character, i.e., "(\\w)", and the remaining zero or more characters: "(\\w*)". These are referred to in the substitute argument of gsub as items \\1 and \\2, respectively. Capitalization for these substrings are handled in different ways below.

string <- "achillea"
gsub("(\\w)(\\w*)", "\\U\\1\\U\\2", string, perl=TRUE) # all caps

[1] "ACHILLEA"

gsub("(\\w)(\\w*)", "\\L\\1\\U\\2", string, perl=TRUE) # low, then upper case

[1] "aCHILLEA"

gsub("(\\w)(\\w*)", "\\U\\1\\L\\2", string, perl=TRUE) # up, then lower case

[1] "Achillea"

The functions tolower() and toupper() provide simpler approaches to convert letters to lower and upper case, respectively.

toupper(string)

[1] "ACHILLEA"

$\blacksquare$

4.4 Date-Time Classes

There are two basic R date-time classes, POSIXlt and POSIXct⁴. Class POSIXct represents the (signed) number of seconds since the beginning of 1970 (in the UTC time zone) as a numeric vector. An object of class POSIXlt will be comprised of a list of vectors with the names sec, min, hour, mday (day of month), mon (month), year, wday (day of week), and yday (day of year).

POSIX naming conventions include:

%m = Month as a decimal number (01–12).
%d = Day of the month as a decimal number (01–31).
%Y = Year. Designations in 0:9999 are accepted.
%H = Hour as a decimal number (00–23).
%M = Minute as a decimal number (00–59

Example 4.28 $\text{}$
As an example, below are twenty dates and corresponding binary water presence measures (0 = water absent, 1 = water present) recorded at 2.5 hour intervals for an intermittent stream site in southwest Idaho (Aho et al. 2023).

dates <- c("08/13/2019 04:00", "08/13/2019 06:30", "08/13/2019 09:00",
           "08/13/2019 11:30", "08/13/2019 14:00", "08/13/2019 16:30",
           "08/13/2019 19:00", "08/13/2019 21:30", "08/14/2019 00:00",
           "08/14/2019 02:30", "08/14/2019 05:00", "08/14/2019 07:30",
           "08/14/2019 10:00", "08/14/2019 12:30", "08/14/2019 15:00",
           "08/14/2019 17:30", "08/14/2019 20:00", "08/14/2019 22:30",
           "08/15/2019 01:00", "08/15/2019 03:30")

pres.abs <- c(1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1)

To convert the character string dates to a date-time object we can use the function strptime(). We have:

dates.ts <- strptime(dates, format = "%m/%d/%Y %H:%M")
class(dates.ts)

[1] "POSIXlt" "POSIXt"

Note that the dates can now be evaluated numerically.

dates.df <- data.frame(dates = dates.ts, pres.abs = pres.abs)
summary(dates.df)

     dates                        pres.abs   
 Min.   :2019-08-13 04:00:00   Min.   :0.00  
 1st Qu.:2019-08-13 15:52:30   1st Qu.:0.75  
 Median :2019-08-14 03:45:00   Median :1.00  
 Mean   :2019-08-14 03:45:00   Mean   :0.75  
 3rd Qu.:2019-08-14 15:37:30   3rd Qu.:1.00  
 Max.   :2019-08-15 03:30:00   Max.   :1.00

I can also easily extract time series components.

dates.ts$mday # day of month

 [1] 13 13 13 13 13 13 13 13 14 14 14 14 14 14 14 14 14 14 15 15

dates.ts$wday # day of week

 [1] 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4

dates.ts$hour # hour

 [1]  4  6  9 11 14 16 19 21  0  2  5  7 10 12 15 17 20 22  1  3

$\blacksquare$

Exercises

Using the plant dataset from Question 5 in the Exercises at the end of Chapter 3, perform the following operations.
1. Attempt to simultaneously calculate the column means for plant height and soil % N using FUN = mean in apply(). Was there an issue? Why?
2. Eliminate missing rows in plant using na.omit() and repeat (a). Did this change the mean for plant height? Why?
3. Modify the FUN argument in apply() to be: FUN = function(x) mean(x, na.rm = TRUE). This will eliminate NAs on a column by column basis.
4. Compare the results in (a), (b), (c). Which is the best approach? Why?
5. Find the mean and variance of plant heights for each Management Type in plant using tapply(). Use the best practice approach for FUN, as deduced in (d).
For the questions below, use the list list.data below.
1. Use sapply(list.data, FUN = length) to get the number of components in each element of list.data.
2. Repeat (a) using lapply(). How is the output in (b) different from (a)?
```
list.data <- list(a = 1:9, height = rnorm(50),
                  greet = c("hello", "goodbye", "hello"))
```
A frequently used statistical application is the calculation of all possible mean differences. Assume that we have the means given in the object means below.
1. Calculate all possible mean differences using means as the first two arguments in outer(), and letting FUN = "-".
2. Extract meaningful and non-redundant differences by using upper.tri() or lower.tri() (Section 3.4.4). There should be ${5 \choose 2} = 10$ meaningful (not simply a mean subtracted from itself) and non-redundant differences.
```
means <- c(trt1 = 20.5, trt2 = 15.3, trt3 = 22.1, trt4 = 30.4, 
           trt5 = 28)
```
Using the plant dataset from Question 5 in the Exercises for Chapter 3, perform the following operations.
1. Use the function replace() to identify samples with soil N less than 13.5% by identifying them as "Npoor".
2. Use the function which() to identify which plant heights are greater than or equal to 33.2 dm.
3. Sort plant heights using the function sort().
4. Sort the plant dataset with respect to ascending values of plant height using the function order().

Using match() or which and %in%, replace the code column names in the dataset cliff.sp from the package asbio, with the correct scientific names (genus and specific epithet) from the dataframe sp.list below.

sp.list <- data.frame(code = c("L_ASCA","L_CLCI","L_COSPP","L_COUN",
"L_DEIN","L_LCAT", "L_LCST","L_LEDI","M_POSP","L_STDR","L_THSP","L_TOCA",
"L_XAEL","M_AMSE", "M_CRFI","M_DISP","M_WECO","P_MIGU","P_POAR",
"P_SAOD"), sci.name = c("Aspicilia caesiocineria","Caloplaca citrina",
"Collema spp.", "Collema undulatum", "Dermatocarpon intestiniforme",
"Lecidea atrobrunnea", "Lecidella stigmatea", "Lecanora dispersa",
"Pohlia sp.", "Staurothele drummondii", "Thelidium species",
"Toninia candida", "Xanthoria elegans", "Amblystegium serpens",
"Cratoneuron filicinum", "Dicranella species", "Weissia controversa",
"Mimulus guttatus", "Poa pattersonii", "Saxifraga odontoloma"))

Using the sp.list dataframe from the previous question, perform the following operations:
1. Apply strsplit() to the the column sp.list$sci.name to create a two column dataframe with genus and corresponding species names.
2. A two character prefix in the column sp.list$code indicates whether a taxon is a lichen (prefix = "L_"), a marchantiophyte (prefix = "M_"), or a vascular plant (prefix = "P_"). Use grep() to identify marchantiophytes.
Use the string vector string below to answer the following questions:
1. Use regular expressions in the pattern argument of gsub() to get rid of extra spaces at the start of string elements while preserving spaces between words.
2. Use the predefined character class [[:alnum:]] and an accompanying quantifier in the pattern argument from grep() to count the number of words whose length is greater than or equal to four characters.
```
string <- c("   Statistics is ", "      a ", " great topic.")
```

Remove the numbers from the character vector below using gsub() and an appropriate Perl-like regular expression.

x <- c("enzyme1","enzyme12","enzyme3","tRNA1","tRNA205",
       "mRNA6","mRNA17","mRNA8","mRNA100")

Consider the character vector times below, which has the format: day-month-year hour:minute:second.
1. Convert times into an object of class POSIXlt called time.pos using the function strptime().
2. Extract the day of the week from time.pos.
3. Sort time.pos using sort() to verify that time.pos is quantitative.
```
times <- c("12-12-2023 12:12:20",
           "12-01-2021 01:12:40",
           "15-10-2021 23:10:15",
           "25-07-2022 13:09:45")
```

References

Aho, Ken, Dewayne Derryberry, Sarah E Godsey, Rob Ramos, Sara R Warix, and Samuel Zipper. 2023. “Communication Distance and Bayesian Inference in Non-Perennial Streams.” Water Resources Research 59 (11): e2023WR034513.

Signorell, Andri. 2023. DescTools: Tools for Descriptive Statistics. https://CRAN.R-project.org/package=DescTools.

Wikipedia. 2023. “Perl.” https://en.wikipedia.org/wiki/Perl.

———. 2024. “String (Computer Science).” https://en.wikipedia.org/wiki/String_(computer_science).

In computer programming, a string is generally a (non-numeric) sequence of characters (Wikipedia 2024). R frequently uses character vectors, i.e., vec <- c("a", "b", "c"). Each entry in vec would be conventionally considered to be a character string.↩︎
The Perl programming language was introduced by Larry Wall in 1987 as a Unix scripting tool to facilitate report processing (Wikipedia 2023). Despite criticisms as an awkward language, Perl remains widely used for its regular expression framework and string parsing capabilities.↩︎
Specifically, they use a version of the POSIX 1003.2 standard.↩︎
Again, the POSIX prefix refers to the IEEE standard Portable Operating System Interface↩︎

Chapter 4 Basic Data Management

4.1 Operations on Arrays, Lists and Vectors

4.1.1 The apply Family of Functions

4.1.1.1 apply()

4.1.1.2 lapply()

4.1.1.3 sapply()

4.1.1.4 tapply()

4.1.2 outer()

4.1.3 stack(), unstack() and reshape()

4.2 Other Simple Data Management Functions

4.2.1 replace()

4.2.2 which()

4.2.3 sort()

4.2.4 rank()

4.2.5 order()

4.2.6 unique()

4.2.7 match()

4.2.8 which() and %in%

4.3 Matching, Querying and Substituting in Strings

4.3.1 strtrim()

4.3.2 strsplit()

4.3.3 grep() and grepl()

4.3.4 gsub()

4.3.5 gregexpr()

4.3.6 Regular Expressions

4.3.6.1 Extended Regular Expressions

Character class

4.3.6.2 Perl-like Regular Expressions

4.4 Date-Time Classes

Exercises

References

4.1.1 The `apply` Family of Functions

4.1.1.1 `apply()`

4.1.1.2 `lapply()`

4.1.1.3 `sapply()`

4.1.1.4 `tapply()`

4.1.2 `outer()`

4.1.3 `stack()`, `unstack()` and `reshape()`

4.2.1 `replace()`

4.2.2 `which()`

4.2.3 `sort()`

4.2.4 `rank()`

4.2.5 `order()`

4.2.6 `unique()`

4.2.7 `match()`

4.2.8 `which()` and `%in%`

4.3.1 `strtrim()`

4.3.2 `strsplit()`

4.3.3 `grep()` and `grepl()`

4.3.4 `gsub()`

4.3.5 `gregexpr()`