Chapter 4 Basic Data Management
“I think, therefore I R."
- William B. King, Psychologist and R enthusiast
\(\text{}\)
An important characteristic of R is its capacity to efficiently manage and analyze large, complex, datasets. In this chapter I list a few functions and approaches useful for data management in base R. Data management considerations for the tidyverse are given in Chapter 5.
4.1 Operations on Arrays, Lists and Vectors
Operators can be applied individually to every row or column of an array, or every component of a list or atomic vector using a number of time saving methods.
4.1.1 The apply
Family of Functions
4.1.1.1 apply()
Operations can be performed quickly on rows and columns of two dimensional arrays with the function apply()
. The function requires three
arguments.
- The first argument,
X
, specifies an array to be analyzed. - The second argument,
MARGIN
, connotes whether rows or columns are to be analyzed.MARGIN = 1
indicates rows,MARGIN = 2
indicates columns, whereasMARGIN = c(1, 2)
indicates rows and columns. - The third argument,
FUN
, defines a function to be applied to the margins of the object in the first argument.
Example 4.1
Consider the asbio::bats
dataset which contains forearm length data, in millimeters, for northern myotis bats (Myotis septentrionalis), along with corresponding bat ages in in days.
days forearm.length
1 1 10.5
2 1 11.0
3 1 12.3
4 1 13.7
5 1 14.2
6 1 14.8
Here we obtain minimum values for the days
and forearm.length
columns.
days forearm.length
1.0 10.5
It is straightforward to change the third argument in apply()
to obtain different summaries, like the mean.
days forearm.length
13.57895 23.60263
or the standard deviation
days forearm.length
12.461035 8.434725
Several summary statistical functions exist for numerical arrays that can be used in some instances in the place of apply()
. These include rowMeans()
and colMeans()
which give the sample means of specified rows and columns, respectively, and rowSums()
and colSums()
which give the sums of specified rows and columns, respectively. For instance:
days forearm.length
13.57895 23.60263
\(\blacksquare\)
4.1.1.2 lapply()
The function lapply()
allows one to sweep functions through list components. It has two main arguments:
- The first argument,
X
, specifies a list to be analyzed. - The second argument,
FUN
, defines a function to be applied to each element inX
.
Example 4.2
Consider the following simple list, whose three components have different lengths.
$a
[1] 1 2 3 4 5 6 7 8
$norm.obs
[1] 1.22730347 0.10854551 1.33289436 -0.10706836 2.39835128
[6] -0.05349229 0.26075305 1.18173824 0.59081373 1.52014568
$logic
[1] TRUE TRUE FALSE FALSE
Here we sweep the function mean()
through the list:
$a
[1] 4.5
$norm.obs
[1] 0.8459985
$logic
[1] 0.5
Note the Boolean outcomes in logic
have been coerced to numeric outcomes. Specifically, TRUE = 1
and FALSE = 0
. Here are the 1st, 2nd (median), and 3rd quartiles of x
:
$a
25% 50% 75%
2.75 4.50 6.25
$norm.obs
25% 50% 75%
0.1465974 0.8862760 1.3064966
$logic
25% 50% 75%
0.0 0.5 1.0
\(\blacksquare\)
4.1.1.3 sapply()
The function sapply()
is a user friendly wrapper for lapply()
that can return a vector or array instead of a list.
a norm.obs logic
25% 2.75 0.1465974 0.0
50% 4.50 0.8862760 0.5
75% 6.25 1.3064966 1.0
4.1.1.4 tapply()
The tapply()
function allows summarization of a one dimensional array (e.g., a column or row from a matrix) with respect to levels in a categorical variable. The function requires three arguments.
- The first argument,
X
, defines a one dimensional array to be analyzed. - The second argument,
INDEX
should provide a list of one or more factors (see example below) with the same length asX
. - The third argument,
FUN
, is used to specify a function to be applied toX
for each level inINDEX
.
Example 4.3 \(\text{}\)
Consider the dataset asbio::heart
, which documents pulse rates for twenty four subjects at four time periods following administration of a experimental treatment. These were two active heart medications and a control. Here are average heart rates for the treatments.
AX23 BWW9 Ctrl
76.28125 81.03125 71.90625
Here are the mean heart rates for treatments, for each time frame. Note that the second argument is defined as a list with two components, each of which can be coerced to be a factor.
time
drug t1 t2 t3 t4
AX23 70.50 80.500 81.000 73.125
BWW9 81.75 84.000 78.625 79.750
Ctrl 72.75 72.375 71.500 71.000
\(\blacksquare\)
The function aggregate()
can be considered a more sophisticated extension of tapply()
. It allows objects under consideration to be expressed as functions of explanatory factors, and contains additional arguments for data specification and time series analyses.
Example 4.4 \(\text{}\)
Here we use aggregate()
to get identical (but reformatted) results to the prior example.
drug time rate
1 AX23 t1 70.500
2 BWW9 t1 81.750
3 Ctrl t1 72.750
4 AX23 t2 80.500
5 BWW9 t2 84.000
6 Ctrl t2 72.375
7 AX23 t3 81.000
8 BWW9 t3 78.625
9 Ctrl t3 71.500
10 AX23 t4 73.125
11 BWW9 t4 79.750
12 Ctrl t4 71.000
Importantly, the first argument, rate ~ drug + time
is in the form of a formula:
[1] "formula"
Here the tilde operator, ~
, allows expression of the formulaic framework: y ~ model
, where y
is a response variable and model
specifies a system of (generally) one or more predictor variables.
\(\blacksquare\)
4.1.2 outer()
Another important function for matrix operations is outer()
. This algorithm allows creation of an array that contains all possible combinations of two atomic vectors or arrays with respect to a user-specified function. The outer()
function has three required arguments.
- The first two arguments,
X
andY
, define arrays or atomic vectors.X
andY
can be identical if one wishes to examine pairwise operations of the array elements (see example below). - The third argument,
FUN
, specifies a function to be used in operations.
Example 4.5 \(\text{}\)
Suppose I wish to find the means of all possible pairs of observations from an atomic vector. I could use the following commands:
[,1] [,2] [,3] [,4] [,5]
[1,] 1.0 1.5 2.0 3.0 2.5
[2,] 1.5 2.0 2.5 3.5 3.0
[3,] 2.0 2.5 3.0 4.0 3.5
[4,] 3.0 3.5 4.0 5.0 4.5
[5,] 2.5 3.0 3.5 4.5 4.0
The argument FUN = "+"
indicates that we wish to add elements to each other. We divide these sums by two to obtain means. Note that the diagonal of the output matrix contains the original elements of x
, because the mean of a number and itself is the original number. The upper and lower triangles are identical because the mean of elements a and b will be the same as the mean of the elements b and a. Note that the result outer(x, x, "*")
can also be obtained using x %o% x
because %o%
is the matrix algebra outer product operator in R.
[,1] [,2] [,3] [,4] [,5]
[1,] 1 2 3 5 4
[2,] 2 4 6 10 8
[3,] 3 6 9 15 12
[4,] 5 10 15 25 20
[5,] 4 8 12 20 16
[,1] [,2] [,3] [,4] [,5]
[1,] 1 2 3 5 4
[2,] 2 4 6 10 8
[3,] 3 6 9 15 12
[4,] 5 10 15 25 20
[5,] 4 8 12 20 16
\(\blacksquare\)
4.1.3 stack()
, unstack()
and reshape()
When manipulating lists and dataframes it is often useful to move between so-called “long” and “wide” data table formats. These operations can be handled with the functions stack()
and unstack()
. Specifically, stack()
concatenates multiple vectors into a single vector along with a factor indicating where each observation originated, whereas unstack()
reverses this process.
Example 4.6 \(\text{}\)
Consider the 4 x 4 dataframe below.
dataf <- data.frame(matrix(nrow = 4, data = rnorm(16)))
names(dataf) <- c("col1", "col2", "col3", "col4")
dataf
col1 col2 col3 col4
1 0.9441796 -0.04453284 0.06968586 0.4873315
2 1.9103583 -0.60254997 -0.14386612 -0.9688990
3 1.2850721 0.59731920 1.37414563 0.4051866
4 0.9670279 -1.32108678 0.14105541 0.5456421
Here I stack dataf
into a long table format.
values ind
1 0.94417965 col1
2 1.91035833 col1
3 1.28507212 col1
4 0.96702792 col1
5 -0.04453284 col2
6 -0.60254997 col2
7 0.59731920 col2
8 -1.32108678 col2
9 0.06968586 col3
10 -0.14386612 col3
11 1.37414563 col3
12 0.14105541 col3
13 0.48733155 col4
14 -0.96889895 col4
15 0.40518657 col4
16 0.54564209 col4
Here I unstack sdataf
.
col1 col2 col3 col4
1 0.9441796 -0.04453284 0.06968586 0.4873315
2 1.9103583 -0.60254997 -0.14386612 -0.9688990
3 1.2850721 0.59731920 1.37414563 0.4051866
4 0.9670279 -1.32108678 0.14105541 0.5456421
The function reshape()
can handle both stacking and unstacking operations. Here I stack dataf
. The arguments timevar
, idvar
, and v.names
are used to provide recognizable identifiers for the columns in the wide table format, observations within those columns, and responses for those combinations.
reshape(dataf, direction = "long",
varying = list(names(dataf)),
timevar = "Column",
idvar = "Column obs.",
v.names = "Response")
Column Response Column obs.
1.1 1 0.94417965 1
2.1 1 1.91035833 2
3.1 1 1.28507212 3
4.1 1 0.96702792 4
1.2 2 -0.04453284 1
2.2 2 -0.60254997 2
3.2 2 0.59731920 3
4.2 2 -1.32108678 4
1.3 3 0.06968586 1
2.3 3 -0.14386612 2
3.3 3 1.37414563 3
4.3 3 0.14105541 4
1.4 4 0.48733155 1
2.4 4 -0.96889895 2
3.4 4 0.40518657 3
4.4 4 0.54564209 4
\(\blacksquare\)
4.2 Other Simple Data Management Functions
4.2.1 replace()
One use the function replace()
to replace elements in an atomic vector based, potentially, on Boolean logic. The function requires three arguments.
- The first argument,
x
, specifies the vector to be analyzed. - The second argument,
list
, connotes which elements need to be replaced. A logical argument can be used here as a replacement index. - The third argument,
values
, defines the replacement value(s).
Example 4.7 \(\text{}\)
For instance:
[1] "R is Cool" "R is Cool" "25" "26" "R is Cool" "R is Cool"
Of course, one can also use square brackets for this operation.
[1] "R is Cool" "R is Cool" "25" "26" "R is Cool" "R is Cool"
\(\blacksquare\)
4.2.2 which()
The function which can be used with logical commands to obtain address indices for data storage object.
Example 4.8
For instance:
[1] 1 2 5 6
Elements one, two, and five meet this criterion. We can now subset based on the index w
.
[1] 21 19 18 19
To find which element in Age
is closest to 24 I could do something like:
[1] 3
\(\blacksquare\)
4.2.3 sort()
By default, The function sort()
sorts data from an atomic vector into an alphanumeric ascending order.
[1] 18 19 19 21 25 26
Data can be sorted in a descending order by specifying decreasing = TRUE
.
[1] 26 25 21 19 19 18
4.2.4 rank()
The function rank gives the ascending alphanumeric rank of elements in a vector. Ties are given the average of their ranks. This operation is important to rank-based permutation analyses .
[1] 4.0 2.5 5.0 6.0 1.0 2.5
The second and last observations were the second smallest in Age
. Thus, their average rank is 2.5.
4.2.5 order()
The function order()
is similar to which()
in that it provides element indices that accord with an alphanumeric ordering. This allows one to sort a vector, matrix or dataframe into an ascending or descending order, based on one or several ordered vectors.
Example 4.9 \(\text{}\)
Consider the dataframe below which lists plant percent cover data for four plant species at three sites. In accordance with the field.data
example from Ch 3, plant species are identified with four letter codes, corresponding to the first two letters of the Linnaean genus and species names.
field.data <- data.frame(code = c("ACMI", "ELSC", "CAEL", "TACE"),
site1 = c(12, 13, 14, 11),
site2 = c(0, 20, 4, 5),
site3 = c(20, 10, 30, 0))
field.data
code site1 site2 site3
1 ACMI 12 0 20
2 ELSC 13 20 10
3 CAEL 14 4 30
4 TACE 11 5 0
Assume that we wish to sort the data with respect to an alphanumeric ordering of species codes. Here we obtain the ordering of the codes
[1] 1 3 2 4
Now we can sort the rows of field.data
based on this ordering.
code site1 site2 site3
1 ACMI 12 0 20
3 CAEL 14 4 30
2 ELSC 13 20 10
4 TACE 11 5 0
\(\blacksquare\)
4.2.6 unique()
To identify unique values in dataset we can use the function unique()
.
Example 4.10
Below is an atomic vector listing species from a bird survey on islands in southeast Alaska. Species ciphers follow the same coding method used in Example 4.9. Note that there are a large number of repeats.
AK.bird <- c("GLGU", "MEGU", "DOCO", "PAJA", "COLO", "BUFF", "COGO",
"WHSC", "TUSW", "GRSC", "GRTE", "REME", "BLOY", "REPH",
"SEPL", "LESA", "ROSA", "WESA", "WISN", "BAEA", "SHOW",
"GLGU", "MEGU", "PAJA", "DOCO", "GRSC", "GRTE", "BUFF",
"MADU", "TUSW", "REME", "SEPL", "REPH", "ROSA", "LESA",
"COSN", "BAEA", "ROHA")
length(AK.bird)
[1] 38
Applying unique()
we obtain a listing of the 24 unique bird species.
[1] "GLGU" "MEGU" "DOCO" "PAJA" "COLO" "BUFF" "COGO" "WHSC" "TUSW" "GRSC"
[11] "GRTE" "REME" "BLOY" "REPH" "SEPL" "LESA" "ROSA" "WESA" "WISN" "BAEA"
[21] "SHOW" "MADU" "COSN" "ROHA"
\(\blacksquare\)
4.2.7 match()
Given two vectors, the function match()
indexes where objects in the second vector appear in the elements of the first vector. For instance:
[1] 5 NA 3 4 2 1
The number 2 (the 1st element in y
) is the 5th element of x
, thus the number 5 is put 1st in the vector m
created by match. The number 1 (the 2nd element of y
) does not occur in x
(it is NA
). The number 4 is the 3rd element
of y
and x
. Thus, the number 3 is given as the third element of m
, and so on.
Example 4.11 \(\text{}\)
The usefulness of match()
may seem unclear at first, but consider a scenario in which I want to convert species code identifiers in field data into actual species names. The following dataframe is a species list that matches four letter species codes to scientific names. Note that the list contains more species than than the field.data
dataset used in Example 4.9.
species.list <- data.frame(code = c("ACMI", "ASFO", "ELSC", "ERRY", "CAEL",
"CAPA", "TACE"), names = c("Achillea millefolium", "Aster foliaceus",
"Elymus scribneri", "Erigeron rydbergii",
"Carex elynoides", "Carex paysonis",
"Taraxacum ceratophorum"))
species.list
code names
1 ACMI Achillea millefolium
2 ASFO Aster foliaceus
3 ELSC Elymus scribneri
4 ERRY Erigeron rydbergii
5 CAEL Carex elynoides
6 CAPA Carex paysonis
7 TACE Taraxacum ceratophorum
Here I add a column in the field.data
of the actual species names using match()
.
m <- match(field.data$code, species.list$code)
field.data.new <- field.data # make a copy of field data
field.data.new$species.name <- species.list$names[m]
field.data.new
code site1 site2 site3 species.name
1 ACMI 12 0 20 Achillea millefolium
2 ELSC 13 20 10 Elymus scribneri
3 CAEL 14 4 30 Carex elynoides
4 TACE 11 5 0 Taraxacum ceratophorum
\(\blacksquare\)
4.2.8 which()
and %in%
We can use the operator in conjunction with the function which()
to achieve the same results as match()
.
m <- which(species.list$code %in% field.data$code)
field.data.new$species.name <- species.list$names[m]
field.data.new
code site1 site2 site3 species.name
1 ACMI 12 0 20 Achillea millefolium
2 ELSC 13 20 10 Elymus scribneri
3 CAEL 14 4 30 Carex elynoides
4 TACE 11 5 0 Taraxacum ceratophorum
Note that the arrangement of arguments are reversed in match()
and which()
. In the former we have: . In the latter we have: which(species.list$code %in% field.data$code)
.
4.3 Matching, Querying and Substituting in Strings
R contains a number of useful methods for handling character string1 data.
4.3.1 strtrim()
The function strtrim
is useful for extracting characters from vectors.
Example 4.12 \(\text{}\)
For the taxonomic codes in the character vector below, the first capital letter indicates whether a species is a flowering plant (anthophyte) or moss (bryophyte) while the last four letters give the species codes (see Example 4.9).
Assume that I want to distinguish anthophytes from bryophytes by extracting the first letter. This can be done by specifying 1
in the second strtrim
argument, width
.
[1] "A" "B" "A"
[1] "A_CAAT" "A_SARI"
\(\blacksquare\)
4.3.2 strsplit()
The function strsplit()
splits a character string into substrings based on user defined criteria. It contains two important arguments.
- The first argument,
x
, specifies the character string to be analyzed. - The second argument,
split
, is a character criterion that is used for splitting.
Example 4.13 \(\text{}\)
Below I split the character string ACMI
in two, based on the space between the words Achillea
and millefolium
.
[[1]]
[1] "Achillea" "millefolium"
Note that the result is a list. To get back to a vector (now with two components), I can use the function unlist()
.
[1] "Achillea" "millefolium"
Here I split based on the letter "l"
.
[[1]]
[1] "Achi" "" "ea mi" "" "efo" "ium"
Interestingly, letting the split
criterion equal NULL
results in spaces being placed between every character in a string.
[[1]]
[1] "A" "c" "h" "i" "l" "l" "e" "a" " " "m" "i" "l" "l" "e" "f" "o" "l"
[18] "i" "u" "m"
We can use this outcome to reverse the order of characters in a string.
[1] "muilofellim aellihcA"
The function rev()
provides a reversed version of its first argument, in this case a result from strsplit()
.
The function paste()
can be use to paste together character strings.
\(\blacksquare\)
Criteria for querying strings can include multiple characters in a particular order, and a particular case:
[[1]]
[1] "R is free "
[2] "ftware and comes with ABSOLUTELY NO WARRANTY"
Note that the "SO"
in "ABSOLUTELY"
is ignored because it is upper case.
4.3.3 grep()
and grepl()
The functions grep()
and grepl()
can be used to identify which elements in a character vector have a specified pattern. They have the same first two arguments.
- The first argument,
pattern
specifies a patterns to be matched. This can be a character string, or object coercible to a character string, or a regular expression (see below). - The second argument,
x
, is a character vector where matches are sought.
Example 4.14 \(\text{}\)
The function grep()
returns indices identifying which entries in a vector contain a queried pattern. In the character vector below, we see that entries five and six have the same genus, Carex.
names = c("Achillea millefolium", "Aster foliaceus",
"Elymus scribneri", "Erigeron rydbergii",
"Carex elynoides", "Carex paysonis",
"Taraxacum ceratophorum")
grep("Carex", names)
[1] 5 6
The function grepl()
does the same thing with Boolean outcomes.
[1] FALSE FALSE FALSE FALSE TRUE TRUE FALSE
Of course, we could use this information to subset names
.
[1] "Carex elynoides" "Carex paysonis"
We can also get grep
to return the values directly by specifying value = TRUE
.
[1] "Carex elynoides" "Carex paysonis"
\(\blacksquare\)
4.3.4 gsub()
The function gsub()
can be used to substitute text that has a specified pattern. Several of its arguments are identical to grep()
and grepl()
:
- As before, the first argument,
pattern
, specifies a pattern to be matched. - The second argument,
replacement
, specifies a replacement for the matched pattern. - The third argument,
x
, is a character vector wherein matches are sought and substitutions are made.
Example 4.15 \(\text{}\)
Here we substitute "C."
for occurrences of "Carex"
in names
.
[1] "Achillea millefolium" "Aster foliaceus"
[3] "Elymus scribneri" "Erigeron rydbergii"
[5] "C. elynoides" "C. paysonis"
[7] "Taraxacum ceratophorum"
\(\blacksquare\)
4.3.5 gregexpr()
The function gregexpr()
identifies the start and end of matching sections in a character vector.
Example 4.16 \(\text{}\)
Here we examine the first two entries in names
, looking for the genus Aster.
[[1]]
[1] -1
attr(,"match.length")
[1] -1
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE
[[2]]
[1] 1
attr(,"match.length")
[1] 5
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE
The output list is cryptic at best and requires some explanation. The first two elements in each of the two list components indicate the character number of the start and end of the matched string. For the first list component, these elements are given the identifier -1
because "Achillea millefolium"
does not contain the pattern "Aster"
. For the second list component, these elements are 1
and 5
because "Aster"
makes up the first five letters of "Aster foliaceus"
.
\(\blacksquare\)
4.3.6 Regular Expressions
A number of R functions for managing character strings, including grep()
, grepl()
, gregexpr()
, gsub()
, and strsplit()
, can can incorporate regular expressions. In computer programming, a regular expression (often abbreviated as regex) is a sequence of characters that allow pattern matching in text. Regular expressions have developed within a number of programming frameworks including the POSIX standard (the Portable Operating System Interface standard), developed by the IEEE, and particularly the language Perl2. Regular expressions in R include extended regular expressions (this is the default for most pattern matching and replacement R functions), and Perl-like regular expressions.
4.3.6.1 Extended Regular Expressions
Default extended regular expressions in R use a POSIX framework for commands3, which includes the use of particular metacharacters. These are: \
, |
, ( )
, [ ]
, ^
, $
, .
, { }
, *
, +
, and ?
. The metacharacters will vary in meaning depending if they occur outside of square brackets, [
and ]
, or inside of square brackets. The former usage means that they are part of a character class (see below). In non-bracketed usage, the metacharacters in the subset below have the following applications (see https://www.pcre.org/original/pcre.txt):
^
start of string or line.$
end of string or line..
match any character except newline.|
start of alternative branch.( )
start and end subpattern.{ }
start and end min/max repetition specification.
Several regular expression metacharacters can be placed at the end of the end of a regular expression to specify repetition. For instance, "*"
indicates the preceding pattern should be matched zero or more times, "{+}"
indicates the preceding pattern should be matched one or more times, "{n}"
indicates the preceding pattern should be matched exactly n
more times, and "{n,}"
indicates the preceding pattern should be matched n
or more times.
Example 4.17 \(\text{}\)
We will use the function regmatches()
, which extracts or replaces matched substrings from gregexpr()
summaries, to illustrate.
[[1]]
[1] "a" "a" "a" "a" "a" "a"
[[1]]
[1] "aa" "aa"
[[1]]
[1] "aaa" "aaa"
\(\blacksquare\)
Example 4.18 \(\text{}\)
Metacharacters can be used together. For instance, the code below demonstrates how one might get rid of one or more extra spaces at the end of character strings.
string <- c("###Now is the time ",
"# for all ",
"#",
" good men",
"### to come to the aid of their country. ")
out <- gsub(" +$", "", string) # drop extra space(s) at end of strings
out <- gsub("^#*","", out) # drop pound sign(s)
paste(out, collapse = "")
[1] "Now is the time for all good men to come to the aid of their country."
\(\blacksquare\)
Example 4.19 \(\text{}\)
As a biological example, microbial “taxa” identifiers can include cryptic Amplicon Sequence Variant (ASV) codes, followed by a general taxonomic assignment. For example, here is an ASV identifier for a bacterium within the family Comamonadaceae.
We can delete the ASV code, which ends in a colon, with:
[1] "f__Comamonadaceae"
The regex script in the first argument means: “match any character string occurring zero or more times that ends in :
”.
\(\blacksquare\)
Example 4.20 \(\text{}\)
As another example, R Markdown delimits monospace font using accent grave metacharacters, `
`
, while LaTeX applies this font between the expression \texttt{
and }
. Below I convert a R Markdown-style character vector containing some monospace strings to a LaTeX-style character vector.
char.vec <- c("`+`", "addition", "$2 + 2$", "`2 + 2`")
gsub("(`)(.*)(`)","\\\texttt{\\2}", char.vec)
[1] "\texttt{+}" "addition" "$2 + 2$" "\texttt{2 + 2}"
With the code
I subset R Markdown strings in char.vec
into three potential components: 1) the `
metacharacter beginning the string, 2) the text content between `
metacharacters, and 3) the closing `
metacharacter itself. I insert the content in item 2 (indicated as \\2
) between \texttt{
and }
using:
\(\blacksquare\)
Importantly, Example 4.20 illustrates the procedure to use if a queried character is itself a general expression metacharacter. For instance, the backslash in \texttt
. In this case, the metacharacter must be escaped using single or double backslashes. That is, \texttt
must be specified as \\\texttt
in gsub()
.
Example 4.21 \(\text{}\)
Here I ask for a string split based on the appearance of ?
(which is a regex metacharacter) and %
(which is not).
[[1]]
[1] "m" "2" "b"
\(\blacksquare\)
Character class
A regular expression character class is comprised of a collection of characters, specifying some query or pattern, situated between quotes (single or double) and square brace metacharacters, e.g., "["
and "]"
. Thus, the code "[\\?%]"
in the previous example defines a character class. Character class pattern matches will be evaluated for any single character in the specified text. The reverse will occur if the first character of the pattern is the regular expression caret metacharacter, ^
. For example, the expression "[0-9]"
matches any single numeric character in a string, (the regular expression metacharacter -
can be used to specify a range) and [^abc]
matches anything except the characters "a"
, "b"
or "c"
.
Example 4.22 \(\text{}\)
Consider the following examples:
[[1]]
[1] "a" "c&m" "%b"
[[1]]
[1] "a" "c" "" "" "" "b"
\(\blacksquare\)
Example 4.23 \(\text{}\)
This regular expression will match most email addresses:
The expression literally reads: “1) find one or more occurrences of characters in a-z or A-Z or 0-9 or dashes or periods, followed by 2) the ampersand symbol (literally), followed by 3) one or more occurrences of characters in a-z or A-Z or 0-9 or dashes or periods, followed by 4) a literal period, followed by one or more occurrences of the letters a-z or A-Z.” Here is a string we wish to query:
string <- c("abc_noboby@isu.edu",
"text with no email",
"me@mything.com",
"also",
"you@yourspace.com",
"@you"
)
We confirm that elements 1, 3, and 5 from string
are email addresses.
[1] "abc_noboby@isu.edu" "me@mything.com" "you@yourspace.com"
\(\blacksquare\)
Certain character classes are predefined. These classes have names that are bounded by two square brackets and colons, and include "[[:lower:]]"
and "[[:upper:]]"
which identify lower and upper case letters, "[:punct:]"
which identifies punctuation, [[:alnum:]]
, which identifies all alphanumeric characters, and "[[:space:]]"
, which identifies space characters, e.g., tab and newline.
[1] TRUE TRUE FALSE FALSE FALSE
[1] TRUE FALSE FALSE FALSE FALSE
[1] FALSE FALSE TRUE TRUE FALSE
[1] FALSE FALSE FALSE FALSE TRUE
Here I ask R to return elements from string
that are three or more characters long.
[1] "M2Ab" "def"
For some pattern matching and replacement jobs it may be best turn off the default extended regular expressions and use exact matching by specifying fixed = TRUE
. For example, R may place periods in the place of spaces in character strings and column names in dataframes and arrays.
Example 4.24 \(\text{}\)
Consider the following example:
countries <- c("United.States", "United.Arab.Emirates", "China", "Germany")
gsub(".", " ", countries)
[1] " " " " " "
[4] " "
Note that using gsub(".", " ", countries)
results in the replacement of all text with spaces because of the meaning of the period metacharacter. To get the desired result we could use:
[1] "United States" "United Arab Emirates" "China"
[4] "Germany"
Of course we could also double escape the period.
[1] "United States" "United Arab Emirates" "China"
[4] "Germany"
\(\blacksquare\)
4.3.6.2 Perl-like Regular Expressions
The R character string functions grep()
, grepl()
, regexpr()
, gregexpr()
, sub()
, gsub()
, and strsplit()
allow Perl-like regular expression pattern matching. This is done by specifying perl = TRUE
, which switches regular expression handling to the PRCE package. Perl allows handling of the POSIX predefined character classes, e.g., "[[:lower:]]"
, along with a wide variety of other calls which are generally implemented using metacharacters and double backslash commands. Here are some examples.
\\d
any decimal digit.\\D
any character that is not a decimal digit.\\h
any horizontal white space character (e.g., tab, space).\\H
any character that is not a horizontal white space character.\\s
any white space character.\\S
any character that is not a white space character.\\v
any vertical white space character (e.g., newline).\\V
any character that is not a vertical white space character.\\w
any word, i.e., letter or character components separated by white space.\\W
any non word.\\b
a word boundary.\\U
upper case character (dependent on context).\\L
lower case character (dependent on context).
Note that reversals in meaning occur for capitalized and uncapitalized commands.
Example 4.25 \(\text{}\)
Here we identify string entries containing numbers.
string <- c("Acidobacteria", "Actinobacteria", "TM7.1", "Gitt-GS-136",
"Chloroflexia", "Bacili")
grep("\\d", string, perl = TRUE)
[1] 3 4
And those containing non-numeric characters (i.e., all of the entries).
[1] 1 2 3 4 5 6
To subset non-numeric entries, one could do something like:
[1] "Acidobacteria" "Actinobacteria" "Chloroflexia" "Bacili"
\(\blacksquare\)
Example 4.26 \(\text{}\)
As a slightly extended example we will count the number of words in the description of the GNU public licences in R (obtained via RShowDoc("COPYING")
). Ideas here largely follow from the function DescTools::StrCountW()
(Signorell 2023).
Text can be read from a connection using the function readLines()
.
[1] "\t\t GNU GENERAL PUBLIC LICENSE"
[2] "\t\t Version 2, June 1991"
[3] ""
[4] " Copyright (C) 1989, 1991 Free Software Foundation, Inc."
[5] " 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA"
[6] " Everyone is permitted to copy and distribute verbatim copies"
Note that the escaped command \t
represent the ASCII (American character encoding standard) control character for tab return. Other useful escaped control characters include \n
, indicating new line or carriage return.
To search for words, we will actually identify string components that are not words, identified with the Perl regex command \\W
and word boundaries, i.e., \\b
. We can combine these summarily as: \\b\\W+\\b
. The call \\W+
indicates a non-word match occurring one or more times. Here we apply this regular expression to the first element of GNU
.
[1] "\t\t GNU GENERAL PUBLIC LICENSE"
[[1]]
[1] 10 18 25
attr(,"match.length")
[1] 1 1 1
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE
Matches occur at three locations, 10
, 18
, and 25
, which separate the four words GNU GENERAL PUBLIC LICENSE
. Thus, to analyze the entire document we could use:
[1] 3048
There are 3048 total words in the license description.
\(\blacksquare\)
One can identify substrings by number using Perl.
Example 4.27 \(\text{}\)
In this example, I subdivide a string into two components, the first character, i.e., "(\\w)"
, and the remaining zero or more characters: "(\\w*)"
. These are referred to in the substitute
argument of gsub
as items \\1
and \\2
, respectively. Capitalization for these substrings are handled in different ways below.
[1] "ACHILLEA"
[1] "aCHILLEA"
[1] "Achillea"
The functions tolower()
and toupper()
provide simpler approaches to convert letters to lower and upper case, respectively.
[1] "ACHILLEA"
\(\blacksquare\)
4.4 Date-Time Classes
There are two basic R date-time classes, POSIXlt and POSIXct4. Class POSIXct represents the (signed) number of seconds since the beginning of 1970 (in the UTC time zone) as a numeric vector. An object of class POSIXlt will be comprised of a list of vectors with the names sec
, min
, hour
, mday
(day of month), mon
(month), year
, wday
(day of week), and yday
(day of year).
POSIX naming conventions include:
%m
= Month as a decimal number (01–12).%d
= Day of the month as a decimal number (01–31).%Y
= Year. Designations in0:9999
are accepted.%H
= Hour as a decimal number (00–23).%M
= Minute as a decimal number (00–59
Example 4.28 \(\text{}\)
As an example, below are twenty dates and corresponding binary water presence measures (0 = water absent, 1 = water present) recorded at 2.5 hour intervals for an intermittent stream site in southwest Idaho (Aho et al. 2023).
dates <- c("08/13/2019 04:00", "08/13/2019 06:30", "08/13/2019 09:00",
"08/13/2019 11:30", "08/13/2019 14:00", "08/13/2019 16:30",
"08/13/2019 19:00", "08/13/2019 21:30", "08/14/2019 00:00",
"08/14/2019 02:30", "08/14/2019 05:00", "08/14/2019 07:30",
"08/14/2019 10:00", "08/14/2019 12:30", "08/14/2019 15:00",
"08/14/2019 17:30", "08/14/2019 20:00", "08/14/2019 22:30",
"08/15/2019 01:00", "08/15/2019 03:30")
pres.abs <- c(1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1)
To convert the character string dates
to a date-time object we can use the function strptime()
. We have:
[1] "POSIXlt" "POSIXt"
Note that the dates can now be evaluated numerically.
dates pres.abs
Min. :2019-08-13 04:00:00 Min. :0.00
1st Qu.:2019-08-13 15:52:30 1st Qu.:0.75
Median :2019-08-14 03:45:00 Median :1.00
Mean :2019-08-14 03:45:00 Mean :0.75
3rd Qu.:2019-08-14 15:37:30 3rd Qu.:1.00
Max. :2019-08-15 03:30:00 Max. :1.00
I can also easily extract time series components.
[1] 13 13 13 13 13 13 13 13 14 14 14 14 14 14 14 14 14 14 15 15
[1] 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4
[1] 4 6 9 11 14 16 19 21 0 2 5 7 10 12 15 17 20 22 1 3
\(\blacksquare\)
Exercises
Using the
plant
dataset from Question 5 in the Exercises at the end of Chapter 3, perform the following operations.- Attempt to simultaneously calculate the column means for plant height and soil % N using
FUN = mean
inapply()
. Was there an issue? Why? - Eliminate missing rows in
plant
usingna.omit()
and repeat (a). Did this change the mean for plant height? Why? - Modify the
FUN
argument inapply()
to be:FUN = function(x) mean(x, na.rm = TRUE)
. This will eliminateNA
s on a column by column basis. - Compare the results in (a), (b), (c). Which is the best approach? Why?
- Find the mean and variance of plant heights for each Management Type in
plant
usingtapply()
. Use the best practice approach forFUN
, as deduced in (d).
- Attempt to simultaneously calculate the column means for plant height and soil % N using
For the questions below, use the list
list.data
below.Use
sapply(list.data, FUN = length)
to get the number of components in each element oflist.data
.Repeat (a) using
lapply()
. How is the output in (b) different from (a)?
A frequently used statistical application is the calculation of all possible mean differences. Assume that we have the means given in the object
means
below.Calculate all possible mean differences using
means
as the first two arguments inouter()
, and lettingFUN = "-"
.Extract meaningful and non-redundant differences by using
upper.tri()
orlower.tri()
(Section 3.4.4). There should be \({5 \choose 2} = 10\) meaningful (not simply a mean subtracted from itself) and non-redundant differences.
Using the
plant
dataset from Question 5 in the Exercises for Chapter 3, perform the following operations.- Use the function
replace()
to identify samples with soil N less than 13.5% by identifying them as"Npoor"
. - Use the function
which()
to identify which plant heights are greater than or equal to 33.2 dm. - Sort plant heights using the function
sort()
. - Sort the
plant
dataset with respect to ascending values of plant height using the functionorder()
.
- Use the function
Using
match()
orwhich
and%in%
, replace the code column names in the datasetcliff.sp
from the package asbio, with the correct scientific names (genus and specific epithet) from the dataframesp.list
below.sp.list <- data.frame(code = c("L_ASCA","L_CLCI","L_COSPP","L_COUN", "L_DEIN","L_LCAT", "L_LCST","L_LEDI","M_POSP","L_STDR","L_THSP","L_TOCA", "L_XAEL","M_AMSE", "M_CRFI","M_DISP","M_WECO","P_MIGU","P_POAR", "P_SAOD"), sci.name = c("Aspicilia caesiocineria","Caloplaca citrina", "Collema spp.", "Collema undulatum", "Dermatocarpon intestiniforme", "Lecidea atrobrunnea", "Lecidella stigmatea", "Lecanora dispersa", "Pohlia sp.", "Staurothele drummondii", "Thelidium species", "Toninia candida", "Xanthoria elegans", "Amblystegium serpens", "Cratoneuron filicinum", "Dicranella species", "Weissia controversa", "Mimulus guttatus", "Poa pattersonii", "Saxifraga odontoloma"))
Using the
sp.list
dataframe from the previous question, perform the following operations:- Apply
strsplit()
to the the columnsp.list$sci.name
to create a two column dataframe with genus and corresponding species names. - A two character prefix in the column
sp.list$code
indicates whether a taxon is a lichen (prefix ="L_"
), a marchantiophyte (prefix ="M_"
), or a vascular plant (prefix ="P_"
). Usegrep()
to identify marchantiophytes.
- Apply
Use the string vector
string
below to answer the following questions:Use regular expressions in the
pattern
argument ofgsub()
to get rid of extra spaces at the start of string elements while preserving spaces between words.Use the predefined character class
[[:alnum:]]
and an accompanying quantifier in thepattern
argument fromgrep()
to count the number of words whose length is greater than or equal to four characters.
Remove the numbers from the character vector below using
gsub()
and an appropriate Perl-like regular expression.Consider the character vector
times
below, which has the format:day-month-year hour:minute:second
.Convert
times
into an object of classPOSIXlt
calledtime.pos
using the functionstrptime()
.Extract the day of the week from
time.pos
.Sort
time.pos
usingsort()
to verify thattime.pos
is quantitative.
References
In computer programming, a string is generally a (non-numeric) sequence of characters (Wikipedia 2024). R frequently uses character vectors, i.e.,
vec <- c("a", "b", "c")
. Each entry invec
would be conventionally considered to be a character string.↩︎The Perl programming language was introduced by Larry Wall in 1987 as a Unix scripting tool to facilitate report processing (Wikipedia 2023). Despite criticisms as an awkward language, Perl remains widely used for its regular expression framework and string parsing capabilities.↩︎
Specifically, they use a version of the POSIX 1003.2 standard.↩︎
Again, the POSIX prefix refers to the IEEE standard Portable Operating System Interface↩︎