Chapter 1 Welcome to R

"I believe that R currently represents the best medium for quality software in support of data analysis."

- John Chambers, Developer of S

"R is a real demonstration of the power of collaboration, and I don’t think you could construct something like this any other way."

- Ross Ihaka, original co-developer of R

1.1 What is R?

R is a computer language and an open source setting for statistics, data management, computation, and graphics. The outward mien of the R-environment is minimalistic, with few menu-driven interactive facilities (no menus exist for some implementations of R). This is in contrast to conventional statistical software consisting of black box, menu-dominated, often inflexible tools. The simplicity of R allows one to easily evaluate, edit, and build procedures for data analysis.

1.2 R and Biology

I am a statistical ecologist, so this book was written with physical scientists, particularly biologists, in mind. R is useful to biologists for three major reasons. First, it provides access to a large number of cutting edge statistical, graphical, and organizational procedures, many of which have been designed specifically for biological research. Second, biological datasets, including those from genetic and spatiotemporal research can be extensive and complex. R can readily manage and analyze such data. Third, analysis of biological data often requires analytical and computational flexibility. R allows one to “get under the hood”, look at the code, and check to see what algorithms are doing. If, after examining an R-algorithm we are unsatisfied, we can generally modify its code or create new code to meet our specific needs.

1.3 Popularity of R

Because of its freeware status, the overall number of people using R is difficult to determine. Nonetheless, the R-consortium website estimates that there are currently more than two million active R users. The r4stats website houses up-to-date surveys concerning the popularity of analytical software. These surveys (accessed 9/1/2023) indicate that R is the preferred language among data scientists for big data projects and data mining. R is also one of the most frequently cited statistical environments in scholarly articles, one of the most frequently used languages on the GitHub repository, and one of the most frequently discussed languages on Stack Overflow. In 2022 the R language was ranked 7th in the world by the Institute of Electrical and Electronics Engineers (IEEE); the top six, in order, were Python, C, C++, C#, Java, SQL, and Javascript). Further, in a 2017 survey, based on Stack Overflow queries, R was the “least disliked" programming language.

The growth and popularity of R can be partially tied to its relatively straightforward extendability via user generated functions and packages. This characteristic prompts a strong sense of community among R-users, along with a practical need for the perpetuation and upkeep of the R environment. While trailing Python, there are currently over 20000 formally contributed R-packages at the Comprehensive R Archive Network (CRAN).

1.4 A Brief History

R was created in the early 1990s by Australian computational statisticians Ross Ihaka and Robert Gentleman (Fig 1.1) to address scope1 and memory use deficiencies in its primary progenitor language, S (Ihaka and Gentleman 1996).

Ross Ihaka (1954 - ) (L) and Robert Gentleman (1959 - ) (R), the co-creators of **R**.

Figure 1.1: Ross Ihaka (1954 - ) (L) and Robert Gentleman (1959 - ) (R), the co-creators of R.

At the insistence of Swiss statistician Martin Maechler (Fig 1.2l), Ihaka and Gentleman made the R source code available via the internet in 1995. Because of its relatively easy-to-learn language, R was quickly extended with code and packages developed by its users. The rapid growth of R gave rise to the need for a group to guide its progress. This led, in 1997, to the establishment of the R-development core team, an international panel that modifies, troubleshoots, and manages source code (Fig 1.2).

A recent version of the **R**-core development team.

Figure 1.2: A recent version of the R-core development team.

1.4.1 Development of the R Language

The R language is based on older languages, particularly S, developed at Bell Laboratories (R. A. Becker and Chambers 1978; RA Becker and Chambers 1981; Richard Becker 2018), and Lisp2 (McCarthy 1978) and Scheme, a dialect of Lisp (Sussman and Steele Jr 1998; Steele 1978), which were developed at the MIT artificial intelligence laboratory in the late 1970s (Fig 1.3). In the appendix to his book Software for Data Analysis, John M. Chambers (Fig 1.2b), a primary developer of S, recounts the unique evolution and goals of S from its inception in 1976 (Chambers 2008). Chambers notes that S was originally intended to be an analysis toolbox solely for the statistics research group at Bell Labs, consisting of roughly 20 people at the time. It was decided that S (initially known as “the system”) would have fundamental extensibility3, reflecting the Bell Labs philosophy that “collaborations could actually enhance research” (Chambers 2008). S was largely founded on a collection Fortran-based4 routines and subroutines.

John McCarthy (1927-2011), creator of the Lisp language, and the first to coin the term "artifical intelligence", working at the MIT AI laboratory. Lisp was the first language that allowed information to be stored as distinct objects, rather than simply collections of numbers.

Figure 1.3: John McCarthy (1927-2011), creator of the Lisp language, and the first to coin the term “artifical intelligence”, working at the MIT AI laboratory. Lisp was the first language that allowed information to be stored as distinct objects, rather than simply collections of numbers.

S evolved alongside the Unix operating system (also developed at Bell Labs) which currently underlies Macintosh and Linux (free-Unix) operating systems5. An early inception S was written for Unix, allowing S to be portable to any machine using Unix. Both S and Unix were quickly commercially licensed by AT&T for university and third party retailers. The academic licensing and distribution of S attracted a large number user groups in 1980s. However, the lack of a clear open source strategy caused many early users to switch from S to R in the 1990s.

Ihaka and Gentleman (1996) used the name R both to acknowledge the influence of S (because r and s are juxtaposed in the alphabet), and to celebrate their own personal efforts (because was R the first letter of their first names). S was purchased by Insightful\(^{\circledR}\) software 2004 to run the commercial software S-Plus\(^{\circledR}\). In 2021 S-Plus\(^{\circledR}\) morphed to include TIBCO connected intelligence software, with some R open source applications. The R and S languages remain very similar, and code written in S can generally be run unaltered in R. The method of function implementation in R, however, remains more similar to Scheme.

The environments of R and S differed in two important ways (Ihaka and Gentleman 1996)6. First, the R-environment is given a fixed amount of memory at startup. This is in contrast to S-engines which adjusted available memory to session needs. Among other things, this difference meant more available pre-reserved computer memory, and fewer virtual pagination7 problems in R (Ihaka and Gentleman 1996). It also made R faster than S-Plus for many applications (Hornik and R Core Team 2023). Second, R variable locations are lexically scoped. In computer science, variables are storage areas with identifiers, and scope defines the context in which a variable name is recognized. So-called global variables are accessible in every scope (for instance, both inside and outside functions). In contrast, local variables may only exist only within the scope of a function. Formal parameters defined in R functions, including arguments, are (generally) local variables, whereas variables created outside of functions are global variables. In contrast to S, lexical scoping allowed functions in R access to variables that were in effect when the function was defined in a session. The characteristics of R functions and details concerning lexical scoping are further addressed in Ch 8.

1.4.2 Recent Developments

According to Thieme (2018), a growing component of the R culture includes individuals who are “Less interested in the mechanics of R than in what R allowed them to do.” This group, which includes individuals with little interest in becoming computer scientists, has been championed by Hadley Wickham (Fig 1.4), creator of the ggplot2 and dplyr R packages, and author of many useful books on R programming. A larger collection of packages supported by Wickham is referred to as the tidyverse.
Hadley Wickham (1979 - ) chief scientist at Rstudio.

Figure 1.4: Hadley Wickham (1979 - ) chief scientist at Rstudio.

1.4.3 The Future of R

As we have seen, R can be clearly tied (particularly via linkages with Fortran and Lisp) to the earliest foundations of software engineering. The future of R will be determined by its formal and informal community of users who have donated years of their lives to its development without monetary compensation. The continued growth of R will require adaptation to the changing demography of R-users. Like most software endeavors, R has been male dominated (Fig 1.2). However, this has been changing since 2012 with the founding of the R Ladies group by Gabriela de Queiroz, a chief data scientist at IBM (Fig 1.5. Today (9/1/2023) there are chapters in 219 cities and more than 28,000 members worldwide.

Gabriela de Queiroz.

Figure 1.5: Gabriela de Queiroz.

1.5 Copyrights and Licenses

R is intentionally open-source and free. Thus, there are no warranties on its environment or packages. As its copyright framework R uses the GNU (a recursive acronym for GNU is not Unix) General Public License (GPL). This allows users to share and change R and its functions. The associated legalese can read after typing RShowDoc("COPYING") in the R-console. Because its functions can be legally (and easily) recycled and altered we should always give credit to developers, package maintainers, or whomever wrote the R functions or code we are using.

1.6 R and Reliability

The lack of an R warranty has frightened away some scientists. But be assured, with few exceptions, R works as well or better than “top of the line” analytical commercial software. Indeed, statistical software giants SAS\(^{\circledR}\) and SPSS\(^{\circledR}\) have made R applications accessible from within their products (Fox 2009), and R processes and files can be shared directly with Microsoft Excel\(^{\circledR}\) and other proprietary software. For specialized or advanced statistical techniques R often outperforms other alternatives because of its diverse array of continually updated applications.

The computing engines and packages that come with a conventional R download (see Ch 3) meet or exceed U.S. federal analytical standards for clinical trial research (Schwartz et al. 2008). In addition, core algorithms used in R are based on seminal and well-trusted functions. For instance, R random number generators include some of the most venerated and thoroughly tested functions in computer history (Chambers 2008). Similarly, the Linear Algebra PACKage (LAPACK) algorithms (Anderson et al. 1999), used by R, are among the world’s most stable and best-tested software.

1.7 Installation

To install R, first go to the website (http://www.r-project.org/). On this page specify which platform you are using (Fig 1.6, step 1). R can currently be used on Unix/Linux, Windows and Mac operating systems. Once an operating system has been selected one can click on the “base” link to download the precompiled base binaries if R currently exists on your computer. If R has not been previously installed on your computer click on “Install R for the first time” (Fig 1.6, step 2). You will delivered to a window containing a link to download the most recent version of R. Click on the “Download” link (Fig 1.6, step 3).

Method for installing **R** for Windows for the first time.

Figure 1.6: Method for installing R for Windows for the first time.

Exercises

  1. The following questions concern the history and general characteristics of R.

    1. Who were the creators of R?
    2. What are some major developmental events in the history of R?
    3. What languages is R derived from and/or most similar to?
    4. What features distinguish R from other languages and statistical software?
    5. What are the three operating systems R works with?
  2. Briefly consider R in the context of major historical events in computer software and artificial intelligence.

References

Anderson, Edward, Zhaojun Bai, Christian Bischof, L Susan Blackford, James Demmel, Jack Dongarra, Jeremy Du Croz, et al. 1999. LAPACK Users’ Guide. SIAM.
Becker, RA, and JM Chambers. 1981. “S: A Language and System for Data Analysis, Bell Laboratories Computer Information Service.” Murray Hill, New Jersey.
Becker, Richard. 2018. The New S Language. CRC Press.
Becker, Richard A, and John M Chambers. 1978. “Design and Implementation of the ’S’ System for Interactive Data Analysis.” In The IEEE Computer Society’s Second International Computer Software and Applications Conference, 1978. COMPSAC’78., 626–29. IEEE.
Chambers, John M. 2008. Software for Data Analysis: Programming with R. Vol. 2. Springer.
Corbató, Fernando J, and Victor A Vyssotsky. 1965. “Introduction and Overview of the Multics System.” In Proceedings of the November 30–December 1, 1965, Fall Joint Computer Conference, Part i, 185–96.
Fox, John. 2009. “Aspects of the Social Organization and Trajectory of the R Project.” R J. 1 (2): 5.
Hornik, Kurt, and the R Core Team. 2023. R FAQ.” https://CRAN.R-project.org/doc/FAQ/R-FAQ.html.
Ihaka, Ross, and Robert Gentleman. 1996. “R: A Language for Data Analysis and Graphics.” Journal of Computational and Graphical Statistics 5 (3): 299–314.
Kernighan, Brian W, and Dennis M Ritchie. 2002. The C Programming Language. Pearson Education Asia.
McCarthy, John. 1978. “History of LISP.” In History of Programming Languages, 173–85. Stanford University.
Schwartz, M, F Harrell Jr, A Rossini, and I Francis. 2008. “R: Regulatory Compliance and Validation Issues a Guidance Document for the Use of R in Regulated Clinical Trial Environments.” The R Foundation for Statistical Computing, c/o Department of Statistics and Mathematics, Wirtschaftsuniversität Wien, Augasse, 2–6.
Steele, Jr, Guy Lewis. 1978. “RABBIT: A Compiler for SCHEME.” Master’s thesis, Massachusetts Institute of Technology.
Sussman, Gerald Jay, and Guy L Steele Jr. 1998. “Scheme: A Interpreter for Extended Lambda Calculus.” Higher-Order and Symbolic Computation 11 (4): 405–39.
Thieme, Nick. 2018. “R Generation.” Significance 15 (4): 14–19.
Thompson, Ken. 1972. Users’ Reference to B. https://web.archive.org/web/20150611114427/https://www.bell-labs.com/usr/dmr/www/kbman.pdf.

  1. In computer science, scope refers to the degree of binding between an identifier of an entity (e.g., an object name) and the entity itself (e.g., an object).↩︎

  2. Lisp, an abbreviation of “LISt Processor”, is the second-oldest (after Fortran) high-level programming language still in common use.↩︎

  3. In software engineering, extensibility is a design principle that provides for future growth. This allows developers to easily expand the software capabilities.↩︎

  4. Fortran is a computer language developed by IBM in the 1950s for science and engineering applications. Remarkably, it remains useful for many applications, including speeding up slow looping routines in interpreted languages like R.↩︎

  5. Unix itself was originally written in assembly language (a low-level programming language with a very strong correspondence between language instructions and machine/operating system instructions). Unix was later re-written in C, a portable, general purpose language, initially developed by Dennis Ritchie (Kernighan and Ritchie 2002). C, in turn, evolved from the language B, created by Ken Thompson (Thompson 1972), which, in turn, was inspired by work on early operating system called Multics (Corbató and Vyssotsky 1965).↩︎

  6. For explicit demonstrations of the technical differences of R and S see (Hornik and R Core Team 2023)↩︎

  7. Virtual pagination is a memory management scheme that allows a computer to store and retrieve data from secondary storage for use in main memory.↩︎