1 Introduction

This short note aims to give some practical tips for R as compared with some common languages that I know. Please follow a more structured guide if you want a reference on R. I have the following recommendations:

2 Basic

2.1 The Survival ?

We can inspect the documentation of everything in R. For example,

?mean
?var

We can even inspect more elementary commands (though we will need to add quotation marks for those):

?"+"
?"for"
?"?"

Looking at the documentation is helpful when we are not sure about how to use some functions (e.g., from a new package). The help manual typically tells us how to use the function, its expected output and provides some examples that we can learn from.

2.2 Long Statements

(If you just want a quick fix, you can toggle the soft-warp. For R-studio, go to Tools > Global Options > Code and check “Soft-warp R source files”.)

Sometimes we may have some long statements that exceed the width of editor. This can be inconvenient when we read the same code later. As R does not require end statement like “;” in other languages, whenever we break down a long statement we have to make sure that it is not ended yet. Here is an example:

# Correct multiline statement
x = 1 +2 +3 +4 +5 +
  6 +7 +8 +9
x # 45
## [1] 45
# Incorrect multiline statement
y = 1 +2 +3 +4 +5
  +6 +7 +8 +9
## [1] 30
y # not 45
## [1] 15

There is no error for the second case because “+6 +7 +8 +9” is also a valid statement. However, the code is not working as we want.

2.3 Subsetting

In general, we subset vector, matrix or vector-like data structure with “[”. For matrix, the order of indexes should be row and then column (same as the tradition in Math). Note that the index in R starts from 1 unlike some other languages. Here are some examples:

# Vector
x = 1:9
x[5] # extract single element
## [1] 5
x[5:7] # extract multiple elements
## [1] 5 6 7
x[-1] # drop single element
## [1] 2 3 4 5 6 7 8 9
x[-c(3,5)] # drop multiple elements
## [1] 1 2 4 6 7 8 9
# Matrix
y = matrix(1:9, nrow=3, ncol=3)
y
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9
y[1:2, 2:3] # subset a matrix
##      [,1] [,2]
## [1,]    4    7
## [2,]    5    8

It is also possible to use string as the index provided that the data structure has names:

# Name the previous matrix
colnames(y) = c("S","T","A")
y[1:2, c("S","A")] # subset a matrix with string
##      S A
## [1,] 1 7
## [2,] 2 8
y[1:2, c(1,3)] # same as above
##      S A
## [1,] 1 7
## [2,] 2 8

For list-like data structure, we extract an element with “[[” (or “$” provided that the element is named). For example:

# List
z = list(code=3005, subject="STAT", school="CUHK")
z$code
## [1] 3005
z[["subject"]]
## [1] "STAT"
z["school"] # this returns a list
## $school
## [1] "CUHK"

List can be used to store heterogeneous data so it is less often used as a data structure in our course.

2.4 Looping

There are several common ways to do iterations in R:

  1. Traditional control flow such as for, while and repeat
  2. Functional programming such as apply and lapply
  3. Vectorization

For beginners, you may stick to traditional control flow until you are more familiar with R. We will discuss vectorization if we have time.

2.5 Order of Arguments

When we call a function, we do not need to follow the order of arguments if we name the arguments. For example,

# cov(x, y = NULL, use = "everything",
#    method = c("pearson", "kendall", "spearman"))
a = c(1:10)
b = c(11:20)
cov(method="kendall", use="everything", x=a, y=b)
## [1] 90
cov(a, b, "everything", "kendall") # same as above
## [1] 90

In practice, we can use this fact to skip specifying some arguments which have default value that we are fine with. For example,

cov(a, b, method="kendall")
## [1] 90

3 Intermediate

3.1 Ternary If Operator

In some common programming languages, ?: is provided as a ternary if operator. For instance, the statement

a ? b : c

means “if a then b otherwise c”. However, ? is the help operator, which is shown in The Survival ?, so this is not true in R. Commonly, people think that ifelse is the ternary if operator but this is not accurate. The correct ternary if is \if`whileifelse` is its vectorized version. We try to compare their speed with the microbenchmark library:

library(microbenchmark)
U = runif(1000)
f_if = function(U, n=length(U)) { # use `if`
  out = vector(length=n)
  for (i in 1:n) # can skip {} if only 1 line
    out[i] = `if`(U[i]<0.5, 0, 1)
  out
}
f_eif = function(U, n=length(U)) { # use ifelse
  out = vector(length=n)
  for (i in 1:n) # can skip {} if only 1 line
    out[i] = ifelse(U[i]<0.5, 0, 1)
  out
}
f_eif_vec = function(U, n=length(U)) { # use ifelse in a vectorized way
  out = ifelse(U<0.5, 0, 1)
  out
}
all.equal(f_if(U), f_eif(U), f_eif_vec(U)) # sanity check
## [1] TRUE
# evaluate execution time systematically
print(microbenchmark(f_if(U), f_eif(U), f_eif_vec(U), times=50))
## Unit: microseconds
##          expr    min     lq     mean  median     uq    max neval
##       f_if(U)   61.9   65.9   76.362   75.00   81.1  119.6    50
##      f_eif(U) 1191.4 1242.7 1552.084 1584.85 1742.5 2968.2    50
##  f_eif_vec(U)   23.9   33.6   76.480   39.90   48.2 1824.7    50

There are two messages here. First, \if`` is the correct ternary if operator since it is much faster when a single element is concerned. Second, vectorization can speed up the implementation in R. The general idea is to use vectors instead of single elements as inputs into a vectorized function.

3.2 Pre-allocation

In simulations, we frequently pre-allocate the storage out in advance of the main loops. The reason is that dynamic memory allocation is usually costly in R. To see this, we try a simple example:

n = 1000
f_pre = function(n) { # pre-allocate out
  out = vector(length=n)
  for (i in 1:n)
    out[i] = i
  out
}
f_dyn = function(n) { # combine elements dynamically
  out = 1
  for (i in 2:n)
    out = c(out, i)
  out
}
all.equal(f_pre(n), f_dyn(n))
## [1] TRUE
print(microbenchmark(f_pre(n), f_dyn(n), times=50))
## Unit: microseconds
##      expr   min    lq     mean  median     uq    max neval
##  f_pre(n)  33.7  37.9   44.810   41.40   49.6   79.9    50
##  f_dyn(n) 876.3 937.6 1349.594 1024.75 1179.0 4904.3    50

By adjusting the value of n, we can see that dynamic memory allocation is not of constant time in this way. Therefore, we should try our best to pre-allocate the necessary storage to avoid performance issues.

3.3 Looping with Functionals

In Looping, I have recommended stick to traditional control flow. The rationale is that functional loops can usually be rewritten as traditional loops with similar performance. Consider the following:

x = matrix(runif(5000), nrow=1000, ncol=5)
f_for = function(x, n=nrow(x)) { # row sum with for loop
  out = vector(length=n)
  for (i in 1:n)
    out[i] = sum(x[i,])
  out
}
f_apply = function(x) { # row sum with apply
  out = apply(x, 1, sum)
  out
}
f_vec = function(x) { # vectorized row sum
  out = rowSums(x)
  out
}
all.equal(f_for(x), f_apply(x), f_vec(x))
## [1] TRUE
print(microbenchmark(f_for(x), f_apply(x), f_vec(x), times=50))
## Unit: microseconds
##        expr   min     lq     mean  median     uq    max neval
##    f_for(x) 599.7  641.4  702.358  664.70  709.1 2081.3    50
##  f_apply(x) 993.1 1066.2 1243.108 1146.45 1242.2 2975.1    50
##    f_vec(x)  12.6   14.9   48.612   17.90   22.4 1475.1    50

In this simple case, for is marginally faster than apply while the vectorized rowSums is much faster than both. When should we use functional loops in practice? There are several cases:

  1. Readability: functional loops result in neater code in general
  2. Design pattern: functional style of programming may be preferred

3.4 Sourcing

The concept of package in R is similar to the concept of library in the other programming languages. However, creating packages to manage our code can be time consuming. A quicker way to save and reuse common R code is through the function source. For example,

x = "stat"
source("example.R") # the code inside covered last line
x
## [1] 3005

We can see that the value of x changes because it was “covered” by the code in example.R, which consists of

# Content of example.R
x # this will not be printed
x = 3005

There are several things to note:

  1. Intuitively, source “import” R code by running every line. If global variables/operations are used in the to-be-imported file, they might affect the result unexpectedly. Put only functions in the to-be-imported file to avoid this issue
  2. By the above example, we can see that some operations might not have effect. For example, calling x (the second line in example.R) did not result in auto-printing the value of x, which was “stat”. Read the documentation for details about the behavior of source
  3. It is possible to import global code chunks separately if we use R markdown. See this. However, we still need to name the chunks so that we can pinpoint which one to be called

3.5 Command Line

While R studio or other IDE is more user friendly, running R code in them is usually slower than running via command line. One of the reasons is the dynamic interaction with the environments. In R studio, we can inspect the variables usually by clicking their names in the environment panel. However, this convenience comes at cost of tracing and showing the memories associated with the names. Therefore, if execution speed is of concern after the development stage, we should run our R code via terminal directly (e.g., using source)

4 Advance

4.1 Vectorization

Vectorization, or array programming, refers to applying operations to the whole array instead of its individual elements one by one. It is an important concept in R because suitable vectorization can speed up R program a lot in practice. We have presented two simple examples in Ternary If Operator and Looping with Functionals. Here we try to give clearer picture via another example.

In some common programming languages, a sum of vector elements function is usually implemented as follows:

sum_loop = function(v) {
  s = 0
  for (i in 1:length(v))
    s = s +v[i]
  s
}

We compare this with the built-in sum function, which is vectorized:

x = 1:1000
all.equal(sum_loop(x), sum(x))
## [1] TRUE
print(microbenchmark(sum_loop(x), sum(x), times=50))
## Unit: nanoseconds
##         expr   min    lq  mean median    uq   max neval
##  sum_loop(x) 39500 41200 45844  42100 44600 87000    50
##       sum(x)   100   100   334    200   300  2800    50

Obviously, the built-in sum is much faster. Nevertheless, if we think about this and the previous two examples carefully, we will realize that we cannot write vectorized R functions ourselves in some sense. The ifelse, rowSums and sum functions are vectorized because they are built-in with highly optimized compiled code (probably in C). It would be difficult to write a vectorized R function without using them because we would not deal with those (probably C) code in general.

In other words, the building blocks of vectorization in R are usually the functions provided in system library (take a look at the package panel in R studio). As long as we operate with arrays directly with suitable built-in functions, we are doing vectorization in R in a narrow sense. To illustrate different possibilities, we try to give a more complex example. Consider a data series with missing data:

set.seed(3005) # to ensure A will not change
n = 10000
X = round(runif(n), 3)
A = sort(sample(1:n, n/2, F))
head(X, 10) # full data series
##  [1] 0.310 0.241 0.379 0.097 0.896 0.699 0.941 0.799 0.062 0.032
head(X[A], 10) # observed data series
##  [1] 0.310 0.097 0.896 0.699 0.941 0.799 0.430 0.415 0.407 0.326
head(A, 10) # index of observed data
##  [1]  1  4  5  6  7  8 13 14 15 16

We are interested in constructing an index series of last observed data, i.e., \[ 1, 1, 1, 4, 5, 6, 7, 8, 8, 8,\ldots \] A naive implementation is

f_naive = function(A, n) {
  out = rep(0, n)
  j = 1
  for (i in A[1]:n) {
    if (j == n/2) next
    if (i == A[j+1]) j = j +1
    out[i] = A[j]
  }
  out
}

A marginally better implementation is

f_naive2 = function(A, n) {
  out = rep(0, n)
    out[A] = A
  for (i in A[1]:n)
    out[i] = `if`(out[i]==0, out[i-1], out[i])
  out
}

A vectorized implementation that uses cumsum is

f_vec = function(A, n) {
  I = rep(0, n)
  I[A] = 1 # indicator of observed data
  I = cumsum(I) # cumsum is vectorized
  out = A[I]
  out
}

Finally, we compare their speed below:

all.equal(f_naive(A,n), f_naive2(A,n), f_vec(A,n))
## [1] TRUE
print(microbenchmark(f_naive(A,n), f_naive2(A,n), f_vec(A,n), times=50))
## Unit: microseconds
##            expr    min     lq     mean  median     uq    max neval
##   f_naive(A, n) 1299.3 1339.9 1494.896 1462.00 1576.7 2153.2    50
##  f_naive2(A, n)  814.4  847.4  933.956  915.85  987.6 1157.4    50
##     f_vec(A, n)   82.3   84.4  154.484   93.00  103.4 2772.8    50

4.2 Environments

Environment is a special data structure in R. For ordinary users, it is not necessary to learn it as list provides similar storage for heterogeneous data as well. However, if you want to implement some advance data structures (e.g., queue, stack, tree), you probably need to use environment in R.

To begin with, the operations associated with environment is similar to that with a named list; see Subsetting. For example,

e = new.env() #create an environment
e$code = 3005 #add variables to the environment
e[["subject"]] = "STAT"
c(e$subject, e[["code"]]) #similar subsetting syntax
## [1] "STAT" "3005"

We can even convert an environment to a list, vice versa.

z = as.list(e, all.names=TRUE)
z
## $code
## [1] 3005
## 
## $subject
## [1] "STAT"

However, there is one key difference which makes the role of environment special in developing data structure. Consider the following function which changes the element named code in a list or environment e:

f_change = function(e, v) {
  e$code = v
}

If we input a list, then

f_change(z, 4010) #z is a list replicated from e in the above
z$code #remains unchanged
## [1] 3005

The element named code is not changed to the input 4010 because changes in a function are local if we do not return the list. Informally, the variable z is passed into f_change by value and a local copy of z is created within the function. However, if we input an environment instead, then

f_change(e, 4010)
e$code #it changes from 3005 to 4010
## [1] 4010

Informally, the variable e is passed into f_change by reference and changes to its content are permanent even in the function. This can be a desirable behavior especially for implementing advance data structures since there is no pointer in R. Alternatively, we may use R6 or a similar OOP system in order to have such reference semantic but environment is available in base R without additional complications.

Before ending this section, let me share my implementation of queue with environment in R. There is a package doing it but I need more flexibility for my research, which motivates me to code it myself. Please feel free to let me know if there is any bug.

queue_init = function() {
  queue = new.env()
  queue
}

queue_push = function(queue, value) {
  E = new.env()
  E$value = value
  E$nextE = NULL
  # If there is no elements in queue, insert E as both head and tail
  if (is.null(queue$head)) {
    E$lastE = NULL
    queue$tail = queue$head = E
  } else {
    E$lastE = queue$tail
    queue$tail$nextE = E
    queue$tail = E
  }
}

queue_pop = function(queue) {
  if (is.null(queue$head)) {
    NULL
  } else {
    value = queue$head$value
    queue$head = queue$head$nextE
    queue$head$lastE = NULL
  }
}

4.3 Rcpp

Recall in Vectorization that array operation with suitable built-in functions is equivalent to vectorization in R in a narrow sense, we now turn our focus to improving performance in a general sense. While we may not be able to deal with the underlying (probably C) code in R, we can write our own C++ code via the Rcpp package.

To motivate the performance improvement, consider bootstrapping the variance of sample mean. By slightly modifying Algorithm 7.1, we have

library(Rcpp)
bootVar_r = function(x, B=10000) {
  n = length(x)
  out = rep(NA, B)
  for (b in 1:B) {
    xb = x[ceiling(runif(n, 0, n))]
    out[b] = mean(xb)
  }
  var(out)*(B-1)/B
}

For the Rcpp implementation, we may use

#include <Rcpp.h>
using namespace Rcpp;

//[[Rcpp::export]]
double bootVar_cpp(NumericVector x, int B=10000) {
  int n = x.size();
  NumericVector out(B);
  for (int i=0; i<B; i++) {
    NumericVector xb = x[floor(runif(n, 0, n))];
    out[i] = mean(xb);
  }
  return(var(out)*(B-1)/B);
}

We compare them as follows:

x = rnorm(100, sd=31.4)
set.seed(3005)
bootVar_r(x)
## [1] 10.02304
set.seed(3005)
bootVar_cpp(x) #same as above
## [1] 10.02304
print(microbenchmark(bootVar_r(x), bootVar_cpp(x), times=50))
## Unit: milliseconds
##            expr     min      lq     mean   median      uq      max neval
##    bootVar_r(x) 83.9917 90.3299 91.99913 91.44540 93.0162 102.4744    50
##  bootVar_cpp(x) 17.0776 17.6039 20.13427 18.79405 23.0862  27.8413    50

Clearly the Rcpp implementation is faster. However, there are several remarks:

  1. We can see that we need (sort of) new terms if we want to use Rcpp. For example, we declare x as Rcpp::NumericVector instead of std::vector as in ordinary C++ code. To this end, I recommend this as a reference
  2. Calling Rcpp functions may incur some overheads. Therefore, it is not always the case that Rcpp implementation is faster. It usually improves performance when there are long loops
  3. If Rcpp functions are simply included in the global environment, they may reside in random addresses. Consequently, we may not call these functions in a parallel computing setting. From my personal experience (and Q&As on the Internet), one of the solutions is to create a personal package with these functions so that they can be called “statically” from the package