This short note aims to give some practical tips for R as compared with some common languages that I know. Please follow a more structured guide if you want a reference on R. I have the following recommendations:
?
We can inspect the documentation of everything in R. For example,
?mean
?var
We can even inspect more elementary commands (though we will need to add quotation marks for those):
?"+"
?"for"
?"?"
Looking at the documentation is helpful when we are not sure about how to use some functions (e.g., from a new package). The help manual typically tells us how to use the function, its expected output and provides some examples that we can learn from.
(If you just want a quick fix, you can toggle the soft-warp. For R-studio, go to Tools > Global Options > Code and check “Soft-warp R source files”.)
Sometimes we may have some long statements that exceed the width of editor. This can be inconvenient when we read the same code later. As R does not require end statement like “;” in other languages, whenever we break down a long statement we have to make sure that it is not ended yet. Here is an example:
# Correct multiline statement
x = 1 +2 +3 +4 +5 +
6 +7 +8 +9
x # 45
## [1] 45
# Incorrect multiline statement
y = 1 +2 +3 +4 +5
+6 +7 +8 +9
## [1] 30
y # not 45
## [1] 15
There is no error for the second case because “+6 +7 +8 +9” is also a valid statement. However, the code is not working as we want.
In general, we subset vector, matrix or vector-like data structure with “[”. For matrix, the order of indexes should be row and then column (same as the tradition in Math). Note that the index in R starts from 1 unlike some other languages. Here are some examples:
# Vector
x = 1:9
x[5] # extract single element
## [1] 5
x[5:7] # extract multiple elements
## [1] 5 6 7
x[-1] # drop single element
## [1] 2 3 4 5 6 7 8 9
x[-c(3,5)] # drop multiple elements
## [1] 1 2 4 6 7 8 9
# Matrix
y = matrix(1:9, nrow=3, ncol=3)
y
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
y[1:2, 2:3] # subset a matrix
## [,1] [,2]
## [1,] 4 7
## [2,] 5 8
It is also possible to use string as the index provided that the data structure has names:
# Name the previous matrix
colnames(y) = c("S","T","A")
y[1:2, c("S","A")] # subset a matrix with string
## S A
## [1,] 1 7
## [2,] 2 8
y[1:2, c(1,3)] # same as above
## S A
## [1,] 1 7
## [2,] 2 8
For list-like data structure, we extract an element with “[[” (or “$” provided that the element is named). For example:
# List
z = list(code=3005, subject="STAT", school="CUHK")
z$code
## [1] 3005
z[["subject"]]
## [1] "STAT"
z["school"] # this returns a list
## $school
## [1] "CUHK"
List can be used to store heterogeneous data so it is less often used as a data structure in our course.
There are several common ways to do iterations in R:
for
, while
and repeat
apply
and lapply
For beginners, you may stick to traditional control flow until you are more familiar with R. We will discuss vectorization if we have time.
When we call a function, we do not need to follow the order of arguments if we name the arguments. For example,
# cov(x, y = NULL, use = "everything",
# method = c("pearson", "kendall", "spearman"))
a = c(1:10)
b = c(11:20)
cov(method="kendall", use="everything", x=a, y=b)
## [1] 90
cov(a, b, "everything", "kendall") # same as above
## [1] 90
In practice, we can use this fact to skip specifying some arguments which have default value that we are fine with. For example,
cov(a, b, method="kendall")
## [1] 90
In some common programming languages, ?:
is provided as a ternary if operator. For instance, the statement
a ? b : c
means “if a then b otherwise c”. However, ?
is the help operator, which is shown in The Survival ?
, so this is not true in R. Commonly, people think that ifelse
is the ternary if operator but this is not accurate. The correct ternary if is \
if`while
ifelse` is its vectorized version. We try to compare their speed with the microbenchmark library:
library(microbenchmark)
U = runif(1000)
f_if = function(U, n=length(U)) { # use `if`
out = vector(length=n)
for (i in 1:n) # can skip {} if only 1 line
out[i] = `if`(U[i]<0.5, 0, 1)
out
}
f_eif = function(U, n=length(U)) { # use ifelse
out = vector(length=n)
for (i in 1:n) # can skip {} if only 1 line
out[i] = ifelse(U[i]<0.5, 0, 1)
out
}
f_eif_vec = function(U, n=length(U)) { # use ifelse in a vectorized way
out = ifelse(U<0.5, 0, 1)
out
}
all.equal(f_if(U), f_eif(U), f_eif_vec(U)) # sanity check
## [1] TRUE
# evaluate execution time systematically
print(microbenchmark(f_if(U), f_eif(U), f_eif_vec(U), times=50))
## Unit: microseconds
## expr min lq mean median uq max neval
## f_if(U) 61.9 65.9 76.362 75.00 81.1 119.6 50
## f_eif(U) 1191.4 1242.7 1552.084 1584.85 1742.5 2968.2 50
## f_eif_vec(U) 23.9 33.6 76.480 39.90 48.2 1824.7 50
There are two messages here. First, \
if`` is the correct ternary if operator since it is much faster when a single element is concerned. Second, vectorization can speed up the implementation in R. The general idea is to use vectors instead of single elements as inputs into a vectorized function.
In simulations, we frequently pre-allocate the storage out
in advance of the main loops. The reason is that dynamic memory allocation is usually costly in R. To see this, we try a simple example:
n = 1000
f_pre = function(n) { # pre-allocate out
out = vector(length=n)
for (i in 1:n)
out[i] = i
out
}
f_dyn = function(n) { # combine elements dynamically
out = 1
for (i in 2:n)
out = c(out, i)
out
}
all.equal(f_pre(n), f_dyn(n))
## [1] TRUE
print(microbenchmark(f_pre(n), f_dyn(n), times=50))
## Unit: microseconds
## expr min lq mean median uq max neval
## f_pre(n) 33.7 37.9 44.810 41.40 49.6 79.9 50
## f_dyn(n) 876.3 937.6 1349.594 1024.75 1179.0 4904.3 50
By adjusting the value of n
, we can see that dynamic memory allocation is not of constant time in this way. Therefore, we should try our best to pre-allocate the necessary storage to avoid performance issues.
In Looping, I have recommended stick to traditional control flow. The rationale is that functional loops can usually be rewritten as traditional loops with similar performance. Consider the following:
x = matrix(runif(5000), nrow=1000, ncol=5)
f_for = function(x, n=nrow(x)) { # row sum with for loop
out = vector(length=n)
for (i in 1:n)
out[i] = sum(x[i,])
out
}
f_apply = function(x) { # row sum with apply
out = apply(x, 1, sum)
out
}
f_vec = function(x) { # vectorized row sum
out = rowSums(x)
out
}
all.equal(f_for(x), f_apply(x), f_vec(x))
## [1] TRUE
print(microbenchmark(f_for(x), f_apply(x), f_vec(x), times=50))
## Unit: microseconds
## expr min lq mean median uq max neval
## f_for(x) 599.7 641.4 702.358 664.70 709.1 2081.3 50
## f_apply(x) 993.1 1066.2 1243.108 1146.45 1242.2 2975.1 50
## f_vec(x) 12.6 14.9 48.612 17.90 22.4 1475.1 50
In this simple case, for
is marginally faster than apply
while the vectorized rowSums
is much faster than both. When should we use functional loops in practice? There are several cases:
The concept of package in R is similar to the concept of library in the other programming languages. However, creating packages to manage our code can be time consuming. A quicker way to save and reuse common R code is through the function source
. For example,
x = "stat"
source("example.R") # the code inside covered last line
x
## [1] 3005
We can see that the value of x
changes because it was “covered” by the code in example.R, which consists of
# Content of example.R
x # this will not be printed
x = 3005
There are several things to note:
source
“import” R code by running every line. If global variables/operations are used in the to-be-imported file, they might affect the result unexpectedly. Put only functions in the to-be-imported file to avoid this issuex
(the second line in example.R) did not result in auto-printing the value of x
, which was “stat”. Read the documentation for details about the behavior of source
While R studio or other IDE is more user friendly, running R code in them is usually slower than running via command line. One of the reasons is the dynamic interaction with the environments. In R studio, we can inspect the variables usually by clicking their names in the environment panel. However, this convenience comes at cost of tracing and showing the memories associated with the names. Therefore, if execution speed is of concern after the development stage, we should run our R code via terminal directly (e.g., using source
)
Vectorization, or array programming, refers to applying operations to the whole array instead of its individual elements one by one. It is an important concept in R because suitable vectorization can speed up R program a lot in practice. We have presented two simple examples in Ternary If Operator and Looping with Functionals. Here we try to give clearer picture via another example.
In some common programming languages, a sum of vector elements function is usually implemented as follows:
sum_loop = function(v) {
s = 0
for (i in 1:length(v))
s = s +v[i]
s
}
We compare this with the built-in sum
function, which is vectorized:
x = 1:1000
all.equal(sum_loop(x), sum(x))
## [1] TRUE
print(microbenchmark(sum_loop(x), sum(x), times=50))
## Unit: nanoseconds
## expr min lq mean median uq max neval
## sum_loop(x) 39500 41200 45844 42100 44600 87000 50
## sum(x) 100 100 334 200 300 2800 50
Obviously, the built-in sum
is much faster. Nevertheless, if we think about this and the previous two examples carefully, we will realize that we cannot write vectorized R functions ourselves in some sense. The ifelse
, rowSums
and sum
functions are vectorized because they are built-in with highly optimized compiled code (probably in C). It would be difficult to write a vectorized R function without using them because we would not deal with those (probably C) code in general.
In other words, the building blocks of vectorization in R are usually the functions provided in system library (take a look at the package panel in R studio). As long as we operate with arrays directly with suitable built-in functions, we are doing vectorization in R in a narrow sense. To illustrate different possibilities, we try to give a more complex example. Consider a data series with missing data:
set.seed(3005) # to ensure A will not change
n = 10000
X = round(runif(n), 3)
A = sort(sample(1:n, n/2, F))
head(X, 10) # full data series
## [1] 0.310 0.241 0.379 0.097 0.896 0.699 0.941 0.799 0.062 0.032
head(X[A], 10) # observed data series
## [1] 0.310 0.097 0.896 0.699 0.941 0.799 0.430 0.415 0.407 0.326
head(A, 10) # index of observed data
## [1] 1 4 5 6 7 8 13 14 15 16
We are interested in constructing an index series of last observed data, i.e., \[ 1, 1, 1, 4, 5, 6, 7, 8, 8, 8,\ldots \] A naive implementation is
f_naive = function(A, n) {
out = rep(0, n)
j = 1
for (i in A[1]:n) {
if (j == n/2) next
if (i == A[j+1]) j = j +1
out[i] = A[j]
}
out
}
A marginally better implementation is
f_naive2 = function(A, n) {
out = rep(0, n)
out[A] = A
for (i in A[1]:n)
out[i] = `if`(out[i]==0, out[i-1], out[i])
out
}
A vectorized implementation that uses cumsum
is
f_vec = function(A, n) {
I = rep(0, n)
I[A] = 1 # indicator of observed data
I = cumsum(I) # cumsum is vectorized
out = A[I]
out
}
Finally, we compare their speed below:
all.equal(f_naive(A,n), f_naive2(A,n), f_vec(A,n))
## [1] TRUE
print(microbenchmark(f_naive(A,n), f_naive2(A,n), f_vec(A,n), times=50))
## Unit: microseconds
## expr min lq mean median uq max neval
## f_naive(A, n) 1299.3 1339.9 1494.896 1462.00 1576.7 2153.2 50
## f_naive2(A, n) 814.4 847.4 933.956 915.85 987.6 1157.4 50
## f_vec(A, n) 82.3 84.4 154.484 93.00 103.4 2772.8 50
Environment is a special data structure in R. For ordinary users, it is not necessary to learn it as list provides similar storage for heterogeneous data as well. However, if you want to implement some advance data structures (e.g., queue, stack, tree), you probably need to use environment in R.
To begin with, the operations associated with environment is similar to that with a named list; see Subsetting. For example,
e = new.env() #create an environment
e$code = 3005 #add variables to the environment
e[["subject"]] = "STAT"
c(e$subject, e[["code"]]) #similar subsetting syntax
## [1] "STAT" "3005"
We can even convert an environment to a list, vice versa.
z = as.list(e, all.names=TRUE)
z
## $code
## [1] 3005
##
## $subject
## [1] "STAT"
However, there is one key difference which makes the role of environment special in developing data structure. Consider the following function which changes the element named code
in a list or environment e
:
f_change = function(e, v) {
e$code = v
}
If we input a list, then
f_change(z, 4010) #z is a list replicated from e in the above
z$code #remains unchanged
## [1] 3005
The element named code
is not changed to the input 4010 because changes in a function are local if we do not return the list. Informally, the variable z
is passed into f_change
by value and a local copy of z
is created within the function. However, if we input an environment instead, then
f_change(e, 4010)
e$code #it changes from 3005 to 4010
## [1] 4010
Informally, the variable e
is passed into f_change
by reference and changes to its content are permanent even in the function. This can be a desirable behavior especially for implementing advance data structures since there is no pointer in R. Alternatively, we may use R6 or a similar OOP system in order to have such reference semantic but environment is available in base R without additional complications.
Before ending this section, let me share my implementation of queue with environment in R. There is a package doing it but I need more flexibility for my research, which motivates me to code it myself. Please feel free to let me know if there is any bug.
queue_init = function() {
queue = new.env()
queue
}
queue_push = function(queue, value) {
E = new.env()
E$value = value
E$nextE = NULL
# If there is no elements in queue, insert E as both head and tail
if (is.null(queue$head)) {
E$lastE = NULL
queue$tail = queue$head = E
} else {
E$lastE = queue$tail
queue$tail$nextE = E
queue$tail = E
}
}
queue_pop = function(queue) {
if (is.null(queue$head)) {
NULL
} else {
value = queue$head$value
queue$head = queue$head$nextE
queue$head$lastE = NULL
}
}
Recall in Vectorization that array operation with suitable built-in functions is equivalent to vectorization in R in a narrow sense, we now turn our focus to improving performance in a general sense. While we may not be able to deal with the underlying (probably C) code in R, we can write our own C++ code via the Rcpp package.
To motivate the performance improvement, consider bootstrapping the variance of sample mean. By slightly modifying Algorithm 7.1, we have
library(Rcpp)
bootVar_r = function(x, B=10000) {
n = length(x)
out = rep(NA, B)
for (b in 1:B) {
xb = x[ceiling(runif(n, 0, n))]
out[b] = mean(xb)
}
var(out)*(B-1)/B
}
For the Rcpp implementation, we may use
#include <Rcpp.h>
using namespace Rcpp;
//[[Rcpp::export]]
double bootVar_cpp(NumericVector x, int B=10000) {
int n = x.size();
NumericVector out(B);
for (int i=0; i<B; i++) {
NumericVector xb = x[floor(runif(n, 0, n))];
out[i] = mean(xb);
}
return(var(out)*(B-1)/B);
}
We compare them as follows:
x = rnorm(100, sd=31.4)
set.seed(3005)
bootVar_r(x)
## [1] 10.02304
set.seed(3005)
bootVar_cpp(x) #same as above
## [1] 10.02304
print(microbenchmark(bootVar_r(x), bootVar_cpp(x), times=50))
## Unit: milliseconds
## expr min lq mean median uq max neval
## bootVar_r(x) 83.9917 90.3299 91.99913 91.44540 93.0162 102.4744 50
## bootVar_cpp(x) 17.0776 17.6039 20.13427 18.79405 23.0862 27.8413 50
Clearly the Rcpp implementation is faster. However, there are several remarks:
Rcpp::NumericVector
instead of std::vector
as in ordinary C++ code. To this end, I recommend this as a reference