Chapter 3 R Basics

3.1 Scalar Arithmetic

R has the usual arithmetic operators:

+ for addition
- for subtraction
* for multiplication
/ for division
^ for raising to a power
% / % for integer division
%% for remainder from integer division

There is also an arithmetic operator “-” for unary minus that is applicable to one operand (i.e., making a negative value; “+” can also be used as unary plus). For example, the following expression yields, based on the order of operations (i.e., ^ first, */ second, and + - last, from left to right if the orders are the same), the answer shown here:

1 + 2 - 3 * 4 / 5 ^ 6

## [1] 2.999232

The number in the brackets (e.g., [1]) indicates the order of elements in the result. Because we are getting a scalar value, only one number is shown after such a bracketed number. This will be handy where we operate with vectors instead of scalars. The order of operations can be changed with the use of parentheses. For example:

(1 + 2 - 3) * 4 / 5 ^ 6

## [1] 0

1 + 2 - (3 * 4 / 5) ^ 6

## [1] -188.103

1 + (2 - 3) * 4 / 5 ^ 6

## [1] 0.999744

1 + (2 - 3) * (4 / 5) ^ 6

## [1] 0.737856

The portions of the expression beginning with the pound symbol, #, to the end of the expression before running the expression will be treated as a comment. For example,

6 + 5 - 4 * 3 / 2 ^ 1  # the answer is [Enter]

## [1] 5

Comments can be typed (and ignored) anywhere in the R expressions. Comments can be very informative explaining the expression to be carried by R.

The default number of decimal places is 7. It can be changed with the options() function with the digits argument for which the valid values are \(0 - 22\). It should be noted that there may exist rounding errors when a very larger number of decimal places, say \(22\), is employed. For example, \(1/3\) will yield \(0.3333333\) in the default setting, but the following rounding error can occur when a higher precision is requested:

options(digits = 22)
1/3

## [1] 0.3333333333333333148296

With the same options() function in effect, the mathematical constant \(\pi\) can be obtained using R both directly with the pi command and indirectly with the arc tangent function:

pi

## [1] 3.141592653589793115998

4 * atan(1)

## [1] 3.141592653589793115998

Other mathematical constants can be obtained using R functions. In fact, nearly all of the common mathematical function are available in R with arguments in parentheses (i.e., parenthetical arguments). For example, mathematical functions include:

abs() - absolute value
exp() - exponential (e to a power)
gamma() - gamma function
lgamma() - log of gamma function
log() - logarithm
log10() - logarithm of base 10
sign() - signum function
sqrt() - square root
floor() - largest integer, less than or equal to
ceiling() - smallest integer, greater than or equal to
trunc() - truncation to the nearest integer
factorial() - factorial
lfactorial() - log of factorial

A full range of logical operators can be used in R:

> - greater than
< - less than
>= - greater than or equal to
<= - less than or equal to
== - equality
!= - non-equality
& - elementwise AND
| - elementwise OR
&& - control AND
|| - control OR
! - unary not

In the trigonometric functions, the arguments are in radians instead of degrees. For example:

sin(pi / 6)

## [1] 0.4999999999999999444888

pi / 6

## [1] 0.5235987755982988156589

sin(0.5235988)

## [1] 0.500000021132492977749

sin((30 / 180) * pi)

## [1] 0.4999999999999999444888

A scalar value or the result from arithmetic operators can be saved as a variable with the assignment function <- or =. The value can be listed by typing in the variable name:

a = sin(pi / 6)

b = cos(pi / 6)

c = sqrt(a ^ 2 + b ^ 2)

c

## [1] 1

Multiple expressions can be combined by separating them with semi-colons. Spaces are mostly optional in the R commands, but readability will be enhanced when proper spacing is employed. For example,

a = sin(pi/6); b = cos(pi/6); c = sqrt(a ^ 2 + b ^ 2); c

## [1] 1

R can handle operations of complex numbers that have real parts and imaginary parts albeit not really useful in applied statistical procedures:

x = 4 + 2i
Re(x)

## [1] 4

Im(x)

## [1] 2

y = 4 - 2i
x + y

## [1] 8+0i

x * y

## [1] 20+0i

3.2 Vector Arithmetic

Here, a vector is an ordered collection of values of the same type stored under one variable name. In fact, even a single value is technically a vector of length one. Usually, when we say vector we mean a structure with multiple elements. One can have numeric vectors (i.e., a series of numbers), a character of vectors (i.e., a series of strings), or logical vectors (TRUE or FALSE values). The key rule is all elements in a vector must be of the same class or type.

If one mixes types, R will automatically coerce them to a common type so that the whole vector is uniform. For example, combining numbers and strings in one vector will turn all values into strings behind the hood.

To define the vector, we use the concatenation function c() and list all the values:

x = c(1, 2, 3)

After defining the vector (i.e., a variable in a statistical sense), the elements of the vector can be listed by typing in the name of the vector:

## [1] 1 2 3

A character vector of names will be

names = c('Alice', 'Thabo', 'Zola')
names

## [1] "Alice" "Thabo" "Zola"

One can perform operations on vectors easily. Functions for simple statistics for a vector are available in R:

min() - smallest value
max() - largest value
range() - minimum and maximum
mean() - arithmetic average
var() - variance
sd() - standard deviation
sum() - arithmetic sum
prod() - product of elements
length() - number of elements
median() - 50th percentile
quantile() - quantiles
cumsum() - cumulative sum
diff() - first difference
table() - frequency table or cross tabulation
summary() - five number summary or frequencies

In addition, after defining two vectors, the following statistical functions are available in R:

cor() - correlation
cov() - covariance

For example:

x = c(1, 2, 3, 2)
y = c(1, 3, 2, 2) 

# compute correlation between the two numeric vectors 
cor(x, y)

## [1] 0.5

# compute covariance between the two numeric vectors 
cov(x, y)

## [1] 0.3333333333333333148296

Sorting or rearranging of the vector in the ascending or increasing order and in the descending or decreasing order can be performed using the sort() function, for example:

x = c(1, 2, 3, 2)
sort(x)

## [1] 1 2 2 3

sort(x, decreasing = TRUE)

## [1] 3 2 2 1

A subset of vector can be created using the order subscripts and their operations in brackets, for example:

x = c(1, 2, 3, 2)


x[1]

## [1] 1

x[2:4]

## [1] 2 3 2

x[-3]

## [1] 1 2 2

x[x < 3]

## [1] 1 2 2

x[x > 2]

## [1] 3

Note that the vector can be replaced with the assignment function, for example:

x = x[-3]; x

## [1] 1 2 2

Vectors can be generated and converted to different types using functions in R:

numeric() - a vector of zeroes with the length of the argument
charactor() - a vector of blank characters of argument length
logical() - a vector of FALSE of argument length
seq() - argument of 1 to argument 2 with the increment of argument 3
1 : 4 - numbers equivalent to seq(1, 4, 1)
rep() - replicate argument 1 as many times as argument 2
as.numeric() - conversion to numeric
as.character() - conversion to string-type
as.logical() - conversion to logical
factor() - creating factor from vector

For example, the following are very useful ways to construct a sequence of nicely patterned elements:

x = 1 : 4; x

## [1] 1 2 3 4

x = seq(1, 4, 1); x

## [1] 1 2 3 4

x = seq(1, 2, 0.2); x

## [1] 1.000000000000000000000 1.199999999999999955591 1.399999999999999911182 1.600000000000000088818 1.800000000000000044409
## [6] 2.000000000000000000000

x = rep(1, 4); x

## [1] 1 1 1 1

x = c(rep(1, 4), rep(2, 2)); x

## [1] 1 1 1 1 2 2

3.3 Matrices and Matrix Functions

An array is a collection of data which can be indexed by one or more subscripts. The vectors discussed above can be seen as one-dimensional arrays. Each element in a vector can be referred to as the name with the subscript enclosed in brackets (e.g., x[1]). Two-dimensional arrays are generally referred to as matrices. The matrix() function is used to create a matrix. For example, a matrix with ones in the first column and four observations in the second column can be defined and listed subsequently by:

X  = matrix(c(1, 1, 1, 1, 1, 2, 3, 2), nrow = 4); X

##      [,1] [,2]
## [1,]    1    1
## [2,]    1    2
## [3,]    1    3
## [4,]    1    2

The R commands as well as the names of objects and variables are case-sensitive. The objects X and x, for example, are not the same unless these are defined to be equivalent. The expression of the above matrix is equivalent to:

X = matrix(c(1, 1, 1, 1, 1, 2, 3, 2), ncol=2)
X = matrix(c(1, 1, 1, 1, 1, 2, 3, 2), nrow=4, ncol=2)
X = matrix(c(1, 1, 1, 2, 1, 3, 1, 2), nrow=4, byrow=T)
X = matrix(c(1, 1, 1, 2, 1, 3, 1, 2), ncol=2, byrow=T)
X = matrix(c(1,1,1,2,1,3,1,2), nrow=4, ncol=2, byrow=T)

Elements in a matrix can be referred to as the name with the row and column subscripts enclosed in brackets. For example, with the same matrix defined earlier:

X[2,2]

## [1] 2

X[, 2]

## [1] 1 2 3 2

X[2, ]

## [1] 1 2

X[1:2, ]

##      [,1] [,2]
## [1,]    1    1
## [2,]    1    2

After defining two or more vectors of the same length (i.e., the same number of elements), a matrix can be constructed by the cbind() function:

u = c(1, 1, 1, 1)
x = c(1, 2, 3, 2)

X  = cbind(u, x); X

##      u x
## [1,] 1 1
## [2,] 1 2
## [3,] 1 3
## [4,] 1 2

It can be noticed that the default column names in the listing of the matrix are replaced with the names of the vectors. The equivalent matrix function:

X = matrix(c(1, 1, 1, 1, 1, 2, 3, 2), ncol = 2, dimnames = list(c(), c("u", "x"))); X

##      u x
## [1,] 1 1
## [2,] 1 2
## [3,] 1 3
## [4,] 1 2

Also the row and column names can be specified with the function of rownames() and colnames(), respectively:

X  = matrix(c(1, 1, 1, 1, 1, 2, 3, 2), ncol = 2) 
colnames(X) = c("u", "x")
rownames(X) = c()

X

##      u x
## [1,] 1 1
## [2,] 1 2
## [3,] 1 3
## [4,] 1 2

After defining vectors of the same length in row wise, a matrix can be constructed by the rbind() function:

r1 = c(1, 1); r2 = c(1, 2); r3 = c(1, 3); r4 = c(1, 2)
X = rbind(r1, r2, r3, r4); X

##    [,1] [,2]
## r1    1    1
## r2    1    2
## r3    1    3
## r4    1    2

Note that the row names can be replaced with the default names with the rownames() function:

rownames(X) = c(); X

##      [,1] [,2]
## [1,]    1    1
## [2,]    1    2
## [3,]    1    3
## [4,]    1    2

A matrix can also be constructed with the array() function, although the array is not limited to be two-dimensional. For example,

X  = array(c(1, 1, 1, 1, 1, 2, 3, 2), dim = c(4,2)); X

##      [,1] [,2]
## [1,]    1    1
## [2,]    1    2
## [3,]    1    3
## [4,]    1    2

Once a matrix is defined, the dimension, the number of rows, and the number of columns of the matrix can be obtained with the following functions:

dim(X)

## [1] 4 2

nrow(X)

## [1] 4

ncol(X)

## [1] 2

The following are some matrix functions:

chol() - Cholesky decomposition
crossprod() - matrix crossproduct
det() - determinant
diag() - to create or extract diagonal values
eigen() - eigenvalues and eigenvectors
outer() - outer product of two vectors
scale() - to scale the columns of a matrix
solve() - inversion or to solve system of linear equations
svd() - singular value of decomposition
qr() - qr orthogonalization
t() - to transpose

Based on the usual conforming conditions with scalars and matrices, the element wise addition, subtraction, multiplication, and division can be performed. Matrix multiplication is done with the operator:

%*% matrix multiplication

The following is an example to obtain the estimates of an intercept and a slope from a simple regression model using the matrix functions and operators:

X  = array(c(1, 1, 1, 1, 1, 2, 3, 2), dim = c(4, 2))
colnames(X) = c("u", "x")

y = c(1, 3, 2, 2)

solve(t(X) %*% X) %*% t(X) %*% y

##   [,1]
## u  1.0
## x  0.5

betahat = solve(crossprod(X, X)) %*% t(X) %*% y
rownames(betahat) = c("a", "b"); betahat

##   [,1]
## a  1.0
## b  0.5

ypredict = X %*% betahat; ypredict

##      [,1]
## [1,]  1.5
## [2,]  2.0
## [3,]  2.5
## [4,]  2.0

yhat = ypredict[, 1]; yhat

## [1] 1.5 2.0 2.5 2.0

residual = y - yhat; residual

## [1] -0.5  1.0 -0.5  0.0

The results from regression analysis in general will be obtained not from the matrix or vector operations but from the R function for statistical modeling. Hence, the above code illustrations are only for the demonstration and instructional purpose.

3.4 Data Frame

A data frame is a two-dimensional array of observations in rows and variables in columns. Functions such as dim(), dimnames(), nrow(), and ncol() will work on data frames. The attach() function for data frames allow that variables contained in the data frame can be easily accessed through the variable names. Data frames can be constructed by the data.frame() function:

heights = c(180, 165, 170)
ids   = c("Alice", "Bob", "Chitra")

people_df = data.frame(id = ids, height = heights); people_df

##       id height
## 1  Alice    180
## 2    Bob    165
## 3 Chitra    170

Variables can be extracted from the data frame or directly referred by declaring the data frame name and the variable name separated with a dollar sign. For example, the names vector can be listed with the following commands assuming that the data frame has been declared as in the earlier expressions:

people_df[, 1]

## [1] "Alice"  "Bob"    "Chitra"

people_df$id

## [1] "Alice"  "Bob"    "Chitra"

The names() function displays the variable names in the data frame:

names(people_df)

## [1] "id"     "height"

A new variable can be appended to an existing data frame with a dollar sign and a variable name using the c() function:

people_df$age = c(21, 17, 45); people_df

##       id height age
## 1  Alice    180  21
## 2    Bob    165  17
## 3 Chitra    170  45

A variable can be removed or portions of the variables can be selected as in the following expressions for the previous data frame people_df with the four variable:

people_df = people_df[,-3]

people_df = people_df[,1:2]

These yield the same data frame people_df with only id and height.

The edit() function opens the data editor window and allows to edit the data frame with the spreadsheet-looking data editor. The values of the variables as well as the variable names can be modified. The data frame can be saved by clicking of the close window icon, that is, the exit button positioned in the top, right corner of the data editor window’s title bar.

It is also possible to construct a data frame by opening up a blank data frame using the edit() function and then entering the necessary values and variable names:

X = edit(data.frame())

A data frame can be saved as a file that can be opened with other editor-type programs as:

write.table(X, file= "X.txt", sep = " ")

The current working directory where the data frame file is to be stored can be found with this:

getwd()

## [1] "C:/Users/cash/OneDrive - University of Cape Town/2026/DataFirst/Introduction-To-R"

and the directory can be changed to the usual root directory with either of the following two commands:

setwd("C:\\")

setwd("C:/")

After loading the file, the variables contained in a data frame can be directly accessed by declaring the attach() function:

attach(people_df)

A data frame can be removed from the current session with the detach() function, for example:

If there is an object defined with same variable name in the attached object, then due to a hierarchical nature of searching objects in the R workspace that attach() function may not bring up the variable contained in the data frame. Care should be exercised when the attach() function is employed.

A list of currently available objects can be found by the ls() function:

ls()

##  [1] "a"         "b"         "betahat"   "c"         "heights"   "ids"       "names"     "people_df" "r1"        "r2"        "r3"       
## [12] "r4"        "residual"  "u"         "x"         "X"         "y"         "yhat"      "ypredict"

The objects can be removed by the rm() function, for example:

x = c(1, 2, 3, 2)

rm(x)

The entire workspace will be cleared by:

rm(list = ls())

3.5 Missing Values

In R, not available (i.e., NA) is used as a missing value. The following lines show how the missing values are treated in R:

x = c(1, NA, 3, 2)
x

## [1]  1 NA  3  2

is.na(x)

## [1] FALSE  TRUE FALSE FALSE

sum(!is.na(x))

## [1] 3

newx = x[!is.na(x)]
newx

## [1] 1 3 2

x[2] = sum(newx)/sum(!is.na(x)) 
x

## [1] 1 2 3 2

Note also that NaN (i.e., not a number) and Inf (i.e., infinity) are treated as missing cases.

x1 = 0/0
x2 = Inf 
x3 = Inf - Inf 
x = c(x1, x2, x3, 2)

x

## [1] NaN Inf NaN   2

is.na(x)

## [1]  TRUE FALSE  TRUE FALSE

The best way to solve the problem of missing values is prevention of the occurrence of missing in the data collection process. There is no missing strategy, none whatsoever, how sophisticate and complicate it can be, that is better than obtaining complete data. Obviously, there is no royal road for missing.

3.6 Control Flow

3.6.1 Logical Operators

We can create logical vectors that indicate whether each element of another vector satisfies some conditions.

Logical.operator	Example
equal to	x == 1
not equal to	x != 1
greater than	x > 7
less than	x < 7
greater than or equal to	x >= 8
less than or equal to	x <= 8
and	x > 1 & x < 4
or	x > 8 \| x == 2

Let us create an atomic vector and see some examples of logical operators. Logical operators will specify all the elements that satisfy some conditions. For instance, \(x == 1\) will return a logical vector indicating whether each element of \(x\) is equal to \(1\).

x = 1:10
x

##  [1]  1  2  3  4  5  6  7  8  9 10

x == 10

##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE

x != 10

##  [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE

x >= 8

##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE

x > 1 & x < 4

##  [1] FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

x == 2 | x > 8

##  [1] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE

See what happens if we use logical operators between two vectors of the same length.

a = c(-5, -3, -1, 0, 2, 4, 6)
b = c(-5, -3, -1, 0, 1, 3, 5)

a == b

## [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE

a > b

## [1] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE

Missing values are contagious in logical operators. That is, logical operators will return NA if one of the elements being compared is NA. For example,

d = c(1, NA, 3)
e = c(1, 2, 3)

d == e

## [1] TRUE   NA TRUE

e > NA

## [1] NA NA NA

which() function returns the indices for the elements of a vector satisfying certain conditions. For example, to find which elements of a are greater than 3 by:

which( a > 3)

## [1] 6 7

We can refer to the elements of a vector satisfying some conditions by using logical operators with brackets [] . The following code will return the values of the vector a that are greater than 3.

a[a > 3]

## [1] 4 6

# or 

a[which(a > 3)]

## [1] 4 6

3.6.2 `if` statement

An if statement allows us to conditionally execute code. Here is an example of how to use an if statement. The following code will print "a has length 7" if the length of a is 7.

if(length(a) == 7){
  print("a has length 7")
} else{
  print("a does not have length 7.")
}

## [1] "a has length 7"

The statement to be tested goes into (), and the consequence for “statement is true” goes into the first braces {} . The consequence for “statement is false” goes into the second {} after else .

In an if statement, we should not use & or | because these are vectorized operations that apply to multiple values. Instead, you can use && or ||.

if(length(x) == 10 && x[1] > 0){
  print("x has length 10, and the first element of x is greater than zero.")
}

## [1] "x has length 10, and the first element of x is greater than zero."

3.6.3 `for` loop

Imagine we have \(10\) x \(10\) matrix that contains integers from 1 to 100:

A = matrix(c(1:100), nrow = 10, byrow = TRUE)
A

##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
##  [1,]    1    2    3    4    5    6    7    8    9    10
##  [2,]   11   12   13   14   15   16   17   18   19    20
##  [3,]   21   22   23   24   25   26   27   28   29    30
##  [4,]   31   32   33   34   35   36   37   38   39    40
##  [5,]   41   42   43   44   45   46   47   48   49    50
##  [6,]   51   52   53   54   55   56   57   58   59    60
##  [7,]   61   62   63   64   65   66   67   68   69    70
##  [8,]   71   72   73   74   75   76   77   78   79    80
##  [9,]   81   82   83   84   85   86   87   88   89    90
## [10,]   91   92   93   94   95   96   97   98   99   100

We want to compute the median of each row. We can copy and paste:

median(A[1, ])

## [1] 5.5

median(A[2, ])

## [1] 15.5

median(A[3, ])

## [1] 25.5

#... 
median(A[10, ])

## [1] 95.5

However, it is not very efficient to use copy and paste if we are dealing with a large number of columns, say \(50\) columns. Instead, we could use a for loop:

med = rep(NA, 10)          # 1. output
for (i in 1:10) {             # 2. sequence
  med[i] = median(A[i, ])      # 3. body
}
med

##  [1]  5.5 15.5 25.5 35.5 45.5 55.5 65.5 75.5 85.5 95.5

A for loop consists of three components:

Output: Before starting the loop, we created an empty atomic vector med of length 10 using rep(). At each iteration, the median of the ith row is assigned as the ith element of our output vector
Sequence: This part shows what to loop over. Each iteration of the for loop assigns i to a different value from 1:10.
Body: he body part is the code that does the work. At each iteration, the code inside the braces {} is run with a different value for i. For example, the first iteration will run med[1] = median(A[1,])

3.6.4 `for` loop with an `if` statement

Here, we will see how to use an if statement inside a for loop. We want to create a vector length of 10, such that the ith element is 1 if the ith element of x is even, and is 2 if the ith element of x is odd.

v = numeric(10)    # output: create a zero vector length of 10
for (i in 1:10) {     # sequence
  if (x[i] %% 2 == 0) {   # if statement
    v[i] = 1          # body
  } else {
    v[i] = 2
  }
}
v

##  [1] 2 1 2 1 2 1 2 1 2 1