Chapter 3 R Basics
3.1 Scalar Arithmetic
R has the usual arithmetic operators:
+ for addition
- for subtraction
* for multiplication
/ for division
^ for raising to a power
% / % for integer division
%% for remainder from integer division
There is also an arithmetic operator “-” for unary minus that is applicable to one operand (i.e., making a negative value; “+” can also be used as unary plus). For example, the following expression yields, based on the order of operations (i.e., ^ first, */ second, and + - last, from left to right if the orders are the same), the answer shown here:
## [1] 2.999232
The number in the brackets (e.g., [1]) indicates the order of elements in the result. Because we are getting a scalar value, only one number is shown after such a bracketed number. This will be handy where we operate with vectors instead of scalars. The order of operations can be changed with the use of parentheses. For example:
## [1] 0
## [1] -188.103
## [1] 0.999744
## [1] 0.737856
The portions of the expression beginning with the pound symbol, #, to the end of the expression before running the expression will be treated as a comment. For example,
## [1] 5
Comments can be typed (and ignored) anywhere in the R expressions. Comments can be very informative explaining the expression to be carried by R.
The default number of decimal places is 7. It can be changed with the options() function with the digits argument for which the valid values are \(0 - 22\). It should be noted that there may exist rounding errors when a very larger number of decimal places, say \(22\), is employed. For example, \(1/3\) will yield \(0.3333333\) in the default setting, but the following rounding error can occur when a higher precision is requested:
## [1] 0.3333333333333333148296
With the same options() function in effect, the mathematical constant \(\pi\) can be obtained using R both directly with the pi command and indirectly with the arc tangent function:
## [1] 3.141592653589793115998
## [1] 3.141592653589793115998
Other mathematical constants can be obtained using R functions. In fact, nearly all of the common mathematical function are available in R with arguments in parentheses (i.e., parenthetical arguments). For example, mathematical functions include:
abs()- absolute valueexp()- exponential (e to a power)gamma()- gamma functionlgamma()- log of gamma functionlog()- logarithmlog10()- logarithm of base 10sign()- signum functionsqrt()- square rootfloor()- largest integer, less than or equal toceiling()- smallest integer, greater than or equal totrunc()- truncation to the nearest integerfactorial()- factoriallfactorial()- log of factorial
A full range of logical operators can be used in R:
>- greater than<- less than>=- greater than or equal to<=- less than or equal to==- equality!=- non-equality&- elementwise AND|- elementwise OR&&- control AND||- control OR!- unary not
In the trigonometric functions, the arguments are in radians instead of degrees. For example:
## [1] 0.4999999999999999444888
## [1] 0.5235987755982988156589
## [1] 0.500000021132492977749
## [1] 0.4999999999999999444888
A scalar value or the result from arithmetic operators can be saved as a variable with the assignment function <- or =. The value can be listed by typing in the variable name:
## [1] 1
Multiple expressions can be combined by separating them with semi-colons. Spaces are mostly optional in the R commands, but readability will be enhanced when proper spacing is employed. For example,
## [1] 1
R can handle operations of complex numbers that have real parts and imaginary parts albeit not really useful in applied statistical procedures:
## [1] 4
## [1] 2
## [1] 8+0i
## [1] 20+0i
3.2 Vector Arithmetic
Here, a vector is an ordered collection of values of the same type stored under one variable name. In fact, even a single value is technically a vector of length one. Usually, when we say vector we mean a structure with multiple elements. One can have numeric vectors (i.e., a series of numbers), a character of vectors (i.e., a series of strings), or logical vectors (TRUE or FALSE values). The key rule is all elements in a vector must be of the same class or type.
If one mixes types, R will automatically coerce them to a common type so that the whole vector is uniform. For example, combining numbers and strings in one vector will turn all values into strings behind the hood.
To define the vector, we use the concatenation function c() and list all the values:
After defining the vector (i.e., a variable in a statistical sense), the elements of the vector can be listed by typing in the name of the vector:
## [1] 1 2 3
A character vector of names will be
## [1] "Alice" "Thabo" "Zola"
One can perform operations on vectors easily. Functions for simple statistics for a vector are available in R:
min()- smallest valuemax()- largest valuerange()- minimum and maximummean()- arithmetic averagevar()- variancesd()- standard deviationsum()- arithmetic sumprod()- product of elementslength()- number of elementsmedian()- 50th percentilequantile()- quantilescumsum()- cumulative sumdiff()- first differencetable()- frequency table or cross tabulationsummary()- five number summary or frequencies
In addition, after defining two vectors, the following statistical functions are available in R:
cor()- correlationcov()- covariance
For example:
## [1] 0.5
## [1] 0.3333333333333333148296
Sorting or rearranging of the vector in the ascending or increasing order and in the descending or decreasing order can be performed using the sort() function, for example:
## [1] 1 2 2 3
## [1] 3 2 2 1
A subset of vector can be created using the order subscripts and their operations in brackets, for example:
## [1] 1
## [1] 2 3 2
## [1] 1 2 2
## [1] 1 2 2
## [1] 3
Note that the vector can be replaced with the assignment function, for example:
## [1] 1 2 2
Vectors can be generated and converted to different types using functions in R:
numeric()- a vector of zeroes with the length of the argumentcharactor()- a vector of blank characters of argument lengthlogical()- a vector of FALSE of argument lengthseq()- argument of 1 to argument 2 with the increment of argument 31 : 4- numbers equivalent toseq(1, 4, 1)rep()- replicate argument 1 as many times as argument 2as.numeric()- conversion to numericas.character()- conversion to string-typeas.logical()- conversion to logicalfactor()- creating factor from vector
For example, the following are very useful ways to construct a sequence of nicely patterned elements:
## [1] 1 2 3 4
## [1] 1 2 3 4
## [1] 1.000000000000000000000 1.199999999999999955591 1.399999999999999911182 1.600000000000000088818 1.800000000000000044409
## [6] 2.000000000000000000000
## [1] 1 1 1 1
## [1] 1 1 1 1 2 2
3.3 Matrices and Matrix Functions
An array is a collection of data which can be indexed by one or more subscripts. The vectors discussed above can be seen as one-dimensional arrays. Each element in a vector can be referred to as the name with the subscript enclosed in brackets (e.g., x[1]). Two-dimensional arrays are generally referred to as matrices. The matrix() function is used to create a matrix. For example, a matrix with ones in the first column and four observations in the second column can be defined and listed subsequently by:
## [,1] [,2]
## [1,] 1 1
## [2,] 1 2
## [3,] 1 3
## [4,] 1 2
The R commands as well as the names of objects and variables are case-sensitive. The objects X and x, for example, are not the same unless these are defined to be equivalent. The expression of the above matrix is equivalent to:
X = matrix(c(1, 1, 1, 1, 1, 2, 3, 2), ncol=2)
X = matrix(c(1, 1, 1, 1, 1, 2, 3, 2), nrow=4, ncol=2)
X = matrix(c(1, 1, 1, 2, 1, 3, 1, 2), nrow=4, byrow=T)
X = matrix(c(1, 1, 1, 2, 1, 3, 1, 2), ncol=2, byrow=T)
X = matrix(c(1,1,1,2,1,3,1,2), nrow=4, ncol=2, byrow=T)Elements in a matrix can be referred to as the name with the row and column subscripts enclosed in brackets. For example, with the same matrix defined earlier:
## [1] 2
## [1] 1 2 3 2
## [1] 1 2
## [,1] [,2]
## [1,] 1 1
## [2,] 1 2
After defining two or more vectors of the same length (i.e., the same number of elements), a matrix can be constructed by the cbind() function:
## u x
## [1,] 1 1
## [2,] 1 2
## [3,] 1 3
## [4,] 1 2
It can be noticed that the default column names in the listing of the matrix are replaced with the names of the vectors. The equivalent matrix function:
## u x
## [1,] 1 1
## [2,] 1 2
## [3,] 1 3
## [4,] 1 2
Also the row and column names can be specified with the function of rownames() and colnames(), respectively:
## u x
## [1,] 1 1
## [2,] 1 2
## [3,] 1 3
## [4,] 1 2
After defining vectors of the same length in row wise, a matrix can be constructed by the rbind() function:
## [,1] [,2]
## r1 1 1
## r2 1 2
## r3 1 3
## r4 1 2
Note that the row names can be replaced with the default names with the rownames() function:
## [,1] [,2]
## [1,] 1 1
## [2,] 1 2
## [3,] 1 3
## [4,] 1 2
A matrix can also be constructed with the array() function, although the array is not limited to be two-dimensional. For example,
## [,1] [,2]
## [1,] 1 1
## [2,] 1 2
## [3,] 1 3
## [4,] 1 2
Once a matrix is defined, the dimension, the number of rows, and the number of columns of the matrix can be obtained with the following functions:
## [1] 4 2
## [1] 4
## [1] 2
The following are some matrix functions:
chol()- Cholesky decompositioncrossprod()- matrix crossproductdet()- determinantdiag()- to create or extract diagonal valueseigen()- eigenvalues and eigenvectorsouter()- outer product of two vectorsscale()- to scale the columns of a matrixsolve()- inversion or to solve system of linear equationssvd()- singular value of decompositionqr()- qr orthogonalizationt()- to transpose
Based on the usual conforming conditions with scalars and matrices, the element wise addition, subtraction, multiplication, and division can be performed. Matrix multiplication is done with the operator:
%*%matrix multiplication
The following is an example to obtain the estimates of an intercept and a slope from a simple regression model using the matrix functions and operators:
X = array(c(1, 1, 1, 1, 1, 2, 3, 2), dim = c(4, 2))
colnames(X) = c("u", "x")
y = c(1, 3, 2, 2)
solve(t(X) %*% X) %*% t(X) %*% y## [,1]
## u 1.0
## x 0.5
## [,1]
## a 1.0
## b 0.5
## [,1]
## [1,] 1.5
## [2,] 2.0
## [3,] 2.5
## [4,] 2.0
## [1] 1.5 2.0 2.5 2.0
## [1] -0.5 1.0 -0.5 0.0
The results from regression analysis in general will be obtained not from the matrix or vector operations but from the R function for statistical modeling. Hence, the above code illustrations are only for the demonstration and instructional purpose.
3.4 Data Frame
A data frame is a two-dimensional array of observations in rows and variables in columns. Functions such as dim(), dimnames(), nrow(), and ncol() will work on data frames. The attach() function for data frames allow that variables contained in the data frame can be easily accessed through the variable names. Data frames can be constructed by the data.frame() function:
heights = c(180, 165, 170)
ids = c("Alice", "Bob", "Chitra")
people_df = data.frame(id = ids, height = heights); people_df## id height
## 1 Alice 180
## 2 Bob 165
## 3 Chitra 170
Variables can be extracted from the data frame or directly referred by declaring the data frame name and the variable name separated with a dollar sign. For example, the names vector can be listed with the following commands assuming that the data frame has been declared as in the earlier expressions:
## [1] "Alice" "Bob" "Chitra"
## [1] "Alice" "Bob" "Chitra"
The names() function displays the variable names in the data frame:
## [1] "id" "height"
A new variable can be appended to an existing data frame with a dollar sign and a variable name using the c() function:
## id height age
## 1 Alice 180 21
## 2 Bob 165 17
## 3 Chitra 170 45
A variable can be removed or portions of the variables can be selected as in the following expressions for the previous data frame people_df with the four variable:
These yield the same data frame people_df with only id and height.
The edit() function opens the data editor window and allows to edit the data frame with the spreadsheet-looking data editor. The values of the variables as well as the variable names can be modified. The data frame can be saved by clicking of the close window icon, that is, the exit button positioned in the top, right corner of the data editor window’s title bar.
It is also possible to construct a data frame by opening up a blank data frame using the edit() function and then entering the necessary values and variable names:
A data frame can be saved as a file that can be opened with other editor-type programs as:
The current working directory where the data frame file is to be stored can be found with this:
## [1] "C:/Users/cash/OneDrive - University of Cape Town/2026/DataFirst/Introduction-To-R"
and the directory can be changed to the usual root directory with either of the following two commands:
After loading the file, the variables contained in a data frame can be directly accessed by declaring the attach() function:
A data frame can be removed from the current session with the detach() function, for example:
If there is an object defined with same variable name in the attached object, then due to a hierarchical nature of searching objects in the R workspace that attach() function may not bring up the variable contained in the data frame. Care should be exercised when the attach() function is employed.
A list of currently available objects can be found by the ls() function:
## [1] "a" "b" "betahat" "c" "heights" "ids" "names" "people_df" "r1" "r2" "r3"
## [12] "r4" "residual" "u" "x" "X" "y" "yhat" "ypredict"
The objects can be removed by the rm() function, for example:
The entire workspace will be cleared by:
3.5 Missing Values
In R, not available (i.e., NA) is used as a missing value. The following lines show how the missing values are treated in R:
## [1] 1 NA 3 2
## [1] FALSE TRUE FALSE FALSE
## [1] 3
## [1] 1 3 2
## [1] 1 2 3 2
Note also that NaN (i.e., not a number) and Inf (i.e., infinity) are treated as missing cases.
## [1] NaN Inf NaN 2
## [1] TRUE FALSE TRUE FALSE
The best way to solve the problem of missing values is prevention of the occurrence of missing in the data collection process. There is no missing strategy, none whatsoever, how sophisticate and complicate it can be, that is better than obtaining complete data. Obviously, there is no royal road for missing.
3.6 Control Flow
3.6.1 Logical Operators
We can create logical vectors that indicate whether each element of another vector satisfies some conditions.
| Logical.operator | Example |
|---|---|
| equal to | x == 1 |
| not equal to | x != 1 |
| greater than | x > 7 |
| less than | x < 7 |
| greater than or equal to | x >= 8 |
| less than or equal to | x <= 8 |
| and | x > 1 & x < 4 |
| or | x > 8 | x == 2 |
Let us create an atomic vector and see some examples of logical operators. Logical operators will specify all the elements that satisfy some conditions. For instance, \(x == 1\) will return a logical vector indicating whether each element of \(x\) is equal to \(1\).
## [1] 1 2 3 4 5 6 7 8 9 10
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
## [1] FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
See what happens if we use logical operators between two vectors of the same length.
## [1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE
## [1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE
Missing values are contagious in logical operators. That is, logical operators will return NA if one of the elements being compared is NA. For example,
## [1] TRUE NA TRUE
## [1] NA NA NA
which() function returns the indices for the elements of a vector satisfying certain conditions. For example, to find which elements of a are greater than 3 by:
## [1] 6 7
We can refer to the elements of a vector satisfying some conditions by using logical operators with brackets [] . The following code will return the values of the vector a that are greater than 3.
## [1] 4 6
## [1] 4 6
3.6.2 if statement
An if statement allows us to conditionally execute code. Here is an example of how to use an if statement. The following code will print "a has length 7" if the length of a is 7.
## [1] "a has length 7"
The statement to be tested goes into (), and the consequence for “statement is true” goes into the first braces {} . The consequence for “statement is false” goes into the second {} after else .
In an if statement, we should not use & or | because these are vectorized operations that apply to multiple values. Instead, you can use && or ||.
if(length(x) == 10 && x[1] > 0){
print("x has length 10, and the first element of x is greater than zero.")
}## [1] "x has length 10, and the first element of x is greater than zero."
3.6.3 for loop
Imagine we have \(10\) x \(10\) matrix that contains integers from 1 to 100:
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 1 2 3 4 5 6 7 8 9 10
## [2,] 11 12 13 14 15 16 17 18 19 20
## [3,] 21 22 23 24 25 26 27 28 29 30
## [4,] 31 32 33 34 35 36 37 38 39 40
## [5,] 41 42 43 44 45 46 47 48 49 50
## [6,] 51 52 53 54 55 56 57 58 59 60
## [7,] 61 62 63 64 65 66 67 68 69 70
## [8,] 71 72 73 74 75 76 77 78 79 80
## [9,] 81 82 83 84 85 86 87 88 89 90
## [10,] 91 92 93 94 95 96 97 98 99 100
We want to compute the median of each row. We can copy and paste:
## [1] 5.5
## [1] 15.5
## [1] 25.5
## [1] 95.5
However, it is not very efficient to use copy and paste if we are dealing with a large number of columns, say \(50\) columns. Instead, we could use a for loop:
med = rep(NA, 10) # 1. output
for (i in 1:10) { # 2. sequence
med[i] = median(A[i, ]) # 3. body
}
med## [1] 5.5 15.5 25.5 35.5 45.5 55.5 65.5 75.5 85.5 95.5
A for loop consists of three components:
- Output: Before starting the loop, we created an empty atomic vector
medof length 10 usingrep(). At each iteration, the median of theith row is assigned as theith element of our output vector - Sequence: This part shows what to loop over. Each iteration of the
forloop assignsito a different value from1:10. - Body: he body part is the code that does the work. At each iteration, the code inside the braces
{}is run with a different value fori. For example, the first iteration will runmed[1] = median(A[1,])
3.6.4 for loop with an if statement
Here, we will see how to use an if statement inside a for loop. We want to create a vector length of 10, such that the ith element is 1 if the ith element of x is even, and is 2 if the ith element of x is odd.
v = numeric(10) # output: create a zero vector length of 10
for (i in 1:10) { # sequence
if (x[i] %% 2 == 0) { # if statement
v[i] = 1 # body
} else {
v[i] = 2
}
}
v## [1] 2 1 2 1 2 1 2 1 2 1