R essentials

Author
Affiliation

Philipp Sterzinger

p.sterzinger@lse.ac.uk
London School of Economics and Political Science

ST227 - Winter term 2024

Published

17 January 2025

Introduction

This note provides a pragmatic introduction to the programming language R. Its purpose is to provide you with all the tools that you need to tackle the R mini project and to serve as a reference for the R coursework. It is aimed at people that are not too familiar with R or need a quick refresher on key concepts. Even if you are familiar with R, do have a look at the note and in particular at the last section “Project submission”. This section provides some guidance on the submission of your R coursework.

Overview

In this note you will learn:

  • What is R, and why bother
  • How to install R and RStudio
  • Basic data structures in R
  • R functions
  • Conditional statements and loops
  • How to load, manipulate, and visualise data
  • Create a RMarkdown file for your R mini project

Approach

This note is designed to be a “hands-on” self study tool. It will introduce you to some concepts and gives exemplatory code snippets like:

print("This program is brought to you by caffeine and StackOverflow.")

You can copy these code snippets by clicking on the icon that appears on the right end of the code block when you hover over it with you cursor. Simply paste this into your R console to reproduce whatever is shown in this note. Sometimes, code snippets are hidden, like below:

code
print("Peek-a-boo!")

You can expand them by clicking on the icon. You are invited to come up with a solution yourself, before looking at the example code. Sometimes, code examples will go beyond the basic concepts that we are introducing in the text. This is done solely for your viewing pleasure and to hopefully get you excited about R. Generally, this note does not constitute examinable material, but it should serve as a toolkit and reference to help bring you up to speed, and support you with theR workshop exercises and the R mini project.

Hopefully, the interactive nature of this note will peaque your interest in the introduced concepts and motivate you to explore a bit further on your own. In my opinion, this is the best way to learn a new programming language.

The presentation is deliberately kept light and short and is thus not at all a comprehensive summary on R. Several excellent, far more detailed, resources exist. If you are interested, have a look at the free ebook Advanced R by Hadley Wickham, or the official Introduction to R document or just google – the internet is a vast and knowledgeable place.

What is R?

R is a free, open-source programming language primarily used for statistical computing, data analysis and data visualisation. It is available for a wide range of operating systems such as Linux, MacOS and Windows. Aside from the base functionality that comes with the standard R distribution, R provides a plethora of packages (\(> 20'000\) at the time of writing), most of which are hosted on the Comprehensive R Archive Network (CRAN). These R packages are a collection of functions, along with documentation, and sometimes datasets, that expand R and make it a versatile and powerful tool for statistical computing and data analysis.

Why use R?

R is an excellent choice for data analysis and statistics due to its flexibility, mature package ecosystem, and its prowess in data manipulation and data visualisation. Its extensive ecosystem of packages, such as ggplot2 for visualisations and dplyr for data manipulation, makes R a powerful tool for statistical applications in academia and industry alike. Additionally, being open-source with a strong community (have a look at this slightly outdated post on R queries on StackExchange) means that R is continuously evolving with new tools and techniques, making it a go-to for reproducible research and cutting-edge data science​.

Installing R and RStudio

Installing R

You can download the latest version of R on CRAN – make sure it is compatible with your operating system (OS) before installing. If you are using an older computer, you might have to download an older version of R from the archive. Just find your operating system in the archive and choose the R version that is right for you – if in doubt, google.

Running R from the command line interface

Once you have R installed, you can use R’s command line interface to interactively run code. This is the most minimalistic approach that may sometimes come in handy when you just want to try something quickly.

On MacOS and Linux you can do this by typing R into your terminal, or into your command prompt on Windows. This launches a R session, and you can now type you R commands into the window.

A screenshot of my terminal running R on Mac

If typing R in your terminal or command prompt does not launch R, it is likely that the system has not added R to the PATH. PATH is an environment variable that bascially tells your OS where to look for the R exectuable. If your R installation is not added to the PATH, your computer cannot find it and the command fails.

On MacOS and Linux, you can check the path to your installation by typing which R into you terminal. On my mac, for example, this command yields /usr/local/bin/R, which means that I would have to add the path /usr/local/bin/ to the PATH variable.

On Windows, the R executable is typically located at C:\Program Files\R\R-x.x.x\bin. You can locate your R installation through Windows Explorer. Then, right-click on This PC (or My Computer) and select Properties. Click Advanced system settings > Environment Variables. Under System Variables, find the PATH variable, select it, and click Edit. Click New and add the path to the R bin folder (e.g., C:\Program Files\R\R-x.x.x\bin).

Warning

Before messing with your PATH variable, make sure that this is indeed the cause why you cannot run R – Google is your friend.

Installing RStudio

While we can now run R from our command line interface, we can only do so line by line. This may quickly become tedious or confusing if we are working on larger or complex tasks. For this reason, we want to add an integrated development environment (IDE), which is essentially a software application that sits on top of R, to improve our workflow.

In this course we use RStudio, which is also free and is arguably standard in the R community. To install RStudio, visit their website and install the RStudio version that is right for your OS. Again, if you are using an older computer, you might have to install an older version. You will have to google which version is compatible with your version of R and OS.

The RStudio window

A screenshot of RStudio on Mac

Once you have launched RStudio, you will see that the RStudio window has 3 distinct panes, but we will add a fourth, namely a pane for our R scripts. To do so, click File > New File > R Script, or click on the icon, which will open a new R script in the top left of your RStudio window.

We now have the following four panes:

  • Top left: The source pane; In this pane we have a R script, in which we can write a series of R commands to accomplish a certain task.
  • Bottom left: The console pane; This is where we can interactively execute code.
  • Top right: The environment pane; In the environment tab, we can see our current R working environment with all saved R objects.
  • Bottom right: The output pane; This pane has different tabs such as Files, which we can use to navigate to and load R scripts or data, the Help tab, which provides some helpful resources and documentations, and the Plots tab, which displays plots from our current session.

Installing R packages

R packages are collections of functions and sometimes datasets, along with documentation. You can install any R package from CRAN by using the command install.packages(“packagename”), where “packagename” should be replaced with the name of the package you want to install. For example, to install the ggplot2 package, we would type:

# installing a R package
install.packages("ggplot2")

In RStudio, you can alternatively install packages by navigating to the Packages tab in the output pane and clicking on the icon.

Exercise: Running your first R script

1. Open RStudio

Find the software RStudio on your computer and launch it.

2. Create a course folder

In the bottom right pane, click on the Folder tab, navigate to a location of your choice, such as Desktop and create a folder for this course, e.g. “ST227”, which will hold all the scripts and data that you create for this course. You can create the folder by clicking the icon.

3. Open a R script

Open a new script either by clicking File > New File > R Script, or clicking the icon and choosing R Script. Save it under an informative name, e.g. ST227_exercise_1.R, by clicking File > Save As… or by clicking the icon.

4. Run commands

We are now in a position to add our first R commands to our script. For example, we can produce a plot of normal random variables like so:

# histogram of normal random variables 
set.seed(1)
normal_vars <- rnorm(500) 
x <- seq(from = -6, to = 6, by = 0.01)
normal_density <- dnorm(x) 
hist(normal_vars, freq = FALSE, xlim = c(-4,4))
lines(x, normal_density, col = "red", lwd = 2)

Once you have this code in your script, you can run it either by pasting it into your console or running the script directly. To do so, either select the code that you wish to run and press the icon, or by typing Cmd-Enter (MacOS) or Ctrl-Enter (Windows). This should produce the plot below in your Plots tab in the bottom right pane.

If you are curious about what is happening in this example, feel free to investigate by typing ? followed by the command you are interested in into the console, e.g. 

# accessing documentation of a function  
? rnorm 

This will open the documentation of the function rnorm in the Help tab of your bottom right pane. You can access documentation like this for any R function that you are interested in. This is helpful if you already know the name of the R function you are interested in. If you do not know which R function to use, you can type ?? followed by something that you are interested in, R will scan all installed libraries for the expression that you are looking for. For example you could type:

# scanning documentations for an expression 
?? histogram  

to scan the documentation of all your packages for the word histogram.

5. Quit the R session

Before quitting your session, make sure you have saved your work. You can quit your R session by typing:

q()  

into your consolse. This will prompt the following response:

Save workspace image? [y/n/c]:

If you answer y, all objects that you defined this session are available upon relaunch. Usually, it is best to answer n to get a clean workspace the next time. c allows you to return to your session.

Basic data structures in R

In this section, we will learn about the basic data structures that we are most likely to encounter in this course. If you want to dig deeper, have a look at this chapter from the book “Advanced R” by Hadley Wickham.

Names and values

Before we jump into data structures and how to manipulate them, we first need to understand how to create and call objects in R. The symbol for variable assignment in R is <- and the syntax is:

x <- c(1, 2, 3)

Under the hood, when we run the above command, R creates an object, in this case a vector containing the values \(1,2,3\), and binds this object to a name, x. Now, when we call x by typing it in the console, R looks for the object that is bound to the name, and returns it

x
[1] 1 2 3

You can also use the more conventional = symbol for variable assignment, however it is common practice to use the <- symbol for variable assignment and the = for named function arguments (more on this in the Functions & Control Flow section), and we too shall adhere to this.

Vectors

The basic data structures in R are vectors. You can think of vectors are one-dimensional containers of data. Even when you think you work with single elements, such as a real number, you are working with a vector of length one. In R, there are two different types of vectors: atomic vectors and lists. Atomic vectors require all of their elements to be of the same types, while lists can contain elements of any type, even lists. Vectors have two common properties that are relevant to us:

  • typeof(v): This tells us what type the vector is
  • length(v): This tells us how many elements are stored in the vectors

We will look at three common kinds of atomic vectors, namely numeric, character and logical vectors as well as lists.

Numeric Vectors

Numeric are vectors that contain numbers, either integers or real numbers (called double in R).

The simplest way to create a numeric, or any other, vector is using the c() function, which combines individual values into a vector. Other ways to generate a vector that you may encounter are with the seq() function, which creates a sequence of numbers, or the rep(), which repeats values.

# creating numeric vectors 
x <- c(1, 2, 3) 
y <- seq(from = -10, to = 10, by = 2) 
z <- rep(x, times = 2, each = 3) 
x 
[1] 1 2 3
y 
 [1] -10  -8  -6  -4  -2   0   2   4   6   8  10
z 
 [1] 1 1 1 2 2 2 3 3 3 1 1 1 2 2 2 3 3 3

Vector arithmetic

Simple arithmetic in R works as we would expect:

# a simple calculator
1 + 2 
[1] 3
(3 * 4) - 7
[1] 5
2 * cos(2 * pi) - exp(1)^log(2)
[1] 2.220446e-16

You can check out ? Arithmetic or the R language documentation (Section 2) for more information on available operators. The last result is \(2.220446 \times 10^{-16}\), a very small number that is equivalent to R’s machine precision

# R's machine precision 
.Machine$double.eps
[1] 2.220446e-16

which is the distance between one and the next largest number that can be represented in R. This means that if \[0 \leq \delta < \texttt{.Machine\$double.eps}\,,\] then \(1+\delta = 1\) in R:

# computer maths 
delta <- 0.999 * (.Machine$double.eps / 2) 
1 + delta == 1 
[1] TRUE

For our purposes, this means that numbers that are approximately equal to .Machine$double.eps are zero. If you wish to understand this deeper, have a look at the wiki page.

On vectors, operations act elementwise, which is called vectorisation in computer science. For example:

# vectorised operations 
x <- c(1, 2, 3) 
x^x 
[1]  1  4 27

A peculiar behaviour of R is recycling. That is, if two vectors involved in an operation are not of the same length, the shorter element will be repeated to match the length of the longer.

# R's recycling behaviour 
(1:3) + (2:9) 
Warning in (1:3) + (2:9): longer object length is not a multiple of shorter
object length
[1]  3  5  7  6  8 10  9 11

If the length of the longer vector is a multiple of the length of the shorter, this is done without warning.

# recycling without warning 
(0:1)^(0:5) 
[1] 1 1 0 1 0 1
(0:5)^(0:2)
[1]  1  1  4  1  4 25
Warning

Generally, use of recycling is error-prone and is, in my opinion, best avoided. An exception is when the shorter element has length one.

We conclude with a collection of useful functions for numeric vectors that you might come across. These are:

sum(x)
mean(x)   
var(x) 
range(x) 
length(x) 

and you can probably guess what they do, but as always you can check by typing ? sum etc. in your console.

Character Vectors

We said in the previous section that we can think of numeric vectors as containers that hold numeric elements. This poses the question whether we can populate vectors with other types of data, such as text – and the answer is yes!

The basic datatype in R that captures text is called character. We can define a character using " " or ' ' such as:

my_name <- "Peter Pan" 

my_name 
[1] "Peter Pan"

The construction of character vectors is very similar to numeric vectors.

my_family <- c("Peter Pan", "Captain Hook", "Tinkerbell") 

parrot <- rep("Patience, Iago, patience.", times = 3)

We can combine elements of character vectors using the paste function, which may be useful when we need to define variable names:

numbered_vars <- paste("x", 1:4, sep = "")

numbered_vars 
[1] "x1" "x2" "x3" "x4"

Logical Vectors

We can also construct vectors containing a special classs of datatypes that are either true or false, which are called logicals in R. These often arise from operations where we essentially ask whether a certain statement is true or false. For example:

# logical dataypes in the wild 
x <- 1 
y <- 2 
x < y 
[1] TRUE
is.character(x) 
[1] FALSE

The symbols that we use to construct such true-or-false statements from comparisons are called comparison operators. These are:

  • ==: Check for equality
  • !=: Check if two elements are not equal
  • >, >=: Check whether one element is (strictly greater than the other)
  • <, <=: Check whether one element is (strictly smaller than the other)

We can also chain together multiple such statements using so-called logical operators:

  • |: Logical OR operator
  • ||: Logical OR operator, evaluates expressions in order and stops if true
  • &: Logical AND operator
  • &&: Logical AND operator, evaluates expressions in order and stops if false

The functions any and all are useful to check whether any or all elements of a logical vector are true.

# chaining together comparisons using logical operators 
set.seed(1)
nums <- runif(10) 
extremes <- nums < 0.1 | nums > 0.9

any(extremes)
[1] TRUE

Lists

Sometimes we may want to group data of different types and lengths to a single object. In R, we can do this with a list, which we can construct with the list function.

# do not try this at home 
a_bad_recipe <- list(
  dish_name = "Chocolate Lava Cake",             
  prep_time_minutes = 30,                       
  ingredients = c("dark chocolate", "butter", "flour", "sugar", "eggs"), 
  ingredient_quantities = c(200, 175, 30, 125, 3),
  calories_per_serving = 450.75,                 
  is_vegan = FALSE,                              
  steps = list(                                  
    step_1 = "Melt chocolate and butter",
    step_2 = "Mix in whisekd egg + sugar, and flour",
    baking_time_minutes_min_max = c(12, 20),                  
    baking_temp_C = 180
  )
) 

a_bad_recipe 
$dish_name
[1] "Chocolate Lava Cake"

$prep_time_minutes
[1] 30

$ingredients
[1] "dark chocolate" "butter"         "flour"          "sugar"         
[5] "eggs"          

$ingredient_quantities
[1] 200 175  30 125   3

$calories_per_serving
[1] 450.75

$is_vegan
[1] FALSE

$steps
$steps$step_1
[1] "Melt chocolate and butter"

$steps$step_2
[1] "Mix in whisekd egg + sugar, and flour"

$steps$baking_time_minutes_min_max
[1] 12 20

$steps$baking_temp_C
[1] 180

We can also construct lists without explicitly naming their components in the construction. If we wish, we can assign and change names using the names() function:

# Creating unnamed list and assigning names using `names()` 
another_list = list(rnorm(10), letters, cars[1:10,]) 

names(another_list) <- c("Random numbers", "Letters", "Cars Data") 

another_list
$`Random numbers`
 [1] -0.8204684  0.4874291  0.7383247  0.5757814 -0.3053884  1.5117812
 [7]  0.3898432 -0.6212406 -2.2146999  1.1249309

$Letters
 [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
[20] "t" "u" "v" "w" "x" "y" "z"

$`Cars Data`
   speed dist
1      4    2
2      4   10
3      7    4
4      7   22
5      8   16
6      9   10
7     10   18
8     10   26
9     10   34
10    11   17

Matrices & Arrays

So far, we have only been concerned with one-dimensional data structures. We can extend these data structures to two or more dimensions in R using matrix and arrays. We construct matrices as

# construction of a matrix 
a_matrix <- matrix(data = 1:12, nrow = 3, ncol = 4)

a_matrix 
     [,1] [,2] [,3] [,4]
[1,]    1    4    7   10
[2,]    2    5    8   11
[3,]    3    6    9   12

data here is the data that we want to populate our matrix with, whereas nrow and ncol specify the number of rows and columns, respectively. Note that R fills matrices in column-major order, meaning they are filled column-wise from left to right by default. If our data is too short to populate the whole matrix, R again uses recycling – caution is advised.

# recycling -- it's a dangerous world out there 
another_matrix <- matrix(data = 1:4, nrow = 3, ncol = 4) 

another_matrix
     [,1] [,2] [,3] [,4]
[1,]    1    4    3    2
[2,]    2    1    4    3
[3,]    3    2    1    4

You can also construct, or append, matrices by adding vectors or matrices column-wise or row-wise with cbind or rbind respectively.

# using cbind and rbind to construct and append matrices 
patched_matrix <- cbind(1:3, rep(10, 3), 
                        seq(from = .1, to = 10, length.out = 3), 
                        runif(3)) 

large_patched_matrix <- rbind(another_matrix, patched_matrix) 

large_patched_matrix
     [,1] [,2]  [,3]      [,4]
[1,]    1    4  3.00 2.0000000
[2,]    2    1  4.00 3.0000000
[3,]    3    2  1.00 4.0000000
[4,]    1   10  0.10 0.4820801
[5,]    2   10  5.05 0.5995658
[6,]    3   10 10.00 0.4935413

Matrices come with a lot of useful operators, such as matrix multiplication %*%, transpose t() or ?solve , ?crossprod , ?qr , ?eigen , ?svd.

Arrays are extensions of matrices to arbitrary dimensions. We construct arrays as follows:

# constructing an array 
an_array <- array(data = 1:27, dim = c(3,3,3))  

where the dim keyword specifies the number of elements in each dimension.

Warning

Matrices and arrays, unlike lists or dataframes, can only hold a single data type. R will try to convert datatypes to meet this requirement and if it succeeds, it will not error or warn you.

# changing data types of a matrix 
another_matrix[1,1] = FALSE 
another_matrix 
     [,1] [,2] [,3] [,4]
[1,]    0    4    3    2
[2,]    2    1    4    3
[3,]    3    2    1    4
another_matrix[3,4] = my_name 
another_matrix  
     [,1] [,2] [,3] [,4]       
[1,] "0"  "4"  "3"  "2"        
[2,] "2"  "1"  "4"  "3"        
[3,] "3"  "2"  "1"  "Peter Pan"

Dataframes

A dataframe in R is a matrix-like object, in which columns must have the same data type, but these data types are allowed to vary between columns. Dataframes are arguably the most common data type for data analysis in R. You can think of a dataframe as a list of equal-length vectors, which is what they secretly are. For example, consider the iris dataset that comes with R. It is a dataframe, but when we check its type, R returns:

# dataframes are lists 
typeof(iris)
[1] "list"

If we wanted to make sure that we are indeed dealing with a dataframe, we can use the attributes function. Attributes is just a list of metadata that any R object can have. For example

# attributes of a dataframe 
attributes(iris)
$names
[1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"     

$class
[1] "data.frame"

$row.names
  [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
 [19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
 [37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
 [55]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
 [73]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
 [91]  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107 108
[109] 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
[127] 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
[145] 145 146 147 148 149 150

We see that dataframes also have name attributes. We can evaluate, or change, the column names using names or columnames and the row names using rownames.

Accessing elements

We have now seen how we can construct and manipulate different data structures. Oftentimes, we want to access particular elements of our data. For this we have two different techniques, accessing by name and accessing by index.

Accessing by name

For objects whose components have been named, we can use these names to access objects.

Lists

In named lists, we access named elements using the syntax list_name$element_name. For example:

# accessing list elements by name 
a_bad_recipe$ingredients
[1] "dark chocolate" "butter"         "flour"          "sugar"         
[5] "eggs"          

This is equivalent to list_name[["element_name"]]:

# accessing list elements by name 
all(a_bad_recipe$ingredients == a_bad_recipe[["ingredients"]])
[1] TRUE
Note

list_name[["element_name"]] returns the element with name element_name, while list_name["element_name"] returns a list containing the element that binds to element_name.

Matrices

If matrices have either row or column names specified, we can access elements of such a matrix by name. The syntax is matrix_name["row_name", "column_name"]. If we want a whole row or column, we can access it by matrix_name["row_name",] and matrix_name[,"column_name"], respectively.

# accessing elements of a named matrix 
payoffs <- c("Both get 1yr", 
            "B gets 3yrs, A free", 
            "A gets 3yrs, B free", 
            "Both get 2 yrs") 
prisoners_dilemma <- matrix(payoffs, nrow = 2, ncol = 2)
colnames(prisoners_dilemma) <- c("B stays silent", "B testifies")
rownames(prisoners_dilemma) <- c("A stays silent", "A testifies")
prisoners_dilemma["A testifies", "B testifies"]
[1] "Both get 2 yrs"

Dataframes

Since dataframes have both list and matrix like behaviour, we can use both the dataframe_name$column_name and the [, "column_name"] syntax to access its columns. If it has named rows, we can also use matrix_name["row_name", "column_name"] just like for matrices.

# Access elements by row and column name 
pets <- c("cat", "dog", "brachiosaurus", "cactus") 
people <- c("Jana", "Jane", "June")
set.seed(1) 
pet_ratings <- as.data.frame(matrix(sample(1:10, 
                            size = length(pets) * length(people), 
                            replace = TRUE), 
                            nrow = length(people)))
rownames(pet_ratings) <- people 
colnames(pet_ratings) <- pets 

pet_ratings$dog
[1] 1 2 7
pet_ratings["June", c("cat", "cactus")]
     cat cactus
June   7     10

Acessing by index

Very often, our data objects are not named or we may want to access a selection of particular elements across different dimensions. We can do this by using indices.

For vectors, matrices, arrays and dataframes alike, we can do this with []. Let’s revisit the iris dataframe:

# select elements by numeric index 
i_1 <- c(4,1,3) 
i_2 <- 2:4 
iris[i_1, i_2] 
  Sepal.Width Petal.Length Petal.Width
4         3.1          1.5         0.2
1         3.5          1.4         0.2
3         3.2          1.3         0.2

The comma separator [ , ] is used to specify the dimensions of the element (? dim).

We can also select all elements except a selection using the - operator. If we want all elements across a particular dimension, we just leave it empty.

# select only the last 5 observations 
number_obs <- nrow(iris) 
all_but_last_five <- 1:(number_obs-5)
iris[-all_but_last_five,]  
    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
146          6.7         3.0          5.2         2.3 virginica
147          6.3         2.5          5.0         1.9 virginica
148          6.5         3.0          5.2         2.0 virginica
149          6.2         3.4          5.4         2.3 virginica
150          5.9         3.0          5.1         1.8 virginica

Finally, we can also access elements using logical vectors as indices.

# Access elements using logical vectors 
janas_favourites = pet_ratings["Jana",] >= 5 

janas_favourites 
      cat   dog brachiosaurus cactus
Jana TRUE FALSE         FALSE   TRUE
pets[janas_favourites] 
[1] "cat"    "cactus"

Loading data

There exist several ways to load different data types. Here we will only look at two common data types: .csv-files, and .xlsx-files. You can almost certainly load any other data format that you encounter, just google – StackOverflow is a great resource.

First, let us download a dataset from the internet. This is a fictional dataset which investigates the effect of alcohol on percieved attractiveness from Andy Field’s Discovering Statistics Using R and RStudio.

# Loading fictional beer goggles dataset from internet
url <- "https://www.discovr.rocks/csv/goggles.csv"
file_name <- "goggles.csv"
download.file(url = url, destfile = file_name, quiet = TRUE)

This downloads the file in the current working directory, which you can check via getwd(). It is good practice to set the appropriate working directory at the beginning of an R script.

Next, we can load the file using the read.table function, where we specify the symbol that separates data, in our case “,”, since we are loading a csv (Comma Separated Values) file.

# Load dataset as dataframe
beer_goggles <- read.table("goggles.csv", sep = ",", header = TRUE) 

head(beer_goggles) 
      id   facetype alcohol attractiveness
1 vfnoxj Attractive Placebo              6
2 hqfxap Attractive Placebo              7
3 obicov Attractive Placebo              6
4 oobiyc Attractive Placebo              7
5 snafxn Attractive Placebo              6
6 vihqnn Attractive Placebo              5

The code below investigates the influence of alcohol consumption on percieved attractiveness in this dataset.

code
# Compute mean and 25% and 75% quantile of attractiveness
library("dplyr") 

beer_goggles <- beer_goggles %>%
  group_by(facetype, alcohol) %>%
  summarise(
    attractive_mean = mean(attractiveness),           
    attractive_lq = quantile(attractiveness, 0.25),   
    attractive_uq = quantile(attractiveness, 0.75)   
  ) 

# Plot mean percieved attractiveness and  25% and 75% quantiles
library("ggplot2")

ggplot(beer_goggles, aes(x = alcohol, 
                        y = attractive_mean, 
                        color = facetype, 
                        group = facetype)) +
  geom_point(size = 3) +  
  geom_line() +          
  geom_errorbar(aes(ymin = attractive_lq, 
                    ymax = attractive_uq), 
                width = 0.5) +  
  labs(x = "Alcohol consumption", 
      y = "Percieved atrractiveness") + 
  theme_minimal() 

To read excel files, we shall use the readxl package, which we can install by install.packages("readxl"). This package also contains some example datasets, which we can use for our demonstration.

# Load .xls file using `readxl` 
library("readxl") 

data_location <- readxl_example("datasets.xlsx")

example_data <- as.data.frame(read_excel(data_location, sheet = 3))

head(example_data)
  weight      feed
1    179 horsebean
2    160 horsebean
3    136 horsebean
4    227 horsebean
5    217 horsebean
6    168 horsebean

We can now visualise the influence of different feed types on weight using barplots by running the code below.

code
# plotting influence of feed ype on weight 
ggplot(example_data, aes(x = feed, y = weight)) +
      geom_boxplot(fill = "lightblue", color = "black") +  
      labs(x = "Feed Category", 
          y = "Weight", 
          title = "Weight per Feed Category") +
      theme_minimal() 
Note

In the examples above, we have used the R packageas dplyr and ggplot. This is solely for illustration purposes and you are not expected to know how to work with these.

Exercise: Loading and manipulating a dataset

We start by opening RStudio, and open a new file, which we call exercise_2.R. In the script we use the # to add comments to our code. Before starting to work on our data, it is good practice to set the working directory and load all packages that we wish to work with in this script. You can check your current directory by typing getwd() and set it using the setwd() function. For example:

##########################################
## ST227 R Exercise 2: Data manipulation ##
##########################################

# set working directory 
my_path <- "" # replace with path to your ST227 folder here 
setwd(my_path) 

# load packages 
library("readxl")

# Problem 1: loading a dataset ...

1. Download a dataset

Go to whatdotheyknow and save the dataset FoI 5067 ii.xlsx in your ST227 folder. This dataset contains

predicted A-level, IB, and AP grades breakdown of students offered admissions from China (Mainland) by course.

Have a look at the data in Excel. How is it organised, are there column headers?

2. Load dataset

We want to load the second sheet of our dataset using the read.xlsx function from the readxl package and convert the loaded data into a dataframe

code
# loading the dataset 
admission_data <- read_xlsx("FoI 5067 ii.xlsx", sheet = 2)

# convert to dataframe 
admission_data <- as.data.frame(admission_data)

# inspect first 5 elements 
head(admission_data, n = 5) 

In RStudio, you can also browse through the data by clicking on admission_data in your top right pane.

3. Manipulate data

Let us simplify the data a little. Let us check, which different qualification types occur in this dataset and how often. We can use the unique and table functions for this.

code
# looking at unique qualifictation types 
types <- admission_data$"Qualification Type"

unique_types <- unique(types)

unique_types_perc <- table(types) / length(types)

Let us focus on the “International Baccalaureate Diploma”, which makes up the majority of the dataset. That is, we only want rows for which “Qualification Type” is “International Baccalaureate Diploma”.

code
# subsetting data 
indices <- types == unique(types)[1]

admission_subdata <- admission_data[indices, c(1,4)]

Note that the “Predicted Grade” column has type character, so we need to convert it to type double.

code
# converting a column of type `character` to type `double`. 
grades <- admission_subdata$"Predicted Grade"
typeof(grades[1])

admission_subdata$"Predicted Grade" <- as.double(grades) 

Now we can check the average predicted grade per programme. We can either do this one by one, subsetting the data and using the mean function, or more efficiently, using aggregate.

code
# computing mean predicted grade per programme 
aggregate(`Predicted Grade` ~ Programme, 
          data = admission_subdata, 
          FUN = mean)

These are a lot of programmes. Let us further simplify the data by only selecting programmes that have a substantial maths component. We can to this by specifying a collection of words that occur in maths oriented programmes, and then search the dataframe for these words using grepl.

code
# selecting only programmes with maths component 
maths_words <- c("Mathematics", "Finance", "Actuarial", "Economics")
pattern <- paste(maths_words, collapse = "|") 
maths_inds <- grepl(pattern, admission_subdata$Programme)

admission_maths <- admission_subdata[maths_inds, ]

aggregate(`Predicted Grade` ~ Programme, 
          data = admission_maths, 
          FUN = mean)

Finally, let us visualise the distribution of different grades for subjects that are maths oriented.

code
# Boxplot of predicted grades per mathsy programme 
par(mar = c(10, 4, 4, 1), cex.axis = 0.5)
boxplot(`Predicted Grade` ~ Programme, data = admission_maths,
        main = "Boxplot of Predicted Grade per Programme",
        xlab = "",
        ylab = "Predicted Grade",
        col = "lightblue", 
        border = "black", 
        las = 2)

4. Save the data and script

Let us now save our cleaned up data. The most efficient way to save R objects is to save them as .RData. We do this using the save function.

code
# save data as .RData
save(admission_data, 
    admission_subdata, 
    admission_maths, 
    file = "admission.RData") 

We can check whether this has worked by first deleting all objects on our workspace using rm and then re-loading the data using load.

code
# Removing all objects from workspace 
rm(list = ls()) 

# loading admission datasets 
load(file="admission.RData")    

# check data is loaded 
ls() 

Functions & Control Flow

Functions

In R, functions are objects that take a set of inputs, and give some outputs based on a series of statements. We have already come across a number of functions, for example + or mean.

When we are working on a complex task, functions are a great way to simplify our program, automate repetitive tasks to make our workflow more efficient, more readable and less error-prone compared to just writing one large file with a series of consecutive commands.

In R, we can define function as follows:

# defining a function 
function_name <- function(arguments){
  # function body 
}

We see that a function is comprised of the following components:

  • A function_name, which we use to call the function
  • An enumeration of function arguments, that we supply to the function
  • The function body, where we transform our inputs in some way

For example, we can define

# function to compute the square of a number 
square <- function(x){
  x_squared <- x * x 
  return(x_squared)
}

square(2) 
[1] 4

The return statement is optional in R, we could also call the object that we want to return at the end of the function body, e.g. 

# function without explicit return statement 
square <- function(x){
  x * x 
}

We can construct functions with as many arguments as we want, all separated by a “,”. It is also possible to supply default values to some or all of our arguments. For example:

# define a function with three arguments 
chat <- function(my_name, your_name, controversial = FALSE){
  our_chat <- paste("Hello ",
                    your_name,
                    "! ", 
                    "I am ", 
                    my_name,
                    ". ",
                    sep = "")
  
  if(controversial){
    our_chat <- paste(our_chat, 
                      "I like pineapple on pizza.", 
                      sep = "")
  }else{
    our_chat <- paste(our_chat, 
                      "Would you like to see pictures of my cat?",
                      sep = "")
  }
  
  return(our_chat)
}

chat("Romeo", "Juliet") 
[1] "Hello Juliet! I am Romeo. Would you like to see pictures of my cat?"

Since the argument has controversial has a default value, we do not need to supply it. If we wish to supply another value, we simply call:

chat("Romeo", "Juliet", TRUE) 
[1] "Hello Juliet! I am Romeo. I like pineapple on pizza."

We can also supply arguments by name, in which case the order in which they appear does not matter:

chat(controversial = FALSE, your_name = "Juliet", my_name = "Romeo")
[1] "Hello Juliet! I am Romeo. Would you like to see pictures of my cat?"

Control Flow Structures

Control Flow is a term from computer science that describes in which order the statements of a program are executed. We will look at two control flow structures, namely conditional statements and iterations.

Conditional statements, or if-else statements, take the form

# a basic if-else statement 
if(condition){
  # do something if condition == TRUE 
}else{
  # do something if condition == FALSE 
}

In the code above, R first evaluates the condition of the if statement. This condition can either be an explicitly defined logical variable or an expression like x > 2 – but its type must always be logical. Depending on whether the condition is true or not, it will then evaluate the if(){} or the else{} block.

We can also nest together multiple if-else statements using else if(){} as follows:

# multiple conditional statements 

if(x > 0){
  # do something if x > 0
}else if(x == 0){
  # do something if x == 0
}else{
  # do something if neither x > 0 nor x == 0, i.e. x < 0 
}

or just have the if(){} statement wihtout an else{}.

# A conditional statement without else-clause 
if(happy_and_know_it){
  print("Clap your hands")
}

We can also nest conditional statements like so:

# A nested conditional statement 
if(happy_and_know_it){
  if(really_want_to_show_it){
    print("Clap your hands")
  }
}

The second control flow structure we consider is iteration, or looping. Iterations are programming structures that repeat a set of code until a specific condition is met. The two most common iterations are for-loops and while-loops.

A for-loop is used to iterate over values in a vector. Their basic structure is:

# basic structure of a for-loop 
for(item in vector){
  # do something 
}

For example, we can use it to compute the Fibonacci sequence, which is given by

\[ F_0 = 0, \, F_1 = 1\,, F_n = F_{n-1} + F_{n-2} \,(n >2)\] as follows:

# compute first n values of fibonacci sequence 
fibonacci <- function(n){
  sequence <- rep(NaN, n) 

  for(i in 1:n){
    if(i == 1){
      sequence[i] <- 0
    }else if(i == 2){
      sequence[i] <- 1
    }else{
      sequence[i] <- sequence[i - 1] + sequence[i - 2] 
    }
  }
  return(sequence)
}

fibonacci(5) 
[1] 0 1 1 2 3

Contrary to the for-loop, which iterates over a finite collection of elements and terminates thereafter, a while-loop iterates indefinetely until a certain condition is met.

It has the basic structure:

# basic structure of a while-loop 
while(condition){
  ## do something as long as condition == TRUE 
}

R does not check whether a while-loop might run for eternity, so be careful to make sure they do terminate, the loop below, for example, would run forever.

# this stunt is performed by trained professionals, do not try this at home 
while(TRUE){
  print("I should have known better")
}
Warning

If you run the code above, you will have to forcibly stop execution. You can do this by typing Ctrl + C into your console or by clicking the icon in your output pane. If RStudio is frozen, you will have to force quit RStudio, which might cause you loosing unsaved work.

A more sensible illustration of a while-loop is provided by the function below, which approximates the golden ratio, the limit of consecutive quotients from the Fibonacci sequence:

\[\varphi = \lim_{n \to \infty} \frac{F_n}{F_{n-1}}\]

# compute the golden ratio 
golden_ratio <- function(error_tol = 1e-9, max_iter = 1000){
  iter_count <- 0
  ratios <- rep(0,2)
  fibonacci_last_two <- c(0,1) 

  error = Inf 
  while(error > error_tol && iter_count < max_iter){
    new_fibonacci <- sum(fibonacci_last_two)
    new_ratio <- new_fibonacci / fibonacci_last_two[1]

    ratios[2] <- ratios[1] 
    ratios[1] <- new_ratio 

    fibonacci_last_two[2] <- fibonacci_last_two[1] 
    fibonacci_last_two[1] <- new_fibonacci 

    error <- abs(diff(ratios))
    iter_count <- iter_count + 1     
  }
  return(c(ratios[1], iter_count)) 
}

golden_ratio() 
[1]  1.618034 25.000000
Note

Note that we have added an iteration counter to ensure that the while-loop terminates eventually. Even though we know that consecutive quotients from the Fibonacci sequence convergence, including an interation counter is good practive, as there are no guarantees that our implementation is correct.

Data Visualisation

In this section we will look at some basic ways to visualise data using R. One of the appeals of R is its prowess in porducing high-quality plots with very simple code. Once could go as far as to say that plots using R’s ggplot2 package have become the gold standard in producing plots for statistical analysis and research. If you intend to continue working with R in your academic or professional future, I highly encourage you the familiarise yourself with this package.

In this note, we content ourselves with the plotting funcitonalities of base R. The plotting functionality of base R is still highly versatile and a lot simpler than ggplot.

Line and scatter plots

Lines and scatter plots are produced using R’s plot function. The basic inputs are:

  • The \(x\) and \(y\) coordinates of the points that we wish to plot. If we only supply one vector, then R plots these values as \(y\)-coordinates against the vector indices.
  • An optional type argument which specifies the type of the plot that we want to create. “p” (points, default option) creates a scatter plot, “l” creates a line plot and “o” overlays the points onto the lines. Other types are available, but these are the most useful.
  • An optional col argument, which specifies the colouring of the plot
  • Optional xlab, ylab, main arguments which specify labels for the \(x\)-axis, \(y\)-axis, and the main title respectively.

You are encouraged to have a look at the plot documentation to check out ohter features. If there is a particular plot that you want to achieve, just check the documentation or google, most likely it is possible and has already been done before.

Let us look at some plots now. We start with a simple scatter plot.

# A simple scatter plot 
set.seed(1)
xs <- rnorm(100)

plot(xs)

Here, we have only supplied one set of values, so R plots the values against the indices of the supplied vector.

We can also plot two vectors against each other, for example in a QQ-plot.

# A simple QQ-plot 
n <- 200
set.seed(1)
xs <- rnorm(n)

sorted_xs <- sort(xs) 
quantile_positions <- (1:n) / (n + 1) 
theoretical_quantiles <- qnorm(quantile_positions)

plot(theoretical_quantiles, sorted_xs, 
    xlab = "Theoretical quantiles",
    ylab = "Sample quantiles",
    main = "Normal QQ-plot")
lines(theoretical_quantiles, theoretical_quantiles, 
      col = "red", 
      lty = 2, 
      lw = 2)

Here we added a \(45\)-degree line to an existing plot using the lines function. You can similarly add points using the points function.

Let us now make a line plot. The basic syntax is plot(x, y, type = "l"). To show you what is possible, and for your later reference, I have added a custom axis and a legend to the plot.

# A line plot 

xs <- seq(from = 0, to = 2 * pi, length.out = 1000) 
cosins <- cos(xs) 
sins <- sin(xs) 

xticks <- seq(from = 0, to = 2 * pi, by = pi/2) 
xlabs <- c(0, 
          expression(pi/2), 
          expression(pi), 
          expression(3*pi/2), 
          expression(2*pi))

plot(xs, cosins, 
    type = "l", 
    xlab = "x",
    ylab = "y",
    main = "Plot of the sine and cosine function", 
    xaxt = "n")
lines(xs, sins, col = 3, lty = 2)
axis(1, at = xticks, labels = xlabs)
legend("right", 
      legend = c("cos(x)", "sin(x)"), 
      col = c(1,3), 
      lty = c(1, 2), 
      cex = .5)

Histograms and density plots

We can plto a histogram with the hist function. The basic syntax is hist(x). The two most useful optional arguments are breaks, which allows one to set number of bins for the histogram and if desired their position. freq allows to specify whether histogram depicts counts (TRUE), or density (FALSE).

Rather than plotting a histogram to depict the empirical distribution of a random variable, we may sometimes want to plot a smooth estimate of its density. We can do this using the density function in R, which provides a kernel density estimate.

As before, we can overlay other line plots and even histograms over an existing histogram plot. Below is an example.

# A histogram 
set.seed(1) 
xs <- rnorm(100) 
smoothed_xs <- density(xs) 
x_grid <- seq(from = -4, to = 4, length.out = 1000)
densities <- dnorm(x_grid) 

hist(xs, 
    col = "lightblue", 
    freq = FALSE, 
    xlab = "x", 
    main = "Histogram with smoothed and theoretical density")
lines(smoothed_xs, col = "black", lw = 2)
lines(x_grid, densities, col = "magenta", lw = 2)
legend("topright", 
      legend = c("Histogram", 
                "Smoothed density", 
                "Theoretical denistiy"), 
      col = c("lightblue", "black", "magenta"), 
      pch = c(15, NA, NA),
      lty = c(NA, 1, 1),
      lwd = c(NA, 2, 2),
      cex = .5)

Box plots

Another option to depict the distribution of a random variable is via a boxplot. In R the basic syntax for a boxplot of a single variable is boxplot(x).

What is really useful in R’s boxplot function is that we can simultaneously produce boxplots across groups from a dataframe. We have already come across a boxplot in our exercise on data manipulation. Recall the dataset admission_maths:

load(file="admission.RData")   
head(admission_maths, n = 5) 
                  Programme Predicted Grade
13 BSc in Actuarial Science              40
15         BSc in Economics              45
16         BSc in Economics              45
17         BSc in Economics              44
18         BSc in Economics              45

We can now produce a boxplot of the grade distribution per Programme using a so-called formula, which consists of two variable names from our dataframe, separated by a ~ symbol. In our case this is `Predicted Grade` ~ Programme. The first component specifies the variable of interest, the second the grouping variable. We had to enclose the variable Predicted Grade in `` as it contains a space character. Below is the full boxplot

# A grouped boxplot 
par(mar = c(10, 4, 4, 1), cex.axis = 0.5)
boxplot(`Predicted Grade` ~ Programme, data = admission_maths,
        main = "Boxplot of Predicted Grade per Programme",
        xlab = "",
        ylab = "Predicted Grade",
        col = "lightblue", 
        border = "black", 
        las = 2)

The par function above allows one to set graphical parameters, such as the margins of a plot, which we had to change to accommodate the large labels of our Programme variable.

Exporting graphics from R

To export graphics in R, we can eihter navigate to the output pane, select the Plots tab and then click on the icon.

Alternatively, we can run the following series of commands.

Before the plot, we run:

# exporting plots: before plotting 
png(file = my_path,  
    width = 4, 
    height = 4) 

where my_path should point to the directory where you want to save the plot, and icnlude the name of the plot and the type suffix. For example, this could be “/Users/LSE/ST227/a_plot.pdf”. Possible file types are pdf, jpg, png. The width and height arguments are optional and specify the width and height in inches, respectively.

Next we run our plot, and any other low level plotting commands, such as modifying the axes, adding a legend or a line etc.

# exporting plots: run your plots 
hist(rnorm(100), freq = FALSE)
lines(seq(-4,4,0.01), dnorm(seq(-4,4,0.01)))

Finally, after plotting we run

# exporting plots: after plotting 
dev.off()

Project submission

For the R mini project, you will be asked to produce a pdf file or your work. This file will contain:

  • code that you used to compute your answers
  • plots
  • plain text explanations

The, at first sight simplest way to do this, is to write a R script with everything that you need to solve the problems and generate plots and then copy paste code and pictures into a word file, which you save as pdf. While, there is nothing inherently wrong with this approach, it feels somewhat sloppy and time consuming.

A very simple way to combine code, output and plain text, is via RMarkdown. RMarkdown is a file format that allows one to easily combine plain text with embedded R code and the resulting output.

It is written in markdown, an easy-to-write plain text format, but don’t worry. To the extent that we will be using RMarkdown it will feel just like writing normal text.

Creating a RMarkdown file

To create a new RMarkdown in RStudio, we simply click File > New File > RMarkdown or click the icon and then select RMarkdown. This will open the window in Figure Figure 2.

Figure 1: Creating a new RMarkdown file in RStudio

Here, we can name our file, add an Author name, and set the date.

Important

For the mini project, submissions are anonymised, so do not write your name in the author field. The file name should be your candidate number followed by “_ST227_Rproject.Rmd”.

We then select pdf as our output format, which should open a file like this:

RStudio with a new RMarkdown file

To render RMarkdown documents, we need to install some packages. Simply paste

install.packages("rmarkdown")
install.packages("knitr")
install.packages("pandoc")

into your Console to do so.

Note

You only need to install these packages once, they need not be included in your markdown document.

Next, before rendering we need to save our document. Then, to render, we simply click on the icon or click File > Knit Document. And this should create a pdf file with some default output as illustrated below:

Figure 2: A default RMarkdown pdf

Now that we know how to render a markdown document let’s have a closer look at the actual RMarkdown file. At the top of the file you will see what is called a YAML header, that looks like this

---
title: "R mini project"
author: "your candidate number"
output: pdf_document
date: "2024-09-24"
---

This header contains metadata that describes the document, such as the title, author, output format, and other settings. For our purposes, we simply choose a suitable title, add the candidate number as our name – or delete author altogether – and set output pdf_document.

Immediately below, you will finde the code chunk

```{r setup, include=FALSE}
  knitr::opts_chunk$set(echo = TRUE)
``` 

which is an initial setup chunk used to configure global options for the document and ensures that the code is displayed in the output (i.e., printed alongside the output of the code). We leave this unchanged.

Markdown basics

We can now populate our RMarkdown with document with text, such as explanations or interpretations of our code, as well as code chunks. Let’s have a look at these in turn.

Text

We can simply write plain text in our RMarkdown document and it will be rendered on the pdf. For example, the RMarkdown file

Headers

To format our output we might want to add headers to our document. Headers consist of one to six # symbols followed by the header text. For example,

  # This is a level one header 

  ## This is a level two header 

renders:

This is a level one header

This is a level two header

Italic text is achieved by enclosing the text to be emphasised in *, like *this is italic* and bold text is achieved by enclosing text in **, like **this is a bold statement**.

Lists

You can generate numbered and unnumbered lists like

A numbered list: 

1. First item 
2. Second item 


An unnumbered list: 

- Some item 
- Another item 

are displayed as:

A numbered list:

  1. First item
  2. Second item

An unnumbered list:

  • Some item
  • Another item

LaTex

Finally, if you are familiar with LaTeX, you can easily include this in your markdown document too. Anything between two $ characters is understood as TeX math and anything between two $$ is understood as display math. For example:

Numbers that are greater than $10'000$ in absolute value scare me. Hence, I avoid looking at the set:

$$S = \left\{x \in \Re: \sqrt{|x|}> 100 \right\}$$

close to bedtime. 

would render:

Numbers that are greater than \(10'000\) in absolute value scare me. Hence, I avoid looking at the set: \[S = \left\{x \in \Re: \sqrt{|x|}> 100 \right\}\] close to bedtime.

And that is all the markdown syntax that we will need. If you want to know more, have a look at the pandocs markdown documentation or Quarto’s markdown basics.

Code chunks

The whole point of RMarkdown is that we can easily embed code and code output into our document. This is done via so-called code chunks, which look like this:

```{r chunk-label, chunk-options} 
  #some R code 
``` 

Code chunks have the following components:

  • Each chunk of code is enclosed in ```{r} ``` which specifies the beginning and end of the code block and the programming language of said code
  • The chunk-label is optional, but helps with error tracking and other useful things that are beyond our requirements
  • The chunk-options allow us to specify how each code chunk behaves in terms of execution, output and appearance.

The most relevant chunk-options for us are:

  • eval = TRUE/FALSE: Determines whether the code chunk should be evaluated (run). If FALSE, the code is shown but not executed.
  • echo = TRUE/FALSE: Controls whether the code itself is displayed in the document. If FALSE, the code is hidden, but the results are shown.
  • include = TRUE/FALSE: If FALSE, both the code and its output are hidden, but the code is still run.
  • warning = TRUE/FALSE: If FALSE, any warnings generated by the chunk are suppressed.
  • message = TRUE/FALSE: If FALSE, messages produced by the code are not displayed.
  • error = TRUE/FALSE: If FALSE, errors from the code will be hidden in the output. If TRUE, errors are shown but do not stop the document from knitting.
  • fig.width / fig.height: Controls the width and height of plots generated by the chunk.

You can check out other chunk options in the knitr documentation.

Exercise: Creating a mock submission

On November 27th, you will be asked to submit a mock R mini project on Moodle. This submission will not be graded; its purpose is solely to help you familiarise yourself with the submission process ahead of your actual mini project.

1. Download Data

Go to the moodle page for ST227 and click on Introduction to R and R Mock Submission. Therein you should find a dataset called ST227 R mock submission data and a pdf file called ST227 R Mock Submission, which contains the same questions as this exercise.

Download the data into your designated ST227 folder.

2. Creating a RMarkdown document

Although you are free to choose whichever way you want to combine code, text and plots, and other outputs in you R mini project submission, we will do it via a RMarkdown document here. This is the most elegant, and in my opinion, simplest way to do so.

In RStudio, we click File > New File > RMarkdown and in the pop-up window of Figure 2, set ST266 R mock submission as title, your candidate number as author, and choose pdf as output format.

Delete everything below the code chunk

```{r setup, include=FALSE}
  knitr::opts_chunk$set(echo = TRUE)
``` 

Save the document in your ST227 folder. The file name should be your candidate number followed by “_ST227_R_mock_submission.Rmd”. Make sure your document renders as expected by clicking File > Knit Document.

3. Loading and inspecting data

We want to load the data using the readxl package. For this, we create a new code chunk

code
```{r loading-data}
# load package 
library("readxl") 

# load data 
unknown_data <- read_xlsx(path = "ST227_R_mock_submission_data.xlsx")

# convert to dataframe 
unknown_data <- as.data.frame(unknown_data)

# print first five observations 
head(unknown_data, n = 5)
```

4. Data analysis and plotting

Here, we are dealing with a dataset that has information on the number of science awarded science degrees and the relative search volume for avocado toas per year. You can find more info on this dataset on here if you are interested.

Let us test for correlation between science_degrees and avocado_toast_searches then:

code
```{r testing-correlation}
  # Pearson's test for correlation 
  with(unknown_data, 
      cor.test(science_degrees, avocado_toast_searches,
      method = "pearson")
  )
```

Let us plot the data. This plot is a bit more involved and solely for your viewing pleasure, you are not expected to produce such plots in ST227.

code
```{r plotting-data, fig.height=6, fig.width=8}
  # plotting science degrees and relative avocado toast searches 
  with(unknown_data, {
    par(mar = c(5, 4, 4, 4) + 0.3) 
    plot(year, science_degrees, 
        type = "o", 
        ylab = "Awarded science degrees", 
        xlab = "Year", 
        main = "Awarded science degrees and Avocado toast searches", 
        lwd = 2, 
        bty = "n", 
        xaxt = "n") 
    par(new = TRUE)
    plot(year, avocado_toast_searches, 
        type = "o", 
        axes = FALSE, 
        bty = "n", 
        xlab = "", 
        ylab = "", 
        col = "red", 
        lty = 2, 
        lwd = 2)
    axis(side=4, 
        at = pretty(range(avocado_toast_searches)), 
        col.axis = "red")
    mtext("Rel. search volume", 
          side = 4, 
          line = 3, 
          col = "red")
    axis(1, at = year, labels = year)
    mtext("Year", side = 1, line = 3)
    }
  )
```

5. Submission

Feel free to add other analyses or plots to this document. Once you are happy with your submission, render the file. You should now have a pdf file named <your_candidate_number>_ST227_R_mock_submission.pdf. Go to Moodle, navigate to the R mock submission, and submit your file in the designated submission window. Well done! I hope you see the advantages RMarkdown can offer compared to copy pasting outputs into a Word document.