Basics of R Programming Language

R!

R is a powerful programming language and open-source software widely used for statistical computing and data analysis. This programming language is developed by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand. R has gained popularity among statisticians, data scientists, researchers, and analysts for its flexibility, extensibility, and robust statistical capabilities.

Why learn R?

Here are several compelling reasons to consider learning R:

  • Statistical Analysis
  • Data Visualization
  • Open Source
  • Community Support
  • Extensibility
  • Integration with Other Languages
  • Data Science and Machine Learning
  • Widely Used in Academia and Industry
  • Continuous Development

R vs Python

  • R and Python are the two most popular programming languages used by data analysts and data scientists. Both are free and open source.
  • Python is a general-purpose programming language, while R is a statistical programming language.

Getting Started with R

To begin working with R, users typically install an Integrated Development Environment (IDE) such as RStudio, which provides a user-friendly interface for coding, debugging, and visualizing results. R scripts are written in the R language and can be executed interactively or saved for later use.

A look around RStudio

Open RStudio. You will see four windows (aka panes). Each window has a different function. The screenshot below shows an analogy linking the different RStudio windows to cooking.

Console Pane

On the left-hand side, you’ll find the console. This is where you can input commands (code that R can interpret), and the responses to your commands, known as output, are displayed here. While the console is handy for experimenting with code, it doesn’t save any of your entered commands. Therefore, relying exclusively on the console is not recommended.

History Pane

The history pane (located in the top right window) maintains a record of the commands that you have executed in the R console during your current R session. This includes both correct and incorrect commands.

You can navigate through your command history using the up and down arrow keys in the console. This allows you to quickly recall and re-run previous commands without retyping them.

Environment Pane

The environment pane (located in the top right window) provides an overview of the objects (variables, data frames, etc.) that currently exist in your R session. It displays the names, types, dimensions, and some content of these objects. This allows you to monitor the state of your workspace in real-time.

Plotting Pane

The plotting pane (located in the bottom right window) is where graphical output, such as plots and charts, is displayed when you create visualizations in R. The Plotting pane often includes tools for zooming, panning, and exporting plots, providing additional functionality for exploring and customizing your visualizations.

Help Pane

The help pane (located in the bottom right window) is a valuable resource for accessing documentation and information about R functions, packages, and commands. When you type a function or command in the console and press the F1 key (Mac: fn + F1) the Help pane displays relevant documentation. Additionally, you can type a keyword in the text box at the top right corner of the Help Pane.

Files Pane

The files pane provides a file browser and file management interface within RStudio. It allows you to navigate through your project directories, view files, and manage your file system.

Packages Pane

This pane provides a user-friendly interface for managing R packages. It lists installed packages and allows you to load, unload, update, and install packages.

Viewer Pane

It is used to display dynamic content generated by R, such as HTML, Shiny applications, or interactive visualizations.

Working directory

Opening an RStudio session launches it from a specific location. This is the working directory. R looks in the working directory by default to read in data and save files. You can find out what the working directory is by using the command getwd(). This shows you the path to your working directory in the console. In Mac this is in the format /path/to/working/directory and in Windows C:\path\to\working\directory. It is often useful to have your data and R scripts in the same directory and set this as your working directory. We will do this now.

Make a folder for this course somewhere on your computer that you will be able to easily find. Name the folder for example, Intro_R_course. Then, to set this folder as your working directory:

In RStudio click on the Files tab and then click on the three dots, as shown below.

In the window that appears, find the folder you created (e.g. Intro_R_course), click on it, then click Open. The files tab will now show the contents of your new folder. Click on More → Set As Working Directory, as shown below.

Note: You can use an RStudio project as described here to automatically keep track of and set the working directory.

R Scripts

In RStudio, the Script pane (located at the top left window) serves as a dedicated space for writing, editing, and executing R scripts. It is where you compose and organize your R code, making it an essential area for creating reproducible and well-documented analyses.

RStudio provides syntax highlighting in the Script pane, making it easier to identify different components of your code. You can execute individual lines or selections of code from the Script pane. This helps in testing and debugging code without running the entire script.

Open a New R Script

Navigate to File → New File → R Script, a new pane will emerge in the top-left corner. Save this blank text file as ‘Week_1_tidyverse.R’ in your current working directory (e.g. IntroR).

Comments

In R, any text following the hash symbol # is termed a comment. R disregards this text, considering it non-executable. Comments serve the purpose of documenting your code, aiding your future understanding of specific lines, and highlighting the intentions or challenges encountered.

RStudio makes it easy to comment or uncomment a paragraph: Select the lines you want to comment (to comment a set of lines) or placing the cursor at any location of a line (to comment a single line), press at the same time on your keyboard + Shift + C (mac) or Ctrl + Shift + C (Windows/Linux).

Extensive use of comments is encouraged throughout this course.

# This is a comment. Ignored by R. But useful for me!

Executing Commands

Executing commands or running code is the process of submitting a command to your computer, which does some computation and returns an answer. In RStudio, there are several ways to execute commands:

  • Select the line(s) of code using the mouse, and then click Run at the top right corner of the R text file.
  • Select Run Lines from the Code menu.
  • Click anywhere on the line of code and click Run.
  • Select the line(s) you want to run. Press + Return (Mac) or Ctrl + Enter (Windows/Linux) to run the selected code.

We suggest the third option, which is fastest. This link provides a list of useful RStudio keyboard shortcuts that can be beneficial when coding and navigating the RStudio IDE.

When you type in, and then run the commands shown in the grey boxes below, you should see the result in the Console pane at bottom left.

Simple Maths in R

We can use R as a calculator to do simple maths.

3 + 5
[1] 8

More complex calculator functions are built in to R, which is the reason it is popular among mathematicians and statisticians. To use these functions, we need to call these functions.

Variables

A variable is a bit of tricky concept, but very important for understanding R. Essentially, a variable is a symbol that we use in place of another value. Usually the other value is a larger/longer form of data. We can tell R to store a lot of data, for example, in a variable named x. When we execute the command x, R returns all of the data that we stored there.

For now however we’ll just use a tiny data set: the number 5. To store some data in a variable, we need to use a special symbol <-, which in our case tells R to assign the value 5 to the variable x. This is called the assignment operator. To insert the assignment operator press Option + - (Mac) or Alt + - (Windows/Linux).

Let’s see how this works.

Create a variable called x, that will contain the number 5.

x <- 5

R won’t return anything in the console, but note that you now have a new entry in the environment pane. The variable name is at the left (x) and the value that is stored in that variable, is displayed on the right (5).

We can now use x in place of 5:

x + 10
[1] 15
x * 3
[1] 15

Variables are sometimes referred to as objects. In R there are different conventions about how to name variables, but most importantly they:

  • cannot begin with a number
  • should begin with an alphabetical letter
  • they are case sensitive
  • variables can take any name, but its best to use something that makes sense to you, and will likely make sense to others who may read your code.

It is wise to adapt a consistent convention for separating words in variables.

For example:

# i_use_snake_case
# other.people.use.periods
# evenOthersUseCamelCase

Calling Functions

R has a large collection of built-in functions that are called like this:

function_name(argument1 = value1, argument2 = value2, ...)

Let’s explore using seq() function to create a series of numbers.

Start by typing se and then press Tab. RStudio will suggest possible completions. Specify seq() by typing more or use the up/down arrows to select it. You’ll see a helpful tooltip-type information pop up, reminding you of the function’s arguments. If you need more assistance, press F1 (Windows/linux) or fn + Tab (Mac) to access the full documentation in the help tab at the lower right.

Now, type the arguments 1, 10 and press <kbd<Enter.

seq(1, 10)
 [1]  1  2  3  4  5  6  7  8  9 10

You can explicitly specify arguments using the name = value format. However, if you don’t, R will try to resolve them based on their position.

seq(from = 1, to = 10)
 [1]  1  2  3  4  5  6  7  8  9 10

In this example, it assumes that we want a sequence starting from 1 and ending at 10. Since we didn’t mention the step size, it defaults to the value defined in the function, which is 1 in this case.

seq(from = 1, to = 10, by = 2)
[1] 1 3 5 7 9

If you are using name = value format the order of the arguments does not matter.

seq(to = 10, by = 2, from = 1)
[1] 1 3 5 7 9

For frequently used functions, I might rely on positional resolution for the first one or two arguments. However, beyond that, I prefer to use the name = value format for clarity and precision.

To take the log of 100:

log(x = 100, base = 10)
[1] 2

To take the square root of 100:

sqrt(100) # this is the short-hand of sqrt(x = 100)
[1] 10

Notice that the square root function is abbreviated to sqrt(). This is to make writing R code faster, however the draw back is that some functions are hard to remember, or to interpret.

Getting Help

In R, the ? and ?? operators are used for accessing help documentation, but they behave slightly differently.

  • The ? operator is used to access help documentation for a specific function or topic. When you type ? followed by the name of a function, you get detailed information about that function. For example try:
?mean
View Output
<!DOCTYPE html> R: Arithmetic Mean
mean R Documentation

Arithmetic Mean

Description

Generic function for the (trimmed) arithmetic mean.

Usage

mean(x, ...)

## Default S3 method:
mean(x, trim = 0, na.rm = FALSE, ...)

Arguments

x

An R object. Currently there are methods for numeric/logical vectors and date, date-time and time interval objects. Complex vectors are allowed for trim = 0, only.

trim

the fraction (0 to 0.5) of observations to be trimmed from each end of x before the mean is computed. Values of trim outside that range are taken as the nearest endpoint.

na.rm

a logical evaluating to TRUE or FALSE indicating whether NA values should be stripped before the computation proceeds.

further arguments passed to or from other methods.

Value

If trim is zero (the default), the arithmetic mean of the values in x is computed, as a numeric or complex vector of length one. If x is not logical (coerced to numeric), numeric (including integer) or complex, NA_real_ is returned, with a warning.

If trim is non-zero, a symmetrically trimmed mean is computed with a fraction of trim observations deleted from each end before the mean is computed.

References

Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole.

See Also

weighted.mean, mean.POSIXct, colMeans for row and column means.

Examples

x <- c(0:10, 50)
xm <- mean(x)
c(xm, mean(x, trim = 0.10))

The above command displays the help documentation for the mean function, providing information about its usage, arguments, and examples.

  • The ?? operator is used for a broader search across help documentation. It performs a search for the specified term or keyword in the documentation.
??regression

This will search for the term “regression” in the help documentation and return relevant results. It’s useful when you want to find functions, packages, or topics related to a specific term.

Tip

Tab completion A very useful feature is Tab completion. You can start typing and use Tab to autocomplete code, for example, a function name.

R Packages

Many developers have built 1000s of functions and shared them with the R user community to help make everyone’s work easier and more efficient. These functions (short programs) are generally packaged up together in (wait for it) Packages. For example, the tidyverse package is a compilation of many different functions, all of which help with data transformation and visualization. Packages also contain data, which is often included to assist new users with learning the available functions.

Installing Packages

Packages are hosted on repositories, with CRAN (Comprehensive R Archive Network) being the primary repository. To install packages from CRAN, you use the install.packages() function. For example:

install.packages("tidyverse")

This will spit out a lot of text into the console as the package is being installed. Once complete you should have a message:

The downloaded binary packages are in... followed by a long directory name.

To remove an installed package:

remove.packages("tidyverse")

Loading Packages

After installation, you need to load a package into your R session using the library() function. For example:

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.0     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

This makes the functions and datasets from the ‘tidyverse’ package available for use in your current session.

Tip

You only need to install a package once. Once installed, you don’t need to reinstall it in subsequent sessions. However, you do need to load the package at the beginning of each R session using the library() function before you can utilize its functions and features. This ensures that the package is actively available for use in your current session.

To view packages currently loaded into memory:

(.packages())
 [1] "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"     "readr"    
 [7] "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics" 
[13] "grDevices" "datasets"  "utils"     "methods"   "base"     
search()
 [1] ".GlobalEnv"        "package:lubridate" "package:forcats"  
 [4] "package:stringr"   "package:dplyr"     "package:purrr"    
 [7] "package:readr"     "package:tidyr"     "package:tibble"   
[10] "package:ggplot2"   "package:tidyverse" "package:stats"    
[13] "package:graphics"  "package:grDevices" "package:datasets" 
[16] "renv:shims"        "package:utils"     "package:methods"  
[19] "Autoloads"         "package:base"     

Package Documentation

Each package comes with documentation that explains how to use its functions. You can access this information using the help() function or by using ? before the function name:

help(tidyverse)
View Output
<!DOCTYPE html> R: tidyverse: Easily Install and Load the ‘Tidyverse’
tidyverse-package R Documentation

tidyverse: Easily Install and Load the ‘Tidyverse’

Description

logo

The ‘tidyverse’ is a set of packages that work in harmony because they share common data representations and ‘API’ design. This package is designed to make it easy to install and load multiple ‘tidyverse’ packages in a single step. Learn more about the ‘tidyverse’ at https://www.tidyverse.org.

Author(s)

Maintainer: Hadley Wickham hadley@rstudio.com

Other contributors:

  • RStudio [copyright holder, funder]

See Also

Useful links:

or by using vignette (if the documentation is in the form of vignettes):

vignette(package="tidyverse")

The Pipe Operator (|>)

The pipe operator (|>) is a commonly used feature of the tidyverse. It was originally defined in the (cleverly named) magrittr package, but is also included in the dplyr, tidyverse packages. The |> symbol can seem confusing and intimidating at first. However, once you understand the basic idea, it can become addicting!

We suggest you use a shortcut: + Shift + M (Mac) or Ctrl + Shift + M (Windows/Linux).

The |> symbol is placed between a value on the left and a function on the right. The |> simply takes the value to the left and passes it to the function on the right as the first argument. It acts as a “pipe”. That’s it!

Suppose we have a variable, x.

x <- 7

The following are the exact same.

sqrt(x)
[1] 2.645751
x |> sqrt()
[1] 2.645751

We’ll continue to use |> throughout this tutorial to show how useful it can be for chaining various data manipulation steps during an analysis.

R Data Types

Figure 1: Image source:https://www.javatpoint.com/r-data-types

To gain a clearer understanding of the remaining content, it is essential to delve into the concept of data types. At this point, we should focus on three fundamental data types:

  1. Numeric data, which involves numbers.
var1 <- 10
var2 <- 1L
var3 <- 5.5
var4 <- 22/7
  1. Character data, which pertains to words. You can create a string using either single quotes (') or double quotes (").
str1 <- "This is a string!!"
str2 <- 'A'
  1. Logical data, encapsulating TRUE/FALSE values.
bool1 <- T
bool2 <- FALSE

You can check the data type of a variable by using the class() function. For example:

class(var3) 
class(bool1)

Apply the class() function to the remaining variables defined earlier.

Comparison Operators and Expressions

Let’s take a moment to discuss logical operators and expressions for questioning the attributes of our objects.

  • == – ‘equal to’
  • != – ‘not equal to’
  • < – ‘less than’
  • > – ‘greater than’
  • <= – ‘less than or equal to’
  • >= – ‘greater than or equal to’
x <- 20
x == 2
[1] FALSE
x <= 50
[1] TRUE
x != 20
[1] FALSE

Logical Operators and Expressions

Logical operators are used to combine or compare logical statements. They allow us to create complex conditions by combining simpler conditions.

  • & (AND): Returns TRUE only if both the conditions on the left and right are TRUE.
  • | (OR): Returns TRUE if at least one of the conditions on the left or right is TRUE.
  • ! (NOT): Negates the logical value of the condition; if the condition is TRUE, ! makes it FALSE, and vice versa.
a <- TRUE
a & a
[1] TRUE
a & !a
[1] FALSE
!a | a
[1] TRUE

R Data Structures

R use several data structures to organize and manipulate data.

Vectors

Vectors are one-dimensional arrays that can hold elements of the same data type. Ordinarily, we need to enclose values in brackets, separated by commas. The values also need to be ‘concatenated’ using a function called c().

numeric_vector <- c(3, 6, 9, 12)
character_vector <- c('R', 'Python', 'Java', 'C')
logical_vector <- c(TRUE, TRUE, FALSE, FALSE)

Lists

Lists are one-dimensional or nested structures that can contain elements of different data types.

mixed_list <- list(1, "apple", TRUE, 3.14)
nested_list <- list(c(1, 2, 3), "hello", list(4, 5, 6))
nested_list
Output
[[1]]
[1] 1 2 3

[[2]]
[1] "hello"

[[3]]
[[3]][[1]]
[1] 4

[[3]][[2]]
[1] 5

[[3]][[3]]
[1] 6

Matrices

Matrices are two-dimensional arrays with rows and columns containing elements of the same data type.

numeric_matrix <- matrix(1:6, nrow = 2, ncol = 3)
numeric_matrix
Output
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

Arrays

Arrays are multi-dimensional structures that can hold elements of the same data type.

numeric_array <- array(1:12, dim = c(2, 3, 2))
numeric_array
Output
, , 1

     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

, , 2

     [,1] [,2] [,3]
[1,]    7    9   11
[2,]    8   10   12

Data Frames

Data frames are two-dimensional structures similar to matrices, but they can store different data types in each column. Data frames generally have column names, which we can treat in the same way as a variable.

For example, let’s combine our three vectors into a data frame, using the data.frame() function:

#combine vectors of the same length into a data frame
new_df <- data.frame(numeric_vector, character_vector, logical_vector)
new_df
Output
numeric_vector character_vector logical_vector
3 R TRUE
6 Python TRUE
9 Java FALSE
12 C FALSE

Importantly, each column in a data frame must have the same number of values (i.e., the same number of rows). This will be a familiar data structure for those who use Microsoft Excel, and is very popular in data science.

Tibbles

A tibble is a modern and enhanced version of a data frame, introduced by the tidyverse collection of packages. They are data frames but they tweak some behaviors to make coding a little bit easier.

new_tbl = tibble(
x = c(0,2,4,6),   
y = c('great','fabulous','yeay', 'amazing'), 
z = x^2 + 3) 
new_tbl
Output
x y z
0 great 3
2 fabulous 7
4 yeay 19
6 amazing 39

In this workshop, we will use the terms tibble and data frame interchangeably. We will extensively use data frames (or tibbles) in the remainder of this workshop.

Factors

Factors are used to represent categorical data with distinct levels or categories.

Imagine that you have a variable that records months:

x1 <- c("Dec", "Apr", "Jan", "Mar")

Using a list of strings to record this variable has two problems:

  • There are only twelve possible months, and there’s nothing saving you from typos:
x2 <- c("Dec", "Apr", "Jam", "Mar")
  • It doesn’t sort in a useful way:
sort(x1)
[1] "Apr" "Dec" "Jan" "Mar"

You can fix both of these problems with a factor. To create a factor you must start by creating a list of the valid levels:

month_levels <- c(
  "Jan", "Feb", "Mar", "Apr", "May", "Jun", 
  "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
)

Now you can create a factor:

y1 <- factor(x1, levels = month_levels)
y1
[1] Dec Apr Jan Mar
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Sort it using the sort() function:

sort(y1)
[1] Jan Mar Apr Dec
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

And any values not in the level will be silently converted to NA (Not Available):

y2 <- factor(x2, levels = month_levels)
y2
[1] Dec  Apr  <NA> Mar 
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Factors have predefined levels, which are the distinct categories that the data can belong to. They can be ordered or unordered. Internally, factors are represented as integers, with each level mapped to a specific integer value. This integer representation is useful for efficient storage and certain statistical analyses (including sorting).

Clearing the Environment

Take a look at the objects you have created in your workspace that is accumulated in the environment pane in the upper right corner of RStudio.

You can obtain a list of objects in your workspace using a couple of different R commands:

objects()
Output
 [1] "a"                "bool1"            "bool2"            "character_vector"
 [5] "logical_vector"   "mixed_list"       "month_levels"     "nested_list"     
 [9] "new_df"           "new_tbl"          "numeric_array"    "numeric_matrix"  
[13] "numeric_vector"   "pandoc_dir"       "quarto_bin_path"  "str1"            
[17] "str2"             "var1"             "var2"             "var3"            
[21] "var4"             "x"                "x1"               "x2"              
[25] "y1"               "y2"              
ls()
Output
 [1] "a"                "bool1"            "bool2"            "character_vector"
 [5] "logical_vector"   "mixed_list"       "month_levels"     "nested_list"     
 [9] "new_df"           "new_tbl"          "numeric_array"    "numeric_matrix"  
[13] "numeric_vector"   "pandoc_dir"       "quarto_bin_path"  "str1"            
[17] "str2"             "var1"             "var2"             "var3"            
[21] "var4"             "x"                "x1"               "x2"              
[25] "y1"               "y2"              

If you wish to remove a specific object, let’s say x1, you can use the following command:

rm(x1)

To remove all objects:

rm(list = ls())

Alternatively, you can click the broom icon in RStudio’s Environment pane to clear everything.

For the sake of reproducibility, it’s crucial to regularly delete your objects and restart your R session. This ensures that your analysis can be replicated next week or even after upgrading your operating system. Restarting your R session helps identify and address any dependencies or configurations needed for your analysis to run successfully.