3 + 5
[1] 8
R is a powerful programming language and open-source software widely used for statistical computing and data analysis. This programming language is developed by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand. R has gained popularity among statisticians, data scientists, researchers, and analysts for its flexibility, extensibility, and robust statistical capabilities.
Here are several compelling reasons to consider learning R:
To begin working with R, users typically install an Integrated Development Environment (IDE) such as RStudio, which provides a user-friendly interface for coding, debugging, and visualizing results. R scripts are written in the R language and can be executed interactively or saved for later use.
Open RStudio. You will see four windows (aka panes). Each window has a different function. The screenshot below shows an analogy linking the different RStudio windows to cooking.
On the left-hand side, you’ll find the console. This is where you can input commands (code that R can interpret), and the responses to your commands, known as output, are displayed here. While the console is handy for experimenting with code, it doesn’t save any of your entered commands. Therefore, relying exclusively on the console is not recommended.
The history pane (located in the top right window) maintains a record of the commands that you have executed in the R console during your current R session. This includes both correct and incorrect commands.
You can navigate through your command history using the up and down arrow keys in the console. This allows you to quickly recall and re-run previous commands without retyping them.
The environment pane (located in the top right window) provides an overview of the objects (variables, data frames, etc.) that currently exist in your R session. It displays the names, types, dimensions, and some content of these objects. This allows you to monitor the state of your workspace in real-time.
The plotting pane (located in the bottom right window) is where graphical output, such as plots and charts, is displayed when you create visualizations in R. The Plotting pane often includes tools for zooming, panning, and exporting plots, providing additional functionality for exploring and customizing your visualizations.
The help pane (located in the bottom right window) is a valuable resource for accessing documentation and information about R functions, packages, and commands. When you type a function or command in the console and press the F1 key (Mac: fn + F1) the Help pane displays relevant documentation. Additionally, you can type a keyword in the text box at the top right corner of the Help Pane.
The files pane provides a file browser and file management interface within RStudio. It allows you to navigate through your project directories, view files, and manage your file system.
This pane provides a user-friendly interface for managing R packages. It lists installed packages and allows you to load, unload, update, and install packages.
It is used to display dynamic content generated by R, such as HTML, Shiny applications, or interactive visualizations.
Opening an RStudio session launches it from a specific location. This is the working directory. R looks in the working directory by default to read in data and save files. You can find out what the working directory is by using the command getwd()
. This shows you the path to your working directory in the console. In Mac this is in the format /path/to/working/directory
and in Windows C:\path\to\working\directory
. It is often useful to have your data and R scripts in the same directory and set this as your working directory. We will do this now.
Make a folder for this course somewhere on your computer that you will be able to easily find. Name the folder for example, Intro_R_course
. Then, to set this folder as your working directory:
In RStudio click on the Files tab and then click on the three dots, as shown below.
In the window that appears, find the folder you created (e.g. Intro_R_course
), click on it, then click Open. The files tab will now show the contents of your new folder. Click on More → Set As Working Directory, as shown below.
Note: You can use an RStudio project as described here to automatically keep track of and set the working directory.
In RStudio, the Script pane (located at the top left window) serves as a dedicated space for writing, editing, and executing R scripts. It is where you compose and organize your R code, making it an essential area for creating reproducible and well-documented analyses.
RStudio provides syntax highlighting in the Script pane, making it easier to identify different components of your code. You can execute individual lines or selections of code from the Script pane. This helps in testing and debugging code without running the entire script.
Navigate to File → New File → R Script, a new pane will emerge in the top-left corner. Save this blank text file as ‘Week_1_tidyverse.R’ in your current working directory (e.g. IntroR
).
Executing commands or running code is the process of submitting a command to your computer, which does some computation and returns an answer. In RStudio, there are several ways to execute commands:
We suggest the third option, which is fastest. This link provides a list of useful RStudio keyboard shortcuts that can be beneficial when coding and navigating the RStudio IDE.
When you type in, and then run the commands shown in the grey boxes below, you should see the result in the Console pane at bottom left.
We can use R as a calculator to do simple maths.
More complex calculator functions are built in to R, which is the reason it is popular among mathematicians and statisticians. To use these functions, we need to call these functions.
A variable is a bit of tricky concept, but very important for understanding R. Essentially, a variable is a symbol that we use in place of another value. Usually the other value is a larger/longer form of data. We can tell R to store a lot of data, for example, in a variable named x
. When we execute the command x
, R returns all of the data that we stored there.
For now however we’ll just use a tiny data set: the number 5. To store some data in a variable, we need to use a special symbol <-
, which in our case tells R to assign the value 5 to the variable x
. This is called the assignment operator. To insert the assignment operator press Option + - (Mac) or Alt + - (Windows/Linux).
Let’s see how this works.
Create a variable called x
, that will contain the number 5.
R won’t return anything in the console, but note that you now have a new entry in the environment pane. The variable name is at the left (x
) and the value that is stored in that variable, is displayed on the right (5).
We can now use x
in place of 5:
Variables are sometimes referred to as objects. In R there are different conventions about how to name variables, but most importantly they:
It is wise to adapt a consistent convention for separating words in variables.
For example:
R has a large collection of built-in functions that are called like this:
Let’s explore using seq()
function to create a series of numbers.
Start by typing se
and then press Tab. RStudio will suggest possible completions. Specify seq()
by typing more or use the up/down arrows to select it. You’ll see a helpful tooltip-type information pop up, reminding you of the function’s arguments. If you need more assistance, press F1 (Windows/linux) or fn + Tab (Mac) to access the full documentation in the help tab at the lower right.
Now, type the arguments 1, 10 and press <kbd<Enter.
You can explicitly specify arguments using the name = value
format. However, if you don’t, R will try to resolve them based on their position.
In this example, it assumes that we want a sequence starting from 1 and ending at 10. Since we didn’t mention the step size, it defaults to the value defined in the function, which is 1 in this case.
If you are using name = value
format the order of the arguments does not matter.
For frequently used functions, I might rely on positional resolution for the first one or two arguments. However, beyond that, I prefer to use the name = value
format for clarity and precision.
To take the log of 100:
To take the square root of 100:
Notice that the square root function is abbreviated to sqrt()
. This is to make writing R code faster, however the draw back is that some functions are hard to remember, or to interpret.
In R, the ?
and ??
operators are used for accessing help documentation, but they behave slightly differently.
?
operator is used to access help documentation for a specific function or topic. When you type ?
followed by the name of a function, you get detailed information about that function. For example try:mean | R Documentation |
Generic function for the (trimmed) arithmetic mean.
mean(x, ...)
## Default S3 method:
mean(x, trim = 0, na.rm = FALSE, ...)
x
|
An R object. Currently there are methods for numeric/logical vectors and date, date-time and time interval objects. Complex vectors are allowed for |
trim
|
the fraction (0 to 0.5) of observations to be trimmed from each end of |
na.rm
|
a logical evaluating to |
…
|
further arguments passed to or from other methods. |
If trim
is zero (the default), the arithmetic mean of the values in x
is computed, as a numeric or complex vector of length one. If x
is not logical (coerced to numeric), numeric (including integer) or complex, NA_real_
is returned, with a warning.
If trim
is non-zero, a symmetrically trimmed mean is computed with a fraction of trim
observations deleted from each end before the mean is computed.
Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole.
weighted.mean
, mean.POSIXct
, colMeans
for row and column means.
x <- c(0:10, 50)
xm <- mean(x)
c(xm, mean(x, trim = 0.10))
The above command displays the help documentation for the mean
function, providing information about its usage, arguments, and examples.
??
operator is used for a broader search across help documentation. It performs a search for the specified term or keyword in the documentation.This will search for the term “regression” in the help documentation and return relevant results. It’s useful when you want to find functions, packages, or topics related to a specific term.
Tab completion A very useful feature is Tab completion. You can start typing and use Tab to autocomplete code, for example, a function name.
Many developers have built 1000s of functions and shared them with the R user community to help make everyone’s work easier and more efficient. These functions (short programs) are generally packaged up together in (wait for it) Packages. For example, the tidyverse package is a compilation of many different functions, all of which help with data transformation and visualization. Packages also contain data, which is often included to assist new users with learning the available functions.
Packages are hosted on repositories, with CRAN (Comprehensive R Archive Network) being the primary repository. To install packages from CRAN, you use the install.packages()
function. For example:
This will spit out a lot of text into the console as the package is being installed. Once complete you should have a message:
The downloaded binary packages are in...
followed by a long directory name.
To remove an installed package:
After installation, you need to load a package into your R session using the library()
function. For example:
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.0 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
This makes the functions and datasets from the ‘tidyverse’ package available for use in your current session.
You only need to install a package once. Once installed, you don’t need to reinstall it in subsequent sessions. However, you do need to load the package at the beginning of each R session using the library()
function before you can utilize its functions and features. This ensures that the package is actively available for use in your current session.
To view packages currently loaded into memory:
[1] "lubridate" "forcats" "stringr" "dplyr" "purrr" "readr"
[7] "tidyr" "tibble" "ggplot2" "tidyverse" "stats" "graphics"
[13] "grDevices" "datasets" "utils" "methods" "base"
[1] ".GlobalEnv" "package:lubridate" "package:forcats"
[4] "package:stringr" "package:dplyr" "package:purrr"
[7] "package:readr" "package:tidyr" "package:tibble"
[10] "package:ggplot2" "package:tidyverse" "package:stats"
[13] "package:graphics" "package:grDevices" "package:datasets"
[16] "renv:shims" "package:utils" "package:methods"
[19] "Autoloads" "package:base"
Each package comes with documentation that explains how to use its functions. You can access this information using the help()
function or by using ?
before the function name:
tidyverse-package | R Documentation |
The ‘tidyverse’ is a set of packages that work in harmony because they share common data representations and ‘API’ design. This package is designed to make it easy to install and load multiple ‘tidyverse’ packages in a single step. Learn more about the ‘tidyverse’ at https://www.tidyverse.org.
Maintainer: Hadley Wickham hadley@rstudio.com
Other contributors:
RStudio [copyright holder, funder]
Useful links:
or by using vignette
(if the documentation is in the form of vignettes):
|>
)The pipe operator (|>
) is a commonly used feature of the tidyverse. It was originally defined in the (cleverly named) magrittr package, but is also included in the dplyr
, tidyverse packages. The |>
symbol can seem confusing and intimidating at first. However, once you understand the basic idea, it can become addicting!
We suggest you use a shortcut: ⌘ + Shift + M (Mac) or Ctrl + Shift + M (Windows/Linux).
The |>
symbol is placed between a value on the left and a function on the right. The |>
simply takes the value to the left and passes it to the function on the right as the first argument. It acts as a “pipe”. That’s it!
Suppose we have a variable, x
.
The following are the exact same.
We’ll continue to use |>
throughout this tutorial to show how useful it can be for chaining various data manipulation steps during an analysis.
To gain a clearer understanding of the remaining content, it is essential to delve into the concept of data types. At this point, we should focus on three fundamental data types:
'
) or double quotes ("
).You can check the data type of a variable by using the class()
function. For example:
Apply the class()
function to the remaining variables defined earlier.
Let’s take a moment to discuss logical operators and expressions for questioning the attributes of our objects.
==
– ‘equal to’!=
– ‘not equal to’<
– ‘less than’>
– ‘greater than’<=
– ‘less than or equal to’>=
– ‘greater than or equal to’Logical operators are used to combine or compare logical statements. They allow us to create complex conditions by combining simpler conditions.
&
(AND): Returns TRUE only if both the conditions on the left and right are TRUE.|
(OR): Returns TRUE if at least one of the conditions on the left or right is TRUE.!
(NOT): Negates the logical value of the condition; if the condition is TRUE, ! makes it FALSE, and vice versa.R use several data structures to organize and manipulate data.
Vectors are one-dimensional arrays that can hold elements of the same data type. Ordinarily, we need to enclose values in brackets, separated by commas. The values also need to be ‘concatenated’ using a function called c()
.
Lists are one-dimensional or nested structures that can contain elements of different data types.
[[1]]
[1] 1 2 3
[[2]]
[1] "hello"
[[3]]
[[3]][[1]]
[1] 4
[[3]][[2]]
[1] 5
[[3]][[3]]
[1] 6
Matrices are two-dimensional arrays with rows and columns containing elements of the same data type.
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
Arrays are multi-dimensional structures that can hold elements of the same data type.
, , 1
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
, , 2
[,1] [,2] [,3]
[1,] 7 9 11
[2,] 8 10 12
Data frames are two-dimensional structures similar to matrices, but they can store different data types in each column. Data frames generally have column names, which we can treat in the same way as a variable.
For example, let’s combine our three vectors into a data frame, using the data.frame()
function:
numeric_vector | character_vector | logical_vector |
---|---|---|
3 | R | TRUE |
6 | Python | TRUE |
9 | Java | FALSE |
12 | C | FALSE |
Importantly, each column in a data frame must have the same number of values (i.e., the same number of rows). This will be a familiar data structure for those who use Microsoft Excel, and is very popular in data science.
A tibble is a modern and enhanced version of a data frame, introduced by the tidyverse collection of packages. They are data frames but they tweak some behaviors to make coding a little bit easier.
x | y | z |
---|---|---|
0 | great | 3 |
2 | fabulous | 7 |
4 | yeay | 19 |
6 | amazing | 39 |
In this workshop, we will use the terms tibble and data frame interchangeably. We will extensively use data frames (or tibbles) in the remainder of this workshop.
Factors are used to represent categorical data with distinct levels or categories.
Imagine that you have a variable that records months:
Using a list of strings to record this variable has two problems:
You can fix both of these problems with a factor. To create a factor you must start by creating a list of the valid levels:
Now you can create a factor:
[1] Dec Apr Jan Mar
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Sort it using the sort()
function:
And any values not in the level will be silently converted to NA
(Not Available):
[1] Dec Apr <NA> Mar
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Factors have predefined levels, which are the distinct categories that the data can belong to. They can be ordered or unordered. Internally, factors are represented as integers, with each level mapped to a specific integer value. This integer representation is useful for efficient storage and certain statistical analyses (including sorting).
Take a look at the objects you have created in your workspace that is accumulated in the environment pane in the upper right corner of RStudio.
You can obtain a list of objects in your workspace using a couple of different R commands:
[1] "a" "bool1" "bool2" "character_vector"
[5] "logical_vector" "mixed_list" "month_levels" "nested_list"
[9] "new_df" "new_tbl" "numeric_array" "numeric_matrix"
[13] "numeric_vector" "pandoc_dir" "quarto_bin_path" "str1"
[17] "str2" "var1" "var2" "var3"
[21] "var4" "x" "x1" "x2"
[25] "y1" "y2"
[1] "a" "bool1" "bool2" "character_vector"
[5] "logical_vector" "mixed_list" "month_levels" "nested_list"
[9] "new_df" "new_tbl" "numeric_array" "numeric_matrix"
[13] "numeric_vector" "pandoc_dir" "quarto_bin_path" "str1"
[17] "str2" "var1" "var2" "var3"
[21] "var4" "x" "x1" "x2"
[25] "y1" "y2"
If you wish to remove a specific object, let’s say x1
, you can use the following command:
To remove all objects:
Alternatively, you can click the broom icon in RStudio’s Environment pane to clear everything.
For the sake of reproducibility, it’s crucial to regularly delete your objects and restart your R session. This ensures that your analysis can be replicated next week or even after upgrading your operating system. Restarting your R session helps identify and address any dependencies or configurations needed for your analysis to run successfully.
Comments
In R, any text following the hash symbol # is termed a comment. R disregards this text, considering it non-executable. Comments serve the purpose of documenting your code, aiding your future understanding of specific lines, and highlighting the intentions or challenges encountered.
RStudio makes it easy to comment or uncomment a paragraph: Select the lines you want to comment (to comment a set of lines) or placing the cursor at any location of a line (to comment a single line), press at the same time on your keyboard ⌘ + Shift + C (mac) or Ctrl + Shift + C (Windows/Linux).
Extensive use of comments is encouraged throughout this course.