README.md

February 22, 2019 · View on GitHub

Introduction to R

Date: 15th March 2018, 6 - 7pm
Series: Wolfson College Skills for Academic Success
Location: Roger Needham Room, Wolfson college, University of Cambridge, UK
Trainer: Sergio Martínez Cuesta
Register here

This course provides a short beginners introduction to the R programming language and software environment for statistical computing and graphics. Sergio will demonstrate basic examples on how to input, explore, plot and output data in R. Everybody is welcome, if you would like to follow along with your laptop, please bring R and RStudio downloaded and installed before the session.

Outline

Motivation
Installing R and RStudio
How can I find help?
Getting started
Variables and functions
- Exercise 1
Vectors
Import and explore data
- Subsetting
- Exercise 2
Sort tables and export results
Basic plotting
- Export graphics
- Exercise 3

This short course is based on the R crash course developed by Mark Dunning and Laurent Gatto.

Motivation

R is one of the most widely-used programming languages for data analysis, statistics and visualisation in academia and industry.
It is open-source and available in all platforms (Mac, Linux and Windows)
Supported by a broad community of software developers and researchers who contribute R packages and libraries to many fields of research
It facilitates reproducibility in research and integration of all your analyses in individual scripts
Easy to write documentation and code together using a free environment like RStudio

E.g. The New Zealand Tourism Dashboard uses R extensively to report statistics.

Installing R and RStudio

Latest version of R
- To check if you have R installed in Mac, go to Finder -> Applications -> Utilities -> Terminal, type R and press Enter. If you see information about the R version and other details appearing followed by a ">" prompt, then you have R installed.
RStudio Desktop Open Source License
- To check if you have RStudio installed in Mac, try opening it, go to Finder -> Applications and click on Rstudio

How can I find help?

Stack Overflow
The Comprehensive R Archive Network (CRAN)
CRAN Task Views
R-bloggers
Quick-R
Local R groups, e.g. R-ladies Cambridge
Type ? followed my the name of the function that you'd like to use, e.g. ?mean

If you are interested in bioinformatics and computational biology, the following links might also be of interest:

Getting started

Open RStudio - see above. Explore the different panels

To download today's workshop:

Go to your web browser and type: https://tinyurl.com/2018-IntroR-Wolfson
Click on IntroR.zip, then press Download and save the file in your preferred folder, e.g. your Desktop
Go to the folder where you saved IntroR.zip and uncompress it, e.g. in Mac just double-click on IntroR.zip. Only then, the folder IntroR will appear.
The folder IntroR contains two files:
- IntroR.Rmd - the code for today's session
- patient-data-cleaned.csv - the dataset that we will be exploring

Now, go back to RStudio:

Click on File -> Open File and select IntroR.Rmd
You are all set to go now :)

Also:

We will be using RStudio console (bottom-left panel) to interact with R during the workshop
The blocks of code shown in IntroR.Rmd - see below - are written using the format R markdown, which allows mixed plain text and R code together within the same document
Each line of R code inside a block can be executed by clicking on the line and pressing CMD + ENTER (Mac) or CTRL + ENTER (Windows and Linux), e.g.:

print("R is fun!")

Alternatively, to execute the entire block, click on the green arrow tip on the right-hand side of the block.

3 + 1

You can add a new block of code by selecting R in the Insert menu or by typing the following syntax directly:

# R code goes in here

Variables and functions

You can use R as a calculator using the symbols +, -, * and /, or more advanced features such as statistical operations, logarithms, trigonometry ...

2 + 1
7 - 1
3 * 2
10 / 5

mean(1:5)
log(1)
pi
sin(pi/2)

To store your results for later, use variables. To create them, use the assignment operator <-:

x <- 25
x
y <- 16
y

You can perform multiple operations using variables:

sqrt(x)
x + y
x <- 36
x <- y
x <- x + 8

Functions in R take one or more arguments as input, which are captured using parentheses. Arguments can be named explicitly, otherwise they are meant to be used in the same order as described in the function definition. E.g. seq is a function for generating a numeric sequence from and to particular numbers. Type ?seq to get the help page for this function.

?seq
seq(from = 1, to = 10, by = 2)
seq(1, 10, 2)

Some functions have default values in some arguments:

seq(1, 10, 1)
seq(1, 10)

The default value for the by argument in the seq() function is 1.

An alternative method to obtain sequences of numbers spaced by one value is the : symbol:

z <- 1:5
z

Exercise 1

Work in pairs, meet the person sitting next to you and try the following together (3 min):

Create a sequence of numbers from 10 to 30 spaced by three values
How about decreasing sequences? Now try from 30 to 10 spaced by three values (hint: check ?seq)
Round the number pi down to 1 decimal place (hint: check ?round)

Vectors

The output we get using R functions such as seq() are called vectors, which are collections of numbers or characters
To create vectors use the function c() (a.k.a. combine)
Use square brackets [ ] to indicate the position within the vector (the index) and extract elements

x <- c(5,6,7,8,9,10)
x
x[3]
x[1]
x[3:5]

Arithmetic operations in vectors occur element by element:

x <- c(2, 4, 5, 6, 7)
y <- x*2
y
x + y

A vector can also contain text, however unlike values, these need to be captured using quotation marks " ":

x <- c("a", "b", "b", "c", "c", "d")
x

x <- c(a, b, b, c, c, d) # otherwise R thinks they are objects

To create subsets of our vectors, we can use comparison operators:

== equal
> greater than
< less than
!= not equal

x <- c("a", "b", "b", "c", "c", "d")
x == "b" # this is known as a logical or boolean vector, composed of TRUE or FALSE values only
x != "b"
x[x != "b"]

x <- c(2, 4, 5, 6, 7)
x > 4
x[x > 4]

Import and explore data

We will use a small made-up dataset which is often used for training purposes. It contains information about 100 lung cancer patients aged 42-44 from different states in the US. We have saved these data as a .csv file to demonstrate how to import and explore data using R.

You will first need to find the path to the file patient-data-cleaned.csv, which was downloaded together with the course materials - see folder IntroR. Use the function file.choose() to open a dialogue box and browse through the directories to reach the file. The path will then be displayed in R:

file.choose()

e.g. for me the path is /Users/martin03/Desktop/IntroR/patient-data-cleaned.csv. The file patient-data-cleaned.csv is a comma-separated values (CSV) file, which can easily be opened using software like Excel. In R, use the read.csv() function and the path obtained above to create a data frame object:

patient_data <- read.csv("/Users/martin03/Desktop/IntroR/patient-data-cleaned.csv") # copy here the path obtained when running file.choose()

Exploring rows and columns in the patient_data data frame:

# Dimensions
dim(patient_data)
ncol(patient_data)
nrow(patient_data)

# Viewing contents
head(patient_data)
View(patient_data)

# Names of columns
colnames(patient_data)

# Accessing data using column names
patient_data$Smokes
patient_data$Height
patient_data$State

# Summary of all data frame contents
summary(patient_data)

R works such that the values in each column have all to be of the same type (i.e. all numbers or all characters/text).

You can apply functions to the columns of the data frame to ask various questions:

# What is the maximum height?
max(patient_data$Height)
# What is the minimum weight?
min(patient_data$Weight)
# What is the mean body mass index (BMI)? Rounded to one decimal place?
round(min(patient_data$BMI), 1)

Subsetting

Just like when subsetting vectors, a selection of a data frame can be made using square brackes [ , ], however data frames are two-dimensional objects so you'll need both row and column indexes:

patient_data[1 , 2]
patient_data[2 , 1]
patient_data[c(1,2,3) , 1]
patient_data[c(1,2,3) , c(1,2)]

If you'd like to see all the rows, or all the columns, you can neglect either the row or column index respectively. But ... remember to keep the comma ;)

patient_data[2, ]
patient_data[, 2]
patient_data[, 1:4]

Rather than selecting rows based on indexes, you can also use comparison operators to give either a TRUE or FALSE result. When applied to subsetting, only rows with a TRUE result get returned.

# The vector of TRUE or FALSE results applied to subsetting data
patient_data$Height > 183

# Which patients are taller than 183cm?
patient_data[patient_data$Height > 183,]

# Which patients are smokers?
patient_data[patient_data$Smokes == "Smoker",]

# Which patients are taller than 183cm AND are smokers too?
patient_data$Height > 183 & patient_data$Smokes == "Smoker"
patient_data[patient_data$Height > 183 & patient_data$Smokes == "Smoker",]

# You can also select only specific columns using the column name, e.g. give me only the ID, Name, State and Disease Grade
patient_data[patient_data$Height > 183 & patient_data$Smokes == "Smoker", c("ID", "Name", "State", "Grade")]

The useful subsetting operators to bear in mind here are and &, or | and in %in%.

Exercise 2

Work in pairs if possible (3 min):

Select patients that have a BMI greater than 30 or their weight is greater than 90kg. Calculate their average height.
Select female patients from California who are not overweighted

Sort tables and export results

The function order() gives sorted indices, which can then be used to sort your data set:

# Sort patients by Disease Grade
order(patient_data$Grade)
patient_data[order(patient_data$Grade),] # from benign (1) to harmful (3)
patient_data[order(patient_data$Grade, decreasing = TRUE),] # from harmful (3) to benign (1)

# Sort patients by more than one condition: first Disease Grade, second Weight
patient_data[order(patient_data$Grade, patient_data$Weight, decreasing = TRUE),]

Once data processing is completed, you can export results out of R as follows:

# Which patients from California are non-smokers?
patient_data_california <- patient_data[patient_data$State == "California" & patient_data$Smokes == "Non-Smoker",]

# Export
write.csv(patient_data_california, file = "/Users/martin03/Desktop/IntroR/patient-data-cleaned-california.csv")

Basic plotting

Simple plotting functions are available in the base R distribution (histograms, barplots, boxplots, scatterplots ...). All that is required as input are vectors of data, e.g. columns in your data frame.

Histograms are often used to have an overview of the distribution of continuous data:

hist(patient_data$BMI)
hist(patient_data$Weight)

Barplots are useful when you have counts of categorical data:

barplot(table(patient_data$Race))
barplot(table(patient_data$Sex))
barplot(table(patient_data$Smokes))
barplot(table(patient_data$State), las=2, cex.names=0.7) # 'las=2' changes the x-axis labels to horizonal and 'cex.names=0.7' changes the size
barplot(table(patient_data$Grade))
barplot(table(patient_data$Overweight))

Boxplots are good when comparing distributions Here the ~ symbol sets up a formula, the effect of which is to put the categorical variable on the x-axis and continuous variable on the y-axis -> boxplot(y ~ x)

boxplot(patient_data$BMI ~ patient_data$Grade)

boxplot(patient_data$BMI ~ patient_data$Overweight)
boxplot(patient_data$Weight ~ patient_data$Overweight)

Scatter plots are useful when representing two continuous variables. Here -> plot(x, y):

plot(patient_data$Weight, patient_data$BMI)

To enhance the appearance of your plots, many different ways of customisation are possible:

Colours: col argument. To get a full list of possible colours type colours(), or check this online reference.
Point type: pch
Axis labels: xlab and ylab
Plot title: main
... and many others: see ?plot and ?par for more options

plot(patient_data$Weight, patient_data$BMI, col="red", pch=16, xlab="Weight (kg)", ylab="BMI", main="US patient data")
abline(lm(patient_data$BMI ~ patient_data$Weight), col="blue")

Related arguments can be used for other plotting functions:

boxplot(patient_data$BMI ~ patient_data$Overweight, col=c("red", "green"), xlab="Overweight patient?", ylab="BMI", main="US patient data")

To explore other types of plots, have a look here. There are dedicated R libraries e.g. ggplot2 to do more sophisticated plotting. We will be exploring these in future workshops.

Export graphics

When running commands directly in the interactive console (bottom-left panel), plots can be exported using the Plots tab in RStudio (bottom-right panel). Click on Export -> Save as PDF ....

You can also save plots to a file calling the pdf() or png() functions before executing the code to create the plot:

pdf("/Users/martin03/Desktop/IntroR/BMIvsWeight.pdf")
plot(patient_data$Weight, patient_data$BMI, col="red", pch=16, xlab="Weight (kg)", ylab="BMI", main="US patient data")
abline(lm(patient_data$BMI ~ patient_data$Weight), col="blue")
dev.off()

The dev.off() line is important; without it you will not be able to view the plot you have created.

Exercise 3

The final one:

Any differences of Weight or BMI between Smokers and Non-Smokers? (hint: try boxplot)
Visualise the relationship between the Height and Weight of the patients

That's it! Enjoy R!

Questions?

Feedback / questions about the course, please email Sergio (sermarcue@gmail.com).

References and additional materials

Blogs:

Books:

Courses:

CRUK-CI R crash course
R for Reproducible Scientific Analysis
Karl Broman's mini tutorials
Basic statistics and data handling with R
Scripting for data analysis (with R)
An Introduction to Solving Biological Problems with R
Data Analysis and Visualisation using R: including dplyr and ggplot2
Babraham institute basic/advanced R and ggplot2 courses
R object-oriented programming and package development, link1 and link2
R course content for the CODATA-RDA Research Data Science Summer School
Data carpentry course for biologists by Ethan White
Cambridge's Data carpentry using R
The Bioconductor 2018 Workshop Compilation

Perspectives:

Acknowledgements

Sergio is a University of Cambridge Data Champion funded by a Jisc research data fellowship to develop research data training activities for researchers. He does research in bioinformatics and computational biology within the Balasubramanian laboratories funded by the Wellcome Trust at the University of Cambridge.

License

This work is distributed under a Creative Commons CC0 license. No rights reserved.

Our sponsors: