README.md
February 22, 2019 · View on GitHub
Introduction to R
<img align="right" src=images/R_logo.png width="150">
- Date: 15th March 2018, 6 - 7pm
- Series: Wolfson College Skills for Academic Success
- Location: Roger Needham Room, Wolfson college, University of Cambridge, UK
- Trainer: Sergio Martínez Cuesta
- Register here
Overview
This course provides a short beginners introduction to the R programming language and software environment for statistical computing and graphics. Sergio will demonstrate basic examples on how to input, explore, plot and output data in R. Everybody is welcome, if you would like to follow along with your laptop, please bring R and RStudio downloaded and installed before the session.
Outline
- Motivation
- Installing R and RStudio
- How can I find help?
- Getting started
- Variables and functions
- Exercise 1
- Vectors
- Import and explore data
- Subsetting
- Exercise 2
- Sort tables and export results
- Basic plotting
- Export graphics
- Exercise 3
This short course is based on the R crash course developed by Mark Dunning and Laurent Gatto.
Motivation
- R is one of the most widely-used programming languages for data analysis, statistics and visualisation in academia and industry.
- It is open-source and available in all platforms (Mac, Linux and Windows)
- Supported by a broad community of software developers and researchers who contribute R packages and libraries to many fields of research
- It facilitates reproducibility in research and integration of all your analyses in individual scripts
- Easy to write documentation and code together using a free environment like RStudio
E.g. The New Zealand Tourism Dashboard uses R extensively to report statistics.
Installing R and RStudio
- Latest version of R
- To check if you have R installed in Mac, go to
Finder->Applications->Utilities->Terminal, typeRand press Enter. If you see information about the R version and other details appearing followed by a ">" prompt, then you have R installed.
- To check if you have R installed in Mac, go to
- RStudio Desktop Open Source License
- To check if you have RStudio installed in Mac, try opening it, go to
Finder->Applicationsand click onRstudio
- To check if you have RStudio installed in Mac, try opening it, go to
How can I find help?
- Stack Overflow
- The Comprehensive R Archive Network (CRAN)
- CRAN Task Views
- R-bloggers
- Quick-R
- Local R groups, e.g. R-ladies Cambridge
- Type
?followed my the name of the function that you'd like to use, e.g.?mean
If you are interested in bioinformatics and computational biology, the following links might also be of interest:
Getting started
- Open RStudio - see above. Explore the different panels
To download today's workshop:
- Go to your web browser and type: https://tinyurl.com/2018-IntroR-Wolfson
- Click on
IntroR.zip, then pressDownloadand save the file in your preferred folder, e.g. your Desktop - Go to the folder where you saved
IntroR.zipand uncompress it, e.g. in Mac just double-click onIntroR.zip. Only then, the folderIntroRwill appear. - The folder
IntroRcontains two files:IntroR.Rmd- the code for today's sessionpatient-data-cleaned.csv- the dataset that we will be exploring
Now, go back to RStudio:
- Click on
File->Open Fileand selectIntroR.Rmd - You are all set to go now :)
Also:
- We will be using RStudio console (bottom-left panel) to interact with R during the workshop
- The blocks of code shown in
IntroR.Rmd- see below - are written using the format R markdown, which allows mixed plain text and R code together within the same document - Each line of R code inside a block can be executed by clicking on the line and pressing CMD + ENTER (Mac) or CTRL + ENTER (Windows and Linux), e.g.:
print("R is fun!")
Alternatively, to execute the entire block, click on the green arrow tip on the right-hand side of the block.
3 + 1
- You can add a new block of code by selecting
Rin theInsertmenu or by typing the following syntax directly:
# R code goes in here
Variables and functions
You can use R as a calculator using the symbols +, -, * and /, or more advanced features such as statistical operations, logarithms, trigonometry ...
2 + 1
7 - 1
3 * 2
10 / 5
mean(1:5)
log(1)
pi
sin(pi/2)
To store your results for later, use variables. To create them, use the assignment operator <-:
x <- 25
x
y <- 16
y
You can perform multiple operations using variables:
sqrt(x)
x + y
x <- 36
x <- y
x <- x + 8
Functions in R take one or more arguments as input, which are captured using parentheses. Arguments can be named explicitly, otherwise they are meant to be used in the same order as described in the function definition. E.g. seq is a function for generating a numeric sequence from and to particular numbers. Type ?seq to get the help page for this function.
?seq
seq(from = 1, to = 10, by = 2)
seq(1, 10, 2)
Some functions have default values in some arguments:
seq(1, 10, 1)
seq(1, 10)
The default value for the by argument in the seq() function is 1.
An alternative method to obtain sequences of numbers spaced by one value is the : symbol:
z <- 1:5
z
Exercise 1
Work in pairs, meet the person sitting next to you and try the following together (3 min):
- Create a sequence of numbers from 10 to 30 spaced by three values
- How about decreasing sequences? Now try from 30 to 10 spaced by three values (hint: check
?seq) - Round the number
pidown to 1 decimal place (hint: check?round)
Vectors
- The output we get using R functions such as
seq()are called vectors, which are collections of numbers or characters - To create vectors use the function
c()(a.k.a. combine) - Use square brackets
[ ]to indicate the position within the vector (the index) and extract elements
x <- c(5,6,7,8,9,10)
x
x[3]
x[1]
x[3:5]
Arithmetic operations in vectors occur element by element:
x <- c(2, 4, 5, 6, 7)
y <- x*2
y
x + y
A vector can also contain text, however unlike values, these need to be captured using quotation marks " ":
x <- c("a", "b", "b", "c", "c", "d")
x
x <- c(a, b, b, c, c, d) # otherwise R thinks they are objects
To create subsets of our vectors, we can use comparison operators:
==equal>greater than<less than!=not equal
x <- c("a", "b", "b", "c", "c", "d")
x == "b" # this is known as a logical or boolean vector, composed of TRUE or FALSE values only
x != "b"
x[x != "b"]
x <- c(2, 4, 5, 6, 7)
x > 4
x[x > 4]
Import and explore data
We will use a small made-up dataset which is often used for training purposes. It contains information about 100 lung cancer patients aged 42-44 from different states in the US. We have saved these data as a .csv file to demonstrate how to import and explore data using R.
You will first need to find the path to the file patient-data-cleaned.csv, which was downloaded together with the course materials - see folder IntroR. Use the function file.choose() to open a dialogue box and browse through the directories to reach the file. The path will then be displayed in R:
file.choose()
e.g. for me the path is /Users/martin03/Desktop/IntroR/patient-data-cleaned.csv. The file patient-data-cleaned.csv is a comma-separated values (CSV) file, which can easily be opened using software like Excel. In R, use the read.csv() function and the path obtained above to create a data frame object:
patient_data <- read.csv("/Users/martin03/Desktop/IntroR/patient-data-cleaned.csv") # copy here the path obtained when running file.choose()
Exploring rows and columns in the patient_data data frame:
# Dimensions
dim(patient_data)
ncol(patient_data)
nrow(patient_data)
# Viewing contents
head(patient_data)
View(patient_data)
# Names of columns
colnames(patient_data)
# Accessing data using column names
patient_data$Smokes
patient_data$Height
patient_data$State
# Summary of all data frame contents
summary(patient_data)
R works such that the values in each column have all to be of the same type (i.e. all numbers or all characters/text).
You can apply functions to the columns of the data frame to ask various questions:
# What is the maximum height?
max(patient_data$Height)
# What is the minimum weight?
min(patient_data$Weight)
# What is the mean body mass index (BMI)? Rounded to one decimal place?
round(min(patient_data$BMI), 1)
Subsetting
Just like when subsetting vectors, a selection of a data frame can be made using square brackes [ , ], however data frames are two-dimensional objects so you'll need both row and column indexes:
patient_data[1 , 2]
patient_data[2 , 1]
patient_data[c(1,2,3) , 1]
patient_data[c(1,2,3) , c(1,2)]
If you'd like to see all the rows, or all the columns, you can neglect either the row or column index respectively. But ... remember to keep the comma ;)
patient_data[2, ]
patient_data[, 2]
patient_data[, 1:4]
Rather than selecting rows based on indexes, you can also use comparison operators to give either a TRUE or FALSE result. When applied to subsetting, only rows with a TRUE result get returned.
# The vector of TRUE or FALSE results applied to subsetting data
patient_data$Height > 183
# Which patients are taller than 183cm?
patient_data[patient_data$Height > 183,]
# Which patients are smokers?
patient_data[patient_data$Smokes == "Smoker",]
# Which patients are taller than 183cm AND are smokers too?
patient_data$Height > 183 & patient_data$Smokes == "Smoker"
patient_data[patient_data$Height > 183 & patient_data$Smokes == "Smoker",]
# You can also select only specific columns using the column name, e.g. give me only the ID, Name, State and Disease Grade
patient_data[patient_data$Height > 183 & patient_data$Smokes == "Smoker", c("ID", "Name", "State", "Grade")]
The useful subsetting operators to bear in mind here are and &, or | and in %in%.
Exercise 2
Work in pairs if possible (3 min):
-
Select patients that have a BMI greater than 30 or their weight is greater than 90kg. Calculate their average height.
-
Select female patients from California who are not overweighted
Sort tables and export results
The function order() gives sorted indices, which can then be used to sort your data set:
# Sort patients by Disease Grade
order(patient_data$Grade)
patient_data[order(patient_data$Grade),] # from benign (1) to harmful (3)
patient_data[order(patient_data$Grade, decreasing = TRUE),] # from harmful (3) to benign (1)
# Sort patients by more than one condition: first Disease Grade, second Weight
patient_data[order(patient_data$Grade, patient_data$Weight, decreasing = TRUE),]
Once data processing is completed, you can export results out of R as follows:
# Which patients from California are non-smokers?
patient_data_california <- patient_data[patient_data$State == "California" & patient_data$Smokes == "Non-Smoker",]
# Export
write.csv(patient_data_california, file = "/Users/martin03/Desktop/IntroR/patient-data-cleaned-california.csv")
Basic plotting
Simple plotting functions are available in the base R distribution (histograms, barplots, boxplots, scatterplots ...). All that is required as input are vectors of data, e.g. columns in your data frame.
Histograms are often used to have an overview of the distribution of continuous data:
hist(patient_data$BMI)
hist(patient_data$Weight)
Barplots are useful when you have counts of categorical data:
barplot(table(patient_data$Race))
barplot(table(patient_data$Sex))
barplot(table(patient_data$Smokes))
barplot(table(patient_data$State), las=2, cex.names=0.7) # 'las=2' changes the x-axis labels to horizonal and 'cex.names=0.7' changes the size
barplot(table(patient_data$Grade))
barplot(table(patient_data$Overweight))
Boxplots are good when comparing distributions Here the ~ symbol sets up a formula, the effect of which is to put the categorical variable on the x-axis and continuous variable on the y-axis -> boxplot(y ~ x)
boxplot(patient_data$BMI ~ patient_data$Grade)
boxplot(patient_data$BMI ~ patient_data$Overweight)
boxplot(patient_data$Weight ~ patient_data$Overweight)
Scatter plots are useful when representing two continuous variables. Here -> plot(x, y):
plot(patient_data$Weight, patient_data$BMI)
To enhance the appearance of your plots, many different ways of customisation are possible:
- Colours:
colargument. To get a full list of possible colours typecolours(), or check this online reference. - Point type:
pch - Axis labels:
xlabandylab - Plot title:
main - ... and many others: see
?plotand?parfor more options
plot(patient_data$Weight, patient_data$BMI, col="red", pch=16, xlab="Weight (kg)", ylab="BMI", main="US patient data")
abline(lm(patient_data$BMI ~ patient_data$Weight), col="blue")
Related arguments can be used for other plotting functions:
boxplot(patient_data$BMI ~ patient_data$Overweight, col=c("red", "green"), xlab="Overweight patient?", ylab="BMI", main="US patient data")
To explore other types of plots, have a look here. There are dedicated R libraries e.g. ggplot2 to do more sophisticated plotting. We will be exploring these in future workshops.
Export graphics
When running commands directly in the interactive console (bottom-left panel), plots can be exported using the Plots tab in RStudio (bottom-right panel). Click on Export -> Save as PDF ....
You can also save plots to a file calling the pdf() or png() functions before executing the code to create the plot:
pdf("/Users/martin03/Desktop/IntroR/BMIvsWeight.pdf")
plot(patient_data$Weight, patient_data$BMI, col="red", pch=16, xlab="Weight (kg)", ylab="BMI", main="US patient data")
abline(lm(patient_data$BMI ~ patient_data$Weight), col="blue")
dev.off()
The dev.off() line is important; without it you will not be able to view the plot you have created.
Exercise 3
The final one:
- Any differences of Weight or BMI between Smokers and Non-Smokers? (hint: try
boxplot) - Visualise the relationship between the Height and Weight of the patients
That's it! Enjoy R!
Questions?
Feedback / questions about the course, please email Sergio (sermarcue@gmail.com).
References and additional materials
Blogs:
- End-to-end visualization using ggplot2
- ggplot2 - Easy way to mix multiple graphs on the same page
- Getting started with data visualization in R using ggplot2
- Rookie mistakes and how to fix them when making plots of data
Books:
- R for Data Science
- Data Visualization for Social Science. A practical introduction with R and ggplot2
- https://www.huber.embl.de/msmb/index.html
Courses:
- CRUK-CI R crash course
- R for Reproducible Scientific Analysis
- Karl Broman's mini tutorials
- Basic statistics and data handling with R
- Scripting for data analysis (with R)
- An Introduction to Solving Biological Problems with R
- Data Analysis and Visualisation using R: including dplyr and ggplot2
- Babraham institute basic/advanced R and ggplot2 courses
- R object-oriented programming and package development, link1 and link2
- R course content for the CODATA-RDA Research Data Science Summer School
- Data carpentry course for biologists by Ethan White
- Cambridge's Data carpentry using R
- The Bioconductor 2018 Workshop Compilation
Perspectives:
Acknowledgements
Sergio is a University of Cambridge Data Champion funded by a Jisc research data fellowship to develop research data training activities for researchers. He does research in bioinformatics and computational biology within the Balasubramanian laboratories funded by the Wellcome Trust at the University of Cambridge.
License
This work is distributed under a Creative Commons CC0 license. No rights reserved.
Our sponsors: