mat - IMISE - Universität Leipzig

Eine Einführung in R:
Grundlagen und erste Schritte
Dr. Henry Löffler-Wirth, Dr. Markus Kreuz, Dr. Helene Kretzmer
Institut für medizinische Informatik, Statistik und Epidemiologie (IMISE)
Interdisziplinäres Zentrum für Bioinformatik (IZBI)
Universität Leipzig
03. November 2016
H. Löffler-Wirth, M. Kreuz, H. Kretzmer
Grundlagen I
03. November 2016
What is R?
Programming & scripting language
Comprehensive statistical environment
Strengths: bioinformatical / statistical data analysis; graph algorithms & displaying
H. Löffler-Wirth, M. Kreuz, H. Kretzmer
Grundlagen I
03. November 2016
2
Why using R?
It’s free 
Runs on diverse platforms (Windows, Unix, MacOS)
Huge collection of packages in public repositories (CRAN, Bioconductor)
Automated workflows
Big data sets
Advanced statistical routines
State-of-the-art graphics capabilities
H. Löffler-Wirth, M. Kreuz, H. Kretzmer
Grundlagen I
03. November 2016
3
Why using R?
H. Löffler-Wirth, M. Kreuz, H. Kretzmer
Grundlagen I
03. November 2016
4
Why using R?
H. Löffler-Wirth, M. Kreuz, H. Kretzmer
Grundlagen I
03. November 2016
5
How to obtain and install R?
R can be downloaded from the Comprehensive R Archive Network (CRAN):
http://cran.rproject.org/
Installation instructions depend on your operating system and should be accessible from
the R download page for you operating system.
For our course, we recommend to use R-studio as programming environment:
https://www.rstudio.com/
H. Löffler-Wirth, M. Kreuz, H. Kretzmer
Grundlagen I
03. November 2016
6
R-studio
shell
plots / help page
code
environment & variables
H. Löffler-Wirth, M. Kreuz, H. Kretzmer
Grundlagen I
03. November 2016
7
R-studio
Using script files allows for long code, easier development and debugging
H. Löffler-Wirth, M. Kreuz, H. Kretzmer
Grundlagen I
03. November 2016
8
Finding help
R mailing lists:
https://stat.ethz.ch/mailman/listinfo/
Manuals and FAQs:
http://www.r-project.org/
Ask Google, he knows loads!
H. Löffler-Wirth, M. Kreuz, H. Kretzmer
Grundlagen I
03. November 2016
9
Lets get it on: First steps in R
# is comment mark
> # we can now talk to R
>3+6
>2*(7+3)
> sqrt( 36 ) + log10( 100 )
>a=2
>a
>a=3
+, -, *, /, ^ are operators.
A list of all operators can be found here:
http://cran.r-project.org/doc/manuals/R-lang.html#Operators
sqrt(…) and log10(…) are functions.
They always look like
functionname( parameter1, parameter2,… )
There are =, ==, <- and <<= and <- are the same
== compares two values
<<- advanced (scope and visibility)
> b <- 3
> a == b
H. Löffler-Wirth, M. Kreuz, H. Kretzmer
Grundlagen I
03. November 2016
10
Lets get it on: First steps in R
> mode( a == b )
There are different types of variables and values
> mode( a )
> mode( "emu" )
> a >= b
> a != b
> c <- a != b
>c
>c+1
Some types can be easily converted,
others don’t…
> "emu" + 1
H. Löffler-Wirth, M. Kreuz, H. Kretzmer
Grundlagen I
03. November 2016
11
Vectors, lists and key functions
> 1:9
Vectors store multiple values and can be assigned using
start:end or c( value1, value2 )
> c( 1:9, 27 )
> c( 1:9, 27 ) + 1
> c( "rat", "emu" )
> c( 1:9, "emu" )
conversion to character mode
> c( 1:9, "emu" ) + 1
> list( 1:9, "emu" )
Lists can contain diverse data types as elements, e.g.
numerical vectors, characters, matrices or yet another lists
> list( nums=1:9, char="emu" )
List and vector elements can be named
str(variable) gives detailed information
> c( nr=25, nr2=36, nr3=12345 )
> str( c( nr=25, nr2=36, nr3=12345 ) )
H. Löffler-Wirth, M. Kreuz, H. Kretzmer
Grundlagen I
03. November 2016
12
Vectors, lists and key functions
Types of brackets:
()  contain parameters for functions
[]  contain indices for vectors/matrices
{}  define code blocks (later)
> num = c( 3, 6, 9, 12 )
> num[ 1 ]
> num[ a ]
!! Indexing in R starts with 1 for first element !!
> num[ c( a, 4 ) ]
Common functions to build vectors:
c(…)  concatenation
seq(start,end,by)  creates a series of numbers
rep(what,times)  repeat something multiple times
> c( 1:10 )
> seq( 1, 10 )
> rep( 1, 10 )
> ?rep
Functions usually accept various parameter combinations.
To get help, type ?functionname or F1 in RStudio
> rep( c( "a", "happy", "bunny" ), each = 2 )
> rep( c( "a", "happy", "bunny" ), times = 2 )
H. Löffler-Wirth, M. Kreuz, H. Kretzmer
Grundlagen I
03. November 2016
13
Vectors, lists and key functions
> num <- c( -1, num, 1 )
Adding elements using c function
> length( num )
> num[2]
Positive and negative selection
> num[-2]
> which( num > 1 )
> num[ which( num > 1 ) ]
which provides indices of defined elements in a vector
and can be used to select or deselect subgroups of a
vector
> num[ -which( num > 1 ) ]
H. Löffler-Wirth, M. Kreuz, H. Kretzmer
Grundlagen I
03. November 2016
14
Vectors, lists and key functions
> num <- c( 1:3 )
> num + 1
> num * c( 1, 1, 2 )
Simple mathematical operations with a scalar
Operations of two equally long vectors result in vectors
Basic vector functions
> sum( num )
> mean( num )
> min( num )
> max( num )
> which.max( num )
H. Löffler-Wirth, M. Kreuz, H. Kretzmer
!! Variable/function names may contain . or _ !!
Grundlagen I
03. November 2016
15
Vectors, lists and key functions
> any( num != 3 )
Check if any or all vector elements meet a condition
> all( num > 1 )
Prunning of console output
> head( 1:1000000 )
> tail( 1:1000000 )
> ls()
ls() lists all vaiables defined in the workspace
rm(variable) removes a variable
> rm( num )
H. Löffler-Wirth, M. Kreuz, H. Kretzmer
Grundlagen I
03. November 2016
16
Programming tasks I
•
Write some code to generate the following sequences:
a) 2, 4, 6, … , 100 (even numbers up to 100)
b) 1, 2, 3, 4, 5, 4, 3, 2, 1
c) 10, 20, 30, 40, 10, 20, 30, 40, 10, 20, 30, 40
d) 1, 4, 9, 16, 25, … , 10000 (square numbers up to 10.000)
•
Which element of d) is 6889? Replace it with 6988.
•
Replace the first 20 elements of d) with 1, 2, 3, … , 20.
•
Replace the entries "7396 " to "8281" in d) with 1, 2, 3, …, 6. (they are successive)
H. Löffler-Wirth, M. Kreuz, H. Kretzmer
Grundlagen I
03. November 2016
17
Solutions
a) 2, 4, 6, … , 100 (even numbers up to 100)
> a <- c(1:50) * 2
> a <- seq( 2, 100, by=2 )
b) 1, 2, 3, 4, 5, 4, 3, 2, 1
> b <- c( 1:5, 4:1 )
c) 10, 20, 30, 40, 10, 20, 30, 40, 10, 20, 30, 40
> dont.use.c <- rep( 1:4, 3 ) * 10 # !!!!!!!!!!!!!!!!!!!!!!!!!!
d) 1, 4, 9, 16, 25, … , 10000 (square numbers up to 10.000)
> d <- c(1:100) ^ 2
H. Löffler-Wirth, M. Kreuz, H. Kretzmer
Grundlagen I
03. November 2016
18
Solutions
Which element of d) is 6889? Replace it with 6988.
> which( d == 6889 )
> d[ which( d == 6889 ) ] = 6988
Replace the first 20 elements of d) with 1, 2, 3, … , 20.
> d[ c(1:20) ] <- c(1:20)
Replace the values 7396 to 8281 in d) with 1, 2, 3, …, 6. (they are successive)
> d[ c( which(d==7396) : which(d==8281) ) ] <- c(1:6)
>d
H. Löffler-Wirth, M. Kreuz, H. Kretzmer
Grundlagen I
03. November 2016
19
Matrices
Build a matrix using matrix(…)
> num <- c( 1:12 )
> mat <- matrix( num, nrow=3 )
Check modes of the variables
> is.vector( num )
> is.matrix( mat )
> dim( mat )
Dimensions and structure of the matrix.
Convention: first rows, then columns
> c( nrow( mat ), ncol( mat ) )
> str( mat )
For further examples, we load a random number matrix:
> load( "Matrix1.RData" )
H. Löffler-Wirth, M. Kreuz, H. Kretzmer
Grundlagen I
03. November 2016
20
Matrices
Accessing matrix elements using
mat[ row indices, col indices ]
> mat[ , 1 ]
> mat[ c( 1:2 ), ]
> mat[ , "Expression1" ]
> mat[ , "Expression1", drop=FALSE ]
drop=FALSE prevents conversion to vector
Very common: select specific genes
> on.thresh <- 0.5
> mat[ which( mat[ , "Expression1" ] >= on.thresh ) , ]
> mat[ which( mat[ , "Expression1" ] < on.thresh ) , ] <- NA
NA: built-in constant
> mat <- mat[ which( mat[ , "Expression1" ] >= on.thresh ) , ]
> mat <- mat[ -which( mat[ , "Expression1" ] < on.thresh ) , ]
H. Löffler-Wirth, M. Kreuz, H. Kretzmer
Grundlagen I
Never do that, it may crash!!
03. November 2016
21
Loops
Fill a vector with first 100 square numbers:
> s <- c( 1^2, 2^2, 3^2, 4^2 ………………..
> s <- c()
> for( i in 1:100 ) s[ i ] <- i^2
> for( i in 1:length(s) )
{
s[ i ] <- sqrt( s[ i ] )
print( i )
}
Simple loop in R:
for( variable in vector ) command/codeblock
!! i has a different value in each iteration !!
Use script files for multi-line commands.
>s
H. Löffler-Wirth, M. Kreuz, H. Kretzmer
Grundlagen I
03. November 2016
22
Conditional execution
> i <- 5
> i < 50
> if( i < 50 ) print( "ok" )
Simple condition in R:
if( condition) command/codeblock
Code blocks can be defined within another code blocks
> for( i in 1:100 )
{
if( i < 50 )
{
s[ i ] <- i^2
Two-way condition in R:
if( condition) command/block else command/block
} else
{
s[ i ] <- i
}
}
H. Löffler-Wirth, M. Kreuz, H. Kretzmer
Grundlagen I
03. November 2016
23
Conditional execution and loop control
> s <- c()
> for( i in 1:100 )
{
if( i < 40 ) next
if( i == 50 ) break
s[ i ] <- i^2
}
next: proceed to next iteration
break: skip remaining iterations
> length( s )
>s
H. Löffler-Wirth, M. Kreuz, H. Kretzmer
gaps in s will be filled with NA
Grundlagen I
03. November 2016
24
Complex conditions
&&: AND operator
||: OR operator
!: NOT operator
> for( i in 1:100 )
{
if( i < 10 || i > 90 ) print( i )
}
H. Löffler-Wirth, M. Kreuz, H. Kretzmer
Grundlagen I
03. November 2016
25
Programming tasks II
•
Generate a vector containing square numbers up to 100.000
•
How many square numbers can be found in interval [1.000,100.000]?
•
1.801.800 is the sum of how many consecutive square numbers (starting from 1^2)?
•
Generate the Fibonacci sequence (first 20 numbers): 1, 1, 2, 3, 5, 8, 13, 21, … -> fn = fn-1 + fn-2
H. Löffler-Wirth, M. Kreuz, H. Kretzmer
Grundlagen I
03. November 2016
26
How many square numbers can be found in interval [1.000,100.000]?
> s <- c( 1:sqrt( 100000 ) )^2
> count <- 0
> for( i in 1:length( s ) )
{
if( s[ i ] < 1000 ) next
if( s[ i ] > 100000 ) break
count <- count + 1
}
> count
alternative:
> count <- 0
> for( i in 1:length( s ) )
{
if( s[ i ] >= 1000 && s[ i ] <= 100000 ) count <- count + 1
}
> count
H. Löffler-Wirth, M. Kreuz, H. Kretzmer
Grundlagen I
03. November 2016
27
1.801.800 is the sum of how many consecutive square numbers
(starting from 1^2)?
> for( i in 1:sqrt(1801800) )
{
if( sum( c(1:i)^2 ) == 1801800 )
{
print(i)
break
}
}
H. Löffler-Wirth, M. Kreuz, H. Kretzmer
Grundlagen I
03. November 2016
28
Generate the Fibonacci sequence (first 20 numbers): 1, 1, 2, 3, 5, 8, 13, 21 fn
= fn-1 + fn-2
> fib <- c(1,1)
> for( i in 3:20 ) fib[i] <- fib[i-1] + fib[i-2]
> fib
H. Löffler-Wirth, M. Kreuz, H. Kretzmer
Grundlagen I
03. November 2016
29
Definition of own functions
> increment <- function( x )
{
inc <- x + 1
return( inc )
}
functions are objects like other variables in R
> increment( x=7 )
our function is called as all functions,
parameters can be addressed with their name
> increment( 7 )
> myAdd <- function( x, y=1 )
{
return( x + y )
}
return() provides value of inc outside the
function; inc is a local variable
function definition also accepts multiple parameters
and default values
> myAdd( 7, 10 )
> myAdd( 7 )
> myAdd <- function( x, y=1 ) return( x + y )
H. Löffler-Wirth, M. Kreuz, H. Kretzmer
code block can be defined for single line commands,
but not necessary
Grundlagen I
03. November 2016
30
apply-family functions
apply, lapply, sapply, tapply are very handy functions.
We start with apply: do some function for each row/column of a matrix
> apply( mat, 1, mean )
> apply( mat, 2, sum )
> rowMeans( mat )
> colSums( mat )
> high.expressed <- function(x)
{
return( all( x > 0.2 ) )
}
> apply( mat, 1, high.expressed)
form: apply( matrix, margin, function, …)
margin: 1 = by row, 2 = by column
some very fast built-in functions:
colMeans, rowMeans, colSums, rowSums
we can write own function, even temporarily within apply
caution: x is a vector!! (row or column)
> apply( mat, 1, function(x) all( x > 0.2 ) )
H. Löffler-Wirth, M. Kreuz, H. Kretzmer
Grundlagen I
03. November 2016
31
apply-family functions
apply can return a vector, a matrix, or a list; depending on content of return(…):
> apply( mat, 2, function(x) x[1] )
> apply( mat, 2, function(x) x[1:3] )
> apply( mat, 2, function(x) list( x[1:3] ) )
> apply( mat, 2, head )
> apply( mat, 2, head, 3 )
H. Löffler-Wirth, M. Kreuz, H. Kretzmer
return(single value)  apply provides a vector
return(vector)  apply provides a matrix,
if vector length is constant
return(otherwise)  apply provides a list,
if return() is e.g. vector of changing length or a list
form: apply( matrix, margin, function, …)
…  optional arguments for function
e.g. head(x, n)  x provided automatically, n can be given
Grundlagen I
03. November 2016
32
apply-family functions
Other apply-functions:
sapply: do some function for each element of a list or vector, returns a vector, matrix or list
> sapply( c(1:100), function(x) x^2 )
form: sapply( vector, function, … )
avoiding for loop
!! (s)apply returns values directly,
for stores them into predefined variables (e.g. vectors) !!
lapply: do some function for each element of a list or vector, always returns a list
> lapply( c(1:100), function(x) x^2 )
tapply: do some function for each group of elements of a vector, returns a list
> tapply( c(1:100), rep(c(1:4),25), mean )
vector
indices
1
1
2
2
3
3
4
4
5
1
6
2
7
3
8
4
9 10 …
1 2 …
form: tapply( vector, indices, function, … )
indices do not have to be ordered
returns one value for each unique index
by: do some function for each group of rows of a matrix, returns a list (->'matrix-tapply')
H. Löffler-Wirth, M. Kreuz, H. Kretzmer
Grundlagen I
03. November 2016
33
Programming tasks III
•
Reload the toy expression matrix (variable name is mat).
•
What is the maximum expression value of each gene in mat?
•
What is the expression range of each sample in mat? (range: maximum – minimum)
•
Are there samples with all genes’ expression exceeding 0.1?
You can use for-loops, but better try apply-functions!
H. Löffler-Wirth, M. Kreuz, H. Kretzmer
Grundlagen I
03. November 2016
34
What is the maximum expression value of each gene in mat?
> load("Matrix1.RData")
> max.expression <- c()
> for( i in 1:nrow( mat ) ) max.expression[ i ] <- max( mat[ i, ] )
> max.expression
apply-version:
> apply( mat, 1, max ) # this solution provides named vector
remember:
avoiding for loop
!! (s)apply returns values directly,
for stores them into predefined variables (e.g. vectors) !!
H. Löffler-Wirth, M. Kreuz, H. Kretzmer
Grundlagen I
03. November 2016
35
What is the expression range of each sample in mat? (range: maximum –
minimum)
> expression.range <- c()
> for( i in 1:ncol( mat ) ) expression.range[ i ] <- max( mat[ , i ] ) - min( mat[ , i ] )
> expression.range
apply-version:
> apply( mat, 2, max ) - apply( mat, 2, min )
another apply-version:
> apply( mat, 2, function(x) max(x) - min(x) )
H. Löffler-Wirth, M. Kreuz, H. Kretzmer
Grundlagen I
03. November 2016
36
Are there samples with all genes’ expression exceeding 0.1?
> n.samples <- 0
> for( i in 1:ncol( mat ) )
{
if( all( mat[ , i ] > 0.1 ) ) n.samples <- n.samples + 1
}
> n.samples
apply-version:
> any( apply( mat, 2, function(x) all( x > 0.1 ) ) )
another apply-version:
> any( apply( mat, 2, min ) > 0.1 )
H. Löffler-Wirth, M. Kreuz, H. Kretzmer
Grundlagen I
03. November 2016
37
Scope and variable visibility
R is flexible regarding types (type definition, type changes) and visibility (global variables)
Few theory… there are three types of variables in R:
•
Formal parameters: parameters of functions, locally visible
•
Local variables: created within a function, locally visible
•
Free variables: all other
x: formal parameter
y: local variable
z: free variable
> z <- 3
> f <- function( x )
{
y <- 2 * x
print( z )
}
H. Löffler-Wirth, M. Kreuz, H. Kretzmer
Grundlagen I
03. November 2016
38
Scope and variable visibility
> z <- 3
> f <- function( x )
{
z <- 5
print( z )
}
>z
# gives 3
> f( 1 ) # prints 5
>z
# gives 3
'free z' is hidden behind 'local z', a completely different
variable which just happen to have the same name
we can access hidden variables using <<-
> z <- 3
> f <- function( x )
{
z <<- 5
print( z )
}
>z
# gives 3
> f( 1 ) # prints 5
>z
# gives 5
H. Löffler-Wirth, M. Kreuz, H. Kretzmer
Grundlagen I
03. November 2016
39
Loops II
What is the sum of the first 100 square numbers?
> sum( c( 1:100 )^2 ) # I know, but…
> s <- 0
> i <- 1
> while( i <= 100)
{
s <- s + i^2
i <- i + 1
}
>s
Conditional loop in R:
while( condition ) command/codeblock
Usually, a counter (i) or other marker is initialized and
modified with each iteration
1.801.800 is the sum of how many successive square numbers?
> s <- 0
> counter <- 0
> while( s < 1801800 )
{
counter <- counter + 1
s <- s + counter^2
}
> counter
H. Löffler-Wirth, M. Kreuz, H. Kretzmer
Grundlagen I
03. November 2016
40
Loops II
> for( i in 1:100 ) { … }
> i <- 1
> while( i <= 100 )
{
…
i <- i + 1
}
H. Löffler-Wirth, M. Kreuz, H. Kretzmer
for loop: need to know number of iterations
while loop: generalized loop
 for loops can be transferred into while loops
Grundlagen I
03. November 2016
41
Programming task IV
Let's imagine you have 10 bacteria cells. You put your bacteria into media and let them grow. Every
time unit, each bacterium produces an amount of 6 metabolites and divides once. You let the bacteria
grow until the amount of metabolites is more than 1.000.000. How many bacteria do you have when
this happens? How many time units have past?
In a first version use simple variables to store the numbers. In a second one, use vectors (code of
version 1 can be easily modified).
Hint: Use a while loop in which you in each iteration double the number of bacteria and calculate the
amount of metabolites. We need a time variable.
H. Löffler-Wirth, M. Kreuz, H. Kretzmer
Grundlagen I
03. November 2016
42
Programming task III – simple version
> n.bact <- 10
> n.meta <- 0
> t <- 1
> while( n.meta < 1000000 )
{
n.meta <- n.meta + n.bact * 6
n.bact <- n.bact * 2
t <- t + 1
}
> n.bact
> t-1
H. Löffler-Wirth, M. Kreuz, H. Kretzmer
Grundlagen I
03. November 2016
43
Programming task III – vector version
> n.bact <- c( 10 )
> n.meta <- c( 0 )
> t <- 1
> while( n.meta[t] < 1000000 )
{
n.meta[ t + 1 ] <- n.meta[ t ] + n.bact[ t ] * 6
n.bact[ t + 1 ] <- n.bact[ t ] * 2
t <- t + 1
}
> n.bact
> barplot( n.bact )
H. Löffler-Wirth, M. Kreuz, H. Kretzmer
Grundlagen I
03. November 2016
44