Exploratory Data Analysis

Regression and
Programming in R
Anja Bråthen Kristoffersen
Biomedical Research Group
http://cran.r-project.org/doc/contrib/Short-refcard.pdf
R Reference Card
Simple linear regression
• Describes the relationship between two
variables x and y
y    x  
• The numbers α and β are called parameters,
and ϵ is the error term.
2014.01.15
3
Example data
• Waiting time between eruptions and the duration
of the eruption for the Old Faithful geyser in
Yellowstone National Park, Wyoming, USA.
head(faithful)
eruptions waiting
1 3.600 79
2 1.800 54
3 3.333 74
4 2.283 62
5 4.533 85
6 2.883 55
2014.01.15
4
Estimated
parameters
P-value
2014.01.15
Coefficient of Determination
5
Coefficient of determination
• the quotient of the variances of the fitted
values and observed values of the dependent
variable.
r
2014.01.15
2

  yˆ
 y
i
 y
2
i
 y
2
6
Prediction
• develop a 95% confidence interval of the
mean eruption duration for the waiting time
of 80 minutes
newdata <- data.frame(waiting=80)
predict(eruption.lm, newdata, interval="confidence")
fit
lwr
upr
1 4.1762 4.1048 4.2476
2014.01.15
7
Residuals
• The difference between the observed data of the
dependent variable y and the fitted values ŷ.
Residual
 y  yˆ
eruption.res <- resid(eruption.lm)
plot(faithful$waiting, eruption.res, ylab="Residuals",
xlab="Waiting Time", main="Old Faithful Eruptions")
abline(0,0,
col = "red", lwd = 2)
2014.01.15
8
2014.01.15
9
Standardized residual
• the residual divided by its standard deviation
Standardiz
ed residual
i
residual

standard
deviation
i
of residual
i
eruption.stdres = rstandard(eruption.lm)
plot(faithful$waiting, eruption.stdres,
ylab="Standardized Residuals", xlab="Waiting Time",
main="Old Faithful Eruptions")
abline(0, 0, col ="red", lwd = 2)
2014.01.15
10
2014.01.15
11
Normal Probability Plot of
Residuals
qqnorm(eruption.stdres,
ylab="Standardized Residuals",
xlab="Normal Scores",
main="Old Faithful Eruptions")
qqline(eruption.stdres, col = "red", lwd=3)
2014.01.15
12
Generalized additive model
• Can replace the linear relationship between
response and variable, as in linear regression,
with a non-linear relationship. A spline.
2014.01.15
13
GAM
install.packages("mgcv")
library(mgcv)
This is mgcv 1.5-5 . For overview type
`help("mgcv-package")'.
eruption.gam <- gam(eruptions ~ 1+s(waiting),
data=faithful)
plot(eruption.gam)
2014.01.15
14
plot(eruption.gam)
2014.01.15
15
p-value
2014.01.15
Coefficient of
16
Determination
eruption.res.gam <- resid(eruption.gam)
plot(faithful$waiting, eruption.res.gam, ylab="Residuals",
xlab="Waiting Time", main="Old Faithful Eruptions")
abline(0,0, col="red", lwd=2)
2014.01.15
17
eruption.res.gam <- resid(eruption.gam)
qqnorm(eruption.res.gam)
qqline(eruption.res.gam, col="red", lwd=3)
2014.01.15
18
Multiple regresion
• describes a dependent variable y (response)
by independent variables x1, x2, ..., xp (p > 1) is
expressed by the equation
y  

 k xk  
k
where the numbers α and βk (k = 1, 2, ..., p)
are the parameters, and ϵ is the error term.
2014.01.15
19
Multiple regresion
Explore your data
• Explore your data before starting the analysis
– Are the responses correlated?
– Use plot (trellis graphics, boxplot)
– Use correlation (pearson, spearman)
2014.01.15
20
DLBCL patient data
• Response: Germinal.cnter.B.cell.signature
• Explanatory variables
–
–
–
–
2014.01.15
2014.01.15
Lymph.node.signature
Proliferation.signature
BMP6
MHC.class.II.signature
21
21
pairs(dat[,8:11])
2014.01.15
22
cor(dat[,8:11])
2014.01.15
23
p-vaules
Not significant
p values
2014.01.15
24
Adjusted Coefficient of Determination
Adjusted Coefficient of
Determination
R adj  1  (1  R )
2
2
n 1
n  p 1
Hvor n er antall observasjoner og
p er antall parametere brukt i modellen
2014.01.15
25
Comparing models
• As long as analysis are done on the same responses
you can compare your models by information
criteria:
• AIC (Akaike's An Information Criterion)
-2*log-likelihood + 2*npar
• BIC (Schwarz's Bayesian criterion)
-2*log-likelihood + log(n)*npar
• Goal: as small AIC or BIC as possible, i.e. explain most
with less parameters
2014.01.15
26
All p values are significant
The AIC value is less, chose this model
Adjusted coefficient of
determination hardly
changed
27
fit2.res <- resid(fit2)
fit2.fitted <- fitted(fit2)
pairs(fit2.fitted, fit2.res, col = "darkgreen", pch="*")
2014.01.15
28
Logistic regression
• We use the logistic regression equation to
predict the probability of a dependent
variable that takes on only the values 0 and 1.
Suppose x1, x2, ..., xp (p > 1) are the
independent variables, α, βk (k = 1, 2, ..., p) are
the parameters, and E(y) is the expected value
of the dependent variable y
1
E ( y) 
2014.01.15
1 e

 



k

 k xk 


29
DLBCL patient data
• Response: alive or dead
• Explanatory variables
–
–
–
–
–
–
–
2014.01.15
Subgroup
IPI.Group
Germinal.center.B.cell.signature
Lymph.node.signature
Proliferation.signature
BMP6
MHC.class.II.signature
30
DLBCL.glm <- glm(Status.at.follow.up ~ Subgroup + IPI.Group +
Germinal.center.B.cell.signature + Lymph.node.signature +
Proliferation.signature + BMP6 + MHC.class.II.signature,
family= "binomial", data = dat)
Here
logistic
regression
is chosen
2014.01.15
31
DLBCL2.glm <- glm(Status.at.follow.up ~ IPI.Group,
family= "binomial", data = dat)
summary(DLBCL2.glm)
High Low Medium missing NA's
32
82
108
1
17
The survival of
those containing
the IPI.group Low
is significantly
different from
those that are in
the group High
2014.01.15
The estimate
is negative,
hence pations
in group Low
have less
probability to
die then those
in group High
32
Backward an forward inclusion of
response variables
• Forward
– Start with all response variables, exclude the one
that is least significant, compare AIC values.
• Backward
– Start with one regression for each response
variable
– Include more and more response variables that
are significant in the model, compare AIC values
2014.01.15
33
Programming in R
2014.01.15
34
Function
panel.hist <- function(x){
usrx <- par("usr")
on.exit(par(usrx))
par(usr = c(usrx[1:2], 0, 1.5) ) # indicates the position in the plot
hi <- hist(x, plot = FALSE)
# calculate histogram without plotting it
Breaks <- hi$breaks
# define breaks used in h
nB <- length(Breaks)
# count the number of breaks
y <- hi$counts
# find the counts in each interval
y <- y/max(y)
# scale y for plotting
rect(Breaks[-nB], 0, Breaks[-1], y) #plots rectangulars in existing plot
}
pairs(dat[,7:11], col = dat[,5], cex = 0.5, pch = 24, diag.panel = panel.hist,
upper.panel=NULL)
2014.01.15
36
2014.01.15
37
2014.01.15
38
apply()
• Use apply when you want the same thing
done on every row or column of a matrix.
• apply(d, 1, functionA) will use functionA on
every row of dataset d.
• apply(d, 2, functionA) will use functionA on
every column of dataset d.
2014.01.15
39
Example: apply()
d <- matrix(runif(90), ncol = 10, nrow = 9)
head(d)
par(mfrow = c(3,3))
apply(d, 1, hist)
2014.01.15
40
for – loop
• for (name in expr_1) expr_2
– name on variable: i
– expr_1 should be 1:antall
– expr_2 should be
sum(dbinom(1:6, 8, p[i])), but you have to save it
somewhere:
beta8[i] <- sum(dbinom(1:6, 8, p[i])) then you
have to create beta8 before the loop.
for – loop
p <- seq(0.6, 1, 0.01)
antall <- length(p)
beta8 <- rep(NA, antall) #allocate space for beta8
for(i in 1:antall){
beta8[i] <- sum(dbinom(1:6, 8, p[i]))
}
power8 <- 1 - beta8
plot(p, power8, type = "l")
2014.01.15
43
if statment
• if (expr_1) expr_2 else expr_3
• expr_1 is a statement that is either true or
false
• expr_2 is preformed if expr_1 is true
• expr_3 is optional, but preformed if expr_1 is
false and else statement is included
Example: if()
x <- runif(1,0,1)
y <- rnorm(1,0,1)
if(x > y){
print(paste("x =", round(x,3), "is greater
than y =", round(y,3), sep = " "))
} else {
print(paste("x =", round(x,3), "is less than
y =", round(y,3), sep = " "))
}
[1] "x = 0.783 is greater than y 0.536"
2014.01.15
45
When finished
• You could save your workspace in R, you could then save the workspace in
different project folders. I do not do that!
• You could just save your script and run it once again if you want to work
further on it.
• You could write your partial results (matrix) that you want to work further
on to a .txt file, use write.table(objectname, ”path”, sep = ”\t”)
– objectname is the name of your object in R
– ”path” is your path to where you will save the object including the file
name of the object.
– sep = ”\t” ensure that your txt file is tabulate separated and makes it
easier to open in excel.

Download Report

Exploratory Data Analysis

Paperzz.com

Your Paperzz