Lada Adamic
SI 544
March 5th, 2008
Simple linear regression and correlation
1
Dealing with missing values
Now that we are processing data to make inferences and predictions, our R tools may start to complain
about the missing values, the ‘NA’s that are hiding out in our data. It will want to know what to do with
them. Should it exclude all rows with ‘NA’s, or just keep complaining about them? What do we want it to
do? Well, if we go about our business and try to ignore the problem, the following happens. Let’s load our
libraries data yet again, we know that there are a lot of missing values in there:
libraries =
read.table("http://www-personal.umich.edu/~ladamic/courses/si544w08/data/libraries.dat",
sep="\t",quote="\"",head=T)
attach(libraries)
By the way if you get errors like
> attach(libraries)
The following object(s) are masked from libraries ( position 5 ) :
ADDRESS ATTEND AUDIO BENEFIT BKMOB BKVOL BRANLIB C_LEGBASE C_RELATN
possibly because you have attached “libraries” previously, just keep typing detach(libraries) until it says
it has never heard of libraries, and then you can attach it again.
Suppose I just want to find out the correlation between the total expenditures on the library collection
and other operating expenditures excluding salaries and the collection.
> cor(TOTEXPCOL,OTHOPEXP)
Error in cor(TOTEXPCOL, OTHOPEXP) : missing observations in cov/cor
Well, that is because OTHOPEXP is chock full of missing values:
> OTHOPEXP[1:20]
[1]
NA 2609487
NA
8590
NA
[19]
NA
NA
NA
NA
13389
NA
NA
NA
NA
NA
NA
NA
NA 275126 NA
I have a few choices. I could only compare those rows where there are no missing values.
> nonmissingothopexp = (!is.na(OTHOPEXP))
> cor(TOTEXPCOL[nonmissingothopexp],OTHOPEXP[nonmissingothopexp])
[1] 0.8726715
I’ve used the is.na() function, which returns TRUE if the value is undefined and FALSE otherwise.
In this case I wanted to keep just the defined values, and hence the !is.na (meaning NOT missing). This
can get a bit tiresome though if I want to exclude all rows where any column entry is undefined (e.g.
!is.na(TOTEXPCOL) & !is.na(POPU) & !is.na(BKMOB) ...).
If I simply want to include all rows that contain any missing values, I can use the complete.cases
function to get an index of all complete rows:
1
> cc = complete.cases(libraries)
> librariescc = libraries[cc,]
> sum(cc)
[1] 5080
In this case we’ve unfortunately lost about 3,000 data points. In general, this is a quick way to omit
observations with missing data.
Most simply, we can tell R to ignore the missing values as it is computing the correlation coefficient using
the use option.
cor(TOTEXPCOL,OTHOPEXP,use="complete.obs")
In general, if you are unsure of how a function deals with missing values, bring up the help page. For
example, if you are taking the mean of a column, you can set na.rm = T.
Finally, we can set an option that will tell R to ignore missing values for functions like linear regression
(lm()) using options().
> options(na.action = na.exclude)
2
A preview of simple linear regression
> lm(OTHOPEXP ~ TOTEXPCOL)
Call:
lm(formula = OTHOPEXP ~ TOTEXPCOL)
Coefficients:
(Intercept)
10946.548
TOTEXPCOL
1.284
> summary(lm(OTHOPEXP ~ TOTEXPCOL))
Call:
lm(formula = OTHOPEXP ~ TOTEXPCOL)
Residuals:
Min
1Q
-7401788
-33716
Median
-13330
3Q
Max
7266 11564481
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.095e+04 5.747e+03
1.905
0.0569 .
TOTEXPCOL
1.284e+00 1.008e-02 127.364
<2e-16 ***
--Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
Residual standard error: 392900 on 5079 degrees of freedom
Multiple R-Squared: 0.7616,
Adjusted R-squared: 0.7615
F-statistic: 1.622e+04 on 1 and 5079 DF, p-value: < 2.2e-16
# plot on a log-log scale
plot(TOTEXPCOL,OTHOPEXP,log="xy")
yfitted = fitted(lm(OTHOPEXP[nonmissingothopexp] ~
TOTEXPCOL[nonmissingothopexp]))
2
# add the linear fit
points(TOTEXPCOL[nonmissingothopexp],yfitted,col="green",cex=0.5)
So we’ve gotten a nifty little formula, for each dollar that a library spends on their collection, they spend
around $1.28 on other expenses, excluding salary.
1e+05
1e+03
OTHOPEXP
1e+07
●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●
●
●●● ●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●●●
●
●
●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
● ●●●
●●
●
●
●●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
● ●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
● ●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●●
●●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●● ●
●●●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
1e+01
●
1e+03
1e+05
1e+07
TOTEXPCOL
Above is a log-log plot showing the fit and the approximately linear relationship between the two expenses
variables.
Let’s now return though to the correlations. By default cor() will compute Pearson’s correlation coefficient, which as we learned last time corresponds to a signed square root of the coefficient of determination.
It assumes that the two variables are normally distributed and that the observations within each variable
are independent. If |r| = 1, then the relationship between the two variables is perfectly linear.
P
(xi − x̄)(yi − ȳ)
p
r= P
(1)
P
(xi − x̄)2 (yi − ȳ)2
We can get just the correlation coefficient using the cor() function, but we should also usually check
for the significance of that coefficient, especially if we have a small sample. Significance simply means that
the probability is small that the you drew a sample of the given correlation when the correlation of the
underlying population is actually 0. For example, we can use the cor.test() function which in addition to
giving us the t statistic and the corresponding p value, actually doesn’t mind missing entries in the data as
long as we have the proper global na.action option set:
> cor.test(TOTEXPCOL,OTHOPEXP)
Pearson’s product-moment correlation
data: TOTEXPCOL and OTHOPEXP t = 127.3639, df = 5079, p-value <
2.2e-16 alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.8659536 0.8790744
sample estimates:
3
cor
0.8726715
So we know that not only is our correlation high, but we also have a narrow confidence interval for its
value.
Now let’s have some fun. Can you find the two most correlated columns in the library data?
3
Nonparametric correlation tests
Sometimes the assumption of normality in the data just does not hold. In this case, one may want to use
nonparametric methods.
3.1
Spearman’s ρ
Spearman’s ρ works basically like Pearson’s r, except that you take each observations rank rather than the
actual value, and then plug it into the equation above. Sometimes this will give worse results (that is a lower
correlation) if your data is actually close to normally distributed, because you are losing accuracy with the
substitution. However, if your data is not normally distributed, and has fairly high variance, it can actually
give you a higher correlation. Case and point are variables that we’ve looked at before, such as the total
income of libraries per population served.
> cor.test(TOTINCM/POPU,ATTEND/POPU)
Pearson’s product-moment correlation
data: TOTINCM/POPU and ATTEND/POPU t = 61.4503, df = 8926, p-value
< 2.2e-16 alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.5304939 0.5596512
sample estimates:
cor
0.5452374
> cor.test(TOTINCM/POPU,ATTEND/POPU,method="spearman")
Spearman’s rank correlation rho
data: TOTINCM/POPU and ATTEND/POPU S = 43762711641, p-value <
2.2e-16 alternative hypothesis: true rho is not equal to 0 sample
estimates:
rho
0.6310284
Warning message: Cannot compute exact p-values with ties in:
cor.test.default(TOTINCM/POPU, ATTEND/POPU, method = "spearman")
The above is a case and point. If we compare the total income per population with the attendance per
population using the Pearson correlation coefficient was get a substantially lower correlation than if we use
Spearman’s rank correlation. Other cases where this may be useful are for example comparing the indegree
and PageRank of a webpage. These distributions are heavy-tailed (meaning not normally distributed)
3.2
Kendall’s τ
Kendall’s τ counts the number of concordant and discordant pairs. Let’s say we are looking at attendance
and population. If both attendance and population are higher for library A than for library B, then this is a
4
concordant pair. If however library A has higher attendance but serves a lower population than B, then this is
a discordant pair. If the two variables are uncorrelated (not a likely outcome in this scenario), then you would
expect about the same number of concordant and discordant pairs. Kendall’s τ then does this comparison
for all pairs of libraries, and that would be pretty computationally intensive, and also unnecessary. It is a
measure better suited to small data sets with a limited range of discrete outcomes. Do you remember how
Kendall’s τ was used in the handedness study?
Where it is often used in SI type applications is with inter-rater agreement. You have ’experts’ (usually
students looking to earn a buck) rate something (questions, documents, etc.). You want to know that these
ratings are accurate, and this is the case if the raters can independently agree.
Here we will be looking at a data set collected by (now graduated) PhD student Jun Zhang. He had two
students rate the expertise of members of a Java programming forum based on the posts of those members.
The scale was 1 − 5, ranging from most to least expert.
> javair = read.table("javaforuminterrater.txt",head=T)
> cor.test(javair$rater1,javair$rater2,method="kendall")
Kendall’s rank correlation tau
data: javair$rater1 and javair$rater2 z = 9.0708, p-value < 2.2e-16
alternative hypothesis: true tau is not equal to 0 sample estimates:
tau
0.6428882
> cor.test(javair$rater1,javair$rater2,method="pearson")
Pearson’s product-moment correlation
data: javair$rater1 and javair$rater2 t = 11.9907, df = 132,
p-value < 2.2e-16 alternative hypothesis: true correlation is not
equal to 0 95 percent confidence interval:
0.6295492 0.7943661
sample estimates:
cor
0.7220486
> cor.test(javair$rater1,javair$rater2,method="spearman")
Spearman’s rank correlation rho
data: javair$rater1 and javair$rater2 S = 105341.6, p-value <
2.2e-16 alternative hypothesis: true rho is not equal to 0 sample
estimates:
rho
0.7372995
In this case, the Spearman correlation coefficient does better than either Kendall’s tau or Pearson’s r.
The correlation is pretty high, but not incredibly so. We’ll revisit inter-rater agreement when we discuss
tabular data.
4
simple linear regression (again, this time in greater detail)
Remember that simple linear regression means that we’re assuming a linear relationship, we’re drawing a
line through it, and then we can predict additional values for yet-unobserved data points. We’ll use the data
file poliblog.txt. For 40 political blogs, it has the number of citations they received from the posts of a large
5
set of blogs in fall of 2004 (”citations”), the number of times they appeared on blogrolls of left-leaning blogs
(”leftlinks”) and the number of times they appeared on blogrolls of right-leaning blogs (”rightlinks”). You
are trying to determine how the number of citations depends on the number of blogrolls.
We’ll sum the blogroll links from left and right leaning blogs to get the total number of blogrolls the
blog appeared on. Plot the scatterplot of citations vs. ”total links”. We’ll add a regression line, as well
as prediction and confidence intervals for the plot. We also want to know whether the slope of the line is
significantly different from 0.
>
>
>
>
>
>
>
>
>
>
>
>
>
# ------------ POLITICAL BLOGS -------------# first read in the political blog data set
poliblog = read.table("poliblog.txt",head=T)
# get the total number of blogroll links
# both from the left and right
totlinks = poliblog$leftlinks + poliblog$rightlinks
citations = poliblog$citations
# regress
lm.poliblogs = lm(citations ~ totlinks)
summary(lm.poliblogs)
Call: lm(formula = citations ~ totlinks)
Residuals:
Min
1Q
-2477.2 -560.9
Median
-214.9
3Q
554.2
Max
3139.0
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -245.617
663.214 -0.370
0.715 totlinks
25.089
4.656
5.388 4.04e-05 ***
--Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
Residual standard error: 1275 on 18 degrees of freedom Multiple
R-Squared: 0.6173,
Adjusted R-squared: 0.596 F-statistic: 29.03
on 1 and 18 DF, p-value: 4.041e-05
The slope is positive and significant. It looks like on average each additional blogroll link from one of the
1,500 political blogs corresponds to 25 additional citations from the larger blogosphere.
>
>
>
>
>
>
>
>
>
>
>
>
>
# create a new data frame, just for plotting out the
# prediction intervals
# it seems we do this just to have an ordered set
# of data points, in this case from 30 to 300
pred.frame = data.frame(totlinks = 30:300)
# get the upper and lower prediction intervals
pp = predict(lm.poliblogs, int="p", newdata = pred.frame)
# get the upper and lower confidence intervals
pc = predict(lm.poliblogs, int="c", newdata = pred.frame)
# make the scatter plot
6
> plot(totlinks,citations,ylim=range(citations, pp),
+ xlab="number of incoming blogroll links", + ylab="number of
incoming post citations", + cex = 1.5, cex.axis=1.5,cex.lab=1.5)
>
> pred.totlinks = pred.frame$totlinks
>
> # add the prediction and confidence intervals using matlines()
> matlines(pred.totlinks, pc, lty=c(1,2,2), col="black")
> matlines(pred.totlinks, pp, lty=c(1,3,3), col="black")
Suppose I somehow left out a blog that had 250 blogroll links pointing to it. Let’s give an estimate and
the prediction interval for the number of citations that blog would have received.
> pred.frame = data.frame(totlinks = c(250))
> predict(lm.poliblogs, int="p", newdata=pred.frame)
fit
lwr
upr
[1,] 6026.578 3036.027 9017.13
10000
2000
6000
●
●
●
●
●
● ●● ●
●
● ●●
●●
● ●
50
100
●●
−2000
number of incoming post citations
The prediction interval is fairly broad, with 95% certainty we can expect anything between 3036 and
9017.
150
200
250
number of incoming blogroll links
Just one last thing, we’ll check that the residuals are normal:
> qqnorm(resid(lm.poliblogs))
Yup, doesn’t look too far from normal.
7
1000 2000 3000
●
●
●
0
●
●
−2000
Sample Quantiles
Normal Q−Q Plot
●
● ●
●●
●●
●
●
●
●
●●
●
●
−2
−1
0
Theoretical Quantiles
8
1
2
© Copyright 2026 Paperzz