R PRACTICAL 2 8 December, 2015 (1)

R PRACTICAL 2
DAVID STEINSALTZ
8 December, 2015
(1) Make a plot of the graphs of the functions y = xp for p = 14 , 21 , 1, 2, 3, and
x ∈ [0, 2], all on the same set of axes, with different colours and line types.
For an extra challenge use the function legend to add a legend that explains
the plot.
(2) (a) Using R compute the following:
(i) P{X = 112} where X is binomial with n = 200, p = 0.6.
200
P X = 112 =
0.6112 0.488 = .0293.
112
> dbinom(112,200,.6)
[1] 0.02933229
(ii) P{X ≥ 4} where X is Poisson with parameter 8.
3
X
8k
−8
= 0.958.
P X ≥4 =1−e
k!
k=0
> 1-ppois(3,8)
[1] 0.9576199
(iii) P{1 < X < 2} where X is Exponential with parameter 2. The
density is 2e−2x for x ≥ 0. So
Z 2
P 1<X<2 =2
e−2y dy = e−2 − e−4 = 0.117.
1
> pexp(2,2)-pexp(1,2)
[1] 0.1170196
(iv) P{X < 2} where
√ X is normal with mean 3 and variance 7.
Z = (X√− 3)/ 7 is standard normal. So P{X < 2} = P{Z <
(2 − 3)/ 7} = P{Z < −0.378}.
> pnorm(2,3,sqrt(7))
[1] 0.3527285
(b) Make 1000 simulations of each distribution in the previous part. Plot
a histogram of each set of simulations, and estimate the desired probability from the simulated outcomes.
1
2
> x1=rbinom ( 1 0 0 0 , 2 0 0 , . 6 )
> x2=r p o i s ( 1 0 0 0 , 8 )
1
2
DAVID STEINSALTZ
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
> x3=rexp ( 1 0 0 0 , 2 )
> x4=rnorm ( 1 0 0 0 , 3 , s q r t ( 7 ) )
> mean ( x1==112)
[ 1 ] 0.03
> mean ( x2>=4)
[ 1 ] 0.961
> mean ( x3<2&x3 >1)
[ 1 ] 0.115
> mean ( x4 <2)
[ 1 ] 0.361
> x1=rbinom ( 1 0 0 0 , 2 0 0 , . 6 )
> x2=r p o i s ( 1 0 0 0 , 8 )
> x3=rexp ( 1 0 0 0 , 2 )
> x4=rnorm ( 1 0 0 0 , 3 , s q r t ( 7 ) )
> mean ( x1==112)
[ 1 ] 0.021
> mean ( x2>=4)
[ 1 ] 0.949
> mean ( x3<2&x3 >1)
[ 1 ] 0.117
> mean ( x4 <2)
[ 1 ] 0.35
> par ( mfrow=c ( 2 , 2 ) ) # D i v i d e t h e p l o t window i n t o 2 x2
sections
> h i s t ( x1 )
> h i s t ( x2 )
> h i s t ( x3 )
> h i s t ( x4 )
.
R PRACTICAL 2
3
110
120
130
150 300
140
0
5
10
15
Histogram of x3
Histogram of x4
2
x3
3
4
0
Frequency
1
150 300
x2
300 600
x1
0
0
0
Frequency
150
100
Frequency
Histogram of x2
0
Frequency
Histogram of x1
−5
0
5
10
x4
(c) [optional] How accurate might we expect these simulations to be? One
way to estimate accuracy is to do multiple repetitions, and see how
much the answers vary. Try this. (This is a version of the method
called bootstrap.) Example: We do this for the normal example.
> # Make 1000 samples, each one the mean of the estimates of P(X<2) from 10
> y=sapply(1:1000,function(i) mean(rnorm(1000,3,sqrt(7))<2))
> sd(y)
[1] 0.01474859
We typically assume that an estimate is likely to be within 2 SDs of
the true value. (This depends on the normal approximation, which
you will learn about. This is the theory of confidence intervals, which
you also will learn about.)
(3) Load the package MASS with the command require(MASS). The data frame
hills gives record times in 1984 for 35 Scottish hill races.
(a) Look at the help file for this data set.
(b) Try applying commands like head, summary, dim, attributes.
(c) What happens when you plot hills?
(d) Make a histogram of the time variable.
hist(hills$time)
4
DAVID STEINSALTZ
8
6
4
0
2
Frequency
10 12
Histogram of hills$time
0
50
100
150
200
hills$time
(e) Compute the mean and SD of the times for those races where the climb
was above the median, and those where it was below the median.
1
2
3
4
5
6
7
8
9
10
11
12
> median ( h i l l s $ c l i m b )
[ 1 ] 1000
> t 1= h i l l s $ time [ h i l l s $ climb >1000]
> t 2= h i l l s $ time [ h i l l s $ climb <1000]
> mean ( t 1 )
[ 1 ] 85.56965
> sd ( t 1 )
[ 1 ] 59.12813
> mean ( t 2 )
[ 1 ] 32.56171
> sd ( t 2 )
[ 1 ] 15.06571
.
(4) Count the number of Adenines, Guanines, Cytosines and Thymines (As, Gs,
Cs and Ts). Then there is the dinucleotide composition, that is, the number
of AAs, AGs etc. (along one strand). There further is the trinucleotide
composition, that is, the number of AAAs, AAGs etc.
> t a b l e ( ecp )
ecp
A
C
G
T
1142742 1180091 1177437 1141382
>
6 > t a b l e ( r b i n d ( ecp [−n ] , ecp [ − 1 ] ) )
1
2
3
4
5
7
8
9
10
> t a b l e ( ecp [−n ] , ecp [ − 1 ] )
A
C
G
T
R PRACTICAL 2
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
A 338006 256773 238013 309950
C 325327 271821 346793 236149
G 267384 384102 270252 255699
T 212024 267395 322379 339584
Warning me ss age s :
1 : In . HTMLsearch ( query ) : U n r e c o g n i z e d s e a r c h
2 : In . HTMLsearch ( query ) : U n r e c o g n i z e d s e a r c h
3 : In . HTMLsearch ( query ) : U n r e c o g n i z e d s e a r c h
> t a b l e ( ecp [− c ( n−1,n ) ] , ecp [− c ( 1 , n ) ] , ecp [− c ( 1 , 2 )
, , = A
A
A 108964
C 76654
G 83530
T 68858
, ,
A
C
G
T
, ,
C
58664
86491
96071
84101
G
56659
70971
56222
83532
T
63721
26770
52688
68845
C
G
74935 80909
47807 115734
93028 92189
56051 95270
T
86523
42746
54247
83879
= C
A
82616
66782
54764
52611
= G
A
C
A 63405 73288
C 104850 87076
G 42503 114670
T 27254 71759
, ,
A
C
G
T
G
T
50653 76282
86904 102957
47515 66142
85180 76998
= T
A
83021
77041
86587
63301
C
49886
50447
80333
55483
G
T
49792 83424
73184 63676
74326 82622
58397 109862
.
5
field : title
f i e l d : keyword
field : alias
])
6
DAVID STEINSALTZ
(a) If you had to assign a probability to observing an A at a stated position
on the E. coli genome, what figure would you use and why? Under
what assumptions does this seem appropriate?
Assuming A’s are evenly spread through the genome, 1142742/4641652 =
0.246193.
(b) Your are told there is an A at a position along the E. coli genome. What
probability would you assign to it being followed by a G, and why? The
fraction of A’s that are followed by G’s is 238013/1142742 = 0.2082824.
(c) Does the E. coli composition data suggest that the event we observe a
G at one site is independent (in some suitable sense) of the previous
two bases? Explain fully, illustrating with appropriate data.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
> l o a d ( u r l ( ’ h t t p : / / s t e i n s a l t z . me . uk/DTC/ e c o l i . rda ’ ) )
> n=l e n g t h ( ecp )
> t a b l e ( ecp )
ecp
A
C
G
T
1142742 1180091 1177437 1141382
> t a b l e ( ecp [− c ( n−1,n ) ] , ecp [− c ( 1 , n ) ] , ecp [− c ( 1 , 2 ) ] )
, , = A
A
A 108964
C 76654
G 83530
T 68858
, ,
A
C
G
T
, ,
C
58664
86491
96071
84101
G
56659
70971
56222
83532
T
63721
26770
52688
68845
C
G
74935 80909
47807 115734
93028 92189
56051 95270
T
86523
42746
54247
83879
= C
A
82616
66782
54764
52611
= G
A
C
A 63405 73288
C 104850 87076
G 42503 114670
T 27254 71759
G
T
50653 76282
86904 102957
47515 66142
85180 76998
R PRACTICAL 2
34
35
36
37
38
39
40
41
42
43
44
, ,
A
C
G
T
7
= T
A
83021
77041
86587
63301
C
49886
50447
80333
55483
G
T
49792 83424
73184 63676
74326 82622
58397 109862
> f t=t a b l e ( ecp [− c ( n−1,n ) ] , ecp [− c ( 1 , n ) ] , ecp [− c ( 1 , 2 ) ] )
> dim ( f t )
[1] 4 4 4
> for ( i in 1:4) {
+
f t [ , , i ]= f t [ , , i ] /sum ( f t [ , , i ] )
+ }
> ft
51 , ,
= A
45
46
47
48
49
50
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
A
C
G
T
, ,
A
C
G
T
, ,
A
C
G
T
, ,
A
0.09535319
0.06707907
0.07309618
0.06025687
C
0.05133622
0.07568732
0.08407067
0.07359585
G
0.04958166
0.06210594
0.04919925
0.07309793
T
0.05576154
0.02342613
0.04610669
0.06024550
C
0.06349934
0.04051128
0.07883121
0.04749718
G
0.06856166
0.09807210
0.07812025
0.08073106
T
0.07331892
0.03622263
0.04596849
0.07107842
C
0.06224372
0.07395391
0.09738958
0.06094514
G
0.04301975
0.07380783
0.04035463
0.07234363
T
0.06478654
0.08744170
0.05617460
0.06539464
= C
A
0.07000816
0.05659055
0.04640659
0.04458216
= G
A
0.05385006
0.08904943
0.03609793
0.02314691
= T
A
C
G
T
A 0.07273726 0.04370666 0.04362431 0.07309034
8
DAVID STEINSALTZ
83
84
85
86
87
C 0.06749800 0.04419817 0.06411876 0.05578851
G 0.07586154 0.07038222 0.06511930 0.07238768
T 0.05545996 0.04861037 0.05116341 0.09625349
> f t=t a b l e ( ecp [− c ( n−1,n ) ] , ecp [− c ( 1 , n ) ] , ecp [− c ( 1 , 2 ) ] )
> t 2 /sum ( t 2 )
[ 1 ] 0.02905434 0.06078956 0.08237753 0.06569789 0.05374411
0.07180936 0.07972736 0.14208318 0.05883309 0.02881407
90 [ 1 1 ] 0 . 0 3 2 3 9 6 4 1 0 . 0 3 3 7 5 1 3 0 0 . 0 6 2 2 0 4 0 7 0 . 0 5 1 6 0 6 9 9 0 . 0 3 7 8 4 6 6 9
0.05850069 0.05076335
91 > t 2=t a b l e ( ecp [−n ] , ecp [ − 1 ] )
92 > t 2 /sum ( t 2 )
88
89
93
94
95
96
97
98
A
C
G
T
A
0.07282021
0.07008864
0.05760536
0.04567857
C
0.05531932
0.05856127
0.08275116
0.05760773
G
0.05127766
0.07471329
0.05822325
0.06945352
T
0.06677581
0.05087608
0.05508794
0.07316018
.
At least to the naked eye it appears that the proportions of the 16
possible pairs preceding a G (the first array) are almost the same as
the lower array, which has the overall proportions of those pairs.
(d) Purine counts.
Divide up the data into about 46,400 blocks of 100 base pairs (bp).
Count the number of purines (i.e. A or G). Do the same for the 4,640
blocks of 1,000 bp, and 464 blocks of 10,000 bp.
(i) For each set of counts, calculate the mean and standard deviation
of the number of purines per block, and draw histograms of these
numbers.
1
2
3
4
5
6
7
8
9
10
11
12
> k=100
> n=l e n g t h ( ecp )%/%k −1
> n . c o l l e c t=NULL
> for ( i in 0: n){
+
a=ecp [ i ∗k + ( 1 : 1 0 0 ) ]
+
n . c o l l e c t=c ( n . c o l l e c t , sum ( a==’A ’ | a==’G ’ ) )
+ }
> sd ( n . c o l l e c t )
[ 1 ] 5.678099
> mean ( n . c o l l e c t )
[ 1 ] 49.986
> hist (n . c o l l e c t )
.
R PRACTICAL 2
9
10000
5000
0
Frequency
15000
Histogram of n.collect
30
40
50
60
70
n.collect
(ii) Compare the results of (i) across the different block sizes and
comment.
1
2
3
4
5
6
7
8
9
10
11
12
> k=500
> n=l e n g t h ( ecp )%/%k −1
> n . c o l l e c t 2=NULL
> for ( i in 0: n){
+
a=ecp [ i ∗k + ( 1 : 1 0 0 ) ]
+
n . c o l l e c t 2=c ( n . c o l l e c t 2 , sum ( a==’A ’ | a==’G ’ ) )
+ }
>
> sd ( n . c o l l e c t 2 )
[ 1 ] 5.673332
> mean ( n . c o l l e c t 2 )
[ 1 ] 50.01077
.
(iii) For each block size, calculate the fraction of the counts within 1,
2 and 3 standard deviations of the mean.
> s a p p l y ( 1 : 3 , f u n c t i o n ( i ) sum ( abs ( n . c o l l e c t −mean ( n .
c o l l e c t ) ) / sd ( n . c o l l e c t )< i ) ) / l e n g t h ( n . c o l l e c t )
2 [ 1 ] 0.6647492 0.9588719 0.9978456
3 > #Block s i z e 500
1
10
DAVID STEINSALTZ
> s a p p l y ( 1 : 3 , f u n c t i o n ( i ) sum ( abs ( n . c o l l e c t 2 −mean ( n .
c o l l e c t 2 ) ) / sd ( n . c o l l e c t 2 )< i ) ) / l e n g t h ( n . c o l l e c t 2 )
5 [ 1 ] 0.6656253 0.9599267 0.9974146
4
.
(iv) Repeat (i), (ii) and (iii) for proportions (rather than counts) of
purines in each block.
Just change sum to mean in defining n.collect and n.collect2.
(e) Compute counts of TATAAT in blocks of 5,000 bp. Assuming that
these counts follow a Poisson distribution, estimate the parameter
of this distribution and obtain an estimate of the standard error of
your parameter estimate. This can be done by either a formula or by
simulation.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
> count . p a t t e r n=f u n c t i o n ( s o u r c e=ecp , t a r g e t=c ( ’T ’ , ’A ’ , ’T ’ , ’A ’ , ’A
’ , ’T ’ ) ) {
+
k=l e n g t h ( t a r g e t )
+
n=l e n g t h ( s o u r c e )−k+1
+
sum ( s a p p l y ( 1 : n , f u n c t i o n ( i ) prod ( s o u r c e [ i : ( i+k−1)]== t a r g e t ) )
)
+ }
> g=s a p p l y ( 1 : u , f u n c t i o n ( i ) count . p a t t e r n ( ecp [ ( ( i −1)∗n+1) : ( i ∗n )
]) )
> table (g)
g
0
1
2
3
4
5
6
7
611 213 56 25 15
5
2
1
> mean ( g ) # i s t h e e s t i m a t e o f t h e P o i s s o n parameter
[ 1 ] 0.5431034
# This i s t h e p r o p o r t i o n o f each count number among t h e b l o c k s
> round ( t a b l e ( g ) / l e n g t h ( g ) , 3 )
g
0
1
2
3
4
5
6
7
0.658 0.230 0.060 0.027 0.016 0.005 0.002 0.001
> round ( d p o i s ( 0 : 7 , . 5 4 3 1 0 3 4 ) , 3 )
# This i s what t h e p r o p o r t i o n s o f each count would be i f
Poisson
[ 1 ] 0.581 0.316 0.086 0.016 0.002 0.000 0.000 0.000
.
If the distribution of TATAAT were Poisson — that is, if they were
scattered among the 5000-base blocks like independent trials with small
probability of success at every point, the Poisson parameter would be the
mean number of “successes” per block, which is 0.5431. In fact, over the
928 blocks the distribution of different count numbers is very different from
Poisson.

Download Report

R PRACTICAL 2 8 December, 2015 (1)

Paperzz.com

Your Paperzz