Statistical hypothesis

CORRELATION AND REGRESSION.
Introduction to Correlation Analysis
The statistical methods discussed so far are primarily intended to describe a single
variable i.e., univariate populations. In this chapter the techniques that are useful in studying the
relationships that exist when the data on two or more variables is available, are discussed.
If on the same individual, data on two variables say X and Y are listed, it is called a bivariate
population. In this bivariate population, for every value of X, there is a corresponding value of Y.
By treating these variables X and Y separately, measures of central tendency, dispersion etc., can
be worked out. In addition to these measures it may be of interest to study the strength of
relationship existing between the variables and the nature of their relationship. The study of the
former aspect is referred to as ‘correlation’ and the latter as regression analysis.
Measure of Simple Correlation
It is a statistical tool to study the degree of association or relationship existing between
two variables, when the relationship is linear or approximately linear. The degree of relationship
is quantified by a coefficient called the ‘Karl Pearson’s product moment correlation coefficient
or simply the ‘correlation coefficient’. It is denoted by r. The working formula for r is given by
r
n  XY  ( X )(  Y )
n  X
2

 ( X ) 2 n  Y 2  ( Y ) 2

In the above expression, X and Y denote the measurements on variables X and Y and n is the
number of pairs of observations i.e. the sample size.
Properties of the correlation coefficient:
(i) It is a pure number without units or dimensions.
(ii) It lies between -1 and 1 i.e., -1 ≤ r ≤ 1.
(iii) The correlation coefficient is independent of the origin and the scale of measurement of the
variables.
The variables are said to be positively correlated if ‘r’ is positive and negatively correlated if ‘r’
is negative. Positive correlation indicates that two variables are moving in the same direction,
i.e., as one increases the other increases or if one decreases the other decreases. Negative
correlation indicates that the two variables are moving in opposite direction i.e., as one increases
the other decreases
Let r be the observed correlation coefficient in a sample of n pairs of observations from a
bivariate normal population. To test the hypothesis Ho:  = 0, i.e., population correlation
coefficient is zero, the following test procedure is used:
(i) Hypothesis
Ho : =0 H1:  ≠ 0
(ii) Test statistic
Compute :
t
r n2
(1  r 2 )
Which is distributed as t with (n-2) df.
(iii) Statistical decision
If the calculated value of t is greater than the table value of t with n-2 degrees of freedom
at the desired level of significance, the correlation between the variables is significant.
However, it is to be noted that the significance of r is not an indication of the strength of
relationship. It is simply a test to see whether  is equal to zero or not. The degree of the
relationship between two variables can be measured by the square of the correlation coefficient r
(which is called the coefficient of determination). Unless r2 very high, one variable should not be
used to predict the other.
Introduction to Simple Linear Regression
If two variables are found to be highly correlated then a more useful approach would be
to study the nature of their relationship. Regression analysis achieves this by formulating
statistical models which can best describe these relationships. These models enable prediction of
the value of one variable, called the dependent variable from the known values of the other
variable(s). It differs from correlation in that regression estimates the nature of relationship
where as the correlation coefficient estimates the degree or intensity of relationship. Further, it is
necessary to designate one of the variables as dependent and the other as independent in the case
of regression analysis which is not necessary in correlation analysis.Simple linear regression
deals with the study of near relationships involving two variables, whereas, the relationships
among more than two variables are studied of multiple regression techniques.
Estimation of parameters of regression equation
Scatter diagram gives some idea of the nature of relationship existing between the
variables. If it indicates that the relationship is linear in nature, next step would be to develop a
statistical model and proceed to estimate the underlying relationship. It is assumed that linear
relationship of the form,
Y = a+bX+e
In expression (1):
.................................... (1)
‘e’ is a random variable (random error factor) assumed to be normally distributed with mean
zero and variance σ2,i.e e ~ N (0,  2 ) .
’a’ and ‘b’ are constants (parameters).
In this model it is assumed that each Yi is normally distributed with mean a+bXi and constant
variance σ2. In the classical regression model it is further assumed that values of the independent
variable X are fixed or are pre-selected by the researcher and the variable X is measured without
error.
Linear regression of Y on X
Fitting linear relationship of the form (1) is equivalent to estimating the constants a and b
for the observed data. The best method that is used of estimation of ‘a’ and ‘b’ is the method of
‘least squares’. In a popular way it only means that a line is found to which the total of squares of
all distances from different points is minimum i.e. sum of e2 is minimum. In other words the
values of a and b which minimize,
n
n
 e i   (Yi  a  bX i ) 2 .................................... (2)
i 1
2
i 1
In the above expression n stands for the number of pairs of observations.
Estimates of parameters a and b which minimize (2) are obtained by the following formula:
b
n  XY  ( X )(  Y )
n  X 2  ( X ) 2
a  Y  bX
Estimated values of these constants are substituted in the equation Y = a+ b X to get the
regression equation. From this equation the value of Y can be estimated for a given value of X.
Different Regression Lines
Prior to this topic it was mentioned that independent variable X is fixed or is not a
random variable. If both X and Y are random variables and are open to choice as to which affects
which then the following regression lines may be conceived :
(i) Regression equation of Y on X
If Y is considered as dependent variable, then the regression equation of Y on X is given by
Y = a+b X
The regression coefficient b is called the regression coefficient of Y on X and is usually
denoted by x. In this equation a and b are so estimated as to minimize the residual variation
(deviations from regression) of Y i.e. ∑(Yi-a-bXi)2 is minimized.
(ii) Regression equation of X on Y
If X is considered as dependent variable then the regression equation is given by
X = a 1+ b 1 Y
The regression coefficient b1 is called the regression coefficient of X on Y and is usually denoted
by bxy. In this equation a and b are so estimated as to minimize the residual variation of X i.e.
∑(Xi-a-bYi)2 is minimized. The values of ‘a’ and ‘b’ obtained in (i) and (ii) will usually be
different.
RANK ORDE CORRELATION
The version of correlation seen earlier applies to those cases where the values of X and of Y are
both measured on an equal- interval scale. It is also possible to apply the apparatus of linear
correlation to cases where X and Y are measured on a merely ordinal scale. When applied to
ordinal data, the measure of correlation is spoken of as the Spearman rank- order correlation
coefficient, typically symbolized as rs.
The Simple Formula for rs, for Rankings without Ties
6∑D2
rs
=
1—
N(N2—1)
If this formula seems a bit odd to you, you are in good company. Generations of statistics
students have been presented with it, and generations have puzzled over such mind- bending
questions as: why do you start out with "1" and subtract something from it?; where does
that N(N2—1) in the denominator come from?; and, above all, how does that peculiar "6" get
into the numerator?
Here are the answers to these age-old questions in a nutshell.


For any set of N paired bivariate ranks, the minimum possible value of-∑D2 occurs in the
case of perfect positive correlation. In this case, rank 1 for X is paired with rank 1 for Y,
rank 2 for X with rank 2 for Y, and so on. Each value of D will accordingly be equal to
zero, and so too will be the sum of the squared values of D.
Conversely, the maximum possible value of-∑D2 occurs in the case of perfect negative
correlation. This maximum possible value is in every instance equal to
maximum-∑D =
N(N2—1)
2
3
STATISTICAL INFERENCE
Introduction
Statistical inference is that branch of statistics which deals with the theory and techniques of
making decisions regarding the statistical nature of the population using samples drawn from the
population.
Statistical inference has two branches. They are:
(i)
Theory Estimation and
(ii)
Testing of hypothesis.
Basic Terminology
1. Population: It is the totality of individuals in a statistical investigation. For example in the
study of knowing mean fish catch per boat in a landing centre, fish catches of all boats operating
in the centre constitute statistical population. A study on the fish catches of all boats is called
census survey.
2. Sample: A few representative selection of individuals form a population. For example in the
study of knowing mean fish catch per boat in a landing centre, a few boats operating selected
randomly form a population. A study on the fish catches of all selected boats is called sample
survey.
3. Parameter: Population mean, population standard deviation, population size etc. which are
quantitative characteristic of population are called parameters.
4. Statistic: Sample means, Sample standard deviations etc. which are function of sample values
are called statistics.
5. Parameter space: A set of all the admissible values population parameters is called
parameter space.
6. Sampling distribution: The distribution of values of a statistic for all different samples of the
same size is called sampling distribution of the statistic.
7. Standard Error: Standard deviation of sampling distribution of the statistic’.
8. Estimator: Statistics which is used to estimate population parameter.
9. Estimate: Value or range of values used to estimate population / parameters.
10. Point Estimate: A single value used to estimate population parameter.
11. Interval estimates: A range of values used to estimate population parameter at a given level
of confidence is called confidence interval.
12. Confidence coefficient: The probability that a confidence interval contains parameter is
called Confidence coefficient.
13. Confidence limits: They are the limits of the confidence interval.
14. Estimation: It is the process of estimating population parameter.
The table of Standard Error of some statistics:
Statistics
Mean
Sample mean
µ
Difference of means
µ1=µ2
Standard Error
SE 
SE 
Sample Proportion
P
PQ
n

n
 12    2 2 



n
 1   n2 
where Q=1-P
Difference of
Proportions
P1-P2
P1Q1 P2 Q2

n1
n2
P1-P2
(when P1=P2)
0
1
1
PQ   
 n1 n2 
Point Estimate:
Sample mean is the point estimator of population mean.
Sample proportion is the point estimator of population proportion.
Utilities of Standard Error: Standard Error (S. E) is a measure variability of the statistic. It
is useful in estimation and testing of hypothesis.
1. Standard Error is used to decide the efficiency and consistency of the statistic as an
estimator.
2. In interval estimation, Standard Error is to write down the confidence intervals.
3. In testing of hypothesis, standard error of the test statistic is used to standardize the
distribution of the test statistics.
Unbiased Estimator:
An estimator is said to be unbiased if the average of all values taken by the estimator is equal to
the population parameter
For example: Sample mean is an unbiased estimator population mean because average of all
means of samples of same size taken from a population is equal to the population mean.
Similarly, Sample proportion is the unbiased estimator of population proportion and Sample
variance is the unbiased estimator of population variance.
Note : Sample variance is given by s 2 
 ( xi  x ) 2
n 1
Interval Estimators:
CASE 1: (1- α ) % Interval estimates for population mean when σ is known is given by
 

X

Z
.


/2
n 


 

X

Z
.
,
X

Z
.


/2
/2
=
n
n 

NOTE: If σ is unknown then, it is to be replaced by sample standard deviation ,‘s’ .
The maximum error (E )given by
Note: Larger the sample size( n), lower is the error(E).
The sample size can be determined for the known confidence level by using the formula :
Example
A random sample of the sixteen Subjects has been taken from a locality gave the
following Fasting sugar level measurements in mg/dl
95, 108, 97, 112, 99, 106, 105, 100, 99, 98, 104, 110, 107, 111, 103, 110.
Assume that Sugar level follow a normal distribution with variance of 25 mg2/dl with
unknown mean:
1. What is the mean?
X  104
2. Determine 95%confidence interval at for the mean sugar level.

 

X

Z
.
,
X

Z
.


/2
/2
Confidence Interval is given by: 
n
n 

For 95% →
zα/2 = 1.96
5
5 

104

1
.
96
.
,
104

1
.
96
.

16
16 


101.55,106.45
TESTING OF HYPOTHESES
Introduction
Many situations call for verification of statements on the basis of available information.
For example, a researcher may be interested in verifying the statement 'Women are prone to
cancer than men’ Verifying a statement concern in a population by examining a sample from that
population is called testing of hypothesis.
Estimation and testing of hypothesis are not as different as they appear. For example,
confidence intervals may be used to arrive at the same conclusions that are reached by using the
testing of hypothesis procedure.
Terminology
Before carrying out any statistical test, it is required to understand following terminology
involved in testing of hypothesis.
Statistical hypothesis
Statistical hypothesis is a statement about the population under study. It is usually a
statement about one or more parameters of the population. Such statement may or may not be
true.
Examples of hypothesis are:
1) Mean weight of newly born babies is 3 kg.
2) Drug A and B are equally effective in decreasing the sugar level
Null hypothesis
The hypothesis to be tested is commonly designated as ‘Null hypothesis” and is denoted usually
by Ho.
Alternative hypothesis
Any admissible hypothesis that differs from a null Hypothesis is called an alternative
hypothesis and is denoted by H1.
Test statistic
It is a function of sample values. It extracts the information about the population
parameter contained in the sample. The observed value of the test statistic serves as a guide in
rejecting or not rejecting the null hypothesis.
For example in testing of null hypothesis that value of the population mean is µ o i.e. Ho :
µ=µ0 the statistic used is
(x  o )
Z
/ n
Level of significance
In testing a given hypothesis, the maximum probability with which we would be willing
to risk type I error is called the level of significance of the tests, denoted by α.
In other words, it is a way of quantifying the amount of risk one wants to take in rejecting
a true hypothesis.
Usually 5% or 1% levels of significance are chosen. These levels, however, depend on
the gravity of the risk which costs in decision making.
To illustrate suppose 5% level of significance is chosen in designing a test of hypothesis,
then there are about 5 chances in 100 that the hypothesis is rejected when it should be accepted,
i.e. one is 95% confident about the right decision. The decision as to which values go into the
rejection region and acceptance region is made on the basis of desired level of significance ‘α’.
Tests of hypothesis are sometimes called tests of significance because of the term level of
significance and the computed value of the test statistic that falls in the rejection region is said to
be significant.
One tailed and two-tailed tests
If the null hypothesis H0 : µ = µ0 is tested against H1 : µ≠µ0 (which implies µ < µ0 or µ
> µ0), then the interest is on extreme values of Z on both tails of the distribution. In such cases
the critical region is split between two sides or tails of the distribution of the test statistic as
shown in Figs. 1 and 2 above. Tests applied for such situations are called ‘two-tailed’ tests.
If the null hypothesis H0 : µ = µ0, is tested against H1 : µ > µ0, then the interest is in the
extreme value to one side of the mean
Test for mean of single sample (Z Test)
Let x1, x2 ..................xn be the values of a variable X, in a large random sample of size n
from a population with mean m and variance σ2. On the basis of this sample, the hypothesis
regarding the value µ is tested. Testing of hypothesis consists of the following steps.
(i)
Hypotheses
Ho
:
H1
:
µ = µo
µ ≠ µo
Where µo is a specified value
(ii) Test statistic
The following test statistic is computed
Z
( x  o )
/ n
Or 
(x  o ) n

Where x is the sample mean.
(iii) Statistical decision
If Z  1.96 , reject Ho at 5% level of significance, otherwise do not reject it.
If Z  2.58 , reject Ho at 1% level of significance, otherwise do not reject it.
Test for equality of two population means
Testing single population proportion
Testing hypothesis about population proportions is carried out in the same way as for
means when the conditions necessary for using normal distribution are met with. When a sample
is sufficiently large and when H0 : P = Po is true then test statistic is given by:
Z
p  po
po qo
n

( p  p0 ) n
p0 q0
which is approximately distributed as a standard normal variate.
Test of significant difference between two proportions
Let x1 and x2 be the numbers of individuals possessing the given attribute say A in
random samples of size n1 and n2 from the two populations respectively, then sample proportions
are given by
p1 
x2
x1
and p 2 
n2
n1
If P1 and P2 are proportions of populations 1 and 2 respectively then variance of (P1-P2) is given
by
( P1  P2 ) 
1
P1 Q1 P2 Q2
1

 PQ 

n1
n2
 n1 n2 
Under the null hypothesis P1 = P2 = P (say) and Q1 = Q2 = Q.
Steps in testing of hypothesis of equality of population proportions are as follows:
(i)
Hypotheses
Ho: P1 = P2 = P (say)
(ii)
H1: P1 ≠ P2
;
Test statistic
The following test statistic is used
P1  P2
Z
Given:
   1
1
P Q  
 n1 n2 
x1 = 18, x2 = 25, n1 = 110, n2 = 140
p1 = 18/110 = 0.1636, p2 = 25/140 = 0.1786

P
Hence,
n1 p1  n2 p2
 0.172
n1  n2
Z
and

1- P  0.828
0.1636  0.1786
1 
 1
(0.172) (0.828) 

110 140 

 0.015
(0.1424)(0.0162)

 0.015
= -0.3123
0.04803
(iii) Statistical decision
As Z  1.96 , the hypothesis is not rejected at 5% level of significance.
Small samples Tests (n <30)
The Student- t – distribution
When the size of the sample is small, the distributions of various statistics are far from normality
and hence tests of hypothesis based on normal variate cannot be applied. In such cases tests of
hypothesis based or exact sampling distribution of ‘t’ and ‘F’ are applied. When applying these
tests it is assumed that the population from which the sample is drawn is normal.
The t - distribution which is popularly known as student’s t distribution is a sampling
distribution derived from the parent normal distribution. This distribution is symmetrical about
the mean but is slightly flatter than then normal distribution. Unlike the normal distribution it
will be different for different size of the sample ‘n’ or the degree of freedom (n-1). When the
size of the sample is very small < 30), the t - distribution markedly differs from normal
distribution, but as n increases t - distribution resembles more and more a normal distribution
(fig.1). The t distribution has mean zero and variance n/(n-2) for n>2. The variable t ranges
theoretically from -  to + . The values of ‘t’ have been tabulated for different degrees of
‘freedom at different levels of significance (Fisher and Yates, 1963).
CASE 1:Test for Single Mean
Let x1, x2 ...xn are a random sample of size n drawn from a normal population with mean µ. Let
x and s2 denote mean and variance of the sample.
To test the hypothesis  0 :   0 the following test procedure is used:
(i)
Hypotheses
(ii)
Test statistic
 0 :   0 ; 1 :   0
t
( x  o ) n
s
 x
1 
2
2
s s 
x


n 1
n

2
Where




This test statistic follows t -distribution with (n-1) degree of freedom.
(iii) Statistical decision
If t  the table value of t at 5% level of significance, then reject Ho at 5% level of significance.
Otherwise accept Ho
If t  the table value of t at 1% level of significance, then reject Ho at 1% level of significance.
Otherwise accept Ho
CASE 2: Testing of difference between two means (population variances
assumed equal)
Let x and s1 be the mean and standard deviation of a sample of size n1 from a normal
population with mean µ1 and let y and s2 be the mean and standard deviation of another sample
of size n2 from a normal population with mean µ2. To test whether the population means differ
significantly, the following null hypothesis is set up:
(i)
Hypotheses
 0 : 1   2 ; 1 : 1   2
(ii)
Test statistic
t
( x - y)
1
1
s   
 n1 n 2 
Which is distributed as t with n1 + n2 - 2 degrees of freedom.
Then ‘s’ in the above expression is computed using the formula:
s
 (x - x ) 2  (y - y ) 2
n1  n2  2
(n 1 - 1) s1  (n 2 - 1 ) s 2
n1  n2  2
2

2
(iii) Statistical decision
If t > the table value of t at the specified level of significance, reject the null hypothesis at the
level, otherwise accept it.
CASE 3: Test of difference between two means of paired observations
(paired t – test)
When the two samples of equal size are drawn from two normal populations and these
samples are not independent, then the paired t-test is used. Dependent samples arise, for
instance, in experiments when an individual is tested first under one condition and then under
another condition, so that there will be two observations for the same individual. Let n be the
size of each of the two samples and d1, d2...................dn the difference between the
corresponding members of the sample. Let d denote the mean of differences and s the standard
deviation of these differences.
(i)
Hypotheses
0 : D  0; 1 : D  0
(ii)
Test statistic
To test this hypothesis, compute t 
d
where d 
s
n
d
n
It is distributed as t with (n-1) degrees of freedom
s

1
 d 2  ( d ) 2 / n
n 1

(iii) Statistical decision
If t > the table value at the specified significance, reject Ho at the level.
Exercise for practice:
I. Samples of eleven and fifteen animals were fed on different diets A and B respectively.
The gain in weight for the individual animals for the same period was as follows:
Diet A(in gm): 20 25 30 23 29 19 30 12 19 27 26
Diet A(in gm): 39 29 19 17 41 28 23 27 31 28 31 18 16 24 26
II. Length measurements of sampled mackerel made on the same day in two landing centres
are given bellow:
Landing centre A: 24 21 20 19 17 18 15 13 20
Landing centre B: 21 22 14 18 16 19 20 22 21
Does the data support that mean length of mackerel in both the landing centres are same.
Chi-square Test
Introduction to Chi-square (  2 ) distribution
Theoretically, Chi-square (i.e.  2 ) distribution can be defined as the sum of squares of
independent normal variates. If X1, X2 ………….Xn are n independent standard normal
variates, then sum of squares of these variates X12 + X22 +…………………………+Xn2 follows
the  2 distribution with n degrees of freedom. Alternatively if a sample of size n, is drawn from
a normal population with variance σ2, the quantity (n-1) s2 / σ2 follows  2 distribution with (n-1)
degrees of freedom where s2 is the sample variance.
The shape of  2 distribution depends on the degrees of freedom which is also its mean
(Fig. 9.5). When n is small, the  2 distribution is markedly different from normal distribution
but as n increases the shape of the curve becomes more and more symmetrical and for n > 30, it
can be approximated by a normal distribution. The values of  2 have been tabulated for different
degrees of freedom at different levels of probability. (Fisher and Yates, 1963). The  2 is always
greater than or equal to zero i.e.  2 ≥ 0.
Test for fixed-ratio hypothesis
Many investigations are carried out to verify empirically some biological phenomena that
are expected to occur under some given assumptions. In common carps normal pigmentation is
due to a dominant gene B — and its recessive allele bb produces blue pigmentation. F1
generation will have all individuals (Bb) with normal pigmentation. When F1’s are crossed to
obtain F2 generation there will be common carps with normal and blue pigmentation in the ratio
of 3: 1. Whether this hypothesis of 3: 1 ratio is substantiated by the actual observed data can be
ascertained by  2 - test. This  2 test can be applied to test any fixed ratio hypothesis provided
the expected ratio is specified before the investigation commences.
If Oi refers to observed frequency and Ei refers to the expected frequency based on the expected
ratio hypothesis, then  2 is computed as follows:
(Oi  Ei ) 2
i 1
Ei
k
2  

k
2
Oi
n

i 1 Ei
………………… (1)
Where n is the total number of observations and k is the number of classes. The  2 in (1) has k-1
degrees of freedom. In this test the expected frequency of each class should be more than 5. If
any such frequency is small adjacent classes may be grouped, so that the expected frequency is
more than 5.
If the calculated value of  2 is greater than the table value of  2 with (k-1) df at specified level
of significance the null hypothesis of specified ratio is rejected.
2 test for independence of attributes in 2 x 2 contingency table
-
Suppose that an attribute data of size n is classified according to two attributes, say, A
and B and the attribute A is further subdivided into two classes A1 and A2 and the attribute B
into B1, and B2. Such attribute data can be presented in the form of a table called 2 x 2
contingency table as shown below:
Table 2: 2x2 contingency table
B\A
A1
A2
Total
B1
a
b
a+b
B2
c
d
c+d
Total
a+c
b+d
N= a+b+c+d
It may be of interest to test the independence of attributes A and B.
H0
H1
:
:
The two attributes A and B are independent.
The two attributes A and B are dependent.
It is tested by the  2 test. A simple formula for computing X2 of a 2 x 2 contingency
table is given by,
N (ad  bc) 2
 
(a  b)(c  d )(a  c)(b  d )
2
Fisher’s Yates correction for continuity
Where a, b, c and d are cell frequencies of 2 x 2 contingency table and N is the total
frequency. This  2 has 1 degree of freedom. If, the expected cell frequencies are large, the
discrete distribution of probabilities of all frequencies approximate to normal distribution. This
approximation holds good fairly well when the degrees of freedom are more than 1 and the
expected cell frequency in the various classes is not small. As the degrees of freedom of  2
statistic of 2 x 2 contingency table is 1,  2 approximation in this case will not be satisfactory
and leads to over estimation of significance. This is corrected by the method suggested by Yates
which is known as ‘Yates correction’. The correction consists of adding ½ to the observed
minimum frequency and adjusting the other cell frequency for the observed marginal totals and
then computing the  2 .
Formula for  2 using the Yates correction in a 2x2 contingency table is given by,
2
n

N  ad  bc  
2

2 
(a  b)(c  d )(a  c)(b  d )
This correction is suitable when the expected frequency of classes is less than 5, but
estimation with correction can do no harm even when the frequencies are large. Hence it is
always better to use the correction as a matter of routine.
Example: In a series of experiments to test whether advanced stages of a particular
infection is cured by ‘Lime treatment’ the following observations were found:
Lime treated
Untreated
(control)
Total
Not cured
Cured
Total
86
14
100
88
12
100
174
26
200
Test whether lime has any effect in curing the infection
Answer:
(i)
Ho
H1
Hypotheses
:
There is no association between lime treatment and curing of infection.
:
There is association between lime treatment and curing of infection
(ii)
Test statistics
N (ad  bc) 2
 
(a  b)(c  d )( a  c)(b  d )
2
Where a =86,
b=14,
c = 88,
200(86 x12  14 x88) 2
 
100 x100 x174 x 26
2
d = 12
200(200) 2

10000 x174 x 26
= 0.1768
(iii) Statistical decision
Since  2 calculated (with and without Yates correction) is less than the table value (3.84
at 5%, 6.64 at 1%), H0 is not rejected. i.e. Infection is not dependant on Lime Curing.
Computation of 2 in (r x c) contingency table
The r x c contingency table is an extension of 2 x 2 contingency table in which the data
are classified into ‘r’ rows and ‘c’ columns (table 3). In this table the frequencies which occupy
cells of the table are called ‘cell frequencies’ whereas row and column totals are called the
‘marginal frequencies;
Table 3: (r x c) contingency table
A\B
B1
B2
...
Bj
...
Bc
Total
A1
O11
O12
...
O1j
...
O1c
A1
A2
.
O21
O22
...
O2j
...
O2c
A2
...
...
...
...
...
...
...
Oi1
Oi2
...
Oij
...
Oic
Ai
...
...
...
...
...
...
...
Ar
Or1
Or2
...
Orj
...
Orc
Ar
Total
(B1)
(B2)
..
Bj
..
BC
N
.
.
Ai
.
.
.
As the table consists of ‘r’ rows and ‘c’ columns, there will be (r x c) observed
frequencies, one in each cell. Corresponding to each observed frequency, there is expected
frequency, computed based on certain hypothesis. Under the null hypothesis of no relationship or
of independence between the attributes, expected frequency of each cell is computed by
multiplying totals of the row and column to which the cell belongs divided by the total number
of observations. For instance, the expected frequency of the cell in 1st row and 2nd column is
obtained by multiplying the 1st row total (A1) with the 2nd column total (B2) and then dividing
by the total number of observations, ‘n’. After calculating the expected frequencies for each cell,
2 is computed using the formula,

n
2
2
O
 i -n
i 1 E i
Which has (r-1) (c-1) degrees of freedom.
F-Distribution
Introduction
In the t- tests on paired and unpaired samples we had to assume that the two samples came from
the same Normal Distribution. We were testing that the means were the same but still had to
assume that standard deviation were the same. We can test whether this assumption is correct by
using F- test is due to Snedecor.
If the standard deviations are the same then we would expect sample standard deviations to be
similar i.e. s1=s2 or more precisely,
This can be written as σ12/ σ22 =1
That is we can test for the closeness of this ratio to 1
The Variance ratio or F-Test formalizes these ideas:
The ratio of variances (s12/ s22) of independent samples taken from a normal population follows a
distribution called F-distribution with two degrees of freedom one for numerator and the other
for denominator.
F-distribution: If s12 is the variance of a sample of size n2 and s22 is the variance of another
independent sample of n1 taken from the same population then ratio of variances of these samples
i.e. s12/ s22 follows a distribution called F- distribution with (n1-1) and (n2-1) degrees of freedom.
Note:
1.
2.
3.
4.
F has two sets of degrees of freedom-one for numerator and another for denominator.
F is a positively skewed distribution.
Shape of F distribution depends upon the two degrees of freedom.
Degrees of freedom are the parameters of F distribution.
Testing for the equality of variances
1. Ho : σ12= σ22 (No significant difference between Variances)
2. H1:
σ12≠ σ22 (Significant difference between Variances)
3. Fix α =0.05 Say
4. Test statistic: F= s12/ s22 where s12 is the larger sample variance
Decision Rule: Reject Ho (i.e. No significant difference between Variances) if computed test
statistic value is more than table F value for (n1-1); (n2-1) degrees of freedom otherwise accept
Ho.
Exercise for practice:
I.
Can the following two samples be considered to have the same underlying variance?
Sample I : 18 29 24 16 17 21 22 19 20
Sample II :
II.
17 19 20 24 21 22 18 19 23 18 20.
Test for the equality standard deviation of the weight of both sex the from the
following sampled data.
Males
: 56 59 48 54 39 57 62 69
Females: 45 49 57 53 51
Basic Ideas
59 61 71 57
The Purpose of Analysis of Variance
In general, the purpose of analysis of variance (ANOVA) is to test for significant differences
between means. Elementary Concepts provides a brief introduction to the basics of statistical
significance testing. If we are only comparing two means, ANOVA will produce the same results
as the t test for independent samples (if we are comparing two different groups of cases or
observations) or the t test for dependent samples (if we are comparing two variables in one set of
cases or observations). If you are not familiar with these tests, you may want to read Basic
Statistics and Tables.
Why the name analysis of variance? It may seem odd that a procedure that compares means is
called analysis of variance. However, this name is derived from the fact that in order to test for
statistical significance between means, we are actually comparing (i.e., analyzing) variances.
Summary of the basic logic of ANOVA. To summarize the discussion up to this point, the
purpose of analysis of variance is to test differences in means (for groups or variables) for
statistical significance. This is accomplished by analyzing the variance, that is, by partitioning
the total variance into the component that is due to true random error (i.e., within-group SS) and
the components that are due to differences between means. These latter variance components are
then tested for statistical significance, and, if significant, we reject the null hypothesis of no
differences between means and accept the alternative hypothesis that the means (in the
population) are different from each other.
Dependent and independent variables. The variables that are measured (e.g., a test score) are
called dependent variables. The variables that are manipulated or controlled (e.g., a teaching
method or some other criterion used to divide observations into groups that are compared) are
called factors or independent variables. For more information on this important distinction, refer
to Elementary Concepts.
A one-way layout consists of a single factor with several levels and multiple observations at each
level. With this kind of layout we can calculate the mean of the observations within each level of
our factor. The residuals will tell us about the variation within each level. We can also average
the means of each level to obtain a grand mean. We can then look at the deviation of the mean of
each level from the grand mean to understand something about the level effects. Finally, we can
compare the variation within levels to the variation across levels. Hence the name analysis of
variance.
Model
The equation indicates that the j th data value, from level i, is the sum of three components: the
common value (grand mean), the level effect (the deviation of each level mean from the grand
mean), and the residual (what's left over).
Important Assumptions in ANOVA
The ANOVA method has some very important assumptions. First, as with the t tests, there is an
assumption of normality--the groups must be normally distributed on the "dependent" variable.
The ANOVA method is relatively robust to violations of this assumption, provided the violations
are not too severe. The best way to assess deviation from normality is to simply examine a
histogram. The issue of normality has been covered well in previous courses.
A more vexing problem in practice is the assumption of equal variances. There is often
evidence that this assumption is violated (remember the standard deviation squared is the
variance, so large differences in standard deviations among groups is evidence of a problem). In
fact, the major statistical software packages provide tests for this assumption. The most common
are the Bartlett-Box test, Hartley's test, and Levene's test. Calculating these statistics is beyond
the scope of the course, but interpretation is relatively straight-forward. They are available in the
major statistical packages. For these tests, the null hypothesis is that variances of the individual
cells are equal. When the null is rejected, the variances are unequal, and the assumption is not
met for ANOVA. However, these tests tend to be very powerful, and hence it is probably not
cause for real concern until the null is rejected at less than the .001 level (sig or p < .001),
especially when sample size is large.
When variances among groups are unequal, they are referred to as heterogenous, and it is
said that the homogeneity of variance assumption is violated. When this is the case, there are
ways of correcting the analysis. We will now consider two approaches for testing the omnibus
null when the homogeneity of variance assumption is untenable. In both of these tests,
modifications of the calculation of MSbg and/or MSwg are made, and, most importantly, the
critical values of F are adjusted by adjusting the df for MSwg.
These methods are mathematically tedious, but not incomprehensible (make sure you
understand the difference- "tedious" would take a whole day to do and would make you bored or
angry, "incomprehensible" is where you couldn't do it even if you had the time). Make sure you
know what you would have to do to perform these methods, even though you do not plan to
(there are ways of testing to see if you have done this!). There is nothing in the formulas below
you have not seen before. It is just n, N, s (standard deviation), a (number of groups), etc. The
first method is the Brown-Forsythe method (Brown & Forsythe, 1974). In this method, the
MSwg is modified to yield a special F statistic (F*). The F* value is then evaluated at a special
denominator df value (df*).
Example
Problem: Susan Sound predicts that students will learn most effectively with a
constant background sound, as opposed to an unpredictable sound or no sound at all.
She randomly divides twenty-four students into three groups of eight. All students
study a passage of text for 30 minutes. Those in group 1 study with background sound
at a constant volume in the background. Those in group 2 study with noise that
changes volume periodically. Those in group 3 study with no sound at all. After
studying, all students take a 10 point multiple choice test over the material. Their
scores follow:
group
1) constant sound
7
4
6 8
6
6
2
9
2) random sound
5
5
3 4
4
7
2
2
3) no sound
2
4
7 1
2
1
5
5
x1
7
4
6
8
6
6
2
9
1
test scores
= 48
x12
49
16
36
64
36
36
4
81
2
1
x2
5
5
3
4
4
7
2
2
=
322
2
1)
2
2)
= 2304
M1 = 6
SStotal = 117.96
= 507.13 - 477.04
SSamong = 30.08
SSwithin = 117.96 - 30.08 = 87.88
df MS
= 32
= 1024
M2 = 4
= 595 - 477.04
Source SS
2
F
x22
25
25
9
16
16
49
4
4
2
2 = 148
x3
2
4
7
1
2
1
5
5
3 = 27
2
3)
= 729
M3 = 3.375
x32
4
16
49
1
4
1
25
25
2
3 = 125
Among 30.08 2 15.04 3.59
Within 87.88 21 4.18
*(according to the F sig/probability table with df = (2,21) F must be at least 3.4668 to reach p <
.05, so F score is statistically significant)
Interpretation: Susan can conclude that her hypothesis may be supported. The means are as she
predicted, in that the constant music group has the highest score. However, the signficant F only
indicates that at least two means are signficantly different from one another, but she can't know
which specific mean pairs significantly differ until she conducts a post-hoc analysis
(e.g., Tukey's HSD).
Nonparametric tests
Occasionally, the assumptions of the t-tests are seriously violated. In particular, if the type of
data you have is ordinal in nature and not at least interval. On such occasions an alternative
approach is to use nonparametric tests.
Nonparametric tests are also referred to as distribution-free tests. These tests have the obvious
advantage of not requiring the assumption of normality or the assumption of homogeneity of
variance. They compare medians rather than means and, as a result, if the data have one or two
outliers, their influence is negated.
Parametric tests are preferred because, in general, for the same number of observations, they are
more likely to lead to the rejection of a false hull hypothesis. That is, parametric tests have more
power. This greater power stems from the fact that if the data have been collected at an interval
or ratio level, information is lost in the conversion to ranked data (i.e., merely ordering the data
from the lowest to the highest value).
There are a wide range of alternatives for the two groups t-tests and are given in the table below.
Parametric test
Non-parametric
Paired sample t-test
Wilcoxon T Test
Independent samples t-test
Mann-Whitney U Test
Pearson's correlation
Spearman's correlation
One-way ANOVA
Kruskal–Wallis one-way ANOVA
The Wilcoxon signed-rank test is a non-parametric statistical hypothesis test used when
comparing two related samples, matched samples, or repeated measurements on a single sample
to assess whether their population mean ranks differ (i.e. it's a paired difference test). It can be
used as an alternative to the paired Student's t-test, t-test for matched pairs, or the t-test for
dependent samples when the population cannot be assumed to be normally distributed. The
Wilcoxon signed ranks test can be performed on small samples and large samples. When
performing these tests, there are eight steps that should be used: 1) State the null and research
hypotheses. 2) Set the level of risk (or the level of significance) associated with the null
hypothesis. 3) Choose the appropriate test statistic. 4) Compute the test statistic. 5) Determine
the value needed for rejection of the null hypothesis using the appropriate table of critical values
for the particular statistic. 6) Compare the obtained value to the critical value. 7) Interpret the
results. 8) Report the results.
It is now suggested that the confidence interval should be reported when testing research data. A
confidence interval is an inference to a population in terms of an estimation of sampling error,
providing a range with a level of confidence of 100(1-alpha)%.
Assumptions
1. Data is paired and comes from the same population.
2. Each pair is chosen randomly and independent.
3. The data is measured on an interval scale (ordinal is not sufficient because we take
differences), but need not be normal.
Test procedure
Let N be the sample size, the number of pairs. Thus, there are a total of 2N data points. For i = 1,
..., N, let
and
denote the measurements.
.
1. For i = 1... N, calculate
function.
and
2. Exclude pairs with
. Let
3. Order the remaining
difference,
, where sgn is the sign
be the reduced sample size.
pairs from smallest absolute difference to largest absolute
.
4. Rank the pairs, starting with the smallest as 1. Ties receive a rank equal to the average of
the ranks they span. Let
denote the rank.
5. Calculate the test statistic W.
, the absolute value of the sum of the signed
ranks.
6. As
increases, the sampling distribution of W converges to a normal distribution.
Thus,
For
, a z-score can be calculated
as
. If z > z critical, reject H0.
For
,
is compared to a critical value from a reference table[1].
If
, reject H0. Alternatively, a p-value can be calculated from
enumeration of all possible combinations of
given
.
[edit]Example
1 125
110
1
15
5 140 140
0
2 115
122
–1
7
3 130 125 1
5
1.5 1.5
5
1.5 1.5
order by absolute
difference
3 130
125
1
5
9 140 135 1
4 140
120
1
20
2 115 122
–1
7
3
–3
5 140
140
0
6 115 124
–1
9
4
–4
6 115
124
9
10 135 145
–1
10
5
–5
–1
12
6
–6
1 125 110 1
15
7
7
5
7 140 123 1
17
8
8
10
4 140 120 1
20
9
9
7 140
123
1
17
8 125 137
8 125
137
–1
12
9 140
135
1
10 135
145
–1
–1
is the sign function,
is the absolute value, and
is the rank. Notice that pairs 3 and
9 are tied in absolute value. They would be ranked 1 and 2, so each gets the average of those
ranks, 1.5.
Mann–Whitney U
In statistics, the Mann–Whitney U test (also called the Mann–Whitney–Wilcoxon(MWW)
or Wilcoxon rank-sum test) is a non-parametric statistical hypothesis test for assessing whether
one of two samples of independent observations tends to have larger values than the other. It is
one of the most well-known non-parametric significance tests. It was proposed initially[1] by the
German Gustav Deuchler in 1914 (with a missing term in the variance) and later independently
by Frank Wilcoxon in 1945,[2] for equal sample sizes, and extended to arbitrary sample sizes and
in other ways by Henry Mann and his student Donald Ransom Whitney in 1947.[3]
Assumptions and formal statement of hypotheses
Although Mann and Whitney developed the MWW test under the assumption
of continuous responses with the alternative hypothesis being that one distribution
is stochastically greater than the other, there are many other ways to formulate the null and
alternative hypotheses such that the MWW test will give a valid test.
A very general formulation is to assume that:
1. All the observations from both groups are independent of each other,
2. The responses are ordinal (i.e. one can at least say, of any two observations, which is the
greater),
3. Under the null hypothesis the distributions of both groups are equal, so that the
probability of an observation from one population (X) exceeding an observation from the
second population (Y) equals the probability of an observation from Y exceeding an
observation from X, that is, there is a symmetry between populations with respect to
probability of random drawing of a larger observation.
4. Under the alternative hypothesis the probability of an observation from one population
(X) exceeding an observation from the second population (Y) (after exclusion of ties) is
not equal to 0.5. The alternative may also be stated in terms of a one-sided test, for
example: P(X > Y) + 0.5 P(X = Y) > 0.5.
Under more strict assumptions than those above, e.g., if the responses are assumed to be
continuous and the alternative is restricted to a shift in location (i.e. F1(x) = F2(x + δ)), we can
interpret a significant MWW test as showing a difference in medians. Under this location shift
assumption, we can also interpret the MWW as assessing whether the Hodges–Lehmann
estimate of the difference in central tendency between the two populations differs from zero.
The Hodges–Lehmann estimate for this two-sample problem is themedian of all possible
differences between an observation in the first sample and an observation in the second sample.
Calculations
The test involves the calculation of a statistic, usually called U, whose distribution under the null
hypothesis is known. In the case of small samples, the distribution is tabulated, but for sample
sizes above ~20 there is a good approximation using the normal distribution. Some books
tabulate statistics equivalent to U, such as the sum of ranks in one of the samples, rather
than U itself.
The U test is included in most modern statistical packages. It is also easily calculated by hand,
especially for small samples. There are two ways of doing this.
First, arrange all the observations into a single ranked series. That is, rank all the observations
without regard to which sample they are in.
Method one:
For small samples a direct method is recommended. It is very quick, and gives an insight into the
meaning of the U statistic.
1. Choose the sample for which the ranks seem to be smaller (The only reason to do this is
to make computation easier). Call this "sample 1," and call the other sample "sample 2."
2. Taking each observation in sample 1, count the number of observations in sample 2 that
have a smaller rank (count a half for any that are equal to it). The sum of these counts
is U.
Method two:
For larger samples, a formula can be used:
1. Add up the ranks for the observations which came from sample 1. The sum of ranks in
sample 2 follows by calculation, since the sum of all the ranks equals N (N + 1)/2
where N is the total number of observations.
2. U is then given by:
where n1 is the sample size for sample 1, and R1 is the sum of the ranks in sample 1.
Note that there is no specification as to which sample is considered sample 1. An equally
valid formula for U is
The smaller value of U1 and U2 is the one used when consulting significance tables. The
sum of the two values is given by
Knowing that R1 + R2 = N (N + 1)/2 and N = n1 + n2 , and doing some algebra, we find
that the sum is
Properties
The maximum value of U is the product of the sample sizes for the two samples. In such a
case, the "other" U would be 0.
For large sample size
where mU and σU are the mean and standard deviation of U, is approximately a standard
normal deviate whose significance can be checked in tables of the normal
distribution. mU and σU are given by
The formula for the standard deviation is more complicated in the presence of tied
ranks; the full formula is given in the text books referenced below. However, if the
number of ties is small (and especially if there are no large tie bands) ties can be
ignored when doing calculations by hand. The computer statistical packages will use
the correctly adjusted formula as a matter of routine.
Note that since U1 + U2 = n1 n2, the mean n1 n2/2 used in the normal approximation is
the mean of the two values of U. Therefore, the absolute value of the z statistic
calculated will be same whichever value of U is used.
Kruskal–Wallis one-way analysis of variance
Kruskal-Wallis one-way analysis of variance)
In statistics, the Kruskal–Wallis one-way analysis of variance by ranks (named after William
Kruskal and W. Allen Wallis) is a non-parametric method for testing whether samples originate
from the same distribution. It is used for comparing more than two samples that are independent,
or not related. The parametric equivalence of the Kruskal-Wallis test is the one-way analysis of
variance (ANOVA). The factual null hypothesis is that the populations from which the samples
originate have the same median. When the Kruskal-Wallis test leads to significant results, then at
least one of the samples is different from the other samples. The test does not identify where the
differences occur or how many differences actually occur. It is an extension of the Mann–
Whitney U test to 3 or more groups.The Mann-Whitney would help analyze the specific sample
pairs for significant differences.
Since it is a non-parametric method, the Kruskal–Wallis test does not assume
a normal distribution, unlike the analogous one-way analysis of variance. However, the test does
assume an identically shaped and scaled distribution for each group, except for any difference
in medians.
Method
1. Rank all data from all groups together; i.e., rank the data from 1 to N ignoring group
membership. Assign any tied values the average of the ranks they would have received
had they not been tied.
2. The test statistic is given by:
where:

is the number of observations in group

is the rank (among all observations) of observation

is the total number of observations across all groups

,

is the average of all the
3. Notice that the denominator of the expression for
exactly
and
from group
.
is
. Thus
Notice that the last formula only contains the squares of the average ranks.
4. A correction for ties can be made by dividing
by
, where G is
the number of groupings of different tied ranks, and ti is the number of tied values within
group i that are tied at a particular value. This correction usually makes little difference
in the value of K unless there are a large number of ties.
5. Finally, the p-value is approximated by
. If some
values are small
(i.e., less than 5) the probability distribution of K can be quite different from this chisquared distribution. If a table of the chi-squared probability distribution is available, the
critical value of chi-squared,
, can be found by entering the table at g − 1 degrees
of freedom and looking under the desired significance or alpha level. The null
hypothesis of equal population medians would then be rejected if
.
Appropriate multiple comparisons would then be performed on the group medians.
6. If the statistic is not significant, then no differences exist between the samples. However,
if the test is significant then a difference exists between at least two of the samples.
Therefore, a researcher might use sample contrasts between individual sample pairs, or
post hoc tests, to determine which of the sample pairs are significantly different. When
performing multiple sample contrasts, the Type I error rate tends to become inflated.
Cohen's kappa (Kappa Statistics)
Cohen's kappa coefficient is a statistical measure of inter-rater agreement or inter-annotator
agreement for qualitative (categorical) items. It is generally thought to be a more robust measure
than simple percent agreement calculation since κ takes into account the agreement occurring by
chance. Some researchers have expressed concern over κ's tendency to take the observed
categories' frequencies as givens, which can have the effect of underestimating agreement for a
category that is also commonly used; for this reason, κ is considered an overly conservative
measure of agreement.
Others contest the assertion that kappa "takes into account" chance agreement. To do this
effectively would require an explicit model of how chance affects rater decisions. The so-called
chance adjustment of kappa statistics supposes that, when not completely certain, raters simply
guess—a very unrealistic scenario.
Calculation
Cohen's kappa measures the agreement between two raters who each classify N items
into C mutually exclusive categories. The first mention of a kappa-like statistic is attributed to
Galton (1892), see Smeeton (1985). The equation for κ is:
where Pr(a) is the relative observed agreement among raters, and Pr(e) is the hypothetical
probability of chance agreement, using the observed data to calculate the probabilities of
each observer randomly saying each category. If the raters are in complete agreement then κ
= 1. If there is no agreement among the raters other than what would be expected by chance
(as defined by Pr (e)), κ = 0.
Example
Suppose that you were analyzing data related to people applying for a grant. Each grant
proposal was read by two people and each reader either said "Yes" or "No" to the proposal.
Suppose the data were as follows, where rows are reader A and columns are reader B:
B
B
Yes
No
A
Yes
20
5
A
No
10
15
Note that there were 20 proposals that were granted by both reader A and reader B and 15
proposals that were rejected by both readers. Thus, the observed percentage agreement is Pr
(a) = (20 + 15) / 50 = 0.70
To calculate Pr (e) (the probability of random agreement) we note that:

Reader A said "Yes" to 25 applicants and "No" to 25 applicants. Thus reader A said
"Yes" 50% of the time.

Reader B said "Yes" to 30 applicants and "No" to 20 applicants. Thus reader B said
"Yes" 60% of the time.
Therefore the probability that both of them would say "Yes" randomly is 0.50 * 0.60 =
0.30 and the probability that both of them would say "No" is 0.50 * 0.40 = 0.20. Thus the
overall probability of random agreement is Pr (e) = 0.3 + 0.2 = 0.5.
So now applying our formula for Cohen's Kappa we get:
Same percentages but different numbers
A case sometimes considered to be a problem with Cohen's Kappa occurs when
comparing the Kappa calculated for two pairs of raters with the two raters in each pair
having the same percentage agreement but one pair give a similar number of ratings
while the other pair give a very different number of ratings.[6] For instance, in the
following two cases there is equal agreement between A and B (60 out of 100 in both
cases) so we would expect the relative values of Cohen's Kappa to reflect this. However,
calculating Cohen's Kappa for each: Case I.
Rater I
Rater II
Yes
No
Yes
45
15
No
25
15
Case II
Yes
No
Yes
25
35
No
5
35
we find that it shows greater similarity between A and B in the second case, compared to
the first.
Significance and magnitude
Statistical significance makes no claim on how important is the magnitude in a given
application or what is considered as high or low agreement.
Statistical significance for kappa is rarely reported, probably because even relatively low
values of kappa can nonetheless be significantly different from zero but not of sufficient
magnitude to satisfy investigators. Still, its standard error has been described and is
computed by various computer programs. If statistical significance is not a useful guide,
what magnitude of kappa reflects adequate agreement? Guidelines would be helpful, but
factors other than agreement can influence its magnitude, which makes interpretation of
a given magnitude problematic. As Sim and Wright noted, two important factors are
prevalence (are the codes equiprobable or do their probabilities vary) and bias (are the
marginal probabilities for the two observers similar or different). Other things being
equal, kappas are higher when codes are equiprobable and distributed similarly by the
two observers.
Another factor is the number of codes. As number of codes increases, kappas become
higher. Based on a simulation study, Bakeman and colleagues concluded that for fallible
observers, values for kappa were lower when codes were fewer. And, in agreement with
Sim & Wrights's statement concerning prevalence, kappas were higher when codes were
roughly equiprobable. Thus Bakeman et al. concluded that "no one value of kappa can be
regarded as universally acceptable." They also provide a computer program that lets
users compute values for kappa specifying number of codes, their probability, and
observer accuracy. For example, given equiprobable codes and observers who are 85%
accurate, value of kappa are .49, .60, .66, and .69 when number of codes is 2, 3, 5, and
10, respectively.
Nonetheless, magnitude guidelines have appeared in the literature. Perhaps the first was
Landis and Koch, who characterized values < 0 as indicating no agreement and 0–.20 as
slight, .21–.40 as fair, .41–.60 as moderate, .61–.80 as substantial, and .81–1 as almost
perfect agreement. This set of guidelines is however by no means universally accepted;
Landis and Koch supplied no evidence to support it, basing it instead on personal
opinion. It has been noted that these guidelines may be more harmful than helpful.
Fleiss's equally arbitrary guidelines characterize kappas over .75 as excellent, .40 to .75
as fair to good, and below .40 as poor.
Intraclass correlation Coefficient (ICC)
Intraclass Correlation Coefficient (ICC) is a general measurement of agreement or consensus,
where the measurements used are assumed to be parametric (continuous and has a Normal
distribution). The Coefficient represents agreements between two or more raters or evaluation
methods on the same set of subjects.
ICC has advantages over correlation coefficient, in that it is adjusted for the effects of the scale
of measurements, and that it will represent agreements from more than two raters.
In statistics, the intraclass correlation (or the intraclass correlation coefficient, abbreviated
ICC) is a descriptive statistic that can be used when quantitative measurements are made on units
that are organized into groups. It describes how strongly units in the same group resemble each
other. While it is viewed as a type of correlation, unlike most other correlation measures it
operates on data structured as groups, rather than data structured as paired observations.
The intraclass correlation is commonly used to quantify the degree to which individuals with a
fixed degree of relatedness (e.g. full siblings) resemble each other in terms of a quantitative trait
(see heritability). Another prominent application is the assessment of consistency or
reproducibility of quantitative measurements made by different observers measuring the same
quantity.
Interpretation of results
Initially, a two way analysis of variance is carried out, and if the between column mean sum
squares (between methods of measurements or raters) is not significantly greater than the
residual error, then a conclusion that concordance exists can be made. The intraclass correlation
coefficient then provides a scalar measure of agreement or concordance between all the
methods. The value 1 represent perfect agreement, and 0 as no agreement at all.
In consideration of the ICC results in terms of models.



Model 1 assumes that each method of measurement or rater is different, being subsets of
a larger set of methods or raters, randomly chosen. This would be the case of having
different research assistants measuring the height of children, with different sets of
people doing the measurements at different sites, then combining the data together.
Model 2 assumes the same methods or raters performs the evaluations in all cases,
although these methods or raters may be a subset of a larger set of methods or raters. This
is the model most frequently encountered in clinical research, when the same researchers
carried out measurements on all the subjects.
Model 3 makes no assumptions about the methods or raters.
In consideration of single or mean:


If each value (score or measurement) is obtained from an individual, then the single or
individual form is used. This is the most common situation encountered in clinical research.

If each value (score or measurement) is the mean of multiple measurements (for example,
each measurement is the mean of repeated measurements of a sample or a rating score is
the mean of that obtained by a team of raters), then the mean form should be used.
In most cases, therefore, Model 2 individual is usually used
ICC can be interpreted as follows:
0-0.2 indicates poor agreement:
0.3-0.4 indicates fair agreement;
0.5-0.6 indicates moderate agreement;
0.7-0.8 indicates strong agreement;
0.8-1.0 indicates almost perfect agreement.
The Intraclass Correlation (ICC) assesses rating reliability by comparing the variability of
different ratings of the same subject to the total variation across all ratings and all subjects.
The theoretical formula for the ICC is:
 2(b)
ICC
=  [1]
 2(b) +  2 (w)
where  2(w) is the pooled variance within subjects, and  2(b) is the variance of the trait between
subjects.
It is easily shown that  2(b) +  2(w) = the total variance of ratings--i.e., the variance for all
ratings, regardless of whether they are for the same subject or not. Hence the interpretation of the
ICC as the proportion of total variance accounted for by within-subject variation.
Equation [1] above would apply if we knew the true values,  2 (w) and  2(b). But we rarely do,
and must instead estimate them from sample data.