Chapter 1 Notes

Chapter 1: Introduction
What is statistics?
Statistics is the science of data. It involves collecting, classifying, summarizing, organizing,
analyzing, and interpreting numerical information.
Statistics:
Descriptive Statistics: mean, median, variance, standard deviation, histogram plot, pie chart.
Inferential Statistics: hypothesis testing (drawing conclusion and making decision).
1.1 Some Important Terminology
Population: is the set of all measurements of interest to the researcher.
Could be finite or infinite.
Could be real or hypothetical
Sample: is a subset of measurements selected from the population of interest.
Random Sample: every random sample of size n that can possibly be selected from the
population has the same probability of been selected.
Sample of Convenience: Samples which are available and convenient.
Statistic: is a measure computed from sample data.
EX:
Parameter: is a constant that determines the specific form of a density function. Usually
unknown, we need to estimate them.
EX:
Random Variable: the numerical data on which we perform statistical analyses are the
outcomes of a random sampling procedure. A set of such outcomes is called a random
variable.
Continuous Variable: A random variable is continuous if the values that it can assume
consist of all real numbers in some interval.
Discrete Variable: A random variable can assume only a finite or countably infinite number
of values.
Population vs. Sample: We try to describe or predict the behavior of the population on the
basis of information obtained from a representative sample from the population.
1
1.2 Hypothesis Testing
Statistical inference:
Hypothesis testing: A statement about one or more populations.
Interval estimation: next section
A statistical test of hypothesis consist:
1. The null hypothesis, denoted by H 0 .
2.
3.
4.
5.
The alternative hypothesis, denoted by H 1 .
The test statistic and its p-value.
The decision rule.
The conclusion.
Alternative hypothesis H 1 : generally the hypothesis that the researcher wishes to support.
Null hypothesis H 0 : a contradiction of the alternative hypothesis.
A statistical researcher always begins by assuming H 0 is true. Use sample data to
conclude.
Statement 1: The average height for CMU students is different from 75 inches.
H0 :
H1 :
Statement 2: The average height for CMU students is greater than 75 inches.
H0 :
H1 :
Statement 3: The average height for CMU students is less than 75 inches.
H0 :
H1 :
Alternative hypothesis
One-tailed
H1 : µ < µ0
H1 : µ > µ0
Two-tailed
H1 : µ ≠ µ0
Test statistic: a single number calculated from the sample data.
p-value: the probability of observing, when H 0 is true.
Either or both of these measures act as decision makers for the researcher in deciding
whether to reject or accept H 0 .
Rejection region: consisting of values that support the H 1 and lead to rejection of H 0 .
Critical values: separate the acceptance and rejection regions.
2
Level of Significance α : the risk you are willing to take of making an incorrect decision.
Decision Rule: Reject H 0 if α ≥ p-value.
In court room:
True State
of Nature
Innocent
Correct decision
Mistake
Innocent
Guilty
Evidence against suspect
Guilty
Mistake
Correct decision
Risk of a test of hypothesis
True
State of
Nature
Statistical Conclusion
Reject H 0
H 0 is true
Fail to reject H 0
Correct decision
H 0 is false
Type II error (probability β)
Type I error (probability α)
Correct decision
Note that: α = P(Type I error) = P(Reject H 0 | H 0 is true)= P(rejection region| H 0 is true)
β = P(Type II error) = P(Fail to reject H 0 | H 0 is false)
Power of a test =1- β = P(Reject H 0 | H 0 is false)
High power is always a desirable characteristic of a test.
Increasing power by increasing sample size
Increasing power by increasing significance level: α = 0.05 will result in a more
powerful test than will a choice of α = 0.01 .
Increased power as a result of size of effect to be detected
Efficiency of hypothesis tests:
The relative efficiency of test A to test B (for the same H 0 , H 1 ,α and β ) is the ratio
n B / n A , where n A and n B are the sample sizes, respectively, of tests A and B.
3
1.3 Estimation
Point estimation: a single value called the estimate from sample data.
Interval estimation: two possible values of the parameter being estimated-a lower value
and an upper value. It’s a more desirable and useful estimate. Called confidence interval.
The probability that a confidence interval will contain the estimated parameter is called
the confidence coefficient, designated by (1 − α ) .
The probabilistic interpretation is based on the fact that in repeated sampling,
100(1 − α )% of the intervals constructed in the same manner (and with the same sampling
sizes) contain the parameter being estimated.
The practical interpretation, we said we are 100(1 − α )% ”confident” that the single
interval constructed contains the parameter we are estimating.
Confidence interval for the parameter θ in probability terms: P( L0 < θ < U 0 ) = 1 − α
where L0 and U 0 are random variables.
Confidence interval for the parameter θ : C ( L < θ < U ) = 100(1 − α )% where L and U are
specified values
Hypothesis testing and confidence interval are related.
Possible values of the parameter θ contained in the 100(1 − α )% confidence interval
L < θ < U are those values that are compatible with hypothesis testing.
4
1.4 Measurement Scales
Nominal scale(Count Data, Categorical Data): distinguishes one object or event from
another on the basis on a name.
EX:
Ordinal scale (Categorical Data): distinguishes objects or events measured on the
ordinal scale from one another on the basis of the relative amount of some characteristic
they possess.
EX:
Interval scale: When objects or events can be distinguished one from another and ranked,
and when the differences between measurements also have meaning.
EX:
Ratio scale: When measurements have the properties of the first three scales and the
additional property that their ratios are meaningful, the scale of measurement is the ratio
scale. A property of the ratio scale is a true zero, indicating a complete absence of the
trait being measured.
EX:
5
1.5 Nonparametric Statistics
Why nonparametric statistics?
Nonparametric procedures:
a) Truly nonparametric procedures.
b) Distribution-free procedures.
Advantage of Nonparametric statistics:
a) depend on a minimum of assumption.
b) for some nonparametric procedures, the computations can be quickly and easily
performed.
c) required minimum preparation in mathematics and statistics.
d) when the data are measured on a weak measurement scale, as when only count
data or rank data are available.
Disadvantage of Nonparametric statistics:
a) the nonparametric procedures are used when parametric procedures are more
appropriate. Such a practice often wastes information.
b) the arithmetic in many instances is tedious when sample size is large.
When to use nonparametric procedure:
a) The hypothesis to be tested does not involve a population parameter.
b) The data have been measured on a scale weaker than required for parametric
procedure.
c) The assumptions necessary for the valid use of a parametric procedure are not
met.
d) Results are needed in a hurry, a computer is not readily available and calculations
must be done by hand.
1.6 Availability and Use of Computer Programs in Nonparametric Statistical
Analysis
6