download

Applied & Comp Math 625.403
“Statistical Methods and
Data Analysis”
Dr. Jackie Telford
Class #1
ACM 625.403 - Telford
1
General Information
Bring the textbook to every class.
– Try to read or skim chapter(s) before lecture.
Homework collected each class period (except for exam
days) which will be graded and returned the next class
period.
See Syllabus for details of schedule.
For copy of viewgraphs (2 per page) or lecture notes, go to
“Specific Course Information” link on www.apl.jhu.edu
website
– can bring a copy to class for easier note-taking
– can print in B&W (color usually not necessary)
– can print duplex to save paper
Class #1
ACM 625.403 - Telford
2
1
Scientific Method (cyclic process)
1. Collection of data through observations and
experiments
2. Recognition of patterns in those data
3. Formulation of hypotheses and theories to
systematize those patterns, often using
mathematical logic and equations
4. Predictions that can be tested by more
observations and experiments
from: “Great Principles of Science”, Robert Hazen, The Teaching Company
Class #1
ACM 625.403 - Telford
3
Predicting the Locations of Planets
1. Ptolemy (c. 100-170 AD): Earth-centered solar system,
used circular orbits and smaller epicyles (to explain
“retrograde” motion)
2. Copernicus (1473-1543): Sun-centered solar system, still
assumed circular orbits
3. Tycho Brahe (1564-1601): improved instruments
increased precision and accuracy of observations,
discovered errors in the predictions of both Ptolemaic and
Copernican systems
4. Kepler (1571-1630): based on Tycho Brahe’s data,
derived three laws of planetary motion, including that the
planets orbit in ellipses with the Sun at one focus
Class #1
ACM 625.403 - Telford
4
2
Scientific Dilemma
Increasingly precise and accurate observations
(decreasing the measurement error) can lead to
improved prediction capability.
Works well for deterministic systems governed by
Newtonian physics (universal law of gravitation).
What if there is inherent unpredictably such that the
measurement error is small as compared to
randomness in the phenomenon being studied?
(manufacturing variation, signal embedded in noise,
genetics, public opinion polls, …)
How do you get reliable information?
Answer: Statistical Methods
Class #1
ACM 625.403 - Telford
5
Chapter 1: Introduction
Statistics is the science of collecting and analyzing data for
the purpose of drawing conclusions and making decisions.
Statistical tasks:
1. Collecting data
2. Summarizing and exploring data
3. Drawing conclusions and making decisions
based on data as the basis for taking action
Statistical Terms:
Descriptive statistics and exploratory data analysis
Inferential statistics (involves fitting models to data)
Class #1
ACM 625.403 - Telford
6
3
Statistical Terms (Cont.)
A population is a collection of all units of interest.
A sample is a subset of a population that is actually observed.
A measurable property or attribute associated with each unit of a population
is called a variable.
A parameter is a numerical characteristic of a population.
A statistic is a numerical characteristic of a sample.
Statistics are used to infer the values of parameters.
A random sample gives an equal pre-assigned chance to every unit of the
population to enter the sample.
In probability, we assume that the population and its parameters are known
and compute the probability of drawing a particular sample.
In statistics, we assume that the population and its parameters are unknown
and the sample is used to infer the values of the parameters.
Different samples give different estimates of population parameters (called
sampling variability).
Sampling variability leads to “sampling error”.
Probability is deductive.
Statistics is inductive.
Class #1
ACM 625.403 - Telford
7
Difference between Statistics and Probability
Statistics: Given the information Probability: Given the information
in the box, what is in your hand?
in your hand, what is the box?
from: Statistics, Norma Gilbert, W.B. Saunders Co., 1976
Class #1
ACM 625.403 - Telford
8
4
Statistical Software
SAS
SPSS
Minitab
Statistica
S-plus
BMDP
Excel add-ons (not Microsoft “add-ins”)
JMP - used in class and in computer lab
Class #1
ACM 625.403 - Telford
9
Chapter 2: Review of Probability
Approaches to probability
– Classical approach (equally likely events, counting formulas)
– Frequentist (“Monte Carlo” the events)
– Personal or subjective approach (Bayesian)
– Axiomatic approach (mathematical)
Basic ideas of axiomatic approach
– Sample space (set of all possible things that could happen)
– Events (particular things that could happen)
– Union (e.g. in cards, either a 10 or a diamond)
– Intersection (e.g. in cards, a 10 of diamonds)
– Complement (e.g. in cards, not a 10 or a diamond)
– Disjoint or mutually exclusive events (e.g. in cards, either a
10 or a 5)
– Inclusion (subset, e.g. in cards, less than a 10 or a 5)
Class #1
ACM 625.403 - Telford
10
5
Axioms of Probability
Axioms:
1. P(A) ≥ 0
2. P(S) = 1 where S is the sample space
3. P(A ∪ B) = P(A) + P(B) if A and B are mutually exclusive
Theorems about probability can be proved using these axioms and
these theorems can be used in probability calculations.
P(A) = 1 - P(Ac ) (see “birthday problem” on p. 13)
Conditional probability and Joint probability:
P(A | B) = P (A ∩ B) / P(B) and P (A ∩ B) = P(A | B) × P(B)
Events A and B are independent if P (A | B) = P(A),
which implies P (A ∩ B) = P(A) × P(B)
• Example 2.11 on p. 17 (private or public school vs. Adv. Placement)
• Example 2.14 (“Let’s make a deal” problem) on p. 20
Class #1
ACM 625.403 - Telford
11
Sensor Problem
Assume that there are two chemical hazard sensors: A and B.
Let P(A falsely detecting a hazardous chemical)=0.05 and the
same for B.
What is the probability of both sensors falsely detecting a
hazardous chemical?
P (A ∩ B) = P(A|B)×P(B) = P(A) × P(B) = 0.05 × 0.05 = 0.0025
– only if A and B are independent (use different detection methods).
If A and B are both “fooled” by the same chemical substance,
then P (A ∩ B) = P(A | B) × P(B) = 1 × 0.05 = 0.05
–
which is 20 times the rate of false alarms (same type of sensor)
DON’T assume independence without good reason!
Class #1
ACM 625.403 - Telford
12
6
AIDS Testing Example
Made-up data
AIDS
Not AIDS
Test positive (+)
95
495
590
Test negative (-)
5
9405
9410
100
9900
10000
P(+|not AIDS) = 495/9900 = .05 (false positives)
want these
to be low
P(AIDS) = 100/10000 = .01 (prevalence)
P(-|AIDS) = 5/100 = .05 (false negatives)
P(not AIDS|-) = 9405/9410 = .999 (specificity)
P(AIDS|+) = 95/590 = .16 (sensitivity/predictability)
want these
to be high
P(not AIDS|+) = 495/590 = .84 (incorrect predictability)
This is on reason why we don’t have mass AIDS testing
Class #1
ACM 625.403 - Telford
13
Counting Formulas
Operation O1 has n1 outcomes, operation O2 has n2
outcomes, …, and operation Ok has nk outcomes.
Multiplication Rule: n1× n2 × … × nk is the total number of
possible outcomes of O1, O2, …, Ok in sequence
Permutations (ordered arrangements of distinct items):
1. n distinct items: n(n -1) • • • (2)(1) = n!
2. r out of n distinct items: n(n -1) • • • (n - r + 1) = n!/(n - r)!
n × ____
n-1 × …×_____
n-r+1
___
1
2
r (positions)
Combinations (unordered arrangement of items):
3. r out of n items: n!/[(n - r)! r!]
Class #1
(dividing by r! unorders them)
ACM 625.403 - Telford
14
7
Suggestions for Solving Probability Problems
Draw a picture
– Venn diagram
– Tree or event diagram (Probabilistic Risk Assessment)
– Sketch
Write out all possible combinations if feasible
Do a smaller scale problem first
– Figure out the algorithm for the solution
– Increment by size of the problem by one and check
algorithm for correctness
– Generalize algorithm (mathematical induction)
Class #1
ACM 625.403 - Telford
15
Random Variables
A random variable (r.v.) associates a unique numerical value with
each outcome in the sample space
Example:
X=
1 if coin toss results in a head
0 if coin toss results in a tail
Discrete random variables: number of possible values is finite
or countably infinite: x1, x2, x3, x4, x5, x6, …
Probability mass function (p.m.f.)
– f(x) = P(X = x) (Sum over all possible values =1 always)
Cumulative distribution function (c.d.f)
– F(x) = P (X ≤ x) = kΣ≤ x f(k)
• See Table 2.1 on p. 21 (p.m.f. and c.d.f. for two dice)
• See Figure 2.5 on p. 22 (p.m.f. and c.d.f. graphs for two dice)
Class #1
ACM 625.403 - Telford
16
8
Continuous Random Variables
An r.v. is continuous if it can assume any value from
one or more intervals of real numbers
Probability density function (p.d.f.) f(x)
f(x) ≥ 0
∞
∫ f ( x )dx = 1 (Area under the curve = 1 always)
−∞
b
P ( a ≤ X ≤ b) = ∫ f ( x )ds for any a ≤ b
a
• See Figure 2.6 on p. 23 (area under curve between two
values a and b)
Class #1
ACM 625.403 - Telford
17
Cumulative Distribution Function
The cumulative distribution function (c.d.f.), denoted
F(x), for a continuous random variable is given by:
x
F ( x ) = P( X ≤ x ) =
∫ f ( y )dy
−∞
It follows that
•
f ( x) =
dF ( x )
dx
See Example 2.16 on p. 24 (exponential distribution)
Class #1
ACM 625.403 - Telford
18
9
Expected Value
The expected value or mean of a discrete r.v. X,denoted
by E(X), µx, or simply µ, is defined as:
E ( X ) = µ = ∑ x f ( x ) = x1 f ( x1 ) + x2 f ( x2 ) + K
x
The expected value of a continuous r.v. X is defined as:
E ( X ) = µ = ∫ x f ( x )dx
• See Figure 2.7 on p. 25 (mean as Center of Gravity or
balance point
Class #1
ACM 625.403 - Telford
19
Variance and Standard Deviation
The variance of an r.v. X, denoted by Var(X), σx2, or
simply σ2, is defined as:
Var(X) = σ2 = E[(X - µ)2]
So Var(X) = E[(X - µ)2]= E(X2 - 2µX + µ2)
= E(X2) - 2µE(X) + E(µ2)
= E(X2) - 2µµ + µ2
= E(X2) - µ2 = E(X2) - [E(X)]2
The standard deviation (SD) is the square root of the
variance. Note that the variance is in the square of
the original units, while the SD is in the original units.
• See Example 2.17 on p. 26 (mean and variance of
two dice)
Class #1
ACM 625.403 - Telford
20
10
Quantiles and Percentiles
For 0 ≤ p ≤ 1 the pth quantile (or the 100pth percentile),
denoted by θp, of a continuous r.v. X is defined by the
following equation:
P ( X ≤ θ p ) = F (θ p ) = p
θ.5 is called the median
• See Example 2.20 on p. 30 (exponential distribution)
Jointly distributed random variables and independent
random variables
See pp. 30-33
Class #1
21
ACM 625.403 - Telford
Covariance and Correlation
Cov(X,Y) = σXY = E[(X - µX)(Y - µY)] = E(XY) - E(X)E(Y)
= E(XY) - µX µY
If X and Y are independent, then E(XY) = E(X)E(Y) so the
covariance is zero. The other direction is not true.
Note that:
E( X Y ) =
∞ ∞
∫ ∫ x y f ( x, y ) dx dy
− ∞− ∞
ρ XY = corr ( X , Y ) =
Cov ( X , Y )
σ
= XY
Var ( X )Var (Y ) σ xσ Y
• See Examples 2.26 and 2.27 on pp. 37-38 (prob vs.
stat grades
Class #1
ACM 625.403 - Telford
22
11
Two Famous Theorems
Chebyshev’s Inequality: Let c > 0 be a constant. Then,
irrespective of the distribution of X,
P( X − µ ≥ c) ≤
•
σ2
c2
See Example 2.29 on p. 41 (exact vs. Cheb. for two dice)
Weak Law of Large Numbers: Let X be the sample mean
of n i.i.d. observations from a population with finite mean µ
and variance σ2. Then, for any fixed c > 0,
P( X − µ ≥ c ) → 0 as n → ∞
Class #1
23
ACM 625.403 - Telford
Selected Discrete Distributions
Bernoulli trials: (single coin flip)
 p if x = 1 (success)
f ( x ) = P( X = x ) = 
 1 − p if x = 0 (failure)
E(X) = p and Var(X) = p(1-p)
0
1
Binomial distribution: (multiple coin flips)
X successes out of n trials
n
f ( x ) = P ( X = x ) =   p x (1 − p )n − x for x = 0, 1, K, n
 x
E(X) = np and Var(X) = np(1-p)
•
See Example 2.30 on p. 43 (teeth)
Class #1
ACM 625.403 - Telford
0 1 . . n
24
12
Selected Discrete Distributions (cont)
Hypergeometric: drawing balls from the box without
replacing the balls (as in the hand with the question
mark)
Poisson: number of occurrences of a rare event
Geometric: number of failures before the first success
Multinomial: more than two outcomes
Negative Binomial: number of failures before first X
successes
1 2 3 … N
Uniform: N equally likely events
• See Table 2.5, p. 59 for properties of these distributions
Class #1
ACM 625.403 - Telford
25
Selected Continuous Distributions
Uniform: equally likely over an interval
Exponential: lifetimes of devices with no
wear-out (“memoryless”), interarrival times
when the arrivals are at random
Gamma: kth arrival time when the
arrivals are at random
Lognormal: lifetimes (similar shape to
Gamma but with longer tail)
Beta: not equally likely over an interval
• See Table 2.5, p. 59 for properties of these distributions
Class #1
ACM 625.403 - Telford
26
13
Normal Distribution (“Bell-curve”, Gaussian)
A continuous r.v X has a normal distribution with parameter µ and
σ2 if its probability density function is given by:
1
f ( x) =
exp[ −( x − µ ) 2 / 2σ 2 ] for - ∞ < x < ∞
σ 2π
E(X) = µ and Var(X) = σ2 (see Figure 2.12, p. 53)
Standard normal distribution:
Z=
X −µ
σ
~ N (0, 1)
• See Table A.3 on p. 673 Φ(z) = P(Z ≤ z)
X −µ x−µ


x−µ
≤
= z  = Φ
P( X ≤ x ) = P Z =

σ
σ


 σ 
• See Examples 2.37 and 2.38 on pp. 54-55 (computations)
Class #1
ACM 625.403 - Telford
27
Carl Friedrick Gauss (1777 - 1855)
Class #1
ACM 625.403 - Telford
28
14
Karl Pearson (1857 - 1936)
“Many years ago I called the
Laplace-Gauss curve the NORMAL
curve, which name, while it avoids
an international question of priority,
has the disadvantage of leading
people to believe that all other
distributions of frequency are in one
sense or another ABNORMAL.
That belief is, of course, not
justifiable.”
Karl Pearson, 1920
Class #1
29
ACM 625.403 - Telford
Percentiles of the Normal Distribution
Suppose that the scores on a standardized test are normally
distributed with mean 500 and standard deviation of 100. What
is the 75th percentile score of this test?
 X − 500 x − 500 
 x − 500 
P( X ≤ x ) = P
≤
 = Φ
 = 0.75
100 
 100
 100 
From Table A.3, Φ(0.675) = 0.75
x − 500
= 0.675 ⇒ x = 500 + (0.675)(100) = 567.5
100
Factoids:
µ
95% of the normal distribution is within ±2σ of µ
99.7% of the normal distribution is within ±3σ of µ
68% of the normal distribution is within
Class #1
ACM 625.403 - Telford
±1σ of
30
15
Linear Combinations of r.v.s
Xi ~ N(µi, σi2) for i = 1, …, n and Cov(Xi, Xj) = σij for i≠j
Let X = a1X1 + a2X2 + … + anXn where ai are constants.
Then X has a normal distribution with mean and variance:
n
E( X ) = E(a1 X1 + a2 X 2 + K+ an X n ) = a1µ1 + a2µ2 + K+ an µn = ∑ai µi
n
n
n
i =1
Var ( X ) = Var ( a1 X 1 + a2 X 2 + K + an X n ) = ∑ a σ + 2∑∑ ai a jσ ij
i =1
X = (X1 + X2 + … + Xn) / n , so ai = 1/n
2
i
2
i
i =1 j =1
i≠j
Therefore, X from n i.i.d. N(µ, σ2) observations ~ N(µ, σ2/n),
since the covariances (σij) are zero (by independence).
Class #1
ACM 625.403 - Telford
31
Homework #1
Read Chapters 1 and 2, omitting Sections 2.4.4 & 2.10
Exercises: (answers to odd problems in the back of the book)
(show work – no credit if calculations or printout not shown)
2.33
2.34 (a) discrete, (b) 0.4, (c) 0.6
2.36 (a) only: mean=2.57, var=3.545
2.43
2.80 (a) 0.4207, (b) 0.3674, (c) 15.46
2.83
Read or skim Chapters 3 (omitting Section 3.4) and Chapter 4.
Class #1
ACM 625.403 - Telford
32
16