Data Analysis II

Data Analysis II
Anthony E. Butterfield
CH EN 4903-1
"There is a theory which states that if ever anybody
discovers exactly what the Universe is for and why it is
here, it will instantly disappear and be replaced by
something even more bizarre and inexplicable. There is
another theory which states that this has already
happened.”
~ Douglas Adams, Hitchhiker's Guide to the Galaxy
Data Analysis II
• Review of Data Analysis I.
• Hypothesis testing.
– Types of errors.
– Types of tests.
– Student’s T-Test
• Fit lines of lines to data.
Quick Review of PDFs and CDFs
• What is the probability of measuring a value
between -0.5 and 1.5 or
between
-2 =1?
and -1?
, with
=0 and
Hypothesis Testing
• How do we know if one hypothesis is more likely true
over alternatives?
• Null Hypothesis (H0) – The hypothesis to be tested to
determine if it is true (often that the data observed are
the result of random chance).
• Alternative Hypothesis (Hi) – A hypothesis that may be
found to be the more probable source of the
observations if the null hypothesis is not (often that
the observations are the result of more than chance, a
real effect).
Possible Types of Error in Tests
• Type I Error:
– Rejecting a true hypothesis, a (significance level).
• Type II Error:
– Accepting a false hypothesis, b (1-test’s power).
• Tradeoff between a and b.
Testing Alternatives, Tail Tests
• One Tail (One-Sided) Test.
– H0:  = 0.
“Our new drug is no better than the old drug”
H1:  > 0.
“Our new drug works better than the old one.”
– H0:  = 0.
“The catalytic converter is just as effective as it
was when new.”
H1:  < 0.
“The catalytic converter has fowled.”
• Two Tail (Two-sided) Test.
– H0:  = 0.
“Our liquid is a Newtonian fluid.”
H1:  ≠ 0.
“Our liquid is a non-Newtonian fluid.”
Student’s T-Test
• T-distribution :
• Used for small data
sets, where the
standard deviation
is unknown.
• As the degrees of
freedom, v, goes to
∞, the t-distribution
becomes the normal
distribution.
Student’s T-Test
• Can use to determine the likelihood of two
means being the same.
2
2
2
2
2
2
2


a b 
a 
b 

  (na  1)    (nb  1)
v  

 na 

 na nb 
 nb 
 ab 
t

2
a
na


2
b
nb
  a  b
 ab
t
T Statistics Example
• The test statistic puts the data in question into
a scale in which we can use the T-distribution.
• Is a = b, or a ≠ b,
or a > b, or a < b?
T Statistics Example
 ab 
 a2
na

a ab b
t t
 ab
 b2
nb
v = 38
ab = 0.324
t = -1.53
Student’s T-Test Example
• Two sets of data, 10
measurements
each, with different
variances and with
means separated by
an increasing value.
• Note the error.
• What if we take 100
measurements?
Student’s T-Test for Our p Data
a  3.1382, b  p
v  16 1  15
 a  0.0521,  b  0
t
 3.1383  p
2
2
 0.2518
0.0521 0

16
1
• Use t statistic and the CDF
to find probability.
• Two-tailed test (P  2).
• Would need t=0.064 for
95% confidence.
Linear Fitting
• How to best fit a straight line, Y=b+mx, to data?
n
S   ( yi  (b  mxi ))
2
i 1
n
dS
 0    2( yi  (b  mxi ))
db
i 1
n
dS
 0    2 xi ( yi  (b  mxi ))
dm
i 1
n
 n 
nb    xi m   yi
i 1
 i 1 
n
 n   n 2
  xi b    xi m   xi yi
i 1
 i 1   i 1 
Linear Fit Quality
• Coefficient of Determination (R2):
SSTotal   ( yi  yi
n
2
)
i 1
n
2
SS Error   ( yi  mxi  b )
i 1
R 2  1  SS Error SSTotal
• The closer R2 is to 1 the better the fit.
Nonlinear Fits
• Linearized fits.
– Prone to problems.
• Nonlinear fits.
– Best for nonlinear
equations.
– End up with n
nonlinear
equations and n
unknowns.
f (Y )  b  mX
n
 n 
nb    xi m   f ( yi )
i 1
 i 1 
n
 n   n 2
  xi b    xi m   xi f ( yi )
i 1
 i 1   i 1 
Y  f ( X , c1 , c2 ,...cn )
n
S (c1 , c2 ,...cn )    yi  Y (xi , c1 , c2 ,...cn )
i 1
S
S
S
 0,
 0,...
0
c1
c2
cn
2
Fitting Example
• Equation: y  a exp (bx )   y  2 exp ( 3x )   y
• Linearized fit puts inordinate emphasis on
data taken at larger values of x, in this case.
C.I. For Fitted Constants
• Method uses Student’s T-Test, residuals and
Jacobian (Matrix of partial derivatives with respect
to parameters for each data point).
• You may use a statistics program.
• For example: Matlab
• nlfit – get fit parameters, residuals, and
Jacobian.
• nlparci – find the CI for parameters.
• nlpredici – find CI for predicted values.
• Open the functions, though, to see how they
function (“>> open nlparci” and “>> help nlparci”).
C.I. For Fitted Constants, Example
• Put code for this example online, here.
>> nlinfitex2
Fit to equation:
y = b1 + b2 * exp(-b3 * x)
x data y data
0.000 3.022
0.222 2.002
0.444 1.644
0.667 1.241
0.889 0.888
1.111 1.052
1.333 1.043
1.556 1.104
1.778 1.055
2.000 0.800
b1 was 1.0, and is estimated to be: 0.949577 ± 0.158716 (95% CL)
b2 was 2.0, and is estimated to be: 2.073648 ± 0.317758 (95% CL)
b3 was 3.0, and is estimated to be: 2.903019 ± 1.056934 (95% CL)
Data Analysis Conclusions
• Data analysis is necessary to near any objective
use of measurements.
• Must have a basic grasp on statistics.
• All data and calculated values should come with
some confidence interval at some probability.
• You can reject data under some circumstances,
but avoid them.
• Use Student’s T-Test and fitting techniques to
judge if your data match theory.