CHAPTER 16 THE FURTHER DATA ANALYSIS

CHAPTER 16 THE FURTHER
DATA ANALYSIS
16.1 Introduction
16.2 FURTHER DATA ANALYSIS:
(MEASURED
V ATTRIBUTE)

FDA is procedure that enables a decision to
be made, based on the sample evidence:
There is no relationship
There is a relationship

These statistical procedures are called
hypothesis tests

Hypothesis
 A statement about a population developed for
purpose of testing.

Hypothesis tests
 A Procedure based on sample evidence and
probability theory to determine whether the
hypothesis is a reasonable statement.

Four stages of hypothesis tests
 Stage 1:
Specifying the hypotheses.
 Stage 2:
Defining the test parameters and
the decision rule.
 Stage 3:
Examining the sample evidence.
 Stage 4:
The conclusions.

FDA for Measured v Attribute requires
two different hypotheses tests
Two levels of attribute explanatory variable
three or more levels of attribute
explanatory variable
16.3
HYPOTHESIS TEST 1
Measured Response v Attribute Explanatory Variable with exactly two levels

Illustrative Example
 Response Variable:
AMOUNT Spent on Clothes
per month
 Attribute Explanatory Variable
GENDER
(Male/Female)
 If Males and Females have the same 'spending on
clothes' characteristics then the average amounts
spent monthly by Males and by Female should be
the same.
 If Male and Females have different 'spending on
clothes' characteristics then the average amount
spent monthly by Males and Female would be
different.

Total population can be split into two or
more sub-populations according to the
level of the attribute, a population of
Males and a population of Females.
POPULATION MEANS THE SAME

Stage 1: Specifying the hypotheses.
NULL HYPOTHESIS:
H 0 : 0  1
 ALTERNATIVE HYPOTHESIS
H1 : 0  1

Stage 2:
The Decision Rule
Results of IDA for Illustrative Example
Outcome 1
 Male
Mean = £45
(Stand Dev = £20)
 Female Mean = £55
(Stand Dev = £20)
 Noenough evidence to form a clear judgement
 FDA is required.
Outcome 2
 Male
Mean = £45
(Stand Dev = £10)
 Female Mean = £55
(Stand Dev = £10)
 The widths of the boxes would lead to the
decision from the I.D.A. that there is definitely a
link.
Outcome 3
 Male
Mean = £45
(Stand Dev = £40)
 Female Mean = £55
(Stand Dev = £40)
 FDA is required and Stand Dev is bigger
 Measure of Relative Separation of the
boxplots
 Considering
not only MEANS but also
STANDARD DEVIATIONof the two samples
 Finding “Threshold value”
If Measure of Relative Separation > Threshold value,
 there is a connection
If Measure of Relative Separation < Threshold value
there is no connection
Student's t Ratio (a measure of the relative
separation of the boxplots )
 Sample
data is Normal distribution
 Student’s t-test
 tcalc --- value of t-ratio
tcalc 
X1  X 2
s12
s22

n1
n2
|tcalc|  Larger Separation
 Outcome2 >Outcome 1>Outcome3
 Bigger
Set
up decision rule
Decision Rule
 If tcalc value is numerically between the range tcrit & + tcrit then the decision rule is flagging H0
Supporting the viewpoint that there is no
relationship
 If tcalc value is numerically outside the range tcrit & + tcrit then the decision rule is flagging H1
Supporting the viewpoint that there is a
relationship.
Value of tcrit
 Depending upon the sample size, through a
measure called Degrees of Freedom(DF)
 Could be looked up in the tables.
The hypothesis test described above is
called the student's t test and is a two
tailed test using the 5% level of
significance.
Formally the level of significance may be
defined as the chance the tester is
prepared to take in coming to the wrong
conclusion about H0

Stage 3:
Doing the calculations
 If tcalc value is numerically between the
range - tTable & + tTable then the decision
rule is flagging H0 There is no relationship
If tcalc value is numerically outside the
range - tTable & + tTable then the decision
rule is flagging H1 There is a relationship

Stage 4:
The conclusions
 In terms of the original business problem
specification
For example, On the basis of the sample
evidence there is evidence to suggest that
there is a link between the amount spent
on clothes and gender, Males on average
spend about £45 per month and females
spend on average £55.

Worked Example CREDIT
IDA

FDA
Stage 1: Define the hypotheses:
H 0 : 0  1
H1 : 0  1
 0--true
average amount borrowed on credit for
house owners
 1--true average amount borrowed on credit for
non house owners}
Stage 2: Defining the test parameters and
the decision rule
 Student’s
t-test

Stage 3: Examining the sample
evidence
MINITAB to do the calculations on the
sample data
tTable = 1.96
tcalc = -4.51 lies outside the range -1.96 to
1.96, reject H0 , accept H1

Stage 4:
The conclusions.
Based on the sample evidence there is a
connection between Amount Borrowed on
Credit and House-ownership. On average
house owners borrow £869.5 and non
house owners borrow £1009.00.
16.4


HYPOTHESIS TEST 2:
Measured Response v Attribute Explanatory Variable with
three or more levels
For example
 Response variable: amount spent in a supermarket
 Explanatory Variable: the customer's marital status--four categories,
Single, Married, Divorced, or widowed

The common data analysis methodology applies and has
the following three stages:
 Initial Data Analysis
 Further Data Analysis
 Describing the Relationship

Example 1:
No evidence of a connection.

Example 2:
Some degree of separation
Measure of relative separation

Hypothesis Test--Four stages
Stage 1:Specifying the hypotheses.
Stage 2:Defining the test parameters and
the decision rule.
Stage 3:Examining the sample evidence.
Stage 4:The conclusions.

Stage 1:
Specifying the hypotheses.
By definition if there is no connection then
all the population means are equal, whilst if
there is a connection at least on of the
means must be different,
Null hypotheses
H 0 : 1  2  3  4
Alternative hypotheses
H1 : at least on mean is different

Stage 2: Defining the test parameters and the
decision rule.
 Decision rule: based on F-Ratio.
 Test procedure: Oneway Analysis of Variance
 ANalysis Of VAriance : ANOVA
 Fcrit is the particular value of F that split the area
under the distribution in the proportions 95%/5%.
Decision rule
 If the value of Fcalc is between 0 and Fcrit then
conclude that there is no link
 If the value of Fcalc is greater than Fcrit then
conclude that on the basis of the sample
evidence there is a link.

Stage 3:Examining the sample evidence
 Example1:
 Fcalc would be small.
 The F-Ratio is defined in such a way that if the
null hypothesis is true, i.e. all the means are
equal then Fcalc is expected to be 1.
Example 2
 Fcalc measures the relative separation
 wider the separation, larger Fcalc value

To find Threshold Value: Fcrit
For F-Ratio:
 two degrees of freedom(depends on sample
size)
 Look up the statistical tables: Ftable
 Suppose:
 Fcalc = 8.91
 The
degrees of freedom as (3, 80)
 Then
Ftable=2.72

Stage 4:The conclusions.
Since the value of Ftable is larger than the
value of Fcalc the conclusion is that on the
basis of the sample evidence, there is
enough evidence to suggest that there is a
link between amount spent by customers in
a supermarket and the customer's marital
status. The remaining issue is to describe
the connection.

Worked Example
 CREDIT data scenario
 Question:
 The explanatory variable 'REGION' influence the
response variable 'CREDIT'?
 The amount borrowed on credit is dependent upon the
region of the country where the customer lives?

IDA

FDA
Stage 1:Specifying the hypotheses.
1  2  3  4  5
H1 : at least on mean is different
Stage 2: Defining the test parameters and
the decision rule.
H0 :
Stage 3:Examining the sample evidence
 MINITAB—ANOVA—ONE WAY
Analysis of Variance for CREDIT
Source DF
SS
MS
F
P
REGION 4
3445125
861281 5.10 0.0
Error 649 109631953
168924
Total 653 113077078
 Ftable=2.39
 Since Fcalc= 5.10 > Ftable=2.39 , the sample
evidence is indicating a link between "Amount
borrowed on credit" and "The region the
customer lives in"
Stage 4:The conclusions
REGION
AMOUNT
SOUTH-WEST
£977.10
SOUTH-EAST
£958.40
LONDON
£1061.80
MIDLANDS
£898.10
NORTH
£864.30
 Examination
of the average values shows
London to be the region with the highest
amount on credit, then the South-West and
South-East with similar average credits; the
North having the lowest amount on credit.
 Examine
diagram displaying the 95%
confidence intervals for each level of the
attribute variable
Interpretation:
 The decision rule is that if the confidence limits
don't overlap then there is a real difference in
the sample means for the two levels of the
attribute.
 For example Region 3 London has an average
amount on credit that is statistically significantly
larger than average amount on credit for
Regions 4, The Midlands, because the two
confidence limits don't overlap.
level 1
level 2
level 3
level 4

level 2
level 3
level 4
level 5
No Difference
No Difference
No Difference
No
Difference
No Difference
No Difference
No
Difference
Difference
Difference
No
Difference
The final description of the link can be
summarised, as the amount spent on credit
in London is significantly higher than in the
Midlands and the North.