INTERMEDIATE STATISTICS A Modern Approach

INTERMEDIATE
STATISTICS
A Modern Approach
James P. Stevens
University of Cincinnati
LAWRENCE ERLBAUM ASSOCIATES, PUBLISHERS
Hillsdale, New Jersey
Hove and London
Copyright 0 1990 by Lawrence Erlbaum Associates, Inc.
All rights reserved. No part of this book may be reproduced in
any form, by photostat, microform, retrieval system, or by any other
means, without the prior written permission of the publisher.
Lawrence Erlbaum Associates, Inc ., Publishers
365 Broadway
Hillsdale, New Jersey 07642
Library of Congress Cataloging-in-Publication Data
Stevens, James P.
Intermediate statistics : a modem approach 1 James P. Stevens.
p. cm.
Includes bibliographical references.
ISBN 0-8058-0491-9. - ISBN 0-8058-0492-7 (pbk.)
1. Statistics. I. Title.
QA276.S828 1989
519.5-dc20
89-39774
CIP
PRINTED IN THE UNITED STATES OF AMERICA
10987654
Detecting Outliers
If the variable is approximately normally distributed, then z scores around 3 in
absolute value should be considered as potential outliers. Why? Because in an
approximate normal distribution about 99% of the scores should lie within three
standard deviations of the mean. Therefore, any z value > 3 indicates a value
very unlikely to occur. Of course, if n is large, (say > loo), then simply by
chance we might expect a few subjects to have z scores > 3 and this should be
kept in mind. However, even for any type of distribution the above rule is
reasonable, although we might consider extending the rule to z > 4. It was
shown many years ago that regardless of how the data is distributed the percentage of observations that are contained within k standard deviations of the mean
must be at least (1 - llk2) 100%. The above holds only for k > 1 and yields the
following percentages for k = 2 through 5:
Number of standard deviations
2
3
4
5
Percentage of observations
at least 75%
at least 88.89%
at least 93.75%
at least 96%
Schiffler (1988) has shown that the largest possible value z value in a data set
of size n is bounded by (n - 1 ) / 6 This means for n = 10 the largest possible z
is 2.846 and for n = 1 1 the largest possible z is 3.015. Thus, for small sample
size any data point with a z around 2.5 should be seriously considered as a
possible outlier.
When comparing group differences, as with the t test for independent samples, we want the z scores computed separately for each group. The BMDPAM
program is very useful for detecting outliers in general (for one variable or for
several variables-the multivariate case). We show in Appendix 2 at the end of
this chapter the BMDPAM control lines for a two group problem. After the
outliers are identified, what should be done with them? The action to be taken is
not to automatically drop the outlier(s) from the analysis. If one finds after
further investigation of the outlying points that an outlier was due to a recording
or entry error, then of course one would correct the data value and redo the
analysis. Or if it is found that the errant data value is due to an instrumentation
error or that the process that generated the data for that subject was different,
then it is legitimate to drop the outlier. If, however, none of these appear to be
1.7 SAS AND SPSSX STATISTICAL PACKAGES
15
the case then one should not drop the outlier, but perhaps report two analyses
(one including the outlier and the other excluding it). Outliers should not necessarily be regarded as "bad." As a matter of fact, it has been argued that outliers
can provide some of the most interesting cases for further study.