INTERMEDIATE STATISTICS A Modern Approach James P. Stevens University of Cincinnati LAWRENCE ERLBAUM ASSOCIATES, PUBLISHERS Hillsdale, New Jersey Hove and London Copyright 0 1990 by Lawrence Erlbaum Associates, Inc. All rights reserved. No part of this book may be reproduced in any form, by photostat, microform, retrieval system, or by any other means, without the prior written permission of the publisher. Lawrence Erlbaum Associates, Inc ., Publishers 365 Broadway Hillsdale, New Jersey 07642 Library of Congress Cataloging-in-Publication Data Stevens, James P. Intermediate statistics : a modem approach 1 James P. Stevens. p. cm. Includes bibliographical references. ISBN 0-8058-0491-9. - ISBN 0-8058-0492-7 (pbk.) 1. Statistics. I. Title. QA276.S828 1989 519.5-dc20 89-39774 CIP PRINTED IN THE UNITED STATES OF AMERICA 10987654 Detecting Outliers If the variable is approximately normally distributed, then z scores around 3 in absolute value should be considered as potential outliers. Why? Because in an approximate normal distribution about 99% of the scores should lie within three standard deviations of the mean. Therefore, any z value > 3 indicates a value very unlikely to occur. Of course, if n is large, (say > loo), then simply by chance we might expect a few subjects to have z scores > 3 and this should be kept in mind. However, even for any type of distribution the above rule is reasonable, although we might consider extending the rule to z > 4. It was shown many years ago that regardless of how the data is distributed the percentage of observations that are contained within k standard deviations of the mean must be at least (1 - llk2) 100%. The above holds only for k > 1 and yields the following percentages for k = 2 through 5: Number of standard deviations 2 3 4 5 Percentage of observations at least 75% at least 88.89% at least 93.75% at least 96% Schiffler (1988) has shown that the largest possible value z value in a data set of size n is bounded by (n - 1 ) / 6 This means for n = 10 the largest possible z is 2.846 and for n = 1 1 the largest possible z is 3.015. Thus, for small sample size any data point with a z around 2.5 should be seriously considered as a possible outlier. When comparing group differences, as with the t test for independent samples, we want the z scores computed separately for each group. The BMDPAM program is very useful for detecting outliers in general (for one variable or for several variables-the multivariate case). We show in Appendix 2 at the end of this chapter the BMDPAM control lines for a two group problem. After the outliers are identified, what should be done with them? The action to be taken is not to automatically drop the outlier(s) from the analysis. If one finds after further investigation of the outlying points that an outlier was due to a recording or entry error, then of course one would correct the data value and redo the analysis. Or if it is found that the errant data value is due to an instrumentation error or that the process that generated the data for that subject was different, then it is legitimate to drop the outlier. If, however, none of these appear to be 1.7 SAS AND SPSSX STATISTICAL PACKAGES 15 the case then one should not drop the outlier, but perhaps report two analyses (one including the outlier and the other excluding it). Outliers should not necessarily be regarded as "bad." As a matter of fact, it has been argued that outliers can provide some of the most interesting cases for further study.
© Copyright 2026 Paperzz