Using SPSS to Screen Data

Using SPSS to Screen Data
Download the file Screen2210.sav from my SPSS data page at
http://core.ecu.edu/psyc/wuenschk/SPSS/SPSS-Data.htm and bring it into SPSS. The 'subjects' in
this data file are automobiles. Among the variables on which you have data are:






ID -- the identification number assigned to this subject.
MPG -- the vehicle's mileage (gasoline consumption, miles per gallon).
REPAIR -- the cost of repairs done on the automobile during the last year.
SPEED -- the speed at which a vibration detector first crossed a threshold value (indicating
that the vehicle's ride was becoming uncomfortable) when tested on the track.
LIKERT4 -- the owner's response on a 4-option Likert-type question. The stem was "Overall, I
am satisfied with this automobile." The response options were: (1) Strongly disagree, (2)
disagree, (3) agree, and (4) strongly agree.
GENDER -- of the owner, (1) for the one sex, (2) for the other.
We want to screen these data for outliers and out-of range values. Since we intend to analyze
the continuous variable with techniques that involve a normality assumption, we also want to
determine if any of the continuous variables are distinctly non-normal in their distribution, and, if so,
we want to try to find a transformation that will make them more nearly normal.
Let us first get some descriptive statistics on every variable except the ID number. Click
Analyze, Descriptive Statistics, Descriptives. Scoot the five variables into the Variables box. Click
Options and select Mean, Std. deviation, Minimum, Maximum, Kurtosis, and Skewness. Click
Continue, OK.
Look at the output. The variable Likert4 has values that range from 1 to 5, but there should be
no values greater than 4. We shall need to determine which subjects have bad data on this variable.
Do you see any evidence of bad data on another variable?
Now look at the Skewness and Kurtosis statistics for the variables MPG, Repair, and Speed.
For MPG the skewness and kurtosis values are close enough to 0 that I would not be uncomfortable
using them in an analysis that assumes that the data came from a normally distributed population.
Repair and Speed are troublesome, however. I generally worry about a variable whose skewness
exceeds an absolute value of 1, and I am not very comfortable with one whose skewness exceeds an
absolute value of .7 or .8. High values of kurtosis also get my attention, since they often indicate that
there are outliers in the distribution.
Let us now find the subjects who have
bad data on the Likert4 variable. Go to the
Data View and click Data, Select Cases.
Select “If condition is satisfied” and “Filtered”
for Unselected Cases, and then click on the
“If” button. In the resulting “Select cases if”
box, enter “likert4 > 4,” like this:

Copyright 2016, Karl L. Wuensch - All rights reserved.
Screen-SPSS.docx
2
Click Continue. The Select Cases window should now look like this:
Click OK. Now look back at the data. You
will see that there is a new variable, filter_$, with
values of 1 for those cases where Likert4 > 4 and
0 for other cases. You will also see a slash
through the case number of each case that has
been filtered out.
Now let us get a listing of all the cases that
have been selected and what value(s) they have
on the Likert4 variable. Click Analyze, Reports,
Case Summaries. Scoot ID and Likert4 into the
variables box. Check only “Display cases” and
“Show only valid cases.” The window should look
like this:
Click OK. Look at the output. You will see listed there the ID numbers for five subjects who
have out-of-range values for the Likert4 variable. We should now go find the original data sheets for
these subjects and see what their actual responses were. If their actual responses have been
incorrectly entered into the data file, we need to correct them. If their responses have been correctly
entered, then we need to decide what to do with a response that is out of range. In some cases we
might decide to recode them to a valid value -- for example, suppose that the survey had mostly
questions with five response options, so that our subject got used to coding the ‘E’ or ‘5’ response
when their choice was the last option, but that for this item there were only four response options.
Maybe those people who selected 5 really intended to select 4. Maybe we should recode all the
scores of 5 to 4 on this variable.
Click Data, Select Cases, select “All cases,” and click OK. On the Data View, click on Filter_$
and hit the delete key. Click Transform, Recode, Into Same Variables. Scoot Likert4 into the
variables box. Click “Old and New Values.” Under “Old Value” select “Value” and enter the number
5. Under “New Value” select “System-missing.” Click “Add” and the window should look like this:
3
If you had decided to change the scores of 5 to scores of 4 instead of to missing values, you
would select, under “New Value,” “Value,” enter the number 4, and then click “Add.” Go ahead and
click Continue, OK to finish the recoding. If you look at the data you will now find that all of the scores
of 5 have been set to a missing value.
Can you find the ID numbers of subjects who have out-of-range values on other variables in
this data set?
Now let us use box and whisker plots to
see if there are any outliers that deserve
investigation. Click Analyze, Descriptive
Statistics, Explore. Scoot MPG, Repair, and
Speed, into the “Dependent List” and ID into
the “Label cases by” box. Under “Display”
select “Plots.” Click OK. Look at the output.
The box plot for Repair shows one outlier, ID
number 46. If you go back and check the data
file you will find that car 46 had $1,061 in
repairs last year. While that is not an
unbelievable value, you probably should
investigate it just to be sure it is correct. The
box plot for Speed shows six outliers, one of
which is an extreme outlier (plotted with a star).
That extreme outlier is ID number 33, an
automobile that started vibrating at only 12
miles per hour, according to our data file.
If you cannot read the ID numbers for some of the outliers, you can always just use the Select
Cases and Case Summaries procedures to get a list of ID numbers of cases with outliers. Do you
remember from your statistics course how to find the “fences” that serve as the boundaries between
outliers and adjacent values? If not, you should read my document Exploratory Data Analysis (EDA).
Finally, let us attend to the two variables which were unacceptably skewed. First, let us try to
find a transformation which will reduce the skewness in the Repair variable. Click Transform,
4
Compute. Type “rep_sqr” in the “Target Variable” box and enter “SQRT(repair)” in the “Numeric
Expression” box. The window should now look like this:
Click OK. If you look back at the data, you will see that the Rep_Sqr transformed variable has
been added. The square root transformation is often useful for reducing positive skewness. If the
original variable has any negative values, you must remember first to add a constant to all scores to
avoid trying to take the square root of a negative number.
Let us also try an even stronger transformation for positive skewness, a logarithmic
transformation. Click Transform, Compute. Type “rep_log” in the “Target Variable” box and enter
“LG10(repair)” in the “Numeric Expression” box and click OK. Also try a super powerful skewnessreducing transformation, the negative reciprocal. Click Transform, Compute. Type “rep_nr” in the
“Target Variable” box and enter “-1000/repair” in the “Numeric Expression” box and click OK.
Now we are ready to see what effect these transformations had on skewness and kurtosis.
Compute skewness and kurtosis on the three transformed variables. You will find that the square
root transformation reduced skewness nicely but that the other two transformations resulted in
distributions that are unacceptably skewed in the negative direction.
We should try transforming the speed variable too. Recall that it is negatively skewed. We
shall first reflect the variable by subtracting every score from a constant that is one greater than the
highest score. Click Transform, Compute. Type “sp_ref” in the “Target Variable” box and enter “91 speed” in the “Numeric Expression” box and click OK. Compute the skewness of Sp_Ref and you will
find that it has exactly the same amount of skewness as did Speed but in a positive rather than a
negative direction. Now try a square root and a log transformation on the Sp_Ref variable. You will
find that the log transformation does a good job of reducing the skewness. You could now use the
log transformed reflected speed scores in an analysis that assumes normal distributions. When
interpreting the results of that analysis you would have to remember that on your reflected speed
variable low scores now represent high speeds and high scores represent low speeds. That can be
confusing. You could re-reflect the transformed variable to prevent such confusion.
5
I decided to exclude from subsequent analysis all cases with
bad data. I sorted the data by value of Likert4 and then deleted the
five cases with invalid data. I also deleted the one case with an outof-range score on gender.
Multivariate Outliers
Imagine a scatterplot in hyperspace, where each axis represents one of your variables. Each
case is represented as one point in that space, with its location determined by its scores on the
variables. The centroid of that space is the point where the value on each variable is the mean of that
variable. Multivariate outliers are those cases who are far from that centroid.
The Mahalanobis distance (MD) is a measure of
the geometric distance between the point representing
any one of the cases and this centroid. Leverage, a
regression diagnostic statistic, is closely related to the
MD
1
 . Likewise,
Mahalanobis distance. Leverage is
N 1 N
MD  (N  1)(Leverage  (1/ N ) .
The SAS manual cites Belsley, Kuh, and Welsch’s
(1980) Regression Diagnostics text, suggesting that one
investigate observations with leverage greater than
2p/n, “where n is the number of observations used to fit
the model, and p is the number of parameters in the
model.” Some use a different rule of thumb – “a point
with leverage greater than (2k+2)/n should be carefully
examined. Here k is the number of predictors and n is the
number of observations.”
To get the leverage statistic, we conduct a multiple
regression predicting ID from MPG, repair, speed,
Likert4, and gender. We shall ignore most of the output
from that regression, which we run only to get the values
of leverage.
6
Using the rule of thumb suggested by
Belsley et al. – investigate cases with leverage
greater than 2(6)/94 = .128. Sort the cases by
value of leverage and then see which cases are
multivariate outliers.
There is only one such case, case
number 86. This car gets only 5 mpg, far from
the mean of 18.74 mpg (z = (5-18.74)/5.15 =
-2.67). This car also had high repair costs, z =
(925-337.59)/262.4 = 2.24, and it started
vibrating badly at only 48 mph, z =
(48.5-77)/13.8 = -2.07. No wonder the owner
gave this clunker the lowest possible satisfaction
score.
Descriptive Statistics
N
Minimum
Maximum
Mean
Std. Deviation
mpg
94
5
30
18.74
5.149
repair
94
4
1061
337.59
262.435
repair_Sqrt
94
2.00
32.57
17.0526
6.87733
speed
94
12
90
76.98
13.781
Log_Speed_Refl
94
.00
1.90
.9457
.44688
likert4
94
1
4
2.69
1.107
gender
94
1
2
1.41
.495
Valid N (listwise)
94
7
Missingness
It is troublesome to have variables with a lot of missing data, but psychologists often need to
deal with such trouble. It can be useful to investigate the correlates of missingness. When a subset
of subjects are missing data on a variable, such missingness may tell you something about those
subjects.
See IntroQ Questionnaire for a description of the survey used to generate the data used for
this example.
I wish to create a variable coding missingness on variable SATM. I select “Transform,”
“Recode into Different Variables.” I scoot SATM into the middle pane. In the rightmost pane I type
the name of the new variable, Miss_SATM. I click “Old and New Values.”
I select Old Value = system missing, New Value = 1 and click Add. I select Old Value = All
Other Values and click Add. I click Continue.
8
Back on the Recode into Different Variables window, I click Change and then Continue. I
look back at the data set and see that the recoding has been done properly. Each case with no score
on SATM now has Miss_SATM = 1, all other cases have Miss_SATM = 0.
Is missingness on SATM related to the scores on the other variables. As you can see below, it
is for statophobia and for Year. Those who failed to provide their SAT-math score were more fearful
of my statistics course than were the others. Also, failing to provide the SAT-math score increased
across the years during which the course was taught. In this case, missingness is actually
information that could be used to help predict statophobia or year.
Correlations
Miss_SATM
Gender
Ideal
Statoph
Nucoph
Year
Pearson Correlation
-.057
Sig. (2-tailed)
.131
N
694
Pearson Correlation
-.017
Sig. (2-tailed)
.653
N
689
Pearson Correlation
.084*
Sig. (2-tailed)
.028
N
685
Pearson Correlation
.007
Sig. (2-tailed)
.846
N
692
Pearson Correlation
.082*
Sig. (2-tailed)
.031
N
694
*. Correlation is significant at the 0.05 level (2-tailed).
9
Links

UCLA Lesson on Regression Diagnostics

Data Screening with SAS

Copyright 2016, Karl L. Wuensch - All rights reserved.