Preventing Errors When Creating Output Datasets Containing

NESUG 2012
Posters
Simple Statistical Programming: Preventing Errors When Creating Output
Data Sets Containing Statistical Test Results for McNemar’s Test
Stephen W. Bosch
Howard M. Proskin & Associates, Inc., Rochester, NY
ABSTRACT
SAS® programmers come from a variety of backgrounds in terms of both work experience and education. Some SAS
Programmers may have to produce statistical output in spite of having an educational background that does not include
advanced statistics.
McNemar’s Test, performed via the FREQ procedure, is an example of a statistical procedure that may require some basic
statistical knowledge in order to produce viable results. This can be accomplished with slight manipulation of the source data
set that is used to produce the test, along with the use of the WEIGHT/ZEROS statement in the PROC FREQ code. The
purpose of this paper is to educate SAS programmers, especially those who may not have strong backgrounds in Statistics, on
the accurate, error-free and warning-free production of an output data set that contains results for the McNemar’s test.
MCNEMAR’S TEST – BACKGROUND, IN SIMPLE TERMS
This paper is not a dissertation in statistics, but rather a resource outlining how to perform a statistical procedure in SAS,
directed especially to those who may not have an advanced education in statistics. Regardless, a basic background on the
McNemar’s statistic helps to set the stage for the implementation of the test via the SAS programming language.
McNemar’s test is a statistical test that is based on a 2x2 classification table, and is used in situations where subjects serve as
their own control. In essence, it analyzes a subject going through two different evaluations to see if the outcome of the first
evaluation has any bearing on the outcome of the second. The null hypothesis asserts that the marginal probabilities for each
outcome are the same. Referring to figure 1.1, this means that the null hypothesis is that the probability of A and the
probability of D, denoted as P(A) and P(D), respectively, are equal. The alternative hypothesis states that they are not equal.
Figure 1.1:
Test 2 =
negative
Test 2 =
positive
Row
Total
A
B
A+B
Test 1 = positive
Test 1 = negative
C
D
C+D
Column Total
A+C
B+D
N
The actual McNemar’s test statistic shown in figure 1.2. Observe that A and D are the components of the denominator. This
will be discussed in more detail subsequently, but based on the nature of the calculation of the statistic, SAS is unable to
perform the test without an actual nonzero integer in the denominator. Naturally, any denominator(A+D) value equal to
missing or equal to 0 will prevent the statistic from being calculated. Therefore, if either A or D is missing, the statistic will not
be able to be calculated.
Figure 1.2:
Χ2 = (D-A)2
D+A
The example below, which will be referenced throughout the duration of this paper, is based on an example provided in a
Medical Statistics text by Bland (2000). The example has been modified from the original example so that the source data set
could be visually displayed within the confines of this paper. In the example portrayed in figure 2.1, 96 schoolchildren were
asked whether they had a severe cold while their age was 12, and again while they were age 14.
1
NESUG 2012
Posters
EXAMPLE (I):
Figure 2.1:
Severe Colds at age 14
Severe Colds at age 12
Yes
No
Total
Yes
12
6
18
No
24
54
78
Total
36
60
96
Observe in the table above (figure 2.1), that at age 12, 18.8% (18/96) of the schoolchildren had reported severe colds. At age
14, 37.5% (36/96) of the schoolchildren had reported severe colds.
The data used to create the table in figure 2.1 – data set SEVERECOLDS1, is shown below in figure 2.2. Variable SubjectID
is a unique identification variable for a child. Cold_12_YN is equal to 1 if the child reported a cold during age 12, and 0 if he or
she did not report a cold during age 12. Likewise, variable Cold_14_YN reports the same information, but for the given child at
age 14. Examine the data figure 2.2, and note how it corresponds to the table above. (Note that figure 2.2 represents one
data set, although it is displayed in 3 columns).
2
NESUG 2012
Posters
Figure 2.2 – data set SEVERECOLDS1:
SubjectID Cold_12_YN Cold_14_YN
1
1
A001
1
1
A002
1
1
A003
1
1
A004
1
1
A005
1
1
A006
1
1
A007
1
1
A008
1
1
A009
1
1
A0010
1
1
A0011
1
1
A0012
1
0
A0013
1
0
A0014
1
0
A0015
1
0
A0016
1
0
A0017
1
0
A0018
0
1
A0019
0
1
A0020
0
1
A0021
0
1
A0022
0
1
A0023
0
1
A0024
0
1
A0025
0
1
A0026
0
1
A0027
0
1
A0028
0
1
A0029
0
1
A0030
0
1
A0031
0
1
A0032
SubjectID
A0033
A0034
A0035
A0036
A0037
A0038
A0039
A0040
A0041
A0042
A0043
A0044
A0045
A0046
A0047
A0048
A0049
A0050
A0051
A0052
A0053
A0054
A0055
A0056
A0057
A0058
A0059
A0060
A0061
A0062
A0063
A0064
Cold_12_YN Cold_14_YN
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
SubjectID
A0065
A0066
A0067
A0068
A0069
A0070
A0071
A0072
A0073
A0074
A0075
A0076
A0077
A0078
A0079
A0080
A0081
A0082
A0083
A0084
A0085
A0086
A0087
A0088
A0089
A0090
A0091
A0092
A0093
A0094
A0095
A0096
Cold_12_YN
Cold_14_YN
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
McNemar’s test is applied in this case in order to answer the following question:
Was there a significant change in the prevalence of severe colds among schoolchildren from age 12 to age 14?
3
NESUG 2012
Posters
APPLICATION
The PROC FREQ code below (figure 2.3) was created by a Statistician to produce a data set containing statistical information
produced from McNemar’s Test, called “ODSDATA1.”
Figure 2.3:
proc freq data = SEVERECOLDS;
ods output McNemarsTest=ODSDATA1;
tables Cold_12_YN*Cold_14_YN / agree;
run;
When the code is run using the data from figure 2.2, the following data set is created (figure 2.4).
Figure 2.4:
ODSDATA1
Table
Name1
Table Cold_12_YN * Cold_14_YN
_MCNEM_
Label1
Statistic
(S)
Table Cold_12_YN * Cold_14_YN
DF_MCNEM
DF
Table Cold_12_YN * Cold_14_YN
P_MCNEM
Pr > S
cValue1
nValue1
10.8
10.8
1
1
.0010
.001015
The p-value to use is located in variable nValue1 on the observation where NAME1 = ‘P_MCNEM.’, as per the Statistician’s
instruction. Henceforth, the p-value is .001015. This is a relatively small p-value, which indicates that there is likely sufficient
evidence to reject the null hypothesis, depending on the level of significance that is employed. Again, the null hypothesis
states that there is no relationship between the proportion of severe colds among subjects at ages 12 and at 14. Rejecting the
null hypothesis is analogous to stating that there is a significant difference between the two proportions.
WHERE DIFFICULTY MAY ARISE
Suppose the analysis is based on the data and table shown below in figures 3.1 and 3.2, a quasi-utopian case where nobody
got colds at age 14:
EXAMPLE (II):
Figure 3.1:
Severe Colds at age 14
Severe Colds at age 12
Yes
No
Total
Yes
.
30
30
No
.
24
24
Total
.
54
54
4
NESUG 2012
Posters
Figure 2.2 – data set SEVERECOLDS2:
SubjectID Cold_12_YN Cold_14_YN
1
0
B001
1
0
B002
1
0
B003
1
0
B004
1
0
B005
1
0
B006
1
0
B007
1
0
B008
1
0
B009
1
0
B0010
1
0
B0011
1
0
B0012
1
0
B0013
1
0
B0014
1
0
B0015
1
0
B0016
1
0
B0017
1
0
B0018
1
0
B0019
1
0
B0020
1
0
B0021
1
0
B0022
1
0
B0023
1
0
B0024
1
0
B0025
1
0
B0026
1
0
B0027
SubjectID
B0028
B0029
B0030
B0031
B0032
B0033
B0034
B0035
B0036
B0037
B0038
B0039
B0040
B0041
B0042
B0043
B0044
B0045
B0046
B0047
B0048
B0049
B0050
B0051
B0052
B0053
B0054
Cold_12_YN
Cold_14_YN
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
The code in figure 3.3 fails to run, and the output data set is not created, as shown in the log (figure 3.4). Why?
Figure 3.3:
proc freq data= SEVERECOLDS2;
ods output McNemarsTest=ODSDATA2 ;
tables Cold_12_YN* Cold_14_YN / agree;
run;
5
NESUG 2012
Posters
Figure 3.4:
449
proc freq data= SEVERECOLDS2;
450
ods output McNemarsTest=ODSDATA2;
451
tables Cold_12_YN* Cold_14_YN / agree;
452
run;
NOTE: No statistics are computed for Cold_12_YN * Cold_14_YN since Cold_14_YN has less than
2 nonmissing levels.
WARNING: Output 'McNemarsTest' was not created. Make sure that the output object name,
label, or path is spelled correctly. Also, verify that the appropriate procedure options
are used to produce the requested output object. For example, verify that the NOPRINT
option is not used.
No statistics were computed, and data set ODSDATA2 was not produced. In order for the McNemar’s test to be executed, the
two squares in the contingency table, which contain ‘yes’ for one outcome and ‘no’ for the other outcome(shaded in gray
above) must contain non-missing data. The denominator in the McNemar’s Test Statistic is created by finding the sum of the
two yes/no squares (values used are 24 and 6 for example (I), and <missing> and 30 for example (II). If one of these values is
missing, the statistic cannot be computed by SAS, since the denominator in the statistic will be <missing>.
As a SAS programmer, clearly you want to run error-free code that produces a data set with accurate results. Luckily, there is
a way to circumvent the failure outlined in the above example.
SOLUTION – CREATE A DATA SET WITH ALL POSSIBLE VARIABLE COMBINATIONS THEN
USE WEIGHT/ZEROS OPTION IN PROC FREQ
In order to solve the problem, apply the following routine outlined in figures 3.5 – 3.9. The figures below include all of the
code, and the resulting data sets. If you want to copy and paste the code directly into the SAS editor, please refer to Appendix
1.
The first step is to create a data set that will include all combinations of yes/no variables that actually exist in the data. This
data set will be called OUTSUMMARY, as shown in Figure 3.5.
Figure 3.5:
proc means data = SEVERECOLDS2 N nway;
class Cold_12_YN Cold_14_YN;
var Cold_12_YN Cold_14_YN;
output out = OUTSUMMARY(drop = _:) N=N;
run;
OUTSUMMARY
Cold_12_YN
Cold_14_YN
N
0
0
24
1
0
30
Then, create a SHELL data set. This data set will contain all possible combinations of yes/no variables that could theoretically
exist. Set N equal to 0 as the default for all combinations. Ultimately, N will represent the number of subjects in the study.
Figure 3.6:
DATA SHELL;
do Cold_12_YN = 0,1;
do Cold_14_YN = 0,1;
N = 0;
output;
end;
end;
run;
6
NESUG 2012
Posters
SHELL
Cold_12_YN
Cold_14_YN
N
0
0
0
0
1
0
1
0
0
1
1
0
Update the SHELL data set with the information from the SEVERECOLDS2 data set. Any combinations of variables that did
not have any information in SEVERECOLDS2 will end up having N equal to 0. The UPDATE statement is similar to a matchmerge, as shown in figure 3.7.
Figure 3.7:
DATA SEVERECOLDSXXX;
update SHELL OUTSUMMARY;
by Cold_12_YN Cold_14_YN;
run;
SEVERECOLDSXXX
Cold_12_YN
Cold_14_YN
N
0
0
24
0
1
0
1
0
30
1
1
0
The final step is essentially the same PROC FREQ code snippet that was developed by the Statistician, as shown already, in
figure 3.3. However, the magic fix is an additional line of code in the PROC FREQ code snippet below, in figure 3.8.
Figure 3.8:
weight N /zeros;.
This WEIGHT statement establishes that a given numeric variable will provide a weight for each observation during an
analysis. If a WEIGHT statement is not specified, PROC FREQ will assign a weight of 1 to every observation. The sum of the
weight variable values represents the total number of observations.
By default, PROC FREQ will ignore the observation in an analysis if a weight variable is either missing or 0. In Example(II),
the weight variable is N. In order for the McNemar’s test used in the example to work, the value 0 needs to be included in the
analysis. This can be accomplished by means of the ZEROS option. This option is a necessity, since PROC FREQ would
otherwise ignore the observation by default. Figure 3.9 displays the PROC FREQ code and the end product, a dataset
containing the McNemar’s test results.
Figure 3.9:
PROC FREQ data = SEVERECOLDSXXX;
ods output McNemarsTest = ODSDATAXXX;
tables Cold_12_YN*Cold_14_YN/agree;
weight N /zeros;
run;
7
NESUG 2012
Posters
ODSDATA1
Table
Name1
Table Cold_12_YN * Cold_14_YN
_MCNEM_
Label1
Statistic
(S)
Table Cold_12_YN * Cold_14_YN
DF_MCNEM
DF
Table Cold_12_YN * Cold_14_YN
P_MCNEM
Pr > S
cValue1
nValue1
30
30
1
1
<.0001
4.32E-08
The code described in figures 3.5 – 3.9 results in the ODSDATAXXX data set, which contains the appropriate McNemar’s test
results, and an infinitesimal p-value.
CONCLUSION
SAS possesses a vast abundance of statistical procedures, and at times these procedures may be tricky to work with and
debug if you are not a statistical expert. It can be very beneficial to understand basic concepts related to the creation of data
sets before executing a statistical procedure. Having this knowledge can result in programs that are efficient, accurate, easily
reusable, and error free.
REFERENCES
Bland M (2000) An introduction to medical statistics, 3rd ed. Oxford: Oxford University Press, via www.medcalc.org.
Delwiche, Lora D. and Slaughter, Susan J. (1998) The Little SAS Book. 2nd Edition. SAS Institute Inc., Cary, NC. Pp 174-175.
McNemar, Quinn (1969) Psychological Statistics, 4th ed. John Wiley & Sons: New York. Pp. 260 – 263.
McNemar, Quinn. Psychometrika – Vol.12, No.2 June, 1947.
Correlated Proportions or Percentages.” Pp. 153-157.
“Note on the Sampling Error of the Difference Between
Medcalc.org, McNemar test on paired proportions. Downloaded 18 April 2012.
http://www.medcalc.org/manual/mcnemartest2.php
SAS Help and Documentation, SAS 9.3. FREQ Procedure, Weight/Zeros. Retrieved 4/21/2012.
Special thanks to my colleagues Howard Proskin, Dan Hatch and Bill Murphy
ACKNOWLEDGMENTS
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in
the USA and other countries. ® indicates USA registration.
Other brand and product names are registered trademarks or trademarks of their respective companies.
CONTACT INFORMATION
Your comments and questions are valued and encouraged. Contact the author at:
Stephen W. Bosch
Howard M. Proskin & Associates
300 Red Creek Dr., Suite 330
Rochester, NY 14623
Phone: 585-359-2420
Fax:
585-359-0465
Email: [email protected]
Web:
www.hmproskin.com
8
NESUG 2012
Posters
APPENDIX 1
data SEVERECOLDS2;
input SubjectID $ Cold_12_YN Cold_14_YN;
datalines;
B001 1
0
B002 1
0
B003 1
0
B004 1
0
B005 1
0
B006 1
0
B007 1
0
B008 1
0
B009 1
0
B0010 1
0
B0011 1
0
B0012 1
0
B0013 1
0
B0014 1
0
B0015 1
0
B0016 1
0
B0017 1
0
B0018 1
0
B0019 1
0
B0020 1
0
B0021 1
0
B0022 1
0
B0023 1
0
B0024 1
0
B0025 1
0
B0026 1
0
B0027 1
0
B0028 1
0
B0029 1
0
B0030 1
0
B0031 0
0
B0032 0
0
B0033 0
0
B0034 0
0
B0035 0
0
B0036 0
0
B0037 0
0
B0038 0
0
B0039 0
0
B0040 0
0
B0041 0
0
B0042 0
0
B0043 0
0
B0044 0
0
B0045 0
0
B0046 0
0
B0047 0
0
B0048 0
0
B0049 0
0
B0050 0
0
B0051 0
0
B0052 0
0
9
NESUG 2012
B0053 0
B0054 0
;
Posters
0
0
proc freq data= SEVERECOLDS2;
ods output McNemarsTest=ODSDATA2 ;
tables Cold_12_YN* Cold_14_YN / agree;
run;
proc means data = SEVERECOLDS2 N nway;
class Cold_12_YN Cold_14_YN;
var Cold_12_YN Cold_14_YN;
output out = OUTSUMMARY(drop = _:) N=N;
run;
DATA SHELL;
do Cold_12_YN = 0,1;
do Cold_14_YN = 0,1;
N = 0;
output;
end;
end;
run;
DATA SEVERECOLDSXXX;
update SHELL OUTSUMMARY;
by Cold_12_YN Cold_14_YN;
run;
PROC FREQ data = SEVERECOLDSXXX;
ods output McNemarsTest = ODSDATAXXX;
tables Cold_12_YN*Cold_14_YN/agree;
weight N /zeros;
run;
10