Consistency Check: QC Across Outputs for Inconsistencies

Paper TT14
Consistency Check: QC Across Outputs for Inconsistencies
John Morrill, Quintiles, Inc., Kansas City, MO
David J. Austin, PRA International, Charlottesville, VA
ABSTRACT
Executing a SAS program produces a log containing many notes, warnings and error messages regarding the
program. There is no such log regarding the fitness of the output itself. For example, no SAS warning is issued
when a set of outputs contain variant usages within the output. This paper describes a naïve but effective approach
to detecting such inconsistencies across outputs. All words (actually, any set of characters separated by a space) in
a document are reported together, sorted with duplicates eliminated. This report brings out inconsistencies by
showing similar elements adjacent to each other. The costs and benefits of reducing the volume of this report with
several methods are discussed. Example input and output along with all SAS code are all provided. The simplicity of
this approach makes this paper appropriate for beginners while its usefulness makes it worthy of consideration for all
levels of SAS ® users.
INTRODUCTION
This paper is organized into five sections. The first section describes the problem and also introduces an example
pharmaceutical input and our approach to detecting inconsistencies, a SAS program we call Con_check (for
consistency check). The second section describes a Con_check report on the example input. The third presents
three methods of improving the usefulness of this report, discussing the benefits and costs of each. The fourth
discusses limitations of Con_check and how to get around some of them. The fifth wraps up several miscellaneous
issues. The appendices provide all example input, output and code discussed in this paper.
1. THE PROBLEM AND THE CON_CHECK APPROACH
Consider a set of tables and listings summarizing a clinical study. A traditional spell-checking program could be used
to identify aberrant spelling and a visual check is often used to identify other problems. There are at least two
shortcomings to these approaches. First, since both the spell-checker and the visual check deal with words in their
context, overall consistency is not an objective. Second, both methods check the same word multiple times and
traditional spell-checkers check one word at a time and flag anything not found in its dictionary meaning the output
may be unnecessarily voluminous.
Con_check attempts to mitigate these shortcomings. Con_check removes words from their original context and
compares them with similar occurrences, increasing the chance of finding inconsistent usages. Also, Con_check
presents only one occurrence of a given word, eliminating time-consuming redundancy and minimizing flagged output
volume.
Here is how Con_check works. First, note the example pharmaceutical output included in Appendix 1: this is a
compilation of tables designed to fit on a single page, contrived with dozens of errors and inconsistencies for
illustration purposes. This SAS output becomes input to Con_check; it is converted to a text file and read into a data
set consisting of a single observation for each “word” (used here to mean a collection of characters not containing the
space character). Duplicate observations are removed and the resulting Con_check report is found in Appendix 2.
This report is sorted so that words starting with the same characters (both alphabetic and non-alphabetic) are
grouped together. Visually scanning this output provides the opportunity to detect inconsistent usages of words in the
original pharmaceutical input.
2. USING THE CON_CHECK REPORT TO DETECT INCONSISTENCIES
Appendix 2 is the Con_check output using all words from Appendix 1 as input. Appendix 2 is divided into page 1 (the
shorter outputs) and page 2 (the longer outputs). Page 1 has 6 panels, created with the PROC REPORT panels
option. Each panel contains three variables. The variable Z is a code described at length in the third section below;
an understanding of Z is not necessary for Con_check output to be useful. The variable W contains “words.” Words
as used in this paper include regular words but also any collection of characters not containing a space. The variable
C is a frequency count of each word. Page 2 of Appendix 2 contains the same variables as page 1 as well as W_L,
the length of the word.
The table below demonstrates how to use the Con_check report to detect inconsistencies.
Selected potential inconsistencies in the first
panel.
&dose
This may be an unresolved
macro variable, missed in the
SAS log.
13 (
Probably
not
cause
for
2 (%)
concern.
(%)by
Likely a missing space.
(0.0$)
(0.0%)
(Beats/minutes)
Probably the $ should be
replaced with a %.
Should this be (Beats/minute)
[without the plural minutes]?
Incorrect characters: zero in
place of the capital letter O.
Incorrect characters: u in place
of the Greek letter µ.
Again, incorrect characters:
lower case l in place of the
capital letter L. Also, should a
space precede the hyphen?
Spacing issues.
(C02)
(CO2)
(cp/ug)
(cp/µg)
(IU/mL)
(IU/ml)
(IU/mL)(mm
(mmHg)
(N=
(N=38)
)
,
.
/
.)
Selected inconsistencies involving numbers.
0.0.
02
02NOV2005
03NOV2005
2004
88mg
1 14.8.1
2 14.8.3
Again, spacing issues.
Familiarity with the output or
looking at a few of these
instances in context will reveal
if these are intended.
1 0.1178
1 0.4148
3 0.8894
May be worth investigating as
irregular.
Perhaps this properly ends a
sentence but maybe it is a
programming error.
As the only number like this with a
preceding
zero,
this
bears
investigating.
Are all dates to be on the same day?
Do we expect to see dates from only
the current year?
Spacing issues.
A case where the frequency is
important in identifying a potential
problem can be seen here. These
are the table numbers appearing at
the top of each output in Appendix 1.
Since we have two instances of
14.8.3, we may want to investigate
whether this is due to properly
having multiple pages of the same
output or due to have mistakenly
placed 14.8.3 where 14.8.2 is
intended. Here, the latter is the
case.
This
illustrates
that
Con_check is valuable for what is
not shown (essentially a frequency
of zero for the missing 14.8.2) as
well as what is shown.
Frequency is also important here.
Upon investigation, all three cases
of 0.8894 turn out to be p-values.
This makes one suspicious that the
models producing these p-values
may improperly be miss-specified as
identical.
In general, Con_check’s output should be considered merely as candidates for further action – checking these
candidates in context will reveal whether further action is merited. Clearly, not all words will be cause for concern. In
fact, the completely consistent output will have no words that are cause for concern. The high incidence of
questionable words in this paper is due to the contrived, error-laden Appendix 1. (Though this example is contrived
for illustration purposes, most of the errors placed in Appendix 1 were detected in actual use of early versions of
Con_check.) In fact, the sparse rate of errors in a typical output is motivation for finding ways to eliminate words that
are not a problem. These ways are explored in the Section 3 below. Also, while a feature of Con_check is to remove
words from their context in order to identify inconsistent usages, it is important to check the context to determine
which usage is preferred. Here are some more cases.
Selected inconsistencies in the second and
third panels.
<.0001
The first is produced with the SAS
<0.0001
format pvalue6.4. Should these
two be made consistent?
8 AUC
Most instances have a space
1 AUCfollowing AUC. One, however, has
1 AUCas
a hyphen immediately following
and another seemingly has a
space missing.
Selected inconsistencies in the last three panels.
3 IL-10
1 IL=10
Should the one instance with the
equal sign be hyphenated instead?
1 MANCOVA
1 MANOVA.
Should one of these be changed to
the other or are the two different
models actually used in this study?
AcmeAce
Is a space missing?
1 C-Peptide
1 C-peptide
1 DAFT
2 DRAFT
doesd
dosed
ears
Should the P following the hyphen
be capitalized?
DAFT is a valid word but is
probably not what is intended.
The first should probably be
“dosed.”
This seemed odd.
Upon
investigation it should instead be
“years.”
Enrolled
Enrolled"
Stray quote mark?
2 gamma,
1 gamna,
The one case with the “n” should
have two “m’s.”
MMDDDDYYYY
MMDDDYYYY
N
n
1 Plabeco
4 Placebo
1 TFN
2 TNF
Trail
1 ta.sas
1 tb.sas
1 tcc.sas
3 or 4 Ds? Actually, should this not
be DDMMMYYYY?
Are both okay or should one style be
selected?
Transposed letters in one instance.
One of these is likely wrong.
This is a good word but seemed
odd. Upon investigation it should
have been Trial as in “Trial
Medication.”
The third case does not follow the
pattern of two letter SAS program
names.
Page 2 shows lines of different length – perhaps this is intended but it merits investigation. Along with making the
Con_check report more usable, other inconsistencies will be covered in section 3.
3. MAKING THE CON_CHECK REPORT MORE USABLE
As was discussed above, the Appendix 1 output is contrived to contain many errors and inconsistencies and thus
makes the Con_check report found in Appendix 2 clearly worth reviewing. In reality, the Con_check report is typically
(hopefully) very sparse of actual problems. A low number of errors per page in part reproduces the original problem
which dissuades a careful consideration of the report and causes errors to remain undetected. In this section we
cover three methods of increasing the error rate through eliminating words that are of no concern, improving the
usefulness of Con_check. These three methods correspond to levels of the variable Z on the Con_check report and
include (Z=1) stripping selected characters so that fewer lines are output to the Con_check report, (Z=2) eliminating
numbers, and (Z=4 and Z=8) eliminating properly spelled words. Eliminating all words to which Con_check assigns a
value to Z above 1 would clearly generate the smallest possible report, but this comes at a cost. For illustration
purposes in this paper, no such flagged words were eliminated.
The first method of eliminating words that do not merit an individual line in the report is to strip off characters which
unnecessarily differentiate the base word. Note that there are seven versions of the word Study in Appendix 2.
Aside from capitalization, six versions differ only in the punctuation [the right parenthesis, comma, period, colon, and
semi-colon] at the end of the word. Words containing these characters at the end are marked by adding 1 to the Z
variable. By stripping these characters, six lines in this report could be reduced to a single line containing Study.
Similarly, stripping the period from the end of Placebi in the fifth panel, this becomes a word that can be spellchecked (see spelling section below). The downside of naively stripping characters is that one loses the ability to
mark words that should be flagged. For example, by eliminating the period and the right parenthesis, neither the .)
nor the 0.0. in the first panel are marked as a potential problem. And adding a hyphen to the set of Z=1 characters
would further reduce the lines used but would also reduce the number of potential problems flagged. Eliminating
characters from the beginning of words would also have both benefits and costs.
The second method of reducing Con_check output is to eliminate numbers. Words flagged as numbers are marked
by adding 2 to the Z variable. Note that Z is cumulative so that Z=3 indicates words that are stripped of selected
characters as described above (initializing Z to 1) which, in turn, results in a number (adding 2 to Z to make 3). The
part of the table describing numbers in section 2 above gives some of the benefits of leaving numbers in the report.
This must be balanced against the tremendous volume of numbers in, for example, a set of listings.
The third method of reducing Con_check output is to eliminate properly spelled words. A standard word list found on
the internet was used to assign Z=4 and a specialized pharmaceutical word list was used to assign Z=8. Again, Z is
cumulative so that a word with a Z=5 value such as Treatment, (with a comma on the end) gets a value of 1 from
stripping off the comma and a value of 4 from appearing in the standard word list. The word limit:,.); (clearly a
contrived case) scores Z=5 for the same reason: the select characters are stripped leaving a properly spelled word.
Many properly spelled words could be eliminated by using this method, but this would be at the cost of eliminating
valid yet unintended words such as colleted (the Appendix 1 context reveals that this word should be collected),
Chang (appears in our standard word list but should be Change), DAFT, and continuos (should be continuous).
4. CON_CHECK LIMITATIONS AND APPROACHES TO OVERCOME THESE LIMITATIONS
Con_check is intended to supplement the quality control process, not replace it. As a naïve approach, it cannot be
depended on to catch grammatical errors. In addition, big-picture consistency or context-sensitive issues are not
flagged. For example, in Appendix 1 Con_check does not notice that Note: is missing from the third output’s footnote
nor that the line separating the title from the header is missing from the first output. And on the final output’s footnote,
the word also improperly appears both at the end of the first line and the beginning of the second line. And by
removing words from their context, Con_check does not highlight alignment issues.
Con_check relies heavily on consistency among initial letters of similar words. An inconsistency at the beginning of a
word will keep the inconsistent word from showing up near the intended word, decreasing the chance that the pair will
be noticed as inconsistency candidates. The word zubject is an example of this. Some progress in overcoming this
may be made by sorting based on reverse(word) while still printing the un-reversed word. This will gather words of
similar endings together, potentially revealing inconsistencies which otherwise would go unnoticed. Singular and
plural forms of words, along with other suffixes, throw a wrench into this method. Perhaps a scoring scheme could be
created to assign similarity between words for comparison purposes.
In its present form, the Con_check unit is a single word. In reality, it would be valuable to check consistency among
units longer than a single word. For example, perhaps a given analyte is expected to be followed by only one form of
a unit rather than both the original and the standard unit. And some words are frequently found together – trial
medication is an example from Appendix 1. Also, hyphenated words extending beyond a single line are presently
considered separate words (due to converting the original document to a text file prior to processing). By extending
the concept of unit to selected multiple word phrases, another level of inconsistency could be detected. Extend the
unit to be a complete sentence and grammatical checking from sophisticated word processing programs can be
utilized. Extend the unit to be an entire footnote and the another level of consistency can be checked. For hints on
how to process multiple word units, see Section 4.2 of Morrill, 2005, the source of the idea for Con_check.
5. MISCELLANEOUS
Note that any file is eligible material: a Statistical Analysis Plan, a set of mock tables or shells, a title and footnote file,
an entire clinical study report, journal articles, books, even the Gettysburg Address. Con_check has been
successfully used on parts of these (except the latter which cannot be improved!). An innovative suggestion was
made by Emily Peterson: run Con_check on mock tables to produce a set of words that should appear in the actual
tables and listings, then use this set as the specialized dictionary to flag unexpected words in the pharmaceutical
outputs programmed from these mock tables. In general, a “before and after” approach can be used to detect
changes in a variety of circumstances: run Con_check, save results, make desired changes to Con_check input, run
Con_check again, compare results.
In practice, multiple SAS outputs are gathered into a single document for processing. We use a perl script (written by
Doug Moore) to gather MS Word documents into a single file, then convert this to a text file. To avoid an
overwhelming volume of numbers in data listing output, one could take only the first page of each listing and still
obtain a representative sample. One could narrow the focus to footnote or title sections and avoid the body with all
its numbers altogether. And one may wish to avoid or run separately AE and Medical History listings since verbatim
terms (think of all the misspellings such as toxicity vs. tocixity) may cloud the results.
Regarding the spell check, note that the standard word list is in lower case letters and the Z values of Con_check are
assigned based on having all letters converted to lower case letters, though the original form is displayed. The
specialized pharmaceutical word list used in Con_check consists of just a few words for illustration purposes. In
practice it could be generated over time by adding appropriate words or could be generated from a medical
dictionary. A correct set of units would be valuable to include in the specialized word list. Note that the specialized
list saves at most only one line of Con_check output per word in the specialized list.
A couple things regarding Con_check’s code: First, Con_check converts tabs to spaces. This is important since tabs
are not displayed in Con_check’s output. Leaving a tab adjacent to a word would result in words showing up as
separate cases in the Con_check output for no visible reason. Second, due to the limit on the number of characters
actually compared using PROC FREQ in some versions of SAS, a PROC SORT along with a data step is used to
accurately summarize the units in Con_check.
REFERENCES
Morrill, JM, “Programming Tips and Examples for Your Toolkit, II” Proceedings of the 2005 Annual Conference of the
Pharmaceutical Industry SAS Users Group, May 2005, http://www.pharmasug.org/2005/bestpapers.htm or
http://www.lexjansen.com/pharmasug
ACKNOWLEDGMENTS
The authors wish to thank Indra Fernando, Doug Moore, John Gorden, and Emily Peterson for valuable review and
suggestions. SAS is a registered trademark or trademark of SAS Institute Inc. in the USA and other countries. ®
indicates USA registration. Other brand and product names are registered trademarks or trademarks of their
respective companies.
CONTACT INFORMATION
Your comments and questions are valued and encouraged.
John Morrill
Quintiles, Inc.,
P.O. Box 9708
Kansas City, MO 64134
816-767-6000
[email protected].
[email protected]
David J. Austin
PRA International, 4105 Lewis & Clark Dr.
Charlottesville, VA 22911-5801
434-951-3000
[email protected]
BIO INFORMATION
John is an Analyst in Statistical Programming at Quintiles, Inc. in Kansas City. He first used SAS in 1985 and has
used it full-time in the pharmaceutical industry since 1998. David is an Analysis Programmer for PRA International in
Charlottesville, Virginia and has worked in the pharmaceutical industry since 1997.
KEYWORDS
title, footnote, text file, inconsistent, aberrant, spelling, quality control, consistency, consistent
Appendix 1
Table 14.8.1
Acme Ace-PROT1A
ta.sas
DRAFT - 03NOV2005 15:26
Page 1 of 1
Multivariate Analysis of ABC Level AUC per day (cp/ug) and Maximum Total of TNF alpha, IFN gamma,
IL-6, and IL-2 Circulating Cytokines During Study, Per-Protocol Population [ Enrolled ]
Adjusted
95% Confidence Interval of
Treatment Group
n
Geometric Mean
Adjusted Geometric Mean
p-Value
______________________________________________________________________________________________________________________________________
Simultaneous Estimates
ABC level AUC
AcmeAce
35
65
(
19 ,
228 )
Plabeco
12
21
(
0 ,
6 )
Total TFN alpha, IFN gamna, IL-6, and IL-2
AcmeAce
35
591.3
( 517.4, 5425.4)
Placebo
12
76.6
(
35.8,
163.9)
______________________________________________________________________________________________________________________________________
Simultaneous Treatment Comparisons:
AcmeAce / Placebo
ABC Level AUC
65
(
12 ,
346 )
<.0001
Total TNF alpha, IFN gamma, IL-6, and IL-2
33.8
(
16.5,
69.5)
<.0001
______________________________________________________________________________________________________________________________________
Partial Correlation Coefficient from
Error SSCP Matrix
p-value
______________________________________________________________________________________________________________________________________
-0.1239
0.8894
______________________________________________________________________________________________________________________________________
Note: MANOVA. fixed-effect. Treatment-Emergent Treatmemt-emergent. viait visit. AUC AUC- (0.0%) (0.0$) (IU/mL) (IU/ml) colleted GADA
GADA- Lympcyte Lymphocyte Lymphocytes Lymphocytess Mddication Chang MedDra Subgrops Summries Comcomitant (Beats/mininute) ears
Acme Ace-PROT1A
tb.sas
DAFT - 03NOV2005 14:46
Page 1 of 1
Table 14.8.3
Analysis of Maximum Lymphopenia Degree (%) After the First Treatment and ABC Level AUC per Day (cp/µg)
During the Study After the First Treatment, Per-protocol Population [ Intent-to-Treat. ]
______________________________________________________________________________________________________________________________________
Treatment
Variable
N
Mean
St. Dev.
Meadian
Minimum
Maximum
p=value
____________________________________________________________________________________________________________________________________
AcmeAce
Maximum lymphopenia degree
35
16
3
16
10
22
ABC level AUC (C02)
35
491
665
152
0
3090
0.1178
Placebo
Maximum lymphopenia degree
12
0
1
0
0
3
ABC level AUC (CO2)
12
3
10
0
0
34
0.4148
______________________________________________________________________________________________________________________________________
Note: Lymphoneia degree defined as number of units (%) below lower normal limit:,.); Seven zubjects dosed with 88 mg of study drug.
MANCOVA fixd-effect bseline C-peptide AUCas a continuos. ABC level AUC (Beats/minutes) (mm Hg) (mmHg) Enrolled" 02 2004 Treatment.
Acme Ace-PROT1A
tcc.sas
DRAFT - 02NOV2005 16:38
Page 1 of 1
Table 14.8.3
Analysis of IL-10 Circulating Cytokine at Day 2, 4 Hours Post Infusion Subset by Occurrence of Rash
Adverse Event Data through Month 18 for 78 Randomized Subjects, Intent-to-Treat Population
______________________________________________________________________________________________________________________________
AcmeAce (N= 40)
Placebo (N=38)
IL-10 at Endpoint
IL-10 at Endpoint
0 n(%)
>0 n(%)
0 n(%)
>0 n(%)
______________________________________________________________________________________________________________________________
Rash During Study
Yes
21 (100.0)
9 ( 44.3)
10 ( 54.4)
3 ( 50.0)
No
. (
.)
5 ( 55.7)
22 ( 45.6)
3 ( 50.0)
P-value
<0.0001
0.8894
P-value Controlling for Treatment
0.8894
______________________________________________________________________________________________________________________________________
IL=10 zero and non-zero values and precense of rash were analyzed within each treatment group using Fisher's exact test and also
also analyzed overall using Mantel-Haenszel test(controlling for treatment group). &dose subjects doesd with 88mg of study drug.
MMDDDYYYY MMDDDDYYYY Trail MEdication. Porbably. Exan Normal? Placebi. Subejct. Urianalysis. 0.0. Contuing. Study. Study)
Life-threatning. Sequale. Ongoign. Checst X-Ray. Inclusiom Criteria. Exclusin. Completd. C-Peptide (%)by (IU/mL)- Study; Study:
Appendix 2, Page 1, Shorter
Z C W
. 1 &dose
. 13 (
1 2 (%)
. 1 (%)by
1 1 (0.0$)
1 1 (0.0%)
1 1 (100.0)
1 1 (Beats/minutes)
1 1 (C02)
1 1 (CO2)
1 1 (cp/ug)
1 1 (cp/µg)
1 1 (IU/mL)
1 1 (IU/ml)
. 1 (IU/mL). 1 (mm
1 1 (mmHg)
. 1 (N=
1 1 (N=38)
1 3 )
1 3 ,
. 3 2 1 -0.1239
1 1 .
1 1 .)
. 1 /
. 9 0
1 1 0.0.
2 1 0.1178
2 1 0.4148
2 3 0.8894
2 1 02
. 1 02NOV2005
. 2 03NOV2005
2 8 1
2 3 10
2 5 12
. 1 14.8.1
. 2 14.8.3
. 1 14:46
2 1 152
. 1 15:26
2 2 16
3 1 16.5,
3 1 163.9)
. 1 16:38
2 1 18
2 1 19
Appendix
Z C
1 1
9 1
1 1
. 1
1 1
. 1
. 2
. 1
. 8
lengths. Input=2006 TT14 Con_check_app1, z=code (strip=1, number=2, spell=4,
Z C W
Z C W
Z C W
3 1 2,
4 2 analyzed
4 1 Estimates
2 1 2004
4 8 and
4 1 Event
2 2 21
4 1 Appendix
4 1 exact
2 2 22
4 1 as
. 1 Exan
2 1 228
4 3 at
1 1 Exclusin.
2 5 3
. 8 AUC
4 2 First
2 1 3090
. 1 AUC. 1 Fisher's
2 1 33.8
. 1 AUCas
. 1 fixd-effect
2 1 34
4 1 below
9 1 fixed-effect.
2 1 346
. 1 bseline
4 3 for
2 4 35
4 1 by
4 1 from
3 1 35.8,
. 1 C-Peptide
. 1 GADA
2 1 4
. 1 C-peptide
. 1 GADA3 1 40)
4 1 Chang
5 2 gamma,
3 1 44.3)
. 1 Checst
1 1 gamna,
3 1 45.6)
4 2 Circulating
4 2 Geometric
2 1 491
4 1 Coefficient
4 1 Group
2 1 5
4 1 colleted
4 1 group
3 2 50.0)
. 1 Comcomitant
5 1 group).
3 1 517.4,
5 1 Comparisons:
1 1 Hg)
3 1 54.4)
1 1 Completd.
4 1 Hours
3 1 5425.4)
4 1 Confidence
. 3 IFN
3 1 55.7)
5 1 continuos.
. 3 IL-10
2 1 591.3
4 1 Controlling
. 3 IL-2
2 1 6
1 1 Contuing.
1 3 IL-6,
2 2 65
4 1 Correlation
. 1 IL=10
2 1 665
5 1 Criteria.
. 1 Inclusiom
3 1 69.5)
4 1 Cytokine
4 1 Infusion
2 1 76.6
4 1 Cytokines
8 1 Intent-to-Treat
2 1 78
4 1 DAFT
4 1 Interval
2 1 88
4 1 Data
4 3 Level
. 1 88mg
4 2 Day
4 4 level
2 1 9
4 1 day
5 1 limit:,.);
. 1 95%
4 1 defined
4 1 lower
. 2 <.0001
4 1 Degree
. 1 Lympcyte
. 1 <0.0001
4 3 degree
12 1 Lymphocyte
. 2 >0
5 1 Dev.
4 1 Lymphocytes
. 1 a
. 1 doesd
. 1 Lymphocytess
. 7 ABC
4 1 dosed
. 1 Lymphoneia
. 3 Ace-PROT1A
4 2 DRAFT
8 1 Lymphopenia
4 3 Acme
5 2 drug.
8 2 lymphopenia
. 5 AcmeAce
4 3 During
. 1 MANCOVA
4 2 Adjusted
4 1 each
1 1 MANOVA.
4 1 Adverse
4 1 ears
. 1 Mantel-Haenszel
4 2 After
4 2 Endpoint
4 1 Matrix
5 3 alpha,
4 1 Enrolled
4 5 Maximum
4 2 also
. 1 Enrolled"
. 1 Mddication
4 3 Analysis
4 1 Error
. 1 Meadian
special=8), c=frequency, w=word
Z C W
Z C W
4 3 Mean
5 1 Study)
. 1 MedDra
5 1 Study,
5 1 MEdication.
5 1 Study.
. 1 mg
5 1 Study:
4 1 Minimum
5 1 Study;
. 1 MMDDDDYYYY
1 1 Subejct.
. 1 MMDDDYYYY
. 1 Subgrops
4 1 Month
4 1 subjects
4 1 Multivariate
5 1 Subjects,
. 1 N
4 1 Subset
. 1 n
. 1 Summries
1 4 n(%)
. 1 ta.sas
4 1 No
4 3 Table
. 1 non-zero
. 1 tb.sas
4 1 normal
. 1 tcc.sas
. 1 Normal?
4 1 test
5 2 Note:
. 1 TFN
4 1 number
4 3 the
4 1 Occurrence
4 1 through
4 13 of
. 2 TNF
1 1 Ongoign.
4 3 Total
4 1 overall
4 1 Trail
8 2 P-value
4 5 Treatment
8 1 p-Value
4 2 treatment
8 1 p-value
5 1 Treatment,
. 1 p=value
5 1 Treatment.
4 3 Page
4 1 units
4 1 Partial
1 1 Urianalysis.
4 2 per
4 2 using
. 1 Per-Protocol
4 1 values
. 1 Per-protocol
4 1 Variable
. 1 Plabeco
. 1 viait
1 1 Placebi.
5 1 visit.
4 4 Placebo
4 1 were
4 3 Population
4 2 with
1 1 Porbably.
4 1 within
4 1 Post
1 1 X-Ray.
. 1 precense
4 1 Yes
4 1 Randomized
4 1 zero
4 2 Rash
. 1 zubjects
4 1 rash
. 2 [
1 1 Sequale.
. 2 ]
4 1 Seven
4 2 Simultaneous
. 1 SSCP
1 1 St.
4 2 Study
4 2 study
2, Page 2, Longer lengths. Input=2006 TT14 Con_check_app1, z=code (strip=1, number=2, spell=4, special=8), c=frequency, w_l=length, w=word
W_L W
16 (Beats/mininute)
16 Intent-to-Treat.
16 Life-threatning.
16 test(controlling
19 Treatmemt-emergent.
18 Treatment-Emergent
126 ______________________________________________________________________________________________________________________________
132 ____________________________________________________________________________________________________________________________________
134 ______________________________________________________________________________________________________________________________________
* Appendix 3, PharmaSUG TT14 (2006), John Morrill & David Austin, Con_check.sas *;
options nocenter mprint mlogic macrogen symbolgen source2 pagesize=51 ls=154
nodate nonumber validvarname=upcase msglevel=I;
* Difficulty with cards statement within a macro. *;
* Could populate with assignment statements or external data set. *;
data special(where=(wp NE '1
2
3'));
length wp $800.;
input wp;
cards;
intent-to-treat
p-value
analytes
fixed-effect
lymphopenia
lymphocyte
1
2
3
;
run;
* Testing tab replacement *;
data special;
set special;
wp_1=tranwrd(wp,' ','x'); * Change tab to space. *;
wp_2=tranwrd(wp,'09'x,'y'); * Change tab to space. *;
run cancel;
proc print;
run cancel;
%macro loop(ids=,strip=Y,numbers=Y,spell=Y,special=Y);
filename book "C:\Documents and Settings\nm11923\My
Documents\PharmaSUG_2006\&ids..txt";
filename spell "C:\Documents and Settings\nm11923\My
Documents\PharmaSUG_2006\word.lst";
libname psug
"C:\Documents and Settings\nm11923\My Documents\PharmaSUG_2006\";
data any(keep=w);*(drop=w: i:);
infile book missover length=reclen;* obs=55400;
length line w $800.;
input line $varying800. reclen;
*line=tranwrd(line,'
',' '); * Change tab to space. *;
line=tranwrd(line,'09'x,' '); * Change tab to space. *;
line=tranwrd(line,'--',' ');
cnt=1;
do until(w eq ' ');
w=scan(line,cnt,' ');
if w ne ' ' then output;
cnt+1;
end;
run;
* Chose retain in data step due to length issues in PROC FREQ *;
proc sort data=any;
by w;
run;
data any;
set any;
retain c;
by w;
if first.w then c=1;
else c+1;
w_l=length(w);
if last.w then output;
run;
***************************************************************;
** Strip selected non-alphabetic characters from word ends of word (wt) - BEGIN. **;
%if %substr(%upcase(&strip),1,1)=Y %then %do;
data anyt(keep=w t wt);
set any;*(where=(substr(upcase(w),1,1) IN ('L' '0' '1' '2' '3' '4' '5' '6')));
wr=reverse(w);
do until (substr(left(wr),1,1) NOT IN (':' ',' '.' ')' ';'));
if (substr(left(wr),1,1)
IN (':' ',' '.' ')' ';')) then do;
wr= substr(left(wr),2);
t=1;
end;
end;
wt=left(reverse(wr));
run;
proc print width=min data=anyt(where=(t));
format w wt $30.;
run cancel;
%end;
** Strip selected non-alphabetic characters from word ends of word (wt) - END. **;
** Identify numbers - BEGIN. **; * Note that t is used as input here!!!!! *;
%if %substr(%upcase(&numbers),1,1)=Y %then %do;
data anyn(keep=w n wn
t);
*set any;
*(where=(substr(upcase(w),1,1) IN ('L' '0' '1' '2' '3' '4' '5' '6' '7' '8' '8')));
set anyt;
if input(wt,?? best.) then do; *Checks if number: avoid message- thanks Doug!*;
n=2;
wn=input(wt,best.);
end;
run;
proc print width=min data=anyn(where=(n));
format w $30.;
run cancel;
%end;
** Identify numbers - END. **;
** Spell Check - BEGIN. **;
%if %substr(%upcase(&spell),1,1)=Y %then %do;
* Read in word list to create SAS data set. *;
data psug.spell;
infile spell missover length=reclen obs=200200;
length spell $800.;
input spell $varying800. reclen;
run cancel;
proc print width=min data=psug.spell(obs=54);
run cancel;
data any_s;
set anyt; * Use anyt here instead of any *;
ws=lowcase(wt);
run;
proc sort data=any_s;*(where=(substr(upcase(w),1,1) IN ('L' 'M' 'T' 'U' '0')));
by ws;
run;
data anys(keep=w s ws);
merge any_s(in=a) psug.spell(in=b rename=(spell=ws));
by ws;
if a AND b;
s=4;
run;
proc sort data=anys;
by w;
run;
proc print width=min data=anys(where=(s));
format w $30.;
run cancel;
%end;
** Spell Check - END. **;
** Special Spell Check - BEGIN. **;
%if %substr(%upcase(&special),1,1)=Y %then %do;
proc sort data=special;
by wp;
run;
data any_p;
set anyt; * Use anyt rather than any. *;
wp=lowcase(wt);
run;
proc sort data=any_p;
by wp;
run;
data anyp(keep=w p wp);
merge any_p(in=a) special(in=b);
by wp;
if a AND b;
p=8;
run;
proc sort data=anyp;
by w;
run;
proc print width=min data=anyp(where=(p));
format w: $30.;
run cancel;
%end;
** Special Spell Check - END. **;
data any_final;
merge any(in=a) anyt(drop=t) anyn anys anyp;
by w;
if a;
z=sum(of t n s p);
w_sort=upcase(w);
run;
proc sort data=any_final;
by w_sort;
run;
proc print width=min;
format w wt $30.;
run cancel;
title "Appendix 2.1, Shorter lengths. Input=2006 TT14 &ids, z=code (strip=1,
number=2, spell=4, special=8), c=frequency, w=word";
proc report data=any_final(where=(length(w)<=15)) panels=6 split='*****';
columns z c w;
define z / width=2 spacing=1;
define c / width=2 spacing=1;
define w / width=15 flow spacing=1;
run;
title "Appendix 2.2, Longer lengths. Input=2006 TT14 &ids, z=code (strip=1, number=2,
spell=4, special=8), c=frequency, w_l=length, w=word";
proc print data=any_final(where=(length(w)>15)) noobs;
var z c w_l w;
run cancel;
proc report data=any_final(where=(length(w)>15)) split='*****';
columns z c w_l w;
define z / width=2 spacing=1;
define c / width=2 spacing=1;
define w_l / width=5 spacing=1;
define w / width=135 flow spacing=1;
run;
%mend loop;
%loop(ids=Con_check_app1);
* Could parameterize 15 (width of column) - changes panels *;