NESUG 18 Ins & Outs From five hundred spreadsheets to one SAS® data set in 3 easy steps: Driving IMPORT with data-driven macro variables Christianna S. Williams, University of North Carolina at Chapel Hill ABSTRACT In a clinical trial, we had sleep data on 90 study participants over 30 one-week study periods. These data were originally stored in Excel® spreadsheets, with separate workbooks for each study period and a separate worksheet within those workbooks for each participant. Additionally complicating matters, not all subjects participated in all study periods, so the number and the “identities” of the worksheets varied among study periods. Thoughtful and systematic naming of workbook files and the worksheets they contained, along with construction of a “control” SAS® data set indicating which people took part in which study periods, facilitated the development of a few fairly simple SAS macros to automate the IMPORTation of more than 500 spreadsheets and the concatenation of these into a single file for statistical analysis. This paper will present these macros, and demonstrate how the “control” dataset was used to create macro variables that were passed as parameters to the macro that read the workbook files. The methods used should be broadly applicable to many situations in research or business where data need to be combined from many files (be they external files or SAS data sets) to streamline this process and allow it to be reproducible and data-driven. BACKGROUND Many people with Alzheimer’s Disease and other dementias suffer from depression and have disturbed sleep patterns, with frequent nighttime waking and excessive daytime drowsiness. There is evidence to suggest that increasing daytime exposure to light can improve mood and sleep patterns in persons with dementia. However, in many nursing homes light levels are kept quite dim, making it difficult for nursing home residents to gain exposure to beneficial daytime light. One technique for increasing exposure to therapeutic light is for individuals to sit in front of a light box. However, nursing home residents with dementia, however, may not comply with sitting in front of a light box long enough to get a therapeutic “dose” of high intensity light. Thus, we conducted a clinical trial in which computercontrolled high intensity-low glare lights were installed in the public areas of two long-term care facilities, one in North Carolina and one in Oregon. The study had a crossover design so that four treatment conditions (high light in the morning, high light in the afternoon, high light all day, and standard lighting) alternated at three-week intervals for 22 intervals (or periods) in the North Carolina site and 8 periods in the Oregon facility. During the third week of each study period, study residents wore wrist actigraphs, which are watch-like devices that contain an accelerometer that records arm movements in one-minute intervals. Analysis of these counts with specialized software provides estimates of night-time sleep for each of the seven nights that the actigraph was worn. This scoring is done manually for each resident for each study period, and the scored results are output to an Excel spreadsheet. Thus, a separate spreadsheet is constructed for each resident for each study period. While over the entire course of the study, we had 90 resident participants, because of new admissions, discharges, and other reasons, new residents were enrolled and other residents left the study throughout its duration. As a result there was a different set of residents for each study period – and thus the sleep data in the Excel files was for a different set of residents for each period. In order to conduct statistical analyses of these data (e.g. to test whether nighttime sleep improves with increased daytime light exposure), I had to read the data from all of these spreadsheets and assemble them into a single SAS data set, which could then be merged with the treatment codes and other resident or period-specific information. In all there were about 500 spreadsheets that needed to be IMPORTed into SAS, while keeping track of which spreadsheet was for which resident for which study period! The actual SAS code to read each spreadsheet is, of course, quite straightforward. What I wanted to avoid was having a SAS program that contained 500 PROC IMPORT steps! And – even 1 NESUG 18 Ins & Outs more importantly – I wanted to ensure that I had sleep data for each study participant for each of the study periods in which he/she was enrolled in the study. This issue was particularly critical for quality control and data validation because of the complicated study design and the fact that each actigraph file had to be scored and assigned its identifiers individually. This paper details how the required analysis data set was constructed, in the following three steps: (1) Write a simple program that will IMPORT a single, specified spreadsheet. (2) Incorporate that program into a macro, with parameters that will identify the worksheet(s) to read. (3) Use a “control” data set to provide values for the parameters needed to generate the calls to this macro, so that all the necessary worksheets can be read. THE STARTING POINT The spreadsheets with the scored sleep data were organized so that there was a separate workbook (*.xls file) for each study period, and each of these workbooks contained a worksheet for each resident. Each worksheet had only seven rows of data – one for each night of sleep data for that resident for that study period. The directory structure for the project dictated that the workbook files were stored in a different directory for each study period and site. The study periods were assigned sequential letters of the alphabet (A through V in North Carolina [site 1] and A through H in Oregon [site 2]), and the directories and workbook files were named accordingly; hence, the sub-directory for period A in the NC facility was named “PERIOD 1A” and the workbook file was named SLEEP_1A, and the file for period C in Oregon was SLEEP_2C and was located in the subdirectory “period 2C”. Individual worksheets were named with the study ID (called RESID in this study) of the resident (e.g. 11501, 11507, etc.). An example of one of the files is shown in Figure 1. Figure 1. Portion of Excel Workbook (SLEEP_1A.xls) containing scored sleep data for Period A at Site 1. The work sheet shown is for RESID 11512. All seven rows of the worksheet are shown but only a portion of the columns. Tabs at bottom indicate other worksheets for other participants in this study period. Column labels are identical for all worksheets. 2 NESUG 18 Ins & Outs CODE TO READ ONE WORKSHEET The following is the SAS code to read a single worksheet (Code 1). PROC IMPORT OUT = WORK.A1_11512 DATAFILE = "S:\lighting\period 1A\sleep_1A.xls" DBMS=EXCEL2000 REPLACE; RANGE = "11512$"; GETNAMES = yes; RUN; DATA A1_11512_r ; SET A1_11512 (WHERE = (resid NE . AND date NE .)); ATTRIB site LENGTH=3 LABEL='Study Site' period LENGTH=$1 LABEL='Study Period' night LENGTH=3 LABEL='Counter for Night' ; night = _N_ ; site = 1; period = 'A' ; RUN; Code 1. SAS code to read a single worksheet in a single spreadsheet, corresponding to the sleep data for RESID 11512, site 1, study period 1. This example utilizes the sleep data for resident ID 11512 in Period A at site 1. The PROC IMPORT reads the specified range of the specified workbook file and writes a temporary SAS data set named A1_11512, which derives its variable names from the first row of the worksheet (because GETNAMES = YES). The resulting data set has seven observations. The subsequent DATA step does a little clean-up (in case of blank rows, which sometimes occur after the data rows) and adds the site and period identifiers to the file as well as a counter for the night of data collection. Now, I could copy and paste this chunk of code 500 times, modifying the identifying parameters (Resident ID, study site, and study period) each time, but in addition to making a very unwieldy SAS program, I’d be sure to make a lot of errors. MACRO TO READ ONE WORKSHEET So, the first improvement I made was to incorporate this code into a macro to which I needed to pass three parameters – RESID, SITE and PERIOD, which jointly indicate which spreadsheet is to be read. The result of one call to the %IMPRT macro, shown below (Code 2), is a data set that is identical to the one resulting from Code 1. %MACRO imprt(resid,site=,period=); * Read the Excel file ; PROC IMPORT OUT= WORK.&period&site._&resid DATAFILE = "C:\CSW\NESUG\NESUG05\MyPapers\IMPORT\sleep_&site&period..xls" DBMS=EXCEL2000 REPLACE; RANGE="'&resid.$'"; GETNAMES=YES; 3 NESUG 18 Ins & Outs RUN; * Delete empty rows, add identifiers ; DATA &period&site._&resid._r ; SET &period&site._&resid (WHERE = (resid NE . AND date NE .)); ATTRIB site LENGTH=3 LABEL='Study Site' period LENGTH=$1 LABEL='Study Period' night LENGTH=3 LABEL='Counter for Night' ; site = %eval(&site) ; period = "&period" ; night = _N_ ; RUN; * Append to file of sleep data for this period and site; PROC APPEND BASE = sleep_&site&period DATA=&period&site._&resid._r ; RUN; %MEND imprt ; %imprt(11512,site=1,period=A) ; %imprt(11515,site=1,period=A) ; Code 2. SAS macro code to read a single worksheet in a single spreadsheet for each call to the macro and append the resulting SAS data set to the data set of all sleep data. This code shows two calls to the macro, one for RESID 11512, site 1, period A and one for RESID 11515, site=1, period=A. Within each iteration of the macro, we APPEND the most recently read sleep data to a data set including all previously read sleep data. So, this program is a little better…now I just have to issue 500 calls to the macro, and I would end up with a data set containing all the scored sleep data. While the program would be substantially more readable and easier to update than the Code 1 example, it would still be quite prone to error – I have to know exactly which macro calls to make. THE KEY TO SIMPLIFICATION: A “CONTROL” DATA SET At this point, I knew there had to be a better way…a way to automate the specification of the parameters for the macro calls. And then I had an epiphany: the set of parameters (i.e. all combinations of RESID, SITE and PERIOD) for which there is sleep data – and corresponding worksheets – is itself data! So, I should get that information into a data set and use that data set to direct or control the macro calls. Luckily, our project manager who was of course keeping track of which residents participated in which study periods and, among those, which had provided valid sleep data – could provide me with this information. This participation data was incorporated into a SAS data set, the first 35 observations of which are shown in Figure 2. Because not all enrolled participants were willing to wear the actigraph, or due to occasional actigraph malfunction, some participants do not have valid sleep data. These are indicated by SLEEPDATA = 0. 4 NESUG 18 Ins & Outs SLEEP_CTRL period A A A A A A A A A A A A A A A A B B B B B B B B B B B B B B B B C C C resid site 11501 11507 11508 11509 11510 11512 11515 11524 11525 11526 11527 11528 11529 11530 11531 11532 11501 11508 11509 11510 11512 11515 11525 11527 11528 11529 11530 11531 11534 11535 11536 11537 11501 11508 11509 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 sleepdata 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 Figure 2. PRINT of the first 35 observations of the “control” data set, which indicates which RESIDs contributed sleep data to which study periods. So, how do we turn DATA step variable values into macro variable values so that they can be passed as parameters to the %IMPRT macro? The SYMPUT call routine is perfectly suited to this task! I decided that I needed to count the number of residents in each study period for each site and use a macro %DO loop to call the macro this many times. I first created a macro, titled %DOPER (for “do period”), that, when fully developed, would direct the calls of the %IMPRT macro. The first step, a DATA _NULL_ step (shown in Code 3), contains two invocations of the SYMPUT routine. The first step, denoted by , creates a macro variable for each iteration of the DATA step (i.e. for each observation read by the SET statement. The name of the macro variable will be the concatenation of period (‘A’,’B’,’C…), site (‘1’ or ‘2’ – I converted the numeric variable SITE to the character variable SITEC to avoid problems with leading blanks), and the automatic SAS data step variable _N_. The 5 NESUG 18 Ins & Outs value assigned to this automatic counter will be the _N_th resident ID that meets the criteria specified by the WHERE clause on the SET statement (i.e. corresponding to the desired site and period and having sleep data). When the DATA step reaches the last qualifying observation in the control data set, another macro variable is created with CALL SYMPUT, denoted by in Code 3. The name of this macro variable will be the concatenation of “NUM_” with site and period. %MACRO doper(per,loc) ; DATA _NULL_ ; SET in.sleep_ctrl (WHERE = (sleepdata=1 AND site=&loc AND period="&per")) END = lastobs; 1 * create macro variable for each Resident ID ; 1 sitec = PUT(site,1.) ; CALL SYMPUT(TRIM(period)||sitec||"_"||LEFT(TRIM(_N_)),resid) ; * get number of observations in each block for each site ; IF lastobs THEN CALL SYMPUT("NUM_"||sitec||"_"||trim(period),LEFT(TRIM(_N_))) ; RUN; %PUT _USER_ ; 3 2 1 %MEND doper ; %doper(A,1) ; Code 3. The beginnings of the macro %DOPER, which will eventually be used to specify calls to the %IMPRT macro. This piece consists of a DATA _NULL_ step to read the “control” data set for a given site and study period and store the RESID’s with sleep data for that period as macro variables. This code also creates a macro variable that has the value of the total number of study participants with sleep data for that site and period. If I include the statement “%PUT _USER_;” in the code above, denoted and call the %DOPER macro, a listing of the macro variables and values created by calling the macro for period A, site 1 is written to the SAS log. This list is shown in Figure 3. DOPER DOPER DOPER DOPER DOPER DOPER DOPER DOPER DOPER DOPER DOPER DOPER DOPER DOPER DOPER DOPER DOPER DOPER PER A LOC 1 A1_1 A1_2 A1_3 A1_4 A1_5 A1_6 A1_7 A1_8 A1_9 A1_10 A1_11 A1_12 A1_13 A1_14 A1_15 NUM_1_A 15 11501 11507 11508 11509 11510 11512 11515 11524 11525 11527 11528 11529 11530 11531 11532 Figure 3. Portion of the SAS log, showing the macro variables and their values created by the call to %DOPER shown in Code 3. 6 NESUG 18 Ins & Outs All of the macro variables are preceded by “DOPER” in the log because they are all local to the %DOPER macro. The first two macro variables shown are the parameters that were passed to the macro when it was invoked (&LOC and &PER). The next 15 macro variables are those that were created by the first invocation of CALL SYMPUT in Code 3, and their values correspond to the 1st through 15th RESID that had sleep data in PERIOD A for SITE 1. Inspection of the listing of the SLEEP_CTRL data set (Figure 2) shows that this is the case. Finally, the last macro variable shown in Figure 2 (NUM_1_A) has the value 15, which is again the total number of RESIDs who should have Excel worksheets to read for this study period. This information will be put to use in the next enhancement of the %DOPER macro. So, now we have all the information we need to direct the IMPORTing of the sleep data for a specified period and site. Adding just a few lines of code to the %DOPER macro will use this information to generate the appropriate calls to the %IMPRT macro. The complete macro is shown in Code 4. %MACRO doper(per,loc) ; DATA _NULL_ ; SET in.sleep_ctrl (WHERE = (sleepdata=1 AND site=&loc AND period = "&per")) END = lastobs; * create macro variable for each Resident ID ; sitec = PUT(site,1.) ; CALL SYMPUT(TRIM(period)||sitec||"_"||LEFT(TRIM(_N_)),resid) ; * get number of observations in each block for each site ; IF lastobs THEN CALL SYMPUT("NUM_"||sitec||"_"||trim(period),LEFT(TRIM(_N_))) ; RUN; * this code will import the data for all residents in this period by calling the %imprt macro once for each RESID in the period ; %DO x = 1 %TO &&num_&loc._&per ; %imprt(&&&per&loc._&x,site=&loc,period=&per) ; %end; PROC APPEND BASE=AllSleep DATA=sleep_&loc&per; RUN; %MEND doper ; %doper(A,1) %doper(B,1) %doper(C,1) %doper(D,1) %doper(E,1) . . . %doper(V,1) %doper(A,2) %doper(B,2) . . . %doper(H,2) ; ; ; ; ; 1 1 2 1 3 ; ; ; ; Code 4. The updated %DOPER macro, which uses the macro variables created by the CALL SYMPUT routine to specify the calls to the %IMPRT macro. Once the data for all residents in one period of the study are read, the resulting data set is APPENDed to the data set with all the sleep data for the study. The %DOPER macro is called once for each study period and site. 7 NESUG 18 Ins & Outs The most significant change to the macro is the addition of the %DO loop, denoted by in Code 4. These three statements contain a rather intimidating number of ampersands – so let’s take them apart. The %DO statement specifies the creation of a macro variable &X, which is going to get incremented by 1 each time the program passes through the loop. The ugly macro expression “&&num_&loc._&per” tells the loop when to stop…so what does it resolve to? In the first pass through &&num_&loc._&per (for the first call to the %DOPER macro where we have specified site 1 and period A), && resolves to &, &loc resolves to 1, and &per resolves to A. So, we have &num_1_A for the next pass. The macro variable &num_1_A was created and assigned a value by the second invocation of CALL SYMPUT in Code 3. It is the number of RESID’s who have sleep data for period A at site 1, and, as shown in Figure 3, it has a value of 15. So, when the macro processor gets through with it, that %DO statement setting up the loop reads %DO X = 1 %TO 15. Of course, the next statement, which calls the %IMPRT macro, specifies what will happen 15 times. Again, let’s take apart this macro call – to figure out what parameters we’re passing to the %IMPRT macro. The second two parameters are easy: &site resolves to 1, and &per resolves to A. We know we want the first parameter to resolve to a RESID that has data for this period, but how is that feat accomplished? Starting with &&&per&loc._&x – in the first pass, && resolves to &, &per resolves to A, &loc resolves to 1, and &x (the index for the %DO loop) resolves to 1 at the first iteration of the %DO loop, 2 the second time, and so on. So, after the macro facility does the first round of resolution of &&&per&loc._&x, it has translated to &A1_1. Again, &A1_1 is one of the macro variables created by CALL SYMPUT (see Code 3, ): it has the value of the first RESID in PERIOD A for SITE 1, and thus corresponds to 11501 (see Figures 2 and 3). Hence in the first trip through the %DO loop, the following call is issued to the %IMPRT macro: %IMPRT(11501,1,A). And the appropriate worksheet is read from the appropriate workbook file. In the second loop, the first macro parameter will get resolved to &A1_2, which corresponds to the second RESID in PERIOD A for SITE 1, which we can see from the output in Figure 3 is 11507. So this is the worksheet that gets IMPORTed, and APPENDed to the data set for this site and period. This looping continues until &X has gotten to 15, at which point the worksheet for the 15th RESID gets IMPORTed and APPENDed. Then, when &X is incremented to 16 the %DO loop doesn’t iterate again. The PROC APPEND step denoted by in Code 4 simply takes the data set that has been constructed by the concatenation of all the sleep data for period A at site 1, and APPENDs that to a data set, SLEEPDATA, that will eventually include all the sleep data for all study periods for both sites. The code denoted in Code 4 simply shows that the %DOPER macro would be called once for each period and site. After the final call has been executed, the SLEEPDATA data set will contain the data from all 502 spreadsheets. Mission accomplished! VALIDATION Ok, but…But what if the information in the control dataset (SLEEP_CTRL) doesn’t match up with the spreadsheets. It’s always a good idea to think about all the things that could go wrong. For example, what if there is a record in the SLEEP_CTRL dataset indicating that there should be a spreadsheet for a given site, period and RESID combination, but that worksheet file doesn’t contain a spreadsheet for that RESID? Say the following call to the %IMPRT macro is generated: %imprt(11504,site=1,period=A) ; This would happen if there were an observation with RESID=11504, SITE=1, PERIOD=’A’ and SLEEPDATA=1 in the SLEEP_CTRL data set. We see in Figure 1 that there is no spreadsheet for 11504 in the file SLEEP_1A.xls. Such a macro call does generate several error messages in the log, as shown in Figure 4. As it turns out, this error does not cause the program to crash – zero observations get APPENDed to the SLEEP_1A data set, but the processing will continue unscathed with the next legitimate call to the %IMPRT macro. There is one important exception to this, however, which is if the faulty macro call is the first one within a site-period combination. This poses a problem because the SAS dataset for that site-period doesn’t yet exist, and so the first APPEND (where the 8 NESUG 18 Ins & Outs BASE data set doesn’t yet exist) creates a dataset (e.g. SLEEP_1A) with 0 observations and 3 variables (SITE, PERIOD and NIGHT). This then causes the program to bomb when the next %IMPRT call is executed because APPEND1 only works when the dataset being APPENDed has the same variables as the existing BASE data set. In this scenario, the faulty BASE data set does not contain all the sleep variables that are in the data set that SAS is attempting to APPEND. Now, there are workarounds (e.g. setting up an empty dataset with all the right variables before processing begins for each site-period). However, I am a staunch believer in READING the SAS log, and paying attention to all WARNINGs and ERRORs, and generally doing what one can to avoid them. In this case, a mismatch between the SLEEP_CTRL data set and the spreadsheets would indicate that either the SLEEP_CTRL data set had an observation that shouldn’t be there or the EXCEL file for that site-period was missing a sheet. From a project management point of view, it should be determined where the discrepancy lies – and fix it, rather than trying to write a SAS program that will run no matter what. The same would be true if an entire EXCEL file were missing or misnamed. ERROR: Describe error: The Microsoft Jet database engine could not find the object ''11504$''. Make sure the object exists and that you spell its name and the path name correctly. ERROR: Import unsuccessful. See SAS Log for details. NOTE: The SAS System stopped processing this step because of errors. ERROR: File WORK.A1_11504.DATA does not exist. NOTE: The SAS System stopped processing this step because of errors. WARNING: The data set WORK.A1_11504_R may be incomplete. When this step was stopped there were 0 observations and 3 variables. NOTE: Appending WORK.A1_11504_R to WORK.SLEEP_1A. WARNING: Variable RESID was not found on DATA file. WARNING: Variable Date was not found on DATA file. WARNING: Variable Bed_time was not found on DATA file. WARNING: Variable Get_up_time was not found on DATA file. WARNING: Variable Time_in_bed was not found on DATA file. < SNIP> NOTE: There were 0 observations read from the data set WORK.A1_11504_R. NOTE: 0 observations added. NOTE: The data set WORK.SLEEP_1A has 14 observations and 35 variables. Figure 4. Excerpts from SAS log showing errors that occur when a macro call is generated for a non-existent spreadsheet within an existing worksheet file. As long as this is not the first call to the %IMPRT macro within a site-period combination, the program will proceed without ill effects. CONCLUSIONS I had several purposes in mind with this paper – and with the application upon which it is based. The most obvious is to demonstrate and explain the method I used to greatly simplify and automate the construction of a single SAS data set from a very large number of EXCEL spreadsheets. While this made my program a lot prettier and less error-prone, and I’ve used an identical method to read and concatenate external files containing other types of data for other projects, I also believe that several aspects of the method are broadly applicable to other fields. The key innovation is the use of a “control” data set that contains the information needed to direct the required processing. This processing could be the reading of a large number of external files (as in my application), or it could be a particular type of analysis that needed to be completed using an existing data set for many sets of parameters, where those sets of parameters could be enumerated in the “control” data set. One of the 9 NESUG 18 Ins & Outs appealing aspects of this strategy, of course, is that – provided that certain structural elements of the “control” data set are consistent, the actual content can change over time – yet the program that is doing the processing won’t need to change. In this way, we see that the line between “code” and “data” can be crossed. Another point I wanted to make in this paper was to show a bit of how a program evolved, starting first with the very simple program that is the guts of the processing task (the PROC IMPORT code, in this particular application), and then, given that this task had to be repeated a mind-numbing number of times, with changes to only a few parameters, folding that simple processing task into a macro. Finally, upon recognizing that this would still require way too many macro calls, determining that I could use data to generate those macro calls. This is the approach I often take with a complex programming task – I visualize it as peeling an onion in reverse. One starts with the core of what one needs to accomplish and builds outward from that, testing that it works properly (still tastes like an onion?) at each step. In fact, I could probably have added another layer to this particular onion – by using the data to specify not only the RESIDs for each site and period but the names of the sites and periods themselves. But that seemed like overkill, since for this project, that aspect was not going to change. Finally, I’d like to convey the message that automating a task in this way – building a good-looking and functional (dare I say elegant) program – is not only good programming (readable, reproducible, maintainable) practice. It also allows one to keep building ones skills (by not always taking the “brute force” approach), and it’s such a blast when it works! 1 I am ignoring the FORCE option that is available in PROC APPEND (would allow concatenation of data sets with different variables or variable attributes) because in this application having different variables on the data sets to be APPENDed indicates that there is a problem with the input data, and I want it to cause an ERROR. REFERENCES For more on SYMPUT see any of the following: 1. Carpenter, Art. 1998. Carpenter’s Complete Guide to the SAS® Macro Language, Cary, NC: SAS Institute Inc, 242 pp. 2. Burlew, Michele M. 1998. SAS® Macro Programming Made Easy. Cary, NC: SAS Institute Inc, 280 pp. 3. SAS Institute, Inc. 2002. SAS Macro Language: Reference. http://v9doc.sas.com/sasdoc/ ACKNOWLEDGMENTS I am supremely grateful to my colleague Lauren Cohen for her careful reading of and constructive commenting on an earlier version of this paper. SAS is a Registered Trademark of the SAS Institute, Inc. of Cary, North Carolina. Excel is a registered trademark of the Microsoft Corporation. ® indicates US registration. Other brand and product names are registered trademarks or trademarks of their respective companies. CONTACT INFORMATION Feel free to contact the author with questions or comments: Christianna S. Williams, PhD Cecil G. Sheps Center for Health Services Research University of North Carolina at Chapel Hill 725 Martin Luther King Blvd. Campus Box # 7590 Chapel Hill, North Carolina 27599 Email: [email protected] 10
© Copyright 2025 Paperzz