NESUG 2012 Coders' Corner A macro to squeeze SAS datasets and create a SAS Transport file Hany Aboutaleb, Biogen Idec, Cambridge, MA Abstract At Biogen Idec we have developed a SAS macro to create transport file/s, and resize text columns to fit the longest value within the column for all datasets. This macro will put those huge text columns on a diet by squeezing the unnecessary space out of them without losing any of the character variable content. Using this macro may result in significant reductions in the SAS/data size. In one case, SAS data that required 113.4 GB was reduced to 56.22 GB, a reduction in disk space of almost 50%. Introduction If your data is stored in an external database, you know how costly it can be to pull a large data set into SAS. Now-a-days it is required to submit all SAS datasets as transport file/s to the FDA, and one of the new requirements for files size is < 1 GB, for all datasets within a study submission. As programmers, we find spaces constantly occurring in character variables, when we create and manipulate these variables. For example if you are given 100 lines of data consisting of names (first, middle, last) in SAS you need to store the data in a character variable with a length attribute (i.e., length $200). In SAS that will create a variable name with a length of 200 to store the given data value. SAS’s definition for length is the amount of storage allocated in the dataset to hold the character variable values. That length can be 1 through 32,767. Every character in a character variable requires one byte. Using this definition, you will need 100 (lines of data) times 200 (variable length) resulting in 20,000 bytes to store the variable in your dataset. Now if we find out that the maximum value length for the variable is 70 which yield 7,000 bytes we will be wasting space if we use a length of 200. The %makexpt macro will remove excess space from SAS datasets in a SAS library, and then create transport file/s. This macro automatically processes all SAS datasets in a SAS library or a subset of them as indicated by the user. The macro computes the minimum length of a character variable in such a way as not to lose any characters contained in the variable. Also, a format statement is created that is associated with the variable so that the formatted length corresponds to the computed length. %Makexpt Macro Process 1. The macro will validate the macro parameters and issues an error message and terminates if any of the required macro parameters are empty or missing. 2. If no parameter is passed, the macro will process using the default libname folder (CRT) as entry and create XPT files for all the SAS datasets on the SAS library folder (CRT). 3. If the input data parameter option &SQZLIBIN is set to Y, the macro will create a backup sub-folder and copy the input data (&LIBIN). The macro will also squeeze the SAS datasets in &LIBIN and create transport files (default set to N). 4. If the input data parameter option &SQZXPTDIR is set to Y, the macro will squeeze the (&XPTDIR) folder (default set to Y). 5. The macro will process &EXCLUDE and &INCLUDE lists, and ensure that they are mutually exclusive. If they are not mutually exclusive, it issues an error message and terminates the program. 6. The macro creates a list of all datasets in &LIBIN and applies the list of dataset names to be excluded or included to the set of datasets. 7. The macro will check that the dataset name is 8 characters, as that is a requirement for V5 transport files, if not 8 characters ,it issues an error message and terminates the program. 8. The macro finds the minimum number of bytes required to store character variables without dropping any characters off the right end of a string (the contents of a character variable are left-justified). 9. The macro will create transport file/s in a new transport folder for all datasets in &LIBIN. 1 NESUG 2012 Coders' Corner 10. The macro processes the &SQZRPT=Y flag to produce a report with the location of the request data to be re-sized, and the resulting directory location for the transport file/s. The macro also computes performance statistics for each dataset and summary performance statistics for all datasets squeezed in a summary report, and will add the compare procedure results for the data before and after to a report file. Macro Input parameters DM LIBIN=CRT XPTDIR=XPT EXCLUDE INCLUDE SQZXPTDIR=Y SQZLIBIN =N SQZRPT =Y Debug=NO Version=1 Name of data set to be processed Name of the SAS libname that have data set to be processed Name of the output transport folder Names of SAS datasets in library to exclude from squeeze Process [optional] Names of SAS datasets in library to include in squeeze process [optional] Flag for squishing xptdir data (default=Y) Flag for squeezing LIBIN data (default=N) Flag for listing of squeeze report (default=Y) To debug the macro (YES/NO) (default=NO) Version control of the macro for future use in case of a new release of the macro (default=1) Sample call %makexpt; %makexpt(ae) %makexpt(cm) Libname crtdir ‘/biostats/xyz/data’; %makexpt(libin=crtdir) %let zyl=’/bioststs/study/xptdir’; %makexpt(libin=crtdir,xptdir=&zyl,include=dm cm vs) %makexpt(libin= crtdir,xptdir=&zyl,exclude=ae cm vs) 2 NESUG 2012 Coders' Corner Sample Run Report Output Results of Squeezing Library /biostats/xxx/xxx/xxx/ 3 NESUG 2012 Coders' Corner Compare procedure report result The EXACT method tests for exact equality 4 Variable length before squeeze Variable length after squeeze NESUG 2012 Coders' Corner CONCLUSIONS Spaces can be an annoyance to the SAS programmer, but keeping the spaces down to a desirable number is an absolute necessity for FDA Data submission. By using macro %makexpt you can reduce your data size substantially. This macro provides a simple solution for squeezing the data and creating transport file/s, making more space available. Finally this macro will satisfy FDA policy for submission datasets which is now part of the published Study Data Specifications for data standards . ACKNOWLEDGMENTS SAS® and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are registered trademarks or trademarks of their respective companies. The author would like to thank Mr. Vincent Da Forno for his review and valuable comments, and my manger Matthew Wien for his valuable comments to this paper. References [1] SAS® 9.2 Macro Language: Reference. Cary, NC: SAS Institute Inc. [2] SAS® 9.2 Language Reference: Concepts. Cary, NC: SAS Institute Inc. [3] SAS® 9.2 Language Reference: Dictionary. Cary, NC: SAS Institute Inc. [4] “Sample 24804: %SQUEEZE-ing Before Compressing Data, Redux,” http://support.sas.com/kb/24/804.html [5] “FDA published Study Data Specifications” http://www.fda.gov/downloads/ForIndustry/DataStandards/StudyDataStandards/UCM312964.pdf Contact Information Your comments and questions are valued and encouraged. Contact the author at: Hany Aboutaleb Biogen Idec. 14 Cambridge Center Cambridge MA 02142 Works Phone: (617) 914-7125 Fax: (617) 679-3280 Email: [email protected] LinkedIn: Hany Aboutaleb 5 NESUG 2012 Coders' Corner Sample Code used in makxpt macro %*------------------------------------------------------------------------------------------------------------------------------Program ID: MAKEXPT.SAS Source pgm: None Description: Macro to create .XPT dataset from input SAS dataset under /data/xpt folder. The macro will re-assign variable lengths for all character variables based on the maximum varible lengths and no change for all numeric variables. Date: Jul 06, 2006 Developer: Hany Aboutaleb (Cambridge, 617-914-(4)7125) Programmer: Hany Aboutaleb (Cambridge, 617-914-(4)7125) Input parameters: DM: Name of data set to be processed LIBIN: Name of the libname that have data set to be processed (default=crtdir) your study crt folder XPTDIR: Name of the xpt folder from setup macro or user define location (default=&chrdatadir) your study xpt folder EXCLUDE: [optional] Names of SAS datasets in library to exclude from squeeze process INCLUDE: [Optional] Names of SAS datasets in library to include in squeeze process SQZRPT=Y Flag for listing of squeese report (default=Y) SQZLIBIN=N Flag for squishing libin data (default=N) SQZXPTDIR=Y Flag for squishing xptdir data (default=Y) Debug: to debug the macro (YES/NO) (default: NO) Version: Version control to the macro for future use in case of new release to the macro (default=1) Sample call: Note: %makexpt(ae) %makexpt(cm) %makexpt(libin=crtdir) %makexpt(libin=xyz,xptdir=zyl,include=dm cm vs) %makexpt(libin=xyz,xptdir=zyl,exclude=ae cm vs) Please make sure that Name of the data set not too long a member name should be 8 Character and to include setup macro, and your data are in crtdir libname Please contact any MRT member for any suggestion or comments or Brief specified requirement (limitation, inputs) for this macro Modification log: ---------------Date Reason Person -------------------------------------------------------------------------------------------------------------------------------------------------------; %macro makexpt(dm, libin=crtdir, xptdir=&chrdatadir, EXCLUDE=, INCLUDE=, SQZRPT=Y, SQZXPTDIR=Y, SQZLIBIN=N, debug=N, version=1); %put %put %put %put %put %put %put %put %put /========================================================================\; | |; | Biogenidec Standard Macro Library: |; | MAKEXPT: Macro to create .XPT dataset from input SAS dataset under |; | /data/xpt folder. The macro will minimum-length of character |; | variable based on the maximum character varible lengths |; | without dropping any characters. |; | |; \=========================================================================/; %put %str(Libname to be processed = &libin); %put %str(XPT Libname to be processed = &xptdir); %put %str(Names of SAS datasets in library to exclude from squeeze process[optional] = &EXCLUDE); %put %str(Names of SAS datasets in library to include in squeeze process[optional] = &INCLUDE); %put %str(debug the macro (YES/NO) = &debug); %put %str(Version control to the macro = &version); %if %upcase(%substr(&debug,1,1)) = Y %then %do; options symbolgen mlogic mprint; %end; %local er ror war ning I J gt8; %let parmerr=0; %let er = ER ; %let ror = ROR ; 6 NESUG 2012 Coders' Corner %let war = WAR ; %let ning=NING ; %macro exinc(exinc); %global ex; %if ^%sysevalf(%superq(exinc)=,boolean) %then %do; %let i =1 ; %do %until (%scan(&exinc,&i)= ) ; %if &i=1 %then %let ex = "%upcase(%scan(&exinc,&i))"; %else %let ex = &ex %str(,) "%upcase(%scan(&exinc,&i))"; %let i = %eval(&i + 1); %end; %end; %mend exinc; %*---------------------------------------------------------------------Validate parameters -----------------------------------------------------------------------; ** Break if parameters libin or dm empty ; %if ^%sysevalf(%superq(dm)=,boolean) %then %do; %if %sysfunc(exist(&libin..&dm)) eq 0 %then %do; %put %sysfunc(sysmsg()); %put &er&ror: Must pass Dataset name as a parameter !!!; %let parmerr=1; %end; %end; %else %if ^(%qcmpres(&libin) eq ) %then %do; %if ( %sysfunc(libref( &libin )) ne 0 ) %then %do; %put %sysfunc(sysmsg()); %put &er&ror: Input library %str(&libin) is not correctly defined.; %let parmerr=1; %end; %end; %if %upcase(%substr(&SQZLIBIN,1,1)) = Y %then %do; libname archive "%sysfunc(pathname(&libin))/backup" ; %if (%sysfunc(libref(archive)) ne 0 ) %then %do; data _null_; length cmd $200; cmd="mkdir %sysfunc(pathname(&libin))/backup"; call system(trim(cmd)); run; libname archive "%sysfunc(pathname(&libin))/backup" ; %end; %end; %exinc(&INCLUDE); %exinc(&EXCLUDE); %put exin=&ex; %if ^(%qcmpres(&dm) eq ) %then %do; proc contents data=&libin..&dm out=views(keep=memname memlabel) noprint; run; %let DSNAME1=&dm; %let NUMMM=1; data _null_; set views end=eof; if eof then call symputx('dslabel1', memlabel); run; %if %upcase(%substr(&debug,1,1)) = Y %then %do; %if %length(&dm)>8 %then %do; %put %str(&war&ning: Data name: &dm Exceed 8 Character lengths, and it will fail the xpt v5 transpose engine.); %end; %end; %end; %else %if ^(%qcmpres(&libin) eq ) %then %do; proc sql noprint; select left(put(count(*), 4.)) into :nummm from dictionary.tables where libname=%upcase("&libin") and length(memname) le 8 and upcase(memname) %if ^%sysevalf(%superq(EXCLUDE)=,boolean) %then ^in(&EX);%else %if ^%sysevalf(%superq(INCLUDE)=,boolean) %then in(&EX);; 7 NESUG 2012 Coders' Corner select lowcase(memname) into :dsname1 - :dsname&nummm from dictionary.tables where libname=%upcase("&libin") and length(memname) le 8 and upcase(memname) %if ^%sysevalf(%superq(EXCLUDE)=,boolean) %then ^in(&EX); %else %if ^%sysevalf(%superq(INCLUDE)=,boolean) %then in(&EX);; select memlabel into :dslabel1 - :dslabel&nummm from dictionary.tables where libname=%upcase("&libin") and length(memname) le 8 and upcase(memname) %if ^%sysevalf(%superq(EXCLUDE)=,boolean) %then ^in(&EX); %else %if ^%sysevalf(%superq(INCLUDE)=,boolean) %then in(&EX);; quit; %if %upcase(%substr(&debug,1,1)) = Y %then %do; %let gt8=; proc sql noprint; select memname into: gt8 separated by ', ' from dictionary.tables where libname=%upcase("&libin") and length(memname) gt 8 and upcase(memname) %if ^%sysevalf(%superq(EXCLUDE)=,boolean) %then ^in(&EX);%else %if ^%sysevalf(%superq(INCLUDE)=,boolean) %then in(&EX);; quit; %put %str(&war&ning: Data names: >8 Exceed 8 Character lengths, and it will fail the xpt v5 transpose engine.); %end; %end; %macro copydm (DSCIN /* name of input SAS dataset */ , DSCOUT /* name of output SAS dataset */ ) ; %let dsin=; %if %index(&DSCIN,.)>0 %then %do;%let dsin=%scan(&DSCIN,1,'.');%let fsin=%scan(&DSCIN,2,'.');%end; %put dsout=&dscout; data _null_; length cmd $200; cmd="cp -p %sysfunc(pathname(&dsin))/&fsin..* %sysfunc(pathname(&dscout))/"; call system(trim(cmd)); run; %mend copydm; %macro dir_report(DSNIN /* name of input SAS dataset */ ,DSNOUT /* name of output SAS dataset */ ,EXCLUDE= ,INCLUDE=); %*-------------------------------------------------------------------------Produce a summary report for the squeeze data and show the precent reduction ----------------------------------------------------------------------------; %let dsin=;%let dsout=;%let fsin=;%let fsout=; %if %index(&DSNIN,.)>0 %then %do;%let dsin=%scan(&DSNIN,1,'.');%let fsin=%scan(&DSNIN,2,'.');%end; %else %do;%let dsin=&DSNIN;%end; %if %index(&DSNOUT,.)>0 %then %do;%let dsout=%scan(&DSNOUT,1,'.');%let fsout=%scan(&DSNOUT,2,'.');%end; %else %do;%let dsout=&DSNOUT;%end; filename dir_1 pipe "ls -l %sysfunc(pathname(&DSIN))" new; filename dir_2 pipe "ls -l %sysfunc(pathname(&DSOUT))" new; %exinc(&INCLUDE); %exinc(&EXCLUDE); %put exin=&ex; %do i=1 %to 2; data dir_&i; length phyfile1 $200 file filename $50 size&i $27 owner&i date&i $20 dir&i $100; infile dir_&i length=len end=last; input phyfile1 $varying200. len; *if index (phyfile1,"&fsin")>0 then do; filename&i=scan(phyfile1,9,' '); file=scan(filename&i,1,'.'); presize&i=input(scan(phyfile1,5,' '),12.5)/1024; size&i=strip(put((input(scan(phyfile1,5,' '),best.)/1024),comma19.)||'(KB)'); date&i = scan( phyfile1, 6,' ')||' '||scan( phyfile1, 7,' ')||','||strip(year(today())); owner&i =scan(phyfile1,3,' '); %if &i=1 %then dir&i="%sysfunc(pathname(&DSIN))"; %else %if &i=2 %then dir&i="%sysfunc(pathname(&DSOUT))";; if ^missing(filename&i) then output; *end; keep size&i filename file date&i owner&i presize&i dir&i; proc sort;by file ; run; %end; data dir_3; merge dir_1(in=ok) dir_2(in=ok1); by file; label pct_red='Percent Reduction'; if ok & ok1; 8 NESUG 2012 Coders' Corner where upcase(file) %if ^%sysevalf(%superq(EXCLUDE)=,boolean) %then ^in(&EX);%else %if ^%sysevalf(%superq(INCLUDE)=,boolean) %then in(&EX);; pct_red=abs(round(((presize2-presize1)/presize1)*100,0.01)); run; proc report data=dir_3 nowd headline headskip missing split='|' spacing=1; column file size1 size2 pct_red; Title "Results of Squeezing Library %sysfunc(pathname(&DSIN))"; define define define define file size1 size2 pct_red /order / order / order / order width=16 width=27 width=27 width=27 'Dataset Name'; center 'Size Before Squeezing'; center 'size After Squeezing'; center 'Percent Reduction in Size'; break after file / skip; run; filename dir_1 clear; filename dir_2 clear; proc datasets nolist ; delete dir_1 dir_2 dir_3; run ; %mend dir_report; %macro SQUEEZE( DSNIN , DSNOUT /* name of input SAS dataset */ /* name of output SAS dataset */ , LABELOUT= /* Data set label text */ ) ; %*---------------------------------------------------------------------create dataset of variable names whose lengths are to be minimized exclude from the process those exlude data set names or include those include optional data set names -----------------------------------------------------------------------; %let SQZ_CHAR_FMT =; proc contents data=&DSNIN memtype=data noprint out=_cntnts_(keep= name type format formatl) ; run ; data fmm; length fname $20; set _cntnts_; where format='TIME' or format='MMDDYY' or format='DATETIME' or format='HHMM'; fname=trim(name)||' '||trim(left(format))||trim(left(formatl))||'.'; run; proc sql noprint; select trim(left(fname)) into: SQZ_CHAR_FMT separated by ' ' from fmm; quit; %let N_CHAR = 0 ; data _null_ ; set _cntnts_ end=lastobs nobs=nobs ; if nobs = 0 then stop ; n_char + ( type = 2 ) ; /* create macro vars containing final # of char variables */ if lastobs then do ; call symput( 'N_CHAR', left( put( n_char, 5. ))) ; end ; run ; %*---------------------------------------------------------------------put global macro names into global symbol table for later retrieval -----------------------------------------------------------------------; %do I = 1 %to &N_CHAR ; %global CHAR&I CHARLEN&I ; %end ; %*---------------------------------------------------------------------create macro vars containing variable names -----------------------------------------------------------------------; proc sql noprint ; %if &N_CHAR > 0 %then %str( select name into :CHAR1 - :CHAR&N_CHAR from _cntnts_ where type = 2 ; ) ; quit ; %*---------------------------------------------------------------------compute min # bytes to keep rightmost character for char vars -----------------------------------------------------------------------; data _null_ ; set &DSNIN end=lastobs ; %if &N_CHAR > 0 %then %str( array _char_len_ ( &N_CHAR ) _temporary_ ; ) ; if _n_ = 1 then do ; %if &N_CHAR > 0 %then %str( do i = 1 to &N_CHAR ; _char_len_( i ) = 0 ; end ; ) ; end ; %if &N_CHAR > 0 9 NESUG 2012 Coders' Corner %then %do ; %do I = 1 %to &N_CHAR ; _char_len_( &I ) = max( _char_len_( &I ), length( &&CHAR&I )) ; %end ; %end ; if lastobs then do ; %if &N_CHAR > 0 %then %do ; %do I = 1 %to &N_CHAR ; call symput( "CHARLEN&I", put( _char_len_( &I ), 5. )) ; %end ; %end ; end ; run ; proc datasets nolist ; delete _cntnts_ fmm; run ; %*---------------------------------------------------------------------initialize SQZ_CHAR global macro vars -----------------------------------------------------------------------; %let SQZ_CHAR = LENGTH ; %if &N_CHAR > 0 %then %do I = 1 %to &N_CHAR ; %let SQZ_CHAR = &SQZ_CHAR %qtrim( &&CHAR&I ) $%left( &&CHARLEN&I ) ; %end ; %*---------------------------------------------------------------------build macro var containing order of all variables -----------------------------------------------------------------------; data _null_ ; length retain $32767 ; retain retain 'retain ' ; dsid = open( "&DSNIN", 'I' ) ; /* open dataset for read access only */ do _i_ = 1 to attrn( dsid, 'nvars' ) ; retain = trim( retain ) || ' ' || varname( dsid, _i_ ) ; end ; call symput( 'RETAIN', retain ) ; run ; %*---------------------------------------------------------------------apply SQZ_* to incoming data, create output dataset -----------------------------------------------------------------------; data &DSNOUT %if (^%sysevalf(%superq(labelout)=,boolean)) %then (label="&labelout");; &RETAIN ; %if &N_CHAR > 0 %then %str( &SQZ_CHAR ; ) ; /* optimize char var lengths */ %if ^(%qcmpres(&SQZ_CHAR_FMT) eq ) %then %str(format &SQZ_CHAR_FMT; ) ; /* adjust char var format lengths */ set &DSNIN ; run ; %*---------------------------------------------------------------------delete macro variables from the global symbol statement in SAS -----------------------------------------------------------------------; %do I = 1 %to &N_CHAR ; %symdel CHAR&I CHARLEN&I ; %end ; %mend SQUEEZE ; %*------------------------------------------------------------------Quit the macro program if you have any wrong input parameters ---------------------------------------------------------------------; %if (&parmerr) %then %return; %if &version=1 %then %do; data _null_; length cmd $200; cmd="mkdir %sysfunc(pathname(work))/tempx"; call system(trim(cmd)); run; libname bbbb "%sysfunc(pathname(work))/tempx/" access=temp ; %do j=1 %to &nummm; %put dsname=&&dsname&j; %put dslabel=&&dslabel&j; options varlenchk=nowarn; %if %upcase(%substr(&SQZLIBIN,1,1)) = Y %then %do; %copydm(&libin..&&dsname&j,archive); %end; %if %upcase(%substr(&SQZXPTDIR,1,1)) = Y %then 10 NESUG 2012 Coders' Corner %do; %SQUEEZE(&libin..&&dsname&j,bbbb.&&dsname&j,labelout=&&dslabel&j); %if %upcase(%substr(&SQZLIBIN,1,1)) = Y %then %do; %copydm(bbbb.&&dsname&j,&libin); %end; %end; %end; options varlenchk=warn; %if (%sysfunc(pathname(&xptdir)) ne 0 ) %then %do; %do j=1 %to &nummm; libname xptfile xport "&xptdir/&&dsname&j...xpt"; proc copy in=bbbb out=xptfile memtype=data; select &&dsname&j; run; %end; %end; %else %if ( %sysfunc(pathname(&xptdir)) eq 0 ) %then %do; %put %sysfunc(sysmsg()); %put &er&ror: Output library %nrstr(&xptdir) is not correctly defined.; %let parmerr=1; %end; %if %upcase(%substr(&SQZRPT,1,1)) = Y %then %do; %if ^(%qcmpres(&dm) eq ) %then %do; %dir_report(&libin..&dm,bbbb.&dm); %end; %else %do; %dir_report(&libin.,bbbb.,exclude=&exclude.,include=&include.); %end; %do j=1 %to &nummm; Title1 "Directory Location of Requested data to be Re-size: %sysfunc(pathname(&libin)) Dataset name: &&dsname&j"; Title2 "Directory Result Location for Re-size XPT Files: %sysfunc(pathname(&xptdir)) File: &&dsname&j"; %if &j^=1 %then title3 j=c "Data &j of &nummm";; proc compare base=&libin..&&dsname&j compare=bbbb.&&dsname&j; run; %let rc=&sysinfo; data _null_; if &rc='1......'b then put 'Observations in Base but not in Comparison Data Set'; run; %end; %end; libname bbbb clear; %end; %mend makexpt; 11
© Copyright 2026 Paperzz