NESUG 18 Posters ® Another Factorial File Compression Experiment Using SAS and UNIX Compression Algorithms Adeline J. Wilcox, US Census Bureau, Washington, DC 1 ABSTRACT Continuing experimental work on SAS data set compression presented at NESUG in 2004, I designed another two-factor factorial experiment. My first factor compares the three kinds of data set compression offered by SAS on UNIX; the SAS DATA set OPTIONS COMPRESS=CHAR, COMPRESS=BINARY and SAS sequential format files created with the V9TAPE engine, three UNIX file compression algorithms; compress, gzip, and bzip2 and a control without any file compression. My second factor compares four kinds of SAS data sets; all character variables, all numeric variables, half character and half numeric, and all numeric in which LENGTHs shorter than 8 were used for smaller values. bzip2 minimized compressed file size for all four control SAS data sets. Only the three SAS file compression methods can be used to give other SAS users read access to compressed SAS data sets without giving them write permission to these files. SAS COMPRESS=BINARY reduced compressed file size more than SAS COMPRESS=CHAR on all four variable type treatments tested including SAS data sets containing only character variables. 2 INTRODUCTION Experimentation is generally an iterative process (Montgomery, 1997). Using what I learned from the results of the experiments I presented at NESUG in 2004 (Wilcox, 2004) and reconsidering other information, I designed another factorial file experiment. This experiment’s design also reflects the fact that I now work in a different computing enviroment in which disk space is more precious than the one in which I conducted my earlier experiments. This experiment aims for a more comprehensive comparison of compression algorithms and variable types. Testing different file compression algorithms on files composed of different variable types is one of two primary objectives of this experiment. My first experiment did not control for variable composition in any way. Testing SAS COMPRESS=CHAR on SAS data sets consisting solely of character variables should determine whether this file compression algorithm can be dropped from further testing. In my first experiment, SAS COMPRESS=CHAR was the slowest and second worst compression algorithm for file size reduction. In that experiment, SAS COMPRESS=BINARY actually increased compressed file size because observation lengths were not long enough to properly test that compression algorithm. In this experiment, I created test data sets with sufficient record length. The other primary objective is a more comprehensive comparison of compression algorithms including gzip, and bzip2. No measurements of SAS CPU time or total CPU time are reported here. 3 DESIGN OF MY EXPERIMENT In my 7 x 4 factorial experiment, the first factor was one of six file compression treatments or control without compression. The second factor compared SAS data sets of different composition; all character variables, all numeric variables, half character and half numeric, and all numeric in which LENGTHs shorter than 8 were used for smaller values. The fixed model for my factorial experiment is yijk = µ + τi + βj + (τ β)ij + ijk where the response Yijk is either disk space used, µ is the mean of both treatment factors, τi represents the file compression treatment, βj represents the SAS variable type and length treatment and (τ β)ij is the interaction between the file compression and SAS variable type composition treatments, and ijk is the random error (Montgomery). 1 NESUG 18 Posters Table 3.1 shows the design of my experiments with 10 replicates within each of the 28 treatments. Consequently, the order of treatment of units within each block was not random. Table 3.1 Assignment of Treatments File Compression Treatment 3.1 None Sequential Format COMPRESS=CHAR COMPRESS=BINARY UNIX compress UNIX gzip UNIX bzip2 Character 10 10 10 10 10 10 10 Numeric 10 10 10 10 10 10 10 Variable Type Short Numeric Character and Numeric 10 10 10 10 10 10 10 10 10 10 10 10 10 10 Choice Of Sample Size Before I ran this experiment, I decided that I needed a reduction in file size of at least 30 percent to make file compression worthwhile. Having used ten replicates in my first file compression experiment, I again used ten replicates to create 10 data set subsets of only character variables. A 30 percent reduction reduce file size by 1,241,088 bytes to no more than 2,895,872 bytes. Referring to the method Montgomery gives for sample size computation for two-factor factorial designs, it appears that ten replicates for each treatment may be considerably more than needed. However, it was convenient for to continue work with the same number of replicates that I used in my first experiment. 4 METHODS Because this experiment is designed to be a comprehensive test of file compression algorithms available in my computing environment, I ran Tukey’s Studentized Range (HSD) test to make all pairwise comparisons of the compression factor and the interaction of the compression factor with the variable type factor. I also tested for differences from the control data sets. 4.1 Creating Test Data Sets All test data sets were generated from decennial census data. The original data were stored in 52 files, one for each of the 50 US states and one each for the District of Columbia and the territory of Puerto Rico. From these 52 files, ten were randomly selected. The second 10,000 observations were read from each of these ten files. Within each control treatment, all ten subsets of the original files were identical in file size and observation length. Table 4.1 shows the size of each of the four sets of control files. Because my first data set, consisting only of character variables, contained solely numeric data stored as character variables, it was possible for me to make all four test data sets identical in data content. I created my second, third and fourth data sets by converting character variables to numeric variables. Table 4.1 The Four Variable Type Treatments Variable Type(s) Character Numeric Short Numeric Character and Numeric Number of Variables 140 140 140 140 Number of Files 10 10 10 10 File Size (bytes) 4,136,960 11,345,920 5,046,272 7,675,904 Observation Length (bytes) 407 1120 496 760 In the original files, all variables were character. From the original data sets, I chose 140 variables that could be converted to numeric variables. All ten of these were 4,136,960 bytes in size. In an effort to control metadata size, I gave the numeric versions of the variables names of the same length as the original character variables. All work was done with bash shell scripts and 32-bit SAS 9.1.3 on an AMD OpteronTM processor running Linux. 2 NESUG 18 4.2 Posters File Compression Algorithms I compared six file compression treatments to controls. I experimented with three SAS compression treatments, the data set options COMPRESS=CHAR and COMPRESS=BINARY and the SAS Sequential format with a named pipe. I also experimented with the three file compression algorithms installed in my Linux computing environment. These are; gzip, compress and bzip2. In my earlier file compression experiments, I did not use bzip2 because my initial experience with it on a very large file wasn’t successful. I tried bzip2 again, this time without getting a non-zero exit status. In this paper, all measures of file size were obtained from Linux. In one of my bash shell scripts, I used the command export oneoften to export an environment variable named oneoften that identifies the replicate. Subsequently, I create a named pipe with the command mknod pipechoneoften p as shown in a SAS Tech Support Sample (SAS Institute Inc., 2002). In this SAS log excerpt, the macro variable named state resolves to 01. 13 %let state=%sysget(oneoften); 14 libname mine ’/adelines/portland/amd/’; NOTE: Libref MINE was successfully assigned as follows: Engine: V9 Physical Name: /adelines/portland/amd 15 libname fargo "pipech&state"; NOTE: Libref FARGO was successfully assigned as follows: Engine: V9TAPE Physical Name: /adelines/portland/amd/pipech01 16 filename nwrpipe pipe "compress < pipech&state > char2&state..Z &"; 17 data _null_; 18 infile nwrpipe; 19 run; NOTE: The infile NWRPIPE is: Pipe command="compress < pipech01 > char201.Z &" NOTE: 0 records were read from the infile NWRPIPE. 20 21 data fargo.a; set mine.char2&state; run; NOTE: There were 10000 observations read from the data set MINE.CHAR201. NOTE: The data set FARGO.A has 10000 observations and 140 variables. NOTE: SAS Institute Inc., SAS Campus Drive, Cary, NC USA 27513-2414 5 RESULTS In the subdirectory where I wrote these 280 files, I ran the command ls -l *.Z *.sas7bdat *.gz *.bz2 > two80a.txt cut -c31-42,57-63 two80a.txt > two80b.txt giving me a list of all 280 files with their file sizes in bytes. File names were designed to identify the treatment(s) applied to the SAS data sets contained in the files. Consquently, this information was captured in the file named two80b.txt I used the file named two80b.txt as the input file to my SAS program named two80b.sas in which I analyzed the effects of the data type composition and file compression treatment factors on file size. 3 NESUG 18 5.1 Posters Compressed File Size Table 5.1 shows treatment means and without adjustment for the other factor or interaction between the factors. This table also shows 95 percent confidence intervals. Means and confidence limits shown are rounded to the nearest byte. Table 5.1 Means of the File Compression Treatments Variable Type Character Numeric Short Numeric Character and Numeric Compression Treatment Control Sequential Format COMPRESS=CHAR COMPRESS=BINARY UNIX compress UNIX gzip UNIX bzip2 Control Sequential Format COMPRESS=CHAR COMPRESS=BINARY UNIX compress UNIX gzip UNIX bzip2 Control Sequential Format COMPRESS=CHAR COMPRESS=BINARY UNIX compress UNIX gzip UNIX bzip2 Control Sequential Format COMPRESS=CHAR COMPRESS=BINARY UNIX compress UNIX gzip UNIX bzip2 Number of Replicates 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 Mean bytes 4136960 545433 2768077 2428928 548270 258361 172062 11345920 874503 5365760 3783066 882196 421452 196340 5046272 652381 4048486 3279258 661797 326971 195753 7675904 768866 3963290 3402957 780778 382229 204152 Lower 95 % CL . 523951 2709445 2381139 526593 250311 163778 . 849524 5342319 3726128 856271 409755 188380 . 627127 4036194 3246682 635528 314961 185839 . 746583 3913810 3362810 758149 372331 195921 Upper 95 % CL . 566914 2826708 2476717 569948 266411 180346 . 899481 5389201 3840003 908120 433149 204301 . 677634 4060779 3311833 688066 338981 205667 . 791150 4012769 3443104 803406 392128 212382 The results shown in Table 5.1 indicate that bzip2 minimizes compressed file size when run on SAS data sets comprised only of character type variables. Because the upper 95 percent confidence limit for this combination of factors does not overlap any of the lower confidence limits from the other treatments, running bzip2 on SAS data sets containing only character type variables appears, at first glance, to be the treatment combination of choice. 4 NESUG 18 5.2 Posters Analysis of Variance I analyzed my data using PROC GLM with file size in bytes as the response variable. I’ve listed my code below. libname mine ’~/portland/’; filename two80 ’~/portland/amd/two80b.txt’; proc format; value $compresn ’1’=’Control’ ’2’=’Sequential Format’ ’3’=’COMPRESS=CHAR’ ’4’=’COMPRESS=BINARY’ ’5’=’UNIX compress’ ’6’=’gzip’ ’7’=’bzip2’; run; title ’Two-Factor Factorial File Compression Experiment’; data read280; infile two80 truncover lrecl=19; length datatype $ 4 comptrmt $ 1; input bytes 1-12 datatype 13-16 comptrmt 17; run; proc glm data=read280; class datatype comptrmt; model bytes=datatype|comptrmt; lsmeans comptrmt/pdiff=controll(’Control’) cl adjust=dunnett; lsmeans comptrmt datatype*comptrmt/adjust=tukey cl pdiff; means datatype|comptrmt/tukey lines cldiff; output out=anal280 residual=resbytes; format comptrmt $compresn.; run; data mine.two80c; set anal280; label resbytes=’Residuals from the Two-Factor Factorial Model’; run; proc means data=read280 mean clm maxdec=0; class datatype comptrmt; var bytes; ways 2; format comptrmt $compresn.; run; 5 NESUG 18 5.3 Posters Statistical Analysis of the Model Table 5.3.1 shows the analysis-of-variance table with the breakdown of Total Sum of Squares for the model and error followed by the Type I Sum of Squares for the individual terms in the model and the interaction term. art of the PROC GLM output. Table 5.3.1 ANOVA for File Size in Bytes DF 27 252 279 Sum of Squares 1.9474229x1015 353339154783 1.9477763x1015 Mean Square 7.2126776E13 1402139503.1 F Value 51440.5 R-Square 0.999819 Coeff Var 1.610138 Root MSE 37445.15 bytes Mean 2325586 DF 3 6 18 Type I SS 1.1128992x1014 1.5889437x1015 2.4718935x1014 Mean Square 3.7096641x1013 2.6482395x1014 1.3732742x1013 F Value 26457.2 188871 9794.13 Source Model Error Corrected Total Source datatype comptrmt datatype*comptrmt Pr > F <.0001 Pr > F <.0001 <.0001 <.0001 Not only was the model statistically significant at the previously chosen critical value of 0.05 but all terms in it and the error term as well. To compare the treatment means with the four control treatments consisting of the data sets comprised of different variable types or combinations of variable types, I used Dunnett’s modification of the t-test. All treatments were statistically significantly different from their controls at the critical value of 0.05. On average, SAS COMPRESS=CHAR reduced file size by 43 percent. 5.4 Multiple Comparisons I performed Tukey’s studentized range test (HSD) on the compression treatment means and the interaction of compression treatment with variable type. Results are shown in Table 5.4.1. For the compression treatment main effect, all pairwise comparisons were statistically significant except those for UNIX compress and the SAS Sequential Format. Since, I programmed my Sequential Format treatment following the SAS sample code that used the UNIX compress command, it is not surprising that these two compression do not give statistically significant results. Indeed, the only surprise are the differences between the sizes of the individual replicate files and the treatment mean file sizes. 6 NESUG 18 Posters Table 5.4.1 Least Squares Means Adjustment for Multiple Comparisons: Dunnett Compression Treatment COMPRESS=BINARY COMPRESS=CHAR Control Sequential Format UNIX compress UNIX bzip2 UNIX gzip Compression Treatment COMPRESS=BINARY COMPRESS=CHAR Control Sequential Format UNIX compress UNIX bzip2 UNIX gzip bytes LSMEAN 3223552 4036403 7051264 710296 718260 192077 347253 bytes LSMEAN 3223552.00 4036403.20 7051264.00 710295.60 718260.07 192076.77 347253.50 H0:LSMean= Control Pr < t <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 95% Confidence Limits 3211892 3235212 4024743 4048063 7039604 7062924 698635 721956 706600 729920 180417 203737 335593 358914 Most multiple comparisons of the interaction term gave statistically significant results. Interactions that were not significant are listed in Table 5.4.2. Two combinations of interaction terms were of borderline statistical significance, between the 0.05 and 0.10 levels. Both involved gzip and bzip2. Table 5.4.2 Multiple Comparisons Between Interaction Terms Character Character Character Character Character and Numeric Character and Numeric Numeric Character and Numeric Numeric Numeric Character Numeric Short Numeric * * * * * * * * * * * * * Character Character * * Not Statistically Significant Sequential Format Character UNIX bzip2 Character and Numeric UNIX bzip2 Numeric UNIX bzip2 Short Numeric UNIX gzip Numeric UNIX bzip2 Short Numeric UNIX bzip2 Short Numeric Sequential Format Character and Numeric UNIX bzip2 Character and Numeric UNIX bzip2 Character and Numeric UNIX gzip Short Numeric Sequential Format Numeric Sequential Format bzip2 Short Numeric Borderline Statistical Significance UNIX gzip Numeric UNIX gzip Short Numeric * * * * * * * * * * * * * UNIX UNIX UNIX UNIX UNIX UNIX UNIX UNIX UNIX UNIX UNIX UNIX UNIX * * UNIX bzip2 UNIX bzip2 compress gzip gzip gzip gzip bzip2 bzip2 compress bzip2 gzip gzip compress compress As noted above, it appears that, because I used the UNIX compress algorithm with the SAS Sequential Format, there is no real difference between the SAS Sequential Format treatment and the UNIX compress treatment. From these results, it appears that SAS programmers who use either the gzip or bzip2 file compression algorithms need not be concerned with altering variable type or variable length for the purpose of saving disk space. 7 NESUG 18 Posters Figure 1: Histogram of Residuals 5.5 Residuals For reference, I overlaid a fitted normal density curve, based on the sample mean and sample standard deviation, on the histogram, shown in Figure 1. The normal probability plot, shown in Figure 2, gives a reference line based on the sample mean and sample standard deviation. These plots show me that the residuals are distributed normally enough for me to use analysis of variance. 8 NESUG 18 Posters Figure 2: Normal Probability Plot 6 DISCUSSION The three SAS file compression methods can be used to give other SAS users read access to compressed SAS data sets without giving them write permission to these files. So far, this appears to be their only advantage. All three UNIX file compression algorithms reduced file size more than either SAS COMPRESS=BINARY or SAS COMPRESS=CHAR. Compared with gzip and bzip2, UNIX compress appears less effective for compression of SAS data sets. Variable type is of little concern to bzip2 users. Compared with other file compression algorithms, SAS COMPRESS=CHAR did not work well on decennial census data in a Linux environment. Linux users have been advised to compress only small to medium size files with bzip2 (Wallen, 2004). For large files, Wallen recommends gzip. According to the man page for bzip2 Larger block sizes give rapidly diminishing marginal returns. Most of the compression comes from the first two or three hundred k of block size, a fact worth bearing in mind when using bzip2 on small machines. For gzip, the Linux man page gives a work around for file sizes of 4 gigabytes or more. Apparently, gzip readily compresses files ranging from 2 gigabytes up to 4 gigabytes. I need to run another experiment using SAS data sets of varying size to determine the approximate file size at which bzip2 should be abandoned for gzip. 9 NESUG 18 7 Posters REFERENCES Block-sorting file compressor, v1.0.2 Linux man pages - bzip2. Gzip, Linux man pages - gzip. Montgomery, D. C. Design and Analysis of Experiments. Fifth Edition. Hoboken, NJ: John Wiley & Sons., 2001. SAS Institute Inc. 2002. Sample 720: compress – Write a Unix compressed SAS data set directly. http://ftp.sas.com/techsup/download/sample/unix/dstep/compress.html Visited 23Jun2005. SAS Institute Inc. 2004. SAS 9.1.3 Language Reference Dictionary, Volumes 1, 2, 3, and 4., Cary, NC: SAS Institute Inc. SAS Institute Inc. 2004. SAS 9.1 Companion for UNIX Enrionments., Cary, NC: SAS Institute Inc. SAS Institute Inc., SAS/STAT User’s Guide, Version 8, Cary, NC: SAS Institute Inc., 1999. 3884 pp. Wallen J. Linux users: Know thy compression utilities June 24, 2002. http://www.zdnet.com.au/insight/0,39023731,20266170,00.htm ZDNet Australia. Visited 23Jun2005. ® Wilcox, A. J. Disk Space Management Topics for SAS in the SolarisTM Operating Environment. NorthEast SAS Users Group, Inc. Seventeenth Annual Conference Proceedings. 2004. 8 ACKNOWLEDGEMENTS ® SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute in the USA and other countries. indicates USA registration. AMD, the AMD Arrow logo, AMD Opteron, and combinations thereof, are trademaks of Advanced Micro Devices, Inc. I am grateful to Bob Sands, US Census Bureau, Decennial Statistical Studies Division, for identifying SAS data sets suitable for reading and subsetting for this experiment. Mark Keintz of The Wharton School, University of Pennsylvania, advised me to try the SAS sequential engine for writing SAS data sets in sequential format on disk. This paper reports the results of research and analysis undertaken by Census Bureau staff. It has undergone a Census Bureau review more limited in scope than that given to official Census Bureau publications. This report is released to inform interested parties of research and to encourage discussion. Yves Thibaudeau, US Census Bureau, Statistical Research Division, kindly agreed to perform the technical review of this paper required for publication. Finally, thanks to Charlie Zender and other LATEXusers who publish their solutions on the Web. 9 CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: Adeline J. Wilcox US Census Bureau Decennial Management Division 4700 Silver Hill Road, Stop 7100 Washington, DC 20233-7100 Phone: (301) 763-9410 Email: [email protected] 10
© Copyright 2026 Paperzz