Another Factorial File Compression Experiment Using

NESUG 18
Posters
®
Another Factorial File Compression Experiment
Using SAS and UNIX Compression Algorithms
Adeline J. Wilcox, US Census Bureau, Washington, DC
1
ABSTRACT
Continuing experimental work on SAS data set compression presented at NESUG in 2004, I designed another
two-factor factorial experiment. My first factor compares the three kinds of data set compression offered by SAS
on UNIX; the SAS DATA set OPTIONS COMPRESS=CHAR, COMPRESS=BINARY and SAS sequential format
files created with the V9TAPE engine, three UNIX file compression algorithms; compress, gzip, and bzip2 and
a control without any file compression. My second factor compares four kinds of SAS data sets; all character
variables, all numeric variables, half character and half numeric, and all numeric in which LENGTHs shorter than
8 were used for smaller values. bzip2 minimized compressed file size for all four control SAS data sets. Only the
three SAS file compression methods can be used to give other SAS users read access to compressed SAS data sets
without giving them write permission to these files. SAS COMPRESS=BINARY reduced compressed file size more
than SAS COMPRESS=CHAR on all four variable type treatments tested including SAS data sets containing only
character variables.
2
INTRODUCTION
Experimentation is generally an iterative process (Montgomery, 1997). Using what I learned from the results of
the experiments I presented at NESUG in 2004 (Wilcox, 2004) and reconsidering other information, I designed
another factorial file experiment. This experiment’s design also reflects the fact that I now work in a different
computing enviroment in which disk space is more precious than the one in which I conducted my earlier experiments. This experiment aims for a more comprehensive comparison of compression algorithms and variable types.
Testing different file compression algorithms on files composed of different variable types is one of two primary
objectives of this experiment. My first experiment did not control for variable composition in any way. Testing
SAS COMPRESS=CHAR on SAS data sets consisting solely of character variables should determine whether this
file compression algorithm can be dropped from further testing. In my first experiment, SAS COMPRESS=CHAR
was the slowest and second worst compression algorithm for file size reduction. In that experiment, SAS COMPRESS=BINARY actually increased compressed file size because observation lengths were not long enough to
properly test that compression algorithm. In this experiment, I created test data sets with sufficient record length.
The other primary objective is a more comprehensive comparison of compression algorithms including gzip, and
bzip2. No measurements of SAS CPU time or total CPU time are reported here.
3
DESIGN OF MY EXPERIMENT
In my 7 x 4 factorial experiment, the first factor was one of six file compression treatments or control without
compression. The second factor compared SAS data sets of different composition; all character variables, all
numeric variables, half character and half numeric, and all numeric in which LENGTHs shorter than 8 were used
for smaller values. The fixed model for my factorial experiment is
yijk = µ + τi + βj + (τ β)ij + ijk
where the response Yijk is either disk space used, µ is the mean of both treatment factors, τi represents the file compression treatment, βj represents the SAS variable type and length treatment and (τ β)ij is the interaction between
the file compression and SAS variable type composition treatments, and ijk is the random error (Montgomery).
1
NESUG 18
Posters
Table 3.1 shows the design of my experiments with 10 replicates within each of the 28 treatments. Consequently,
the order of treatment of units within each block was not random.
Table 3.1 Assignment of Treatments
File
Compression
Treatment
3.1
None
Sequential Format
COMPRESS=CHAR
COMPRESS=BINARY
UNIX compress
UNIX gzip
UNIX bzip2
Character
10
10
10
10
10
10
10
Numeric
10
10
10
10
10
10
10
Variable Type
Short Numeric Character and Numeric
10
10
10
10
10
10
10
10
10
10
10
10
10
10
Choice Of Sample Size
Before I ran this experiment, I decided that I needed a reduction in file size of at least 30 percent to make file
compression worthwhile. Having used ten replicates in my first file compression experiment, I again used ten
replicates to create 10 data set subsets of only character variables. A 30 percent reduction reduce file size by
1,241,088 bytes to no more than 2,895,872 bytes. Referring to the method Montgomery gives for sample size
computation for two-factor factorial designs, it appears that ten replicates for each treatment may be considerably
more than needed. However, it was convenient for to continue work with the same number of replicates that I used
in my first experiment.
4
METHODS
Because this experiment is designed to be a comprehensive test of file compression algorithms available in my
computing environment, I ran Tukey’s Studentized Range (HSD) test to make all pairwise comparisons of the
compression factor and the interaction of the compression factor with the variable type factor. I also tested for
differences from the control data sets.
4.1
Creating Test Data Sets
All test data sets were generated from decennial census data. The original data were stored in 52 files, one for each
of the 50 US states and one each for the District of Columbia and the territory of Puerto Rico. From these 52
files, ten were randomly selected. The second 10,000 observations were read from each of these ten files. Within
each control treatment, all ten subsets of the original files were identical in file size and observation length. Table
4.1 shows the size of each of the four sets of control files. Because my first data set, consisting only of character
variables, contained solely numeric data stored as character variables, it was possible for me to make all four test
data sets identical in data content. I created my second, third and fourth data sets by converting character variables
to numeric variables.
Table 4.1 The Four Variable Type Treatments
Variable Type(s)
Character
Numeric
Short Numeric
Character and Numeric
Number of
Variables
140
140
140
140
Number
of Files
10
10
10
10
File Size
(bytes)
4,136,960
11,345,920
5,046,272
7,675,904
Observation
Length (bytes)
407
1120
496
760
In the original files, all variables were character. From the original data sets, I chose 140 variables that could be
converted to numeric variables. All ten of these were 4,136,960 bytes in size. In an effort to control metadata size,
I gave the numeric versions of the variables names of the same length as the original character variables. All work
was done with bash shell scripts and 32-bit SAS 9.1.3 on an AMD OpteronTM processor running Linux.
2
NESUG 18
4.2
Posters
File Compression Algorithms
I compared six file compression treatments to controls. I experimented with three SAS compression treatments, the
data set options COMPRESS=CHAR and COMPRESS=BINARY and the SAS Sequential format with a named
pipe. I also experimented with the three file compression algorithms installed in my Linux computing environment.
These are; gzip, compress and bzip2. In my earlier file compression experiments, I did not use bzip2 because
my initial experience with it on a very large file wasn’t successful. I tried bzip2 again, this time without getting
a non-zero exit status. In this paper, all measures of file size were obtained from Linux. In one of my bash shell
scripts, I used the command export oneoften to export an environment variable named oneoften that identifies
the replicate. Subsequently, I create a named pipe with the command mknod pipechoneoften p as shown in a
SAS Tech Support Sample (SAS Institute Inc., 2002). In this SAS log excerpt, the macro variable named state
resolves to 01.
13
%let state=%sysget(oneoften);
14
libname mine ’/adelines/portland/amd/’;
NOTE: Libref MINE was successfully assigned as follows:
Engine:
V9
Physical Name: /adelines/portland/amd
15
libname fargo "pipech&state";
NOTE: Libref FARGO was successfully assigned as follows:
Engine:
V9TAPE
Physical Name: /adelines/portland/amd/pipech01
16
filename nwrpipe pipe "compress < pipech&state > char2&state..Z &";
17
data _null_;
18
infile nwrpipe;
19
run;
NOTE: The infile NWRPIPE is:
Pipe command="compress < pipech01 > char201.Z &"
NOTE: 0 records were read from the infile NWRPIPE.
20
21
data fargo.a; set mine.char2&state;
run;
NOTE: There were 10000 observations read from the data set MINE.CHAR201.
NOTE: The data set FARGO.A has 10000 observations and 140 variables.
NOTE: SAS Institute Inc., SAS Campus Drive, Cary, NC USA 27513-2414
5
RESULTS
In the subdirectory where I wrote these 280 files, I ran the command
ls -l *.Z *.sas7bdat *.gz *.bz2 > two80a.txt
cut -c31-42,57-63 two80a.txt > two80b.txt
giving me a list of all 280 files with their file sizes in bytes. File names were designed to identify the treatment(s)
applied to the SAS data sets contained in the files. Consquently, this information was captured in the file named
two80b.txt I used the file named two80b.txt as the input file to my SAS program named two80b.sas in which I
analyzed the effects of the data type composition and file compression treatment factors on file size.
3
NESUG 18
5.1
Posters
Compressed File Size
Table 5.1 shows treatment means and without adjustment for the other factor or interaction between the factors.
This table also shows 95 percent confidence intervals. Means and confidence limits shown are rounded to the nearest
byte.
Table 5.1 Means of the File Compression Treatments
Variable
Type
Character
Numeric
Short
Numeric
Character
and
Numeric
Compression
Treatment
Control
Sequential Format
COMPRESS=CHAR
COMPRESS=BINARY
UNIX compress
UNIX gzip
UNIX bzip2
Control
Sequential Format
COMPRESS=CHAR
COMPRESS=BINARY
UNIX compress
UNIX gzip
UNIX bzip2
Control
Sequential Format
COMPRESS=CHAR
COMPRESS=BINARY
UNIX compress
UNIX gzip
UNIX bzip2
Control
Sequential Format
COMPRESS=CHAR
COMPRESS=BINARY
UNIX compress
UNIX gzip
UNIX bzip2
Number of
Replicates
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
Mean
bytes
4136960
545433
2768077
2428928
548270
258361
172062
11345920
874503
5365760
3783066
882196
421452
196340
5046272
652381
4048486
3279258
661797
326971
195753
7675904
768866
3963290
3402957
780778
382229
204152
Lower
95 % CL
.
523951
2709445
2381139
526593
250311
163778
.
849524
5342319
3726128
856271
409755
188380
.
627127
4036194
3246682
635528
314961
185839
.
746583
3913810
3362810
758149
372331
195921
Upper
95 % CL
.
566914
2826708
2476717
569948
266411
180346
.
899481
5389201
3840003
908120
433149
204301
.
677634
4060779
3311833
688066
338981
205667
.
791150
4012769
3443104
803406
392128
212382
The results shown in Table 5.1 indicate that bzip2 minimizes compressed file size when run on SAS data sets
comprised only of character type variables. Because the upper 95 percent confidence limit for this combination of
factors does not overlap any of the lower confidence limits from the other treatments, running bzip2 on SAS data
sets containing only character type variables appears, at first glance, to be the treatment combination of choice.
4
NESUG 18
5.2
Posters
Analysis of Variance
I analyzed my data using PROC GLM with file size in bytes as the response variable. I’ve listed my code below.
libname mine ’~/portland/’;
filename two80 ’~/portland/amd/two80b.txt’;
proc format;
value $compresn
’1’=’Control’
’2’=’Sequential Format’
’3’=’COMPRESS=CHAR’
’4’=’COMPRESS=BINARY’
’5’=’UNIX compress’
’6’=’gzip’
’7’=’bzip2’;
run;
title ’Two-Factor Factorial File Compression Experiment’;
data read280; infile two80 truncover lrecl=19;
length datatype $ 4 comptrmt $ 1;
input bytes 1-12 datatype 13-16 comptrmt 17;
run;
proc glm data=read280;
class datatype comptrmt;
model bytes=datatype|comptrmt;
lsmeans comptrmt/pdiff=controll(’Control’) cl adjust=dunnett;
lsmeans comptrmt datatype*comptrmt/adjust=tukey cl pdiff;
means datatype|comptrmt/tukey lines cldiff;
output out=anal280 residual=resbytes;
format comptrmt $compresn.;
run;
data mine.two80c; set anal280;
label resbytes=’Residuals from the Two-Factor Factorial Model’;
run;
proc means data=read280 mean clm maxdec=0;
class datatype comptrmt;
var bytes;
ways 2;
format comptrmt $compresn.;
run;
5
NESUG 18
5.3
Posters
Statistical Analysis of the Model
Table 5.3.1 shows the analysis-of-variance table with the breakdown of Total Sum of Squares for the model and
error followed by the Type I Sum of Squares for the individual terms in the model and the interaction term. art of
the PROC GLM output.
Table 5.3.1 ANOVA for File Size in Bytes
DF
27
252
279
Sum of
Squares
1.9474229x1015
353339154783
1.9477763x1015
Mean Square
7.2126776E13
1402139503.1
F Value
51440.5
R-Square
0.999819
Coeff Var
1.610138
Root MSE
37445.15
bytes Mean
2325586
DF
3
6
18
Type I SS
1.1128992x1014
1.5889437x1015
2.4718935x1014
Mean Square
3.7096641x1013
2.6482395x1014
1.3732742x1013
F Value
26457.2
188871
9794.13
Source
Model
Error
Corrected Total
Source
datatype
comptrmt
datatype*comptrmt
Pr > F
<.0001
Pr > F
<.0001
<.0001
<.0001
Not only was the model statistically significant at the previously chosen critical value of 0.05 but all terms in it and
the error term as well. To compare the treatment means with the four control treatments consisting of the data
sets comprised of different variable types or combinations of variable types, I used Dunnett’s modification of the
t-test. All treatments were statistically significantly different from their controls at the critical value of 0.05. On
average, SAS COMPRESS=CHAR reduced file size by 43 percent.
5.4
Multiple Comparisons
I performed Tukey’s studentized range test (HSD) on the compression treatment means and the interaction of
compression treatment with variable type. Results are shown in Table 5.4.1. For the compression treatment main
effect, all pairwise comparisons were statistically significant except those for UNIX compress and the SAS Sequential
Format. Since, I programmed my Sequential Format treatment following the SAS sample code that used the UNIX
compress command, it is not surprising that these two compression do not give statistically significant results.
Indeed, the only surprise are the differences between the sizes of the individual replicate files and the treatment
mean file sizes.
6
NESUG 18
Posters
Table 5.4.1 Least Squares Means
Adjustment for Multiple Comparisons: Dunnett
Compression Treatment
COMPRESS=BINARY
COMPRESS=CHAR
Control
Sequential Format
UNIX compress
UNIX bzip2
UNIX gzip
Compression Treatment
COMPRESS=BINARY
COMPRESS=CHAR
Control
Sequential Format
UNIX compress
UNIX bzip2
UNIX gzip
bytes LSMEAN
3223552
4036403
7051264
710296
718260
192077
347253
bytes LSMEAN
3223552.00
4036403.20
7051264.00
710295.60
718260.07
192076.77
347253.50
H0:LSMean=
Control
Pr < t
<.0001
<.0001
<.0001
<.0001
<.0001
<.0001
95% Confidence Limits
3211892
3235212
4024743
4048063
7039604
7062924
698635
721956
706600
729920
180417
203737
335593
358914
Most multiple comparisons of the interaction term gave statistically significant results. Interactions that were not
significant are listed in Table 5.4.2. Two combinations of interaction terms were of borderline statistical significance,
between the 0.05 and 0.10 levels. Both involved gzip and bzip2.
Table 5.4.2 Multiple Comparisons Between Interaction Terms
Character
Character
Character
Character
Character and Numeric
Character and Numeric
Numeric
Character and Numeric
Numeric
Numeric
Character
Numeric
Short Numeric
*
*
*
*
*
*
*
*
*
*
*
*
*
Character
Character
*
*
Not Statistically Significant
Sequential Format
Character
UNIX bzip2
Character and Numeric
UNIX bzip2
Numeric
UNIX bzip2
Short Numeric
UNIX gzip
Numeric
UNIX bzip2
Short Numeric
UNIX bzip2
Short Numeric
Sequential Format
Character and Numeric
UNIX bzip2
Character and Numeric
UNIX bzip2
Character and Numeric
UNIX gzip
Short Numeric
Sequential Format
Numeric
Sequential Format bzip2
Short Numeric
Borderline Statistical Significance
UNIX gzip
Numeric
UNIX gzip
Short Numeric
*
*
*
*
*
*
*
*
*
*
*
*
*
UNIX
UNIX
UNIX
UNIX
UNIX
UNIX
UNIX
UNIX
UNIX
UNIX
UNIX
UNIX
UNIX
*
*
UNIX bzip2
UNIX bzip2
compress
gzip
gzip
gzip
gzip
bzip2
bzip2
compress
bzip2
gzip
gzip
compress
compress
As noted above, it appears that, because I used the UNIX compress algorithm with the SAS Sequential Format,
there is no real difference between the SAS Sequential Format treatment and the UNIX compress treatment. From
these results, it appears that SAS programmers who use either the gzip or bzip2 file compression algorithms need
not be concerned with altering variable type or variable length for the purpose of saving disk space.
7
NESUG 18
Posters
Figure 1: Histogram of Residuals
5.5
Residuals
For reference, I overlaid a fitted normal density curve, based on the sample mean and sample standard deviation, on
the histogram, shown in Figure 1. The normal probability plot, shown in Figure 2, gives a reference line based on
the sample mean and sample standard deviation. These plots show me that the residuals are distributed normally
enough for me to use analysis of variance.
8
NESUG 18
Posters
Figure 2: Normal Probability Plot
6
DISCUSSION
The three SAS file compression methods can be used to give other SAS users read access to compressed SAS data
sets without giving them write permission to these files. So far, this appears to be their only advantage. All
three UNIX file compression algorithms reduced file size more than either SAS COMPRESS=BINARY or SAS
COMPRESS=CHAR. Compared with gzip and bzip2, UNIX compress appears less effective for compression of
SAS data sets. Variable type is of little concern to bzip2 users. Compared with other file compression algorithms,
SAS COMPRESS=CHAR did not work well on decennial census data in a Linux environment.
Linux users have been advised to compress only small to medium size files with bzip2 (Wallen, 2004). For large
files, Wallen recommends gzip. According to the man page for bzip2
Larger block sizes give rapidly diminishing marginal returns. Most of the compression comes from the
first two or three hundred k of block size, a fact worth bearing in mind when using bzip2 on small
machines.
For gzip, the Linux man page gives a work around for file sizes of 4 gigabytes or more. Apparently, gzip readily
compresses files ranging from 2 gigabytes up to 4 gigabytes. I need to run another experiment using SAS data sets
of varying size to determine the approximate file size at which bzip2 should be abandoned for gzip.
9
NESUG 18
7
Posters
REFERENCES
Block-sorting file compressor, v1.0.2 Linux man pages - bzip2.
Gzip, Linux man pages - gzip.
Montgomery, D. C. Design and Analysis of Experiments. Fifth Edition. Hoboken, NJ: John Wiley & Sons.,
2001.
SAS Institute Inc. 2002. Sample 720: compress – Write a Unix compressed SAS data set directly.
http://ftp.sas.com/techsup/download/sample/unix/dstep/compress.html Visited 23Jun2005.
SAS Institute Inc. 2004. SAS 9.1.3 Language Reference Dictionary, Volumes 1, 2, 3, and 4., Cary, NC: SAS
Institute Inc.
SAS Institute Inc. 2004. SAS 9.1 Companion for UNIX Enrionments., Cary, NC: SAS Institute Inc.
SAS Institute Inc., SAS/STAT User’s Guide, Version 8, Cary, NC: SAS Institute Inc., 1999. 3884 pp.
Wallen J. Linux users: Know thy compression utilities June 24, 2002.
http://www.zdnet.com.au/insight/0,39023731,20266170,00.htm ZDNet Australia. Visited 23Jun2005.
®
Wilcox, A. J. Disk Space Management Topics for SAS in the SolarisTM Operating Environment. NorthEast
SAS Users Group, Inc. Seventeenth Annual Conference Proceedings. 2004.
8
ACKNOWLEDGEMENTS
®
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS
Institute in the USA and other countries.
indicates USA registration. AMD, the AMD Arrow logo, AMD
Opteron, and combinations thereof, are trademaks of Advanced Micro Devices, Inc.
I am grateful to Bob Sands, US Census Bureau, Decennial Statistical Studies Division, for identifying SAS data
sets suitable for reading and subsetting for this experiment. Mark Keintz of The Wharton School, University of
Pennsylvania, advised me to try the SAS sequential engine for writing SAS data sets in sequential format on disk.
This paper reports the results of research and analysis undertaken by Census Bureau staff. It has undergone a
Census Bureau review more limited in scope than that given to official Census Bureau publications. This report is
released to inform interested parties of research and to encourage discussion. Yves Thibaudeau, US Census Bureau,
Statistical Research Division, kindly agreed to perform the technical review of this paper required for publication.
Finally, thanks to Charlie Zender and other LATEXusers who publish their solutions on the Web.
9
CONTACT INFORMATION
Your comments and questions are valued and encouraged. Contact the author at:
Adeline J. Wilcox
US Census Bureau
Decennial Management Division
4700 Silver Hill Road, Stop 7100
Washington, DC 20233-7100
Phone: (301) 763-9410
Email: [email protected]
10