Adapting the SPEC 2000 Benchmark Suite for Simulation-Based Computer Architecture Research AJ KleinOsowski*, John Flynn, Nancy Meares, David J. Lilja Department of Electrical and Computer Engineering Minnesota Supercomputing Institute University of Minnesota 200 Union Street SE Minneapolis, MN 55455 *Corresponding Author: ajko @ ece.umn.edu. Abstract The large input datasets in the SPEC 2000 benchmark suite result in unreasonably long simulation times when using detailed execution-driven simulators for evaluating future computer architecture ideas. To address this problem, we have an ongoing project to reduce the execution times of the SPEC 2000 benchmarks in a quantitatively defensible way. Upon completion of this work1 , we will have smaller input datasets for several SPEC 2000 benchmarks. The programs using our reduced input datasets will produce execution profiles that accurately reflect the program behavior of the full reference dataset, as measured using standard statistical tests. In the process of reducing and verifying the SPEC 2000 benchmark datasets, we also obtain instruction mix, memory behavior, and instructions per cycle characterization information about each benchmark program. Index Terms —SPEC 2000, computer benchmarks, computer architecture, performance evaluation, SimpleScalar simulator 1 Introduction Historically, computer architects used benchmark programs representative of real programs to test their computer designs. As time progressed, a need arose for an effective and fair way to test and rate computer architecture features and performance. The benchmark 1 The reduced datasets started in this work will be available at http:// www.arctic.umn.edu suite from the Standard Performance Evaluation Corporation, commonly known as SPEC [1], is one example of a collection of programs used by the research community to test and rate current and future computer architecture designs [7, 8, 9]. As with most benchmark programs, the SPEC 1995 benchmark suite was developed with the then current and next generation computer systems in mind. Now, five years later, computer technology has advanced to the point where the 1995 benchmarks are no longer suitable. On current state-of-the-art computer systems, several of the SPEC 1995 benchmark programs execute in less than one minute [1]. In an effort to keep up with the rapid progress of computer systems, SPEC chose to dramatically increase the runtimes of the new SPEC 2000 benchmark programs [2, 3], as compared to the runtimes of the SPEC 1995 benchmark programs. These long runtimes are beneficial when testing performance on native hardware. However, when evaluating new computer architectures using detailed execution-driven simulators, the long runtimes of the SPEC 2000 benchmarks result in unreasonably long simulation times. Reasonable execution times for simulation-based computer architecture research comes in a few flavors: a) We want a short simulation time (on the order of minutes) to help debug the simulator and do quick tests. b) We want intermediate length simulation time (of a few hours) for more detailed testing of the simulator and to obtain preliminary performance results. c) We want a complete simulation (of no more than a few days) using a large, realistic input set to obtain true performance statistics for the architecture design under test. Items (a) and (b) do not have to match the execution profile of the original full input set that closely, although we would prefer if (b) was reasonably close. For accurate architectural research simulations, however, we need (c) to match the profile of the original full input set to within an acceptable level as measured using an appropriate statistical test. A typical approach to reducing runtime is to alter the input dataset. However, blindly reducing a dataset is bad practice since the new input data may cause the execution profile to be completely different from the execution profile obtained with the original full input dataset. The SPEC committee chooses programs for the SPEC 2000 suite due to the way the programs stress hardware function units, caches, and memory systems. When the execution profile is altered, the benchmark program no longer tests the architecture characteristics it was designed to test. In this work, we gather function-level execution profiling information for the SPEC 2000 benchmarks using the gprof [10] tool from the GNU suite of Unix tools [5]. This tool generates a function-level execution profile using a sampling technique while the program is running. The profiles generated by gprof give us a good indication of where the programs spend the majority of their execution time. We then experiment with various techniques to reduce the datasets. Each time we reduce the dataset, we re-run the execution profiles and recalculate our statistical test to check how close the execution profile is to the reference dataset execution profile. In the process of reducing and verifying our datasets, we also gather instruction mix, memory behavior, and instructions per cycle (IPC) characterization information. 2 Background and Motivation SimpleScalar [4] is an execution-driven simulator package commonly used by the computer architecture research community. SimpleScalar includes several simulators, each of which provides a different level of detail and statistical information about the simulated hardware. For most intricate computer architecture studies, researchers use sim-outorder, the most detailed simulator in the SimpleScalar suite. Sim-outorder supports out-of-order instruction issue and execution, as well as full simulation of the cache system, branch predictor, and other function units. Each simulated machine instruction traverses a five stage pipeline before being retired. Sim-outorder handles register renaming and result forwarding. Branch prediction is also implemented in sim-outorder. At each stage of the pipeline several dozen statistics and performance counters are updated for each simulated instruction. This level of simulation detail requires over 40,000 host machine cycles to simulate each target machine cycle. On a 300 MHz Sun UltraSparc system, the simulated machine runs at 7 KIPS (kilo-instructions per second). At that speed, the 197.parser benchmark, with the full reference dataset of 301 billion instructions, would take almost seventeen months to simulate! An uninterrupted multiple-month simulation is unrealistic for most computer architecture research. Considering the combinatorial explosion that occurs when examining numerous benchmark programs with a variety of hardware configurations, the multiple-month simulation time needed for each simulation can result in years of CPU time required to gather enough information for one simple study. Clearly, we need to find a way to reduce the simulation time. Intricate computer architecture research typically requires all of the statistics generated by a simulator such as sim-outorder. With this need in mind, using a less detailed simulator is not an option. The best option is to find a quantitatively defensible way to reduce the input datasets, and, consequently, the runtimes, of the SPEC 2000 benchmarks. 3 Methodology SPEC provides three standard datasets with their benchmark programs. These datasets are similar to our desired datasets except on a much larger scale. Namely, the test dataset gives a quick test of the benchmark on the desired architecture, the train dataset gives an intermediate length run, and the reference dataset gives the complete evaluation of the host computer system’s performance. We began our analysis by compiling the benchmarks with the SimpleScalar version of gcc (version 2.6.3) on a 300 MHz Sun UltraSparc running Solaris 2.7. This modified version of gcc builds binaries for the simulator architecture. We compiled at four different optimization levels, O0 through O3. Once we had binaries compiled for the SimpleScalar architecture, we ran each SPEC dataset (test, train, ref) with sim-fast, a simple simulator for determining instruction counts. (All of the SimpleScalar simulators we used are from the SimpleScalar version 3.0 suite.) The output from sim-fast gave us an idea of the size (in terms of instruction count) of each benchmark. The sim-profile simulator, run with the SimpleScalar architecture binaries, gave us an instruction mix profile for the reference dataset. The –iclass flag gave us a breakdown of instruction totals by class (i.e. loads, stores, unconditional branches, conditional branches, integer computation, floating-point computation, and traps) and the –iprof flag gave us a count of each individual instruction executed. In an effort to characterize the memory behavior of the SPEC 2000 benchmarks, we used the sim-cache simulator, run with the SimpleScalar architecture binaries, to obtain the miss rates of the level 1 data cache. We ran simulations with six different level 1 data cache sizes, 16k, 32k, 64k, 128k, 256k, and 512k. In all of these simulations, we used a 32 byte block size and 4-way set associativity with an LRU replacement policy. We proceeded to gather a function-level execution profile of the reference dataset. We recompiled the benchmarks with the Sun UltraSparc version of gcc using compiler flags to insert profile counting routines. Once again, we compiled at four optimization levels, O0 through O3. We ran the benchmark natively, then ran gprof to obtain an execution profile showing the fraction of total execution time spent in each function. Now that we had base profiles to document the program behavior of the SPEC reference dataset, we began reducing the datasets for each benchmark. Our method of reducing the dataset varied widely from benchmark to benchmark. For some benchmarks, we were able to directly truncate the input files. For benchmarks without input files, or with fixed problem parameters, we examined the benchmark source code to see if there was some way to alter the number of loop iterations or some other iteration factor to reduce the runtime. For still other benchmarks, we resorted to contacting the benchmark author and requesting smaller input files. After each reduction attempt, we verified our functionlevel execution profile against the reference dataset profile by calculating a goodness-of-fit test using the chisquared statistic [6]. For this calculation, we looked at the function-level execution profile for the reference dataset, removed all functions with a fraction of total execution time less than 0.01 percent, and compared the remaining functions’ fraction of total execution time to their fraction of total execution time in our reduced dataset. The fraction of total execution time for each function becomes a term in the overall chi-squared statistic. Our goal is to have three sets of inputs, one input with a simulation time of a few minutes, one input with a simulation time of a few hours, and one input with a simulation time of no more than a few days. The small input set might not closely follow the distribution of function-level execution times in the reference execution. That is, its chi-squared statistic might be quite large. We would like the medium input set to be reasonably close to the reference input set execution profile. This means the medium input set chi-squared statistic should not be too large. However, the large input set execution profile must be appropriately close to the reference input set execution profile as measured by the chi-squared test. Specifically, our goal is that the calculated chi-squared statistic should be smaller than the tabulated critical value of this statistic at the 90 percent confidence level. 4 The SPEC 2000 Benchmarks The original SPEC 89 benchmark programs consisted of four programs written in C and six programs written in Fortran. In 1995, this set of programs was replaced by the SPEC 95 benchmark suite. This updated collection consisted of eight programs that were intended to be representative of typical integer-style computations and ten floating-point intensive programs. The SPEC 2000 benchmarks also are divided into floating-point and integer categories. The twelve integer benchmark programs (CINT 2000), summarized in Table 1, are all written in C, with the exception of 252.eon, which is written in C++. Of the fourteen floating-point benchmark programs (CFP 2000), shown in Table 2, six are written in Fortran 77 (F77), four are written in Fortran 90 (F90), and the remaining four are written in C. A few of the programs from SPEC 95 have been carried over into the new SPEC 2000 suite, specifically, gcc, perl, and vortex from the integer programs, and swim, mgrid, and applu from the floating-point programs. However, these programs have been modified for the SPEC 2000 suite. Consequently, the results produced by a program with the same name are not comparable across the two benchmark sets. More detailed information about the SPEC 2000 benchmark suite can be obtained from [1]. Table 1. The Benchmark 164.gzip 175.vpr 176.gcc 181.mcf 186.crafty 197.parser 252.eon 253.perlbmk 254.gap 255.vortex 256.bzip2 300.twolf CINT Lang C C 2000 Suite Category Compression FPGA Circuit Placement and Routing C C Programming Language Compiler C Combinatoral Optimization C Chess Game Playing C Word Processing C++ Computer Visualization C PERL Programming Language C Group Theory, Interpreter C Object Oriented Database C Compression C Place and Route Simulator Table 2. The CFP 2000 Suite Benchmark Lang Category 168.wupwise F77 Physics/Quantum Chromodynamics 171.swim F77 Shallow Water Modeling 172.mgrid F77 Multi-grid Solver: 3D Potential Field 173.applu F77 Parabolic/Elliptic Partial Differential Equations 177.mesa C 3D Graphics Library 178.galgel F90 Computational Fluid Dynamics 179.art C Image Recognition/Neural Networks 183.equake C Seismic Wave Propagation Simulation 187.facerec F90 Image Processing: Face Recognition 188.ammp C Computational Chemistry 189.lucas F90 Number Theory/Primality Testing 191.fma3d F90 Finite-element Crash Simulation 200.sixtrack F77 High Energy Nuclear Physics Accelerator Design 301.apsi F77 Meteorology: Pollutant Distribution 5 Results Our results document the program behavior of a subset of the SPEC 2000 benchmark suite. The reference run of several SPEC 2000 benchmarks actually uses multiple runs of the same executable with different command line arguments or different input files. Therefore, we treat each of these different command lines as a separate, subbenchmark. For example, the first sub-benchmark of the 175.vpr benchmark reads a circuit from the input file, creates a placement of the nodes in the circuit, then saves this placement to a file. The second sub-benchmark of the 175.vpr benchmark reads the placement file created by the first sub-benchmark, creates a routing among the nodes in the placement, then saves the routing to a file. Our results document the 175.vpr benchmark, with its place and route sub-benchmarks; the 164.gzip benchmark, with its graphic, program, source, random, and log subbenchmarks (same command line, different types of input files); and the 197.parser benchmark. (Parser has only one command line.) All three of these benchmarks are from the SPEC CINT 2000 suite. We chose 175.vpr, 164.gzip, and 197.parser to begin our analysis because we felt they were a good sampling of the programs in the CINT 2000 suite. We will continue our analysis with the rest of the programs in the CINT 2000 suite and the programs in the CFP 2000 suite as time permits. We use the function-level execution profile to determine the goodness-of-fit of our large reduced dataset compared to the full reference dataset of SPEC 2000. For documentation and characterization, we include instruction mix profiles and level 1 cache behavior of both the full reference dataset and our large reduced dataset, as well as instructions per cycle (IPC) values for the large reduced dataset. 5.1 Function-level execution profile results Table 3 shows the goodness-of-fit values (as calculated with the chi-squared statistic and the functionlevel execution profile) for our eight sub-benchmarks and their large (LgRed), medium (MdRed), and small (SmRed) reduced input datasets at optimization levels O0 through O3. Our results show that, with the exception of parser, we achieved our goal of having reasonable sized large input datasets which result in an execution profile that closely mimics the execution profile of the full reference dataset. In the case of the large reduced dataset for the Place, Program, and Random sub-benchmarks at optimization level O3, we had a single digit chi-squared value. This small chi-squared value shows a particularly good correlation between our large reduced dataset and the full reference dataset. Thus, for seven out of eight sub-benchmarks, we can say with 90 percent confidence that the differences in the execution profiles of the large reduced dataset and the full reference dataset are no larger than what would be expected due to random fluctuations. At optimization level O3, our medium reduced dataset had an acceptable goodness-of-fit in five out of the eight cases and our small reduced dataset had an acceptable goodness-of-fit in four out of the eight cases. As we stated in Section 1, we would like our medium and small reduced datasets to at least vaguely mimic our full reference dataset. However, the goal of our medium and small reduced datasets is to have mid-length and short simulation times, not to mimic the full reference dataset. (Simulation time results are discussed in Section 5.4.) We gathered function-level execution profiles at four different optimization levels to see if we could discern any patterns across optimization levels. Our results show similar goodness-of-fit values for each reduced dataset across optimization levels. We conclude that for each reduced dataset, across each optimization level, the program is behaving in a consistent manner in comparison to the full reference dataset. 5.2 Instruction mix profile results Figure 1 shows the instruction mix breakdown for the reference (Ref) dataset and our large reduced (LgRed) dataset at optimization levels O0 through O3. In the interest of brevity, we did not include the instruction mix results for the medium and small reduced datasets. Our results show a surprisingly large discrepancy in the instruction mix between the full reference and our large reduced datasets. Since the goodness-of-fit values for the function-level execution profiles (discussed in Section 5.1) were acceptable, we expected the instruction mix of the large reduced dataset to scale down proportionately from the full reference instruction mix. We found that when we applied the goodness-of-fit calculations to the instruction mix histograms, we obtained chi-squared values several times larger than the critical value at the 90 percent confidence level. The large reduced datasets show vaguely consistent instruction mix distributions across optimization levels. That is, the percentage breakdown of loads, stores, unconditional branches, conditional branches, integer computation, floating point computation, and traps is roughly the same across optimization levels. The reference dataset instruction mix distribution, on the other hand, varied widely across optimization levels. For the reference dataset of each benchmark, three out of the four optimization levels have similar instruction mix values. In the Place, Route, Graphic, Source, Log, and Parser subbenchmarks, the instruction mix distribution at O0 differed significantly from the other optimization levels. In the Program and Random sub-benchmarks, the instruction mix distribution at O1 differed significantly from the other optimization levels. We were not surprised to see the differences in instruction mix distributions across optimization levels for the reference dataset. However, we are unclear why these differences did not carry through to the large reduced dataset. We welcome other researchers to investigate and explain the instruction mix discrepancies between the full reference and our large reduced datasets. 5.3 Cache behavior results Figure 2 shows the level 1 data cache miss rates for the reference (Ref) dataset and our large reduced (LgRed) dataset at optimization level O3. Once again, in the interest of brevity, we did not include our results for the small and medium reduced datasets. Since we saw similar behavior across optimization levels in the execution profiles, we limited our cache behavior simulations to optimization level O3. The Place and Route sub-benchmarks show the behavior we expected, that is, the miss rate curve for the large reduced dataset is similar to the miss rate curve for the full reference dataset, except the large reduced dataset curve is shifted to the left. We speculate that the smaller input dataset has a smaller footprint in memory. Therefore, as the data cache gets larger, the entire memory image fits into cache and the miss rate approaches zero. That is, for the Place and Route subbenchmarks, the large reduced dataset produces a scaled down version of the full reference dataset cache behavior. The Graphic, Program, Source, Random, Log, and Parser sub-benchmarks show surprisingly similar miss rates between the full reference and large reduced datasets. In fact, for the Graphic and Program subbenchmarks, the full reference dataset miss rate is actually lower than the large reduced dataset miss rate. At present, we do not have an explanation for this unexpected result. We speculate that the reduced input dataset reduces the number of memory operations performed while maintaining a similar memory footprint. 5.4 Impact on simulation time Table 4 shows the instruction counts and estimated simulation times (using sim-outorder) for our large (LgRed), medium (MdRed), and small (SmRed) reduced datasets compared to the full reference (Ref) dataset. The estimated simulation times were calculated by running several small simulations on a 300Mhz UltraSparc, noting the progressive decrease in execution rate as the dataset got larger, then extrapolating the execution rate to very large datasets. In the end, we used two execution rate factors to calculate estimated simulation time: 1) For the small, medium, and large reduced dataset, we use an execution rate of 60 million instructions per hour. 2) For the full reference dataset, we use an execution rate of 25 million instructions per hour. The simulation times in Table 4 show that for the Place, Route, and Parser sub-benchmarks, we accomplished our goal of having a small dataset with a simulation time on the order of minutes, a medium dataset with a simulation time on the order of hours, and a large dataset with a simulation time on the order of a few days. In the case of the Graphic, Program, Source, Random, and Log sub-benchmarks, we have an appropriate length run for the large reduced dataset, but our medium and short simulation times are much longer than we would like. Gzip performs compression on a buffer of data in memory. The size of this buffer, in megabytes, is determined by an integer command line argument. This integer parameter restriction means the smallest size buffer we can use is one megabyte. If we specify an input file smaller than one megabyte, gzip duplicates the input file in memory until it obtains one megabyte of data. This data duplication results in large instruction counts and long simulation times. The only way to obtain a very short simulation of gzip would be to modify the source code and remove the data duplication function call. 5.5 IPC Results Figure 3 shows the instructions per cycle (IPC) for the large reduced dataset. As stated in Section 1, most simulators which generate IPC counts and other timing information run several orders of magnitude slower than smaller, less detailed simulators. Given our current tools and the time available, we were not able to obtain IPC counts for the reference input sets. The IPC counts for the large reduced datasets are included for documentation, although they cannot yet be compared to the IPC counts of the full reference input sets. We determined the IPC values by using sim-outorder with the default configuration. This configuration uses a 4 instruction per cycle issue width, 4 integer ALUs, 1 integer multiplier, 4 floating point ALUs, and 1 floating point multiplier. With this configuration, we see that seven out of the eight sub-benchmarks sustain an execution rate of greater than one instruction per cycle. The one sub-benchmark which fell short of one instruction per cycle, Place, still performed fairly well with an IPC of 0.9341. Table 3. Goodness-of-fit chi-squared statistic values for 175.vpr, 164.gzip, and 197.parser for the large (LgRed), medium (MdRed), and small (SmRed) reduced datasets function-level execution profiles compared to the full SPEC reference datasets. *90% Conf = Critical value of the chi-squared statistic at the 90 percent confidence level as tablulated in [6]. (a) Place (Vpr) Execution Opt LgRed MdRed Level O0 5.44 24.59 O1 8.51 36.15 O2 11.33 20.97 O3 9.53 32.98 Profile SmRed 90% Conf* 24.77 32.01 29.62 29.62 (b) Route (Vpr) Execution Profile Opt LgRed MdRed SmRed Level O0 27.35 391.68 6367.8 O1 64.02 907.96 20870.9 O2 55.25 749.83 111190.9 O3 56.01 1143.62 62637.0 90% Conf* 78.85 86.63 82.19 85.53 (c) Graphic (Gzip) Execution Profile Opt LgRed MdRed SmRed Level O0 11.12 27.38 434.11 O1 16.85 37.97 410.91 O2 13.14 37.73 397.17 O3 18.75 45.30 431.39 90% Conf* 34.38 43.72 41.41 44.88 (d) Program (Gzip) Execution Profile Opt LgRed MdRed SmRed Level O0 3.74 6.61 10.52 O1 3.36 11.81 9.04 O2 2.43 6.26 8.75 O3 3.57 7.31 11.44 90% Conf* 37.92 40.26 41.41 37.92 (e) Source (Gzip) Execution Profile Opt LgRed MdRed SmRed Level O0 15.58 5.40 8.65 O1 24.56 6.46 3.75 O2 16.34 7.39 6.23 O3 13.21 12.48 12.58 90% Conf* 40.26 39.09 37.92 36.74 (f) Random (Gzip) Execution Profile Opt LgRed MdRed SmRed Level O0 4.82 2.90 2.75 O1 5.84 5.10 5.89 O2 6.48 12.91 5.99 O3 5.77 15.23 12.15 90% Conf* 35.56 34.38 39.09 36.74 (g) Log (Gzip) Execution Profile Opt LgRed MdRed SmRed Level O0 15.56 16.85 10.12 O1 33.69 30.61 72.03 O2 14.56 8.29 23.96 O3 29.67 12.07 14.26 90% Conf* 36.74 36.74 41.41 37.92 (h) Parser Execution Profile Opt LgRed MdRed SmRed Level O0 283.53 366.30 6692.6 O1 212.80 203.68 1742.2 O2 310.30 389.64 7072.2 O3 536.40 422.74 14454.2 156.27 156.68 113.21 102.93 90% Conf* 182.26 169.35 173.66 122.85 Place (Vpr) Instruction Mix Route (Vpr) Instruction Mix 100% 100% 90% 80% 80% trap 70% trap fp computation fp computation 60% int computation 60% int computation 50% conditional branch unconditional branch 40% conditional branch unconditional branch 40% store 30% store load 20% load 20% 10% 0% 0% Ref O0 Lgred Ref O1 Lgred Ref O2 Lgred Ref O3 Lgred O0 O1 O2 O3 Ref O0 Lgred Ref O1 Lgred Ref O2 Lgred Ref O3 Lgred O0 O1 O2 O3 (a) (b) Graphic (Gzip) Instruction Mix Program (Gzip) Instruction Mix 100% 100% 90% 80% 80% trap fp computation trap 70% fp computation 60% int computation 60% int computation 50% 40% conditional branch unconditional branch conditional branch unconditional branch store 40% store 30% load 20% load 20% 10% 0% 0% Ref O0 Lgred Ref O1 Lgred Ref O2 Lgred Ref O3 Lgred O0 O1 O2 O3 Ref O0 Lgred Ref O1 Lgred Ref O2 Lgred Ref O3 Lgred O0 O1 O2 O3 (c) (d) Source (Gzip) Instruction Mix Random (Gzip) Benchmark 100% 100% 90% 80% 80% trap fp computation 60% 40% trap 70% fp computation int computation 60% int computation conditional branch unconditional branch 50% conditional branch unconditional branch store 40% store 30% load 20% load 20% 10% 0% 0% Ref O0 Lgred Ref O1 Lgred Ref O2 Lgred Ref O3 Lgred O0 O1 O2 O3 Ref O0 Lgred Ref O1 Lgred Ref O2 Lgred Ref O3 Lgred O0 O1 O2 O3 (e) (f) Log (Gzip) Instruction Mix Parser Instruction Mix 100% 100% 80% 80% trap trap fp computation 60% 40% fp computation int computation 60% int computation conditional branch unconditional branch 40% conditional branch unconditional branch store store load 20% load 20% 0% 0% Ref O0 Lgred Ref O1 Lgred Ref O2 Lgred Ref O3 Lgred O0 O1 O2 O3 (g) Ref O0 Lgred Ref O1 Lgred Ref O2 Lgred Ref O3 Lgred O0 O1 O2 O3 (h) Figure 1. Instruction mix breakdown for the reference (Ref) and large reduced (LgRed) datasets for the 175.vpr, 164.gzip, and 197.parser SPEC 2000 benchmarks at optimization levels O0 through O3. Place (Vpr) Benchmark Route (Vpr) Benchmark 0.07 0.06 0.06 0.05 0.04 0.04 lgred 0.03 ref Miss Rate Miss Rate 0.05 lgred 0.03 ref 0.02 0.02 0.01 0.01 0 0 16k 32k 64k 128k 256k 512k 16k 32k Level 1 Data Cache Size 64k 128k 256k 512k Level 1 Data Cache Size (a) (b) Graphic (Gzip) Benchmark Program (Gzip) Benchmark 0.06 0.25 0.05 0.2 lgred 0.03 ref Miss Rate Miss Rate 0.04 0.15 lgred ref 0.1 0.02 0.05 0.01 0 0 16k 32k 64k 128k 256k 512k 16k 32k Level 1 Data Cache Size 64k 128k (c) 512k (d) Source (Gzip) Benchmark Random (Gzip) Benchmark 0.16 0.07 0.14 0.06 0.12 0.05 0.1 lgred 0.08 ref 0.06 Miss Rate Miss Rate 256k Level 1 Data Cache Size 0.04 lgred 0.03 ref 0.02 0.04 0.01 0.02 0 0 16k 32k 64k 128k 256k 512k 16k 32k Level 1 Data Cache Size 64k 128k 256k 512k Level 1 Data Cache Size (e) (f) Log (Gzip) Benchmark Parser Benchmark 0.1 0.06 0.09 0.05 0.08 0.04 0.06 lgred 0.05 ref 0.04 Miss Rate Miss Rate 0.07 lgred 0.03 ref 0.02 0.03 0.02 0.01 0.01 0 0 16k 32k 64k 128k Level 1 Data Cache Size (g) 256k 512k 16k 32k 64k 128k 256k 512k Level 1 Data Cache Size (h) Figure 2. Level 1 data cache miss rates for the reference (Ref) and the large reduced (LgRed) datasets for the 175.vpr, 164.gzip, and 197.parser SPEC 2000 benchmarks compiled at optimization level O3. Benchmark 175.vpr, place 175.vpr, route 164.gzip, graphic 164.gzip. program 164.gzip, source 164.gzip, random 164.gzip, log 197.parser Reference Dataset LgRed Dataset MdRed Dataset SmRed Dataset Inst Count Sim Time Inst Count Sim Time Inst Count Sim Time Inst Count Sim Time (in millions) (in hours) (in millions) (in hours) (in millions) (in hours) (in millions) (in hours) 109,752 4390 1521 25.3 217 3.6 18 0.3 97,273 3891 881 14.7 94 1.6 6 0.1 81,270 3251 1370 22.8 964 16.1 3221 53.7 116,010 4640 1958 32.6 1812 30.2 2606 43.4 63,172 2526 1181 19.7 1149 19.2 1112 18.5 64,372 2575 1065 17.8 1066 17.8 1066 17.8 33,952 1358 531 8.9 531 8.9 526 8.8 301,361 12,054 2717 45.3 227 3.8 41 0.7 Table 4. Instruction counts and estimated simulation times (using sim-outorder) for the SPEC 2000 benchmarks (compiled at optimization level O3) on a moderately loaded 300 MHz Sun UltraSparc . Estimated simulation time as calculated using a 25 million instructions per hour factor for simulations over 10 billion (10,000 million) instructions and a 60 million instructions per hour factor for simulations less than 10 billion (10,000 million) instructions. IPC for Large Reduced (LgRed) Datasets 1.7138 1.6000 1.4124 1.4130 1.5987 1.4476 1.2416 0.9341 Place (vpr) Route (vpr) Graphic Program (gzip) (gzip) Source (gzip) Random (gzip) Log (gzip) Parser Figure 3. Instructions Per Cycle (IPC) as measured using the SimpleScalar sim-outorder simulator for the 175.vpr, 164.gzip, and 197.parser SPEC 2000 benchmarks when compiled with optimization level O3 and run with the large reduced (LgRed) input datasets. 6 Conclusions and Future Work In this work, we compare the behavior of the SPEC 2000 benchmark programs when executed with different input datasets by looking at the fraction of total execution time spent in functions as measured by the gprof profiling tool, the instruction mix as measured with the sim-profile simulator from the SimpleScalar suite of simulators, and the level 1 cache miss rates as measured with the simcache simulator from SimpleScalar. These metrics give us a rough indication of what is going on in the simulated hardware while the program executes. More detailed study is needed to see if our datasets correctly depict the second order effects of the reference dataset. [11] In particular, further study is needed to examine the instruction ordering effects in the pipeline when the programs are executed with each dataset. Another, perhaps higher priority, course of action is to better reduce the data for the 197.parser benchmark. If a close-to-exact representation of the full reference dataset is needed for 197.parser, our dataset is not a good choice. However, we speculate that with a different reduction technique, we can obtain a dataset which more closely mimics the full reference dataset. In the interest of shortened simulation time, we also need to use a different reduction technique on 164.gzip. We have a good gzip dataset for detailed simulations; however, we were unable to obtain a dataset for quick tests of our simulator. Our work to date encompasses only a fraction of the programs in the SPEC 2000 benchmark suite. In the weeks and months to come, we plan to apply the same methodology stated in Section 3 to the rest of the C programs in the CINT 2000 and CFP 2000 suite. (Our tools do not work with C++ or Fortran programs, so we will limit our present analysis to the programs written in C.) Our execution profile and cache miss results show that it is possible to obtain small datasets that reasonably mimic the behavior of the full reference datasets of the SPEC 2000 benchmarks. These reduced datasets are not perfect, though. We discovered one tradeoff for smaller simulations is that the instruction mix produced with our reduced datasets is vastly different from the instruction mix of the full reference dataset. Our future work will determine if there are other major differences between our large reduced dataset and the full reference dataset. We anticipate using our reduced datasets with detailed execution-driven simulators to evaluate hardware tradeoffs in future computer architecture studies. Our datasets give us reasonable length simulations, thereby allowing us to run many permutations of hardware configurations and complete computer architecture research studies in a reasonable amount of time. While our reduced input datasets do not produce execution profiles that perfectly match the execution profiles produced with the original reference input datasets , they do provide reasonably similar profiles. Furthermore, the program characterizations presented in this paper highlight the important areas in which the program behavior with our reduced datasets and the program behavior with the original reference dataset differ. Understanding these differences is important in helping researchers understand the performance results they obtain when using these benchmark programs to evaluate new architectural ideas. Acknowledgements This work was supported in part by National Science Foundation grants EIA-9971666 and CCR-9900605, NSF Research Experiences for Undergraduates grants EEC9619750 and CCR-9610379, and by the Minnesota Supercomputing Institute. References [1] SPEC Benchmark Suite. Information available at http://www.spec.org [2] Henning, John L., “SPEC CPU2000: Measuring CPU Performance in the New Millennium,” IEEE Computer, Vol. 33, No. 7, July 2000, pp. 28-35. [3] SPEC CPU2000 Press Release FAQ, available at http://www.spec.org/osg/cpu2000/press/ faq.html [4] Burger, D. and T. Austin, "The SimpleScalar Tool Set, Version 3.0," University of Wisconsin Madison Computer Sciences Department. Distribution web page located at: http://www.cs.wisc.edu/~mscalar/ simplescalar.html [5] GNU Unix Toolset. Information and binaries available at http://www.gnu.org [6] Lilja, David J., Measuring Computer Performance: A Practitioner's Guide, Cambridge University Press, New York, NY, 2000. [7] Giladi, R. and N. Ahituv, “SPEC as a Performance Evaluation Measure,” IEEE Computer, Vol. 28, No. 8, August 1995, pp. 33-42. [8] Mirghafori, N., M. Jacoby, and D. Patterson, “Truth in SPEC Benchmarks,” ACM Computer Architecture News, Vol. 23, No. 5, December 1995, pp. 34-42. [9] Dujmovic, J. and I. Dujmovic, “Evolution and Evaluation of SPEC Benchmarks,” Performance Evaluation Review, Vol. 26, No. 3, December 1998, pp. 2-9. [10] Graham, S.L., P.B. Kessler, M.K. McKusick, “gprof: A Call Graph Execution Profiler”, Proceedings of the SIGPLAN '82 Symposium on Compiler Construction, SIGPLAN Notices, Vol. 17, No. 6, pp. 120-126, June 1982. [11] Iyengar, V., L. Trevillyan, and P. Bose, “Representative Traces For Processor Models with Infinite Cache,” Proceedings of the Second International Symposium on High Performance Computer Architecture (HPCA), 1996, pp. 62-72.
© Copyright 2026 Paperzz