Adapting the SPEC 2000 Benchmark Suite for Simulation

Adapting the SPEC 2000 Benchmark Suite for Simulation-Based
Computer Architecture Research
AJ KleinOsowski*, John Flynn, Nancy Meares, David J. Lilja
Department of Electrical and Computer Engineering
Minnesota Supercomputing Institute
University of Minnesota
200 Union Street SE
Minneapolis, MN 55455
*Corresponding Author: ajko @ ece.umn.edu.
Abstract
The large input datasets in the SPEC 2000 benchmark
suite result in unreasonably long simulation times when
using detailed execution-driven simulators for evaluating
future computer architecture ideas. To address this
problem, we have an ongoing project to reduce the
execution times of the SPEC 2000 benchmarks in a
quantitatively defensible way. Upon completion of this
work1 , we will have smaller input datasets for several
SPEC 2000 benchmarks. The programs using our
reduced input datasets will produce execution profiles
that accurately reflect the program behavior of the full
reference dataset, as measured using standard statistical
tests. In the process of reducing and verifying the SPEC
2000 benchmark datasets, we also obtain instruction mix,
memory behavior, and instructions per cycle
characterization information about each benchmark
program.
Index Terms —SPEC 2000, computer benchmarks,
computer
architecture,
performance
evaluation,
SimpleScalar simulator
1 Introduction
Historically, computer architects used benchmark
programs representative of real programs to test their
computer designs. As time progressed, a need arose for
an effective and fair way to test and rate computer
architecture features and performance. The benchmark
1
The reduced datasets started in this work will be
available at http:// www.arctic.umn.edu
suite from the Standard Performance Evaluation
Corporation, commonly known as SPEC [1], is one
example of a collection of programs used by the research
community to test and rate current and future computer
architecture designs [7, 8, 9].
As with most benchmark programs, the SPEC 1995
benchmark suite was developed with the then current and
next generation computer systems in mind. Now, five
years later, computer technology has advanced to the
point where the 1995 benchmarks are no longer suitable.
On current state-of-the-art computer systems, several of
the SPEC 1995 benchmark programs execute in less than
one minute [1]. In an effort to keep up with the rapid
progress of computer systems, SPEC chose to
dramatically increase the runtimes of the new SPEC 2000
benchmark programs [2, 3], as compared to the runtimes
of the SPEC 1995 benchmark programs. These long
runtimes are beneficial when testing performance on
native hardware.
However, when evaluating new
computer architectures using detailed execution-driven
simulators, the long runtimes of the SPEC 2000
benchmarks result in unreasonably long simulation times.
Reasonable execution times for simulation-based
computer architecture research comes in a few flavors:
a) We want a short simulation time (on the order of
minutes) to help debug the simulator and do quick
tests.
b) We want intermediate length simulation time (of a
few hours) for more detailed testing of the
simulator and to obtain preliminary performance
results.
c) We want a complete simulation (of no more than
a few days) using a large, realistic input set to
obtain true performance statistics for the
architecture design under test.
Items (a) and (b) do not have to match the execution
profile of the original full input set that closely, although
we would prefer if (b) was reasonably close. For accurate
architectural research simulations, however, we need (c)
to match the profile of the original full input set to within
an acceptable level as measured using an appropriate
statistical test.
A typical approach to reducing runtime is to alter the
input dataset. However, blindly reducing a dataset is bad
practice since the new input data may cause the execution
profile to be completely different from the execution
profile obtained with the original full input dataset. The
SPEC committee chooses programs for the SPEC 2000
suite due to the way the programs stress hardware
function units, caches, and memory systems. When the
execution profile is altered, the benchmark program no
longer tests the architecture characteristics it was
designed to test.
In this work, we gather function-level execution
profiling information for the SPEC 2000 benchmarks
using the gprof [10] tool from the GNU suite of Unix
tools [5]. This tool generates a function-level execution
profile using a sampling technique while the program is
running. The profiles generated by gprof give us a good
indication of where the programs spend the majority of
their execution time. We then experiment with various
techniques to reduce the datasets. Each time we reduce
the dataset, we re-run the execution profiles and
recalculate our statistical test to check how close the
execution profile is to the reference dataset execution
profile. In the process of reducing and verifying our
datasets, we also gather instruction mix, memory
behavior, and instructions per cycle (IPC) characterization
information.
2 Background and Motivation
SimpleScalar [4] is an execution-driven simulator
package commonly used by the computer architecture
research community.
SimpleScalar includes several
simulators, each of which provides a different level of
detail and statistical information about the simulated
hardware.
For most intricate computer architecture
studies, researchers use sim-outorder, the most detailed
simulator in the SimpleScalar suite.
Sim-outorder supports out-of-order instruction issue
and execution, as well as full simulation of the cache
system, branch predictor, and other function units. Each
simulated machine instruction traverses a five stage
pipeline before being retired.
Sim-outorder handles
register renaming and result forwarding.
Branch
prediction is also implemented in sim-outorder.
At each stage of the pipeline several dozen statistics
and performance counters are updated for each simulated
instruction. This level of simulation detail requires over
40,000 host machine cycles to simulate each target
machine cycle. On a 300 MHz Sun UltraSparc system,
the simulated machine runs at 7 KIPS (kilo-instructions
per second). At that speed, the 197.parser benchmark,
with the full reference dataset of 301 billion instructions,
would take almost seventeen months to simulate!
An uninterrupted multiple-month simulation is
unrealistic for most computer architecture research.
Considering the combinatorial explosion that occurs when
examining numerous benchmark programs with a variety
of hardware configurations, the multiple-month
simulation time needed for each simulation can result in
years of CPU time required to gather enough information
for one simple study. Clearly, we need to find a way to
reduce the simulation time.
Intricate computer architecture research typically
requires all of the statistics generated by a simulator such
as sim-outorder. With this need in mind, using a less
detailed simulator is not an option. The best option is to
find a quantitatively defensible way to reduce the input
datasets, and, consequently, the runtimes, of the SPEC
2000 benchmarks.
3 Methodology
SPEC provides three standard datasets with their
benchmark programs. These datasets are similar to our
desired datasets except on a much larger scale. Namely,
the test dataset gives a quick test of the benchmark on the
desired architecture, the train dataset gives an
intermediate length run, and the reference dataset gives
the complete evaluation of the host computer system’s
performance.
We began our analysis by compiling the benchmarks
with the SimpleScalar version of gcc (version 2.6.3) on a
300 MHz Sun UltraSparc running Solaris 2.7.
This
modified version of gcc builds binaries for the simulator
architecture. We compiled at four different optimization
levels, O0 through O3. Once we had binaries compiled
for the SimpleScalar architecture, we ran each SPEC
dataset (test, train, ref) with sim-fast, a simple simulator
for determining instruction counts.
(All of the
SimpleScalar simulators we used are from the
SimpleScalar version 3.0 suite.) The output from sim-fast
gave us an idea of the size (in terms of instruction count)
of each benchmark.
The sim-profile simulator, run with the SimpleScalar
architecture binaries, gave us an instruction mix profile
for the reference dataset. The –iclass flag gave us a
breakdown of instruction totals by class (i.e. loads, stores,
unconditional branches, conditional branches, integer
computation, floating-point computation, and traps) and
the –iprof flag gave us a count of each individual
instruction executed.
In an effort to characterize the memory behavior of the
SPEC 2000 benchmarks, we used the sim-cache
simulator, run with the SimpleScalar architecture binaries,
to obtain the miss rates of the level 1 data cache. We ran
simulations with six different level 1 data cache sizes,
16k, 32k, 64k, 128k, 256k, and 512k. In all of these
simulations, we used a 32 byte block size and 4-way set
associativity with an LRU replacement policy.
We proceeded to gather a function-level execution
profile of the reference dataset. We recompiled the
benchmarks with the Sun UltraSparc version of gcc using
compiler flags to insert profile counting routines. Once
again, we compiled at four optimization levels, O0
through O3. We ran the benchmark natively, then ran
gprof to obtain an execution profile showing the fraction
of total execution time spent in each function.
Now that we had base profiles to document the
program behavior of the SPEC reference dataset, we
began reducing the datasets for each benchmark. Our
method of reducing the dataset varied widely from
benchmark to benchmark. For some benchmarks, we
were able to directly truncate the input files. For
benchmarks without input files, or with fixed problem
parameters, we examined the benchmark source code to
see if there was some way to alter the number of loop
iterations or some other iteration factor to reduce the
runtime. For still other benchmarks, we resorted to
contacting the benchmark author and requesting smaller
input files.
After each reduction attempt, we verified our functionlevel execution profile against the reference dataset
profile by calculating a goodness-of-fit test using the chisquared statistic [6]. For this calculation, we looked at
the function-level execution profile for the reference
dataset, removed all functions with a fraction of total
execution time less than 0.01 percent, and compared the
remaining functions’ fraction of total execution time to
their fraction of total execution time in our reduced
dataset. The fraction of total execution time for each
function becomes a term in the overall chi-squared
statistic.
Our goal is to have three sets of inputs, one input with
a simulation time of a few minutes, one input with a
simulation time of a few hours, and one input with a
simulation time of no more than a few days. The small
input set might not closely follow the distribution of
function-level execution times in the reference execution.
That is, its chi-squared statistic might be quite large. We
would like the medium input set to be reasonably close to
the reference input set execution profile. This means the
medium input set chi-squared statistic should not be too
large. However, the large input set execution profile
must be appropriately close to the reference input set
execution profile as measured by the chi-squared test.
Specifically, our goal is that the calculated chi-squared
statistic should be smaller than the tabulated critical value
of this statistic at the 90 percent confidence level.
4 The SPEC 2000 Benchmarks
The original SPEC 89 benchmark programs consisted
of four programs written in C and six programs written in
Fortran. In 1995, this set of programs was replaced by the
SPEC 95 benchmark suite. This updated collection
consisted of eight programs that were intended to be
representative of typical integer-style computations and
ten floating-point intensive programs. The SPEC 2000
benchmarks also are divided into floating-point and
integer categories.
The twelve integer benchmark
programs (CINT 2000), summarized in Table 1, are all
written in C, with the exception of 252.eon, which is
written in C++. Of the fourteen floating-point benchmark
programs (CFP 2000), shown in Table 2, six are written in
Fortran 77 (F77), four are written in Fortran 90 (F90), and
the remaining four are written in C.
A few of the programs from SPEC 95 have been
carried over into the new SPEC 2000 suite, specifically,
gcc, perl, and vortex from the integer programs, and
swim, mgrid, and applu from the floating-point programs.
However, these programs have been modified for the
SPEC 2000 suite. Consequently, the results produced by
a program with the same name are not comparable across
the two benchmark sets. More detailed information about
the SPEC 2000 benchmark suite can be obtained from [1].
Table 1. The
Benchmark
164.gzip
175.vpr
176.gcc
181.mcf
186.crafty
197.parser
252.eon
253.perlbmk
254.gap
255.vortex
256.bzip2
300.twolf
CINT
Lang
C
C
2000 Suite
Category
Compression
FPGA Circuit Placement and
Routing
C
C Programming Language
Compiler
C
Combinatoral Optimization
C
Chess Game Playing
C
Word Processing
C++ Computer Visualization
C
PERL Programming Language
C
Group Theory, Interpreter
C
Object Oriented Database
C
Compression
C
Place and Route Simulator
Table 2. The CFP 2000 Suite
Benchmark
Lang
Category
168.wupwise F77
Physics/Quantum
Chromodynamics
171.swim
F77
Shallow Water Modeling
172.mgrid
F77
Multi-grid Solver: 3D Potential
Field
173.applu
F77
Parabolic/Elliptic Partial
Differential Equations
177.mesa
C
3D Graphics Library
178.galgel
F90
Computational Fluid Dynamics
179.art
C
Image Recognition/Neural
Networks
183.equake
C
Seismic Wave Propagation
Simulation
187.facerec
F90
Image Processing: Face
Recognition
188.ammp
C
Computational Chemistry
189.lucas
F90
Number Theory/Primality
Testing
191.fma3d
F90
Finite-element Crash
Simulation
200.sixtrack F77
High Energy Nuclear Physics
Accelerator Design
301.apsi
F77
Meteorology: Pollutant
Distribution
5 Results
Our results document the program behavior of a subset
of the SPEC 2000 benchmark suite. The reference run of
several SPEC 2000 benchmarks actually uses multiple
runs of the same executable with different command line
arguments or different input files. Therefore, we treat
each of these different command lines as a separate, subbenchmark. For example, the first sub-benchmark of the
175.vpr benchmark reads a circuit from the input file,
creates a placement of the nodes in the circuit, then saves
this placement to a file. The second sub-benchmark of
the 175.vpr benchmark reads the placement file created
by the first sub-benchmark, creates a routing among the
nodes in the placement, then saves the routing to a file.
Our results document the 175.vpr benchmark, with its
place and route sub-benchmarks; the 164.gzip benchmark,
with its graphic, program, source, random, and log subbenchmarks (same command line, different types of input
files); and the 197.parser benchmark. (Parser has only
one command line.) All three of these benchmarks are
from the SPEC CINT 2000 suite.
We chose 175.vpr, 164.gzip, and 197.parser to begin
our analysis because we felt they were a good sampling of
the programs in the CINT 2000 suite. We will continue
our analysis with the rest of the programs in the CINT
2000 suite and the programs in the CFP 2000 suite as time
permits.
We use the function-level execution profile to
determine the goodness-of-fit of our large reduced dataset
compared to the full reference dataset of SPEC 2000. For
documentation and characterization, we include
instruction mix profiles and level 1 cache behavior of
both the full reference dataset and our large reduced
dataset, as well as instructions per cycle (IPC) values for
the large reduced dataset.
5.1 Function-level execution profile results
Table 3 shows the goodness-of-fit values (as
calculated with the chi-squared statistic and the functionlevel execution profile) for our eight sub-benchmarks and
their large (LgRed), medium (MdRed), and small
(SmRed) reduced input datasets at optimization levels O0
through O3. Our results show that, with the exception of
parser, we achieved our goal of having reasonable sized
large input datasets which result in an execution profile
that closely mimics the execution profile of the full
reference dataset. In the case of the large reduced dataset
for the Place, Program, and Random sub-benchmarks at
optimization level O3, we had a single digit chi-squared
value. This small chi-squared value shows a particularly
good correlation between our large reduced dataset and
the full reference dataset. Thus, for seven out of eight
sub-benchmarks, we can say with 90 percent confidence
that the differences in the execution profiles of the large
reduced dataset and the full reference dataset are no larger
than what would be expected due to random fluctuations.
At optimization level O3, our medium reduced dataset
had an acceptable goodness-of-fit in five out of the eight
cases and our small reduced dataset had an acceptable
goodness-of-fit in four out of the eight cases. As we
stated in Section 1, we would like our medium and small
reduced datasets to at least vaguely mimic our full
reference dataset. However, the goal of our medium and
small reduced datasets is to have mid-length and short
simulation times, not to mimic the full reference dataset.
(Simulation time results are discussed in Section 5.4.)
We gathered function-level execution profiles at four
different optimization levels to see if we could discern
any patterns across optimization levels. Our results show
similar goodness-of-fit values for each reduced dataset
across optimization levels. We conclude that for each
reduced dataset, across each optimization level, the
program is behaving in a consistent manner in comparison
to the full reference dataset.
5.2 Instruction mix profile results
Figure 1 shows the instruction mix breakdown for the
reference (Ref) dataset and our large reduced (LgRed)
dataset at optimization levels O0 through O3. In the
interest of brevity, we did not include the instruction mix
results for the medium and small reduced datasets.
Our results show a surprisingly large discrepancy in
the instruction mix between the full reference and our
large reduced datasets. Since the goodness-of-fit values
for the function-level execution profiles (discussed in
Section 5.1) were acceptable, we expected the instruction
mix of the large reduced dataset to scale down
proportionately from the full reference instruction mix.
We found that when we applied the goodness-of-fit
calculations to the instruction mix histograms, we
obtained chi-squared values several times larger than the
critical value at the 90 percent confidence level.
The large reduced datasets show vaguely consistent
instruction mix distributions across optimization levels.
That is, the percentage breakdown of loads, stores,
unconditional branches, conditional branches, integer
computation, floating point computation, and traps is
roughly the same across optimization levels.
The
reference dataset instruction mix distribution, on the other
hand, varied widely across optimization levels. For the
reference dataset of each benchmark, three out of the four
optimization levels have similar instruction mix values.
In the Place, Route, Graphic, Source, Log, and Parser subbenchmarks, the instruction mix distribution at O0
differed significantly from the other optimization levels.
In the Program and Random sub-benchmarks, the
instruction mix distribution at O1 differed significantly
from the other optimization levels.
We were not surprised to see the differences in
instruction mix distributions across optimization levels for
the reference dataset. However, we are unclear why these
differences did not carry through to the large reduced
dataset. We welcome other researchers to investigate and
explain the instruction mix discrepancies between the full
reference and our large reduced datasets.
5.3 Cache behavior results
Figure 2 shows the level 1 data cache miss rates for
the reference (Ref) dataset and our large reduced (LgRed)
dataset at optimization level O3.
Once again, in the
interest of brevity, we did not include our results for the
small and medium reduced datasets. Since we saw
similar behavior across optimization levels in the
execution profiles, we limited our cache behavior
simulations to optimization level O3.
The Place and Route sub-benchmarks show the
behavior we expected, that is, the miss rate curve for the
large reduced dataset is similar to the miss rate curve for
the full reference dataset, except the large reduced dataset
curve is shifted to the left. We speculate that the smaller
input dataset has a smaller footprint in memory.
Therefore, as the data cache gets larger, the entire
memory image fits into cache and the miss rate
approaches zero. That is, for the Place and Route subbenchmarks, the large reduced dataset produces a scaled
down version of the full reference dataset cache behavior.
The Graphic, Program, Source, Random, Log, and
Parser sub-benchmarks show surprisingly similar miss
rates between the full reference and large reduced
datasets. In fact, for the Graphic and Program subbenchmarks, the full reference dataset miss rate is actually
lower than the large reduced dataset miss rate. At present,
we do not have an explanation for this unexpected result.
We speculate that the reduced input dataset reduces the
number of memory operations performed while
maintaining a similar memory footprint.
5.4 Impact on simulation time
Table 4 shows the instruction counts and estimated
simulation times (using sim-outorder) for our large
(LgRed), medium (MdRed), and small (SmRed) reduced
datasets compared to the full reference (Ref) dataset.
The estimated simulation times were calculated by
running several small simulations on a 300Mhz
UltraSparc, noting the progressive decrease in execution
rate as the dataset got larger, then extrapolating the
execution rate to very large datasets. In the end, we used
two execution rate factors to calculate estimated
simulation time:
1) For the small, medium, and large reduced dataset,
we use an execution rate of 60 million instructions
per hour.
2) For the full reference dataset, we use an execution
rate of 25 million instructions per hour.
The simulation times in Table 4 show that for the
Place, Route, and Parser sub-benchmarks, we
accomplished our goal of having a small dataset with a
simulation time on the order of minutes, a medium dataset
with a simulation time on the order of hours, and a large
dataset with a simulation time on the order of a few days.
In the case of the Graphic, Program, Source, Random,
and Log sub-benchmarks, we have an appropriate length
run for the large reduced dataset, but our medium and
short simulation times are much longer than we would
like. Gzip performs compression on a buffer of data in
memory. The size of this buffer, in megabytes, is
determined by an integer command line argument. This
integer parameter restriction means the smallest size
buffer we can use is one megabyte. If we specify an input
file smaller than one megabyte, gzip duplicates the input
file in memory until it obtains one megabyte of data. This
data duplication results in large instruction counts and
long simulation times. The only way to obtain a very
short simulation of gzip would be to modify the source
code and remove the data duplication function call.
5.5 IPC Results
Figure 3 shows the instructions per cycle (IPC) for the
large reduced dataset. As stated in Section 1, most
simulators which generate IPC counts and other timing
information run several orders of magnitude slower than
smaller, less detailed simulators. Given our current tools
and the time available, we were not able to obtain IPC
counts for the reference input sets. The IPC counts for
the large reduced datasets are included for documentation,
although they cannot yet be compared to the IPC counts
of the full reference input sets.
We determined the IPC values by using sim-outorder
with the default configuration. This configuration uses a
4 instruction per cycle issue width, 4 integer ALUs, 1
integer multiplier, 4 floating point ALUs, and 1 floating
point multiplier. With this configuration, we see that
seven out of the eight sub-benchmarks sustain an
execution rate of greater than one instruction per cycle.
The one sub-benchmark which fell short of one
instruction per cycle, Place, still performed fairly well
with an IPC of 0.9341.
Table 3. Goodness-of-fit chi-squared statistic values for 175.vpr, 164.gzip, and 197.parser for the large
(LgRed), medium (MdRed), and small (SmRed) reduced datasets function-level execution profiles compared
to the full SPEC reference datasets. *90% Conf = Critical value of the chi-squared statistic at the 90 percent
confidence level as tablulated in [6].
(a) Place (Vpr) Execution
Opt
LgRed MdRed
Level
O0
5.44
24.59
O1
8.51
36.15
O2
11.33
20.97
O3
9.53
32.98
Profile
SmRed
90%
Conf*
24.77
32.01
29.62
29.62
(b) Route (Vpr) Execution Profile
Opt
LgRed MdRed SmRed
Level
O0
27.35 391.68 6367.8
O1
64.02 907.96 20870.9
O2
55.25 749.83 111190.9
O3
56.01 1143.62 62637.0
90%
Conf*
78.85
86.63
82.19
85.53
(c) Graphic (Gzip) Execution Profile
Opt
LgRed MdRed SmRed
Level
O0
11.12
27.38 434.11
O1
16.85
37.97 410.91
O2
13.14
37.73 397.17
O3
18.75
45.30 431.39
90%
Conf*
34.38
43.72
41.41
44.88
(d) Program (Gzip) Execution Profile
Opt
LgRed MdRed SmRed
Level
O0
3.74
6.61
10.52
O1
3.36
11.81
9.04
O2
2.43
6.26
8.75
O3
3.57
7.31
11.44
90%
Conf*
37.92
40.26
41.41
37.92
(e) Source (Gzip) Execution Profile
Opt
LgRed MdRed SmRed
Level
O0
15.58
5.40
8.65
O1
24.56
6.46
3.75
O2
16.34
7.39
6.23
O3
13.21
12.48
12.58
90%
Conf*
40.26
39.09
37.92
36.74
(f) Random (Gzip) Execution Profile
Opt
LgRed MdRed SmRed
Level
O0
4.82
2.90
2.75
O1
5.84
5.10
5.89
O2
6.48
12.91
5.99
O3
5.77
15.23
12.15
90%
Conf*
35.56
34.38
39.09
36.74
(g) Log (Gzip) Execution Profile
Opt
LgRed MdRed SmRed
Level
O0
15.56
16.85
10.12
O1
33.69
30.61
72.03
O2
14.56
8.29
23.96
O3
29.67
12.07
14.26
90%
Conf*
36.74
36.74
41.41
37.92
(h) Parser Execution Profile
Opt
LgRed MdRed SmRed
Level
O0 283.53 366.30 6692.6
O1 212.80 203.68 1742.2
O2 310.30 389.64 7072.2
O3 536.40 422.74 14454.2
156.27
156.68
113.21
102.93
90%
Conf*
182.26
169.35
173.66
122.85
Place (Vpr) Instruction Mix
Route (Vpr) Instruction Mix
100%
100%
90%
80%
80%
trap
70%
trap
fp computation
fp computation
60%
int computation
60%
int computation
50%
conditional branch
unconditional branch
40%
conditional branch
unconditional branch
40%
store
30%
store
load
20%
load
20%
10%
0%
0%
Ref O0 Lgred Ref O1 Lgred Ref O2 Lgred Ref O3 Lgred
O0
O1
O2
O3
Ref O0 Lgred Ref O1 Lgred Ref O2 Lgred Ref O3 Lgred
O0
O1
O2
O3
(a)
(b)
Graphic (Gzip) Instruction Mix
Program (Gzip) Instruction Mix
100%
100%
90%
80%
80%
trap
fp computation
trap
70%
fp computation
60%
int computation
60%
int computation
50%
40%
conditional branch
unconditional branch
conditional branch
unconditional branch
store
40%
store
30%
load
20%
load
20%
10%
0%
0%
Ref O0 Lgred Ref O1 Lgred Ref O2 Lgred Ref O3 Lgred
O0
O1
O2
O3
Ref O0 Lgred Ref O1 Lgred Ref O2 Lgred Ref O3 Lgred
O0
O1
O2
O3
(c)
(d)
Source (Gzip) Instruction Mix
Random (Gzip) Benchmark
100%
100%
90%
80%
80%
trap
fp computation
60%
40%
trap
70%
fp computation
int computation
60%
int computation
conditional branch
unconditional branch
50%
conditional branch
unconditional branch
store
40%
store
30%
load
20%
load
20%
10%
0%
0%
Ref O0 Lgred Ref O1 Lgred Ref O2 Lgred Ref O3 Lgred
O0
O1
O2
O3
Ref O0 Lgred Ref O1 Lgred Ref O2 Lgred Ref O3 Lgred
O0
O1
O2
O3
(e)
(f)
Log (Gzip) Instruction Mix
Parser Instruction Mix
100%
100%
80%
80%
trap
trap
fp computation
60%
40%
fp computation
int computation
60%
int computation
conditional branch
unconditional branch
40%
conditional branch
unconditional branch
store
store
load
20%
load
20%
0%
0%
Ref O0 Lgred Ref O1 Lgred Ref O2 Lgred Ref O3 Lgred
O0
O1
O2
O3
(g)
Ref O0 Lgred Ref O1 Lgred Ref O2 Lgred Ref O3 Lgred
O0
O1
O2
O3
(h)
Figure 1. Instruction mix breakdown for the reference (Ref) and large reduced (LgRed) datasets
for the 175.vpr, 164.gzip, and 197.parser SPEC 2000 benchmarks at optimization levels O0
through O3.
Place (Vpr) Benchmark
Route (Vpr) Benchmark
0.07
0.06
0.06
0.05
0.04
0.04
lgred
0.03
ref
Miss Rate
Miss Rate
0.05
lgred
0.03
ref
0.02
0.02
0.01
0.01
0
0
16k
32k
64k
128k
256k
512k
16k
32k
Level 1 Data Cache Size
64k
128k
256k
512k
Level 1 Data Cache Size
(a)
(b)
Graphic (Gzip) Benchmark
Program (Gzip) Benchmark
0.06
0.25
0.05
0.2
lgred
0.03
ref
Miss Rate
Miss Rate
0.04
0.15
lgred
ref
0.1
0.02
0.05
0.01
0
0
16k
32k
64k
128k
256k
512k
16k
32k
Level 1 Data Cache Size
64k
128k
(c)
512k
(d)
Source (Gzip) Benchmark
Random (Gzip) Benchmark
0.16
0.07
0.14
0.06
0.12
0.05
0.1
lgred
0.08
ref
0.06
Miss Rate
Miss Rate
256k
Level 1 Data Cache Size
0.04
lgred
0.03
ref
0.02
0.04
0.01
0.02
0
0
16k
32k
64k
128k
256k
512k
16k
32k
Level 1 Data Cache Size
64k
128k
256k
512k
Level 1 Data Cache Size
(e)
(f)
Log (Gzip) Benchmark
Parser Benchmark
0.1
0.06
0.09
0.05
0.08
0.04
0.06
lgred
0.05
ref
0.04
Miss Rate
Miss Rate
0.07
lgred
0.03
ref
0.02
0.03
0.02
0.01
0.01
0
0
16k
32k
64k
128k
Level 1 Data Cache Size
(g)
256k
512k
16k
32k
64k
128k
256k
512k
Level 1 Data Cache Size
(h)
Figure 2. Level 1 data cache miss rates for the reference (Ref) and the large reduced (LgRed) datasets
for the 175.vpr, 164.gzip, and 197.parser SPEC 2000 benchmarks compiled at optimization level O3.
Benchmark
175.vpr,
place
175.vpr,
route
164.gzip,
graphic
164.gzip.
program
164.gzip,
source
164.gzip,
random
164.gzip,
log
197.parser
Reference Dataset
LgRed Dataset
MdRed Dataset
SmRed Dataset
Inst Count Sim Time Inst Count Sim Time Inst Count Sim Time Inst Count Sim Time
(in millions) (in hours) (in millions) (in hours) (in millions) (in hours) (in millions) (in hours)
109,752
4390
1521
25.3
217
3.6
18
0.3
97,273
3891
881
14.7
94
1.6
6
0.1
81,270
3251
1370
22.8
964
16.1
3221
53.7
116,010
4640
1958
32.6
1812
30.2
2606
43.4
63,172
2526
1181
19.7
1149
19.2
1112
18.5
64,372
2575
1065
17.8
1066
17.8
1066
17.8
33,952
1358
531
8.9
531
8.9
526
8.8
301,361
12,054
2717
45.3
227
3.8
41
0.7
Table 4. Instruction counts and estimated simulation times (using sim-outorder) for the SPEC 2000
benchmarks (compiled at optimization level O3) on a moderately loaded 300 MHz Sun UltraSparc .
Estimated simulation time as calculated using a 25 million instructions per hour factor for simulations over 10
billion (10,000 million) instructions and a 60 million instructions per hour factor for simulations less than 10
billion (10,000 million) instructions.
IPC for Large Reduced (LgRed) Datasets
1.7138
1.6000
1.4124
1.4130
1.5987
1.4476
1.2416
0.9341
Place
(vpr)
Route
(vpr)
Graphic Program
(gzip)
(gzip)
Source
(gzip)
Random
(gzip)
Log
(gzip)
Parser
Figure 3. Instructions Per Cycle (IPC) as measured using the SimpleScalar sim-outorder
simulator for the 175.vpr, 164.gzip, and 197.parser SPEC 2000 benchmarks when compiled with
optimization level O3 and run with the large reduced (LgRed) input datasets.
6 Conclusions and Future Work
In this work, we compare the behavior of the SPEC
2000 benchmark programs when executed with different
input datasets by looking at the fraction of total execution
time spent in functions as measured by the gprof profiling
tool, the instruction mix as measured with the sim-profile
simulator from the SimpleScalar suite of simulators, and
the level 1 cache miss rates as measured with the simcache simulator from SimpleScalar. These metrics give
us a rough indication of what is going on in the simulated
hardware while the program executes. More detailed
study is needed to see if our datasets correctly depict the
second order effects of the reference dataset. [11] In
particular, further study is needed to examine the
instruction ordering effects in the pipeline when the
programs are executed with each dataset.
Another, perhaps higher priority, course of action is to
better reduce the data for the 197.parser benchmark. If a
close-to-exact representation of the full reference dataset
is needed for 197.parser, our dataset is not a good choice.
However, we speculate that with a different reduction
technique, we can obtain a dataset which more closely
mimics the full reference dataset.
In the interest of shortened simulation time, we also
need to use a different reduction technique on 164.gzip.
We have a good gzip dataset for detailed simulations;
however, we were unable to obtain a dataset for quick
tests of our simulator.
Our work to date encompasses only a fraction of the
programs in the SPEC 2000 benchmark suite. In the
weeks and months to come, we plan to apply the same
methodology stated in Section 3 to the rest of the C
programs in the CINT 2000 and CFP 2000 suite. (Our
tools do not work with C++ or Fortran programs, so we
will limit our present analysis to the programs written in
C.)
Our execution profile and cache miss results show that
it is possible to obtain small datasets that reasonably
mimic the behavior of the full reference datasets of the
SPEC 2000 benchmarks. These reduced datasets are not
perfect, though. We discovered one tradeoff for smaller
simulations is that the instruction mix produced with our
reduced datasets is vastly different from the instruction
mix of the full reference dataset. Our future work will
determine if there are other major differences between our
large reduced dataset and the full reference dataset.
We anticipate using our reduced datasets with detailed
execution-driven simulators to evaluate hardware tradeoffs in future computer architecture studies. Our datasets
give us reasonable length simulations, thereby allowing us
to run many permutations of hardware configurations and
complete computer architecture research studies in a
reasonable amount of time.
While our reduced input datasets do not produce
execution profiles that perfectly match the execution
profiles produced with the original reference input
datasets , they do provide reasonably similar profiles.
Furthermore, the program characterizations presented in
this paper highlight the important areas in which the
program behavior with our reduced datasets and the
program behavior with the original reference dataset
differ. Understanding these differences is important in
helping researchers understand the performance results
they obtain when using these benchmark programs to
evaluate new architectural ideas.
Acknowledgements
This work was supported in part by National Science
Foundation grants EIA-9971666 and CCR-9900605, NSF
Research Experiences for Undergraduates grants EEC9619750 and CCR-9610379, and by the Minnesota
Supercomputing Institute.
References
[1] SPEC Benchmark Suite.
Information available at
http://www.spec.org
[2] Henning, John L., “SPEC CPU2000: Measuring CPU
Performance in the New Millennium,” IEEE Computer,
Vol. 33, No. 7, July 2000, pp. 28-35.
[3] SPEC CPU2000 Press Release FAQ, available at
http://www.spec.org/osg/cpu2000/press/ faq.html
[4] Burger, D. and T. Austin, "The SimpleScalar Tool Set,
Version 3.0," University of Wisconsin Madison Computer
Sciences Department. Distribution web page located at:
http://www.cs.wisc.edu/~mscalar/ simplescalar.html
[5] GNU Unix Toolset. Information and binaries available at
http://www.gnu.org
[6] Lilja, David J., Measuring Computer Performance: A
Practitioner's Guide, Cambridge University Press, New
York, NY, 2000.
[7] Giladi, R. and N. Ahituv, “SPEC as a Performance
Evaluation Measure,” IEEE Computer, Vol. 28, No. 8,
August 1995, pp. 33-42.
[8] Mirghafori, N., M. Jacoby, and D. Patterson, “Truth in
SPEC Benchmarks,” ACM Computer Architecture News,
Vol. 23, No. 5, December 1995, pp. 34-42.
[9] Dujmovic, J. and I. Dujmovic, “Evolution and Evaluation
of SPEC Benchmarks,” Performance Evaluation Review,
Vol. 26, No. 3, December 1998, pp. 2-9.
[10] Graham, S.L., P.B. Kessler, M.K. McKusick, “gprof: A
Call Graph Execution Profiler”, Proceedings of the
SIGPLAN '82 Symposium on Compiler Construction,
SIGPLAN Notices, Vol. 17, No. 6, pp. 120-126, June
1982.
[11] Iyengar, V., L. Trevillyan, and P. Bose, “Representative
Traces For Processor Models with Infinite Cache,”
Proceedings of the Second International Symposium on
High Performance Computer Architecture (HPCA), 1996,
pp. 62-72.