Computer performance variability

Computer performance variability
by THOMAS E. BELL
The Rand Corporation
Santa Monica, California
MOTIVATION FOR PERFORMANCE
CO~VIPARISOXS
and disk I/Os performed, and the amount of core occupied. 1
·When each type of resource is separately charged for, one
computer may be much cheaper for one type of job but very
expensive for another type. If the charge for I/O is relatively low and the charge for CPU time relatively high, the
user would be tempted to run I/O-bound jobs on this
computer but to submit his "number-crunehers" to another
computer. The user without access to a network may be
precluded from distributing his work among the available
computers in the most economical manner, but the net,York
user has more options to choose from-and more decisions
to make. These economic decisions must usually consider
each of a number of performance metrics when the rate
fur each may be different on each of the available computers.
In addition to potentially different rates for each resource,
computer centers may employ different functional forms
for their billing equations. One of the reasons for such
differences is the variety of objectives that they may adopt.
Published objectives usually include cost recovery and
equitability (or reasonableness) in addition to repeatability.2,3,4 Other objectives may include limiting load grO\vth
to avoid the need for procuring a new machine,5 biasing
users to employ resources available in excess rather than
those in short supply,6 and being able to separate those
users with immediate needs (and can afford to satisfy their
need) from those who desire cheaper, slower service. 7 The
differences in objectives ensure that, as computer centers
are increasingly drawn together by nehYorks, users will face
more and more different kinds of economic situations to
evaluate, and they will find that superficial evaluations will
not be adequate Tor choosing the most appropriate node.
Since computational speeds often differ behveen the machines, rates themselves cannot be compared; the user
must compare the total costs of computing through execution of sample jobs.
Either real jobs or synthetic jobs (which use resources in
known ways but do no useful computations) can be used in
response time and resource charge comparisons. Typically,
a user submits a standard job to each of several computers
to measure the response time (either interactive or batch)
and determine the charges. After finding the values from
each candidate node, he picks the one offering him the best
ratio of service (response time as ,yell as other services) to
During a period when computing on a network is free,
users can be very informal about choosing the computer
for running their jobs. Certain issues usually dominate this
informal evaluation-the convenience of entering jobs, the
availability of attractive services, the reliability of the
system, and the individual user's familiarity with the
system's conventions. These informal evaluations are usually
qualitative, but one additional, quantitative characteristic
is often included-response time.
Response time, in the context of a computer network,
may be defined as the elapsed time for responding to a
batch-job run request as well as the more common definition
of the elapsed time for responding to interactive requests.
On some systems, of course, these hvo instances are blended
together. On other systems the two are distinct, and the
values obtained from measuring them are quite different.
On two specific computers a user might find that one provides superior batch response time while the other provides
superior interactive response time. If everything else is
equal and he has no problems in transferring data from one
machine to the other, the user would choose the first for
batch executions (e.g., running statistical evaluation programs) and the second for doing interactive work (e.g.,
editing a report).
If installations charge real money for their computing
services, another element (money) must be included in an
evaluation of alternative computers. Evaluations rapidly
lose their informal nature when net,York users find that
their choice of computer determines the amount of money
that will remain in budgets for paying their salaries. Personnel time lost due to poor response time, bad conventions,
or inadequate services must now be traded off against the
costs of avoiding these conditions. Computers that might
have been unacceptable prior to charging may become
optimal after economics has become a factor in decisionmaking.
Response time is a single metric, but computer charges
are computed from a number of different performance
metrics. For example, a bill might be computed from the
amount of processor time consumed, the number uf cards
read, the number of lines printed, the number of tape l/Os
761
From the collection of the Computer History Museum (www.computerhistory.org)
762
National Computer Conference, 1974
cost. This exercise is therefore critical to the user--since it
determines the costs he will incur-and to the node-since
it determines the amount of load the center will experience.
REPEATABILITY AND VARIABILITY
One of the objectives for charging systems is repeatability,
the characteristic of producing the same charge from each
of a number of runs of the same job. This same objective
of zero variability between runs is often implicitly assumed
to have been met by users performing comparisons. They
run a single job once on each machine and assume that the
resulting performance values accurately represent the
machine's performance. This is equivalent to the assumption that there exists zero within-sample variability. Therefore, the variation within samples (e.g., several runs on the
same computer) can be disregarded in comparison with the
variation between samples (e.g., runs on computer A compared with runs on computer B). If the standard deviation
(a measure of variability) of run times on computers A and
B were always far smaller than the difference in run times
on --thetwB- machines, the -within;:.sample -variability would
clearly not be significant.
Good repeatability would aid users in budgeting their
funds for computing as well as help them in -making comparisons. Gabrielle and John Wiorkmvski4 suggest that "a
variance of no greater than 1 percent is thought to be acceptable." Probably, they mean that the standard deviation
of charges should not exceed 1 percent of the mean. This
amount of variability is certainly so small that it would
interfere verv little in realistic comparisons. Perhaps performance v~iability is unworthy of consideration; some
indication of the problem's actual magnitude is necessary
to evaluate its importance.
DETERlVIINING THE
PROBLEfif
~IAGNITUDE
OF THE
Synthetic benchmarks are being used extensively in
performance investigations. For example, Buchholz' synthetic test jobS has been used by Wood and Forman 9 for
comparative performance investigations on batch systems,
and Vote lO has employed his synthetic program in evaluating a time-sharing system. With their increasing use and
their documented advantages for certain types of investigations, synthetic jobs are a natural vehicle for determining
the magnitude of variability.
Synthetic job
We have used a modification of the Buchholz synthetic
test job to determine the magnitude of variability in strictly
controlled test situations on IB}\f, Honeywell, and other
manufadurers' equipment. The job (as modified) is written
in FORTRAN so that it can be executed on a variety of
computers and is structured as follows:
1. Obtain the time that the job was given control and
keep the time in memory,
2. Set up for the job's execution,
3. Set up for running a set of identical passes with an
I/O-CPU mix as specified on a parameter card.
4. Execute the set of identical passes and record (in
memory) the time of each passes' start and finish,
5. Compute some simple statistics from the resultant
execution times and print both the times and the
values of the statistics.
6. If requested, return to step 3 to repeat the operations for a new I/O-CPU mix,
7. Determine the current time; print out this time and
the initiation time.
8. Terminate the job.
As indicated, the job has embedded data collection in the
form of interrogations of the system's hardware clock to
record the elapsed time between certain major points in the
program's execution. When appropriate, the job also determines the_ acc_umulatedresource usage at each of the major
points. The elapsed time within the job can be compared
with the time recorded by the accounting system to determine whether initiation/termination time is large enough
to require distinguishing between these two measures of
elapsed time. In all cases we have observed, the initiation/
termination time has been so large that disregarding it
would invalidate many conclusions from performance
investigations.
The design of the job enables the user to identify the
source of certain kinds of variability. Since repeated passes
are individually timed, variability that arises from within
the period of job execution can be identified. By running
the job in a number of different situations other sources of
variability can be identified. This job can be used directly
to evaluate variability in batch systems, and can be submitted remotely to evaluate variability in remote job
environments such as the time-sharing system we investigated.
Interactive responsiveness
While a synthetic job can be used to evaluate the system's
response to user-programmed activity, it is not adequate to
investigate highly interactive activity like text-editing. The
latter type of system is usually investigated by using scripts;
the user first makes lip a, list. of commands and then determines the time required for him to complete a series of
interactions based on the list of commands. Unfortunately,
this approach precludes identifying the source of variability
if it arises from only a subset of the commands employed.
In addition, the variability of human response is mixed
with the variability in computer response. A better approach
is to time the response of the computer to each individual
command.
From the collection of the Computer History Museum (www.computerhistory.org)
Computer Performance Variability
The analyst can time these responses with a stop watch
as done by Lockett and White,l1· but this technique fails
when the computer's response time becomes small. In such
situations the human's response time in operating the
watch may exceed the computer's response time, and human
variability dominates computer variability. In addition,
the human often becomes sloppy when large amounts of
data are needed because the job becomes tedious. To avoid
these problems, we designed and implemented a hardware
device to time responses to a resolution of one millisecond.
A nalysis approach
Performance can be made to vary by orders of magnitude
if a programmer puts his mind to it. By using a computer
with a low resolution clock and doing careful programming,
a programmer could execute a job that was almost never
in control at the time the computer's clock advanced. Thus
his job would be charged for using almost no resources.
On the other hand, the programmer could leave out the
special timing controls and be charged for using hundreds
of times as much processing resource. An exercise performed
in this way would prove little since any relationship to a
normal job stream would be tenuous.
A more useful indication of variability's magnitude would
be the lower limit to be expected in more normal situations.
Our experiments were directed to this objective and therefore involve simple situations vvhere low variability is to be
expected.
Ultimately, all variability could probably be explained.
Variations in processing rate could be caused by fluctuations
in power line frequency; differences in elapsed time for
processing I/O-bound jobs might be explained as variations
in the number of I/O-retries; variability in initiation time
could sometimes be explained by slight differences in the
length of a job control file. However, these circumstances
are transitory and usually unknowable to the user. Our
tests were designed to represent the best possible situation
that a user could realistically expect.
The simplest situation we investigated is one in which
no jobs other than the test job are active; the system is
initialized at the beginning of the test period to establish
that no other jobs will interfere with the experiment. In this
simple test the parameterized test job is set to run in either
of two modes-as a totally CPU-bound job or as a totally
I/O-bound job.
The tests then increased in complexity through controlled
multiprogramming situations to uncontrolled, normal operations. Results of the tests were used in simple analyses to
indicate the degree of variability that should be expected,
but in this initial investigation we made no attempt to
employ sophisticated statistical models.
In all cases we used computers that were isolated from
any networks, and 've usually allO\ved only controlled on-line
activity so that unexpected loading would not occur. In
some cases we found that physically disconnecting transmission lines was necessary to achieve an environment that
TABLE I-8tand Alone Runs-Elapsed
(All Times in Milliseconds)
763
Time~"
Type of Job
Number Passes
Mean Elapsed
Time
Standard
Deviation
CPU
CPU
1/0
1/0
20
20
20
20
29135
29135
35165
35300
73
73
135
186
was strictly controlled. Whenever we relaxed our controls,
loading became random and variability increased.
E~VIPIRICAL
RESULTS
The most elementary measurement of performance is
probably elapsed time, and it is often the one of most interest. Results of elapsed time investigations are therefore
presented first, with results involving I/O and CPU metrics
following.
Batch elapsed time
The phenomenon of variability in elapsed times under
multiprogramming conditions is well-known. When a user's
job is run with different mixtures of other jobs, differing
amounts of resources are denied it each time it is run; thus
the job may execute slowly one time and rapidly the next.
One way to decrease this variability (and also decrease the
average time) is to ensure that the job of interest is run
with the highest priority. The interested network user
might therefore occasionally pay to run at a very high
priority in order to determine the performance under "best
possible" conditions. However, an even better situation is
to run stand-alone on the computer. We therefore executed
the synthetic job in its simple CPU-bound and I/O-bound
versions on an otherwise idle system.
The elapsed time to execute the CPU-bound portion of
this job on an IB:\1 360/65 operating under OS/~IVT at
The Rand Corporation evidenced no variability within the
job that was above the measurement's resolution of 32
milliseconds. (See Table 1.) The I/O-bound portion's
elapsed time, however, typically varied enough to result in
standard deviations (for 20 identical executions) of about
.5 percent of the mean (mean of about 3,) seconds). This
small value indicates that, at least within a stand-alone
job, variability is not impressive.
The situation becomes less encouraging when the comparisons are behveen separately initiated jobs. The statistics
of interest now include initiation time, termination time,
and average execution time of a pass through the timed
loop for each separately-initiated job. The elapsed time
for initiating the jobs (the time bet,,;een initiation as recorded by the accounting system and the time control has
passed to the job's code) averaged 8.25 seconds with a
From the collection of the Computer History Museum (www.computerhistory.org)
764
National Computer Conference, 1974
TABLE II-8tand Alone Runs-Channel Times
(All times in Milliseconds)
Disk
Run
Number
File 1
File 2
File 3
File 4
1
2
3
34432
344.56
34429
34274
34283
34257
27351
27298
27334
39809
39823
39816
Mean
34439
34271
27328
39816
Tape
Run
Number
File 1
File 2
File 3
File 4
4
5
6
22356
22404
22350
22533
22621
22677
23328
23350
23087
23093
23103
22370
22610
23339
23094
Mean
2~~40
standard deviation of 6.67 percent of the mean. The elapsed
time for termination (time user code completes executing
until the accounting system records termination) averaged
3.775 seconds wltn statidard deviation of 16.7 percen-tof
the mean.
Although our experiments provided multiple samples of
initiation and termination, the sample size of identically
run executions was too small for computation of meaningful
measures of variability. In two specific instances, however,
the average elapsed times to execute the totally CPUbound portion differed by less than the resolution of measurement. The average execution time for the I/O-bound
portion, in two instances of samples of two, changed by
0.4 percent and 1.2 percent. (Allocation of files was done
identically in each instance.) Although the within-sample
variability appears small for internal portions of a job, the
initiation and termination times vary significantly. Therefore, comparisons involving elapsed times should be designed
recognizing that the variability of initiation and termination
may obscure some results.
as percentages of means ranged from a low of 5.7 percent,
to a typical 12 percent, to a high of 30.8 percent. These
values were obtained with different levels of user activity
on WYLBUR by real users (as opposed to our artificial load
for testing), but did not include the variability often introduced by time-sharing systems since the 360/65 was not
time-shared.
We executed our standard synthetic job on a Honeywell
615 computer under its time-sharing system to determine
variability in this on-line system. The elapsed time to
execute each pass through the CPU-bound portion ,vas
recorded in order to provide an indication of the variability
of response time during normal operation on that system.
Each pass could be executed in about 9 seconds in a standalone system. However, during normal oper9.tion the average
was occasionally as large as 597 seconds with the standard
deviation exceeding the average. (It was 598 seconds.)
Variability of this magnitude can produce highly anti-social
behavior by some network users.
I/O activity
.--
Elapsed time is variable in all but the simplest cases, but
perhaps the network comparison shopper can depend on
charges from a bi11ing system to be within the 1 percent
TABLE III-Multiprogrammed Runs-Channel Times
(All times in Milliseconds)
InInRun File 1 Increase File 2 crease File 3 crease File 4 Increase
Percent
Percent
Percent
Percent
(Disk)
1
1
2
2
3
4
28543
37678
27726
38578
30896
30200
-17.1
9.4
-19.5
12.0
-10.3
12.3
37227
48834
37366
49977
38983
47346
8.6
42.5
9.0
45.8
13.7
38.2
49739
46655
47920
45346
24687
24730
82.0
70.7
75.4
65.9
-9.7
-9.5
44329
30638
43163
28613
24162
24195
11.3
-23.0
8.4
-28.1
-39.3
-39.2
1.9 23570
2.2 23675
2.1
2.5
(Tape)
On-line elapsed time
22698
22873
3
4
An on-line system operating with low priority in a computer would be expected to have variable response, but
relatively constant response is usually expected when the
on-line system is given a priority only a little below the
operating system itself. 'Ve ran a series of carefully designed
tests to provide an indication of this assumption's validity
on our WYLBUR12 test editor in Rand's normal environment. A heavily I/O-bound activity (listing a file Ull a video
terminal) experienced response time with a standard deviation 23.2 percent of the mean. A heavily CPU-bound activity (automatically changing selected characters) had
reduced variability-15.6 percent of its mean. A variety of
other editing functions experienced similar variability under
a pair of different configurations. The standard deviations
1.3 22908
2.2 23032
1.4 23775
1.9 23849
Notes:
1. Three jobs were muitiprogrammed in each case. The first job was
CPU-bound; its results are not represented here. The other two
jobs were I/O-bound and were paired as follows:
Run
Second Job Type
Third Job Type
2
3
4
Disk
Disk
Disk
Disk
Disk
Disk
Tape
Tape
2. All increases represent changes from the mean times indicated for
individual files in Table II.
From the collection of the Computer History Museum (www.computerhistory.org)
Computer Performance Variability
suggested by the Wiorkowskis. We ran tests on two systems
to test this.
Repeated runs of the same stand-alone, I/O-bound job
resulted in recorded channel times on a Honeywell Information Systems (HIS) 6050 with ranges considerably less than
1 percent of the means. (See Table II.) When run ,vith
one CPU-bound job and only one other I/O-bound job, the
channel times changed from the stand-alone values by an
average deviation of 2.0 percent for tape files, but 29.2 percent for disk files. (As indicated in Table III, both positive
and negative deviations were observed.) In one case, an 82
percent deviation ,vas observed between the charges for the
stand-alone run and a three-job multiprogramming case.
All these instances involved simple sequential files. Using
this type of file several times with identical jobs usually
results in nearly constant performance when considering
the metric of I/O requests. I/O counts (Execute Channel
Program requests, EXCPs) are reported in IB:\I's System
~1anagement Facilities (S::YIF) accounting system and
seldom vary. However, Sl\IF only reports I/O requests
rather than actual I/O accesses. The number of actual I/O
accesses normally exceeds the number of requests since
channels can execute a number of accesses as a result of a
single EXCP. The difference usually becomes dramatic in
the Indexed Sequential Access ::\fethod (ISAl\f) where a
single request may result in extensive examination of overflow areas.
Processor activity
Processor time appears to be a straightforward metric,
but even its definition is open to dispute; major parts of a
job's CPU activity may not be logically associated with
that job alone. For instance, the rate of executing instructions may be reduced as a result of activity on a channel
or another processor in the system. In addition to the definitional problem, accounting systems are often implemented
in ways that users feel are illogical. These problems are
seldom of importance when a job runs alone in the system,
but may be critical during multiprogramming or multiprocessing. Repeated runs of our test job on an IBM 360/65
did not indicate any meaningful variability when the CP"Gbound job was run stand-alone, but the I/O bound job, in
two cases, provided CPU times of 29.4 seconds and 28.7
seconds (a difference of 2.4 percent of the mean). The I/O
variability, however, does not appear critical because this
job ran, stand-alone, for an elapsed time of approximately
780 seconds; the difference of 0.7 seconds is therefore less
than 0.1 percent of the elapsed time. Under multiprogramming results vary more.
Although the CPU charges should not vary when the
job is run multiprogrammed rather than stand-alone, our
results indicate that the reported charges contained both a
biasing element and a random element. The charges for the
CPU-bound job went up from 583 seconds (stand-alone) to
612 seconds (when run with the I/O-bound job) to 637
765
seconds (when run with a job causing timer interrupts
every 16.7 milliseconds) to 673 seconds (when multiprogrammed with both the other jobs). The changes in CPU
charges are clearly dependent on the number of interrupts
the system handles for other jobs on the system. The largest
CPU charge observed in this series of tests with the CP"Gbound job was 16 percent over the stand-alone charge, but
larger biases can be obtained by running more interruptcausing jobs simultaneously. In one particularly annoying
case, the author observed a production job (as opposed to a
synthetic one) whose CPU charges differed between two
runs by an amount equal to the smaller of the two
charges-a 100 percent variation!
I/O-bound jobs often experience CPU charge variability
of equal relative magnitude, and users with I/O-bound jobs
have come to expect 30 percent variations in their charges.
These problems are not unique to IB::'.'1 equipment. VVe
found the same sorts of variability when running on a
Honeywell 60.50 processor. With only a single processor
active, ,ve observed processor times that (with two other
jobs active) increased up to 7.2 percent over the stand-alone
average charge. :\10re jobs and more processors increase the
variability.
IXTgRPRETATIOX
The results of performance tests indicate conclusively
that something is varying. Some of the reported variability
is undoubtedly due to the reporting mechanisms themselves.
For example, the CPU time reported by IB.:\1's S::\fF in an
MVT system attributes the processing time for I/O interrupts to the job in control at the time of the interrupt.
Since that job may be anyone running on the system (the
job that caused the I/O to be executed or an unrelated
one), CPU charges are dependent on which jobs are concurrently executing as ,veIl as the micro-level scheduling of
these jobs. This scheme had the advantage of accounting
for most of the CPU activity at reasonably lmv cost, but it
often reported a number whose meaning was obscure.
(IB::\f's new virtual system reduces the reported variability
by not reporting the I/O interrupt-handling time.)
The entire amount of variability cannot be attributed to
reporting mechanisms; the basic processes are clearly subject
to variability-causing phenomena. Some of these can be
identified subsequent to a test and employed to design
better-controlled tests in the future. For example, in some
of our tests we could have reduced variability by using only
old files. K ew files are allocated by the operating system
in a manner that depends on previous file allocations; using
only old files would ensure that physical file positions would
be identical for each test run and result in reduced variability. However, employing this strategy would preclude
determining how wen the operating system performed file
allocation. In general, the more realistic the desired test,
the more variabiiity the analyst must accept.
Even if a test is run with good performance reporting in
From the collection of the Computer History Museum (www.computerhistory.org)
766
National Computer Conference, 1974
a very tightly controlled environment, performance must
be considered a random variable. If files are allocated prior
to testing, random I/O errors still may occur. Rotating
devices do not possess ideally constant, known rotational
velocities. In addition, micro-level sequencing in a processor
is often dependent on the precise start time of user jobs
and the entry time of system support modules. The ability
to explain these effects after the test does not imply that
they are knowable before the test.
Comparisons of elapsed times or charges between computers on a network cannot depend on variability being
within the desired few percent. Even in the simplified situations reported in this paper the variability was often large
enough to preclude single-sample evaluations from being
dependable. If a user intends to do sigrificant computing
after choosing a node, he should ensure that his evaluation
reflects this reality. Further, he should occasionally check
the environment on the network to see whether charges or
response time have changed enough to justify a change in
his workload allocation scheme.
REFERENCES
1. Hall, Gayle, "Development of an Adequate Accounting System,"
New York, Share, Inc., Computer Measurement and EvaluationSelected Papers from the SHARE Project, Vol. 1, 1973, pp. 301-305.
2. Kreitzberg, C. B. and J. H. Webb, "An Approach to Job Pricing
in a Multi-programming Environment," Proceedings Fall Joint
Computer Conference, 1972, pp. 115-122.
3. Young, J. W., "Work Using Simulation and Benchmarks," New
York, Share, Inc., Computer Measurement and EvaluationSelected Papers from the SHARE Project, Vol. 1, 1973, pp. 286-292.
4. Wiorkowski, G. K. and J. J. Wiorkowski, itA Cost Allocation
Model," Datamation, Vol. 19, No.8, August 1973, pp. 60-65.
5. Bell, T. E., B. W. Boehm, and R. A. Watson, "Computer Performance Analysis: Framework and Initial Phases for a Performance
Improvement Effort," The Rand Corporation, R-549-PR, November 1972. (Also Fall Joint Computer Conference 1972, pp. 11411154.)
6. Watson, R. A., Computer Performance Analysis: Applications of
Accounting Data, The Rand Corporation, R-573-PR, May 1971.
7. Nielsen, N. R., "Flexible Pricing: An Approach to the Allocation of
Computer Resources," Proceedings Fall Joint Computer Conference,
1968, pp. 521-531.
8. Buchholz, W., itA Synthetic Job for Measuring System Performance," IBM Systems Journal, Vol. 8, No.4, 1969, pp. 309-318.
9. Wood, D. C., and E. H. Forman, "Throughput Measurement Using
a Synthetic Job Stream," Proceedings Fall Joint Computer Conference, 1971, pp. 51-55.
10. Vote, F. W., Multiprogramming Systems Evaluated Through Synthetic Programs, Lincoln Laboratories (MIT), ESD-TR-73-338,
December 1973.
11. LO"ckett,J. A~aiidA.R. White, -Controlled -Tests!or'Perjorrrwnce
Evaluation, The Rand Corporation, P-5028, June 1973.
12. Fajman, R., and J. Borgelt, "WYLBUR: An Interactive Text
Editing and Remote Job Entry System," Communications of the
ACM, Vol. 16, No.5, May 1973, pp. 314-322.
From the collection of the Computer History Museum (www.computerhistory.org)