Fetch Directed Prefetching

Fetch Directed Prefetching – A Study

Fetch Directed Prefetching – A Study
Gokul Nadathur
Nitin Bahadur
Sambavi Muthukrishnan
{ gokul, bnitin, sambavi}@cs.wisc.edu
Dec 8, 2000
University of Wisconsin-Madison
Abstract
This report presents our work on the fetch directed prefetching architecture
proposed by Reinman et.al [Reinman 99]. Fetch directed architecture decouples
the branch predictor from the main pipeline so that the branch predictor can run
ahead of the pipeline on pipeline stalls and i-cache misses. Other components of
the architecture help in having the next instruction blocks for execution ready for
the fetch unit. We also explored prefetching using stream buffers over the base
architecture. We implemented both on Simplescalar 2.0. Our simulations on a
subset of the Spec95 Integer benchmarks show that fetch directed prefetching
improves performance. Fetch directed prefetching gives greater improvement in
IPC as compared to base Simplescalar architecture with stream buffers. We
observed a peak improvement of 8.8% for fetch directed prefetching and 6.29%
for stream buffers among the benchmarks we ran. Our statistics show that the
modest performance can be improved by multi-porting and/or pipelining the
cache hierarchy. Thus the fetch directed architecture with intelligent prefetching
hides the i-cache stall latencies, improving fetch bandwidth and performance.
1
1.
Introduction
As computer architecture advances, most modern processors aim at achieving better
performance through improved instruction level parallelism. This places a large demand on the
instruction fetch mechanism and the memory/cache heirarchy. Also, with the advent of
superscalar processors that can execute multiple instructions per cycle, there is a need for
better branch prediction ( perhaps multiple branch predictions) per cycle. The rate at which the
BTB and branch predictor can be cycled thus have an effect on the number of actual
instructions that may be executed since this limits the fetch bandwidth. Unless the instruction
fetch and memory hierarchy respond to the needs of the execution engine, the true benefits of
the advances in ILP cannot be realised.
This argument motivates our study on instruction prefetching techniques.
When
instructions are prefetched, there is a possibility of hiding memory latencies due to cache
misses. We study two techniques: stream buffers and fetch-directed prefetching [Jouppi 90 and
Reinman 99].
While steam buffers just approaches the problem of improving fetch bandwidth by
prefetching instructions/cache lines sequentially on a cache miss; the fetch directed
architecture improves upon this by modifying the baseline architecture to provide a better idea
of what should be prefetched. The fetch directed architecture works by letting the branch
predictor work independently of the fetch unit. This enables a prediction each cycle (this is of
course dependent on the cycle time of the branch predictor and the BTB). So the branch
predictor is never stalled due to instruction cache misses (when the fetch unit is stalled).
Similarly the fetch unit is not stalled by any latency in branch prediction. The predictions
produced in each cycle may be used to determine what to prefetch (with some suitable filtering
mechanism).
The fetch directed architecture also proposes some modifications to the Branch Target
2
Buffer which enables better space utilization of the BTB by grouping together all addresses
which fall through (even branches predicted as fall through) and not wasting BTB entries for
such fall through branches..
The main goals of our project were to study the fetch directed architecture proposed by
the authors of [1], to study prefetching in this context, modify the SimpleScalar tool set for this
architecture and revalidate their conclusions from the simulation results. We also compare the
performance of the fetch directed prefetching scheme with the stream buffers technique.
This paper is organized as follows. Section 2 describes the fetch directed architecture in
detail. Section 3 describes the stream buffers and the fetch directed prefetching which we have
implemented. The results of our simulation studies are presented in Section 4 and we finally
conclude in Section 5.
2.
Fetch Directed Prefetching Architecture
The fetch directed architecture uses a decoupled branch predictor and an instruction
cache. The decoupled branch predictor runs ahead of the execution PC making predictions.
The other components of the architecture make use of the predictions to supply the fetch
engine with instruction addresses to be next executed. Any prefetching mechanism can be
added to the fetch directed architecture to make use of the instruction addresses which are
available before they are used. Since stream buffers [Jouppi 90] show consistently better
performance, we add stream prefetching to the base design to yield an architecture which
combines the use of future predictions and prefetching to supply the fetch engine with correct
instructions and hides the L1 i-cache miss latencies. The main components of the Fetch
Directed Prefetching Architecture are :
1) The decoupled branch predictor
2) Fetch Target Buffer (FTB)
3
3) Fetch Target Queue (FTQ)
4) Prefetch Filtration Mechanism
5) Prefetch Instruction Queue (PIQ)
6) Prefetch Buffers and Prefetch Mechanism
PIQ
L2 Cache
Prefetch
Branch
Predictor
Prefetch Filtration
Mechamism
Prefetch
Buffer
Instruction
Fetch
FTB
FTQ
Figure 1 : Fetch directed prefetching architecture
The fetch directed prefetching architecture is shown in Figure 1. The fetch target buffer
(FTB) is similar to the branch target buffer (BTB). The fetch target queue (FTQ) contains a list
of instruction address blocks which should be next used by the fetch stage. The prefetch
instruction queue (PIQ) contains a list of i-cache blocks which should be prefetched by the
prefetch unit. Each component of the architecture is described in detail in the following sections.
2.1
Decoupled branch predictor
Motivation for a decoupled predictor
The decoupled branch predictor is the focal point of the fetch directed architecture. The
predictor is decoupled from the main pipeline. This enables the predictor to run ahead of the
4
pipeline. The advantage of this is that the branch predictor being able to work even when the
fetch unit is stalled. The branch predictor produces a useful prediction every cycle which may
be used in later cycles by the fetch unit. This scheme however has the problem that the history
seen by the branch predictor may not be up to date since it is making predictions well in
advance of the fetch unit using any of those predictions.
Changes to SimpleScalar
The changes made to the Simplescalar tool set to incorporate a decoupled branch
predictor were the following. There were changes needed to the main pipeline to extract the
branch predictor out. Essentially, the original main pipeline called the branch predictor for every
fetch address. Instead at this point, the fetch addresses were themselves specified by the Fetch
Target Queue (Section 2.3). An entry was popped off the queue to specify a range of
addresses to access. When this range was completely fetched and put into the dispatch queue,
the next range was read off the FTQ. Every range in a FTQ entry incorporates a fetch target
block, starting at a taken branch target or the starting address of the program and ending at a
taken branch or size of an FTQ entry.
The other changes were to the branch mispredict and branch misfetch handling. In
either case, the entire fetch directed architecture state associated with branch misprediction
needs to be flushed. The PC used by the branch predictor and the main pipeline also need to
be reset to the correct target PC.
There was a need to maintain the branch predictor update information from the branch
prediction stage to the branch predictor update stage (when the correct outcome of the branch
is known). In the fetch directed architecture, the branch predictor runs ahead of the main
pipeline and hence this information cannot be stored in the fetch data queue (in the way that
the baseline architecture works). So we added an extra queue between the branch predictor
and the fetch stage. This queue maintains the direction update information for the branches till
5
it can be put into the fetch data queue in the fetch stage.
2.2
Fetch Target Buffer (FTB)
The fetch target buffer is used for storing the fall though and target addresses of the branch
ending a basic block. If the branch ending the basic block is predicted as taken, then the target
address is used for next cycle's FTB lookup, else the fall through address is used. Each entry in
the FTB contains a tag, taken target address and the fall through address of the block. It is
similar to the basic block target buffer (BBTB) [Yeh-Pat 92]. The FTB differs from the BBTB in
three aspects:
a) Basic blocks with branches that are not taken are not stored in the FTB.
b) The full fall through address of the basic block is not stored. Instead only the size of the
basic block is stored. This reduces the storage requirements for the fall through address
( as it has been shown [Reinman 99] that the typical distance between the current fetch
address and BBTB's fall through address is not large. )
c) The FTB is not coupled with the instruction cache, so that it can continue operation even
when the i-cache is stalled due to a miss.
Each cycle the FTB is accessed by the branch predictor with an address and a prediction
[Reinman2 99]. If an entry is found in the FTB an entry is inserted into the FTQ starting at the
current address and ending at the last instruction of the basic block. Based on the prediction,
the target or fall through address is returned to the branch predictor lookup routine. The lookup
routine will use this address for looking up the FTB in the next cycle. Thus the branch predictor
and FTB together are used for predicting the next set of blocks that will be used by the fetch
engine. By predicting the blocks before they are used by the fetch engine, the prefetch
mechanism can prefetch the instructions in time for use, thus avoiding I-cache miss latencies.
6
An advantage of the FTB is that it can contain fetch blocks which span multiple cache blocks.
So on a FTB hit, a single/multi-block entry from the FTB is inserted into the FTQ (in one or
more entries). If there is a FTB miss, then the a sequential block starting at the next cache
aligned address and upto 2 cache blocks is inserted into the FTQ. Since not-taken branches do
not have entries in the FTB, their lookup will fail and sequential blocks will be inserted into FTQ.
This not only gives correct operation but also reduces the space requirement of the FTB by not
storing not-taken branches. On repetitive FTB lookup failures sequential blocks are inserted
into the FTQ until a misprediction occurs when the FTB will be updated with the correct
information and lookup failures will reduce. Since the FTB is of fixed size, entries in the FTB are
replaced by LRU policy. A 2 level FTB would help reduce the FTB access time and we can have
a larger second level FTB. We have currently implemented only a single level FTB.
Changes to Simplescalar
The FTB is used by the branch predictor and it fills in entries into the FTQ. So the
decoupled branch predictor was made to use to use the FTB and update the FTB for all branch
lookups and branch updates. The blocks produced by the FTB are then inserted into the FTQ.
2.3
Fetch Target Queue (FTQ)
The FTQ is a queue of instruction address blocks. It is used to bridge the gap between
the branch predictor and the instruction cache[Reinman 99]. The fetch engine deqeueues the
FTQ for getting the instruction addresses to execute. The FTQ provides the buffering
necessary to allow the branch predictor and i-cache to operate autonomously. It allows the
branch predictor to work ahead of the i-cache when the latter is stalled due to an i-cache miss.
Each FTQ entry contains upto 3 fetch block entries and the target address for the
branch ending the basic block. Each fetch block entry represents an L1 i-cache line. Each fetch
block entry contains a valid bit, an enqueued prefetch bit (whether the entry has been
7
enqueued for prefetch already) and the cache block address.
Changes to Simplescalar
The FTQ gets filled by the FTB and it provides instruction address blocks to the fetch
unit. So the fetch unit had to be modified to make use of instruction blocks provided by the
FTQ. The FTQ also provided some interfaces for use by the prefetch filtering mechanism.
2.4
Prefetch Instruction Queue (PIQ) and Prefetch Filter
The Prefetch Instruction Queue is a queue of addresses waiting to be prefetched by the
prefetch engine. Each PIQ entry basically indicates a cache line which is to be prefetched.
Entries are inserted into the PIQ every cycle by the Prefetch Filtration Mechanism. The earlier a
prefetch can be initiated before the fetch block reaches the I-cache, the greater the potential to
hide the miss latency. At the same time, the farther the fetch block is on the FTQ, the more
likelihood of it being on a mispredicted branch. So the heuristic [Reinman 99] starts choosing
entries starting from second FTQ entry and goes upto 10th entry in the FTQ. The first entry in
the FTQ is skipped as there would be little benefit starting to prefetch it when it is close to being
fetched by the i-cache. By searching only 9 entries, we reduce the time required to find
matching entries to insert into the PIQ. Based on exact timings, even multiple entries can be
inserted into PIQ in one cycle. Currently we enqueue upto 3 entries into the PIQ at one time.
Associated with each PIQ entry is a prefetch buffer. We implemented the PIQ and prefetch
stream buffers with same number of entries but it is possible to use lesser prefetch buffers
based on the memory hierarchy and latencies. The prefetch mechanism deqeues entries from
the PIQ and schedules them for prefetch. Thus the filtration mechanism fills the PIQ while the
Prefetch mechanism empties it.
Changes to Simplescalar
A new mechanism was added which checked the FTQ for any entries which could be
8
enqueued for prefetch and were not already enqueued. Here the filtration scheme looked at
FTQ entries, filtered them based on above heuristic and enqueued them into the PIQ for
prefetch. Whenever the L2 bus was free and a stream buffer was available, the prefetch
mechanism dequeued an entry from the PIQ and started prefetching it.
2.5
Recovery on mispredictions
On a misprediction, the FTQ, PIQ and Prefetch Buffer entries are invalidated and the
FTB is updated with the correct branch information. Based on their utilizations, the sizes of the
FTQ and PIQ can be accordingly set.
3
Stream Buffers
3.1
Introduction
L1 I-cache
L2 I-cache
Stream buffer
Figure 2: stream buffers
Stream Buffers are used to reduce penalty due to compulsory and capacity miss
problems. A stream buffer is a cache that is 1 cycle far from the L1 cache. This means that a
miss in L1 serviced by the stream buffer will incur a penalty of 1 cycle. The buffers are probed
in parallel with an i-cache access and if the i-cache access misses, then in the next cycle , the
cache block is fetched into the i-cache from the stream buffer. The buffers can be probed in
different ways. We have explored 2 ways of probing the buffer. In one policy the buffer is a
FIFO queue where only the head entry contains a comparator. So any probe is compared
against the head of the queue. If the entry is present then its serviced or else the entire queue
9
is flushed and prefetching occurs from the next sequential block from the miss. This scheme
has the advantage of being simple and a comparator is required only at the head. Another
policy that we have explored is LRU. This policy uses a fully associative prefetch buffer with a
LRU replacement strategy, adding a replaceable bit to each prefetch buffer entry. A prefetch
would be initiated only if there was an entry marked as replaceable in the prefetch buffer,
choosing the least recently used replaceble entry. On a miss all entries would be marked as
replaceable . They would also be marked as replaceable in the prefetch buffer when the
matching cache blocks are fetched during the instruction cache fetch.
Changes to Simplescalar
The stream buffer logic and policies were implemented separately. Changes were made
to the fetch unit of the simplescalar, the L1 i-cache miss handler and the main pipeline loop.
Additional Configuration parameters were added to configure the stream buffer for different
runs. The fetch unit of simplescalar was modified to probe the stream buffer in parallel with the
i-cache lookup. The prefetch was initiated after the fetch in the pipeline to give priority to
instruction fetches over prefetch. The L1 miss handler was modified so that a miss in L1 and a
corresponding hit in the stream buffer is serviced by the stream buffer. If a miss occurs both in
L1 and the stream buffer then an L2 cache access is made. During such an access , the stream
buffers are flushed appropriately and the next prefetch address is also set. Cases when an
instruction fetch is blocked by an ongoing prefetch were also accounted for.
3.2
Prefetch Buffers in Fetch directed architecture
Prefetch buffers are similar to the stream buffers. The main difference is that the
addresses are computed by the PIQ and a prefetch is intiated using the address at the head of
the queue. Priority is given to outstanding L1 cache fetches. Only FIFO policies were examined
in this case. The prefetch buffers were flushed on a branch mispredict to keep them in sync
10
with the FTQ.
Implementation
The implementation of stream buffers were extended to accommodate prefetch buffers.
The changes that were involved were to the way the prefetch addresses were computed . The
prefetch buffer queries the PIQ logic to obtain the address for the next prefetch. The flushing
mechanisms were now initiated on a branch mispredict .
4
Simulation Results
4.1
Simulation Architecture
The base architecture that we explored was a 4-issue out of order processor. The L2
cache was single ported with a transfer bandwidth of 8 bytes/cycle. We used a 4KB , 64 cache
byte lined, 4 way set associative L2 cache and 256 byte, 32 byte cache lined, 2 way set
associative L1 cache. The L2 access latency was set to 12 cycles and the memory latency to
30 cycles. These along with the statstics related to fetch directed architecture and prefetching
are presented in the following sections.
4.2
Stream Buffer Results
The implementation of stream buffers were tested on various Spec95 integer
benchmarks like compress, li, perl and vortex. Various statistics like IPC, number of useful
prefetches , harmful prefetches , cycles lost due to harmful prefetches and cycles gained due to
useful prefetch were collected .
IPC of various policies
We see a consistent improvement in the IPC for the FIFO prefetching over baseline
architecture. The LRU strategy performs better than FIFO . The percentage speed up vary
across benchmarks. For FIFO it varies from less than 1% to less than 6%. For LRU the
11
percentage speed up varies from less than 1% to less than 7%. We attribute the low
percentage increase to two factors. Firstly, the L2 cache is single ported and the prefetching
strategy gives priority to the Icache and D-cache misses . So a prefetch is possible only when
the L2 is not servicing either L1 caches. Secondly, some prefetches hoard the bus when the
fetch unit need it. Since the fetch unit has initiated a fetch from L2 the ongoing prefetch is
useless and has a detrimental effect by hoarding the bus. We studied the actual percentage of
such occurrences and the number of cycles lost due to such harmful prefetches ( graph 2 ) .
These statistics explain why the IPC speedup is very low for certain benchmarks. For example
in the case of li benchmark (graph 2 ), there was 5.17% gain in the number of cycles due to
useful prefetches . But we also see that the harmful prefetches costed 2% cycle loss . So the
actual percentage gained is reduced to around 3.17% of total number of cycles. Graph 1 shows
the possible improvement in the IPC if these harmful prefetches were avoided ( Ideal IPC
column ). For li , we see a 9% improvement in the Ideal IPC over the IPC using single ported
non - piplined L2 cache. This means that adding ports to L2 cache or pipielining will definitely
improve IPC in such cases.
IPC
Stream Buffers Prefetching Performance
1.9
1.8
1.7
1.6
1.5
1.4
1.3
1.2
1.1
1
0.9
0.8
No prefetching
Stream Buffers w/ FIFO
Stream Buffers w/ LRU
Ideal IPC w/ Stream
Buffers w/ FIFO
compress
perl
li
vortex
Benchmarks
Graph 1 :
Processor Issue width = 4 . L1 cache size : 256 bytes, Cache line : 32 byte,
Mapping : 2 –way set associative
L2 Cache : 4 KB ,Cache line : 64 bytes,
12 Mapping : 4 way set associative
L2 latency = 12 cycles Prefetch buffer size = 8
Single Ported L2 cache with 8 bytes/cycle
% of total cycles
Study of stalled i-fetches due to prefetch
20
18
16
14
12
10
8
6
4
2
0
Cycles lost due to
harmful prefetch
Cycles gained due to
prefetch hits
compress
li
vortex
perl
Benchmark
Graph 2 : Study of cycles lost due to prefetches blocking I –cache fetch. The
parameters for prefetch buffers are the same as in graph1.
Variation of IPC with size of prefetch buffers
IPC Speedup vs Number of Stream Buffers (FIFO)
% Speedup over no Stream
Buffer
6
5
4
perl
li
vortex
compress
3
2
1
0
2
4
8
16
Number of Stream Buffers
Graph 3 : Variation of IPC with number of entries in the stream buffers
for FIFO policy.
13
% Speedup over no Stream
Buffer
IPC Speedup vs Number of Stream Buffers (LRU)
7
6
5
perl
4
li
3
vortex
2
compress
1
0
2
4
8
16
Number of Stream Buffers
Graph 4 : Variation of IPC with number of stream buffers for LRU
replacement policy.
We examined the amount of speedup that could be obtained as we vary the size of
prefetch buffers. The main motivation for this study would be in the case when we use the LRU
policy since it requires full associativity which adds to hardware cost. We found that adding
additional buffers to the stream buffer does increase the IPC . But after a certain number of
buffers the IPC reaches an asymptote ( Graphs 3 and 4 ). This behavior is consistent across
both FIFO and LRU. The optimal number of buffers should be such that the total size of the
stream buffer is equal to the average size of a basic block. The average size of the basic block
also places an upper bound on the number of harmful blocks that can be actually prefetched
assuming that the entire basic block is serviced by the prefetch buffer. So increasing the size of
the prefetch buffer beyond this limit does not have any change since the increased number of
blocks are never used in prefetch.
14
4.3
Fetch Directed Architecture Results
This section presents our results from simulations on the fetch directed architecture.
Graph 5 compares the performance of a Branch Target Buffer versus a Fetch Target Buffer in
the decoupled architecture. We plugged both into the decoupled architecture to determine the
speedup provided by the Fetch Target Buffer. Our simulations used a fully associative BTB and
FTB of 256 entries. We find that the FTB consistently outperforms the BTB. This is due to the
fact that the FTB can keep the execution engine busy because its predictions span a larger
number of instructions ( the fetch target block ) while the BTB can only predict the target
address ( this is equivalent to a fetch target block that contains only one address). Also since
the FTB stores entries only for taken branches, it has a larger reach than a same sized BTB.
BTB vs FTB
1.2
1
IPC
0.8
BTB
0.6
FTB
0.4
0.2
0
li
perl
vortex
Benchmark
Graph 5: Comparison of IPC obtained using BTB and FTB
Graph 6 presents results of our simulation studies that study the IPC improvement when
we introduce prefetching into the decoupled architecture. The speedup is calculated over the
base fetch directed architecture with no prefetching. We see a varying range of speedups for
different benchmarks. The reason for this is that the prefetching is of use only if the decoupled
15
branch predictor is able to make the right predictions. Thus, the nature of the benchmark
affects the gain derived due to prefetching. We find that we get a peak performance
improvement of 8.8% (li) due to prefetching in the benchmarks we ran on the simulator.
% IPC increase
Improvement in IPC using prefetching
10
9
8
7
6
5
4
3
2
1
0
IPC improvement
compress
li
perl
vortex
Benchmark
Graph 6: Improvement in IPC using stream buffer prefetching
% IPC improvement
Improvement in IPC using different techniques
10
9
8
7
6
5
4
3
2
1
0
Stream Buffers over No
Stream Buffers
Fetch Directed
Prefetching over base
Fetch Directed
Architecture
compress
li
perl
vortex
Benchmark
Graph 7: Comaprison of IPC improvement when using stream
buffers with base simplescalar and fetch directed architecture
16
Graph 7 compares the stream buffer performance with the performance of the fetch
directed prefetching architecture. We find that fetch directed prefetching gives substantial
performance improvement in the case of li and vortex. In compress, we see a slight
performance drop which again is due to the nature of the benchmark. If a lot of the branch
predictions are wrong, the prefetching is mostly useless. For example, if a lot of branches we
predict taken are actually not taken, stream buffers will perform better than fetch directed
prefetching.
5
Conclusions
Fetch directed architecture decouples the branch predictor from the main pipeline so
that it can run ahead and make predictions even when the I-cache is stalled. We simulated the
fetch directed architecure on Simplescalar and found that the FTB scheme gives better
performance than a BTB. We also coupled the architecture with stream buffers and showed
that the architecture enables intelliegent prefetching resulting in a greater improvement in IPC
as compared to prefetching over base simplescalar architecture. For better results, we need the
change to memory hierarchy and model multi-ported and/or pipelined caches.
References
[Reinman 99] Glenn Reinman, Brad Calder, Todd Austin, Fetch Directed Instruction
Prefetching, Proceedings of 32nd Annual Symposium on Microarchitecture (MICRO-32), Nov.
1999.
[Reinman2 99] Glenn Reinman, Todd Austin, Brad Calder, A Scalable Front-End Architecture
for Fast Instruction Delivery, Proceedings of International Symposium on Computer
Architecture, May 1999.
[Reinman 00] Glenn Reinman, Brad Calder, and Todd Austin, Optimizations Enabled by a
Decoupled Front-End Architecture, To be published in the IEEE Transactions on Computing.
17
[Jouppi 90] Norman P. Jouppi, Improving Direct-Mapped Cache Performance by the
Addition of a Small Fully-Associative Cache and Prefetch Buffers, Proc.
International
Symposium on Computer Architecture , June 1990
[Simplescalar 97] D.C. Burger, Todd Austin, The Simplescalar tool set, ver. 2.0, Technical
Report CS-TR-97-1342, University of Wisconsin-Madison, July 1997.
[Yeh-Pat 92] T.Yeh abd Y.Patt, A Comprehensive instruction fetch mechanism for a
processor supporting speculative execution, In Proceedings of 25th Annual International
Symposium on Microarchitecture, Dec 1992.
18

Download Report

Fetch Directed Prefetching – A Study

Paperzz.com

Your Paperzz