Fast Architecture Ev..

Fast Architecture Evaluation of
Heterogeneous MPSoCs
by Host-Compiled Simulation
黃 翔
Dept. of Electrical Engineering
National Cheng Kung University
Tainan, Taiwan, R.O.C
2012.09.10
Abstract
 For evaluating important architectural decisions such as tile
structure and core selection within each tile for future 100–
1000 core designs, fast and flexible simulation approaches are
mandatory.
 We evaluate heterogeneous tiled MPSoCs using a timing-approximate
simulation approach.
 In order to verify performance goals of the heterogeneous
MPSoC apart from functional correctness, we propose a
timing-approximate simulation approach and a time-warping
mechanism.
 It allows the investigation of phases of thread (re-)distribution and
resource-awareness with an appropriate accuracy.
 For selected case studies, it is shown how architectural
parameters may be varied very fast enabling the exploration
of different designs for cost, performance, and other design
objectives.
2
Introduction
 Processor architectures are becoming not only more and more
parallel but also increasingly heterogeneous for energy efficiency
reasons.
 This trend toward many-core system designs implementing hundreds to
thousands of cores as well as hardware accelerators on a single chip leads
to many different challenges.
 such as overheating, reliability and security issues as well as resource contentions
 As a remedy, resource-aware programming concepts such as
invasive computing have been recently proposed that exploit
self-adaptiveness and self-organization of thread generation and
distribution.
 The heterogeneity poses a big problem on how to evaluate
architectural design choices early.
 such as number of tiles, internal tile structure and selection of cores within
each tile
3
Background on Resource-Aware
Programming
The three distinguished programming constructs of invasive
computing can be summarized as follows:
 Invasion:
 The first step of an invasive program is to explore its neighborhood and claim
resources in a phase.
 This is done by issuing a library call to invade.
 The run-time system then builds and returns a claim object, which denotes a set of
compute resources, that is now allocated to the calling application.
 Infection:
 The second step is to copy the code and the data to the invaded compute resources and
execute this code in parallel in an execution phase called infection.
 This is done by issuing a library call to infect on the claim object.
 Retreat:
 After the parallel execution terminates, the programmer finally has to free the
occupied resources by calling the library function retreat on the claim object.
4
Architecture Evaluation (1/3)
 We want to simulate the performance of multiple resource-aware
applications running on heterogeneous tiled architectures.
 In order to simulate the functional behavior of the applications
as well as important timing information, we employ a hostcompiled simulation approach.
 This approach is based on a time measurement on the host processor
and a time warping mechanism for scaling the measured execution time
to a modeled target processor.
 We decided to use this approach, because
 easily create heterogeneous tiled architectures
 change the parameters of the containing processing, memory, and
communication resources
 evaluate the performance of the architecture and the functional
correctness of the applications in a very short time
5
Architecture Evaluation (2/3)
 In Figure 1, an overview of our proposed MPSoC architecture evaluation is
depicted.
Figure 1:
Overall flow of architecture evaluation.
6
Architecture Evaluation (3/3)

The total hardware cost is simply computed by considering the
sum of all underlying computational resources, i.e., only the
cost for cores, PEs of a tightly-coupled processor array
(TCPA) , and network routers is considered.

The application model allows to define execution scenarios that
include multiple competing applications with different degrees
of parallelism.
7
MPSoC Architecture Model (1/2)

Figure 2 displays a typical example of such an MPSoC
architecture as considered throughout this paper.
Figure 2:
A generic invasive tiled architecture with different processor types,
accelerator tiles such as TCPAs, and memory tiles.
8
MPSoC Architecture Model (2/2)

We characterize a tiled architecture according to Figure 2 by
the following structural parameters:





a set T = {T1, . . . , Tm} of tiles with a total of m tiles
To each tile Ti ∈ T there is associated a size s(Ti)
(number of cores per tile)
A processor type r(Ti) ∈ {RISC, i-Core, TCPA}
(assuming each tile is composed only of cores of the same type).
For example, say the upper left tile is T1 in Figure 2. Then, s(T1) = 2(?)
and r(T1) = RISC. Similarly, the upper TCPA tile T2 is characterized by
s(T2) = 20 and r(T1) = TCPA.
Evidently, the cost of an MPSoC
will depend not only on tile number
and tile size, but also on the type of
processor chosen within each tile.
9
Application and Programming Model(1/3)

The applications express their temporal resource requirements
only by using the invasive programming.


Applications may request additional compute resources for their
parallel execution or release compute resources in order to make them
available for other applications running on the architecture.
An example of an invasive resource-aware program as well as
the execution of this program on a tile containing four RISC
cores (like the upper left tile in Figure 2) is shown Figure 3.
10
Application and Programming Model(2/3)
Figure 3:
Example program using
resource aware programming
constructs for parallel execution
that is requesting two
additional CPUs at runtime
11
Application and Programming Model(3/3)


Here, the program initially starts at RISC1.
Then, the program wants to allocate two additional RISC
CPUs for parallel execution.



4: c.add(new TypeConstraint(PEType.RISC));
5: c.add(new PEQuantity(2));
If the invasion of two more cores is successful, a claim
containing two free RISC cores is returned.

6: val claim = homeClaim + Claim.invade(c);


Now, the parallel execution can be started by issuing an infect
command with the appropriate i-let.


This claim and the so-called homeClaim, which denotes the initially assigned core
for the application, are merged to a single claim.
10: claim.infect(ilet);
The resources RISC1, RISC2, and RISC3 are used in parallel
until the initial program finally issues a retreat on RISC2 and
RISC3 after all child i-lets have terminated.

11: claim.retreat();
12
Host-Compiled Simulation (1/6)

Basically, this simulation approach is composed of two parts.
a)
b)
A time measurement on the host processor and a time warping
mechanism for scaling the measured execution time to a modeled
target processor.
A synchronization mechanism for simulating parallel applications.

As measurement parameter we use the number of executed
instructions on the host processor.

After counting the number of executed instructions, we map
this value to an execution time value on the target processor by
applying a set of analytical equations.


This time mapping, we call time warping.
The equations are parameterized by the computational properties of the
target processor and by some general properties of the application.
13
Host-Compiled Simulation (2/6)

For an estimation of the execution time t on the target
processor, we apply the following equation:







I : the number of instructions
CPIM : instructions that access the main memory
CPIC : instructions that can be computed without a memory access
The properties of the application provide the fractions of memory
instructions pM and compute instructions pC, where pM, pC ∈ [0, 1] ⊂ R
and pM + pC = 1.
clock frequency f of the target processor
a slowdown factor b, which is the ratio of the required bandwidth B of all
applications on one tile to the maximum available memory bandwidth BM
on this tile: b = B / BM
N : the number of parallel execution units
ex: In case of a CPU, we set N = 1.
14
Host-Compiled Simulation (3/6)

In order to obtain correct simulation results, it has to be
guaranteed that each simulation thread reached at least that
point in simulation time at which the modification of the
global state occurs.

Therefore, a time-based and event-driven synchronization mechanism
is provided that guarantees the causal and time-aware execution
sequence of the threads according to specifically defined
synchronization points.


We chose each call to the resource-aware programming library as a
synchronization point, because at such a point, the shared status information of the
underlying architecture model is read and modified.
Two thread types :
1.
simulation thread


2.
maintains its local simulation time.
generate events that contain the current local simulation time of the thread and put
them into the global event list
synchronization thread


maintains the global simulation time and only this thread advances it
events are read and analyzed by the synchronization thread
15
Host-Compiled Simulation (4/6)

Both thread types follow a certain procedure for the
synchronization, which is depicted in Fig. 4.
Figure 4:
Flowchart of the synchronization mechanism between the
simulation of multiple applications and parallel threads.
16
Host-Compiled Simulation (5/6)
 The execution of the thread begins by starting the time
measurement function.
 The simulation starts immediately and continues until a
synchronization point is reached.
 The time measurement then is stopped.
 The time warping mechanism is applied on the measured
number of instructions.
 After updating the local simulation time value by adding the
determined time value to it, an event containing the local
simulation time is created and put into the global event list.

Now, a barrier synchronization takes place and blocks the thread
until all existing simulation threads as well as the synchronization
thread have reached that barrier.
17
Host-Compiled Simulation (6/6)
 Only the synchronization thread operates.


It determines the event with lowest time value from the global event list,
sets the global simulation time to that value and removes this event
from the list.
Again, a barrier synchronization for all threads takes place.
 After synchronization, the simulation threads check whether
the global simulation time value equals their local simulation
time value.


If this is true, the simulation thread continues its simulation, elsewise it
runs into the first barrier again.
The synchronization thread directly runs again into the first barrier and
waits for the other threads to synchronize.
18
Experimental Results (1/12)
 The simulation runs were executed on a Intel Core i7 QuadCore CPU with eight virtual cores at 2.93 GHz.
 We specify the costs and the parameters of the hardware model
for the different types of resources on a heterogeneous
architecture according to Table 1.
Table 1:
Area cost units and hardware model parameters of the different types
of resources used in this experiments.
19
Experimental Results (2/12)
 Here, we considered a homogeneous architecture consisting of
64 processing resources of the type RISC.
 Figure 5 shows the different variants of the layouts, we
evaluated in our experiments.
 This resource-aware application only utilizes homogeneous
architectures and does not contain any part, which could
exploit heterogeneity.
Figure 5:
Selected tile layouts of a homogeneous architecture. Here, only a grid-based
topology is considered. Each tile consists of one or more equal processing
resources. The architecture on the left consists of only one tile that contains
64 processing resources.
20
Experimental Results (3/12)
 We simulated only one resource-aware application.
 We increased the degree of parallelism of the application by
spawning a given number of threads, which are executed in
parallel on the architecture.
 The results are shown in Figure 6.
Figure 6:
Simulation of a resource-aware
application on a homogeneous
architecture with different tile
layouts. The computation is done in
parallel by spawning several threads
21
Experimental Results (4/12)
 Here, in general, one can see that the application gains a larger
speedup when spawning a low number of threads and runs into a
saturation when spawning a large number of threads.
 The simulation shows that the tile layout 1×64 results in a higher
latency than the other tile layouts when more threads are used,
because the bandwidth limitation on a shared memory system
slows down the execution time of the threads.
 Among the four different considered configurations, the best
layout for this application is 16×4.
 This is a mixture of shared memory and distributed memory system.
 Here, the bandwidth limit on one tile only affects a few threads.
22
Experimental Results (5/12)
 Our second experiment simulates a given number of the same
application on the architectures.
 We estimated the latency for the execution of all applications.
 We fixed the degree of parallelism by four threads per
application and each application is started at the same time.
 The results for the different tile layouts are shown in Figure 7.
Figure 7:
Simulation of several resourceaware applications in parallel on
a homogeneous architecture
with different tile layouts.
23
Experimental Results (6/12)
 One can see that tile layout 1×64 results in the highest
latencies, because of the bandwidth limit.
 The other layouts result in lower latencies, because more tiles
are used and the bandwidth limit is distributed over these tiles.
 Again, tile layout 16×4 results in the best latency.
 For the evaluation of the architecture variants against costs and
performance, we used the sum of the area units of all
containing resources within the architecture as costs value and
the number of applications per second the architecture is able
to execute as performance value.
 The results of the evaluation are shown in Figure 8.
24
Experimental Results (7/12)
Figure 8:
Evaluation of different tile layouts
against costs and performance.
 In case of the 64×1 layout, the costs are maximized for a
homogeneous architecture.
 The 1×64 layout has the lowest costs, however also the lowest
performance.
 The 16×4 layout has a high performance and moderate costs. 25
Experimental Results (8/12)
 In our second series of experiments, we evaluated costs and
performance of heterogeneous architectures.
 Here, we studied five different configurations of tiled
architectures with a 16×4 layout as shown in Figure 9.
Figure 9:
Selected heterogeneous architectures.
Each tile may consists of processing resources of type RISC(4), i-Core(1), or
TCPA(4x4 or 8x8).
26
Experimental Results (9/12)
 As application scenario, we considered a compute-bound
resource-aware application, which consists of three subalgorithms with the following characteristics:
a) The first part is a task parallel execution.
b) The second part is computationally intensive, which is suited for data
parallel accelerators, like the TCPAs.
c) The third part may benefit from custom instructions of ASIPs, such as
the i-Core.
 The results of the simulation of several of these applications
on the considered architectures are shown in Figure 10.
 They show the average latency of the simulated applications.
27
Experimental Results (10/12)
Figure 10:
Simulation of several resource-aware applications in parallel on a
heterogeneous architecture with different types of contained processing
resources.
28
Experimental Results (11/12)
 Here, one can see that the homogeneous architecture (Arch1)
does not fit the requirements of the application, thus no
speedup is observable in this case.
 The latency is constant over the number of applications, because the
applications are mainly compute-bound and do not slow down from
bandwidth limitations.
 Among the considered architectures, Arch4 provides the best
performance for the applications, which is a mixture of small
TCPA tiles and i-Core tiles.
 Those variants with a big TCPA tile (Arch3 and Arch5) result
in higher latencies than the variants with smaller TCPA tiles.
29
Experimental Results (12/12)
 Figure 11 shows the evaluation results of our considered
heterogeneous architectures.
 In the figure, it can be seen that Arch2 and Arch4 dominate
Arch1, Arch3, and Arch5 in costs and performance.
Figure 11:
Evaluation of
heterogeneous
architectures against costs
and performance.
30
Conclusion
 We demonstrated the fast evaluation of the architectural design
space of tile number, tile organization, and processor selection
within each tile.
31