Design Flow of a Dedicated Computer Cluster Customized for a

Design Flow of a Dedicated Computer Cluster
Customized for a Distributed Genetic Algorithm
Application
Alexandra Aguiar, Márcio Kreutz, Rafael Santos and Tatiana Santos
Universidade de Santa Cruz do Sul
Santa Cruz do Sul, RS, Brazil
[email protected], {kreutz, rsantos, tatianas}@unisc.br
Abstract— In the past few years, computer grids and clusters
of computers have been widely used in order to keep up
with the high computational performance required by highend applications. Also, they are especially attractive due to
their good performance at relatively low cost, if compared to
powerful servers and supercomputers. The same scenario can be
faced on the embedded world, where high specialized tasks are
usually partitioned to dedicated processors composing distributed
systems. This work is focused on the architectural specialization of cluster machines by analysing application behavior and
optimizing instruction-set architectures. The motivation for this
work relies on the observation that tasks found in embedded
software present a behavior that normally can be implemented by
a subset of instruction-set architectures. This opens opportunities
for optimization by removing the not needed instructions. As a
consequence, processors become specialized since their resources
fit better to applications performance and power consumption
constraints.
This work proposes a design flow to adapt cluster machines to
the constraints of embedded applications, where high flexibility
and performance are achieved by hardware customization and
its further distribution. Moreover, it is presented a case of study
with the entire design flow description as well as the synthesis
results.
I. I NTRODUCTION
Typically, the performance of applications can be improved
by either distributing their tasks among processing elements
or by specializing them to dedicated hardware.
Both approaches have pros and cons. Some applications
are more naturally distributed by software while some others
are more likely to fit in a hardware implementation. This
clearly depends on the source of the application. In the
latest years, however, the use of computer clusters and grids
have become an attractive alternative for the execution of
applications that demand high computational power because
of their price-performance ratio. These systems often respond
to the processing demands of most distributed applications.
This work takes as its main goal the mission of finding a compromise between both approaches: processors are
connected by a network fabric in a cluster fashion, while
their instruction are targeted to dedicated tasks of embedded
applications.
The memory model used in cluster systems requires that
applications exchange messages through a networked pool of
nodes [1]. Even though current giga ethernet cards are found
148
1-4244-1027-4/07/$25.00 ©2007 IEEE
at relatively low prices and offer significant higher bandwidth,
some applications do not map well to this model.
On top of that, despite the fact that network cards are
constantly improving, some applications also need resources
at the processor level that are not found in commodity microprocessors. This occurs especially in applications in which
a hardware implementation would be the best approach to
follow. Nevertheless, even in such cases the distribution would
certainly help to increase performance.
So, a combination of dedicated processors coupled with high
speed networks may be an interesting alternative for those
applications that need specific processor resources and high
speed communication capabilities.
In this work it is presented the design flow of a applicationspecific computer cluster machine where processors are customized according to the demand constraints of embedded applications and integrated with a ethernet-based communication
stack. The proposed design flow targets applications that allow
a smooth and efficient distribution, having tasks that could be
easily fitted in small, optimized processors.
Finally, this research provides the hardware infrastructure
to allow the connection among the nodes as well as a design
flow to build the application and integrate the system. In
other words, this work proposes a development flow and
infrastructure which allows optimized processor to execute a
given distributed and/or parallel application.
The remaining of this paper is divided as follows: Section 2
shows the related works, while Section 3 presents in details the
concept of a cluster target to application specific constraints.
Section 4 describes a case of study with a distributed genetic
algorithm. Section 5 describes how synthesis and validation
was performed in this research, while Section 6 shows the
results achieved. Finally, Section 7 presents the conclusions,
remarks and future works.
II. R ELATED W ORK
Previous works have studied the use of application specific
devices in cluster systems as these devices have become more
popular over the years.
Yeh et.al [2] proposed the use of FPGAs to build a switch
fabric. Jones et. al [3] included a reconfigurable off-the-shelf
computing card in each node of a cluster.
Other implementations such as Dandalis [4] proposed the
use of reconfigurable network cards to implement specific
protocols such as IPsec. Sass [5] proposed the use of an intelligent network card (INIC) capable of processing messages and
injecting them into the network alleviating the pressure on the
processor in order to enable full exploitation of bandwidth and
latencies of modern networks. Then Underwood [6] presented
a cost analysis of an Adaptable Computing Cluster based on
the INIC project.
More recently Jacob [7] proposed the CARMA framework
for reconfigurable clusters as a tool for managing different
configuration schemes and Willians [8] presented a reconfigurable cluster-on-chip architecture and supporting libraries for
developing multi-core reconfigurable systems on chip using
the MPI (Message Passing Interface) standard.
III. D ESIGNING A D EDICATED C LUSTER N ODE
One of the main goals of this work is to provide a design
flow aimed at designing optimized cluster machines by tuning
their instruction-sets according to the constraints of embedded
software tasks. Figure 1 presents such design flow as proposed
in this work.
The first step concerns for the FemtoJava core generation
through the SASHIMI tool [9]. This core is further integrated
with the communication infrastructure provided by the LwIP
library (TCP/IP stack) and the integration layer, as described
in steps (2) and (3).
The dedicated cluster architecture proposed in this research
differs from the other projects mainly at two aspects. First,
our approach aims at customizing the processor itself as a
function of applications constraints. This is done through an
automated flow which generates an optimized microprocessor
starting from a Java application. The microprocessor is tailored
for the application, enabling the optimizations necessary for
that application. Furthermore, it synthesizes only the hardware
resources needed by the application.
Second, the generated microprocessor is integrated with the
communication module (a TCP/IP stack) which implements
transmit/receive buffers that can be accessed at the same
frequency as the processor. The later allows full exploitation
of high speed network communication.
The design flow proposed allows the programmer to concentrate on the development of the application, i.e. the algorithm.
Thus the use of an automated flow enables fast development
at a higher level abstraction not found in other projects.
The result is a tightly coupled device which integrates the
processor and communication into one single FPGA. It is
important to mention here that the goal is not to propose a
device that will replace conventional cluster nodes (PCs) nor
a reconfigurable node that is complimentary to a PC host (as
opposed to the work discussed earlier [3][4][6][5]).
On the other hand, the goal is to enable fast development
of distributed applications that require customized processors
in order to achieve small area, low power and high performance through processor instruction-set and network latency
optimization. It is also important to highlight that once the
application is partitioned, each cluster node may comprise a
highly optimized processor, according to the task previously
allocated to it. The partitioned tasks allocation is supposed to
be done at an earlier design stage and is not the scope of this
research.
It is also important to consider that the focus of this work,
at its current development status, concerns for a proove of
concept related to the development of a design flow devoted
to generate a optimized cluster architectures, not to develop a
complete cluster machine comprised by any number of nodes.
149
Fig. 1.
Design flow for a Dedicated cluster development
The following Sections discuss each step of this flow in
details.
A. Creating a FemtoJava Core
The first step to the dedicated cluster implementation is
the application definition and further distribution. The target
application must be implemented in Java language, according
to the SASHIMI tool constraints.
The SASHIMI tool is an environment which synthesizes
applications described in Java Language to specific VHDL
microcontrollers. Thus, the main advantage of using the
SASHIMI tool is the automatic generation of a microcontroller
adapted for a given application described in such high level
abstraction language.
Thus, the tool automatically verifies which instructions used
in the Java description are essential to the implementation
in hardware and then generates the customized FemtoJava
microcontroller. The system was initially proposed in [9], but
was later improved and the newest version can generate cores
with pipeline and VLIW support. More details about the tool
and the constraints to implement the synthesizable Java code
may be also found in [9].
The idea of using an automatic flow to implement the
embedded core is mainly due to the fast development cycle
provided. Also, this alternative was used in order to make
the implementation of the dedicated cluster feasible including
for researchers with little knowledge in hardware design, as
the application may be described directly in Java Language.
The integration layer was also developed in order to easily
integrate the TCP/IP stack and FemtoJava core. Any variation
on the application affects only this step of the flow. For each
application, a new core must be developed using SASHIMI
tool but the remaining modules do not change, since they
present proper interface to connect with any FemtoJava created
by the SASHIMI tool.
So, this first step provides the customized core, which is
later integrated with the remaining blocks (TCP/IP stack and
integration layer). The following sections discuss both the
integration layer and the TCP/IP stack.
B. The Integration Layer
In order to complete the communication infrastructure, it
is necessary some additional logic to provide synchronization
between the TCP/IP stack and the FemtoJava core. This logic,
developed in VHDL, is implemented mainly through buffers
and Finite State Machines.
Thus, when the FemtoJava core needs to communicate, it
sends a request to the FSM responsible for the communication
between the FemtoJava and the TCP/IP stack. This FSM
handles the request and places the data to be sent in a FIFO
buffer (send FIFO). A second FSM, responsible for sending,
reads the new data placed in the buffer and sends it to the
stack. The stack packs the data and sends it over the network.
The receiving process is similar, but it starts when new data
is unpacked by the TCP/IP stack. It then sends a request to
the FSM responsible for receiving the data. This FSM handles
the request and places the data in a FIFO buffer (receive
FIFO). The FSM responsible for the communication between
the FemtoJava and the TCP/IP stack reads the received data
and sends it to the FemtoJava core.
Figure 2 shows the structures from the integration layer
between the FemtoJava core and the TCP/IP stack, which is
detailed in the next Section.
C. The TCP/IP stack
The TCP/IP stack is implemented using the LwIP (Light
weight IP) communication library described in [10]. This
library, written in C language, runs in one of the PowerPCs
hardwired in the FPGA used as development platform in this
work.
The stack provided by LwIP library supports transfer rates
of 10/100Mb/s and operates in full-duplex mode, once the send
and receive entities are implemented to work independently.
This research uses the LwIP library through API sockets
functions, which allow threads programming in order to send
and receive data while the FemtoJava core executes in parallel.
It is important to notice that the physical layer is implemented by an ASIC [11] available in the development board.
150
So, the network packages are sent/received by such ASIC
which then passes these packages to LwIP. LwIP provides the
data link, network and transport layers, passing the resultant
TCP packages to the application layer implemented directly
in hardware by the FemtoJava core.
D. Integrating the Node
The infrastructure, i.e., the communication stack and integration layer, are ready to be connected to the embedded
application earlier developed in Java and synthesized to a
VHDL FemtoJava microcontroller. As discussed before, this
infrastructure should not vary should the embedded application
changes.
Figure 2 shows how the system integrates the entire node,
i.e., how all blocks are connected in order to implement the
processing node. Basically, the FemtoJava core and the TCP/IP
stack are both connected with the integration layer, which
is responsible for synchronizing the communication between
these two modules.
Fig. 2.
System integration
Each node may be responsible for a given part of the
application, according to the distribution performed yet at the
software level.
IV. C ASE OF S TUDY: A D ISTRIBUTED E VOLUTIONARY
A LGORITHM
A real scientific application was used to apply the flow and
design a dedicated cluster machine.
This application, widely used in the pharmaceutical industry, uses spectrographic techniques in order to dose and further
characterize anti-hypertensives [12]. These techniques, however, result in a large number of variables and, as consequence,
the dosage process, i.e., the determination of the necessary
chemical elements, is very slow. This occurs because a combinatorial analysis is required to choose the best variables
combination among all.
An alternative to speedup this process is to employ Evolutionary Algorithms (EA), since they have been recognized
as a powerful approach to solve optimization problems [13].
Evolutionary Algorithms are inspired by simple models of
biological evolution. They are known as robust optimization
algorithms based on the collective learning process within a
population of individuals.
Each individual represents a search point in the representation space R. By the iterative processing of an evolution
loop, consisting of selection, mutation, and recombination, a
randomly initialized population evolves toward better regions
of the search space. The fitness function f delivers the quality
information necessary for the selection process to favor individuals with higher fitness for reproduction. The reproduction
process consists of the recombination mechanism, responsible
for the mixing of parental information, and mutation introduction undirected innovation into the population.
The EA algorithm applied to the spectrographic technique
used in this work was first developed in Java Language and it
was manually distributed according to the island model briefly
discussed below.
In the island model, fully described in [14], each node has
its own subpopulation and performs all typical tasks required
by an EA (analysis/selection, mutation and recombination).
This means that the total population of a sequential algorithm
might be divided into smaller subpopulations distributed for
execution in the different nodes available.
Figure 3 shows the island model distribution.
Fig. 3.
EA distribution according to the island model
Thus, the AE implemented according to the island model
performs the following tasks:
•
•
•
•
Analysis – in this step, all population individuals are
evaluated according to a given mathematical function
Crossover – in this step, the individuals recombination is
performed
Mutation – this part of the algorithm provides the
mutation on the actual population, generating the new
population
Migration – during this step, each node sends and receives
a given number of individuals from the remaining nodes.
After this procedure, these new individuals are then
integrated to their own population.
Although both models were completely developed in software, previous works have shown that the island model is
151
much more effective for this application [15]. Thus, only this
model was developed to the cluster.
It is possible to observe that each island may be implemented in a different node. So, the dedicated cluster for this
case is composed by the replication of several single nodes,
implemented according to the methodology described here.
After this process, the prototyped nodes must be connected
through a traditional network and the embedded application is
ready to run in the cluster.
As discussed before, after developing the distributed algorithm in Java language, the SASHIMI tool [9] was used in
order to generate the FemtoJava cores used in each node of
the cluster. Those cores were later connected with the TCP/IP
stack and integration layer in order to complete each node.
All synthesis and validation strategies are described in the
next Section.
V. S IMULATION , S YNTHESIS AND VALIDATION
After using SASHIMI tool [9] to generate the FemtoJava
core for the distributed EA, simulations were performed using
Mentor Graphics ModelSim in order to validate the core. The
reference data for the validation was extracted from the same
algorithm modeled in Java which was used as the SASHIMI
tool input.
The TCP/IP as well as the integration layer were also simulated in Mentor Graphics ModelSim. The reference data, in
this case, is extracted from a program implemented to generate
network packages with known transfer data, written in C. The
core and the stack were then integrated and new simulations
were performed using Mentor Graphics ModelSim.
The prototyping was performed targeting the XUP-V2P
board [16], which has a XC2VP30 Virtex-II Pro FPGA and
it is designed by Digilent, Inc. Besides the powerful FPGA
(with two PowerPC 405 processors hardwired) this board
has SDRAM, as well as several other useful resources and
interfaces such as RS-232, ethernet, XSVGA output, and so
on.
After the simulation, the synthesis and debugging/verification processes were started using the Xilinx
ISE Foundation Software [17], as the target device is a Xilinx
Virtex II Pro FPGA family device. Synthesis results are
shown in the next Section.
In order to validate the prototype, the Xilinx EDK Platform
Studio tool was used to create the entire programming environment to support verification [18]. This includes an interface
to communicate the prototyped node with RS-232 interface
which allowed the verification of the data produced by each
node against the ones produced yet in the simulation.
Thus, the bitstream produced by the Xilinx ISE Foundation
software was downloaded to the FPGAs through an USB2
programming interface (JTAG). This download included the
node as well as the verification interface generated by the
Xilinx EDK platform Studio. Also, the remaining PowerPC
processor available was used to verify the prototyped nodes.
Thus, the results generated by a given node were received by
the PowerPC, which sent these data through the RS-232 to a
host PC. Then, these results were compared with the results
achieved by simulation.
VI. R ESULTS
The following Sections present a summary of the synthesis
results as well as the performance achieved by the designed
system.
A. Synthesis
The entire node, i.e., FemtoJava core, TCP/IP network stack
and integration layer were completely synthesized using the
Xilinx ISE foundation software for the XC2VP30 VirtexII Pro FPGA. Furthermore, a non-distributed version of the
application was also developed and synthesized to the FPGA.
This second embedded system do not have any support for
parallel execution, i.e., the application executes a sequential
version of the algorithm and it was not integrated with the
communication stack.
The post place and route synthesis results for both systems
are summarized in Table I.
TABLE I
S YNTHESIS R ESULTS
Synthesis for XC2VP30
Distributed
Sequential
Maximum Frequency
100.59 MHz
102.06 MHz
Power Consumption
1.1 W
1.1 W
Area (# of LUTs)
8,354 (30%) 5,394 (19%)
It is possible to observe that the entire node of the dedicated
cluster occupies around 30% of the LUTs available, meaning
that there is still enough room to optimize the embedded
application, if needed. These optimizations, however, are not
the scope of this work. Also, the sequential version is significantly smaller. Nevertheless, this version does not use the
communication infra-structure, which occupies a large part of
the LUTs available in the FPGA, around 19%.
The maximum frequency achieved in both cases is more
than 100MHz. This is a significant result, considering the
FPGA technology used in this research. Moreover, regarding to
the distributed version, the embedded application is synchronized with the integration layer as well as the communication
stack and this frequency is more than enough to keep an
effective process. It is also interesting to observe that the node
communication infrastructure is not limiting frequency, as both
versions has achieved similar results.
Additionally, it is important to notice the low power consumption in both cases (sequential and distributed), which was
around 1.1 W. This result points out how power consumption
may be reduced when using dedicated hardware, as a state-ofthe-art GPP may consume over 100 W [19]. This is especially
attractive in comparison with regular clusters, since one of the
main problems of the later approach is the high rates in power
consumption and heat dissipation.
152
B. Performance
In order to validate and compare the results achieved by the
embedded version, the distributed application were also run in
a regular cluster. However, it is not intention of this work
widely compare dedicated and regular clusters using different
configurations and number of nodes. The main goal of this
work was to develop and validate an effective flow to generate
a dedicated cluster to a given application and the comparison
between this and regular clusters is only to illustrate such
approach. Thus, both regular and dedicated clusters used in
the experiments showed in this Section have two nodes.
The conventional cluster used to perform the experiments
is based on two Intel Pentium 4 nodes, each one running at
2.8GHz with 512MBytes. The interconnection used between
them is a direct fast-ethernet link, similar to the FPGA
cluster architecture. The application ran in this cluster was the
same Java application which was earlier synthesized through
SASHIMI tool in order to create the customized FemtoJava
core. As discussed before, the nodes are similar and each one
implements an island from the EA.
On the other hand, the dedicated cluster used is based on
two nodes developed according to the methodology previously
described in this work. It is important to observe that each
node has the same functionalities from the ones ran in the
conventional cluster, as they were generated from that same
Java application.
Table II shows results relative to the evolutionary algorithm
execution time in both conventional and application-specific
clusters. These results were taken considering the average of
20 executions of the algorithm in each cluster. In order to
obtain the execution time results in the conventional cluster
case, system calls were used. For the dedicated cluster, the
number of clock pulses necessary to complete an execution
were accumulated and later multiplied by its period. Thus,
the table shows the average of execution time as well as the
best performance achieved by the experiments in each case
(dedicated and traditional), besides the relation between those
results.
TABLE II
D EDICATED VS . T RADITIONAL CLUSTERS
Comparison results between dedicated and traditional
Dedicated
Traditional
Difference (%)
Average exec. time (s)
14.16
22.53
37.15
Lowest exec. time (s)
5.36
19.5
72.51
It is possible to observe that the dedicated cluster achieved
remarkable results, when considering a cluster comprising two
nodes. In average, the traditional cluster takes more than 22
seconds to find the EA fitness, while the dedicated cluster takes
less than 15 seconds. This means that, in average, the gain of
the dedicated cluster over the traditional is greater than 35%.
Moreover, the dedicated cluster performance may be more than
70% better when comparing the lowest execution times of each
case. This points out that, in fact, the latencies produced by
useless software and hardware layers can significantly harm
a given system performance. Besides that, the results point
that the dedicated cluster, which runs at 100 MHz, provides a
speed-up of 1.5 times over a Pentium based cluster running at
2.8 GHz which means that the dedicated nodes are 42 times
more efficient compared with a superscalar processor running
at the same frequency. This happens because the dedicated
cluster runs on FPGAs which are devices that explore parallel
or spacial execution instead of the temporal or sequential
approach used in the Pentium based cluster. Moreover the
concurrent execution provided by FPGAs devices represent
an advantage for the algorithm’s operations, which explore it
by increasing the parallelism of its execution.
VII. C ONCLUSION AND F UTURE W ORK
This work proposed a design flow to efficiently distribute
embedded systems through dedicated cluster machines.
Even with several studies in the field of distributed software, computer clusters and embedded systems, none of them
proposed an approach joining these technologies.
According to the proposed flow, the nodes are developed
in Java Language and the synthesizable VHDL is generated
automatically through SASHIMI design flow. After that, the
application core must be integrated with the TCP/IP stack as
well as an integration layer.
The application implemented as a first case of study was
an Evolutionary Algorithm (EA) applied to spectrographic
analysis, which is a widely known problem in the pharmaceutical industry. After following the design flow, the nodes were
entirely implemented and integrated. The verification process
is totally complete and performance achieved remarkable
results for a 2 node cluster.
The dedicated node using the XUP-V2P board achieves
around 100MHz and consume only 30% of the target device, a
Xilinx VP30 Virtex II Pro. Moreover, the frequency achieved
is considered to be good for the used device. There is still
room for new optimizations in the architecture as it was
fully placed using just 30% of the FPGA LUTs. As an
automatic flow was used to generate the FemtoJava core, some
manually optimizations may be easily performed. However,
which optimizations to implement as well as their impact in
the node performance are going to be addressed in a future
work.
The general performance for the experiments executed in
this work are also very impressive. In average, using the
dedicated cluster is more than 35% better than running the
same algorithm in a conventional cluster. The best case,
however, achieves more than 70% of improvement over the
conventional cluster. More experiments with different clusters
configurations and number of nodes still need to be performed,
but these results validate the idea of an embedded distributed
application running in a application-specific cluster.
Also as future work, new alternatives to communication
other than TCP/IP are going to be studied and analyzed.
Moreover, new applications are going to be developed to verify
the effectiveness of this approach.
153
ACKNOWLEDGMENT
The authors gratefully acknowledge the UNISC and
FAPERGS support in the form of scholarships and grants.
R EFERENCES
[1] A. S. Tanenbaum, Distributed Systems: Principles and Paradigms, 2002.
[2] C.-C. Yeh, C.-H. Wu, and J.-Y. Juang, “Design and implementation of
a multicomputer interconnection network using FPGAs,” in IEEE Symposium on Field-Programmable Custom Computing Machines. Napa
Valley, California: IEEE, 1995, pp. 30–39.
[3] M. Jones, L. Scharf, J. Scott, C. Twaddle, M. Yaconis, K. Yao, and
P. Athanas, “Implementing an API for distributed adaptive computing
systems,” in IEEE Symposium on Field-Programmable Custom Computing Machines. Napa Valley, California: IEEE, 2000, pp. 222–230.
[4] A. Dandalis, V. K. Prasanna, and J. D. P. Rolim, “An adaptive cryptographic engine for ipsec architectures,” in IEEE Symposium on FieldProgrammable Custom Computing Machines. Napa Valley, California:
IEEE, 2000, pp. 132–141.
[5] R. Sass, K. Underwood, and W. Ligon, Design of Adaptable Computing
Cluster. The Military and Aerospace Programmable Logic Device
(MAPLD) International Conferences, 2001.
[6] K. Underwood, R. Sass, and W. Ligon, Cost Effectiveness of an
Adaptable Computing Cluster. ACM/IEEE Supercomputing, 2001.
[7] A. M. Jacob, I. A. Troxel, and A. D. George, Distributed Configuration
Management for Reconfigurable Cluster Computing. HCS Research
Lab, University of Florida, 2004.
[8] J. Willians, I. Syed, J. Wu, and N. Bergmann, “A reconfigurable clusteron-chip architecture with MPI communication layer,” in 14th Annual
IEEE Symposium on Field-Programmable Custom Computing Machines.
Los Alamitos, California: IEEE, 2006, pp. 351–352.
[9] S. Ito, L. Carro, and R. Jacobi, “Sashimi and femtojava: making
java work for microcontroller applications,” IEEE Design & Test of
Computers, pp. 100–110, 2001.
[10] A. Dunkels, “Design and implementation of the lwip tcp/ip stack,”
Swedish Institute of Computer Science, Tech. Rep., February 2001.
[11] Intel,
“Intel
LXT972A
single-port
10/100
mbps
PHY
transceiver,”
Available
at:
<http://www.intel.com/design/network/products/lan/datashts/24918603.pdf>.
Accessed in May., 2007.
[12] J. Coates, “Vibrational spectroscopy: Instrumentations for infrared and
raman spectroscopy,” Applied Spectroscopy Reviews, vol. 33, 1998.
[13] L. A. N. Lorena and J. C. Furtado, “Constructive genetic algorithm for
clustering problems,” Evolutionary Computation, vol. 9, no. 3, pp. 309–
327, 2001.
[14] E. Alba and J. M. Troya, A useful review on coarse grain Parallel Genetic Algorithms, Universidad de Málaga, Campus de Teatinos (2.2.A.6),
29071-Málaga (ESPAÑA), 1997.
[15] A. Aguiar, C. Both, M. Kreutz, R. dos Santos, and T. dos Santos,
“Implementação de algoritmos genéticos paralelos aplicados a fármacos,” in XXVI Congresso da Sociedade Brasileira de Computação
- WPerformance.
Campo Grande - MS: Sociedade Brasileira de
Computação, Julho 2006.
[16] Xilinx University Program Virtex-II Pro Development System - Hardware
Reference Manual, Xilinx Inc., 2006.
[17] Xilinx ISE 8.2i - Software Manual, Xilinx Inc., 2006.
[18] Xilinx Embedded Development Kit (EDK) 8.2 - Software Manual, Xilinx
Inc., 2006.
[19] “Intel pentium 4 processor - thermal management,” Available
at
<http://www.intel.com/support/processors/pentium4/sb/CS007999.htm>. Accessed in December, 2006, 2006.