Design Flow of a Dedicated Computer Cluster Customized for a Distributed Genetic Algorithm Application Alexandra Aguiar, Márcio Kreutz, Rafael Santos and Tatiana Santos Universidade de Santa Cruz do Sul Santa Cruz do Sul, RS, Brazil [email protected], {kreutz, rsantos, tatianas}@unisc.br Abstract— In the past few years, computer grids and clusters of computers have been widely used in order to keep up with the high computational performance required by highend applications. Also, they are especially attractive due to their good performance at relatively low cost, if compared to powerful servers and supercomputers. The same scenario can be faced on the embedded world, where high specialized tasks are usually partitioned to dedicated processors composing distributed systems. This work is focused on the architectural specialization of cluster machines by analysing application behavior and optimizing instruction-set architectures. The motivation for this work relies on the observation that tasks found in embedded software present a behavior that normally can be implemented by a subset of instruction-set architectures. This opens opportunities for optimization by removing the not needed instructions. As a consequence, processors become specialized since their resources fit better to applications performance and power consumption constraints. This work proposes a design flow to adapt cluster machines to the constraints of embedded applications, where high flexibility and performance are achieved by hardware customization and its further distribution. Moreover, it is presented a case of study with the entire design flow description as well as the synthesis results. I. I NTRODUCTION Typically, the performance of applications can be improved by either distributing their tasks among processing elements or by specializing them to dedicated hardware. Both approaches have pros and cons. Some applications are more naturally distributed by software while some others are more likely to fit in a hardware implementation. This clearly depends on the source of the application. In the latest years, however, the use of computer clusters and grids have become an attractive alternative for the execution of applications that demand high computational power because of their price-performance ratio. These systems often respond to the processing demands of most distributed applications. This work takes as its main goal the mission of finding a compromise between both approaches: processors are connected by a network fabric in a cluster fashion, while their instruction are targeted to dedicated tasks of embedded applications. The memory model used in cluster systems requires that applications exchange messages through a networked pool of nodes [1]. Even though current giga ethernet cards are found 148 1-4244-1027-4/07/$25.00 ©2007 IEEE at relatively low prices and offer significant higher bandwidth, some applications do not map well to this model. On top of that, despite the fact that network cards are constantly improving, some applications also need resources at the processor level that are not found in commodity microprocessors. This occurs especially in applications in which a hardware implementation would be the best approach to follow. Nevertheless, even in such cases the distribution would certainly help to increase performance. So, a combination of dedicated processors coupled with high speed networks may be an interesting alternative for those applications that need specific processor resources and high speed communication capabilities. In this work it is presented the design flow of a applicationspecific computer cluster machine where processors are customized according to the demand constraints of embedded applications and integrated with a ethernet-based communication stack. The proposed design flow targets applications that allow a smooth and efficient distribution, having tasks that could be easily fitted in small, optimized processors. Finally, this research provides the hardware infrastructure to allow the connection among the nodes as well as a design flow to build the application and integrate the system. In other words, this work proposes a development flow and infrastructure which allows optimized processor to execute a given distributed and/or parallel application. The remaining of this paper is divided as follows: Section 2 shows the related works, while Section 3 presents in details the concept of a cluster target to application specific constraints. Section 4 describes a case of study with a distributed genetic algorithm. Section 5 describes how synthesis and validation was performed in this research, while Section 6 shows the results achieved. Finally, Section 7 presents the conclusions, remarks and future works. II. R ELATED W ORK Previous works have studied the use of application specific devices in cluster systems as these devices have become more popular over the years. Yeh et.al [2] proposed the use of FPGAs to build a switch fabric. Jones et. al [3] included a reconfigurable off-the-shelf computing card in each node of a cluster. Other implementations such as Dandalis [4] proposed the use of reconfigurable network cards to implement specific protocols such as IPsec. Sass [5] proposed the use of an intelligent network card (INIC) capable of processing messages and injecting them into the network alleviating the pressure on the processor in order to enable full exploitation of bandwidth and latencies of modern networks. Then Underwood [6] presented a cost analysis of an Adaptable Computing Cluster based on the INIC project. More recently Jacob [7] proposed the CARMA framework for reconfigurable clusters as a tool for managing different configuration schemes and Willians [8] presented a reconfigurable cluster-on-chip architecture and supporting libraries for developing multi-core reconfigurable systems on chip using the MPI (Message Passing Interface) standard. III. D ESIGNING A D EDICATED C LUSTER N ODE One of the main goals of this work is to provide a design flow aimed at designing optimized cluster machines by tuning their instruction-sets according to the constraints of embedded software tasks. Figure 1 presents such design flow as proposed in this work. The first step concerns for the FemtoJava core generation through the SASHIMI tool [9]. This core is further integrated with the communication infrastructure provided by the LwIP library (TCP/IP stack) and the integration layer, as described in steps (2) and (3). The dedicated cluster architecture proposed in this research differs from the other projects mainly at two aspects. First, our approach aims at customizing the processor itself as a function of applications constraints. This is done through an automated flow which generates an optimized microprocessor starting from a Java application. The microprocessor is tailored for the application, enabling the optimizations necessary for that application. Furthermore, it synthesizes only the hardware resources needed by the application. Second, the generated microprocessor is integrated with the communication module (a TCP/IP stack) which implements transmit/receive buffers that can be accessed at the same frequency as the processor. The later allows full exploitation of high speed network communication. The design flow proposed allows the programmer to concentrate on the development of the application, i.e. the algorithm. Thus the use of an automated flow enables fast development at a higher level abstraction not found in other projects. The result is a tightly coupled device which integrates the processor and communication into one single FPGA. It is important to mention here that the goal is not to propose a device that will replace conventional cluster nodes (PCs) nor a reconfigurable node that is complimentary to a PC host (as opposed to the work discussed earlier [3][4][6][5]). On the other hand, the goal is to enable fast development of distributed applications that require customized processors in order to achieve small area, low power and high performance through processor instruction-set and network latency optimization. It is also important to highlight that once the application is partitioned, each cluster node may comprise a highly optimized processor, according to the task previously allocated to it. The partitioned tasks allocation is supposed to be done at an earlier design stage and is not the scope of this research. It is also important to consider that the focus of this work, at its current development status, concerns for a proove of concept related to the development of a design flow devoted to generate a optimized cluster architectures, not to develop a complete cluster machine comprised by any number of nodes. 149 Fig. 1. Design flow for a Dedicated cluster development The following Sections discuss each step of this flow in details. A. Creating a FemtoJava Core The first step to the dedicated cluster implementation is the application definition and further distribution. The target application must be implemented in Java language, according to the SASHIMI tool constraints. The SASHIMI tool is an environment which synthesizes applications described in Java Language to specific VHDL microcontrollers. Thus, the main advantage of using the SASHIMI tool is the automatic generation of a microcontroller adapted for a given application described in such high level abstraction language. Thus, the tool automatically verifies which instructions used in the Java description are essential to the implementation in hardware and then generates the customized FemtoJava microcontroller. The system was initially proposed in [9], but was later improved and the newest version can generate cores with pipeline and VLIW support. More details about the tool and the constraints to implement the synthesizable Java code may be also found in [9]. The idea of using an automatic flow to implement the embedded core is mainly due to the fast development cycle provided. Also, this alternative was used in order to make the implementation of the dedicated cluster feasible including for researchers with little knowledge in hardware design, as the application may be described directly in Java Language. The integration layer was also developed in order to easily integrate the TCP/IP stack and FemtoJava core. Any variation on the application affects only this step of the flow. For each application, a new core must be developed using SASHIMI tool but the remaining modules do not change, since they present proper interface to connect with any FemtoJava created by the SASHIMI tool. So, this first step provides the customized core, which is later integrated with the remaining blocks (TCP/IP stack and integration layer). The following sections discuss both the integration layer and the TCP/IP stack. B. The Integration Layer In order to complete the communication infrastructure, it is necessary some additional logic to provide synchronization between the TCP/IP stack and the FemtoJava core. This logic, developed in VHDL, is implemented mainly through buffers and Finite State Machines. Thus, when the FemtoJava core needs to communicate, it sends a request to the FSM responsible for the communication between the FemtoJava and the TCP/IP stack. This FSM handles the request and places the data to be sent in a FIFO buffer (send FIFO). A second FSM, responsible for sending, reads the new data placed in the buffer and sends it to the stack. The stack packs the data and sends it over the network. The receiving process is similar, but it starts when new data is unpacked by the TCP/IP stack. It then sends a request to the FSM responsible for receiving the data. This FSM handles the request and places the data in a FIFO buffer (receive FIFO). The FSM responsible for the communication between the FemtoJava and the TCP/IP stack reads the received data and sends it to the FemtoJava core. Figure 2 shows the structures from the integration layer between the FemtoJava core and the TCP/IP stack, which is detailed in the next Section. C. The TCP/IP stack The TCP/IP stack is implemented using the LwIP (Light weight IP) communication library described in [10]. This library, written in C language, runs in one of the PowerPCs hardwired in the FPGA used as development platform in this work. The stack provided by LwIP library supports transfer rates of 10/100Mb/s and operates in full-duplex mode, once the send and receive entities are implemented to work independently. This research uses the LwIP library through API sockets functions, which allow threads programming in order to send and receive data while the FemtoJava core executes in parallel. It is important to notice that the physical layer is implemented by an ASIC [11] available in the development board. 150 So, the network packages are sent/received by such ASIC which then passes these packages to LwIP. LwIP provides the data link, network and transport layers, passing the resultant TCP packages to the application layer implemented directly in hardware by the FemtoJava core. D. Integrating the Node The infrastructure, i.e., the communication stack and integration layer, are ready to be connected to the embedded application earlier developed in Java and synthesized to a VHDL FemtoJava microcontroller. As discussed before, this infrastructure should not vary should the embedded application changes. Figure 2 shows how the system integrates the entire node, i.e., how all blocks are connected in order to implement the processing node. Basically, the FemtoJava core and the TCP/IP stack are both connected with the integration layer, which is responsible for synchronizing the communication between these two modules. Fig. 2. System integration Each node may be responsible for a given part of the application, according to the distribution performed yet at the software level. IV. C ASE OF S TUDY: A D ISTRIBUTED E VOLUTIONARY A LGORITHM A real scientific application was used to apply the flow and design a dedicated cluster machine. This application, widely used in the pharmaceutical industry, uses spectrographic techniques in order to dose and further characterize anti-hypertensives [12]. These techniques, however, result in a large number of variables and, as consequence, the dosage process, i.e., the determination of the necessary chemical elements, is very slow. This occurs because a combinatorial analysis is required to choose the best variables combination among all. An alternative to speedup this process is to employ Evolutionary Algorithms (EA), since they have been recognized as a powerful approach to solve optimization problems [13]. Evolutionary Algorithms are inspired by simple models of biological evolution. They are known as robust optimization algorithms based on the collective learning process within a population of individuals. Each individual represents a search point in the representation space R. By the iterative processing of an evolution loop, consisting of selection, mutation, and recombination, a randomly initialized population evolves toward better regions of the search space. The fitness function f delivers the quality information necessary for the selection process to favor individuals with higher fitness for reproduction. The reproduction process consists of the recombination mechanism, responsible for the mixing of parental information, and mutation introduction undirected innovation into the population. The EA algorithm applied to the spectrographic technique used in this work was first developed in Java Language and it was manually distributed according to the island model briefly discussed below. In the island model, fully described in [14], each node has its own subpopulation and performs all typical tasks required by an EA (analysis/selection, mutation and recombination). This means that the total population of a sequential algorithm might be divided into smaller subpopulations distributed for execution in the different nodes available. Figure 3 shows the island model distribution. Fig. 3. EA distribution according to the island model Thus, the AE implemented according to the island model performs the following tasks: • • • • Analysis – in this step, all population individuals are evaluated according to a given mathematical function Crossover – in this step, the individuals recombination is performed Mutation – this part of the algorithm provides the mutation on the actual population, generating the new population Migration – during this step, each node sends and receives a given number of individuals from the remaining nodes. After this procedure, these new individuals are then integrated to their own population. Although both models were completely developed in software, previous works have shown that the island model is 151 much more effective for this application [15]. Thus, only this model was developed to the cluster. It is possible to observe that each island may be implemented in a different node. So, the dedicated cluster for this case is composed by the replication of several single nodes, implemented according to the methodology described here. After this process, the prototyped nodes must be connected through a traditional network and the embedded application is ready to run in the cluster. As discussed before, after developing the distributed algorithm in Java language, the SASHIMI tool [9] was used in order to generate the FemtoJava cores used in each node of the cluster. Those cores were later connected with the TCP/IP stack and integration layer in order to complete each node. All synthesis and validation strategies are described in the next Section. V. S IMULATION , S YNTHESIS AND VALIDATION After using SASHIMI tool [9] to generate the FemtoJava core for the distributed EA, simulations were performed using Mentor Graphics ModelSim in order to validate the core. The reference data for the validation was extracted from the same algorithm modeled in Java which was used as the SASHIMI tool input. The TCP/IP as well as the integration layer were also simulated in Mentor Graphics ModelSim. The reference data, in this case, is extracted from a program implemented to generate network packages with known transfer data, written in C. The core and the stack were then integrated and new simulations were performed using Mentor Graphics ModelSim. The prototyping was performed targeting the XUP-V2P board [16], which has a XC2VP30 Virtex-II Pro FPGA and it is designed by Digilent, Inc. Besides the powerful FPGA (with two PowerPC 405 processors hardwired) this board has SDRAM, as well as several other useful resources and interfaces such as RS-232, ethernet, XSVGA output, and so on. After the simulation, the synthesis and debugging/verification processes were started using the Xilinx ISE Foundation Software [17], as the target device is a Xilinx Virtex II Pro FPGA family device. Synthesis results are shown in the next Section. In order to validate the prototype, the Xilinx EDK Platform Studio tool was used to create the entire programming environment to support verification [18]. This includes an interface to communicate the prototyped node with RS-232 interface which allowed the verification of the data produced by each node against the ones produced yet in the simulation. Thus, the bitstream produced by the Xilinx ISE Foundation software was downloaded to the FPGAs through an USB2 programming interface (JTAG). This download included the node as well as the verification interface generated by the Xilinx EDK platform Studio. Also, the remaining PowerPC processor available was used to verify the prototyped nodes. Thus, the results generated by a given node were received by the PowerPC, which sent these data through the RS-232 to a host PC. Then, these results were compared with the results achieved by simulation. VI. R ESULTS The following Sections present a summary of the synthesis results as well as the performance achieved by the designed system. A. Synthesis The entire node, i.e., FemtoJava core, TCP/IP network stack and integration layer were completely synthesized using the Xilinx ISE foundation software for the XC2VP30 VirtexII Pro FPGA. Furthermore, a non-distributed version of the application was also developed and synthesized to the FPGA. This second embedded system do not have any support for parallel execution, i.e., the application executes a sequential version of the algorithm and it was not integrated with the communication stack. The post place and route synthesis results for both systems are summarized in Table I. TABLE I S YNTHESIS R ESULTS Synthesis for XC2VP30 Distributed Sequential Maximum Frequency 100.59 MHz 102.06 MHz Power Consumption 1.1 W 1.1 W Area (# of LUTs) 8,354 (30%) 5,394 (19%) It is possible to observe that the entire node of the dedicated cluster occupies around 30% of the LUTs available, meaning that there is still enough room to optimize the embedded application, if needed. These optimizations, however, are not the scope of this work. Also, the sequential version is significantly smaller. Nevertheless, this version does not use the communication infra-structure, which occupies a large part of the LUTs available in the FPGA, around 19%. The maximum frequency achieved in both cases is more than 100MHz. This is a significant result, considering the FPGA technology used in this research. Moreover, regarding to the distributed version, the embedded application is synchronized with the integration layer as well as the communication stack and this frequency is more than enough to keep an effective process. It is also interesting to observe that the node communication infrastructure is not limiting frequency, as both versions has achieved similar results. Additionally, it is important to notice the low power consumption in both cases (sequential and distributed), which was around 1.1 W. This result points out how power consumption may be reduced when using dedicated hardware, as a state-ofthe-art GPP may consume over 100 W [19]. This is especially attractive in comparison with regular clusters, since one of the main problems of the later approach is the high rates in power consumption and heat dissipation. 152 B. Performance In order to validate and compare the results achieved by the embedded version, the distributed application were also run in a regular cluster. However, it is not intention of this work widely compare dedicated and regular clusters using different configurations and number of nodes. The main goal of this work was to develop and validate an effective flow to generate a dedicated cluster to a given application and the comparison between this and regular clusters is only to illustrate such approach. Thus, both regular and dedicated clusters used in the experiments showed in this Section have two nodes. The conventional cluster used to perform the experiments is based on two Intel Pentium 4 nodes, each one running at 2.8GHz with 512MBytes. The interconnection used between them is a direct fast-ethernet link, similar to the FPGA cluster architecture. The application ran in this cluster was the same Java application which was earlier synthesized through SASHIMI tool in order to create the customized FemtoJava core. As discussed before, the nodes are similar and each one implements an island from the EA. On the other hand, the dedicated cluster used is based on two nodes developed according to the methodology previously described in this work. It is important to observe that each node has the same functionalities from the ones ran in the conventional cluster, as they were generated from that same Java application. Table II shows results relative to the evolutionary algorithm execution time in both conventional and application-specific clusters. These results were taken considering the average of 20 executions of the algorithm in each cluster. In order to obtain the execution time results in the conventional cluster case, system calls were used. For the dedicated cluster, the number of clock pulses necessary to complete an execution were accumulated and later multiplied by its period. Thus, the table shows the average of execution time as well as the best performance achieved by the experiments in each case (dedicated and traditional), besides the relation between those results. TABLE II D EDICATED VS . T RADITIONAL CLUSTERS Comparison results between dedicated and traditional Dedicated Traditional Difference (%) Average exec. time (s) 14.16 22.53 37.15 Lowest exec. time (s) 5.36 19.5 72.51 It is possible to observe that the dedicated cluster achieved remarkable results, when considering a cluster comprising two nodes. In average, the traditional cluster takes more than 22 seconds to find the EA fitness, while the dedicated cluster takes less than 15 seconds. This means that, in average, the gain of the dedicated cluster over the traditional is greater than 35%. Moreover, the dedicated cluster performance may be more than 70% better when comparing the lowest execution times of each case. This points out that, in fact, the latencies produced by useless software and hardware layers can significantly harm a given system performance. Besides that, the results point that the dedicated cluster, which runs at 100 MHz, provides a speed-up of 1.5 times over a Pentium based cluster running at 2.8 GHz which means that the dedicated nodes are 42 times more efficient compared with a superscalar processor running at the same frequency. This happens because the dedicated cluster runs on FPGAs which are devices that explore parallel or spacial execution instead of the temporal or sequential approach used in the Pentium based cluster. Moreover the concurrent execution provided by FPGAs devices represent an advantage for the algorithm’s operations, which explore it by increasing the parallelism of its execution. VII. C ONCLUSION AND F UTURE W ORK This work proposed a design flow to efficiently distribute embedded systems through dedicated cluster machines. Even with several studies in the field of distributed software, computer clusters and embedded systems, none of them proposed an approach joining these technologies. According to the proposed flow, the nodes are developed in Java Language and the synthesizable VHDL is generated automatically through SASHIMI design flow. After that, the application core must be integrated with the TCP/IP stack as well as an integration layer. The application implemented as a first case of study was an Evolutionary Algorithm (EA) applied to spectrographic analysis, which is a widely known problem in the pharmaceutical industry. After following the design flow, the nodes were entirely implemented and integrated. The verification process is totally complete and performance achieved remarkable results for a 2 node cluster. The dedicated node using the XUP-V2P board achieves around 100MHz and consume only 30% of the target device, a Xilinx VP30 Virtex II Pro. Moreover, the frequency achieved is considered to be good for the used device. There is still room for new optimizations in the architecture as it was fully placed using just 30% of the FPGA LUTs. As an automatic flow was used to generate the FemtoJava core, some manually optimizations may be easily performed. However, which optimizations to implement as well as their impact in the node performance are going to be addressed in a future work. The general performance for the experiments executed in this work are also very impressive. In average, using the dedicated cluster is more than 35% better than running the same algorithm in a conventional cluster. The best case, however, achieves more than 70% of improvement over the conventional cluster. More experiments with different clusters configurations and number of nodes still need to be performed, but these results validate the idea of an embedded distributed application running in a application-specific cluster. Also as future work, new alternatives to communication other than TCP/IP are going to be studied and analyzed. Moreover, new applications are going to be developed to verify the effectiveness of this approach. 153 ACKNOWLEDGMENT The authors gratefully acknowledge the UNISC and FAPERGS support in the form of scholarships and grants. R EFERENCES [1] A. S. Tanenbaum, Distributed Systems: Principles and Paradigms, 2002. [2] C.-C. Yeh, C.-H. Wu, and J.-Y. Juang, “Design and implementation of a multicomputer interconnection network using FPGAs,” in IEEE Symposium on Field-Programmable Custom Computing Machines. Napa Valley, California: IEEE, 1995, pp. 30–39. [3] M. Jones, L. Scharf, J. Scott, C. Twaddle, M. Yaconis, K. Yao, and P. Athanas, “Implementing an API for distributed adaptive computing systems,” in IEEE Symposium on Field-Programmable Custom Computing Machines. Napa Valley, California: IEEE, 2000, pp. 222–230. [4] A. Dandalis, V. K. Prasanna, and J. D. P. Rolim, “An adaptive cryptographic engine for ipsec architectures,” in IEEE Symposium on FieldProgrammable Custom Computing Machines. Napa Valley, California: IEEE, 2000, pp. 132–141. [5] R. Sass, K. Underwood, and W. Ligon, Design of Adaptable Computing Cluster. The Military and Aerospace Programmable Logic Device (MAPLD) International Conferences, 2001. [6] K. Underwood, R. Sass, and W. Ligon, Cost Effectiveness of an Adaptable Computing Cluster. ACM/IEEE Supercomputing, 2001. [7] A. M. Jacob, I. A. Troxel, and A. D. George, Distributed Configuration Management for Reconfigurable Cluster Computing. HCS Research Lab, University of Florida, 2004. [8] J. Willians, I. Syed, J. Wu, and N. Bergmann, “A reconfigurable clusteron-chip architecture with MPI communication layer,” in 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines. Los Alamitos, California: IEEE, 2006, pp. 351–352. [9] S. Ito, L. Carro, and R. Jacobi, “Sashimi and femtojava: making java work for microcontroller applications,” IEEE Design & Test of Computers, pp. 100–110, 2001. [10] A. Dunkels, “Design and implementation of the lwip tcp/ip stack,” Swedish Institute of Computer Science, Tech. Rep., February 2001. [11] Intel, “Intel LXT972A single-port 10/100 mbps PHY transceiver,” Available at: <http://www.intel.com/design/network/products/lan/datashts/24918603.pdf>. Accessed in May., 2007. [12] J. Coates, “Vibrational spectroscopy: Instrumentations for infrared and raman spectroscopy,” Applied Spectroscopy Reviews, vol. 33, 1998. [13] L. A. N. Lorena and J. C. Furtado, “Constructive genetic algorithm for clustering problems,” Evolutionary Computation, vol. 9, no. 3, pp. 309– 327, 2001. [14] E. Alba and J. M. Troya, A useful review on coarse grain Parallel Genetic Algorithms, Universidad de Málaga, Campus de Teatinos (2.2.A.6), 29071-Málaga (ESPAÑA), 1997. [15] A. Aguiar, C. Both, M. Kreutz, R. dos Santos, and T. dos Santos, “Implementação de algoritmos genéticos paralelos aplicados a fármacos,” in XXVI Congresso da Sociedade Brasileira de Computação - WPerformance. Campo Grande - MS: Sociedade Brasileira de Computação, Julho 2006. [16] Xilinx University Program Virtex-II Pro Development System - Hardware Reference Manual, Xilinx Inc., 2006. [17] Xilinx ISE 8.2i - Software Manual, Xilinx Inc., 2006. [18] Xilinx Embedded Development Kit (EDK) 8.2 - Software Manual, Xilinx Inc., 2006. [19] “Intel pentium 4 processor - thermal management,” Available at <http://www.intel.com/support/processors/pentium4/sb/CS007999.htm>. Accessed in December, 2006, 2006.
© Copyright 2026 Paperzz