IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 34, NO. 12, DECEMBER 1999 1921 A 10-Gb/s (1.25 Gb/s 8) 4 2 0.25- m CMOS/SIMOX ATM Switch Based on Scalable Distributed Arbitration Eiji Oki, Member, IEEE, Naoaki Yamanaka, Senior Member, IEEE, Yusuke Ohtomo, Member, IEEE, Kazuhiko Okazaki, and Ryusuke Kawano, Member, IEEE Abstract—This paper presents the design and implementation of a scalable asynchronous transfer mode switch. We fabricated a 10-Gb/s 4 2 2 switch large-scale integration (LSI) that uses a new distributed contention control technique that allows the switch LSI to be expanded. The developed contention control is executed in a distributed manner at each switch LSI, and the contention control time does not depend on the number of connected switch LSI’s. To increase the LSI throughput and reduce the power consumption, we used 0.25-m CMOS/SIMOX (separation by implanted oxygen) technology, which enables us to make 221 pseudo-emitter-coupled-logic I/O pins with 1.25-Gb/s throughput. In addition, power consumption of 7 W is achieved by operating the CMOS/SIMOX gates at 02.0 V. This consumption is 36% less than that of bulk CMOS gates (11 W) at the same speed at 02.5 V. Using these switch LSI’s, an 8 2 8 switching multichip module with 80-Gb/s throughput was fabricated with a compact size. I. INTRODUCTION A SYNCHRONOUS transfer mode (ATM) is expected to lead to multimedia communication networks. The demand for multimedia services, such as high-speed data communications and high-definition television broadcasting, will increase, and ATM switching systems having over 1-Tb/s throughput must be created [1], [2]. Several switch architectures for achieving a highperformance ATM switching system have been presented [3]–[5]. To avoid two or more cells that come from different input ports’ arriving at the same destined output port, queuing buffers must be arranged in a switch. Switch architectures are mainly categorized by the position of the buffers into an output-buffer type, an input/output-buffer type, an input-buffer type, and a crosspoint-buffer type. The output-buffer-type switch can provide good statistical performance. In this architecture, the writing speed at output buffers must be as fast as the sum of all the input line speeds. Therefore, the Knockout switch was presented by using an -to- concentrator in order to relax the required writing speed at the output buffers [6]. This architecture can grow modularly toward a larger switch. However, cells may be discarded when the number of cells arriving simultaneously at In the input/output-bufferthe output buffer is larger than Manuscript received April 9, 1999; revised July 4, 1999. E. Oki, N. Yamanaka, K. Okazaki, and R. Kawano are with NTT Network Service Systems Laboratories, Tokyo 180-8585 Japan. Y. Ohtomo is with NTT Telecommunications Energy Laboratories, Kanagawa 243-01 Japan. Publisher Item Identifier S 0018-9200(99)08964-7. type switch, a cell waits in an input buffer to avoid the internal conflict caused by simultaneous cell arrival, even though the switch allows multiple cells up to to be written to an output buffer during the same cell time [7]–[9]. These output-buffertype and input/output-buffer-type switch architectures require the internal line capacity to be expanded according to the input/output line speed. This makes it difficult to implement a large switching system that consists of many VLSI’s because a large number of interconnection links and/or high-speed links are required for VLSI’s to be connected in order to achieve the required switch throughput [10]. This causes the interconnection cost to rise and leads to a pin bottleneck in the very large scale integrations (VLSI’s). Another approach is the input-buffer-type switch architecture. It does not require the internal line speed to be increased. It is well known that head-of-line (HOL) blocking limits the maximum throughput to 58% [11]. To improve the throughput performance, several novel scheduling algorithms have been proposed [12], [13]. These approaches require a centralized scheduler that considers requests from all of the input buffers and determines a new configuration for the crossbar within one ATM cell time. If the input/output line speed and the switch size increase, the centralized scheduler may become a bottleneck in terms of scalability. If we consider advanced CMOS technologies, they may enable us to make many gates and large memories; in which case, a crosspoint-buffer-type switch architecture is an appropriate choice. This architecture does not require any increase in the internal line speed and it eliminates the HOL blocking that occurs in the input-buffer-type switch, at the cost of having a large amount of crosspoint-buffer memory at each crosspoint. To achieve the required switch throughput, many switch large-scale integrations (LSI’s) must be connected. Let us switch LSI’s arranged in a matrix. These switch consider LSI’s have output buffers. Each switch LSI has crosspointbuffer functions using its output buffers as well as an switching function. It is not necessary for each switch LSI to have a crosspoint buffer at each crosspoint inside it. To switching, we have several choices such as achieve input/output-buffer type, output-buffer type, or shared-memory type when we implement the switch LSI [14]. In this paper, we use the input/output-buffer-type approach inside a switch LSI considering the implementation. From the viewpoint of the switching system, this type of switch architecture is called 0018–9200/99$10.00 1999 IEEE 1922 a crosspoint-LSI-type switch architecture in this paper. However, some problems occur. First, the crosspoint-LSItype switch experiences a problem when the number of row LSI’s increases and output lines are fast. As the output-line speed increases, the ATM cell time decreases. In a switch having a large number of row LSI’s, ring arbitration among the LSI’s belonging to the same output port cannot be completed within the short ATM cell time. Therefore, in conventional switches based on ring arbitration, the arbitration time limits the output-line speed according to the number of row switch LSIs; ring arbitration must be completed within the cell time. To reduce the time required for the ring arbitration, a bidirectional arbitration method was proposed [15]. It uses a bidirectional token bus to replace the ring arbiter. This bidirectional arbiter enables the speed of ring arbitration to be up to twice as fast as simple ring arbitration, but it requires twice as many control signals as simple ring arbitration. To obtain even faster arbitration than is possible with bidirectional arbitration, hierarchical arbitration, in other words, tree arbitration, may be employed. Row switch LSI’s are divided into some groups. Ring arbitration is executed among row switch LSI’s within a group, and it is also executed among the different groups hierarchically. However, these fast arbitration schemes increase the number of control signals and hardware complexity. In addition, executing such fast arbitration within a short cell time requires a strict timing design when the switch size increases. We consider that these kinds of centralized contention control schemes are not scalable as the size of the switching system increases. Second, to achieve the highest possible throughput of a switching system in a cost-effective manner, the switch LSI throughput should be large as well. The number of I/O pins may become a bottleneck in terms of throughput of the switch LSI. Therefore, the first requirement for a switch LSI is a distributed contention control technique to solve the problem of conventional centralized contention control. The second requirement is to use I/O pins with an interface of at least 1 Gb/s to avoid the pin bottleneck. The final requirement is that power consumption of the switch LSI should be less than 10 W, considering practical deployment in a system. This paper presents the design and implementation of a scalable crosspoint-LSI-type switch that employs a new distributed contention control technique, called scalable distributed arbitration (SDA), that allows the switch LSI to be expanded [16]. SDA is executed in a distributed manner at each switch LSI, and the arbitration time does not depend on the number of connected switch LSI’s [17]. For higher LSI throughput and lower power consumption, 0.25- m CMOS/SIMOX (separation by implanted oxygen) technology is used. This technology enables us to achieve 1.25-Gb/s pseudo-emitter-coupled-logic (ECL) I/O 221 pins. In addition, power consumption of only 7 W is achieved by operating the CMOS/SIMOX gates at 2.0 8 switching multichip V. Using these switch LSI’s, an 8 module (MCM) with 80-Gb/s throughput was fabricated with a compact size. The remainder of this paper is organized as follows. Section II explains the problems of the conventional switch IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 34, NO. 12, DECEMBER 1999 architecture based on ring arbitration. Section III presents the switch architecture based on SDA and its performance. Section IV describes our developed switch LSI and presents some results. Section V describes the LSI testing for MCM assembly. Section VI describes the 80-Gb/s-throughput switching module. Last, Section VII summarizes the key points. II. CONVENTIONAL SWITCH ARCHITECTURE In the crosspoint-LSI-type switch architecture (shown in Fig. 1), an ATM cell from an input line is dropped into an output buffer attached to the destined output line through an switching function. A large-scale input buffer and an switch LSI’s, and it has input switch consists of output ports. Here, we assume that the switch ports and LSI’s employ an input/output-buffer-type switch architecture. An output line is a bus that is accessed by all output buffers of the same row switch LSI’s. The dropped ATM cell is stored in the output buffer until it is injected into the output line. A conventional crosspoint-LSI-type switch architecture uses ring arbitration among switch LSI’s to avoid output-busaccess contention, as shown in Fig. 1 [15]. As described in Section I, other centralized arbitration schemes are possible, but, to simplify our discussion of the problem of the centralized arbitration, we describe the simple ring arbitration here. switch LSI’s have output buffers, which The function as crosspoint buffers in the switching system. These switch LSI’s are arranged in a matrix. Contention occurs when ATM cells from different switch LSI’s request transmission to the same output line at the same cell time. In the conventional switch, the ring arbiter searches, from some starting point, for an output buffer that has made a request to transfer a cell to the output line. The starting point is just below the output buffer from which a cell was sent to the output line at the previous cell time. If the ring arbiter finds such a request, the cell at the head of the output buffer is selected for transmission. At the next cell time, the starting point is reset to just below the selected output buffer. Thus, in the worst case, the control signal for ring arbitration must pass through all switch LSI’s belonging to the same output line within the ATM cell time. For that reason, the maximum output-line speed of the conventional switch is limited by the number of row switch LSI’s and by the transmission delay of the control signals in each switch LSI. b/s is given by the The maximum output-line speed following equation: (1) the transmission where the number of row switch LSI’s is and the length delay of the control signals in a switch is depends on the performance of of an ATM cell is bits. is a factor devices and the distance between switch LSI’s. that depends on how the centralized arbitration is implemented. Here, is set to one since simple ring arbitration is assumed. If bidirectional arbitration were used, would be set to two. and for Fig. 2 shows the relationship between values in the conventional switch. is set to different OKI et al.: 0.25- m CMOS/SIMOX ATM SWITCH 1923 Fig. 1. Conventional arbitration among switch LSI’s. 53 8 bits. As increases, decreases. For example, at ns and is 8.8 Gb/s. Thus, since the conventional switch uses ring arbitration, the arbitration time limits the output-line speed according to the number of row switch LSI’s to ensure that ring arbitration can be completed is made small within the ATM cell time. As a result, unless by using ultra-high-speed devices, the conventional switch cannot achieve large throughput. Note that when the centralized contention controller for all row switch LSI’s is located in a different place and pipelined may not affect the required arbitration control is executed, time very much [18]. However, as increases, the centralized contention controller needs to be expanded. In addition, for this bus-access transmission system, strict timing design for transmitting cells and control signals will needed, and this will be difficult in a large-scale switching system. Considering the scalability of the switching system, we think that it is better to use a distributed control approach rather than a centralized one. III. SCALABLE-DISTRIBUTED-ARBITRATION (SDA) SWITCH ARCHITECTURE A. Structure This section describes a high-speed crosspoint-LSI-type switch based on distributed contention control, called the SDA switch. Fig. 3 illustrates its structure. There is an ATM output buffer, an ATM transit buffer, an arbitration-control part (CNTL), and a selector at every output port in the switch LSI, but for simplicity, only one output port per switch LSI is 1924 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 34, NO. 12, DECEMBER 1999 Fig. 2. Maximum output-link speed in ring arbitration. shown. Note that the ATM output buffer of the switch LSI can be regarded as a crosspoint buffer in the switching system. The SDA mechanism is as follows. 1) An ATM output buffer sends a request (REQ) to CNTL if there is at least one cell stored in the output buffer. An ATM transit buffer stores several cells that are sent from either the output buffer of the upper LSI or the transit buffer of the upper LSI. The transit buffer size may be one or a few cells. Like the output buffer, the transit buffer sends REQ to CNTL if there is at least one cell stored in it. 2) If the transit buffer is about to become full, it sends not-acknowledgment (NACK) to the upper CNTL. 3) If there are any REQ’s, and CNTL does not receive NACK from the next lower transit buffer, then CNTL selects a cell within one cell time. CNTL determines which cell should be sent according to the following cell selection rule. The selected cell is sent through a selector to the next lower transit buffer or the output line. 4) The cell selection rule is as follows. If either the output buffer or the transit buffer makes a request for cell release, the cell in the requesting buffer is selected. If both the output buffer and the transit buffer request cell release, the cell with the larger delay time is selected. The delay time is defined as the time elapsed since the cell entered the output buffer. To compare the delay time of competing cells, we bits, and use a synchronous counter, which needs we also use the same number of overhead bits in each cell. The synchronous counter is incremented by one cell times, where is a parameter representing every the granularity for measuring the delay time. When the delay time is measured with the greatest accuracy. When a cell enters an output buffer, the value of the synchronous counter is written in the overhead of the cell. When both an output buffer and a transit buffer issue requests for cell release, the values of their counters are compared. If the difference in values is less Fig. 3. SDA technique among switch LSI’s. than the cell with the smaller value is selected. Conversely, if the difference is equal to or more than the cell with the larger value is selected. This delay-time comparison works when the maximum delay We thus set the value of time is less than to satisfy this relationship. 5) When the delay time of the cell in the output buffer equals that in the transit buffer, CNTL determines which cell should be sent using a second cell selection rule. is large, the probability that the second cell When selection rule is used is large. Let us consider the th switch LSI and transit buffers starting at the top. The second rule is that the th output buffer is selected with while the th transit buffer is selected probability of For example, the third with probability of output buffer and the transit buffer are selected with probabilities of 1/3 and 2/3, respectively. According to the second cell selection rule, the cell that enters the th output buffer goes to the output line OKI et al.: 0.25- m CMOS/SIMOX ATM SWITCH 1925 with a total probability given by (2) Here, the first term on the left side of (2) is the probability that the cell in the output buffer of the th LSI is selected, the second term is the probability that the th LSI is selected, and cell in the transit buffer of the the final term is the probability that the cell in the transit buffer of the th LSI is selected. The total probability that a cell from any output buffer is selected for delivery Therefore, the to an output line is a constant value fairness of the selected probability is kept by using the second selection rule, even when the delay time of the cell in the output buffer equals that in the transit buffer. In the implementation for the second cell selection rule, to avoid random variable generation, we employed the following simple cell selection mechanism. For each output port, the th switch LSI has a counter that counts whenever contention up cyclically from zero to occurs between the output buffer and the transit buffer at that output port. When the counter value is zero, the output buffer is selected; otherwise, the transit buffer is selected. This mechanism achieves cell selection with the specified probability using simple hardware, but it is not a completely randomly weighted cell-selection mechanism. Thus the SDA switch achieves distributed arbitration at each switch LSI. The longest control signal transmission distance for arbitration within one cell time is obviously the distance between two adjacent switch LSI’s. In the conventional switch, the control signal for ring arbitration must pass through all LSI’s belonging to the same output line. For that reason, the arbitration time of the SDA switch does not depend on the number of switch LSI’s. Here, we compare the SDA switch with the Knockout switch [6] and the Abacus switch [7]. To increase the switch throughput, the Knockout switch can grow modularly. It uses -to- concentrators to connect switch modules on a -tomatrix plane. Since cells may be discarded at the concentrators in each row module, the cell loss probability for a cell that comes from the top module may be higher than that of a bottom cell because the former cell has to transit many concentrators. In the SDA switch, cell loss never occurs on the way to an output line as long as a cell is transmitted from an output buffer in a switch LSI to the lower switch LSI because of the use of the NACK signal. In the SDA switch, the cell loss probability is designed according to the output buffer size. In addition, the Knockout switch needs more interconnection links between modules than the SDA switch does. From the viewpoint of the system, the SDA switch does not require the internal link capacity to be expanded. This is because the Knockout switch arranges buffers only in the bottom module while the SDA switch arranges them in every switch LSI. The Abacus switch is an input/output-buffer-type switch, and it also requires the internal link capacity to be expanded times to eliminate HOL blocking. The Abacus switch allows distributed contention control of each small switch fabric by Fig. 4. Probability of delay time. grouping output ports, and it supports multicasting of traffic. It has buffers only at the input and output ports. -to-many selection is performed in a distributed manner using its switch fabric and all input modules to avoid the speed constraint. This distributed contention technique is very useful, but the timing requirement of routing cells and resolving contention should be carefully considered, as is described in [7], when a large-scale switching system is developed. On the other hand, in our SDA switch, the arbitration is executed in a distributed manner, the control signals are transmitted only between adjacent switch LSI’s, and it does not use a bus line as an output line. Therefore, the timing limitation in terms of transmission of cells and control signals is relaxed even if the switching system is expanded. This is an advantage of the SDA. B. Performance of SDA SDA performance was evaluated in terms of delay time and output buffer size by computer simulation. For simplicity, we assume that input traffic to the output buffer at each LSI is random, the input load is 0.95, and cells are distributed uniformly to all output buffers belonging to the same input line. Note that, in this subsection, in order to clarify the SDA performance, we focus on an output buffer, a transit buffer, and the SDA control part in a switch LSI. Our switch LSI employs an input/output-buffer memory-type switch architecture inside the switching function. The switching performance before a cell enters an output buffer will be described in the next section. The SDA switch ensures SDA-switch delay time fairness. Fig. 4 shows the probability of the SDA-switch delay time’s The probability is being larger than a certain time for shown for each output buffer that cells enter. The SDA-switch delay time is defined as the time from when the cell enters the output buffer until it reaches the output line. This delay 1926 Fig. 5. Dependence of 99.99% delay time on number of row switch LSI’s IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 34, NO. 12, DECEMBER 1999 N Fig. 6. Dependence of average delay time on number of row switch LSI’s N: : definition is used as a measure of the SDA performance, and it is different from the delay time used in the first selection rule described in Section III-A. That is why we call this delay time the SDA-switch delay time. However, when we describe the SDA performance, we use delay time instead of SDAswitch delay time to simplify the description in the following. In the SDA switch, when is more than about ten [cell times], all delay times have basically the same probability, so delay time fairness is achieved. (Fairness is not maintained at shorter [cell times] for the cell values because it takes at least in the top output buffer to enter the output line.) is larger than a certain time, the In addition, when probability of the SDA switch delay time’s being larger than is smaller than that of the conventional switch, as shown in Fig. 4. This is because, in the SDA switch, the cell with the largest delay time is selected. increases This delay reduction effect becomes clearer as reaches a certain value. Fig. 5 shows that the 99.99% until delay time of the SDA does not change very much even when increases if is less than about 64, while that of the is conventional switch increases rapidly. However, when larger than 64, the 99.99% delay time of the SDA increases linearly. The reason is as follows. A cell has to wait for at least becomes large, one cell time at each transit buffer. When this one-cell-time waiting effect of transit buffers appears. is larger than about 210, the delay of the Then, when SDA becomes larger than that of the conventional switch. The is the crossover point on the 99.99% delay time, point where the throughput of the switching system is 840 Gb/s if we use our 4 2 switch LSI with 10-Gb/s line speed. Here, we show an example of setting the values of and When the maximum delay time, the worst delay time, is smaller than [cell times], the synchronous counter even if it is incremented every cell has a size of just time; in other words, Consider two competing cells, where, for example, one cell’s synchronous counter value is 129 to 255 and the other’s is zero. In this case, the difference between the two values is larger than 2 so the cell with the is larger value (i.e., the former cell) is selected. If used, then is reduced. We note that the values of and should be determined so that the maximum delay time may If this condition is satisfied, the not be larger than 2 SDA mechanism works well, as explained in Section III-A. Fig. 6 shows that the average delay time of the SDA for while that of the all input traffic increases linearly with conventional switch does not increase. This is mainly due to the one-cell-time waiting effect of transit buffers, as mentioned in connection with Fig. 5. Even though this effect increases the average delay of the SDA, SDA offers the advantage of distributed contention control. The required output buffer size of the SDA switch is smaller than that of the conventional switch, as shown in Fig. 7. The required buffer sizes were estimated so as to guarantee the Implementing the SDA mechanism cell loss ratio of 10 requires additional hardware compared with the conventional switch, such as a transit buffer, a synchronous counter, and control parts. The amount of hardware required is discussed in Section IV. In the SDA switch, since the required buffer sizes depend on the position of the switch LSI’s, Fig. 7 shows the smallest (the output buffer of the top switch LSI) and the largest (the output buffer of the bottom switch LSI) sizes. The sizes of the output buffers of intermediate switch LSI’s lie between these two values. Because the SDA switch has a shorter delay time, as explained earlier, the queue length of the output buffer is also reduced. This is why the output buffer size of the SDA switch is less than that of the conventional switch. IV. SWITCH LSI DESIGN Fig. 8(a) shows a block diagram of the switch LSI. The switch LSI has a 4 2 switching function with input and output ATM link speeds of 10 Gb/s. It also has more output ports (expanded output ports) to reduce input signal fan-out. All OKI et al.: 0.25- m CMOS/SIMOX ATM SWITCH Fig. 7. Required output buffer size of switch LSI’s. input signals are transmitted via the switch LSI’s in a pipelined manner. A 10-Gb/s ATM link is achieved by eight single-ended I/O’s at 1.25 Gb/s with an f/2 clock and frame signal, which are differential I/O’s. The 10-Gb/s ATM cells stream along the arrows. The 1.25-Gb/s pseudo-ECL interface in the chip is constructed with CMOS low-voltage-swing active-pullup I/O circuits, six 8 : 64 demultiplexers (DEMUX’s), and six 64 : 8 multiplexers (MUX’s). The two-edge trigger MUX/DEMUX circuit uses an f/2 input clock [19]–[21]. Differential pseudoECL interface circuits, as shown in Fig. 9, are used to generate the f/2 clock to increase the noise margin of the two-edge trigger DEMUX. The basic idea of the active-pullup circuits is to pull the output up to less than a few hundred picoseconds and to shorten the rise time of the open-drain-type output circuits. The 10-Gb/s ATM cell stream is widened to 64 bits at the internal clock speed of 156 MHz. The input/output-buffer-memory-type switch architecture is used in the switch LSI. BUFI has a 16-cell ATM input buffer at each input port, and BUFO has a 128-cell ATM output buffer and a 16-cell transit buffer at each output port. The ATM cell size is 64 bytes. These buffer memories are fabricated under an offered so as to hold the cell loss ratio to 10 load of 0.95. SDA is implemented in BUFO, as shown in Fig. 8(b). SDA relaxes the operating speed for arbitration, compared to ring arbitration. The SDA operation is executed only between two adjacent LSI’s within the ATM cell time of 51.2 ns (equal to eight internal-clock cycles) in a pipelined manner. Since SDA uses cell selection control based on ATM cell arrival time, the required ATM output buffer amount is reduced by approximately 25% compared with a switch LSI using conventional ring arbitration. This is another feature of SDA. The reduction in amount of memory was 44 kb for two output buffers in a switch LSI. The amount of extra hardware required for implementing the SDA mechanism in the switch LSI was 15 kgates and a 16-kb 2-port RAM, which includes transit buffers, synchronous counters, control parts, and so on. Although this amount of extra hardware is 1927 necessary, we believe that it does not have much impact on the implementation of SDA using our advanced CMOS devices, considering the reduction in output buffer size, compared with the conventional switch architecture. A 4 2 switching function is achieved using the tandemcrosspoint (TDXP) switching technique to effectively eliminate ATM cell blocking in the switch LSI, as shown in Fig. 10. Since the TDXP switching mechanism and its performance were described in [22], we only briefly summarize them here. The switch consists of multiple crossbar switch planes, which are connected in tandem at every crosspoint. Even if a cell cannot be transmitted to an output port on the first plane, it has a chance of being transmitted on the next plane. Cell transmission is executed on each switch plane in a pipelined manner. Therefore, more than one cell can be transmitted to the same output port within one cell time slot, and the internal line speed of each switch is equal to the input/output line speed. The arbitration control on each switch plane is executed independently of that of the other planes. We employed ring arbitration on each switch plane in the switch LSI. This switch architecture has several advantages in implementation. First, the switch LSI uses the TDXP switch architecture, which allows input and output lines to operate at the same speed. On the other hand, a conventional input/output-buffer-memory-type switch requires the internal speed to be increased [15]. It is very difficult to make internal lines that are sufficiently fast; we would have to use ultrahighspeed devices such as GaAs or Si bipolar VLSI. Second, the total speed for writing an output buffer memory is 20 Gb/s, which is expanded to 128 bits in parallel at 156 MHz, while a shared-memory-type switch requires 40 Gb/s. TDXP switching relaxes the memory-writing speed while avoiding ATM cell blocking. Third, since the TDXP switch employs a simple cell reading algorithm at the input buffer in order to retain the cell sequence, TDXP does not require the cell sequences to be rebuilt at the output buffers, unlike the parallel switch. These advantages make it easy to implement the high-speed ATM switch [22]. The tandem crosspoints are implemented in the TDXPSW function block. Each TDXPSW has a 1-cell ATM buffer. Fig. 11 shows the average delay for the 4 2 switch LSI. The average delay for TDXP switching is defined as the time from when a cell enters an input buffer until it entfg 17 ers an output buffer. Note that we focus on the performance of the TDXP switching mechanism and do not consider the queuing delay time at the output buffer, which is related to the SDA mechanism. To expand the switching system throughput, switch LSI’s are arranged in a matrix plane. these 2 switching function. The switching system has a 4 increases, in Since the number of column switch LSI’s other words, the switching system size increases, the average delay time approaches 2. We assume that cells are distributed uniformly to all destinations from the same input line and that the total input traffic load is 0.95. The input traffic load Our switching system to each switch LSI is divided by employs a crosspoint-buffer-type switch architecture from the system-level viewpoint, but an input/output-buffer-type switch architecture from the switch-LSI-level viewpoint. In addition, 1928 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 34, NO. 12, DECEMBER 1999 (a) (b) Fig. 8. Functional block diagram of switch LSI. we assume that a cell has to wait at least one cell time in both an input buffer and a tandem crosspoint. Therefore, the average delay time is close to two. If output-buffer-type crossbar switching is employed in a switch LSI, the average switching delay (as we defined it here) is zero. We can observe that the difference in the average delay between TDXP switching and output-buffer-type switching is only a few cell times. Therefore, the the average delay of TDXP switching in the switch LSI is very small. We note that, since the input traffic load that enters the switch LSI decreases when the switching system size increases and the internal transmission capacity in the switch LSI is twice the outside transmission capacity of the switch LSI, the HOL blocking probability at an input buffer in the switch LSI is much lower than that in an input-buffer-type switching system [11], [12]. The performance of the TDXP switching mechanism and that of the input-buffer-type switch [11] were compared in [22]. OKI et al.: 0.25- m CMOS/SIMOX ATM SWITCH 1929 Fig. 9. Differential pseudo-ECL interface. Fig. 11. Average delay time for TDXP switching in switch LSI. Fig. 10. Tandem-crosspoint switching mechanism in switch LSI. Other features of the switch LSI are support for two priority classes and multicasting, both of which are needed for multimedia services. The ATM output buffer in the BUFO block handles high-priority cells for guaranteed services and low-priority cells for best-effort service. The head-of-line priority discipline and push-out mechanism are used. A lowpriority cell is transmitted only when there are no high-priority cells in the ATM output buffer. In addition, when the ATM output buffer is full and a high-priority cell enters, the stored low-priority cells that entered the output buffer last are pushed out of the buffer. The 10-Gb/s ATM cell stream is handled within one ATM cell time of 51.2 ns at the PUSH_CONT block, as shown in Fig. 8(b). To support multicasting, 64 routing bits are used. The switch LSI has two modes: a unicasting mode and a multicasting mode. For the unicasting mode, a binary notation is used in the routing bit. For the multicasting mode, a multicast pattern is a bit map of all of the output ports in a large switching system. Each bit indicates if the cell is to be sent to the destined output port. For instance, if the th bit in the multicast routing bits is set to “1,” then the cell should be sent to the th output port in 1930 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 34, NO. 12, DECEMBER 1999 TABLE I SWITCH LSI SPECIFICATIONS Fig. 12. Chip micrograph. a large switching system. In each switch LSI, two specified bit positions are set since a switch LSI has two output ports. A cell that comes from an input line enters an input buffer in some switch LSI’s through an address filter, if at least one of the two routing bits that are specified in each switch LSI is “1.” When a cell enters two or more different switch LSI’s, copies of it are independently transmitted to each destined output port. Therefore, HOL blocking due to multicasting increases only when a cell is destined to two output ports in the same switch LSI. The multicasting mechanism using the TDXP switching that we implemented and its performance were described in [22]. Note that since each input traffic load to each switch in LSI is divided by the number of column switch LSI’s our switching system), HOL blocking due to multicasting is reduced. The LSI is fabricated using 0.25- m CMOS/SIMOX technology [23]. An internal thermal oxidation SIMOX wafer provides a fully depleted MOSFET. The fully depleted CMOS/SIMOX device achieves a low threshold voltage and low source/drain capacitance. As a result, the unity-gain bandwidth is increased with low power consumption. Fig. 12 is a die micrograph showing the 288-kgate and 209kb two-port RAM. For 40-Gb/s switch throughput, two chips are needed. Note that each MEMORY under the BUFO block is used in each BUFO block as both an output buffer and a transit buffer, and the MEMORY on the left of the BUFI blocks is used in BUFI blocks as input buffers. Although the memories are included in the BUFO and BUFI blocks in Fig. 8, they are shown separately in Fig. 12 in order to indicate how much area they occupy. The supply voltage is 2 V, and the reference voltage is 1.3 V. The die size is 16.55 16.55 mm . The power consumption of the switch LSI is only 7 W. This consumption is 36% less than that of bulk CMOS gates (11 W) at the same speed at 2.5 V [23]. The specifications of the switch LSI are summarized in Table I. V. LSI TESTING FOR MCM ASSEMBLY The switch LSI’s were tested for bare-chip mounting in an MCM substrate. The LSI testing and MCM assembly flow are shown in Fig. 13. Switch LSI functions at 200 MHz were confirmed using an LSI tester. The LSI’s were measured on wafer at the full required speed of 1.25 Gb/s as shown in Fig. 14 using a high-speed probe card to provide known good dies (KGD’s) for MCM assembly. For sensitive wiring design for an MCM substrate due to the 1.25-Gb/s transmission between LSI’s, detailed AC performance such as input/output timing interface conditions of the LSI was measured using an LSI evaluation board at the speed of 1.25 Gb/s. The fabricated evaluation board was a built-up laminated printed-wiring-board using flip-chip mounting [24]. The output eye diagram measured using the evaluation board is shown in Fig. 15. Based on the results of the evaluation-board tests, the wiring design for the MCM substrate was done. KG substrates were provided after open/short testing. Then KGD’s were assembled on a KG substrate, and MCM’s were tested. VI. 80-Gb/s ATM SWITCHING MULTICHIP MODULE An 8 8 ATM switching module with 80-Gb/s throughput was fabricated using several switch LSI’s. Eight switch LSI’s and 32 MCM-interface LSI’s were mounted by wirebonding on a 40-layer ceramic substrate as shown in Fig. 16 [25]. This module is 114 mm 160 mm 6.55 mm. The MCMinterface LSI’s were fabricated as Si-bipolar devices using the super-self-aligned technology [26], and have serial/parallel and parallel/serial functions to convert the line speeds of the 16 physical lines (625 Mb/s) to those of the eight physical lines (1.25 Gb/s), as shown in Fig. 17. On the MCM substrate, 1.25- OKI et al.: 0.25- m CMOS/SIMOX ATM SWITCH Fig. 13. LSI testing and MCM assembly flow. Fig. 14. On-wafer measured output waveform at 1.25 Gb/s. 1931 1932 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 34, NO. 12, DECEMBER 1999 Fig. 15. A 1.25-Gb/s eye diagram obtained using evaluation board. Fig. 16. Photograph of 80-Gbit/s ATM switching module. Fig. 17. Cross section of MCM. Gb/s signal transmission was achieved using the pseudo-ECL interface circuits described in Fig. 9. Switching modules were interconnected by high-density flexible printed circuit cables to expand the switching capacity [27]. VII. CONCLUSIONS This paper presented the design and implementation of a scalable switch that uses a newly developed distributed contention control technique, called scalable distributed ar- OKI et al.: 0.25- m CMOS/SIMOX ATM SWITCH 1933 bitration, that allows the switch LSI to be expanded. SDA is executed in a distributed manner at each switch LSI and the arbitration time does not depend on the number of connected switch LSI’s. To increase the switch LSI throughput and reduce the power consumption, we used 0.25- m CMOS/SIMOX technology. This technology enables us to offer 221 I/O pins with 1.25-Gb/s throughput. In addition, power consumption of 7 W is achieved by operating the CMOS/SIMOX gates at 2.0 V. Using these switch LSI’s, 8 switching module with 80-Gb/s throughput based an 8 on MCM technology was fabricated with a size of 114 mm 160 mm 6.55 mm. The scalability offered by the switch LSI and the module is the key to multimedia Tb/s-class ATM switching systems. REFERENCES [1] N. Yamanaka, S. Yasukawa, E. Oki, T. Kurimoto, T. Kawamura, and T. Matsumura, “OPTIMA: Tb/s ATM switching system architecture based on highly statistical optical WDM interconnection,” in Proc. ISS’97, 1997, p. IS-02.8. [2] E. Munter, J. Parker, and P. Kirkby, “A high-capacity ATM switch based on advanced electronic and optical techniques,” IEEE Commun. Mag., pp. 64–71, 1995. [3] H. Ahmadi and W. E. Denzel, “A survey of modern high-performance switching techniques,” IEEE J. Select. Areas Commun., vol. 7, no. 7, pp. 1091–1103, 1989. [4] M. G. Hluchyj and M. J. Karol, “Queueing in high-performance packet switching,” IEEE J. Select. Areas Commun., vol. 6, no. 9, pp. 1587–1597, 1988. [5] Y. Oie, T. Suda, M. Murata, D. Kolson, and H. Miyahara, “Survey of switching techniques in high-speed networks and their performance,” in Proc. IEEE Infocom’90, pp. 1242–1251. [6] Y. S. Yeh, M. G. Hluchyj, and A. S. Acampora, “The knockout switch; A simple, modular, architecture for high-performance packet switching,” in Proc. ISS’87, 1987, vol. B10.2.1, pp. 801–808. [7] H. J. Chao, B.-S. Choe, J.-S. Park, and N. Uzun, “Design and implementation of Abacus switch: A scalable multicast ATM switch,” IEEE J. Select. Areas Commun., vol. 15, no. 5, pp. 830–843, 1997. [8] I. Iliadis and W. E. Denzel, “Analysis of packet switches with input and output queueing,” IEEE Trans. Commun., vol. 41, no. 5, pp. 731–740, 1993. [9] Y. Oie, M. Murata, K. Kubota, and H. Miyahara, “Effect of speedup in nonblocking packet switch,” in Proc. ICC89, 1989, p. 410. [10] N. Yamanaka, K. Endo, K. Genda, H. Fukuda, T. Kishimoto, and S. Sasaki, “320 Gb/s high-speed ATM switching system hardware technologies based on copper-polyimide MCM,” IEEE Trans. Comp., Packag., Manufact. Technol., vol. 18, pp. 83–91, 1995. [11] M. J. Karol, M. G. Hluchyj, and S. P. Morgan, “Input versus output queueing on a space-division packet switch,” IEEE Trans. Commun., vol. COM-35, pp. 1347–1356, 1987. [12] M. J. Karol, K. Y. Eng, and H. Obara, “Improving the performance of input-queued ATM packet switches,” in Proc. IEEE Infocom’92, 1992, pp. 110–115. [13] N. Mackeown, “Scheduling algorithm for input-queued cell switches,” Ph.D. dissertation, University of California at Berkeley, 1995. [14] H. Tomonaga, N. Matsuoka, Y. Kato, and Y. Watanabe, “High-speed switching module for a large capacity ATM switching system,” in Proc. IEEE GLOBECOM’92, 1992, pp. 123–127. [15] K. Genda, Y. Doi, K. Endo, T. Kawamura, and S. Sasaki, “A 160-Gb/s ATM switching system using an internal speed-up crossbar switch,” in Proc. IEEE GLOBECOM’94, 1994, pp. 123–133. 8) 4 [16] E. Oki, N. Yamanaka, and Y. Ohtomo, “A 10 Gb/s (1.25 Gb/s 2 CMOS/SIMOX ATM switch,” in IEEE ISSCC99, Feb. 1999, pp. 172–173. [17] E. Oki and N. Yamanaka, “Scalable crosspoint buffering ATM switch architecture using distributed arbitration scheme,” in Proc. IEEE ATM’97 Workshop, 1997, pp. 28–35. [18] A. Cisneros and C. A. Brackett, “A large ATM switch based on memory switches and optical star couplers,” IEEE J. Select. Areas Commun., vol. 9, no. 8, pp. 1348–1360, 1991. [19] Y. Ohtomo, M. Nogawa, and M. Ino, “A 2.6-Gbps/pin SIMOX-CMOS low-voltage-swing interface circuit,” IEICE Trans. Electron., vol. E79C, no. 4, pp. 524–529, 1996. 2 2 [20] S. Yasuda, Y. Ohtomo, M. Ino, Y. Kado, and T. Tsuchiya, “3-Gb/s CMOS 1:4 MUX and DEMUX IC’s,” IEICE Trans. Electron., vol. E78-C, no. 12, pp. 1746–1753, 1995. [21] Y. Ohtomo, S. Yasuda, M. Nogawa, J. Inoue, K. Yamakoshi, H. Sawada, M. Ino, S. Hino, Y. Sato, Y. Takei, T. Watanabe, and K. Takeya, “A 40 Gb/s 8 8 ATM switch LSI using 0.25 m CMOS/SIMOX,” in Proc. IEEE ISSCC97, 1997, pp. 154–155. [22] E. Oki and N. Yamanaka, “High-speed tandem-crosspoint ATM switch architecture with input and output buffers,” IEICE Trans. Commun., vol. E81-B, no. 2, pp. 215–223, 1998. [23] M. Ino, “Low-power and high-speed LSI’s using 0.25 m CMOS/SIMOX,” IEICE Trans. Electron., vol. E80-C, no. 12, 1997. [24] Y. Tsukada, “Bare chip packaging technology,” in Proc. Advances in Electronic Packaging Conf., 1997, vol. 1, pp. 285–290. [25] K. Okazaki, N. Sugiura, A. Harada, N. Yamanaka and E. Oki, “80-Gbit/s MCM-C technologies for high-speed ATM switching systems,” in Proc. IMAPS Int. Conf. Exhibition High Density Packaging & MCMs, Apr. 1999, pp. 284–288. [26] T. Sakai, S. Konaka, Y. Kobayashi, M. Suzuki, and Y. Kawai, “Gigabit logic bipolar technology: Advanced super self-aligned process technology,” Electron. Lett., vol. 19, pp. 283–234, 1983. [27] S. Sasaki, T. Kishimoto, K. Genda, K. Endo, and K. Kaizu, “Multichip module technologies for high-speed ATM switching systems,” in Proc. ’94 MCM Conf., 1994, pp. 130–135. 2 Eiji Oki (M’95) received the B.E. and M.E. degrees in instrumentation engineering and the Ph.D. degree in electrical engineering from Keio University, Yokohama, Japan, in 1991, 1993, and 1999, respectively. In 1993, he joined Nippon Telegraph and Telephone Corporation’s (NTT’s) Communication Switching Laboratories, Tokyo, Japan. He has been researching multimedia-communication network architectures based on ATM techniques and trafficcontrol methods for ATM networks. He is currently developing high-speed ATM switching systems in NTT Network Service Systems Laboratories as a Research Engineer. Dr. Oki is a member of the IEICE of Japan. He received the Switching System Research Award and the Excellent Paper Award from the IEICE in 1998 and 1999, respectively. Naoaki Yamanaka (M’85–SM’96) was born in Sendai City, Miyagi Prefecture, Japan, in 1958. He received the B.E., M.E., and Ph. D. degrees in engineering from Keio University, Tokyo, Japan, in 1981, 1983, and 1991, respectively. In 1983, he joined Nippon Telegraph and Telephone Corporation’s (NTT’s) Communication Switching Laboratories, Tokyo, where he researched and developed high-speed switching systems and high-speed switching technologies, such as ultrahigh-speed switching LSI chips/devices, packaging techniques, and interconnection techniques, for broad-band ISDN services. Since 1989, he has been developing broad-band ISDN items based on ATM techniques. He is now researching ATM-based broad-band ISDN architectures and is engaged in traffic management and performance analysis of ATM networks. He is currently a Senior Research Engineer, Supervisor, Research Group Leader in the Broadband Network System Laboratory at NTT. Dr. Yamanaka received the Best of Conference Award at the 40th, 44th, and 48th IEEE Electronic Components and Technology Conferences, the TELECOM System Technology Prize from the Telecommunications Advancement Foundation, the IEEE CPMT Transactions Part B: Best Transactions Paper Award, and the Excellent Paper Award from the IEICE in 1990, 1994, 1999, 1994, 1996, and 1999, respectively. He is the Broadband Network Area Editor of the IEEE COMMUNICATION SURVEYS, Editor of the IEICE Transactions, and the IEICE Communication Society International Affairs Director, as well as Secretary of the Asia Pacific Board of the IEEE Communications Society. 1934 Yusuke Ohtomo (M’92) received the B.E., M.E., and Ph.D. degrees in electric engineering from Keio University, Kanagawa, Japan, in 1983, 1985, and 1998, respectively. Since he joined Nippon Telegraph and Telephone (NTT) Corp., Tokyo, Japan, in 1985, he has been engaged in the research and development of highspeed CMOS/BiCMOS circuit technology, particularly with application to telecommunications LSI’s. His current research interests include low-power multigigahertz LSI design using SOI devices. He is now a Senior Research Engineer at the NTT Telecommunications Energy Laboratories, Kanagawa. Dr. Ohtomo is a member of the IEICE of Japan. Kazuhiko Okazaki was born in Tokyo, Japan, in 1969. He received the B.E. and M.E. degrees in applied physics and chemistry from The University of Electro-Communications, Tokyo, in 1993 and 1995, respectively. In 1995, he joined Nippon Telegraph and Telephone Corporation’s Network Service Systems Laboratories, Tokyo. He has been researching highspeed electrical interconnections in a rack system and high-density MCM packaging technologies for high-speed ATM switching systems. Mr. Okazaki is a member of the IEICE of Japan and the IMAPS. IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 34, NO. 12, DECEMBER 1999 Ryusuke Kawano (M’98) was born in Oita, Japan, on April 2, 1964. He received the B.E. and M.E. degrees from the University of Osaka Prefecture, Osaka, Japan, in 1987 and 1989, respectively. In 1989, he joined Nippon Telegraph and Telephone Corp. (NTT) and began researching and developing process technology for high-speed Sibipolar devices. Since 1992, he has been engaged in researching and developing high-speed integrated circuits using Si bipolar transistors and GaAs MESFET’s at the NTT LSI Laboratories, Kanagawa, Japan. Since moving to NTT Network Service Systems Laboratories, Tokyo, Japan, his research interests have included very large capacity ATM switching hardware such as high-speed logic, optical interconnection, and cooling. Mr. Kawano is a member of the IEICE of Japan.
© Copyright 2026 Paperzz