A 10-Gb/s (1.25 Gb/s x 8) 4 x 2 0.25-/spl mu/m CMOS/SIMOX

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 34, NO. 12, DECEMBER 1999
1921
A 10-Gb/s (1.25 Gb/s
8) 4
2
0.25- m CMOS/SIMOX ATM Switch
Based on Scalable Distributed Arbitration
Eiji Oki, Member, IEEE, Naoaki Yamanaka, Senior Member, IEEE, Yusuke Ohtomo, Member, IEEE,
Kazuhiko Okazaki, and Ryusuke Kawano, Member, IEEE
Abstract—This paper presents the design and implementation
of a scalable asynchronous transfer mode switch. We fabricated a
10-Gb/s 4 2 2 switch large-scale integration (LSI) that uses a new
distributed contention control technique that allows the switch
LSI to be expanded. The developed contention control is executed
in a distributed manner at each switch LSI, and the contention
control time does not depend on the number of connected switch
LSI’s. To increase the LSI throughput and reduce the power
consumption, we used 0.25-m CMOS/SIMOX (separation by
implanted oxygen) technology, which enables us to make 221
pseudo-emitter-coupled-logic I/O pins with 1.25-Gb/s throughput.
In addition, power consumption of 7 W is achieved by operating
the CMOS/SIMOX gates at 02.0 V. This consumption is 36% less
than that of bulk CMOS gates (11 W) at the same speed at 02.5
V. Using these switch LSI’s, an 8 2 8 switching multichip module
with 80-Gb/s throughput was fabricated with a compact size.
I. INTRODUCTION
A
SYNCHRONOUS transfer mode (ATM) is expected to
lead to multimedia communication networks. The demand for multimedia services, such as high-speed data communications and high-definition television broadcasting, will
increase, and ATM switching systems having over 1-Tb/s
throughput must be created [1], [2].
Several switch architectures for achieving a highperformance ATM switching system have been presented
[3]–[5]. To avoid two or more cells that come from different
input ports’ arriving at the same destined output port, queuing
buffers must be arranged in a switch. Switch architectures
are mainly categorized by the position of the buffers into an
output-buffer type, an input/output-buffer type, an input-buffer
type, and a crosspoint-buffer type.
The output-buffer-type switch can provide good statistical
performance. In this architecture, the writing speed at output
buffers must be as fast as the sum of all the input line speeds.
Therefore, the Knockout switch was presented by using an
-to- concentrator in order to relax the required writing
speed at the output buffers [6]. This architecture can grow
modularly toward a larger switch. However, cells may be
discarded when the number of cells arriving simultaneously at
In the input/output-bufferthe output buffer is larger than
Manuscript received April 9, 1999; revised July 4, 1999.
E. Oki, N. Yamanaka, K. Okazaki, and R. Kawano are with NTT Network
Service Systems Laboratories, Tokyo 180-8585 Japan.
Y. Ohtomo is with NTT Telecommunications Energy Laboratories, Kanagawa 243-01 Japan.
Publisher Item Identifier S 0018-9200(99)08964-7.
type switch, a cell waits in an input buffer to avoid the internal
conflict caused by simultaneous cell arrival, even though the
switch allows multiple cells up to to be written to an output
buffer during the same cell time [7]–[9]. These output-buffertype and input/output-buffer-type switch architectures require
the internal line capacity to be expanded according to the
input/output line speed. This makes it difficult to implement a
large switching system that consists of many VLSI’s because
a large number of interconnection links and/or high-speed
links are required for VLSI’s to be connected in order to
achieve the required switch throughput [10]. This causes the
interconnection cost to rise and leads to a pin bottleneck in
the very large scale integrations (VLSI’s).
Another approach is the input-buffer-type switch architecture. It does not require the internal line speed to be increased.
It is well known that head-of-line (HOL) blocking limits the
maximum throughput to 58% [11]. To improve the throughput
performance, several novel scheduling algorithms have been
proposed [12], [13]. These approaches require a centralized
scheduler that considers requests from all of the input buffers
and determines a new configuration for the crossbar within
one ATM cell time. If the input/output line speed and the
switch size increase, the centralized scheduler may become a
bottleneck in terms of scalability.
If we consider advanced CMOS technologies, they may enable us to make many gates and large memories; in which case,
a crosspoint-buffer-type switch architecture is an appropriate
choice. This architecture does not require any increase in the
internal line speed and it eliminates the HOL blocking that
occurs in the input-buffer-type switch, at the cost of having a
large amount of crosspoint-buffer memory at each crosspoint.
To achieve the required switch throughput, many switch
large-scale integrations (LSI’s) must be connected. Let us
switch LSI’s arranged in a matrix. These switch
consider
LSI’s have output buffers. Each switch LSI has crosspointbuffer functions using its output buffers as well as an
switching function. It is not necessary for each switch LSI
to have a crosspoint buffer at each crosspoint inside it. To
switching, we have several choices such as
achieve
input/output-buffer type, output-buffer type, or shared-memory
type when we implement the switch LSI [14]. In this paper,
we use the input/output-buffer-type approach inside a switch
LSI considering the implementation. From the viewpoint of
the switching system, this type of switch architecture is called
0018–9200/99$10.00  1999 IEEE
1922
a crosspoint-LSI-type switch architecture in this paper.
However, some problems occur. First, the crosspoint-LSItype switch experiences a problem when the number of row
LSI’s increases and output lines are fast. As the output-line
speed increases, the ATM cell time decreases. In a switch
having a large number of row LSI’s, ring arbitration among the
LSI’s belonging to the same output port cannot be completed
within the short ATM cell time. Therefore, in conventional
switches based on ring arbitration, the arbitration time limits
the output-line speed according to the number of row switch
LSIs; ring arbitration must be completed within the cell
time. To reduce the time required for the ring arbitration, a
bidirectional arbitration method was proposed [15]. It uses
a bidirectional token bus to replace the ring arbiter. This
bidirectional arbiter enables the speed of ring arbitration
to be up to twice as fast as simple ring arbitration, but
it requires twice as many control signals as simple ring
arbitration. To obtain even faster arbitration than is possible
with bidirectional arbitration, hierarchical arbitration, in other
words, tree arbitration, may be employed. Row switch LSI’s
are divided into some groups. Ring arbitration is executed
among row switch LSI’s within a group, and it is also executed
among the different groups hierarchically. However, these fast
arbitration schemes increase the number of control signals
and hardware complexity. In addition, executing such fast
arbitration within a short cell time requires a strict timing
design when the switch size increases. We consider that
these kinds of centralized contention control schemes are not
scalable as the size of the switching system increases.
Second, to achieve the highest possible throughput of a
switching system in a cost-effective manner, the switch LSI
throughput should be large as well. The number of I/O pins
may become a bottleneck in terms of throughput of the switch
LSI.
Therefore, the first requirement for a switch LSI is a
distributed contention control technique to solve the problem
of conventional centralized contention control. The second
requirement is to use I/O pins with an interface of at least
1 Gb/s to avoid the pin bottleneck. The final requirement is
that power consumption of the switch LSI should be less than
10 W, considering practical deployment in a system.
This paper presents the design and implementation of a scalable crosspoint-LSI-type switch that employs a new distributed
contention control technique, called scalable distributed arbitration (SDA), that allows the switch LSI to be expanded [16].
SDA is executed in a distributed manner at each switch LSI,
and the arbitration time does not depend on the number of
connected switch LSI’s [17]. For higher LSI throughput and
lower power consumption, 0.25- m CMOS/SIMOX (separation by implanted oxygen) technology is used. This technology
enables us to achieve 1.25-Gb/s pseudo-emitter-coupled-logic
(ECL) I/O 221 pins. In addition, power consumption of only 7
W is achieved by operating the CMOS/SIMOX gates at 2.0
8 switching multichip
V. Using these switch LSI’s, an 8
module (MCM) with 80-Gb/s throughput was fabricated with
a compact size.
The remainder of this paper is organized as follows.
Section II explains the problems of the conventional switch
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 34, NO. 12, DECEMBER 1999
architecture based on ring arbitration. Section III presents
the switch architecture based on SDA and its performance.
Section IV describes our developed switch LSI and presents
some results. Section V describes the LSI testing for MCM
assembly. Section VI describes the 80-Gb/s-throughput
switching module. Last, Section VII summarizes the key
points.
II. CONVENTIONAL SWITCH ARCHITECTURE
In the crosspoint-LSI-type switch architecture (shown in
Fig. 1), an ATM cell from an input line is dropped into an
output buffer attached to the destined output line through an
switching function. A large-scale
input buffer and an
switch LSI’s, and it has
input
switch consists of
output ports. Here, we assume that the switch
ports and
LSI’s employ an input/output-buffer-type switch architecture.
An output line is a bus that is accessed by all output buffers
of the same row switch LSI’s. The dropped ATM cell is
stored in the output buffer until it is injected into the output
line. A conventional crosspoint-LSI-type switch architecture
uses ring arbitration among switch LSI’s to avoid output-busaccess contention, as shown in Fig. 1 [15]. As described in
Section I, other centralized arbitration schemes are possible,
but, to simplify our discussion of the problem of the centralized
arbitration, we describe the simple ring arbitration here.
switch LSI’s have
output buffers, which
The
function as crosspoint buffers in the switching system. These
switch LSI’s are arranged in a matrix. Contention occurs when
ATM cells from different switch LSI’s request transmission to
the same output line at the same cell time. In the conventional
switch, the ring arbiter searches, from some starting point, for
an output buffer that has made a request to transfer a cell to the
output line. The starting point is just below the output buffer
from which a cell was sent to the output line at the previous
cell time. If the ring arbiter finds such a request, the cell at
the head of the output buffer is selected for transmission. At
the next cell time, the starting point is reset to just below the
selected output buffer. Thus, in the worst case, the control
signal for ring arbitration must pass through all switch LSI’s
belonging to the same output line within the ATM cell time.
For that reason, the maximum output-line speed of the
conventional switch is limited by the number of row switch
LSI’s and by the transmission delay of the control signals in
each switch LSI.
b/s is given by the
The maximum output-line speed
following equation:
(1)
the transmission
where the number of row switch LSI’s is
and the length
delay of the control signals in a switch is
depends on the performance of
of an ATM cell is bits.
is a factor
devices and the distance between switch LSI’s.
that depends on how the centralized arbitration is implemented.
Here, is set to one since simple ring arbitration is assumed.
If bidirectional arbitration were used, would be set to two.
and
for
Fig. 2 shows the relationship between
values in the conventional switch.
is set to
different
OKI et al.: 0.25- m CMOS/SIMOX ATM SWITCH
1923
Fig. 1. Conventional arbitration among switch LSI’s.
53
8 bits. As increases,
decreases. For example, at
ns and
is 8.8 Gb/s. Thus, since the
conventional switch uses ring arbitration, the arbitration time
limits the output-line speed according to the number of row
switch LSI’s to ensure that ring arbitration can be completed
is made small
within the ATM cell time. As a result, unless
by using ultra-high-speed devices, the conventional switch
cannot achieve large throughput.
Note that when the centralized contention controller for all
row switch LSI’s is located in a different place and pipelined
may not affect the required arbitration
control is executed,
time very much [18]. However, as increases, the centralized
contention controller needs to be expanded. In addition, for
this bus-access transmission system, strict timing design for
transmitting cells and control signals will needed, and this will
be difficult in a large-scale switching system. Considering the
scalability of the switching system, we think that it is better to
use a distributed control approach rather than a centralized one.
III. SCALABLE-DISTRIBUTED-ARBITRATION
(SDA) SWITCH ARCHITECTURE
A. Structure
This section describes a high-speed crosspoint-LSI-type
switch based on distributed contention control, called the
SDA switch. Fig. 3 illustrates its structure. There is an ATM
output buffer, an ATM transit buffer, an arbitration-control
part (CNTL), and a selector at every output port in the switch
LSI, but for simplicity, only one output port per switch LSI is
1924
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 34, NO. 12, DECEMBER 1999
Fig. 2. Maximum output-link speed in ring arbitration.
shown. Note that the ATM output buffer of the switch LSI can
be regarded as a crosspoint buffer in the switching system.
The SDA mechanism is as follows.
1) An ATM output buffer sends a request (REQ) to CNTL
if there is at least one cell stored in the output buffer.
An ATM transit buffer stores several cells that are sent
from either the output buffer of the upper LSI or the
transit buffer of the upper LSI. The transit buffer size
may be one or a few cells. Like the output buffer, the
transit buffer sends REQ to CNTL if there is at least
one cell stored in it.
2) If the transit buffer is about to become full, it sends
not-acknowledgment (NACK) to the upper CNTL.
3) If there are any REQ’s, and CNTL does not receive
NACK from the next lower transit buffer, then CNTL
selects a cell within one cell time. CNTL determines
which cell should be sent according to the following
cell selection rule. The selected cell is sent through a
selector to the next lower transit buffer or the output
line.
4) The cell selection rule is as follows. If either the output
buffer or the transit buffer makes a request for cell
release, the cell in the requesting buffer is selected. If
both the output buffer and the transit buffer request cell
release, the cell with the larger delay time is selected.
The delay time is defined as the time elapsed since the
cell entered the output buffer.
To compare the delay time of competing cells, we
bits, and
use a synchronous counter, which needs
we also use the same number of overhead bits in each
cell. The synchronous counter is incremented by one
cell times, where is a parameter representing
every
the granularity for measuring the delay time. When
the delay time is measured with the greatest
accuracy. When a cell enters an output buffer, the value
of the synchronous counter is written in the overhead
of the cell. When both an output buffer and a transit
buffer issue requests for cell release, the values of their
counters are compared. If the difference in values is less
Fig. 3. SDA technique among switch LSI’s.
than
the cell with the smaller value is selected.
Conversely, if the difference is equal to or more than
the cell with the larger value is selected. This
delay-time comparison works when the maximum delay
We thus set the value of
time is less than
to satisfy this relationship.
5) When the delay time of the cell in the output buffer
equals that in the transit buffer, CNTL determines which
cell should be sent using a second cell selection rule.
is large, the probability that the second cell
When
selection rule is used is large. Let us consider the th
switch LSI and transit buffers starting at the top. The
second rule is that the th output buffer is selected with
while the th transit buffer is selected
probability of
For example, the third
with probability of
output buffer and the transit buffer are selected with
probabilities of 1/3 and 2/3, respectively.
According to the second cell selection rule, the cell
that enters the th output buffer goes to the output line
OKI et al.: 0.25- m CMOS/SIMOX ATM SWITCH
1925
with a total probability given by
(2)
Here, the first term on the left side of (2) is the
probability that the cell in the output buffer of the th
LSI is selected, the second term is the probability that the
th LSI is selected, and
cell in the transit buffer of the
the final term is the probability that the cell in the transit
buffer of the th LSI is selected. The total probability
that a cell from any output buffer is selected for delivery
Therefore, the
to an output line is a constant value
fairness of the selected probability is kept by using the
second selection rule, even when the delay time of the
cell in the output buffer equals that in the transit buffer.
In the implementation for the second cell selection
rule, to avoid random variable generation, we employed
the following simple cell selection mechanism. For each
output port, the th switch LSI has a counter that counts
whenever contention
up cyclically from zero to
occurs between the output buffer and the transit buffer
at that output port. When the counter value is zero, the
output buffer is selected; otherwise, the transit buffer is
selected. This mechanism achieves cell selection with
the specified probability using simple hardware, but it
is not a completely randomly weighted cell-selection
mechanism.
Thus the SDA switch achieves distributed arbitration at each
switch LSI. The longest control signal transmission distance
for arbitration within one cell time is obviously the distance
between two adjacent switch LSI’s. In the conventional switch,
the control signal for ring arbitration must pass through all
LSI’s belonging to the same output line. For that reason, the
arbitration time of the SDA switch does not depend on the
number of switch LSI’s.
Here, we compare the SDA switch with the Knockout
switch [6] and the Abacus switch [7]. To increase the switch
throughput, the Knockout switch can grow modularly. It uses
-to- concentrators to connect switch modules on a
-tomatrix plane. Since cells may be discarded at the
concentrators in each row module, the cell loss probability for
a cell that comes from the top module may be higher than that
of a bottom cell because the former cell has to transit many
concentrators. In the SDA switch, cell loss never occurs on the
way to an output line as long as a cell is transmitted from an
output buffer in a switch LSI to the lower switch LSI because
of the use of the NACK signal. In the SDA switch, the cell
loss probability is designed according to the output buffer size.
In addition, the Knockout switch needs more interconnection
links between modules than the SDA switch does. From the
viewpoint of the system, the SDA switch does not require
the internal link capacity to be expanded. This is because the
Knockout switch arranges buffers only in the bottom module
while the SDA switch arranges them in every switch LSI.
The Abacus switch is an input/output-buffer-type switch,
and it also requires the internal link capacity to be expanded
times to eliminate HOL blocking. The Abacus switch allows
distributed contention control of each small switch fabric by
Fig. 4. Probability of delay time.
grouping output ports, and it supports multicasting of traffic.
It has buffers only at the input and output ports. -to-many
selection is performed in a distributed manner using its switch
fabric and all input modules to avoid the speed constraint.
This distributed contention technique is very useful, but the
timing requirement of routing cells and resolving contention
should be carefully considered, as is described in [7], when
a large-scale switching system is developed. On the other
hand, in our SDA switch, the arbitration is executed in a
distributed manner, the control signals are transmitted only
between adjacent switch LSI’s, and it does not use a bus line
as an output line. Therefore, the timing limitation in terms
of transmission of cells and control signals is relaxed even
if the switching system is expanded. This is an advantage of
the SDA.
B. Performance of SDA
SDA performance was evaluated in terms of delay time
and output buffer size by computer simulation. For simplicity,
we assume that input traffic to the output buffer at each LSI
is random, the input load is 0.95, and cells are distributed
uniformly to all output buffers belonging to the same input
line.
Note that, in this subsection, in order to clarify the SDA
performance, we focus on an output buffer, a transit buffer, and
the SDA control part in a switch LSI. Our switch LSI employs
an input/output-buffer memory-type switch architecture inside
the switching function. The switching performance before a
cell enters an output buffer will be described in the next
section.
The SDA switch ensures SDA-switch delay time fairness.
Fig. 4 shows the probability of the SDA-switch delay time’s
The probability is
being larger than a certain time for
shown for each output buffer that cells enter. The SDA-switch
delay time is defined as the time from when the cell enters
the output buffer until it reaches the output line. This delay
1926
Fig. 5. Dependence of 99.99% delay time on number of row switch LSI’s
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 34, NO. 12, DECEMBER 1999
N
Fig. 6. Dependence of average delay time on number of row switch LSI’s N:
:
definition is used as a measure of the SDA performance, and
it is different from the delay time used in the first selection
rule described in Section III-A. That is why we call this delay
time the SDA-switch delay time. However, when we describe
the SDA performance, we use delay time instead of SDAswitch delay time to simplify the description in the following.
In the SDA switch, when is more than about ten [cell times],
all delay times have basically the same probability, so delay
time fairness is achieved. (Fairness is not maintained at shorter
[cell times] for the cell
values because it takes at least
in the top output buffer to enter the output line.)
is larger than a certain time, the
In addition, when
probability of the SDA switch delay time’s being larger than
is smaller than that of the conventional switch, as shown in
Fig. 4. This is because, in the SDA switch, the cell with the
largest delay time is selected.
increases
This delay reduction effect becomes clearer as
reaches a certain value. Fig. 5 shows that the 99.99%
until
delay time of the SDA does not change very much even when
increases if
is less than about 64, while that of the
is
conventional switch increases rapidly. However, when
larger than 64, the 99.99% delay time of the SDA increases
linearly. The reason is as follows. A cell has to wait for at least
becomes large,
one cell time at each transit buffer. When
this one-cell-time waiting effect of transit buffers appears.
is larger than about 210, the delay of the
Then, when
SDA becomes larger than that of the conventional switch. The
is the
crossover point on the 99.99% delay time,
point where the throughput of the switching system is 840
Gb/s if we use our 4 2 switch LSI with 10-Gb/s line speed.
Here, we show an example of setting the values of and
When the maximum delay time, the worst delay time, is
smaller than
[cell times], the synchronous counter
even if it is incremented every cell
has a size of just
time; in other words,
Consider two competing cells,
where, for example, one cell’s synchronous counter value is
129 to 255 and the other’s is zero. In this case, the difference
between the two values is larger than 2 so the cell with the
is
larger value (i.e., the former cell) is selected. If
used, then is reduced. We note that the values of and
should be determined so that the maximum delay time may
If this condition is satisfied, the
not be larger than 2
SDA mechanism works well, as explained in Section III-A.
Fig. 6 shows that the average delay time of the SDA for
while that of the
all input traffic increases linearly with
conventional switch does not increase. This is mainly due to
the one-cell-time waiting effect of transit buffers, as mentioned
in connection with Fig. 5. Even though this effect increases
the average delay of the SDA, SDA offers the advantage of
distributed contention control.
The required output buffer size of the SDA switch is smaller
than that of the conventional switch, as shown in Fig. 7. The
required buffer sizes were estimated so as to guarantee the
Implementing the SDA mechanism
cell loss ratio of 10
requires additional hardware compared with the conventional
switch, such as a transit buffer, a synchronous counter, and
control parts. The amount of hardware required is discussed
in Section IV. In the SDA switch, since the required buffer
sizes depend on the position of the switch LSI’s, Fig. 7 shows
the smallest (the output buffer of the top switch LSI) and
the largest (the output buffer of the bottom switch LSI) sizes.
The sizes of the output buffers of intermediate switch LSI’s
lie between these two values. Because the SDA switch has a
shorter delay time, as explained earlier, the queue length of the
output buffer is also reduced. This is why the output buffer
size of the SDA switch is less than that of the conventional
switch.
IV. SWITCH LSI DESIGN
Fig. 8(a) shows a block diagram of the switch LSI. The
switch LSI has a 4 2 switching function with input and output ATM link speeds of 10 Gb/s. It also has more output ports
(expanded output ports) to reduce input signal fan-out. All
OKI et al.: 0.25- m CMOS/SIMOX ATM SWITCH
Fig. 7. Required output buffer size of switch LSI’s.
input signals are transmitted via the switch LSI’s in a pipelined
manner. A 10-Gb/s ATM link is achieved by eight single-ended
I/O’s at 1.25 Gb/s with an f/2 clock and frame signal, which
are differential I/O’s. The 10-Gb/s ATM cells stream along
the arrows. The 1.25-Gb/s pseudo-ECL interface in the chip is
constructed with CMOS low-voltage-swing active-pullup I/O
circuits, six 8 : 64 demultiplexers (DEMUX’s), and six 64 : 8
multiplexers (MUX’s). The two-edge trigger MUX/DEMUX
circuit uses an f/2 input clock [19]–[21]. Differential pseudoECL interface circuits, as shown in Fig. 9, are used to generate
the f/2 clock to increase the noise margin of the two-edge
trigger DEMUX. The basic idea of the active-pullup circuits
is to pull the output up to less than a few hundred picoseconds
and to shorten the rise time of the open-drain-type output
circuits. The 10-Gb/s ATM cell stream is widened to 64 bits
at the internal clock speed of 156 MHz.
The input/output-buffer-memory-type switch architecture is
used in the switch LSI. BUFI has a 16-cell ATM input buffer at
each input port, and BUFO has a 128-cell ATM output buffer
and a 16-cell transit buffer at each output port. The ATM
cell size is 64 bytes. These buffer memories are fabricated
under an offered
so as to hold the cell loss ratio to 10
load of 0.95. SDA is implemented in BUFO, as shown in
Fig. 8(b). SDA relaxes the operating speed for arbitration,
compared to ring arbitration. The SDA operation is executed
only between two adjacent LSI’s within the ATM cell time
of 51.2 ns (equal to eight internal-clock cycles) in a pipelined
manner. Since SDA uses cell selection control based on ATM
cell arrival time, the required ATM output buffer amount is
reduced by approximately 25% compared with a switch LSI
using conventional ring arbitration. This is another feature of
SDA. The reduction in amount of memory was 44 kb for
two output buffers in a switch LSI. The amount of extra
hardware required for implementing the SDA mechanism in
the switch LSI was 15 kgates and a 16-kb 2-port RAM,
which includes transit buffers, synchronous counters, control
parts, and so on. Although this amount of extra hardware is
1927
necessary, we believe that it does not have much impact on the
implementation of SDA using our advanced CMOS devices,
considering the reduction in output buffer size, compared with
the conventional switch architecture.
A 4 2 switching function is achieved using the tandemcrosspoint (TDXP) switching technique to effectively eliminate ATM cell blocking in the switch LSI, as shown in Fig. 10.
Since the TDXP switching mechanism and its performance
were described in [22], we only briefly summarize them here.
The switch consists of multiple crossbar switch planes, which
are connected in tandem at every crosspoint. Even if a cell
cannot be transmitted to an output port on the first plane,
it has a chance of being transmitted on the next plane. Cell
transmission is executed on each switch plane in a pipelined
manner. Therefore, more than one cell can be transmitted to
the same output port within one cell time slot, and the internal
line speed of each switch is equal to the input/output line
speed. The arbitration control on each switch plane is executed
independently of that of the other planes. We employed ring
arbitration on each switch plane in the switch LSI.
This switch architecture has several advantages in implementation. First, the switch LSI uses the TDXP switch
architecture, which allows input and output lines to operate at the same speed. On the other hand, a conventional
input/output-buffer-memory-type switch requires the internal
speed to be increased [15]. It is very difficult to make internal
lines that are sufficiently fast; we would have to use ultrahighspeed devices such as GaAs or Si bipolar VLSI. Second, the
total speed for writing an output buffer memory is 20 Gb/s,
which is expanded to 128 bits in parallel at 156 MHz, while a
shared-memory-type switch requires 40 Gb/s. TDXP switching
relaxes the memory-writing speed while avoiding ATM cell
blocking. Third, since the TDXP switch employs a simple cell
reading algorithm at the input buffer in order to retain the
cell sequence, TDXP does not require the cell sequences to be
rebuilt at the output buffers, unlike the parallel switch. These
advantages make it easy to implement the high-speed ATM
switch [22]. The tandem crosspoints are implemented in the
TDXPSW function block. Each TDXPSW has a 1-cell ATM
buffer.
Fig. 11 shows the average delay for the 4 2 switch LSI.
The average delay for TDXP switching is defined as the time
from when a cell enters an input buffer until it entfg 17
ers an output buffer. Note that we focus on the performance
of the TDXP switching mechanism and do not consider the
queuing delay time at the output buffer, which is related to the
SDA mechanism. To expand the switching system throughput,
switch LSI’s are arranged in a matrix plane.
these
2
switching function.
The switching system has a 4
increases, in
Since the number of column switch LSI’s
other words, the switching system size increases, the average
delay time approaches 2. We assume that cells are distributed
uniformly to all destinations from the same input line and
that the total input traffic load is 0.95. The input traffic load
Our switching system
to each switch LSI is divided by
employs a crosspoint-buffer-type switch architecture from the
system-level viewpoint, but an input/output-buffer-type switch
architecture from the switch-LSI-level viewpoint. In addition,
1928
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 34, NO. 12, DECEMBER 1999
(a)
(b)
Fig. 8. Functional block diagram of switch LSI.
we assume that a cell has to wait at least one cell time in both
an input buffer and a tandem crosspoint. Therefore, the average
delay time is close to two. If output-buffer-type crossbar
switching is employed in a switch LSI, the average switching
delay (as we defined it here) is zero. We can observe that
the difference in the average delay between TDXP switching
and output-buffer-type switching is only a few cell times.
Therefore, the the average delay of TDXP switching in the
switch LSI is very small.
We note that, since the input traffic load that enters the
switch LSI decreases when the switching system size increases
and the internal transmission capacity in the switch LSI is
twice the outside transmission capacity of the switch LSI, the
HOL blocking probability at an input buffer in the switch
LSI is much lower than that in an input-buffer-type switching
system [11], [12]. The performance of the TDXP switching
mechanism and that of the input-buffer-type switch [11] were
compared in [22].
OKI et al.: 0.25- m CMOS/SIMOX ATM SWITCH
1929
Fig. 9. Differential pseudo-ECL interface.
Fig. 11. Average delay time for TDXP switching in switch LSI.
Fig. 10.
Tandem-crosspoint switching mechanism in switch LSI.
Other features of the switch LSI are support for two
priority classes and multicasting, both of which are needed
for multimedia services. The ATM output buffer in the BUFO
block handles high-priority cells for guaranteed services and
low-priority cells for best-effort service. The head-of-line
priority discipline and push-out mechanism are used. A lowpriority cell is transmitted only when there are no high-priority
cells in the ATM output buffer. In addition, when the ATM
output buffer is full and a high-priority cell enters, the stored
low-priority cells that entered the output buffer last are pushed
out of the buffer. The 10-Gb/s ATM cell stream is handled
within one ATM cell time of 51.2 ns at the PUSH_CONT
block, as shown in Fig. 8(b).
To support multicasting, 64 routing bits are used. The switch
LSI has two modes: a unicasting mode and a multicasting
mode. For the unicasting mode, a binary notation is used in the
routing bit. For the multicasting mode, a multicast pattern is a
bit map of all of the output ports in a large switching system.
Each bit indicates if the cell is to be sent to the destined output
port. For instance, if the th bit in the multicast routing bits is
set to “1,” then the cell should be sent to the th output port in
1930
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 34, NO. 12, DECEMBER 1999
TABLE I
SWITCH LSI SPECIFICATIONS
Fig. 12.
Chip micrograph.
a large switching system. In each switch LSI, two specified bit
positions are set since a switch LSI has two output ports. A cell
that comes from an input line enters an input buffer in some
switch LSI’s through an address filter, if at least one of the two
routing bits that are specified in each switch LSI is “1.” When
a cell enters two or more different switch LSI’s, copies of
it are independently transmitted to each destined output port.
Therefore, HOL blocking due to multicasting increases only
when a cell is destined to two output ports in the same switch
LSI. The multicasting mechanism using the TDXP switching
that we implemented and its performance were described in
[22]. Note that since each input traffic load to each switch
in
LSI is divided by the number of column switch LSI’s
our switching system), HOL blocking due to multicasting is
reduced.
The LSI is fabricated using 0.25- m CMOS/SIMOX
technology [23]. An internal thermal oxidation SIMOX
wafer provides a fully depleted MOSFET. The fully depleted
CMOS/SIMOX device achieves a low threshold voltage and
low source/drain capacitance. As a result, the unity-gain
bandwidth is increased with low power consumption.
Fig. 12 is a die micrograph showing the 288-kgate and 209kb two-port RAM. For 40-Gb/s switch throughput, two chips
are needed. Note that each MEMORY under the BUFO block
is used in each BUFO block as both an output buffer and
a transit buffer, and the MEMORY on the left of the BUFI
blocks is used in BUFI blocks as input buffers. Although
the memories are included in the BUFO and BUFI blocks in
Fig. 8, they are shown separately in Fig. 12 in order to indicate
how much area they occupy. The supply voltage is 2 V, and
the reference voltage is 1.3 V. The die size is 16.55 16.55
mm . The power consumption of the switch LSI is only 7 W.
This consumption is 36% less than that of bulk CMOS gates
(11 W) at the same speed at 2.5 V [23]. The specifications
of the switch LSI are summarized in Table I.
V. LSI TESTING
FOR
MCM ASSEMBLY
The switch LSI’s were tested for bare-chip mounting in an
MCM substrate. The LSI testing and MCM assembly flow
are shown in Fig. 13. Switch LSI functions at 200 MHz were
confirmed using an LSI tester. The LSI’s were measured on
wafer at the full required speed of 1.25 Gb/s as shown in
Fig. 14 using a high-speed probe card to provide known good
dies (KGD’s) for MCM assembly.
For sensitive wiring design for an MCM substrate due to
the 1.25-Gb/s transmission between LSI’s, detailed AC performance such as input/output timing interface conditions of the
LSI was measured using an LSI evaluation board at the speed
of 1.25 Gb/s. The fabricated evaluation board was a built-up
laminated printed-wiring-board using flip-chip mounting [24].
The output eye diagram measured using the evaluation board is
shown in Fig. 15. Based on the results of the evaluation-board
tests, the wiring design for the MCM substrate was done. KG
substrates were provided after open/short testing. Then KGD’s
were assembled on a KG substrate, and MCM’s were tested.
VI. 80-Gb/s ATM SWITCHING MULTICHIP MODULE
An 8 8 ATM switching module with 80-Gb/s throughput
was fabricated using several switch LSI’s. Eight switch LSI’s
and 32 MCM-interface LSI’s were mounted by wirebonding
on a 40-layer ceramic substrate as shown in Fig. 16 [25].
This module is 114 mm 160 mm 6.55 mm. The MCMinterface LSI’s were fabricated as Si-bipolar devices using the
super-self-aligned technology [26], and have serial/parallel and
parallel/serial functions to convert the line speeds of the 16
physical lines (625 Mb/s) to those of the eight physical lines
(1.25 Gb/s), as shown in Fig. 17. On the MCM substrate, 1.25-
OKI et al.: 0.25- m CMOS/SIMOX ATM SWITCH
Fig. 13.
LSI testing and MCM assembly flow.
Fig. 14.
On-wafer measured output waveform at 1.25 Gb/s.
1931
1932
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 34, NO. 12, DECEMBER 1999
Fig. 15.
A 1.25-Gb/s eye diagram obtained using evaluation board.
Fig. 16.
Photograph of 80-Gbit/s ATM switching module.
Fig. 17.
Cross section of MCM.
Gb/s signal transmission was achieved using the pseudo-ECL
interface circuits described in Fig. 9. Switching modules were
interconnected by high-density flexible printed circuit cables
to expand the switching capacity [27].
VII. CONCLUSIONS
This paper presented the design and implementation of
a scalable switch that uses a newly developed distributed
contention control technique, called scalable distributed ar-
OKI et al.: 0.25- m CMOS/SIMOX ATM SWITCH
1933
bitration, that allows the switch LSI to be expanded. SDA
is executed in a distributed manner at each switch LSI and
the arbitration time does not depend on the number of connected switch LSI’s. To increase the switch LSI throughput and reduce the power consumption, we used 0.25- m
CMOS/SIMOX technology. This technology enables us to
offer 221 I/O pins with 1.25-Gb/s throughput. In addition,
power consumption of 7 W is achieved by operating the
CMOS/SIMOX gates at 2.0 V. Using these switch LSI’s,
8 switching module with 80-Gb/s throughput based
an 8
on MCM technology was fabricated with a size of 114 mm
160 mm 6.55 mm. The scalability offered by the switch
LSI and the module is the key to multimedia Tb/s-class ATM
switching systems.
REFERENCES
[1] N. Yamanaka, S. Yasukawa, E. Oki, T. Kurimoto, T. Kawamura, and T.
Matsumura, “OPTIMA: Tb/s ATM switching system architecture based
on highly statistical optical WDM interconnection,” in Proc. ISS’97,
1997, p. IS-02.8.
[2] E. Munter, J. Parker, and P. Kirkby, “A high-capacity ATM switch based
on advanced electronic and optical techniques,” IEEE Commun. Mag.,
pp. 64–71, 1995.
[3] H. Ahmadi and W. E. Denzel, “A survey of modern high-performance
switching techniques,” IEEE J. Select. Areas Commun., vol. 7, no. 7,
pp. 1091–1103, 1989.
[4] M. G. Hluchyj and M. J. Karol, “Queueing in high-performance packet
switching,” IEEE J. Select. Areas Commun., vol. 6, no. 9, pp. 1587–1597,
1988.
[5] Y. Oie, T. Suda, M. Murata, D. Kolson, and H. Miyahara, “Survey of
switching techniques in high-speed networks and their performance,” in
Proc. IEEE Infocom’90, pp. 1242–1251.
[6] Y. S. Yeh, M. G. Hluchyj, and A. S. Acampora, “The knockout switch;
A simple, modular, architecture for high-performance packet switching,”
in Proc. ISS’87, 1987, vol. B10.2.1, pp. 801–808.
[7] H. J. Chao, B.-S. Choe, J.-S. Park, and N. Uzun, “Design and implementation of Abacus switch: A scalable multicast ATM switch,” IEEE
J. Select. Areas Commun., vol. 15, no. 5, pp. 830–843, 1997.
[8] I. Iliadis and W. E. Denzel, “Analysis of packet switches with input and
output queueing,” IEEE Trans. Commun., vol. 41, no. 5, pp. 731–740,
1993.
[9] Y. Oie, M. Murata, K. Kubota, and H. Miyahara, “Effect of speedup in
nonblocking packet switch,” in Proc. ICC89, 1989, p. 410.
[10] N. Yamanaka, K. Endo, K. Genda, H. Fukuda, T. Kishimoto, and
S. Sasaki, “320 Gb/s high-speed ATM switching system hardware
technologies based on copper-polyimide MCM,” IEEE Trans. Comp.,
Packag., Manufact. Technol., vol. 18, pp. 83–91, 1995.
[11] M. J. Karol, M. G. Hluchyj, and S. P. Morgan, “Input versus output
queueing on a space-division packet switch,” IEEE Trans. Commun.,
vol. COM-35, pp. 1347–1356, 1987.
[12] M. J. Karol, K. Y. Eng, and H. Obara, “Improving the performance of
input-queued ATM packet switches,” in Proc. IEEE Infocom’92, 1992,
pp. 110–115.
[13] N. Mackeown, “Scheduling algorithm for input-queued cell switches,”
Ph.D. dissertation, University of California at Berkeley, 1995.
[14] H. Tomonaga, N. Matsuoka, Y. Kato, and Y. Watanabe, “High-speed
switching module for a large capacity ATM switching system,” in Proc.
IEEE GLOBECOM’92, 1992, pp. 123–127.
[15] K. Genda, Y. Doi, K. Endo, T. Kawamura, and S. Sasaki, “A 160-Gb/s
ATM switching system using an internal speed-up crossbar switch,” in
Proc. IEEE GLOBECOM’94, 1994, pp. 123–133.
8) 4
[16] E. Oki, N. Yamanaka, and Y. Ohtomo, “A 10 Gb/s (1.25 Gb/s
2 CMOS/SIMOX ATM switch,” in IEEE ISSCC99, Feb. 1999, pp.
172–173.
[17] E. Oki and N. Yamanaka, “Scalable crosspoint buffering ATM switch architecture using distributed arbitration scheme,” in Proc. IEEE ATM’97
Workshop, 1997, pp. 28–35.
[18] A. Cisneros and C. A. Brackett, “A large ATM switch based on memory
switches and optical star couplers,” IEEE J. Select. Areas Commun., vol.
9, no. 8, pp. 1348–1360, 1991.
[19] Y. Ohtomo, M. Nogawa, and M. Ino, “A 2.6-Gbps/pin SIMOX-CMOS
low-voltage-swing interface circuit,” IEICE Trans. Electron., vol. E79C, no. 4, pp. 524–529, 1996.
2
2
[20] S. Yasuda, Y. Ohtomo, M. Ino, Y. Kado, and T. Tsuchiya, “3-Gb/s
CMOS 1:4 MUX and DEMUX IC’s,” IEICE Trans. Electron., vol.
E78-C, no. 12, pp. 1746–1753, 1995.
[21] Y. Ohtomo, S. Yasuda, M. Nogawa, J. Inoue, K. Yamakoshi, H. Sawada,
M. Ino, S. Hino, Y. Sato, Y. Takei, T. Watanabe, and K. Takeya, “A 40
Gb/s 8 8 ATM switch LSI using 0.25 m CMOS/SIMOX,” in Proc.
IEEE ISSCC97, 1997, pp. 154–155.
[22] E. Oki and N. Yamanaka, “High-speed tandem-crosspoint ATM switch
architecture with input and output buffers,” IEICE Trans. Commun., vol.
E81-B, no. 2, pp. 215–223, 1998.
[23] M. Ino, “Low-power and high-speed LSI’s using 0.25 m
CMOS/SIMOX,” IEICE Trans. Electron., vol. E80-C, no. 12,
1997.
[24] Y. Tsukada, “Bare chip packaging technology,” in Proc. Advances in
Electronic Packaging Conf., 1997, vol. 1, pp. 285–290.
[25] K. Okazaki, N. Sugiura, A. Harada, N. Yamanaka and E. Oki, “80-Gbit/s
MCM-C technologies for high-speed ATM switching systems,” in Proc.
IMAPS Int. Conf. Exhibition High Density Packaging & MCMs, Apr.
1999, pp. 284–288.
[26] T. Sakai, S. Konaka, Y. Kobayashi, M. Suzuki, and Y. Kawai, “Gigabit logic bipolar technology: Advanced super self-aligned process
technology,” Electron. Lett., vol. 19, pp. 283–234, 1983.
[27] S. Sasaki, T. Kishimoto, K. Genda, K. Endo, and K. Kaizu, “Multichip
module technologies for high-speed ATM switching systems,” in Proc.
’94 MCM Conf., 1994, pp. 130–135.
2
Eiji Oki (M’95) received the B.E. and M.E.
degrees in instrumentation engineering and the
Ph.D. degree in electrical engineering from Keio
University, Yokohama, Japan, in 1991, 1993, and
1999, respectively.
In 1993, he joined Nippon Telegraph and
Telephone Corporation’s (NTT’s) Communication
Switching Laboratories, Tokyo, Japan. He has been
researching multimedia-communication network
architectures based on ATM techniques and trafficcontrol methods for ATM networks. He is currently
developing high-speed ATM switching systems in NTT Network Service
Systems Laboratories as a Research Engineer.
Dr. Oki is a member of the IEICE of Japan. He received the Switching
System Research Award and the Excellent Paper Award from the IEICE in
1998 and 1999, respectively.
Naoaki Yamanaka (M’85–SM’96) was born in
Sendai City, Miyagi Prefecture, Japan, in 1958.
He received the B.E., M.E., and Ph. D. degrees in
engineering from Keio University, Tokyo, Japan, in
1981, 1983, and 1991, respectively.
In 1983, he joined Nippon Telegraph and
Telephone Corporation’s (NTT’s) Communication
Switching Laboratories, Tokyo, where he researched
and developed high-speed switching systems and
high-speed switching technologies, such as ultrahigh-speed switching LSI chips/devices, packaging
techniques, and interconnection techniques, for broad-band ISDN services.
Since 1989, he has been developing broad-band ISDN items based on ATM
techniques. He is now researching ATM-based broad-band ISDN architectures
and is engaged in traffic management and performance analysis of ATM
networks. He is currently a Senior Research Engineer, Supervisor, Research
Group Leader in the Broadband Network System Laboratory at NTT.
Dr. Yamanaka received the Best of Conference Award at the 40th,
44th, and 48th IEEE Electronic Components and Technology Conferences,
the TELECOM System Technology Prize from the Telecommunications
Advancement Foundation, the IEEE CPMT Transactions Part B: Best
Transactions Paper Award, and the Excellent Paper Award from the IEICE in
1990, 1994, 1999, 1994, 1996, and 1999, respectively. He is the Broadband
Network Area Editor of the IEEE COMMUNICATION SURVEYS, Editor of the
IEICE Transactions, and the IEICE Communication Society International
Affairs Director, as well as Secretary of the Asia Pacific Board of the IEEE
Communications Society.
1934
Yusuke Ohtomo (M’92) received the B.E., M.E.,
and Ph.D. degrees in electric engineering from Keio
University, Kanagawa, Japan, in 1983, 1985, and
1998, respectively.
Since he joined Nippon Telegraph and Telephone
(NTT) Corp., Tokyo, Japan, in 1985, he has been
engaged in the research and development of highspeed CMOS/BiCMOS circuit technology, particularly with application to telecommunications LSI’s.
His current research interests include low-power
multigigahertz LSI design using SOI devices. He
is now a Senior Research Engineer at the NTT Telecommunications Energy
Laboratories, Kanagawa.
Dr. Ohtomo is a member of the IEICE of Japan.
Kazuhiko Okazaki was born in Tokyo, Japan, in
1969. He received the B.E. and M.E. degrees in
applied physics and chemistry from The University
of Electro-Communications, Tokyo, in 1993 and
1995, respectively.
In 1995, he joined Nippon Telegraph and Telephone Corporation’s Network Service Systems Laboratories, Tokyo. He has been researching highspeed electrical interconnections in a rack system
and high-density MCM packaging technologies for
high-speed ATM switching systems.
Mr. Okazaki is a member of the IEICE of Japan and the IMAPS.
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 34, NO. 12, DECEMBER 1999
Ryusuke Kawano (M’98) was born in Oita, Japan,
on April 2, 1964. He received the B.E. and M.E.
degrees from the University of Osaka Prefecture,
Osaka, Japan, in 1987 and 1989, respectively.
In 1989, he joined Nippon Telegraph and Telephone Corp. (NTT) and began researching and
developing process technology for high-speed Sibipolar devices. Since 1992, he has been engaged
in researching and developing high-speed integrated
circuits using Si bipolar transistors and GaAs MESFET’s at the NTT LSI Laboratories, Kanagawa,
Japan. Since moving to NTT Network Service Systems Laboratories, Tokyo,
Japan, his research interests have included very large capacity ATM switching
hardware such as high-speed logic, optical interconnection, and cooling.
Mr. Kawano is a member of the IEICE of Japan.