Asynchronous Logic: A Computer Systems Perspective Rajit Manohar Computer Systems Lab, Cornell http://vlsi.cornell.edu/ Approaches to computation Discrete Time Continuous Time Discrete Value Digital, synchronous logic Digital, asynchronous logic Continuous Value Switched-capacitor analog General analog computation clock receiver sender data signal value Asynchronous logic time sampling window data req ack ack data 1 data 0 acknowledge receiver data 1 sender receiver sender data 0 X X X X data request acknowledge Explicit versus implicit synchronization A B A1 B1 C C1 A B A1 B1 C C1 C2 A2 B2 A2 C2 B2 C3 A3 B3 A3 C4 B4 A4 Synchronous computation B4 C4 A4 Asynchronous computation time time B3 C3 Differences at a glance Asynchronous Synchronous Continuous time computation. Discrete time computation. Observed delay is the average over possible circuit paths. Observed delay is the maximum over possible circuit paths. Local switching from data-driven computation. Can lead to low-power operation. Global activity from clock-driven computation. Low power requires careful clock gating. Throughput determined by device speed. (Margins needed for some circuit families.) Throughput determined by device speed and additional operating margins. Data must be encoded, requiring additional wires for signals. Unencoded data acceptable due to global clock network. A little bit of (biased) history… Binary addition (von Neumann) 1940 Switching theory (Miller) Illiac II (UIUC/Muller) 1950 Macro modular systems (Clarke, Molnar, Stucki,…) Atlas, MU5 (Manchester) 1960 1970 Flow tables (Huffman) Trace theory (Snepscheut) HiAER (Cauwenberghs) Syntactic compilation (Martin, Rem) 1980 ARM with caches (Furber) 1990 Micro pipelines (Sutherland) System Timing (Seitz, in Mead/Conway) Asynchronous CPU (Martin) General hazard-free synthesis AER (Mead lab) Pipelined FPGAs (Manohar) FACETS (Heidelberg) 2000 17FO4 MIPS CPU (Martin) 2010 UltraSparc IIIi mem controller (Sun) Commercial 80C51 (Phillips) TrueNorth (IBM/Cornell) 10G Ethernet switch (Fulcrum) Event-driven CPU (Manohar) Sensory systems (retina, cochlea, etc.) SpiNNaker (Furber) Neurogrid (Boahen) I. Exploiting the gap between average and max delay 1 1 0 0 0 1 0 1 1 0 0 + 0 1 1 0 1 0 1 0 0 0 1 1 0 • Slow part: computing carries ❖ … but only in the worst-case! Theorem [von Neumann, 1946]. The average-case latency for a ripple carry binary adder is O(log N) for i.i.d. inputs A.W. Burks, H.H. Goldstein, J. von Neumann. Preliminary discussion of the logical design of an electronic computing instrument. (1946) I. Exploiting the gap between average and max delay • Standard adder tree ❖ O(log N) latency • Hybrid adder tree ❖ Tree adder + ripple adder Theorem [Winograd, 1965]. The worst-case latency for binary addition is Ω(log N) Theorem. The average-case latency of the hybrid asynchronous adder is O(log log N) for i.i.d. inputs Theorem. The average-case latency of the hybrid asynchronous adder is optimal for any input distribution S.O. Winograd. On the time required to perform addition. JACM (1965) R. Manohar and J. Tierno. Asynchronous parallel prefix computation. IEEE Transactions on Computers (1998) II. Exploiting data-driven power management • Low power FIR filter ❖ External, synchronous I/O ❖ Monitor occupancy of FIFOs ❖ Speed up/slow down using adaptive power supply DC/DC converter Vdd • Break-point add/multiply ❖ ❖ Detect when top half of operand is required State FIR filter Dynamically switch between full width and half width arithmetic L. Nielsen and J. Sparso. A Low-power Asynchronous Datapath for a FIR filter bank. Proc. ASYNC (1996) 0 1 0 1 0 1 II. Exploiting data-driven power management 0 0 0 ··· 0 0 0 1 0 1 0 1 0 1 + 0 0 0 ··· 0 0 0 1 ❖ Compress leading 0’s and 1’s0 0 0 · · · 0 0 1 0 • Width-adaptive numbers SPEC: 124.m88ksim Number: 1.4 ⇥ 1014 ⇤ 247 0 1 + 0 1 Average:08.4 0 1 0 90 70 60 50 40 Average: 10.9 Average: 8.8 30 20 no compaction simple compaction full compaction 10 0 5 10 15 20 Operand Widths SPEC: 129.compress (%) 100 25 70 60 50 40 30 Average: 10.5 Averag 20 no co simple co full co 10 0 30 Average: 10.0 80 receiver Cumulative Distribution Function (%) 80 0 sender Cumulative Distribution Function (%) 90 SPEC: 126.gcc 100 Number: 5 29 ⇤ 10 R. Manohar. Width-Adaptive Data Word 100 Architectures. Proc. ARVLSI (2001) (%) 100 128−bit datapath receiver v/s 0 1 + 0 1 0 1 0 sender 0 0 0 ··· 0 0 0 1 + 0 0 0 ··· 0 0 0 1 0 0 0 ··· 0 0 1 0 8−bits per VDW block 5 2 15 Operand Widths SPEC: 132.ijpeg 20 25 III. Exploiting elasticity of pipelines f g h Asynchronous computation is* robust to changes in circuitlevel pipelining! f g h *R. Manohar, A. J. Martin. Slack Elasticity in Concurrent Computing. Proc. Mathematics of Program Construction (1998) III. Exploiting elasticity of pipelines • Asynchronous FIR filter • Data arrives at variable rates • Data rates are predictable • Fixed amount of time for async computation 1 2 3 • Variable number of clock cycles for the fixed time budget M. Singh, J.A. Tierno, A. Rylyakov, S. Rylov, S.M. Nowick. An adaptively pipelined mixed synchronous-asynchronous digital FIR filter chip operating at 1.3 GHz. Proc. ASYNC (2002) 9 III. Exploiting elasticity of pipelines • FPGA throughput low due to overhead of programmable interconnect • ~3x improvement in peak throughput by transparent interconnect pipelining LB CB LB CB CB SB CB SB LB CB LB CB CB SB CB SB J. Teifel and R. Manohar. Highly pipelined asynchronous FPGAs. Proc. FPGA (2004) IV. Exploiting timing robustness Asynchronous FPGA Test Data Throughput (MHz) 1200 1000 Process: TSMC 0.18um Nominal Vdd: 1.8V 77K 12K 800 294K 600 400 400K 200 Xilinx Virtex 0 0.5 1 1.5 Voltage (V) 2 2.5 D. Fang, J. Teifel, and R. Manohar. A High-Performance Asynchronous FPGA: Test Results. Proc. FCCM (2005) IV. Exploiting timing robustness • “GasP” logic family 6-10 FO4 w • Self-reset Track reset • No margins on control signals • Standard datapath • Latest 40nm chip operates >6 GHz RA(i) Single-track control wire RA(i+1) Track set Wait for L set, R reset Datapath driver w I. Sutherland, S. Fairbanks. GasP: a minimal FIFO control. Proc. ASYNC (2001) V. Exploiting low noise and low EMI time domain frequency domain 0 100 200 300 400 MHz Synchronous 80C51 (Phillips) 0 100 200 300 400 MHz Asynchronous 80C51 (Phillips) J. Kessels, T. Kramer, G. den Besten, A. Peeters, V. Timm. Applying Asynchronous Circuits in Contactless Smart Cards. Proc. ASYNC (2000) VI. Exploiting continuous-time information processing • Continuous-time digital signal processing • Automatic adaptation to input bandwidth Nyquist sampling Level-crossing sampling • No aliasing from sampling process B. Schell, Y. Tsividis. A clockless adc/dsp/dac system with activitydependent power dissipation and no aliasing. Proc. ISSCC (2008) VI. Exploiting continuous-time information processing • Address-event representation in neuromorphic systems ❖ “Time represents itself” ❖ Spike timing and coincidence for information processing • Asynchronous logic ❖ Temporal resolution is decoupled from throughput ❖ Automatic power management • Projects ❖ Silicon retinas, cochleas, … ❖ More recent: FACETS, HiAER, Neurogrid, neuroP, SpiNNaker, TrueNorth A note on oscillations in asynchronous logic • Well-known result in asynchronous circuit theory ❖ Signal transitions: look at the ith occurrence of transition t ❖ time(t, i) = the actual time of this event 0 time (t, i) = f (t) + p⇤ ⇥ i ❖ • Then: |time(t, i) time0 (t, i)| < B ❖ Bounded, independent of i, and t ❖ If the circuit is strongly connected, this result holds • More recently: stronger periodicity under more general conditions J. Magott. Performance evaluation of concurrent systems using petri nets. IPL (1984) S.M. Burns, A.J. Martin. Performance Analysis and Optimization of Asynchronous Circuits. Proc. ARVLSI (1990) W. Hua and R. Manohar. Exact timing analysis for concurrent systems. To appear, work-in-progress session, DAC (2016) Asynchronous logic in TrueNorth • Spikes trigger activity… ❖ … so they should be asynchronous … so add a periodic event to the system (the clock) Token Control ❖ Scheduler • Neurons leak periodically… At the external timing “tick”: SRAM Spikes travel through asynchronous network • Receiver buffers spikes until it is time to deliver them Digital, asynchronous logic Router • If the neuron spikes, send output spike to destinations, with specified delay Neuron • Update each neuron, delivering spikes plus any local state update Asynchronous logic in TrueNorth • Spike communication network ❖ Links are a shared resource ❖ Links might be contended • Spike delivery times can differ based on traffic ❖ Different runs of the same network might produce different spike patterns • Instead, we use a spike scheduler ❖ Guarantees deterministic computation ❖ … and hence, repeatability Summary • Asynchronous logic has a long history • Some key properties exploited in projects ❖ Gap between average and maximum delay for a computation ❖ Automatic data-driven power management ❖ Elastic pipelining ❖ Robustness to timing uncertainty/variation ❖ Reduced EMI and noise ❖ Continuous-time information representation Acknowledgments • The “async” community • Students: ❖ Ned Bingham, Wenmian Hua, Sean Ogden, Praful Purohit, Tayyar Rzayev, Nitish Srivastava ❖ John Teifel, Clinton Kelly, Virantha Ekanayake, Song Peng, David Biermann, David Fang, Chris LaFrieda, Filipp Akopyan, Basit Riaz Sheikh, Benjamin Tang, Nabil Imam, Sandra Jackson, Carlos Otero, Benjamin Hill, Stephen Longfield, Jonathan Tse, Robert Karmazin • Collaborators: ❖ General: David Albonesi, Sunil Bhave, Francois Guimbretiere, Amit Lal, Yoram Moses, Ivan Sutherland, Yannis Tsividis ❖ Neuromorphic: John Arthur, Kwabena Boahen, Tobi Delbruck, Chris Eliasmith, Giacomo Indiveri, Paul Merolla, Dharmendra Modha • Support: AFRL, DARPA, IARPA, NSF, ONR, IBM, Lockheed, Qualcomm, Samsung, SRC
© Copyright 2026 Paperzz