Error Correcting Codes for Serial links : an update Sergio Cavaliere Department of Physics, University of Napoli “Federico II”, Italy and INFN Sezione di Napoli, Italy e-mail: [email protected] XV SuperB Workshop – Caltech - Dec, 2010 Overview Recall Problems with serial link failures and errors due to rad hard environment Recall what are the relevant parameters for the performance of the error correcting code Define bit error rate and bit error frequency and the related Poisson statistics Probability analysis regarding Bit Error Rate reduction in Hamming codes. Probability analysis regarding Bit Error Rate reduction in Reed Solomon codes Analysis of some proposed coding structures Conclusion and future work XV SuperB Workshop – Caltech Dec., 2010 2 Problems with serial link failures and errors Two main problems regarding errors due to rad hard environment : Loss Of Lock – due to failures on fixed bits in the SERDES - analyzed in last Frascati meeting. • Conclusion: need to provide a direct fast link between transmitter and receiver in order to signall promptly occurrence of LoL Bit errors due to the radiation hard environment: affect data integrity and data quality • Solution: need of an Error Correcting Code (ECC) • evaluation of the required performance of the code • start from a presumed bit error rate 10-10 (LHC?) [compared with the extreme technology limits of 10-15] • arrive to a desired time between errors XV SuperB Workshop – Caltech Dec., 2010 3 Relevant parameters for the performance of the error correcting code In the usual communication approach, the relevant parameter for serial link improvement is the coding gain: increase in channel noise wich can be balanced by error correction codes This allows reducing costs with the same performance or increasing speed at the same cost: relax SNR requirements in our case what is important is the bit error rate reduction obtained by ECC From BER parameter we may compute an overall failure rate for each serial link and for the whole apparatus at a fixed data rate XV SuperB Workshop – Caltech Dec., 2010 4 Bit Error Rate and time between errors λ error events and λ faultybits 1/T transmitted bits Unit time T transmit clock period BER =no. of errored bits/ no. of trasmitted bits f = transmission frequency λ = BEF bit error frequency=BER*f Average Time between errors = 1/BEF e.g. f=1.1GHz λ=0.11 BER=10-10 μ=Average Time between errors = 9s XV SuperB Workshop – Caltech Dec., 2010 5 Bit errors: Poisson statistics Error on bits caused by events which take place in a radiation hard environment has an usual statistics with the features: • events take place one after the other and indipendently each other • the average number of events in unit time is constant, equal to λ. exponential probability pdf mu=9.09 (10000 events) 0.12 measured pdf theorical pdf • λ is the average number of events in unit time (frequency or rate) 0.1 probability 0.08 • μ =1/ λ is the average time distance from one event to the next 0.06 0.04 t 1 p (t ) e 0.02 0 mu 0 20 40 60 time (s) 80 100 c( t ) 1 e XV SuperB Workshop – Caltech Dec., 2010 t 6 Bit Error Rate and time between errors The diagram shows how a value for BER translates into the average time between errors (in case of continuous data exchange) at a fixed operating frequency 10 10 average time between bits in error at 1.1 GHz 10 10 years 8 1 year time (s) 10 1 month 6 1 day 10 4 1 hour 10 e.g. BER=10-10 average time between errors = 9s use error correction coding to achieve: BER=10-16 average time between errors = 4 months 2 1 minute 10 s 10 0 10 -18 10 -16 -14 10 bit error rate (BER %) 10 -12 1s 10 -10 XV SuperB Workshop – Caltech Dec., 2010 7 How to evaluate how much ECC power is needed? Assume a command length of 100 bits (actual figures will be 72-90-108 bits) Assume a reference BER=10-10 for each link For a single link: correction of 0 bit per frame will deliver BER=10-10 -> time between errors 9s correction of 1 bit per frame will deliver BER=5*10-17 time between errors years correction of 2 bit per frame will deliver BER=2*10-25 t between errors many years Binomial formula: probability of having n errors in a frame of m bits and error probability p m p(m, n) p n (1 p)m n n 17 p(100,2) 5 10 p(100,3) 2 10 25 We may argue that a moderate complexity ECC may be adopted Observation: Low probability values would involve very long simulations XV SuperB Workshop – Caltech Dec., 2010 8 Bit Error Rate reduction in Hamming codes P ( n, i ) i t 1 n i n t 1 p (1 p ) n i P(n, t 1) p (1 p ) n t 1 i t 1 i t 1 n n 1 t 1 peb p t peb (n 1) p 2 Probability of a bit error for block codes. n=wordlength t=no. of corrected bits Probability of a bit error for Hamming codes t=1 n=2m-1 In log scale it is a straight line with angular coefficient 2(n-1) Probability of a word error for block codes. n = wordlength t=no. of corrected bits Hamming: bit error probability encoded vs BER m=3....8 t=1 10 bit error probability encoded pew n 10 10 10 m= m= m= m= m= m= -16 -18 -20 -22 10 -12 10 -11 10 -10 10 BER XV SuperB Workshop – Caltech Dec., 2010 9 -9 8 7 6 5 4 3 Bit Error Rate reduction in shortened Hamming codes Due to the 18 bits constraint in the serdes we must use shortened Hamming codes 26 13 31 18 18 31 26 H(31,26) H(31,26) 13 13 13 13 0 10 10 10 10 10 10 10 10 13 -2 Hamming: simulated and computed ber after coding -3 computed simulated For Hamming code H(n,k) shortened to H(ns,ks) -4 -5 pshort pHamm -6 -7 n ns In the above example n=31 ns=18 -8 -9 10 -5 10 -4 10 -3 10 -2 XV SuperB Workshop – Caltech Dec., 2010 10 Bit Error Rate reduction in multi-Hamming codes 8 15 11 3 H(15,11) 0 12 12 15 3 3 11 3 H(15,11) 18 8 11 11 7 4 3 1 0 H(7,4) 6 1 6 1 4 7 H(7,4) 10 pmulti p1 k1red p2 k2 red k1red k2 red 10 10 3 Our codes will be made of a combination of shortened Hamming codes 1 Multi-Hamming: simulated and computed ber after coding -3 computed simulated -4 -5 -6 pmulti = bit error probability for the overall code 10 p1=probability of branch no.1 (shortened Hamming) k1red = no. of bits of the message of branch no.1 10-7 p2=probability of branch no. 2 (shortened Hamming) k2red = no. of bits of the message of branch no. 2 10-8 10 10 -9 -10 10 -5 10 XV SuperB Workshop – Caltech Dec., 2010 -4 10 -3 10 11 -2 Bit Error Rate reduction in Reed Solomon codes 10 RS code [linear scale]: m=3 n=7 ber (parameter t) -1 ps 1 (1 p) 10 coded ber 10 10 10 10 m -2 ps probability that symbol is in error p probability that a bit is in error m is the symbol length n t 1 pew p (1 p )n t 1 t 1 -3 pew probability that a word made of n symbols is in error -4 n n n 1 i q d n i n i pib Ps (1 Ps ) Ps (1 Ps ) n i i 1 2( q 1) i t 1 n i i d 1 -5 -6 10 -4 -3 10 ber 10 -2 pib probability that a bit of the message is in error after ECC coding Same work as Hamming to obtain features for shortened and combined codes XV SuperB Workshop – Caltech Dec., 2010 12 Hamming code: features of a selected test trasmitted a frame of 2x18 bit=36bit code no polarity control codes 2 x H(15,11) + Hs(6,3) {from H(7,4)} 11 25 bit 11 Data to transmit buffer & 3 H(15,11) 15 H(15,11) 15 Hs(6,3) 6 36 bit serdes 18 18 encoder scrambler 18bit 2*18 n=2 Ecc = 12 % serial link BER 10-10 10-19 Overhead = 44 % Data to distribute 11 25 bit 11 3 Buffer & H(15,11) 15 H(15,11) 15 Hs(6,3) 6 decoder 36 bit serdes 18bit 2*18 18 18 descrambler lug. ’17 XV SuperB Workshop – Caltech Dec., 2010 13 Hamming code: features of a selected test code Multi-Hamming: frequency of errors after coding (clock at 1.1Gbit/s) time between errors after ECC (s) 10 e.g. f=1.1GHz 11 1000 years 10 uncorrected BER = 10-10 10 100 years 10 average time between failures 9 s 9 10 years 10 after coding 8 1 year 10 corrected BER = 10-19 7 1 month 10 6 average time between failures 244 years 8 days 10 -1 lug. ’17 0 1 10 10 time between errors before ECC (s) XV SuperB Workshop – Caltech Dec., 2010 14 Reed Solomon codes Similar examples may be made for Reed Solomon codes We do not show an example for this also because greater hardware complexity of both encoding and decoding may drive to the Hamming solution which is: • simple as far as regards hardware complexity and • faster as far as regards the involved delays XV SuperB Workshop – Caltech Dec., 2010 15 Conclusions We have developed a thorough statistical analysis of bit error probability after ECC coding for complex, shortened and mixed codes both Hamming and Reed Solomon codes, with some simulation We must point out that the above consideration on error rates apply to a single link. The 500-1000 multiplicity will obviously raise the bit error frequency in the apparatus by that multiplying factor. Even taking into account this circumstance we might argue that a moderate correction capability is needed in order to reduce error rate to a suitable value. This will be assessed as soon as we will have precise figures on the error rate in our rad hard environment We will therefore revert to very simple ECC structures, fully compatible with a proper hardware implementation on the ground of both available hardware resources and processing time XV SuperB Workshop – Caltech Dec., 2010 16 To be done obtain precise figures on the bit error rate in our rad hard environment define and analyze Hamming (Reed Solomon) coding structures with the purpose of reducing both silicon area and operating speed for the implementation analyze thoroughly the impact of error rates on the performance of the overall apparatus and related data quality evaluate practical implementations XV SuperB Workshop – Caltech Dec., 2010 17
© Copyright 2024 Paperzz