Slides - Agenda INFN

Error Correcting Codes for Serial
links : an update
Sergio Cavaliere
 Department of Physics, University of Napoli “Federico II”, Italy
and
 INFN Sezione di Napoli, Italy
e-mail: [email protected]
XV SuperB Workshop – Caltech - Dec, 2010
Overview

Recall Problems with serial link failures and errors due to rad
hard environment

Recall what are the relevant parameters for the performance of
the error correcting code

Define bit error rate and bit error frequency and the related
Poisson statistics

Probability analysis regarding Bit Error Rate reduction in
Hamming codes.

Probability analysis regarding Bit Error Rate reduction in Reed
Solomon codes

Analysis of some proposed coding structures

Conclusion and future work
XV SuperB Workshop – Caltech Dec., 2010
2
Problems with serial link failures and
errors
Two main problems regarding errors due to rad hard environment :
 Loss Of Lock – due to failures on fixed bits in the SERDES - analyzed in last
Frascati meeting.
• Conclusion: need to provide a direct fast link between transmitter and
receiver in order to signall promptly occurrence of LoL
 Bit errors due to the radiation hard environment: affect data integrity and
data quality
• Solution: need of an Error Correcting Code (ECC)
• evaluation of the required performance of the code
• start from a presumed bit error rate 10-10 (LHC?) [compared with the
extreme technology limits of 10-15]
• arrive to a desired time between errors
XV SuperB Workshop – Caltech Dec., 2010
3
Relevant parameters for the
performance of the error correcting code
 In the usual communication approach, the relevant parameter for serial link
improvement is the coding gain:
increase in channel noise wich can be balanced by error correction codes
This allows
reducing costs with the same performance or
increasing speed at the same cost:
 relax SNR requirements
 in our case what is important is the bit error rate reduction obtained by
ECC
From BER parameter we may compute an overall failure rate for each serial
link and for the whole apparatus at a fixed data rate
XV SuperB Workshop – Caltech Dec., 2010
4
Bit Error Rate and time between
errors
λ error events and λ faultybits
1/T transmitted bits
Unit time
T transmit clock period
BER =no. of errored bits/ no. of trasmitted bits
f = transmission frequency
λ = BEF bit error frequency=BER*f
Average Time between errors = 1/BEF
e.g. f=1.1GHz
λ=0.11
BER=10-10
μ=Average Time between errors = 9s
XV SuperB Workshop – Caltech Dec., 2010
5
Bit errors: Poisson statistics
Error on bits caused by events which take place in a radiation hard
environment has an usual statistics with the features:
• events take place one after the other and indipendently each other
• the average number of events in unit time is constant, equal to λ.
exponential probability pdf mu=9.09 (10000 events)
0.12
measured pdf
theorical pdf
• λ is the average number
of events in unit time
(frequency or rate)
0.1
probability
0.08
• μ =1/ λ is the average
time distance from one
event to the next
0.06
0.04
t
1
p (t )  e 

0.02
0
mu
0
20
40
60
time (s)
80
100
c( t )  1  e
XV SuperB Workshop – Caltech Dec., 2010
t

6
Bit Error Rate and time between
errors
The diagram shows how a value for BER translates into the average time between
errors (in case of continuous data exchange) at a fixed operating frequency
10
10
average time between bits in error at 1.1 GHz
10
10 years
8
1 year
time (s)
10
1 month
6
1 day
10
4
1 hour
10
e.g.
BER=10-10
average time between errors = 9s
use error correction coding to
achieve:
BER=10-16
average time between errors = 4
months
2
1 minute
10 s
10
0
10
-18
10
-16
-14
10
bit error rate (BER %)
10
-12
1s
10
-10
XV SuperB Workshop – Caltech Dec., 2010
7
How to evaluate how much ECC
power is needed?
Assume a command length of 100 bits (actual figures will be 72-90-108 bits)
Assume a reference BER=10-10 for each link
For a single link:
correction of 0 bit per frame will deliver BER=10-10 -> time between errors 9s
correction of 1 bit per frame will deliver BER=5*10-17 time between errors years
correction of 2 bit per frame will deliver BER=2*10-25 t between errors many years
Binomial formula: probability of having n errors in a frame of m bits and error
probability p
m
p(m, n)    p n (1  p)m  n
 n  17
p(100,2)  5  10
p(100,3)  2  10 25
We may argue that a moderate complexity ECC may be adopted
Observation: Low probability values would involve very long simulations
XV SuperB Workshop – Caltech Dec., 2010
8
Bit Error Rate reduction in Hamming
codes

P ( n, i ) 
i  t 1
n i
 n  t 1
  p (1  p ) n  i  P(n, t  1)  
 p (1  p ) n  t 1
i
 t  1
i  t 1  
n

 n  1 t  1
peb  
 p
t


peb  (n  1) p
2
Probability of a bit error
for block codes.
n=wordlength t=no. of
corrected bits
Probability of a bit error
for Hamming codes
t=1 n=2m-1
In log scale it is a straight line with
angular coefficient 2(n-1)
Probability of a word
error for block codes.
n = wordlength
t=no. of corrected bits
Hamming: bit error probability encoded vs BER m=3....8 t=1
10
bit error probability encoded
pew 
n
10
10
10
m=
m=
m=
m=
m=
m=
-16
-18
-20
-22
10
-12
10
-11
10
-10
10
BER
XV SuperB Workshop – Caltech Dec., 2010
9
-9
8
7
6
5
4
3
Bit Error Rate reduction in shortened
Hamming codes
Due to the 18 bits constraint in the serdes we must use shortened Hamming codes
26
13
31
18
18
31
26
H(31,26)
H(31,26)
13
13
13
13
0
10
10
10
10
10
10
10
10
13
-2
Hamming: simulated and computed ber after coding
-3
computed
simulated
For Hamming code H(n,k)
shortened to H(ns,ks)
-4
-5
pshort  pHamm
-6
-7
n
ns
In the above example
n=31 ns=18
-8
-9
10
-5
10
-4
10
-3
10
-2
XV SuperB Workshop – Caltech Dec., 2010
10
Bit Error Rate reduction in multi-Hamming codes
8
15
11
3
H(15,11)
0
12
12
15
3
3
11
3
H(15,11)
18
8
11
11
7
4
3
1
0
H(7,4)
6
1
6
1
4
7
H(7,4)
10
pmulti 
p1  k1red  p2  k2 red
k1red  k2 red
10
10
3
Our codes will be made
of a combination of
shortened Hamming
codes
1
Multi-Hamming: simulated and computed ber after coding
-3
computed
simulated
-4
-5
-6
pmulti = bit error probability for the overall code
10
p1=probability of branch no.1 (shortened Hamming)
k1red = no. of bits of the message of branch no.1 10-7
p2=probability of branch no. 2 (shortened Hamming)
k2red = no. of bits of the message of branch no. 2 10-8
10
10
-9
-10
10
-5
10
XV SuperB Workshop – Caltech Dec., 2010
-4
10
-3
10
11
-2
Bit Error Rate reduction in Reed
Solomon codes
10
RS code [linear scale]: m=3 n=7 ber (parameter t)
-1
ps  1  (1  p)
10
coded ber
10
10
10
10
m
-2
ps probability that symbol is in
error
p probability that a bit is in error
m is the symbol length
 n  t 1
pew  
 p (1  p )n t 1
 t  1
-3
pew probability that a word
made of n symbols is in
error
-4
n
n

 n  1 i
q 
d n i
n i
pib 
 Ps (1  Ps ) 

Ps (1  Ps ) n i 

i 1 
2( q  1) i  t 1 n  i 

i  d 1 

-5
-6
10
-4
-3
10
ber
10
-2

pib probability that a bit of the
message is in error after ECC
coding
Same work as Hamming to obtain features for shortened and combined codes
XV SuperB Workshop – Caltech Dec., 2010
12
Hamming code: features of a selected test
trasmitted a frame of 2x18 bit=36bit
code
no polarity control
codes 2 x H(15,11) + Hs(6,3) {from H(7,4)}
11
25 bit 11
Data to
transmit
buffer &
3
H(15,11)
15
H(15,11)
15
Hs(6,3)
6
36 bit
serdes
18
18
encoder
scrambler
18bit
2*18
n=2
Ecc = 12 %
serial
link
BER 10-10  10-19
Overhead = 44 %
Data to
distribute
11
25 bit 11
3
Buffer &
H(15,11)
15
H(15,11)
15
Hs(6,3)
6
decoder
36 bit
serdes
18bit
2*18
18
18
descrambler
lug. ’17
XV SuperB Workshop – Caltech Dec., 2010
13
Hamming code: features of a selected test
code
Multi-Hamming: frequency of errors after coding (clock at 1.1Gbit/s)
time between errors after ECC (s)
10
e.g.
f=1.1GHz
11
1000 years
10
uncorrected BER = 10-10
10
100 years
10
average time between failures 9 s
9
10 years
10
after coding
8
1 year
10
corrected BER = 10-19
7
1 month
10
6
average time between failures 244 years
8 days
10
-1
lug. ’17
0
1
10
10
time between errors before ECC (s)
XV SuperB Workshop – Caltech Dec., 2010
14
Reed Solomon codes
Similar examples may be made for Reed Solomon codes
We do not show an example for this also because greater
hardware complexity of both encoding and decoding may drive to
the Hamming solution which is:
• simple as far as regards hardware complexity and
• faster as far as regards the involved delays
XV SuperB Workshop – Caltech Dec., 2010
15
Conclusions
 We have developed a thorough statistical analysis of bit error probability
after ECC coding for complex, shortened and mixed codes both Hamming and
Reed Solomon codes, with some simulation
 We must point out that the above consideration on error rates apply to a
single link. The 500-1000 multiplicity will obviously raise the bit error frequency
in the apparatus by that multiplying factor.
 Even taking into account this circumstance we might argue that a moderate
correction capability is needed in order to reduce error rate to a suitable value.
 This will be assessed as soon as we will have precise figures on the error
rate in our rad hard environment
 We will therefore revert to very simple ECC structures, fully compatible with
a proper hardware implementation on the ground of both available hardware
resources and processing time
XV SuperB Workshop – Caltech Dec., 2010
16
To be done
 obtain precise figures on the bit error rate in our rad hard environment
 define and analyze Hamming (Reed Solomon) coding structures with the
purpose of reducing both silicon area and operating speed for the
implementation
 analyze thoroughly the impact of error rates on the performance of the overall
apparatus and related data quality
 evaluate practical implementations
XV SuperB Workshop – Caltech Dec., 2010
17