lectures.alex.balgavy.eu

Lecture notes from university.
git clone git://git.alex.balgavy.eu/lectures.alex.balgavy.eu.git
Log | Files | Refs | Submodules

fault-tolerance.md (4787B)


      1 +++
      2 title = 'Fault tolerance'
      3 +++
      4 # Fault tolerance
      5 A process may crash -- unexpectedly stop executing events.
      6 
      7 Assume network is complete -- there's a bidirectional channel between any two processes.
      8 So process crashes never make remaining network disconnected.
      9 Assume crashing of processes can't be observed.
     10 
     11 ## Consensus
     12 Binary consensus
     13 - initially all processes randomly select 0 or 1
     14 - eventually all alive processes uniformly decide 0 or 1
     15 
     16 Assumptions:
     17 - validity: if all processes randomly select same initial value b, then all alive processes decide b
     18 - k-crash consensus: at most k processes may crash
     19 
     20 No algorithm for 1-crash consensus always terminates.
     21 
     22 b-potent set S of processes: if by only executing events at processes in S, some process in S can decide b
     23 
     24 No Las Vegas algorithm for k ≥ N/2 consensus
     25 
     26 ### Bracha-Toueg crash consensus algorithm
     27 Let k < N/2.
     28 - initially each alive process randomly selects 0 or 1, with weight 1
     29 - in round n, at each alive undecided process p:
     30     - p sends [n, value(p), weight(p)] to all processes including itself
     31     - p waits until N-k messages [n, b, w] have arrived (dismisses/stores messages from earlier/future rounds)
     32         - if w > N/2, then value(p) = b
     33         - else, value(p) = 0 if most messages voted 0, value(p) = 1 otherwise
     34         - weight(p) = number of incoming votes for value(p) in round n
     35     - if w > N/2 for more than k incoming messages, then p decides b
     36 - if p decides b, it broadcasts [n+1, b, N-k] and [n+2, b, n-k] and terminates
     37 
     38 Let k < N/2.
     39 This is a Las Vegas algorithm that terminates with probability 1.
     40 
     41 ## Failure detection
     42 Failure detector at process tracks which process have (or may have) crashed.
     43 Given an upper bound on network latency and heartbeat messages, one can implement a failure detector.
     44 For this setting, terminating crash consensus algorithms exist.
     45 
     46 Assume time domain with total order.
     47 - F(t) is set of crashed processes at time t ("failure pattern")
     48 - t1 < t2 ⇒ F(t1) ⊆ F(t2)
     49 - Assume processes can't observe F(t).
     50 
     51 H(p, t) is set of processes that p suspects to be crashed at time t ("failure detector history")
     52 
     53 Require that failure detectors are complete: from some time onward, each crashed process is suspected by each alive process.
     54 
     55 "strongly accurate" failure detector: if only crashed processes are ever suspected
     56 - Assumptions:
     57     - each alive process broadcasts `alive` every v time units
     58     - dmax is a known upper bound on network latency
     59 
     60 - Process from which no message is received for v + dmax time units has crashed
     61 
     62 "weakly accurate": if some process is never suspected by any process
     63 - Processes numbered p0..p(N-1)
     64 - Initially, each process randomly selects 0 or 1.
     65 - In round n:
     66     - pn (if alive) broadcasts its value
     67     - each process waits
     68         - for incoming message from pn, in which case it adopts value of pn
     69         - or until it suspects that pn has crashed
     70 - After round N-1, each alive process decides for its value.
     71 
     72 "eventually strongly accurate": if from some time onward, only crashed processes are suspected
     73 - assumptions:
     74     - each alive process broadcasts `alive` every v time units
     75     - there is unknown upper bound on network latency
     76 - each process q initially guesses as network latency dq = 1
     77 - if q receives no message from p for v+dq time units, then q suspects that p has crashed
     78 - if q receives a message from a suspected process p, then p is no longer suspected, and dq = dq+1
     79 - let k ≥ N/2, there is no Las Vegas algorithm for k-crash consensus based on eventually strongly accurate failure detector
     80 
     81 "eventually weakly accurate": if from some time onward, some process is never suspected
     82 - let k < N/2, eventually weakly accurate failure detector used for k-crash consensus
     83 - Chandra-Toueg k-crash consensus algorithm
     84     - each process q records last round lu(q) in which it updated value(q)
     85     - initially value(q) ∈ {0,1} and lu(q) = -1
     86     - processes numbered p0..p(N-1)
     87     - round n coordinated by pc with c = n mod N
     88     - in round n, each alive process q sends [vote, n, value(q), lu(q)] to pc
     89     - pc (if alive) waits until N-k such messages arrived, and selects one [vote, n, b, l] with l as large as possible
     90         - value(pc) = b, lu(pc) = n
     91         - pc broadcasts [value, n, b]
     92     - each alive process q waits
     93         - either until [value, n, b] arrives
     94             - then value(q) = b, lu(q) = n
     95             - and q sends [ack, n] to pc
     96         - or until it suspects pc crashed
     97             - then q sends [nack, n] to pc
     98     - if pc receives more than k messages [ack, n], then pc decides b and broadcasts [decide, b]
     99     - an undecided process that receives [decide, b] decides b
    100 - let k < N/2, the Chandra-Toueg algorithm is an always terminating k-crash consensus algorithm