lectures.alex.balgavy.eu

Lecture notes from university.
git clone git://git.alex.balgavy.eu/lectures.alex.balgavy.eu.git
Log | Files | Refs | Submodules

data-center-transport.md (2021B)


      1 +++
      2 title = 'Data center transport'
      3 +++
      4 
      5 ## Data center transport
      6 TCP incast problem:
      7 - datacenter application runs on multiple servers
      8 - use a scatter-gather work pattern (client requests data from a bunch of servers, all servers respond)
      9 - commodity switches usually have shallow buffers → queue capacity overrun at switch when data comes back to client
     10 - collision leads to packet loss, which is recognized by servers after a timeout, at which point all servers start again at the same time
     11 
     12 Ethernet flow control: pause frame
     13 - overwhelmed ethernet receiver can send "PAUSE" frame to sender
     14 - upon receiving PAUSE frame, sender stops transmission for some amount of time
     15 - but, not designed for switches, and blocks all transmission at port-level
     16 
     17 Priority-based flow control
     18 - 8 virtual traffic lanes, one can be selectively stopped
     19 - timeout is configuration
     20 - but, only 8 lanes, unfairness, and deadlocks in large networks
     21 
     22 ### DCTCP
     23 - pass information about switch queue buildup to senders
     24 - at sender, react by slowing down transmission
     25 
     26 Explicit congestion notification
     27 - standardized way of passing presence of congestion
     28 - part of IP packet header, supported by most commodity switches
     29 - for queue size of N: when queue occupancy goes beyond K, mark passing packet's ECN bit as "yes"
     30 
     31 DCTCP main idea
     32 - switch: marks with ECN after the threshold K
     33 - ECN receiver: marks ACKs with ECE (ECN echo) flag, until sender ACKs back using CWR (congestion window reduce) flag
     34 - DCTCP receiver: marks ACKs corresponding to ECN packet
     35 - sender: estimate packets that are marked with ECN in a running window
     36 
     37 ### TIMELY
     38 Use round trip time (RTT) as indication of congestion
     39 - RTT is multi-bit, no explicit switch support required to do marking
     40 - assumes that: TX NIC can generate completion timestamps, RX NIC can generate ACKs in hardware, at switches ACKs go through high-priority queue
     41 
     42 Key concept:
     43 - use gradient of RTTs
     44 - positive → rising RTT → queue buildup
     45 - negative → decreasing RTT → queue depletion