data-center-transport.md (2021B)
1 +++ 2 title = 'Data center transport' 3 +++ 4 5 ## Data center transport 6 TCP incast problem: 7 - datacenter application runs on multiple servers 8 - use a scatter-gather work pattern (client requests data from a bunch of servers, all servers respond) 9 - commodity switches usually have shallow buffers → queue capacity overrun at switch when data comes back to client 10 - collision leads to packet loss, which is recognized by servers after a timeout, at which point all servers start again at the same time 11 12 Ethernet flow control: pause frame 13 - overwhelmed ethernet receiver can send "PAUSE" frame to sender 14 - upon receiving PAUSE frame, sender stops transmission for some amount of time 15 - but, not designed for switches, and blocks all transmission at port-level 16 17 Priority-based flow control 18 - 8 virtual traffic lanes, one can be selectively stopped 19 - timeout is configuration 20 - but, only 8 lanes, unfairness, and deadlocks in large networks 21 22 ### DCTCP 23 - pass information about switch queue buildup to senders 24 - at sender, react by slowing down transmission 25 26 Explicit congestion notification 27 - standardized way of passing presence of congestion 28 - part of IP packet header, supported by most commodity switches 29 - for queue size of N: when queue occupancy goes beyond K, mark passing packet's ECN bit as "yes" 30 31 DCTCP main idea 32 - switch: marks with ECN after the threshold K 33 - ECN receiver: marks ACKs with ECE (ECN echo) flag, until sender ACKs back using CWR (congestion window reduce) flag 34 - DCTCP receiver: marks ACKs corresponding to ECN packet 35 - sender: estimate packets that are marked with ECN in a running window 36 37 ### TIMELY 38 Use round trip time (RTT) as indication of congestion 39 - RTT is multi-bit, no explicit switch support required to do marking 40 - assumes that: TX NIC can generate completion timestamps, RX NIC can generate ACKs in hardware, at switches ACKs go through high-priority queue 41 42 Key concept: 43 - use gradient of RTTs 44 - positive → rising RTT → queue buildup 45 - negative → decreasing RTT → queue depletion