index.md (2478B)
1 +++ 2 title = 'Datacenter networking' 3 +++ 4 5 # Datacenter networking 6 Why not a single giant switch? Limited port density, broadcast storms, isolation. 7 8 Tree-based data center network: 9 10 ![Diagram of tree-based network](tree-based-datacenter-network.png) 11 12 Bottleneck is in the top 2 layers. 13 14 Performance metrics: 15 - bisection width: minimum number of links cut to divide network into two halves 16 - bisection bandwidth: minimum bandwidth of links that divide network into two halves 17 - full bisection bandwidth: one half of nodes can communicate at the same time with other half of nodes 18 19 Oversubscription: ratio -- (worst-case required aggregate bandwidth among end-hosts) : (total bisection bandwidth of network topology) 20 - 1:1 -- all hosts can use full uplink capacity 21 - 5:1 -- only 20% of host bandwidth may be available 22 23 ## Fat-tree 24 Fat-tree topology: emulate single huge switch with many smaller switches 25 26 ![Fat-tree topology diagram](fat-tree-topology.png) 27 28 Needs to be backward compatible with IP/Ethernet, so routing algorithms naively choose shortest path, leading to bottleneck. And you get complex wiring. 29 30 Addressing: 31 - 10.0.0.0/8 private address block 32 - pod switches: 10.pod.switch.1 33 - core switches: 10.k.j.i, with i and j core positions in (k/2)² core switches 34 - hosts: 10.pod.switch.id 35 36 Forwarding with two-level lookup table: 37 - prefix used for forwarding intra-pod traffic 38 - suffixes for forwarding inter-pod traffic 39 40 Routing: 41 - prefixes in two-level lookup table prevent intra-pod traffic from leaving pod 42 - each host-to-host communication has single static path 43 44 Flow collision can lead to bottleneck: 45 - use equal-cost multi-path (ECMP): static path for each flow 46 - or flow scheduling: have centralised scheduler to assign flows to paths 47 48 To solve cabling issue, organize switches into pod racks. 49 50 Unaddressed issues: 51 - no support for seamless VM migration, because IPs location-dependent 52 - plug-and-play not possible: IPs pre-assigned to switches and hosts 53 54 ## PortLand: layer 2 system 55 Intuition: separate node location from node identifier. 56 - IP is node identifier 57 - Pseudo MAC is node location 58 59 Fabric manager maintains IP → PMAC mapping, and facilitates fault tolerance. 60 61 62 Switches self-discover location by exchanging Location Discover Messages (LDMs): 63 - tree level/role: based on neighbor identify 64 - pod number: get from Fabric manager 65 - position number: aggregation switches help top-of-rack switches choose unique number 66 67 ![PortLand workflow](portland.png)