lectures.alex.balgavy.eu

Lecture notes from university.
git clone git://git.alex.balgavy.eu/lectures.alex.balgavy.eu.git
Log | Files | Refs | Submodules

commit 6e4792dc7b5144f5e486a6ffd8b38d376580a059
parent 39535b597dbad843cd9578d06ace717723ef81a0
Author: Alex Balgavy <alex@balgavy.eu>
Date:   Sun, 19 Dec 2021 23:23:02 +0100

ACN notes

Diffstat:
Mconfig.toml | 4++--
Mcontent/acn-notes/_index.md | 24+++++++++++++++---------
Acontent/acn-notes/data-center-networking/fat-tree-4-ports.png | 0
Acontent/acn-notes/data-center-networking/index.md | 56++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Acontent/acn-notes/data-center-networking/portland-flow-diagram.png | 0
Acontent/acn-notes/data-center-transport.md | 45+++++++++++++++++++++++++++++++++++++++++++++
Dcontent/acn-notes/data-structures-and-algorithms-used-in-network-designs.md | 47-----------------------------------------------
Dcontent/acn-notes/datacenter-networking/fat-tree-topology.png | 0
Dcontent/acn-notes/datacenter-networking/index.md | 67-------------------------------------------------------------------
Dcontent/acn-notes/datacenter-networking/portland.png | 0
Dcontent/acn-notes/datacenter-networking/tree-based-datacenter-network.png | 0
Dcontent/acn-notes/datacenter-transport.md | 38--------------------------------------
Acontent/acn-notes/dns.md | 19+++++++++++++++++++
Acontent/acn-notes/in-network-computing.md | 42++++++++++++++++++++++++++++++++++++++++++
Mcontent/acn-notes/intro.md | 33++++++++++-----------------------
Acontent/acn-notes/ip-routing.md | 33+++++++++++++++++++++++++++++++++
Acontent/acn-notes/local-network-ethernet-arp.md | 31+++++++++++++++++++++++++++++++
Acontent/acn-notes/machine-learning-for-networking.md | 39+++++++++++++++++++++++++++++++++++++++
Mcontent/acn-notes/network-automation.md | 63+++++++++++++++++++++------------------------------------------
Acontent/acn-notes/network-function-virtualization.md | 27+++++++++++++++++++++++++++
Acontent/acn-notes/network-monitoring.md | 28++++++++++++++++++++++++++++
Mcontent/acn-notes/network-transport.md | 104++++++++++++++++++++++++++++++++++++++++---------------------------------------
Dcontent/acn-notes/networking-basics/berkeley-sockets.png | 0
Dcontent/acn-notes/networking-basics/index.md | 97-------------------------------------------------------------------------------
Dcontent/acn-notes/networking-basics/tcp.png | 0
Acontent/acn-notes/networking-data-structures-and-algorithms.md | 21+++++++++++++++++++++
Acontent/acn-notes/programmable-data-plane-p4.md | 28++++++++++++++++++++++++++++
Acontent/acn-notes/software-defined-networking/abstractions-in-sdn-diagram.png | 0
Dcontent/acn-notes/software-defined-networking/covisor-design-overview.png | 0
Mcontent/acn-notes/software-defined-networking/index.md | 138+++++++++++++++++++++++++++-----------------------------------------------------
Dcontent/acn-notes/software-defined-networking/network-planes-diagram.png | 0
Dcontent/acn-notes/software-defined-networking/openflow-diagram.png | 0
Dcontent/acn-notes/software-defined-networking/openflow-example.png | 0
Dcontent/acn-notes/software-defined-networking/sdn-architecture.png | 0
Dcontent/acn-notes/software-defined-networking/slicing-layer.png | 0
Dcontent/acn-notes/software-defined-networking/traditional-vs-sdn.png | 0
36 files changed, 516 insertions(+), 468 deletions(-)

diff --git a/config.toml b/config.toml @@ -1,5 +1,5 @@ # The URL the site will be built for -base_url = "https://lectures.alex.balgavy.eu" +base_url = "lectures.alex.balgavy.eu" title = "Lecture Notes" description = "My lecture notes. Primarily for myself, maybe useful for others." default_language = "en" @@ -14,7 +14,7 @@ build_search_index = true # Whether to do syntax highlighting # Theme can be customised by setting the `highlight_theme` variable to a theme supported by Zola highlight_code = true -extra_syntaxes = ["syntaxes"] +extra_syntaxes_and_themes = ["syntaxes"] smart_punctuation = true diff --git a/content/acn-notes/_index.md b/content/acn-notes/_index.md @@ -1,13 +1,19 @@ +++ title = 'Advanced Computer Networks' +++ - # Advanced Computer Networks -1. [Intro](intro) -2. [Networking basics](networking-basics) -3. [Data structures and algorithms used in network designs](data-structures-and-algorithms-used-in-network-designs) -4. [Network transport](network-transport) -5. [Datacenter networking](datacenter-networking) -6. [Datacenter transport](datacenter-transport) -7. [Software-defined networking](software-defined-networking) -8. [Network automation](network-automation) +- [Intro](intro) +- [DNS](dns) +- [IP routing](ip-routing) +- [Local network: Ethernet, ARP](local-network-ethernet-arp) +- [Networking data structures and algorithms](networking-data-structures-and-algorithms) +- [Network transport](network-transport) +- [Data center networking](data-center-networking) +- [Data center transport](data-center-transport) +- [Software defined networking](software-defined-networking) +- [Network automation](network-automation) +- [Network function virtualization](network-function-virtualization) +- [Programmable data plane (P4)](programmable-data-plane-p4) +- [Network monitoring](network-monitoring) +- [In-network computing](in-network-computing) +- [Machine learning for networking](machine-learning-for-networking) diff --git a/content/acn-notes/data-center-networking/fat-tree-4-ports.png b/content/acn-notes/data-center-networking/fat-tree-4-ports.png Binary files differ. diff --git a/content/acn-notes/data-center-networking/index.md b/content/acn-notes/data-center-networking/index.md @@ -0,0 +1,56 @@ ++++ +title = 'Data center networking' ++++ + +## Data center networking +In data centers, servers are organized in interconnected racks. + +Performance metrics: +- bisection width: minimum number of links cut to divide network in half +- bisection bandwidth: minimum bandwidth of links that divide network in half +- full bisection bandwidth: one half of nodes can communicate at the same time with other half of nodes + +Oversubscription: ratio (worst-case required aggregate bandwidth among end-hosts)/(total bisection bandwidth of topology) + +### Fat-tree topology +Example of 4-port fat-tree topology: + +![Fat-tree topology diagram](./fat-tree-4-ports.png) + +Allows full bisection bandwidth between core and aggregation switches. + +For k-port switches: +- need (5k²/4) switches +- can have k³/4 servers + +Addressing: +- pod switches: 10.pod.switch.1 (pod, switch ∈ [0, k-1]) +- core switches: 10.k.j.i (i and j denote core positions) +- hosts: 10.pod.switch.id + +Forwarding: two-level lookup table +- prefixes for forwarding traffic in pod +- suffixes for forwarding traffic between pods + +Each host-to-host communication has single static path. + +Flow-collision happens when multiple flows use same path. +Solutions: +- equal-cost multi-path (ECMP): static path for each flow +- flow scheduling: centralized scheduler assigns flows to paths + + +Challenges & issues +- must be backward compatible with IP/Ethernet +- complex wiring +- no support for seamless VM migration (would break TCP connection) +- plug-and-play not possible, IPs have to be preassigned + +### PortLand +Separate node location (Pseudo MAC) from node identifier (Host IP). + +Fabric manager maintains IP → PMAC mapping . + +Switches self-discover location by exchanging Location Discovery Messages. + +![Portland flow diagram](portland-flow-diagram.png) diff --git a/content/acn-notes/data-center-networking/portland-flow-diagram.png b/content/acn-notes/data-center-networking/portland-flow-diagram.png Binary files differ. diff --git a/content/acn-notes/data-center-transport.md b/content/acn-notes/data-center-transport.md @@ -0,0 +1,45 @@ ++++ +title = 'Data center transport' ++++ + +## Data center transport +TCP incast problem: +- datacenter application runs on multiple servers +- use a scatter-gather work pattern (client requests data from a bunch of servers, all servers respond) +- commodity switches usually have shallow buffers → queue capacity overrun at switch when data comes back to client +- collision leads to packet loss, which is recognized by servers after a timeout, at which point all servers start again at the same time + +Ethernet flow control: pause frame +- overwhelmed ethernet receiver can send "PAUSE" frame to sender +- upon receiving PAUSE frame, sender stops transmission for some amount of time +- but, not designed for switches, and blocks all transmission at port-level + +Priority-based flow control +- 8 virtual traffic lanes, one can be selectively stopped +- timeout is configuration +- but, only 8 lanes, unfairness, and deadlocks in large networks + +### DCTCP +- pass information about switch queue buildup to senders +- at sender, react by slowing down transmission + +Explicit congestion notification +- standardized way of passing presence of congestion +- part of IP packet header, supported by most commodity switches +- for queue size of N: when queue occupancy goes beyond K, mark passing packet's ECN bit as "yes" + +DCTCP main idea +- switch: marks with ECN after the threshold K +- ECN receiver: marks ACKs with ECE (ECN echo) flag, until sener ACKs back using CWR (congestion window reduce) flag +- DCTCP receiver: marks ACKs corresponding to ECN packet +- sender: estimate packets that are marked with ECN in a running window + +### TIMELY +Use round trip time (RTT) as indication of congestion +- RTT is multi-bit, no explicit switch support required to do marking +- assumes that: TX NIC can generate completion timestamps, RX NIC can generate ACKs in hardware, at switches ACKs go through high-priority queue + +Key concept: +- use gradient of RTTs +- positive → rising RTT → queue buildup +- negative → decreasing RTT → queue depletion diff --git a/content/acn-notes/data-structures-and-algorithms-used-in-network-designs.md b/content/acn-notes/data-structures-and-algorithms-used-in-network-designs.md @@ -1,47 +0,0 @@ -+++ -title = 'Data structures and algorithms used in network designs' -+++ -# Data structures and algorithms used in network designs -## How to quickly decide if element exists in a large set -Used in tasks involving membership detection: -- access control list (decides if IP address is in blocklist) -- IP multicast (decides if router port should replicate packet) -- load balancer (decides if source IP has been assigned to server) - -Trivial solutions: -- unordered list, linear search. O(n) time. -- ordered list, binary search. O(log n) time. - -Hashing: -- map data to fixed-size values with function -- hash table indexed by hash values, 1-to-1 mapping -- constant O(1) time, just computing a function -- problems: hash collision -- multiple data entries mapped to same hash value -- choosing good hash functions: deterministic (always same value for same key), efficiently computable, minimize collisions - - modulo hashing: $h(x) = x % table_size$, with $table_size$ a prime number - - with non-uniformly distributed data, data sharing factors with $table_size$ will be hashed to same values → use prime numbers, which have least factors - - other hash functions, e.g. MD5, SHA256 -- handling hash collisions: - - data structure that resolves hash collisions - - separate chaining: create list of keys that map to same hash value - - open addressing: shift values following a given strategy when a collision happens. when trying to find(x), hash and keep looking for x using the strategy until an empty slot is found. - - cuckoo hashing: use two hash functions to generate two possible slots for each key, use two hash tables - - bloom filter - -## How to efficiently count occurrences for large set of elements -Counting bloom filter: -- increment counters corresponding to hash values on insert -- on read, look up counters corresponding to hash values with minimum count - -Count-min sketch: -- three hash functions, each mapped to array of counters (hash tables) - -## How to quickly find match on prefix of a string -Trie: a binary tree where nodes have a part of the prefix - -PATRICIA trie: radix trie with radix 2 - -Direct trie: combine segments in entries - -TCAM: hardware implementation -- supports matching on set of records in constant time diff --git a/content/acn-notes/datacenter-networking/fat-tree-topology.png b/content/acn-notes/datacenter-networking/fat-tree-topology.png Binary files differ. diff --git a/content/acn-notes/datacenter-networking/index.md b/content/acn-notes/datacenter-networking/index.md @@ -1,67 +0,0 @@ -+++ -title = 'Datacenter networking' -+++ - -# Datacenter networking -Why not a single giant switch? Limited port density, broadcast storms, isolation. - -Tree-based data center network: - -![Diagram of tree-based network](tree-based-datacenter-network.png) - -Bottleneck is in the top 2 layers. - -Performance metrics: -- bisection width: minimum number of links cut to divide network into two halves -- bisection bandwidth: minimum bandwidth of links that divide network into two halves -- full bisection bandwidth: one half of nodes can communicate at the same time with other half of nodes - -Oversubscription: ratio -- (worst-case required aggregate bandwidth among end-hosts) : (total bisection bandwidth of network topology) -- 1:1 -- all hosts can use full uplink capacity -- 5:1 -- only 20% of host bandwidth may be available - -## Fat-tree -Fat-tree topology: emulate single huge switch with many smaller switches - -![Fat-tree topology diagram](fat-tree-topology.png) - -Needs to be backward compatible with IP/Ethernet, so routing algorithms naively choose shortest path, leading to bottleneck. And you get complex wiring. - -Addressing: -- 10.0.0.0/8 private address block -- pod switches: 10.pod.switch.1 -- core switches: 10.k.j.i, with i and j core positions in (k/2)² core switches -- hosts: 10.pod.switch.id - -Forwarding with two-level lookup table: -- prefix used for forwarding intra-pod traffic -- suffixes for forwarding inter-pod traffic - -Routing: -- prefixes in two-level lookup table prevent intra-pod traffic from leaving pod -- each host-to-host communication has single static path - -Flow collision can lead to bottleneck: -- use equal-cost multi-path (ECMP): static path for each flow -- or flow scheduling: have centralised scheduler to assign flows to paths - -To solve cabling issue, organize switches into pod racks. - -Unaddressed issues: -- no support for seamless VM migration, because IPs location-dependent -- plug-and-play not possible: IPs pre-assigned to switches and hosts - -## PortLand: layer 2 system -Intuition: separate node location from node identifier. -- IP is node identifier -- Pseudo MAC is node location - -Fabric manager maintains IP → PMAC mapping, and facilitates fault tolerance. - - -Switches self-discover location by exchanging Location Discover Messages (LDMs): -- tree level/role: based on neighbor identify -- pod number: get from Fabric manager -- position number: aggregation switches help top-of-rack switches choose unique number - -![PortLand workflow](portland.png) diff --git a/content/acn-notes/datacenter-networking/portland.png b/content/acn-notes/datacenter-networking/portland.png Binary files differ. diff --git a/content/acn-notes/datacenter-networking/tree-based-datacenter-network.png b/content/acn-notes/datacenter-networking/tree-based-datacenter-network.png Binary files differ. diff --git a/content/acn-notes/datacenter-transport.md b/content/acn-notes/datacenter-transport.md @@ -1,38 +0,0 @@ -+++ -title = 'Datacenter transport' -+++ -# Datacenter transport -In datacenters, network is extremely high speed and low latency. - -In TCP: -- reliable, in-order delivery with seq numbers and acks -- don't overrun receiver (receiving window `rwnd`) and network (congestion window `cwnd`) - - what can be sent is `min(rwnd, cwnd)` - -TCP incast problem: client-facing query might have to collect data from many servers → packet drops because capacity overrun at shared commodity switches - -Ethernet flow control: -- overwhelmed ethernet receiver can send "PAUSE" frame to sender, upon which sender stops transmission for certain amount of time -- designed for end-host overruns, not switches -- blocks all transmission at Ethernet level (port level) - -Priority-based flow control: -- enhancement over PAUSE frames -- 8 virtual traffic lanes, one can be selectively stoppe. timeout is configurable. -- but only 8 lanes, might lead to deadlocks in large networks, and unfairness - -Datacenter TCP (DCTCP): -- pass information about switch queue buildup to senders -- at sender, react by slowing down transmission -- Explicit Congestion Notification: standard way of passing "presence of congestion" - - part of IP packet header - - for queue size of N, when queue occupancy goes beyond K, mark passing packet's ECN bit as "yes" -- after threshold K, start marking packets with ECN - - typical ECN receiver marks acks with ECE flag, until sender acks back with CWR flag bit - - DCTCP receiver: only mark acks corresponding to ECN packet - -TIMELY -- use round trip time (RTT) as indication of congestion signal -- multi-bit, indicating end-to-end congestion through network → no explicit switch support needed to do marking -- assumes ack-based protocol (TCP) -- absolute RTTs not used, only gradient. positive means rising RTT, so queue buildup. negative means opposite. diff --git a/content/acn-notes/dns.md b/content/acn-notes/dns.md @@ -0,0 +1,19 @@ ++++ +title = 'DNS' ++++ + +## DNS +resolves hostnames to IP addresses. + +DNS query: +- query local DNS server +- if not found, go to root nameserver +- root says to contact TLD nameserver (e.g. "com") +- TLD nameserver says to contact domain nameserver +- domain nameserver gives IP address + +DNS types: +- resolution: AAAA (IPv6), A (IPv4) +- nameserver: NS +- alias (canonical hostname): CNAME +- mail server: MX diff --git a/content/acn-notes/in-network-computing.md b/content/acn-notes/in-network-computing.md @@ -0,0 +1,42 @@ ++++ +title = 'In-network computing' ++++ + +## In-network computing +### Implementing in-network caching service +Key-value storage that meets aggressive latency and throughput objectives efficiently. + +Target workloads: small objects, read intensive, highly skewed and dynamic key popularity + +Use PISA for key-value store: +- programmable parser: parse custom key-value fields in packet header +- programmable match-action pipeline + - read and update key-value data + - provide query statistics for cache updates + +How to identify app-level packet fields: +- NetCache packet has some data - operation, sequence, key, value +- only top-of-rack switch needs to parse those fields + +How to store and serve variable-length data on switches: +- use register array, fetch using index +- for variable, either use action data to hold indices, or multiple register arrays with same index +- two-level lookup: bitmap indicates arrays that store key's value, index shows slots in arrays to get value + +Efficiently keep cache updated: +- cache hottest O(NlogN) items with limited insertion rate +- cached key: per-key counter array +- uncached key: count-min sketch (report new hot keys), bloom filter (remove duplicate hot key reports) + +### Implementing in-network coordination service +In-network coordination is communication-heavy, not computation-heavy + +Use set of coordination switches to run consensus protocol + +NetChain design goals: high throughput, low latency, consistency, fault tolerance + +Chain replication: consistency & fault tolerance +- storage nodes organized in chain structure +- handle operations: read from tail, write from head to tail + +Because of replication, tolerates f-1 failures for f nodes. diff --git a/content/acn-notes/intro.md b/content/acn-notes/intro.md @@ -2,29 +2,16 @@ title = 'Intro' +++ -# Intro +## Intro +Circuit switching: physical channel carrying stream of data, no routing -Circuit switching: -- physical channel with stream of data from source to destination -- three phases: set up, data transfer, tear down -- data transfer involves no routing -- have to reserve circuits +Packet switching: message in short packets, each handled separately, packets store/queued in routers and forwarded to appropriate neighbor -Packet switching: -- message split into short packets, each handled separately -- one operation - send packet -- packets queued in each router, forwarded to appropriate neighbor +Network performance metrics: +- bandwidth: max rate of data transfer across given network path +- latency: amount of time it takes to deliver data from source to destination -Bandwidth: max rate of data transfer across given network path - -Latency: amount of time taken to deliver data from source to destination - -Performance metric - flow completion time -- how long does it take to complete traffic flow? -- how long does it take to complete a set of correlated flows? - -Network reliability: -- end-to-end argument: "if a function can only be correctly implemented end-to-end, it must be implemented in the end systems." -- fate-sharing principle: to deal with potential failures, store critical system state at nodes which rely on that state. Only way to lose the state is if the node that relies on it fails, at which point it doesn't matter. - - if you store on network devices, need state to be cleaned up or recreated -- packet vs circuit switching: packet switching is better than circuit switching in terms of *resilience* and *efficiency*. But circuit switching performance is more predictable. +Network reliability considerations: +- end-to-end argument: if a function can only be correctly implemented end-to-end, it must be implemented in the end systems. +- fate-sharing principle: store critical system state at nodes which rely on that state, so the only way to lose that state is if the node that relies on it fails, in which case it doesn't matter +- packet vs. circuit switching: packet is better than circuit in terms of resilience and efficiency, circuit is better than packet in terms of predictable performance diff --git a/content/acn-notes/ip-routing.md b/content/acn-notes/ip-routing.md @@ -0,0 +1,33 @@ ++++ +title = 'IP routing' ++++ + +## IP routing +IP addresses have prefix and identifier: +- e.g. 10.0.0.1/24: 24 bits network identifier, rest host identifier +- also subnet mask notation: 255.255.255.0 + +Routers have tables like this for forwarding: + +| Match | Action | +|----------------|--------| +| 122.38.42.0/24 | port-2 | +| 116.16.0.0/16 | port-1 | +| 139.70.8.0/24 | drop | + +IPv4 packets have fields including: +- TOS: type of service -- two bits used for explicit congestion notification +- total length +- TTL: decreased by one when passing router, packet dropped when TTL is 0 +- protocol: transport layer protocol (TCP, UDP) + +Generating forwarding tables: +- in same domain: OSPF (link state) + - routers exchange link-state messages to learn topology + - each router runs Dijkstra's to compute shortest paths to other routers + - each router generates forwarding table based on shortest paths +- between domains: BGP (distance vector) + + +Traffic engineering: network engineering for performance evaluation and optimization +- multiprotocol label switching: add extra header with labels to let you manipulate forwarding diff --git a/content/acn-notes/local-network-ethernet-arp.md b/content/acn-notes/local-network-ethernet-arp.md @@ -0,0 +1,31 @@ ++++ +title = 'Local network: Ethernet, ARP' ++++ +## Local network: Ethernet, ARP +Switched ethernet: ethernet segments connected with switches, which work on link layer + +Switch creates Ethernet segments and forwards frames between segments based on MAC address + +MAC address is 6 bytes, unique among all network adapters, managed by IEEE + +Ethernet frame contains destination and source MAC addresses. + +Link layer switches forward/broadcast/drop frames based on switch table, operate transparently to hosts. + +Generating the table: +- learn new interface mappings from incoming frames +- if destination unknown, broadcast on all interfaces except the source + +Packet transmission: +- store-and-forward: packets received in full, only then forwarded +- cut-through: packet receiving and sending done at the same time + +Avoid loops in network by computing a logical spanning tree, and rebuilding the tree on failure. + +Traffic isolation with VLANs -- subsets of ports. + +ARP: obtaining destination MAC address from a host +- query: whoever has an IP address, respond with your MAC address +- reply comes only from that host and contains MAC address + +NAT: maps an IP address space into another diff --git a/content/acn-notes/machine-learning-for-networking.md b/content/acn-notes/machine-learning-for-networking.md @@ -0,0 +1,39 @@ ++++ +title = 'Machine learning for networking' ++++ + +## Machine learning for networking +### Video streaming +Dynamic streaming over HTTP (DASH). + +Reinforcement learning: reward measures how good an action is, goal is to maximize cumulative reward. + +Pensieve: learning-based ABR algorithm +- ABR: adaptive bit rate +- Pensieve learns it through experience +- RL reward is `bitrate - rebuffering - smoothness` +- use large corpus of network traces to train the model + +### Network packet classification +Decides action to take based on matched rule. +Solutions: +- hardware-based (e.g. TCAM): fast. but expensive, energy-consuming, hard to scale +- software-based (e.g. decision-tree): scalable, but slow and requires large memory + +Existing techniques: +- node cutting: cut space into smaller areas, each area corresponds to leaf in decision tree, match by traveling through decision tree and select rule in matched leaf with highest priority +- rule partitioning: partition space into two parts, build separate decision (sub)tree for each part, match by traveling through decision tree and select rule in matched leaf with highest priority + +End-to-end learning: replace decision tree with RL model +- may not need data structure at all +- but cannot guarantee classification correctness, packet inference takes too long, needs specialised hardware + +NeuroCuts: use deep reinforcement learning to tackle problem of building decision trees +- action: either cutting node, or partitioning set of rule +- reward: classification time, or memory footprint, or both +- reward delayed and only given when whole tree is built + +Naive MDP formulation: +- sequential markov decision process + - assumes DFS order of building the tree node-by-node + - action: cut or partition current node diff --git a/content/acn-notes/network-automation.md b/content/acn-notes/network-automation.md @@ -1,57 +1,36 @@ +++ title = 'Network automation' +++ -# Network automation -## Fibbing: centralised control over distributed routing -Fibbing combines benefits of distributed protocols (scalable by design & robust) with those of a centralized controller (manageable, flexible). -You get centralised control over distributed routing by lying to th routing protocols: -- get requirements from network operator (routing, load balancing, failover) - - provides high-level language: +## Network automation +### Centralized control over distributed routing: fibbing +Fibbing: lying to routing protocols +- high-level language to specify forwarding requirements ``` ((E, C, D1) and (E, G, D1); // traffic between E and D1 load balanced on two paths ((A, *, B, * D2) or (A, *, C, *, D2)); // traffic between A and D2 should cross B or C (F, G, *, D3) as backupof ((F, H));) // traffic between F and D3 should be reroutd via G if link (F,H) fails ``` +- fibbing controller computes paths and creates fake topology by sending link-state messages -- compute expected paths centrally -- fibbing controller sends fake control messages to routers -- distributed protocols generate the expected paths with augmented network based on fake messages +### Synthesizing network configurations - Propane +Compile network-wide routing objectives into low-level configurations. -Any set of forwarding directed acyclic graphs can be enforced by fibbing. - -## Synthesizing network configurations: Propane -- Language for expressing network-wide objectives: - - regular expressions with extensions -- Compiler for purely distributed implementation +Main goals of Propane: +- language for expressing network-wide objectives (intra- and inter-domain routing) +- compiler for purely distribute implementation - generate BGP configs for each router - - compiler guarantees policy compliance - - use graph-based intermediate representation - -Goal: compile network-wide routing objectives into low-level configurations of devices running distributed protocols. - -Example: - -``` -define Ownership = -{PG1 => end(A) // path of traffic with prefix PG1 must end at A - PG2 => end(B) // path of traffic with prefix PG2 must end at B - true => end(out)} // traffic with any other prefixes can leave the datacenter - -define Locality = {PL1 | PL2 => only(in)} // traffic with prefixes PL* is only allowed in datacenter -``` + - compiler guarantees policy compliance for all failures -When compiling, BGP control graph safety analysis is done with the topology. +### Autocompleting partial network configurations +Problems: +- produced configs may widely differ from human-generating ones, lowing confidence in them +- can produce widely different configurations given slightly different requirements +- can't flexibly adapt to operational requirements -## Autocompleting partial network configurations: Net Complete -Problems with existing synthesizers: -- do not provide operators with fine-grained control over synthesized configurations -- can produce configurations very different from human-generated, lowering confidence of using it -- can produce very different configurations given slightly different requirements -- cannot flexibly adapt to operational requirements +Existing synthesizers don't provide operators with fine-grained control over synthesized configurations. -NetComplete allows operators to flexibly express intents through configuration sketches with "holes" -- holes can identify specific attributes (IP addresses, link costs, BGP local preferences) or entire pieces of configuration -- reduce autocompletion to constraint satisfaction problem: - - encode protocol semantics + high-level reqs + partial configuraions as logical formula in SMT - - use a solver to find an assignment to configuration variables +NetComplete: network operators can specify their intents through configuration sketches with "holes" +- holes can identify specific attributes (e.g. IP adresses), link costs, BGP local preferences, or entire pieces of configuration +- encodes requirements as logical formula in SMT +- use solver (Z3) to find assignment that satisfies constraints diff --git a/content/acn-notes/network-function-virtualization.md b/content/acn-notes/network-function-virtualization.md @@ -0,0 +1,27 @@ ++++ +title = 'Network function virtualization' ++++ +## Network function virtualization +Virtualise network functions on top of commodity servers. + +Benefits: +- hardware sharing across multiple functions +- more flexible management using software +- cost saving via reduced hardware +- safe to try new features on an operational network/platform + +Click: modular architecture or programming network functions +- element: class, input port, output ports, configuration string +- connection and configuration + - connection: three types of ports (push, pull, agnostic) + - push/pull connection constraints: + - connection must have same type on the two ends + - port must only have one connection + - element must not have different types on its input/output ports if the element processes packets immediately + +Achieving high performance +- NUMA architecture: non-uniform memory access (processor can access its own local memory faster than non-local memory) + - each network port is accessed by single core + - each packet is handled by a single core +- multi-queue NICs, allocate one core per queue +- batching: poll multiple packets from NIC instead of one at a time diff --git a/content/acn-notes/network-monitoring.md b/content/acn-notes/network-monitoring.md @@ -0,0 +1,28 @@ ++++ +title = 'Network monitoring' ++++ + +## Network monitoring +Tasks: +- which path id my packet take? +- which rules on the switch did the packet follow? +- how long did the packet queue at each switch? +- who did the packet share the queue with? + +In-band network telemetry: leverage programmability of switches to insert monitoring info into packet header along network path + +Heavy hitters: network flows that are larger (in number of packets, or bytes) than some fraction *t* of total packets seen on the link or some top *k* flows by size + +Space-saving algorithm: counter-based algorithm that uses O(k) counters to track k heavy flows +- properties: + - no flow counter in the table is ever underestimated + - min value in the table (`val_r`) is an upper bound on the overestimation error of any counter + - any flow with true count higher than average table count will always be present in the table +- if flow has appeared in table: hash to flow key, increment corresponding counter +- if flow not contained in table: find the min counter in the table, replace the key with the current flow key, increment the counter + +Universal streaming: +- there exists universal approach to estimate G-sum when function `g(*)` is non-decreasing such that g(0) = 0 and g(x) does not monotonically grow faster than x² +- universal sketch construction can be used to estimate G-sum with high probability using poly-logarithmic memory + +Item i is a g-heavy hitter if changing its frequency `f_i` significantly affects its G-sum diff --git a/content/acn-notes/network-transport.md b/content/acn-notes/network-transport.md @@ -1,66 +1,68 @@ +++ title = 'Network transport' +++ -# Network transport -## TCP congestion control -Congestion control: determine rate to send data on connection, such that network is efficiently utilized, and sender doesn't overrun network capability -In TCP, end hosts control sending rate, based on ACKs. +## Network transport +### Congestion control +determine rate to send data, such that sender doesn't overrun network capability, and network is efficiently used +- switches have packet queue, packets dropped if not enough space in queue +- TCP sends data across network as fast as possible, but not faster + - end hosts control sending rate based on ACKs + - packet conservation principle: for connection in equilibrium, new packet shouldn't be put into network until an old packet leaves -Packet conservation: for connection to stay in equilibrium, a new packet should not be put into network until an old packet leaves +self-clocking: new packets sent when packet leaves, at bottleneck rate. -Reaching equilibrium quickly: TCP slow-start +quickly reaching equilibrium: TCP slow start - upon receiving ACK, increase congestion window by 1 +- timeout or 3 duplicate ACKs assumed to be packet drop -Packet loss is not good indicator of congestion. Instead, provide a model, estimate parameters for the model based on probing, and decide sending rate using the model. +adapt to available space: additive increase multiplicative decrease (AIMD) +- congestion window increased by 1, but decreased by half of window size -## Multi-path transport -Multiple paths, e.g. cellular and wifi simultaneously. Higher throughput, failover from one path to another, and seamless mobility. +BBR: congestion-based congestion control +- models network, estimates maxBW (bandwidth) and minRTT (round tirp time) on every ACK +- control sending rate based on model +- uses a state machine -To not modify apps, present same socket API and expectations. +### Multi-path transport +Benefits: higher throughput, failover, seamless mobility -During TCP 3-way handshake, set option `MP_CAPABLE` in TCP header for multipath. +For unmodified apps, present same socket API and expectations, establish TCP connections in the normal way. Single flow by default, add sub-flows if possible. -To add sub-flows, use token hashed from exchanged key, and use HMAC (code) for auth based on exchanged keys. -Associate sub-flows with existing connection. +MPTCP +- to add new sub-flow, use token hashed from key exchanged in handshake, HMAC for authentication. +- start sending packets with new IP/port pair, associate sub-flows with exiting connection -Middleboxes: network equipment that apply special operations on path of network packets. -- examples: firewall, NAT -- some inspect TCP traffic and check sequence numbers -- so, need to have per-flow sequence numbers +Middleboxes: network equipment that apply special operations on path of network packets (e.g. firewall, NAT) +- some check TCP sequence numbers +- solution: use per-flow sequence number -All sub-flows share same receive buffer and use same receive window. - -MPTCP congestion control goals: -- be fair to TCP at bottleneck links: take as much capacity as TCP at bottleneck link -- use efficient paths: each connection should send all traffic on least-congested paths, but keep some traffic on alternative paths as probe +MPTCP congestion control: +- be fair to TCP at bottleneck links: take as much capacity as TCP, no matter how many paths it's using +- use efficient paths: each congestion should send all traffic on least-congested paths, but keep some traffic on alternative paths as probe - perform as well as TCP - -Congestion control mechanism: -- congestion window for each sub-flow -- increase window for sub-flow for each ACK on that path -- halve window for each drop on that path - -## HTTP -### HTTP/1 -1 round trip to set up TCP, 2 for TLS1.2. - -After setup, only one request/response possible at a time, so Head-of-Line (HoL) blocking. - -### HTTP/1.1 -Avoids HoL blocking using multiple TCP connections, allowing concurrent requests/responses. - -### HTTP/2 -Multiple streams (each for object) multiplexed on same TCP connection. - -Even multiple domains can share same TCP connection. - -Supports priority of streams, using dependency tree. - -But still has HoL blocking on _TCP_ connection -- packet retransmission for one object delays transmissions of others. - -## QUIC -Protocol to make streaming faster. - -No round trip to known server, or if crypto keys not new. And connections survive change of IP address. -Uses multiple streams over UDP. +- don't oscillate +- so, maintains a congestion window for each sub-flow, using a formula that I hope we don't need to know + +### HTTP +HTTP/1 +- 1 round trip to set up TCP connection +- 2 round trips to set up TLS 1.2 connection +- after setup, requests/responses flow over connection, one at a time + - head-of-line (HoL) blocking on HTTP connection + +HTTP/1.1: avoid HoL blocking +- multiple TCP connections allow multiple objects to be fetched through concurrent HTTP requests/responses + +HTTP/2: stream multiplexing +- multiple streams (each for an object) multiplexed on same TCP connection +- even multiple domains can share same TCP connection +- supports stream priority set by client + +### QUIC +new streaming protocol to make streaming faster + +HTTP/3 over QUIC: +- 0 round trip to known server, or if crypto keys not new +- connections survive IP address change +- after setup, HTTP requests/responses flow over connection via QUIC streams diff --git a/content/acn-notes/networking-basics/berkeley-sockets.png b/content/acn-notes/networking-basics/berkeley-sockets.png Binary files differ. diff --git a/content/acn-notes/networking-basics/index.md b/content/acn-notes/networking-basics/index.md @@ -1,97 +0,0 @@ -+++ -title = 'Networking basics' -+++ - -# Networking basics -## Domain name system (DNS) -People can't remember IPs, need names that are easy to remember. - -Before DNS, you just had a hosts file that was periodically updated via FTP. - -DNS: -- distributed, so no centralization and good scalability -- simple client/server architecture via UDP port 53 -- hierarchical namespace: - - root name server by ICANN - - responsible for root zone file -- lists TLDs and who owns them - - 13 root servers, globally replicated - - contacted when names can't be resolved locally - - top-level domains managed by Verisign and others - -Resolving a name via recursive DNS query, e.g. "www.google.com" -- query local DNS server (e.g. dns.vu.nl) -- no entry found, go to root -- root says to contact "com" nameserver -- query "com" NS for "www.google.com" -- "com" NS says to contact "ns1.google.com" -- query "ns1.google.com" for "www.google.com" -- "ns1.google.com" returns IP address - -DNS types: -- A (IPv4), AAAA (IPv6): DNS resolution -- CNAME: look for alias -- NS: query for DNS responsible for partial name -- MX: look for mail server - -## Socket and TCP -Berkeley sockets: - -![Berkeley sockets flowchar -t](berkeley-sockets.png) - -Transmission control protocol (TCP): -- uses a three-way handshake - -![TCP flow](tcp.png) - -TCP functionality: -- reliable delivery: integrity check (header checksum), packet retransmission when lost (sequence number), packet reordering -- flow control: receiver not overrun by sender -- congestion control: network not overrun by sender - -## IP routing -IP addresses made up of 32 bits, in groups of 8. -Contain network identifier (IP prefix), subnet identifier, host identifier. - -CIDR notation: 10.0.0.1/24 -- first 24 bits for network identifier -- rest for host identifier -- alternative subnet mask notation: 255.255.255.0 - -Generating forwarding tables: -- control plane: routers use distributed protocol to exchange messages and compute shortest paths to other routers - - OSPF: within domain. routers exchange link-state messages to learn topology, each router uses Dijkstra's to get shortest path, and generates forwarding table entries - - BGP: between autonomous systems. - -MPLS: multiprotocol label switching -- uses a label field, which routers use to forward traffic -- useful for traffic engineering (optimization, performance improvement, etc.) - -## Ethernet and ARP -Switched Ethernet: -- switching creates Ethernet segments and forwards frames between them based on MAC address -- Ethernet MAC address: 6 bytes, unique among all network adapters, managed by IEEE -- switches forward/broadcast/drop frames based on switch table -- switches don't need MAC address - they operate transparently to hosts -- generating table: - - learn new MAC interface mappings through incoming frames - - if destination MAC unknown, broadcast frame on all interfaces except the one where the frame originated -- store-and-forward: packets received in full, buffered, then forwarded onto output link -- cut-through: when lookup is done, can receive and send packet at the same time (reduces latency, but can't do integrity check) -- redundancy without loops: use logical spanning tree (STP), automatically rebuild on failure - - with loops, you'd get packets bouncing around constantly -- traffic isolation: VLAN - - network manager partitions ports into subsets, assigns to VLANs - - ports in same VLAN form broadcast domain, ports on different VLANs routed through internal router in switch - - switches connected on trunk ports belonging to all VLANs - -ARP: obtaining destination MAC address -- ARP query: ask host with IP to respond with MAC address -- ARP reply: MAC address response sent -- ARP table is cached locally - -## Network address translation (NAT) -NAT: way to map IP address space into another, used to mask network changes and prevent running out of IPv4 addresses - -From the outside, your IP is the address of your router. -When your router gets traffic, it sends it to the appropriate host on the local network. diff --git a/content/acn-notes/networking-basics/tcp.png b/content/acn-notes/networking-basics/tcp.png Binary files differ. diff --git a/content/acn-notes/networking-data-structures-and-algorithms.md b/content/acn-notes/networking-data-structures-and-algorithms.md @@ -0,0 +1,21 @@ ++++ +title = 'Networking data structures and algorithms' ++++ + +## Networking data structures and algorithms +Deciding if an element exists in a large set +- hashing: mapping data to fixed size values with a function +- hash function properties: deterministic, efficiently computable, minimize collisions +- modulo hashing: `h(x) = x % table_size`, with `table_size` a prime number +- handle collisions with separate chaining (linked list on each hash value), or open addressing (look for next open spot following a given strategy if collisions detected) +- bloom filter: use multiple hash functions to lower collision rate + +Count occurrences for large set of elements: +- counting bloom filter: on insert, increment counters corresponding to hash values. on read, look up counters and take the minimum count (does not ensure correctness, only upper bound). +- count-min sketch: three hash functions, each mapped to array of counters. on insert, increment. on read, obtain minimum of all counters. + +Find match on prefix of a string: +- Trie: binary tree whose nodes have part of the prefix +- PATRICIA trie: like radix trie, with radix 2 +- direct trie: combine segments in entries +- hardware implementation in TCAM: hardware device that lets you match on set of records in constant time diff --git a/content/acn-notes/programmable-data-plane-p4.md b/content/acn-notes/programmable-data-plane-p4.md @@ -0,0 +1,28 @@ ++++ +title = 'Programmable data plane (P4)' ++++ + +## Programmable data plane (P4) +All network features are centered around the capabilities of the ASIC, and it takes years for a new one to be developed and deployed. +So, make the ASIC programmable, and let your features tell the ASIC what to support. + +PISA: protocol independent switch architecture + +P4: domain-specific language for programming protocol-independent packet processors + +Architecture specific constructs +- each architecture defines "externs" - blackbox functions whose interface is known +- P4 aims to be target-independent + +P4 is C-like, statically typed. + +Parser: +- uses state machine to map packets into headers and metadata +- `transition` statements move between states, with final accept/reject states + +Control: +- tables: match a key and return an action +- actions: like C functions +- control flow is similar to C, but no loops + +Deparser assembles the packet from data structures. diff --git a/content/acn-notes/software-defined-networking/abstractions-in-sdn-diagram.png b/content/acn-notes/software-defined-networking/abstractions-in-sdn-diagram.png Binary files differ. diff --git a/content/acn-notes/software-defined-networking/covisor-design-overview.png b/content/acn-notes/software-defined-networking/covisor-design-overview.png Binary files differ. diff --git a/content/acn-notes/software-defined-networking/index.md b/content/acn-notes/software-defined-networking/index.md @@ -1,96 +1,50 @@ +++ -title = 'Software-defined networking' +title = 'Software defined networking' +++ -# Software-defined networking -Network planes: - -![Diagram of network planes](network-planes-diagram.png) - -Control plane has complex goals: connectivity, inter-domain policy, isolation, access control... - -We need abstractions to extract simplicity; ultimately we should be able to program the network as we do computers. - -First attempt: demultiplexing packets to software programs: -- in-band: packet has small piece of code that can execute on router -- out-band: user injects the code to be executed beforehand - -Software-defined network: -- control plane physically separate from data plane -- single control plane controls several forwarding devices - -![Traditional network vs SDN](traditional-vs-sdn.png) - -## Abstractions in SDN -![SDN architecture diagram](sdn-architecture.png) - -### Forwarding - OpenFlow -express intent independent of implementation. - -OpenFlow is API for controlling packet forwarding. -Configuration in terms of flow entries `<header, action>`. -No hardware changes needed, only firmware update. - -![OpenFlow diagram](openflow-diagram.png) - -Flow tables -- contain rows of flows -- contain: rule (exact/wildcard) + action + statistics + priority -- match+action - - match: on arbitrary fields in headers, or new header → allows any flow granularity - - action: forward to port, or drop, or send to controller, or overwrite header, or forward at specific bit-rate - - no support for deep packet inspection (payload-related) - -![OpenFlow protocol example](openflow-example.png) - -### Network state -Global network view is annotated graph, provided through API. -Control program is a function applied on the view (graph). -Implementation is "network operating systems", runs on servers in network. - -Info goes both ways: -- info from routers/switches to create view -- configuration to routers/switches to control forwarding - -### Specification abstraction -Control mechanism expresses desired behavior, and should not be responsible for implementing it on physical network infrastructure. - -Abstract view of network, only enough detail to specify goals. - -## SDN for network slicing -It's hard to realistically evaluate new network services. - -So, network slicing: divide network into logical slices -- each slice controls its own packet forwarding -- users pick which slice controls their traffic, as opt-in -- existing production services run in their own slice -- enforce strong isolation: actions in one slice do not affect others -- allow this testbed to mirror production network - real hardware, performance, etc. - -![Slicing layer diagram](slicing-layer.png) - -Slicing policies specify resource limit for each slice (bandwidth, max number of forwarding rules on switches, topology, fraction of switch/router CPU). - -For OpenFlow, FlowVisor allows network slicing -- intercepts control messages from OpenFlow to enforce the slicing. -- checks who controls packet/flow -- check if rules allowed or not - -## Composing network control programs in SDN -CoVisor is compositional hypervisor: -- clean interface for composing multiple controllers on same network (e.g. POX + Ryu + Floodlight...) - -Operators: -- `+`: parallel packet processing -- `>>`: sequential packet processing -- `▷`: overrie, one controller acts or defers processing to another controller +## Software defined networking +Software defined network: +- control plane physically separate form data plane +- single (logically centralized) control plane controls several forwarding devices + +Abstractions in SDN: + +![Diagram of abstractions in SDN](abstractions-in-sdn-diagram.png) + +Forwarding abstraction: OpenFlow +- intent, independent of implementation +- standardized interface to switch +- configuration using flow entries: `<header, action>` + - match on any header, or new header + - action: forward to port(s), drop, send to controller, change header, forward at specific bit-rate + - but no support for payload-related functions + +Network state abstraction: "Network Operating Systems" +- annotated network graph provided through API +- runs on servers in network +- information flows from router/switches to form view +- configurations flow to routers/switches to control forwarding + +Specification abstraction +- control mechanism express desired behavior +- not responsible for implementing that behavior on physical network infrastructure +- proposed: abstract view of the network + +### Network testing (slicing) +Hard to realistically test new network services. + +So, slice the network: +- divide production network into logical slices +- users pick which slice controls their traffic (testing is opt-in) +- enforce strong isolation between slices + +Slicing policy specifies resource limit for each slice. +- FlowVisor can be used to enforce network slicing, by checking policies + +### Composing network control programs +CoVisor is compositional hypervisor for SDN: +- clean interface to compose multiple controllers on same network +- provides operators: parallel (`+`), sequential (`>>`), override (`▷`) - constraints on individual controllers: - - network visibility - virtual topology to each controller (so controllers can't operate on whole network) - - many-to-one primitive: one virtual node to many physical - - one-to-many primitive: one node in physical, several virtual (e.g. MAC learner + gateway + IP router) - - packet handling capability - fine-grained access control to each controller - - constraint on pattern: header fields, match type - - constrain on action: actions to take on matched packets - -Design overview: - -![CoVisor design overview](covisor-design-overview.png) + - visibility (virtual topology for each controller) + - capability (fine-grained access control) diff --git a/content/acn-notes/software-defined-networking/network-planes-diagram.png b/content/acn-notes/software-defined-networking/network-planes-diagram.png Binary files differ. diff --git a/content/acn-notes/software-defined-networking/openflow-diagram.png b/content/acn-notes/software-defined-networking/openflow-diagram.png Binary files differ. diff --git a/content/acn-notes/software-defined-networking/openflow-example.png b/content/acn-notes/software-defined-networking/openflow-example.png Binary files differ. diff --git a/content/acn-notes/software-defined-networking/sdn-architecture.png b/content/acn-notes/software-defined-networking/sdn-architecture.png Binary files differ. diff --git a/content/acn-notes/software-defined-networking/slicing-layer.png b/content/acn-notes/software-defined-networking/slicing-layer.png Binary files differ. diff --git a/content/acn-notes/software-defined-networking/traditional-vs-sdn.png b/content/acn-notes/software-defined-networking/traditional-vs-sdn.png Binary files differ.