Read "Streaming Systems" 1&2, Streaming 101 Read "F1, a distributed SQL database that scales" Read "Zanzibar, Google’s Consistent, Global Authorization System" Read "Spanner, Google's Globally-Distributed Database" Read "Designing Data-intensive applications" 12, The Future of Data Systems IOS development with Swift Read "Designing Data-intensive applications" 10&11, Batch and Stream Processing Read "Designing Data-intensive applications" 9, Consistency and Consensus Read "Designing Data-intensive applications" 8, Distributed System Troubles Read "Designing Data-intensive applications" 7, Transactions Read "Designing Data-intensive applications" 6, Partitioning Read "Designing Data-intensive applications" 5, Replication Read "Designing Data-intensive applications" 3&4, Storage, Retrieval, Encoding Read "Designing Data-intensive applications" 1&2, Foundation of Data Systems Three cases of binary search TAMU Operating System 2 Memory Management TAMU Operating System 1 Introduction Overview in cloud computing 2 TAMU Operating System 7 Virtualization TAMU Operating System 6 File System TAMU Operating System 5 I/O and Disk Management TAMU Operating System 4 Synchronization TAMU Operating System 3 Concurrency and Threading TAMU Computer Networks 5 Data Link Layer TAMU Computer Networks 4 Network Layer TAMU Computer Networks 3 Transport Layer TAMU Computer Networks 2 Application Layer TAMU Computer Networks 1 Introduction Overview in distributed systems and cloud computing 1 A well-optimized Union-Find implementation, in Java A heap implementation supporting deletion TAMU Advanced Algorithms 3, Maximum Bandwidth Path (Dijkstra, MST, Linear) TAMU Advanced Algorithms 2, B+ tree and Segment Intersection TAMU Advanced Algorithms 1, BST, 2-3 Tree and Heap TAMU AI, Searching problems Factorization Machine and Field-aware Factorization Machine for CTR prediction TAMU Neural Network 10 Information-Theoretic Models TAMU Neural Network 9 Principal Component Analysis TAMU Neural Network 8 Neurodynamics TAMU Neural Network 7 Self-Organizing Maps TAMU Neural Network 6 Deep Learning Overview TAMU Neural Network 5 Radial-Basis Function Networks TAMU Neural Network 4 Multi-Layer Perceptrons TAMU Neural Network 3 Single-Layer Perceptrons Princeton Algorithms P1W6 Hash Tables & Symbol Table Applications Stanford ML 11 Application Example Photo OCR Stanford ML 10 Large Scale Machine Learning Stanford ML 9 Anomaly Detection and Recommender Systems Stanford ML 8 Clustering & Principal Component Analysis Princeton Algorithms P1W5 Balanced Search Trees TAMU Neural Network 2 Learning Processes TAMU Neural Network 1 Introduction Stanford ML 7 Support Vector Machine Stanford ML 6 Evaluate Algorithms Princeton Algorithms P1W4 Priority Queues and Symbol Tables Stanford ML 5 Neural Networks Learning Princeton Algorithms P1W3 Mergesort and Quicksort Stanford ML 4 Neural Networks Basics Princeton Algorithms P1W2 Stack and Queue, Basic Sorts Stanford ML 3 Classification Problems Stanford ML 2 Multivariate Regression and Normal Equation Princeton Algorithms P1W1 Union and Find Stanford ML 1 Introduction and Parameter Learning

TAMU Computer Networks 4 Network Layer

2018-04-25

Introduction

Network layer protocols: IP, ICMP, IGMP.

  • IP
    • Datagram format
    • Addressing conventions
  • ICMP
    • Error reporting
    • Ping, traceroute
  • IGMP
    • Multicast

Network Layer = IP Layer

  • Transports segments from sending to receiving host
  • On the sending side, encapsulates segments into datagrams
  • On the receiving side, delivers segments to transport layer
  • Network layer protocols in every host and router
  • Router examines header fields in all IP datagrams passing through it

Key Network-Layer Functions

  1. Routing: determine the path taken by packets from source to dest
    • Build a minimum-cost table at each router
    • Table has next-hop neighbor for each possible destination
    • send packet along the least-expensive path
  2. Forwarding: move packets from a router’s input port to appropriate router output port (physical interface (NIC) inside router, not TCP/UDP port)
    • Table lookup
    • Port-to-port transfer
    • Goal: efficiency
  3. Connection setup in certain network architectures
    • e.g., ATM (Asynchronous Transfer Mode)
    • Before datagrams flow in such networks, two hosts and intermediate routers establish virtual circuit (VC)
      • Routers get involved to set up a path

Virtual circuit and datagram networks

Internet is preferred compared with ATM.

  • Driven by data exchange among computers
    • “Elastic” service, no strict timing requirements
  • Smart end systems, complexity at edges
  • Simple network core

ATM:

  • Like Telephony
  • Strict timing, bandwidth requirements
  • Need for guaranteed service
  • Dumb end system
  • Complexity in network core

Virtual Circuits

  • VCs may create a path that behaves much like a telephone circuit (no congestion, low delay, no loss)
  • Call setup for each connection before data can flow
    • Similar to TCP’s handshake, but involves routers
  • Each packet carries a VC tag instead of the 5-tuple
    • 5-tuple: <src addr, dest addr, src port, dest port, proto>
  • Every router on source-dest path maintains “state” for each passing connection
    • Mapping from tags to next-hop router
  • Fraction of router resources (bandwidth, buffers) are allocated to each VC

Datagram networks

  • No call setup at network layer
  • Routers: no state about end-to-end connections
    • No network-level concept of “connection”
  • Packets forwarded using destination host address
    • Packets between the same source-dest pair may take different paths (multi-path routing)

Inside a router

A router has several ports (interface that can send and receive data) and a switching hardware connecting them.

Key router functions:

  • Run routing algorithms
    • RIP, OSPF, BGP
  • Forward datagrams from incoming to outgoing link
    • Through switch fabric hardware
    • Given datagram destination, look up output port using forwarding table in input port memory
    • Queueing may happen at both input and output ports
      • Customer traffic: single FIFO drop-tail queue
      • ISP traffic: multiple queues with WRR or priority queuing

Switching methods:

  • Switching Via Memory
  • Switching Via a Bus
    • A shared bus
  • Switching Via An Interconnection Network
    • Overcomes bus bandwidth limitations
    • Crossbar

IP, Internet Protocol

IP datagram

Overhead of IP is 20 bytes, same as TCP.

IP datagram format (some):

  • IP protocol version
  • TTL: max number remaining hops (decremented at each router)
  • source and destination IP addresses
  • 16-bit ID, flags, fragment offset (for IP fragmentation)
  • upper layer protocol to deliver (TCP, ICMP, etc)
  • QoS (priority of packet)
  • checksum (header only)

IP fragmentation & reassembly

  • Network links have varying MTUs (maximum transmission units) – largest possible link-level frames
    • Different link types, different MTUs (most common 1500)
  • Large IP datagram divided (“fragmented”) within network
    • One datagram becomes several datagrams
    • Reassembled only at final destination

For example, a large datagram can be divided into 3 smaller one.

  • length includes the 20 bytes header
  • offset in in 8-byte words
  • Last packet is identified by flag 0 and nonzero offset
// large
length 4000, id x, flag 0, offset 0

//small 1
length 1500, id x, flag 1, offset 0

//small 2
length 1500, id x, flag 1, offset 185

//small 3
length 1040, id x, flag 0, offset 370

IP Addressing

  • IP address: 32-bit identifier for host or router interface
  • Interface: connection between host/router and physical link
    • Also called a port
    • Routers typically have multiple interfaces
    • Multihoming: Hosts have multiple interfaces

Subnet (LAN) is a network composed of devices with the same subnet prefix of IP address.

  • Subnet prefix: k bits
  • Host suffix: 32-k remaining bits
  • Can physically reach each other without intervening router
  • To determine the subnets, detach each interface from its host or router, creating islands of isolated networks
  • Subnet mask: 255.255.255.0 or /24

CIDR stands for Classless InterDomain Routing. It has the address format of 200.23.16.0/23.

A host can get an IP address by:

  • hard-coded by system admin in a file
    • Windows: Control-panel–network–configuration–tcp/ip–properties
    • Linux: /etc/rc.config
  • Or dynamically assigned by DHCP
    • Plug-and-play

Hierarchical Addressing: allows efficient advertisement of routing information. ISPs can ask for different CIDR address range, like “Send me anything in 199.31.0.0/16 or 200.23.18.0/23”.

ISP get a block of addresses from ICANN (Internet Corporation for Assigned Names and Numbers). These registries process ISP and user requests for subnet space. Also manage DNS and resolve disputes.

  • ARIN (North/South America)
  • RIPE (Europe)
  • APNIC (Asia-Pacific)
  • AfriNIC (Africa)

NAT, Network Address Translation

Local network uses just one IP address as far as the outside world is concerned.

  • No need to be allocated a range of addresses from ISP
  • Just one IP address is used for all devices
  • Can change addresses of devices in local network without notifying outside world
  • Can change ISP without changing addresses of devices in local network
  • Devices inside local net not explicitly addressable or visible to outside world (a security plus)

NAT router has a translation table that translates between WAN side addr (ip, port) and LAN side addr. It will change datagram source addr while sending out and dest addr while receiving in.

  • With 16-bit port, a NAT allows 64K simultaneous connections with a single LAN-side address.
  • NAT router touches Transport layer (Router should process up to Network layer), which is controversial.
  • Makes inbound connections difficult
    • Inbound connections needed in P2P and other applications
    • May be overcome by UPnP or manually configuring NAT to route incoming connections to a particular host
  • Some believe that address shortage should instead be solved by IPv6

ICMP, Internet Control Message Protocol

  • Communicates network-level debug information
    • Error reporting: unreachable host, network, port, protocol
    • Echo request/reply (ping)
  • Network-layer above IP
    • ICMP msgs carried in IP datagrams (“layer 3.5”)
  • ICMP error message
    • Payload contains first 28 bytes of IP pkt causing error

Datagram format:

  • Type: 8 bits
  • Code: 8 bits
  • Checksum: 16 bits
  • ID: 16 bits
  • Sequence: 16 bits
Type Code description
0 0 echo reply(ping)
3 3 dest port unreachable
8 0 echo request (ping)
11 0 TTL expired

Traceroute

  • Source sends series of UDP segments to dest
    • First with TTL = 1
    • Second with TTL = 2
    • Unlikely port number
  • When the n-th datagram arrives to the n-th router
    • Router discards datagram
    • Sends to source a TTL Expired (type 11, code 0)
    • Message includes IP hdr from router & first 28 bytes of original packet
  • When ICMP message arrives, source calculates RTT
    • Traceroute does this 3 times per hop
  • UDP segment eventually arrives at destination host
    • Destination returns ICMP “port unreachable” packet (type 3, code 3)
    • When source gets this ICMP, it stops

IPv6

  • Address length is 16 bytes.
  • Simpler header format helps speed up forwarding
  • Header changes to facilitate QoS (priority of packet) and extensions
  • Data format
    • Fixed-length 40 byte header
    • No fragmentation allowed
    • No checksum
    • Allow options, by “next header”
  • Tunneling: IPv6 can be carried as payload in IPv4 datagrams among IPv4 routers.

Routing algorithms

  • Global
    • Routers have complete topology, link cost info
    • Link state algorithms
  • Local (decentralized)
    • Router knows physically-connected neighbors, link costs to neighbors
    • Iterative process of computation, exchange of info with neighbors
    • Distance vector algorithms

Dijkstra’s algorithm:

  • Entire network topology and link costs known
    • Accomplished via “link state broadcast”
    • Eventually, all nodes have same info
  • Iterative: after k iterations, know least-cost path to k closest destinations
  • Heap-based implementation:
  • Oscillations possible, but only for traffic-dependent cost

Distance vector

  • Two metrics known to each node
    • Estimate of least cost from to
    • Link cost to reach ’s immediate neighbors
  • Each node maintains a distance vector
    • to neighbors :
  • Node periodically asks its neighbors for their distance vectors
    • Thus, has access to the following for each neighbor
    • for neighbor ,

Bellman-Ford algorithm

When a node receives new DV estimate from neighbor , it updates its own DV using the Bellman-Ford equation:

  • Centralized Bellman Ford requires time
    • Convergence of decentralized version depends on topology, link weights, update delays, and timing of events
  • Allow negative weights
  • Iterative, asynchronous. Each iteration caused by:
    • Local link cost change
    • DV update message from neighbor
  • Distributed:
    • Each node notifies neighbors only when its DV changes

Steps:

  • Node detects local link cost change
  • Recalculates distance vector, updates routing info if needed
  • If DV changes, notifies neighbors

Performance:

  • Good news travels fast
  • Bad news travels slow, “count to infinity” problem!
    • Use poisoned reverse to solve it
      • If z routes through y to get to x:
      • z tells y that z’s distance to x is infinite.
      • So y won’t route to x via z.
      • May not work for complex topology.

Comparison of LS and DV

Message complexity:

  • LS with nodes, links
    • Sent messages
  • DV exchange messages between neighbors only

Convergence time:

  • LS: plus time sending messages
    • Oscillations (cost = congestion)
  • DV: convergence time varies
    • May have routing loops
    • Count-to-infinity problem

Robustness:

  • LS
    • Node can advertise incorrect link cost
    • Affects only a small portion of the graph
  • DS
    • Node can advertise incorrect path cost
    • Each node’s table used by others
    • Error propagate thru network

Hierarchical routing

Limitations: Memory, CPU time, Message overhead with billions links. ISPs are not willing to share their topology with others.

Internet is network of networks. Network admins control routing in their own networks, export reachable subnets to outside world.

So the idea is:

  • Aggregate routers into regions called AS (Autonomous Systems)
  • Routers in the same AS run the same algorithm
    • Accomplished via intra-AS routing protocols
  • ISPs gain flexibility
    • Routers in different ASes can run different intra-AS protocols

Gateway routers:

  • Direct links to routers in other ASes
  • Exchange routing view of each AS using an inter-AS protocol

Interconnected ASes:

  • Intra-AS sets entries for all internal destinations
  • Inter-AS accepts external destinations from neighbor ASes
    • E.g., a gateway in AS1 learns 128.194/16 is reachable via AS2
  • Inter-AS broadcasts pairs (subnet, exit router)
    • E.g., a gateway in AS1 notifies all routers in AS1 that it can reach 128.194/16
  • Inter-AS protocol is also responsible to determine which path to take while multiple path.
    • E.g., hot potato routing: use the closest exit point

Routing in the Internet

  • Intra-AS routing
    • RIP: Routing Information Protocol (DV)
      • Distance metric: # of hops (max = 15)
      • DV exchanged among neighbors every 30 sec using advertisement messages
        • UDP port 520
      • If no advertisement heard after 180 sec –> neighbor/link declared dead
      • QoS only applies to specialized packets generated by ISP
      • RIP routing tables managed by an application-level process called routed (daemon)
    • OSPF: Open Shortest Path First (LS)
      • Flood ads messages to entire AS
      • use IP datagram with protocol number 89, authenticated
      • Layer 3.5
      • Hierarchical OSPF in large domains
    • IGRP: Interior Gateway Routing Protocol (Cisco, DV); EIGRP (Extended IGRP)
    • IS-IS (Intermediate System to Intermediate System, LS, independent of the network layer)
  • Inter-AS, only one option
    • BGP (Border Gateway Protocol)
    • All ISPs must support it

BGP, Border Gateway Protocol

BGP allows a subnet to advertise its existence to the rest of the Internet: “I am here”. It also provides each AS a means to:

  • Obtain subnet reachability information from neighboring ASes
  • Propagate the reachability information to all routers internal to the AS
  • Determine “good” routes to subnets based on reachability information and policy

Basics:

  • Pairs of routers (BGP peers) exchange routing info over TCP (port 179) connections: BGP sessions.
    • Note that BGP sessions do not correspond to physical links.
  • When AS2 advertises a prefix 128.194/16 to AS1, AS2 is promising it will forward any datagrams destined to that prefix towards the prefix
    • AS2 can aggregate prefixes in its advertisement
  • eBGP session is between ASes
  • iBGP will distribute info to all routers inside an AS
  • Internal AS routers combine intra-AS data with iBGP broadcasts to set up actual forwarding tables

Path Attributes & BGP Routes:

  • When advertising an IP prefix (i.e., subnet), message includes BGP attributes
    • Route: subnet + attributes
  • Two important attributes:
    • AS-PATH: contains ASes through which the advert for the prefix passed (latest AS first)
    • NEXT-HOP: indicates the router that should receive traffic (usually border router of the AS that advertised prefix; multiple values possible)

Route selection:

  • When gateway router receives route advert, it uses an import policy to accept/decline
    • Filters and rules decide allowed/prohibited routes
  • Decide which route is better
    • Multi-exit discriminator (MED) attribute: policy of foreign AS that assigns different weight to different incoming points
    • Shortest AS-PATH
    • Closest NEXT-HOP router: hot potato routing
    • Local preference attribute: policy decision of accepting AS that assigns different weight to various exit points (only used in iBGP)
  • ISPs want to route mainly to/from their customers!

Broadcast and multicast routing

  • Broadcast: send a packet to all hosts in the network
  • Multicast: send to a certain subset of nodes
  • Unicast: one sender, one receiver

Broadcast

  • Controlled flooding: routers re-broadcast each received packet only once
    • Must keep a table of all previously received pkts to avoid re-sending of the same data (not scalable)
  • Reverse-path forwarding (RPF)
    • routers re-broadcast only packets received on the interface leading towards the source along their own shortest path
    • Drawbacks of RPF: redundant packets are still transmitted and routing must be symmetric
  • Minimum Spanning Tree
    • a tree subgraph of G that spans all network nodes and has the minimum cost of all such trees
    • Kruskal’s and Prim’s algorithms build MST in time
    • often impractical due to lack of global knowledge
  • Center-Based Spanning Tree: a “center” node is selected first (various methods exist)
    • All other nodes asynchronously send join requests using unicast routing towards the center until intersection with tree

Multicast

Multicast involves a subset of routers. Goal is to create a tree between routers to which multicast group members are attached. (NP-complete)

Approach:

  • Source-based mcast forwarding tree
    • tree of shortest path routes from source S to all receivers
    • Dijkstra’s algorithm when S knows entire topology from LS algorithm like MOSPF
  • Source-specific RPF (default opt-in)
    • Initially flood every router, not ideal
    • Remove unrelated routers from the tree. Prune messages sent upstream by router with no downstream group members
  • Steiner Tree: minimum cost tree connecting all routers with attached group members
    • NP-complete
    • Even though heuristics exists, not used in practice
  • Center-Based Tree (CBT) (default opt-out)
    • Single delivery tree shared by all
    • One router identified as “center” of tree
    • Join messages sent towards center until existing tree is met

DVMRP

  • DVMRP: Distance Vector Multicast Routing Protocol, RFC 1075 (1988)
  • Flood and prune (default opt-in): reverse path forwarding (RPF), tree rooted at source
  • IGMP broadcasts proceed between neighbor routers
  • Multicast IP addresses are in 224.0.0.0/4
    • To join a particular group, use setsockopt with IP_ADD_MEMBERSHIP

Tunneling: Mcast datagram encapsulated inside “normal” (non-multicast-addressed) datagram.

  • Unicast IP datagram sent thru “tunnel” via regular IP unicast to receiving mcast router
  • Receiving mcast router decapsulates mcast datagrams

PIM: Protocol Independent Multicast

  • Dense (default opt-in)
  • Sparse (default opt-out)

Multicast future

  • Wide-area multicast deployment has been traditionally slow, now practically dead
    • E.g., Mbone
  • One issue is scalability
    • Flooding all Internet receivers is just insane
    • Opens loopholes for DDoS attacks
  • ISP unwillingness to accept multicast traffic
    • Who pays for a single packet being replicated 1M times?
  • Finally, multicast congestion control is hard

Reference

This is my class notes while taking CSCE 612 at TAMU. Credit to the instructor Dr. Loguinov.


Creative Commons License
Melon blog is created by melonskin. This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
© 2016-2019. All rights reserved by melonskin. Powered by Jekyll.