SerialReads

Computer Networking Overview

May 06, 2025

Great. I’ll compile a modular, section-by-section overview of computer networking tailored for senior-level software engineers and system design interviews—especially within the AWS ecosystem. The sections will be delivered in markdown format here in the chat for easy reference and study.

I’ll get started and share the first section with you shortly.

Computer Networking Overview

This comprehensive overview covers core computer networking models, protocols, components, architectures, and AWS cloud networking services. It is tailored for a senior software engineer preparing for system design interviews and working with AWS. Each section below is modular and self-contained, providing high-level concepts, practical examples, and best practices relevant to real-world systems design and cloud infrastructure.

Fundamental Networking Models

Networking models provide a layered framework for understanding how data travels across networks. The two primary models are the OSI seven-layer model and the TCP/IP model. These models break down networking functions into layers, each building on the layer below. In practice, the models guide design and troubleshooting by isolating issues to specific layers.

OSI Model (7 Layers)

Figure: The 7 Layers of the OSI Model, from Physical (Layer 1) to Application (Layer 7). Each layer has specific responsibilities and interfaces with the layers above and below.

The Open Systems Interconnection (OSI) model defines seven abstraction layers that describe how data moves from an application to the physical network medium and back. The layers, from 1 (lowest) to 7 (highest), are: Physical, Data Link, Network, Transport, Session, Presentation, and Application. Each layer encapsulates a set of functions (for example, the Network layer handles routing of packets, while the Transport layer ensures reliable delivery). This layered approach provides a “universal language” for different systems to communicate. In practice, the OSI model is mostly used as a teaching and troubleshooting tool – real-world protocols often span multiple layers or skip some. For example, an issue with an unreachable web service might be diagnosed by checking connectivity at Layer 3 (IP routing) and Layer 4 (TCP handshake) before looking at an application-layer (Layer 7) problem. The OSI model is rarely implemented exactly in modern networks, but it remains valuable for understanding and explaining network behavior.

Layers and their roles: At Layer 1 (Physical), raw bits are transmitted over a medium (cables, Wi-Fi, etc.). Layer 2 (Data Link) handles framing and local node-to-node delivery (e.g., Ethernet MAC addressing). Layer 3 (Network) manages addressing and routing between networks – the Internet Protocol (IP) operates here. Layer 4 (Transport) provides end-to-end communication and reliability; e.g. TCP (connection-oriented, reliable) and UDP (connectionless, best-effort). Layer 5 (Session) governs the establishment and teardown of sessions (e.g., managing multiple logical connections, as with an RPC session). Layer 6 (Presentation) deals with data format and syntax, ensuring that data from the application layer can be understood by the receiving system (examples: encryption/decryption, serialization like JSON or XML). Layer 7 (Application) is the closest to the end-user; it includes protocols for specific networking applications – e.g. HTTP for web, SMTP for email. These layers interact in a stack: each layer on the sender adds its header (encapsulation) and the corresponding layer on the receiver reads and strips it (decapsulation). Understanding the OSI layers is helpful in interviews for explaining terms like “L4 load balancer” or debugging (e.g., identifying if a problem is at the network layer vs. the application layer). Not every system uses all layers distinctly (some layers might be empty or combined), but the model’s separation of concerns aids in clarity.

TCP/IP Model

The TCP/IP model is the pragmatic model on which the modern Internet is built. It condenses the OSI layers into four or five layers: typically Link (Network Interface), Internet, Transport, and Application (some versions separate Link into Physical and Data Link, totaling five layers). While the OSI model is a theoretical reference, the TCP/IP model maps more directly to real protocols in use. For example, in the TCP/IP model, the Internet layer corresponds to IP (IPv4/IPv6) for routing, the Transport layer includes TCP and UDP, and the Application layer encompasses everything from HTTP to DNS. The TCP/IP model’s simplicity reflects the design of the Internet, where protocols are defined in these four layers and interoperability is key. In practice, when designing systems we often refer to TCP/IP layers; e.g., designing a solution “at the transport layer” likely implies working with TCP/UDP rather than inventing a new layer. The OSI model remains useful for conceptual understanding, but the TCP/IP model is more commonly used in practice today, especially when discussing real-world networking (for instance, engineers often speak of “layer 4 vs layer 7 load balancing” in terms of TCP/IP and OSI equivalently). A senior engineer should understand both models: OSI for its vocabulary and thoroughness, and TCP/IP for its direct mapping to actual protocols and the Internet’s architecture.

Essential Networking Protocols

Modern networks rely on a suite of core protocols that operate at different layers to enable communication. This section covers key protocols and concepts: HTTP/HTTPS at the application layer, TCP vs UDP at the transport layer, IP (v4 & v6) at the network layer (with addressing/subnetting), and DNS for name resolution. Understanding these is crucial for system design (e.g., choosing TCP or UDP for a service, or designing a domain name scheme for microservices).

HTTP and HTTPS (Web Protocols)

HTTP (HyperText Transfer Protocol) is the fundamental application-layer protocol of the World Wide Web. It defines how clients (typically web browsers) request resources from servers and how servers respond. HTTP is a stateless, request-response protocol, meaning each request from a client is independent – the server does not retain session information between requests by default. A client (like a browser) sends an HTTP request (e.g., a GET request for a webpage), and the server returns an HTTP response (e.g., an HTML page). Common HTTP methods include GET (retrieve data), POST (submit data to be processed), PUT, DELETE, etc., often corresponding to CRUD operations in RESTful APIs. HTTP responses come with status codes indicating the result of the request: for example, 200 OK (success), 404 Not Found, 500 Internal Server Error, etc. These codes are grouped into classes – 1xx informational, 2xx success, 3xx redirection, 4xx client error, 5xx server error. For instance, a 200 means success, 404 means the requested resource was not found, 503 means the server is unavailable. Understanding status code classes is useful in debugging and designing REST APIs (e.g., returning 404 vs 400 for different error conditions).

HTTPS is the secure version of HTTP. It stands for HTTP Secure and essentially means HTTP over TLS/SSL encryption. When using HTTPS (e.g., https:// URLs), the client and server perform a TLS handshake to establish an encrypted connection before exchanging HTTP data, thereby providing confidentiality and integrity. TLS (Transport Layer Security) is the modern version of SSL and is a security protocol that encrypts communication between a client and server. A primary use case of TLS is securing web traffic (HTTPS) to prevent eavesdropping or tampering. In practice, when a browser connects to an HTTPS site, it verifies the server’s identity via an X.509 certificate and then negotiates encryption keys (this is the TLS handshake). After that, HTTP requests and responses are encrypted. TLS provides assurances that the client is talking to the genuine server (authentication via certificates) and that no one can read or alter the data in transit. Modern best practices require HTTPS for virtually all web traffic (e.g., browsers flag non-HTTPS sites as insecure). In system design, one should note that HTTPS adds a bit of overhead (CPU for encryption, and latency for the handshake), but it’s necessary for security. Also, load balancers or proxies often terminate TLS (performing decryption) and forward traffic internally over HTTP – this is a common architecture for handling HTTPS at scale. In summary, HTTP/HTTPS knowledge includes understanding the stateless nature of HTTP (and the need for mechanisms like cookies or tokens to maintain sessions), knowing the common status codes, and recognizing how TLS secures communications.

TCP vs UDP (Transport Layer Protocols)

At the transport layer, TCP (Transmission Control Protocol) and UDP (User Datagram Protocol) are the two fundamental protocols. Both use IP underneath but differ significantly in behavior and use cases:

In summary, TCP vs UDP trade-off comes down to reliability vs. latency. TCP gives you heavy guarantees (ordering, no duplicates, reliable delivery) at the cost of extra network chatter and complexity, whereas UDP is essentially just “send and forget,” suitable for cases where the application can tolerate or handle some loss. As a senior engineer, it’s important to know that TCP can struggle with high-latency links or multicast (UDP supports multicast, TCP doesn’t), and that UDP requires you to consider packet loss at the application level. Many protocols build on UDP for speed and add their own reliability only if needed (e.g., QUIC is a modern protocol used in HTTP/3 that runs over UDP to get the benefits of UDP with built-in congestion control and reliability at the application layer). In AWS, most services (like HTTP-based services) use TCP, but things like AWS CloudWatch Logs or metrics might use HTTP (thus TCP) for reliability. If designing a custom service, say a telemetry ingestion service that can lose the occasional packet but needs low latency, you might design it over UDP.

Reliability and ordering: TCP ensures in-order delivery or the connection breaks – it’s “near-lossless”. UDP is “lossy”; the application might receive packets out of order or not at all, and it must deal with that. For example, a video call app might see a missing UDP packet and decide to not display a few pixels for a moment rather than pause the whole video. Knowing which transport to use is crucial: e.g., a payment system should not use UDP (you don’t want lost data in transactions), whereas a live metrics dashboard could use UDP for real-time updates where occasional drops are acceptable.

IP Protocol (IPv4, IPv6 and Addressing/Subnetting)

IP (Internet Protocol) operates at the network layer (Layer 3) and is the core protocol that delivers packets from source to destination across networks. It provides addressing (each device has an IP address) and routing (forwarding through intermediate routers). There are two versions in use:

Addressing and Subnetting: An IP address has two parts: network prefix and host identifier. Subnetting is the practice of dividing a network into smaller networks (subnets) by extending the network prefix. Each subnet is identified by a network address and a subnet mask (or prefix length, e.g., /24). For example, 192.168.1.0/24 represents a subnet where the first 24 bits are network (192.168.1.0) and the remaining 8 bits are for hosts (256 addresses, of which a few are reserved). Why subnet? It improves routing efficiency, security, and management. A subnet (subnetwork) is a network inside a larger network – it allows localizing traffic so it doesn’t have to traverse the entire network. By subnetting, network traffic can travel shorter distances without passing through unnecessary routers. For instance, in a data center, you might subnet by rack or by department so that most traffic stays local. From a design perspective, subnetting helps segment networks (e.g., separating a database subnet from a web subnet for security). In cloud environments like AWS, you must subnet your VPC IP range into subnets (often at least one public and one private subnet per availability zone). Each subnet has a CIDR block (range of IPs) and belongs to one availability zone.

A concept related to subnetting is the subnet mask (e.g., 255.255.255.0 for /24) which delineates network vs host bits. Another concept is CIDR (Classless Inter-Domain Routing) notation (like /16, /24) which replaced classful networks to allow more flexible allocation. For an interview, one should know how to calculate the number of addresses in a subnet (e.g., a /24 has 2^(32-24)=256 addresses) and understand broadcast vs network addresses, etc., though cloud platforms abstract some of these details. Key takeaway: IP addressing enables global unique identification of endpoints, and subnetting is used to logically partition networks to suit organizational or topological needs.

In practice, when designing an AWS VPC, you might choose an IPv4 CIDR like 10.0.0.0/16 (gives 65k addresses) and then subnet it into /24s for different zones or services. IPv6 can also be enabled (AWS gives a /56 or /64 block) to allow IPv6 addressing. Many system design considerations (like load balancer addressing, NAT gateways, etc.) tie back to IP addresses and subnets.

DNS (Domain Name System)

DNS (Domain Name System) is the phonebook of the Internet, translating human-friendly domain names (like example.com) into IP addresses that computers use to route traffic. When you enter a URL or use an API endpoint, a DNS resolution occurs to find the server’s address. DNS is an application-layer protocol, but it underpins practically all internet communications.

How DNS resolution works: DNS is a distributed, hierarchical system. There are different types of DNS servers that work together in a lookup:

This process is iterative: the recursive resolver goes step by step, and it caches responses along the way to speed up future lookups. Caching is critical to DNS’s performance – once a name is resolved, the resolver will remember it for the TTL (time-to-live) specified in the DNS record, so subsequent requests are answered quickly from cache rather than hitting the root/TLD servers again.

DNS records: DNS records map names to data. Common record types include A (IPv4 address for a host), AAAA (IPv6 address), CNAME (alias one name to another), MX (mail exchange servers for email), TXT (arbitrary text, often for verification or SPF/DKIM email security), NS (delegates to a name server), and SOA (start of authority, domain metadata). For instance, an A record for www.example.com might point to 93.184.216.34. A CNAME for photos.example.com might alias to images.example.com (so that the latter’s A record is used). In system design, you use DNS to provide friendly endpoints for your services and to do things like load distribution (by returning multiple A records or using weighted DNS). DNS caching can sometimes cause stale data issues (e.g., if you change an IP but the old one is cached), which is why TTLs should be chosen carefully – short TTLs for highly dynamic services, longer for stable mappings.

Server roles: A single machine can be configured as different DNS roles:

From a security standpoint, DNS has weaknesses (like spoofing), which led to DNSSEC (DNS Security Extensions) where responses are signed to ensure authenticity. For performance, many large services use CDNs and Anycast DNS to make sure DNS queries are answered by a nearby server.

For a senior engineer, understanding DNS is key: e.g., how a CDN like CloudFront uses CNAMEs, how to design domain naming for microservices (perhaps using subdomains), how to handle DNS-based load balancing or failover (Route 53 can do health-check based failover). Also, knowing that DNS resolution adds latency to the first request (DNS lookup time) – typically a few tens of milliseconds – and that clients cache results (browsers, OS cache) which is why a bad DNS record can linger. In summary, DNS translates names to IPs and is structured in a hierarchy of servers (root → TLD → authoritative), with caching at resolvers to improve performance. It’s a critical piece often brought up in system design (for example, how do services discover each other? Possibly via DNS names).

Common Networking Components

Networking involves hardware and software components each playing distinct roles. Key components include switches, routers, and gateways (for moving data through networks), load balancers and proxies (for distributing and intermediating traffic), and firewalls, VPNs, and security groups (for securing network boundaries). Understanding these is important both for on-premise architecture and cloud (where many of these exist virtually).

Routers, Switches, and Gateways

In a nutshell: switches connect devices within a network (LAN), routers connect different networks (LAN to LAN or LAN to WAN), and gateways connect fundamentally different networks or protocols.

In cloud design, you don’t manually handle switches and routers – AWS handles those, but they expose abstractions: Subnets and route tables (router behavior) and Gateways (internet gateway, NAT gateway, etc.). On-premise, one designs which subnets go to which routers, uses switches to connect servers, etc. A strong understanding ensures you can reason about things like why two instances in different subnets can’t communicate – maybe a route is missing (router issue) or a NACL blocking (firewall issue) rather than a switch issue, etc.

Load Balancers & Reverse Proxies

Load balancers and reverse proxies are mechanisms to distribute and manage incoming traffic to servers. They often overlap in functionality (and a single software or device can be both). Both typically sit between clients and back-end servers, but there are subtle differences and use cases:

In cloud environments, the distinction blurs: for instance, AWS’s Application Load Balancer is effectively a reverse proxy that does HTTP(S) load balancing (it looks at headers, can do path-based routing, etc.) – it operates at Layer 7. AWS’s Network Load Balancer is purely packet-based (Layer 4), not modifying traffic, just forwarding at network level with ultra-high throughput. In system design, mention Layer 4 vs Layer 7 load balancing. Layer 7 (application load balancing) allows smarter routing (like sending all /api/users/* requests to a specific set of servers, or implementing blue/green deployments by routing traffic based on cookie or host), but it introduces a bit more latency due to deeper packet inspection. Layer 4 load balancing is useful for non-HTTP protocols or when you need extreme performance (millions of connections) and just want to distribute by IP/port (it’s also typically what is used for UDP-based services, since Layer 7 LB for arbitrary UDP is not feasible).

A real-world example: Suppose you have a microservices architecture with many services. You might use a reverse proxy (like an API Gateway or Nginx) to route requests to different services based on the URL. That reverse proxy might also handle authentication, caching, or compressing responses. Meanwhile, each service might have multiple instances, so there’s load balancing happening for each service cluster as well. In AWS, you could achieve this with an Application Load Balancer that has multiple target groups (one for each microservice set) and listener rules to route to them by path or host. Alternatively, you could chain a Network Load Balancer in front of an ALB (AWS allows NLB -> ALB) to combine benefits (though that’s an edge case).

In summary, load balancers focus on distributing traffic to ensure no single server is overwhelmed and to provide redundancy, and they come in L4 and L7 flavors. Reverse proxies can do load balancing but also provide a single point to implement cross-cutting concerns (logging, caching, TLS, etc.) at the application protocol level. Most modern load balancers (like ALB, Nginx, F5 BIG-IP LTM) are effectively reverse proxies working at L7. Knowing how to use them is crucial: e.g., in a system design, a load balancer can allow horizontal scaling and zero-downtime deployments (by draining connections on one server while others take traffic). Also, from a security view, a reverse proxy can shield the backend servers (clients only ever see the proxy’s IP/name).

Firewalls, VPNs, and Security Groups

These components are all about security and controlled access in networking:

To sum up this section: Firewalls and security groups protect resources by permitting or blocking connections. VPNs securely connect networks or clients over untrusted networks. In a layered security model, you might have a network firewall at your VPC edge (AWS offers Network Firewall service, or you use NACLs), plus security groups on each instance for a second layer, plus maybe host-based firewalls on the OS. A best practice is the principle of least privilege – only open the ports and sources needed. For instance, open database port only to app servers, not to the world. AWS Security Groups make this easier by allowing who can connect in terms of other groups or IP ranges. And finally, always consider encryption (VPN or TLS) when sending sensitive data across networks you don’t fully control.

Networking Architectures & Patterns

Beyond individual protocols and components, how we organize and pattern our network interactions is critical in system design. Two fundamental paradigms are client-server and peer-to-peer. Additionally, content delivery via CDNs (Content Delivery Networks) has become a staple pattern for improving performance for global users. We examine each and their use cases, pros/cons.

Client-Server Architecture

Client-server is the classic architecture where a server provides a service or resources, and clients consume them over a network. The server is often a centralized (or a distributed cluster acting as central) authority that hosts data or functionality, and clients (which could be end-user applications, browsers, mobile apps, or other services) initiate requests to the server and receive responses. This pattern underlies the majority of web and enterprise systems. For example, when you use a web application, your browser (client) sends an HTTP request to a web server (perhaps running on AWS EC2 behind a load balancer) which then responds with the webpage or data.

Characteristics:

Pros:

Cons:

Example patterns:

In interviews, when asked to design a system (like Twitter, etc.), the default assumption is a client-server model (users use the service via a client that calls the centralized service). One might mention using load balancers to handle more clients, caching on server side to reduce load, etc., but not converting it to a peer-to-peer system (unless it’s something explicitly P2P like a file sharing service).

Peer-to-Peer Networks

Peer-to-peer (P2P) architecture decentralizes the client-server model by making each node both a client and a server. In a pure P2P network, there is no central coordinator; every node (peer) can initiate or service requests. Peers directly exchange data with each other. This model gained fame with file-sharing systems (Napster, Kazaa, BitTorrent) and is also used in blockchain networks and some communication apps.

Characteristics:

Pros:

Cons:

Use cases and examples:

In system design, P2P might come up if discussing, say, a content distribution system or how to avoid central bottlenecks. Often, though, interview designs lean on client-server due to simplicity and control. But mentioning P2P could show awareness of alternatives. For instance, for a video streaming platform, one might mention a peer-assisted delivery (like clients also upload popular chunks to nearby clients) to reduce bandwidth on origin – some commercial systems (and WebRTC based CDNs) do that.

One should also understand hybrid models: Many systems use a central index but P2P for data. Napster had a central server for search, but file transfers were P2P between users. BitTorrent uses trackers (or DHT) to coordinate but then peers directly exchange. This often yields a balance: a bit of centralization for efficiency where acceptable, combined with decentralization for scale.

CDNs (Content Delivery Networks) and Caching Strategies

A Content Delivery Network (CDN) is a distributed network of servers strategically placed around the globe to cache and deliver content to users with high availability and performance. Instead of every user hitting your origin server (which might be in one region), users fetch content (especially static content like images, scripts, videos) from a CDN node (edge server) that is geographically closer to them. This reduces latency and offloads work from the origin. AWS CloudFront is an example of a CDN service.

How CDNs work: When using a CDN, your DNS for, say, assets.myapp.com is configured to point to the CDN’s network. A user in Europe might resolve that to an edge server in Frankfurt, while a user in Asia gets an edge in Singapore. The CDN edge will check if it has the requested content cached; if yes, it serves it immediately. If not, it will fetch it from the origin (your server), cache it, and then serve the user. Subsequent requests from nearby users get the cached copy. CDNs thus operate as reverse proxies with distributed caching. CDNs like CloudFront also handle HTTPS, can compress content, and even edge compute (like AWS Lambda@Edge).

Benefits:

AWS CloudFront specifics: It’s a managed CDN where you define “distributions” that pull from an origin (could be an S3 bucket, an HTTP server, etc.). CloudFront has edge locations across the world. It supports dynamic content as well, including WebSockets and can even proxy through to a custom origin for API calls (though dynamic responses may have Cache-Control: no-cache so it just forwards them). You can configure behaviors for different URL path patterns (cache some things longer, some not at all, etc.). CloudFront also allows invalidation – if you deploy a new version of a file, you can tell the CDN to purge the old version from cache.

Caching strategies:

CDN in system design interviews: If users are global, you should almost always mention a CDN for static content. E.g., “we will serve images and videos via a CDN to reduce latency for users and offload our servers.” If designing something like Netflix or YouTube, CDNs are absolutely critical (they heavily use CDNs to stream video bits from locations close to users). Even for API responses, a CDN can cache certain GET requests if they are public and heavy (e.g., a public leaderboard data that’s same for everyone). Additionally, mention using browser caching (setting headers so that the user’s browser caches files) – CDNs and browser caches together form a caching strategy.

AWS CloudFront example: “Amazon CloudFront is a CDN service built for high performance, security, and developer convenience”. In practice, you’d set up CloudFront distributions for your static assets (perhaps using an S3 bucket as origin) and maybe one for your API (with shorter TTLs or no caching for dynamic API calls, but still using edge locations for TLS termination). CloudFront can also do things like serve a custom error page if origin is down (improving user experience during failures).

Overall, CDNs and caching are about moving data closer to users and reducing redundant work. They are a form of scaling out read throughput – many more people can be served by caches than by the origin at once. The trade-off is cache consistency (you have to manage updates carefully). But for primarily static or globally replicated content, CDNs massively improve performance and are almost default in modern architectures.

Cloud Networking Fundamentals (AWS)

Amazon Web Services provides virtual networking capabilities that mirror many of the concepts from traditional networking, but in a software-defined, easy-to-manage way. Key topics include VPCs (Virtual Private Clouds), subnets and routing (including Internet Gateways and NAT), Security Groups vs NACLs for filtering, Route 53 for DNS, and Elastic Load Balancing options (ALB, NLB, CLB). Mastering these concepts is essential for designing secure and scalable AWS architectures.

VPC (Virtual Private Cloud)

An Amazon VPC is essentially a private network within AWS for your resources. When you create a VPC, you are carving out an isolated IP network (with a range you choose, e.g., 10.0.0.0/16) in which you can launch AWS resources (EC2 instances, RDS databases, Lambda functions (if in a VPC), etc.). By default, no one else can see or access your VPC – it’s like having your own logically separated section of the AWS cloud. As AWS says, “Amazon VPC is a logically isolated section of the AWS Cloud where you can launch AWS resources in a virtual network that you define.”. You have full control over that network’s configuration: IP address ranges, subnets, route tables, gateways, and security settings.

Key aspects of VPCs:

Design-wise, a VPC is analogous to having a virtual data center. You’d typically plan subnets for different tiers (public-facing vs internal). For example, a common pattern: a Public subnet for load balancers or bastion hosts (with internet access), and a Private subnet for application servers and databases (no direct internet ingress).

It’s worth noting that within a VPC, AWS provides an implicit router that handles routing between subnets (you don’t see it, but if subnets have routes to each other, traffic flows). You can imagine this as AWS’s internal network fabric connecting the subnets and implementing the route tables.

Subnets, Internet Gateways, and NAT (Public vs Private Subnets)

Subnets divide a VPC’s IP space and each lies in a single Availability Zone. They are the container where you actually place resources (when you launch an EC2 instance, you pick a subnet for its NIC). There are two main flavors:

Internet Gateway (IGW): As mentioned, an IGW is attached to a VPC to enable internet access. It is horizontally scaled and highly available by AWS (no need to manage it). An IGW serves two purposes: it allows outbound traffic to the internet and inbound traffic from the internet for public IPs. For IPv4, the IGW also performs a one-to-one NAT: your instance’s public IPv4 is actually mapped to the instance’s private IP on the way in and out. For IPv6, NAT is not needed (since IPv6 addresses are globally unique), so the IGW is more of a router. In a route table, you make a route for 0.0.0.0/0 (and ::/0 for v6) pointing to the IGW to signify internet-bound traffic goes there. Only subnets with such a route are “public.” Instances in public subnets must have public IPs to be reachable from internet (or else they go out to net with no return path). Also, security groups/NACLs must allow the traffic. When designing, you may mention an IGW as essentially providing the VPC internet connectivity – without it, the VPC is isolated (which can be desired for strict private networks).

NAT Gateway: AWS’s managed NAT service. You place a NAT Gateway in a public subnet, give it an Elastic IP (public IPv4). Then private subnets route their internet traffic to this NAT. The NAT gateway then allows instances in private subnets to connect out to the internet, but prevents the internet from initiating connections with those instances. This quote encapsulates NAT’s essence: one-way access. The NAT device maps internal addresses to its own address for outgoing requests (port mapping). For IPv6, AWS offers an Egress-Only Internet Gateway which is similar concept (allows v6 out, not in). NAT Gateways are highly available within an AZ (and you typically deploy one per AZ for resilience). They’re fully managed (scale automatically but do incur hourly and data processing costs). Historically, one could also use a NAT Instance (an EC2 running Linux NAT software) which is cheaper but not HA by default and more maintenance. Nowadays NAT Gateway is recommended for ease.

Putting it together: A common AWS network setup:

From a design perspective, public subnets are for your “edge” resources (like ALBs, bastion hosts for SSH, etc.), private subnets for internal servers and databases. This minimizes exposure – e.g., your database only accessible from app servers – and is a common interview point for securing cloud architectures.

One nuance: if an instance in a public subnet has no public IP, it’s effectively isolated (like a private instance) despite subnet being public, because IGW won’t have anything to do (it doesn’t NAT for an instance with only private IP). Conversely, if you somehow give an instance a public IP but put it in a subnet with no IGW route, that public IP is essentially useless (no route).

AWS Route Tables and summarizing IGW/NAT: Typically we have one main route table or custom tables. The internet route is only in the public ones. Also, AWS automatically adds routes for VPC’s own IP ranges so subnets talk to each other internally. You can also have routes to Virtual Private Gateways (for VPN to on-prem) or VPC Peering connections or Transit Gateway attachments, but that’s beyond scope here.

In summary, Internet Gateway = door to the internet (both directions) for the VPC, NAT Gateway = one-way door for private subnet instances to go out to internet (outbound only). Designing a secure app, you’d likely: “Place the web tier in public subnets (behind an ALB perhaps) and the business logic and database in private subnets. Use a NAT Gateway so that private instances can download updates. The public subnets have routes to IGW, private subnets have routes to NAT.” This ensures only the web tier is exposed and everything else isn’t directly reachable from outside.

Security Groups vs. NACLs (Network ACLs)

AWS provides two layers of network security within a VPC: Security Groups (SG) and Network Access Control Lists (NACLs). It’s important to understand the differences and how they complement each other.

Security Groups: As discussed earlier, they act as stateful firewalls at the instance level. Key properties:

Network ACLs: These act as a stateless firewall at the subnet level. They control traffic in and out of entire subnets. Key properties:

Differences & Best Practices:

To illustrate, let’s say we have a VPC with a public and private subnet:

Stateful vs Stateless example: Imagine an HTTP request from a user to a web server: - NACL Inbound on public subnet: must allow TCP 80 from user’s IP. - SG on web server: must allow TCP 80 from user’s IP (or 0.0.0.0/0 if open to all). - Response flows out: - SG will allow it because SG is stateful (inbound was allowed, so outbound reply is auto-allowed). - NACL outbound on public subnet must explicitly allow ephemeral port traffic to the user's IP. If NACL outbound had a blanket allow all, fine. If not, you need a rule (e.g., allow 1024-65535). - If any of those checks fails, traffic is blocked.

Summary of differences:

In practice, many designs simply use SGs to restrict traffic (which is often enough). But knowing NACLs exist and how to use them adds defense in depth. For example, to protect against an unlikely scenario of a compromised instance that modifies its SG (if compromised credentials), a NACL could still block certain traffic. Or to absolutely prevent any internet access on a subnet level, you could use NACL to deny 0.0.0.0/0 outbound. Also in some cases like restricting certain CIDR ranges environment-wide, a NACL is easier than updating many SGs.

Route 53 (DNS Management)

Amazon Route 53 is AWS’s cloud DNS service. It can manage both public DNS names for your domains and private DNS within your VPCs. The name Route 53 comes from TCP/UDP port 53, which is used for DNS.

Key features:

Route 53 and system design: If your service has a custom domain, you’ll use Route 53 (or another DNS) to host that domain’s records pointing to your load balancer or cloudfront, etc. With Route 53, you can also do clever things like blue-green deploy using weighted routing (send small % to new environment). Or region load balancing by latency. Or failover: e.g., if US-east-1 goes down, fail DNS over to US-west-2 where a backup setup is running. The propagation of DNS changes isn’t instantaneous for users (cache TTLs matter), but Route 53 allows TTL as low as 0 for alias records (meaning on next query it will fetch fresh, useful for health-check based failover to be quick).

One should mention that Route 53 is highly reliable – it’s designed with SLA of 100%. DNS in general is crucial infrastructure. Also mention the concept of resolver (clients, like stub resolvers, will cache results per TTL – something to keep in mind if doing fast failovers).

In interviews, typical mentions:

Elastic Load Balancing (ALB, NLB, CLB)

AWS offers Elastic Load Balancing (ELB) as a service with different types of load balancers. The main ones are:

When to use which:

Scalability and elasticity: All these load balancers are managed by AWS – they automatically scale their capacity (adding more compute at the back end invisibly) as traffic grows. That’s why they’re called “Elastic”. You don’t manually provision the capacity (though with NLB you have less visibility; ALB you get some metrics). They are designed to be highly available across AZs if you subnet them appropriately.

Integration with ECS/EKS: ALB can directly integrate with container orchestrators (like as an ingress for Kubernetes or with ECS as a target group that automatically registers tasks). NLB too with certain capabilities.

Costs: ALB is priced per hour + LCU (capacity unit based on new connections, active connections, bandwidth, etc.). NLB is per hour + data. NLB often cheaper for pure data-heavy scenarios (because ALB’s LCU might charge more for lots of requests). CLB is per hour + data. A note: if you needed both layer 4 and 7 features, you might chain an NLB in front of ALB – but that’s advanced scenario.

In system design:

Comparison quick recap: ALB operates at Layer 7 (HTTP) with smart routing and is ideal for web applications, NLB at Layer 4 for high performance and non-HTTP, and CLB is legacy providing basic load balancing functionality. In an AWS architecture, load balancers are the entry point for traffic (sitting in a public subnet if internet-facing), and they direct traffic to target instances in usually private subnets. They decouple clients from servers (clients hit LB DNS which stays same, even if servers change). Also, ALB can offload a lot of HTTP concerns (it can do HTTPS termination, including SNI for multiple certs, it can handle HTTP to HTTPS redirection, and you can attach WAF to it for filtering malicious requests). These features can simplify application code.


By understanding these networking principles, protocols, patterns, and AWS networking specifics, a senior engineer can design systems that are robust, scalable, and secure. In a system design interview, you would draw on these concepts to justify choices (like using TCP vs UDP, adding a CDN, segmenting networks with subnets and security groups, etc.). In practice, knowing how to configure and optimize these (like properly setting up VPC with least privilege, or using ALB vs NLB appropriately, or tuning DNS TTLs) is vital for running reliable services in the cloud.

Network Troubleshooting & Performance

Even with a solid design, networks can encounter issues. Troubleshooting those and optimizing performance are key skills. Common issues include high latency, packet loss, or DNS resolution failures. We also rely on tools like ping, traceroute, and DNS lookup utilities to diagnose problems. Finally, performance can often be improved by techniques like caching, using CDNs, reducing round trips, and tuning configurations.

Common Network Issues: Latency, Packet Loss, DNS Problems

Other issues:

When users report “the network is slow,” usually it is one of latency or loss or DNS (or the server is slow, but network gets blamed). So one systematically checks each.

Diagnostic Tools: ping, traceroute, and DNS lookup

In an interview, if asked how to debug “service unreachable”, you might say: ping the instance (to see if it’s network or host), traceroute to see where it stops, check security group/NACL, check instance OS firewall, etc. If “site is slow”, you might use ping to see latency, or do a dig to ensure DNS isn’t slow, etc.

Performance Tuning Tips: Reducing Latency and Improving Throughput

To improve network performance (speed and reliability), consider:

This comprehensive overview covers core computer networking models, protocols, components, architectures, and AWS cloud networking services. It is tailored for a senior software engineer preparing for system design interviews and working with AWS. Each section below is modular and self-contained, providing high-level concepts, practical examples, and best practices relevant to real-world systems design and cloud infrastructure.

Fundamental Networking Models

Networking models provide a layered framework for understanding how data travels across networks. The two primary models are the OSI seven-layer model and the TCP/IP model. These models break down networking functions into layers, each building on the layer below. In practice, the models guide design and troubleshooting by isolating issues to specific layers.

OSI Model (7 Layers)

Figure: The 7 Layers of the OSI Model, from Physical (Layer 1) to Application (Layer 7). Each layer has specific responsibilities and interfaces with the layers above and below.

The Open Systems Interconnection (OSI) model defines seven abstraction layers that describe how data moves from an application to the physical network medium and back. The layers, from 1 (lowest) to 7 (highest), are: Physical, Data Link, Network, Transport, Session, Presentation, and Application. Each layer encapsulates a set of functions (for example, the Network layer handles routing of packets, while the Transport layer ensures reliable delivery). This layered approach provides a “universal language” for different systems to communicate. In practice, the OSI model is mostly used as a teaching and troubleshooting tool – real-world protocols often span multiple layers or skip some. For example, an issue with an unreachable web service might be diagnosed by checking connectivity at Layer 3 (IP routing) and Layer 4 (TCP handshake) before looking at an application-layer (Layer 7) problem. The OSI model is rarely implemented exactly in modern networks, but it remains valuable for understanding and explaining network behavior.

Layers and their roles: At Layer 1 (Physical), raw bits are transmitted over a medium (cables, Wi-Fi, etc.). Layer 2 (Data Link) handles framing and local node-to-node delivery (e.g., Ethernet MAC addressing). Layer 3 (Network) manages addressing and routing between networks – the Internet Protocol (IP) operates here. Layer 4 (Transport) provides end-to-end communication and reliability; e.g. TCP (connection-oriented, reliable) and UDP (connectionless, best-effort). Layer 5 (Session) governs the establishment and teardown of sessions (e.g., managing multiple logical connections, as with an RPC session). Layer 6 (Presentation) deals with data format and syntax, ensuring that data from the application layer can be understood by the receiving system (examples: encryption/decryption, serialization like JSON or XML). Layer 7 (Application) is the closest to the end-user; it includes protocols for specific networking applications – e.g. HTTP for web, SMTP for email. These layers interact in a stack: each layer on the sender adds its header (encapsulation) and the corresponding layer on the receiver reads and strips it (decapsulation). Understanding the OSI layers is helpful in interviews for explaining terms like “L4 load balancer” or debugging (e.g., identifying if a problem is at the network layer vs. the application layer). Not every system uses all layers distinctly (some layers might be empty or combined), but the model’s separation of concerns aids in clarity.

TCP/IP Model

The TCP/IP model is the pragmatic model on which the modern Internet is built. It condenses the OSI layers into four or five layers: typically Link (Network Interface), Internet, Transport, and Application (some versions separate Link into Physical and Data Link, totaling five layers). While the OSI model is a theoretical reference, the TCP/IP model maps more directly to real protocols in use. For example, in the TCP/IP model, the Internet layer corresponds to IP (IPv4/IPv6) for routing, the Transport layer includes TCP and UDP, and the Application layer encompasses everything from HTTP to DNS. The TCP/IP model’s simplicity reflects the design of the Internet, where protocols are defined in these four layers and interoperability is key. In practice, when designing systems we often refer to TCP/IP layers; e.g., designing a solution “at the transport layer” likely implies working with TCP/UDP rather than inventing a new layer. The OSI model remains useful for conceptual understanding, but the TCP/IP model is now more commonly used in practice today, especially when discussing real-world networking (for instance, engineers often speak of “layer 4 vs layer 7 load balancing” in terms of TCP/IP and OSI equivalently). A senior engineer should understand both models: OSI for its vocabulary and thoroughness, and TCP/IP for its direct mapping to actual protocols and the Internet’s architecture.

Essential Networking Protocols

Modern networks rely on a suite of core protocols that operate at different layers to enable communication. This section covers key protocols and concepts: HTTP/HTTPS at the application layer, TCP vs UDP at the transport layer, IP (v4 & v6) at the network layer (with addressing/subnetting), and DNS for name resolution. Understanding these is crucial for system design (e.g., choosing TCP or UDP for a service, or designing a domain name scheme for microservices).

HTTP and HTTPS (Web Protocols)

HTTP (HyperText Transfer Protocol) is the fundamental application-layer protocol of the World Wide Web. It defines how clients (typically web browsers) request resources from servers and how servers respond. HTTP is a stateless, request-response protocol, meaning each request from a client is independent – the server does not retain session information between requests by default. A client (like a browser) sends an HTTP request (e.g., a GET request for a webpage), and the server returns an HTTP response (e.g., an HTML page). Common HTTP methods include GET (retrieve data), POST (submit data to be processed), PUT, DELETE, etc., often corresponding to CRUD operations in RESTful APIs. HTTP responses come with status codes indicating the result of the request: for example, 200 OK (success), 404 Not Found, 500 Internal Server Error, etc. These codes are grouped into classes – 1xx informational, 2xx success, 3xx redirection, 4xx client error, 5xx server error. For instance, a 200 means success, 404 means the requested resource was not found, 503 means the server is unavailable. Understanding status code classes is useful in debugging and designing REST APIs (e.g., returning 404 vs 400 for different error conditions).

HTTPS is the secure version of HTTP. It stands for HTTP Secure and essentially means HTTP over TLS/SSL encryption. When using HTTPS (e.g., https:// URLs), the client and server perform a TLS handshake to establish an encrypted connection before exchanging HTTP data, thereby providing confidentiality and integrity. TLS (Transport Layer Security) is the modern version of SSL and is a security protocol that encrypts communication between a client and server. A primary use case of TLS is securing web traffic (HTTPS) to prevent eavesdropping or tampering. In practice, when a browser connects to an HTTPS site, it verifies the server’s identity via an X.509 certificate and then negotiates encryption keys (this is the TLS handshake). After that, HTTP requests and responses are encrypted. TLS provides assurances that the client is talking to the genuine server (authentication via certificates) and that no one can read or alter the data in transit. Modern best practices require HTTPS for virtually all web traffic (e.g., browsers flag non-HTTPS sites as insecure). In system design, one should note that HTTPS adds a bit of overhead (CPU for encryption, and latency for the handshake), but it’s necessary for security. Also, load balancers or proxies often terminate TLS (performing decryption) and forward traffic internally over HTTP – this is a common architecture for handling HTTPS at scale. In summary, HTTP/HTTPS knowledge includes understanding the stateless nature of HTTP (and the need for mechanisms like cookies or tokens to maintain sessions), knowing the common status codes, and recognizing how TLS secures communications.

TCP vs UDP (Transport Layer Protocols)

At the transport layer, TCP (Transmission Control Protocol) and UDP (User Datagram Protocol) are the two fundamental protocols. Both use IP underneath but differ significantly in behavior and use cases:

In summary, TCP vs UDP trade-off comes down to reliability vs. latency. TCP gives you heavy guarantees (ordering, no duplicates, reliable delivery) at the cost of extra network chatter and complexity, whereas UDP is essentially just “send and forget,” suitable for cases where the application can tolerate or handle some loss. As a senior engineer, it’s important to know that TCP can struggle with high-latency links or multicast (UDP supports multicast, TCP doesn’t), and that UDP requires you to consider packet loss at the application level. Many protocols build on UDP for speed and add their own reliability only if needed (e.g., QUIC is a modern protocol used in HTTP/3 that runs over UDP to get the benefits of UDP with built-in congestion control and reliability at the application layer). In AWS, most services (like HTTP-based services) use TCP, but things like AWS CloudWatch Logs or metrics might use HTTP (thus TCP) for reliability. If designing a custom service, say a telemetry ingestion service that can lose the occasional packet but needs low latency, you might design it over UDP.

Reliability and ordering: TCP ensures in-order delivery or the connection breaks – it’s “near-lossless”. UDP is “lossy”; the application might receive packets out of order or not at all, and it must deal with that. For example, a video call app might see a missing UDP packet and decide to not display a few pixels for a moment rather than pause the whole video. Knowing which transport to use is crucial: e.g., a payment system should not use UDP (you don’t want lost data in transactions), whereas a live metrics dashboard could use UDP for real-time updates where occasional drops are acceptable.

IP Protocol (IPv4, IPv6 and Addressing/Subnetting)

IP (Internet Protocol) operates at the network layer (Layer 3) and is the core protocol that delivers packets from source to destination across networks. It provides addressing (each device has an IP address) and routing (forwarding through intermediate routers). There are two versions in use:

Addressing and Subnetting: An IP address has two parts: network prefix and host identifier. Subnetting is the practice of dividing a network into smaller networks (subnets) by extending the network prefix. Each subnet is identified by a network address and a subnet mask (or prefix length, e.g., /24). For example, 192.168.1.0/24 represents a subnet where the first 24 bits are network (192.168.1.0) and the remaining 8 bits are for hosts (256 addresses, of which a few are reserved). Why subnet? It improves routing efficiency, security, and management. A subnet (subnetwork) is a smaller network inside a larger network – it allows localizing traffic so it doesn’t have to traverse the entire network. By subnetting, network traffic can travel shorter distances without passing through unnecessary routers. For instance, in a data center, you might subnet by rack or by department so that most traffic stays local. From a design perspective, subnetting helps segment networks (e.g., separating a database subnet from a web subnet for security). In cloud environments like AWS, you must subnet your VPC IP range into subnets (often at least one public and one private subnet per availability zone). Each subnet has a CIDR block (range of IPs) and belongs to one availability zone.

A concept related to subnetting is the subnet mask (e.g., 255.255.255.0 for /24) which delineates network vs host bits. Another concept is CIDR (Classless Inter-Domain Routing) notation (like /16, /24) which replaced classful networks to allow more flexible allocation. For an interview, one should know how to calculate the number of addresses in a subnet (e.g., a /24 has 2^(32-24)=256 addresses) and understand broadcast vs network addresses, etc., though cloud platforms abstract some of these details. Key takeaway: IP addressing enables global unique identification of endpoints, and subnetting is used to logically partition networks to suit organizational or topological needs.

In practice, when designing an AWS VPC, you might choose an IPv4 CIDR like 10.0.0.0/16 (gives 65k addresses) and then subnet it into /24s for different zones or services. IPv6 can also be enabled (AWS gives a /56 or /64 block) to allow IPv6 addressing. Many system design considerations (like load balancer addressing, NAT not needed for IPv6, etc.) tie back to IP addresses and subnets.

DNS (Domain Name System)

DNS (Domain Name System) is the phonebook of the Internet, translating human-friendly domain names (like example.com) into IP addresses that computers use to route traffic. When you enter a URL or use an API endpoint, a DNS resolution occurs to find the server’s address. DNS is an application-layer protocol, but it underpins practically all internet communications.

How DNS resolution works: DNS is a distributed, hierarchical system. There are different types of DNS servers that work together in a lookup:

This process is iterative: the recursive resolver goes step by step, and it caches responses along the way to speed up future lookups. Caching is critical to DNS’s performance – once a name is resolved, the resolver will remember it for the TTL (time-to-live) specified in the DNS record, so subsequent requests are answered quickly from cache rather than hitting the root/TLD servers again.

DNS records: DNS records map names to data. Common record types include A (IPv4 address for a host), AAAA (IPv6 address), CNAME (alias one name to another), MX (mail exchange servers for email), TXT (arbitrary text, often for email security like SPF/DKIM), NS (delegates to a name server), and SOA (start of authority, domain metadata). For instance, an A record for www.example.com might point to 93.184.216.34. A CNAME for photos.example.com might alias to images.example.com (so that the latter’s A record is used). In system design, you use DNS to provide friendly endpoints for your services and to do things like load distribution (by returning multiple A records or using weighted DNS). DNS caching can sometimes cause stale data issues (e.g., if you change an IP but the old one is cached), which is why TTLs should be chosen carefully – short TTLs for highly dynamic services, longer for stable mappings.

Server roles: A single machine can be configured as different DNS roles:

From a security standpoint, DNS has weaknesses (like spoofing), which led to DNSSEC (DNS Security Extensions) where responses are signed to ensure authenticity. For performance, many large services use CDNs and Anycast DNS to make sure DNS queries are answered by a nearby server.

For a senior engineer, understanding DNS is key: e.g., how a CDN like CloudFront uses CNAMEs, how to design domain naming for microservices (perhaps using subdomains), how to handle DNS-based load balancing or failover (Route 53 can do health-check based failover). Also, knowing that DNS resolution adds latency to the first request (DNS lookup time) – typically a few tens of milliseconds – and that clients cache results (browsers, OS cache) which is why a bad DNS record can linger. In summary, DNS translates names to IPs and is structured in a hierarchy of servers (root → TLD → authoritative), with caching at resolvers to improve performance. It’s a critical piece often brought up in system design (for example, how do services discover each other? Possibly via DNS names).

Common Networking Components

Networking involves hardware and software components each playing distinct roles. Key components include switches, routers, and gateways (for moving data through networks), load balancers and proxies (for distributing and intermediating traffic), and firewalls, VPNs, and security groups (for securing network boundaries). Understanding these is important both for on-premise architecture and cloud (where many of these exist virtually).

Routers, Switches, and Gateways

In a nutshell: switches connect devices within a network (LAN), routers connect different networks (LAN to LAN or LAN to WAN), and gateways connect fundamentally different networks or protocols.

In cloud design, you don’t manually handle switches and routers – AWS handles those, but they expose abstractions: Subnets and route tables (router behavior) and Gateways (internet gateway, NAT gateway, etc.). On-premise, one designs which subnets go to which routers, uses switches to connect servers, etc. A strong understanding ensures you can reason about things like why two instances in different subnets can’t communicate – maybe a route is missing (router issue) or a NACL blocking (firewall issue) rather than a switch issue, etc.

Load Balancers & Reverse Proxies

Load balancers and reverse proxies are mechanisms to distribute and manage incoming traffic to servers. They often overlap in functionality (and a single software or device can be both). Both typically sit between clients and back-end servers, but there are subtle differences and use cases:

In cloud environments, the distinction blurs: for instance, AWS’s Application Load Balancer is effectively a reverse proxy that does HTTP(S) load balancing (it looks at headers, can do path-based routing, etc.) – it operates at Layer 7. AWS’s Network Load Balancer is purely packet-based (Layer 4), not modifying traffic, just forwarding at network level with ultra-high throughput. In system design, mention Layer 4 vs Layer 7 load balancing. Layer 7 (application load balancing) allows smarter routing (like sending all /api/users/* requests to a specific set of servers, or implementing blue/green deployments by routing traffic based on cookie or host), but it introduces a bit more latency due to deeper packet inspection. Layer 4 load balancing is useful for non-HTTP protocols or when you need extreme performance (millions of connections) and just want to distribute by IP/port (it’s also typically what is used for UDP-based services, since Layer 7 LB for arbitrary UDP is not feasible).

A real-world example: Suppose you have a microservices architecture with many services. You might use a reverse proxy (like an API Gateway or Nginx) to route requests to different services based on the URL. That reverse proxy might also handle authentication, caching, or compressing responses. Meanwhile, each service might have multiple instances, so there’s load balancing happening for each service cluster as well. In AWS, you could achieve this with an Application Load Balancer that has multiple target groups (one for each microservice set) and listener rules to route to them by path or host. Alternatively, you could chain a Network Load Balancer in front of an ALB (AWS allows NLB -> ALB) to combine benefits (though that’s an edge case).

In summary, load balancers focus on distributing traffic to ensure no single server is overwhelmed and to provide redundancy, and they come in L4 and L7 flavors. Reverse proxies can do load balancing but also provide a single point to implement cross-cutting concerns (logging, caching, TLS, etc.) at the application protocol level. Most modern load balancers (like ALB, Nginx, F5 BIG-IP LTM) are effectively reverse proxies working at L7. Knowing how to use them is crucial: e.g., in a system design, a load balancer can allow horizontal scaling and zero-downtime deployments (by draining connections on one server while others take traffic). Also, from a security view, a reverse proxy can shield the backend servers (clients only ever see the proxy’s IP/name).

Firewalls, VPNs, and Security Groups

These components are all about security and controlled access in networking:

To sum up this section: Firewalls and security groups protect resources by permitting or blocking connections. VPNs securely connect networks or clients over untrusted networks. In a layered security model, you might have a network firewall at your VPC edge (AWS offers Network Firewall service, or you use NACLs), plus security groups on each instance for a second layer, plus maybe host-based firewalls on the OS. A best practice is the principle of least privilege – only open the ports and sources needed. For instance, open database port only to app servers, not to the world. AWS Security Groups make this easier by allowing who can connect in terms of other groups or IP ranges. And finally, always consider encryption (VPN or TLS) when sending sensitive data across networks you don’t fully control.

Networking Architectures & Patterns

Beyond individual protocols and components, how we organize and pattern our network interactions is critical in system design. Two fundamental paradigms are client-server and peer-to-peer. Additionally, content delivery via CDNs (Content Delivery Networks) has become a staple pattern for improving performance for global users. We examine each and their use cases, pros/cons.

Client-Server Architecture

Client-server is the classic architecture where a server provides a service or resources, and clients consume them over a network. The server is often a centralized (or a distributed cluster acting as central) authority that hosts data or functionality, and clients (which could be end-user applications, browsers, mobile apps, or other services) initiate requests to the server and receive responses. This pattern underlies the majority of web and enterprise systems. For example, when you use a web application, your browser (client) sends an HTTP request to a web server (perhaps running on AWS EC2 behind a load balancer) which then responds with the webpage or data.

Characteristics:

Pros:

Cons:

Example patterns:

In interviews, when asked to design a system (like Twitter, etc.), the default assumption is a client-server model (users use the service via a client that calls the centralized service). One might mention using load balancers to handle more clients, caching on server side to reduce load, etc., but not converting it to a peer-to-peer system (unless it’s something explicitly P2P like a file sharing service).

Peer-to-Peer Networks

Peer-to-peer (P2P) architecture decentralizes the client-server model by making each node both a client and a server. In a pure P2P network, there is no central coordinator; every node (peer) can initiate or service requests. Peers directly exchange data with each other. This model gained fame with file-sharing systems (Napster, Kazaa, BitTorrent) and is also used in blockchain networks and some communication apps.

Characteristics:

Pros:

Cons:

Use cases and examples:

In system design, P2P might come up if discussing, say, a content distribution system or how to avoid central bottlenecks. Often, though, interview designs lean on client-server due to simplicity and control. But mentioning P2P could show awareness of alternatives. For instance, for a video streaming platform, one might mention a peer-assisted delivery (like clients also upload popular chunks to nearby clients) to reduce bandwidth on origin – some commercial systems (and WebRTC based CDNs) do that.

One should also understand hybrid models: Many systems use a central index but P2P for data. Napster had a central server for search, but file transfers were P2P between users. BitTorrent uses trackers (or DHT) to coordinate but then peers directly exchange. This often yields a balance: a bit of centralization for efficiency where acceptable, combined with decentralization for scale.

CDNs (Content Delivery Networks) and Caching Strategies

A Content Delivery Network (CDN) is a distributed network of servers strategically placed around the globe to cache and deliver content to users with high availability and performance. Instead of every user hitting your origin server (which might be in one region), users fetch content (especially static content like images, scripts, videos) from a CDN node (edge server) that is geographically closer to them. This reduces latency and offloads work from the origin. AWS CloudFront is an example of a CDN service.

How CDNs work: When using a CDN, your DNS for, say, assets.myapp.com is configured to point to the CDN’s network. A user in Europe might resolve that to an edge server in Frankfurt, while a user in Asia gets an edge in Singapore. The CDN edge will check if it has the requested content cached; if yes, it serves it immediately. If not, it will fetch it from the origin (your server), cache it, and then serve the user. Subsequent requests from nearby users get the cached copy. CDNs thus operate as reverse proxies with distributed caching. CDNs like CloudFront also handle HTTPS, can compress content, and even edge compute (like AWS Lambda@Edge).

Benefits:

AWS CloudFront specifics: It’s a managed CDN where you define “distributions” that pull from an origin (could be an S3 bucket, an HTTP server, etc.). CloudFront has edge locations across the world. It supports dynamic content as well, including WebSockets and can even proxy through to a custom origin for API calls (though dynamic responses may have Cache-Control: no-cache so it just forwards them). You can configure behaviors for different URL path patterns (cache some things longer, some not at all, etc.). CloudFront also allows invalidation – if you deploy a new version of a file, you can tell the CDN to purge the old version from cache. CloudFront can serve content over HTTP/2 and HTTP/3 to clients, optimizing connection reuse and header compression.

Caching strategies:

CDN in system design interviews: If users are global, you should almost always mention a CDN for static content. E.g., “we will serve images and videos via a CDN to reduce latency for users and offload our servers.” If designing something like Netflix or YouTube, CDNs are absolutely critical (they heavily use CDNs to stream video bits from locations close to users). Even for API responses, a CDN can cache certain GET requests if they are public and heavy (e.g., a public leaderboard data that’s same for everyone). Additionally, mention using browser caching (setting headers so that the user’s browser caches files) – CDNs and browser caches together form a caching strategy.

AWS CloudFront example: “Amazon CloudFront is a CDN service built for high performance, security, and developer convenience”. In practice, you’d set up CloudFront distributions for your static assets (perhaps using an S3 bucket as origin) and maybe one for your API (with shorter TTLs or no caching for dynamic API calls, but still using edge locations for TLS termination). CloudFront can also speed up sites which use TLS by optimizing connection reuse and using features like TLS False Start for handshakes. Overall, CDNs and caching are about moving data closer to users and reducing redundant work. They are a form of scaling out read throughput – many more people can be served by caches than by the origin at once. The trade-off is cache consistency (you have to manage updates carefully), but for primarily static or globally replicated content, CDNs massively improve performance and are almost default in modern architectures.

Cloud Networking Fundamentals (AWS)

Amazon Web Services provides virtual networking capabilities that mirror many of the concepts from traditional networking, but in a software-defined, easy-to-manage way. Key topics include VPCs (Virtual Private Clouds), subnets and routing (including Internet Gateways and NAT), Security Groups vs NACLs for filtering, Route 53 for DNS, and Elastic Load Balancing options (ALB, NLB, CLB). Mastering these concepts is essential for designing secure and scalable AWS architectures.

VPC (Virtual Private Cloud)

An Amazon VPC is essentially a private network within AWS for your resources. When you create a VPC, you are carving out an isolated IP network (with a range you choose, e.g., 10.0.0.0/16) in which you can launch AWS resources (EC2 instances, RDS databases, Lambda functions (if in a VPC), etc.). By default, no one else can see or access your VPC – it’s like having your own logically separated section of the AWS cloud. As AWS says, “Amazon VPC is a logically isolated section of the AWS Cloud where you can launch AWS resources in a virtual network that you define.”. You have complete control to customize and design this software-defined network based on your requirements.

Key aspects of VPCs:

Think of a VPC as your virtual data center. You have to plan IP ranges, subnets, and how to connect out or in. AWS provides defaults (e.g., default VPC) to get started, but custom VPCs are recommended for sophisticated setups. With a VPC, you have flexibility: you can peer with another VPC (even in different AWS accounts or regions, with some limitations), connect to on-prem networks via VPN or Direct Connect, attach endpoints for AWS services (so that calls to, say, S3 or DynamoDB don’t leave the AWS backbone), etc.

Design-wise, you might have one VPC per environment (production, staging, dev) or per application grouping. Within a VPC, using subnets to separate public-facing and internal resources is a common pattern described next.

Subnets, Internet Gateways, and NAT (Public vs Private Subnets)

Figure: Public vs. Private subnets in a VPC. In this AWS diagram, the subnet in AZ A is a public subnet because its route table directs internet-bound traffic (0.0.0.0/0) to an Internet Gateway, whereas the subnet in AZ B is a private subnet with no such route. An instance in the public subnet (AZ A) that has a public IP can communicate with the internet via the IGW, but an instance in the private subnet (AZ B) cannot reach the internet directly. Even if a private-subnet instance had a public IP, it wouldn’t be reachable, because the subnet’s route table lacks an IGW route. Typically, private subnets instead use a NAT Gateway in a public subnet for outbound internet access.

In AWS, subnets divide a VPC’s IP space and each lies in a single Availability Zone. There are two main flavors of subnets in a VPC:

To use a NAT Gateway, you create it in a public subnet (one in each AZ for HA typically) and update private subnet route tables: 0.0.0.0/0 -> nat-gateway-id. NAT Gateways are managed by AWS (highly available in the AZ and scale automatically). They perform the IP/port translation. One must remember to also configure the Security Group of instances if they need to allow certain outbound flows (by default SG allows all outbound). Also, NAT Gateway itself uses an Elastic IP – ensure your IGW and route to IGW for the public subnet is in place.

Internet Gateway (IGW) is the VPC’s doorway to the internet. It’s attached at the VPC level (not per subnet). For a subnet to be public, it needs a route to the IGW. The IGW handles traffic to/from public IPs of instances: it translates between the instance’s private IP and its public IP (for IPv4). The IGW is horizontally scaled by AWS, and you don’t need to worry about capacity. It simply needs to be attached and routes set. Each VPC comes with a default route table; by default AWS puts the 0.0.0.0/0 -> IGW route in the main route table for default VPCs, but in a custom VPC you add it yourself in the public subnet’s route table. An IGW also supports IPv6 (where no NAT is needed – IPv6 addresses are public by nature, so the IGW is more of a router for v6).

Summary: In an AWS environment you typically design with at least one public subnet (with IGW route) for outward-facing components, and private subnets (no IGW route) for internal components. Private subnets use NAT gateways for outbound connectivity. Public subnets have internet connectivity potential, which you secure with Security Groups. This setup (often called a “2-tier VPC architecture”: public web tier, private app tier) is a common interview talking point for deploying a web service securely. For completeness: if something truly needs to be completely isolated (no internet even outbound), you would put it in a private subnet and either not use a NAT or even use NACLs to restrict all traffic except VPC internal.

Security Groups vs. NACLs (Network ACLs)

AWS provides two layers of network security within a VPC: Security Groups (SG) and Network Access Control Lists (NACLs). It’s important to understand the differences and how they complement each other.

Differences and best practice:

Security Groups are stateful and easier to manage for each service, making them the go-to for most traffic control in AWS. NACLs, being stateless and subnet-scoped, are more of a coarse filter or safety net. A common strategy:

An example difference: Suppose an instance in a private subnet was compromised and tried to initiate traffic on an unusual port to exfiltrate data. If you had an outbound deny in the NACL for that port or foreign IP range, it would be blocked at subnet level regardless of SG. Conversely, SGs by default allow all outbound, so the traffic would go out unless you had locked down outbound SG rules (which many people don’t, they keep the default allow all outbound on SGs since they rely on inbound rules mostly and statefulness). So NACL can provide that additional egress control if needed.

Another example: You want to make absolutely sure no one can RDP into any server except through a specific bastion. You might set the NACL for private subnets to deny inbound TCP 3389 from all sources except the bastion’s subnet. Even if an SG was misconfigured, the NACL would block it. It’s belt-and-suspenders.

Security Group vs NACL summary: “Security Groups work at the resource level and are stateful, while NACLs operate at the subnet level and are stateless.”. You can say SGs are like host-based firewalls, and NACLs are like network edge ACLs on a router. Most AWS architectures lean on SGs for their flexibility and ease, using NACLs sparingly. But in an interview, it’s good to mention both and perhaps note that Security Groups are usually sufficient for most needs. If pressed: Security Groups for instance-level whitelisting, NACLs for subnet-level blacklisting (if needed).

Route 53 (DNS Management)

Amazon Route 53 is AWS’s managed DNS service. It provides domain registration, DNS record hosting, and health-checking/routing features. It is highly available and scalable. The name “53” comes from UDP/TCP port 53, the DNS port.

Key capabilities:

Use in architecture:

Route 53 operates on a global scale (you don’t “pick a region” for Route 53 – it’s AWS global). It’s highly reliable – it uses a global anycast network of DNS servers.

For interviews: mention Route 53 as the solution for scaling DNS and doing clever routing strategies. For instance, for a global service, “we’ll use Route 53 latency-based routing to direct users to the closest deployment region, improving performance.” Or “we use Route 53 health checks and failover routing to increase reliability by automatically shifting traffic if the primary cluster goes down.” It’s also one of those things that differentiates a design beyond just “use DNS” because of these advanced features. And of course, it integrates with AWS names (alias to an S3 or CloudFront can simplify serving a static site at root domain, etc.).

Elastic Load Balancing (ALB, NLB, CLB)

AWS offers multiple types of Elastic Load Balancers as managed services. The main ones are:

Comparison and design:

High availability: ELBs themselves are multi-AZ by default (you specify subnets in at least two AZs). AWS will create load balancer nodes in each AZ. If one AZ or LB node fails, others continue. So you get HA at infrastructure level.

Scalability: It’s elastic – ALB/NLB automatically scale capacity. You don’t see or manage that. However, note that NLB has a pretty high baseline capacity so can handle sudden bursts well, whereas ALB scales but very sudden massive bursts could see a bit of latency while scaling (AWS mitigates by providing “warm-up” or you can contact them if you expect a huge jump).

Integration: For example, AWS ECS (containers) can register tasks with ALB target groups (with dynamic port mapping) making service discovery easy. AWS EKS (Kubernetes) typically uses an ALB ingress controller to route traffic to services.

In an interview, when talking about AWS architecture:

Cost considerations: ALB charges per hour and per LCU (Load Balancer Capacity Unit – based on new connections, active connections, bandwidth). NLB charges per hour and per GB. Generally, for small traffic, ALB can be slightly more expensive; for extremely high throughput, NLB might be more cost-effective if LCU costs for ALB pile up. But cost rarely is a huge factor in choosing ALB vs NLB – functionality is.

To summarize:


By understanding these networking principles, protocols, patterns, and AWS networking specifics, a senior engineer can design systems that are robust, scalable, and secure. In a system design interview, you would draw on these concepts to justify choices (like using TCP vs UDP, adding a CDN, segmenting networks with subnets and security groups, etc.). In practice, knowing how to configure and optimize these (like properly setting up a VPC with least privilege, or using ALB vs NLB appropriately, or tuning DNS TTLs) is vital for running reliable services in the cloud.

Network Troubleshooting & Performance

Even with a solid design, networks can encounter issues. Troubleshooting those and optimizing performance are key skills. Common issues include high latency, packet loss, or DNS resolution failures. We also rely on tools like ping, traceroute, and DNS lookup utilities to diagnose problems. Finally, performance can often be improved by techniques like caching, using CDNs, reducing round trips, and tuning configurations.

Common Network Issues: Latency, Packet Loss, DNS Problems

Other issues not explicitly in the heading:

A systematic approach: If a service is unreachable, check DNS (resolves correctly?), then basic connectivity (ping/traceroute to see if it’s network path or maybe SG blocking ICMP), then check the service port (telnet/nc to port to see if connection is accepted or refused or timed out – timed out often means firewall drop, refused means service up but not accepting, etc.), and so on.

Diagnostic Tools: ping, traceroute, and DNS lookup

In practice, a combination is used. For instance, if a user can’t reach a service, you might:

  1. Ping the server (to see basic connectivity).
  2. Traceroute to see where it stops (maybe your traffic isn’t even reaching AWS, indicating a network issue in between).
  3. If ping works, try telnet server 443 to see if the port is open. If not, security group or server issue.
  4. Use nslookup to ensure you’re hitting correct IP.
  5. Check instance’s SG/NACL.
  6. Use curl on the instance itself to test local service, etc.

These tools are fundamental for diagnosing the earlier issues of latency (ping and traceroute to locate latency), packet loss (ping over time or mtr to see where loss occurs), and DNS (nslookup/dig). For optimization, there are also tools like iperf (to measure throughput between two points) or specialised services (AWS CloudWatch for monitoring network metrics, etc.).

Performance Tuning Tips: Reducing Latency and Improving Throughput

Designing a network and system for performance involves multiple techniques:

In summary, to improve network performance:

A practical example: say we have a web page loading in 3 seconds. After analysis, we find 100ms of DNS, 500ms TLS handshake, then it loads 50 assets sequentially. To tune:

For back-end throughput scenario: transferring a 1GB file from region A to region B had 100ms latency and took 2 hours. To optimize:

All these show that network performance is multi-faceted – from architecture decisions (where to host, how to distribute) to configuration (protocols, caching, load balancing) to code (making fewer calls, etc.). A senior engineer should consider these in design: e.g., mention use of CDNs, note you will compress data, design APIs to minimize chattiness, and ensure to profile/monitor to continually improve.

system-design networking