Distributed Systems Cheat Sheet

Better Engineering

Sep 07, 2025

System Design Refresher

Core Concepts

Distributed System

A distributed system is a network of independent computers that work together and appear to the end-user as a single system.
Example: Google Search or Netflix—hundreds of servers work together, but you see one unified service.

Scalability

The ability of a system to handle increasing workloads by adding resources.
Horizontal Scaling: Add more machines (nodes).
- Example: Adding more servers to a web application cluster.
Vertical Scaling: Upgrade existing machines (CPU, RAM, storage).
- Example: Moving from 8GB RAM to 32GB RAM on the same server.

Reliability

The system continues to function correctly despite failures (hardware crash, network outage).
Example: If one database server fails, requests are rerouted to a backup.

Availability

The percentage of time a system is operational and accessible.
Example: A system with 99.99% availability can be down for only ~52 minutes per year.

Consistency

All nodes in the system see the same data at the same time.
Example: If you update your Facebook profile picture, all users should see the new photo immediately.

Partition Tolerance

The system continues to work even if communication between nodes is broken (network split).
Example: During a network outage, Amazon’s shopping cart still lets you add items, syncing later.

CAP Theorem

States that a distributed system can only guarantee two out of three:
1. Consistency (all nodes have the same data)
2. Availability (system always responds)
3. Partition Tolerance (works despite network failures)
Example:
- CP system: Prioritizes consistency + partition tolerance (HBase, MongoDB).
- AP system: Prioritizes availability + partition tolerance (Cassandra, DynamoDB).

State Management

How the system handles data storage and synchronization across nodes.
Stateless: Each request is independent (e.g., REST APIs).
Stateful: System remembers previous interactions (e.g., video streaming session).

Data Partitioning

Splitting data across multiple servers for scalability and performance.
Types:
- Horizontal: Split rows (users A–M on server 1, N–Z on server 2).
- Vertical: Split by columns (profile data on one server, payment data on another).
- Hash-based: Use a hash function to decide where to store data.
- Range-based: Divide data by ranges (IDs 1–1000 on server A, 1001–2000 on server B).

Replication Strategies

Copying data across nodes for fault tolerance.
Synchronous: All copies updated together → strong consistency but slower.
Asynchronous: Primary updates, replicas lag slightly → faster, eventual consistency.
Quorum-based: Majority of replicas must agree before confirming update.

Consistency Models

Define how up-to-date data appears to users.
Strong: Always see the latest data.
Eventual: Data updates propagate slowly, but eventually all nodes match (e.g., DNS).
Causal: Preserves cause-effect relationships (if post A causes comment B, all see A before B).

Time and Order

Ensuring events are processed in the correct sequence across distributed systems.
Lamport Timestamps: Logical clocks to order events.
Vector Clocks: Capture causality between events (who influenced whom).
Example: Messaging apps ensure messages appear in the correct order even across servers.

Fault Tolerance

The ability to continue operating after a failure.
Strategies:
- Failover: Switching to a backup system automatically.
- Redundancy: Extra copies of data or services.
Example: Google Cloud keeps multiple replicas of files across regions.

Concurrency Control

Managing multiple simultaneous operations on shared data.
Optimistic: Assume no conflicts → check before committing.
Pessimistic: Lock data before changes → prevents conflicts but slower.
Example: Online banking uses concurrency control to prevent double withdrawal.

Communication

Remote Procedure Call (RPC)

Lets one program call a function on another machine as if it were local.
Handles details like network communication, data transfer, and return values.
Example: A client calls getUserInfo() which actually runs on a remote server.

Message Passing

Communication method where nodes exchange messages asynchronously.
Useful for decoupling services (they don’t need to be online at the same time).

Types:

Message Queues (Point-to-Point):
- Messages go from one sender → one receiver.
- Example: RabbitMQ, Amazon SQS for job/task queues.
Message Brokers (Publish–Subscribe):
- One sender publishes → multiple subscribers can consume.
- Example: Kafka, Apache Pulsar for event-driven systems.

REST (Representational State Transfer)

A popular API communication style over HTTP.
Uses standard operations (GET, POST, PUT, DELETE).
Example: A weather app calling api.weather.com/tokyo?day=tomorrow.

GraphQL

A query language for APIs that allows clients to request exactly the data they need.
More efficient than REST in some cases (no over-fetching/under-fetching).
Uses TLS/SSL, HTTPS, SSH for security.
Example: Instead of multiple REST calls, a single GraphQL query fetches user info + orders together.

gRPC (Google Remote Procedure Call)

A high-performance RPC framework built on Protocol Buffers (Protobuf).
Faster and more efficient than REST for microservices communication.
Example: Streaming real-time data (chat, video calls) between services.

Webhooks

Event-driven callbacks: When something happens, the system sends an HTTP request to a predefined URL.
Example: Stripe (payment gateway) sends a webhook to your app when a payment succeeds.

Coordination & Consensus

Consensus Algorithms (Paxos, Raft)

Consensus ensures that all nodes agree on a single value or decision, even with failures.
Needed for consistency in distributed systems (e.g., agreeing on which transaction to apply next).
Paxos: Complex but widely used in production systems (e.g., Google Chubby).
Raft: Easier to understand; used in etcd, Consul for leader election and log replication.
Example: In a replicated database, consensus ensures all replicas agree on the order of transactions.

Distributed Locks (Zookeeper, etcd)

Locks help coordinate access to shared resources across distributed nodes.
Prevents two nodes from modifying the same data at the same time.
Zookeeper and etcd provide APIs for distributed locking.
Example: In Hadoop, Zookeeper is used to ensure only one NameNode is active at a time.

Distributed Transactions (Two-Phase Commit – 2PC)

Ensures a transaction is either committed everywhere or aborted everywhere across multiple systems.
Two phases:
1. Prepare Phase – Coordinator asks all participants if they can commit.
2. Commit Phase – If all agree, commit; otherwise rollback.
Example: A bank transfer where money is deducted from one account and credited to another across different databases.

Leader Election

In distributed systems, one node is often chosen as the leader/coordinator to manage tasks.
The leader handles coordination, while followers replicate state.
Example: In Raft, one node is elected as the leader to manage log replication.

Gossip Protocol

A decentralized communication method where nodes periodically share state info with random peers.
Over time, info spreads like a “gossip/rumor” until all nodes know it.
Used for membership management, failure detection, and state dissemination.
Example: Cassandra and DynamoDB use gossip protocols for node state sharing.

Design Principles

Loose Coupling

Components should be independent of each other as much as possible.
Benefit: Changes in one component don’t break the whole system.
Example: Microservices architecture, where services communicate via APIs rather than sharing code.

High Cohesion

Keep related functionality together in one module/service.
Benefit: Easier to maintain, test, and extend.
Example: A “Payments Service” handles all payment logic instead of spreading it across multiple services.

Abstraction

Hide unnecessary details, exposing only what’s essential.
Benefit: Simplifies usage and reduces complexity.
Example: AWS S3 abstracts file storage behind simple API calls (putObject, getObject).

Single Responsibility

Each component should do one thing well.
Benefit: Improves clarity, reduces bugs.
Example: A logging service only records logs, not authentication or caching.

Separation of Concerns

Divide the system into distinct layers or modules, each handling different concerns.
Example: In a web app → UI Layer, Business Logic Layer, and Database Layer are separated.

Statelessness

Where possible, services should not store client session data.
Benefit: Improves scalability and fault tolerance.
Example: REST APIs are stateless → each request is independent, with no stored state on the server.

Idempotence

An operation can be repeated safely without changing the outcome.
Benefit: Reliable retries and recovery from failures.
Example: Submitting a payment request twice should only charge once.

Eventual Consistency

Allow temporary inconsistency for better availability and performance.
Benefit: High scalability in distributed systems.
Example: DynamoDB or Cassandra → writes propagate asynchronously, but eventually all replicas agree.

Fail-Fast

Detect and report errors quickly instead of continuing with bad state.
Benefit: Easier debugging, prevents cascading failures.
Example: Microservice immediately returns an error if it cannot connect to a database.

Circuit Breaker

A pattern that isolates failing components to prevent system-wide failure.
Benefit: Protects the system from overload.
Example: Netflix Hystrix → stops calling a failing service after repeated failures.

Retry Mechanisms

Automatically retry failed operations, often with backoff strategies.
Example: An API client retries after a timeout, waiting longer with each attempt.

Asynchronicity

Design systems for non-blocking communication.
Benefit: Improves performance and responsiveness.
Example: Kafka message queue → producers don’t wait for consumers to process.

Timeouts / Deadlines

Set time limits for requests to prevent indefinite waiting.
Benefit: Improves reliability and resource management.
Example: HTTP requests typically timeout after a set period (e.g., 30 seconds).

Data Management in Distributed Systems

Distributed Databases

Databases spread across multiple nodes for scalability, fault tolerance, and performance.

Types:

NoSQL (schema-flexible, horizontally scalable):
- MongoDB → document-based.
- Cassandra → column-family store, great for high write throughput.
- DynamoDB → AWS-managed, scalable key-value store.
NewSQL (SQL + scalability of NoSQL):
- CockroachDB, YugabyteDB → Distributed SQL databases offering strong consistency + horizontal scaling.

Distributed File Systems

Store files across multiple servers but present them as a single logical system.

HDFS (Hadoop Distributed File System): Designed for big data batch processing (MapReduce, Spark).
Ceph: Provides block, object, and file storage in a unified distributed system.

Caching

Stores frequently accessed data in-memory to reduce latency.
Redis: In-memory key-value store with persistence and pub/sub features.
Memcached: Simple key-value store, optimized for caching.
Example: Storing user session tokens in Redis for fast authentication.

ACID Properties (Relational Databases)

Guarantees reliability of transactions:

Atomicity: All or nothing (transfer money → deduct + credit happen together).
Consistency: Database moves from one valid state to another.
Isolation: Transactions don’t interfere with each other.
Durability: Once committed, data persists even after crashes.

BASE Properties (NoSQL Databases)

Trade-off for scalability and availability:

Basically Available → system is almost always available.
Soft State → state may change over time (due to eventual sync).
Eventually Consistent → replicas converge to the same value eventually.
Example: Amazon Dynamo or Cassandra use BASE over strict ACID.

Data Partitioning

Splitting large datasets across nodes for scalability:

Sharding: Divide data into independent chunks.
Hash Partitioning: Hash function decides data location.
Range Partitioning: Data ranges (IDs 1–1000 → Node A, 1001–2000 → Node B).

Replication Strategies

Copying data to improve fault tolerance and availability:

Synchronous Replication: Data written to all replicas at once (strong consistency, slower).
Asynchronous Replication: Primary updates, replicas catch up later (faster, eventual consistency).
Quorum-Based: Majority of replicas must acknowledge (balance consistency + availability).

Key Technology Area in Distributed Systems

Container Orchestration

Automates deployment, scaling, and management of containers.
Kubernetes: Most widely used for large-scale orchestration.
Docker Swarm: Simpler, built into Docker.
Example: Auto-scaling microservices based on traffic.

Stream Processing

Real-time data processing (vs batch processing).
Apache Flink: Low-latency, high-throughput stream processing.
Apache Spark Streaming: Mini-batch streaming engine.
Example: Fraud detection on financial transactions in real time.

Service Discovery

Enables services to find each other dynamically without hardcoding addresses.
Consul, etcd: Register and discover services.
Example: A microservice looking up another’s IP via Consul.

API Gateways

Manage, secure, and monitor APIs.
Kong, Tyk, Apigee, AWS API Gateway.
Features: Rate limiting, authentication, load balancing.
Example: AWS API Gateway in front of Lambda functions.

Workflow Orchestration

Automates complex workflows and pipelines.
Apache Airflow: DAG-based workflow automation.
Argo Workflows: Kubernetes-native workflow engine.
Example: ETL jobs that move data from raw → processed → analytics DB.

Load Balancers

Distribute incoming traffic across servers.
HAProxy, NGINX, Traefik, AWS ELB.
Improves performance, fault tolerance, and scalability.
Example: ELB distributing traffic to EC2 instances.

Job Schedulers

Automate recurring tasks.
Quartz, Apache Oozie, Kubernetes CronJobs.
Example: Run daily data backup job at midnight.

Monitoring & Observability

Track performance, logs, and metrics.
Prometheus: Metrics collection & alerting.
Grafana: Visualization & dashboards.
Example: Monitoring latency and CPU usage in microservices.

Distributed Tracing

Traces requests across microservices for debugging.
Zipkin, Jaeger.
Example: Debugging slow API response by tracing through multiple services.

Service Mesh

Manages service-to-service communication (routing, security, observability).
Istio, Linkerd.
Example: Automatically encrypt traffic between microservices with mTLS.

Cloud Providers

Infrastructure platforms offering compute, storage, and managed services.
AWS, Azure, Google Cloud Platform (GCP).
Example: Deploying a data pipeline on AWS with S3, Lambda, and Redshift.

Infrastructure as Code (IaC)

Manage infrastructure through code instead of manual setup.
Terraform, Pulumi, AWS CloudFormation.
Example: Defining cloud resources (VMs, networks) in version-controlled scripts.

Serverless Platforms

Run code without managing servers.
AWS Lambda, Azure Functions, GCP Functions.
Pay only for execution time, auto-scales.
Example: Trigger Lambda function when a file is uploaded to S3.

Configuration Management

Automate software configuration and system setup.
Ansible, Chef, Puppet.
Example: Ansible playbook to configure web servers across a cluster.

Distributed Caches

In-memory stores to reduce latency.
Redis, Memcached.
Example: Store frequently accessed user session data in Redis for quick retrieval.

✨ If you found this article helpful, please consider liking and sharing it so others in your network can benefit too. Your support helps me reach more learners and continue creating content on distributed systems and data science. 🚀

DevLogsByOlujare🎭

Sep 16, 2025

This is an incredibly useful and well-structured cheat sheet.

You've managed to distill a vast and complex topic into a concise, easy-to-digest guide.

The way you've broken down key concepts, communication methods, and design principles is perfect for both newbie and those looking for a quick refresher.

This is definitely a bookmark-worthy resource for any engineer working with distributed systems. Thank you for sharing.

2 replies by Better Engineering and others

2 more comments...

Better Engineers

Discussion about this post

Ready for more?