Deep Dive in Data Partitioning & Sharding

Deep drive in Sharding Architecture Patterns :

Feb 08, 2025

Distributed System Learning Roadmap

What is Sharding?

Sharding is a method of database architecture, mainly employed for horizontal partitioning across multiple machines or databases. Each shard functions as a separate database, and together, they comprise a single logical database. Sharding distributes data according to a specific key, such as customer ID or geographic location, with the goal of decreasing the load on each database and thereby improving performance.

For example, an e-commerce platform experiencing heavy transaction volumes could use sharding to distribute customer data across different databases based on geographic location, thus ensuring even distribution of requests and reducing latency for users. The database is segmented into multiple shards based on customer geographic location (Americas, Europe, Asia, etc.).

-- Example SQL snippets for creating shards based on geographic location
-- Create a new database shard for European customers
CREATE DATABASE eur_customers;
-- Create a new database shard for American customers
CREATE DATABASE americas_customers;
-- Create a new database shard for Asian customers
CREATE DATABASE asia_customers;

What is Partitioning?

Partitioning is the process of dividing a database into distinct sections, or partitions, that can be stored and managed separately, it is commonly used to mean vertical partitioning. This division occurs within a single database system, eliminating the need for distribution across multiple servers. Partitioning is often implemented to enhance manageability, performance, and availability of large databases by organizing data into smaller, more manageable segments. Such implementation can streamline operations and significantly bolster query performance.

Consider a large database that stores sales records for a multinational corporation. By partitioning the sales records by region or year, the management of this extensive dataset becomes more feasible.

-- Example SQL snippet for creating a partition on sales records by year
CREATE TABLE sales (
    sale_id INT NOT NULL,
    product_name VARCHAR(255) NOT NULL,
    sale_date DATE NOT NULL,
    amount DECIMAL(10,2) NOT NULL
) PARTITION BY RANGE (YEAR(sale_date)) (
    PARTITION p2018 VALUES LESS THAN (2019),
    PARTITION p2019 VALUES LESS THAN (2020),
    PARTITION p2020 VALUES LESS THAN (2021),
    PARTITION p2021 VALUES LESS THAN (2022)
);

Database Sharding vs Partitioning

Sharding and Partitioning are both about breaking up a large data set into smaller subsets. The difference is that sharding implies the data is spread across multiple computers while partitioning does not. Partitioning is about grouping subsets of data within a single database instance. In many cases, the terms sharding and partitioning are even used synonymously, especially when preceded by the terms horizontal and vertical. Thus, horizontal sharding and horizontal partitioning can mean the same thing.

Types of Sharding

There are different criteria you can use to separate your data into various shards. The criteria you use may depend on your application, the structure of your data, your system architecture, geography, and your desires for scalability. Here are four major types of Sharding:

Key Range Based Sharding

Range-based sharding, or dynamic sharding, splits database rows based on a range of values. Then the database designer assigns a shard key to the respective range. For example, the database designer partitions the data according to the Range of IDs like the diagram below

 1 - 30 → Shard 1
 30- 60 → Shard 2

When writing a customer record to the database, the application determines the correct shard key by checking the customer's ID. Then the application matches the key to its physical node and stores the row on that machine. Similarly, the application performs a reverse match when searching for a particular record.

Advantages of range-based sharding:

Efficient for range queries because data is distributed in an orderly manner
Facilitates data archiving and purging by dropping entire shards.
Suitable for time-series data and historical records

Challenges of range-based sharding:

Imbalanced shard sizes if data distribution is uneven
Challenges in handling skewed data distribution
Limited flexibility when dealing with non-uniform data access patterns

Hash Based Sharding

Hashed sharding assigns the shard key to each row of the database by using a mathematical formula called a hash function. The hash function takes the information from the row and produces a hash value. The application uses the hash value as a shard key and stores the information in the corresponding physical shard.

Software developers use hashed sharding to evenly distribute information in a database among multiple shards. For example, the software separates customer records into two shards with alternative hash values of 1 and 2.

Advantages of hash-based sharding:

Evenly distributes data, preventing hotspots or imbalanced loads.
Suitable for situations where the order of data is not important.
Scalable and easy to implement.

Challenges of hash-based sharding:

Retrieving a specific range of data can be complex.
Shard rebalancing can be challenging as data volume grows.
Adding or removing shards may require reshuffling data.

Directory sharding

Directory Sharding uses a lookup table to match database information to the corresponding physical shard. A lookup table is like a table on a spreadsheet that links a database column to a shard key. For example, the following diagram shows a lookup table for clothing IDs.

Advantages of directory-based sharding:

Flexible and adaptable to complex distribution needs.
Eases the process of shard management and rebalancing.
Supports dynamic changes to data distribution rules.

Challenges of directory-based sharding:

Adds complexity with the need for a separate metadata service.
Performance overhead due to metadata lookups.
Potential single point of failure in the metadata service.Consistent Hashing

Use of Service Discovery in Database Sharding

When implementing database sharding, service discovery plays a crucial role in ensuring efficient and dynamic routing of queries to the appropriate shard. Here’s how service discovery helps in a sharded database system:

1. Dynamic Shard Lookup

Instead of hardcoding database connection details, service discovery helps dynamically determine the location of the correct shard for a given request. This is particularly useful when shards are distributed across multiple servers or regions.

2. Load Balancing Across Shards

Service discovery can integrate with load balancers to distribute traffic evenly across shards, preventing bottlenecks and optimizing query performance.

3. Automatic Failover & High Availability

If a shard becomes unavailable due to hardware failure or maintenance, service discovery helps redirect requests to a replica or backup shard, ensuring high availability.

4. Scaling and Elasticity

When new shards are added or existing ones are relocated, service discovery updates the registry dynamically, allowing applications to find the new shard locations without manual reconfiguration.

5. Reduced Configuration Overhead

With service discovery, applications don’t need to maintain a static mapping of data partitions to database servers. Instead, they query the discovery service to fetch the correct connection details at runtime.

Example in Practice

Consul, Zookeeper, or etcd can store and manage shard locations dynamically.
DNS-based service discovery can resolve shard locations based on predefined rules.
Proxy-based solutions (e.g., Envoy, HAProxy, or ProxySQL) can use service discovery to route requests efficiently.

If you found this guide helpful and want to stay updated with more insightful posts on software architecture and engineering, be sure to Follow me and Subscribe for more knowledge-packed content. 🔔💻

Happy learning, and may your systems be ever reliable! 🚀✨

Better Engineers

Discussion about this post