Databases: What to Know about Partitioning
Data is everything, everything is data, and in most cases, the data in databases keeps growing and growing, but the user experience must remain unaffected. How can that be? One strategy that all other strategies are built on top of is to divide and conquer, aka partitioning.
Distributed database provider Cockroach Labs recently published a helpful blog post — written by Senior Technical Content Marketer Charles Custer — that covers the basics of partitioning.
There are two types of partitioning. Vertical partitioning splits a table by machine capability. Horizontal scaling splits by row, bringing in the ability to add more servers, which is called sharding.
Sharding can greatly improve the performance of an application, but it requires quite a bit of engineering work to perform operations that would have been done more simply with a single database.
What Is Database Partitioning?
Conceptually very simple, it means breaking a database into separate pieces, aka partitions. Separate partitions allow separate access, storage and management for each. The upside is a more scalable, performant application, but the challenge is that it can be tough engineering, expensive and time-consuming.
Types of Partitions
Though there are more than two, horizontal and vertical are the most common. This example includes a single table, but real-world examples are not as simple.
Here is the Before Table:
Vertical Partition is a table split by columns with different columns stored on different partitions.
A use case for vertical partitioning is when different columns are used differently. If this example was from a real bank, then Partition 1 (username, city) would be relatively static and need fewer performant machines, but Partition 2 (balance) would require high-throughput, high-performance machines since balances are updated more regularly.
Horizontal Partition is a table split by rows with different ranges on different partitions
A use case example for horizontal partitioning is to partition tables to improve the performance of specific, frequently used queries against that data.
Sharding is the spreading of horizontal partitions across multiple servers.
Sharding is a common practice at companies with relational databases. Without sharding, the database is limited to vertical scaling alone, which is beneficial but limited. To introduce horizontal scaling, the database is split into horizontal partitions, now called shards, and those shards are split across multiple servers.
Numerous sharding strategies and architectures exist. Amazon Web Services has a great introductory resource to sharding.
Advantages of Partitioning
Scalability: Vertical scaling is great to a point, but eventually one server can only do so much. Partitioning introduces the ability to add servers and removes limitations on scaling.
Ability: One server means there’s a single point of failure, but partitioning solves that.
Performance: Performance improvements are completely customizable, but some examples are reduced latency by sharding based on user location, and reduced contention by dividing request load across multiple servers.
Sounds Great, What Are These Challenges You Speak Of?
Take a single-instance Postgres database linked to its production application (and some backups serving as safeties if the production database goes down). The engineering team foresees the coming performance limitations of their current configuration and draws up a plan to split the database into three active shards, each with two passive replica shards for backup. Phew.
But wait, there’s more… it’s not just a shard-and-done situation because now processes that might have been simple before need more work and consideration. Here are a few tasks:
- Split the data and workload evenly among the shards; otherwise, there could be a disproportionate amount of queries overloading a shard.
- Make sure application code routing queries from the application goes to the correct shard.
- Support code for other systems interacting with the database (i.e., data pipelines or change feeds that send updates to analytics databases) will likely change once shards are created.
- Once-simple procedures will likely now be complex with multiple shards involved (i.e., updating data that spans more than one shard).
Here’s an example of a sharded architecture:
The thing about sharding is that when new shards are added, all of the bullets listed above will need updating again. This may result in a better user experience, but maybe not a better engineering experience, depending on how you define a better engineering experience. The cost will go up, but storing data is expensive.
Sharding and partitioning are both about breaking up a larger database into essentially smaller databases to increase performance based on specific needs. What affects the data also affects the application at large. Distributed databases are great, but it’s no replacement for understanding what goes on under the hood, Custer said.