Modal Title
Cloud Services / Data

Why We Built Cloud Native Object Storage for PostgreSQL with Amazon S3

Timescale built what it saw its users would want to build themselves: A bottomless, consumption-based object store based on Amazon S3 for its cloud native PostgreSQL databases. 
Nov 29th, 2022 10:00am by
Featued image for: Why We Built Cloud Native Object Storage for PostgreSQL with Amazon S3
Image via Pixabay.

It’s no secret that at Timescale, we’re fans of PostgreSQL — we offer high-performance PostgreSQL in the cloud. We decided to build our time series database on top of PostgreSQL, given our conviction that it is the best database foundation for applications and that we could further develop embedded capabilities to scale PostgreSQL to the most demanding data-intensive workloads.

But on top of scalability and performance, there’s another key concern for developers managing data at scale: cost efficiency. And current cloud database solutions don’t give time series developers the tools they need to address this problem. 

The Time Series Problem

Time series data is often collected at high frequency or across long time horizons. This scale is often a fundamental part of applications: it could be storing metrics about all IoT devices in a fleet, identifying all the events in a gaming application, or collating data about many financial instruments. All this data adds up over time, often leading to difficult trade-offs about which data to store and for how long as teams try to balance performance, costs, and analytics needs. 

This challenge is tough to address in the database alone, forcing developers to manually set up their own custom systems for data warehousing — which they use alongside their production databases — or to operate complex data stacks in which data flows through different data stores via custom pipelines. This all adds extra costs and a management burden to the developer. 

When we set out to build a solution to help developers address this problem, we went back to our guiding principles of expanding PostgreSQL, the robust and trusted database, with the features that developers need to manage time series at scale. We realized that the best solution for this problem would be to leverage cloud native architecture and bring PostgreSQL closer to the cloud native world — combining the advantages of modern cloud native storage systems with the rock-solid reliability and performance of PostgreSQL. 

So to solve this problem of cost-efficient storage of time-series data, we decided to build what our users would want to build themselves: we added a bottomless, consumption-based object store based on Amazon S3 to our cloud native PostgreSQL databases. 

Expanding the Boundaries of PostgreSQL

Timescale Cloud is built as a cloud native database platform on AWS. This cloud native architecture allows us to extend the boundaries of traditional PostgreSQL, incorporating elements typically attributed to data warehouses or data lakes, without compromising the PostgreSQL experience (and oftentimes enhancing it). 

Access to object storage on Amazon S3 from their PostgreSQL database gives developers the tools they need to build applications that scale efficiently while reducing costs. The object store is consumption-based, so they don’t have to pre-allocate — and pre-pay for — an upper bound of their storage. Plus, its baseline price is 10x cheaper than traditional storage. This allows developers to scale their time series data infinitely and cost-effectively — keeping some data in the database itself and tiering data to object storage to save on costs. 

But the best part is that adding Amazon S3 to PostgreSQL allows developers to access this affordable object storage layer without giving up the PostgreSQL experience. By running a simple SQL command in Timescale Cloud, they can automatically tier data based on its age, as suited to their data management needs:

After tiering the data to the object store, developers can continue to query it from within the database via standard SQL. Predicates, filters, JOINs, CTEs, windowing — all the powerful PostgreSQL features work. Instead of operating multiple tools and ETL pipelines, developers can now transparently stretch their relational tables across multiple storage layers as the object store is now an integral part of our cloud database. 

Here’s an example of how that works, using the EXPLAIN plan for a single SQL query that fetches data from disk and object storage (notice the Foreign Scan):

How We Did It

To enable this new functionality in PostgreSQL, we built new database internal capabilities and external subsystems. Data chunks (segments of data related by time) that comprise a tiered hypertable now stretch across standard storage and object storage. We also optimized our data format for each layer: block storage starts in uncompressed row-based format and can be converted to Timescale’s native compressed columnar format. On top of that, all object storage is in a compressed columnar format well-suited for Amazon S3 (more specifically, Apache Parquet). This allows developers more options to take advantage of the best data storage type during different stages of their data life cycle.

Once a data tiering policy is enabled, chunks stored in our native internal database format are asynchronously migrated into Parquet format and stored in S3 based on their age (although they remain fully accessible throughout the tiering process). A single SQL query will pull data from the disk storage, object storage, or both as needed, but we implemented various query optimizations to limit what needs to be read from S3 to resolve the query. 

We perform “chunk exclusion” to avoid processing chunks falling outside the query’s time window. Further, the database doesn’t need to read the entire object from S3, even for selected chunks, as it stores various metadata to build a “map” of row groups and columnar offsets within the object. The result? It minimizes the amount of data to be processed, even within a single S3 object that has to be fetched to answer queries properly.

Test It Out

The ability to tier data to Amazon S3 within Timescale Cloud saves you the manual work of building and integrating a custom system or operating a separate data store (e.g., Snowflake) for your archival of historical data. Instead of setting up, maintaining, and operating a separate system alongside your production database (and a separate ETL process), you can simply work with a Timescale hypertable that serves your entire data lifecycle, where data is distributed across different storage layers. You can query regular and tiered data seamlessly from this table and also JOIN it to the rest of your tables, avoiding silos without adding more complexity to your data stack.

This not only simplifies operations but also billing: unlike regular data warehousing systems (which typically charge per query, making it very difficult to forecast the final cost), you’ll pay only for what you store, keeping your pricing transparent at all times. 

You can test this novel feature today by requesting access to our private beta. Sign up for Timescale Cloud. (the first 30 days are completely free, no credit card is required) and then navigate to Operations > Data Tiering to request access to the private beta. Bottomless object storage from within your PostgreSQL database.

Group Created with Sketch.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.