Twenty Years Old, PostgreSQL Maintains its Vigor in the Big Data Age
In one of its biggest shows of support, Amazon added PostgreSQL support to the AWS Aurora database late last year, making it easier to move an Oracle database to the AWS cloud.
“Commercial grade databases are proprietary, expensive and require vendor lock-in. Builders don’t like this and it’s why enterprises have been moving as fast as they can to open source [options] like MariaDB, MySQL and PostgreSQL,” AWS CEO Andy Jassy said at the time.
Live streaming video platform Twitch’s senior data engineer Aaron Brashears wrote about how the social media platform uses PostgreSQL for the majority of its online transaction processing (OLTP) workloads.
“Twitch has approximately 125 database hosts serving OLTP workloads in production, usually as part of a cluster. Approximately 4 [percent] run MySQL, 2 [percent] run Amazon Aurora, and the rest run PostgreSQL,” Brashears wrote. “We manage the provisioning, system image, replication and backup of several of the databases though most new clusters are RDS for PostgreSQL.”
And recently Berlin freelance developer Rene Dudfield wrote about the pros and cons of using the technology for uses outside of relational data including real-time change feeds, time series, object store, document store, full-text search and more.
He wrote: “I’m pretty sure PostgreSQL is some combination of a minimal tool choice and the Swiss army knife tool choice in the shape of a big blue solid elephant.”
PostgreSQL, which evolved from the Ingres project at the University of California, Berkeley, celebrated the 20th anniversary of its first open source release last year.
“There may be situations where Hadoop or Cassandra, for instance, are the best places to store certain data, IOT data, depending on the volume, or for historical reasons you have data stored somewhere else. Now you can do a MapReduce job to expose that through a foreign data wrapper into Postgres. Inside Postgres, you can do queries, you can write reports. So that really enables a polyglot data model where you can have data in multiple different places and handle it in an organized way,” said Marc Linster, senior vice president of product development at EnterpriseDB, which sponsors the open source project and offers a commercial distribution of PostgreSQL focused on large-scale uses.
Though it’s been around for a long time, PostgreSQL not out of touch, according to Linster, pointing to features such as JSON-B for document stores and Post-GIS capabilities for geographic information systems.
“You can use JSON-B all the way from the browser to the database, all the way through the stack. Other databases require you to deconstruct documents for the mobile browser, then do transactions from the bits and pieces. In Postgres, you can use the whole document,” Linster said.
He cites its general-purpose nature as one of its strengths.
“In Postgres, you can start out unstructured, but you can use structure as you uncover it. Think how powerful that is: I can start with tables and relationships, but I can also start in areas where I don’t know yet what my structure’s going to be like. I can have a contact field, which is a JSON field, that can note multiple phone numbers and multiple email addresses within that field. As the application matures, I might pull some of that information out or start managing it differently, but I can start unstructured. And I can decide that some of my data will remain unstructured and some of it will be strictly structured,” he said.
In PostgreSQL 9.5, the database project moved closer to combining data analysis with a traditional database, functions previously thought to require separate systems. But “the real story is that [Hybrid transaction/analytical processing or HTAP] is going mainstream,” according to Gartner analyst Massimo Pezzini, pointing to reduced hardware costs and business imperatives for real-time analytics among the drivers.
PostgreSQL Version 9.6 added parallel processing capabilities that use multiple CPU cores to accelerate response times for queries that touch a lot of data. And enhancements to the freeze map eliminate multiple scans of data on certain data blocks, reducing I/O overhead. These enhancements further boost its ability to scale up.
In conjunction with the release of its enterprise PostgreSQL platform last week, EnterpriseDB released an adapter, called a Foreign Data Wrapper (FDW), for Hadoop with Apache Spark compatibility. The new version, HDFS_FDW, can be downloaded from the EDB PostgresData Adapters web page or the EnterpriseDB PostgreSQL GitHub page.
Data wrappers allow you to connect from within PostgreSQL (Also known as Postgres) to a remote system, then read and write data from other databases and use it as if it were inside the PostgreSQL database, Citus Data’s Craig Kerstiens explain in a post on FDWs.
They allow PostgreSQL queries to include structured or unstructured data, from multiple sources, such as NoSQL databases and the Hadoop Distributed File System (HDFS), as if they were in a single database.
The new version gives organizations the ability to combine analytic workloads based on the HDFS with operational data in PostgreSQL, using an Apache Spark interface.
In 9.6 and looking forward to version 10, due out in the third quarter of 2017, the community is working to maximize use of hardware resources for PostgreSQL, according to Robert Haas, EnterpriseDB vice president and chief database architect.
“We want to work to make sure you can use all the hardware resources that you have for Postgres. That used to be a bottleneck. There will be improvements in the way we do locking internally that results in better concurrency so you can put heavier and heavier loads on the system,” he said.
And the project needs to have core capabilities in place so that anyone can create foreign data wrappers that talk to other data sources such as Spark or Hadoop, he said
“One of the important things to do is to push as much work as possible to the remote server. If somebody issues a complex query against what they see as a table, you don’t want to fetch the table back from the remote side, then do the processing locally. You want to push as much of the computation as you can to the remote side because it will be more efficient,” he explained.
“What you’re seeing in 9.6, and there will be even more in 10, is increased ability for authors of foreign data wrappers to push more and more calculations over to the remote server to kind of turn PostgreSQL into a data hub. In 9.6, we now have support for pushing down joins and aggregates to remote servers if the data wrapper supports that.”
Logical replication will be coming in version 10. Physical replication has been available for several releases now, which promotes high availability, but logical replication means replicating rows between different types and versions of database systems, he said. It allows you to replicate individual tables rather than entire databases.
And a lot of work on parallelism will be part of the next version, allowing infrastructure and capabilities will be broadened out to new use cases.
“We’re doing more work on reducing locking bottlenecks and once we get past the locking bottlenecks, I think there’s going to be more work on raw performance increases, where some of those performances increases require significant refactoring of the underlying code,” he said.