It’s been a big year for open source relational database PostgreSQL. Stack Overflow developers named it the Most Wanted Database in 2021 and DB-Engines named it the Database Management System of the Year for the third time. So when the maintainers released Version 14 last month, it had better leave a mark. The new release comes with more than 220 patches — a 30% increase over the usual.
The New Stack spoke with Umair Shahid, the head of PostgreSQL at enterprise database performance software provider Percona, as well as chair of the PostgreSQL code of conduct committee, lead of the user groups in Islamabad and Dubai, and part of the user group committee for the U.S. PostgreSQL Association (PgUS). He highlighted 14’s updates across performance, horizontal scaling, observability, developer experience and security, which we bring to you now.
PostgreSQL v14.0: Performance Enhancement
Shahid lumped the performance enhancements into two groups — those that “just work” behind the scene without user intervention, and those that require configuration changes in order to work. Both will increase both storage and resource efficiency, which of course reduces cost.
Bloat reduction is one of those behind-the-scenes improvements. Frequently updated indexes tend to have dead tuples — any record that’s been deleted but still takes up space — which cause index bloat. Typically, these tuples are removed only when a vacuum is run. Between vacuums, as the page gets filled up, an update or insert will cause a page split — something that is not reversible, Shahid explained. This split would occur even though dead tuples inside the existing page could have been removed, making room for additional tuples. In PostgreSQL 14, dead tuples are automatically detected and removed even between vacuums, allowing for a reduced number of page splits, which in turn reduces index bloat.
Similarly, when data is deleted and a vacuum runs, before 14, PostgreSQL marked that data as to be deleted and then waits for the next round of vacuuming to actually free up the space. A new algorithm allows for the eager deletion of that data, hence freeing up space more efficiently.
The vacuum is now able to identify and free up the space in one go.” Shahid called this “a more efficient way in reducing space utilization and increasing efficiency.”
Still, developers and database admins will have the usual buffer of a rollback and point-in-time recovery.
This release also brings more features to parallel query execution, in which PostgreSQL can devise query plans that can leverage multiple CPUs to answer queries faster. Now your database can execute queries in parallel for RETURN QUERY and REFRESH MATERIALIZED VIEW.
More prominent updates include pipeline mode for LibPQ, which is the interface that developers use to connect their application to the database. With PostgreSQL, they now have the ability to use a pipeline mode. LibPQ used to be single-threaded, where it would wait for one query to complete execution before sending the next one to the database. Now devs can feed multi-transactions into the pipeline and LibPQ will execute them turn by turn to feedback results into the application. The application no longer has to wait for the first transaction to complete to execute the next one.
This was one of the updates in which Shahid commented, “Why did we not think about this earlier? This is such a no-brainer! But that’s how technology progresses.”
Another potential no-brainer-in-hindsight is an upgrade to TOAST, which now allows for LZ4 compression. TOAST is a system that allows the storage of much larger data. Before version 14, it used PGLZ, which is a very fast compression algorithm but the actual compression ratio was small. LZ4 is again very fast — Shahid said it’s lossless, meaning you don’t lose any data, as you would with a compressed photograph that appears pixelated — and it has a high compression ratio as well.
He said it varies by data but the algorithm intelligently utilizes the space that it needs for the data. Continuing the photo example, if you had one of a clear blue sky, it would compress it with a much higher compression ratio than a very colorful pic of an autumnal forest.
PostgreSQL v14.0: Horizontal Scalability and Distributed Workloads
Next up in this release’s additions is replicating in-progress transactions. A typical cluster set-up will almost never have a single node, because that would become the single point of failure. It’s typically distributed with a primary node and secondary nodes that all talk to each other to make sure they sync up. Before, large transactions were written to disk until the transaction was completed, and only then were they replicated to the standby nodes. With PostgreSQL 14, in-progress transactions can be replicated to standby nodes, without the need to wait for the transaction to complete.
The other updates to horizontal scalability involve the Foreign Data Wrapper. With a FDW, you have the ability to shard or slice up data in a particular way that you can distribute that data across different servers for horizontal scalability for distributed workloads. Now you can perform bulk inserts instead of row-by-row in not only local but also foreign tables.
Also as a query is being executed, it does the scan or reading of the table to understand where to fetch the data from. Local queries could run in parallel since version 12, but now you can also run parallel scans of foreign tables.
PostgreSQL v14.0: Observability
There have also been many enhancements to database observability for database administrators, offering deeper insights and statistics.
PostgreSQL’s core engine exposes an API that allows users to write custom extensions to enhance functionality without making any change to the core engine. One of the most popular extensions is pg_stat_statement, which monitors queries in the database and tracks statistics. It uses query hashing to assign unique IDs to queries that it can then track through the core server.
“In PostgreSQL version 14, we’ve taken that hashing technique from the extension into the core server, so you can use the same hash and work with any other extension for the same query.” Shahid continued that, before, it was only the core server and
pg_stat_statement that knew that ID — now any extension can use the same ID, allowing for much richer observability across the PostgreSQL system and logging functions.
PostgreSQL v14.0: Developer Convenience
“If someone is developing an application, you can’t expect them to be a database expert,” Shahid said.
Among other developer experience improvements, Shahid said they’ve made JSON enhancements that further refine the usability of unstructured data, so developers are able to use it in the languages they like to use, including Python and Java.
“It’s not just about getting the feature in a native data type but it’s about making sure the devs use it in the way they are used to,” he said.
They’ve also added multirange types. Before this release, developers could only have one range type, with specific starting and endpoints — like an office being open from 9 a.m. to 5 p.m. Now, you can specify a non-contiguous range in an array of ranges, like this meeting room is occupied from 9 to 9:30 a.m., 11 to noon, and so on. You can now hold the entire period during the day in the same variable.
The typical stored procedures were added in PostgreSQL 11, giving developers transactional control in a block of code. PostgreSQL 14 implements the OUT parameter, allowing developers to return data using multiple parameters in their stored procedures. This feature will be familiar to Oracle developers and welcome addition for folks trying to migrate from Oracle to PostgreSQL.
PostgreSQL v14.0: Security
Perhaps the most interesting feature for many users wasn’t spoken about until the end of our interview with Shahid — security.
One of the potential downsides to PostgreSQL before this release was that the default authentication method had been MD5, which, among other things, doesn’t comply anymore with the PCI DSS payment card standard. MD5 leverages an encrypted key that both parties hold. “As you send your data, you need to have a mechanism to try to make sure the other party knows how to decrypt.”
However, Shahid explained, if someone is sniffing in the middle, there’s a slight chance that MD5 can be decrypted along the way.
The default authentication method for PostgreSQL is now SCRAM, the Salted Challenge Response Authentication Mechanism, which is meeting the industry standard with a higher level of compliance. The salted challenge sends data without telling the recipient how to decrypt it — both sides already know how. That eliminates the power of that sniffer.
They’ve also made updates to predefined roles, including
pg_read_all_data allows for assigning of privilege to a user that has the ability to read all the data. The common use case is either analytics or someone tasked with setting up a replication platform where new tables are being added.
They also added the
pg_write_all_data. Shahid gave the example of a table that sets the characteristics for tables.
“pg_write_all_data grants superuser styled privileges to the user, and hence should be used sparingly. But in the cases that it is needed, it provides for a lot of convenience. Applications evolve, databases evolve, schemas evolve. A blanket privilege to write all data can ensure that the user continues to have the required permissions, that explicitly assigning such permissions isn’t missed with the updates” Shahid said.
He emphasized that this is a convenience to be used only sparingly.
What Makes PostgreSQL So Popular?
“The license has a lot to do with it,” argues Shahid. “Not because people actually read the license to try to understand it. It’s just, in open source there are varying degrees of openness.”
He continued that the PostgreSQL open source license is about as liberal as it gets, consisting of only 143 words which he summarized as:
- Do whatever you want with it.
- Don’t blame the University of California if something goes wrong.
Shahid explained this openness is why all major cloud vendors can provide PostgreSQL solutions and that there are so many forks and rebrands of the database.
Also, just like it serves distributed systems, it is the product of a highly distributed open source community.
“It has a unique community behind the project in that it’s not run by a single commercial entity. PostgreSQL is a true multi-user, multi-region, multi-ethnicity company. The only aspect of decision making is ‘What’s good for the project’ not ‘What’s good for business’ because there is no business,” he said, reminding us that open source is a development model, not a business model.
He said that when you look at absolute numbers, there’s no doubt that MySQL and Oracle would have more users, but that growth has been rather flat while open source solutions like PostgreSQL continue an upward growth.