One of our favorite databases for maintaining data at scale is Postgres. Being a data-centric business, we have come to rely heavily on the gentle giant and use it extensively across our entire platform.

Currently, we run a majority of our Postgres instances on Amazon RDS. After offloading worries about the maintenance, availability and backup to RDS, I inevitably assumed the role of the accidental DBA for one of our larger Postgres clusters.

In this post, I would like to describe one of our adventures in managing this large cluster and catalog the story as it happened.

I have to disclose upfront that I would not consider myself an expert on large Postgres clusters, and that some of the issues described here might have been anticipated far earlier by more experienced DBAs. Given this, it is also noteworthy that Postgres stood up fairly well against all the use (abuse) that we threw against it.

So let’s begin our journey there and back again.

Prelude

In order to support concurrent operations, Postgres implements Multi Version Concurrency Control (MVCC), which elegantly maintains data consistency while allowing for parallel read/writes. In minimal words, MVCC maintains a transaction ID (XID) counter and whenever a transaction is started, this counter is incremented to track the data visible to that transaction. For this post, it is sufficient to think of each XID as representing a particular state of data in a relation. Of-course, it is more complicated than that and a better can be explanation is available over at Heroku.

Now as with many tales, enter Sauron — transaction ID wraparound.

A problem arises because these XIDs are represented with only 32 bits in Postgres. It is just easier to quote the excellent Postgres documentation here

PostgreSQL’s MVCC … depend on being able to compare transaction ID (XID) numbers: a row version with an insertion XID greater than the current transaction’s XID is “in the future” and should not be visible to the current transaction. But since transaction IDs have limited size (32 bits) a cluster that runs for a long time (more than 4 billion transactions) would suffer transaction ID wraparound: the XID counter wraps around to zero, and all of a sudden transactions that were in the past appear to be in the future — which means their output become invisible. In short, catastrophic data loss.

To avoid this, it is necessary to make sure that VACUUM operations are run periodically to advance the oldest XID forward. Usually the autovacuum army takes care of the fight and prevents transaction XID wraparound from wreaking havoc.

Waking up to a warning

One of our instances on RDS (an m4.xlarge with about 750 GB) had been showing signs of slowing down over the last few weeks. It used to frequently hit the maximum disk IOPS rate and stayed there, suffering large costs to read and write latencies. While, we were working on diagnosing this issue, I received an unexpected email from AWS support, pointing out that the age of the oldest table in our database was very large (> 1 billion).

If left untended, it would cause transaction ID wraparound to happen — which would be a pain to fix and possibly mandate a lot of downtime.

Estimating time left before Mount Doom erupts

The first step was to estimate the time left till wraparound might actually happen. By using txid_current(), it is possible to find the burn rate for XIDs. By running this query multiple times, separated by a few hours, the differences were used to estimate a burn rate of about 600,000 per hour. Given that the age of our oldest XID was nearly 1.3 billion, and wanting to be conservative, we estimated that we had about 40 safe days before hitting 2 billion (if other factors remained constant).

To increase our time-to-live, the first step was reducing our burn rate. There were a few low-hanging fruit with some of our indexing scripts — by grouping them into manageable transactions, we were able to shave off quite a bit. Together with some other clumping of heavy writes, we were quickly able to reduce the burn to about 400,000 per hour. This gave us another 20 days, moving grace time back to about 60 days.

Having detected the slow march towards database shutdown, but with a notice of 60 days, we counted our lucky stars and started searching for ways to prevent disaster.

The search for the One Ring

The first stage in the battle was making the autovacuum daemon run more aggressively. However, due to the large number of tables (> 200,000) in our database, even the newly fortified autovacuum army turned out to be losing. I'll describe these settings in the next post, but to summarize it shortly, even with more autovacuum_max_workers, higher maintenance_work_mem and reduced autovacuum_naptime, it seemed like a losing battle.

So the decision was made to search for the relations containing the oldest XIDs and then force a freeze on them. The two primary aides in this quest were the pg_class and pg_database system tables.

Whenever a VACUUM is run on a relation, the cutoff XID (no longer visible to any transaction) is updated in the relfrozenxid column of the pg_class table. The smallest value of relfrozenxid across all relations is saved as datfrozenxid in pg_database

By running the query below, the oldest entry was identified as a TOAST table called pg_toast_165938048, relfrozenxid for this relation agreed with the value of datfrozenxid across the entire database.

SELECT ns.nspname, relname,
       relfrozenxid, age(relfrozenxid)
FROM pg_class c
INNER JOIN pg_namespace ns ON c.relnamespace = ns.oid
WHERE ns.nspname NOT IN ('pg_catalog','information_schema')
ORDER BY age(relfrozenxid) DESC LIMIT 10;

nspname   |         relname          | relfrozenxid |    age
----------+--------------------------+--------------+------------
 pg_toast | pg_toast_165938048       |   3842253562 | 1232089527
 pg_toast | pg_toast_189701233       |   4442253562 |  632089527
...

SELECT datname, datfrozenxid, age(datfrozenxid)
FROM pg_database
ORDER BY age(datfrozenxid) DESC LIMIT 5;

datname        |   datfrozenxid   |    age
---------------+------------------+--------------
 mydb          |   3842253562     |   1232089527
...

Confident that we found the oldest relation, we proceeded with a manual freeze

VACUUM (verbose, freeze) pg_toast.pg_toast_165938048;
ERROR:  permission denied for schema pg_toast

Ok, so why didn’t that work?

Turns out, the rds_superuser role from Amazon RDS does not have the required permissions on pg_toast (and there is no way to grant it either).

Hmm, so lets look at the underlying relation for this pg_toast table. Using reltoastrelid from pg_class, it was traced to a large materialized view that we no longer needed. So the decision was made to simply drop this view and move on.

DROP MATERIALIZED VIEW unused_matview;

A surprise from the ashes

Fairly confident that the oldest XIDs should now be removed from the database, I began checking the old friends at pg_class and pg_database to make sure that the ages were now reduced.

-- from pg_class
 nspname  |         relname          | relfrozenxid |    age
----------+--------------------------+--------------+------------
 pg_toast | pg_toast_189701233       |   4442253562 |  637089527
...

-- from pg_database
   datname     |   datfrozenxid   |    age
---------------+------------------+--------------
 mydb          |   3842253562     |   1237089527

Wait, what?

The oldest relfrozenxid in pg_class seems to have been fixed with its age much lower than what was seen earlier, but datfrozenxid in pg_database was still the wrong value. What gives?

Going back to the Postgres documentation, I found something that I had missed earlier.

A manual VACUUM should fix the problem … but note that the VACUUM must be performed by a superuser, else it will fail to process system catalogs and thus not be able to advance the database’s datfrozenxid

Turns out, it was due to our not-a-real-superuser rds_superuser again. None of the roles provided by RDS have the required privileges for pg_database and so it still keeps the old XID.

All this code spelunking took a while and going back to my doomsday clock, I painfully re-calculated the time to shutdown: 50 days

It seemed like we might now be worse off, as I was stuck with this unchanged datfrozenxid referring to a relation which does not exist any more. And this XID value being inconsistent with the very definition of datfrozenxid being the oldest value from relfrozenxid.

Intermission

I had not expected this post to get so long. So I’ll take a break here and continue the tale of how we finally won the battle against transaction ID wraparound in a future post (edit: published here).

If you are interested in more stories about transaction ID wraparound and the importance of vacuuming (more, not less!), I’ll leave a few links here.