Snapshot Restore versus pg_restore – Time Comparison

March 13, 2019 4 minutes

Summary

I’ll just cut to the chase here and give detailed analysis immediately after.

The time difference between pg_restoring an 11GB sized PostgreSQL database with 159MB pg_dump file, versus restoring snapshot of its t2.medium RDS Instance:

alt="image"

Analysis

Important note! The time values are subject to change, because it depends on network, other running processes, intermediary EC2 Instance type etc.

pg_restore

I used a t2.medium SSM Bastion Instance to execute PostgreSQL commands. As I’ve mentioned before, provisioning a more powerful EC2 Instance might increase the performance drastically. You can find information about SSM Bastions in this page.

Imagine you have a PostgreSQL database named “XXX”; with 11 GB size. Don’t get distracted with the name, I don’t know why I named it “XXX”; at the first hand, but here we go anyway 🙂

alt="image"

I take dump of it by this pg_dump command (check the documentation here)

PGPASSWORD='${PGPASSWORD}' pg_dump -h "restore-test.xxxxxxxxxxxx.eu-central-1.rds.amazonaws.com" -p 54321 XXX XXX --format=c > xxx.sql

And to see how big the file is, I execute;

alt="image"

You can see that the xxx.sql dump file is 159 MB.

The reason why our dump file is smaller than our XXX database is because we take logical backup of the database, therefore, all of those empty pages, old versions of rows, indexes aren’t going to be included in the dump file. It is also compressed as default, because we used -format=c option. All that greatly reduced the size.

Now, let’s restore our database with this dump file by this pg_restore command (check documentation here)

PGPASSWORD='${PGPASSWORD}' pg_restore -v -h "restore-test.xxxxxxxxxxxx.eu-central-1.rds.amazonaws.com" -p 5432 XXX -d POSTGRESQL -j 4 -O -C "xxx.sql"

As you can see, I used time command with the actual command, so that we can see how long it took;

alt="image"

“real” value here is what we must be looking for. Details over here. Not bad, huh? However, it is still 14 minutes of downtime, which I don’t recommend for your mental, and probably physical health 🙂 Let’s note the value and proceed with Snapshot Restore operation.

Snapshot Restore

I took a snapshot of our PostgreSQL RDS Instance, which has our prey, XXX database;

alt="image"

At this point, if you’ve decided to use snapshot restore operation instead of pg_restore, you’re going to need to change your corrupted RDS Instance’s identifier. Because AWS doesn’t allow duplicate RDS identifiers, obviously. I mean, they’re identifiers, that’s what they do, identify. You can change identifier to, let’s say, “original-old”;. So the time it takes to change identifier of a Multi-AZ RDS Instance is this;

alt="image"

Quite long. After changing our corrupted RDS Instance’s identifier, we can now proceed to restore our snapshot and name it as our original RDS identifier, so our applications can automatically connect to it. Let’s see how long it takes to restore a Multi-AZ PostgreSQL RDS Instance;

alt="image"

WOW! This is not good at all. 20 minutes is enough time to make you looking for a new job! Add that identifier change time(2:21) to this and you get 22:04 minutes! You might as well take a damn nap while this completes.

Conclusion

Obviously, the first thing to do is, enabling Multi-AZ feature of your RDS Instances. Because in case of a failure in an availability zone, AWS will automatically change RDS Instance’s DNS to a different RDS Instance in a different availability zone. That way, your RDS Instances are fault tolerant.

After enabling Multi-AZ feature, I highly recommend taking pg_dumps of your PostgreSQL Databases daily and uploading them to S3. Even if you don’t use PostgreSQL, you can adapt this solution to MySQL etc.

I also highly recommend configuring read-replicas(slaves) of the master RDS Instance, if you are swimming in gold and don’t feel lazy at all. Because in that case, you can make one of the slaves your master DB and keep your job in some specific disaster scenarios. In addition being a recovery solution for disaster cases, if designed well, it also increases R/W performance drastically.