This Time Self-Hosted
dark mode light mode Search

Tarsnap and backup strategies

After having had a quite traumatic experience with a customer’s service running on one of the virtual servers I run last November, I made sure to have a very thorough backup for all my systems. Unfortunately, it turns out to be a bit too thorough, so let me explore with you what was going on.

First of all, the software I use to run the backup is tarsnap — you might have heard of it or not, but it’s basically a very smart service, that uses an open-source client, based upon libarchive, and then a server system that stores content (de-duplicated, compressed and encrypted with a very flexible key system). The author is a FreeBSD developer, and he’s charging an insanely small amount of money.

But the most important part to know when you use tarsnap is that you just always create a new archive: it doesn’t really matter what you changed, just get everything together, and it will automatically de-duplicate the content that didn’t change, so why bother? My first dumb method of backups, which is still running as of this time, is to simply, every two hours, dump a copy of the databases (one server runs PostgreSQL, the other MySQL — I no longer run MongoDB but I start to wonder about it, honestly), and then use tarsnap to generate an archive of the whole /etc, /var and a few more places where important stuff is. The archive is named after date and time of the snapshot. And I haven’t deleted any snapshot since I started, for most servers.

It was a mistake.

The moment when I went to recover the data out of earhart (the host that still hosts this blog, a customer’s app, and a couple more sites, like the assets for the blog and even Autotools Mythbuster — but all the static content, as it’s managed by git, is now also mirrored and served active-active from another server called pasteur), the time it took to extract the backup was unsustainable. The reason was obvious when I thought about it: since it has been de-duplicating for almost an year, it would have to scan hundreds if not thousands of archives to get all the small bits and pieces.

I still haven’t replaced this backup system, which is very bad for me, especially since it takes a long time to delete the older archives even after extracting them. On the other hand it’s probably a lot of a matter of tradeoff in the expenses as well, as going through all the older archives to remove the old crap drained my credits with tarsnap quickly. Since the data is de-duplicated and encrypted, the archives’ data needs to be downloaded to be decrypted, before it can be deleted.

My next preference is going to be to set it up so that the script is executed in different modes: 24 times in 48 hours (every two hours), 14 times in 14 days (daily), and 8 times in two months (weekly). The problem is actually doing the rotation properly with a script, but I’ll probably publish a Puppet module to take care of that, since it’s the easiest thing for me to do, to make sure it executes as intended.

The essence of this post is basically to warn you all that, no matter whether it’s cheap to keep around the whole set of backups since the start of time, it’s still a good idea to just rotate them.. especially for content that does not change that often! Think about it even when you set up any kind of backup strategy…

Comments 7
  1. These have been my ideas on the matter:Have you tried rsnapshot? If you only backup files, deduplicated rsync over ssh is a childs play. (Furthemore, this thing is elegant..:) Encryption probably belongs to the block level, compression might not make sense at all – compressable files are usually of neglectible size anyway, big files otoh are usually already compressed. YMMV.Databases usually need more love, since copying their files while hot will make a mess. Other than just plain dumping them all the time – it will grow expensive on server load as the database grows – you usually have 2 options:a) shipping binlogs from a live system. A regular full base and incremental binary logs add up to a real backup one way or another. In mysql I pretty much sync the binlogs, ship them, parse them and concatenate them onto a cc of the nightly fulldump.sql; tgz them the next day and delete the hourlys after a few moths. Postgres seems to have a more low-level approach, which I am yet to fully parse for myself – but it still revolves around shipping binlogs. I have a nice script for mysql, but since i now also have to take care of postgres, I’m thinking of moving both onto b)…b) copying them over from a temporary lvm snapshot. This is based on the idea that since lvm/2 freezes and syncs your filesystem while creating the snapshot, the filesystem and the database on-disk will be consistent. You can then mount the temporary snapshot and gather up the database files, which at worst case will have to have their logs replayed on restoration.Any ideas?

  2. I’m the maintainer for @rsnapshot@ — it works very well, in some situations. But not really that much in this. The main problem being that I need a place to store the backups. The idea behind tarsnap is that Amazon S3 is used for storing them (glacier might be an option as well but it’s a different story), which is fine by me. And that’s why it has to be encrypted.(Now the problem is that you didn’t take what de-duplication is, in there…)As for the load on the database to dump them… the amount of data in them is risible so worrying about the dump time is stupid…

  3. Did you tested duplicity?It allows backuping to a lot of backends (IMAP, google docs, ssh, ftp, etc..) and it does the encryption locally with GPG keys, so you don’t have to trust the remote end. It also supports incremental backups.And about tarsnap, sincerely, 0.30 per GB looks really expensive to me.

  4. I haven’t tested duplicity, to be honest. Tarsnap works and $0.22/day is not a bad deal in my opinion, and that’s with the stupid backup method.

  5. Tarsnap simply doesn’t scale. Even with a small number of archives it takes long to restore backups.

  6. I did backups for the database of my website and had the exact opposite problem: At one point the sysadmins wrote me an email that I’m massively over my quote.Well, turns out that I had activated 3 hourly backups – with 300MiB per backup and no limit…Now I’m on a 2 weekly rotation schedule and daily backups, and that works pretty well. (But I did not need it, yet, so I don’t know if it will work well in the end…).Ideal might be a log-scale backup-schedule: 2-hourly(limit: one day), daily(limit: one week), weekly(limit: a month), monthly(limit: a year), yearly(limit: a decade). That’s 12+7+4+12+10 = 45 backup files with meaningful times (so I can actually understand which file I need when something went wrong and I only see it after some time).

  7. There’s a shell script here: http://www.bishnet.net/tim/… that you might find interesting. It does a daily / weekly / monthly rotation, but i’m sure you could add hourly increments. I’m running it as a cron job on two machines and it hasn’t failed me yet.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.