Of course I have a backup! - Making backups with rdiff-backup

Theory

As the name suggests, rdiff-backup makes reverse incremental backups. What's that? Let's look at** incremental backup**s first. Incremental backups work by taking a snapshot of a file on the first run, and storing only the differences (diffs) on subsequent runs. Unlike a traditional copy, you only have one copy of the whole file, and have many diffs that describe how it looked at each run. This makes subsequent backups run quicker, take much less space, and still keep the ability to restore any version of the file.
A reverse incremental backup is almost the same. It still only stores one copy of the file, but it is always the latest version that is stored, and you keep the diffs that allow you to go back to an earlier revision. This is tailored to the use case when you accidentally overwrite, or delete a file on the production server. Instead of taking the original file from the backup, and applying N diffs to it, you can just simply copy the file back, because you have the latest version as a whole.

Usage

Search for rdiff in your package manager (aptitude, pacman, yum, etc.), and install it.
Let's create a project where we can try it out. I'm going to create two folders, a "website", that houses the project that we want to back up, and a "backup" folder, that stores the backup(s).

bash

mkdir -p ~/backups ~/website/htdocs
cd ~/website/htdocs
echo "Hello world" > index.html

Taking a backup is as easy as specifying the source, and the destination folder:

rdiff-backup ~/website ~/backups```

That's all, you now have a backup of your whole site in ~/backups/. Let's add an imaginary forum to our site, link to it in the frontpage (index.html), and then take another backup.

```bash
cd ~/website/htdocs/
echo "Here is the forum" > forum.html
echo 'Click <a href="/forum.html">here</a> for the forum' >> index.html
rdiff-backup ~/website ~/backups

If you take a look at ~/backups/htdocs you can see that the file structure nicely mirrors the website. Now let's delete the forum.html file, and take another backup:

bash

rm forum.html
rdiff-backup ~/website ~/backups

The forum.html also vanished from the ~/backups/ folder. That's because we only keep the latest version, for every older version we will have to do a little work for restoring.

A few use cases

Now that we did a few changes, namely creating a new file, modifying an existing one, and deleting one. Let's look at a few scenarios at how you might have to restore something.

Restoring a file, or a directory

If you need to restore the latest state of the file, eg. after an accidental delete, you can just copy back the file from the backup folder, to the production folder. Use mc, cp, anything that you prefer, nothing else to do here.

Restoring an older version of a file

If you need an older version, first you need to figure out the date that file was in the state that you need. You can list all the increments (the times when rdiff-backup saw a change to a file) with the -l switch, and restore a specified version with -r:

bash

cd ~/backups/htdocs/
rdiff-backup -l index.html
# Found 1 increments:
#     index.html.2011-10-20T18:34:44+02:00.diff.gz   Thu Oct 20 18:34:44 2011
# Current mirror: Thu Oct 20 18:39:18 2011
rdiff-backup -r '2B' index.html /tmp/index.html
cat /tmp/index.html
# hello world

You can restore both files and directories like this. 2B means to restore the file as it looked like 1 change ago. 3B means 2 changes ago, and so on, n-1 changes ago. Take a note at the TIME FORMATS section in the manual, it describes more ways to restore a file (there are 6 ways). Also, the manual is wrong on the B notation:

A backup session specification which is a non-negative integer followed by 'B'. For instance, '0B' specifies the time of the current mirror, and '3B' specifies the time of the 3rd newest increment.

Both 0B, and 1B restore the file to the current state, but the manual leads you to believe 1B would restore the first increment.

List changes that have happened since a given time

You can use --list-changed-since [time], to list all the changes that have happened after a certain time. You can also use this to restore deleted files, that are no longer present in the backups directory, because another backup has taken place. You have to find the filename with --list-changed-since, find out the last increment date with -l, and restore it with -r. It will work as expected, even though the file seems to be missing.

Comparing the production folder to the backup

You have made some changes to the site, but forgot which files you changed, or you just need to make sure that the backup is fresh. You can use the --compare option for this:

bash

echo "A new change" >> index.html
rdiff-backup --compare ~/website/ ~/backups/
# changed: htdocs
# changed: htdocs/index.html

Keep in mind that this comparison also takes mtime into account, so if the files are perfectly identical, but have different mtimes, rdiff-backup will report it as changed. You can also use --compare-at-time to specify the time you wish to compare against.

Recovering a file, without knowing the file name

You have deleted a file three months ago, and now it turns out that you need it. Of course, you no longer remember the name of the file. If you can at least guess what the filename was, you can use find to try and match an increment file with find:

bash

find . -iname '*forum*' -type f
# ./rdiff-backup-data/increments/htdocs/forum.html.2011-10-16T21:10:01+02:00.snapshot.gz
# ./rdiff-backup-data/increments/htdocs/forum.html.2011-10-20T18:34:44+02:00.missing
# ./rdiff-backup-data/increments/htdocs/forum.html.2011-10-16T21:05:59+02:00.missing

Another option, if you know some of the contents of the file, you can try grepping the increments folder. Even though the increment files are gzipped, you can usually find the strings in it:

bash

grep -ir 'forum' .
# Binary file ./rdiff-backup-data/increments/htdocs/forum.html.2011-10-16T21:10:01+02:00.snapshot.gz matches

If you don't remember anything, you need to look through the increments folder, under rdiff-backup-data. It contains all of the files that rdiff-backup has encountered so far, even the deleted ones, so you can look up the filename there. For example, I have deleted forum.html, but it still exists in the increments folder:

$ tree ~/backups/rdiff-backup-data/increments
~/backups/rdiff-backup-data/increments
├── htdocs
│   ├── forum.html.2011-10-16T21:05:59+02:00.missing
│   ├── forum.html.2011-10-16T21:10:01+02:00.snapshot.gz
│   ├── forum.html.2011-10-20T18:34:44+02:00.missing
│   ├── index.html.2011-10-16T21:05:59+02:00.diff.gz
│   ├── index.html.2011-10-16T21:10:01+02:00.diff.gz
│   └── index.html.2011-10-20T18:34:44+02:00.diff.gz
├── htdocs.2011-10-16T21:05:59+02:00.dir
├── htdocs.2011-10-16T21:10:01+02:00.dir
└── htdocs.2011-10-20T18:34:44+02:00.dir

Taking it further

You can automate this. Drop the command into cron.daily, into your own crontab, or create a user just for handling backups, which is in my opinion, the best solution. Start thinking about offsite backups now. Chances are, if your site gets hacked, or the HDD breaks down, you will not be able to access your backups, so always have them copied to a different place. You can mount the remote site with sshfs, transfer the backups with scp, tell rdiff-backup to connect through ssh, or you can install rdiff-backup on the remote side, and launch it in daemon mode.

You will also have to monitor the disk usage. Since we are only storing the diffs, we are very space efficient, but if your site has any user generated content, like uploaded files, forum avatars, those can get out of hand quickly. You can remove old versions with --remove-older-than [time]. Skim through the manual at least once, to know all the options that are available.