July 17, 2007

Incremental backup from shell scripts

Category: Sysadmin.

In the past days, I have been working on an incremental backup solution for my company dedicated servers. These servers hosts gigabytes of data spread between websites, documents repositories and tools.

For each of our machine, a complete backup consists in two actions:

  1. a local copy with appropriate privileges of everything that cannot be safely hot-copied to a remote location (such as database clusters that have to be dumped to SQL files and system files that need root privileges to be read);
  2. directories to save on the host are copied to a backup server.

The first part is performed by Backupninja, a very flexible tool that rely on many other programs to do its job. I particularly like it since it goes in the UNIX philosophy: do one thing and do it well. Basically, this first part was trivial to set up.

The second part is a bit tricker. The machines have no way to contact the backup server, so it cannot push data when ready (using Backupninja for example). Instead, the backup server has to pull data from the host to save. The good point for us is that it will be easier to limit overhead pulling data from one server at a time. The bad point is that we want efficient transfer and incremental backups: rdiff-backup do this, but is not suitable for an heterogeneous network like ours since it requires the same version on each host. Moreover, it depends on python (we do not use it on all of our machines and do not intended to install it only for this purpose) and we prefer to have a direct access to hot-copies (the directory tree as it was when the backup was performed).

The answer to this was to write a shell script that, for each server:

  1. copies the last saved files (when applicable) to the new backup location;
  2. uses rsync(1) to synchronize the local copy with the remote data;
  3. replaces unchanged files with hard-links to the previous backup files.

Hard-linking from a shell script

The third point was implemented as shown bellow:

echo "---> Hard linking unchanged files ($(date))"

find "$DST" -type f -exec ls -l '{}' ';' | { 
  while read MODE LINKS USER GROUP SIZE DATE TIME FILENAME; do

    SFILENAME=$(echo $FILENAME | sed "s/$BACKUP_DATE/$LAST_BACKUP/")
    if [ -f "$SFILENAME" ]; then
      ls -l "$SFILENAME" | { 
        read SMODE SLINKS SUSER SGROUP SSIZE SDATE STIME SFILENAME

        if [ '(' "$SIZE" = "$SSIZE" ')' \
          -a '(' "$DATE" = "$SDATE" ')' \
          -a '(' "$TIME" = "$STIME" ')' ]; then
          ln -f "$SFILENAME" "$FILENAME" || exit 1 
        fi
      } 
    fi

  done
} || exit 1

The process is simple, but for each file, it launch many programs. In the worst case: ls(1), sh(1), echo(1), sed(1), [(1), ls(1), [(1) and ln(1). The result is that this loop is really long to run and increased the machine load. For one of our hosts, it took more than seven hours to finish hard-linking circa 14 GB of files!

---> Hard linking unchanged files (Mon Jul 16 12:17:58 CEST 2007)
---> Removing deprecated backups (Mon Jul 16 19:42:31 CEST 2007)

Since all these tools only perform a simple action (basically a system call), most of the overhead comes from disk I/O access and programs invocation. Changing our disks for faster ones is not a priority... but we can consider writing a single program that recursively replaces all unchanged files with hard links in a directory tree.

Doing everything in a single program invocation

This program has been written in C and is called lntree. It dramatically drops down the time required to hard link existing files since it takes only half an hour to do the same:

---> Hard linking unchanged files (Tue Jul 17 14:31:49 CEST 2007)
---> Removing deprecated backups (Tue Jul 17 15:05:21 CEST 2007)

The program is licensed under the terms of a BSD-like free software license and its source code is available form a Subversion repository. It can be fetched running the following command:

svn checkout http://svn.healthgrid.org/svn/labs/trunk/lntree

Note that the Makefile shipped with the source is a BSD makefile. On some systems, you will have to install and use pmake(1) to compile this program.

Tags:

No Comments Yet

Comments RSS feed | Leave a Reply…

top