Obnam architecture

1 Overview

This document sketches the design of the software architecture of Obnam, the backup program.

This is based on thoughts originally published at https://blog.liw.fi/tag/backup-impl/.

In the context of Obnam, a backup is an independent copy of data that can be restored if the primary copy can no longer be used.

Some definitions:

the primary copy of your data is the one you work with or access directly
a backup is in separate copy made at some given time and that won’t change or be lost even if the primary one is

Important aspects of a backup:

once a backup has been made, it won’t change even if the primary copy changes; it is, of course, possible to make a new backup after the changes have happened, and then have two independent copies
backups are made of files, or other data on persistent storage, onto other persistent storage
storage media is always ephemeral and corruptible; backups are meant to guard against that
the primary copy and backup need to be stored on different media; if they’re on the same media, the backup doesn’t guard against media failures

2 Table stakes for backups

For Obnam, I make the following assumptions:

the primary copy of data is on some device, usually as files on a file system, but can be data on a raw disk drive or partition
backups are stored on local storage, or on a remote server
the user using the backup client should not have to trust the server more than they have to; I am assuming they can trust the server to not delete data on the server intentionally
backups are encrypted and authenticated only on the client, the server stores opaque blobs; the user does not need to trust the server to read data on the server, or modify it: the client uses cryptography to prevent reading and detect modification
a server can be used by multiple people, who are mutually not hostile
in a disaster situation, the backup system should require the least number of things for a user to restore their data

3 Very high level architectural assumptions

Backed up data is split into chunks of a suitable size. This makes de-duplication simple: by splitting files into chunks in just the right way, identical data that occurs in multiple places can be stored only once in the backup. The simplest example for this is when a file is renamed, but not otherwise modified. An sensible backup system will notice the rename, only stores the new name, not all the data in the file all over again.
- De-duplication can be done at quite a fine-grained granularity or a coarse one. There are a number of approaches here. At this high-level of architecture thinking, we don’t need to care how the splitting into chunks happen. We do need to take care that the size of chunks can vary and that the backup storage can’t care about the specifics of chunk splitting.
- There are ways to do “content sensitive” chunk splitting so that the same bit of data is recognized as a chunk event if it’s preceded by other data. This is exciting, but I don’t know of any research about how much this actually finds duplicate data in real data sets. A flexible backup system might need to support many ways to split data into chunks, so that the optimal method is used for each subset of the precious data being backed up.
- I note that the finest possible granularity here is the bit, but it would be ridiculous to that far. However the backup system is implemented, each chunk is going to incur some overhead, and if the chunks are too small, even the slightest overhead is going to be too much. A backup system needs to strike a suitable balance here.
To achieve de-duplication, the backup system needs a way to detect that two chunks are identical. A popular way to do this is to use cryptographically secure checksum, or hash, such as SHA3. An important feature of them is that if two chunks have the same hash, they are almost certainly identical in content (if the hashes are different, the chunks are absolutely certain to be different). It can be much more efficient to compute and compare hashes than to retrieve and compare chunk data. This is probably good for most people most of the time.
- However, for the people who do research into hash function collisions, it’s not good enough. It makes a sad researcher who spends a century of CPU time to create a hash collision, then makes a backup of the generated data, and when restoring their data finds out that the backup system decided that the two files with the same checksum were in fact identical.
- A backup system could make this configurable, possibly on a per-directory basis. A hash collision researcher can mark the directory where they store hash collisions as “compare to de-duplicate”.
- I admit this is a very rare use case, but it preys on my mind. At this level of software architectural thinking, the crucial point is whether to make the backup system use content hashes as the only chunk identifiers, or if chunk identifiers should be independent of the content.
- Note that the checksum algorithm should be strong for security, not just collision detection. We want to be able to safely and securely back up arbitrary data, including data provided by an attacker, and not suffer from attacks from that source.
I really like the SSH protocol and its SFTP sub-system for data transfer. I don’t particularly like it for accessing a backup server. The needs of a backup system are sufficiently different from the needs of generic remote file transfer and file system access that I don’t recommend using SFTP for this. For example, it’s tricky to set up an SFTP server that allows a backup client to make a new backup, or to restore an existing backup, but does not it allow deleting a backup. It makes more sense to me to build a custom HTTP API for the backup server.
- It seems important to me that one can authorize ones various devices to make new backups automatically, but not allow them to delete old backups. This mitigates the situation where a device is compromised. A compromised client can’t destroy data that has been backed up, even if can make new backups with non-sense or corrupt data.

4 Encrypting backups

I want my backups to be encrypted at rest so that if someone gains access to the backup storage they can’t see my data. I also want my backups to be signed so that I can trust the data I restore is the data I backed up. This is also called confidentiality and authentication of backed up data.

“At rest” means as stored on disk. I also want transfers to and from a backup server to be encrypted, but that’s easy to achieve with TLS or SSH.

4.1 AEAD: authenticated encryption with associated data

Doing encryption and signing separately has turned out to be easy to get wrong. Since about the year 2000 there have been ways to achieve both with one operation, using authenticated encryption or its variant with associated data AEAD. This is easier to get right. In short, with authenticated encryption, if you can decrypt some data, you can be sure that the decrypted data is what was encrypted.

For AEAD, the two operations are:

encrypt(plaintext, key, ad) → (cipher text, authentication tag)
- the cipher text, authentication tag and ad are stored in backup storage
- at least some AEAD implementation make the cipher text and authentication tag part of the same output string, but that’s an implementation detail; they’re conceptually separate
decrypt(cipher text, authentication tag, key, ad) → plaintext or error

In other words, you keep the associated data with the cipher text, as you’ll need it to decrypt. If the decryption works, you know the associated data is also good (in addition the encrypted data). You do need to be careful not to trust the associated data until it’s been authenticated.

For backups, each chunk of user data would be encrypted with AEAD, and the associated data is the checksum of the plain text data. When a backup client de-duplicates data, it splits data into chunks, computers the checksum of each, and searches the backup repository for chunks with that associated data.

When restoring a backup, the client decrypts the chunks, using the checksum. This also authenticates the data: if the decrypt operation fails, the data can’t be used.

All this requires storing the checksum for each somewhere. There also needs to be ways to keep track of what backups there are, what files each contains, and what chunks belong to each file. We’ll not worry about that yet. For now assume it’s all done using magic.

Actually, the associated data for a chunk probably should not be the checksum of the plain text data. That leaks information: an attacker could determine that a file contains a specific document by looking for chunks with the same checksum as the document. Instead, the associated data could be an encrypted version of the checksum, or the result of some other similar transformation. For now, let’s not worry about that.

4.2 Managing keys

Note that AEAD is a symmetric operation: the key must be kept secret. To complicate things, the client should support many keys for different groups of chunks. This is important especially so that different clients can share chunks in backup storage.

Imagine Alice and Bob both work for the same spy agency. They both get a lot of the same management reports and documents. They both also have confidential letters that they can’t share each other. It would be ideal if their backup system let them mark which files are confidential and which can be shared, and then the chunks from those files can be shared or not shared with the other.

To implement this, the backup client needs to keep track of several keys. It also needs a way to keep track of which key each chunk is using. All these keys need to be computer generated and entirely random, for security. There is no hope of a user ever remembering any of them.

The keys should be stored in one place, which I tentatively call the “client chunk”. This would be encrypted with yet another key, the “client key”. The client key is stored in one or more “client credential” chunks, each of which is encrypted with separate key. This is similar to what the Linux full disk encryption system LUKS uses: the actual disk encryption key is encrypted with various passphrases, each encrypted key stored in a separate key slot. Because LUKS has a fixed amount of space for this, it limits the slots to eight. A backup program does not need to have that limitation: we can let the user as many client credential chunks as they want.

I’m assuming here that the backup storage allows lookup via the associated data. The client and credential chunks can then be found by using associated data “client-chunk” or “credential-chunk”. If there are many matching chunks, the client needs to be able to determine which one it needs. (More magic. Waving my hands frantically.)

If the client chunks is updated to add a new key (or to drop one), the new client chunk is encrypted with the same key and uploaded to the backup store. All existing client credentials will continue to work. The old client chunk can then be deleted.

There can be any number of client credentials, which each encrypts the client key using a different method:

a user-provided passphrase
- or a key derived from that with a key derivation function
a hardware key
- TPM
- Yubikey challenge/response
an SSH or OpenPGP key
- could be stored in a Yubikey or other hardware token
hopefully there’s more

To perform a backup or a restore, the client would need to be able to use any one of the credentials.

An interesting possible evolution of the above scheme might be to have some of the credential be split using a secret sharing setup: for normal use, the TPM credential might be used (but it would only enable making new backups and restoring backups, not deleting backups). For more unusual situations, you might need both a passphrase and a Yubikey credential. An unusual operation might be to delete backups, or to adjust the set of data chunk keys a client has.

4.3 Summary

Backup:

get one or more credentials from user to decrypt the client key
get and decrypt the client chunk, using the client key
- fail if this gives an error
encrypt each new chunk with the right chunk key
store the cipher text, authentication tag, and associated data in backup storage

Restore:

get one or more credentials from user to decrypt the client key
get and decrypt the client chunk, using the client key
- fail if this gives an error
for each chunk that needs to be restored, decrypt it using the right chunk key and associated data, making sure this works
- fail if this gives an error

There’s a lot of steps skipped in this, but this is the shape of my current thinking about backup encryption. I am, however, not an expert on this, so I expect to get feedback telling me how to do this better.