Obnam architecture

1 Overview

This document sketches the design of the software architecture of Obnam, the backup program.

This is based on thoughts originally published at https://blog.liw.fi/tag/backup-impl/.

In the context of Obnam, a backup is an independent copy of data that can be restored if the primary copy can no longer be used.

Some definitions:

Important aspects of a backup:

2 Table stakes for backups

For Obnam, I make the following assumptions:

3 Very high level architectural assumptions

4 Encrypting backups

I want my backups to be encrypted at rest so that if someone gains access to the backup storage they can’t see my data. I also want my backups to be signed so that I can trust the data I restore is the data I backed up. This is also called confidentiality and authentication of backed up data.

“At rest” means as stored on disk. I also want transfers to and from a backup server to be encrypted, but that’s easy to achieve with TLS or SSH.

4.1 AEAD: authenticated encryption with associated data

Doing encryption and signing separately has turned out to be easy to get wrong. Since about the year 2000 there have been ways to achieve both with one operation, using authenticated encryption or its variant with associated data AEAD. This is easier to get right. In short, with authenticated encryption, if you can decrypt some data, you can be sure that the decrypted data is what was encrypted.

For AEAD, the two operations are:

In other words, you keep the associated data with the cipher text, as you’ll need it to decrypt. If the decryption works, you know the associated data is also good (in addition the encrypted data). You do need to be careful not to trust the associated data until it’s been authenticated.

For backups, each chunk of user data would be encrypted with AEAD, and the associated data is the checksum of the plain text data. When a backup client de-duplicates data, it splits data into chunks, computers the checksum of each, and searches the backup repository for chunks with that associated data.

When restoring a backup, the client decrypts the chunks, using the checksum. This also authenticates the data: if the decrypt operation fails, the data can’t be used.

All this requires storing the checksum for each somewhere. There also needs to be ways to keep track of what backups there are, what files each contains, and what chunks belong to each file. We’ll not worry about that yet. For now assume it’s all done using magic.

Actually, the associated data for a chunk probably should not be the checksum of the plain text data. That leaks information: an attacker could determine that a file contains a specific document by looking for chunks with the same checksum as the document. Instead, the associated data could be an encrypted version of the checksum, or the result of some other similar transformation. For now, let’s not worry about that.

4.2 Managing keys

Note that AEAD is a symmetric operation: the key must be kept secret. To complicate things, the client should support many keys for different groups of chunks. This is important especially so that different clients can share chunks in backup storage.

Imagine Alice and Bob both work for the same spy agency. They both get a lot of the same management reports and documents. They both also have confidential letters that they can’t share each other. It would be ideal if their backup system let them mark which files are confidential and which can be shared, and then the chunks from those files can be shared or not shared with the other.

To implement this, the backup client needs to keep track of several keys. It also needs a way to keep track of which key each chunk is using. All these keys need to be computer generated and entirely random, for security. There is no hope of a user ever remembering any of them.

The keys should be stored in one place, which I tentatively call the “client chunk”. This would be encrypted with yet another key, the “client key”. The client key is stored in one or more “client credential” chunks, each of which is encrypted with separate key. This is similar to what the Linux full disk encryption system LUKS uses: the actual disk encryption key is encrypted with various passphrases, each encrypted key stored in a separate key slot. Because LUKS has a fixed amount of space for this, it limits the slots to eight. A backup program does not need to have that limitation: we can let the user as many client credential chunks as they want.

I’m assuming here that the backup storage allows lookup via the associated data. The client and credential chunks can then be found by using associated data “client-chunk” or “credential-chunk”. If there are many matching chunks, the client needs to be able to determine which one it needs. (More magic. Waving my hands frantically.)

If the client chunks is updated to add a new key (or to drop one), the new client chunk is encrypted with the same key and uploaded to the backup store. All existing client credentials will continue to work. The old client chunk can then be deleted.

There can be any number of client credentials, which each encrypts the client key using a different method:

To perform a backup or a restore, the client would need to be able to use any one of the credentials.

An interesting possible evolution of the above scheme might be to have some of the credential be split using a secret sharing setup: for normal use, the TPM credential might be used (but it would only enable making new backups and restoring backups, not deleting backups). For more unusual situations, you might need both a passphrase and a Yubikey credential. An unusual operation might be to delete backups, or to adjust the set of data chunk keys a client has.

4.3 Summary

Backup:

Restore:

There’s a lot of steps skipped in this, but this is the shape of my current thinking about backup encryption. I am, however, not an expert on this, so I expect to get feedback telling me how to do this better.