In the context of Obnam, a backup is an independent copy of data that
can be restored if the primary copy can no longer be used.
Some definitions:
the primary copy of your data is the one you work with or access
directly
a backup is in separate copy made at some given time and that won’t
change or be lost even if the primary one is
Important aspects of a backup:
once a backup has been made, it won’t change even if the primary
copy changes; it is, of course, possible to make a new backup after the
changes have happened, and then have two independent copies
backups are made of files, or other data on persistent storage, onto
other persistent storage
storage media is always ephemeral and corruptible; backups are meant
to guard against that
the primary copy and backup need to be stored on different media; if
they’re on the same media, the backup doesn’t guard against media
failures
2 Table stakes for backups
For Obnam, I make the following assumptions:
the primary copy of data is on some device, usually as files on a
file system, but can be data on a raw disk drive or partition
backups are stored on local storage, or on a remote server
the user using the backup client should not have to trust the server
more than they have to; I am assuming they can trust the server to not
delete data on the server intentionally
backups are encrypted and authenticated only on the client, the
server stores opaque blobs; the user does not need to trust the server
to read data on the server, or modify it: the client uses cryptography
to prevent reading and detect modification
a server can be used by multiple people, who are mutually not
hostile
in a disaster situation, the backup system should require the least
number of things for a user to restore their data
3 Very high level architectural
assumptions
Backed up data is split into chunks of a suitable size. This makes
de-duplication simple: by splitting files into chunks in just the right
way, identical data that occurs in multiple places can be stored only
once in the backup. The simplest example for this is when a file is
renamed, but not otherwise modified. An sensible backup system will
notice the rename, only stores the new name, not all the data in the
file all over again.
De-duplication can be done at quite a fine-grained granularity or a
coarse one. There are a number of approaches here. At this high-level of
architecture thinking, we don’t need to care how the splitting into
chunks happen. We do need to take care that the size of chunks can vary
and that the backup storage can’t care about the specifics of chunk
splitting.
There are ways to do “content sensitive” chunk splitting so that the
same bit of data is recognized as a chunk event if it’s preceded by
other data. This is exciting, but I don’t know of any research about how
much this actually finds duplicate data in real data sets. A flexible
backup system might need to support many ways to split data into chunks,
so that the optimal method is used for each subset of the precious data
being backed up.
I note that the finest possible granularity here is the bit, but it
would be ridiculous to that far. However the backup system is
implemented, each chunk is going to incur some overhead, and if the
chunks are too small, even the slightest overhead is going to be too
much. A backup system needs to strike a suitable balance here.
To achieve de-duplication, the backup system needs a way to detect
that two chunks are identical. A popular way to do this is to use
cryptographically secure checksum, or hash, such as SHA3. An important
feature of them is that if two chunks have the same hash, they are
almost certainly identical in content (if the hashes are different, the
chunks are absolutely certain to be different). It can be much more
efficient to compute and compare hashes than to retrieve and compare
chunk data. This is probably good for most people most of the time.
However, for the people who do research into hash function
collisions, it’s not good enough. It makes a sad researcher who spends a
century of CPU time to create a hash collision, then makes a backup of
the generated data, and when restoring their data finds out that the
backup system decided that the two files with the same checksum were in
fact identical.
A backup system could make this configurable, possibly on a
per-directory basis. A hash collision researcher can mark the directory
where they store hash collisions as “compare to de-duplicate”.
I admit this is a very rare use case, but it preys on my mind. At
this level of software architectural thinking, the crucial point is
whether to make the backup system use content hashes as the only chunk
identifiers, or if chunk identifiers should be independent of the
content.
Note that the checksum algorithm should be strong for security, not
just collision detection. We want to be able to safely and securely back
up arbitrary data, including data provided by an attacker, and not
suffer from attacks from that source.
I really like the SSH protocol and
its SFTP
sub-system for data transfer. I don’t particularly like it for accessing
a backup server. The needs of a backup system are sufficiently different
from the needs of generic remote file transfer and file system access
that I don’t recommend using SFTP for this. For example, it’s tricky to
set up an SFTP server that allows a backup client to make a new backup,
or to restore an existing backup, but does not it allow deleting a
backup. It makes more sense to me to build a custom HTTP API for the
backup server.
It seems important to me that one can authorize ones various devices
to make new backups automatically, but not allow them to delete old
backups. This mitigates the situation where a device is compromised. A
compromised client can’t destroy data that has been backed up, even if
can make new backups with non-sense or corrupt data.
4 Encrypting backups
I want my backups to be encrypted at rest so that if someone
gains access to the backup storage they can’t see my data. I also want
my backups to be signed so that I can trust the data I restore is the
data I backed up. This is also called confidentiality and authentication
of backed up data.
“At rest” means as stored on disk. I also want transfers to and from
a backup server to be encrypted, but that’s easy to achieve with TLS or
SSH.
4.1 AEAD: authenticated encryption
with associated data
Doing encryption and signing separately has turned out to be easy to
get wrong. Since about the year 2000 there have been ways to achieve
both with one operation, using authenticated
encryption or its variant with associated data AEAD. This is easier
to get right. In short, with authenticated encryption, if you can
decrypt some data, you can be sure that the decrypted data is what was
encrypted.
the cipher text, authentication
tag and ad are stored in backup storage
at least some AEAD implementation make the cipher
text and authentication tag part of the same
output string, but that’s an implementation detail; they’re conceptually
separate
decrypt(cipher text,
authentication tag, key, ad) →
plaintext or error
In other words, you keep the associated data with the cipher text, as
you’ll need it to decrypt. If the decryption works, you know the
associated data is also good (in addition the encrypted data). You do
need to be careful not to trust the associated data until it’s been
authenticated.
For backups, each chunk of user data would be encrypted with AEAD,
and the associated data is the checksum of the plain text data. When a
backup client de-duplicates data, it splits data into chunks, computers
the checksum of each, and searches the backup repository for chunks with
that associated data.
When restoring a backup, the client decrypts the chunks, using the
checksum. This also authenticates the data: if the decrypt operation
fails, the data can’t be used.
All this requires storing the checksum for each somewhere. There also
needs to be ways to keep track of what backups there are, what files
each contains, and what chunks belong to each file. We’ll not worry
about that yet. For now assume it’s all done using magic.
Actually, the associated data for a chunk probably should not be the
checksum of the plain text data. That leaks information: an attacker
could determine that a file contains a specific document by looking for
chunks with the same checksum as the document. Instead, the associated
data could be an encrypted version of the checksum, or the result of
some other similar transformation. For now, let’s not worry about
that.
4.2 Managing keys
Note that AEAD is a symmetric operation: the key must be kept secret.
To complicate things, the client should support many keys for different
groups of chunks. This is important especially so that different clients
can share chunks in backup storage.
Imagine Alice and Bob both work for the same spy agency. They both
get a lot of the same management reports and documents. They both also
have confidential letters that they can’t share each other. It would be
ideal if their backup system let them mark which files are confidential
and which can be shared, and then the chunks from those files can be
shared or not shared with the other.
To implement this, the backup client needs to keep track of several
keys. It also needs a way to keep track of which key each chunk is
using. All these keys need to be computer generated and entirely random,
for security. There is no hope of a user ever remembering any of
them.
The keys should be stored in one place, which I tentatively call the
“client chunk”. This would be encrypted with yet another key, the
“client key”. The client key is stored in one or more “client
credential” chunks, each of which is encrypted with separate key. This
is similar to what the Linux full disk encryption system LUKS uses: the
actual disk encryption key is encrypted with various passphrases, each
encrypted key stored in a separate key slot. Because LUKS has a fixed
amount of space for this, it limits the slots to eight. A backup program
does not need to have that limitation: we can let the user as many
client credential chunks as they want.
I’m assuming here that the backup storage allows lookup via the
associated data. The client and credential chunks can then be found by
using associated data “client-chunk” or “credential-chunk”. If there are
many matching chunks, the client needs to be able to determine which one
it needs. (More magic. Waving my hands frantically.)
If the client chunks is updated to add a new key (or to drop one),
the new client chunk is encrypted with the same key and uploaded to the
backup store. All existing client credentials will continue to work. The
old client chunk can then be deleted.
There can be any number of client credentials, which each encrypts
the client key using a different method:
a user-provided passphrase
or a key derived from that with a key derivation function
a hardware key
TPM
Yubikey challenge/response
an SSH or OpenPGP key
could be stored in a Yubikey or other hardware token
hopefully there’s more
To perform a backup or a restore, the client would need to be able to
use any one of the credentials.
An interesting possible evolution of the above scheme might be to
have some of the credential be split using a secret sharing
setup: for normal use, the TPM credential might be used (but it would
only enable making new backups and restoring backups, not deleting
backups). For more unusual situations, you might need both a passphrase
and a Yubikey credential. An unusual operation might be to delete
backups, or to adjust the set of data chunk keys a client has.
4.3 Summary
Backup:
get one or more credentials from user to decrypt the client key
get and decrypt the client chunk, using the client key
fail if this gives an error
encrypt each new chunk with the right chunk key
store the cipher text, authentication tag, and associated data in
backup storage
Restore:
get one or more credentials from user to decrypt the client key
get and decrypt the client chunk, using the client key
fail if this gives an error
for each chunk that needs to be restored, decrypt it using the right
chunk key and associated data, making sure this works
fail if this gives an error
There’s a lot of steps skipped in this, but this is the shape of my
current thinking about backup encryption. I am, however, not an expert
on this, so I expect to get feedback telling me how to do this
better.