Optimizing a Birthday Attack

13 Aug 2014

Background

A birthday attack is a generic attack on hash functions (and some other cryptographic primitives) that trades time for space (memory).

Let's setup a secenario - you are asking a CA to sign a certificate matching a 128-bit, but otherwise secure hash. The CA will happily sign a certificate for your domain, but not for somebody else's.

A naive attack would be to create a certificate for your domain, then try randomly tweaking your malicious alternative until you find one with the same hash. This is brute-force, and would take on-average 2127 attempts - otherwise described as O(n) time.

A different approach would be store store 264 hashes on disk matching the legit certificate, then try generating malicious alternatives until you find a match. This would require 264 hashes with one prefix (to be stored), then take on-average 264 hashes and comparisons with the other prefix to match. You would need to do this before submitting to the CA -- this isn't a preimage attack.

If your legit hashes were stored in a flat file or array, then you'd have to make 264 x 264 = 2128 comparisons after building the table - this would not complete in your lifetime (even taking Moore's law into account).

On the other hand, if you were to create an O(1) index into that file you would then only have to make 264 comparisons - this approach is done in approximately O(sqrt(n)) time.

Another point to mention is that you don't need to store 264 hashes - if you're willing to double your compute load you can halve your storage requirement. This is an important consideration, as for example a video card can compute hashes at well over 4x the speed as your PC, so if your PC has 16GB of RAM and your video card has 4GB of RAM the video card might still be the faster choice.

Daniel Bilar created a handout which he kindly shared if you want to understand why this works.

The target

Back to reality now - you're probably not going to be breaking 128-bit hashes on your home PC -- although if those hashes are MD4 or MD5 there are certainly quicker alternatives to birthday attacks.

Yet I came across a problem that does lend itself to birthday attacks - librsync - and consequently projects such as duplicity, rdiff-backup, and seemingly Dropbox. By default, librsync truncates hashes to 8 bytes (64 bits). 232 space is over 4 billion times smaller, and almost certainly within reach of mortals.

I should mention that the hash used is MD4, but I decided against performing the quicker attack in favour of a more powerful attack. It's not often you get to perform a birthday attack against a real live system!

If we were to craft two different blocks that collide with both the rsync checksum (similar to Addler32) and half-MD4, and we arranged for them to be placed in different parts of the same file (presumably a file that also contains data that isn't ours), then the first block will be stamped over every instance of the second.

This could actually be quite a critical flaw if the file were a database or filesystem image - you could possibly inject malicous data via a web server, and have it overwrite a different part of the filesystem or database - perhaps to exploit the kernel or database engine, or simply overwrite a filesystem structure to change permissions.

Start off simple

What we need to store is a mapping of hashes to the plaintext that created them. Even for a 64-bit hash, this could be a lot of disk space, but since we're presumably using a seed to generate that plaintext, we could just store that instead.

In this case, I have two prefixes that hash to the same rsync checksum. I then made an encoding scheme that would encode a 40-bit integer to 20 bytes. The code looks like:

func rollsum_expand(n uint64) []byte {
    ret := make([]byte, 20)
    var i uint

    for i = 0; i < 5; i++ {
        b := byte((n >> (i * 8)) & 0xff)
        c := 255 - b
        ret[(i*4)+0] = b
        ret[(i*4)+1] = c
        ret[(i*4)+2] = c
        ret[(i*4)+3] = b
    }
    return ret
}

This allows me to convert some seeds into rsync blocks, which could then be hashed. Note that I need to support more than 232 seeds, as the time portion of the attack takes on average 232 comparisons, which might mean more or less.

My initial design involved simply creating a Go map (Same as a Python dict - a hashtable with built in support from the language). The key is the truncated md4, and the value is the seed. I simply stored the required amount of hashes with my first prefix in the map, then compared hashes from my second prefix against them.

Taking full advantage of the Go programming language, I created multiple goroutines to simultaneously generate hashes, piping them down a channel to a goroutine that built up the hash table. I then re-used the same goroutine function to send candidate hashes to the storage process, which printed the result and quit when it found a match.

Since GOMAXPROCS was set to the number of cores in the system, this kept my system busy, and it could attack scaled-down versions 48-bit versions in around 20 seconds.

Unfortunately I ran into a problem - Go maps only allow 231-1 entries, and I needed double that amount!

memcached style

This is where the first major design improvement came in - I decided to launch multiple store/compare processes, each storing a fair share of the work. I used a technique made popular by memcached - I use a hash of the input (in this case it's already a hash - so I used the first byte of it) to decide which storage process to send the hash to.

When the program switches to candidate generation mode, any collisions generated would also go to the correct process. This has an additional advantage of freeing the store nodes to run on multiple cores simultaniously, widening the pipeline.

The even bigger advantage of this approach - it can scale to multiple systems! Distributing the memory load of the attack is a critical part of scaling it up, should it be needed, and it was looking like this was the case - when trying to store 232 items on a VM with 100GB of RAM, the program just ran out of memory straight away. I decided to go for another approach - maybe I'd optimise the memory usage first.

Ditch the keys!

Well we already threw away the plaintext in favour of the seed -- but since the seed can generate the plaintext and the plaintext can generate the original hash, storing the keys is redundant.

Now you might object, saying that an md4 hash is far slower than a memory comparison, even if the big-O notation stays the same. But this argument ignores the simple fact that we're trading time for space - so if we're wasting 8 times the space storing the key, then we're computing 8 times as many md4 hashes to make up for the keys that don't fit in memory.

So I started with a classic hashtable then list design - in Go that would look like [][]uint64. The first dimension is the number of buckets, which we want to be close to the number of hashes we're storing, and the second dimension is an array of seeds. We can just use the FNV32a hash that ships with the Go standard library of the original hash to decide which bucket to use.

In the comparison phase (once we've stored our hashes), we've added another pipeline stage. We now have the generation goroutine creating candidates from sequencial seeds, then the storage goroutines find the correct bucket for that hash and send it to a validation goroutine, which recomputes the md4 of all the seeds in the bucket and compares them to the candidate.

This approach worked, but unfortunately the memory overhead of Go slices wasn't that much better than the map structure - I still couldn't get a full 2^32 in memory on my 100GB VM.

Ditch the collisions!!

What if we simply made the hashtable a slice of []uint64. While there would be some hash collisions, the percentage was in the 2-3% range in my tests. But now the seeds are stored in an actual flat block of RAM, without the overhead of the slice datastructure.

It may be tempting to leave some extra space in your hashtable to reduce the number of lost collisions. This is thinking about the problem the wrong way - if our limit is the amount of memory we have then empty buckets cause us to do extra computes anyway.

This approach worked, and was fast! It now took only 80G of RAM (Achievement Unlocked - top memory usage displayed in 't'). That's huge by 2014 standards, but within the realms of mortals. Still this isn't really home PC or video card territory, so let's work on it a bit.

Shrink the seeds

I was using uint64 for the seeds, since I envisioned going higher than 64-bit hashes -- definitely in the time phase. But for the storage, I wasn't using seeds out of the 32-bit range. So converting into a 32-bit integer for storage, then back into a 64-bit integer when pulled out halved my storage.

Now it fit in around 36G of RAM. That's much better. At a later point I decided to simply encode the stored seeds into a configurable number of bytes and used a flat byte slice to store the hashtable - with 4 bytes per seed it took the same amount of RAM.

The perfect hash? Or is it just Cuckoo?

This is the part where I actually hit a dead-end. A fun and enlightening dead-end, so I'll share it with you anyway.

Inspired by the CMPH library, I set about creating an almost-perfect hash function. The CHD function listed on the site claims O(n) construction, O(1) lookups, and near the information-theoretical limit of compression.

I decided to try to implement it, but rather than storing keys/values or offsets into files, just put the seeds in there as before.

To start, I looked up Hash and Displace - it turns out the best reference to it is on Wikipedia under Cuckoo hashing. The idea is that you have two or more hash functions that you apply to the same key.

If the location for the item you're inserting is already taken, you replace it with the item you're inserting, then hash the displaced item with another hash function. This is done recursively until you finally settle on a free spot.

My implementation wasn't too different to the original - I use a single table and each entry was this struct:

type DigestSeedHash struct {
    Digest []byte // MD4 digest
    Seed   uint64 // Seed used to create MD4
    H      int    // The index of the hash function used to store the item
}

After some experimentation, I settled on siphash, generating 3 hashes from the single 64-bit value:

h := siphash.Hash(self.K0, self.K1, data)
ret[0] = (h >> 32) % self.M
ret[1] = h % self.M
ret[2] = ((h >> 32) + h) % self.M

Also, since we're now storing a relatively large structure temporarily, I decided to write the hash and seeds for each storage process into a temporary file, then allowing only a couple of storage processes at a time to build a hash table. Once the table is built, it is compacted into a flat []uint32 slice of seeds.

Surprisingly enough, at 80% load (eg. 20% empty buckets) this almost always generates a perfect hash function, although I'm not bothered if it discards 1 or 2 due to looping - almost perfect is good enough for me. Rather than bothering to detect looping, my recursive function just implemented a TTL, and bailed out if it was taking too long.

This approach worked so far, but it hasn't really saved us any RAM (our table is 20% empty) and now we have to compute 3 hashes per candidate instead of 1. But I've only implemented Hash and Displace so far, not Compress, Hash and Displace.

Compression

Another interesting concept that the CHD function brought to the table was using a compression algorithm on the hash table that retained random access.

The compression algorithm they used is available here, but to sum it up, you first start with a frequency count of each character in your bytestream. You then encode the most popular two characters as a single bit (0 and 1), then the next four most popular using 2 bits (00, 01, 10, 11), etc. Now you can pack them into a flat bit vector. In order to access the bits, you create another bit vector of the same size, setting only the bits where a code starts.

Now you can find the start and end of the code with a select(n) query, which simply counts from the start of the array and tells you the position of the nth set bit. This amounts to nearly random access, so long as you're willing to store some summarised counts aswell.

This compression algorithm works ok for highly repetitive data, but the select vector consumes a lot of space, so if you want to halve the size of it you can encode the codes for the most popular four bytes as 2 bits, the next most popular sixteen as 4 bits, etc. Since you know every second bit is a 0, you can simply drop those bits.

My empreical testing found that compression didn't work well. The seeds may have started out sequential, but they have been spread out in a random distribution now. All bytes are equally likely, except zeros in the empty spots.

This may work better for cases where only part of the byte is used (eg. if generating 225 seeds, rather than 232), but the extra time and complexity from disk storage, compression, and accessing the select vector made the couple of hundred megabytes it would save seem insignificant.

In the end, this approach turned out to be much slower than the simpler approach I already had. Perhaps I'll revive this technique for some other project.

Pollard's rho method (update)

Shortly after publishing the first version of this post, Samuel Neves pointed out this paper, although being impatient I found the TL;DR version in these slides.

The basic concept is that instead of storing each hash from one of your prefixes then comparing hashes from the other, you produce chains of hashes - starting from a seed then chaining until you reach a distinguishing point.

A distinguising point is simply a way to identify the message as being 1-in-n, so in my case I chose the first two bytes being 0, meaning that the chains would on-average be 216 hashes long.

To chain the messages, you start with M and M' - the benign prefix and it's malicious replacement, and decide which prefix you'll use based on some characteristic of the input hash (such as a bit). You then hash the prefix with an encoded version of the hash appended - making sure to truncate the hash first if you're working with truncated hashes.

I wrote this as a 3-stage pipeline:

  1. Generators

    These goroutines start with seeds (as previously), and generate the hash chains until they reach a distinguising point, then the seed and final hash are sent to the appropriate storage process (as before). It then repeats with the next seed.

  2. Storage Processes

    The storage processes now accept the hash and seed pairs from the generators, and check if they have it in their hashtable (back to using Go map[[HASH_TRUNC]byte]uint64). If it is in the table, then both seeds are sent to the verifier process, otherwise it is stored in the table, along with it's seed.

  3. Verifier process

    The verifier process accepts a pair of seeds. Starting with the first seed, the chain is regenerated up to the distinguishing point put into a map of hash -> prev.

    Then using the second seed it generates the other chain, checking at each step whether the latest hash is in the map. If it is, then both previous hashes are checked.

    If both of the hashes being checked would've used the same prefix, we have a worthless collision and we move on. If they used different prefixes, then we have the result we were looking for.

This process took far less RAM - for a 64-bit hash it used around 24MB of RAM and completed in 45 minutes on my home PC. This was both the fastest and most memory efficient method! Also since memory accesses have been reduced (due to the chaining), this would benefit more from GPU and other types of hardware acceleration than the other methods I tried.

Conclusion

There's a good reason why crypographic primitives aim for 128-bit security - at the 128-bit level physics kicks in, and it can be shown this solar system won't produce enough energy to perform such a calculation. With hashes, the 128-bit security level is a 256-bit (32 byte) output.

As it stands today, MD4 and MD5 are hopelessly broken, at-least when it comes to collision resistance and SHA-1 isn't looking too good either - although surprisingly nobody's put in the resources to publicly break it yet. See the hashclash project for more information.

So pick a strong hash such as SHA-256 (compliant) or Blake2 (fast), and don't truncate the output unless you're sure that collisions aren't a problem. Keep in mind that with enough hashes, a collision can occur by accident too.