It is now well-known that the crytographic hash function MD5 has been
broken. In March 2005, Xiaoyun Wang and Hongbo Yu of Shandong
University in China published an article
in which they describe an algorithm that can find two different
sequences of 128 bytes with the same MD5 hash. One famous such pair is
the following:
d131dd02c5e6eec4693d9a0698aff95c2fcab58712467eab4004583eb8fb7f89 55ad340609f4b30283e488832571415a085125e8f7cdc99fd91dbdf280373c5b d8823e3156348f5bae6dacd436c919c6dd53e2b487da03fd02396306d248cda0 e99f33420f577ee8ce54b67080a80d1ec69821bcb6a8839396f9652b6ff72a70and d131dd02c5e6eec4693d9a0698aff95c2fcab50712467eab4004583eb8fb7f89 55ad340609f4b30283e4888325f1415a085125e8f7cdc99fd91dbd7280373c5b d8823e3156348f5bae6dacd436c919c6dd53e23487da03fd02396306d248cda0 e99f33420f577ee8ce54b67080280d1ec69821bcb6a8839396f965ab6ff72a70Each of these blocks has MD5 hash 79054025255fb1a26e4bc422aef54eb4. Ben Laurie has a nice website that visualizes this MD5 collision. For a non-technical, though slightly outdated, introduction to hash functions, see Steve Friedle's Illustrated Guide. |
As we will explain below, the algorithm of Wang and Yu can be used to
create files of arbitrary length that have identical MD5 hashes, and
that differ only in 128 bytes somewhere in the middle of the file.
Several people have used this technique to create pairs of interesting
files with identical MD5 hashes:
|
The following is an improvement of Diaz's example, which does not need
a special extractor. Here are two pairs of executable programs (one
pair runs on Windows, one pair on Linux).
|
The above files were generated by exploiting two facts: the block
structure of the MD5 function, and the fact that Wang and Yu's
technique works for an arbitrary initialization vector. To understand
what this means, it is useful to have a general idea of how the MD5
function processes its input. This is done by an iteration method
known as the Merkle-Damgard method. A given input file is first padded
so that its length will be a multiple of 64 bytes. It is then divided
into individual 64-byte blocks M0,
M1, ..., Mn-1. The MD5 hash
is computed by computing a sequence of 16-byte
states s0, ...,
sn, according to the rule:
The method of Wang and Yu makes it possible, for a given initialization vector s, to find two pairs of blocks M,M' and N,N', such that f(f(s, M), M') = f(f(s, N), N'). It is important that this works for any initialization vector s, and not just for the standard initialization vector s0. Combining these observations, it is possible to find pairs of files of arbitrary length, which are identical except for 128 bytes somewhere in the middle of the file, and which have identical MD5 hash. Indeed, let us write the two files as sequences of 64-byte blocks:
The blocks at the beginning of the files, M0, ..., Mi-1, can be chosen arbitrarily. Suppose that the internal state of the MD5 hash function after processing these blocks is si. Now we can apply Wang and Yu's method to the initialization vector si, to find two pairs of blocks Mi, Mi+1 and Ni, Ni+1, such that si+2 = f(f(si, Mi), Mi+1) = f(f(si, Ni), Ni+1). This guarantees that the internal state si+2 after the i+2st block will be the same for the two files. Finally, the remaining blocks Mi+2, ..., Mn can again be chosen arbitrarily. So how can we use this technique to produce a pair of programs (or postscript files) that have identical MD5 hash, yet behave in arbitrary different ways? This is simple. All we have to do is write the two programs like this:
Program 1: if (data1 == data1) then { good_program } else { evil_program } Program 2: if (data2 == data1) then { good_program } else { evil_program } and arrange things so that "data1" = Mi, Mi+1 and "data2" = Ni, Ni+1 in the above scheme. This can even be done in a compiled program, by first compiling it with dummy values for data1 and data2, and later replacing them with the properly computed values. |
Here, you can download the software that I used to create
MD5-colliding executable files.
Quick usage instructions:Note for Windows users: the below instructions are for Unix/Linux. On Windows, you may have to append ".exe" to the names of executable files. Also, to use "make", you must have the GNU tools installed and working.
|