In this first set we cover some of the fundamental building blocks we will use for many of the coming challenges. We also warm up our cryptanalysis skills by breaking a couple classical cipher systems.

Challenge 1: Convert Hex to Base64

To start we are tasked with decoding some hex data and encoding the resulting bytes in base64. Not much cryptography happening here, but some useful utilities since we’ll be given raw byte data (keys, ciphertexts, etc) in encoded form later on. Python has everything we need to encode and decode various representations of bytes so all we have to do is chain together a couple calls from the standard library.

from base64 import b64encode
from binascii import unhexlify


def hex_to_b64(hex_str: bytes) -> bytes:
    raw = unhexlify(hex_str)
    return b64encode(raw)


def challenge01():
    hex_data = b'49276d206b696c6c696e6720796f757220627261696e206c' \
               b'696b65206120706f69736f6e6f7573206d757368726f6f6d'
    b64_data = hex_to_b64(hex_data)
    print(b64_data)

Challenge 2: Fixed XOR

In this second challenge we are tasked with computing the XOR of two byte sequences. First we need to decode the hex encoded bytes, which we learned how to do in the first challenge. Then we can operate on raw bytes and implement the XOR. The XOR operator is implemented via ^ in python. We can utilize the fact that the elements returned when iterating over a byte sequence in python can be interpreted as the integer representation of that byte (e.g. b'\xff' => 255). ^ operates on integers so we grab elements pairwise from the two byte sequences we wish to XOR, apply ^ to the pairs, and then recombine the results into a byte sequence. zip (getting elements from two sequences pairwise) and list comprehensions (combining our results back into a sequence) make this easy to concisely implement.

from binascii import hexlify, unhexlify


def xor(buf1: bytes, buf2: bytes) -> bytes:
    return bytes([b1 ^ b2 for (b1, b2) in zip(buf1, buf2)])


def challenge02():
    left = unhexlify(b'1c0111001f010100061a024b53535009181c')
    right = unhexlify(b'686974207468652062756c6c277320657965')
    xored = xor(left, right)
    print(hexlify(xored))

Challenge 3: Single-byte XOR Cipher

Finally some code breaking! We are tasked with deciphering some data that has been encrypted under a single byte key. The encryption is done via the XOR operation from the last challenge, granted this time the key is a single byte so we iteratively XOR the key byte against the enrypted data to get a candidate plaintext.

The solution is broken into two parts - scoring and book keeping. Scoring is the more interesting of the two (for a cryptographer at least). Considering that the range of possible characters in a candidate plaintext can be any character represented by encoding 0x00 through 0xff we’ll find that most of the candidates end up being gibberish that encode to characters that are not letters. So a good scoring method would be “score candidate plaintexts with lots of letters higher than those with few letters”. We can even do a bit better. We can score those plaintexts that have common english letters higher than those that don’t. Our reasoning being that common english letters are on average more likely to appear, and thus correlate more closely to a plaintext that is valid english.

The book keeping is just a brute force attack. Since the key is a single byte we only have to check 2^8 = 256 possible keys. Childs play for a computer. For each key we score the resulting candidate plaintext and, if the score is a new high score, we mark that key as the best guess for the true key.

Technically we already know the plaintext when we find the key and can just return that along with the key, but for future problems an interface that returns only the key is handier. Plus the extra decryption operation doesn’t change our runtime complexity, the core of the computational complexity here is in the brute force attack which is O(2^n), where n is the size of the key in bits.

from binascii import unhexlify


def score(chars: bytes) -> int:
    points = 0
    most_freq_letters = b'etaoinhs'

    for c in chars:
        if c in most_freq_letters:
            points += 1

    return points


def find_single_byte_key(ctxt: bytes) -> int:
    best_key = None
    best_score = -1

    for key in range(0xff + 1):
        ptxt = bytes([(c ^ key) for c in ctxt])
        key_score = score(ptxt)

        if key_score > best_score:
            best_key = key
            best_score = key_score

    return best_key


def single_byte_cipher(ctxt: bytes) -> bytes:
    key = find_single_byte_key(ctxt)
    return bytes(map(lambda c: key ^ c, ctxt))


def challenge03():
    ctxt = unhexlify(
        b'1b37373331363f78151b7f2b783431333d'
        b'78397828372d363c78373e783a393b3736'
    )
    ptxt = single_byte_cipher(ctxt)
    print(ptxt)

Challenge 4: Detect Single-character XOR

For this challenge we are tasked with building on top of challenge 3 and identifying which ciphertext in a collection of ciphertexts decrypts to a valid (english) plaintext. The keys are again a single byte and the encryption is again done using XOR. We reuse the logic from challenge 3 to find the best (most common english letters) plaintext for each ciphertext. We then add another layer on top of that and score each of the best plaintexts, returning the one with the highest score.

Our runtime complexity is slight higher than the last problem now, O(2^n * m) where n is the number of bits in the key and m is the number of ciphertexts. It’s also worth noting that reading the entire text file and splitting it into a list of ciphertexts to iterate over probably isn’t the most efficient. If we had a huge list then reading line by line until the end of the buffer would be better. In this case the list is small so this implementation works well enough and saves us having to strip newlines and identifying where the EOF is.

from binascii import unhexlify

from challenge03 import score, single_byte_cipher


def challenge04():
    with open('4.txt', 'rb') as f:
        high_score = 0
        ptxt = None

        for ctxt in f.read().split(b'\n'):
            raw = unhexlify(ctxt)
            txt = single_byte_cipher(raw)
            points = score(txt)

            if points > high_score:
                high_score = points
                ptxt = txt

    print(ptxt)

Challenge 5: Implement Repeating-key XOR

Keys longer than one byte! We’re starting to slowly but surely move toward challenges where we can’t brute force the key space quickly. Here we are tasked with implementing a repeating key XOR, meaning we repeat the key to be the length of the plaintext and then xor the two together.

Burning `em
|||||||||||
ICEICEICEIC

We make this a little more efficient by noting that the corresponding key byte for a plaintext byte at index i is the key byte at index i % len(key). This is because the key repeats every len(key) bytes.

from binascii import hexlify


def repeating_key_xor(intxt: bytes, key: bytes) -> bytes:
    outtxt = []

    for i, c in enumerate(intxt):
        key_index = i % len(key)
        outtxt.append(c ^ key[key_index])

    return bytes(outtxt)


def challenge05() :
    plaintext = b'Burning \'em, if you ain\'t quick and nimble ' \
                b'I go crazy when I hear a cymbal'
    key = b'ICE'
    ctxt = repeating_key_xor(plaintext, key)
    print(hexlify(ctxt))

Challenge 6: Break Repeating-key XOR

This challenges greets us with “It is officially on, now.”. Indeed, in this challenge we are tasked with breaking a ciphertext that has been encrypted with the repeating key XOR from challenge 5. The key length is unknown so we will first need to determine that. Once we have done that we will reduce the problem to multiple single byte key ciphers (which we already know how to break). More on that later, first let’s talk about figuring out the key size.

The first utility function we need to implement is hamming_distance, which tells us how many bits differ between two byte sequences. The way we do this is to take the XOR (pairwise, by byte) of the two sequences. The result of the XOR will have a 1 bit at indices where the sequences differ and 0 bit where they are the same. We can then use the newly introduced int.bit_count from python3.10 to count the number of 1 bits in the XOR result, which is the Hamming distance of the two sequences. We then verify that the test vector we are given in the problem produces the expected output -

In [1]: from challenge06 import hamming_distance

In [2]: hamming_distance(b'this is a test', b'wokka wokka!!!')
Out[2]: 37

Next we need to figure out how long the key actually is. The problem statement tells us that to do this we can break the ciphertext into blocks that are the size of a candidate key. We then find the Hamming distance between those blocks and normalize it. Low values in this score correlate to better candidates for key sizes. One thing the problem doesn’t get into is why this is. Consider we have a key that is three bytes long, let’s say b"KEY". Suppose we have a plaintext of b"HIDDENSECRET". Then our ciphertext is as follows -

b"KEY" ^ b"HID" + b"KEY" ^ b"DEN" + b"KEY" ^ b"SEC" + b"KEY" ^ b"RET"

Note that when we compute the Hamming distance of the blocks against one another the keys cancel out.

b"KEY" ^ b"HID" ^ b"KEY" ^ b"DEN" = b"HID" ^ b"DEN"

This follows from some basic properties of XOR, namely -

a ^ a = 0
a ^ 0 = a

So when we pick the right key size our Hamming distance calculation is doing the distance between the plaintext blocks. If we have a plaintext that is in English (and make some sane assumptions about character encoding) then our Hamming distance should be low, because most characters in English have very similar bit representations, so their XOR is mostly 0 bits. Note that if we pick the wrong key size this does not apply as the key bytes will not cancel one another out. With wrong key sizes we expect a much more random distribution of bytes being XORed in the Hamming distance calculation.

Once we have the top candidates for key sizes we apply the same principles that we used to break single byte keys. We reduce each key byte of the candidate key size into its own single byte cipher to be solved. We do this by taking all the ciphertext bytes that would be encrypted under that key byte, concatenate them, and create a new ciphertext that we already have techniques for breaking. Once we have a candidate for each key byte we produce the corresponding plaintext and score it. The highest score, as always, will be our best candidate for the plaintext.

from base64 import b64decode
from itertools import combinations
from typing import List

from challenge03 import score, find_single_byte_key
from challenge05 import repeating_key_xor


def hamming_distance(buf1: bytes, buf2: bytes) -> int:
    distance = 0

    for c1, c2 in zip(buf1, buf2):
        diff = c1 ^ c2
        distance += diff.bit_count()

    return distance


def _best_key_lengths(data: bytes) -> List[int]:
    dist_and_ksize = []

    for ksize in range(2, 41):
        blocks = []
        for block_start in range(0, ksize * 4, ksize):
            block_end = block_start + ksize
            blocks.append(data[block_start:block_end])

        distance = 0
        for block1, block2 in combinations(blocks, 2):
            distance += hamming_distance(block1, block2)

        dist_and_ksize.append((distance / ksize, ksize))

    dist_and_ksize.sort()
    return [ksize for (_, ksize) in dist_and_ksize[:3]]


def break_repeating_key(data: bytes) -> bytes:
    keylens = _best_key_lengths(data)
    best_score, ptxt = 0, b''

    for keylen in keylens:
        key = []
        blocks = [[] for _ in range(keylen)]

        for i, c in enumerate(data):
            blocks[i % keylen].append(c)
        blocks = map(bytes, blocks)

        for block in blocks:
            key.append(find_single_byte_key(block))
        key = bytes(key)

        txt = repeating_key_xor(data, key)
        cur_score = score(txt)

        if cur_score > best_score:
            best_score = cur_score
            ptxt = txt

    return ptxt


def challenge06():
    with open('6.txt', 'rb') as f:
        data = b64decode(f.read().replace(b'\n', b''))
        print(break_repeating_key(data))

Challenge 7: AES in ECB Mode

In this challenge we get our first usage of modern cryptography, namely the AES cipher. Python doesn’t implement AES as part of the standard library so we can either implement it ourselves or use a third party library. Implementing it ourselves would be useful if we’re attacking the internals of the cipher, such as doing cryptanalysis of the S-boxes, but for these challenges we’ll attack constructions that are one or more “layers” higher than AES. That is to say, we’ll attack vulnerable protocols that build on AES, but where AES is not part of the attack vector.

That being said, we’ll be using pycryptodome for most of our primitives that we don’t implement ourselves. Why? Mainly because when I first solved these challenges years ago I used pycrypto and pycryptodome has a compatible API and is still actively maintained. The cryptography library is also probably just fine for this, although I find primitives being under the hazmat category in their package layout a bit obnoxious.

The only other thing to note here is that the block cipher mode we’re using is electronic codebook (ECB), which is somewhat famous for poorly encrypting images of penguins. This mode is quite simple, you take a block (128 bits in AES) of plaintext, feed it into AES, get the ciphertext and rinse and repeat until you’re out of plaintext blocks. The final ciphertext is just the concatenation of the ciphertext blocks. Wikipedia also has a nice diagram describing this mode if you’re a visual learner.

from base64 import b64decode

from Crypto.Cipher import AES


def aes_ecb_decrypt(ctxt: bytes, key: bytes) -> bytes:
    ptxt = b''
    cipher = AES.new(key, AES.MODE_ECB)

    for block in range(len(ctxt) // AES.block_size):
        start = block * AES.block_size
        end = start + AES.block_size
        ptxt += cipher.decrypt(ctxt[start:end])

    return ptxt


def challenge07():
    key = b'YELLOW SUBMARINE'

    with open('7.txt', 'rb') as f:
        data = b64decode(f.read().replace(b'\n', b''))
        print(aes_ecb_decrypt(data, key))

Challenge 8: Detect AES in ECB Mode

In the final challenge of set 1 we need to identify which ciphertext has been encrypted using ECB mode. Recall that ECB mode will always produce the same ciphertext block for two equal plaintexts blocks, provided that they are encrypted under the same key. So an easy way to check for ECB mode is to see if there are two equal blocks in a ciphertext. Note that we do not have to perform any decryption operations for us to be able to identify the ciphertext encrypted in ECB mode.

from binascii import hexlify, unhexlify
from typing import Optional

from Crypto.Cipher import AES


def is_encrypted_in_ecb_mode(ctxt: bytes) -> bool:
    blocks = set()

    for block in range(len(ctxt) // AES.block_size):
        start = block * AES.block_size
        end = start + AES.block_size
        block = ctxt[start:end]

        if block in blocks:
            return True

        blocks.add(block)

    return False


def challenge08():
    with open('8.txt', 'rb') as f:
        ctxts = [unhexlify(txt) for txt in f.read().split(b'\n')]

    for ctxt in ctxts:
        if is_encrypted_in_ecb_mode(ctxt):
            print(hexlify(ctxt))

Appendix

While it is theoretically possible for ciphertexts encrypted with other block cipher modes to have equal blocks, the probability of this happening is so infinitesimal that it’s not worth considering for this problem. To give a more concrete example, consider CBC mode. If we assume that AES is a pseudorandom function, and that the IV is randomly sampled, then we can also state that each ciphertext block is a randomly sampled value from the function image (all possible 128 bit numbers) that is independent of the plaintext block. We can then apply the birthday paradox to determine how likely a collision (two equal blocks) is in such a ciphertext. Given that members of a group can have any of n possible values, and there are k members in the group, the probability that no two members have the same value assigned to them is

(1 / n)^k * (n * (n-1) * (n-2) * ... * (n-k+1))

In our case the range of possible values is all 128 bit numbers, so n=2^128 possible values. If we have a k=1000000 block plaintext then the probability of two equal ciphertext blocks would be

In [1]: n = 2**128

In [2]: k = 1000000

In [3]: from functools import reduce

In [4]: reduce(lambda x, y: x * (y / n), range(n, n-k, -1), 1.0)
Out[4]: 1.0

Python can’t properly represent values this close to 1 due to floating point representation limitations so let’s reduce the possible values to n=2^64, which is about 1.94 * 10^19 times smaller than n=2^128. In this case the probability of no two equal blocks is

In [1]: n = 2**64

In [2]: k = 1000000

In [3]: from functools import reduce

In [4]: reduce(lambda x, y: x * (y / n), range(n, n-k, -1), 1.0)
Out[4]: 0.9999999728949818

This means the probability of there not being a collision in a ciphertext that is a million blocks long, using a block cipher with a function image of size 2^64, is 99.99999%. Now imagine how many more nines the probability has when we expand the function image size by a factor of 1.94 * 10^19. Given how incomprehensibly unlikely this scenario is we do not take it into consideration for our approach to identifying ciphertexts encrypted in ECB mode.

Cryptopals Set 1: Basics

Tags