Nov 18, 2023

Hashing in python

Hashing is hard.
But let’s start from the beginning.

Hashes are a foundational concept in python.

As a very rough approximation, every python object stores attributes (which is, any name bound to the object, be it a member or a method) in a dictionary, called __dict__.
It goes without saying, that calculating hashes is a core responsibility; and hash is a fundamental function in python.
However, hash is just a frontend; the actual hashing function is defined in the __hash__ special method. Builtin objects, like strings, have it implemented (and this is what’s being used for attributes), but custom types must implement it in order to be used in hashable collections.

So far, so good.
However, hashes are also a very common concepts in many algorithms.
For instance, one might want to implement some caching mechanism, and use the hash of an object.
The first idea is then to use the builtin hash function – what’s better than what is already there, and widely used and battle-tested?

Well, not so fast.
The result hash is designed to be unique inside a process boundary.
Once data crosses process – e.g. by marshaling to another process, or by serialization to some storage – there is no guarantee that the result will match. In other words, the result of hash("foobar") on a process might be different from the result of hash("foobar") on a different process.
Actually, it’s almost guaranteed to be different!

What is the solution?
A first approach is to use cryptographic hashing functions (e.g. SHA).
They are stable and well-documented, and can be even used with other languages.
However, many of them they are slow, and they are slow by design, because of their purpose (cryptography).
However, in hashlib, we have also blake2b, which is an implementation of BLAKE2. Its usage is a bit more convoluted than hash(), so I usually try to wrap into some function:

from hashlib import blake2b

def hash_blake(plaintext: str) -> str:
    h = blake2b()  # use digest_size for shorter hashes
    h.update(plaintext.encode('utf-8'))
    return h.hexdigest()