Hashing in python
Hashing is hard.
But let’s start from the beginning.
Hashes are a foundational concept in python.
As a very rough approximation, every python object stores attributes (which
is, any name bound to the object, be it a member or a method) in a
dictionary, called __dict__
.
It goes without saying, that calculating hashes is a core responsibility;
and hash
is a fundamental function in python.
However, hash
is just a frontend; the actual hashing function is defined
in the __hash__
special method. Builtin objects, like strings,
have it implemented (and this is what’s being used for attributes), but
custom types must implement it in order to be used in hashable collections.
So far, so good.
However, hashes are also a very common concepts in many algorithms.
For instance, one might want to implement some caching mechanism, and use
the hash of an object.
The first idea is then to use the builtin hash
function – what’s better
than what is already there, and widely used and battle-tested?
Well, not so fast.
The result hash
is designed to be unique inside a process boundary.
Once data crosses process – e.g. by marshaling to another process, or by
serialization to some storage – there is no guarantee that the result will
match.
In other words, the result of hash("foobar")
on a process might be different
from the result of hash("foobar")
on a different process.
Actually, it’s almost guaranteed to be different!
What is the solution?
A first approach is to use cryptographic hashing functions (e.g. SHA).
They are stable and well-documented, and can be even used with other
languages.
However, many of them they are slow, and they are slow by design, because of
their purpose (cryptography).
However, in hashlib, we have also blake2b
, which is an implementation of
BLAKE2. Its usage is a bit more convoluted than hash()
, so I usually
try to wrap into some function:
from hashlib import blake2b
def hash_blake(plaintext: str) -> str:
h = blake2b() # use digest_size for shorter hashes
h.update(plaintext.encode('utf-8'))
return h.hexdigest()