Hashing Explained, And Why Blockchains Rely On Them
w/Code
What if, what if, you walk into a maze
Lost in the maze and from the path you stray
No memory of how you got there, no, no clue.
You look around, the maze is giant
One that covers the entire globe.
You walk and walk, but there’s no escaping.
You ready yourself and try and try
But in this maze, you’ve been put.
No way to win, can only grind
But no matter what you do, the beginning is gone.
The winning input has been withdrawn
And in this maze, you’ll wander forever.
What if, what if, the input you seek, a treasure to behold.
But there isn’t, never was, and never will be
An X to mark the spot.
You can’t get an input from the output.
That’s just how it works.
How hashing should work.
So What Is Hashing?
Hashing is when you take some input data of any length and convert it into a completely random hexadecimal (0–9 and A-F) string of a fixed length called a ‘hash’. This is achieved using a hash algorithm, such as SHA-1, MD5, or SHA-256.
For example:
data = "For example"
sha1(data)
Output: dc3c09a115b4ee611eb7505826a4fa0c2a8495d8
While there are numerous different algorithms that can be used for hashing, there are still certain common rules that should be followed to ensure maximum security.
- The output should, in no way, relate to the input and vice versa. So if I have ‘abc’ and put it through an algorithm, it should give something like ‘sP5vl21’, not ‘ABC’, which as you can see there is a very clear connection between the input (abc) and output (ABC). It’s like being reborn as a new person with a completely new personality. Nobody, except for you, should be able to figure out what the input was, or who you were in the previous life (unless you tell them, of course).
- The output hash must always be the same length as any other output hash (a fixed length). Even if the input is ‘a’ or ‘aaaaaa’, the output hash’s length must be the same.
- The same input must always produce the same output hash. However, the smallest of changes (including simply making an uppercase letter lowercase or adding a space), should result in a usually, completely and wildly different output hash. The hash of ‘Abc’ is not the same as ‘abc’. Try this out for yourself.
- Two different inputs should never be able to produce the same hash. This is called a collision. And while not impossible, if the hash algorithm is flawed in some way, collisions could appear. For example, on shattered.io you can find two different pdfs. These pdfs when put into a hash algorithm produce the same hash (which is not supposed to happen as every hash should be unique.)
Collisions Matter.
Meet Jeff.
Jeff works at a bank called “Bank”.
At Bank, whenever Jeff receives a cheque, he sends the information to his boss by email who will update the accounts. But before that, Jeff must walk to his boss’s office and tell him the hash of the email. Then, his boss will check the hash of the email again to see if it matches the one Jeff gave him to make sure nothing was tampered with.
But today, unbeknownst to him, there is a hacker. The notorious, the one and only: Bobberoni is ready to intercept the email.
Jeff receives a cheque and sends an email with the following data:
“Receiver: Alex H, Sender: Daisy F” (Hash: 0317d7a2f68780957…)
Jeff sends the email and gives the hash to his boss. However, Bobberoni successfully intercepted the email and he changed “Alex H” to “Bobberoni” so that he would receive the money instead. Since changing anything in the input (the email) changes the output, the hash that Jeff’s boss will get when he checks the email will be different from Jeff’s and they will know something is wrong.
So they do just that. Jeff tells his boss that the hash is 0317d7a2f68780957… And his boss puts the email in the hash algorithm. He gets the same hash, 0317d7a2f68780957…, and they assume nothing is wrong.
But how is that possible? Since the hacker changed something in the email, the output hash must be different? Well, unfortunately for Jeff and his boss, “Bobberoni” and “Alex H” produce the same hash (not actually). So yes, even though something was changed, those two inputs still gave the same output.
This is why collisions are a problem.
A Hash’s Purpose: Integrity And Security
The two biggest things.
Integrity in hashes is like the firewall of a computer. You need to make sure that none of the data has been tampered with and that it is trustworthy and legit. Security is, well, security: keeping the hackers out.
Bob’s Bank was an example of integrity. Making sure that the two hashes match is one of the best ways to prevent any suspicious entities from meddling.
And for security?
Well, if Bob’s Bank has a database of user logins for their bank accounts, and said database has been breached, well… That’s a problem. But if you hash all of the passwords before entering them into the database, then the hackers will have no idea what to do with some useless numbers and letters. Even if they know the hash of the password, that does not mean they know the password itself.
How Do They Work?
What is this magic going on behind the scenes? It seems quite impossible to create an algorithm for “random numbers and letters” but “not actually random numbers and letters”. But, it is.
Let’s take a look at the code of SHA-1 (it’s a fair bit):
# this code is written in python 3, by yours truly :)
data = "Something"
# declare some variables
bytes = ""
h0 = 0x67452301
h1 = 0xEFCDAB89
h2 = 0x98BADCFE
h3 = 0x10325476
h4 = 0xC3D2E1F0
a = h0
b = h1
c = h2
d = h3
e = h4
# for every character in data, get the ascii code of said character
# then convert the ascii code to binary
# pad it with zeroes until it is 8 characters long
for char in data:
bytes += "0"*(8-len(bin(ord(char))[2:])) + bin(ord(char))[2:]
# append a one
bits = bytes + "1"
# append zeroes until the length is 512 mod 448
while len(bits) % 512 != 448:
bits+="0"
# take the length of the combined 8 original bytes
# then pad with zeroes until it is 64 long
bits += "0" *(64 - len(bin(len(bytes))[2:])) + bin(len(bytes))[2:]
# splits a string into chunks
# l = the string, n = bits in every chunk
def chunkify(l, n):
return [l[i:i+n] for i in range(0, len(l), n)]
#left rotate function
def rol(n, b):
return ((n << b) | (n >> (32 - b))) & 0xffffffff
#VVV the magic VVV
for chunk in chunkify(bits, 512):
#split 'chunk' even more into 32 bit strings
words = chunkify(chunk, 32)
w = []
# append the 16 strings in words to w
for n in range(0, 16):
w.append(int(words[n], 2))
#perform bitwise operations to generate another 64 strings
for i in range(16, 80):
w.append(rol((w[i-3] ^ w[i-8] ^ w[i-14] ^ w[i-16]), 1))
#perform a lot of bitwise operations to maipulate variables
for i in range(0, 80):
if i < 20:
bAc = b & c
nbAd = (~b) & d
f = bAc | nbAd
k = 0x5A827999
elif i < 40:
bXc = b ^ c
f = bXc ^ d
k = 0x6ED9EBA1
elif i < 60:
bAc = b & c
bAd = b & d
cAd = c & d
f = bAc | bAd | cAd
k = 0x8F1BBCDC
else:
f = b ^ c ^ d
k = 0xCA62C1D6
temp = rol(a, 5) + f + e + k + w[i] & 0xffffffff
e = d
d = c
c = rol(b, 30)
b = a
a = temp
h0 = h0 + a & 0xffffffff
h1 = h1 + b & 0xffffffff
h2 = h2 + c & 0xffffffff
h3 = h3 + d & 0xffffffff
h4 = h4 + e & 0xffffffff
#and just join them all together
output = '%08x%08x%08x%08x%08x' % (h0, h1, h2, h3, h4)
print(output)
Quite confusing, isn’t it? Well, let’s walk through this, step by step.
Declare Some Variables
First, you need some data, so Data = ‘Something’. Then you must declare some state variables to use as the primary inputs in the future code.
Bytes To Bits
Next, you must take the ascii codes of all the characters (including punctuation and spaces, those are characters!) which would be (S) 83 (o) 111 (m) 109 (e) 101 (t) 116 (h) 104 (i) 105 (n) 110 (g) 103. Convert these values to binary and then pad them with zeroes until they are all 8 characters long. For example, 83 in binary is 1010011, which can then be padded to 01010011 in order to make it 8 characters in length. Do that with all the characters and you get:
010100110110111101101… (just more zeroes and ones…)
Then, append a one.
(P.S. Append means to add to the end, and pad means to add in the front, in case you’re confused)
512 mod 448
We need to make sure that everything, in the end, will be divisible by 512. So, if we have:
010100110110111101101…1 (← after appending the one)
after converting the ascii codes to binary and padding them and appending the 1.
It’s 73 characters long, we must make it 512 mod 448. Basically, make it 448 characters long. If the string is already longer than 448 characters, you do not shorten it. In that case, you’d have to make it 960 (448+512) characters long. Longer than 960? Add another 512. Longer than 1472 (960+512)? Add another 512! Fortunately, in python, the % symbol solves all of this tedious work for us.
But wait, I thought you said divisible by 512? 448 (or 960 or 1472) is certainly not divisible by 512?! Let me explain:
Take your original 8 bytes (before you appended the 1) and join them together and calculate the length. Convert the length (64) to binary (1000000). Then pad it with zeroes until it is 64 characters long. Append that to your 448 bits and voila! 448+64 = 512 or 960+64 = 1024.
Something divisible by 512!
Bitwise Operations — Chunkify and Left Rotate
What is a bitwise operation? There are 6 basic operators.
- And: &. Sets each bit to 1 if both bits are 1
- Or: |. Sets each bit to 1 if one of the two bits is 1
- eXclusive Or (XOR): ^. Sets each bit to 1 if one (and only one) of two bits is 1
- Not: ~. Inverts all the bits.
- Left Shift: <<. Shift everything to the left and fill in the gap with zeroes. Any bits shifted too far to the left will be removed. (1100, you can’t move the first one to the left, since it is already the leftmost bit, so you just get rid of it → 1000)
- Right Shift: Shift everything to the right and fill in the gap with the leftmost bit, and any bits shifted too far to the right will be removed.
The chunkify function takes a string, let’s say ‘01100011’ and splits it into chunks. Let’s say we want to split it into 4 chunks: chunkify(‘01100011’, 4)
Output: [‘01’, ’01’, ’00’, ’11']
The left rotate function just rotates all the bits to the left, n times. For example, if n=2: 0101 → 1010 → 0100
Where All The Magic Happens
This is where it gets confusing.
- You have a string of 512 (or 1024 or 1536 or 2048 or… bits) and split the chunks into chunks which are 512 bits long. So if you had a string of 1536 bits, you’d need to split them in three to get three chunks, each 512 bits long (512 bits, 512 bits, 512 bits). Then split those chunks into 16, 32-bit ‘words’, then append those 16 words to a list (w).
- Extend w to 80 words. Through the use of some XORs and a left rotate function, we can generate 64 new words that will stay the same every time since there is no randomness with bitwise operations.
The next part is hard to explain, but is quite repetitive, so I’ll spare you your time and interest level. But all you need to know is, all that matters are five variables: a, b, c, d, and e. Like before, with bitwise operations we are generating new values for those five variables. This is the secret to how the hash can always stay the same length every time.
I’ve been saying that the input has no relationship with the output. Because it doesn’t. You might say, “Well, you need the input for the output, that’s a relationship.” True, but there’s a middleman. The only thing you actually need are those five variables, because in the end, those five become your hash. Your input only influences how those five variables change. How many times you change and manipulate them.
No matter what your input is, you’re only manipulating a, b, c, d, and e. The five variables never change in length, only their values.
From there, do some more variable reassignment and join the 5 variables together and boom. Hash.
Behold! The magic of SHA-1.
output = sha1('Something')
print(output)
# Output: b74dd130fe4e46c52aeb39878480cfe50324dab9
(Note that SHA-1 is no longer secure and you should use more secure hash functions like SHA-256)
Without Hashing, Blockchains Would Cease To Exist
Enough of this confusing language, how do blockchains actually implement hashing?
The Chain in Block
Blockchains are chains of blocks, since all the blocks are chained together.
The reason that blockchains are so secure is because of the system in which they are chained. Chained with hashes.
There is certain data that is recorded within a block such as transaction data, but it also includes the hash of the previous block: In any two given blockchain blocks, the second block will contain the hash of the first block.
For example, if block #99’s hash is 3m6 and block #100’s is 0pQ, block #100 must contain the hash of the previous block (#99) which would be 3m6. And block #99 must contain the hash of block #98, etc.
Again, when you change something in the input (which would be the block itself), the entire hash changes. So for example, if Bobberoni tries to change the data in block #99, the hash will change (unless a collision happens) but the hash stored in block #100 won’t change and since the hash of block 99 and the hash of block 99 stored in block 100 aren’t the same, something is clearly wrong.
If Bobberoni tries to hack the last block in the chain (let’s say #101), and successfully changes some of the data: the hash will change, like usual. However, since there is no preceding block in the chain to store the “previous hash’s block”, nobody will know that it’s been tampered with. Right?
Well, one thing about blockchains mean that they are distributed ledgers. Bobberoni cannot hack all the ledgers at once. It is beyond doubt that the other ledgers will notice something is wrong and it will be quite clear that something suspicious is going on in that ledger.
Proof Of Work (PoW)
You can find a more detailed article that covers proof of work, consensus mechanisms and miners here.
In a blockchain, to reduce the risk of a hacker changing the data in a block, and causing a hash collision, there are certain requirements. In most blockchains that use PoW as their consensus mechanism, the hash of a block must start with a certain amount of zeroes.
Basically, the hash can’t look like this: 9xIpM4U2, it must look like this: 00000c8fP.
The amount of leading zeroes required exponentially decreases the chance of a collision. Say a valid hash for a block must have 5 leading zeroes, the chance of finding one is approximately 0.0000095367431640625% (1/16^n, where n is the number of leading zeroes required).
I mean, that’s already an insanely low chance for just 5 zeroes. Imagine if it was 20 zeroes. (It’s 0.0000000000000000000000827180613%).
So, if Bobberoni tries to find a collision, they will need to find a hash with 5 leading zeroes, which already has a very very low chance, then match all the remaining digits, which also has a very low chance.
Well, that’s the magic of hashing for you.
Thanks to hashing our passwords and blockchains can stay secure.
And keep 99.9% of the Bobberonis out.
Or rather, in the maze.