With the rapid growth of blockchain, FinTech, or Bitcoin, hash functions gradually attract more attentions in recent years. Several mining machines or application-specific integrated circuits (ASICs) became famous with their great computational hash power. That hash power becomes an ability that profits in the Bitcoin society.

So, what exactly is a hash function?

A hash function is a function that takes any string with arbitrary bit-length as input, then it produces a deterministic fixed-length string as the function output. The short output may be a unique identifier representing the original input. By the way, the term deterministic denotes the same outputs come out while given the same inputs.

A hash function outputs a shot fix-length string whatever it takes as input.

A practical hash operation

Before theorems and definitions, users with a linux-based operating system like MacOS, Fedora or Ubuntu can easily obtain a SHA256 hash value of files through following command.

openssl dgst [algorithm] [path-to-file]

There are some common hash algorithm like -sha256 and -sha384 could be found by entering

openssl dgst -help

Following, a practical command which produces a SHA256 hash value of a file path-to-file. No matter that file accounts tens of megabytes or even gigabytes, the SHA256 hash value is 256 bits.

Command

openssl dgst —sha256 [path-to-file]

Output

SHA256(path-to-file)= bb4a9cbd401e3ed8b9bad674bbb50febe8b9df0acb4e2f87be7a79369ead6430

For more hash implementations, plenty of online hash function calculator might help us understand the computation before and after various hash functions.

Common hash functions

The secure hash algorithm (SHA) series are endorsed by FIPS and NIST so that they’re the mostly widely adopted hash algorithms. The older versions SHA0 and SHA1 have been corrupted so that please carefully avoid these algorithms; while the definition of security and corrupted will be later introduced. By the way, the algorithm SHA1 is pretty similar to SHA2, and SHA1 has been corrupted. However, SHA2 keeps secure or so called survived due to the fact that its output length is 256 bits or more (depend on its variants), which is much longer than outputs of SHA1 (160 bits). NIST chose SHA3 as the next generation member of the SHA family because SHA3 executes totally different algorithm from SHA1 and SHA2. That might be a quite long story if we’re going to discuss about the history. Here we just list some common recommended and corrupted algorithms as common senses.

Recommended

SHA2 series (SHA256, SHA384, SHA512), SHA3 (keccak), RIPEMD-160

Insecure

MD4, MD5, SHA1

Then, let’s keep going deeper to understand what are hash functions and what roles can hash functions play in the modern IT industries. Don’t worry, it will not be very theoretically.

Why do we need hash functions?

A hash function converts its various source inputs into fixed format outputs known as the hash value, the digest or the footprint. There are numerous advantages that make it so powerful in our modern IT industries: on one hand, it works quite efficiently so that each binary files in the computer or Internet could become a digest easily; on the other hand, it provides powerful utilities along with its properties such as one-way, collision-resistant, avalanche effect and universal distribution.

One-way

Let y = hash(x), it is efficient to compute y from x; but it is extremely difficult to reveal any bit of x while given value y and hash function. With this property, it could be guaranteed that value x exists before value y does. Besides, one party may deliver y in order to make a secure data comparison without revealing data x to another party.

Collision-resistant

For any values a and b as inputs, it is harshly difficult to find a collision that satisfies hash(a) = hash(b). Hence, the correctness of hash value overwhelmingly implies the integration of its source input. The collision-resistant property ensures the integrity of original inputs in a very cheap way, which enables the short fixed-length footprint to stand for its large source. Many large file transferring services like Linux distributions provide digest for file integration verification.

Applications of hash functions

Two properties above are mostly common utilized. The most well-known application is that: with the aids of hash functions, the digital signature works on input any format of electronic files or strings. Besides, in the Bitcoin case, block n+1 contains the digest of block n. On one hand, it proves block n exists prior to block n+1(one-way); and on the other hand, it ensures the integrity and correctness of block n (collision-resistant). The latter blocks repeatedly empower the integrity of former blocks so that the blockchain structure continuously become more and more immutable. Bitcoin is an elegant technique that needs more stories to describe, if you’re interested in that part, some resource could be referred.

Another famous application about the hash function is the password storage. For example, those big companies like Google and Facebook owns billions of user accounts and personal information. If they store users’ password as plaintext without processing, the whole data is in danger once the identity server is corrupted. A well-known solution is to save the hash values of users’ passwords rather than the passwords. On one hand, it keeps password verifiable; on the other hand, if the identity server is corrupt, by the one-way property, those digests keep password much much more secure. The acquirement of each password becomes extremely difficult for hackers. This is a serious issue that might be discussed in some future.

Security of hash functions

When it comes to security, it is unavoidably more theoretical, I’ll try my best to keep it not that sophisticated. Let’s recall the process that a hash function maps an any-length input to a fix-length output. In other words, it is well-known that theoretically, there must be many collisions that x != y and hash(x)=hash(y). The pigeon-hole theory explains this situation in an easily-understanding manner: n+1 pigeons live in n cages, there must be at least one cage contains 2 pigeons.

First, the one-way property is easily-understanding while the collisions exists. Given the digest hash(x)=hash(y), it is impossible to precisely indicate the input between value x and y. Considering the fact that an infinity-to-one mapping occurs along with the hash computation, the one-way property is trivially obtained.

Secondly, the collision-resistant!? Contradictorily, it seems that we just discussed about the infinity-to-one mapping that cause a lot collisions. Yes, we’re magically insisting on its collision-resistant property. More precisely, the collisions in fact exist; however, they’re computationally hard to find. The difficulty depends on the output length of hash functions. Take SHA256 for example, it outputs a non-predictable 256-bit number between 0 and ²²⁵⁶ -1. There are trillions of trillions of trillions of trillions of possible outputs so that the collisions quite rarely occur. Note that the wide range of outputs is insufficient to guarantee the collision-resistant property, more properties like uniform distribution is required which will be introduced later, which is even more theoretical and difficult to be prove. Before scientists conquer the proof issues, the hash functions currently act like collision-resistant, and we just believe the employed hash function are collision-resistant until it is corrupted. Aforementioned SHA1 and MD5 hash functions have been corrupted since collisions of them were found.

Other properties

The following two properties are not that commonly adopted in modern IT industries so that readers may decide to skip this phrase or continue to understand the other sides of hash functions.

Avalanche effect

This is an appropriate name about this property which describes a tiny difference between two input sources (even only one-bit difference) makes numerous bias between two hash values. A strong definition called strict avalanche effect defines each one bitwise change influence approximately half bits of the hash outputs.

Uniform distribution

This is an ideal assumption adopted in some cryptographic works, which assumes the outputs of hash function is uniformly distributed in the domain it locates. While the input is unknown, the hash function works unpredictably as a random number generator in the specific domain. This is so far still an argument that whether it exists or not. Random oracle model is a theoretical model aside from cryptographic security proofs that assumed hash functions are universal distribution.

If you enjoy the stories and willing to read more easily-understanding cryptographic or blockchain-related content, please give me some claps. I’ll be flattered if there are some sponsorship via Bitcoin or Ethereum.

Bitcoin: 36FmpfiDPVAJdxKtXM79Lj5G5awZgDNKar

Ethereum: 0x37fd7D56a7228344F1bca1132d7D1b1ED398FCaa

We’re a group of experts major in applied cryptography and blockchain. Contact us for enterprise consulting and education.