Hashes, salt, and pepper
Salt and Pepper
Cryptographic hash functions help you protect your passwords, but hashing is only secure if properly understood.
Hash functions are an integral part of computer science – and not just with databases and checksums. Hashes were originally intended for storing data efficiently in memory, but the hashing concept has evolved into a technique for securely storing passwords.
Linux writes the password hash values to the /etc/shadow
file, which you can only read if you have root privileges. But even if you have the root password, you'll find it difficult to learn any useful access information. The function used to store the password hash values in etc/shadow
is a oneway function, which means you can't work backward from the hash value to create the original password – at least in theory. As you'll learn in this article, attackers still sometimes manage to crack these supposedly irreversible hash functions.
What is a Hash?
The idea of a hash is simple: An address is calculated from the value that is to be stored. Suppose, for example, you need to store the four user names Fritz, Laempel, Max, and Moritz. A hash function would calculate a numeric value from these names.
The simplest option is to use letters in the alphabet (A=1, B=2, etc.), which the program then uses to form a sum of the digits from the letter values in the name (Table 1). The hash value must be in this range because only a limited number of memory locations are available.
Table 1
Simple Hash Function
Value  Sum  Hash 

Fritz 
79 
2 
Laempel 
64 
1 
Max 
38 
3 
Moritz 
101 
3 
The example in Table 1 has seven available memory locations (from 0 to 6). The hash value is therefore determined by the sum modulo 7. Anyone who doesn't fancy computing the numerical values for Fritz, Laempel, Max, and Moritz manually can use Bash with the help of Perl, as shown in Listing 1.
Listing 1
Bash Hash
Each name first ends up in the name
variable via the for
loop; echo
sends the name through a pipe to tr
, which replaces all lowercase and uppercase letters and passes the result to a Perl script. The Perl script then splits the character string into individual letters and outputs its ASCII value separated by the  64 +
character string. This gives A the value 1, B the value 2, and so on. Anyone who wants to see this happening live should enter set x
. To terminate the function again use set +x
(Figure 1).
In this example, Max and Moritz have the same hash value. If you imagine memory as an array, you can only accommodate one entry in each slot and will be unable to handle the task at hand. Theoretical computer science therefore doesn't consider the hash value to be a physical slot, but rather a number of a bucket that can accommodate several entries.
With open hashing (Figure 2), Max and Moritz could live in one bucket because each bucket can contain any number of entries. However, the number of entries is limited with closed hashing.
An array of pointers, which can (in principle) contain an unlimited number of elements, is a suitable data structure for open hashing. At each index of the array a linked list to the elements of this bucket is kept. Open hashing is always useful if the size of the array is comparable to the number of elements, n, for all the values you want to store, because then, on average, only one entry per index is stored, allowing for quick access.
The hash function requires a constant time O(1) to calculate the hash value. However, the time it takes to wade through the list in each bucket increases as a linear function of the number of elements. The worst case scenario is that all n elements end up in one bucket and the required access time is now O(n). In that case, this method offers no advantage over a singly linked list.
The hash function will ideally distribute the number of values equally across all m indices [O(n/m)], which makes finding elements faster than with a singly linked list. Such a structure is particularly useful if you're looking for a specific value from a large amount of data. Closed hashing limits the number of entries in a bucket and thereby prevents long chains.
Keep a Lid on It
Many real implementations draw the line for the maximum number of entries per bucket at 1; however, higher values are also theoretically possible. The simplest implementation is a twodimensional array with n number of rows (=buckets) and m number of columns for the maximum number of elements per bucket. However, this solution isn't ideal for memory usage.
Collisions are now no longer irrelevant; you still have to check whether the bucket has room for another value.
Some methods provide a new hashing (called rehashing) with another hash function as the solution. This rehashing is repeated until no more collisions occur. The worst case scenario is that you'll need a lot of new hash functions, which give rise to different results to find a free space.
The simplest hash functions always add 1 to the previous hash value; h[ ] (h subscript 0) is the original function, n is the number of buckets, and i is the ith rehashing attempt:
This process is called linear probing and is a bit like a visit to an Oktoberfest tent: People go from table to table until they find an empty seat. This process causes block formations. In places where everything is already occupied, the next table is likely to be occupied too.
Therefore refinements alternate between looking above and below. Square probing using i^2 (i squared) instead of i is a possible alternative. Collisions are equally likely with the first hash, but the likelihood decreases with rehashing.
Rehashing leads to another problem: Anyone looking for an entry in memory later (Max for example) just needs to apply the function to the name and will then receive the memory address. However, if nothing is stored at this address, you need to keep rehashing until you find Max or an empty or only partially filled bucket, which would mean that Max wasn't saved.
If an entry has been deleted, you need to set a marker at this position so the search is terminated prematurely. The memory space should be less than 80 percent full to keep the probability of collisions low enough.
Adding Security
Cryptography places higher demands on hash functions. A hash function is considered secure if it isn't possible to recreate the output value from the hash value – at least not in an acceptable time. In other words, the function must not be reversible. The probability of finding a second output value for the given hash and the probability that any two output values lead to the same hash value should be very low.
The last two stipulations sound similar, but they differ in a way that is best illustrated with the birthday paradox: Suppose Max is at a party and asks the other guests if anyone's birthday is on the same day as his. An affirmative response is relatively unlikely. If, on the other hand, he wanted to know if any two arbitrary partygoers' birthdays are on the same day, the probability is more than 50 percent, even with as few as 23 people at the party. For the first scenario, 183 people would be needed to get such a value. So, from a cryptographic point of view, birthdays are a bad hash value, because a very large number of people (7 billion) map to a very small range of values (365 days).
The range of values for a cryptographically secure hash function is a fundamental part of the design. MD5, for example, uses 128 bits (2^128, i.e., about 3 x 10^38 different values). SHA1 uses 160 bits (10^48 different values), SHA256 as many as 256 bits (10^77). A function is considered to be collisionresistant if the probability of a collision is less than 2^(n/2) with a hash value of n bits.
Like all hash processes, the MD5 algorithm first converts a message to the appropriate length. An integer multiple of 512 bits is required. Padding works in such a way that 1 is appended to the message as the first bit, followed by zeros, until 64 bits remain free at the end. The message length is therefore divisible by 512. A 64bit value appended to the end (in binarycoded form) specifies how long the original message was [1].
The algorithm then splits the message up into 512bit blocks and splits these blocks, in turn, into 32bit blocks. The hash values are then calculated for these blocks, also known as words. Four words are always linked to each other logically four times in different ways, moved bitbybit or added. The result is four 32bit words: A, B, C, and D. The algorithm adds them to the previously calculated values A, B, C, and D. This combination of A, B, C, and D after each round is called a status vector.
This continues until the entire message is processed. The MD5 value is then A, B, C, and D in a row. A, B, C, and D receive a fixed output value – the initialization vector – before the calculation.
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Direct Download
Read full article as PDF:
Price $2.95
News

Red Hat Reports $823 Revenue for Second Quarter 2019
Strong performance for cloud and container products leads to a 14% increase over last year.

Debian, Ubuntu, and Other Distros are Leaving Users Vulnerable
A security researcher says Linux vendors wait too long to patch the kernel.

NextCloud 14 Arrives
The new version adds two new security features for users.

Linus Torvalds Takes a Break, Apologizes
The creator of Linux is admitting his abusive behavior on LKML and plans to change.

A New Business Model for Open Source Projects
Storj offers a new program that enables open source projects to monetize from storage.

Linux Mint Debian Edition 3 Released
It’s Linux Mint without any Ubuntu base packages.

Zorin OS 12.4 Released
It’s an extremely popular distribution among Windows and macOS users.

Debian Celebrates Its Birthday
The great community Linux distro turns 25 years old.

Chromebooks Support Debian Applications
Containerized Debian environment is now available through the Chrome OS dev channel.

Opera Embraces Snap for Linux
Announcement may open doors for more mainstream applications to adopt Ubuntu's Snap package system.