Hash tables can place arbitrary data into linear arrays. They can be used to implement maps or sets, here we will use the example of placing strings into sets. We will simply take the strings from this web page as examples and fill the hash tables with them. Hash tables support the operations insert, delete and lookup, and also need to be able to transparently increase the size of the table as the amount of data increases. Hash tables support many applications, such as keeping track of IP addresses in a networking application.
The main ingredient for a hash table is a hash code, which computes a very large number from an arbitrary string. Here we are using the MD5 code which computes a 128 bit number. These codes spread the data out, so that even very similar strings are mapped to distant numbers. To get a position in the array from the hash code, a remainder is computed from dividing by the size of the array. It has to be ensured though, that the same string entry yields the same position in the same array every time. Occasionally different strings are mapped to the same position in the array, which is called a hash collision. There are different ways for dealing with this, above you see ‘separate chaining’: under every position of the array a linked list is maintained, which contains an unordered set of hash table entries.
This panel shows the lookup procedure in detail: first the string to look up is hashed with MD5 and then we reduce it to a number indicating the position in the array. Next, the procedure jumps to that part of the array, and if it is an empty slot, the entry is not present. On the other hand, if a chain of entries is present at that point the procedure traverses it looking for the entry. If the entire chain is traversed without finding an entry, it is not present in the table.
This panel shows the insertion and deletion operations. The panel fills the hash table to a certain size and then alternates insertion and deletion operations. For insertion, as before, we compute the array position from the hash code of the key string, and then either place a single entry into the array in case it was empty, or prepend a new entry to an existing linked list of entries. For deletion we have to do more work, as with lookup we have to find the specific location of the entry in the linked list and then splice it out of the list to remove it. In case the entry is the only element in the list, we simply null out the pointer in the array.
Hash tables make a fundamental space vs time tradeoff. In an array that is very large compared to the number of items in it, a lot of space is used but the chains are short and operations are correspondingly fast. In a small array on the other hand, space is reduced at the expense of slower operations on longer chains. A good implementation will choose an appropriate array size automatically and resize the array when the number of items changes significantly.
Instead of maintaining the linked lists under every table entry, there are other methods such as ‘open addressing’. In that scheme, entries are written right into the array, and in case of a hash collision we simply find another place to fit the entry into. The simplest such scheme is ‘linear probing’, whereby simply the next array location is probed, one by one until an empty space is found. Such a hash table is shown above, where the columns should be thought of as one long array. At first we show the insertion procedure, which does linear probing as necessary to find a place. Lookup is similar, where the array is traversed starting at the position calculated from the hash code, until either the wanted entry is found or an empty spot indicates a lookup miss. This will be displayed after insertion has filled the array to a certain size. One drawback of the method outlined here is that there is a hard upper limit on the number of entries, and once it is reached the array needs to be resized or the hash table freezes.
Linear probing can lead to long, filled-up stretches of the array that have to be traversed sequentially to find an empty spot. An alternative is ‘double hashing’, shown above, where a second number is derived from the entries’ hash code, which specifies a stepping distance which is used to calculate the next probe location. This is a different value for each hash code. We have made sure to use an array with a length that is a prime number, which prevents the stepping number from dividing the array length, which can lead to unwanted effects. Removing table entries from open addressing hash tables is not that easy actually, compared to insertion and lookup, so we won't cover it here.
Bloom filters are close relatives of hash tables. They save space by not even storing the objects they are keeping track of. Instead they store a few bits derived from the hash code of an object into a bit field. In the display above, the random strings taken from this page are hashed into 6 numbers, which in turn indicate positions in the rectangular bit field of 1250 bits. To insert the string, these 6 bits are set to 1. Bits presently set to 1 are displayed in blue, new inserts are highlighted in red and inserts which were already set are highlighted in green, in this case the bit is already set to 1, and it simply remains so.
A common application for Bloom filters is in Web crawlers, which want to keep track of URLs that have been visited yet. These might be billions of URLs, but memory should be used sparingly. It should be noted, that the length of an individual string is not of concern, the hash code is always the same size.
To look up an entry in the bloom filter, the same positions are derived from the hash code as before, and they are looked up in the bit field. If one or more bit is not set, the string can never have been inserted in the first place and we return false, the entry is not present. It is slightly more complicated with finding a string to be present: if all bit field positions are set to 1, this indicates that the string was likely inserted at some point, so the result is true. But there is the possibility of a false positive: this happens if all relevant bits are set to 1 by chance, from the hash codes of other entries! This is the main drawback of Bloom filters. But the larger the bit field is, and the fewer entries have been inserted into it, the lower the probability of a false positive becomes. Note also, that Bloom filters don't suffer from the possibility of false negatives.
If you enjoyed this page, there are more algorithms and data structures to be found on the main page.