Learning Storage and Retrieval with a Hash Function
If you walk around the city of London, you won’t have too much trouble finding Gough Square, hidden behind Fleet Street. There stands the home of Dr. Samuel Johnson, who (among other things) wrote the first English Dictionary.
There is something surprisingly magical about a dictionary. There is probably only one long-ordered chain you know well, and have known well for a long time: your alphabet. This gives a dictionary a few remarkable characteristics. Each entry has just one key. And the indexing works only one way; so using only a dictionary, you can’t discover an unknown word just be knowing its meaning. Even a young child can look up Steatopygia, but finding that word simply by searching the entire book for “the state of having substantial levels of tissue on the buttocks and thighs” might be a very long task.
Some basics of computing have overlapping definitions and get obscured by their own ubiquity. While hashing has a simple definition, different aspects of it are in focus at different times. In today’s world, hash security is of interest, but in this article, we are looking into storing and retrieving items with a key. We’ll use elementary computing concepts, with a small degree of maths curiosity. It starts with the first hash table.
A hash function is any function that can be used to map data of arbitrary size to fixed-size values.
Arrays and Hash Tables
From a computing point of view, the English Dictionary is not quite so wonderful, as there was never any need to restrict its size. The English language was never created with efficiency in mind; if it was then the dictionary would start with the word “A”, and end with the word “ZZZZ”. Indeed, just four letters if used efficiently could represent all English words (with enough room for another language).
For a digital dictionary, we are interested in numeric keys and indices. We know that the simplest form of storage is a fixed-size array. For our example, we will play with a five-element array for storing the worrying mugshots of our five friends: Alan, Beth, Cath, Dave and Eddy. We want to retrieve the images using just their names; and eventually, delete them.
Now, we could use just two arrays: one with the image data, and one with a matching index holding the keys. The data array would need cells big enough to hold image data, while the index array just needs cells big enough to hold small strings.
In the example below, the key array holds the key “Alan” whose index matches the slightly angry dark-bearded visage in the image array above. I guess Midjourney was having a stressful day. This system is simple to maintain, as entries can be created and deleted easily in sync.
For our small example, this seems fine. But for a large array, the problem with this scheme is that as the dual entries increase, you have to potentially search for longer in order to find the right key. Worse, you could search the entire array just to confirm that the key wasn’t present. Once I start deleting entries, I also have to track disparate empty cells somehow. In other words, this system doesn’t scale well.
What we want is a function that sits between the nice human-readable key and an indexed array of images. Something that can look at the key and say “oh, I put your image data here”.
You might also recognize this model from valet parking. I give the attendant my car key, and in return, he gives me a ticket. Then some other person goes and parks my car. Later I give you the ticket, this matches my car details and my car is retrieved. Addition and retrieval take a fixed amount of time. (At least that is how I believe the system works; my only reference is “Ferris Bueller’s Day Off”.)
So, what is inside the “hash function”? Surprisingly it can be more or less anything. As long as it can produce valid index numbers.
The starting point for examples is always the modulo operation, as that precisely fits the bill — it provides a number guaranteed to be less than a given maximum.
In computing, the modulo operation returns the remainder or signed remainder of a division, after one number is divided by another.
So if we have a dictionary with a five-slot array, we will need to output index positions of 0-4. As you can see, modulo 5 does this with any integer value:
17 mod 5 = 2
2345 mod 5 = 0
73 mod 5 = 3
I’m sure you grok the pattern. So we have our hash function, right? We should have no need for the index array — just get an index from our name key! Well, not quite.
Given that 17, 207, and 2347 all produce the same modulo result, this can’t possibly work on its own. Even for five random numbers, there will likely be some collisions. The only way to avoid collisions, certainly for a trivial algorithm, is to store the key as we were doing before. This isn’t so surprising really — for a fixed-size array, no algorithm could possibly perfectly fill our index without remembering what had gone before. This would be like a valet parking where the details of the car and where it is parked don’t get recorded anywhere. So our hash function is really saying “oh, start looking for your image data here.” What I’m describing is called open addressing.
I want keys to be the name of the person’s mugshot, e.g. “Cath” — but that isn’t a number. However, I can get a fairly unique number from a string. We could produce a value based on the positional encoding of numbers (think of English dictionary order, counting in base 26), but with an eye on our very simple modulo function let’s just add all the ASCII values of the characters together.
Alan = 65 + 108 + 97 + 110 = 380
Beth = 66 + 101 + 116 + 104 = 387
Cath = 67 + 97 + 116 + 104 = 384
Dave = 68 + 97 + 118 + 101 = 384
Eddy = 69 + 100 + 100 + 121 = 390
(Tip: If you are using a Mac, you can do both modulo and other lines of math directly in the Spotlight Search bar.)
So, finally, let’s deal with collisions. If we store the key, then we can fulfill our expectation of keeping the information close to where we first looked, by simply putting it in the next available place down the key array. This method is called linear probing.
So now we have all the ingredients. First, we tell our hash algorithm that we want to store an image of “Alan.” It converts the string “Alan” to an integer, then uses modulo to produce an index into the key array. Then it adds the data, using the same index, to the value array.
Then we add ”Beth” in the same way. This time, the index is two.
When we try and add Eddy next, we get a potential collision with Alan’s entry. So we move to the next available slot in the index array:
To retrieve our data, we just follow the algorithm again, but this time to confirm the index. When we search for “Eddie,” we see “Alan” — so move on one. We didn’t have too far to go. The hashing algorithm told us where to start looking like it promised.
Naturally, this makes a little more sense for the larger arrays you would use in real applications. In modern computing languages, you don’t need to specify the size of the dictionary — they resize as needed. But for now, I think Johnson, and his cat Hodge, would be happy with their indirect contribution to computing.