Hashtable

Related data structures: Array

A map, symbol table, a dictionary or an associative array is considered an abstract data type composed of a collection of (unique_key, value) pairs. Operations associated with this data type are addition and removal of a pair in a collection, lookup values by their keys and modification of an existing pair.

A hashtable uses the key of each record to determine the location in an array structure. To achieve this, the key is passed into a hash function which returns a numeric value based on the key.

Hash Functions

A good Hash Function should

be simple to compute.
always return the same numeric value when given the exact same key.
have less number of collisions while placing the record in the hash table. Ideally none should occur (perfect hash function).
produce such keys which will get distributed uniformly over an array.
depend on every bit of the key. Thus a hash function that simply extracts the portion of a key is not suitable.

Load Factor

The load factor denoted by the symbol λ measures the fullness of the hash table. It is calculated by the formula λ = number of records in table ÷ number of locations.

Dictionary Problem

Data structures that maintain a set of data during search, delete and insert operations suffer from a data structural problem known as the dictionary problem.

Consider the following data structural problem.

We have a set of items. Each item is a (key, value) pair. Keys are in the range {1,...,U} and values are arbitrary. A data structure supporting the following operations is called a dynamic dictionary.

insert(k, v): insert a new item into the database with key k and value v. If an item with key k already exists in the database, update its value to v.
delete(x): delete item x from the database (we assume x is a pointer to the item)
query(k): return the the value associated with key k, or null if key k is not in the database.

This is a dynamic data structural problem since we also need to support insertion and deletion. (Dynamic variants of data structure problems are ones that support the data set being updated.)

Hashing is a known method to solve this problem.

A hash family H is a set of functions h: {1,...,U} -> {1,...,m}. The idea is to maintain an array of size m where an item with key k is stored at the h(k)th position of the array. Unfortunately, due to collisions this would not work; Two different keys k, k' in a database may have h(k) = h(k').

Assume we initialize an array X of size a where X[i] stores the pointer to the head of a doubly linked list. We pick a hash h at random from a hash family. This way, X[i] will be a doubly linked list containing (key, value) pairs of all items whose key k satisfies h(k) = i. Basically, colliding items are stored in the same linked list.

For inserting a (k,v) pair you simply insert the pair to the head of the list at X[h(k)].

To query k you traverse the list at X[h(k)] until finding an item with key k or see that the key does not exist.

Deletion splices the item out of the doubly linked list when the key is found.

Collisions

Since a hash function translates all keys into indexes in an array, it is often the case that there will be many more keys than there are indexes. Therefore it is always a possibility that multiple keys will be translated into the same index, which is a collision. Generally speaking, collisions are unavoidable. Thus, hashing implementations must have a collision resolution policy. Collision resolution methods can be broken into two classes: open hashing and closed hashing. The difference between the two methods has to do with whether collisions are stored outside the table (open hashing) or at another slot in the same table (closed hashing).

The simplest way of implementing a form of open hashing is to define each slot in the hash table to be the head of a linked list. All records that hash to a specific slot are placed on its linked list. Ordering of records within a linked list of a slot can be done with insertion sort, key-value ordering, or frequency-of-access order. Ordering the list by key value provides an advantage in the case of an unsuccessful search, because we know to stop searching the list once we encounter a key that is greater than the one being searched for. If records on the list are unordered or ordered by frequency, then an unsuccessful search will need to visit every record on the list.

Open hashing is most appropriate when the hash table is kept in main memory, with the lists implemented by a standard in-memory linked list. Storing an open hash table on disk in an efficient way is difficult, because members of a given linked list might be stored on different disk blocks. This would result in multiple disk accesses when searching for a particular key value, which defeats the purpose of using hashing.

Closed hashing stores all records directly in the hash table. Each record R with key value K has a home position h(K) that is computed by the hash function. If R is inserted and another value already fills the computed h(K) then R will be stored at some other slot in the table. Which collision detection method is used determines which slot that may be.

Next: Collision Resolution: Bucket Hashing