Page 194 - DCAP605_ADVANCED_DATA_STRUCTURE_AND_ALGORITHMS
P. 194
Unit 9: Hashing
9.1 Hashing Notes
In many applications we require to use a data object called symbol table. A symbol table is nothing
but a set of pairs (name, value). Where value represents collection of attributes associated with the
name, and this collection of attributes depends upon the program element identified by the name.
For example, if a name x is used to identify an array in a program, then the attributes associated
with x are the number of dimensions, lower bound and upper bound of each dimension, and the
element type. Therefore a symbol table can be thought of as a linear list of pairs (name, value),
and hence you can use a list of data object for realizing a symbol table. A symbol table is referred
to or accessed frequently either for adding the name, or for storing the attributes of the name, or
for retrieving the attributes of the name. Therefore accessing efficiency is a prime concern while
designing a symbol table. Hence the most common way of getting a symbol table implemented
is to use a hash table. Hashing is a method of directly computing the index of the table by using
some suitable mathematical function called hash function. The hash function operates on the
name to be stored in the symbol table, or whose attributes are to be retrieved from the symbol
table. If h is a hash function and x is a name, then h(x) gives the index of the table where x along
with its attributes can be stored. If x is already stored in the table, then h(x) gives the index of the
table where it is stored to retrieve the attributes of x from the table. There are various methods of
defining a hash function like a division method. In this method, you take the sum of the values of
the characters, divide it by the size of the table, and take the remainder. This gives us an integer
value lying in the range of 0 to (n – 1) if the size of the table is n. The other method is a mid
square method. In this method, the identifi er is fi rst squared and then the appropriate number
of bits from the middle of square is used as the hash value. Since the middle bits of the square
usually depend on all the characters in the identifier, it is expected that different identifi ers will
result into different values. The number of middle bits that you select depends on the table size.
Therefore if r is the number of middle bits that you use to form hash value, then the table size
will be 2r. Hence when you use this method the table size is required to be power of 2. Another
method is folding in which the identifier is partitioned into several parts, all but the last part
being of the same length. These parts are then added together to obtain the hash value.
To store the name or to add attributes of the name, you compute hash value of the name, and
place the name or attributes as the case may be, at that place in the table whose index is the hash
value of the name. For retrieving the attribute values of the name kept in the symbol table, I apply
the hash function to the name to obtain index of the table where you get the attributes of the
name. Hence you find that no comparisons are required to be done. Hence the time required for
the retrieval is independent of the table size. Therefore retrieval is possible in a constant amount
of time, which will be the time taken for computing the hash function. Therefore hash table seems
to be the best for realization, of the symbol table, but there is one problem associated with the
hashing, and it is of collisions. Hash collision occurs when the two identifiers are mapped into
the same hash value. This happens because a hash function defines a mapping from a set of valid
identifi ers to the set of those integers, which are used as indices of the table. Therefore you see
that the domain of the mapping defined by the hash function is much larger than the range of the
mapping, and hence the mapping is of many to one nature. Therefore when I implement a hash
table a suitable collision handling mechanism is to be provided which will be activated when
there is a collision.
Collision handling involve finding out an alternative location for one of the two colliding
symbols. For example, if x and y are the different identifiers and if h(x) = h(y), x and y are the
colliding symbols. If x is encountered before y, then the ith entry of the table will be used for
accommodating symbol x, but later on when y comes there is a hash collision, and therefore
you have to find out an alternative location either for x or y. This means you find out a suitable
alternative location and either accommodate y in that location, or you can move x to that location
LOVELY PROFESSIONAL UNIVERSITY 189