Re: [ADVANCED] Hash-table questions [long]

From: Eric Green (egreen@cika.cypronet.com)
Date: 03/03/97


> > *Optimal* hash table sizes, or functions, only exist in theory...
> > However, there are a few things to think about when choosing the size.
> > 
> > As a general rule, it should be a prime number far from any power of 2.
> 
> THis is what I really want to understand.  I can write all the code just
> fine, and I can use primes for hash keys, but what I want to know is:
> why? I've seen one other person recommemd a prime far from 2^n, but I've
> seen many more people recommend using a prime 2^n plus or minus 1, as in
> 7, 17, 31, etc.
> 
> I haven't seen any explanation for the difference in methods, and I can't
> give it a decent test until I have a signifigant number of spells (I have
> 0 at the moment).

A good hash function is one that each key is equally likely to hash
to any of the slots.  When using a division method hash function, this
means trying to find the set size as unrelated to the probability
distribution of the set of items you will be hashing.

Without knowing the set of items to be hashed, a prime number for the
size of the hash table is a good choice, since the chances of the probability
distribution having a similar "factor" is the lowest (there is only one
factor in a prime number).

A power of two is bad for some distributions, because it has the affect
of only caring about the lowest order bits of the key.  For example,
if h(k) = k mod 16, only the 4 lowest order bits make a difference in
where the key is hashed.  All other bits are irrelevant.

Simiarly, a power of 10 is a bad choice if the application deals
with decimal digits.  The higher order digits are ignored in such
hash functions.

A set size of 2^p - 1 is concidered a poor choice when the key is
a character string interpreted as a radix 2^p, since two strings hash
to the same bucket when the strings are the same except for two adjacent
characters are transposed.


The general rule of "a prime number not too close to powers of two" will
usually work well when the distribution of the set of keys is unknown,
because it avoids the above problems.  If you do know something about the
distribution, violating this rule may improve your hash functions.

One reference i own which goes into some details about choosing a good hash
function is "Introduction to Algorithms", but Cormen, Leiserson, and Rivest.
If you are looking for furthur information on hashing, check for this or
similar titles in your library.


Eric
thrytis@imaxx.net
+-----------------------------------------------------------+
| Ensure that you have read the CircleMUD Mailing List FAQ: |
|   http://cspo.queensu.ca/~fletcher/Circle/list_faq.html   |
|    Or send 'info circle' to majordomo@cspo.queensu.ca     |
+-----------------------------------------------------------+



This archive was generated by hypermail 2b30 : 12/18/00 PST