Re: [ADVANCED] Hash-table questions [long]

From: Johan Eenfeldt (d95joe@csd.uu.se)
Date: 02/28/97


On Fri, 21 Feb 1997, Sammy wrote:

> I'm just starting to play with hash tables, and I seem to have figured
> them out, but I was wondering if anyone could help me optimize my tables.
> All I know of them is what I remember from reading about them once several
> months ago, and I remember a little from a discussion on this list a long
> time ago.

Hashtables are very good datastructures for just about anything. Here
comes a (short) description of it for those who haven't heard of about it
earlier...

A hashtable is somewhat like an array, but instead of looking directly in
slot x when looking for something with that number, we look in

> Finding a spell by name in a list of 1000 strings is the problem.  My
> solution is to use a hash table based on the first three letters in the
> spell name and the length of the name, and I'm using 200 as the hash
> modulus (if that's the correct term).  If more than one spell falls under
> the same hash location, they're in a sequential linked list.

With 1000 entries, I would recommend making the table bigger. You get
faster lookups when less values map to the same slot, and the table
doesn't take much memory.

Now, about the mapping of a string to the key value - the hash function -
a few notes, most rather natural...

The goal is for the function to map the strings as evenly as possible to
numbers, minimizing the chanse that different strings map to same value.

If you are pretty sure that strings of same length won't have, or seldom
will have, the same first three letters, this is an example of a simple
hashfunction to do this (writen in mailer, take care...):

/* Sample simple hashfunction to map strings to key value -jee- */
long
hash_key_value(char *string)
{
  int i;
  long key = 0;

  if (!string)
    return 0;

 /* Bit pattern of key:
  * 0-7   ASCII value of first char
  * 8-15  ASCII value of second char
  * 16-23 ASCII value of third char
  * 24-31 length of string
  */

  key += (MIN(strlen(string), 255) << 24);

  for (i = 0; *string && i < 3; string++, i++)
    key += (*string << (8 * i));

  return key;
}

It is possible to optimize this quite a lot. For one thing we know that
some ASCII values will never be used. If you end up with lots of strings,
and many identical keyvalues, you could for example use something like
this instead of just taking the ASCII code right of, which decrease the
number of bits used by a char to... 5 or so.

char
hash_get_char_value(char c)
{
  if(c >= 'a' && c <= 'z')
    return (c - 'a');
  else
    return ('z' -  'a' + 1);
}

Note that, depending on size of table, all strings with same first letter
will map to same slot.

> Is there a way to pick an optimal hash mod for this system?  I remember
> someone on the list way back saying that prime numbers should be used, but
> I'm not sure why.  I know I can optimize by repetitive building of the
> table with various values, but if someone could make some sense of the
> prime number idea, I'd appreciate it.

*Optimal* hash table sizes, or functions, only exist in theory...
However, there are a few things to think about when choosing the size.

As a general rule, it should be a prime number far from any power of 2.

It is possible to get better performance by analyzing what the key values
in use are, but I'd say this is so much waste of time here... ;-)

> Uh, I hope I didn't just ask a FAQ question.

Hardly.

--
/ Johan Eenfeldt
  Student, Computing Science Department @ the University of Uppsala
  d95joe@csd.uu.se

+-----------------------------------------------------------+
| Ensure that you have read the CircleMUD Mailing List FAQ: |
|   http://cspo.queensu.ca/~fletcher/Circle/list_faq.html   |
|    Or send 'info circle' to majordomo@cspo.queensu.ca     |
+-----------------------------------------------------------+



This archive was generated by hypermail 2b30 : 12/18/00 PST