Re: Cursing code

From: Jesse Becker (hawson@temperedweaves.com)
Date: 05/14/03


On Wed, May 14, 2003 at 06:57:41PM +0200, Templar Viper wrote:
> From: "Thomas Arp" <t_arp@stofanet.dk>
> > From: "Templar Viper" <templarviper@HOTMAIL.COM>
> > > A while ago, I fixed this code together. It checks the argument for
> > > cursing (placed in curse_list), and replaces it with a harmless beep. I
> > > had to use str_strr, as strstr isn't case sensitive. However, I want to
> > > ignore certain characters, such as spaces, full stops and more of those.
> >
> > I'm a bit unclear on your intent here;
> > Do you wish to make sure to catch "CUR SE", too? Or do you mean you have
> > 'multiword' curse words ?
>
> Yes, I do want to catch "CUR SE".

I had to help out with something like this a few years ago for a
semi-major online game from a semi-major online game company (both of
which shall remain nameless <grin>).  We didn't both to filter
chat--that was nominally covered by the TOS agreement (<chuckle>), but
we did want to prevent people from choosing handles based on various
'illegal' words.

We did this:

1)  Generate a list of invalid words (BADWORDS), in plain english.[1]
2)  Generate a list of substitutions that could be used, stuff
    like 'l' => 'l', and 'k' => '|<', and 'U' => 'V' or '|_|', etc
3)  Permute BADWORDS against all of our mappings, and generate a HUGE
    list of 'invalid' names.
4)  On each name creation, check the proposed name against this list.


Note 1:  This was, perhaps, the single most fun thing I've ever done in
a paying job. :-)  I also discovered just how twisted the minds of two
of my co-workers are...

It worked quite well, and the permutation, which was essentially an N*N
operation, only had to be done whenever we update the BADWORDS list
(which wasn't every often...see note 1...).  The new checks added zero
CPU time overhead.  These were also running on hardware that was already
old (HP 715 pizza boxes).

Now, if you want to filter on the fly, you now have, at worst, a linear
search to perform per word.  Naturally, if you do some clever data
structure work, you can improve on this.  All of the patterns were for
matching whole words only, so we didn't have to worry about works like
"hassle".  Adding a rule to permute your words into spaced out versions
and adding them to your checklist can also be done.

This might be a fair bit faster than recomputing each word on the fly,
especially if you have a good searching algorithm.

--Hawson

--
   +---------------------------------------------------------------+
   | FAQ: http://qsilver.queensu.ca/~fletchra/Circle/list-faq.html |
   | Archives: http://post.queensu.ca/listserv/wwwarch/circle.html |
   | Newbie List:  http://groups.yahoo.com/group/circle-newbies/   |
   +---------------------------------------------------------------+



This archive was generated by hypermail 2b30 : 06/26/03 PDT