Re: Crash Recovery Techniques

From: StormeRider (silk@ici.net)
Date: 03/02/00


At 05:09 PM 3/1/00 -0800, you wrote:
>On Wed, 1 Mar 2000, Pat OLaughlin wrote:
>
> >         On my MUD, I implemented a very simple system that will use
> > signals to detect when a segmentation fault occurs and directly
> > afterwards will automatically "copyover" the MUD.  It also tells the
> > last command that was typed and the person who typed it.  This
> > system works great but it's not as extensive as I want it.
>
>This comes up every few months, and every time it comes up I say the same
>thing: this is not a good idea.  Corrupt data (buffer overflows, etc.)
>cause segmentation faults.  After a seg fault, the state of the memory
>associated with a process cannot be relied upon to be accurate, nor is it
>entirely clear how each individual operating system handles core dumps.
>It's very easy, then, to create a situation where the Mud either further
>corrupts data (especially data being written to files open for writing
>when the crash occurred) or gets into a loop of crashing on corrupt
>data.  What you really want to do is have data (world state) persist over
>boots of the game engine, and...
>
>Catching a segmentation fault is *NOT* the way to handle this -- the
>proper method is to adopt some sort of model of persistence, such as you
>would get with MySQL, and store world state separate of the Mud.  A
>full-featured persistent database storage system has features for
>transaction logging, recovery, and other vital aspects and will be *more*
>than suitable for your purposes.
>
>End result?  People would still get disconnected on crash (unless you
>moved the networking code and game engine apart, as I've recommended in
>times past), but on reconnection the game would appear to be more or less
>the same, depending upon when the world state was last updated.
>
> > Another cool feature is the players wouldn't even know the crash
> > happened.  The world would be restored to it's original state.
>
>Better to spend your time fixing the crash bugs than making them
>transparent, and then approach persistence for better reasons (i.e.,
>creating a truly dynamic world, rather than a static one which runs in
>cycles based upon zone reset).
>
>-dak

Ultimately speaking, you're right. However, the level of work involved in
separating
the networking code ang going to a system like LP uses with a mudlib and a
driver
compared to simply using a fork to exec and then core dumping in the old
process
are vastly different. Yes you do not acheive the full persistance of the
other model
but you can also do it in the space of about a half-hour.

While it is true that when doing this, it isn't perfectly reliable
(sometimes memory
is too corrupted for gdb to track the problem perfectly in the core file
dumped), it is
more reliable than one would think. I said this from experience, as adding
this type
of crash recovery and automatic backtracing has proved to be invaluable in
tracking
down a lot of problems with the ROM MUD I work on, and at the same time, it's
fairly friendly to players. When I put it in, our average player base shot
up about 10
people (from average 20 count to 30).

(*Side note: after a period of server hardware instability and a pwipe as a
result, the
average peak time player base has moved up to close to 40. Odd how that
happened...*)

--SR


     +------------------------------------------------------------+
     | Ensure that you have read the CircleMUD Mailing List FAQ:  |
     |  http://qsilver.queensu.ca/~fletchra/Circle/list-faq.html  |
     +------------------------------------------------------------+



This archive was generated by hypermail 2b30 : 04/10/01 PDT