Re: Crash Recovery Techniques

From: Dan Merillat (harik@chaos.ao.net)
Date: 03/03/00


Pat OLaughlin writes:
>         On my MUD, I implemented a very simple system that will use
> signals to detect when a segmentation fault occurs and directly
> afterwards will automatically "copyover" the MUD.  It also tells the
> last command that was typed and the person who typed it.  This
> system works great but it's not as extensive as I want it.
>         I once IMM'd on a MUD once that had a very nice crash
> recovery system that would somehow recover from the crash and
> display a GDB type output that would show what section of code
> the MUD crashed at on syslog.  Another cool feature is the players
> wouldn't even know the crash happened.  The world would be
> restored to it's original state.
>         If anyone has done some extensive crash recovering I'd really
> like to know.

Well, most crashes don't occur in 'core' code, unless you're modifying
things like db.c or handler.c (why?)  Usually someone messes up an assignment
or dereference in a do_spell or do_command.

With that in mind, a copyover works like this:

Mud tromps on memory it shouldn't (0x0, usually).  It gets a SIGSEGV.
your signal handler catches it, and begins the fun part.

The game fork()s.  The child prints the contents of the interpreter buffer to syslog,
then calls core_dump().

The parent:
The game is saved, all of it.  All items, mobs, whatever.  Save all descriptor
data, all character data, all fighting data.

now, exec the mud with an extra argument: --crashfile=bla (where you saved all this).
your sockets are kept open after an exec, so nobody gets disconnected.

Upon seeing --crashfile, rather then loading the zones/players from their normal
files, the mud loads everything from the crashfile.   Toss it back into the main
game loop and you're ready to go.

I'd reccommend letting all gods (at least) know that the game restarted.  Perhaps
send to the players something like "the world shimmers, then solidifies again"

As for GDB, from reading the manpage...
run 'gdb mud -c corefile -batch -x commandfile' with output piped back to the syslog
file.

modify the core_dump to do something like this:

{
        int pid;
        int i;

        if ((pid=fork()) < 0)
                return 0;
        else if (!pid)
                return 0;

        for (i=0; i<MAX_FDS; i++)
                if (i != syslogfd)
                        close(i);

        dup2(syslogfd, 1);
        dup2(syslogfd, 2); /* point stdout/stderr to the syslog file descriptor */
        if ((pid=fork()) < 0)
                exit(0);  /* can't return, since we're a child.  No core for you. */
        if (!pid) { /* child... */
                abort(); /* dumps core */
        }
        waitpid(pid, NULL, 0); /* waits for the child to die */
        execlp("gdb", "gdb", "bin/circle", "-c", "core", "-batch", "-x", "corecommands", NULL);
        exit(0); /* not reached, but if exec fails... */
}

Now, for the 'corecommands' file, I'd just put something simple like
"directory /home/circle/src ; bt" for general purpose debugging.
If you've got a command that you think is bombing, perhaps printing some suspect
structures...

print buf
print arg
print *ch
print *obj
print *vict

... will work in large parts of the code.  If it dosn't, it'll just log "no symbol bla"

Don't even try this code verbatam, BTW.  I wrote this in an email client so
I'm pretty sure it won't even compile.  The exec line is probably wrong, and
I didn't lookup how to find the fd for syslog.  However, it's an idea.

--Dan


     +------------------------------------------------------------+
     | Ensure that you have read the CircleMUD Mailing List FAQ:  |
     |  http://qsilver.queensu.ca/~fletchra/Circle/list-faq.html  |
     +------------------------------------------------------------+



This archive was generated by hypermail 2b30 : 04/10/01 PDT