Re: Using lex and yacc for parsing

From: Mark Coletti (mcoletti@clark.net)
Date: 10/30/95


Jeremy Elson pounded furiously on the keyboard:

	[...]

> Lex is a great tool, but I have avoided including a lex-built parser for
> a number of reasons.  Mainly, because the efficiency gains in using a
> lex-built parser are not worth the complexities it introduces.

	I disagree.  I feel that using lex and yacc reduces the overall code
complexity.

> For example, in your example lex file, you had to enumerate

> n
> no
> nor
> north
> 
> For north!  This is complex.  

	I re-emphasize that the code I posted earlier was for illustrative
purposes.  All the examples were something that I cobbled together in little
over half an hour just to show lex and yacc's capabilities.  I didn't intend
for it to be a real basis for a lexer and parser for CircleMUD.

	As to being complex, I heartily disagree!  In this case, the possible
permutations for "north" commands are explicitly stated.  Nowhere in the 1500+
lines of ``interpreter.c'' did I see a way to automagically come up with
shortened prefixes for commands.  And, yet, in four lines I've explicitly
done so in a way that, to me, is intuitive.  :)


> What if you have 400 commands?

	Then you might have too many commands?  ;)

	Seriously, though, the more appropriate (and what's typically done)
method would be to have a core set of tokens that the lexer explicitly looks
for, with the remaining keyword set being in some form of symbol table.  The
book I mentioned actually goes on to illustrate just such a set-up.

	Granted, CircleMUD already has such a type of symbol table, cmd_info,
located in ``interpreter.c''.  However, the structure of that table is not
clearly defined, requiring the programmer to reverse engineer the parsing
mechanism to deduce the underlying semantics of the various fields.

	(I've just re-read the preceeding paragraph.  I've been working with
the government for too long!  Aiee!)

> How will you handle conflicting commands?

	Conflicting command syntax is a sign of linguistic ambiguity, which
is a badism.

	"Remember that conflicts mean that yacc can't properly parse a
	grammar, probably because it's ambiguous, which means that there are
	multiple possible parses for the same input.  ... this usually points
	to a mistake in your language design.  If a grammar is ambiguous to
	yacc, it's almost certainly ambiguous to humans, too."  
		pg. 64, _Lexx & Yacc_

You'd have to do something like..
> 
> n -> north
> no -> north
> nor -> north
> nos -> noshout
> nort -> north
> north -> north
> nosh -> noshout
> nosho -> noshout
> noshou -> noshout
> noshout -> noshout

> What would you have to do if you decide that "n" means "noshout" instead
> of north?  What if you have 15 different commands that start with n?

	Lex and it's derivatives always match against the longer rule.  So:

n |
no	{ return NORTH; /* <1> */ }

noshout	{ return NOSHOUT; /* <2> */ }

	A string of "noshout" will always match the second rule even though
the first two characters, 'no', partially match the first rule.  Again, this
is because lex prefers to match against a longer regular expression.

	Besides, I'm leaning towards having only two strings match for the
NORTH token: 'n' and 'north'.  If you want any other variations, you can use
an alias to create them.   And, for toggling various switches, I'd have a
general language construct like this:

	no { SHOUT | TELL | GOSSIP }

	If you wanted "nos" to stand for "no shout," then you can make that an
alias.

	I'm not hung up on this; I'm still working out the command language
syntax.


> CircleMUD's parser is not efficient at all but it gets the job done and it's
> easy to change.  (In 3.0 you just change the order the terms appear in the
> list to change their precedences.)

	It might be (sorta) easy to add new commands to the command symbol
table, but it's not easy to _change the grammer_.

	For example, when I was dreaming up the earlier examples, I
arbitrarily decided to extend the example grammer to allow an optional numeric
parameter that specified the number of moves to make in a given direction.  It
took just a few minutes to add in the NUMBER token and its support, and to
extend the yacc grammer definition to accomodate specifying the number of
moves.  I've looked through ``interpreter.c'' and I don't know where even to
begin to make the necessary changes to do the same thing.  :-\

	In short, adding new keywords and their respective support functions
seems to be relatively easy in CircleMUD; modifying the command line _grammar_
is a different story altogether.  

	Besides, it's safe to say that more programmers understand lex and
yacc than there are programmers that know how the CircleMUD's parser works --
if the parser were written using lex and yacc, it's more likely that newbie
programmers would be able to get up to speed on the code as they're more
likely to have some familiarity with lex and yacc.  Anyhoo, they can always
run down the street to the local Borders or whatever and get a lex and yacc
textbook; they can't do the same for the CircleMUD parser.

> Many people have said in the past that the CircleMUD parser is inefficient
> and should be changed.  I agree that it's ineffeicient but I do not think
> it should be changed because it's simple and it's efficient *enough*.
> Have you ever profiled a running CircleMUD?  If you do, you'll find an
> incredibly small amount of time (say, 0.5% or something, if I remember
> correctly) is spent in command_interpreter() and its subsidiaries.
> 
> If you make CircleMUD's parser *10 times* more efficient, you'll make
> CircleMUD 0.45% faster.  Is that really worth it?

	Apparently my mistake in the original post was to over-emphasize the
parsing speedup that using lex would garner.  You're absolutely right -- in
fact the principle is called Amdahl's Law.  Why spend an inordinate amount of
time optimizing code that's using an insignificant amount of cycles?  If your
goal is truly to speedup program execution, then you should concentrate on the
"hot-spots" where your program is spending most of its time.  However, my goal
with using lex and yacc is not necessarily to speed up CircleMUD (although
that'd be a likely by-product), but to reduce the source-code's complexity
and to increase its scalability.

> Besides, CPU time is cheap -- people's time is *not* cheap.  If I've
> learned anything about software engineering, it is that your goal should
> be to reduce the amount of work *people* have to do or the amount of time
> people have to wait, not the amount of work computers have to do.  Computers
> don't care.

	... and it's ironic that you should mention this as this is primarily
the reason why I wanted to re-write the CircleMUD command line parser using
lex and yacc!  =8-)

	I looked through ``interpreter.cc'' and had great difficulty in making
heads or tails of _exactly_ how it worked.  There are a number of ancillary
lexing/parsing functions whose purpose is unclear and whose relationships with
the rest of the program is difficult to determine.  If something goes awry in
any of the parsing functions, I'm not sure where to start looking for the
cause of the problem.

	I do note that a lot of the work done by some of the lexing functions
are handled implictly by lex.  Using lex would therefore deprecate a lot of
existing code -- that alone would mean reducing the parser's complexity.

> In other words, if you want to rewrite a huge chunk of CircleMUD code in
> a way that's more complex than it already is, your time would be much
> better spent attacking a piece of the code that could benefit much more
> from improvement -- say, the way mobile specprocs are called every 10
> seconds, or changing the specproc structure over to being event driven,
> or some such.

	Before making significant enhancements to a program, the underlying
infrastructure should be sound.  I've worked on adding enhancements to crufty
code before, and twust me, it's an arduous process.  =8-)

Although CircleMUD has its good sides, it needs some basic things done first
before new features are added:

	o Makefile should not have explicity dependencies; ``makedepend'' or
	  gcc -M should be used to automagically generate them
	o Makefile should rely as much as possible on implicit pattern
	  matching rules to eliminate redundancy and complexity
	o Add header file sentinals
	o Reduce dependency on pre-processor wherever possible
	o Simplify header files by moving as much as possible to
	  implementation files
	o Introduce opaque types to reduce code complexity
	o Rely on lex and yacc for command line parsing
	o Re-visit the command language grammar and its implementation
	o Rely on lex and yacc for world, zone, shop, and object file parsing
	o Re-visit the world, zone, shop, and object file language syntax


	I don't mean for this to sound like a out and out diatribe on
CircleMUD.  It's a good program with a lot of potential --- as you've
emphasized in the documentation, now the program works, so optimization and
polishing can begin.  Working on the above points first before adding features
would go a long ways towards achieving those goals.  Then again, that's JMHO
and YMMV.  ;-)


Mark!
-- 
Mark Coletti                       |  DBA Systems, Inc.  Fairfax, VA
mcoletti@clark.net                 |  United States Geological Survey
http://www.clark.net/pub/mcoletti  |  Office of Standards & Technology
code as if whoever maintains your code is a violent psychopath who knows where
				   you live



This archive was generated by hypermail 2b30 : 12/07/00 PST