Re: Using lex and yacc for parsing

From: Mark Coletti (mcoletti@clark.net)
Date: 10/31/95


Yo!  (Again!)

	A couple of folks have asked me what all this "lex" and "yacc"
stuff is all about.  Still others have wondered about whether "lex" or
"yacc" are relevant to CircleMUD.  Well, instead of tailoring a
bazillion individual replies, I've decided to just briefly describe
what the hey "lex" and "yacc" are and (hopefully) convey just exactly
how useful it would be in CircleMUD.  So, here goes!  =8-)

	I'll first discuss lex, and then yacc.

	Lex is a "lexical analyzer."  It's used to take an arbitrary
stream of characters and convert that stream into tokens.

	I wrote a simple lexer in lex in five minutes that would parse
the CircleMUD movement directions for "north," "south," "east," and
"west."  I've included the source below.  (And, no, I don't expect you
to understand it, but I just wanted to show you how easy to understand
a lex file is. =8-)


----->8  snip! 8<--------

%{
	/* Sample lex script for tokenizing directions

	/* C declarations and includes go here */

#include <stdio.h>

%}

%%

[\t ]+	/* this regular expression pattern will ignore whitespace */

n |
no |
nor |
nort |
north	{ printf("Going to the Great White North, a?\n"); }

s |
so |
sou |
sout |
south	{ printf("Going south for the winter!\n"); }

e |
ea |
eas |
east	{ printf("Go to where the sun rises!\n"); }

w |
we |
wes |
west	{ printf("Go west, young man!\n"); }


[a-zA-Z]+	{ printf("Eh, wot?\n"); /* bogus input */ }


.|\n	{ ECHO; /* normal default anyhoo */ }


%%


int
main()
{
	yylex();	/* this starts the lexer, which will start reading
			   stdin and echoing to stdout */

	exit(0);

}


----->8  snip! 8<--------



	I compiled, linked and ran it as follows:


	# this is the lex file with the above source

resdgw17 ~/tmp/src 48 > flex direx.l


	# running flex, which is a public domain version of lex,
	# produces the following C source code file (I chose flex,
	# because it generates superior lexers and has fewer bugs)

	
resdgw17 ~/tmp/src 50 > ll lex*
-rw-rw-r--   1 mcoletti general    37123 Oct 31 10:56 lex.yy.c


	# now I just compile and link it

resdgw17 ~/tmp/src 52 > gcc lex.yy.c -o direx -ll


	# and now I can run it!

resdgw17 ~/tmp/src 54 > direx
n
Going to the Great White North, a?

s
Going south for the winter!

e
Go to where the sun rises!

w
Go west, young man!

nor
Going to the Great White North, a?

east s west nor
Go to where the sun rises!
Going south for the winter!
Go west, young man!
Going to the Great White North, a?

bogosity
Eh, wot?


	Well, that's cute, but somewhat useless.  What we need next is
a _parser_ that will take tokens issued by the lexer and, well, parse
them.  :)  That is, the lexer will read in a stream of characters,
convert those characters into tokens, and pass those tokens up to the
parser.  The parser, in turn, will do something meaningful with those
tokens.

	As you've probably already guessed, yacc builds parsers just
as lex (or flex) builds lexers.  However, before diving into a yacc
file, I need ta make a few niggling changes to the lexer so that it
actually returns tokens instead of printing out what it found.  The
changes follow here:


----->8  snip! 8<--------
%{
	/* Sample lex script for tokenizing directions

	/* C declarations and includes go here */

	/* we need to get the token definitions from yacc,
	   which are found in this header file that's automagically
	   generated by yacc (or bison) */

#include "y.tab.h"

%}

%%

[\t ]+	/* this regular expression pattern will ignore whitespace */

n |
no |
nor |
nort |
north	{ return NORTH; }

s |
so |
sou |
sout |
south	{ return SOUTH; }

e |
ea |
eas |
east	{ return EAST; }

w |
we |
wes |
west	{ return WEST; }


[a-zA-Z]+	{ printf("bogus direction %s\n", yytext); /* bogus
input */ }


[0-9]+	{ yylval.ival = atoi(yytext); return NUMBER; }


.|\n	{ ECHO; /* normal default anyhoo */ }


%%

----->8  snip! 8<--------

	There are a few subtle changes I've made to this lexer.
First, I now return tokens instead of printing stuff.  I've also added
a new token, NUMBER, which returns an integer to the parser.  You'll
see why I added that in a minute.  ;)


	Okiley dokiley, now for, ta daaaa! the parser.  Again, I just
put this up for illustrative purposes.  I don't intend for this to be
a for real parser that will be used in CircleMud; this is something I
kludged up over 'bout a half hour or so.

----->8  snip! direx.y 8<--------
%{
#include <stdio.h>
%}

%union {
  int ival;
}

%token NORTH SOUTH EAST WEST
%token <ival>NUMBER
%right NUMBER

%%

direction_list:		direction
	|		multi_direction
	|		direction_list direction
	|		direction_list multi_direction
	;

direction:	NORTH	{ printf("you move north\n"); }
	|	SOUTH	{ printf("you move south\n"); }
	|	EAST	{ printf("you move east\n");  }
	|	WEST	{ printf("you move west\n");  }
	;

multi_direction:	direction NUMBER { printf(" %d times\n", $2);
}
	;	
%%

----->8  snip! direx.y 8<--------

	In this yacc file I've defined a grammar that consists of
direction tokens for each of the cardinal directions and another token
for numbers.  I've also defined a union type that will store any
integers the underlying lexer runs across.

	The grammar is very simple.  It consists of a list of
directions.  Directions themselves can be merely any one of the
cardinal directions, or a cardinal direction _and_ a number; the
number represents how many times you want to move in that direction.

	Ok, now to build the sucker:

% flex direx.l 
% bison -d --yacc direx.y  
% gcc y.tab.c lex.yy.c -o direx -ly -ll

	I used ``bison'' which is the GNU version of yacc.  This is
preferable because it generates a better parser, gives better
diagnostics, and apparently supports re-entrant parsers (which is big-time
relevant to CircleMUD).

	Ok, now to run it!

% direx
n s e w
you move north
you move south
you move east
you move west

n 1 sout 2 e wes 3
you move north
 1 times
you move south
 2 times
you move east
you move west
 3 times

foo
bogus direction foo

bar ba
bogus direction bar
bogus direction ba

4
parse error
Exit 1


	Hopefully I've shown that lex and yacc (or their equivalents)
are way useful at building parsers.  In a matter of about half an
hour, I was able to cobble together a very, very simple parser that
would've taken much longer to write out by hand.  Better yet, the
generated parser is more efficient than one I could write in a
reasonable amount of time.  To quote from _Lex and Yacc_ by John
Levine, Tony Mason and Doug Brown, "A lex lexer is almost always
faster than a lexer you might write in C by hand." (pg. 1) Also, it's
certainly _much_ easier to understand than the one found (somewhere)
in CircleMUD, IMHO.  =8-)

	I hope this helps!

Mark!
-- 
Mark Coletti                       |  DBA Systems, Inc.  Fairfax, VA
mcoletti@clark.net                 |  United States Geological Survey
http://www.clark.net/pub/mcoletti  |  Office of Standards & Technology
    A toast to bread, for without bread, there could be no toast.



This archive was generated by hypermail 2b30 : 12/07/00 PST