Jeremy Elson's Toy Compiler

This archive contains the source code for the Toy compiler that I
wrote in Spring of 1994 as part of my advanced compiler writing class
at Johns Hopkins, taught by Dr. Scott Smith.  The "doc" directory
contains various class handouts that I received, describing how the
compiler was supposed to work and tips on its implementation.
Unfortunately, I seem to have lost the handout that formally describes
the language itself.  The "samples" directory contains samples of Toy
code that we used to test our compilers.  The "src" directory contains
the source code of my compiler.

Everything below the line is the original text that I submitted along with
my final code, which is in the src directory.

In case you're wondering, I got a B+.

---------------------------------------------------------------------------
---------------------------------------------------------------------------

README FILE

Jeremy Elson
Advanced Compiler Writing, 600.428
Dr. Scott Smith
May 1994

Preamble
---------------------------------------------------------------------

My toy compiler is a full-featured 428 toy compiler which meets or
exceeds all specifications of the 428 compiler assignment.

I have tested it on all examples in server:~scott/examples428.  For
all examples, the toy compiler produces valid C code which is compiled
silently by the C compiler to produce a binary.  No test cases were
able to cause the compiler itself to crash, nor could they cause it
to generate invalid C code or code which generates warnings from the
C compiler.

Most programs, when compiled with the C compiler, ran flawlessly.  The
one notable exception is the infamous put3.toy program, which caused
a segmentation fault when it was run.  This may indeed be due to a bug
in the toy program (not the compiler), but I haven't looked at it
carefully enough to tell.


Extra Credit
---------------------------------------------------------------------

There are a number of aspects of my compiler which I believe are worthy
of extra credit.  They are pointed out briefly in this section and detailed
in the document below.

o  Optimization: addresses are not loaded into temporaries unless necessary.

o  Use of three distinct addressing modes: ADDRESS, LITERAL, and IMMEDIATE.

o  A very powerful, flexible, easily extensible type system.

o  Automatically generated comments

o  Highly encapsulated coding style with data-hiding to make modification
   of the compiler easier.


Table of Contents
---------------------------------------------------------------------

1. User Manual
===Files Used
===Compiling the Compiler
===Running the Compiler

2. A Peek Under the Hood
===Overview
===The Type System
===The Symbol Table
===Keeping Functions Straight
===The Parse Tree
===Binding Functions at Run-Time
===Activation Records and Function Calling
===Optimization of Temporaries
===Automated Comments

1.  User Manual
---------------------------------------------------------------------

===Files Used

README		- This file
Makefile	- Controls the compilation of the program
lexgen		- Text file used as lex input
yaccgen		- Text file used as yacc input
main.c		- Short file containing the compiler driver.
symtable.c	- Functions containing symbol table routines
gencode.c	- Functions containing code generation routines
toy-preamble.c	- Text file copied verbatim to the beginning of
		  compiled programs
toy-postamble.c	- Text file copied verbatim to the end of compiled
		  programs
toy.h		- Header file for toy compiler (manually written)
tuples.h	- Translation from tuples to C code
toy		- Shell script for compiling toy programs


===Compiling the Compiler

Simply typing 'make' compiles the compiler.  It runs lex and yacc using
lexgen and yaccgen as input files, respectively.  yacc is run with a
command-line switch to generate the file y.tab.h which contains (arbitrary)
integer values for each of the compiler's tokens.

Once lex and yacc have been run, their resulting .c files are compiled,
along with symtable.c, main.c and gencode.c.  The 5 object files are then
linked together to form the final toyc compiler.


===Running the Compiler

toyc expects a valid toy program from standard input and produces the file
toymain.c as output.  toymain.c is (very simple) valid C code, mostly in
the form of tuples which are resolved by the inclusion of tuples.h.
The file toymain.c can then be compiled into actual machine code using
a C compiler.  The C compiler should not give any warnings or errors.

The 'toy' shell script automates the process of redirecting its first
argument to the standard input of toyc, then calling the C compiler on
toymain.c to generate a binary called 'toyrun'.


2.  A Peek Under the Hood
---------------------------------------------------------------------

===Overview

The toy compiler is a one-pass, two-stage compiler.  In other words, only
one pass is made through the input, but the input scanning is only the
first stage of the compilation.  The second stage puts together the code
generated by the first stage in such a way that it is compilable by the
C compiler.  

The first stage of the compilation is where most of the actual work of
the compiler is done.  Only one pass is made through the input at which
time the text is simultaneously broken down into lexemes and parsed using
an LALR(1) parser implemented by yacc.  Type-checking and code generation
also happen at the same time.

The lexer recognizes reserved words, symbols (such as <=, =, >=, etc.),
integer constants, and other regular expressions which are assumed to
be identifiers.  The lexer is capable of supporting an arbitrary number
of levels of nested comments.

The lexer is not run directly; rather, it delivers lexemes to the yacc-
generated parser (invoked from main.c via yyparse()) on demand.  The yacc
parser works its magic and calls the appropriate embedded actions as the
various grammar rules are reduced.

The yacc actions consist of code to perform both type-checking and code
generation.  There is excellent correspondence between the rules of the
toy428 type system and the grammar rules governing its syntax; therefore,
the general form for most action code is a one or more type-checking
statements followed by one or more code-generation statements.  There are
exceptions: declarations occurring within functions or "new" blocks do not
generate code, but rather update the symbol table.

The code for the symbol table and code generation are both highly
encapsulated.  For example, only the code generator -- and no outside
functions -- has the knowledge of which file is currently open for writing.
This C++-esque data-hiding scheme allows for easy modification of
individual parts of the program without breaking other parts.

The only difficulty is when some action must be performed which requires
proprietary knowledge from more than one part of the compiler.  Most notably,
this arises in the function which generates activation records.  Only the
symbol table knows how many local variables exist for that function, but
only the code generator knows what the function number is.  This problem
is solved by having two functions: Sym_activate and Gen_activate.
Sym_activate is called; it counts the number of locals and passes that
information on to Gen_activate, which then completes the action.


===The Type System

At run-time, my compiler assigns arbitrary, unique integers to each distinct
type used by the toy program.  Only the only types which are the same across
all compilations: INTEGER is defined as type 0, and EMPTYLIST (an argument
list with no arguments) is defined as type 1.

All complex types are defined as a 'modification' of an existing type.
A modification consists of a base and a modifier.  The base can be any
existing type, either fundamental or derived.  The modifier can be either
the pre-defined constants TYPE_ROW and TYPE_REF.  To create argument
lists, the modifier can be another type (either fundamental or derived.)

The type numbers are assigned at compile-time as necessary.  Each time
a type is encountered, get_type is called.  Get_type checks to see if the
requested type already has a number.  If so, that number is returned.  If
not, that type is given a new number and added to the type table; the newly
assigned number is then returned.

Functions have slightly different formats; their base type is defined as
a type which represents their argument list, and their modifier is their
return type.  They also have a special 'FUNCTION' flag to differentiate
them from simple argument lists.

The true power of this type system is that all types, no matter how complex,
can be compared for equality as easily as comparing one integer to another.
Therefore, normally complex operations such as making sure formal argument
lists match their actual arguments are all reduced to a simple check of
the form "if (type1 == type2)", where type1 and type2 are simply integers.
In addition, all types of arbitrary complexity can be represented in the
parser tree as simple integers requiring constant space.

This system is extremely powerful and flexible.  It is completely general
and is quite easy to extend.  For example, let's say we wanted to add 'float'
as a new fundamental type for toy programs.  All that would have to be done
would be to add 'float' as a new reserved word, add TYPE_FLOAT as a new
fundamental data type (say, type 2), and change the yaccgen file to assign
TYPE_FLOAT to variables declares with the 'float' keyword.  These simple
steps would take only 5 minutes, and yet the symbol table would then
automatically be able to handle all derived types involving floats (i.e.,
arrays of floats, pointers to floats, pointers to pointers to floats,
functions returning floats, functions taking floats or float-derived types
as arguments, etc.)


===The Symbol Table

The symbol table is implemented as an array whose size is defined at
compiler-compile time (i.e. the time at which the compiler is compiled).
This simplifies access to the symbol table, although it does place a
restriction on the maximum number of symbols which can be contained in a
single toy program.  This number is simply a #define in toy.h and can be
easily changed.

Each element of the symbol table has several datums associated with it.
The structure itself looks like this:

struct symbol_info {
   char *name;
   int level;
   int offset;
   int type;
   int temp_number;
};

name is simply the symbol's name.  level and offset are the lexical level
of the symbol and its offset from the beginning of that lexical level;
they are used, at compile-time, to determine which series of run-time
access links should be followed to find non-local variables.  'type' is
the type system's integer representation of the type of the symbol within
the toy428 type system.  temp_number is used for a certain type of
optimization which will be discussed later.

There are four basic operations performed by the symbol table: functions
starting blocks, adding symbols, looking up symbols, and ending blocks.

Each time a block is started, the compiler remembers the state of the
symbol table at the beginning of the block so that that state can be
restored when the block ends.

Adding a symbol is simple.  A check is first made to ensure that another
symbol does not already exist with the same name in the current block.
If one does exist, an error is generated; if not, the symbol is added.

Looking up a symbol is also simple; the symbol table is simply searched
from the end (the most recently added symbol) to the beginning.  Optimally,
this lookup process should be made faster with some type of hash-table
scheme but is not currently implemented.

The symbol table is initialized with two symbols: get() and put().


===Keeping Functions Straight

One of the main problems with implementing a toy428 compiler is dealing
with function definitions.  Since code is generated in a single pass,
code for one function is interrupted by code for a function defined within
it.  Therefore, some mechanism must be implemented to make sure blocks
of code remain together and in the proper order.  My compiler uses a
stack of file pointers, each representing a function currently undergoing
definition.

Note there are two distinct concepts when determining which files should
be open: the function _level_, which defines how 'deep' a function is in
the stack, and the function _number_, which is a unique integer sequence
number given to each function.  Each time a new function is defined, its
level and sequence number are both incremented; when a function definition
ends, only the level is decremented.  For example:

Function 0 (Lev 0)
   Function 1 (Lev 1)
   Function 2 (Lev 1)
      Function 3 (Lev 2)
         Function 4 (Lev 3)
      Function 5 (Lev 2)
      Function 6 (Lev 2)
   Function 7 (Lev 1)

For the sake of generality (and to avoid special case handling), the
outermost ("main") block is simply considered to be a function at lexical
level 0.  Each time a function starts, a file is opened and its file pointer
is pushed onto a stack.  When functions end, the function pointer on the top
of the stack is popped off and closed.  The Gen_write() function is used to
write tuples out to a file; this function automatically writes to whatever
file is on the top of the stack.

Each function is stored in a file in /tmp during the first stage of code
generation.  During the second stage, all the temporary files are
concatenated, preceded by the toy-preamble and succeeded by the
toy-postamble.  The temporary files are then deleted.  (Note, the "NTEMPS"
definitions are written in between the preamble and the function definitions
in the final toymain.c.)

Note that this second stage of compilation is not performed if any errors
are reported during the first stage.


===The Parse Tree

Each element of the parse tree contains 4 datums: 'type', which represents
the type within the toy428 type system (used for the type-checker); 'place',
which represents the data stored by that element of the tree; 'placetype',
which describes what type of data is stored in 'place'; and 'next', which
is a pointer to the same type of structure, used to generate argument
lists.

The standard 428 assignment dictates that 'place' take on one of two modes:
literal, meaning that the data is actually stored in t[place], and address
mode, meaning that t[place] contains a pointer to the data.  However, this
introduces an inefficiency for statements such as "B := 5"; the address of
B is loaded into a temporary, then 5 is loaded into a temporary, then the
contents of one temporary is copied to the other, like this:

(ASSIGN, Addr(B), t[1])
(ASSIGN, 5, t[2])
(ASSIGN, t[2], t[1]^)

My compiler uses the concept of the 'immediate', which is taken to mean
that the data for an element of the tree actually _is_ "place" rather than
being _stored_ in t[place].  This both reduces temporary usage as well as
reducing the total number of lines of code.  The above code would look like:

(ASSIGN, Addr(B), t[1])
(ASSIGN, 5, t[1]^)

under my toy428 compiler.

'placetype' is used to store whether or not 'place' is literal, immediate,
or addressed.  The function Gen_tname() receives an element of the parse
tree as an argument and automatically prints it temp based on the value
of place and placetype.  For example, if place is 5, Gen_tname would
generate:

Mode    	Generated
IMMEDIATE	5
LITERAL		t[5]
ADDRESS		(*(object *)t[5])

The Gen_tname() function vastly simplifies the yaccgen file, for it is
not necessary for the yaccgen code to concern itself with what types of
data it is handling; Gen_tname() resolves all addresses automatically.


===Binding Functions at Run-Time

The code generated for 'FORMAL' keywords is a call to the C function
get_funcconst(), a function contained in the toy-preamble and copied into
every toymain.c file generated.  get_funcconst takes two arguments: the
IP (instruction pointer) and EP (environment pointer).  It then allocates
2 words of memory, stores the IP and EP in that memory, and returns a pointer
to that memory.  This scheme allows functions to be represented in one
machine word.

The EP passed is a pointer to the current environment; i.e., a pointer
to the environment in which the function was defined.  The IP is the name
of the C function (given the name toyfunN, where N is the function's
sequence number).  For this reason, functions are written to the toymain.c
file in reverse order (i.e., highest number to lowest number, followed by
main()); this ensures that the C compiler will always see the definition
of a function before it sees a reference to it.


===Activation Records and Function Calling

For the sake of generality and code simplicity, all blocks, including the
outermost "main()" block, have activation records.  The outermost activation
record is automatically generated with two local symbols: get() and put().
At run-time, that activation record is generated and get() and put() are
bound to the actual get() and put() functions, which are written in C and
contained in the toy-preamble file.

Activation records are generated at the beginning of each new block after
locals have been declared.  Blocks which do not declare locals still have
activation records, although they contain no data except for an access
link up to its nearest lexically enclosing block.  Normal blocks allocate
space for their access link and locals, but do not store any data in their
activation records (except for the access link, of course) by default.
Activation records for actual functions have the additional default action
of copying the function's arguments, passed using the C run-time stack, into
the function's newly created activation record.

A function's activation record's access link is a copy of the pointer to
the environment in which the function was defined; i.e., the EP bound to
the function at the time the FORMAL statement was declared.  This pointer
is passed to functions using the C stack as arg 0.  User-defined arguments
to functions are passed as arg1..argn.

Activation records are never freed at run-time in order to avoid problems
with escaping functions.  Optimally, garbage collection should be performed
to solve this problem, but that has not been implemented.


===Optimization of Temporaries

An important optimization in the use of temporary variables has been made.
Specifically, the compiler remembers whether or not it has already loaded
the address of a variable into a temporary; if so, it does not re-load it
if that variable is accessed again.  This also significantly cuts down on
the number of lines of code generated.

This feature is implemented via the symbol table; part of the record of
a symbol includes the temporary number which has been assigned to it.
(i.e. if B's tempnum = 8, the compiler knows that B's address currently
resides in t[8].)


===Automated Comments

The generated C code (in toymain.c) is commented to give pointers as to
what each tuple corresponds to in terms of toy code.  This helps with
debugging or getting the feel for how code is generated.