Jeremy Elson's Toy Compiler This archive contains the source code for the Toy compiler that I wrote in Spring of 1994 as part of my advanced compiler writing class at Johns Hopkins, taught by Dr. Scott Smith. The "doc" directory contains various class handouts that I received, describing how the compiler was supposed to work and tips on its implementation. Unfortunately, I seem to have lost the handout that formally describes the language itself. The "samples" directory contains samples of Toy code that we used to test our compilers. The "src" directory contains the source code of my compiler. Everything below the line is the original text that I submitted along with my final code, which is in the src directory. In case you're wondering, I got a B+. --------------------------------------------------------------------------- --------------------------------------------------------------------------- README FILE Jeremy Elson Advanced Compiler Writing, 600.428 Dr. Scott Smith May 1994 Preamble --------------------------------------------------------------------- My toy compiler is a full-featured 428 toy compiler which meets or exceeds all specifications of the 428 compiler assignment. I have tested it on all examples in server:~scott/examples428. For all examples, the toy compiler produces valid C code which is compiled silently by the C compiler to produce a binary. No test cases were able to cause the compiler itself to crash, nor could they cause it to generate invalid C code or code which generates warnings from the C compiler. Most programs, when compiled with the C compiler, ran flawlessly. The one notable exception is the infamous put3.toy program, which caused a segmentation fault when it was run. This may indeed be due to a bug in the toy program (not the compiler), but I haven't looked at it carefully enough to tell. Extra Credit --------------------------------------------------------------------- There are a number of aspects of my compiler which I believe are worthy of extra credit. They are pointed out briefly in this section and detailed in the document below. o Optimization: addresses are not loaded into temporaries unless necessary. o Use of three distinct addressing modes: ADDRESS, LITERAL, and IMMEDIATE. o A very powerful, flexible, easily extensible type system. o Automatically generated comments o Highly encapsulated coding style with data-hiding to make modification of the compiler easier. Table of Contents --------------------------------------------------------------------- 1. User Manual ===Files Used ===Compiling the Compiler ===Running the Compiler 2. A Peek Under the Hood ===Overview ===The Type System ===The Symbol Table ===Keeping Functions Straight ===The Parse Tree ===Binding Functions at Run-Time ===Activation Records and Function Calling ===Optimization of Temporaries ===Automated Comments 1. User Manual --------------------------------------------------------------------- ===Files Used README - This file Makefile - Controls the compilation of the program lexgen - Text file used as lex input yaccgen - Text file used as yacc input main.c - Short file containing the compiler driver. symtable.c - Functions containing symbol table routines gencode.c - Functions containing code generation routines toy-preamble.c - Text file copied verbatim to the beginning of compiled programs toy-postamble.c - Text file copied verbatim to the end of compiled programs toy.h - Header file for toy compiler (manually written) tuples.h - Translation from tuples to C code toy - Shell script for compiling toy programs ===Compiling the Compiler Simply typing 'make' compiles the compiler. It runs lex and yacc using lexgen and yaccgen as input files, respectively. yacc is run with a command-line switch to generate the file y.tab.h which contains (arbitrary) integer values for each of the compiler's tokens. Once lex and yacc have been run, their resulting .c files are compiled, along with symtable.c, main.c and gencode.c. The 5 object files are then linked together to form the final toyc compiler. ===Running the Compiler toyc expects a valid toy program from standard input and produces the file toymain.c as output. toymain.c is (very simple) valid C code, mostly in the form of tuples which are resolved by the inclusion of tuples.h. The file toymain.c can then be compiled into actual machine code using a C compiler. The C compiler should not give any warnings or errors. The 'toy' shell script automates the process of redirecting its first argument to the standard input of toyc, then calling the C compiler on toymain.c to generate a binary called 'toyrun'. 2. A Peek Under the Hood --------------------------------------------------------------------- ===Overview The toy compiler is a one-pass, two-stage compiler. In other words, only one pass is made through the input, but the input scanning is only the first stage of the compilation. The second stage puts together the code generated by the first stage in such a way that it is compilable by the C compiler. The first stage of the compilation is where most of the actual work of the compiler is done. Only one pass is made through the input at which time the text is simultaneously broken down into lexemes and parsed using an LALR(1) parser implemented by yacc. Type-checking and code generation also happen at the same time. The lexer recognizes reserved words, symbols (such as <=, =, >=, etc.), integer constants, and other regular expressions which are assumed to be identifiers. The lexer is capable of supporting an arbitrary number of levels of nested comments. The lexer is not run directly; rather, it delivers lexemes to the yacc- generated parser (invoked from main.c via yyparse()) on demand. The yacc parser works its magic and calls the appropriate embedded actions as the various grammar rules are reduced. The yacc actions consist of code to perform both type-checking and code generation. There is excellent correspondence between the rules of the toy428 type system and the grammar rules governing its syntax; therefore, the general form for most action code is a one or more type-checking statements followed by one or more code-generation statements. There are exceptions: declarations occurring within functions or "new" blocks do not generate code, but rather update the symbol table. The code for the symbol table and code generation are both highly encapsulated. For example, only the code generator -- and no outside functions -- has the knowledge of which file is currently open for writing. This C++-esque data-hiding scheme allows for easy modification of individual parts of the program without breaking other parts. The only difficulty is when some action must be performed which requires proprietary knowledge from more than one part of the compiler. Most notably, this arises in the function which generates activation records. Only the symbol table knows how many local variables exist for that function, but only the code generator knows what the function number is. This problem is solved by having two functions: Sym_activate and Gen_activate. Sym_activate is called; it counts the number of locals and passes that information on to Gen_activate, which then completes the action. ===The Type System At run-time, my compiler assigns arbitrary, unique integers to each distinct type used by the toy program. Only the only types which are the same across all compilations: INTEGER is defined as type 0, and EMPTYLIST (an argument list with no arguments) is defined as type 1. All complex types are defined as a 'modification' of an existing type. A modification consists of a base and a modifier. The base can be any existing type, either fundamental or derived. The modifier can be either the pre-defined constants TYPE_ROW and TYPE_REF. To create argument lists, the modifier can be another type (either fundamental or derived.) The type numbers are assigned at compile-time as necessary. Each time a type is encountered, get_type is called. Get_type checks to see if the requested type already has a number. If so, that number is returned. If not, that type is given a new number and added to the type table; the newly assigned number is then returned. Functions have slightly different formats; their base type is defined as a type which represents their argument list, and their modifier is their return type. They also have a special 'FUNCTION' flag to differentiate them from simple argument lists. The true power of this type system is that all types, no matter how complex, can be compared for equality as easily as comparing one integer to another. Therefore, normally complex operations such as making sure formal argument lists match their actual arguments are all reduced to a simple check of the form "if (type1 == type2)", where type1 and type2 are simply integers. In addition, all types of arbitrary complexity can be represented in the parser tree as simple integers requiring constant space. This system is extremely powerful and flexible. It is completely general and is quite easy to extend. For example, let's say we wanted to add 'float' as a new fundamental type for toy programs. All that would have to be done would be to add 'float' as a new reserved word, add TYPE_FLOAT as a new fundamental data type (say, type 2), and change the yaccgen file to assign TYPE_FLOAT to variables declares with the 'float' keyword. These simple steps would take only 5 minutes, and yet the symbol table would then automatically be able to handle all derived types involving floats (i.e., arrays of floats, pointers to floats, pointers to pointers to floats, functions returning floats, functions taking floats or float-derived types as arguments, etc.) ===The Symbol Table The symbol table is implemented as an array whose size is defined at compiler-compile time (i.e. the time at which the compiler is compiled). This simplifies access to the symbol table, although it does place a restriction on the maximum number of symbols which can be contained in a single toy program. This number is simply a #define in toy.h and can be easily changed. Each element of the symbol table has several datums associated with it. The structure itself looks like this: struct symbol_info { char *name; int level; int offset; int type; int temp_number; }; name is simply the symbol's name. level and offset are the lexical level of the symbol and its offset from the beginning of that lexical level; they are used, at compile-time, to determine which series of run-time access links should be followed to find non-local variables. 'type' is the type system's integer representation of the type of the symbol within the toy428 type system. temp_number is used for a certain type of optimization which will be discussed later. There are four basic operations performed by the symbol table: functions starting blocks, adding symbols, looking up symbols, and ending blocks. Each time a block is started, the compiler remembers the state of the symbol table at the beginning of the block so that that state can be restored when the block ends. Adding a symbol is simple. A check is first made to ensure that another symbol does not already exist with the same name in the current block. If one does exist, an error is generated; if not, the symbol is added. Looking up a symbol is also simple; the symbol table is simply searched from the end (the most recently added symbol) to the beginning. Optimally, this lookup process should be made faster with some type of hash-table scheme but is not currently implemented. The symbol table is initialized with two symbols: get() and put(). ===Keeping Functions Straight One of the main problems with implementing a toy428 compiler is dealing with function definitions. Since code is generated in a single pass, code for one function is interrupted by code for a function defined within it. Therefore, some mechanism must be implemented to make sure blocks of code remain together and in the proper order. My compiler uses a stack of file pointers, each representing a function currently undergoing definition. Note there are two distinct concepts when determining which files should be open: the function _level_, which defines how 'deep' a function is in the stack, and the function _number_, which is a unique integer sequence number given to each function. Each time a new function is defined, its level and sequence number are both incremented; when a function definition ends, only the level is decremented. For example: Function 0 (Lev 0) Function 1 (Lev 1) Function 2 (Lev 1) Function 3 (Lev 2) Function 4 (Lev 3) Function 5 (Lev 2) Function 6 (Lev 2) Function 7 (Lev 1) For the sake of generality (and to avoid special case handling), the outermost ("main") block is simply considered to be a function at lexical level 0. Each time a function starts, a file is opened and its file pointer is pushed onto a stack. When functions end, the function pointer on the top of the stack is popped off and closed. The Gen_write() function is used to write tuples out to a file; this function automatically writes to whatever file is on the top of the stack. Each function is stored in a file in /tmp during the first stage of code generation. During the second stage, all the temporary files are concatenated, preceded by the toy-preamble and succeeded by the toy-postamble. The temporary files are then deleted. (Note, the "NTEMPS" definitions are written in between the preamble and the function definitions in the final toymain.c.) Note that this second stage of compilation is not performed if any errors are reported during the first stage. ===The Parse Tree Each element of the parse tree contains 4 datums: 'type', which represents the type within the toy428 type system (used for the type-checker); 'place', which represents the data stored by that element of the tree; 'placetype', which describes what type of data is stored in 'place'; and 'next', which is a pointer to the same type of structure, used to generate argument lists. The standard 428 assignment dictates that 'place' take on one of two modes: literal, meaning that the data is actually stored in t[place], and address mode, meaning that t[place] contains a pointer to the data. However, this introduces an inefficiency for statements such as "B := 5"; the address of B is loaded into a temporary, then 5 is loaded into a temporary, then the contents of one temporary is copied to the other, like this: (ASSIGN, Addr(B), t[1]) (ASSIGN, 5, t[2]) (ASSIGN, t[2], t[1]^) My compiler uses the concept of the 'immediate', which is taken to mean that the data for an element of the tree actually _is_ "place" rather than being _stored_ in t[place]. This both reduces temporary usage as well as reducing the total number of lines of code. The above code would look like: (ASSIGN, Addr(B), t[1]) (ASSIGN, 5, t[1]^) under my toy428 compiler. 'placetype' is used to store whether or not 'place' is literal, immediate, or addressed. The function Gen_tname() receives an element of the parse tree as an argument and automatically prints it temp based on the value of place and placetype. For example, if place is 5, Gen_tname would generate: Mode Generated IMMEDIATE 5 LITERAL t[5] ADDRESS (*(object *)t[5]) The Gen_tname() function vastly simplifies the yaccgen file, for it is not necessary for the yaccgen code to concern itself with what types of data it is handling; Gen_tname() resolves all addresses automatically. ===Binding Functions at Run-Time The code generated for 'FORMAL' keywords is a call to the C function get_funcconst(), a function contained in the toy-preamble and copied into every toymain.c file generated. get_funcconst takes two arguments: the IP (instruction pointer) and EP (environment pointer). It then allocates 2 words of memory, stores the IP and EP in that memory, and returns a pointer to that memory. This scheme allows functions to be represented in one machine word. The EP passed is a pointer to the current environment; i.e., a pointer to the environment in which the function was defined. The IP is the name of the C function (given the name toyfunN, where N is the function's sequence number). For this reason, functions are written to the toymain.c file in reverse order (i.e., highest number to lowest number, followed by main()); this ensures that the C compiler will always see the definition of a function before it sees a reference to it. ===Activation Records and Function Calling For the sake of generality and code simplicity, all blocks, including the outermost "main()" block, have activation records. The outermost activation record is automatically generated with two local symbols: get() and put(). At run-time, that activation record is generated and get() and put() are bound to the actual get() and put() functions, which are written in C and contained in the toy-preamble file. Activation records are generated at the beginning of each new block after locals have been declared. Blocks which do not declare locals still have activation records, although they contain no data except for an access link up to its nearest lexically enclosing block. Normal blocks allocate space for their access link and locals, but do not store any data in their activation records (except for the access link, of course) by default. Activation records for actual functions have the additional default action of copying the function's arguments, passed using the C run-time stack, into the function's newly created activation record. A function's activation record's access link is a copy of the pointer to the environment in which the function was defined; i.e., the EP bound to the function at the time the FORMAL statement was declared. This pointer is passed to functions using the C stack as arg 0. User-defined arguments to functions are passed as arg1..argn. Activation records are never freed at run-time in order to avoid problems with escaping functions. Optimally, garbage collection should be performed to solve this problem, but that has not been implemented. ===Optimization of Temporaries An important optimization in the use of temporary variables has been made. Specifically, the compiler remembers whether or not it has already loaded the address of a variable into a temporary; if so, it does not re-load it if that variable is accessed again. This also significantly cuts down on the number of lines of code generated. This feature is implemented via the symbol table; part of the record of a symbol includes the temporary number which has been assigned to it. (i.e. if B's tempnum = 8, the compiler knows that B's address currently resides in t[8].) ===Automated Comments The generated C code (in toymain.c) is commented to give pointers as to what each tuple corresponds to in terms of toy code. This helps with debugging or getting the feel for how code is generated.