Chapter 1
Introduction

powered by FreeFind

Modified: 

Overview

The course follows the Association of Computing Machinery's Curriculum 2001 guide for university computer science programs. The guide provides the following description of a compilers course:

Introduces the theory and practice of programming language translation. Topics include compiler design, lexical analysis, parsing, symbol tables, declaration and storage management, code generation, and optimization techniques.

This course has two distinct but interrelated goals. First, it explores the theory of language translation. Second, it shows how to apply this theory to build compilers and interpreters, as well as compiler generators. It covers the building of translators both from scratch and using compiler generators. In the process, the course also identifies and explores the main issues of the design of translators.

As is the case in many computer science courses with a significant theoretical component, visualization tools can improve the quality of lectures and serve as animated lecture notes. The most useful kind of algorithm animations are those that show in a synchronized way both an operational view and a conceptual view of the algorithm steps. The construction of a compiler/interpreter is a necessary component of this course, so students can obtain the necessary skills. Compiler programming projects, however, are often problematic for the following reasons:

• The size of a compiler implementation is usually much larger than that of the projects
students have undertaken in earlier courses.
• Most compiler generators are tabular, which makes the resulting compiler more
difficult to debug.


The severity of these problems will be reduced by using declarative scanners and parser generators that produce recursive-descent parsers.


Syllabus:


• Overview of programming languages: History of programming languages; brief survey
of programming paradigms; the role of language translation in the programming
process
• Fundamental issues in language design: General principles of language design; design goals; typing regimes; data structure models; control structure models; abstraction mechanisms
• Virtual machines: The concept of a virtual machine; hierarchy of virtual machines; intermediate languages
• Introduction to language translation: Comparison of interpreters and compilers; language translation phases; machine-dependent and machine-independent aspects of translation; language translation as a software engineering activity
• Lexical analysis: Application of regular expressions in lexical scanners; hand-coded vs. automatically-generated scanners; formal definition of tokens; implementation of finite-state automata
• Syntactic analysis: Formal definition of grammars; BNF and EBNF; bottom-up vs. top-down parsing; tabular vs. recursive-descent parsers; error handling; automatic generation of tabular parsers; symbol table management; the use of tools in support of
the translation process
• Models of execution control: Order of evaluation of subexpressions; exceptions and exception handling; runtime systems
• Declaration, modularity, and storage management: Declaration models; parameterization mechanisms; type parameterization; mechanisms for sharing and restricting visibility of declarations; garbage collection
• Type systems: Data type as set of values with set of operations; data types; typechecking models; semantic models of user-defined types; parametric polymorphism; subtype polymorphism; type-checking algorithms
• Interpretation: Iterative vs. recursive interpretation; iterative interpretation of intermediate code; recursive interpretation of a parse tree
• Code generation: Intermediate and object code; intermediate representations; implementation of code generators; code generation by tree walking; context-sensitive translation; register use
• Optimization: Machine-independent optimization; data-flow analysis; loop optimizations; machine-dependent optimization


Curriculum 2001 Units covered:


PL1 Overview of programming languages 2 core hours
PL2 Virtual machines 1 core hour
PL3 Introduction to language translation 2 core hours
PL8 Language translation systems 15 hours
PL9 Type systems 4 hours
Elective topics 16 hours

1.1 MODULES AND INTERFACES

Compilers are large software systems; the first compiler of note was written for FORTRAN over 50 years ago (e.g. ) by an IBM team lead by John Backus and required 10 man-years to implement. We will individually implement a compiler for a somewhat more complex language (i.e. MiniJava) as part of a one-semester course. How is that possible? Are we that much smarter than the IBM team? Much has been learned about language and software systems since then that will make our job considerably easier than the Backus team, not easy, but doable.

Large systems are best implemented as modules with well-defined interfaces that allow modules to be connected without knowledge of the module's internal operation, much as the USB interface promotes interconnection of devices to a computer. Modern compilers are designed in a modular fashion where one module corresponds to one of some 12 compilation phases (see Figure 1.1 of text). By connecting the modules in a sequential fashion where the output results from one phase is the input to another, a complete compiler can be constructed. That approach will be followed in this course, connecting existing modules (e.g. JavaCC for parsing) to those of our own design.

1.2 TOOLS AND SOFTWARE

The first two compiler phases are:

  1. lexical analysis - analyzing character groupings from the source text to determine corresponding language elements or tokens.

    For example, in the MiniJava language:

    • the 4 characters ' ' 'i' 'f' ' ' correspond to the <IF> token.

    • num = 34; corresponds to tokens <ID> <EQUAL> <INTEGER_LITERAL> <SEMICOLON>

     

  2. parsing - analyzes the phrase structure or how tokens are grouped together.

    For example, the following phrase structure is valid for an assignment statement in the MiniJava language:

    <ID> <EQUAL> <INTEGER_LITERAL> <SEMICOLON>

Lexical analysis and parsing are highly mechanical, regular, routine processes; generators that, given the language grammar, produce lexers and parsers will be used in this course. Since Java is our language of choice, we will use JavaCC and/or SableCC to generate the first two compiler modules in Java. These generators produce a Java program that will lex and parse grammars such as those written in MiniJava. We will add our own parts in Java to generate executable code.

1.3 DATA STRUCTURES FOR TREE LANGUAGES

HTML, XML, XSL, etc. are languages conveniently represented in the Document Object Model, essentially a tree structure. For example, the HTML:

<b><i>C431</i><u>Notes</u></b>

can be represented internally by the browser as:

                 <document>
                         |
                       <b>
                       /    \
                  <i>     <u>
                   |           |
                C431      Notes

where internal nodes are markup commands and leaf nodes are text. The tree is constructed when the HTML is loaded and traversed the browser renders the text as:

C431 Notes

The source of C++, Java and many other programming languages can be represented as a tree, commonly termed an Abstract Syntax Tree (AST). For the grammar:

<exp> ::= <exp>-<mulexp> | <exp> + <mulexp> | <mulexp>
<mulexp> ::= <mulexp> * <rootexp> | <rootexp>
<rootexp> ::= 1 | 2 | 3 | 4

the text:

1-2*3+4

corresponds to the parse tree which corresponds to the grammar rules used to parse the text:

                                <exp>
                              /     |    \
                     <exp>     +    <mulexp>
                   /    |     \                     \
          <exp>     -     <mulexp>       <rootexp>
           /                     /     |    \                 |
 <mulexp>     <mulexp>    *  <rootexp>    4
         |                     |               |
 <rootexp>       <rootexp>         3
         |                     |
         1                    2

 

The AST, usually retaining only the leaf nodes of the parse tree, is a data structure that corresponds to the meaning (semantics) of the text. By traversing in post-order and performing each operation (+,-,*) on the operands (1,2,3,4) this particular AST produces the expected result of (1-(2*3))+4:

AST
    +
   /  \
  -    4
 / \  
1  *
   / \
  2  3

Post-order

123*-4+

Printing

   void print(AST tree) {
       if ( tree == null ) return;
      
       print( tree.left );
       print( tree.right );

       System.out.println( tree.data );
   }

Exercise

What is the result of printing in a preorder traversal?

Execute/Interpret

The point of a language is execution, one value of the AST is an representation on which execution can be performed while traversing the AST. To draw a distinction:

execution - the source is translated into machine code and operations are performed by hardware (the hardware may be simulated as in Java Virtual Machine).

interpretation - the source is usually translated to some intermediate form (e.g. AST) and operations are performed by a program. Interpreters share the first modules for lexing and parsing with compilers to build the the intermediate representation.

Straight-line programs

A compiler for languages such as C++ and Java generally:

  1. Lexes the text into tokens.
  2. Parses the token structure to validate syntax.
  3. Constructs an Abstract Syntax Tree (AST) from a program's text that represents the program's semantics (meaning)
  4. generates corresponding machine code,
  5. the code can then be executed

The text describes a simple language (no iterators, functions, data structures, just assignment, print and compound statements) defined as follows:

Examples

  1. 123
  2. lemon := 4 
  3. one+two
  4. one,two,3
  5. print(one,two,3)
  6. one:=1; two:=2; print(one,two,3)
Exp®num
Stm®id:=Exp
Exp®Exp Binop Exp
ExpList®Exp , ExpList
Stm®print( ExpList )
Stm®Stm ; Stm
    (NumExp)
AssignStm
OpExp
ExpList
PrintStm
CompoundStm

Exercise

Are the following part of the straight-line language defined in Grammar 1.3? If so, what? If not, why not?
  1. a = 3+b
  2. a := 3; b := 5;
  3. (a:=3, 4)
  4. a:=3;print(a)
  5. (a:=3;print(a),4)
  6. (a:=3;print(a);4)

Our simple language will be represented not in text form but in the intermediate AST representation, delaying the need to transform text to the AST; that will come later. For some examples of program text, equivalent Java and graphical AST:

Text Java AST Graphical AST
34 new NumExp(34) NumExp(34)
a := 34

        new AssignStm("a",
           new NumExp(34)
        )

                           AssignStm
                             /          \
                           "a"    NumExp(34)

a := 5 + 3 ;
print ( a )

new CompoundStm(
       new AssignStm( "a",
           new OpExp(
              new NumExp(5),
              OpExp.Plus,
              new NumExp(3)
           )
        ),
        new PrintStm(
             new LastExpList(
                   new IdExp("b")
             )
         )
);

                CompoundStm
                   /                 \
            AssignStm             PrintStm
               /       \                     |        
          "a"    OpExp             LastExpList 
                 /      |       \                    |
NumExp(5) OpExp.Plus NumExp(3)  IdExp("b")

 

Representation of Straight-line Programs

The text gives Java code that corresponds to Grammar 1.3, p. 7. The correspondence of the Java and grammar can be seen through an example:

Grammar Java Structure
Stm ® id := Exp





 

Exp ® num

class AssignStm extends Stm
{
   public String id; public Exp exp;
   AssignStm(String i, Exp e)
   { id = i; exp = e; }
}
 

class NumExp extends Exp
{
   public int num;
   NumExp(int n) {num = n;}
}

    Stm
       |
 AssignStm
     /   \
String Exp
           |
       NumExp

Example: To represent a straight-line assignment statement as an AST in Java:

Text

Java AST

Graphical AST
a := 34; new Stm(
    new AssignStm(
           "a",
           new NumExp(34)
    )
)     
       Stm
         |
    AssignStm
       /     \
   String  NumExp
      |         |
     "a"       34

Exercise

Give the Java and graphical AST for the following using Grammar 1.3:

  1. 123
  2. lemon := 4 
  3. one+two
  4. one,two,3
  5. print(one,two,3)
  6. one:=1; two:=2; print(one,two,3)

 

Grammar

Below is partial definition of the straight-line programming grammar in Java.

StraightLine.java
abstract class Stm {}

class PrintStm extends Stm
{
   public ExpList exps;
   PrintStm(ExpList e)
   { exps = e; }
}

class AssignStm extends Stm
{
   public String id; public Exp exp;
   AssignStm(String i, Exp e)
   { id = i; exp = e; }
}

abstract class Exp {}

class NumExp extends Exp
{
   public int num;
   NumExp(int n) {num = n;}
}
 

abstract class ExpList {}

class LastExpList extends ExpList
{
   public Exp head;
   public LastExpList(Exp h) {
        head = h;}
}

Program

A straight-line program of print( 34 ) is below in Java AST form, contained in file Program.java

Program.java for print(34)
public class Program{  
  Stm program =
     new PrintStm(
         new LastExpList(
              new NumExp(34)
         )
     );
}
       Stm
         |
    PrintStm
         |
  LastExpList
         |
    NumExp

Interpreter

A partial interpreter for the straight-line grammar of print( 34 ) is given below. It can only interpret concrete (versus abstract) AST nodes of the classes: PrintStm, NumExp and LastExp, just those used in our program.

Below are the class definitions and constructors for Stm, PrintStm, AssignStm, ExpList, LastExpList, Exp and NumExp; these are in the separate file StraightLine.java but are used by the Interpreter.java program.

One programming note: an abstract class, such as Stm, does not have attributes, functions or constructors; its sole purpose is to serve as a parent class to concrete classes for inheritance purposes, for example, both PrintStm and AssignStm are Stm's.

StraightLine.java
abstract class Stm {}

class PrintStm extends Stm
{
   public ExpList exps;
   PrintStm(ExpList e)
   { exps = e; }
}

class AssignStm extends Stm
{
   public String id; public Exp exp;
   AssignStm(String i, Exp e)
   { id = i; exp = e; }
}

abstract class Exp {}

class NumExp extends Exp
{
   public int num;
   NumExp(int n) {num = n;}
}
 

abstract class ExpList {}

class LastExpList extends ExpList
{
   public Exp head;
   public LastExpList(Exp h) {
        head = h;}
}

Interpreter has two main functions:

  1. Construct a Program object, the AST representing the program to interpret.
  2. Interpret the program AST by recursively calling function interpret. The class of the parameter (argument) determines which interpret function is called. Essentially, the recursive calls walk the AST tree. The interpret functions usually do one of the following:
    1. perform some base operation (e.g. add two num's) and return the result
    2. call an interpret function closer to the base operation and return result

Interpreter.java

  1. public class Interpreter{
  2.  public static void main(String[] args) {
  3.     Interpreter interpreter = new Interpreter();
        System.out.println("Evaluating program.");
  4.     interpreter.interpret(new Program().program); 
     }
  5.  int interpret(Stm stm)  { 
  6.     return this.interpret((PrintStm)stm);
     }
  7.  int interpret(PrintStm stm) {
  8.     ExpList exp = stm.exps;
  9.     System.out.println(this.interpret(exp));
  10.     return 0;
     }
  1.  int interpret(Exp exp) {
  2.     if (exp instanceof NumExp)
  3.       return this.interpret((NumExp)exp);
  4.     return 0;
     }
  5.  int interpret(NumExp exp) {
  6.     return exp.num;
     }
  7.  int interpret(ExpList list) {
  8.     return this.interpret((LastExpList)list);
     }
  9.  int interpret(LastExpList list) {
  10.     return this.interpret(list.head);
      }		
    }

Tracing Interpreter Execution

Program.java for print(34)
public class Program{  
  Stm program =
     new PrintStm(
         new LastExpList(
              new NumExp(34)
         )
     );
}
  • Line 4 - Construct Program object named program, the AST to interpret. 
  • Line 4 - program is class Stm, polymorphic call made to Line 5 interpret.
  • Line 6 - cast stm as PrintStm (valid because PrintStm inherits from Stm).
  • Line 6 - call Line 7.
  • Line 8 - reference exps attribute (class ExpList) of PrintSum object.
  • Line 9 - call Line 17.
  • Line 18 - cast list as LastExpList (valid because inherits from ExpList)
  • Line 18 - call Line 19.
  • Line 20 - call Line 11.
  • Line 12 - test that exp is a NumExp object (it is).
  • Line 13 - call Line 15.
  • Line 16 - return 34.
  • Line 13, 20, 18 return 34.
  • Line 9 - print 34.
  • Line 10 - return 0
  • Line 4 - execution complete

Example - Extend the Interpreter to handle AssignStm. Store in hash table the pair <"a",34> with "a" the key or hash.

AssignStm in Interpreter.java
public class Program{  
  Stm program =
     new AssignStm( "a",
              new NumExp(34)
     );
}

class AssignStm extends Stm
{
   public String id; public Exp exp;
   AssignStm(String i, Exp e)
   { id = i; exp = e; }
}

 

 java.util.Hashtable table = new java.util.Hashtable();

 int interpret(AssignStm stm)
 {
    table.put(stm.id, new Integer(this.interpret(stm.exp)));

    return 0;
 }
 

Example - The previous Stm assumed everything was a PrintStm. Extend the Interpreter to handle Stm of AssignStm or PrintStm.

Determines class of stm (i.e. stm instanceof AssignStm returns true if stm is an AssignStm object) in order to correctly cast interpret parameter (e.g. interpret((AssignStm)stm); ).

Stm in Interpreter.java
public class Program{  
  Stm program =
     new AssignStm( "a",
              new NumExp(34)
     );
}

abstract class Stm {}

class AssignStm extends Stm
{
   public String id; public Exp exp;
   AssignStm(String i, Exp e)
   { id = i; exp = e; }
}

 int interpret(Stm stm) {

    if (stm instanceof AssignStm)
       return this.interpret((AssignStm)stm);

    if (stm instanceof PrintStm)
       return this.interpret((PrintStm)stm);

    System.err.println("ERROR " + stm);
    return -1;
 }
 

 

Running the Interpreter

With the grammar, interpreter and program defined, the program can be interpreted by compiling and running Interpreter.java. Note that compiling the interpreter compiles the program (Program.java) files automatically).

Below are the steps necessary to compile the interpreter and run the program at IUS:

  1. Download StraightLine.java Interpreter.java Program.java
  2. At the command prompt (at IUS):

    v:\common\user\c311\forjava

    del Program.class                                        Force re-compile

    javac -classpath . StraightLine.java

    javac -classpath . Interpreter.java

    java -cp . Interpreter