Chapter 2
Lexical Analysis

powered by FreeFind

Modified: 

OVERVIEW

The first phase of compiling a program is generally that of lexical analysis as graphically depicted below inputting text and outputting tokens to the next phase:

a := 237; ® lexical analysis ® <id> <assign> <integer_literal> <semicolon>

2.1 LEXICAL TOKENS - a sequence of characters that can be treated as a unit in the grammar of a programming language

Examples of tokens:

Text Token
if <IF>
-2.3 <REAL>
{ <LPAREN>
foobaz <ID>

Examples of non-tokens:

Text Type
// Check again Comment
' ' white space
'\t' white space
'\n' white space

Note that blanks are usually discarded during lexical analysis as in the following:

a := 237; ® lexical analysis ® <id> <assign> <integer_literal> <semicolon>

2.2 REGULAR EXPRESSIONS - a standard notation used to specify the characters corresponding to lexical tokens of a language.

Text notation

Regular Expression Notation
 
Expression Meaning Example
. Any character except "\n"  
a,b,... Non special characters match that character ab;c matches "ab;c"
[] Any character in the brackets.
^ negates it when it is the first character.
- signifies a range if not the first character.
[abz] a single a or b or z
[a-z] lowercase letters
[^a-z] Anything except lowercase letters.
* 0 or more of the preceding pattern a* - nothing, a, aa, aaa,...
+ 1 or more of the preceding pattern  
? 0 or 1 of the preceding pattern. [0-9]? An optional digit
{n} n of the preceding pattern. [a-z]{3} All groups of three letters
{n,m} n to m of the preceding pattern. [a-z]{3,5} All groups of three, four or five letters
\ Escape character \* matches an asterisk
() groups patterns ([ab]1?)? matches nothing, a, a1, b, b1.
| Either the pattern before or after. (if)+|5 matches multiple if's or a single 5
"..." Literally what is in the quotes. "\*" matches an backslash then an asterisk

 

Regular Expressions for some tokens

Regular Expression Token Example
if IF if
[0-9]+ INTEGER_LITERAL 123
[0-9]*"."[0-9]+ REAL_LITERAL 12.3
.2
2. not allowed
[a-zA-Z][a-zA-Z0-9]* ID aBc
aA12B
12b not allowed

Disambiguation

The rules are ambiguous, for example is if8 an <ID> or <IF> <INTEGER_LITERAL> tokens? Two ad hoc rules are commonly used:

Exercise - What are some strings matching the regulars expressions below?

  1. abc
  2. abc+
  3. [abc]+
  4. [ab]c+
  5. a|b|c
  6. [a-c]
  7. a[0-9]*
  8. [a-zA-Z0-9]*

Exercise - Give regular expressions for the following:

  1. both the English and the American spellings of the word "grey/gray".
  2. signed integers
  3. scientific notation (e.g. 1.2E-13)
  4. your name in a string of arbitrary characters.

Example - Regular expressions are useful for matching patterns, part what a text editor does. Many scripting languages (Python/Perl/etc.) and UNIX provide regular expressions; the UNIX grep command allows files to be searched for strings that match regular expressions.

For example:

grep E[+-][0-9]+ Chapter2.htm

displays the line from this file (Chapter2.htm) that contains the string above:

scientific notation (e.g. 1.2E-13)

 

Exercise - Use a command (grep) to search files for text using regular expressions.

UNIX

  • grep E[+-][0-9]+ < Chapter2.htm

Windows

  1. Download grep for Windows.
  2. Follow installation instructions.
  3. Read the quick-start guide.
  4. Example use: grep32 E[+-][0-9]+ Chapter2.htm

 

2.3 FINITE AUTOMATA - a computational state machine with a finite number of states.

Finite automata are represented as a directed graph using the following symbols:

labeled nodes correspond to a state, in this case State 1. A single state is denoted as the start state.

edges correspond to a transition and condition from one state to another, in this case the transition occurs on a value of a-z.

labeled nodes that correspond to a final state, in this case State 2 is final.

Regular expressions can be converted into finite automata capable of computing tokens.

Regular Expression Finite Automata
if
[0-9]+
[a-zA-Z][a-zA-Z0-9]*

 

Implementation of Finite Automata in Code

Code to simulate finite automata for ID.

state = 1;
while state == 1 or 2
    switch (state)
      1 : switch (input character)
           'a'-'z' : advance input
                     
state = 2
           'A'-'Z' : advance input
                     
state = 2
           default : state = error
      2 : switch (input character)
           'a'-'z',
           'A'-'Z',
           '0'-'9' : advance input
 
          default : state = ID token

 

Transition tables

  Input characters c
States s States representing transitions T(s,c) 

 

Example:

        \Input
State  \
a-z A-Z 0-9 other
1 2 2 error error
2 2 2 2 ID token (no transition)

 

Implementation

s = 1
c = Input
while s not error or ID token
   s = T[s, c]
   if s not error or ID token c = Input 

 

2.4 NONDETERMINISTIC FINITE AUTOMATA (NFA) - Automata with the choice of two or more edges labeled with the same symbol or special edge labeled with ε, edges that may be taken without using a symbol.

Example - the following NFA accepts either even number of a's or multiple of 3 a's; the NFA must correctly choose which path to follow at the first transition.

CONVERTING A REGULAR EXPRESSION TO AN NFA

Example: Regular expression and NFA for either even number of a's or multiple of 3 a's.

(aa)+ | (aaa)+

 

Exercise: Construct the NFA for the regular expression:

  1. [ab]a*

  2. [+-][9-0]*

 

Example: Regular expressions and NFA for:

if

[a-z][a-z0-9]*

[0-9]+

IF

ID

NUM

DETERMINISTIC FINITE AUTOMATA (DFA) - Automata with no two edges from the same state labeled with the same symbol or any special edge labeled with ε, edges that may be taken without using a symbol.

Exercise - Start at State 1,4,9,14 and follow edges for strings if8, 123, 123if.

 

Problem    NFA require automata to make correct choice between two or more identical edges out of a state for a character. DFA has only a single edge out of a state for a character.

CONVERTING NFA TO DFA

Converting NFA to DFA requires eliminating multiple edges having the same label and edges labeled with ε.

ε-closure of single state s is the set of states reachable by a series of 0 or more ε-transitions, written as s. The ε-closure of state s always contains state s itself.

Example: NFA for regular expression for a* has  1= {1,2,4}, 2= {2}, 3= {2,3,4}, 4= {4}

ε-closure - S, of a set of states is the union of the ε-closures of each individual state s: S=Ès

Example: {1,3} = 1È3 = {1,2,4} È {2,3,4} = {1,2,3,4} 

 

DFA M construction from NFA M

  1. Compute ε-closure from start state of NFA M, becomes the start state of DFA M.

  2. For this set and each subsequent set compute transitions on characters a by:

    1. Given set S of states and character a, compute S'a = { t | for some s in S there is a transition from s to t on a}

    2. Compute S'a, ε-closure of S'a. Defines new state in DFA subset construction with a new transition on a of S ® S'a.

    3. Continue until no new states or transitions are created.

  3. Mark as final in M those states marked as final in M.

 

Example: DFA M construction from NFA M for regular expression a*

  • DFA start 1 = {1,2,4}

    • DFA states[1] ={1,2,4}

  • Transition on a from state 2 to 3, none on 1 and 4, so transition on a from {1,2,4} to {1,2,4}a  = 3 = {2,3,4}

    • DFA states[2]={2,3,4}, DFA trans[1,a]=2

  • Since no other transitions on a (other than 2 to 3 already computed), ignore states 1, 2, 4; consider only new state {2,3,4}.

  • Again transition on a from state 2 to 3, none on 4, so transition on a from {2,3,4} to {2,3,4}a  = 3 = {2,3,4}

    • DFA states[2]={2,3,4}, DFA trans[2,a]=2.

  • No more states to consider, DFA is constructed.

  • Note that state 4 is final in the NFA and contained in both {1,2,4} and {2,3,4} so both are final in the DFA below.

The NFA to DFA algorithm (given by the text on page 29) produces the following results:

states
0 {}
1 {1,2,4}
2 {2,3,4}
         trans
  0 1 2
a 0 2 2
       
  1. e ¬ DFAedge(states[0],a) = DFAedge({},a) = {}

  2. trans[0,a] ¬ 0 where trans[j,a] ¬ e

  3. e ¬ DFAedge(states[1],a) = DFAedge({1,2,4},a) = {2,3,4}

  4. trans[1,a] ¬ 2 where trans[j,a] ¬ p

  5. e ¬ DFAedge(states[2],a) = DFAedge({2,3,4},a) = {2,3,4}

  6. trans[2,a] ¬ 2 where trans[j,a] ¬ i

  • j=0    p=1
  •  
  • j=1
  • p=2
  • j=2
  • i=2
  • j=3    p=2

Example: NFA for regular expression ab | a

  • DFA start state is 1 = {1,2,6}

    • DFA states[1]={1,2,6}

  • Transition on a from state 2 to 3, and state 6 to 7. Transition on a from {1,2,6} to {1,2,6}a  = {3,7} = {3,4,7,8} or on a {1,2,6} ® {3,4,7,8}

    • DFA states[2]={3,4,7,8}, trans[1,a]=2

  • Since no other transitions on a at 1, 2 or 6 (other than 2 to 3 and 6 to 7 already computed), ignore states 1, 2, 6; consider only new state {3,4,7,8}.

  • Transition on b from state 4 to 5, {3,4,7,8}b= 5 = {5,8} or on b {3,4,7,8} ® {5,8}

    • DFA states[3]={5,8}, trans[2,b]=3

  • No more states to consider, DFA is constructed.

  • Note that state 8 is final in the NFA and contained in both {3,4,7,8} and {5,8} (DFA states[2] and states[3]) so both are final in the DFA.

states
0 {}
1 {1,2,6}
2 {3,4,7,8}
3 {5,8}
         trans
  0 1 2 3
a 0 2 0 0
b 0 0 3 0
         

Example: NFA for regular expression letter (letter | digit)*

  • DFA start state is 1 = {1}

    • DFA states[1]={1}

  • Transition on letter from state 1 to 2, 2 = {2,3,4,5,7,10}.

    • DFA states[2]={2,3,4,5,7,10}, trans[1,letter]=2

  • Transition on letter from 2 to 6 = {4,5,6,7,9,10}.

    • DFA states[3]={4,5,6,7,9,10}, trans[2,letter]=3

  • Transition on digit from 2 to 8 = {4,5,7,8,9,10}.

    • DFA states[4]={4,5,7,8,9,10}, trans[2,digit]=4

  • Transition on letter from 6 to 6 = {4,5,6,7,9,10}

    • DFA states[3] = {4,5,6,7,9,10}, trans[3,letter]=3

  • Transition on digit from 6 to 8 = {4,5,7,8,9,10}

    • DFA states[4] = {4,5,7,8,9,10}, trans[3,digit]=4

  • Transition on digit from 8 to 8 = {4,5,7,8,9,10}

    • DFA states[4] = {4,5,7,8,9,10}, trans[4,digit]=4

  • Transition on letter from 8 to 6 = {4,5,6,7,9,10}

    • DFA states[3] = {4,5,6,7,9,10}, trans[4,letter]=3

  • No more states to consider, DFA is constructed.

  • Note that state 10 is final in the NFA and contained in {2,3,4,5,7,10}, {4,5,6,7,9,10} and {4,5,7,8,9,10} so are final in the DFA.

states
0 {}
1 {1,2,6}
2 {3,4,7,8}
3 {5,8}
         trans
  0 1 2 3 4
letter 0 2 3 3 3
digit 0 0 4 4 4
         

 

Exercise: Construct the DFA from the following NFA for (a|є)b*

 

2.5 LEXICAL-ANALYZER GENERATORS

Translation of regular expression into a DFA produces a lexical analyzer; the construction of which can be automated.

regular expression ® generator ® lexical analyzer

grep

grep inputs a regular expression:

grep E[-+][0-9]+ Chapter2.htm

and automatically constructs a lexical analyzer. Compilers have fixed regular expressions to recognize so construct a complete DFA each time using the methods described above for matching strings input from a file. grep and similar programs with dynamic user input of regular expressions need only form the closure of NFA states for the actual input a and not other possible inputs, generating closures dependent upon the input, potentially reducing the number of states converted. The hazard is that loops produce repeated conversions.

Two generators that produce lexical analyzers in Java useful for implementing a compiler will be examined: JavaCC and SableCC.

JavaCC -Intermixes grammar rules and control code.

Syntax - Similar to regular expressions with the following points of note:

Program2.9.jj

options {
   DEBUG_PARSER = true;                    // Set to false to turn off token printing
}

PARSER_BEGIN(MyParser)

            class MyParser
            {
                        public static void main(String args[]) throws ParseException {
                                    new MyParser(System.in).Start();
                        }
            }

PARSER_END(MyParser) 

TOKEN : {
              < IF: "if" >
            | < #DIGIT: ["0"-"9"] >
            | < ID: ["a"-"z"] ( ["a"-"z"] | <DIGIT>)* >
            | < NUM: (<DIGIT>)+ >
            | < REAL: ((<DIGIT>)+ "." (<DIGIT>)*) | ( (<DIGIT>)* "." (<DIGIT>)+) >

SKIP: {
            < "--" (~["\n", "\r"])* ("\n" | "\r" | "\r\n") >         // Comment
          | " " | "\t" | "\n" | "\r"

void Start() : {}

{          (<IF> | <ID> | <NUM> | <REAL>)*    }

Input - The following is valid for the lexical analyzer produced.

12.34 4 if
abc123 if4
-- a comment

Use

 

SableCC - Object oriented for extensibility and separates the regular expression definition and analyzer execution into two files:

  1. Lexical analyzer generator control that defines the regular expressions for tokens and productions possible from the tokens.

  2. Java program that calls the lexer.

Syntax - Similar to regular expressions with the following points of note:

Program2.10.js

Package Program2;

Helpers
            digit = ['0'..'9'];
            ht = 9;
            lf = 10;
            cr = 13;
            sp = ' ';
Tokens
            if = 'if';
            id = ['a'..'z'] (['a'..'z'] | (digit))*;
            number = digit+;
            real = ((digit)+ '.' (digit)*) | ((digit)* '.' (digit)+);
            whitespace = (sp | ht | cr lf | cr | lf )+;
            comments = '--' [[0..255]-[cr+lf]]*;
Ignored Tokens
            whitespace,
            comments;

Main.java

package Program2; 

import Program2.lexer.*;
import Program2.node.*;
import java.io.*;

public class Main{   

    public static void main(String[] arguments){
            try{
                // Create a lexer instance.

                Lexer l = new Lexer(new PushbackReader
                        (new InputStreamReader(System.in), 1024));                       

                Token t = l.next();
                while (!t.getText().equals("")){
                        System.out.print("<"+t.toString()+">");
                        t = l.next();
                }
            }
            catch(Exception e){ System.out.println(e.getMessage()); }
    }
}

Use