Chapter 2
|
OVERVIEW
- Lexical analysis - breaking up the input into individual words or tokens.
The first phase of compiling a program is generally that of lexical analysis as graphically depicted below inputting text and outputting tokens to the next phase:
a := 237; ® lexical analysis ® <id> <assign> <integer_literal> <semicolon>
2.1 LEXICAL TOKENS - a sequence of characters that can be treated as a unit in the grammar of a programming language
Examples of tokens:
Text Token if <IF> -2.3 <REAL> { <LPAREN> foobaz <ID> Examples of non-tokens:
Text Type // Check again Comment ' ' white space '\t' white space '\n' white space Note that blanks are usually discarded during lexical analysis as in the following:
a := 237; ® lexical analysis ® <id> <assign> <integer_literal> <semicolon>
2.2 REGULAR EXPRESSIONS - a standard notation used to specify the characters corresponding to lexical tokens of a language.
Text notation
Regular Expression Notation
Expression Meaning Example .Any character except "\n" a,b,...Non special characters match that character ab;c matches "ab;c" []Any character in the brackets.
^ negates it when it is the first character.
- signifies a range if not the first character.[abz] a single a or b or z
[a-z] lowercase letters
[^a-z] Anything except lowercase letters.*0 or more of the preceding pattern a* - nothing, a, aa, aaa,... + 1 or more of the preceding pattern ? 0 or 1 of the preceding pattern. [0-9]? An optional digit {n} n of the preceding pattern. [a-z]{3} All groups of three letters {n,m} n to m of the preceding pattern. [a-z]{3,5} All groups of three, four or five letters \Escape character \* matches an asterisk ()groups patterns ([ab]1?)? matches nothing, a, a1, b, b1. |Either the pattern before or after. (if)+|5 matches multiple if's or a single 5 "..."Literally what is in the quotes. "\*" matches an backslash then an asterisk
Regular Expressions for some tokens
Regular Expression Token Example if IF if [0-9]+ INTEGER_LITERAL 123 [0-9]*"."[0-9]+ REAL_LITERAL 12.3
.2
2. not allowed[a-zA-Z][a-zA-Z0-9]* ID aBc
aA12B
12b not allowedDisambiguation
The rules are ambiguous, for example is if8 an <ID> or <IF> <INTEGER_LITERAL> tokens? Two ad hoc rules are commonly used:
- Longest match: if8 is an <ID> rather than <IF> <INTEGER_LITERAL> tokens.
- Rule priority: for a longest matched string, the first rule match determines token type. By rule listing order, if is an <IF> rather than an <ID>.
Exercise - What are some strings matching the regulars expressions below?
- abc
- abc+
- [abc]+
- [ab]c+
- a|b|c
- [a-c]
- a[0-9]*
- [a-zA-Z0-9]*
Exercise - Give regular expressions for the following:
- both the English and the American spellings of the word "grey/gray".
- signed integers
- scientific notation (e.g. 1.2E-13)
- your name in a string of arbitrary characters.
Example - Regular expressions are useful for matching patterns, part what a text editor does. Many scripting languages (Python/Perl/etc.) and UNIX provide regular expressions; the UNIX grep command allows files to be searched for strings that match regular expressions.
For example:
grep E[+-][0-9]+ Chapter2.htm
displays the line from this file (Chapter2.htm) that contains the string above:
scientific notation (e.g. 1.2E-13)
Exercise - Use a command (grep) to search files for text using regular expressions.
UNIX
- grep E[+-][0-9]+ < Chapter2.htm
Windows
- Download grep for Windows.
- Follow installation instructions.
- Read the quick-start guide.
- Example use: grep32 E[+-][0-9]+ Chapter2.htm
2.3 FINITE AUTOMATA - a computational state machine with a finite number of states.
Finite automata are represented as a directed graph using the following symbols:
labeled nodes correspond to a state, in this case State 1. A single state is denoted as the start state.
edges correspond to a transition and condition from one state to another, in this case the transition occurs on a value of a-z.
labeled nodes that correspond to a final state, in this case State 2 is final.
Regular expressions can be converted into finite automata capable of computing tokens.
Regular Expression Finite Automata if [0-9]+ [a-zA-Z][a-zA-Z0-9]*
Implementation of Finite Automata in Code
Code to simulate finite automata for ID.
state = 1;
while state == 1 or 2
switch (state)
1 : switch (input character)
'a'-'z' : advance input
state = 2
'A'-'Z' : advance input
state = 2
default : state = error
2 : switch (input character)
'a'-'z',
'A'-'Z',
'0'-'9' : advance input
default : state = ID token
Transition tables
- Generic data structure (table) versus hard coding of finite automata states.
- State transitions table entries.
Input characters c States s States representing transitions T(s,c)
Example:
\Input
State \a-z A-Z 0-9 other 1 2 2 error error 2 2 2 2 ID token (no transition)
Implementation
s = 1
c = Input
while s not error or ID token
s = T[s, c]
if s not error or ID token c = Input
2.4 NONDETERMINISTIC FINITE AUTOMATA (NFA) - Automata with the choice of two or more edges labeled with the same symbol or special edge labeled with ε, edges that may be taken without using a symbol.
Example - the following NFA accepts either even number of a's or multiple of 3 a's; the NFA must correctly choose which path to follow at the first transition.
CONVERTING A REGULAR EXPRESSION TO AN NFA
Example: Regular expression and NFA for either even number of a's or multiple of 3 a's.
(aa)+ | (aaa)+
Exercise: Construct the NFA for the regular expression:
[ab]a*
[+-][9-0]*
Example: Regular expressions and NFA for:
if [a-z][a-z0-9]*
[0-9]+
IF ID
NUM
DETERMINISTIC FINITE AUTOMATA (DFA) - Automata with no two edges from the same state labeled with the same symbol or any special edge labeled with ε, edges that may be taken without using a symbol.
Exercise - Start at State 1,4,9,14 and follow edges for strings if8, 123, 123if.
Problem NFA require automata to make correct choice between two or more identical edges out of a state for a character. DFA has only a single edge out of a state for a character.
Converting NFA to DFA requires eliminating multiple edges having the same label and edges labeled with ε.
ε-closure of single state s is the set of states reachable by a series of 0 or more ε-transitions, written as s. The ε-closure of state s always contains state s itself.
Example: NFA for regular expression for a* has 1= {1,2,4}, 2= {2}, 3= {2,3,4}, 4= {4}
ε-closure - S, of a set of states is the union of the ε-closures of each individual state s: S=Ès
Example: {1,3} = 1È3 = {1,2,4} È {2,3,4} = {1,2,3,4}
DFA M construction from NFA M
Compute ε-closure from start state of NFA M, becomes the start state of DFA M.
For this set and each subsequent set compute transitions on characters a by:
Given set S of states and character a, compute S'a = { t | for some s in S there is a transition from s to t on a}
Compute S'a, ε-closure of S'a. Defines new state in DFA subset construction with a new transition on a of S ® S'a.
Continue until no new states or transitions are created.
Mark as final in M those states marked as final in M.
Example:
DFA M construction from NFA M for regular expression a*
DFA start 1 = {1,2,4}
DFA states[1] ={1,2,4}
Transition on a from state 2 to 3, none on 1 and 4, so transition on a from {1,2,4} to {1,2,4}a = 3 = {2,3,4}
DFA states[2]={2,3,4}, DFA trans[1,a]=2
Since no other transitions on a (other than 2 to 3 already computed), ignore states 1, 2, 4; consider only new state {2,3,4}.
Again transition on a from state 2 to 3, none on 4, so transition on a from {2,3,4} to {2,3,4}a = 3 = {2,3,4}
DFA states[2]={2,3,4}, DFA trans[2,a]=2.
No more states to consider, DFA is constructed.
Note that state 4 is final in the NFA and contained in both {1,2,4} and {2,3,4} so both are final in the DFA below.
The NFA to DFA algorithm (given by the text on page 29) produces the following results:
states
0 {} 1 {1,2,4} 2 {2,3,4} trans
0 1 2 a 0 2 2
e ¬ DFAedge(states[0],a) = DFAedge({},a) = {}
trans[0,a] ¬ 0 where trans[j,a] ¬ e
e ¬ DFAedge(states[1],a) = DFAedge({1,2,4},a) = {2,3,4}
trans[1,a] ¬ 2 where trans[j,a] ¬ p
e ¬ DFAedge(states[2],a) = DFAedge({2,3,4},a) = {2,3,4}
trans[2,a] ¬ 2 where trans[j,a] ¬ i
- j=0 p=1
- j=1
- p=2
- j=2
- i=2
- j=3 p=2
Example:
NFA for regular expression ab | a
DFA start state is 1 = {1,2,6}
DFA states[1]={1,2,6}
Transition on a from state 2 to 3, and state 6 to 7. Transition on a from {1,2,6} to {1,2,6}a = {3,7} = {3,4,7,8} or on a {1,2,6} ® {3,4,7,8}
DFA states[2]={3,4,7,8}, trans[1,a]=2
Since no other transitions on a at 1, 2 or 6 (other than 2 to 3 and 6 to 7 already computed), ignore states 1, 2, 6; consider only new state {3,4,7,8}.
Transition on b from state 4 to 5, {3,4,7,8}b= 5 = {5,8} or on b {3,4,7,8} ®
{5,8}
DFA states[3]={5,8}, trans[2,b]=3
No more states to consider, DFA is constructed.
Note that state 8 is final in the NFA and contained in both {3,4,7,8} and {5,8} (DFA states[2] and states[3]) so both are final in the DFA.
states
0 {} 1 {1,2,6} 2 {3,4,7,8} 3 {5,8} trans
0 1 2 3 a 0 2 0 0 b 0 0 3 0 Example:
NFA for regular expression letter (letter | digit)*
DFA start state is 1 = {1}
DFA states[1]={1}
Transition on letter from state 1 to 2, 2 = {2,3,4,5,7,10}.
DFA states[2]={2,3,4,5,7,10}, trans[1,letter]=2
Transition on letter from 2 to 6 = {4,5,6,7,9,10}.
DFA states[3]={4,5,6,7,9,10}, trans[2,letter]=3
Transition on digit from 2 to 8 = {4,5,7,8,9,10}.
DFA states[4]={4,5,7,8,9,10}, trans[2,digit]=4
Transition on letter from 6 to 6 = {4,5,6,7,9,10}
DFA states[3] = {4,5,6,7,9,10}, trans[3,letter]=3
Transition on digit from 6 to 8 = {4,5,7,8,9,10}
DFA states[4] = {4,5,7,8,9,10}, trans[3,digit]=4
Transition on digit from 8 to 8 = {4,5,7,8,9,10}
DFA states[4] = {4,5,7,8,9,10}, trans[4,digit]=4
Transition on letter from 8 to 6 = {4,5,6,7,9,10}
DFA states[3] = {4,5,6,7,9,10}, trans[4,letter]=3
No more states to consider, DFA is constructed.
Note that state 10 is final in the NFA and contained in {2,3,4,5,7,10}, {4,5,6,7,9,10} and {4,5,7,8,9,10} so are final in the DFA.
states
0 {} 1 {1,2,6} 2 {3,4,7,8} 3 {5,8} trans
0 1 2 3 4 letter 0 2 3 3 3 digit 0 0 4 4 4
Exercise: Construct the DFA from the following NFA for (a|є)b*
2.5 LEXICAL-ANALYZER GENERATORS
Translation of regular expression into a DFA produces a lexical analyzer; the construction of which can be automated.
regular expression ® generator ® lexical analyzer grep
grep inputs a regular expression:
grep E[-+][0-9]+ Chapter2.htm
and automatically constructs a lexical analyzer. Compilers have fixed regular expressions to recognize so construct a complete DFA each time using the methods described above for matching strings input from a file. grep and similar programs with dynamic user input of regular expressions need only form the closure of NFA states for the actual input a and not other possible inputs, generating closures dependent upon the input, potentially reducing the number of states converted. The hazard is that loops produce repeated conversions.
Two generators that produce lexical analyzers in Java useful for implementing a compiler will be examined: JavaCC and SableCC.
JavaCC -Intermixes grammar rules and control code.
Syntax - Similar to regular expressions with the following points of note:
- sequence: ["0"-"9"] rather than [0-9]
- characters: "A"
- not: [] is no characters, ~[] is all characters, ~["*"] is all characters except "*"
- union: ["a","e","i","o","u","y"] are the vowel characters
PARSER_BEGIN(MyParser) - defines a class (e.g. MyParser) with a main function that:
constructs a lexical analyzer inputting characters from standard input (i.e. keyboard)
calls a generated function Start() to analyze input.
TOKEN - regular expressions defining valid tokens
SKIP - regular expressions defining strings to skip or ignore
Start() - analyzes input for listed tokens; 0 or more of the following are allowed:
(<IF> | <ID> | <NUM> | <REAL>)*
Program2.9.jj options {
DEBUG_PARSER = true; // Set to false to turn off token printing
}
PARSER_BEGIN(MyParser)class MyParser
{
public static void main(String args[]) throws ParseException {
new MyParser(System.in).Start();
}
}PARSER_END(MyParser)
TOKEN : {
< IF: "if" >
| < #DIGIT: ["0"-"9"] >
| < ID: ["a"-"z"] ( ["a"-"z"] | <DIGIT>)* >
| < NUM: (<DIGIT>)+ >
| < REAL: ((<DIGIT>)+ "." (<DIGIT>)*) | ( (<DIGIT>)* "." (<DIGIT>)+) >
}SKIP: {
< "--" (~["\n", "\r"])* ("\n" | "\r" | "\r\n") > // Comment
| " " | "\t" | "\n" | "\r"
}void Start() : {}
{ (<IF> | <ID> | <NUM> | <REAL>)* }
Input - The following is valid for the lexical analyzer produced.
12.34 4 if
abc123 if4
-- a commentUse
- Home:
- Download JavaCC and/or SableCC software and unzip to the root level directory.
- Enter the following at the DOS Command window:
path=%path%;\javacc-3.2\bin\
IUS:
- Enter in the DOS Command window:
v:\common\user\C431\CC
v:\common\user\C311\forjava
- JavaCC
- Download Program2.9.jj and Lexical1.txt.
- Generates lexical analyzer named MyParser.java, assumes JavaCC is a root-level directory.
JavaCC Program2.9.jj
- Compile all Java files in directory.
javac -classpath . *.java
- Analyze the contents of file Lexical1.
java -cp . MyParser < Lexical1.txt
SableCC - Object oriented for extensibility and separates the regular expression definition and analyzer execution into two files:
Lexical analyzer generator control that defines the regular expressions for tokens and productions possible from the tokens.
Java program that calls the lexer.
- Program2.10.js - The generator control file.
- Program2 - the name given to a sub-directory used to group all generated files.
- Helpers - generally used separate non-token definition from tokens.
- Tokens - regular expression for valid tokens.
- Ignored Tokens - just what it says.
- Productions - none required for the lexer.
Syntax - Similar to regular expressions with the following points of note:
- sequence: ['0'..'9'] rather than [0-9]
- characters: either 'A' or 65 (the ASCII code), [0..255] is all the characters
- difference: [[0..255]-['*']] is all the characters except '*'
- union: ['a'+'e'+'i'+'o'+'u'+'y'] are the vowel characters
Program2.10.js Package Program2;
Helpers
digit = ['0'..'9'];
ht = 9;
lf = 10;
cr = 13;
sp = ' ';
Tokens
if = 'if';
id = ['a'..'z'] (['a'..'z'] | (digit))*;
number = digit+;
real = ((digit)+ '.' (digit)*) | ((digit)* '.' (digit)+);
whitespace = (sp | ht | cr lf | cr | lf )+;
comments = '--' [[0..255]-[cr+lf]]*;
Ignored Tokens
whitespace,
comments;
- Main.java - Should be in sub-directory Program2, scans input for tokens and prints any found or error,
- Program2 - same in both files, the sub-directory for the lexer.
- Lexer l - constructs new lexer using standard input (i.e. System.in).
- Token t = l.next(); - reads first token
- System.out.print(t.toString()); - prints the input characters of token t.
Main.java package Program2;
import Program2.lexer.*;
import Program2.node.*;
import java.io.*;public class Main{
public static void main(String[] arguments){
try{
// Create a lexer instance.Lexer l = new Lexer(new PushbackReader
(new InputStreamReader(System.in), 1024));Token t = l.next();
while (!t.getText().equals("")){
System.out.print("<"+t.toString()+">");
t = l.next();
}
}
catch(Exception e){ System.out.println(e.getMessage()); }
}
}Use
- Download Program2.10.js and Lexical1.txt.
- To generate the SableCC lexical analyzer:
Home: java -jar \sablecc-2.18.2\lib\sablecc.jar Program2.10.js
IUS: java -jar v:\common\user\C431\sablecc.jar Program2.10.js
- To delete an older version of the analyzer and compile the Java program named Program2\Main.java enter:
del Program2\Main.class
javac -classpath . Program2\Main.java
- To perform the lexical analysis of Lexical1:
java -cp . Program2.Main < Lexical1.txt