Chapter 3
|
OVERVIEW
- Parsing - determining the syntax or structure of a program.
As graphically depicted below, the first two phases of compiling a program is generally that of lexical analysis inputting text and outputting tokens to the second phase, the parser:
a := 237; ® lexical analysis ® <id> <assign> <integer_literal> <semicolon> ® parsing
Regular expressions are not powerful enough to define the phrase syntax of a language.
- Lexical analyzer can implement abbreviation of digits below by substituting in [0-9]+.
- Fails when substitution of a recursive phrase definition.
Example: defines expr as 611, (45+56), ((4+5)+6)
- digits = [0-9]+
- sum = expr "+" expr
- expr = "(" sum ")" | digits
- Finite automata of N states can recognize only N balanced parenthesis.
- To reduce (45+56) to an expr, substituting sum for 45+56 yields expr = "(" expr "+" expr ")" | digits.
- Substituting expr yields expr = "(" expr = "(" expr "+" expr ")" | digits"+" expr ")" | digits.
- Need recursion not substitution.
Backus Naur Form (BNF) notation for context-free grammars that includes recursion.
Context free grammar
- Regular expressions for lexical declarations (* and +) defined by iteration and substitution.
- Context free grammar for phrase declarations include recursion. Can also use for lexical declarations.
- left recursion - A = Aa | B yields Baaaaaa
- right recursion - A = aA | B yields aaaaaaB
- Alternations.
- expr = ab(c|d)e
- aux = c | d
- expr = ab aux e
- aux = c
- aux = d
- expr = ab aux e
- *
- expr = (abc)*
- expr = abc expr
- expr = є
- +
- expr = (abc)+
- expr = abc expr
- expr = abc
- ? over strings abc
- expr = abc?
- aux = a | b | c | є
- expr = abc aux
Exercise - Convert the regular expression A = a* | ab+ into BNF
3.1 CONTEXT-FREE GRAMMARS - Specification similar to regular expressions but include recursion.
Context-free implies that a definition is valid anywhere. Most programming languages have context-free (syntax) and context-sensitive (scope, typing, etc.) elements.
- non-terminals - S, E and L
- terminals - id num print ( ) := , ;
Derivations
a:= 7;
b := c + ( d := 5 + 6, d)derives sentence:
id := num;
id := id + ( id := num + num, id)so is part of the straight-line language.
Leftmost/Rightmost - leftmost derivation always expands leftmost non-terminal, rightmost ...
Parse trees - data structure corresponding to derivation results.
a:= 7;
b := c + ( d := 5 + 6, d)derivation corresponds to the parse tree:
![]() |
![]() |
Ambiguous Grammar - a grammar that can derive sentences with more than one parse tree.
Example: id := id+id+id
Exercise - Derive id := id+id+id
Exercise - Parse using Grammar 3.5
- 1 * 2 + 3
- 1 * (2 + 3)
Exercise from Figure 3.9
- What non-terminals can occur for ?X
- What non-terminals can occur for ?Y
- What non-terminals can occur for ?U
- What non-terminals can occur for ?V
Exercise - Parse using Grammar 3.8
- 1 * 2 + 3
- 1 * (2 + 3)
End of file marker
S - start symbol
$ - end of file marker
S ® E $ - defines production S where $ must come after a complete E-phrase. Adding S ® E $ to Grammar 3.8 produces the augmented Grammar 3.10.
PARSING BY BACKTRACKING
Parsing by backtracking is somewhat like depth first search: given a legal string, the parser will generate all possible parses until one is discovered that works. Backtracking is an exhaustive or brute force method than can be very expensive when search failures occur and backtracking is required.
Consider parsing bcde by following the grammar rules: ee (fails), bAc, bcAc, bcdc (fails)
Must now back up to try: bAe for the correct parse.
Modifying the grammar by factoring out the duplicate bA has the effect of providing only one choice rather than two. The parser does not need to backtrack.
Backtracking occurs when mistakes are made and rules are tried that eventually fail. Wastes effort and can be difficult to restore scanner input to correct point.
One method of avoiding mistakes is to predict, based on the current token, which rule is the correct one.
In practice, the method determines for each non-terminal all the tokens (terminals) that can occur FIRST in the production, the FIRST tokens can then guide the prediction. We saw that just knowing the correct first token is not always enough when duplications occur (bA), we also need to know what tokens FOLLOW others.
We will see that the FIRST and FOLLOW sets can be used to determine the correct parse rule.
3.2 PREDICTIVE PARSING
Recursive descent parser - View grammar rule for non-terminal A as a function that will recognize an A. The right-hand side of rule specifies the code structure:
- the sequence of terminals and non-terminals correspond to matches of the input and calls to other procedures
- choices correspond to alternatives (case or if-statements) within the code.
Recursive-descent or predictive parsing only works on grammars where first terminal token provides enough information to determine which production to use.
Grammar and Recursive-descent Parser addop ® + | -
mulop ® * | /
exp ® exp addop term | term
term ® term mulop factor | factor
factor ® ( exp ) | number
Token token; // global
void factor() {
switch (token) {
case lpar : match( lpar ); exp(); match( rpar ); break;
case number : match( number ); break;
default : error();
}
}void match( expectedToken : Token ) {
if( token == expectedToken) getToken();
else error();
}
- match - function advances to next token when expected match occurs.
Exercise - Give the functions for exp and term.
Problem
Recursive-descent or predictive parsing only works on grammars where first terminal token provides enough information to determine which production to use.
Alternatives
The BNF below has two alternatives, cannot immediately (based on first token if) determine which alternative; EBNF defines else part as option, designed to translate directly into recursive-descent parser code.
Note that grammar is ambiguous but the natural matching of each else as soon as encountered in input corresponds to the most closely nested if and else disambiguating rule.
Grammar and Recursive-descent Parser BNF
if-stmt ® if ( exp ) statement | if ( exp ) statement else statement
EBNF
if-stmt ® if ( exp ) statement [ else statement ]
void ifStmt() {
match( if );
match( lpar );
exp();
statement();
if (token == else ) {
match( else );
statement();
}
}
Repetition
Attempting to translate the following BNF production directly produces infinite recursion on exp.
Grammar and Recursive-descent Parser exp ® exp addop term | term
void exp() {
switch (token) {
case exp : exp(); match( addop ); term(); break;
case term : term(); break;
default : error();
}
}Since exp and term can both begin with the same tokens (a number of left parenthesis), the following EBNF solution using curly braces for repetition rather than recursion.
Grammar and Recursive-descent Parser exp ® term { addop term }
term ® factor { mulop factor }
void exp() {
term();
while( token == addop ) {
match( addop );
term();
}
}
void term() {
factor();
while( token == mulop ) {
match( mulop );
factor();
}
}
Recursive-descent Parser for Grammar 3.11
final int IF=1, THEN=2, ELSE=3, BEGIN=4, END=5, PRINT=6, SEMI=7, NUM=8, EQ=9;
int tok = getToken();
void advance() {tok=getToken();} .
void eat(int t) {if (tok==t) advance(); else error();}
void S() {
switch(tok) {
case IF: eat(IF); E(); eat(THEN); S(); eat(ELSE); S(); break;
case BEGIN: eat(BEGIN); S(); L(); break;
case PRINT: eat(PRINT); E(); break;
default: error();
} }
void L() {
switch(tok) {
case END: eat(END); break;
case SEMI: eat(SEMI); S(); L(); break;
default: error();
} }
void E() { eat(NUM); eat(EQ); eat(NUM); }
Recursive-descent Parser for Grammar 3.10 void S() { E(); eat(EOF); }
void E() {switch (tok) {
case ?: E(); eat(PLUS); T(); break;
case ?: E(); eat(MINUS); T(); break;
case ?: T(); break;
default: error();
}}
void T() {switch (tok) {
case ?: T(); eat(TIMES); F(); break;
case ?: T(); eat(DIV); F(); break;
case ?: F(); break;
default: error ( ) ;
}}Problem - The recursive-descent parser above only considers one current token, a one token look-ahead.
For (4*2)+3 it should use
E ® E+T
but for (4*2) it should use
E ® T;
with only the ( to consider it cannot determine which rule to follow.
FIRST AND FOLLOW SETS
nullable - X is nullable if X can derive the empty string є
else-part ® else stmt | є
FIRST(g) - the set of terminals that can begin strings derived from g
FIRST(else-part) = {else, є }
FOLLOW(X) - the set of terminals that immediately follow X. tÎFOLLOW(X) if any derivation contains Xt; including XYZt and Y and Z are nullable (derive є).
Given a non-terminal X, the set FIRST(X) consists of terminals and possibly є is defined as:
- if X is a terminal or є then FIRST(X)=X.
- if X is a non-terminal the for each production X ® X1X2...Xn , then FIRST(X) contains FIRST(Xi)-{є}.
- if FIRST(X1), FIRST(X2)...FIRST(Xi) contain є where i<n then FIRST(X) contains FIRST(Xi+1)-{є}
- if all FIRST(X1), FIRST(X2)...FIRST(Xn) contain є then FIRST(X) contains є
Given a non-terminal A, the set FOLLOW(A) consists of terminals and possibly $ is defined as:
- if A is a start symbol then $ is in Follow(A).
- if production B ® a A g , then First(g)-{є} is in Follow(A)
- if production B ® a A g , such that є is in First(g), then Follow(A) contains Follow(B); since whatever follows B can follow A when g=є
- if production B ® a A, then Follow(A) contains Follow(B); since whatever follows B can follow A
Example
![]() |
![]() |
Pass 0![]() |
FIRST[X]={} FIRST[Y]={} FIRST[Z]={} |
FOLLOW[X]={} FOLLOW[Y]={} FOLLOW[Z]={} |
Pass 1![]() |
FIRST[X]={a} X®a
FIRST[Y]={є,c} Y®c Y®є FIRST[Z]={d} Z®d |
FOLLOW[X]={c,d} Y®c,Y®є,Z®d,Z®XYZ FOLLOW[Y]={d} Z®d, Z®XYZ FOLLOW[Z]={} |
Pass 2![]() |
FIRST[X]={a,c,є} X®Y, Y®c
Y®є FIRST[Y]={є,c} FIRST[Z]={a,d} X®a, Z®XYZ FIRST[Z]={a,c,d} X®Y, Y®є, Y®c , Z®XYZ |
FOLLOW[X]={a,c,d} Z®XYZ,Y®є,FIRST[Z]={a,c,d} FOLLOW[Y]={a,c,d} Z®XYZ, FIRST[Z]={a,c,d} FOLLOW[Z]={} |

Example - First sets
Given a non-terminal X, the set FIRST(X) consists of terminals and possibly є is defined as:
- if X is a terminal or є then FIRST(X)=X.
- if X is a non-terminal the for each production X ® X1X2...Xn , then FIRST(X) contains FIRST(Xi)-{є}.
- if FIRST(X1), FIRST(X2)...FIRST(Xi) contain є where i<n then FIRST(X) contains FIRST(Xi+1)-{є}
- if all FIRST(X1), FIRST(X2)...FIRST(Xn) contain є then FIRST(X) contains є
Example Grammar
No є-productions: First(A) for all non-terminals A for (all terminals Z) First[Z] = {Z}; for (all non-terminals A) First[A] = {};
repeat
for (each production choice A ® X1X2X3...Xn )
add First[ X1 ]to First[ A ]until (there are no changes to any First[A])
addop ® + | -
mulop ® *
exp ® exp addop term | term
term ® term mulop factor | factor
factor ® ( exp ) | number
Grammar contains no є-productions
Can ignore exp ® exp addop term and term ® term mulop factor since left recursion adds nothing to First sets (e.g. 1 states First(exp) added to First(exp).
Apply First algorithm considering productions in order, pass 1 listed below:
exp ® exp addop term no changes
exp ® term adds First(exp) = First(term) = {} so no changes
addop ® + adds First(addop) = {+}
addop ® - adds - so First(addop) = {+,-}
term ® term mulop factor adds nothing
term ® factor adds First(term) = First(factor) = {} so no changes
mulop ® * adds First(mulop) = {*}
factor ® ( exp ) adds First(factor)= {(}
factor ® number adds First(factor)={(,number}
Rule Pass 1 Pass 2 Pass 3 Completed 1. exp ® exp addop term First(exp)=
{(,number}First(term)=
{(,number}First(factor)=
{(,number}First(addop)=
{+,-}First(mulop)=
{*}
2. exp ® term First(exp)=
{(,number}3. addop ® + First(addop)=
{+}4. addop ® - First(addop)=
{+,-}5. term ® term mulop factor 6. term ® factor First(term)=
{(,number}7. mulop ® * First(mulop)=
{*}8. factor ® ( exp ) First(factor)=
{(}9. factor ® number First(factor)=
{(,number}
Example
є-productions: First(A) for all non-terminals A stmt ® if-stmt | other
if-stmt ® if ( exp ) stmt else-part
else-part ® else stmt | є
exp ® 0 | 1
Grammar contains є-productions
Apply First algorithm considering productions in order, Pass 1 listed below:
stmt ® if-stmt no changes since First(if-stmt) is empty
stmt ® other adds First(stmt) = other
if-stmt ® if ( exp ) stmt else-part adds if
else-part ® else stmt adds else First(else-part) = {else}
else-part ® є adds є First(else-part)= {else, є}
exp ® 0 adds First(exp) = {0}
exp ® 1 adds First(exp) = {0,1}
Pass 2
stmt ® if-stmt adds if since First(if-stmt)={if}
No other changes occur so complete.
Rule Pass 1 Pass 2 Completed 1. stmt ® if-stmt First(stmt)=
{if, other}First(stmt)=
{if, other}First(if-stmt)=
{if}First(else-part)=
{else, є}First(exp)=
{0,1}2. stmt ® other First(stmt)=
{other}3. if-stmt ® if ( exp )
stmt else-partFirst(if-stmt)=
{if}4. else-part ® else stmt First(else-part)=
{else}5. else-part ® є First(else-part)=
{else, є}6. exp ® 0 First(exp)=
{0}7. exp ® 1 First(exp)=
{0,1}
Given a non-terminal X, the set FIRST(X) consists of terminals and possibly є is defined as:
- if X is a terminal or є then FIRST(X)=X.
- if X is a non-terminal the for each production X ® X1X2...Xn , then FIRST(X) contains FIRST(Xi)-{є}.
- if FIRST(X1), FIRST(X2)...FIRST(Xi) contain є where i<n then FIRST(X) contains FIRST(Xi+1)-{є}
- if all FIRST(X1), FIRST(X2)...FIRST(Xn) contain є then FIRST(X) contains є
Exercise - Determine FIRST(S) = FIRST(ABCd), FIRST(A), FIRST(B), FIRST(C) for grammar below:
S ® ABCd A ® e|f|є
B ® g|h|є
C ® p|q
Exercise - Determine FIRST(S) = FIRST(AdCB), FIRST(A), FIRST(B), FIRST(C) for grammar below:
- S ® AdCB
- A ® e|f|є
- B ® g|h|є
- C ® p|q
Predictive parser - FIRST sets used to predict which production to use given left-hand side X and some terminal T from FIRST set. We can write a function X with a switch statement with a case for each T.
Problem when grammar is ambiguous producing overlapping sets.
Problem when production X ® є
FOLLOW sets - used in handling productions with X ® є
Example - FOLLOW sets
Example addop ® + | -
Given a non-terminal A, the set FOLLOW(A) consists of terminals and possibly $ is defined as:
- if A is a start symbol then $ is in Follow(A).
- if production B ® a A g , then First(g)-{є} is in Follow(A)
- if production B ® a A g , such that є is in First(g), then Follow(A) contains Follow(B); since whatever follows B can follow A when g=є
- if production B ® a A, then Follow(A) contains Follow(B); since whatever follows B can follow A
mulop ® *
exp ® exp addop term | term
term ® term mulop factor | factor
factor ® ( exp ) | number
First(exp)= {(,number} First(term)={(,number}
First(factor)={(,number}
First(addop)={+,-}
First(mulop)={*}
Rules 3,4,7 and 9 below have no non-terminals on right-hand sides so add nothing to Follow sets.
Begin with Follow set of start symbol exp, Follow(exp)={$}
Apply Follow algorithm considering productions in order, pass 1 listed below:
exp ® exp addop term:
First(addop) added to Follow(exp)={$,+,-}
First(term) added to Follow(addop) = {(,number}
Follow(exp) added to Follow(term) = {$,+,-} by 3. at left.exp ® term: Follow(exp) added to Follow(term) = {$,+,-} by 3. at left so no changes
addop ® + : nothing
addop ® - : nothing
term ® term mulop factor :
First(mulop) added to Follow(term) = {$,+,-,*}
First(factor) added to Follow(mulop) = {(,number}
Follow(term) added to Follow(factor) = {$,+,-,*}term ® factor : Follow(term) added to Follow(factor) = {$,+,-,*} so no changes
mulop ® * : nothing
factor ® ( exp ) : First( ) ) added to Follow(exp) = {$,+,-, ) }
factor ® number : nothing
Rule Pass 1 Pass 2 Completed 1. exp ® exp addop term Follow(exp)=
{$,+,-}
Follow(addop)=
{(,number}
Follow(term)=
Follow(exp)Follow(term)=
{$,+,-,*,)}Follow(exp)=
{$,+,-,)}Follow(term)=
{$,+,-,*,)}Follow(factor)=
{$,+,-,*,)}Follow(addop)=
{(,number}Follow(mulop)=
{(,number}
2. exp ® term 5. term ® term mulop factor Follow(term)=
{$,+,-,*}
Follow(mulop)=
{(,number}
Follow(factor)=
Follow(term)Follow(factor)=
{$,+,-,*,)}6. term ® factor 8. factor ® ( exp ) Follow(exp)=
{$,+,-,)}
Follow Example
є-productions: First(A) for all non-terminals A
- stmt ® if-stmt
- stmt ® other
- if-stmt ® if ( exp ) stmt else-part
- else-part ® else stmt
- else-part ® є
- exp ® 0
- exp ® 1
Grammar contains є-productions
Rules 2,5,6,7 have only terminals so ignore. Initialize FOLLOW[stmt]={$}
Pass 1
Rule 1 adds:
FOLLOW[stmt] to FOLLOW[if-stmt] = {$}
Rule 3 adds:
) to FOLLOW[exp]={)}
FIRST[else-part]-є to FOLLOW[stmt]={$,else}
FOLLOW[if-stmt] to FOLLOW[else-part]={$}
FOLLOW[if-stmt] to FOLLOW[stmt]={$,else} (since else-part ® є)
Rule 4 adds:
FOLLOW[else-part] to FOLLOW[stmt]={$,else}
Pass 2
Rule 1 adds:
FOLLOW[stmt] to FOLLOW[if-stmt] = {$,else}
Rule 3 adds:
FOLLOW[if-stmt] to FOLLOW[else-part]={$,else}
Rule 4 adds:
FOLLOW[else-part] to FOLLOW[stmt]={$,else} but no change
Pass 3 No other changes occur so complete.
First(stmt)= {if, other} First(if-stmt)= {if}
First(else-part)= {else, є}
First(exp)={0,1}
Rule Pass 1 Pass 2 Completed 1. stmt ® if-stmt FOLLOW[stmt]={$}
FOLLOW[if-stmt]={$}FOLLOW[if-stmt]={$,else}
FOLLOW[stmt]={$,else} FOLLOW[if-stmt]={$,else}
FOLLOW[else-part]={$,else}
FOLLOW[exp]={)}
3. if-stmt ® if ( exp )
stmt else-partFOLLOW[exp]={)}
FOLLOW[stmt]={$,else}
FOLLOW[else-part]={$}FOLLOW[else-part]={$,else} 4. else-part ® else stmt FOLLOW[stmt]={$,else} 5. else-part ® є 6. exp ® 0 7. exp ® 1 Exercise - Determine FOLLOW sets for grammars below using the rules below. Rules 2-4 have no non-terminals.
Given a non-terminal A, the set FOLLOW(A) consists of terminals and possibly $ is defined as:
- if A is a start symbol then $ is in Follow(A).
- if production B ® a A g , then First(g)-{є} is in Follow(A)
- if production B ® a A g , such that є is in First(g), then Follow(A) contains Follow(B); since whatever follows B can follow A when g=є
- if production B ® a A, then Follow(A) contains Follow(B); since whatever follows B can follow A
- S ® ABCd
- A ® e|f|є
- B ® g|h|є
- C ® p|q
FIRST[S]={e,f,g,h,p,q} FIRST[A]={e,f,є}
FIRST[B]={g,h,є}
FIRST[C]={p,q}
- S ® AdCB
- A ® e|f|є
- B ® g|h|є
- C ® p|q
FIRST[S]={e,f,d} FIRST[A]={e,f,є}
FIRST[B]={g,h,є}
FIRST[C]={p,q}
Example
| E ® TQ
Q ® +TQ | -TQ | є T ® FR R ® *FR|/FR|є F ® (E) | i |
FIRST(E)=FIRST(T)=FIRST(F)={(,i} FIRST(Q)={+,-, є} FIRST(R)={*,/,є) FIRST(+TQ)={+} FIRST(-TQ)={-} FIRST(*FR)={*} FIRST(/FR)={/} FIRST((E)={(} FIRST(i)={i} |
FOLLOW(Q)={$,)}=FOLLOW(E) Q only appears at end (E ® TQ) so includes FOLLOW(E). Other productions have Q on left and right (recursive) so ignore.
FOLLOW(T)={+,-,),$}
FOLLOW(R)=FOLLOW(T)={+,-,),$} because of T ® FR and R at end of production. Rule 3.
FOLLOW(F)={+,-,*,/,),$} includes FIRST(R)-{є}={*,/}. R is nullable include
FOLLOW(T) from T ® FR.
CONSTRUCTING A PREDICTIVE PARSER
- Give the predictive parsing table for the grammar.
E ® TQ
Q ® +TQ | -TQ | є
T ® FR
R ® *FR|/FR|є
F ® (E) | iFIRST(E)=FIRST(T)=FIRST(F)={(,i}
FIRST(Q)={+,-, є}
FIRST(R)={*,/,є)
FIRST(+TQ)={+}
FIRST(-TQ)={-}
FIRST(*FR)={*}
FIRST(/FR)={/}
FIRST((E))={(}
FIRST(i)={i}FOLLOW(E)={$,)}
FOLLOW(Q)={$,)}
FOLLOW(T)={+,-,),$}
FOLLOW(R)={+,-,),$}
FOLLOW(F)={+,-,*,/,),$}
Blank entries are error conditions. Rules for typical production X ® b are:
- For all terminals a in FIRST(b) except є, [X,a] = X ® b
- If b=є or if є is in FIRST(b) then: For all a in FOLLOW(X), [X,a]=X ® є
E ® TQ FIRST(T)={(,i} so [E,(] and [E,i]=E ® TQ
Q ® +TQ FIRST(+TQ)={+} so [Q,+]=Q ® +TQ
Q ® -TQ FIRST(-TQ)={-} so [Q,+]=Q ® -TQ
T ® FR FIRST(F)={(,i} so [T,(] and [T,i] = T ® FR
R ® *FR FIRST(*FR)={*} so [R,*]=R ® *FR
R ® /FR FIRST(/FR)={/} so [R,/]=R ® /FR
F ® (E) FIRST((E))={(} so [F,(]=F ® (E)
F ® i FIRST(i)={i} so [F,i] = F ® iQ ® є FOLLOW(Q)={$,)} so [Q,$]=є and [Q,)]=є
R ® є FOLLOW(R)={+,-,),$} so [R,+]=[R,-]=[R,)]=[R,$]=є
i + - * / ( ) $ E E ® TQ E ® TQ Q Q ® +TQ Q ® -TQ Q ® є Q ® є T T ® FR T ® FR R R ® є R ® є R ® *FR R ® /FR R ® є R ® є F F ® i F ® (E) Example
Blank entries are error conditions. Rules for typical production X ® b are:
- For all terminals a in FIRST(b) except є, [X,a] = X ® b
- If b=є or if є is in FIRST(b) then: For all a in FOLLOW(X), [X,a]=X ® є
[X,a] = X®a because a is a terminal, FIRST(a)=a
X®Y because Y is nullable and aÎFOLLOW[X]={a,c,d}[X,c] = X®Y because cÎFIRST(Y)={a,c,d}
X®Y because Y is nullable and cÎFOLLOW[X]={a,c,d}[Y,c] = Y®c because c is a terminal, FIRST(c)=c
Y®є because Y is nullable and cÎFOLLOW[Y]={a,c,d}[Z,d] = Z®d and dÎFIRST(XYZ)={a,c,d}
[Z,c] = Z®XYZ since cÎFIRST(XYZ)={a,c,d}
FIRST[X]={a,c,є} FIRST[Y]={є,c}
FIRST[Z]={a,c,d}
FOLLOW[X]={a,c,d}
FOLLOW[Y]={a,c,d}
FOLLOW[Z]={}
Partial Recursive-descent Parser for Grammar 3.12 using FIGURE 3.14 void Z() {switch (tok) {
case a,c: X(); Y(); Z(); break;
case d: X(); Y(); Z(); break;
case d: eat(d); break;
default: error();
}}
void X() {switch (tok) {
case a: eat(a); break;
case a: Y(); break;
case c,d: Y(); break;
default: error ( ) ;
}}
Problems for predictive parsing
FIGURE 3.14 has duplicate entries (e.g. [X, a])
Grammar 3.12 is ambiguous, there are two parses of sentence d.
An ambiguous grammar yields duplicate predictive parsing table entries.
Predictive parsing will not work with ambiguous grammars.
LL(1) - Left-to-right, Left-most derivation, 1-symbol input look-ahead. Must be able to predict the production to use for parsing based on the next input symbol
Grammars that contain no duplicate entries in predictive parsing tables.
Recursive-descent parser looks ahead only one symbol.
LL(k) - Left-to-right, Left-most derivation, k-symbol look-ahead. No ambiguous grammar is LL(k) for any k, meaning no matter how far the look-ahead its still not parse-able.
ELIMINATING LEFT RECURSION - Grammars with left recursion cannot be LL(1)
left recursion often used to make operations left associative.
E ® E + T | T
T ® T * F | F
F ® i | (E)
Correct parse is at right, top-down predictive parser produces parse tree for i + i + i at left:
E
/ \
E T
/ \
E T
/ \
E T
/ \
E T
/
¥E
/ \
E T
/ \ |
E T F
| | |
T F i
| |
F i
|
i
immediate left recursion - where only occurs in production of a single non-terminal:
exp ® exp addop term | term
indirect left recursion - more difficult, uncommon in real grammars, example:
A ® B b
B ® A a
Removing immediate left recursion
left recursion present only in form: A ® A a | b
generates strings of the form: ban
Base case: A ® b
Recursive case: A ® A a
remove left recursion by rewriting A ® A a | b as rules:
generate base case b first: A ® bA'
generate repetitions a using:
right rather than left recursion: A' ® aA'
empty: A' ® є
Example
Removing left recursion from:
E ® E - T | T
yields:
E ® T E'
E' ® - T E'
E' ® є
Allows:
T
T - T - T
Question - What strings are defined by: S ® Sa | b
Question - Is the grammar still left associative? Parse 2 - 3 - 4 using:
E ® T E'
E' ® - T E'
E' ® є
T ® 2 | 3 | 4
LL(1) TABLE CONSTRUCTION
construct table by entering in row X, column T :
production X® g for each T Î FIRST(g)
if g is nullable, production for each T Î FOLLOW(X)
Example
Applying the following transformations to Grammar 3.10 converts left-recursion to right-recursion
Removing left-recursion from Grammar 3.10 yields 3.15
Example - Table construction
Production S ® E$
has FIRST(E$) = {(,id,num} so (S,id), (S, (), (S, num) = S ® E$
Production E' ® + T E'
has FIRST(+ T E') = {+} so (E',+) = E' ® +TE'
Production E' ® - T E'
has FIRST(- T E') = {-} so (E',-) = E' ® -TE'
E' ®
E' is nullable, FOLLOW(E') = {),$} so (E',)) and (E',$) = E' ®
Construct table by entering in row X, column T :
production X® g for each T Î FIRST(g)
if g is nullable, production for each T Î FOLLOW(X)
Exercise Explain reason for the entry:
[T,id]
[T',)]
LEFT FACTORING - required when two or more grammar rules share a common prefix string making productions ambiguous to an LL(1) parser.
A ® ab | ag
Problem for parser cannot determine which production to use, given FIRST(A) = {a} for:
A ® ab
A ® ag
Generally, grammar with choices can be factored from the form:
A ® ab | ag
to:
A ® aA'
A' ® b
A' ® g
Example - the following is ambiguous, having two productions starting with if
if-stmt ® if ( exp ) stmt | if ( exp ) stmt else stmt
can be factored to:
if-stmt ® if ( exp ) stmt else-part
else-part ® else stmt
else-part ® є
ERROR RECOVERY
Recursive-descent parser from TABLE 3.17 void T() {
switch (tok) {
case ID:
case NUM:
case LPAREN: F(); Tprime(); break;
default: error!
} }void Tprime() {
switch (tok) {
case PLUS: break;
case TIMES: eat(TIMES); F(); Tprime(); break;
case EOF: break;
case RPAREN: break;
default : error!
} }
Parse errors occur where parse table entries are blank.
Above indicates point where error could be reported and handled.
Three approaches to error handling:
Report error and exit. Problem: syntax errors reported one at a time for each compile.
Report error, insert a valid token and continue. Problem: insertion could produce another error leading to another insertion ... where new errors not fault of input stream.
Report error, delete invalid token and continue. Problem: often results in other, hard to predict errors and misleading error reporting until a production start token is located. One simple but common approach in many languages is to skip and restart parse past a statement separator (e.g. ;) when an error occurs.
Print error cause and exit void T() {
switch (tok) {
case ID:
case NUM:
case LPAREN: F(); Tprime(); break;
default: System.out.println("expected id, num, or left-paren");
System.exit(-1);
} }
Error recovery by skipping to a following token int Tprime_follow [] = {PLUS, RPAREN, EOF};
void Tprime() {
switch (tok) {
case PLUS: break;
case TIMES: eat(TIMES); F(); Tprime(); ~reak;
case RPAREN: break;
case EOF: break;
default: System.out.println("expected +, *, right-paren, or end-of-file");
skipto(Tprime_follow);
}}
3.3 LR (Bottom-up) PARSING
LR(0) - Left-to-right parse, Rightmost-derivation, 0 token look ahead.
LR(1) - Left-to-right parse, Rightmost-derivation, 1 token look ahead.
SLR(1) - Simple LR(1) that uses some look ahead
LALR(1) - Look Ahead LR(1), more powerful than SLR(1) but less complex than LR(1)
Comparison
Bottom-up more powerful than top-down (e.g. left recursion not a problem)
Bottom-up more complex than top-down
Bottom-up generally too complex for hand coding
LL(k) parsing must predict production to use having seen k tokens of right-hand side.
LR(k) postpones production decision until seeing complete right-hand side of production and k input tokens beyond. Uses stack to hold intermediate symbols leading to production rule decisions and input.
Using LR parsing table, 2 types of actions
Shift: sn push input token onto stack
Reduce: rn choose rule X ® A B C; pop C B A; push X
LR PARSING ENGINE
General operation is:
Place $ at end of input string, push beginning state 0
Repeat
Let m be current state and ai the incoming token
Let X = Table[m,ai ]
switch X
case sn : shift ai and enter state m=n, push ai and m
case rk : reduce using production rule k
pop state and as many symbols as in right side of rule k
stop = Stack[top],
push symbol on left side of rule k and new current state m=Table[symbol on left side of rule k, stop]; will be nonterminal from goto part of table.
until input accepted or error
Example of LR(1) parsing - Uses one token look-ahead (11 column entries). LR(2) uses 2 token sequence look-ahead (112 entries for all possible pairings)
Input string: (i+i)/i
Start by pushing state 0.
s4 shifts ( pushes ( and 4; the new current state m=Table[0,(]=4
r8 reduces by rule 8 (i.e. F ® i). Pop the one right-hand symbol of rule 8 and state (i and 5). State at top of stack, 4, and rule 8 left hand symbol, F, determine new current state: m = Table[F,4] = 3. Push F and new state. The bottom-up effect has been to reduce i using F ® i
Question: Explain Lines 7 and 8.
Table[$,1]=accept, parse is complete
Example
Actions
sn Shift into state n
gn Goto state n
rk Reduce by rule k (see Grammar 3.1)
a Accept
Error when blank table entry
TABLE 3.19 meaning
(1,id)=s4 means to shift, next row is 4.
(10,$)=r5 means to reduce using rule 5.
Derive parse of a:=7$ Stack Input Action row,column entry 1 a:=7$ shift 1,id s4 1id4 :=7$ shift 4,:= s6 1id4:=6 7$ shift 6,num s10 1id4:=6num10 $ reduce E ® num 10,$ r5 6,E g11 1id4:=6E11 $ reduce S ® id:=E 11,$ r2 1,S g2 1S2 $ accept 2,$ accept
LR(0) PARSER GENERATION
LR(0) grammars are those that can be parsed by only examining the stack, uses 0 look-ahead. Too weak to be useful but illustrates parser generation.
LR(0) Items - a production choice distinguished by a '.' in its right-hand side. The concept is that an item records an intermediate step in the recognition of the right-hand side of a particular production.
If A ® a is a production and if b and g are any two symbols (including the empty string є) such that a = bg, then A ® b.g is an LR(0) item.
A ® b.g means that b has already been seen and it may be possible to derive the next input tokens from g and b would be at the top of the stack.
A ® .a means we may be about to recognize an A by using the grammar rule A ® a
A ® a. means we have recognized an A by using the grammar rule A ® a
Example
Grammar:
S' ® S
S ® ( S ) S
S ® є3 production choices and 8 items:
S' ® .S
S' ® S.
S ® .(S)S
S ® (.S)S
S ® (S.)S
S ® (S).S
S ® (S)S.
S ® .
Exercise - What are the items for the following grammar?
E' ® E
E ® E +n | n
LR(0) parser construction
We have seen that a NFA can be more readily be constructed than DFA, then convert the NFA to a DFA. Here we will examine the NFA as a means of understanding the process.
NFA transitions - consider item A ® a.g and g begins with symbol X which may be a token or non-terminal. The item can then be written as A ® a.Xh and there is a transition on X:
Shift when X is a token:
If X is a token this corresponds to a shift of X from input to the top of the stack.
If X is a non-terminal this corresponds to a reduction of by a production X ® .b so for every item A ® a.Xh we must add an є transition
Reduce when X is a non-terminal:
NFA generation
E' ® .E
E ® .E+n
E ® .n
E' ® .E
E ® E.+n
E ® E+.n
E ® E+n.
E ® n.
S' ® .S
S' ® S.
S ® .(S)S
S ® (.S)S
S ® (S.)S
S ® (S).S
S ® (S)S.
S ® .
NFA to DFA
![]()
|
Converting NFA to DFA see Chapter 2 ε-closure of single state s is the set of states reachable by a series of 0 or more ε-transitions, written as s. The ε-closure of state s always contains state s itself. Example: NFA for regular expression for a* has 1= {1,2,4}, 2= {2}, 3= {2,3,4}, 4= {4}
ε-closure - S, of a set of states is the union of the ε-closures of each individual state s: S=Ès
|
Closure - I is set of items, closure(I) adds more items to set of items when there is a . to left of non-terminal.
Add item [A ® .B] to set
(Closure rule) For all items in the set so far, if the . precedes another non-terminal D, include all initial D-items unless already there (an initial D-item is one whose left-hand side is D and right-hand side begins with a .
Keep applying closure rule until nothing more can be added.
Goto - X is terminal or non-terminal grammar symbol, goto(I,X) moves . past X in all items in I.
S' ® S$ - auxiliary start production
T - set of states seen so far
E - set of shift or goto edges seen so far
R - set of reduce actions
Example - DFA set construction
Grammar
S' ® S
S ® ( S ) S
S ® є
Items
S' ® .S
S' ® S.
S ® .(S)S
S ® (.S)S
S ® (S.)S
S ® (S).S
S ® (S)S.
S ® .
DFA of sets of LR(0) items
Determine states:
State 0 = [S' ® .S] È Closure(S' ® .S) = { [S ® .(S)S], [S ® .]} because . before non-terminal S
State 1 = [S' ® S.] È Closure(S' ® S.) = { } because . before terminal or end
State 2 = [S ® (.S)S] È Closure(S ® (.S)S) = { [S ® .(S)S], [S ® .]} because . before non-terminal S
Exercise - What is State 4 and why?
Example - Parse Table Transition Entries
Grammar
E' ® E
E ® E+n
E ® n
Items
E' ® .E
E' ® E.
E ® . E+n
E ® E.+n
E ® E+.n
E ® E+n.
E ® .n
E ® n.
DFA of sets of LR(0) items
Transition entries
- Table(0,n) = s2 because n is terminal and transition from state 0 to 2.
- Table(0,E) = goto 1 because E is non-terminal. The next input symbol (either + or $) determines table entry;
- i.e. Table(1, input symbol)
- Table(1,$) = accept because marks end of successful parse of input.
- Note that State 4 has reduction entries not listed here.
State Input Goto n + $ E 0 s2 1 1 s3 accept 2 3 s4 4 Example - Parse Table Transition and Reduction Entries
Grammar
A' ® A
A ® (A)
A ® a
Items
A' ® .A
A' ® A.
A ® . (A)
A ® (.A)
A ® (A.)
A ® (A).
A ® .a
A ® a.
DFA of sets of LR(0) items
Transition entries
State Input Goto ( a ) A 0 s3 s2 1 1 2 3 s3 s2 4 4 s5 5
- Table(0,() = s3, ( is terminal and transition from state 0 to 3. Push ( onto stack.
- Table(0,A) = goto 1, A is non-terminal.
- Table(5,)) = r2, reduce using Rule 2. A ® (A) Push A onto stack.
- Table(2,a) = r3, reduce using Rule 3. A ® a. Push A onto stack.
Note the accept state corresponds the final reduction A' ® A.
FOLLOW(A)={),a}, since a and ) are terminal, defines inputs for which reductions are required.
Reduction entries
- FOLLOW(A)={),a}
State Input Goto ( a ) A 0 s3 s2 g1 1 acc 2 r3 r3 r3 3 s3 s2 g4 4 s5 5 r2 r2 r2
Example - LR(0) construction
S' ® .S, S' is start, right-hand side is S and . is current position of the parser.
State 1, input begins with S or any possible right-hand side of S production.
State 2 reached from state 1 when x (a terminal) is shifted, .S and .(L) ignored. With . at end of item (S ® x.), top of stack has complete right-hand side of production (S ® x) which parser would reduce.
State 3 reached by shifting (.L), all L productions included with S productions
State 4 reached by goto 4 after shift of x (State 2) and reducing S production (S ® x) Rule 2.
For symbol $ make an accept action rather than goto(I,$).
- for each edge
where X is terminal, (I,X)=shift J
- for each edge
where X is non-terminal, (I,X)=goto J
- for each state I containing item S' ® S.$, (I,$)=accept
- for a state I containing A ® g. (production number n), (I,Y)=reduce n for every token Y
Adding Actions - Reductions
If a state q contains a completed item [A ® a.] the action is reduce n where n is the number of the corresponding production of the completed item A. For every token Y, Table(q , Y) = reduce n
Example
For
State 2: Completed item [S ® x.] using grammar Rule 2. Add r2 for {()$x,} table entries.
State 6: Completed item [S ® (L).] using grammar Rule 1. Add r1 for {()$x,} table entries.
State 7: Completed item [L ® S.] using grammar Rule 3. Add r3 for {()$x,} table entries.
State 9: Completed item [L ® L,S.] using grammar Rule 4. Add r4 for {()$x,} table entries.
Example Constructing the State Table
Will use V[a] as set of items reachable via sequence of symbols a.
State 0: V[є]- reachable from sequence of symbols є, corresponds to empty stack and augmented rule Z ® E
Put item [Z ® .E] into set V[є]= { [Z ® .E] }
(Closure) For all items in set so far, if . precedes non-terminal D, include all initial D-items into set. Initial D-item is one with D on left side and right side begins with a dot. Continue until nothing more can be added.
V[є]= { [Z ® .E],[E ® .E+T],[E ® .E-T],[E ® .T],[T ® .T*F],[T ® .T/F],[T ® .F],[F ® .(E)],[F ® .i] }
Transitions out of State 0
For items constituting state, V[є], transitions are for symbols following . or {E, T, F, (, i}. Will define a row (state) for each transition.
State 1: V[E] - transitions on E out of State 0 { [Z ® .E],[E ® .E+T],[E ® .E-T] }
Found by moving . over E. Closure rule does not apply because . followed by terminal.
V[E] = { [Z ® E.],[E ® E.+T],[E ® E.-T] }
State 2: V[T] - transitions on T out of State 0 { [E ® T.],[T ® .T*F],[T ® T./F] }
Again closure rule does not apply because . followed by terminal.
V[T] = { [E ® T.],[T ® T.*F],[T ® T./F] }
State 3: V[F] - transitions on F out of State 0 { [T ® .F] }
V[F] = { [T ® F.] }
State 4: V[(] - transitions on ( out of State 0 { [F ® .(E)] }
Closure rule applies because transition on ( moved . which is followed by E, a nonterminal; must add E, T, F items. Note that . not moved on these items since corresponds to a є-transition.
V[(] = { [F ® (.E)], [E ® .E+T],[E ® .E-T],[E ® .T],[T ® .T*F],[T ® .T/F],[T ® .F],[F ® .(E)],[F ® .i] }
State 5: V[i] - transitions on i out of State 0 { [F ® .i] }
V[i] = { [F ® i.] }
State 1: V[E]
Transitions out of State 1
For items constituting state, V[E]= { [Z ® E.],[E ® E.+T],[E ® E.-T] }, transitions are for symbols following . or {+,-}. Will define a row (state) for each transition.
State 6: V[E+] - transitions on + out of State 1 {[E ® E.+T] }
Closure rule applies because transition on + moved . which is followed by T, a nonterminal; must add T, F items. Note that . not moved on these items since corresponds to a є-transition.
V[E+] = { [E ® E+.T],[T ® .T*F],[T ® T./F],[T ® .F],[F ® .(E)],[F ® .i] }
State 7: V[E-] - transitions on - out of State 1 {[E ® E.-T] }
V[E-] = { [E ® E-.T],[T ® .T*F],[T ® T./F],[T ® .F],[F ® .(E)],[F ® .i] }
State 2: V[T]
Transitions out of State 2
For items constituting state, V[T]= { [E ® T.],[T ® T.*F],[T ® T./F] }, transitions are for symbols following . or {*/}. Will define a row (state) for each transition.
State 8: V[T*] - transitions on * out of State 2 {[T ® T*.F]}
Closure rule applies because transition on * moved . which is followed by F, a nonterminal; must add F item.
V[T*] = { [T ® T*.F],[F ® .(E)],[F ® .i] }
State 9: V[T/] - transitions on / out of State 2 {[T ® T/.F] }
V[T/] = { [T ® T/.F] ,[F ® .(E)],[F ® .i] }
State 3: V[F]
Transitions out of State 3
There are no transitions since V[F] = { [T ® F.] }
Remaining states
State 10: V[(E] = { [F ® (E.)], [E ® E.+T],[E ® E.-T] }
State 11: V[E+T] = { [E ® E+T.],[T ® T.*F],[T ® T./F] }
State 12: V[E-T] = { [E ® E-T.],[T ® T.*F],[T ® T./F] }
State 13: V[T*F] = { [T ® T*F.] }
State 14: V[T/F] = { [T ® T*F.] }
State 15: V[(E)] = { [F ® (E).] }
Completed State table and diagram
SLR (Simple LR(1)) PARSER GENERATION
SLR(1) is more powerful than LR(0) parsing:
uses LR(0) DFA sets
consults input token before the shift to make sure appropriate DFA transition exists
uses FOLLOW set of non-terminal to determine if reduction should be performed.
Table:
Transitions are the same
Reduction difference
LR(0) if a state has a complete item of the form A ® a. then reduce using grammar rule A ® a
SLR(1) if a state has a complete item of the form A ® a. and the look-ahead token is in FOLLOW(A) then reduce using grammar rule A ® a
SLR(1) have fewer reductions in table due to look-ahead
Adding Action-Table Entries for Terminals
Note that non-terminals are in Goto part of table since they are not shifted or reduced.
Change each state entry from n to action sn.
If a state contains a completed item [A ® a.] then for all inputs in FOLLOW(A) the action is reduce n where n is the number of the corresponding production of the completed item A.
If State q contains item [Z ® a.], the completed item corresponding to the dummy production, then Table[q,$]=accept
Example
State 0: V[є] - No completed items hence no reductions. For terminals only, change 5 and 4 to s5 and s4.
State 1: V[E] - Completed item [Z ® E.], the dummy production; by Rule 3 above, Table[1,$]=acc.
State 2: V[T] - Completed item [E ® T.] using grammar Rule 3. FOLLOW(E)={+,-,),$} so add r3 to those table entries.
State 3: V[F] - Completed item [T ® F.] using grammar Rule 6. FOLLOW(T)={*,/,+,-,),$} add r6 to those table entries. Note FOLLOW(E) = {+,-,),$} added to FOLLOW(T) because T comes at end in production E ® T.
Exercise
What are the row entries for State 5: V[i] = { [F ® i.] }
Why aren't we interested in the FIRST sets for LR(0)?
Example - SLR(1) Parse Table
Grammar
E' ® E
E ® E+n
E ® n
Items
E' ® .E
E' ® E.
E ® . E+n
E ® E.+n
E ® E+.n
E ® E+n.
E ® .n
E ® n.
DFA of sets of LR(0) items
Transition entries
- Table(0,n) = s2 because n is terminal and transition from state 0 to 2.
- Table(0,E) = goto 1 because E is non-terminal. The next input symbol (either + or $) determines table entry; Table(1, input symbol)
- Table(1,$) = accept because marks end of successful parse of input.
State Input Goto n + $ E 0 s2 1 1 s3 accept 2 3 s4 4
Reduction entries
- FOLLOW(E')={$}
- FOLLOW(E)={$,+}
State Input Goto n + $ E 0 s2 1 1 s3 accept 2 r3 r3 3 s4 4 r2 r2 (2,+) = r3 = (2,$)
FOLLOW(E)={$,+}
Example SLR
Table[3,+] has duplicate entry, s4 when [E ® T.+E] but also r2 to reduce under grammar Rule 2 [E ® T] - grammar is not LR(0).
SLR parse table transitions same as LR(0).
SLR reductions are computed by algorithm where (I,X,A ® a) indicates that in state I, on look-ahead symbol X (from the FOLLOW(A) set) the parser will reduce by rule A ® a.
Computation of reductions:
FOLLOW(E) = {$} from Rule 0
FOLLOW(T)= {$,+} from Rule 1 and 2
Exercise
Why is Table(5,$) = r3
Why is Table(5,+) = r3
Why isn't Table(3,+) = r2
Parse x+x$
Shift-Reduce Conflicts
The original Table(3,+) contains s4 and r2, meaning both can occur, a shift-reduce conflict.
Reduce-Reduce Conflicts
If Table(3,+) contains r3 and r2, meaning two rules could be used for a reduction, a reduce-reduce conflict occurs.
Exercise - State 3 of the above has a shift-reduce conflict for LR(0). Why? SLR(1) does not. Why?
LR(1) PARSER GENERATION
LR(0) - no look-ahead
SLR(1) - same transitions as LR(0) but look-ahead on FOLLOW set for reductions.
LR(1) - look-ahead on input token.
LALR(1) - same as LR(1) except states differing only by combining look-ahead for reductions.
Shift/Reduce Problem with SLR
S ® L=R
S ® R
L ® *R
L ® id
- R ® L
State 2 S ® L.=R
R ® L.
= 2 shift =
R ® LConsider the actions of State 2.
- FOLLOW(R) includes the = sign (because FOLLOW(L)={=} and R ® L) which requires reducing R ® L. by Rule 2 with = as the next input.
- The shift of = is also called for because of S ® L.=R in Rule 1.
- However there is no rule with a right-hand side of the form: R=....
- State 2 should not reduce L to R but only shift.
- Need to split states when necessary by additional information, the look-ahead.
Look-ahead added to productions by [A ® a.b, a] where production A ® ab has the look-ahead of terminal a or $.
LR(1) Parse Table Construction Example
Similar to LR(0) except that each LR(0) item will be further divided into items that differ by look-ahead. Where LR(0) filled an entire state with a reduction, LR(1) places reductions only in state[n, terminal] specificially where the terminal is the look-ahead.
Grammar Rules
S' ® S$
S ® CC
C ® cC
C ® d
LR(0) Items
S' ® .S$
S' ® S.$
S ® .CC
S ® C.C
S ® CC.
C ® .cC
C ® c.C
C ® cC.
C ® .d
C ® d.
Compute closure( {[S' ® .S, $]} )
note that this is different than for LR(0) or SLR
and that in A ® a.Bb, FIRST(ba)¹FOLLOW(B) in general
closure(I)
for each item [A ® a.Bb, a] in I
each production B ® g in the grammar G
and each terminal b in FIRST(ba)
such that [B ®. g, b] is not in I
add [B ®. g, b] to I
S' ® S$
S ® CC
C ® cC
C ® d
Match item [S' ® .S, $] with [A ® a.Bb, a]
A=S'
a=є
B=S
b=є
a=$
Add [B ®. g, b] for each B ® g and terminal b in FIRST(ba)
Add [S®.CC,$] since b=є and a=$, FIRST(є$)={$}, b=$
Match item [S ® .CC, $] with [A ® a.Bb, a]
A=S
a=є
B=C
b=C
a=$
Add [C®. g, b] for each C® g and terminal b in FIRST(C$)=FIRST(C)={c,d} since C does not derive є the $ is not included.
Add [C ® .cC, c], [C ® .cC, d], [C ® .d, c], [C ® .d, d] since FIRST(C)={c,d}
None of the new items have non-terminals to right of . so completed first set of LR(1) items.
Write shorthand of [C ® .cC, c/d] for [C ® .cC, c], [C ® .cC, d] since the same transition on c.
I0: [S' ® .S, $] Note that I0 is a state in the DFA
[S ®.CC, $]
[C ® .cC, c/d]
[C ® .d, c/d]Now compute goto(I0,X) for various X; shifts for terminals and goto's for non-terminals. Derives new states.
X=S
Closure([S' ® S., $]) = [S' ® S., $] since . is at end
I1: [S' ® S., $]
X=c
Closure([C ® c.C, c/d]) yields:
I3: [C ® c.C, c/d]
[C ® .cC, c/d]
[C ® .d, c/d]X=C
Closure([S ®C.C, $]) yields:
I2: [C ® C.C, $]
[C ® .cC, $]
[C ® .d, $]X=d
Closure([C ® d., c/d]) yields:
I4: [C ® d., c/d]
goto(I1,X) for X=є. adds no new items.
goto(I2,X)S' ® S$
S ® CC
C ® cC
C ® d
X=C
Closure([C ® CC., $]) = [C ® CC., $]
I5: [C ® CC., $]
X=c
Closure([C ® c.C, $]) yields:
I6: [C ® c.C, $]
[C ® .cC, $]
[C ® .d, $]X=d
Closure([C ® d., $]) yields:
I7: [C ® d., $]
goto(I3,X)
X=C
Closure([C ® cC., c/d]) = [C ® cC., c/d]
I8: [C ® CC., $]
X=c
Closure([C ® c.C, c/d]) yields:
I3: [C ® c.C, c/d]
[C ® .cC, c/d]
[C ® .d, c/d]X=d
Closure([C ® d., c/d]) yields:
I4: [C ® d., c/d]
goto(I4,X) and goto(I5,X) for X=є add no new items.
goto(I6,X) and goto(I7,X) for c and d are I6 and I7 respectively.
S' ® S$
S ® CC
C ® cC
C ® dgoto(I6,C)
X=C
Closure([C ® cC., $]) yields:
I9: [C ® cC., $]
goto(I8,X) and goto(I9,X) add no new items.
goto graph and LR(1) parse table
![]() |
![]() Rules needed for reductions
Reductions: If [A ® a., a] in Ii and A¹S' then Table(i,a)=reduce A ® a. Accept: If [S' ® S., $] in Ii then Table(i,a)=accept |
|
Key differences:
SLR
reductions use FOLLOW sets which are formed by union of other FOLLOW sets; can include too much.
FOLLOW(S)={$}
FOLLOW(C)={c d}ÈFOLLOW(S)={c d $}states formed on closure of LR(0) items with no added information
LR(1)
use FIRST sets which are formed by union of other nullable FIRST sets; .
FIRST(C)= { c d }states formed on closure of LR(0) items + FIRST set information, produces more states than SLR.
LALR(1) - Look Ahead LR
All SLR grammars are LR(1) but all LR(1) grammars are not SLR. The disadvantage of LR(1) is the increased number of states over SLR due to the added states from the look-ahead. LALR(1) has the same number of states as SLR, handles most common syntactic constructs in programming languages and any SLR grammar plus some cases not SLR.
In the LR(1) DFA diagram above I4 and I7 [C ® d., c/d] and [C ® d., $] are the same except for the look-ahead.
Note that the grammar generates c*dc*d.
On input of cc...cdcc...cd:
the parser shifts the first group of c's and following d onto stack in [I3,d]=s4,
[I4,c] or [I4,d] reduces by r3 provided the next input is c or d.
The requirement that look-ahead c or d is next makes sense since either of these symbols would begin the second c*d. If $ follows the first d the input is similar to: ccd$ which is not part of the language, state 4 would declare an error. After reducing by r3 in state 4, and eventually reading the second d, the parser enters state 7 and must see a $ or the string is not of the form c*dc*d. It reduces by r3 provided the next input is $.
cores - I4 and I7 are similar, [C ® d., c/d] and [C ® d., $], the core C ® d. forming the reduction is the same.
Union - The union of I4 [C ® d., c/d] and I7 [C ® d., $] is now I47 [C ® d., c/d/$] having the same reduction on the look-ahead c/d/$.
goto's - goto's on d to I4 or I7 from I0, I2, I3 and I6 now enter I47. Other merged sets are done similarly.
The resulting merger will not produce shift/reduce conflicts not already in the LR(1) sets since shifts do not depend on the look-ahead. Shift/reduce conflicts may not be detected immediately, since the look-aheads are merged, but will be detected before another shift occurs.
However, the merger may produce reduce/reduce conflicts.
Consider a grammar that has two item sets {[A®c., d], [B®c., e]} and {[A®c., e], [B®c., d]}.
The union {[A®c., d/e], [B®c., d/e]} causes a reduce/reduce conflict since reductions by both A®c and B®c are called for on inputs d and e.
Example LR(1) and LALR(1)
LR(1) states 7 and 12 identical (r3) ignoring lookahead {= $}
LR(1) states 8 and 11 identical (r4) ignoring lookahead {= $}
LR(1) states 10 and 14 identical (r5) ignoring lookahead {= $}
LR(1) states 6 and 13 identical; since now s8=s11, g10=g14, g7=g12 then s6=s13.
LR PARSING OF AMBIGUOUS GRAMMARS
Ambiguous grammars result when a rule can be parsed in more than one way. For the grammar:
S ® if E then S else S
S ® if E then S
S ® other
allows two parses of the following:
if e1 then if e2 then s1 else s2
as:
if e1 then (if e2 then s1 else s2)
if e1 then (if e2 then s1) else s2
In most languages, the else is paired with the nearest then. LR parsing tables will have a shift-reduce conflict with the else, a shift for parse 1 and reduce for parse 2.
The ambiguity can be removed by rewriting the grammar to add separate rules for matched and unmatched if statements.
S® M
S ® U
M® if E then M else M // else with nearest if rule
M ® other
U ® if E then S
U ® if E then M else U // M is if else match or other, forces else with nearest if
Rather than change grammar can resolve shift-reduce conflicts by favoring shift on an else to enforce parse 1 or reduce for parse 2.
3.4 USING PARSER GENERATORS
Two parser generators in Java will be examined: JavaCC and SableCC.
JavaCC - Intermixes grammar rules and control code.
PARSER_BEGIN(MyParser) - defines a class (e.g. MyParser) with a main function that:
constructs a parser inputting characters from standard input (i.e. keyboard)
calls a generated function Prog() to parse input.
TOKEN - regular expressions defining valid tokens
SKIP - regular expressions defining strings to skip or ignore
Prog() - analyzes input following productions
Grammar3.31.jj options {
DEBUG_PARSER = true; // Set to false to turn off token printing
}
PARSER_BEGIN(MyParser)
class MyParser
{
public static void main(String args[]) throws ParseException {
new MyParser(System.in).Prog();
}
}
PARSER_END(MyParser)
TOKEN : {
< WHILE: "while">
| < BEGIN: "begin">
| < END: "end">
| < DO: "do">
| < IF: "if" >
| < THEN: "then">
| < ELSE: "else">
| < SEMI: ";">
| < ASSIGN: "=">
| < ID: ["a"-"z"] ( ["a"-"z"] | ["0"-"9"])* >
}
SKIP: { " " | "\t" | "\n" | "\r" | "\r\n" }
void Prog() : {}
{ StmList() <EOF> }
void StmList() : {}
{ Stm() StmListPrime() }
void StmListPrime() : {}
{ ( <SEMI> Stm() StmListPrime() )? }
void Stm() : {}
{ <ID> <ASSIGN> <ID>
| <WHILE> <ID> <DO> Stm()
| <BEGIN> StmList() <END>
| <IF> <ID> <THEN> Stm()
[ LOOKAHEAD(2) <ELSE> Stm() ]
}Input - The following is valid for the parser produced.
abc123 = if4; if a then b=c
Use
- Home:
- IUS:
- Enter in the DOS Command window:
v:\common\user\C431\CC
- JavaCC
- Download Grammar3.31.jj and Parse1, Parse2, Parse3.
- Generates lexical analyzer named MyParser.java, assumes JavaCC is a root-level directory.
JavaCC Grammar3.31.jj
- Compile all Java files in directory.
javac -classpath . *.java
- Parse the contents of file Parse1.
java -cp . MyParser < Parse1
SableCC - Object oriented for extensibility and separates the regular expression definition and execution into two files:
Lexical analyzer and parser generator control that defines the regular expressions for tokens and productions possible from the tokens.
Java program that calls the lexer and parser.
- Grammar3.32.js - The generator control file.
- Program3 - the name given to a sub-directory used to group all generated files.
- Helpers - generally used separate non-token definition from tokens.
- Tokens - regular expression for valid tokens.
- Ignored Tokens - just what it says.
- Productions - Define productions.
Grammar3.32.js Package Program3; Tokens while = 'while'; begin = 'begin'; end = 'end'; do = 'do'; if = 'if'; then = 'then'; else = 'else'; semi = ';'; assign = '='; whitespace = (' ' | '\t' | 13 10 | '\n' )+; // 13 10 is '\r' '\n' id = ['a'..'z'] (['a'..'z'] | ['0'..'9'])*;
Ignored Tokens whitespace;
Productions prog = stmlist; stm = {assign} [left]: id assign [right]: id | {while} while id do stm | {begin} begin stmlist end | {if_then} if id then stm | {if_then_else} if id then [true_stm] : stm else [false_stm] : stm ; stmlist = {stmt} stm | {stmlist} stmlist semi stm;
- Main.java - Should be in sub-directory Program3, scans input for tokens and prints any found or error,
- Program3 - same in both files, the sub-directory for the parser.
- Lexer lexer - constructs new lexer using standard input (i.e. System.in).
- Parser parser - constructs new parser using the lexer.
- Start ast = parser.parse - calls the parser and returns the constructed the AST tree.
- ast.apply(new PrintTree()) - starts the traversal of AST tree and calls to the methods defined in PrintTree.
Main.java package Program3;
import Program3.lexer.*;
import Program3.node.*;
import Program3.parser.*;
import java.io.*;
public class Main{
public static void main(String[] arguments){
try{
Lexer lexer = new Lexer(new PushbackReader
(new InputStreamReader(System.in), 1024));
Parser parser = new Parser(lexer);
Start ast = parser.parse();
ast.apply(new PrintTree());
}
catch(Exception e){ System.out.println("Error: " + e.getMessage()); }
}
}
- PrintTree.java - Should be in sub-directory Program3, defines methods that called during AST traversal,
- Program3 - same in both files, the sub-directory for the parser.
- class PrintTree extends DepthFirstAdapter - class for traversing AST tree in depth-first order.
- public void caseTAssign(TAssign node) - called when an assign token node visited (i.e. from our grammar: assign = '=';)
- public void caseAWhileStm(AWhileStm node) - called when a while statement production node visited (i.e. from our grammar:
{while} while id do stm )Grammar method naming conventions
Tokens - assign = '=';
- method - caseTtoken where token has first letter capitalized: caseTAssign
- node parameter - Ttoken where token has first letter capitalized: TAssign
Productions - stm = {while} while id do stm
- method - caseALabelProduction where Label (e.g. {while}) and Production (e.g. stm) have first letter capitalized: caseAWhileStm
- node parameter - ALabelProduction where Label (e.g. {while}) and Production (e.g. stm) have first letter capitalized: AWhileStm
PrintTree.java package Program3;
import Program3.analysis.*;
import Program3.node.*;
class PrintTree extends DepthFirstAdapter
{
public PrintTree() { System.out.println("PrintTree"); }
public void caseTAssign(TAssign node) { System.out.print("caseTAssign <"+node+">"); }
public void caseTId(TId node) { System.out.print("caseTId <"+node+">"); }
public void caseTBegin(TBegin node){ System.out.print("caseTBegin <"+node+">");}
public void caseTDo(TDo node){ System.out.print("caseTDo <"+node+">");}
public void caseTElse(TElse node){ System.out.print("caseTElse <"+node+">");}
public void caseTEnd(TEnd node){ System.out.print("caseTEnd <"+node+">");}
public void caseTIf(TIf node){ System.out.print("caseTIf <"+node+">");}
public void caseTSemi(TSemi node){ System.out.print("caseTSemi <"+node+">");}
public void caseTThen(TThen node){ System.out.print("caseTThen <"+node+">");}
public void caseAAssignStmWithoutTrailingSubstm(AAssignStmWithoutTrailingSubstm node)
{ System.out.println("caseAAssignStmWithoutTrailingSubstm <"+node+">"); }
public void caseAWhileStm(AWhileStm node)
{ System.out.println("caseAWhileStm <"+node+">"); }
public void caseAStmtStmlist(AStmtStmlist node)
{ System.out.println(" caseAStmtStmlist <"+node+">"); }
public void caseAIfThenStm(AIfThenStm node)
{ System.out.println(" caseAIfThenStm <"+node+">"); }
public void caseAIfThenElseStm(AIfThenElseStm node)
{ System.out.println(" caseAIfThenElseStm <"+node+">"); }
public void caseABeginStmWithoutTrailingSubstm(ABeginStmWithoutTrailingSubstm node)
{ System.out.println(" caseABeginStmWithoutTrailingSubstm <"+node+">"); }
public void outAStmtStmlist(AStmtStmlist node)
{ System.out.println("outAStmtStmlist <"+node+">"); }
}Input - The following is valid for the parser. Note that due to the ambiguity of if rule, SableCC will not generate a parser for the above but give the error discussed in the text on p. 71. Use the disambiguated grammar Grammar3.34.js for testing.
abc123 = if4; while a do b=c
Use
- Download Grammar3.32.js, Grammar3.34.js, Main.java, PrintTree.java and Parse1, Parse2, Parse3.
- To generate the SableCC parser:
Home: java -jar \sablecc-2.18.2\lib\sablecc.jar Grammar3.32.js
IUS: java -jar v:\common\user\C431\sablecc.jar Grammar3.32.js
- Copy Main.java and PrintTree.java to sub-directory Program3.
- To delete an older version of the parser and compile the Java program named Program3\Main.java enter:
del Program3\Main.class
javac -classpath . Program3\Main.java
- To perform the parsing of Parse1:
java -cp . Program3.Main < Parse1
3.5 ERROR RECOVERY
- LR(k) parse tables, where blank, denotes errors.
- We have assumed that an error halts the parse.
- Will attempt recovery from error and continue parse.
RECOVERY USING THE ERROR SYMBOL
Local Recovery approaches
- Adjust parse stack and input at point error detected to allow parse to resume.
- Add special grammar rules for handling errors.
Example
One approach uses synchronizing tokens such as ; or ) to provide a recovery point and a special terminal token error.
When error detected:
- pop stack until state reached where action for error token is shift
- shift error token
- skip input up to the synchronization token.
- resume parsing
exp ® ID exp ® ( exps )
exps ® exps ; exp
exp ® ( error ) exps ® error ; exp
In look-ahead parsers (SLR and LALR) the rule exp ® error (not followed by token) would be a reduce action without advancing the input, leading to another error when input examined.
Global Error Repair
- Finds the smallest set of insertions, deletions or replacement that would turn the source string into a syntactically valid string.
- Requires no changes to grammar (error productions) or table modification.
- Parse engine (which uses tables) is modified.
Burke-Fisher error repair examines tries every possible single-token insertion, deletion or replacement at every point K tokens before parser reports error.
- Parser must back up K tokens an reparse on error detection (blank table entry).
- Attempts repair on up to K tokens (e.g. replace { with } ).
- Maintains two parse stacks, current and old.
- Queue of K most recently shifted tokens; each shift moves new token to queue and current stack; and oldest token from queue to old stack.
- Appropriate reduce actions performed on current and old stacks (old is behind current in parsing K tokens).
- Error repair
- Create copy of queue
- Make changes to copy
- Attempt to reparse from old stack (K previous tokens from current stack) using tokens from queue copy.
- If queue copy emptied, repair successful.
Example
![]() |
![]() |