It will be recalled that automata have very limited memory capabilities. These can be strengthened
by adding a stack structure.
5.1 Pushdown Automata
As Fig. 5.1 indicates, a pushdown automaton consists of three components: 1) an
input tape, 2) a control unit and 3) a stack structure. The input tape consists of a linear
configuration of cells each of which contains a character from an alphabet. This tape can be
moved one cell at a time to the left. The stack is also a sequential structure that has a first
element and grows in either direction from the other end. Contrary to the tape head associated
with the input tape, the head positioned over the current stack element can read and write special
stack
characters from
that position. The current stack element is always the top element of the stack, hence the name
``stack''. The control unit contains both tape heads and finds itself at any moment in a
particular state.
Figure 5.1:
Conceptual Model of a Pushdown Automaton
|
|
A (non-deterministic) finite state pushdown automaton (abbreviated PDA or, when the
context is clear, an automaton) is a 7-tuple
= (X, Z,
, R, zA, SA, ZF), where
-
X = {x1, ... , xm} is a finite set of input symbols. As above, it is also called
an alphabet. The empty symbol
is not a member of this set. It does, however,
carry its usual meaning when encountered in the input.
-
Z = {z1, ... zn} is a finite set of states.
-
= {s1, ... , sp} is a finite set of stack symbols. In this case
.
-
R
((X
{
})×Z×
)×(Z×
)) is the
transition relation.
- zA is the initial state.
- SA is the initial stack symbol.
-
ZF
K is a distinguished set of final states.
- The non-determinism comes from the fact that
allows a transition relation, instead of
a transition function. To emphasize the nature of the correspondence extra parentheses have been
utilized,
- The element
((xi, zj, sk),(zr, s)) is a member of the relation R means: At the moment
in which xi is read, the automaton finds itself in state zj and the string s of stack
symbols occupies the top of the stack, the first symbol of which is directly under the read/write
head. Then the automaton changes its internal state to zr and replaces sk with the new word
s. Again, the first symbol of s lies under the read/write head.
- There are two distinguished stack operations:
- The relation
((xi, zj,
),(zr, s))
R signifies that s is written to the top
of the stack. Hence the familiar name ``push''.
- The relation
((xi, zj, sk),(zr,
))
R signifies that the word sk is removed
from the stack and the next read/write position is the character directly preceding sk on the
stack. Hence the term ``pop''.
- There are 5 cases that are worth mentioning
- There is no triple
(xi, zj, sk) as the first component in any relation member. Then the
automaton halts and remains there. This is also possible for the triple
(
, zj, sk).
- If
((xi, zj, sk),(zr, s))
R with xi
X and
s
then the automaton
goes into state zr, sk is replaced by s, the input tape is moved left by one cell and the
read/write head is positioned directly over the first symbol of s.
- If
((
, zj, sk),(zr, s))
R then procede as in (b), except the input tape does
not move to the left.
- If
((xi, zj, sk),(zr,
))
R then
goes into state zj, sk is popped
from the stack and the read/write head is positioned over the symbol immediately preceding sk on
the stack (if, that is, sk is not the first element on the stack).
- If
((xi, zj, sk),(zr, s))
R,
zr
ZF, the input read unit has no more symbols
to read and the stack is empty, then
halts. This will be dealt with in more detail
presently.
A configuration of
is an element
(x1, z1, S1) from
(X
{
})×Z×
. The string x1 is interpreted as that part of the input string x that remains to be
processed and S is the entire contents of the stack.
If
(x1, z1, S1) and
(x2, z2, S2) are configurations then we say the first directly
produces the second, written
(x1, z1, S1)
(x2, z2, S2), if there is an element
((a1, z1,
)(z2,
))
R with
x1 = a1x2,
S1 = 
and
S2 = 
. This situation is sketched in Fig. 5.2. Note that the case
a1 =
is also possible here.
Figure 5.2:
Direct Derivation of Configurations
|
|
As with automata one can speak of acceptance of strings by
over the alphabet X and hence of the
language accepted by a pushdown automaton. This is the content of the following definition.
A string x
X* is accepted by the non deterministic
pushdown automaton
if there is a sequence
of configurations
(
x,
zA,
SA)

(
x2,
z2,
S2)
... 
(

,
zn,

)
with
zn
ZF. The sequence is called a calculation of
and has the length
n. Finally, set the language generated by
L(
) to be the set
Let P be defined through
X = {a, b, c},
Z = {zA = z1, z2, z3},
= {SA, S,
},
ZF = {z3} and, finally the state transitions
| (1) |
((a, z1, SA),(z1, SSA)) R |
| (2) |
((a, z1, S),(z1, SS)) R |
| (3) |
((b, z1, S),(z2, )) R |
| (4) |
((c, z2, S),(z2, )) R |
| (5) |
((c, z2, SA),(z3, )) R. |
Then
L(
) = {x
{a, b, c}| x = anbcn}. Note that this language is not regular.
The calculation for the string x = a3bc3 is adduced in Fig. 5.3. Starting out in initial state
z1 with the initial stack symbol SA on top of the stack and reading input symbol a entails pushing
S onto the stack and remaining in state z1 according to the first rule. Thereafter for each a read
the second rule pushes an additional s in state z1 until input symbol b is encountered. Then the
third rule starts popping S from the stack and goes into state z2; this process is completed by matching the input symbols c against the remaining stack symbols according to the fourth and
fifth rules.
Figure 5.3:
Derivation of the String a3bc3
 |
5.2 Context-Free Languages
As will be recalled from the last chapter there were two basic ways to determine whether a given string
belongs to the language generated by some finite state automaton: One could verify that the string brings
the automaton to a final state or one could derive, or, better, produce, the string in the regular grammar
corresponding to
the automaton. The same option holds for PDAs.
A context-free grammar is a grammar
= (X, T, S, R) for which all
rules, or productions, in R have the special form
A
,
for A
X - T and
X*. Additionally,
for any two strings
u, v
X* write
u
v (u directly produces v
) if and only if (1)
u = u1Au2 for
u1, u2
X* and A
X - T and (2)
v = v1
v2 and
A
,
X*,
is a production from R. The reduction
u
v is also called a direct production.
Finally, write
u
v for two strings
u, v
X* (u derives
v) if there is a sequence
u = u0
u1
u2
...
un = v of
direct productions
ui
ui+1 from R. The length of the derivation is n. The
language generated by
is
{x
T*| S
x}.
Thus, the definition just articulates the reduction of A to
in any context in which A
occurs. It is trivial that every regular language is context-free. The obverse, as will be seen presently,
is not true. Before proving the central theorem for this section two typical examples are given.
Consider
= (X, T, R, S) with T = {a, b} and
X = {S, a, b,
}. The
productions, or grammar rules, are:
S
aSb |
. Then it is clear that
L(
) = {anbn| n
0}. From the previous chapter it is known that this language is not regular.
Let
X = {E, T, F, id, + , - ,*,/,(,), a, b, c}
and
T = {a, b, c, + , - ,*,/,(,)}. The start symbol S is E and the productions are as follows:
| E |
 |
E + T | E - T | T |
|
| T |
 |
T*F | T/F | F |
|
| F |
 |
(E) | id |
|
| id |
 |
a | b | c |
|
Then the string (a + b)*c belongs to
L(
). Indeed, it is easy to write down a derivation of
this string:
| E |
 |
T T*F F*F (E)*F (E + T)*F |
|
| |
 |
(T + T)*F (F + T)*F (id + T)*F (a + T)*F |
|
| |
 |
(a + F)*F (a + id )*F (a + b)*F (a + b)*id (a + b)*c |
|
The derivation just adduced is leftmost in the sense that the leftmost nonterminal was always
substituted. Although derivations are in general by no means unique, the leftmost one is. The
entire derivation can also be nicely represented in a tree form, as Fig. 5.4 suggests.
Figure 5.4:
Derivation Tree for the Expression (a + b)*c
|
|
The internal nodes of the derivation, or syntax, tree are nonterminal symbols and the frontier of
the tree consists of terminal symbols. The start symbol is the root and the derived symbols
are nodes. The order of the tree is the maximal number of successor nodes for any
given node. In this case, the tree has order 3. Finally, the height of the tree is the length
of the longest path from the root to a leaf node, i.e. a node that has no successor. The
string (a + b)*c obtained from the concatenation of the leaf nodes together from left to right is called
the yield of the tree.
The expected relation between pushdown automata and context-free languages is
enunciated in the following theorem.
Theorem 5..1
A language
L (
i.e. a subset
L
X*) is generated by a context-free
grammar

if and only if there is a pushdown automaton

with
L = L(
).
If
= (X, T, R, S) is a given context-free grammar, then it is relatively straightforward
to construct a PDA
with
L(
) = L(
). Let the stack alphabet S be the set
X
{SA}
{
} and the set of states
{z0 = zA, z1} with
ZF = {z1}.
Also let the input alphabet be
T
{
} and define the transitions
- Push the grammar's start symbol straight off:
((
, z0, SA),(z1, SSA))
R.
- Nondeterministically push the right-hand side of a grammar rule onto the stack:
- Match the input symbol with the terminal symbol on top of the stack: For each a
T put
((
z1,
a,
a),(
z1,

))
R.
- Pop the stack start symbol at the very end:
((
, z1, SA),(z1,
))
R.
Consider, as an example, the grammar
= (X, T, R, S) with
X = {S, a, b, c},
T = {a, b, c} and
productions
S
aSa | bSb | c. Then the language generated by
is
L(
) = {xcxR | x
{a, b}* where xR denotes the reverse string to x}.
Then, as in the above general PDA specification, the stack alphabet consists of the set
{
, S, a, b, c, SA}, the input alphabet is {a, b, c} and the state transitions are as follows:
(( , z0, SA),(z1, SSA)) R |
(( , z1, S),(z1, aSa)) R |
(( , z1, S),(z1, bSb)) R |
(( , z1, S),(z1, c)) R |
((a, z1, a),(z1, )) R |
((b, z1, b),(z1, )) R |
((c, z1, c),(z1, )) R |
(( , z1, SA),(z1, )) R |
To see that this PDA works correctly, ascertain that the string a2bcba2 is indeed an element of
L(
) by consulting the derivation diagram in Fig. 5.5
Figure 5.5:
Derivation of the String a2bcba2
 |
Conversely, assume
is a PDA.
To clarify the subsequent definitions the following discussion on the internal operation of
is offered. The goal is, of course, to concoct a context-free grammar that executes a leftmost
derivation of every string that
accepts. If
were as simple as the example in the first
part of this proof, namely, that after pushing the very first nontrivial symbol (not SA) onto the
stack
remains in a single state z1, then it would be very straightforward to reverse the above
process and construct
from
. Basically, if x is the input string write
x = x
ax
, where
x
is that part of x that has already been processed (a so-called
prefix of x) and
ax
is the rest of x whose first input symbol is a. Then the
direct production of configurations of
of the form
(ax
, z1, AA
)
(x
, z1,
A
) corresponds to the grammar rule
A
a
, resulting in the reduction
x
AA
xa
A
. Thus the
sequence of stack moves from the above-mentioned example commences with SA and, after popping that
symbol, derives the string a2bcba2, as can be seen by inspecting the stack column in Fig. 5.5.
Unfortunately, the general case is considerably more complicated, because
's state transitions
also enter into the picture. Proceeding naively, one could reduce
to a 2 state PDA

of the aforementioned type by pushing pairs (z, A) of states and stack symbols from
onto

's stack, thus imitating
's calculation of input strings.
Thus, when
is in state z and pushes A onto the stack,

pushes (z, A)
onto its stack. The reader is invited to pause to discover the fatal shortcoming of this method before
reading further.
The problem becomes immediately transparent when one considers what happens when

pops a stack element (z, A). State z is no longer relevant for

's further
operation-
was in state z when A got pushed, but what state was
in when the
pop occurred? Therefore, it is necessary to push triples
(z, A, z
), where
z
is
's state when the pop takes place. Since it is not known what
's state
z
is
going to be when it pops A,
has to guess what it is going to be, .i.e. it
nondeterministically pushes
(z, A, z
), where
z
Z is arbitrary.
The only restriction is
that when executing two (or more) push operations the unknown state
z
must be manipulated
consistently. This means if A1A2 is pushed, then after
pops A1, or, equivalently,

pops
(z1, A1, z1
), then
finds itself in state
z1
.
Since

does not use its own state information in imitating
's state
transitions,
's current state must be available in describing the next element of

's stack, or, in other words,
better be in state
z1
when
popping A2 from its stack, and so must be of the form
(z1
, A2, z2) for some (predicted)
z2
Z. This train of thought will now be formalized.
For simplicity, assume that
pushes at most
two symbols and that it has a single acceptance statei zF. A moment's reflection shows that these
assumptions are not restrictive; but they do eliminate some extra preprocessing.
The
nonterminals of G are triples
(z, A, z
) of states
z, z
and a stack symbol A.
The
basic idea is to imitate what the machine undergoes in state z finally to pop symbol A and to
wind up thereby in state
z
, having processed some string of
input characters. Thus the rules for the sought-after context-free grammar are
posited as follows:
- For the (extra) start symbol put
S
(zA, SA, zF).
- For each transition
((a, z, B),(z
, C))
R put for each z1
Z
(
z,
B,
z1)
a(
z
,
C,
z1)
- In case two symbols are pushed, i.e.
((a, z, B),(z
, C1C2))
R, then put for
each pair
z1, z2
Z
(
z,
B,
z1)
a(
z
,
C1,
z2)(
z2,
C2,
z1).
- For each z
Z put
(z,
, z)
.
It is important to notice the free choice of z1 and z1, z2 in rules 2. and 3. Consider,
for example, processing the string a2bc2 from the PDA from Section 5.1. Then posit
the start rule
S 
(
z1,
SA,
z3),
since there is only one final state. Now mechanically translate each of the transitions from this
PDA into their grammatical equivalents as shown in Table 5.1.
Table 5.1:
Translation of the PDA
Transition Rules into Grammatical Productions
| Nr. |
Transition Function |
Nr |
Production |
| 1 |
((a, zA, SA),(zA, SSA)) |
1' |
(zA, SA, z') a(zA, S, z'')(z'', SA, z') |
| 2 |
((a, zA, S),(zA, SS)) |
2' |
(zA, S, w') a(zA, s, w'')(w'', s, w') |
| 3 |
((b, zA, S),(z2, )) |
3' |
(zA, S, v') b(z2, , v') |
| 4 |
((c, z2, S),(z2, )) |
4' |
(z2, S, u') c(z2, , u') |
| 5 |
((c, z2, SA),(z3, )) |
5' |
z2, SA, t') c(z3, , t') |
|
It is important to note that states
z', z'', w', w'', v', u', t' can be chosen at will. Hopefully, a proper
choice will lead to success in accordance with the philosophy of nondeterminism.
Derivation of the string
a2bc2
L(
).
| S |
 |
(zA, SA, zF) |
| |
 |
a(zA, S, z'')(z'', SA, zF) |
| |
 |
a2(zA, S, w'')(w'', s, z'')(z'', SA, zF) |
| |
 |
a2b(z2, , w'')(w'', S, z'')(z'', SA, zF) |
| |
 |
a2bc(z2, , z2)(z2, SA, zF) |
| |
 |
a2bc2(z3, , z3) = a2bc2 |
Notice that the steps above correspond in a one-to-one manner with the sequence of transitions from
the original PDA.
This now concludes the sketch of the proof of the
theorem.
Tree representations of derivations, also known as syntax trees, were briefly introduced in the
preceding section to promote intuition of derivations. Since these are such important tools for
the investigation of context-free
languages, they will be dealt with a little more systematically here.
Let
= (X, T, R, S) be a context-free grammar. A syntax tree
for this grammar consists of one of the following
- A single node x for an x
T. This x is both root and leaf node.
- An edge
corresponding to a production
A
R.
- A tree
where the
A1, A2, ... , An are the root nodes of syntax trees. Their yields are read
from left to right.
Until now the syntax trees were uniquely determined-even if the sequence
of direct derivations were not. Separating the productions corresponding to the operator
hierarchy, from weakest to strongest, in the expression grammar
+ , - ,*,/,() preserves this
natural hierarchy. If this is not done, then syntax trees with a false evualation sequence are
often the result. Suppose, for instance, that the rules of the expression grammar were written
E
E + E | E*E | id, then two different syntax trees are the result.
If the first production
E
E + E were chosen then the result would be the tree
On the other hand, choosing the production
E
E*E first results in a syntax tree of an
entirely different ilk.
Thus this grammar is ambiguous, because it is possible to generate two different
syntax trees for the expression a + b*c.
Work with a given context-free grammar is greatly facilitated by putting it into a so-called
normal form. This provides some kind of regularity in the appearance of the right-hand
sides of grammar rules. One of the most important normal forms is the Chomsky normal form.
The context-free Grammar
= (X, T, R, S) is said to be in
Chomsky normal form if all grammar rules have the form
A a | BC, |
(5.1) |
for a
T and
B, C
X - T. There is one exception. If
L(
), then
the single extra rule
S  |
(5.2) |
is permitted. If
L(
) then production rule 5.2 is not allowed.
Theorem 5..2
Any context-free grammar
= (X, T, R, S) can be rewritten in Chomsky
normal form.
To facilitate rewriting the grammar rules it is first a good idea to eliminate unnecessary
pathology in the original grammar. Although a CFG always defines its grammar rules with single
nonterminal symbols on the left-hand side of every production, it can easily happen that there
are nonterminal symbols that never appear on the left-hand side of any production. It is
then seen that they cannot generate any sequence of terminal symbols and, moreover, may never
appear in the right-hand side of any production to help produce a sentence in the associated
language. It can even happen that there are nonterminal symbols that appear in no sentential
form5.1 derivable from the start
symbol. These symbols are called useless and are to be expurgated at the outset. It should
be abundentally clear that no part of any rule containing one of these symbols has the least
influence of L(G). Consider, for example, the grammar
| S |
 |
AB | CA | AD |
| B |
 |
BC | AB |
| A |
 |
aA | a |
| C |
 |
b | aB | bC |
No terminal symbol is derivable from B and D never appears in the right-hand side of any
rule. Thus, the grammar simplifies to:
Now that the grammar has been stripped of extraneous elements, the grammatical transformation can
begin.
The rules for
can be rewritten as follows:
- Purge R of rules of the form
A
. If there is another rule with A
occupying the left-hand side, then proceed as follows. For every rule in which A appears on
the right-hand side, add another rule to R without this occurrence of A. If A occurs more than
once, then add rules with each individual occurrence of A elided, while retaining the other
occurrences of that nonterminal symbol. Finally, add rules with pairs of individual occurrences of
A eliminated, etc. until all combinations have been expunged. For example, the rules
A
a |
and
A
ABAC could be replaced by
A
a and
A
BAC | ABC | BC| ABAC.
- Replace any rule of the form
A

... ,
by the n - 1 rules
A
A2, A2
A3, ... An-1

.
- Eliminate ``useless'' rules
A
B, where A and B are nonterminal symbols. Indeed,
if there is a rule
B
, where
consists of more than one symbol, then reduce
to
A
.
Consider again the expression grammar from Section 5.2. Then the rule
E
E + T | E - T, is replaced by
E
EE
,
E
POPT,
POP
+ | -.
Similar replacements hold for
T
T*F | T/F and
F
(E). Finally the
productions
E
T,
T
F and
F
id are reduced to
E
a | b | c,
T
a | b | c and
F
a | b | c respectively.
As with finite automata there is a version of the pumping lemma that demonstrates certain languages
are not context free. This theorem is often called Ogden's lemma after its discoverer.
Assume that
is in Chomsky normal form. For
x
L(
) consider
the (binary) syntax tree for the derivation of x. Assume the height of this tree is h as
illustrated in Fig. 5.6.
Figure 5.6:
Derivation Tree for the string
x
L(
)
|
|
Then it follows that
| x|
2h-2 + 2h-2 = 2h-1, i.e. the yield of the tree with
height h is at most 2h-1. If
has k nonterminal symbols, let n = 2k. Then let
x
L(
) be a string with
| x|
n. Thus the syntax tree for x has height at least
k + 1, thus on the path from the root downwards that defines the height of the tree there are at least
k + 2 nodes, i.e. at least k + 1 nonterminal symbols. It then follows that there is some
nonterminal symbol A that appears at least twice. Consulting Fig. 5.7, it is seen
that the partial derivation
S
uAz
uvAyz obtains.
Figure 5.7:
Nonterminal A appears twice in the derivation of x
|
|
If, now, both u and z were empty, then derivations of the form
S
uAz
A
would be possible, contrary to the assumption of Chomsky normal form. For the same reason either v
or y are nonempty. If | vwy| > n then apply the procedure anew until the condition
| vwy|
n holds. Finally, since the derivation
A
vAy can be repeated as often as
one pleases, it follows that
S
uAz
uvAyz
uv2Ay2z
uv2wy2zi, etc. can be generated. This completes the proof.
The language
L = {aibici | i
1} is not context free.
Assume L were context-free. Then let n be the n from the preceding theorem and put
x = anbncn. Ogden's lemma then provides the decomposition x = uvwyz with the stated properties.
There are several cases to consider.
Case 1 The string vy contains only a's. But then the string uwz
L, which is impossible,
because it contains fewer a's than b's and c's. .
Case 2,3 vy contains only b's or c's. This case is similar to case 1.
Case 4,5 vy contains only a's and b's or only b's or c's. Then it follows that
uwz contains more c's than
a's and b's or more a's than b's and c's. This is again a contradiction.
Since
| vwy|
n it is not possible that vy contain a's and c's.
In this example the power of Ogden's lemma will be extended. The language
L = {aibjck | i < j < k} is not contetx free.
The pumping properties of Ogden's lemma are of little use here. Hence we return to
the syntax tree in the proof of the theorem. If, now, L were context-free then consider the path in
its syntax tree from the root to the leaf node containing the rightmost a. For sufficiently large
i there is likewise a nonterminal symbol A appearing twice on this path. Then
x = uamwyz for
some m
0. There are 2 cases to be considered here.
Case 1 m = 0. Then the matching substring y cannot be empty. On the other hand the theorem
guarantees that
| wy|
i, hence all c's must be contained in z. But then y contains at least
one b, so for sufficiently large n the string uwynz belongs to L, which is impossible, because
for large enough n, this is string will contain more b's than c's-contrary to assumption
Case 2 m
1.
It then follows that
Now obviously y cannot contain more than 1 letter kind and this letter is either b or c. If y
contained only b's then the string
uanmwynz would contain more a's than c's for n
sufficiently large. As similar argument holds if y contained only c's. In any case, a contradiction
is derived and thus the assumption that L is context-free is false.
Proceding by analogy, one would expect the closure theorems for finite state automata to
generelize to PDAs. Surprisingly, not everything carries over.
The following theorem, however, articulates a property of context-free languages analogous to finite
automata.
Theorem 5..4
The context-free langauges are closed under the formation of
- Union
- Concatenation
- Kleene star.
Let
1 = (X1, T1, R1, S1) and
2 = (X2, T2, R2, S2)
be context-free grammars. Without loss of generality, it can be supposed that
(X1 - T1)
(X2 - T2) =
. If not, then rewrite the grammar rules of one with new nonterminal
symbols. Let S be another nonterminal symbol. Then set
R = R1
R2
{S
S1, S
S2} and
= (X1
X2
{S}, T1
T2, R, S). It then
follows that
L(
) = L(
1)
L(
2). It is left to the reader to verify
closure under concatenation and Kleene star in a similar manner.
The closure under intersection property is not quite exact.
Theorem 5..5
The intersection of a context-free language L1 and a regular language L2 is context-free.
If L1 is context-free then it is recognized by some push down
automaton
= (X, Z1,
, R1, S1, F1). If L2 is a regular language, then it is recognized by
a deterministic automaton
= (X, Z2, f, S2, F2). Define a new PDA

= (X, Z,
, R, SA, F) as follows:
Z = Z1×Z2,
=
,
SA = (S1, S2) and,
finally,
F = F1×F2. The transition relation R is obtained directly from the transition
relation of
and the transition function of
, viz. for every transition of
of the form
((a1, z1, S
),(z1
, S1
))
R1 and for each state
z2
Z2, put
((
a1,(
z1,
z2),
S1
),((
z1
,
f (
a,
z2),
S1
))
R
and for each
-move of
of the form
((
, z1, S1
),(z1
, S1
))
R and
z2
Z2 put
((

,(
z1,
z2),
S1
),((
z1
,
z2),
S1
))
R,
or, stated in words,

passes from state (z1, z2) into state
(z1
, z2
) if and only if
passes from z1 to z2 and
passes from z2 to
z2
, i.e.
x
L(
)) if and only if
x
L(
)
L(
).
Theorem 5..6
The class of context-free languages is not closed under intersection and complement.
It is easy to see that the two languages
L1 = {
anbncm |
m.
n 
0}
and
L2 = {
ambncn |
m,
n 
0}
are context free. The interesction, however,
L1
L2 = {anbncn | n
0} is not context-free.
From the complement identity
L1
L2 =

,
it is seen that the complements
and
are not in general context-free.
An algorithm is called polynomial in case there is an integer k
2
such that the number of steps after which the algorithm halts is
(nk). The argument n
depends only on the input.
Theorem 5..7
There is a polynomial algorithm that constructs to any given push down automaton

a context-free
grammar

with
L(
) = L(
). Conversely, there is a polynomial algorithm that
constructs to any given context-free grammar

a push down automaton

with
L(
) = L(
).
Theorem 5..8
There is a polynomial algorithm that decides, given any context-free grammar
G = (X, T, R, S) and
x
T* whether
x
L(
).
The proof of this theorem sometimes goes under the name CYK algorithm after
their discoverers Cocke, Younger and Kasami. It proceeds as follows:
- Rewrite
in Chomsky normal form. It is easily seen that this can be done in polynomial
time.
- If
x = x1x2 ... xn, then for
0
i, j
n put
xij = xixi+1 ... xi+j-1. It is noteworthy that
| xij| = j. The idea is to determine all A
X - T for which
A
xij. Thus set
Vij = {
A
X -
T |
A
xij}.
- For j = 1 it is readily seen that
Vi1 = {A
X - T | A
xi}.
- For general j it is also seen that
A
Vij
A
xixi+1 ... xi+j-1
A
BC is a rule from R and
B
xi ... xi+k-1 and
C
xi+k ... xi+j-1 for some
k = 1, 2, ... j - 1.
Thus the algorithm can be formulated as follows:
for i : = 1 to n do
Vi1 : = {A
X - T | A
xi
R};
for j : = 2 to n do
for i : = 1 to n - j + 1 do begin
Vij : =
;
for k : = 1 to j - 1 do
Vij : = Vij
{A
X - T | A
BC, B
Vik, C
Vi+k, j-k};
end
Figure 5.8:
Diagonal Procedure for CYK Algorithm
|
|
There is a nice interpretation of the innermost for loop. Formally one processes the pairs
Vi1Vi+1, j-1,
Vi2Vi+2, j-2, ... ,
Vi, j-1Vi+j-1, 1. As evidenced in Fig.
5.8 go down the ith column and simultaneously traverse the diagonal from
Vi+1, j-1
up and to the right. The corresponding elements are compared with each other.
Finally, it is seen that
x
L(
)
S
V1, n, because then
S
x1 ... xn, where
n = length(x).
This technique of producing increasingly larger solutions from smaller ones is called dynamic
programming.
Consider the Grammar
| S |
|
AB | BC |
|
| A |
|
BA | a |
|
| B |
|
CC | b |
|
| C |
|
AB | a |
|
and the string x = baaba with n = 5. Then proceeding as above, the following triangular matrix
results:
| b |
a |
a |
b |
a |
| B |
A, C |
A, C |
B |
A, C |
| S, A |
B |
S, C |
S, A |
|
 |
B |
B |
|
|
 |
S, A, C |
|
|
|
| S, A, C |
|
|
|
|
Since
S
V15 it follows that
x
L(
). It is quite remarkable that the algorithm
time is
(n3). It is also remarkable that the CYK algorithm actually shows how to construct the
derivation, which has great practical importance.
Concomitant to the various automata that have been introduced are the various languages, or, equivalently
subsets of strings over some alphabet, that are recognized by those automata. Chomsky has
characterized a full spectrum of language types that occur in formal language theory. The first step
is to recall that in the definition of a grammar completely general production rules are allowed.
Specifically this allows productions whose left-hand sides are arbitrary non-empty strings over the set
X of terminal and nonterminal symbols. There is a further technical restriction on the nullbar
productions: If the empty string is derivable in the language generated by any grammar. Following Chomsky, such grammars are called Type 0
Grammars. If one further requires that fewer combinations of terminal and nonterminal symbols occur
on the left-hand side than on the right then one speaks of a Type 1 Grammar. The various
types of languages considered in this lecture series are summarized in the following definition.
Let
be a grammar.
- If
is type 0, then its associated language
L(
) is called a type 0
language.
is called type 1 if for all productions
: : =
and its associated language a type 1 language.
is called context free or type 2 and its associated language
a type 2 language if for all productions
: : =
the left-hand side
consists of a single nonterminal symbol and the nullbar
convention holds.
is called regular or a type 3 and its assocated language a
type 3 language if every rule of
is left
or right linear and the nullbar convention holds.
The following very important theorem is stated here with proof of the second to last strict inclusion. The
strict inclusions before that have already been treated. Only the last is stated without proof:
Theorem 5..9
If
Li denotes the class of type
i languages (
i = 0, 1, 2, 3), then
Consider the following context-dependant grammar:
S aSBC |
S aBC |
CB BC |
aB ab |
bB bb |
bC bc |
cC cc |
Then it is easy to derive the string abc:
Similarly, one derives the string a2b2c2:
| S |
aSBC |
a2BCBC |
|
| |
a2B2C2 |
a2bBC2 |
|
| |
a2b2C2 |
a2b2cC |
a2b2c2 |
It is then a routine application of mathematical induction to prove the general formula
S
anbncn.
In this chapter the student has encountered the power of nondeterminism in algorithms in Computer
Science. In this connection, however, practical issues such as time to completion have been relegated
to the background. For example, given a context-free grammar one could always parse a string of input
symbols using nondeterministic back-up methods that repeatedly attempt to produce a leftmost
derivation for that string from the start symbol, or, failing that, reject the string after all possibilities have been
exhausted. Indeed, the very first grammar-driven compilers did precisely that.
Most modern programming languages have syntaxes that allow predictive parsing, thus eliminating
the time-wasting idiosyncracy of back-up. The associative PDA is thus deterministic and the parsing
procedure is very efficient. In this section we apply the methods learned to build both top-down and bottom-up
that function deterministically.
The goal of a top-down parse is to produce a leftmost derivation starting with the start symbol and ends
up either having consumed the input and accepted the input string or determines that there is a conflict
and produces an error message to that extent.
The basic idea is simple: given the current input symbol and the current non-terminal symbol, usually
pushed onto the stack, consult a parsing table, or, in the jargon of automata theory, a finite state
transition function (the current state is the current nonterminal symbol), and replace the symbol on
the stack with the proper right-hand side to continue simulating the derivation. See Fig. 5.9
for an illustration of this process. If the top of the stack is a terminal symbol and matches the input,
then just pop that symbol and continue with the rest of the input string. A simplified process is given in
the following algorithm.
Figure 5.9:
Top-Down Parser as a PDA
|
|
- x = first input symbol
- Z = S
while there is input begin
write Z = uAw for
u
T*, A
X - T, w
X*;
if A
X - T or the end marker $
if
Table[x, A] = A
v = v1v2...vn
pop(A);
push v on thestack with v1 on top;
else ifA
T
pop(A);
x = getnextsymbol();
end
The main point here is that Table[x, A] will consist of exactly one element that allows the predictive
parsing to proceed along unique, well-defined lines. It will easily be seen that there are context-free
grammars for which this lookup technique fails. The expression grammar used throughout will serve
the purpose at hand. Given an input, say an identifier, how do we know to reduce E to E + T, to
T, etc.? Since `+' can appear arbitrarily far away from the current input position, this
grammar as it stands will not allow of computing a parse table. This particular shortcoming, known as
left recursion is easily excised from the grammar (make it right recursive):
E TE' |
E' +TE' | -TE' |  |
T FT' |
T' *FT' | /FT' |  |
F (E) | id |
Note the use of the
-reductions. Allows this particular grammar now works, there remain
other general pitfalls that need to be mentioned. Consider the grammar
S A | B |
A aA | a |
B aB | b |
There is no way, given the input symbol a, to determine whether to pick the right-hand side A or
B. Thus the predictive parsing algorithm will work only on grammars for which the first terminal
symbol produced by the leftmost nonterminal is each right-hand side has to be uniquely determined.
This symbol is denoted by
first(
A), where
X and A
X*. It is formally
defined recursively as follows:
Thus it makes sense to require for the rules
A
|
| ...
that
for all i and j with i
j.
In the above example
first(A) = first(B) = a. There is still one more subtle point: it could happen
that


is nullable, in which case a nonterminal symbol following
in some right-hand side could cause a conflict with
first(
). Thus we define
follow(
A) = {
a
T |
S
uAw,
a
first(
w),
w
X*}
and require for each A
X - T
first(A) follow(A) = . |
(5.5) |
The context-free grammar
G = (X, T, R, S) is LL(1) if equations 5.4 and 5.5 hold for all
A
X - T.
We shall presently see that LL(1) grammars allow the unqiue construction of parsing tables for predictive
parsing.
Consider the left-factored expression grammar above. Then we calculate
| first(E) |
= |
{(, id} |
| first(T) |
= |
{(, id} |
| first(F) |
= |
{(, id} |
| first(E') |
= |
{ + , - , } |
| first(T') |
= |
{*,/, } |
| follow(E) |
= |
{),$} |
|
follow(E') |
= |
{),$} |
| follow(T) |
= |
{ + , - ,$} |
|
follow(T') |
= |
{ + , - ,$} |
| follow(F) |
= |
{ + , - ,$,*,/,)} |
For instance, from the rules
E
TE',
T
FT' and
F
(E) | id, it
is seen that only ( and id can begin a derivation from E alone. Furthermore
follow(T')
consists of the non-
elements of first(E') and the end marker $. The other calculations
are analogous.
Now we are in a position to define the parsing table, call it
Table[*,*]:
It should be intuitively clear that if the grammar G is LL(1) then the so-constructed parsing
table will have at most one item for each entry (It may, of course, have none at all.)
This section will now be concluded with the parsing table for the expression grammar and a derivation
of the legal input string
(id + id )*id.
|
Input Symbol |
|
|
id |
+ |
- |
* |
/ |
| E |
E TE' |
|
|
|
|
| E' |
|
E' + TE' |
E' - TE' |
|
|
| T |
T FT' |
|
|
|
|
| T' |
|
|
|
T' *FT' |
T' /TF' |
| F |
F id |
|
|
|
|
|
Input Symbol |
|
( |
) |
$ |
| E |
E TE' |
|
|
| E' |
|
E'  |
E'  |
| T |
T FT' |
|
|
| T' |
|
T'  |
T'  |
| F |
F (E) |
|
|
For instance, since
id
first(T) add
E
TE' to
Table[id, E]. Also
(
first(T),
so consequently add
A
TE' to
Table[), E]. Finally the rule
E'
means that it must be added to
Table[), E'] and
Table[$, E']. The other entries are left to the
reader.
Finally we adduce the parse sequence for the expression
(id + id )*id. Note the end marker is
pushed as initial stack symbol onto the stack. Then when the expression has been processed, the end
marker `$' of the input string is matched against the initial stack symbol and the string is
accepted. Note that there are really only two mutually exclusive things to be done, depending on the symbol
on the top of the stack: if that symbol is a terminal symbol, then match it to the output; if there is no
match then there is an error. If the symbol is a nonterminal symbol, then look up the entry in the parser
table corresponding to that symbol and to the current input symbol. See Table 5.2.
Table 5.2:
Predictive Parse of the expression
(id + id )*id
| Input String |
Stack |
Remarks |
|
| ( |
id |
+ |
id |
) |
* |
id |
$ |
E$ |
Beginning Configuration |
|
 |
|
|
|
|
|
|
|
 |
|
|
| ( |
id |
+ |
id |
) |
* |
id |
$ |
TE'$ |
Consult
Table[(, E] |
|
 |
|
|
|
|
|
|
|
 |
|
|
| ( |
id |
+ |
id |
) |
* |
id |
$ |
FT'E'$ |
Consult
Table[(, T] |
|
 |
|
|
|
|
|
|
|
 |
|
|
| ( |
id |
+ |
id |
) |
* |
id |
$ |
(E)T'E'$ |
Consult
Table[(, F] |
|
 |
|
|
|
|
|
|
|
 |
|
|
| ( |
id |
+ |
id |
) |
* |
id |
$ |
E)T'E'$ |
Match ('s |
|
| |
 |
|
|
|
|
|
|
 |
|
|
| ( |
id |
+ |
id |
) |
* |
id |
$ |
TE')T'E'E$ |
Consult
Table[id, E] |
|
| |
 |
|
|
|
|
|
|
 |
|
|
| ( |
id |
+ |
id |
) |
* |
id |
$ |
FT'E')T'E'$ |
Consult
Table[id, T] |
|
| |
 |
|
|
|
|
|
|
 |
|
|
| ( |
id |
+ |
id |
) |
* |
id |
$ |
idT'E')T'E'$ |
Consult
Table[id, F] |
|
| |
 |
|
|
|
|
|
|
 |
|
|
| ( |
id |
+ |
id |
) |
* |
id |
$ |
T'E')T'E'$ |
Match id's |
|
| |
|
 |
|
|
|
|
|
 |
|
|
| ( |
id |
+ |
id |
) |
* |
id |
$ |
E')T'E'$ |
Consult
Table[+ , T'] |
|
| |
|
 |
|
|
|
|
|
 |
|
|
| ( |
id |
+ |
id |
) |
* |
id |
$ |
+ TE')T'E'$ |
Match +'s |
|
| |
|
|
 |
|
|
|
|
 |
|
|
| ( |
id |
+ |
id |
) |
* |
id |
$ |
FT'E')T'E'$ |
Consult
Table[id, T] |
|
| |
|
|
 |
|
|
|
|
 |
|
|
| ( |
id |
+ |
id |
) |
* |
id |
$ |
idT'E')T'E'$ |
Consult
Table[id, F] |
|
| |
|
|
 |
|
|
|
|
 |
|
|
| ( |
id |
+ |
id |
) |
* |
id |
$ |
T'E')T'E'$ |
Match id's |
|
| |
|
|
|
 |
|
|
|
 |
|
|
| ( |
id |
+ |
id |
) |
* |
id |
$ |
)T'E'$ |
Consult
Table[), T] |
|
| |
|
|
|
 |
|
|
|
 |
|
|
| ( |
id |
+ |
id |
) |
* |
id |
$ |
T'E'$ |
Match )'s |
|
| |
|
|
|
|
 |
|
|
 |
|
|
| ( |
id |
+ |
id |
) |
* |
id |
$ |
*FT'E'$ |
Consult
Table[*, T'] |
|
| |
|
|
|
|
 |
|
|
 |
|
|
| ( |
id |
+ |
id |
) |
* |
id |
$ |
FT'E'$ |
Match *'s |
|
| |
|
|
|
|
|
 |
|
 |
|
|
| ( |
id |
+ |
id |
) |
* |
id |
$ |
idT'E'$ |
Consult
Table[id, F] |
|
| |
|
|
|
|
|
 |
|
 |
|
|
| ( |
id |
+ |
id |
) |
* |
id |
$ |
T'E'$ |
Match id's |
|
| |
|
|
|
|
|
 |
|
 |
|
|
| ( |
id |
+ |
id |
) |
* |
id |
$ |
E'$ |
Consult
Table[$, T'] |
|
| |
|
|
|
|
|
|
 |
 |
|
|
| ( |
id |
+ |
id |
) |
* |
id |
$ |
$ |
Consult
Table[$, E'] |
|
| |
|
|
|
|
|
|
 |
 |
|
|
|
In this section assume that all grammars do not have a recursive start symbol S. If one does then
introduce a new rule
S'
S and make S' the new start symbol.
Bottom up methods are a little more involved in the theory of parsing. Consider, for example,
the right derivation of the expression a + b*c:
E  |
|
E + T E + T*F E + T*id E + T*c E + F*c E + id*c |
|
| |
|
E + b*c T + b*c F + b*c id + b*c a + b*c. |
|
At each stage of the derivation the sentential form of the stage is of the form uv, where u
X* and
v
T*. Tracing this derivation backwards, now proceed as follows: Starting from the leftmost input
symbol reduce that symbol to a rule for which it is the right-hand side, in this case
id
a.
Then reduce id to F, etc. until an E has been produced. All of the previous symbols are
handles or right-hand sides of rules that allow successful (in the sense that the start symbol will
eventually be produced). After E has been obtained, the next input symbol `+' is kept, or better,
appended to E. Thus the sentential form `E +' is produced. This sentential form is called a viable prefix because there is a rule of the form
E
E + T (a trivial one). If it recognized
that E + is a viable prefix, then, starting with the next input symbol, continue this process from that
point onwards until the rest of the right-hand side has been produced, i.e. a handle has been
found. Then reduce this handle to the left-hand side of the ``correct'' rule until the start symbol alone
has been produced. This process can be nicely realized using a push-down automaton. Thus, proceeding from
left to right on the input string, shift or push one or more input symbols onto the stack until a
handle is found. The reduce or pop that handle from the stack and push the left-hand side of the
associated rule onto the stack. On a successful parse, if no reduction is presently forthcoming then the
contents of the stack constitute a viable prefix for some rule yet to be determined. Another way of
saying the same thing is that the contents of the stack, read from bottom up, are the prefix of a
sentential form produced on the way back to the start symbol during a rightmost derivation.
A correct parse of the string a + b*c as a sequence of shift/reduce actions is given in Table 5.3.
Notice the decision to handle multiplication before addition is governed by ``looking ahead'' one symbol.
Table 5.3:
Predictive Parse of the expression a + b*c
| Stack |
Input |
Action |
| $ |
a + b*c$ |
Shift |
| id$ |
+ b*c$ |
Reduce |
| F$ |
+ b*c$ |
Reduce |
| T$ |
+ b*c$ |
Reduce |
| E$ |
+ b*c$ |
Reduce |
| + E$ |
b*c$ |
Shift |
| b + E$ |
*c$ |
Shift |
| id + E$ |
*c$ |
Reduce |
| F + E$ |
*c$ |
Reduce |
| T + E$ |
*c$ |
Reduce |
| *T + E$ |
c$ |
Shift |
| c*T + E$ |
$ |
Reduce |
| id*T + E$ |
$ |
Reduce |
| F*T + E$ |
$ |
Reduce |
| T + E$ |
$ |
Reduce |
| E$ |
$ |
Accept |
|
Note that this PDA, whose states would consist of all prefixes of all right-hand sides, is nondeterministic
and therefore impractical to implement. The rest of this section is devoted to replacing this PDA by a
deterministic one.
Let
G = (X, T, R, S) be a context-free grammar. The the string uw (
u, w
X*) is
an LR(o)-Context of the rule
A
w if
S
uAv
uwv,
for some v
T*.
Note that uw in an LR(0)-Context of the rule
A
w if it is possible to reduce w directly to
A during a rightmost parse. v is just that part of the candidate string already produced.
Consider the Grammar
G = (X, T, R, S) with
X = {S, A, B, a, b}, T = {a, b} and productions
Then
L(G) = {anbmam | m, n
1}. It is readily seen that
- LR(0)-Context(
S
AB). Note that for the start symbol the LR(0)-Context of the
rule(s) with S on the left is always just the corresponding right-hand side.
- To calculate the LR(0)-Context of the rule
A
Aa we must examine all rightmost
derivations for which this rule is applied at the very last step:
Thus, with u =
, w - Aa, we have
LR(0)-Context(
A
Aa) = {
aA}.
- It is almost trivial that
LR(0)-Context(
A
a) = {
a},
because this is the last symbol generated by any rightmost parse.
- Again, to calculate the LR(0)-Context of the rule
B
bBa consider all rightmost
derivations that terminate in an application of this rule:
and thus
LR(0)-Context(
B
bBa) =
Abb*Ba.
- To calculate the LR(0)-Context of rule
B
ba, note that it terminates all rightmost derivations
starting from B:
| S |
AB |
Aba |
|
| S |
AB |
AbBa |
|
| |
AbmBam |
Abmbaam, |
|
and so
LR(0)-Context(
B
ba) =
Ab*ba.
A point worth mentioning is that even if the actual target strings that are derivable from the start
symbol are in general context-free, the LR(0)-Context is always describable with a regular
expression. If we knew the LR(0) contexts, then it is not hard to construct a bottom-up parse
of language strings, as the following example for the above grammar G and the target string
aaabbaa shows:
| Sentential Form |
LR(0)-Context |
Rule |
Action |
aabbaa |
a |
A a |
reduce |
aabbaa |
Aa |
A Aa |
reduce |
abbaa |
Aa |
A Aa |
reduce |
bbaa |
- |
- |
shift |
A baa |
- |
- |
shift |
Ab aa |
- |
- |
shift |
Abb a |
ba |
B Ba |
reduce |
Ab a |
abBA |
B bBa |
reduce |
A |
AB |
S AB |
reduce |
 |
- |
- |
accept |
Thus, using LR(0)-Contexts results in a deterministic bottom-up parse, because as soon as
uw is seen, w is reduced by the rule
A
w. The main obstacle to implementation is
the calculation of the regular expression that determines the decision to reduce; for it can become
arbitarily large before it can be distinguished (if at all - see the next definition) from other
potential LR(0)-Contexts. Thus the fourth and fifth rules of G above give rise to two contexts
that are distinguishable only after the second-to-last character has been read! For this reason, a slightly
different approach is taken, namely a DFA will be constructed that determines whether the derivation
up to the current point is actually a viable prefix.
The context-free Grammar
G = (X, T, R, S) is LR(0) if for every
u
LR(0) - Context(A
) and
uw
LR(0) - Context(B
), where
A
,
B
R, u
X* and w
T*, then
w =
, A = B and
=
.
Thus it is not possible at some stage to reduce
to A but also to forgo a reduction, shift
more symbols and then reduce. This condition also implies that
is not an acceptable
handle for 2 different rules.
Unfortunately, the expression grammar is not LR(0) because of the ambiguity
T
LR(0)-Context(E
T) and
Ta
LR(0)-Context(id
a). Note the need for a 1 symbol lookahead to decide whether to
reduce or not. The grammar G, on the other hand, is LR(0).
In the following algorithm assume the function getsym(x) returns the next symbol from the string x,
and the function shift(v) pushes the terminal symbol v onto the stack.
Deterministic Parser for LR(0) Grammars
- x = input string
-
u =
, v = getsym(x)
while (u
S) begin
if
u = xw
LR(0) - Context(A
w) reduce: u = xA
else if u is a viable prefix and
v
shift(
v = getsym(x));
else error();
end
We turn now to the problem of finding a DFA to determine viable prefixes. Quite clearly, its states
will contain this information.
Let
G = (X, T, R, S) be a context-free grammar. If
A

R, then the LR(0)-items of this production are the marked rules
A
.
,
A
.
and
A

.. The last item, for reasons that will
soon become clear, is called a complete item. The production
A
has only the
one item
A
.. The LR(0)-items of G is the union of the set of LR(0)-items over
all productions of G.
Thus, an LR(0)-item consists of a production together with a marker to indicate how much of the right-
hand side has been produced up to the current point of a rightmost derivation.
How now is a viable prefix recognized? A moment's reflection reveals that the top symbols of the
stack constitute a sentential form Y that corresponds to some rule
A
.YZ (
X*), if a reduction is to take place, or, in other words, the marker gets moved over. If
a reduction is not possible, then the next step consists of adding i.e. pushing input symbols
until a reduction is in fact possible, which is just another way of saying that a viable prefix is on
the stack at the moment. Thus, the states of the proposed automaton are precisely the LR(0)-items of
the grammar G5.2 and the transaction function
``moves the marker over.''
Given the LR(0)-grammar
G = (X, T, P, S) define the NFA

(G) of G to be the 5-tuple


= (
T,
Z,
zA,
R,
ZF =
Z),
where Z is the set of LR(0)-items augmented by an initial state zA and a relationR defined
as follows:
-
((
, zA),(S
.A))
R for all
S
A
P,
-
((a, A
u.av),(A
ua.v)) for a
T,
-
((B, A
u.Bv),(A
uB.v))
R for B
X - T,
-
((
, A
u.Bv),(B
.A))
R for all
B
A
P.
Notice that rules 1 and 4 render 
nondeterministic and that every state is an acceptance
state. The symbol u is the viable prefix that is currently being processed.
Consider, again, the grammar G of this section.

(G) is easily
constructed as in Fig. 5.10.
Figure 5.10:
The non-deterministic

(G) for the Grammar G
|
|
We have already learned how to convert the NFA

(G) to the DFA
(G) in Chapter3. The
result of this construction is to be found in Fig. 5.11. Note that to keep clarity the arrows
to the empty state, i.e. the error state, have not been drawn. This DFA is the basis upon which all
LR parsing tables are constructed. Instead of constructing any of them, it is perhaps most instructive to
carry out the parsing algorithm directly by using
(G). The strategy is quite simple: We employ
the stack to store the current symbol of the processed sentential form together with the state of
in which it was adduced. If the state is that of a complete LR(0)-item then reduce using that rule. Thus
we must pop as many items from the stack as there are symbols in the right-hand side of the rule. The new state
is then the successor of the state on top of the stack under the single symbol from the left-hand side of the
rule. This is the pushed together with that symbol. If the state is not a complete LR(0)-item, then there is
more to be seen until a reduction can be executed, i.e. the stack contains a viable prefix and the next
input symbols must be pushed, together with the state into which
goes upon that symbol and the
current state. If, at any stage, there is no transition function for the current state-symbol configuration, then
an error has occurred. This strategy is summarized in the following algorithm:
Deterministic Parser for LR(0) Grammars G using
(G)
- x = input string
-
v = getsym(x)
-
state = [zA]
while (
TOS
S) begin
if state contains a complete LR(0)-item
A
{
pop
length(
) symbols
tempstate = TOS
push the pair
(goto(A, tempstate), A)
}
else if state is not error state {
state = goto(v, state)
push (v, state)
v=getsym(x);
}
else error();
end
In the above algorithm TOS has been used interchangeably for the entire contents of the
top of the stack or the state number or the symbol stored there.
The action of this algorithm is then demonstrated in the following table. Note that every
reduction traverses the automaton backwards according to the symbols in the sentential form
on the right-hand side.
| Stack |
Input String |
Current State |
Action |
| (1,$) |
aabbaa$ |
1 |
shift a |
|
(2, 1)(1,$) |
abbaa$ |
2 |
reduce
A a |
|
(3, A)(1,$) |
abbaa$ |
3 |
shift a |
|
(5, a)(3, A)(1,$) |
bbaa$ |
5 |
reduce
A Aa |
|
(3, A)(1$,) |
bbaa$ |
3 |
shift a |
|
(5, a)(3, A)(1,$) |
baa$ |
5 |
reduce
A Aa |
|
(3, A)(1,$) |
baa$ |
3 |
shift b |
|
(6, b)(3, A)(1,$) |
aa$ |
6 |
shift b |
|
(6, b)(6, b)(3, A)(1,$) |
a$ |
6 |
shift a |
|
(7, a)(6, b)(6, b)(3, A)(1,$) |
$ |
7 |
reduce
B ba |
|
(8, B)(6, b)(3, A)(1,$) |
$ |
8 |
shift a |
|
(9, a)(8, B)(6, b)(3, A)(1,$) |
 |
9 |
reduce
B bBa |
|
(4, B)(3, A)(1,$) |
 |
4 |
reduce
S AB |
| S |
|
|
accept |
Figure 5.11:
The Deterministic Automat
(G) for Grammar G
|
|
Although the parsing algorithm from the preceding section will always work correctly if the
associated DFA
(G) is correct. As already observed, even the automat associated with a simple
expression is not correct. The problem is that it is always necessary to determine the next input symbol
in order to make a decision about the next step.
The string uwz is in the LR(1)-Context of the rule
A
w if there
is a derivation
S
uAv
uwv
for v = zv' or
v =
.
Note that z
T or
z =
.
Considering the expression once again, if an LR(0) parser, upon reading an id, has reduced a factor to a
term it is not clear how to proceed. It could
- reduce
E
T, or
- shift T to obtain T*F.
If the next symbol were
, then 1) would be executed and processing stops. If it weren't, then 1) would be
executed and the next symbol read. If it were * then 2) is carried out. It all depends on what the next input symbol
is.
If
G = (X, T, R, S) is a context-free grammar then the LR(1)-items of G have the
form
[A . , {z1,..., zn}] |
[A . , {z1,..., zn}] |
[A  , {z1,..., zn}] |
for
zi
T
{
}. The set
{z1,..., zn} is called the lookahead set of the
LR(1)-item. These are precisely those z that appear in a LR(1)-context. In other words, the terminal symbol
is in the lookahead set of
A

if it follows

in some rightmost
derivation:
It is now relatively straightforward to construct the nondeterministic stack machine

(G) as
before. The states z consist of those LR(1)-items I whose lookahead sets aid in the computation of a
successor state. The transition relation as before ``moves the marker over.''
-
((
, zA),[S
.A,{
}])
R for all
S
A
P,
-
((a,[A
u.av,{z1,..., zn}]),[A
ua.v,{z1,..., zn}]) for a
T,
-
((B,[A
u.Bv,{z1,..., zn}]),[A
uB.v,{z1,..., zn}])
R for B
X - T,
-
((
,[A
u.Bv,{z1,..., zn}]),[B
.
,{y1,..., ym}])
R for all
B
P and
yi
first(vzk) for some k.
Note the transition rule 4. It forms the
-closure of the sets of items, thus allowing
an instant calculation of the determinsitic pushdown automat
(G) without bothering
too much about the nondeterminsitic version: if the current marker position has a nonterminal symbol B followed by some
v
X*, then for each rule of the form
B
and for each terminal symbol
y
first(vzk) for which
[B
.alpha, y] does not satisfy condition 4 conjoin the
LR(1) item to the relation R. This closure operation ensures that the lookahead set is
progated correctly down the tree.
Consider again the augmented expression grammar
S E |
E E + T | T |
T T*F | F |
F (E) | id |
Starting with S calculate the LR(1)-items. For the first production
S
E the lookahead
set is always
{
}, thus begin with the item
[
S 
.
E,{

}]
and then compute the
-closure of it. Thus for the rules
E
E + T and
E
T, add the LR(1)-items
[E
.E + T,{
}] and
[E
.T,{
}].
This then leads to
[T
.T*F,{
}],
[T
.F,{
}],
[F
.(E),{
}] and finally
[F
.id,{
}].
But now care must be taken with the rule
E
.E + T; for E is followed by a `+', and hence
[E
.E + T,{ + }] must be conjoined to the previous LR(1)-items. But then the same applies
to the rule
T
.T*F; we must add
[T
.T*F,{*}] to this set of
items. There are no more new additions. Thus, collecting these items into a single state of items
yields the first state of
(G):
| |
[S .E,{ }] |
[F .(E),{ } |
| |
[E .E + T,{ + , }] |
[F .id,{ } |
| I1: |
[E .T,{ }] |
[T .T*F,{*, } |
| |
[T .F,{ }] |
|
Now apply the transition relation to items in I1 for A
X* to generate the next states.
Remember, the transition relation just moves the marker over. We will denote this with
goto(A, I1): for A = E we have
goto(
E,
I1) =
E
E. +
T, and
S
E..
Now compute the
-closure of these two items to obtain the successor of I1 under
E. Since there is no nonterminal symbol following the dot, there is no change. Thus:
| I2: |
[S E.,{ }], |
[E E. + T,{ + , } |
Plugging in A = T results in a similar set:
| I3: |
[E T.,{ }], |
[T T.*F,{*, } |
For A = F we only obtain
| I4: |
[T F.,{ }] |
Proceeding in this manner yields 11 states. The final result is given in the following
transition table that represents the central LR-parsing table. The reader is strongly
encouraged to carry out the calculations independentally:
| State |
Symbol |
Successor |
| I1 |
E |
I2 |
| I1 |
T |
I3 |
| I1 |
F |
I4 |
| I1 |
( |
I5 |
| I1 |
id |
I6 |
| I2 |
+ |
I7 |
| I3 |
* |
I8 |
| I5 |
E |
I9 |
| I5 |
T |
I3 |
| I5 |
F |
I4 |
| I5 |
id |
I6 |
| I7 |
T |
I10 |
| I7 |
F |
I4 |
| I7 |
( |
I5 |
| I7 |
id |
I5 |
| I8 |
F |
I11 |
| I8 |
( |
I5 |
| I8 |
id |
I6 |
| I9 |
) |
I12 |
| I9 |
+ |
I7 |
| I10 |
* |
I8 |
This automaton can also be used in the main LR-parsing algorithm: Upon encountering a complete
item
[A
.,{z1,...zn}] and the current lookahead symbol is one of the
zi, then reduce as in the algorithm. If the state is a legal state then shift and obtain a new
lookahead symbol.
Prof. Raymond Zavodnik, July 2002