5. Pushdown Automata and Context Free Languages

It will be recalled that automata have very limited memory capabilities. These can be strengthened by adding a stack structure.


5.1 Pushdown Automata

As Fig. 5.1 indicates, a pushdown automaton consists of three components: 1) an input tape, 2) a control unit and 3) a stack structure. The input tape consists of a linear configuration of cells each of which contains a character from an alphabet. This tape can be moved one cell at a time to the left. The stack is also a sequential structure that has a first element and grows in either direction from the other end. Contrary to the tape head associated with the input tape, the head positioned over the current stack element can read and write special stack characters from that position. The current stack element is always the top element of the stack, hence the name ``stack''. The control unit contains both tape heads and finds itself at any moment in a particular state.

Figure 5.1: Conceptual Model of a Pushdown Automaton
\includegraphics[]{pushdown-automaton}

Definition

A (non-deterministic) finite state pushdown automaton (abbreviated PDA or, when the context is clear, an automaton) is a 7-tuple $ \cal {P}$ = (X, Z,$ \bf S$, R, zA, SA, ZF), where

Remarks

  1. The non-determinism comes from the fact that $ \cal {P}$ allows a transition relation, instead of a transition function. To emphasize the nature of the correspondence extra parentheses have been utilized,
  2. The element ((xi, zj, sk),(zr, s)) is a member of the relation R means: At the moment in which xi is read, the automaton finds itself in state zj and the string s of stack symbols occupies the top of the stack, the first symbol of which is directly under the read/write head. Then the automaton changes its internal state to zr and replaces sk with the new word s. Again, the first symbol of s lies under the read/write head.
  3. There are two distinguished stack operations:
    1. The relation ((xi, zj,$ \lambda$),(zr, s)) $ \in$ R signifies that s is written to the top of the stack. Hence the familiar name ``push''.
    2. The relation ((xi, zj, sk),(zr,$ \lambda$)) $ \in$ R signifies that the word sk is removed from the stack and the next read/write position is the character directly preceding sk on the stack. Hence the term ``pop''.
  4. There are 5 cases that are worth mentioning
    1. There is no triple (xi, zj, sk) as the first component in any relation member. Then the automaton halts and remains there. This is also possible for the triple ($ \lambda$, zj, sk).
    2. If ((xi, zj, sk),(zr, s)) $ \in$ R with xi $ \in$ X and s $ \neq$ $ \lambda$ then the automaton goes into state zr, sk is replaced by s, the input tape is moved left by one cell and the read/write head is positioned directly over the first symbol of s.
    3. If (($ \lambda$, zj, sk),(zr, s)) $ \in$ R then procede as in (b), except the input tape does not move to the left.
    4. If ((xi, zj, sk),(zr,$ \lambda$)) $ \in$ R then $ \cal {P}$ goes into state zj, sk is popped from the stack and the read/write head is positioned over the symbol immediately preceding sk on the stack (if, that is, sk is not the first element on the stack).
    5. If ((xi, zj, sk),(zr, s)) $ \in$ R, zr $ \in$ ZF, the input read unit has no more symbols to read and the stack is empty, then $ \cal {P}$ halts. This will be dealt with in more detail presently.

Definition

A configuration of $ \cal {P}$ is an element (x1, z1, S1) from (X $ \cup$ {$ \lambda$})×Z×$ \bf S^{*}_{}$. The string x1 is interpreted as that part of the input string x that remains to be processed and S is the entire contents of the stack. If (x1, z1, S1) and (x2, z2, S2) are configurations then we say the first directly produces the second, written (x1, z1, S1) $ \Rightarrow$ (x2, z2, S2), if there is an element ((a1, z1,$ \alpha_{1}^{}$)(z2,$ \alpha_{2}^{}$)) $ \in$ R with x1 = a1x2, S1 = $ \alpha_{1}^{}$$ \beta$ and S2 = $ \alpha_{2}^{}$$ \beta$. This situation is sketched in Fig. 5.2. Note that the case a1 = $ \lambda$ is also possible here.

Figure 5.2: Direct Derivation of Configurations
\includegraphics[]{pushdown-derivation}



As with automata one can speak of acceptance of strings by $ \cal {P}$ over the alphabet X and hence of the language accepted by a pushdown automaton. This is the content of the following definition.

Definition

A string x $ \in$ X* is accepted by the non deterministic pushdown automaton $ \cal {P}$ if there is a sequence of configurations

(x, zA, SA) $\displaystyle \Rightarrow$ (x2, z2, S2) $\displaystyle \Rightarrow$ ... $\displaystyle \Rightarrow$ ($\displaystyle \lambda$, zn,$\displaystyle \lambda$)

with zn $ \in$ ZF. The sequence is called a calculation of $ \cal {P}$ and has the length n. Finally, set the language generated by L($ \cal {P}$) to be the set

L($\displaystyle \cal {P}$) = {x $\displaystyle \in$ X*|  $\displaystyle \cal {P}$  accepts  x}

Example

Let P be defined through X = {a, b, c}, Z = {zA = z1, z2, z3}, $ \bf S$ = {SA, S,$ \lambda$}, ZF = {z3} and, finally the state transitions

(1) ((a, z1, SA),(z1, SSA)) $ \in$ R
(2) ((a, z1, S),(z1, SS)) $ \in$ R
(3) ((b, z1, S),(z2,$ \lambda$)) $ \in$ R
(4) ((c, z2, S),(z2,$ \lambda$)) $ \in$ R
(5) ((c, z2, SA),(z3,$ \lambda$)) $ \in$ R.

Then L($ \cal {P}$) = {x $ \in$ {a, b, c}| x = anbcn}. Note that this language is not regular. The calculation for the string x = a3bc3 is adduced in Fig. 5.3. Starting out in initial state z1 with the initial stack symbol SA on top of the stack and reading input symbol a entails pushing S onto the stack and remaining in state z1 according to the first rule. Thereafter for each a read the second rule pushes an additional s in state z1 until input symbol b is encountered. Then the third rule starts popping S from the stack and goes into state z2; this process is completed by matching the input symbols c against the remaining stack symbols according to the fourth and fifth rules.

Figure 5.3: Derivation of the String a3bc3
\begin{figure}\begin{center}
\begin{tabular}{\vert c\vert c\vert c\vert c\vert c...
...}{c}{} &\multicolumn{1}{c}{$\uparrow$} \\
\end{tabular}\end{center}\end{figure}


5.2 Context-Free Languages

As will be recalled from the last chapter there were two basic ways to determine whether a given string belongs to the language generated by some finite state automaton: One could verify that the string brings the automaton to a final state or one could derive, or, better, produce, the string in the regular grammar corresponding to the automaton. The same option holds for PDAs.

Definition

A context-free grammar is a grammar $ \cal {G}$ = (X, T, S, R) for which all rules, or productions, in R have the special form A $ \Rightarrow$ $ \alpha$, for A $ \in$ X - T and $ \alpha$ $ \in$ X*. Additionally, for any two strings u, v $ \in$ X* write u $ \Rightarrow$ v (u directly produces v ) if and only if (1) u = u1Au2 for u1, u2 $ \in$ X* and A $ \in$ X - T and (2) v = v1$ \alpha$v2 and A $ \Rightarrow$ $ \alpha$, $ \alpha$ $ \in$ X*, is a production from R. The reduction u $ \Rightarrow$ v is also called a direct production. Finally, write u $ \Rightarrow$ v for two strings u, v $ \in$ X* (u derives v) if there is a sequence u = u0 $ \Rightarrow$ u1 $ \Rightarrow$ u2 $ \Rightarrow$ ... $ \Rightarrow$ un = v of direct productions ui $ \Rightarrow$ ui+1 from R. The length of the derivation is n. The language generated by $ \cal {G}$ is {x $ \in$ T*|  S $ \Rightarrow^{*}_{}$ x}.


Thus, the definition just articulates the reduction of A to $ \alpha$ in any context in which A occurs. It is trivial that every regular language is context-free. The obverse, as will be seen presently, is not true. Before proving the central theorem for this section two typical examples are given.

Example 1

Consider $ \cal {G}$ = (X, T, R, S) with T = {a, b} and X = {S, a, b,$ \lambda$}. The productions, or grammar rules, are: S $ \Rightarrow$ aSb  |  $ \lambda$. Then it is clear that L($ \cal {G}$) = {anbn|  n $ \geq$ 0}. From the previous chapter it is known that this language is not regular.

Example 2: A Grammar for Arithmetic Expressions

Let

X = {E, T, F, id, + , - ,*,/,(,), a, b, c}

and T = {a, b, c, + , - ,*,/,(,)}. The start symbol S is E and the productions are as follows:
E $\displaystyle \Rightarrow$ E + T  |  E - T  |  T  
T $\displaystyle \Rightarrow$ T*F  |  T/F  |  F  
F $\displaystyle \Rightarrow$ (E)  |  id  
id $\displaystyle \Rightarrow$ a  |  b  |  c  

Then the string (a + b)*c belongs to L($ \cal {G}$). Indeed, it is easy to write down a derivation of this string:
E $\displaystyle \Rightarrow$ T $\displaystyle \Rightarrow$ T*F $\displaystyle \Rightarrow$ F*F $\displaystyle \Rightarrow$ (E)*F $\displaystyle \Rightarrow$ (E + T)*F  
  $\displaystyle \Rightarrow$ (T + T)*F $\displaystyle \Rightarrow$ (F + T)*F $\displaystyle \Rightarrow$ (id + T)*F $\displaystyle \Rightarrow$ (a + T)*F  
  $\displaystyle \Rightarrow$ (a + F)*F $\displaystyle \Rightarrow$ (a + id )*F $\displaystyle \Rightarrow$ (a + b)*F $\displaystyle \Rightarrow$ (a + b)*id $\displaystyle \Rightarrow$ (a + b)*c  

The derivation just adduced is leftmost in the sense that the leftmost nonterminal was always substituted. Although derivations are in general by no means unique, the leftmost one is. The entire derivation can also be nicely represented in a tree form, as Fig. 5.4 suggests.

Figure 5.4: Derivation Tree for the Expression (a + b)*c
\includegraphics[height=8cm]{derivation-tree}

The internal nodes of the derivation, or syntax, tree are nonterminal symbols and the frontier of the tree consists of terminal symbols. The start symbol is the root and the derived symbols are nodes. The order of the tree is the maximal number of successor nodes for any given node. In this case, the tree has order 3. Finally, the height of the tree is the length of the longest path from the root to a leaf node, i.e. a node that has no successor. The string (a + b)*c obtained from the concatenation of the leaf nodes together from left to right is called the yield of the tree.

The expected relation between pushdown automata and context-free languages is enunciated in the following theorem.

Theorem 5..1   A language L (i.e. a subset L $ \subseteq$ X*) is generated by a context-free grammar $ \cal {G}$ if and only if there is a pushdown automaton $ \cal {P}$ with L = L($ \cal {P}$).

Proof

If $ \cal {G}$ = (X, T, R, S) is a given context-free grammar, then it is relatively straightforward to construct a PDA $ \cal {P}$ with L($ \cal {P}$) = L($ \cal {G}$). Let the stack alphabet S be the set X $ \cup$ {SA} $ \cup$ {$ \lambda$} and the set of states {z0 = zA, z1} with ZF = {z1}. Also let the input alphabet be T $ \cup$ {$ \lambda$} and define the transitions
  1. Push the grammar's start symbol straight off: (($ \lambda$, z0, SA),(z1, SSA)) $ \in$ R.
  2. Nondeterministically push the right-hand side of a grammar rule onto the stack:

    (($\displaystyle \lambda$, z1, A),(z1,$\displaystyle \alpha$)) $\displaystyle \in$ R.

  3. Match the input symbol with the terminal symbol on top of the stack: For each a $ \in$ T put

    ((z1, a, a),(z1,$\displaystyle \lambda$)) $\displaystyle \in$ R.

  4. Pop the stack start symbol at the very end: (($ \lambda$, z1, SA),(z1,$ \lambda$)) $ \in$ R.
Consider, as an example, the grammar $ \cal {G}$ = (X, T, R, S) with X = {S, a, b, c}, T = {a, b, c} and productions S $ \Rightarrow$ aSa  |  bSb  |  c. Then the language generated by $ \cal {G}$ is L($ \cal {G}$) = {xcxR  |  x $ \in$ {a, b}*  where xR denotes the reverse string to x}. Then, as in the above general PDA specification, the stack alphabet consists of the set {$ \lambda$, S, a, b, c, SA}, the input alphabet is {a, b, c} and the state transitions are as follows:

(($ \lambda$, z0, SA),(z1, SSA)) $ \in$ R
(($ \lambda$, z1, S),(z1, aSa)) $ \in$ R
(($ \lambda$, z1, S),(z1, bSb)) $ \in$ R
(($ \lambda$, z1, S),(z1, c)) $ \in$ R
((a, z1, a),(z1,$ \lambda$)) $ \in$ R
((b, z1, b),(z1,$ \lambda$)) $ \in$ R
((c, z1, c),(z1,$ \lambda$)) $ \in$ R
(($ \lambda$, z1, SA),(z1,$ \lambda$)) $ \in$ R

To see that this PDA works correctly, ascertain that the string a2bcba2 is indeed an element of L($ \cal {P}$) by consulting the derivation diagram in Fig. 5.5

Figure 5.5: Derivation of the String a2bcba2
\begin{figure}\begin{center}
\begin{tabular}{\vert c\vert c\vert c\vert c\vert c...
...{} &\multicolumn{1}{c}{$\uparrow$} \\
\par\end{tabular}\end{center}\end{figure}

Conversely, assume $ \cal {P}$ is a PDA. To clarify the subsequent definitions the following discussion on the internal operation of $ \cal {P}$ is offered. The goal is, of course, to concoct a context-free grammar that executes a leftmost derivation of every string that $ \cal {P}$ accepts. If $ \cal {P}$ were as simple as the example in the first part of this proof, namely, that after pushing the very first nontrivial symbol (not SA) onto the stack $ \cal {P}$ remains in a single state z1, then it would be very straightforward to reverse the above process and construct $ \cal {G}$ from $ \cal {P}$. Basically, if x is the input string write x = x$\scriptstyle \prime$ax$\scriptstyle \prime{^\prime}$, where x$\scriptstyle \prime$ is that part of x that has already been processed (a so-called prefix of x) and ax$\scriptstyle \prime{^\prime}$ is the rest of x whose first input symbol is a. Then the direct production of configurations of $ \cal {P}$ of the form (ax$\scriptstyle \prime{^\prime}$, z1, AA$\scriptstyle \prime$) $ \Rightarrow$ (x$\scriptstyle \prime{^\prime}$, z1,$ \alpha$A$\scriptstyle \prime$) corresponds to the grammar rule A $ \Rightarrow$ a$ \alpha$, resulting in the reduction x$\scriptstyle \prime$AA$\scriptstyle \prime$ $ \Rightarrow$ xa$ \alpha$A$\scriptstyle \prime$. Thus the sequence of stack moves from the above-mentioned example commences with SA and, after popping that symbol, derives the string a2bcba2, as can be seen by inspecting the stack column in Fig. 5.5.

Unfortunately, the general case is considerably more complicated, because $ \cal {P}$'s state transitions also enter into the picture. Proceeding naively, one could reduce $ \cal {P}$ to a 2 state PDA $ \cal {P}$$\scriptstyle \prime$ of the aforementioned type by pushing pairs (z, A) of states and stack symbols from $ \cal {P}$ onto $ \cal {P}$$\scriptstyle \prime$'s stack, thus imitating $ \cal {P}$'s calculation of input strings. Thus, when $ \cal {P}$ is in state z and pushes A onto the stack, $ \cal {P}$$\scriptstyle \prime$ pushes (z, A) onto its stack. The reader is invited to pause to discover the fatal shortcoming of this method before reading further.

The problem becomes immediately transparent when one considers what happens when $ \cal {P}$$\scriptstyle \prime$ pops a stack element (z, A). State z is no longer relevant for $ \cal {P}$$\scriptstyle \prime$'s further operation-$ \cal {P}$ was in state z when A got pushed, but what state was $ \cal {P}$ in when the pop occurred? Therefore, it is necessary to push triples (z, A, z$\scriptstyle \prime$), where z$\scriptstyle \prime$ is $ \cal {P}$'s state when the pop takes place. Since it is not known what $ \cal {P}$'s state z$\scriptstyle \prime$ is going to be when it pops A, $ \cal {P}$ has to guess what it is going to be, .i.e. it nondeterministically pushes (z, A, z$\scriptstyle \prime$), where z$\scriptstyle \prime$ $ \in$ Z is arbitrary. The only restriction is that when executing two (or more) push operations the unknown state z$\scriptstyle \prime$ must be manipulated consistently. This means if A1A2 is pushed, then after $ \cal {P}$ pops A1, or, equivalently, $ \cal {P}$$\scriptstyle \prime$ pops (z1, A1, z1$\scriptstyle \prime$), then $ \cal {P}$ finds itself in state z1$\scriptstyle \prime$. Since $ \cal {P}$$\scriptstyle \prime$ does not use its own state information in imitating $ \cal {P}$'s state transitions, $ \cal {P}$'s current state must be available in describing the next element of $ \cal {P}$$\scriptstyle \prime$'s stack, or, in other words, $ \cal {P}$ better be in state z1$\scriptstyle \prime$ when popping A2 from its stack, and so must be of the form (z1$\scriptstyle \prime$, A2, z2) for some (predicted) z2 $ \in$ Z. This train of thought will now be formalized.

For simplicity, assume that $ \cal {P}$ pushes at most two symbols and that it has a single acceptance statei zF. A moment's reflection shows that these assumptions are not restrictive; but they do eliminate some extra preprocessing. The nonterminals of G are triples (z, A, z$\scriptstyle \prime$) of states z, z$\scriptstyle \prime$ and a stack symbol A. The basic idea is to imitate what the machine undergoes in state z finally to pop symbol A and to wind up thereby in state z$\scriptstyle \prime$, having processed some string of input characters. Thus the rules for the sought-after context-free grammar are posited as follows:

  1. For the (extra) start symbol put S $ \Rightarrow$ (zA, SA, zF).
  2. For each transition ((a, z, B),(z$\scriptstyle \prime$, C)) $ \in$ R put for each z1 $ \in$ Z

    (z, B, z1) $\displaystyle \Rightarrow$ a(z$\scriptstyle \prime$, C, z1)

  3. In case two symbols are pushed, i.e. ((a, z, B),(z$\scriptstyle \prime$, C1C2)) $ \in$ R, then put for each pair z1, z2 $ \in$ Z

    (z, B, z1) $\displaystyle \Rightarrow$ a(z$\scriptstyle \prime$, C1, z2)(z2, C2, z1).

  4. For each z $ \in$ Z put (z,$ \lambda$, z) $ \Rightarrow$ $ \lambda$.
It is important to notice the free choice of z1 and z1, z2 in rules 2. and 3. Consider, for example, processing the string a2bc2 from the PDA from Section 5.1. Then posit the start rule

S $\displaystyle \Rightarrow$ (z1, SA, z3),

since there is only one final state. Now mechanically translate each of the transitions from this PDA into their grammatical equivalents as shown in Table 5.1.

Table 5.1: Translation of the PDA $ \cal {P}$ Transition Rules into Grammatical Productions
Nr. Transition Function Nr Production
1 ((a, zA, SA),(zA, SSA)) 1' (zA, SA, z') $ \Rightarrow$ a(zA, S, z'')(z'', SA, z')
2 ((a, zA, S),(zA, SS)) 2' (zA, S, w') $ \Rightarrow$ a(zA, s, w'')(w'', s, w')
3 ((b, zA, S),(z2,$ \lambda$)) 3' (zA, S, v') $ \Rightarrow$ b(z2,$ \lambda$, v')
4 ((c, z2, S),(z2,$ \lambda$)) 4' (z2, S, u') $ \Rightarrow$ c(z2,$ \lambda$, u')
5 ((c, z2, SA),(z3,$ \lambda$)) 5' z2, SA, t') $ \Rightarrow$ c(z3,$ \lambda$, t')


It is important to note that states z', z'', w', w'', v', u', t' can be chosen at will. Hopefully, a proper choice will lead to success in accordance with the philosophy of nondeterminism.

Example

Derivation of the string a2bc2 $ \in$ L($ \cal {P}$).

S $ \overset{1}{\Rightarrow}$ (zA, SA, zF)
  $ \overset{1'}{\Rightarrow}$ a(zA, S, z'')(z'', SA, zF)
  $ \overset{2'}{\Rightarrow}$ a2(zA, S, w'')(w'', s, z'')(z'', SA, zF)
  $ \overset{3'}{\Rightarrow}$ a2b(z2,$ \lambda$, w'')(w'', S, z'')(z'', SA, zF)
  $ \overset{4'}{\Rightarrow}$ a2bc(z2,$ \lambda$, z2)(z2, SA, zF)
  $ \overset{5'}{\Rightarrow}$ a2bc2(z3,$ \lambda$, z3) = a2bc2

Notice that the steps above correspond in a one-to-one manner with the sequence of transitions from the original PDA. This now concludes the sketch of the proof of the theorem.

5.3 Properties of Context-Free Langauges

5.3.1 Syntax Trees

Tree representations of derivations, also known as syntax trees, were briefly introduced in the preceding section to promote intuition of derivations. Since these are such important tools for the investigation of context-free languages, they will be dealt with a little more systematically here.

Definition

Let $ \cal {G}$ = (X, T, R, S) be a context-free grammar. A syntax tree for this grammar consists of one of the following
  1. A single node x for an x $ \in$ T. This x is both root and leaf node.
  2. An edge
    \includegraphics[]{syntax-tree}
    corresponding to a production A $ \Rightarrow$ $ \alpha$ $ \in$ R.
  3. A tree
    \includegraphics[]{syntax-tree2}
    where the A1, A2, ... , An are the root nodes of syntax trees. Their yields are read from left to right.

Ambiguity

Until now the syntax trees were uniquely determined-even if the sequence of direct derivations were not. Separating the productions corresponding to the operator hierarchy, from weakest to strongest, in the expression grammar + , - ,*,/,() preserves this natural hierarchy. If this is not done, then syntax trees with a false evualation sequence are often the result. Suppose, for instance, that the rules of the expression grammar were written E $ \Rightarrow$ E + E  |  E*E  |  id, then two different syntax trees are the result. If the first production E $ \Rightarrow$ E + E were chosen then the result would be the tree

\includegraphics[]{syntax-tree3}

On the other hand, choosing the production E $ \Rightarrow$ E*E first results in a syntax tree of an entirely different ilk.

\includegraphics[]{syntax-tree4}

Thus this grammar is ambiguous, because it is possible to generate two different syntax trees for the expression a + b*c.

5.3.2 Chomsky Normal Form

Work with a given context-free grammar is greatly facilitated by putting it into a so-called normal form. This provides some kind of regularity in the appearance of the right-hand sides of grammar rules. One of the most important normal forms is the Chomsky normal form.

Definition

The context-free Grammar $ \cal {G}$ = (X, T, R, S) is said to be in Chomsky normal form if all grammar rules have the form

A $\displaystyle \Rightarrow$ a  |  BC, (5.1)

for a $ \in$ T and B, C $ \in$ X - T. There is one exception. If $ \lambda$ $ \in$ L($ \cal {G}$), then the single extra rule

S $\displaystyle \Rightarrow$ $\displaystyle \lambda$ (5.2)

is permitted. If $ \lambda$ $ \notin$L($ \cal {G}$) then production rule 5.2 is not allowed.

Theorem 5..2   Any context-free grammar $ \cal {G}$ = (X, T, R, S) can be rewritten in Chomsky normal form.

Proof

To facilitate rewriting the grammar rules it is first a good idea to eliminate unnecessary pathology in the original grammar. Although a CFG always defines its grammar rules with single nonterminal symbols on the left-hand side of every production, it can easily happen that there are nonterminal symbols that never appear on the left-hand side of any production. It is then seen that they cannot generate any sequence of terminal symbols and, moreover, may never appear in the right-hand side of any production to help produce a sentence in the associated language. It can even happen that there are nonterminal symbols that appear in no sentential form5.1 derivable from the start symbol. These symbols are called useless and are to be expurgated at the outset. It should be abundentally clear that no part of any rule containing one of these symbols has the least influence of L(G). Consider, for example, the grammar

S $ \Rightarrow$ AB  |  CA |  AD
B $ \Rightarrow$ BC  |  AB
A $ \Rightarrow$ aA  |  a
C $ \Rightarrow$ b  |  aB |  bC

No terminal symbol is derivable from B and D never appears in the right-hand side of any rule. Thus, the grammar simplifies to:

S $ \Rightarrow$ CA
A $ \Rightarrow$ aA  |  a
C $ \Rightarrow$ b  |  bC

Now that the grammar has been stripped of extraneous elements, the grammatical transformation can begin. The rules for $ \cal {G}$ can be rewritten as follows:

Example

Consider again the expression grammar from Section 5.2. Then the rule E $ \Rightarrow$ E + T  |  E - T, is replaced by E $ \Rightarrow$ EE$\scriptstyle \prime$, E$\scriptstyle \prime$ $ \Rightarrow$ POPT, POP $ \Rightarrow$ +  |   -. Similar replacements hold for T $ \Rightarrow$ T*F  |  T/F and F $ \Rightarrow$ (E). Finally the productions E $ \Rightarrow$ T, T $ \Rightarrow$ F and F $ \Rightarrow$ id are reduced to E $ \Rightarrow$ a  |  b  |  c, T $ \Rightarrow$ a  |  b  |  c and F $ \Rightarrow$ a  |  b  |  c respectively.

5.3.3 Non Context-Free Languages: Ogden's Lemma

As with finite automata there is a version of the pumping lemma that demonstrates certain languages are not context free. This theorem is often called Ogden's lemma after its discoverer.

Theorem 5..3   Let $ \cal {G}$ = (X, T, R, s) be a context-free language. Then there is an integer n = n($ \cal {G}$) for which every string x $ \in$ L($ \cal {G}$) having a length | x| greater than n can be written

x = uvwyz, (5.3)

and
  1. vy $ \neq$ $ \lambda$ (that is, v $ \neq$ $ \lambda$ or y $ \neq$ $ \lambda$).
  2. The length of vwy satisfies | vwy| $ \leq$ n.
  3. For each integer k $ \geq$ 0, it follows that uvkwykz $ \in$ L($ \cal {G}$).

Proof

Assume that $ \cal {G}$ is in Chomsky normal form. For x $ \in$ L($ \cal {G}$) consider the (binary) syntax tree for the derivation of x. Assume the height of this tree is h as illustrated in Fig. 5.6.

Figure 5.6: Derivation Tree for the string x $ \in$ L($ \cal {G}$)
\includegraphics[]{derivation-tree2}

Then it follows that | x| $ \leq$ 2h-2 + 2h-2 = 2h-1, i.e. the yield of the tree with height h is at most 2h-1. If $ \cal {G}$ has k nonterminal symbols, let n = 2k. Then let x $ \in$ L($ \cal {G}$) be a string with | x| $ \geq$ n. Thus the syntax tree for x has height at least k + 1, thus on the path from the root downwards that defines the height of the tree there are at least k + 2 nodes, i.e. at least k + 1 nonterminal symbols. It then follows that there is some nonterminal symbol A that appears at least twice. Consulting Fig. 5.7, it is seen that the partial derivation S $ \Rightarrow^{*}_{}$ uAz $ \Rightarrow^{*}_{}$ uvAyz obtains.

Figure 5.7: Nonterminal A appears twice in the derivation of x
\includegraphics[]{nonterminal-a}

If, now, both u and z were empty, then derivations of the form S $ \Rightarrow$ uAz $ \Rightarrow$ A would be possible, contrary to the assumption of Chomsky normal form. For the same reason either v or y are nonempty. If | vwy| > n then apply the procedure anew until the condition | vwy| $ \leq$ n holds. Finally, since the derivation A $ \Rightarrow$ vAy can be repeated as often as one pleases, it follows that S $ \Rightarrow^{*}_{}$ uAz $ \Rightarrow^{*}_{}$ uvAyz $ \Rightarrow^{*}_{}$ uv2Ay2z $ \Rightarrow$ uv2wy2zi, etc. can be generated. This completes the proof.

Example 1

The language L = {aibici  |  i $ \geq$ 1} is not context free.

Proof

Assume L were context-free. Then let n be the n from the preceding theorem and put x = anbncn. Ogden's lemma then provides the decomposition x = uvwyz with the stated properties. There are several cases to consider.
Case 1 The string vy contains only a's. But then the string uwz $ \in$ L, which is impossible, because it contains fewer a's than b's and c's. .
Case 2,3 vy contains only b's or c's. This case is similar to case 1.
Case 4,5 vy contains only a's and b's or only b's or c's. Then it follows that uwz contains more c's than a's and b's or more a's than b's and c's. This is again a contradiction.
Since | vwy| $ \leq$ n it is not possible that vy contain a's and c's.

Example 2

In this example the power of Ogden's lemma will be extended. The language L = {aibjck  |  i < j < k} is not contetx free.

Proof

The pumping properties of Ogden's lemma are of little use here. Hence we return to the syntax tree in the proof of the theorem. If, now, L were context-free then consider the path in its syntax tree from the root to the leaf node containing the rightmost a. For sufficiently large i there is likewise a nonterminal symbol A appearing twice on this path. Then x = uamwyz for some m $ \geq$ 0. There are 2 cases to be considered here.
Case 1 m = 0. Then the matching substring y cannot be empty. On the other hand the theorem guarantees that | wy| $ \leq$ i, hence all c's must be contained in z. But then y contains at least one b, so for sufficiently large n the string uwynz belongs to L, which is impossible, because for large enough n, this is string will contain more b's than c's-contrary to assumption
Case 2 m $ \geq$ 1. It then follows that

S $\displaystyle \Rightarrow^{*}_{}$ uAz $\displaystyle \Rightarrow^{*}_{}$ uamAyz $\displaystyle \Rightarrow^{*}_{}$ ua2mAy2z $\displaystyle \Rightarrow$ ... $\displaystyle \Rightarrow$ uanmwynz.

Now obviously y cannot contain more than 1 letter kind and this letter is either b or c. If y contained only b's then the string uanmwynz would contain more a's than c's for n sufficiently large. As similar argument holds if y contained only c's. In any case, a contradiction is derived and thus the assumption that L is context-free is false.

5.3.4 Closure Properties of Context Free Languages

Proceding by analogy, one would expect the closure theorems for finite state automata to generelize to PDAs. Surprisingly, not everything carries over. The following theorem, however, articulates a property of context-free languages analogous to finite automata.

Theorem 5..4   The context-free langauges are closed under the formation of

Proof

Let $ \cal {G}$1 = (X1, T1, R1, S1) and $ \cal {G}$2 = (X2, T2, R2, S2) be context-free grammars. Without loss of generality, it can be supposed that (X1 - T1) $ \cap$ (X2 - T2) = $ \emptyset$. If not, then rewrite the grammar rules of one with new nonterminal symbols. Let S be another nonterminal symbol. Then set R = R1 $ \cup$ R2 $ \cup$ {S $ \Rightarrow$ S1, S $ \Rightarrow$ S2} and $ \cal {G}$ = (X1 $ \cup$ X2 $ \cup$ {S}, T1 $ \cup$ T2, R, S). It then follows that L($ \cal {G}$) = L($ \cal {G}$1) $ \cup$ L($ \cal {G}$2). It is left to the reader to verify closure under concatenation and Kleene star in a similar manner.

The closure under intersection property is not quite exact.

Theorem 5..5   The intersection of a context-free language L1 and a regular language L2 is context-free.

Proof

If L1 is context-free then it is recognized by some push down automaton $ \cal {P}$ = (X, Z1,$ \bf S_1$, R1, S1, F1). If L2 is a regular language, then it is recognized by a deterministic automaton $ \cal {A}$ = (X, Z2, f, S2, F2). Define a new PDA $ \cal {P}$$\scriptstyle \prime$ = (X, Z,$ \bf S$, R, SA, F) as follows: Z = Z1×Z2, $ \bf S$ = $ \bf S_1$, SA = (S1, S2) and, finally, F = F1×F2. The transition relation R is obtained directly from the transition relation of $ \cal {P}$ and the transition function of $ \cal {A}$, viz. for every transition of $ \cal {P}$ of the form ((a1, z1, S$\scriptstyle \prime$),(z1$\scriptstyle \prime$, S1$\scriptstyle \prime{^\prime}$)) $ \in$ R1 and for each state z2 $ \in$ Z2, put

((a1,(z1, z2), S1$\scriptstyle \prime$),((z1$\scriptstyle \prime$, f (a, z2), S1$\scriptstyle \prime{^\prime}$)) $\displaystyle \in$ R

and for each $ \lambda$-move of $ \cal {P}$ of the form (($ \lambda$, z1, S1$\scriptstyle \prime$),(z1$\scriptstyle \prime$, S1$\scriptstyle \prime{^\prime}$)) $ \in$ R and z2 $ \in$ Z2 put

(($\displaystyle \lambda$,(z1, z2), S1$\scriptstyle \prime$),((z1$\scriptstyle \prime$, z2), S1$\scriptstyle \prime{^\prime}$)) $\displaystyle \in$ R,

or, stated in words, $ \cal {P}$$\scriptstyle \prime$ passes from state (z1, z2) into state (z1$\scriptstyle \prime$, z2$\scriptstyle \prime$) if and only if $ \cal {P}$ passes from z1 to z2 and $ \cal {A}$ passes from z2 to z2$\scriptstyle \prime$, i.e. x $ \in$ L($ \cal {P}$$\scriptstyle \prime$)) if and only if x $ \in$ L($ \cal {P}$) $ \cap$ L($ \cal {A}$).

Theorem 5..6   The class of context-free languages is not closed under intersection and complement.

Proof

It is easy to see that the two languages

L1 = {anbncm  |  m.n $\displaystyle \geq$ 0}

and

L2 = {ambncn  |  m, n $\displaystyle \geq$ 0}

are context free. The interesction, however, L1 $ \cap$ L2 = {anbncn  |  n $ \geq$ 0} is not context-free. From the complement identity

L1 $\displaystyle \cap$ L2 = $\displaystyle \overline{{\overline{L_1} \cup \overline{L_2}}}$,

it is seen that the complements $ \overline{{L_1}}$ and $ \overline{{L_2}}$ are not in general context-free.

5.4 Push Down Automata and Context-Free Grammars

Definition

An algorithm is called polynomial in case there is an integer k $ \geq$ 2 such that the number of steps after which the algorithm halts is $ \cal {O}$(nk). The argument n depends only on the input.

Theorem 5..7   There is a polynomial algorithm that constructs to any given push down automaton $ \cal {P}$ a context-free grammar $ \cal {G}$ with L($ \cal {G}$) = L($ \cal {A}$). Conversely, there is a polynomial algorithm that constructs to any given context-free grammar $ \cal {G}$ a push down automaton $ \cal {P}$ with L($ \cal {P}$) = L($ \cal {G}$).

Theorem 5..8   There is a polynomial algorithm that decides, given any context-free grammar G = (X, T, R, S) and x $ \in$ T* whether x $ \in$ L($ \cal {G}$).

Proof

The proof of this theorem sometimes goes under the name CYK algorithm after their discoverers Cocke, Younger and Kasami. It proceeds as follows:
  1. Rewrite $ \cal {G}$ in Chomsky normal form. It is easily seen that this can be done in polynomial time.
  2. If x = x1x2 ... xn, then for 0 $ \leq$ i, j $ \leq$ n put xij = xixi+1 ... xi+j-1. It is noteworthy that | xij| = j. The idea is to determine all A $ \in$ X - T for which A $ \Rightarrow^{*}_{}$ xij. Thus set

    Vij = {A $\displaystyle \in$ X - T  |  A $\displaystyle \Rightarrow$ xij}.

    1. For j = 1 it is readily seen that Vi1 = {A $ \in$ X - T  |  A $ \Rightarrow$ xi}.
    2. For general j it is also seen that A $ \in$ Vij$ \iff$A $ \Rightarrow^{*}_{}$ xixi+1 ... xi+j-1$ \iff$A $ \Rightarrow$ BC is a rule from R and B $ \Rightarrow^{*}_{}$ xi ... xi+k-1 and C $ \Rightarrow^{*}_{}$ xi+k ... xi+j-1 for some k = 1, 2, ... j - 1.
Thus the algorithm can be formulated as follows:

for i : = 1 to n do

Vi1 : = {A $ \in$ X - T  |  A $ \Rightarrow$ xi $ \in$ R};
for j : = 2 to n do
for i : = 1 to n - j + 1 do begin
Vij : = $ \emptyset$;
for k : = 1 to j - 1 do
Vij : = Vij$ \bigcup${A $ \in$ X - T  |  A $ \Rightarrow$ BC,  B $ \in$ Vik,  C $ \in$ Vi+k, j-k};
end

Figure 5.8: Diagonal Procedure for CYK Algorithm
\includegraphics[]{diagonal-procedure}

There is a nice interpretation of the innermost for loop. Formally one processes the pairs Vi1Vi+1, j-1, Vi2Vi+2, j-2, ... , Vi, j-1Vi+j-1, 1. As evidenced in Fig. 5.8 go down the ith column and simultaneously traverse the diagonal from Vi+1, j-1 up and to the right. The corresponding elements are compared with each other.
Finally, it is seen that x $ \in$ L($ \cal {G}$)$ \iff$S $ \in$ V1, n, because then S $ \Rightarrow$ x1 ... xn, where n = length(x).

This technique of producing increasingly larger solutions from smaller ones is called dynamic programming.

Example

Consider the Grammar
S   $\displaystyle \Rightarrow$ AB  |  BC  
A   $\displaystyle \Rightarrow$ BA  |  a  
B   $\displaystyle \Rightarrow$ CC  |  b  
C   $\displaystyle \Rightarrow$ AB  | a  

and the string x = baaba with n = 5. Then proceeding as above, the following triangular matrix results:

b a a b a
B A, C A, C B A, C
S, A B S, C S, A  
$\displaystyle \emptyset$ B B    
$\displaystyle \emptyset$ S, A, C      
S, A, C        

Since S $ \in$ V15 it follows that x $ \in$ L($ \cal {G}$). It is quite remarkable that the algorithm time is $ \cal {O}$(n3). It is also remarkable that the CYK algorithm actually shows how to construct the derivation, which has great practical importance.

5.5 The Chomsky Hierarchie

Concomitant to the various automata that have been introduced are the various languages, or, equivalently subsets of strings over some alphabet, that are recognized by those automata. Chomsky has characterized a full spectrum of language types that occur in formal language theory. The first step is to recall that in the definition of a grammar completely general production rules are allowed. Specifically this allows productions whose left-hand sides are arbitrary non-empty strings over the set X of terminal and nonterminal symbols. There is a further technical restriction on the nullbar productions: If the empty string is derivable in the language generated by any grammar. Following Chomsky, such grammars are called Type 0 Grammars. If one further requires that fewer combinations of terminal and nonterminal symbols occur on the left-hand side than on the right then one speaks of a Type 1 Grammar. The various types of languages considered in this lecture series are summarized in the following definition.

Definition

Let $ \cal {G}$ be a grammar.
  1. If $ \cal {G}$ is type 0, then its associated language L($ \cal {G}$) is called a type 0 language.
  2. $ \cal {G}$ is called type 1 if for all productions $ \alpha$ : : = $ \beta$

    length($\displaystyle \alpha$) $\displaystyle \leq$ length($\displaystyle \beta$).

    and its associated language a type 1 language.
  3. $ \cal {G}$ is called context free or type 2 and its associated language a type 2 language if for all productions $ \alpha$ : : = $ \beta$ the left-hand side $ \alpha$ consists of a single nonterminal symbol and the nullbar convention holds.
  4. $ \cal {G}$ is called regular or a type 3 and its assocated language a type 3 language if every rule of $ \cal {G}$ is left or right linear and the nullbar convention holds.

The following very important theorem is stated here with proof of the second to last strict inclusion. The strict inclusions before that have already been treated. Only the last is stated without proof:

Theorem 5..9   If Li denotes the class of type i languages (i = 0, 1, 2, 3), then

L3 $\displaystyle \not\subseteq$L2 $\displaystyle \not\subseteq$L1 $\displaystyle \not\subseteq$L0

Proof:

Consider the following context-dependant grammar:


S $ \Rightarrow$  aSBC
S $ \Rightarrow$  aBC
CB $ \Rightarrow$  BC
aB $ \Rightarrow$  ab
bB $ \Rightarrow$  bb
bC $ \Rightarrow$  bc
cC $ \Rightarrow$  cc


Then it is easy to derive the string abc:


S $ \Rightarrow$  aBC   $ \Rightarrow$  abC $ \Rightarrow$ abc


Similarly, one derives the string a2b2c2:


S $ \Rightarrow$  aSBC $ \Rightarrow$  a2BCBC  
  $ \Rightarrow$  a2B2C2 $ \Rightarrow$  a2bBC2  
  $ \Rightarrow$  a2b2C2 $ \Rightarrow$  a2b2cC $ \Rightarrow$ a2b2c2


It is then a routine application of mathematical induction to prove the general formula S $ \Rightarrow$  anbncn.

5.6 Application of PDA's: Parsing and Compilation

In this chapter the student has encountered the power of nondeterminism in algorithms in Computer Science. In this connection, however, practical issues such as time to completion have been relegated to the background. For example, given a context-free grammar one could always parse a string of input symbols using nondeterministic back-up methods that repeatedly attempt to produce a leftmost derivation for that string from the start symbol, or, failing that, reject the string after all possibilities have been exhausted. Indeed, the very first grammar-driven compilers did precisely that.

Most modern programming languages have syntaxes that allow predictive parsing, thus eliminating the time-wasting idiosyncracy of back-up. The associative PDA is thus deterministic and the parsing procedure is very efficient. In this section we apply the methods learned to build both top-down and bottom-up that function deterministically.

5.6.1 Determinsitic Top-Down Parsing and LL(1) Grammars

The goal of a top-down parse is to produce a leftmost derivation starting with the start symbol and ends up either having consumed the input and accepted the input string or determines that there is a conflict and produces an error message to that extent.

The basic idea is simple: given the current input symbol and the current non-terminal symbol, usually pushed onto the stack, consult a parsing table, or, in the jargon of automata theory, a finite state transition function (the current state is the current nonterminal symbol), and replace the symbol on the stack with the proper right-hand side to continue simulating the derivation. See Fig. 5.9 for an illustration of this process. If the top of the stack is a terminal symbol and matches the input, then just pop that symbol and continue with the rest of the input string. A simplified process is given in the following algorithm.

Figure 5.9: Top-Down Parser as a PDA
\includegraphics[]{topdown-parser}

Algorithm

  1. x = first input symbol
  2. Z = S
  3. 
    while there is input begin
    
    write Z = uAw for u $ \in$ T*, A $ \in$ X - T, w $ \in$ X*;
    if A $ \in$ X - T or the end marker $
    if Table[x, A] = A $ \Rightarrow$ v = v1v2...vn
    pop(A);
    push v on thestack with v1 on top;
    else ifA $ \in$ T
    pop(A);
    x = getnextsymbol();
    end

The main point here is that Table[x, A] will consist of exactly one element that allows the predictive parsing to proceed along unique, well-defined lines. It will easily be seen that there are context-free grammars for which this lookup technique fails. The expression grammar used throughout will serve the purpose at hand. Given an input, say an identifier, how do we know to reduce E to E + T, to T, etc.? Since `+' can appear arbitrarily far away from the current input position, this grammar as it stands will not allow of computing a parse table. This particular shortcoming, known as left recursion is easily excised from the grammar (make it right recursive):

E $ \Rightarrow$ TE'
E' $ \Rightarrow$ +TE'  |  -TE'  |  $ \lambda$
T $ \Rightarrow$ FT'
T' $ \Rightarrow$ *FT'  |  /FT'  |  $ \lambda$
F $ \Rightarrow$ (E)  |  id

Note the use of the $ \lambda$-reductions. Allows this particular grammar now works, there remain other general pitfalls that need to be mentioned. Consider the grammar

S $ \Rightarrow$ A  |  B
A $ \Rightarrow$ aA  |  a
B $ \Rightarrow$ aB  |  b

There is no way, given the input symbol a, to determine whether to pick the right-hand side A or B. Thus the predictive parsing algorithm will work only on grammars for which the first terminal symbol produced by the leftmost nonterminal is each right-hand side has to be uniquely determined. This symbol is denoted by first($ \alpha$A), where $ \alpha$ $ \in$ X and A $ \in$ X*. It is formally defined recursively as follows:

first($\displaystyle \alpha$A) = $\displaystyle \left\{\vphantom{\begin{array}{ll}
\{\alpha\} &\mbox{if}\; \alpha...
...pha_2) \ldots first(\alpha_n)
&\mbox{and}\; \alpha \in X-T
\end{array}}\right.$$\displaystyle \begin{array}{ll}
\{\alpha\} &\mbox{if}\; \alpha \in T \\
& \mb...
...irst(\alpha_2) \ldots first(\alpha_n)
&\mbox{and}\; \alpha \in X-T
\end{array}$

Thus it makes sense to require for the rules A $ \Rightarrow$ $ \alpha_{1}^{}$  |  $ \alpha_{2}^{}$  |  ...$ \alpha_{n}^{}$ that

first($\displaystyle \alpha_{i}^{}$) $\displaystyle \cap$ first($\displaystyle \alpha_{j}^{}$) = $\displaystyle \emptyset$ (5.4)

for all i and j with i $ \neq$ j. In the above example first(A) = first(B) = a. There is still one more subtle point: it could happen that $ \alpha$$ \overset{*}{\Rightarrow}$$ \lambda$ is nullable, in which case a nonterminal symbol following $ \alpha$ in some right-hand side could cause a conflict with first($ \alpha$). Thus we define

follow(A) = {a $\displaystyle \in$ T  |  S$\displaystyle \overset{*}{\Rightarrow}$uAw, a $\displaystyle \in$ first(w), w $\displaystyle \in$ X*}

and require for each A $ \in$ X - T

first(A) $\displaystyle \cap$ follow(A) = $\displaystyle \emptyset$. (5.5)

Definition

The context-free grammar G = (X, T, R, S) is LL(1) if equations 5.4 and 5.5 hold for all A $ \in$ X - T.


We shall presently see that LL(1) grammars allow the unqiue construction of parsing tables for predictive parsing.

Example

Consider the left-factored expression grammar above. Then we calculate

first(E) = {(, id}
first(T) = {(, id}
first(F) = {(, id}
first(E') = { + , - ,$ \lambda$}
first(T') = {*,/,$ \lambda$}
follow(E) = {),$}
follow(E') = {),$}
follow(T) = { + , - ,$}
follow(T') = { + , - ,$}
follow(F) = { + , - ,$,*,/,)}

For instance, from the rules E $ \Rightarrow$ TE', T $ \Rightarrow$ FT' and F $ \Rightarrow$ (E)  |  id, it is seen that only ( and id can begin a derivation from E alone. Furthermore follow(T') consists of the non-$ \lambda$ elements of first(E') and the end marker $. The other calculations are analogous.

Now we are in a position to define the parsing table, call it Table[*,*]:


\framebox{\begin{minipage}{11cm}
For each grammar rule $A \Rightarrow \alpha$ (...
...pha$.\\
This assignment is also to be carried out for $b = \$$.
\end{minipage}}


It should be intuitively clear that if the grammar G is LL(1) then the so-constructed parsing table will have at most one item for each entry (It may, of course, have none at all.)

This section will now be concluded with the parsing table for the expression grammar and a derivation of the legal input string (id + id )*id.

Input Symbol  
id + - * /
E E $ \Rightarrow$ TE'        
E'   E' $ \Rightarrow$ + TE' E' $ \Rightarrow$ - TE'    
T T $ \Rightarrow$ FT'        
T'       T' $ \Rightarrow$ *FT' T' $ \Rightarrow$ /TF'
F F $ \Rightarrow$ id        


Input Symbol
( ) $
E E $ \Rightarrow$ TE'    
E'   E' $ \Rightarrow$ $ \lambda$ E' $ \Rightarrow$ $ \lambda$
T T $ \Rightarrow$ FT'    
T'   T' $ \Rightarrow$ $ \lambda$ T' $ \Rightarrow$ $ \lambda$
F F $ \Rightarrow$ (E)    


For instance, since id $ \in$ first(T) add E $ \Rightarrow$ TE' to Table[id, E]. Also ( $ \in$ first(T), so consequently add A $ \Rightarrow$ TE' to Table[), E]. Finally the rule E' $ \Rightarrow$ $ \lambda$ means that it must be added to Table[), E'] and Table[$, E']. The other entries are left to the reader.

Finally we adduce the parse sequence for the expression (id + id )*id. Note the end marker is pushed as initial stack symbol onto the stack. Then when the expression has been processed, the end marker `$' of the input string is matched against the initial stack symbol and the string is accepted. Note that there are really only two mutually exclusive things to be done, depending on the symbol on the top of the stack: if that symbol is a terminal symbol, then match it to the output; if there is no match then there is an error. If the symbol is a nonterminal symbol, then look up the entry in the parser table corresponding to that symbol and to the current input symbol. See Table 5.2.


Table 5.2: Predictive Parse of the expression (id + id )*id
Input String Stack Remarks  
( id + id ) * id $ E$ Beginning Configuration  
$ \uparrow$               $ \uparrow$    
( id + id ) * id $ TE'$ Consult Table[(, E]  
$ \uparrow$               $ \uparrow$    
( id + id ) * id $ FT'E'$ Consult Table[(, T]  
$ \uparrow$               $ \uparrow$    
( id + id ) * id $ (E)T'E'$ Consult Table[(, F]  
$ \uparrow$               $ \uparrow$    
( id + id ) * id $ E)T'E'$ Match ('s  
  $ \uparrow$             $ \uparrow$    
( id + id ) * id $ TE')T'E'E$ Consult Table[id, E]  
  $ \uparrow$             $ \uparrow$    
( id + id ) * id $ FT'E')T'E'$ Consult Table[id, T]  
  $ \uparrow$             $ \uparrow$    
( id + id ) * id $ idT'E')T'E'$ Consult Table[id, F]  
  $ \uparrow$             $ \uparrow$    
( id + id ) * id $ T'E')T'E'$ Match id's  
    $ \uparrow$           $ \uparrow$    
( id + id ) * id $ E')T'E'$ Consult Table[+ , T']  
    $ \uparrow$           $ \uparrow$    
( id + id ) * id $ + TE')T'E'$ Match +'s  
      $ \uparrow$         $ \uparrow$    
( id + id ) * id $ FT'E')T'E'$ Consult Table[id, T]  
      $ \uparrow$         $ \uparrow$    
( id + id ) * id $ idT'E')T'E'$ Consult Table[id, F]  
      $ \uparrow$         $ \uparrow$    
( id + id ) * id $ T'E')T'E'$ Match id's  
        $ \uparrow$       $ \uparrow$    
( id + id ) * id $ )T'E'$ Consult Table[), T]  
        $ \uparrow$       $ \uparrow$    
( id + id ) * id $ T'E'$ Match )'s  
          $ \uparrow$     $ \uparrow$    
( id + id ) * id $ *FT'E'$ Consult Table[*, T']  
          $ \uparrow$     $ \uparrow$    
( id + id ) * id $ FT'E'$ Match *'s  
            $ \uparrow$   $ \uparrow$    
( id + id ) * id $ idT'E'$ Consult Table[id, F]  
            $ \uparrow$   $ \uparrow$    
( id + id ) * id $ T'E'$ Match id's  
            $ \uparrow$   $ \uparrow$    
( id + id ) * id $ E'$ Consult Table[$, T']  
              $ \uparrow$ $ \uparrow$    
( id + id ) * id $ $ Consult Table[$, E']  
              $ \uparrow$ $ \uparrow$    


5.6.2 Predictive Bottom-Up Parsing and LR Grammars

In this section assume that all grammars do not have a recursive start symbol S. If one does then introduce a new rule S' $ \Rightarrow$ S and make S' the new start symbol. Bottom up methods are a little more involved in the theory of parsing. Consider, for example, the right derivation of the expression a + b*c:
E $\displaystyle \Rightarrow$   E + T $\displaystyle \Rightarrow$ E + T*F $\displaystyle \Rightarrow$ E + T*id $\displaystyle \Rightarrow$ E + T*c $\displaystyle \Rightarrow$ E + F*c $\displaystyle \Rightarrow$ E + id*c  
    $\displaystyle \Rightarrow$ E + b*c $\displaystyle \Rightarrow$ T + b*c $\displaystyle \Rightarrow$ F + b*c $\displaystyle \Rightarrow$ id + b*c $\displaystyle \Rightarrow$ a + b*c.  

At each stage of the derivation the sentential form of the stage is of the form uv, where u $ \in$ X* and v $ \in$ T*. Tracing this derivation backwards, now proceed as follows: Starting from the leftmost input symbol reduce that symbol to a rule for which it is the right-hand side, in this case id $ \Rightarrow$ a. Then reduce id to F, etc. until an E has been produced. All of the previous symbols are handles or right-hand sides of rules that allow successful (in the sense that the start symbol will eventually be produced). After E has been obtained, the next input symbol `+' is kept, or better, appended to E. Thus the sentential form `E +' is produced. This sentential form is called a viable prefix because there is a rule of the form E $ \Rightarrow$ E + T (a trivial one). If it recognized that E + is a viable prefix, then, starting with the next input symbol, continue this process from that point onwards until the rest of the right-hand side has been produced, i.e. a handle has been found. Then reduce this handle to the left-hand side of the ``correct'' rule until the start symbol alone has been produced. This process can be nicely realized using a push-down automaton. Thus, proceeding from left to right on the input string, shift or push one or more input symbols onto the stack until a handle is found. The reduce or pop that handle from the stack and push the left-hand side of the associated rule onto the stack. On a successful parse, if no reduction is presently forthcoming then the contents of the stack constitute a viable prefix for some rule yet to be determined. Another way of saying the same thing is that the contents of the stack, read from bottom up, are the prefix of a sentential form produced on the way back to the start symbol during a rightmost derivation.

A correct parse of the string a + b*c as a sequence of shift/reduce actions is given in Table 5.3. Notice the decision to handle multiplication before addition is governed by ``looking ahead'' one symbol.

Table 5.3: Predictive Parse of the expression a + b*c
Stack Input Action
$ a + b*c$ Shift
id$ + b*c$ Reduce
F$ + b*c$ Reduce
T$ + b*c$ Reduce
E$ + b*c$ Reduce
+ E$ b*c$ Shift
b + E$ *c$ Shift
id + E$ *c$ Reduce
F + E$ *c$ Reduce
T + E$ *c$ Reduce
*T + E$ c$ Shift
c*T + E$ $ Reduce
id*T + E$ $ Reduce
F*T + E$ $ Reduce
T + E$ $ Reduce
E$ $ Accept


Note that this PDA, whose states would consist of all prefixes of all right-hand sides, is nondeterministic and therefore impractical to implement. The rest of this section is devoted to replacing this PDA by a deterministic one.

5.6.2.1 LR(0) Grammars

Definition

Let G = (X, T, R, S) be a context-free grammar. The the string uw ( u, w $ \in$ X*) is an LR(o)-Context of the rule A $ \Rightarrow$ w if

S$\displaystyle \overset{*}{\Rightarrow}$uAv $\displaystyle \Rightarrow$ uwv,

for some v $ \in$ T*.

Note that uw in an LR(0)-Context of the rule A $ \Rightarrow$ w if it is possible to reduce w directly to A during a rightmost parse. v is just that part of the candidate string already produced.


Example

Consider the Grammar G = (X, T, R, S) with X = {S, A, B, a, b}, T = {a, b} and productions
S $\displaystyle \Rightarrow$ AB  
A $\displaystyle \Rightarrow$ Aa  
A $\displaystyle \Rightarrow$ a  
B $\displaystyle \Rightarrow$ bBa  
B $\displaystyle \Rightarrow$ ba  

Then L(G) = {anbmam  |  m, n $ \geq$ 1}. It is readily seen that
  1. LR(0)-Context( S   $ \Rightarrow$  AB). Note that for the start symbol the LR(0)-Context of the rule(s) with S on the left is always just the corresponding right-hand side.
  2. To calculate the LR(0)-Context of the rule A $ \Rightarrow$ Aa we must examine all rightmost derivations for which this rule is applied at the very last step:
    S $\displaystyle \Rightarrow$ AB $\displaystyle \overset{*}{\Rightarrow}$Abmam $\displaystyle \Rightarrow$ Aabmam.  

    Thus, with u = $ \lambda$, w - Aa, we have

    LR(0)-Context(A $\displaystyle \Rightarrow$ Aa) = {aA}.

  3. It is almost trivial that

    LR(0)-Context(A $\displaystyle \Rightarrow$ a) = {a},

    because this is the last symbol generated by any rightmost parse.
  4. Again, to calculate the LR(0)-Context of the rule B $ \Rightarrow$ bBa consider all rightmost derivations that terminate in an application of this rule:
    S $\displaystyle \Rightarrow$ AB $\displaystyle \Rightarrow$ AbBa $\displaystyle \overset{*}{\Rightarrow}${AbmBa  |  m $\displaystyle \geq$ 1},  

    and thus

    LR(0)-Context(B $\displaystyle \Rightarrow$ bBa) = Abb*Ba.

  5. To calculate the LR(0)-Context of rule B $ \Rightarrow$ ba, note that it terminates all rightmost derivations starting from B:
    S $\displaystyle \Rightarrow$ AB $\displaystyle \Rightarrow$ Aba  
    S $\displaystyle \Rightarrow$ AB $\displaystyle \Rightarrow$ AbBa  
      $\displaystyle \overset{*}{\Rightarrow}$AbmBam $\displaystyle \Rightarrow$ Abmbaam,  

    and so

    LR(0)-Context(B $\displaystyle \Rightarrow$ ba) = Ab*ba.

A point worth mentioning is that even if the actual target strings that are derivable from the start symbol are in general context-free, the LR(0)-Context is always describable with a regular expression. If we knew the LR(0) contexts, then it is not hard to construct a bottom-up parse of language strings, as the following example for the above grammar G and the target string aaabbaa shows:


Sentential Form LR(0)-Context Rule Action
$ \underset{\uparrow}{a}$aabbaa a A $ \Rightarrow$ a reduce
$ \underset{\uparrow}{A}$aabbaa Aa A $ \Rightarrow$ Aa reduce
$ \underset{\uparrow}{A}$abbaa Aa A $ \Rightarrow$ Aa reduce
$ \underset{\uparrow}{A}$bbaa - - shift
A$ \underset{\uparrow}{b}$baa - - shift
Ab$ \underset{\uparrow}{b}$aa - - shift
Abb$ \underset{\uparrow}{a}$a ba B $ \Rightarrow$ Ba reduce
Ab$ \underset{\uparrow}{B}$a abBA B $ \Rightarrow$ bBa reduce
A$ \underset{\uparrow}{B}$ AB S $ \Rightarrow$ AB reduce
$ \underset{\uparrow}{S}$ - - accept


Thus, using LR(0)-Contexts results in a deterministic bottom-up parse, because as soon as uw is seen, w is reduced by the rule A $ \Rightarrow$ w. The main obstacle to implementation is the calculation of the regular expression that determines the decision to reduce; for it can become arbitarily large before it can be distinguished (if at all - see the next definition) from other potential LR(0)-Contexts. Thus the fourth and fifth rules of G above give rise to two contexts that are distinguishable only after the second-to-last character has been read! For this reason, a slightly different approach is taken, namely a DFA will be constructed that determines whether the derivation up to the current point is actually a viable prefix.

Definition

The context-free Grammar G = (X, T, R, S) is LR(0) if for every u $ \in$ LR(0) - Context(A $ \Rightarrow$ $ \alpha$) and uw $ \in$ LR(0) - Context(B $ \Rightarrow$ $ \beta$), where A $ \Rightarrow$ $ \alpha$, B $ \Rightarrow$ $ \beta$ $ \in$ R, u $ \in$ X* and w $ \in$ T*, then w = $ \lambda$, A = B and $ \alpha$ = $ \beta$.


Thus it is not possible at some stage to reduce $ \alpha$ to A but also to forgo a reduction, shift more symbols and then reduce. This condition also implies that $ \alpha$ is not an acceptable handle for 2 different rules.

Unfortunately, the expression grammar is not LR(0) because of the ambiguity T $ \in$ LR(0)-Context(E $ \Rightarrow$ T) and Ta $ \in$ LR(0)-Context(id $ \Rightarrow$ a). Note the need for a 1 symbol lookahead to decide whether to reduce or not. The grammar G, on the other hand, is LR(0).

In the following algorithm assume the function getsym(x) returns the next symbol from the string x, and the function shift(v) pushes the terminal symbol v onto the stack.

Algorithm

Deterministic Parser for LR(0) Grammars
  1. x = input string
  2. u = $ \lambda$, v = getsym(x)
  3. 
    while (u $ \neq$ S) begin
    
    if u = xw $ \in$ LR(0) - Context(A $ \Rightarrow$ w) reduce: u = xA
    else if u is a viable prefix and v $ \neq$ $ \lambda$
    shift( v = getsym(x));
    else error();
    end


We turn now to the problem of finding a DFA to determine viable prefixes. Quite clearly, its states will contain this information.


Definition

Let G = (X, T, R, S) be a context-free grammar. If A $ \Rightarrow$ $ \alpha$$ \beta$ $ \in$ R, then the LR(0)-items of this production are the marked rules A $ \Rightarrow$ .$ \alpha$$ \beta$, A $ \Rightarrow$ $ \alpha$.$ \beta$ and A $ \Rightarrow$ $ \alpha$$ \beta$.. The last item, for reasons that will soon become clear, is called a complete item. The production A $ \Rightarrow$ $ \lambda$ has only the one item A $ \Rightarrow$ .. The LR(0)-items of G is the union of the set of LR(0)-items over all productions of G.


Thus, an LR(0)-item consists of a production together with a marker to indicate how much of the right- hand side has been produced up to the current point of a rightmost derivation.

How now is a viable prefix recognized? A moment's reflection reveals that the top symbols of the stack constitute a sentential form Y that corresponds to some rule A $ \Rightarrow$ $ \alpha$.YZ ( $ \alpha$ $ \in$ X*), if a reduction is to take place, or, in other words, the marker gets moved over. If a reduction is not possible, then the next step consists of adding i.e. pushing input symbols until a reduction is in fact possible, which is just another way of saying that a viable prefix is on the stack at the moment. Thus, the states of the proposed automaton are precisely the LR(0)-items of the grammar G5.2 and the transaction function ``moves the marker over.''


Definition

Given the LR(0)-grammar G = (X, T, P, S) define the NFA $ \cal {A}$$\scriptstyle \prime$(G) of G to be the 5-tuple

$\displaystyle \cal {A}$$\scriptstyle \prime$ = (T, Z, zA, R, ZF = Z),

where Z is the set of LR(0)-items augmented by an initial state zA and a relationR defined as follows:
  1. (($ \lambda$, zA),(S $ \Rightarrow$ .A)) $ \in$ R for all S $ \Rightarrow$ A $ \in$ P,
  2. ((a, A $ \Rightarrow$ u.av),(A $ \Rightarrow$ ua.v)) for a $ \in$ T,
  3. ((B, A $ \Rightarrow$ u.Bv),(A $ \Rightarrow$ uB.v)) $ \in$ R for B $ \in$ X - T,
  4. (($ \lambda$, A $ \Rightarrow$ u.Bv),(B $ \Rightarrow$ .A)) $ \in$ R for all B $ \Rightarrow$ A $ \in$ P.
Notice that rules 1 and 4 render $ \cal {A}$$\scriptstyle \prime$ nondeterministic and that every state is an acceptance state. The symbol u is the viable prefix that is currently being processed.

Example

Consider, again, the grammar G of this section. $ \cal {A}$$\scriptstyle \prime$(G) is easily constructed as in Fig. 5.10.

Figure 5.10: The non-deterministic $ \cal {A}$$\scriptstyle \prime$(G) for the Grammar G
\includegraphics[width=12cm]{grammar}

We have already learned how to convert the NFA $ \cal {A}$$\scriptstyle \prime$(G) to the DFA $ \cal {A}$(G) in Chapter3. The result of this construction is to be found in Fig. 5.11. Note that to keep clarity the arrows to the empty state, i.e. the error state, have not been drawn. This DFA is the basis upon which all LR parsing tables are constructed. Instead of constructing any of them, it is perhaps most instructive to carry out the parsing algorithm directly by using $ \cal {A}$(G). The strategy is quite simple: We employ the stack to store the current symbol of the processed sentential form together with the state of $ \cal {A}$ in which it was adduced. If the state is that of a complete LR(0)-item then reduce using that rule. Thus we must pop as many items from the stack as there are symbols in the right-hand side of the rule. The new state is then the successor of the state on top of the stack under the single symbol from the left-hand side of the rule. This is the pushed together with that symbol. If the state is not a complete LR(0)-item, then there is more to be seen until a reduction can be executed, i.e. the stack contains a viable prefix and the next input symbols must be pushed, together with the state into which $ \cal {A}$ goes upon that symbol and the current state. If, at any stage, there is no transition function for the current state-symbol configuration, then an error has occurred. This strategy is summarized in the following algorithm:

Algorithm

Deterministic Parser for LR(0) Grammars G using $ \cal {A}$(G)
  1. x = input string
  2. v = getsym(x)
  3. state = [zA]
  4. 
    while (
    TOS $ \neq$ S) begin
    
    if state contains a complete LR(0)-item A $ \Rightarrow$ $ \alpha$ {
    pop length($ \alpha$) symbols
    tempstate = TOS
    push the pair (goto(A, tempstate), A)
    }
    else if state is not error state {
    state = goto(v, state)
    push (v, state)
    v=getsym(x);
    }
    else error();
    end

In the above algorithm TOS has been used interchangeably for the entire contents of the top of the stack or the state number or the symbol stored there.

The action of this algorithm is then demonstrated in the following table. Note that every reduction traverses the automaton backwards according to the symbols in the sentential form on the right-hand side.


Stack Input String Current State Action
(1,$) $ \underset{\uparrow}{a}$aabbaa$ 1 shift a
(2, 1)(1,$) $ \underset{\uparrow}{a}$abbaa$ 2 reduce A $ \Rightarrow$ a
(3, A)(1,$) $ \underset{\uparrow}{a}$abbaa$ 3 shift a
(5, a)(3, A)(1,$) $ \underset{\uparrow}{a}$bbaa$ 5 reduce A $ \Rightarrow$ Aa
(3, A)(1$,) $ \underset{\uparrow}{a}$bbaa$ 3 shift a
(5, a)(3, A)(1,$) $ \underset{\uparrow}{b}$baa$ 5 reduce A $ \Rightarrow$ Aa
(3, A)(1,$) $ \underset{\uparrow}{b}$baa$ 3 shift b
(6, b)(3, A)(1,$) $ \underset{\uparrow}{b}$aa$ 6 shift b
(6, b)(6, b)(3, A)(1,$) $ \underset{\uparrow}{a}$a$ 6 shift a
(7, a)(6, b)(6, b)(3, A)(1,$) $ \underset{\uparrow}a$$ 7 reduce B $ \Rightarrow$ ba
(8, B)(6, b)(3, A)(1,$) $ \underset{\uparrow}{a}$$ 8 shift a
(9, a)(8, B)(6, b)(3, A)(1,$) $ \underset{\uparrow}{\$}$ 9 reduce B $ \Rightarrow$ bBa
(4, B)(3, A)(1,$) $ \underset{\uparrow}{\$}$ 4 reduce S $ \Rightarrow$ AB
S     accept


Figure 5.11: The Deterministic Automat $ \cal {A}$(G) for Grammar G
\includegraphics[]{grammar2}

5.6.2.2 LR(1) Grammars

Although the parsing algorithm from the preceding section will always work correctly if the associated DFA $ \cal {A}$(G) is correct. As already observed, even the automat associated with a simple expression is not correct. The problem is that it is always necessary to determine the next input symbol in order to make a decision about the next step.

Definition

The string uwz is in the LR(1)-Context of the rule A $ \Rightarrow$  w if there is a derivation

S$\displaystyle \overset{*}{\Rightarrow}$  uAv   $\displaystyle \Rightarrow$  uwv

for v = zv' or v = $ \lambda$.


Note that z $ \in$ T or z = $ \lambda$.


Considering the expression once again, if an LR(0) parser, upon reading an id, has reduced a factor to a term it is not clear how to proceed. It could

  1. reduce E   $ \Rightarrow$  T, or
  2. shift T to obtain T*F.
If the next symbol were $ \lambda$, then 1) would be executed and processing stops. If it weren't, then 1) would be executed and the next symbol read. If it were * then 2) is carried out. It all depends on what the next input symbol is.


Definition

If G = (X, T, R, S) is a context-free grammar then the LR(1)-items of G have the form


[A   $ \Rightarrow$  .$ \alpha$$ \beta$,  {z1,..., zn}]
[A   $ \Rightarrow$  $ \alpha$.$ \beta$,  {z1,..., zn}]
[A   $ \Rightarrow$  $ \alpha$$ \beta$,  {z1,..., zn}]


for zi $ \in$ T $ \cup$ {$ \lambda$}. The set {z1,..., zn} is called the lookahead set of the LR(1)-item. These are precisely those z that appear in a LR(1)-context. In other words, the terminal symbol is in the lookahead set of A   $ \Rightarrow$ $ \alpha$$ \beta$ if it follows $ \alpha$$ \beta$ in some rightmost derivation:

S  $\displaystyle \overset{*}{\Rightarrow}$  uAzv   $\displaystyle \Rightarrow$  u$\displaystyle \alpha$$\displaystyle \beta$zv.

It is now relatively straightforward to construct the nondeterministic stack machine $ \cal {A}$$\scriptstyle \prime$(G) as before. The states z consist of those LR(1)-items I whose lookahead sets aid in the computation of a successor state. The transition relation as before ``moves the marker over.''

  1. (($ \lambda$, zA),[S $ \Rightarrow$ .A,{$ \lambda$}]) $ \in$ R for all S $ \Rightarrow$ A $ \in$ P,
  2. ((a,[A $ \Rightarrow$ u.av,{z1,..., zn}]),[A $ \Rightarrow$ ua.v,{z1,..., zn}]) for a $ \in$ T,
  3. ((B,[A $ \Rightarrow$ u.Bv,{z1,..., zn}]),[A $ \Rightarrow$ uB.v,{z1,..., zn}]) $ \in$ R for B $ \in$ X - T,
  4. (($ \lambda$,[A $ \Rightarrow$ u.Bv,{z1,..., zn}]),[B $ \Rightarrow$ .$ \alpha$,{y1,..., ym}]) $ \in$ R for all B $ \Rightarrow$ $ \alpha$ $ \in$ P and yi $ \in$ first(vzk) for some k.

Note the transition rule 4. It forms the $ \lambda$-closure of the sets of items, thus allowing an instant calculation of the determinsitic pushdown automat $ \cal {A}$(G) without bothering too much about the nondeterminsitic version: if the current marker position has a nonterminal symbol B followed by some v $ \in$ X*, then for each rule of the form B   $ \Rightarrow$  $ \alpha$ and for each terminal symbol y $ \in$ first(vzk) for which [B   $ \Rightarrow$  .alpha, y] does not satisfy condition 4 conjoin the LR(1) item to the relation R. This closure operation ensures that the lookahead set is progated correctly down the tree.

Example

Consider again the augmented expression grammar


S   $ \Rightarrow$  E
E   $ \Rightarrow$  E + T  |  T
T   $ \Rightarrow$  T*F  |  F
F   $ \Rightarrow$  (E)  |  id


Starting with S calculate the LR(1)-items. For the first production S   $ \Rightarrow$  E the lookahead set is always {$ \lambda$}, thus begin with the item

[S   $\displaystyle \Rightarrow$  .E,{$\displaystyle \lambda$}]

and then compute the $ \lambda$-closure of it. Thus for the rules E   $ \Rightarrow$  E + T and E $ \Rightarrow$ T, add the LR(1)-items [E   $ \Rightarrow$  .E + T,{$ \lambda$}] and [E $ \Rightarrow$  .T,{$ \lambda$}]. This then leads to [T   $ \Rightarrow$  .T*F,{$ \lambda$}], [T   $ \Rightarrow$  .F,{$ \lambda$}], [F   $ \Rightarrow$  .(E),{$ \lambda$}] and finally [F   $ \Rightarrow$  .id,{$ \lambda$}]. But now care must be taken with the rule E   $ \Rightarrow$  .E + T; for E is followed by a `+', and hence [E   $ \Rightarrow$  .E + T,{ + }] must be conjoined to the previous LR(1)-items. But then the same applies to the rule T   $ \Rightarrow$  .T*F; we must add [T   $ \Rightarrow$  .T*F,{*}] to this set of items. There are no more new additions. Thus, collecting these items into a single state of items yields the first state of $ \cal {A}$(G):


  [S   $ \Rightarrow$  .E,{$ \lambda$}] [F   $ \Rightarrow$  .(E),{$ \lambda$}
  [E   $ \Rightarrow$  .E + T,{ + ,$ \lambda$}] [F   $ \Rightarrow$  .id,{$ \lambda$}
I1: [E   $ \Rightarrow$  .T,{$ \lambda$}] [T   $ \Rightarrow$  .T*F,{*,$ \lambda$}
  [T   $ \Rightarrow$  .F,{$ \lambda$}]  


Now apply the transition relation to items in I1 for A $ \in$ X* to generate the next states. Remember, the transition relation just moves the marker over. We will denote this with goto(A, I1): for A = E we have

goto(E, I1) = E   $\displaystyle \Rightarrow$  E. + T, and S   $\displaystyle \Rightarrow$  E..

Now compute the $ \lambda$-closure of these two items to obtain the successor of I1 under E. Since there is no nonterminal symbol following the dot, there is no change. Thus:


I2: [S   $ \Rightarrow$  E.,{$ \lambda$}], [E   $ \Rightarrow$  E. + T,{ + ,$ \lambda$}


Plugging in A = T results in a similar set:


I3: [E   $ \Rightarrow$  T.,{$ \lambda$}], [T   $ \Rightarrow$  T.*F,{*,$ \lambda$}


For A = F we only obtain


I4: [T   $ \Rightarrow$  F.,{$ \lambda$}]


Proceeding in this manner yields 11 states. The final result is given in the following transition table that represents the central LR-parsing table. The reader is strongly encouraged to carry out the calculations independentally:


State Symbol Successor
I1 E I2
I1 T I3
I1 F I4
I1 ( I5
I1 id I6
I2 + I7
I3 * I8
I5 E I9
I5 T I3
I5 F I4
I5 id I6
I7 T I10
I7 F I4
I7 ( I5
I7 id I5
I8 F I11
I8 ( I5
I8 id I6
I9 ) I12
I9 + I7
I10 * I8


This automaton can also be used in the main LR-parsing algorithm: Upon encountering a complete item [A  $ \Rightarrow$  $ \alpha$.,{z1,...zn}] and the current lookahead symbol is one of the zi, then reduce as in the algorithm. If the state is a legal state then shift and obtain a new lookahead symbol.

Prof. Raymond Zavodnik, July 2002