
Lecture notes from university.
git clone git://git.alex.balgavy.eu/lectures.alex.balgavy.eu.git
Log | Files | Refs | Submodules

lecture-6.md (9369B)

      1 +++
      2 title = "Lecture 6"
      3 template = "page-math.html"
      4 +++
      6 # Normalising (contd)
      8 ## Removal of unit production rules
     10 Unit production rule: A → B, where B is a variable Steps:
     12 1.  Remove all λ-productions
     13 2.  Determine all pairs of different $A \rightarrow^+ B$
     14 3.  Whenever there's a derivation A → B → y, add new rule A → y.
     15 4.  Remove all unit production rules
     17 ## Chomsky normal form
     19 When all rules have form A → BC or A → a. i.e.: the RHS is either two
     20 variables, or a terminal.
     22 Steps:
     24 1.  Remove all λ-productions
     25 2.  Remove all unit production rules
     26 3.  For every terminal a:
     27     1.  add: variable C<sub>a</sub>, rule C<sub>a</sub> → a.
     28     2.  in any rule with length RHS  ≥ 2, replace the terminal a with
     29         C<sub>a</sub>
     30 4.  Split all rules so that they have a maximum of 2 variables on the
     31     RHS, by adding new rules and variables. Example:
     32     -   start with one rule, {A → BCDE}.
     33     -   split, introduce variable X<sub>1</sub>: {A → BX<sub>1</sub>, X<sub>1</sub> → CDE}
     34     -   split, introduce variable X<sub>2</sub>: {A → BX<sub>1</sub>, X<sub>1</sub> → CX<sub>2</sub>, X<sub>2</sub> → DE}
     35     -   no more splits needed, as every rule has max 2 variables on the
     36         RHS.
     38 ## Removing useless variables
     40 Why? It simplifies the grammar, sometimes by a ton.
     42 A variable is:
     44 -   useless if there's no way to reach it when deriving a
     45     word/terminal. If you remove production rules with useless
     46     variables, the language doesn't change.
     47 -   productive if it derives to a terminal. If it's not productive,
     48     it's useless (just like your average student)
     50 Steps:
     52 1.  Determine the productive variables: if A → y is a rule, and all
     53     variables in y are productive, then A is productive.
     54 2.  Remove rules containing a non-productive variable.
     55 3.  Determine reachable variables:
     56     -   start symbol is always reachable
     57     -   if A → y, and A is reachable, then so are all variables in y.
     58 4.  Remove rules containing an ureachable variable.
     59 5.  Any variable from the original grammar that doesn't show up in the
     60     remaining rules is useless.
     62 So basically, evict the unproductive and useless. Don't quote me on
     63 that.
     65 ## Erasable variables
     67 A is erasable if you can somehow derive λ from it.
     69 -   if A → λ, A is erasable
     70 -   if A → B<sub>1</sub>...B<sub>n</sub>, and B<sub>1</sub>...B<sub>n</sub> are erasable, then so is A.
     72 # Parsing
     73 Parsing: the search for a derivation tree for a given word.
     75 For CFGs, parsing is possible in O(\|w\|<sup>3</sup>) time, where \|w\| is length
     76 of input word.
     78 ## Bottom-up parsing (right-to-left)
     80 Start from input word, try to construct starting variable S. Applies
     81 rules backwards.
     83 The CYK (Cocke-Younger-Kasami) algorithm does bottom-up parsing for
     84 grammars in Chomsky normal form. It determines whether a non-empty word
     85 w is in L(G) (i.e., is accepted by the grammar).
     87 Steps:
     89 1.  Take grammar G in Chomsky normal form. Hopefully someone will be
     90     nice enough to give you that; if not, you'll have to normalize it
     91     yourself.
     92 2.  Compute sets V<sub>u</sub> of variables from which you can derive u, where u is a
     93     _contiguous_ subword of w.
     94     -   If u is a letter, then V<sub>u</sub> are the variables that derive to u.
     95     -   If u is multiple letters, then V<sub>u</sub> is set of all variables such that:
     96         -   u = u<sub>1</sub>u<sub>2</sub> with u<sub>1</sub> and u<sub>2</sub> being some nonempty _words_ (potentially multiple letters)
     97         -   A → BC is a production in the grammar, with B deriving to u<sub>1</sub> and C deriving to u<sub>2</sub>
     98 3.  If the starting variable is in the set of variables that derive to word w, then the grammar generates that word.
    100 ## Top-down parsing (left-to-right)
    102 Start from starting variable S, try to derive the input word.
    104 Simple leftmost:
    106 -   idea:
    107     -   always expand leftmost variable A, replace A by u if A → u.
    108     -   backtrack on mismatch with input string, then try another
    109         production rule A → v.
    110 -   issue is that backtracking is expensive and hard
    112 ## LL parsing:
    114 LL: left-to-right (top-down), leftmost derivation. Backtracking not
    115 allowed.
    117 -   LL(1): looks at one symbol of input. grammar is LL(1) if parser
    118     table has max one rule in every cell (i.e., no ambiguity when a
    119     symbol is read)
    120     -   left factorization can make a grammar LL(1). e.g:
    121         -   {S → ab \| ac } is not LL(1), ambiguity with one symbol
    122             lookahead.
    123         -   if factorize to two rules, {S → aA, A → b \| c}, the grammar
    124             is LL(1).
    125 -   LL(k): looks k symbols ahead. table constructed with k symbol
    126     lookahead, the grammar is LL(k) if the table has max one rule in
    127     every cell. size of parser table grows exponentially.
    129 CFG prerequisite - must have no useless variables (though λ-productions
    130 and unit productions are allowed)
    132 I'll try to explain this in a more understandable way than the abstract notation we get.
    134 ### First set
    135 The set of terminals that begin strings derivable from variable A.
    137 To find First(A), you want to look at the RHS of every rule A -> XY:
    138 - if X is a terminal, then first(A) is that terminal
    139 - if X is a variable, then first(A) is:
    140     - first(X)
    141     - and if X can derive lambda, also first(Y)
    143 #### Example
    144 Take the grammar with rules:
    146 1. A → DbCbz
    147 2. A → dzzzA
    148 3. A → λ
    149 4. B → kkdb
    150 5. C → kzeA
    151 6. D → AneCB
    153 I start with B, because it doesn't depend on other first sets.
    155 First(B):
    156 - rule 4: k is first, and is a terminal
    157 - therefore First(B) = k
    159 First(C):
    160 - rule 5: k is first, and is a terminal
    161 - therefore First(C) = k
    163 First(A):
    164 - rule 3: λ
    165 - rule 2: d is first, and is a terminal
    166 - rule 1: D is first, and is a variable, so have to find First(D):
    167     - from below, First(D) = {d, n}
    168 - therefore First(A) = {λ, d, n}
    170 First(D):
    171 - rule 6: A is first, and is a variable, so have to find First(A):
    172     - from above, First(A) = {d, First(D)}
    173     - A can derive λ, so First(D) includes n
    174 - therefore First(D) = {d, n}
    176 Remember, duplicates are excluded in sets.
    178 ### Follow set
    179 The set of possible terminals immediately following a variable A.
    181 To find Follow(A), you want to look at rules that have A on the RHS:
    182 - if A is followed by a terminal, add that terminal to Follow(A)
    183 - if A is followed by variable V, add First(V) to Follow(A)
    184 - if A is at the end of a rule V → XA, where X are some variables/terminals, add Follow(V) to Follow(A)
    185 - if A is at the end of a rule, add \$ to Follow(A) if
    186     - A is not included in any other RHS
    187     - or is in a rule such as A → XA and can derive λ
    188 - the start symbol always has \$ in its follow set
    190 #### Example
    191 Take the grammar with rules:
    193 1. A → DbCbz
    194 2. A → dzzzA
    195 3. A → λ
    196 4. B → kkdb
    197 5. C → kzeA
    198 6. D → AneCB
    200 I start with D, as it does not depend much on other rules.
    202 Follow(D):
    203 - rule 1: D is followed by b, which is a terminal
    204 - therefore Follow(D) = {b}
    206 Follow(B):
    207 - rule 6: B is the last symbol, so add Follow(D) to Follow(B)
    208     - from above, Follow(D) = {b}
    209 - therefore, Follow(B) = {b}
    211 Follow(C):
    212 - rule 1: C is followed by b, which is a terminal
    213 - rule 6: C is followed by B, which is a variable, so add First(B) to Follow(C)
    214     - in the previous section, we found First(B) = {k}
    215 - therefore, Follow(C) = {b, k}
    217 Follow(A):
    218 - rule 6: A is followed by n, which is a terminal
    219 - rule 5: A is the last symbol, so add Follow(C) to Follow(A)
    220     - from above, Follow(C) = {b, k}
    221 - rule 2: A is the last symbol, so add Follow(A) to Follow(A)
    222     - since A can derive λ, also add \$ to Follow(A)
    223     - nothing else is added, because sets don't have duplicates
    224 - therefore, Follow(A) = {n, b, k, \$}
    226 ### Parse table construction
    227 Once you have first and follow sets, you can construct a parse table.
    228 The rows are indexed by variables, the columns are indexed by terminals.
    230 A cell at row A and column u contains a rule (LHS → RHS) if:
    231 - u ∈ First(RHS)
    232 - or λ ∈ First(RHS) and u ∈ Follow(LHS)
    234 #### Example
    235 Take the grammar with rules:
    237 1. A → DbCbz
    238 2. A → dzzzA
    239 3. A → λ
    240 4. B → kkdb
    241 5. C → kzeA
    242 6. D → AneCB
    244 Rule 1:
    245 - First(DbCbz) = First(D) = {d, n}
    246 - therefore, it will be in row A, at columns d and n
    248 Rule 2:
    249 - First(dzzzA) = d
    250 - therefore, it will be in row A, at column d
    252 Rule 3:
    253 - First(λ) = λ
    254 - Follow(A) = {n, b, k, \$}
    255 - therefore, it will be in row A, at columns n, b, k, and \$.
    257 Rule 4:
    258 - First(kkdb) = k
    259 - therefore, it will be in row B, at column k
    261 Rule 5:
    262 - First(kzeA) = k
    263 - therefore, it will be in row C, at column k
    265 Rule 6:
    266 - First(AneCB) = First(A) = {λ, d, n}
    267 - because λ ∈ First(A), add n to First(AneCB). But it's already part of the set.
    268 - therefore, the rule will be in row D, at columns d and n.
    270 The resulting parse table:
    272 |   | b     | d                      | n                  | k        | z | \$    |
    273 |---|-------|------------------------|--------------------|----------|---|-------|
    274 | A | A → λ | A → DbCbz<br>A → dzzzA | A → DbCbz<br>A → λ | A → λ    |   | A → λ |
    275 | B |       |                        |                    | B → kkdb |   |       |
    276 | C |       |                        |                    | C → kzeA |   |       |
    277 | D |       | D → AneCB              | D → AneCB          |          |   |       |
    279 This table could not yet be used for LL(1) parsing, as there are cells containing more than one rule.