lectures.alex.balgavy.eu

Lecture notes from university.
git clone git://git.alex.balgavy.eu/lectures.alex.balgavy.eu.git
Log | Files | Refs | Submodules

lecture-6.md (9369B)


      1 +++
      2 title = "Lecture 6"
      3 template = "page-math.html"
      4 +++
      5 
      6 # Normalising (contd)
      7 
      8 ## Removal of unit production rules
      9 
     10 Unit production rule: A → B, where B is a variable Steps:
     11 
     12 1.  Remove all λ-productions
     13 2.  Determine all pairs of different $A \rightarrow^+ B$
     14 3.  Whenever there's a derivation A → B → y, add new rule A → y.
     15 4.  Remove all unit production rules
     16 
     17 ## Chomsky normal form
     18 
     19 When all rules have form A → BC or A → a. i.e.: the RHS is either two
     20 variables, or a terminal.
     21 
     22 Steps:
     23 
     24 1.  Remove all λ-productions
     25 2.  Remove all unit production rules
     26 3.  For every terminal a:
     27     1.  add: variable C<sub>a</sub>, rule C<sub>a</sub> → a.
     28     2.  in any rule with length RHS  ≥ 2, replace the terminal a with
     29         C<sub>a</sub>
     30 4.  Split all rules so that they have a maximum of 2 variables on the
     31     RHS, by adding new rules and variables. Example:
     32     -   start with one rule, {A → BCDE}.
     33     -   split, introduce variable X<sub>1</sub>: {A → BX<sub>1</sub>, X<sub>1</sub> → CDE}
     34     -   split, introduce variable X<sub>2</sub>: {A → BX<sub>1</sub>, X<sub>1</sub> → CX<sub>2</sub>, X<sub>2</sub> → DE}
     35     -   no more splits needed, as every rule has max 2 variables on the
     36         RHS.
     37 
     38 ## Removing useless variables
     39 
     40 Why? It simplifies the grammar, sometimes by a ton.
     41 
     42 A variable is:
     43 
     44 -   useless if there's no way to reach it when deriving a
     45     word/terminal. If you remove production rules with useless
     46     variables, the language doesn't change.
     47 -   productive if it derives to a terminal. If it's not productive,
     48     it's useless (just like your average student)
     49 
     50 Steps:
     51 
     52 1.  Determine the productive variables: if A → y is a rule, and all
     53     variables in y are productive, then A is productive.
     54 2.  Remove rules containing a non-productive variable.
     55 3.  Determine reachable variables:
     56     -   start symbol is always reachable
     57     -   if A → y, and A is reachable, then so are all variables in y.
     58 4.  Remove rules containing an ureachable variable.
     59 5.  Any variable from the original grammar that doesn't show up in the
     60     remaining rules is useless.
     61 
     62 So basically, evict the unproductive and useless. Don't quote me on
     63 that.
     64 
     65 ## Erasable variables
     66 
     67 A is erasable if you can somehow derive λ from it.
     68 
     69 -   if A → λ, A is erasable
     70 -   if A → B<sub>1</sub>...B<sub>n</sub>, and B<sub>1</sub>...B<sub>n</sub> are erasable, then so is A.
     71 
     72 # Parsing
     73 Parsing: the search for a derivation tree for a given word.
     74 
     75 For CFGs, parsing is possible in O(\|w\|<sup>3</sup>) time, where \|w\| is length
     76 of input word.
     77 
     78 ## Bottom-up parsing (right-to-left)
     79 
     80 Start from input word, try to construct starting variable S. Applies
     81 rules backwards.
     82 
     83 The CYK (Cocke-Younger-Kasami) algorithm does bottom-up parsing for
     84 grammars in Chomsky normal form. It determines whether a non-empty word
     85 w is in L(G) (i.e., is accepted by the grammar).
     86 
     87 Steps:
     88 
     89 1.  Take grammar G in Chomsky normal form. Hopefully someone will be
     90     nice enough to give you that; if not, you'll have to normalize it
     91     yourself.
     92 2.  Compute sets V<sub>u</sub> of variables from which you can derive u, where u is a
     93     _contiguous_ subword of w.
     94     -   If u is a letter, then V<sub>u</sub> are the variables that derive to u.
     95     -   If u is multiple letters, then V<sub>u</sub> is set of all variables such that:
     96         -   u = u<sub>1</sub>u<sub>2</sub> with u<sub>1</sub> and u<sub>2</sub> being some nonempty _words_ (potentially multiple letters)
     97         -   A → BC is a production in the grammar, with B deriving to u<sub>1</sub> and C deriving to u<sub>2</sub>
     98 3.  If the starting variable is in the set of variables that derive to word w, then the grammar generates that word.
     99 
    100 ## Top-down parsing (left-to-right)
    101 
    102 Start from starting variable S, try to derive the input word.
    103 
    104 Simple leftmost:
    105 
    106 -   idea:
    107     -   always expand leftmost variable A, replace A by u if A → u.
    108     -   backtrack on mismatch with input string, then try another
    109         production rule A → v.
    110 -   issue is that backtracking is expensive and hard
    111 
    112 ## LL parsing:
    113 
    114 LL: left-to-right (top-down), leftmost derivation. Backtracking not
    115 allowed.
    116 
    117 -   LL(1): looks at one symbol of input. grammar is LL(1) if parser
    118     table has max one rule in every cell (i.e., no ambiguity when a
    119     symbol is read)
    120     -   left factorization can make a grammar LL(1). e.g:
    121         -   {S → ab \| ac } is not LL(1), ambiguity with one symbol
    122             lookahead.
    123         -   if factorize to two rules, {S → aA, A → b \| c}, the grammar
    124             is LL(1).
    125 -   LL(k): looks k symbols ahead. table constructed with k symbol
    126     lookahead, the grammar is LL(k) if the table has max one rule in
    127     every cell. size of parser table grows exponentially.
    128 
    129 CFG prerequisite - must have no useless variables (though λ-productions
    130 and unit productions are allowed)
    131 
    132 I'll try to explain this in a more understandable way than the abstract notation we get.
    133 
    134 ### First set
    135 The set of terminals that begin strings derivable from variable A.
    136 
    137 To find First(A), you want to look at the RHS of every rule A -> XY:
    138 - if X is a terminal, then first(A) is that terminal
    139 - if X is a variable, then first(A) is:
    140     - first(X)
    141     - and if X can derive lambda, also first(Y)
    142 
    143 #### Example
    144 Take the grammar with rules:
    145 
    146 1. A → DbCbz
    147 2. A → dzzzA
    148 3. A → λ
    149 4. B → kkdb
    150 5. C → kzeA
    151 6. D → AneCB
    152 
    153 I start with B, because it doesn't depend on other first sets.
    154 
    155 First(B):
    156 - rule 4: k is first, and is a terminal
    157 - therefore First(B) = k
    158 
    159 First(C):
    160 - rule 5: k is first, and is a terminal
    161 - therefore First(C) = k
    162 
    163 First(A):
    164 - rule 3: λ
    165 - rule 2: d is first, and is a terminal
    166 - rule 1: D is first, and is a variable, so have to find First(D):
    167     - from below, First(D) = {d, n}
    168 - therefore First(A) = {λ, d, n}
    169 
    170 First(D):
    171 - rule 6: A is first, and is a variable, so have to find First(A):
    172     - from above, First(A) = {d, First(D)}
    173     - A can derive λ, so First(D) includes n
    174 - therefore First(D) = {d, n}
    175 
    176 Remember, duplicates are excluded in sets.
    177 
    178 ### Follow set
    179 The set of possible terminals immediately following a variable A.
    180 
    181 To find Follow(A), you want to look at rules that have A on the RHS:
    182 - if A is followed by a terminal, add that terminal to Follow(A)
    183 - if A is followed by variable V, add First(V) to Follow(A)
    184 - if A is at the end of a rule V → XA, where X are some variables/terminals, add Follow(V) to Follow(A)
    185 - if A is at the end of a rule, add \$ to Follow(A) if
    186     - A is not included in any other RHS
    187     - or is in a rule such as A → XA and can derive λ
    188 - the start symbol always has \$ in its follow set
    189 
    190 #### Example
    191 Take the grammar with rules:
    192 
    193 1. A → DbCbz
    194 2. A → dzzzA
    195 3. A → λ
    196 4. B → kkdb
    197 5. C → kzeA
    198 6. D → AneCB
    199 
    200 I start with D, as it does not depend much on other rules.
    201 
    202 Follow(D):
    203 - rule 1: D is followed by b, which is a terminal
    204 - therefore Follow(D) = {b}
    205 
    206 Follow(B):
    207 - rule 6: B is the last symbol, so add Follow(D) to Follow(B)
    208     - from above, Follow(D) = {b}
    209 - therefore, Follow(B) = {b}
    210 
    211 Follow(C):
    212 - rule 1: C is followed by b, which is a terminal
    213 - rule 6: C is followed by B, which is a variable, so add First(B) to Follow(C)
    214     - in the previous section, we found First(B) = {k}
    215 - therefore, Follow(C) = {b, k}
    216 
    217 Follow(A):
    218 - rule 6: A is followed by n, which is a terminal
    219 - rule 5: A is the last symbol, so add Follow(C) to Follow(A)
    220     - from above, Follow(C) = {b, k}
    221 - rule 2: A is the last symbol, so add Follow(A) to Follow(A)
    222     - since A can derive λ, also add \$ to Follow(A)
    223     - nothing else is added, because sets don't have duplicates
    224 - therefore, Follow(A) = {n, b, k, \$}
    225 
    226 ### Parse table construction
    227 Once you have first and follow sets, you can construct a parse table.
    228 The rows are indexed by variables, the columns are indexed by terminals.
    229 
    230 A cell at row A and column u contains a rule (LHS → RHS) if:
    231 - u ∈ First(RHS)
    232 - or λ ∈ First(RHS) and u ∈ Follow(LHS)
    233 
    234 #### Example
    235 Take the grammar with rules:
    236 
    237 1. A → DbCbz
    238 2. A → dzzzA
    239 3. A → λ
    240 4. B → kkdb
    241 5. C → kzeA
    242 6. D → AneCB
    243 
    244 Rule 1:
    245 - First(DbCbz) = First(D) = {d, n}
    246 - therefore, it will be in row A, at columns d and n
    247 
    248 Rule 2:
    249 - First(dzzzA) = d
    250 - therefore, it will be in row A, at column d
    251 
    252 Rule 3:
    253 - First(λ) = λ
    254 - Follow(A) = {n, b, k, \$}
    255 - therefore, it will be in row A, at columns n, b, k, and \$.
    256 
    257 Rule 4:
    258 - First(kkdb) = k
    259 - therefore, it will be in row B, at column k
    260 
    261 Rule 5:
    262 - First(kzeA) = k
    263 - therefore, it will be in row C, at column k
    264 
    265 Rule 6:
    266 - First(AneCB) = First(A) = {λ, d, n}
    267 - because λ ∈ First(A), add n to First(AneCB). But it's already part of the set.
    268 - therefore, the rule will be in row D, at columns d and n.
    269 
    270 The resulting parse table:
    271 
    272 |   | b     | d                      | n                  | k        | z | \$    |
    273 |---|-------|------------------------|--------------------|----------|---|-------|
    274 | A | A → λ | A → DbCbz<br>A → dzzzA | A → DbCbz<br>A → λ | A → λ    |   | A → λ |
    275 | B |       |                        |                    | B → kkdb |   |       |
    276 | C |       |                        |                    | C → kzeA |   |       |
    277 | D |       | D → AneCB              | D → AneCB          |          |   |       |
    278 
    279 This table could not yet be used for LL(1) parsing, as there are cells containing more than one rule.