lecture-6.md (9369B)
1 +++ 2 title = "Lecture 6" 3 template = "page-math.html" 4 +++ 5 6 # Normalising (contd) 7 8 ## Removal of unit production rules 9 10 Unit production rule: A → B, where B is a variable Steps: 11 12 1. Remove all λ-productions 13 2. Determine all pairs of different $A \rightarrow^+ B$ 14 3. Whenever there's a derivation A → B → y, add new rule A → y. 15 4. Remove all unit production rules 16 17 ## Chomsky normal form 18 19 When all rules have form A → BC or A → a. i.e.: the RHS is either two 20 variables, or a terminal. 21 22 Steps: 23 24 1. Remove all λ-productions 25 2. Remove all unit production rules 26 3. For every terminal a: 27 1. add: variable C<sub>a</sub>, rule C<sub>a</sub> → a. 28 2. in any rule with length RHS ≥ 2, replace the terminal a with 29 C<sub>a</sub> 30 4. Split all rules so that they have a maximum of 2 variables on the 31 RHS, by adding new rules and variables. Example: 32 - start with one rule, {A → BCDE}. 33 - split, introduce variable X<sub>1</sub>: {A → BX<sub>1</sub>, X<sub>1</sub> → CDE} 34 - split, introduce variable X<sub>2</sub>: {A → BX<sub>1</sub>, X<sub>1</sub> → CX<sub>2</sub>, X<sub>2</sub> → DE} 35 - no more splits needed, as every rule has max 2 variables on the 36 RHS. 37 38 ## Removing useless variables 39 40 Why? It simplifies the grammar, sometimes by a ton. 41 42 A variable is: 43 44 - useless if there's no way to reach it when deriving a 45 word/terminal. If you remove production rules with useless 46 variables, the language doesn't change. 47 - productive if it derives to a terminal. If it's not productive, 48 it's useless (just like your average student) 49 50 Steps: 51 52 1. Determine the productive variables: if A → y is a rule, and all 53 variables in y are productive, then A is productive. 54 2. Remove rules containing a non-productive variable. 55 3. Determine reachable variables: 56 - start symbol is always reachable 57 - if A → y, and A is reachable, then so are all variables in y. 58 4. Remove rules containing an ureachable variable. 59 5. Any variable from the original grammar that doesn't show up in the 60 remaining rules is useless. 61 62 So basically, evict the unproductive and useless. Don't quote me on 63 that. 64 65 ## Erasable variables 66 67 A is erasable if you can somehow derive λ from it. 68 69 - if A → λ, A is erasable 70 - if A → B<sub>1</sub>...B<sub>n</sub>, and B<sub>1</sub>...B<sub>n</sub> are erasable, then so is A. 71 72 # Parsing 73 Parsing: the search for a derivation tree for a given word. 74 75 For CFGs, parsing is possible in O(\|w\|<sup>3</sup>) time, where \|w\| is length 76 of input word. 77 78 ## Bottom-up parsing (right-to-left) 79 80 Start from input word, try to construct starting variable S. Applies 81 rules backwards. 82 83 The CYK (Cocke-Younger-Kasami) algorithm does bottom-up parsing for 84 grammars in Chomsky normal form. It determines whether a non-empty word 85 w is in L(G) (i.e., is accepted by the grammar). 86 87 Steps: 88 89 1. Take grammar G in Chomsky normal form. Hopefully someone will be 90 nice enough to give you that; if not, you'll have to normalize it 91 yourself. 92 2. Compute sets V<sub>u</sub> of variables from which you can derive u, where u is a 93 _contiguous_ subword of w. 94 - If u is a letter, then V<sub>u</sub> are the variables that derive to u. 95 - If u is multiple letters, then V<sub>u</sub> is set of all variables such that: 96 - u = u<sub>1</sub>u<sub>2</sub> with u<sub>1</sub> and u<sub>2</sub> being some nonempty _words_ (potentially multiple letters) 97 - A → BC is a production in the grammar, with B deriving to u<sub>1</sub> and C deriving to u<sub>2</sub> 98 3. If the starting variable is in the set of variables that derive to word w, then the grammar generates that word. 99 100 ## Top-down parsing (left-to-right) 101 102 Start from starting variable S, try to derive the input word. 103 104 Simple leftmost: 105 106 - idea: 107 - always expand leftmost variable A, replace A by u if A → u. 108 - backtrack on mismatch with input string, then try another 109 production rule A → v. 110 - issue is that backtracking is expensive and hard 111 112 ## LL parsing: 113 114 LL: left-to-right (top-down), leftmost derivation. Backtracking not 115 allowed. 116 117 - LL(1): looks at one symbol of input. grammar is LL(1) if parser 118 table has max one rule in every cell (i.e., no ambiguity when a 119 symbol is read) 120 - left factorization can make a grammar LL(1). e.g: 121 - {S → ab \| ac } is not LL(1), ambiguity with one symbol 122 lookahead. 123 - if factorize to two rules, {S → aA, A → b \| c}, the grammar 124 is LL(1). 125 - LL(k): looks k symbols ahead. table constructed with k symbol 126 lookahead, the grammar is LL(k) if the table has max one rule in 127 every cell. size of parser table grows exponentially. 128 129 CFG prerequisite - must have no useless variables (though λ-productions 130 and unit productions are allowed) 131 132 I'll try to explain this in a more understandable way than the abstract notation we get. 133 134 ### First set 135 The set of terminals that begin strings derivable from variable A. 136 137 To find First(A), you want to look at the RHS of every rule A -> XY: 138 - if X is a terminal, then first(A) is that terminal 139 - if X is a variable, then first(A) is: 140 - first(X) 141 - and if X can derive lambda, also first(Y) 142 143 #### Example 144 Take the grammar with rules: 145 146 1. A → DbCbz 147 2. A → dzzzA 148 3. A → λ 149 4. B → kkdb 150 5. C → kzeA 151 6. D → AneCB 152 153 I start with B, because it doesn't depend on other first sets. 154 155 First(B): 156 - rule 4: k is first, and is a terminal 157 - therefore First(B) = k 158 159 First(C): 160 - rule 5: k is first, and is a terminal 161 - therefore First(C) = k 162 163 First(A): 164 - rule 3: λ 165 - rule 2: d is first, and is a terminal 166 - rule 1: D is first, and is a variable, so have to find First(D): 167 - from below, First(D) = {d, n} 168 - therefore First(A) = {λ, d, n} 169 170 First(D): 171 - rule 6: A is first, and is a variable, so have to find First(A): 172 - from above, First(A) = {d, First(D)} 173 - A can derive λ, so First(D) includes n 174 - therefore First(D) = {d, n} 175 176 Remember, duplicates are excluded in sets. 177 178 ### Follow set 179 The set of possible terminals immediately following a variable A. 180 181 To find Follow(A), you want to look at rules that have A on the RHS: 182 - if A is followed by a terminal, add that terminal to Follow(A) 183 - if A is followed by variable V, add First(V) to Follow(A) 184 - if A is at the end of a rule V → XA, where X are some variables/terminals, add Follow(V) to Follow(A) 185 - if A is at the end of a rule, add \$ to Follow(A) if 186 - A is not included in any other RHS 187 - or is in a rule such as A → XA and can derive λ 188 - the start symbol always has \$ in its follow set 189 190 #### Example 191 Take the grammar with rules: 192 193 1. A → DbCbz 194 2. A → dzzzA 195 3. A → λ 196 4. B → kkdb 197 5. C → kzeA 198 6. D → AneCB 199 200 I start with D, as it does not depend much on other rules. 201 202 Follow(D): 203 - rule 1: D is followed by b, which is a terminal 204 - therefore Follow(D) = {b} 205 206 Follow(B): 207 - rule 6: B is the last symbol, so add Follow(D) to Follow(B) 208 - from above, Follow(D) = {b} 209 - therefore, Follow(B) = {b} 210 211 Follow(C): 212 - rule 1: C is followed by b, which is a terminal 213 - rule 6: C is followed by B, which is a variable, so add First(B) to Follow(C) 214 - in the previous section, we found First(B) = {k} 215 - therefore, Follow(C) = {b, k} 216 217 Follow(A): 218 - rule 6: A is followed by n, which is a terminal 219 - rule 5: A is the last symbol, so add Follow(C) to Follow(A) 220 - from above, Follow(C) = {b, k} 221 - rule 2: A is the last symbol, so add Follow(A) to Follow(A) 222 - since A can derive λ, also add \$ to Follow(A) 223 - nothing else is added, because sets don't have duplicates 224 - therefore, Follow(A) = {n, b, k, \$} 225 226 ### Parse table construction 227 Once you have first and follow sets, you can construct a parse table. 228 The rows are indexed by variables, the columns are indexed by terminals. 229 230 A cell at row A and column u contains a rule (LHS → RHS) if: 231 - u ∈ First(RHS) 232 - or λ ∈ First(RHS) and u ∈ Follow(LHS) 233 234 #### Example 235 Take the grammar with rules: 236 237 1. A → DbCbz 238 2. A → dzzzA 239 3. A → λ 240 4. B → kkdb 241 5. C → kzeA 242 6. D → AneCB 243 244 Rule 1: 245 - First(DbCbz) = First(D) = {d, n} 246 - therefore, it will be in row A, at columns d and n 247 248 Rule 2: 249 - First(dzzzA) = d 250 - therefore, it will be in row A, at column d 251 252 Rule 3: 253 - First(λ) = λ 254 - Follow(A) = {n, b, k, \$} 255 - therefore, it will be in row A, at columns n, b, k, and \$. 256 257 Rule 4: 258 - First(kkdb) = k 259 - therefore, it will be in row B, at column k 260 261 Rule 5: 262 - First(kzeA) = k 263 - therefore, it will be in row C, at column k 264 265 Rule 6: 266 - First(AneCB) = First(A) = {λ, d, n} 267 - because λ ∈ First(A), add n to First(AneCB). But it's already part of the set. 268 - therefore, the rule will be in row D, at columns d and n. 269 270 The resulting parse table: 271 272 | | b | d | n | k | z | \$ | 273 |---|-------|------------------------|--------------------|----------|---|-------| 274 | A | A → λ | A → DbCbz<br>A → dzzzA | A → DbCbz<br>A → λ | A → λ | | A → λ | 275 | B | | | | B → kkdb | | | 276 | C | | | | C → kzeA | | | 277 | D | | D → AneCB | D → AneCB | | | | 278 279 This table could not yet be used for LL(1) parsing, as there are cells containing more than one rule.