index.md (15010B)
1 +++ 2 title = 'Deep learning' 3 template = 'page-math.html' 4 +++ 5 # Deep learning 6 ## Deep learning systems (autodiff engines) 7 8 ### Tensors 9 10 To scale up backpropagation, want to move from operations on scalars to tensors. 11 12 Tensor: generalisation of vectors/matrices to higher dimensions. e.g. a 2-tensor 13 has two dimensions, a 4-tensor has 4 dimensions. 14 15 You can represent data as a tensor. e.g. an RGB image is a 3-tensor of the red, 16 green, and blue values for each pixel. 17 18 ### Functions on tensors 19 20 Functions have inputs and outputs, all of which are tensors. 21 22 They implement: 23 24 - `forward(...)`: computing outputs given the inputs 25 - `backward(...)`: computing gradients over inputs, given gradients over 26 outputs 27 28 The modules we chain together are defined in a computation graph: 29 30 ![](bf4ec9fe629e41389da29c0de7efb63d.png) 31 32 A deep learning system uses this graph to execute a computation (forward pass), 33 and does backpropagation to compute gradients to data nodes wrt the output 34 (backward pass). 35 36 Autodiff engine 37 38 - Perform computation by chaining functions 39 - keeps track of all computation in a computation graph 40 - when computation done, walk backward through computation graph for 41 backpropagation 42 - eager evaluation: build graph as we perform computation 43 44 ## Backpropagation revisited 45 46 Functions can have any number of inputs and outputs, which must be tensors. 47 48 The final output must be a scalar (i.e. always take derivative of scalar 49 function). 50 51 ### Multivariate chain rule 52 53 How do you take derivatives when variables aren't scalars? 54 55 Multiple inputs: 56 57 ![](edfac0f6027c40c9a9e012e658f54d68.png) 58 59 How do you find the derivative with two inputs? Use the multivariate chain rule, 60 i.e. take single derivative for each input and then sum them. 61 62 $\frac{\partial c}{\partial x} = \frac{\partial c}{\partial a} \frac{\partial a}{\partial x} + \frac{\partial c}{\partial b} \frac{\partial b}{\partial x}$ 63 64 ### Backpropagation with tensors - matrix calculus 65 66 Start with scalar derivatives: one output over one input (just pick a random 67 one) 68 69 Tensor derivative: put all possible scalar derivatives into a tensor. 70 71 But how to arrange/order the tensor? 72 73 Solution: accumulate the gradient product. 74 75 forward(x): given input x, compute output y 76 77 backward(l<sub>y</sub>): given $l_{y} = \frac{\partial loss}{\partial y}$, compute 78 $\frac{\partial loss}{\partial y} \frac{\partial y}{\partial x}$. 79 80 convention: gradient of A has same shape as A 81 82 #### Example: 83 84 Let: 85 86 - k = Wx + b 87 - forward(W, x, b): compute Wx + b 88 - backward(l<sub>k</sub>): compute $\frac{\partial l}{\partial k} \frac{\partial k}{\partial 89 W}, \quad \frac{\partial l}{\partial k} \frac{\partial k}{\partial x}, \quad 90 \frac{\partial l}{\partial k} \frac{\partial k}{\partial b}$ 91 92 Steps: 93 94 1. work out scalar derivative: $\frac{\partial l}{\partial k} \frac{\partial 95 k}{\partial W_{23}}$ 96 2. apply multivariate chain rule $\frac{\partial l}{\partial k} \frac{\partial 97 k}{\partial W_{23}} = ... = \frac{\partial l}{\partial k_{2}} x_{3}$ 98 3. now we know that $\frac{\partial l}{\partial k} \frac{\partial k}{\partial W_{ij}} = 99 \frac{\partial l}{\partial k_{i}} x_{j}$ 100 4. so, $\frac{\partial l}{\partial k} \frac{\partial k}{\partial W} = \frac{\partial 101 l}{\partial k} x^{T}$ 102 103 ## Making deep neural nets work 104 105 ### Overcoming vanishing gradients 106 107 If weights of network are initialized too high, activations will hit rightmost 108 part of gradient, so local gradient for each node will be very close to zero. So 109 network won't start learning. 110 111 If they are too negative, then hit leftmost part of sigmoid, and get the same 112 problem. 113 114 ![](8c2398ce91ed4694abf679c536b2cf61.png) 115 116 ReLU preserves derivatives for nodes whose activations it lets through. Kills 117 derivatives for nodes that produce negative value, but as long as network is 118 properly initialised, around half of values in batch will always produce 119 positive input for ReLU. 120 121 Still risk that durin training, the network will move to configuration where 122 neuron always produces negative input for every instance in data. In that case, 123 end up with a dead neuron - its gradient will always be zero, no weights below 124 that neuron will change anymore (unless they also feed into a non-dead neuron). 125 126 Initialization: 127 128 - assume that the layer input is roughly distributed so that its mean is 0 and 129 variance is 1 in every direction (standardise/normalise data so this is true 130 for first layer) 131 - initialisation designed to pick a random matrix that keeps these properties 132 true 133 134 ### Minibatch gradient descent 135 136 Like stochastic gradient descent, but with small batches of instances, instead 137 of single instances. 138 139 - smaller batches: close to stochastic gradient descent, more noisy, less 140 parallelism 141 - bigger batches: more like regular gradient descnet, more parallelism, limit 142 is memory 143 144 In general, stay between 16 and 128 instances. 145 146 ### Optimizers 147 148 #### Momentum 149 150 If gradient descent is a hiker in a snowstorm, moment gradient descent is a 151 boulder rolling down the hill. 152 153 Gradient doesn't affect its movement directly, but acts as a force on moving 154 object. If gradient is zero, updates continue in the same direction, but slowed 155 down by a 'friction constant' (μ). 156 157 Regular gradient descent: $w \leftarrow w - \eta \nabla loss(w)$ 158 159 With momentum: 160 161 - $v \leftarrow \mu v - \eta \nabla loss(w)$ 162 - $w \leftarrow w + v$ 163 164 #### Nesterov momentum 165 166 In regular momentum, actual stem taken is sum of two vectors: the momentum step 167 (in direction we took last iteration) and gradient step (in direction of 168 steepest descent at current point) 169 170 This evaluates gradient after the momentum step, since we are taking that step 171 anyway. Makes the gradient a bit more accurate. 172 173 #### Adam 174 175 Combines idea of momentum with idea that each weight should have its own 176 learning rate. 177 178 Normalize gradients: keep running mean m and uncentered variance v, for each 179 parameter over the gradient. Subtract these instead of the gradient. 180 181 Calculations: 182 183 - $m \leftarrow \beta_{1} * m + (1 - \beta_{1}) \nabla loss(w)$ 184 - $v \leftarrow \beta_{2} * v + (1 - \beta_{2}) (\nabla loss(w))^{2}$ 185 - $w \leftarrow w - \eta \frac{m}{\sqrt{v} + \epsilon}$ 186 187 ### Regularizers 188 189 The bigger your model is, the bigger the capacity for overfitting. 190 191 Regularizers pull the model back towards simpler models, but don't eliminate 192 more complex solutions. 193 194 #### L2 regularizer 195 196 "Simpler means smaller parameters" 197 198 Take all params, stick them in one vector ("θ"). Then $loss_{reg} = loss + 199 \lambda \|\theta\|$ 200 201 Models with bigger weights get higher loss, but if it's worth it (i.e. original 202 loss decreases enough), they can still beat simpler models. 203 204 If you have a bowl where you want to roll a marble to the lowest point, L2 loss 205 is like tipping the bowl slightly to the right (shifting the lowest point). 206 207 #### L1 regulariser 208 209 "Simpler means smaller parameters and more zero parameters" 210 211 lp norm: $\|\theta\|^{p} = \sqrt[p]{w<sup>{p}+b</sup>{p}}$ $loss \leftarrow loss + 212 \lambda \|\theta\|^{1}$ 213 214 If you have a bowl where you want to roll a marble to the lowest point, L1 loss 215 is like using a square bowl -- if it has groves along dimensions, marble is 216 likely to end up in one of the grooves. 217 218 #### Dropout regularisation 219 220 "Simpler means more robust; during training, randomly disable hidden units" 221 222 During training, remove hidden and input nodes, each with probability p. This 223 prevents co-adaptation -- multiple neurons firing together in specific 224 combinations. 225 226 The analogy is if you can learn how to do a task repeatedly whilst drunk, you 227 should be able to do the task sober. So basically, do all of the practice exams 228 while drunk, and then you'll ace the final while sober (or you'll fail and 229 disprove all of machine learning, choose your destiny). But if anyone asks, I 230 didn't tell you to do that. 231 232 ## Convolutional neural networks 233 234 Disclaimer: I'm gonna revise these notes, the prof basically covered all of CNN 235 theory in ten minutes lol. So I don't have much here atm. 236 237 Hidden layer has shape of another image, with more channels. 238 239 Hidden nodes only wired to nearby nodes in the previous layer. 240 241 Weights are shared, each hidden node has same weights as the previous layer. 242 243 Maxpooling reduces image dimensions. 244 245 ## Deep learning vs machine learning 246 247 In ML, you chain things together. But chaining modules that are 99% accurate 248 doesn't mean the whole pipeline is 99% accurate, as error accumulates. 249 250 In deep learning, make each module differentiable - ensure that we can work out 251 **local** gradient, so we can train pipeline as a whole using backpropagation. 252 This is "end-to-end learning". 253 254 It's a lower level of abstraction, giving you smaller building blocks. 255 256 ## Generators 257 258 Visual shorthand: 259 260 ![](4f24499ecda0424abfc6b408bf663267.png) 261 262 How do you turn neural network into probability distribution? 263 264 - option 1: take output and interpret it as parameters of multivariate normal (μ, Σ) 265 - if output has high dimensions, take Σ to be diagonal matrix 266 - allows network to communicate how sure it's about the output (i.e. smaller variances in Σ mean it's more sure) 267 - allows sampling from the generator, and computing prob density ![](45614363f80f489eb6424d1ba48915a8.png)x 268 - option 2: start with an MVN, sample vector from it, feed that vector to the NN, and look at what comes out 269 - cannot easily compute prob density for an instance 270 271 - can easily sample 272 273 ![](1efde6bbc5484b4481db40d089140c0b.png) 274 - option 3: both. i.e., sample input from standard MVN, interpret output as another MVN, then sample from that. 275 - input is called z 276 - space of inputs is the latent space 277 - naive approach: sample random point from data, sample point from model, train on how close they are. loss could be any distance between tensors, like mean-square error 278 - doesn't work -- mode collapse. 279 - if a generated point is close to a mode, the model should be rewarded, but since it's also far away from some other points, we might compute the loss to a different point 280 - the different modes (areas of high prob) of data distr end up being averaged into a single point 281 - we want network to imagine details, not average over all possibilities 282 283 ![](1a5455e19f984f23b9cc90fb4d99d59c.png) 284 285 How do you 'fix' mode collapse? 286 287 ## Generative adversarial networks 288 289 If you can generate adversarial examples (i.e. try to break your network), you can also add them to the dataset and then retrain your network. 290 291 Generator: takes input sampled from standard MVN, produces image 292 293 Discriminator: takes image, classifies as Pos (real) or Neg (fake) 294 295 ### Vanilla GANs 296 297 Training discriminator: 298 299 - feed examples from positive class 300 - train it to classify them as Pos (just nudge the weights with backpropagation) 301 - sample images from generator, train it to make them negative 302 303 Training generator: 304 305 - freeze discriminator 306 - train weights of generator to produce images that the discriminator labels as positive 307 308 ### Conditional GANs 309 310 If we want network to generate output probabilistically. i.e., the network has to fill in realistic details. 311 312 Make the generator a function, taking input and mapping it to output. Uses randomness to imagine specific output details. 313 314 Feed discriminator: 315 316 - either input/output pair from data, which it should classify as real 317 - or input from data with output generated by generator, which it should classify as fake 318 319 Training generator in two ways: 320 <ol type="a"> 321 <li>freeze weights of discriminator, train generator to produce stuff that the discriminator will classify as real</li> 322 <li>feed it an input from data, backpropagate on corresponding output using L1 loss</li> 323 </ol> 324 325 Only works if input and output matched; for some tasks, only have unmatched bags of images in two domains. Can't randomly match because mode collapse. So what do? 326 327 ### CycleGAN 328 329 Add "cycle consistency term" to loss function. 330 331 E.g. in horse-to-zebra example, if transform horse to zebra and back, result should be close to original image. 332 333 So, new goal: 334 335 - train horse-to-zebra transformer and zebra-to-horse transformer, such that 336 - horse-discriminator can't tell generated horses (and zebras) from real ones 337 - cycle consistency loss for both combined is low 338 339 Think of generators doing steganography (hiding info in pictures). For example, hiding a horse inside a zebra (picture, obviously). 340 341 ### StyleGAN 342 343 Feed the network the latent vector at each layer. 344 345 Since deconvolution starts with low resolution, high level description of image, feeding it latent vector at each layer allows it to use different parts of the vector to describe different aspects of the image ("styles"). 346 347 Network also receives separate extra random noise per layer, which allows it to make random choices. 348 349 Then generate image for destination, but for a few layers (bottom, middle, or top) we use source latent vector instead. 350 351 ### What can we do with a generator? 352 353 Gotta fill this in. 354 355 ## Autoencoders 356 357 A type of neural network that tries to make output as close to input as possible, but there is a middle layer (smaller than input) that functions as a bottleneck. 358 359 After network is trained, that layer becomes a compressed representation of the input. 360 361 ![](e935de30948c46dfabccb5d24b5e1a5e.png) 362 363 blue layer is latent representation of input. If autoencoder works well, expect to see similar images clustered together. 364 365 To find direction in latent space that we can use to make someone smile, we label instances as smiling and nonsmiling, and draw vector between their respective means. That's called the smiling vector (god I can't take this shit seriously) 366 367 ### Turning an autoencoder into a generator 368 369 How: 370 371 - train an autoencoder 372 - encode the data to latent variables Z 373 - fit MVN to Z 374 - sample from the MVN 375 - "decode" the sample 376 377 But we're training for reconstruction error, and then turning result into autoencoder. Can we train for maximum likelihood directly? 378 379 ## Variational autoencoders 380 381 Force decoder to also decode points near z correctly, and force latent distribution of data towards N(0,1). Can be derived from first principles. 382 383 Approximate P(z \| z,θ) with neural network, and make that the q function. 384 385 Want to choose parameters θ (weights of neural network) to maximise log likelihood of data. 386 387 $\ln{P(x|\theta)} = L(q, \theta) + KL(q,p)$ with $P = P(z|x,\theta)$. 388 389 - q(z\|x) any approximation to P(z\|x) 390 - KL(q, p) - Kullback-Leibler divergence 391 - $L(q, \theta) = E_{q} \ln{\frac{P(x,z|\theta)}{q(z|x)}}$ 392 393 We can't marginalize out hidden variable z, or compute probability over z given x. Instead, use approximation on prob of z given x, called q, and optimise both probability of x given z and z given x. 394 395 ![](650e25dac37b4b4db20998694f3f6146.png) 396 397 Solves mode collapse, because we map input to latent space and back to data space, so we know which instance the generated output should look like. 398 399 Sorry guys this lecture was hard to follow, I'll finish this part up when I revise for exams.