lectures.alex.balgavy.eu

Lecture notes from university.
git clone git://git.alex.balgavy.eu/lectures.alex.balgavy.eu.git
Log | Files | Refs | Submodules

index.md (15010B)


      1 +++
      2 title = 'Deep learning'
      3 template = 'page-math.html'
      4 +++
      5 # Deep learning
      6 ## Deep learning systems (autodiff engines)
      7 
      8 ### Tensors
      9 
     10 To scale up backpropagation, want to move from operations on scalars to tensors.
     11 
     12 Tensor: generalisation of vectors/matrices to higher dimensions. e.g. a 2-tensor
     13 has two dimensions, a 4-tensor has 4 dimensions.
     14 
     15 You can represent data as a tensor. e.g. an RGB image is a 3-tensor of the red,
     16 green, and blue values for each pixel.
     17 
     18 ### Functions on tensors
     19 
     20 Functions have inputs and outputs, all of which are tensors.
     21 
     22 They implement:
     23 
     24 -   `forward(...)`: computing outputs given the inputs
     25 -   `backward(...)`: computing gradients over inputs, given gradients over
     26     outputs
     27 
     28 The modules we chain together are defined in a computation graph:
     29 
     30 ![](bf4ec9fe629e41389da29c0de7efb63d.png)
     31 
     32 A deep learning system uses this graph to execute a computation (forward pass),
     33 and does backpropagation to compute gradients to data nodes wrt the output
     34 (backward pass).
     35 
     36 Autodiff engine
     37 
     38 -   Perform computation by chaining functions
     39 -   keeps track of all computation in a computation graph
     40 -   when computation done, walk backward through computation graph for
     41     backpropagation
     42 -   eager evaluation: build graph as we perform computation
     43 
     44 ## Backpropagation revisited
     45 
     46 Functions can have any number of inputs and outputs, which must be tensors.
     47 
     48 The final output must be a scalar (i.e. always take derivative of scalar
     49 function).
     50 
     51 ### Multivariate chain rule
     52 
     53 How do you take derivatives when variables aren't scalars?
     54 
     55 Multiple inputs:
     56 
     57 ![](edfac0f6027c40c9a9e012e658f54d68.png)
     58 
     59 How do you find the derivative with two inputs? Use the multivariate chain rule,
     60 i.e. take single derivative for each input and then sum them.
     61 
     62 $\frac{\partial c}{\partial x} = \frac{\partial c}{\partial a} \frac{\partial a}{\partial x} + \frac{\partial c}{\partial b} \frac{\partial b}{\partial x}$
     63 
     64 ### Backpropagation with tensors - matrix calculus
     65 
     66 Start with scalar derivatives: one output over one input (just pick a random
     67 one)
     68 
     69 Tensor derivative: put all possible scalar derivatives into a tensor.
     70 
     71 But how to arrange/order the tensor?
     72 
     73 Solution: accumulate the gradient product.
     74 
     75 forward(x): given input x, compute output y
     76 
     77 backward(l<sub>y</sub>): given $l_{y} = \frac{\partial loss}{\partial y}$, compute
     78 $\frac{\partial loss}{\partial y} \frac{\partial y}{\partial x}$.
     79 
     80 convention: gradient of A has same shape as A
     81 
     82 #### Example:
     83 
     84 Let:
     85 
     86 -   k = Wx + b
     87 -   forward(W, x, b): compute Wx + b
     88 -   backward(l<sub>k</sub>): compute $\frac{\partial l}{\partial k} \frac{\partial k}{\partial
     89     W}, \quad \frac{\partial l}{\partial k} \frac{\partial k}{\partial x}, \quad
     90     \frac{\partial l}{\partial k} \frac{\partial k}{\partial b}$
     91 
     92 Steps:
     93 
     94 1.  work out scalar derivative: $\frac{\partial l}{\partial k} \frac{\partial
     95 k}{\partial W_{23}}$
     96 2.  apply multivariate chain rule $\frac{\partial l}{\partial k} \frac{\partial
     97 k}{\partial W_{23}} = ... = \frac{\partial l}{\partial k_{2}} x_{3}$
     98 3.  now we know that $\frac{\partial l}{\partial k} \frac{\partial k}{\partial W_{ij}} =
     99 \frac{\partial l}{\partial k_{i}} x_{j}$
    100 4.  so, $\frac{\partial l}{\partial k} \frac{\partial k}{\partial W} = \frac{\partial
    101 l}{\partial k} x^{T}$
    102 
    103 ## Making deep neural nets work
    104 
    105 ### Overcoming vanishing gradients
    106 
    107 If weights of network are initialized too high, activations will hit rightmost
    108 part of gradient, so local gradient for each node will be very close to zero. So
    109 network won't start learning.
    110 
    111 If they are too negative, then hit leftmost part of sigmoid, and get the same
    112 problem.
    113 
    114 ![](8c2398ce91ed4694abf679c536b2cf61.png)
    115 
    116 ReLU preserves derivatives for nodes whose activations it lets through.  Kills
    117 derivatives for nodes that produce negative value, but as long as network is
    118 properly initialised, around half of values in batch will always produce
    119 positive input for ReLU.
    120 
    121 Still risk that durin training, the network will move to configuration where
    122 neuron always produces negative input for every instance in data.  In that case,
    123 end up with a dead neuron - its gradient will always be zero, no weights below
    124 that neuron will change anymore (unless they also feed into a non-dead neuron).
    125 
    126 Initialization:
    127 
    128 -   assume that the layer input is roughly distributed so that its mean is 0 and
    129     variance is 1 in every direction (standardise/normalise data so this is true
    130     for first layer)
    131 -   initialisation designed to pick a random matrix that keeps these properties
    132     true
    133 
    134 ### Minibatch gradient descent
    135 
    136 Like stochastic gradient descent, but with small batches of instances, instead
    137 of single instances.
    138 
    139 -   smaller batches: close to stochastic gradient descent, more noisy, less
    140     parallelism
    141 -   bigger batches: more like regular gradient descnet, more parallelism, limit
    142     is memory
    143 
    144 In general, stay between 16 and 128 instances.
    145 
    146 ### Optimizers
    147 
    148 #### Momentum
    149 
    150 If gradient descent is a hiker in a snowstorm, moment gradient descent is a
    151 boulder rolling down the hill.
    152 
    153 Gradient doesn't affect its movement directly, but acts as a force on moving
    154 object. If gradient is zero, updates continue in the same direction, but slowed
    155 down by a 'friction constant' (μ).
    156 
    157 Regular gradient descent: $w \leftarrow w - \eta \nabla loss(w)$
    158 
    159 With momentum:
    160 
    161 -   $v \leftarrow \mu v - \eta \nabla loss(w)$
    162 -   $w \leftarrow w + v$
    163 
    164 #### Nesterov momentum
    165 
    166 In regular momentum, actual stem taken is sum of two vectors: the momentum step
    167 (in direction we took last iteration) and gradient step (in direction of
    168 steepest descent at current point)
    169 
    170 This evaluates gradient after the momentum step, since we are taking that step
    171 anyway. Makes the gradient a bit more accurate.
    172 
    173 #### Adam
    174 
    175 Combines idea of momentum with idea that each weight should have its own
    176 learning rate.
    177 
    178 Normalize gradients: keep running mean m and uncentered variance v, for each
    179 parameter over the gradient. Subtract these instead of the gradient.
    180 
    181 Calculations:
    182 
    183 -   $m \leftarrow \beta_{1} * m + (1 - \beta_{1}) \nabla loss(w)$
    184 -   $v \leftarrow \beta_{2} * v + (1 - \beta_{2}) (\nabla loss(w))^{2}$
    185 -   $w \leftarrow w - \eta \frac{m}{\sqrt{v} + \epsilon}$
    186 
    187 ### Regularizers
    188 
    189 The bigger your model is, the bigger the capacity for overfitting.
    190 
    191 Regularizers pull the model back towards simpler models, but don't eliminate
    192 more complex solutions.
    193 
    194 #### L2 regularizer
    195 
    196 "Simpler means smaller parameters"
    197 
    198 Take all params, stick them in one vector ("θ"). Then $loss_{reg} = loss +
    199 \lambda \|\theta\|$
    200 
    201 Models with bigger weights get higher loss, but if it's worth it (i.e.  original
    202 loss decreases enough), they can still beat simpler models.
    203 
    204 If you have a bowl where you want to roll a marble to the lowest point, L2 loss
    205 is like tipping the bowl slightly to the right (shifting the lowest point).
    206 
    207 #### L1 regulariser
    208 
    209 "Simpler means smaller parameters and more zero parameters"
    210 
    211 lp norm: $\|\theta\|^{p} = \sqrt[p]{w<sup>{p}+b</sup>{p}}$ $loss \leftarrow loss +
    212 \lambda \|\theta\|^{1}$
    213 
    214 If you have a bowl where you want to roll a marble to the lowest point, L1 loss
    215 is like using a square bowl -- if it has groves along dimensions, marble is
    216 likely to end up in one of the grooves.
    217 
    218 #### Dropout regularisation
    219 
    220 "Simpler means more robust; during training, randomly disable hidden units"
    221 
    222 During training, remove hidden and input nodes, each with probability p.  This
    223 prevents co-adaptation -- multiple neurons firing together in specific
    224 combinations.
    225 
    226 The analogy is if you can learn how to do a task repeatedly whilst drunk, you
    227 should be able to do the task sober. So basically, do all of the practice exams
    228 while drunk, and then you'll ace the final while sober (or you'll fail and
    229 disprove all of machine learning, choose your destiny). But if anyone asks, I
    230 didn't tell you to do that.
    231 
    232 ## Convolutional neural networks
    233 
    234 Disclaimer: I'm gonna revise these notes, the prof basically covered all of CNN
    235 theory in ten minutes lol. So I don't have much here atm.
    236 
    237 Hidden layer has shape of another image, with more channels.
    238 
    239 Hidden nodes only wired to nearby nodes in the previous layer.
    240 
    241 Weights are shared, each hidden node has same weights as the previous layer.
    242 
    243 Maxpooling reduces image dimensions.
    244 
    245 ## Deep learning vs machine learning
    246 
    247 In ML, you chain things together. But chaining modules that are 99% accurate
    248 doesn't mean the whole pipeline is 99% accurate, as error accumulates.
    249 
    250 In deep learning, make each module differentiable - ensure that we can work out
    251 **local** gradient, so we can train pipeline as a whole using backpropagation.
    252 This is "end-to-end learning".
    253 
    254 It's a lower level of abstraction, giving you smaller building blocks.
    255 
    256 ## Generators
    257 
    258 Visual shorthand:
    259 
    260 ![](4f24499ecda0424abfc6b408bf663267.png)
    261 
    262 How do you turn neural network into probability distribution?
    263 
    264 -   option 1: take output and interpret it as parameters of multivariate normal (μ, Σ)
    265     -   if output has high dimensions, take Σ to be diagonal matrix
    266     -   allows network to communicate how sure it's about the output (i.e. smaller variances in Σ mean it's more sure)
    267     -   allows sampling from the generator, and computing prob density ![](45614363f80f489eb6424d1ba48915a8.png)x
    268 -   option 2: start with an MVN, sample vector from it, feed that vector to the NN, and look at what comes out
    269     -   cannot easily compute prob density for an instance
    270 
    271     -   can easily sample
    272 
    273         ![](1efde6bbc5484b4481db40d089140c0b.png)
    274 -   option 3: both. i.e., sample input from standard MVN, interpret output as another MVN, then sample from that.
    275     -   input is called z
    276     -   space of inputs is the latent space
    277     -   naive approach: sample random point from data, sample point from model, train on how close they are. loss could be any distance between tensors, like mean-square error
    278         -   doesn't work -- mode collapse.
    279         -   if a generated point is close to a mode, the model should be rewarded, but since it's also far away from some other points, we might compute the loss to a different point
    280         -   the different modes (areas of high prob) of data distr end up being averaged into a single point
    281         -   we want network to imagine details, not average over all possibilities
    282 
    283     ![](1a5455e19f984f23b9cc90fb4d99d59c.png)
    284 
    285 How do you 'fix' mode collapse?
    286 
    287 ## Generative adversarial networks
    288 
    289 If you can generate adversarial examples (i.e. try to break your network), you can also add them to the dataset and then retrain your network.
    290 
    291 Generator: takes input sampled from standard MVN, produces image
    292 
    293 Discriminator: takes image, classifies as Pos (real) or Neg (fake)
    294 
    295 ### Vanilla GANs
    296 
    297 Training discriminator:
    298 
    299 -   feed examples from positive class
    300 -   train it to classify them as Pos (just nudge the weights with backpropagation)
    301 -   sample images from generator, train it to make them negative
    302 
    303 Training generator:
    304 
    305 -   freeze discriminator
    306 -   train weights of generator to produce images that the discriminator labels as positive
    307 
    308 ### Conditional GANs
    309 
    310 If we want network to generate output probabilistically. i.e., the network has to fill in realistic details.
    311 
    312 Make the generator a function, taking input and mapping it to output. Uses randomness to imagine specific output details.
    313 
    314 Feed discriminator:
    315 
    316 -   either input/output pair from data, which it should classify as real
    317 -   or input from data with output generated by generator, which it should classify as fake
    318 
    319 Training generator in two ways:
    320 <ol type="a">
    321   <li>freeze weights of discriminator, train generator to produce stuff that the discriminator will classify as real</li>
    322   <li>feed it an input from data, backpropagate on corresponding output using L1 loss</li>
    323 </ol>
    324 
    325 Only works if input and output matched; for some tasks, only have unmatched bags of images in two domains. Can't randomly match because mode collapse. So what do?
    326 
    327 ### CycleGAN
    328 
    329 Add "cycle consistency term" to loss function.
    330 
    331 E.g. in horse-to-zebra example, if transform horse to zebra and back, result should be close to original image.
    332 
    333 So, new goal:
    334 
    335 -   train horse-to-zebra transformer and zebra-to-horse transformer, such that
    336 -   horse-discriminator can't tell generated horses (and zebras) from real ones
    337 -   cycle consistency loss for both combined is low
    338 
    339 Think of generators doing steganography (hiding info in pictures). For example, hiding a horse inside a zebra (picture, obviously).
    340 
    341 ### StyleGAN
    342 
    343 Feed the network the latent vector at each layer.
    344 
    345 Since deconvolution starts with low resolution, high level description of image, feeding it latent vector at each layer allows it to use different parts of the vector to describe different aspects of the image ("styles").
    346 
    347 Network also receives separate extra random noise per layer, which allows it to make random choices.
    348 
    349 Then generate image for destination, but for a few layers (bottom, middle, or top) we use source latent vector instead.
    350 
    351 ### What can we do with a generator?
    352 
    353 Gotta fill this in.
    354 
    355 ## Autoencoders
    356 
    357 A type of neural network that tries to make output as close to input as possible, but there is a middle layer (smaller than input) that functions as a bottleneck.
    358 
    359 After network is trained, that layer becomes a compressed representation of the input.
    360 
    361 ![](e935de30948c46dfabccb5d24b5e1a5e.png)
    362 
    363 blue layer is latent representation of input. If autoencoder works well, expect to see similar images clustered together.
    364 
    365 To find direction in latent space that we can use to make someone smile, we label instances as smiling and nonsmiling, and draw vector between their respective means. That's called the smiling vector (god I can't take this shit seriously)
    366 
    367 ### Turning an autoencoder into a generator
    368 
    369 How:
    370 
    371 -   train an autoencoder
    372 -   encode the data to latent variables Z
    373 -   fit MVN to Z
    374 -   sample from the MVN
    375 -   "decode" the sample
    376 
    377 But we're training for reconstruction error, and then turning result into autoencoder. Can we train for maximum likelihood directly?
    378 
    379 ## Variational autoencoders
    380 
    381 Force decoder to also decode points near z correctly, and force latent distribution of data towards N(0,1). Can be derived from first principles.
    382 
    383 Approximate P(z \| z,θ) with neural network, and make that the q function.
    384 
    385 Want to choose parameters θ (weights of neural network) to maximise log likelihood of data.
    386 
    387 $\ln{P(x|\theta)} = L(q, \theta) + KL(q,p)$ with $P = P(z|x,\theta)$.
    388 
    389 -   q(z\|x) any approximation to P(z\|x)
    390 -   KL(q, p) - Kullback-Leibler divergence
    391 -   $L(q, \theta) = E_{q} \ln{\frac{P(x,z|\theta)}{q(z|x)}}$
    392 
    393 We can't marginalize out hidden variable z, or compute probability over z given x. Instead, use approximation on prob of z given x, called q, and optimise both probability of x given z and z given x.
    394 
    395 ![](650e25dac37b4b4db20998694f3f6146.png)
    396 
    397 Solves mode collapse, because we map input to latent space and back to data space, so we know which instance the generated output should look like.
    398 
    399 Sorry guys this lecture was hard to follow, I'll finish this part up when I revise for exams.