lectures.alex.balgavy.eu

Lecture notes from university.
git clone git://git.alex.balgavy.eu/lectures.alex.balgavy.eu.git
Log | Files | Refs | Submodules

index.md (9127B)


      1 +++
      2 title = 'Linear models'
      3 template = 'page-math.html'
      4 +++
      5 # Linear models
      6 ## Defining a model
      7 
      8 1 feature x: $f_{w,b}(x) = wx + b$
      9 
     10 2 features x<sub>1</sub>, x<sub>2</sub>: $f_{w_1,w_2, b}(x_1, x_2) = w_1 x_1 + w_2 x_2 + b$
     11 
     12 Generally,
     13 
     14 $
     15 \begin{aligned}
     16 f_{w, b}(x) &= w_1 x_1 + w_2 x_2 + w_3 x_3 + ... + b \\
     17             &= w^T x + b \\
     18             &= \sum_{i} w_i x_i \\
     19             &= \|w\| \|x\| \cos{\alpha}
     20 \end{aligned}
     21 $
     22 
     23 with w is vector w<sub>1</sub> to w<sub>n</sub>, x is x<sub>1</sub> to x<sub>n</sub>
     24 
     25 with $w = \begin{pmatrix} w_1 \\ \dots \\ w_n \end{pmatrix}$ and $x = \begin{pmatrix} x_1 \\ \dots \\ x_n \end{pmatrix}$
     26 ## But which model fits best?
     27 
     28 Define loss function, then search for model whihc best fits loss
     29 function.
     30 
     31 ### Mean squared error loss
     32 $\text{loss}_{x,y}(p) = \frac{1}{n} \sum_j (f_p (x<sup>j) - y</sup>j)^2$
     33 
     34 Defines residuals that show how far from mean (?)
     35 
     36 Why square? Make everything positive, but also penalize outliers
     37 
     38 ### Optimization & searching
     39 
     40 $\hat p = \argmin_p \text{loss}_{x,y}(p)$
     41 
     42 To escape local minima: add randomness, add multiple models
     43 
     44 To converge faster: combine known good models (breeding), inspect the
     45 local neighbourhood
     46 
     47 #### Black box optimisation
     48 
     49 Simple, only need to compute loss function, and a few more TODO
     50 things
     51 
     52 ##### Random search
     53 
     54 Start with random point p in model space.
     55 
     56 ```
     57 loop:
     58   pick random point p' close to p
     59   if loss(p') < loss(p):
     60     p <- p'
     61 ```
     62 
     63 You need to define what 'close to' means though.
     64 
     65 Convexity: the property of having one minimum (i.e. if for any
     66 two points, the line between those points is above the function)
     67 
     68 The issue with random search is it can get stuck in a local
     69 minimum. In many situations, local minima are fine, we don't
     70 *always* need an algorithm for a guaranteed global minimum.
     71 
     72 In discrete model spaces (which have a more graph-like
     73 structure), you need to figure out a transition function.
     74 
     75 ##### Simulated annealing
     76 
     77 'Improved' random search.
     78 
     79 ```
     80 pick random point p' close to p
     81 loop:
     82 pick random point p' close to p
     83 
     84 ..etc TODO the lecturer was going fast as fuck
     85 ```
     86 
     87 #####  Parallel search
     88 
     89 Can also do these searches in parallel, or even parallel with
     90 some communication between searches.
     91 
     92 Population methods, eg. evolutionary algorithms:
     93 
     94 ```
     95 start with population of k models
     96 loop:
     97   rank population by loss
     98   remove the half with worst loss
     99   "breed" new population of k models
    100   optionally, add a little noise to each child
    101 ```
    102 
    103 #####  Branching search
    104 
    105 Coming closer to gradient descent:
    106 
    107 ```
    108 pick random point p in model spce
    109 loop:
    110   pick k random points {p_i} close to p
    111   p' <- argmin_p_i loss(p_i)
    112   if
    113 TODO again he switched the fuckin slide
    114 ```
    115 
    116 #### Gradient descent
    117 
    118 Good, but doesn't help with global/local minima.
    119 
    120 In 2D space, the tangent line is the slope. In higher spaces, the
    121 plane/hyperplane is the gradient (analog of slope).
    122 
    123 Gradient:
    124 $\nabla f(x, y) = (\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y})$
    125 
    126 Tangent hyperplane: $g(x) = \nabla f(p)^T x + c$
    127 
    128 Gives best approximation at point p.
    129 
    130 The direction of steepest ascent:
    131 
    132 $
    133 \begin{aligned}
    134 g(x) &= w^T x \\
    135       &= \|w\| \|x\| \cos{\alpha} \\
    136     \|x\| &= 1 \\
    137         &\rightarrow ||w|| \cos{\alpha}
    138 \end{aligned}
    139 $
    140 
    141 The angle is maximised when cos(α) is 1, so α is 0. So the gradient
    142 is the direction of steepest ascent
    143 
    144 ```
    145 pick a random point p in model space
    146 loop:
    147   p <- p - \eta \nabla loss(p)
    148 ```
    149 
    150 Usually set η (step size, learning rate) between 0.0001 and 0.1.
    151 
    152 Take partial derivatives of loss function, then calculate them.
    153 
    154 Cons:
    155 
    156 -   only works for continuous model spaces, with smooth loss
    157     functions, for which we can work out the gradient
    158 -   does not escape local minima
    159 
    160 Pros:
    161 
    162 -   very fast, low memory
    163 -   very accurate
    164 
    165 If the model is linear, you don't actually need to search, you
    166 could just set partial derivatives equal to zero and solve.
    167 
    168 Sometimes the loss function shouldn't be the same as the evaluation
    169 function, because you might not get a smooth function.
    170 
    171 #### Classification losses
    172 ##### Least-squares loss
    173 Apply the least-squares calculation, you get a smooth function.
    174 Then you can do gradient descent.
    175 
    176 ## Neural networks (feedforward)
    177 
    178 ### Overview
    179 
    180 Learns a feature extractor together with the classifier
    181 
    182 Neuron has inputs (dendrites) and one output (axon) The simplified
    183 version for computers is the \'perceptron\':
    184 
    185 -   inputs are features (x)
    186 -   multiply each input with a weight (w)
    187 -   add a bias node (b)
    188 -   y = w<sub>1</sub>x<sub>1</sub> + w<sub>2</sub>x<sub>2</sub> + b
    189 -   output class A if y \> 0, otherwise class B
    190 
    191 Nonlinearity:
    192 
    193 -   sigmoid function $\sigma(x) = \frac{1}{1+e^{-x}}$
    194 
    195     ![](b149c9058e4548719393205f11b3fd74.png)
    196 
    197 -   ReLU
    198     $r(x) = \begin{cases} x &\text{if } x > 0 \\ 0 &\text{otherwise} \end{cases}$
    199 
    200     ![](f9abb2f4be3640919753fd9709e0e764.png)
    201 
    202 Feedforward network: a multilayer perceptron -- hidden layer(s) between
    203 input and output layers
    204 
    205 Every edge has weights, and the network learns by adapting the weights.
    206 
    207 It trains both feature extractor and linear model at once.
    208 
    209 ### Classification
    210 
    211 Binary:
    212 
    213 -   add a sigmoid to the output layer
    214 -   the result is then the probability that the result is positive given
    215     the input
    216 
    217 Multiclass:
    218 
    219 -   softmax activation:
    220 -   for output nodes o: o<sub>i</sub> = w<sup>T</sup>h + b
    221 -   then result $y_i = \frac{exp(o_i)}{\sum_{j}exp(o_{j})}$
    222 
    223 ### Dealing with loss - gradient descent & backpropagation
    224 
    225 Stochastic gradient descent:
    226 
    227 1.  Pick random weights w for the whole model
    228 2.  loop:
    229     -   for x in X: $w \leftarrow w - \eta\nabla loss_{x}(w)$
    230 
    231 For complex models, symbolic and numeric methods for computing the
    232 gradient are expensive. Use backpropagation:
    233 
    234 -   break computation down into chain of modules
    235 -   work out local derivative of each module symbolically (like you
    236     would on paper)
    237 -   do forward pass for a given input x. compute f(x), remember
    238     intermediate values.
    239 -   compute local derivatives for x, and multiply to compute global
    240     derivative (because chain rule)
    241 
    242 For feedforward network, you look at derivative of loss function with
    243 respect to the weights
    244 
    245 ## Support vector machines (SVMs)
    246 
    247 Uses a kernel to expand the feature space
    248 
    249 Margin: line for which the space to the nearest positive and negative
    250 points is as big as possible.
    251 
    252 Support vectors: the points that the margin just touches
    253 
    254 ![](610c2acbde354f9fb8e54b0a9efb4b1f.png)
    255 
    256 The support vector machine tries to find this line.
    257 
    258 Objective of SVM:
    259 
    260 -   maximize 2x the size of the margin
    261 -   such that all positive points are either 1 or above 1, negative
    262     points are either at or below 1
    263 -   hard margin SVM:
    264     -   minimize $\frac{1}{2} \|w\|$
    265     -   st y<sup>i</sup>(w<sup>T</sup>x<sup>i</sup> + b)  ≥ 1 for all x<sup>i</sup>
    266     -   but if data is not linearly separable, cannot satisfy this
    267         constraint
    268 -   soft margin SVM:
    269     -   minimize $\frac{1}{2} \|w\| + C \sum_{i}p_{i}$, p<sup>i</sup>  ≥ 0
    270     -   st y<sup>i</sup>(w<sup>T</sup>x<sup>i</sup> + b)  ≥ 1 - p<sup>i</sup> for all x<sup>i</sup>
    271 
    272 ![](36c6819f0f3a4a7381214b8baf48b2f1.png)
    273 
    274 For loss, two options:
    275 
    276 -   express everything in terms of w, get rid of constraints:
    277     -   allows gradient descent
    278     -   good for neural networks
    279     -   get SVM loss:
    280         -   p<sup>i</sup> = max (0, y<sup>i</sup>(w<sup>T</sup>x<sup>i</sup>+b)-1)
    281         -   $\frac{1}{2} \|w\| + C\sum_{i}\max{0, y<sup>i(w</sup>{T}x^{i}+b)-1}$
    282         -   no constraints
    283 -   express everything in terms of support vectors, get rid of w
    284     -   doesn\'t allow error backpropagation
    285     -   allows the kernel trick:
    286         -   if you have an algorithm which operates only on dot product
    287             of instances, you can substitute the dot product for a
    288             kernel function.
    289         -   kernel function k(x<sup>i</sup>, x<sup>j</sup>) computes dot product of x<sup>i</sup>
    290             and x<sup>j</sup> in a high-dimensional feature space, without
    291             explicitly computing the features themselves
    292         -   polynomial kernel: k(a,b) = (a<sup>T</sup> b + 1)<sup>d</sup>
    293             -   feature space for d=2: all squares, all cross products,
    294                 all single features
    295             -   feature space for d=3: all cubes and squares, all
    296                 two-way and three-way cross products, all single
    297                 features
    298         -   RBF kernel: $k(a,b) = exp(-\gamma \|a-b\|)$, feature space
    299             is infinite dimensional
    300     -   have to optimise under constraints: Lagrange multipliers
    301         -   minimize f(a) such that g<sub>i</sub>(a)  ≥ 0 for i  ∈ \[1, *n*\]
    302         -   $L(a, \alpha_{1}, ..., \alpha_n) = f(a) - \sum_{i} \alpha_{i}g_{i}(a)$
    303         -   solve $\nabla L = 0$ such that α<sub>I</sub>  ≥ 0 for i  ∈ \[1, *n*\]
    304     -   result:
    305         -   minimize $-\frac{1}{2} \sum_i \sum_j \alpha^i \alpha<sup>j y</sup>i y<sup>j k(x</sup>i, x^j) + \sum_i \alpha^i$
    306         -   such that 0 ≤ α<sup>i</sup> ≤ C; $\sum_i \alpha<sup>i y</sup>i = 0$
    307 
    308 ## Summary of classification loss functions
    309 
    310 ![](b08d80ef6b5241578c3d432a466db7ea.png)