index.md (9127B)
1 +++ 2 title = 'Linear models' 3 template = 'page-math.html' 4 +++ 5 # Linear models 6 ## Defining a model 7 8 1 feature x: $f_{w,b}(x) = wx + b$ 9 10 2 features x<sub>1</sub>, x<sub>2</sub>: $f_{w_1,w_2, b}(x_1, x_2) = w_1 x_1 + w_2 x_2 + b$ 11 12 Generally, 13 14 $ 15 \begin{aligned} 16 f_{w, b}(x) &= w_1 x_1 + w_2 x_2 + w_3 x_3 + ... + b \\ 17 &= w^T x + b \\ 18 &= \sum_{i} w_i x_i \\ 19 &= \|w\| \|x\| \cos{\alpha} 20 \end{aligned} 21 $ 22 23 with w is vector w<sub>1</sub> to w<sub>n</sub>, x is x<sub>1</sub> to x<sub>n</sub> 24 25 with $w = \begin{pmatrix} w_1 \\ \dots \\ w_n \end{pmatrix}$ and $x = \begin{pmatrix} x_1 \\ \dots \\ x_n \end{pmatrix}$ 26 ## But which model fits best? 27 28 Define loss function, then search for model whihc best fits loss 29 function. 30 31 ### Mean squared error loss 32 $\text{loss}_{x,y}(p) = \frac{1}{n} \sum_j (f_p (x<sup>j) - y</sup>j)^2$ 33 34 Defines residuals that show how far from mean (?) 35 36 Why square? Make everything positive, but also penalize outliers 37 38 ### Optimization & searching 39 40 $\hat p = \argmin_p \text{loss}_{x,y}(p)$ 41 42 To escape local minima: add randomness, add multiple models 43 44 To converge faster: combine known good models (breeding), inspect the 45 local neighbourhood 46 47 #### Black box optimisation 48 49 Simple, only need to compute loss function, and a few more TODO 50 things 51 52 ##### Random search 53 54 Start with random point p in model space. 55 56 ``` 57 loop: 58 pick random point p' close to p 59 if loss(p') < loss(p): 60 p <- p' 61 ``` 62 63 You need to define what 'close to' means though. 64 65 Convexity: the property of having one minimum (i.e. if for any 66 two points, the line between those points is above the function) 67 68 The issue with random search is it can get stuck in a local 69 minimum. In many situations, local minima are fine, we don't 70 *always* need an algorithm for a guaranteed global minimum. 71 72 In discrete model spaces (which have a more graph-like 73 structure), you need to figure out a transition function. 74 75 ##### Simulated annealing 76 77 'Improved' random search. 78 79 ``` 80 pick random point p' close to p 81 loop: 82 pick random point p' close to p 83 84 ..etc TODO the lecturer was going fast as fuck 85 ``` 86 87 ##### Parallel search 88 89 Can also do these searches in parallel, or even parallel with 90 some communication between searches. 91 92 Population methods, eg. evolutionary algorithms: 93 94 ``` 95 start with population of k models 96 loop: 97 rank population by loss 98 remove the half with worst loss 99 "breed" new population of k models 100 optionally, add a little noise to each child 101 ``` 102 103 ##### Branching search 104 105 Coming closer to gradient descent: 106 107 ``` 108 pick random point p in model spce 109 loop: 110 pick k random points {p_i} close to p 111 p' <- argmin_p_i loss(p_i) 112 if 113 TODO again he switched the fuckin slide 114 ``` 115 116 #### Gradient descent 117 118 Good, but doesn't help with global/local minima. 119 120 In 2D space, the tangent line is the slope. In higher spaces, the 121 plane/hyperplane is the gradient (analog of slope). 122 123 Gradient: 124 $\nabla f(x, y) = (\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y})$ 125 126 Tangent hyperplane: $g(x) = \nabla f(p)^T x + c$ 127 128 Gives best approximation at point p. 129 130 The direction of steepest ascent: 131 132 $ 133 \begin{aligned} 134 g(x) &= w^T x \\ 135 &= \|w\| \|x\| \cos{\alpha} \\ 136 \|x\| &= 1 \\ 137 &\rightarrow ||w|| \cos{\alpha} 138 \end{aligned} 139 $ 140 141 The angle is maximised when cos(α) is 1, so α is 0. So the gradient 142 is the direction of steepest ascent 143 144 ``` 145 pick a random point p in model space 146 loop: 147 p <- p - \eta \nabla loss(p) 148 ``` 149 150 Usually set η (step size, learning rate) between 0.0001 and 0.1. 151 152 Take partial derivatives of loss function, then calculate them. 153 154 Cons: 155 156 - only works for continuous model spaces, with smooth loss 157 functions, for which we can work out the gradient 158 - does not escape local minima 159 160 Pros: 161 162 - very fast, low memory 163 - very accurate 164 165 If the model is linear, you don't actually need to search, you 166 could just set partial derivatives equal to zero and solve. 167 168 Sometimes the loss function shouldn't be the same as the evaluation 169 function, because you might not get a smooth function. 170 171 #### Classification losses 172 ##### Least-squares loss 173 Apply the least-squares calculation, you get a smooth function. 174 Then you can do gradient descent. 175 176 ## Neural networks (feedforward) 177 178 ### Overview 179 180 Learns a feature extractor together with the classifier 181 182 Neuron has inputs (dendrites) and one output (axon) The simplified 183 version for computers is the \'perceptron\': 184 185 - inputs are features (x) 186 - multiply each input with a weight (w) 187 - add a bias node (b) 188 - y = w<sub>1</sub>x<sub>1</sub> + w<sub>2</sub>x<sub>2</sub> + b 189 - output class A if y \> 0, otherwise class B 190 191 Nonlinearity: 192 193 - sigmoid function $\sigma(x) = \frac{1}{1+e^{-x}}$ 194 195 ![](b149c9058e4548719393205f11b3fd74.png) 196 197 - ReLU 198 $r(x) = \begin{cases} x &\text{if } x > 0 \\ 0 &\text{otherwise} \end{cases}$ 199 200 ![](f9abb2f4be3640919753fd9709e0e764.png) 201 202 Feedforward network: a multilayer perceptron -- hidden layer(s) between 203 input and output layers 204 205 Every edge has weights, and the network learns by adapting the weights. 206 207 It trains both feature extractor and linear model at once. 208 209 ### Classification 210 211 Binary: 212 213 - add a sigmoid to the output layer 214 - the result is then the probability that the result is positive given 215 the input 216 217 Multiclass: 218 219 - softmax activation: 220 - for output nodes o: o<sub>i</sub> = w<sup>T</sup>h + b 221 - then result $y_i = \frac{exp(o_i)}{\sum_{j}exp(o_{j})}$ 222 223 ### Dealing with loss - gradient descent & backpropagation 224 225 Stochastic gradient descent: 226 227 1. Pick random weights w for the whole model 228 2. loop: 229 - for x in X: $w \leftarrow w - \eta\nabla loss_{x}(w)$ 230 231 For complex models, symbolic and numeric methods for computing the 232 gradient are expensive. Use backpropagation: 233 234 - break computation down into chain of modules 235 - work out local derivative of each module symbolically (like you 236 would on paper) 237 - do forward pass for a given input x. compute f(x), remember 238 intermediate values. 239 - compute local derivatives for x, and multiply to compute global 240 derivative (because chain rule) 241 242 For feedforward network, you look at derivative of loss function with 243 respect to the weights 244 245 ## Support vector machines (SVMs) 246 247 Uses a kernel to expand the feature space 248 249 Margin: line for which the space to the nearest positive and negative 250 points is as big as possible. 251 252 Support vectors: the points that the margin just touches 253 254 ![](610c2acbde354f9fb8e54b0a9efb4b1f.png) 255 256 The support vector machine tries to find this line. 257 258 Objective of SVM: 259 260 - maximize 2x the size of the margin 261 - such that all positive points are either 1 or above 1, negative 262 points are either at or below 1 263 - hard margin SVM: 264 - minimize $\frac{1}{2} \|w\|$ 265 - st y<sup>i</sup>(w<sup>T</sup>x<sup>i</sup> + b) ≥ 1 for all x<sup>i</sup> 266 - but if data is not linearly separable, cannot satisfy this 267 constraint 268 - soft margin SVM: 269 - minimize $\frac{1}{2} \|w\| + C \sum_{i}p_{i}$, p<sup>i</sup> ≥ 0 270 - st y<sup>i</sup>(w<sup>T</sup>x<sup>i</sup> + b) ≥ 1 - p<sup>i</sup> for all x<sup>i</sup> 271 272 ![](36c6819f0f3a4a7381214b8baf48b2f1.png) 273 274 For loss, two options: 275 276 - express everything in terms of w, get rid of constraints: 277 - allows gradient descent 278 - good for neural networks 279 - get SVM loss: 280 - p<sup>i</sup> = max (0, y<sup>i</sup>(w<sup>T</sup>x<sup>i</sup>+b)-1) 281 - $\frac{1}{2} \|w\| + C\sum_{i}\max{0, y<sup>i(w</sup>{T}x^{i}+b)-1}$ 282 - no constraints 283 - express everything in terms of support vectors, get rid of w 284 - doesn\'t allow error backpropagation 285 - allows the kernel trick: 286 - if you have an algorithm which operates only on dot product 287 of instances, you can substitute the dot product for a 288 kernel function. 289 - kernel function k(x<sup>i</sup>, x<sup>j</sup>) computes dot product of x<sup>i</sup> 290 and x<sup>j</sup> in a high-dimensional feature space, without 291 explicitly computing the features themselves 292 - polynomial kernel: k(a,b) = (a<sup>T</sup> b + 1)<sup>d</sup> 293 - feature space for d=2: all squares, all cross products, 294 all single features 295 - feature space for d=3: all cubes and squares, all 296 two-way and three-way cross products, all single 297 features 298 - RBF kernel: $k(a,b) = exp(-\gamma \|a-b\|)$, feature space 299 is infinite dimensional 300 - have to optimise under constraints: Lagrange multipliers 301 - minimize f(a) such that g<sub>i</sub>(a) ≥ 0 for i ∈ \[1, *n*\] 302 - $L(a, \alpha_{1}, ..., \alpha_n) = f(a) - \sum_{i} \alpha_{i}g_{i}(a)$ 303 - solve $\nabla L = 0$ such that α<sub>I</sub> ≥ 0 for i ∈ \[1, *n*\] 304 - result: 305 - minimize $-\frac{1}{2} \sum_i \sum_j \alpha^i \alpha<sup>j y</sup>i y<sup>j k(x</sup>i, x^j) + \sum_i \alpha^i$ 306 - such that 0 ≤ α<sup>i</sup> ≤ C; $\sum_i \alpha<sup>i y</sup>i = 0$ 307 308 ## Summary of classification loss functions 309 310 ![](b08d80ef6b5241578c3d432a466db7ea.png)