index.md (11800B)
1 +++ 2 title = "Machine Learning" 3 template = 'page-math.html' 4 +++ 5 # Machine Learning 6 7 design of a learning element is affected by: 8 * which components of performance element are to be learned 9 * what feedback is available to learn these components 10 * what representation is used for the components 11 12 offline learning: learn based on some data, then apply results to situation 13 14 feedback types (get input, make decision, learn based on feedback): 15 * supervised learning: correct answers for each example. 16 * only positive examples 17 * positive and negative examples 18 * unsupervised learning: no correct answers given 19 * semi-supervised learning: learn which questions to ask (active learning) 20 * reinforcement learning: occasional rewards 21 22 ![image](neurons-vs-nn.png) 23 24 ## Learning problems 25 26 Classification 27 * use: from data to discrete classes. 28 * examples: handwriting recognition, spam filtering, facial recognition. 29 30 Regression 31 * use: predicting a numeric value 32 * examples: stock market, weather predictions 33 34 Ranking 35 * use: comparing items 36 * examples: web search 37 38 Collaborative filtering 39 * use: take some else's information, based on that, give prediction 40 * examples: recommendation systems (books, movies/shows) 41 42 Clustering 43 * use: discovering structure/patterns in data 44 * examples: clustering images 45 46 ## Methodology 47 ### Data 48 labeled instances (like spam, not spam) 49 * training set 50 * held out set (e.g. for validation) 51 * test set (don't even look at this until you want to test your model) 52 53 randomly allocate to data sets. 54 55 features: attribute-value pairs characterizing each x 56 57 ### Experimentation 58 experimentation cycle: 59 1. select hypothesis, tune hyperparameters on held-out/validation set 60 2. compute accuracy of test set (never "peek" at test set itself) 61 62 ### Evaluation 63 accuracy -- fraction of instances predicted correctly 64 65 Cross validation: 66 67 ![Cross validation diagram](cross-validation.png) 68 69 create a confusion matrix (TODO: there's a diagram for this but not in the old slides) 70 71 ## Machine Learning Steps: 72 1. choose the features 73 2. choose the model class 74 3. choose a search method 75 76 ### Choose the features 77 create a feature space: features on x-y axes, points are individual data, classification would be a color scheme. 78 79 ![Feature space graph](feature-space.png) 80 81 82 #### Inductive learning method 83 * construct/adjust h to agree with f (function predicting value) on training set 84 * h is consistent if it agrees with f on all examples 85 * for example, curve fitting: 86 87 ![Curve fitting graph](curve-fitting.png) 88 89 Occam's razor: "one should not increase, beyond what is necessary, the number of entities required to explain anything" 90 basically, choose the simplest option. 91 92 #### Classifying with naive Bayes 93 94 Binary classification 95 * input: email 96 * output: spam, not spam 97 * setup: 98 * get large collection of example emails, labeled "spam/not spam" 99 * someone has to hand-label that 100 * want to learn to predict label for new emails in future 101 * features: attributes used to make label decision 102 * words, text patterns (e.g. caps), non-text 103 104 calculation for Bayesian classifier: $P(C|F_1,...,F_n)$ 105 106 using Bayes' theorem: 107 108 $P(C|F_1,...F_n)=\frac{P(C)P(F_1,...,F_n|C)}{P(F_1,...,F_n)}$ 109 110 rewrite the numerator of the equation: 111 112 $P(C)P(F_1,...,F_n |C) = P(C)P(F_1 | C)P(F_2 | C, F_1)P(F_3|C, F_1, F_2)P(F_4,...,F_n | C, F_1, F_2, F_3)$ 113 114 that uses the chaining rule. but it's too computationally expensive. 115 so naively assume conditional independence: 116 117 $P(F_i | C, F_j) = P(F_i | C)$ 118 119 This simplifies the formula to: 120 121 $P(C)P(F_1,...,F_n | C) = P(C) PI(0 to n) P(F_i | C)$ 122 123 ![image](bayes-calculation.png) 124 125 Laplace smoothing helps with really small probabilities. 126 Naive Bayes often works. sometimes performs well even if the assumptions are badly violated. 127 classification is about predicting correct class label, _not_ about accurately estimating probabilities 128 129 #### Clustering with K-nearest neighbor 130 k nearest neighbor classification: find k most similar case, copy majority label 131 132 e.g. to classify unlabeled document, find most similar document and copy label: 133 134 ![K-nearest neighbor example](knn.png) 135 136 the _k_ helps get rid of noise which would lead to misclassification 137 138 #### Linear classifier 139 linear classifier: come up with a line that divides feature space, use that for prediction. 140 141 works for stuff like $x \lor y$, but not if we want $x \oplus y$ or other things that are not linearly separable. 142 143 you can build a design matrix of all the different features you want to include. here's an example with 5 different features *age, height, age², age × height, height²) that classifies a person as male/female: 144 145 ![Design matrix](design-matrix.png) 146 147 if you go to more dimensions (so more features in a design matrix), you need hyperplanes. 148 149 hyperplanes sound fancy af but it's just a line in some higher dimension. for example, this is how you would use a hyperplane in the third dimension to separate points: 150 151 ![3D hyperplane example](3d-hyperplane.png) 152 153 #### Support vector machine 154 k(x,y): "distance" function between instances x and y 155 156 SVMs create a linear separating hyperplane 157 158 they can embed that hyperplane into a higher dimensional domain using the kernel trick -- so long as _k_ is a kernel, you don't have to explicitly compute the design matrix (that drastically reduces the computation) 159 160 try to achieve a good margin -- a large separation from the line for both classes (so reduce the blur between two classes, make a clear distinction). 161 162 watch out for over-fitting -- you might end up with the model being trained extremely specifically to your data. 163 164 ### Choose the model (model search) 165 166 so we have a lot of ways to put a line on a graph. but how do you choose the right one? 167 168 #### Regression 169 train a learner to produce a model (the model is the line/hyperplane). then you give the model a new instance, and it produces a new value based on a function. 170 171 assign a value for each of the points in the feature space. 172 173 evaluating regression -- what's a good model for a regression? 174 175 you use an error function (the difference between predicted and real values, square to avoid negatives): 176 177 $error(p) = \sum_i (f_p (x_i) - y_i)^2$ 178 179 ![Error graph](error-graph.png) 180 181 in this example, each of the possible lines is represented by two parameters (s the slope, b the y-intercept), in the left graph. those parameters can be plotted on 2D axes, on the right graph. 182 183 ![Feature and model space](feature-model-space.png) 184 185 Then you can take those points in the right graph, and plot their respective error rates (from the error function above) on the z axis, so you end up with a 3D graph -- an error surface: 186 187 ![Error surface](error-surface.png) 188 189 now the point is to find the one with the lowest error (the lowest point in the surface, colored green). and that's what calculus is for, specifically differentiation. 190 191 if you've never done calculus, it's not easy to explain, but basically taking the derivative means you're looking for the slope of a certain function at some point in that function. if you set the derivative to zero, you're looking for the point where the slope is zero -- specifically the minimum or maximum of a function. 192 193 quick example: if you have a function $y = x^2 + 2$, there's a minimum at x = 0. you may know that by intuition, but you can also take the first derivative ($y' = 2x$), set $y'$ equal to zero, and solve for x -- you'll get the same result. it's trivial with a simple function, but once you get into cubic and quartic functions, you can't really do it by intuition.. 194 195 so a derivative $f'(x)$ of a function $f(x)$ gives the slope of $f(x)$ at any point in the domain (where $f(x)$ has a value $y$). 196 197 the rules for taking derivatives (don't need to memorize these, they will be provided on the exam): 198 199 ![Derivative rules](derivative-rules.png) 200 201 the problems: 202 * not all derivatives that we find can be resolved to zero 203 * and we also have to consider that this can be in higher dimensions 204 205 #### Gradient descent 206 you use a variant of the hill climbing algorithm 207 208 the steps: 209 1. pick random parameters s (slope), b (intercept) 210 2. repeat: 211 1. compute the gradient 212 2. move a small step in the opposite direction 213 214 but there may be multiple zero points -- local optima. there's no guarantee of convergence. 215 216 ## Neural Networks 217 ### Overview 218 the difference between original machine learning and deep learning: 219 220 ![image](ml-vs-dl.png) 221 222 the original perceptron: 223 * binary classifier 224 * bias node, always set to 1 225 * n inputs for each feature, with weights 226 * $y=w_1 x_1 + w_2 x_2 + b$ 227 228 but chaining doesn't make it more interesting, the function collapses to a linear one 229 230 ![image](perceptron.png) 231 232 So introduce an activation function instead of using a standard linear calculation. This puts the values in a certain range, and now the diagram looks like this: 233 234 ![image](activation-functions.png) 235 236 Then you can build a feed-forward network with hidden layers between input and output: 237 238 ![image](feedforward.png) 239 240 241 ### Training neural networks 242 Loss function determines how close a given network is to being correct for an input. 243 If you plot neural networks on x-y, and the amount of loss on z axis, you end up with a loss surface. Then you want to find a point in that surface where the loss is the lowest. 244 245 ![image](loss-plot.png) 246 ![image](loss-surface.png) 247 248 You can find the low point in the error surface with gradient descent (differentiation). 249 250 Stochastic gradient descent: 251 1. pick random weights w 252 2. loop: 253 * $w = w - r \triangledown loss (w)$ 254 255 where r is the learning rate. make sure to pick a good learning rate because if it's too high (like 0.1), it will be inaccurate, while if it's too low, it will take a long time. 256 257 so in summary: 258 * get examples of input and output 259 * get a loss function 260 * work out the gradient of loss with respect to the weights 261 * use (stochastic) gradient descent to slowly improve the weights 262 263 how do you calculate the gradient for complex data? 264 * symbolically? too computationally expensive 265 * numerically? too unstable, also expensive 266 * so settle for a middle ground -- backpropagation 267 268 backpropagation: if the system is a composition of modules, the gradient is the product of the gradient of each module with respect to its arguments 269 * break computation down into chain of modules 270 * work out derivative of each module with respect to its input (_symbolically_) 271 * compute the gradient for a given input by multiplying these gradients for the current input 272 273 ![image](backpropagation.png) 274 275 for a given input x, you do a forward pass (calculating the value for each part), then fill in into final calculation. 276 277 so like this, with x = 0: 278 279 ![image](backpropagation-calculation.png) 280 281 neural networks are just a way of thinking about differentiable computation. 282 283 ### Autoencoders: a NN architecture 284 bottleneck architecture: 285 286 ![image](autoencoder-architecture.png) 287 288 input should be as close as possible to the output. but it must pass through a small representation 289 290 autoencoders map data to a _latent representation_ 291 292 ### Trying it out 293 * code: https://github.com/pbloem/machine-learning/blob/master/code-samples/autoencoder/autoencoder.py 294 * setup: https://docs.google.com/document/d/1-LXG5Lb76xQy70W2ZdannnYMEXRLt0CsoiaK0gTkmfY/edit?usp=sharing 295 296 ## The promise of depth 297 if you have many dimensions, the gradient isn't as easy to detect anymore. there's a huge number of parameters. 298 299 solutions: 300 * Pretraining (no pre-classification by humans) 301 * train the network on itself in hidden layers 302 * similar to the autoencoder idea 303 * you can then stack the autoencoders (hidden layers) 304 305 ![Pretraining](pretraining.png) 306 307 * Pruning the network 308 * convolutional neural network: combine multiple input nodes into one node of a hidden layer