machine-learning.html (16367B)
1 <!DOCTYPE html> 2 <html> 3 <head> 4 <script type="text/javascript" async src="https://cdn.jsdelivr.net/gh/mathjax/MathJax@2.7.5/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script> 5 <link rel="Stylesheet" type="text/css" href="style.css"> 6 <title>machine-learning</title> 7 <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> 8 </head> 9 <body> 10 11 <div id="Machine Learning"><h1 id="Machine Learning">Machine Learning</h1></div> 12 13 <p> 14 design of a learning element is affected by: 15 </p> 16 <ul> 17 <li> 18 which components of performance element are to be learned 19 20 <li> 21 what feedback is available to learn these components 22 23 <li> 24 what representation is used for the components 25 26 </ul> 27 28 <p> 29 offline learning: learn based on some data, then apply results to situation 30 </p> 31 32 <p> 33 feedback types (get input, make decision, learn based on feedback): 34 </p> 35 <ul> 36 <li> 37 supervised learning: correct answers for each example. 38 39 <ul> 40 <li> 41 only positive examples 42 43 <li> 44 positive and negative examples 45 46 </ul> 47 <li> 48 unsupervised learning: no correct answers given 49 50 <li> 51 semi-supervised learning: learn which questions to ask (active learning) 52 53 <li> 54 reinforcement learning: occasional rewards 55 56 </ul> 57 58 <p> 59 <img src="img/neurons-vs-nn.png" /> 60 </p> 61 62 <div id="Machine Learning-Learning problems"><h2 id="Learning problems">Learning problems</h2></div> 63 64 <p> 65 Classification 66 </p> 67 <ul> 68 <li> 69 use: from data to discrete classes. 70 71 <li> 72 examples: handwriting recognition, spam filtering, facial recognition. 73 74 </ul> 75 76 <p> 77 Regression 78 </p> 79 <ul> 80 <li> 81 use: predicting a numeric value 82 83 <li> 84 examples: stock market, weather predictions 85 86 </ul> 87 88 <p> 89 Ranking 90 </p> 91 <ul> 92 <li> 93 use: comparing items 94 95 <li> 96 examples: web search 97 98 </ul> 99 100 <p> 101 Collaborative filtering 102 </p> 103 <ul> 104 <li> 105 use: take some else's information, based on that, give prediction 106 107 <li> 108 examples: recommendation systems (books, movies/shows) 109 110 </ul> 111 112 <p> 113 Clustering 114 </p> 115 <ul> 116 <li> 117 use: discovering structure/patterns in data 118 119 <li> 120 examples: clustering images 121 122 </ul> 123 124 <div id="Machine Learning-Methodology"><h2 id="Methodology">Methodology</h2></div> 125 <div id="Machine Learning-Methodology-Data"><h3 id="Data">Data</h3></div> 126 <p> 127 labeled instances (like spam, not spam) 128 </p> 129 <ul> 130 <li> 131 training set 132 133 <li> 134 held out set (e.g. for validation) 135 136 <li> 137 test set (don't even look at this until you want to test your model) 138 139 </ul> 140 141 <p> 142 randomly allocate to data sets. 143 </p> 144 145 <p> 146 features: attribute-value pairs characterizing each x 147 </p> 148 149 <div id="Machine Learning-Methodology-Experimentation"><h3 id="Experimentation">Experimentation</h3></div> 150 <p> 151 experimentation cycle: 152 </p> 153 <ol> 154 <li> 155 select hypothesis, tune hyperparameters on held-out/validation set 156 157 <li> 158 compute accuracy of test set (never "peek" at test set itself) 159 160 </ol> 161 162 <div id="Machine Learning-Methodology-Evaluation"><h3 id="Evaluation">Evaluation</h3></div> 163 <p> 164 accuracy -- fraction of instances predicted correctly 165 </p> 166 167 <p> 168 Cross validation: 169 </p> 170 171 <p> 172 <img src="img/cross-validation.png" alt="Cross validation diagram" /> 173 </p> 174 175 <p> 176 create a confusion matrix (<span class="todo">TODO</span>: there's a diagram for this but not in the old slides) 177 </p> 178 179 <div id="Machine Learning-Machine Learning Steps:"><h2 id="Machine Learning Steps:">Machine Learning Steps:</h2></div> 180 <ol> 181 <li> 182 choose the features 183 184 <li> 185 choose the model class 186 187 <li> 188 choose a search method 189 190 </ol> 191 192 <div id="Machine Learning-Machine Learning Steps:-Choose the features"><h3 id="Choose the features">Choose the features</h3></div> 193 <p> 194 create a feature space: features on x-y axes, points are individual data, classification would be a color scheme. 195 </p> 196 197 <p> 198 <img src="img/feature-space.png" alt="Feature space graph" /> 199 </p> 200 201 202 <div id="Machine Learning-Machine Learning Steps:-Choose the features-Inductive learning method"><h4 id="Inductive learning method">Inductive learning method</h4></div> 203 <ul> 204 <li> 205 construct/adjust h to agree with f (function predicting value) on training set 206 207 <li> 208 h is consistent if it agrees with f on all examples 209 210 <li> 211 for example, curve fitting: 212 213 </ul> 214 <blockquote> 215 <img src="img/curve-fitting.png" alt="Curve fitting graph" /> 216 </blockquote> 217 218 <p> 219 Occam's razor: "one should not increase, beyond what is necessary, the number of entities required to explain anything" 220 basically, choose the simplest option. 221 </p> 222 223 <div id="Machine Learning-Machine Learning Steps:-Choose the features-Classifying with naive Bayes"><h4 id="Classifying with naive Bayes">Classifying with naive Bayes</h4></div> 224 225 <p> 226 Binary classification 227 </p> 228 <ul> 229 <li> 230 input: email 231 232 <li> 233 output: spam, not spam 234 235 <li> 236 setup: 237 238 <ul> 239 <li> 240 get large collection of example emails, labeled "spam/not spam" 241 242 <li> 243 someone has to hand-label that 244 245 <li> 246 want to learn to predict label for new emails in future 247 248 </ul> 249 <li> 250 features: attributes used to make label decision 251 252 <ul> 253 <li> 254 words, text patterns (e.g. caps), non-text 255 256 </ul> 257 </ul> 258 259 <p> 260 calculation for Bayesian classifier: \(P(C|F_1,...,F_n)\) 261 </p> 262 263 <p> 264 using Bayes' theorem: 265 </p> 266 267 <p> 268 \(P(C|F_1,...F_n)=\frac{P(C)P(F_1,...,F_n|C)}{P(F_1,...,F_n)}\) 269 </p> 270 271 <p> 272 rewrite the numerator of the equation: 273 </p> 274 275 <p> 276 \(P(C)P(F_1,...,F_n |C) = P(C)P(F_1 | C)P(F_2 | C, F_1)P(F_3|C, F_1, F_2)P(F_4,...,F_n | C, F_1, F_2, F_3)\) 277 </p> 278 279 <p> 280 that uses the chaining rule. but it's too computationally expensive. 281 so naively assume conditional independence: 282 </p> 283 284 <p> 285 \(P(F_i | C, F_j) = P(F_i | C)\) 286 </p> 287 288 <p> 289 This simplifies the formula to: 290 </p> 291 292 <p> 293 \(P(C)P(F_1,...,F_n | C) = P(C) PI(0 to n) P(F_i | C)\) 294 </p> 295 296 <p> 297 <img src="img/bayes-calculation.png" /> 298 </p> 299 300 <p> 301 Laplace smoothing helps with really small probabilities. 302 Naive Bayes often works. sometimes performs well even if the assumptions are badly violated. 303 classification is about predicting correct class label, <em>not</em> about accurately estimating probabilities 304 </p> 305 306 <div id="Machine Learning-Machine Learning Steps:-Choose the features-Clustering with K-nearest neighbor"><h4 id="Clustering with K-nearest neighbor">Clustering with K-nearest neighbor</h4></div> 307 <p> 308 k nearest neighbor classification: find k most similar case, copy majority label 309 </p> 310 311 <p> 312 e.g. to classify unlabeled document, find most similar document and copy label: 313 </p> 314 315 <p> 316 <img src="img/knn.png" alt="K-nearest neighbor example" /> 317 </p> 318 319 <p> 320 the <em>k</em> helps get rid of noise which would lead to misclassification 321 </p> 322 323 <div id="Machine Learning-Machine Learning Steps:-Choose the features-Linear classifier"><h4 id="Linear classifier">Linear classifier</h4></div> 324 <p> 325 linear classifier: come up with a line that divides feature space, use that for prediction. 326 </p> 327 328 <p> 329 works for stuff like \(x \lor y\), but not if we want \(x \oplus y\) or other things that are not linearly separable. 330 </p> 331 332 <p> 333 you can build a design matrix of all the different features you want to include. here's an example with 5 different features *age, height, age², age × height, height²) that classifies a person as male/female: 334 </p> 335 336 <p> 337 <img src="img/design-matrix.png" alt="Design matrix" /> 338 </p> 339 340 <p> 341 if you go to more dimensions (so more features in a design matrix), you need hyperplanes. 342 </p> 343 344 <p> 345 hyperplanes sound fancy af but it's just a line in some higher dimension. for example, this is how you would use a hyperplane in the third dimension to separate points: 346 </p> 347 348 <p> 349 <img src="img/3d-hyperplane.png" alt="3D hyperplane example" /> 350 </p> 351 352 <div id="Machine Learning-Machine Learning Steps:-Choose the features-Support vector machine"><h4 id="Support vector machine">Support vector machine</h4></div> 353 <p> 354 k(x,y): "distance" function between instances x and y 355 </p> 356 357 <p> 358 SVMs create a linear separating hyperplane 359 </p> 360 361 <p> 362 they can embed that hyperplane into a higher dimensional domain using the kernel trick -- so long as <em>k</em> is a kernel, you don't have to explicitly compute the design matrix (that drastically reduces the computation) 363 </p> 364 365 <p> 366 try to achieve a good margin -- a large separation from the line for both classes (so reduce the blur between two classes, make a clear distinction). 367 </p> 368 369 <p> 370 watch out for over-fitting -- you might end up with the model being trained extremely specifically to your data. 371 </p> 372 373 <div id="Machine Learning-Machine Learning Steps:-Choose the model (model search)"><h3 id="Choose the model (model search)">Choose the model (model search)</h3></div> 374 375 <p> 376 so we have a lot of ways to put a line on a graph. but how do you choose the right one? 377 </p> 378 379 <div id="Machine Learning-Machine Learning Steps:-Choose the model (model search)-Regression"><h4 id="Regression">Regression</h4></div> 380 <p> 381 train a learner to produce a model (the model is the line/hyperplane). then you give the model a new instance, and it produces a new value based on a function. 382 </p> 383 384 <p> 385 assign a value for each of the points in the feature space. 386 </p> 387 388 <p> 389 evaluating regression -- what's a good model for a regression? 390 </p> 391 392 <p> 393 you use an error function (the difference between predicted and real values, square to avoid negatives): 394 </p> 395 396 <p> 397 \(error(p) = \sum_i (f_p (x_i) - y_i)^2\) 398 </p> 399 400 <p> 401 <img src="img/error-graph.png" alt="Error graph" /> 402 </p> 403 404 <p> 405 in this example, each of the possible lines is represented by two parameters (s the slope, b the y-intercept), in the left graph. those parameters can be plotted on 2D axes, on the right graph. 406 </p> 407 408 <p> 409 <img src="img/feature-model-space.png" alt="Feature and model space" /> 410 </p> 411 412 <p> 413 Then you can take those points in the right graph, and plot their respective error rates (from the error function above) on the z axis, so you end up with a 3D graph -- an error surface: 414 </p> 415 416 <p> 417 <img src="img/error-surface.png" alt="Error surface" /> 418 </p> 419 420 <p> 421 now the point is to find the one with the lowest error (the lowest point in the surface, colored green). and that's what calculus is for, specifically differentiation. 422 </p> 423 424 <p> 425 if you've never done calculus, it's not easy to explain, but basically taking the derivative means you're looking for the slope of a certain function at some point in that function. if you set the derivative to zero, you're looking for the point where the slope is zero -- specifically the minimum or maximum of a function. 426 </p> 427 428 <p> 429 quick example: if you have a function \(y = x^2 + 2\), there's a minimum at x = 0. you may know that by intuition, but you can also take the first derivative (\(y' = 2x\)), set \(y'\) equal to zero, and solve for x -- you'll get the same result. it's trivial with a simple function, but once you get into cubic and quartic functions, you can't really do it by intuition.. 430 </p> 431 432 <p> 433 so a derivative \(f'(x)\) of a function \(f(x)\) gives the slope of \(f(x)\) at any point in the domain (where \(f(x)\) has a value \(y\)). 434 </p> 435 436 <p> 437 the rules for taking derivatives (don't need to memorize these, they will be provided on the exam): 438 </p> 439 440 <p> 441 <img src="img/derivative-rules.png" alt="Derivative rules" /> 442 </p> 443 444 <p> 445 the problems: 446 </p> 447 <ul> 448 <li> 449 not all derivatives that we find can be resolved to zero 450 451 <li> 452 and we also have to consider that this can be in higher dimensions 453 454 </ul> 455 456 <div id="Machine Learning-Machine Learning Steps:-Choose the model (model search)-Gradient descent"><h4 id="Gradient descent">Gradient descent</h4></div> 457 <p> 458 you use a variant of the hill climbing algorithm 459 </p> 460 461 <p> 462 the steps: 463 </p> 464 <ol> 465 <li> 466 pick random parameters s (slope), b (intercept) 467 468 <li> 469 repeat: 470 471 <ol> 472 <li> 473 compute the gradient 474 475 <li> 476 move a small step in the opposite direction 477 478 </ol> 479 </ol> 480 481 <p> 482 but there may be multiple zero points -- local optima. there's no guarantee of convergence. 483 </p> 484 485 <div id="Machine Learning-Neural Networks"><h2 id="Neural Networks">Neural Networks</h2></div> 486 <div id="Machine Learning-Neural Networks-Overview"><h3 id="Overview">Overview</h3></div> 487 <p> 488 the difference between original machine learning and deep learning: 489 </p> 490 491 <p> 492 <img src="img/ml-vs-dl.png" /> 493 </p> 494 495 <p> 496 the original perceptron: 497 </p> 498 <ul> 499 <li> 500 binary classifier 501 502 <li> 503 bias node, always set to 1 504 505 <li> 506 n inputs for each feature, with weights 507 508 <li> 509 \(y=w_1 x_1 _ w_2 x_2 + b\) 510 511 </ul> 512 513 <p> 514 but chaining doesn't make it more interesting, the function collapses to a linear one 515 </p> 516 517 <p> 518 <img src="img/perceptron.png" /> 519 </p> 520 521 <p> 522 So introduce an activation function instead of using a standard linear calculation. This puts the values in a certain range, and now the diagram looks like this: 523 </p> 524 525 <p> 526 <img src="img/activation-functions.png" /> 527 </p> 528 529 <p> 530 Then you can build a feed-forward network with hidden layers between input and output: 531 </p> 532 533 <p> 534 <img src="img/feedforward.png" /> 535 </p> 536 537 538 <div id="Machine Learning-Neural Networks-Training neural networks"><h3 id="Training neural networks">Training neural networks</h3></div> 539 <p> 540 Loss function determines how close a given network is to being correct for an input. 541 If you plot neural networks on x-y, and the amount of loss on z axis, you end up with a loss surface. Then you want to find a point in that surface where the loss is the lowest. 542 </p> 543 544 <p> 545 <img src="img/loss-plot.png" /> 546 <img src="img/loss-surface.png" /> 547 </p> 548 549 <p> 550 You can find the low point in the error surface with gradient descent (differentiation). 551 </p> 552 553 <p> 554 Stochastic gradient descent: 555 </p> 556 <ol> 557 <li> 558 pick random weights w 559 560 <li> 561 loop: 562 563 <ul> 564 <li> 565 \(w = w - r \triangledown loss (w)\) 566 567 </ul> 568 </ol> 569 570 <p> 571 where r is the learning rate. make sure to pick a good learning rate because if it's too high (like 0.1), it will be inaccurate, while if it's too low, it will take a long time. 572 </p> 573 574 <p> 575 so in summary: 576 </p> 577 <ul> 578 <li> 579 get examples of input and output 580 581 <li> 582 get a loss function 583 584 <li> 585 work out the gradient of loss with respect to the weights 586 587 <li> 588 use (stochastic) gradient descent to slowly improve the weights 589 590 </ul> 591 592 <p> 593 how do you calculate the gradient for complex data? 594 </p> 595 <ul> 596 <li> 597 symbolically? too computationally expensive 598 599 <li> 600 numerically? too unstable, also expensive 601 602 <li> 603 so settle for a middle ground -- backpropagation 604 605 </ul> 606 607 <p> 608 backpropagation: if the system is a composition of modules, the gradient is the product of the gradient of each module with respect to its arguments 609 </p> 610 <ul> 611 <li> 612 break computation down into chain of modules 613 614 <li> 615 work out derivative of each module with respect to its input (<em>symbolically</em>) 616 617 <li> 618 compute the gradient for a given input by multiplying these gradients for the current input 619 620 </ul> 621 622 <p> 623 <img src="img/backpropagation.png" /> 624 </p> 625 626 <p> 627 for a given input x, you do a forward pass (calculating the value for each part), then fill in into final calculation. 628 </p> 629 630 <p> 631 so like this, with x = 0: 632 </p> 633 634 <p> 635 <img src="img/backpropagation-calculation.png" /> 636 </p> 637 638 <p> 639 neural networks are just a way of thinking about differentiable computation. 640 </p> 641 642 <div id="Machine Learning-Neural Networks-Autoencoders: a NN architecture"><h3 id="Autoencoders: a NN architecture">Autoencoders: a NN architecture</h3></div> 643 <p> 644 bottleneck architecture: 645 </p> 646 647 <p> 648 <img src="img/autoencoder-architecture.png" /> 649 </p> 650 651 <p> 652 input should be as close as possible to the output. but it must pass through a small representation 653 </p> 654 655 <p> 656 autoencoders map data to a <em>latent representation</em> 657 </p> 658 659 <div id="Machine Learning-Neural Networks-Trying it out"><h3 id="Trying it out">Trying it out</h3></div> 660 <ul> 661 <li> 662 code: <a href="https://github.com/pbloem/machine-learning/blob/master/code-samples/autoencoder/autoencoder.py">https://github.com/pbloem/machine-learning/blob/master/code-samples/autoencoder/autoencoder.py</a> 663 664 <li> 665 setup: <a href="https://docs.google.com/document/d/1-LXG5Lb76xQy70W2ZdannnYMEXRLt0CsoiaK0gTkmfY/edit?usp=sharing">https://docs.google.com/document/d/1-LXG5Lb76xQy70W2ZdannnYMEXRLt0CsoiaK0gTkmfY/edit?usp=sharing</a> 666 667 </ul> 668 669 <div id="Machine Learning-The promise of depth"><h2 id="The promise of depth">The promise of depth</h2></div> 670 <p> 671 if you have many dimensions, the gradient isn't as easy to detect anymore. there's a huge number of parameters. 672 </p> 673 674 <p> 675 solutions: 676 </p> 677 <ul> 678 <li> 679 Pretraining (no pre-classification by humans) 680 681 <ul> 682 <li> 683 train the network on itself in hidden layers 684 685 <li> 686 similar to the autoencoder idea 687 688 <li> 689 you can then stack the autoencoders (hidden layers) 690 691 </ul> 692 </ul> 693 <blockquote> 694 <img src="img/pretraining.png" alt="Pretraining" /> 695 </blockquote> 696 697 <ul> 698 <li> 699 Pruning the network 700 701 <ul> 702 <li> 703 convolutional neural network: combine multiple input nodes into one node of a hidden layer 704 705 </ul> 706 </ul> 707 708 </body> 709 </html>