lectures.alex.balgavy.eu

Lecture notes from university.
git clone git://git.alex.balgavy.eu/lectures.alex.balgavy.eu.git
Log | Files | Refs | Submodules

machine-learning.html (16367B)


      1 <!DOCTYPE html>
      2 <html>
      3 <head>
      4 <script type="text/javascript" async src="https://cdn.jsdelivr.net/gh/mathjax/MathJax@2.7.5/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>
      5 <link rel="Stylesheet" type="text/css" href="style.css">
      6 <title>machine-learning</title>
      7 <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
      8 </head>
      9 <body>
     10 
     11 <div id="Machine Learning"><h1 id="Machine Learning">Machine Learning</h1></div>
     12 
     13 <p>
     14 design of a learning element is affected by:
     15 </p>
     16 <ul>
     17 <li>
     18 which components of performance element are to be learned
     19 
     20 <li>
     21 what feedback is available to learn these components
     22 
     23 <li>
     24 what representation is used for the components
     25 
     26 </ul>
     27 
     28 <p>
     29 offline learning: learn based on some data, then apply results to situation
     30 </p>
     31 
     32 <p>
     33 feedback types (get input, make decision, learn based on feedback):
     34 </p>
     35 <ul>
     36 <li>
     37 supervised learning: correct answers for each example.
     38 
     39 <ul>
     40 <li>
     41 only positive examples
     42 
     43 <li>
     44 positive and negative examples
     45 
     46 </ul>
     47 <li>
     48 unsupervised learning: no correct answers given
     49 
     50 <li>
     51 semi-supervised learning: learn which questions to ask (active learning)
     52 
     53 <li>
     54 reinforcement learning: occasional rewards
     55 
     56 </ul>
     57 
     58 <p>
     59 <img src="img/neurons-vs-nn.png" />
     60 </p>
     61 
     62 <div id="Machine Learning-Learning problems"><h2 id="Learning problems">Learning problems</h2></div>
     63 
     64 <p>
     65 Classification
     66 </p>
     67 <ul>
     68 <li>
     69 use: from data to discrete classes.
     70 
     71 <li>
     72 examples: handwriting recognition, spam filtering, facial recognition.
     73 
     74 </ul>
     75 
     76 <p>
     77 Regression
     78 </p>
     79 <ul>
     80 <li>
     81 use: predicting a numeric value
     82 
     83 <li>
     84 examples: stock market, weather predictions
     85 
     86 </ul>
     87 
     88 <p>
     89 Ranking
     90 </p>
     91 <ul>
     92 <li>
     93 use: comparing items
     94 
     95 <li>
     96 examples: web search
     97 
     98 </ul>
     99 
    100 <p>
    101 Collaborative filtering
    102 </p>
    103 <ul>
    104 <li>
    105 use: take some else's information, based on that, give prediction
    106 
    107 <li>
    108 examples: recommendation systems (books, movies/shows)
    109 
    110 </ul>
    111 
    112 <p>
    113 Clustering
    114 </p>
    115 <ul>
    116 <li>
    117 use: discovering structure/patterns in data
    118 
    119 <li>
    120 examples: clustering images
    121 
    122 </ul>
    123 
    124 <div id="Machine Learning-Methodology"><h2 id="Methodology">Methodology</h2></div>
    125 <div id="Machine Learning-Methodology-Data"><h3 id="Data">Data</h3></div>
    126 <p>
    127 labeled instances (like spam, not spam)
    128 </p>
    129 <ul>
    130 <li>
    131 training set
    132 
    133 <li>
    134 held out set (e.g. for validation)
    135 
    136 <li>
    137 test set (don't even look at this until you want to test your model)
    138 
    139 </ul>
    140 
    141 <p>
    142 randomly allocate to data sets.
    143 </p>
    144 
    145 <p>
    146 features: attribute-value pairs characterizing each x
    147 </p>
    148 
    149 <div id="Machine Learning-Methodology-Experimentation"><h3 id="Experimentation">Experimentation</h3></div>
    150 <p>
    151 experimentation cycle:
    152 </p>
    153 <ol>
    154 <li>
    155 select hypothesis, tune hyperparameters on held-out/validation set
    156 
    157 <li>
    158 compute accuracy of test set (never "peek" at test set itself)
    159 
    160 </ol>
    161 
    162 <div id="Machine Learning-Methodology-Evaluation"><h3 id="Evaluation">Evaluation</h3></div>
    163 <p>
    164 accuracy -- fraction of instances predicted correctly
    165 </p>
    166 
    167 <p>
    168 Cross validation:
    169 </p>
    170 
    171 <p>
    172 <img src="img/cross-validation.png" alt="Cross validation diagram" />
    173 </p>
    174 
    175 <p>
    176 create a confusion matrix (<span class="todo">TODO</span>: there's a diagram for this but not in the old slides)
    177 </p>
    178 
    179 <div id="Machine Learning-Machine Learning Steps:"><h2 id="Machine Learning Steps:">Machine Learning Steps:</h2></div>
    180 <ol>
    181 <li>
    182 choose the features
    183 
    184 <li>
    185 choose the model class
    186 
    187 <li>
    188 choose a search method
    189 
    190 </ol>
    191 
    192 <div id="Machine Learning-Machine Learning Steps:-Choose the features"><h3 id="Choose the features">Choose the features</h3></div>
    193 <p>
    194 create a feature space: features on x-y axes, points are individual data, classification would be a color scheme.
    195 </p>
    196 
    197 <p>
    198 <img src="img/feature-space.png" alt="Feature space graph" />
    199 </p>
    200 
    201 
    202 <div id="Machine Learning-Machine Learning Steps:-Choose the features-Inductive learning method"><h4 id="Inductive learning method">Inductive learning method</h4></div>
    203 <ul>
    204 <li>
    205 construct/adjust h to agree with f (function predicting value) on training set
    206 
    207 <li>
    208 h is consistent if it agrees with f on all examples
    209 
    210 <li>
    211 for example, curve fitting:
    212 
    213 </ul>
    214 <blockquote>
    215 <img src="img/curve-fitting.png" alt="Curve fitting graph" />
    216 </blockquote>
    217 
    218 <p>
    219 Occam's razor: "one should not increase, beyond what is necessary, the number of entities required to explain anything"
    220 basically, choose the simplest option.
    221 </p>
    222 
    223 <div id="Machine Learning-Machine Learning Steps:-Choose the features-Classifying with naive Bayes"><h4 id="Classifying with naive Bayes">Classifying with naive Bayes</h4></div>
    224 
    225 <p>
    226 Binary classification
    227 </p>
    228 <ul>
    229 <li>
    230 input: email
    231 
    232 <li>
    233 output: spam, not spam
    234 
    235 <li>
    236 setup:
    237 
    238 <ul>
    239 <li>
    240 get large collection of example emails, labeled "spam/not spam"
    241 
    242 <li>
    243 someone has to hand-label that
    244 
    245 <li>
    246 want to learn to predict label for new emails in future
    247 
    248 </ul>
    249 <li>
    250 features: attributes used to make label decision
    251 
    252 <ul>
    253 <li>
    254 words, text patterns (e.g. caps), non-text
    255 
    256 </ul>
    257 </ul>
    258 
    259 <p>
    260 calculation for Bayesian classifier: \(P(C|F_1,...,F_n)\)
    261 </p>
    262 
    263 <p>
    264 using Bayes' theorem:
    265 </p>
    266 
    267 <p>
    268 \(P(C|F_1,...F_n)=\frac{P(C)P(F_1,...,F_n|C)}{P(F_1,...,F_n)}\)
    269 </p>
    270 
    271 <p>
    272 rewrite the numerator of the equation:
    273 </p>
    274 
    275 <p>
    276 \(P(C)P(F_1,...,F_n |C) = P(C)P(F_1 | C)P(F_2 | C, F_1)P(F_3|C, F_1, F_2)P(F_4,...,F_n | C, F_1, F_2, F_3)\)
    277 </p>
    278 
    279 <p>
    280 that uses the chaining rule. but it's too computationally expensive.
    281 so naively assume conditional independence:
    282 </p>
    283 
    284 <p>
    285 \(P(F_i | C, F_j) = P(F_i | C)\)
    286 </p>
    287 
    288 <p>
    289 This simplifies the formula to:
    290 </p>
    291 
    292 <p>
    293 \(P(C)P(F_1,...,F_n | C) = P(C) PI(0 to n) P(F_i | C)\)
    294 </p>
    295 
    296 <p>
    297 <img src="img/bayes-calculation.png" />
    298 </p>
    299 
    300 <p>
    301 Laplace smoothing helps with really small probabilities.
    302 Naive Bayes often works. sometimes performs well even if the assumptions are badly violated.
    303 classification is about predicting correct class label, <em>not</em> about accurately estimating probabilities
    304 </p>
    305 
    306 <div id="Machine Learning-Machine Learning Steps:-Choose the features-Clustering with K-nearest neighbor"><h4 id="Clustering with K-nearest neighbor">Clustering with K-nearest neighbor</h4></div>
    307 <p>
    308 k nearest neighbor classification: find k most similar case, copy majority label
    309 </p>
    310 
    311 <p>
    312 e.g. to classify unlabeled document, find most similar document and copy label:
    313 </p>
    314 
    315 <p>
    316 <img src="img/knn.png" alt="K-nearest neighbor example" />
    317 </p>
    318 
    319 <p>
    320 the <em>k</em> helps get rid of noise which would lead to misclassification  
    321 </p>
    322 
    323 <div id="Machine Learning-Machine Learning Steps:-Choose the features-Linear classifier"><h4 id="Linear classifier">Linear classifier</h4></div>
    324 <p>
    325 linear classifier: come up with a line that divides feature space, use that for prediction.
    326 </p>
    327 
    328 <p>
    329 works for stuff like \(x \lor y\), but not if we want \(x \oplus y\) or other things that are not linearly separable.
    330 </p>
    331 
    332 <p>
    333 you can build a design matrix of all the different features you want to include. here's an example with 5 different features *age, height, age², age × height, height²) that classifies a person as male/female:
    334 </p>
    335 
    336 <p>
    337 <img src="img/design-matrix.png" alt="Design matrix" />
    338 </p>
    339 
    340 <p>
    341 if you go to more dimensions (so more features in a design matrix), you need hyperplanes.
    342 </p>
    343 
    344 <p>
    345 hyperplanes sound fancy af but it's just a line in some higher dimension. for example, this is how you would use a hyperplane in the third dimension to separate points:
    346 </p>
    347 
    348 <p>
    349 <img src="img/3d-hyperplane.png" alt="3D hyperplane example" />
    350 </p>
    351 
    352 <div id="Machine Learning-Machine Learning Steps:-Choose the features-Support vector machine"><h4 id="Support vector machine">Support vector machine</h4></div>
    353 <p>
    354 k(x,y): "distance" function between instances x and y
    355 </p>
    356 
    357 <p>
    358 SVMs create a linear separating hyperplane
    359 </p>
    360 
    361 <p>
    362 they can embed that hyperplane into a higher dimensional domain using the kernel trick -- so long as <em>k</em> is a kernel, you don't have to explicitly compute the design matrix (that drastically reduces the computation)
    363 </p>
    364 
    365 <p>
    366 try to achieve a good margin -- a large separation from the line for both classes (so reduce the blur between two classes, make a clear distinction).
    367 </p>
    368 
    369 <p>
    370 watch out for over-fitting -- you might end up with the model being trained extremely specifically to your data.
    371 </p>
    372 
    373 <div id="Machine Learning-Machine Learning Steps:-Choose the model (model search)"><h3 id="Choose the model (model search)">Choose the model (model search)</h3></div>
    374 
    375 <p>
    376 so we have a lot of ways to put a line on a graph. but how do you choose the right one?
    377 </p>
    378 
    379 <div id="Machine Learning-Machine Learning Steps:-Choose the model (model search)-Regression"><h4 id="Regression">Regression</h4></div>
    380 <p>
    381 train a learner to produce a model (the model is the line/hyperplane). then you give the model a new instance, and it produces a new value based on a function.
    382 </p>
    383 
    384 <p>
    385 assign a value for each of the points in the feature space.
    386 </p>
    387 
    388 <p>
    389 evaluating regression -- what's a good model for a regression?
    390 </p>
    391 
    392 <p>
    393 you use an error function (the difference between predicted and real values, square to avoid negatives):
    394 </p>
    395 
    396 <p>
    397 \(error(p) = \sum_i (f_p (x_i) - y_i)^2\)
    398 </p>
    399 
    400 <p>
    401 <img src="img/error-graph.png" alt="Error graph" />
    402 </p>
    403 
    404 <p>
    405 in this example, each of the possible lines is represented by two parameters (s the slope, b the y-intercept), in the left graph. those parameters can be plotted on 2D axes, on the right graph.
    406 </p>
    407 
    408 <p>
    409 <img src="img/feature-model-space.png" alt="Feature and model space" />
    410 </p>
    411 
    412 <p>
    413 Then you can take those points in the right graph, and plot their respective error rates (from the error function above) on the z axis, so you end up with a 3D graph -- an error surface:
    414 </p>
    415 
    416 <p>
    417 <img src="img/error-surface.png" alt="Error surface" />
    418 </p>
    419 
    420 <p>
    421 now the point is to find the one with the lowest error (the lowest point in the surface, colored green). and that's what calculus is for, specifically differentiation.
    422 </p>
    423 
    424 <p>
    425 if you've never done calculus, it's not easy to explain, but basically taking the derivative means you're looking for the slope of a certain function at some point in that function. if you set the derivative to zero, you're looking for the point where the slope is zero -- specifically the minimum or maximum of a function.
    426 </p>
    427 
    428 <p>
    429 quick example: if you have a function \(y = x^2 + 2\), there's a minimum at x = 0. you may know that by intuition, but you can also take the first derivative (\(y' = 2x\)), set \(y'\) equal to zero, and solve for x -- you'll get the same result. it's trivial with a simple function, but once you get into cubic and quartic functions, you can't really do it by intuition..
    430 </p>
    431 
    432 <p>
    433 so a derivative \(f'(x)\) of a function \(f(x)\) gives the slope of \(f(x)\) at any point in the domain (where \(f(x)\) has a value \(y\)).
    434 </p>
    435 
    436 <p>
    437 the rules for taking derivatives (don't need to memorize these, they will be provided on the exam):
    438 </p>
    439 
    440 <p>
    441 <img src="img/derivative-rules.png" alt="Derivative rules" />
    442 </p>
    443 
    444 <p>
    445 the problems: 
    446 </p>
    447 <ul>
    448 <li>
    449 not all derivatives that we find can be resolved to zero
    450 
    451 <li>
    452 and we also have to consider that this can be in higher dimensions
    453 
    454 </ul>
    455 
    456 <div id="Machine Learning-Machine Learning Steps:-Choose the model (model search)-Gradient descent"><h4 id="Gradient descent">Gradient descent</h4></div>
    457 <p>
    458 you use a variant of the hill climbing algorithm
    459 </p>
    460 
    461 <p>
    462 the steps:
    463 </p>
    464 <ol>
    465 <li>
    466 pick random parameters s (slope), b (intercept)
    467 
    468 <li>
    469 repeat:
    470 
    471 <ol>
    472 <li>
    473 compute the gradient
    474 
    475 <li>
    476 move a small step in the opposite direction
    477 
    478 </ol>
    479 </ol>
    480 
    481 <p>
    482 but there may be multiple zero points -- local optima. there's no guarantee of convergence.
    483 </p>
    484 
    485 <div id="Machine Learning-Neural Networks"><h2 id="Neural Networks">Neural Networks</h2></div>
    486 <div id="Machine Learning-Neural Networks-Overview"><h3 id="Overview">Overview</h3></div>
    487 <p>
    488 the difference between original machine learning and deep learning:
    489 </p>
    490 
    491 <p>
    492 <img src="img/ml-vs-dl.png" />
    493 </p>
    494 
    495 <p>
    496 the original perceptron:
    497 </p>
    498 <ul>
    499 <li>
    500 binary classifier
    501 
    502 <li>
    503 bias node, always set to 1
    504 
    505 <li>
    506 n inputs for each feature, with weights
    507 
    508 <li>
    509 \(y=w_1 x_1 _ w_2 x_2 + b\)
    510 
    511 </ul>
    512 
    513 <p>
    514 but chaining doesn't make it more interesting, the function collapses to a linear one
    515 </p>
    516   
    517 <p>
    518 <img src="img/perceptron.png" />
    519 </p>
    520 
    521 <p>
    522 So introduce an activation function instead of using a standard linear calculation. This puts the values in a certain range, and now the diagram looks like this:
    523 </p>
    524 
    525 <p>
    526 <img src="img/activation-functions.png" />
    527 </p>
    528 
    529 <p>
    530 Then you can build a feed-forward network with hidden layers between input and output:
    531 </p>
    532 
    533 <p>
    534 <img src="img/feedforward.png" />
    535 </p>
    536 
    537   
    538 <div id="Machine Learning-Neural Networks-Training neural networks"><h3 id="Training neural networks">Training neural networks</h3></div>
    539 <p>
    540 Loss function determines how close a given network is to being correct for an input.
    541 If you plot neural networks on x-y, and the amount of loss on z axis, you end up with a loss surface. Then you want to find a point in that surface where the loss is the lowest. 
    542 </p>
    543 
    544 <p>
    545 <img src="img/loss-plot.png" />
    546 <img src="img/loss-surface.png" />
    547 </p>
    548 
    549 <p>
    550 You can find the low point in the error surface with gradient descent (differentiation).
    551 </p>
    552 
    553 <p>
    554 Stochastic gradient descent:
    555 </p>
    556 <ol>
    557 <li>
    558 pick random weights w
    559 
    560 <li>
    561 loop:
    562 
    563 <ul>
    564 <li>
    565 \(w = w - r \triangledown loss (w)\)
    566 
    567 </ul>
    568 </ol>
    569 
    570 <p>
    571 where r is the learning rate. make sure to pick a good learning rate because if it's too high (like 0.1), it will be inaccurate, while if it's too low, it will take a long time.
    572 </p>
    573 
    574 <p>
    575 so in summary:
    576 </p>
    577 <ul>
    578 <li>
    579 get examples of input and output
    580 
    581 <li>
    582 get a loss function
    583 
    584 <li>
    585 work out the gradient of loss with respect to the weights
    586 
    587 <li>
    588 use (stochastic) gradient descent to slowly improve the weights
    589 
    590 </ul>
    591 
    592 <p>
    593 how do you calculate the gradient for complex data?
    594 </p>
    595 <ul>
    596 <li>
    597 symbolically? too computationally expensive
    598 
    599 <li>
    600 numerically? too unstable, also expensive
    601 
    602 <li>
    603 so settle for a middle ground -- backpropagation
    604 
    605 </ul>
    606 
    607 <p>
    608 backpropagation: if the system is a composition of modules, the gradient is the product of the gradient of each module with respect to its arguments
    609 </p>
    610 <ul>
    611 <li>
    612 break computation down into chain of modules
    613 
    614 <li>
    615 work out derivative of each module with respect to its input (<em>symbolically</em>)
    616 
    617 <li>
    618 compute the gradient for a given input by multiplying these gradients for the current input
    619 
    620 </ul>
    621 
    622 <p>
    623 <img src="img/backpropagation.png" />
    624 </p>
    625 
    626 <p>
    627 for a given input x, you do a forward pass (calculating the value for each part), then fill in into final calculation.
    628 </p>
    629 
    630 <p>
    631 so like this, with x = 0:
    632 </p>
    633 
    634 <p>
    635 <img src="img/backpropagation-calculation.png" />
    636 </p>
    637 
    638 <p>
    639 neural networks are just a way of thinking about differentiable computation.
    640 </p>
    641 
    642 <div id="Machine Learning-Neural Networks-Autoencoders: a NN architecture"><h3 id="Autoencoders: a NN architecture">Autoencoders: a NN architecture</h3></div>
    643 <p>
    644 bottleneck architecture:
    645 </p>
    646 
    647 <p>
    648 <img src="img/autoencoder-architecture.png" />
    649 </p>
    650 
    651 <p>
    652 input should be as close as possible to the output. but it must pass through a small representation
    653 </p>
    654 
    655 <p>
    656 autoencoders map data to a <em>latent representation</em>
    657 </p>
    658 
    659 <div id="Machine Learning-Neural Networks-Trying it out"><h3 id="Trying it out">Trying it out</h3></div>
    660 <ul>
    661 <li>
    662 code: <a href="https://github.com/pbloem/machine-learning/blob/master/code-samples/autoencoder/autoencoder.py">https://github.com/pbloem/machine-learning/blob/master/code-samples/autoencoder/autoencoder.py</a>
    663 
    664 <li>
    665 setup: <a href="https://docs.google.com/document/d/1-LXG5Lb76xQy70W2ZdannnYMEXRLt0CsoiaK0gTkmfY/edit?usp=sharing">https://docs.google.com/document/d/1-LXG5Lb76xQy70W2ZdannnYMEXRLt0CsoiaK0gTkmfY/edit?usp=sharing</a>
    666 
    667 </ul>
    668 
    669 <div id="Machine Learning-The promise of depth"><h2 id="The promise of depth">The promise of depth</h2></div>
    670 <p>
    671 if you have many dimensions, the gradient isn't as easy to detect anymore. there's a huge number of parameters.
    672 </p>
    673 
    674 <p>
    675 solutions:
    676 </p>
    677 <ul>
    678 <li>
    679 Pretraining (no pre-classification by humans)
    680 
    681 <ul>
    682 <li>
    683 train the network on itself in hidden layers
    684 
    685 <li>
    686 similar to the autoencoder idea
    687 
    688 <li>
    689 you can then stack the autoencoders (hidden layers)
    690 
    691 </ul>
    692 </ul>
    693 <blockquote>
    694 <img src="img/pretraining.png" alt="Pretraining" />
    695 </blockquote>
    696     
    697 <ul>
    698 <li>
    699 Pruning the network 
    700 
    701 <ul>
    702 <li>
    703 convolutional neural network: combine multiple input nodes into one node of a hidden layer 
    704 
    705 </ul>
    706 </ul>
    707 
    708 </body>
    709 </html>