lectures.alex.balgavy.eu

Lecture notes from university.
git clone git://git.alex.balgavy.eu/lectures.alex.balgavy.eu.git
Log | Files | Refs | Submodules

machine-learning.wiki (11976B)


      1 = Machine Learning =
      2 
      3 design of a learning element is affected by:
      4   * which components of performance element are to be learned
      5   * what feedback is available to learn these components
      6   * what representation is used for the components
      7 
      8 offline learning: learn based on some data, then apply results to situation
      9 
     10 feedback types (get input, make decision, learn based on feedback):
     11   * supervised learning: correct answers for each example.
     12     * only positive examples
     13     * positive and negative examples
     14   * unsupervised learning: no correct answers given
     15   * semi-supervised learning: learn which questions to ask (active learning)
     16   * reinforcement learning: occasional rewards
     17 
     18 {{file:img/neurons-vs-nn.png}}
     19 
     20 == Learning problems ==
     21 
     22 Classification
     23   * use: from data to discrete classes.
     24   * examples: handwriting recognition, spam filtering, facial recognition.
     25 
     26 Regression
     27   * use: predicting a numeric value
     28   * examples: stock market, weather predictions
     29 
     30 Ranking
     31   * use: comparing items
     32   * examples: web search
     33 
     34 Collaborative filtering
     35   * use: take some else's information, based on that, give prediction
     36   * examples: recommendation systems (books, movies/shows)
     37 
     38 Clustering
     39   * use: discovering structure/patterns in data
     40   * examples: clustering images
     41 
     42 == Methodology ==
     43 === Data ===
     44 labeled instances (like spam, not spam)
     45   * training set
     46   * held out set (e.g. for validation)
     47   * test set (don't even look at this until you want to test your model)
     48 
     49 randomly allocate to data sets.
     50 
     51 features: attribute-value pairs characterizing each x
     52 
     53 === Experimentation ===
     54 experimentation cycle:
     55   1. select hypothesis, tune hyperparameters on held-out/validation set
     56   2. compute accuracy of test set (never "peek" at test set itself)
     57 
     58 === Evaluation ===
     59 accuracy -- fraction of instances predicted correctly
     60 
     61 Cross validation:
     62 
     63 {{file:img/cross-validation.png|Cross validation diagram}}
     64 
     65 create a confusion matrix (TODO: there's a diagram for this but not in the old slides)
     66 
     67 == Machine Learning Steps: ==
     68   1. choose the features
     69   2. choose the model class
     70   3. choose a search method
     71 
     72 === Choose the features ===
     73 create a feature space: features on x-y axes, points are individual data, classification would be a color scheme.
     74 
     75 {{file:img/feature-space.png|Feature space graph}}
     76 
     77 
     78 ==== Inductive learning method ====
     79   * construct/adjust h to agree with f (function predicting value) on training set
     80   * h is consistent if it agrees with f on all examples
     81   * for example, curve fitting:
     82     
     83     {{file:img/curve-fitting.png|Curve fitting graph}}
     84 
     85 Occam's razor: "one should not increase, beyond what is necessary, the number of entities required to explain anything"
     86 basically, choose the simplest option.
     87 
     88 ==== Classifying with naive Bayes ====
     89 
     90 Binary classification
     91 * input: email
     92 * output: spam, not spam
     93 * setup:
     94   * get large collection of example emails, labeled "spam/not spam"
     95   * someone has to hand-label that
     96   * want to learn to predict label for new emails in future
     97 * features: attributes used to make label decision
     98   * words, text patterns (e.g. caps), non-text
     99 
    100 calculation for Bayesian classifier: $P(C|F_1,...,F_n)$
    101 
    102 using Bayes' theorem:
    103 
    104 $P(C|F_1,...F_n)=\frac{P(C)P(F_1,...,F_n|C)}{P(F_1,...,F_n)}$
    105 
    106 rewrite the numerator of the equation:
    107 
    108 $P(C)P(F_1,...,F_n |C) = P(C)P(F_1 | C)P(F_2 | C, F_1)P(F_3|C, F_1, F_2)P(F_4,...,F_n | C, F_1, F_2, F_3)$
    109 
    110 that uses the chaining rule. but it's too computationally expensive.
    111 so naively assume conditional independence:
    112 
    113 $P(F_i | C, F_j) = P(F_i | C)$
    114 
    115 This simplifies the formula to:
    116 
    117 $P(C)P(F_1,...,F_n | C) = P(C) PI(0 to n) P(F_i | C)$
    118 
    119 {{file:img/bayes-calculation.png}}
    120 
    121 Laplace smoothing helps with really small probabilities.
    122 Naive Bayes often works. sometimes performs well even if the assumptions are badly violated.
    123 classification is about predicting correct class label, _not_ about accurately estimating probabilities
    124 
    125 ==== Clustering with K-nearest neighbor ====
    126 k nearest neighbor classification: find k most similar case, copy majority label
    127 
    128 e.g. to classify unlabeled document, find most similar document and copy label:
    129 
    130 {{file:img/knn.png|K-nearest neighbor example}}
    131 
    132 the _k_ helps get rid of noise which would lead to misclassification  
    133 
    134 ==== Linear classifier ====
    135 linear classifier: come up with a line that divides feature space, use that for prediction.
    136 
    137 works for stuff like $x \lor y$, but not if we want $x \oplus y$ or other things that are not linearly separable.
    138 
    139 you can build a design matrix of all the different features you want to include. here's an example with 5 different features *age, height, age², age × height, height²) that classifies a person as male/female:
    140 
    141 {{file:img/design-matrix.png|Design matrix}}
    142 
    143 if you go to more dimensions (so more features in a design matrix), you need hyperplanes.
    144 
    145 hyperplanes sound fancy af but it's just a line in some higher dimension. for example, this is how you would use a hyperplane in the third dimension to separate points:
    146 
    147 {{file:img/3d-hyperplane.png|3D hyperplane example}}
    148 
    149 ==== Support vector machine ====
    150 k(x,y): "distance" function between instances x and y
    151 
    152 SVMs create a linear separating hyperplane
    153 
    154 they can embed that hyperplane into a higher dimensional domain using the kernel trick -- so long as _k_ is a kernel, you don't have to explicitly compute the design matrix (that drastically reduces the computation)
    155 
    156 try to achieve a good margin -- a large separation from the line for both classes (so reduce the blur between two classes, make a clear distinction).
    157 
    158 watch out for over-fitting -- you might end up with the model being trained extremely specifically to your data.
    159 
    160 === Choose the model (model search) ===
    161 
    162 so we have a lot of ways to put a line on a graph. but how do you choose the right one?
    163 
    164 ==== Regression ====
    165 train a learner to produce a model (the model is the line/hyperplane). then you give the model a new instance, and it produces a new value based on a function.
    166 
    167 assign a value for each of the points in the feature space.
    168 
    169 evaluating regression -- what's a good model for a regression?
    170 
    171 you use an error function (the difference between predicted and real values, square to avoid negatives):
    172 
    173 $error(p) = \sum_i (f_p (x_i) - y_i)^2$
    174 
    175 {{file:img/error-graph.png|Error graph}}
    176 
    177 in this example, each of the possible lines is represented by two parameters (s the slope, b the y-intercept), in the left graph. those parameters can be plotted on 2D axes, on the right graph.
    178 
    179 {{file:img/feature-model-space.png|Feature and model space}}
    180 
    181 Then you can take those points in the right graph, and plot their respective error rates (from the error function above) on the z axis, so you end up with a 3D graph -- an error surface:
    182 
    183 {{file:img/error-surface.png|Error surface}}
    184 
    185 now the point is to find the one with the lowest error (the lowest point in the surface, colored green). and that's what calculus is for, specifically differentiation.
    186 
    187 if you've never done calculus, it's not easy to explain, but basically taking the derivative means you're looking for the slope of a certain function at some point in that function. if you set the derivative to zero, you're looking for the point where the slope is zero -- specifically the minimum or maximum of a function.
    188 
    189 quick example: if you have a function $y = x^2 + 2$, there's a minimum at x = 0. you may know that by intuition, but you can also take the first derivative ($y' = 2x$), set $y'$ equal to zero, and solve for x -- you'll get the same result. it's trivial with a simple function, but once you get into cubic and quartic functions, you can't really do it by intuition..
    190 
    191 so a derivative $f'(x)$ of a function $f(x)$ gives the slope of $f(x)$ at any point in the domain (where $f(x)$ has a value $y$).
    192 
    193 the rules for taking derivatives (don't need to memorize these, they will be provided on the exam):
    194 
    195 {{file:img/derivative-rules.png|Derivative rules}}
    196 
    197 the problems: 
    198   * not all derivatives that we find can be resolved to zero
    199   * and we also have to consider that this can be in higher dimensions
    200 
    201 ==== Gradient descent ====
    202 you use a variant of the hill climbing algorithm
    203 
    204 the steps:
    205   1. pick random parameters s (slope), b (intercept)
    206   2. repeat:
    207     1. compute the gradient
    208     2. move a small step in the opposite direction
    209 
    210 but there may be multiple zero points -- local optima. there's no guarantee of convergence.
    211 
    212 == Neural Networks ==
    213 === Overview ===
    214 the difference between original machine learning and deep learning:
    215 
    216 {{file:img/ml-vs-dl.png}}
    217 
    218 the original perceptron:
    219   * binary classifier
    220   * bias node, always set to 1
    221   * n inputs for each feature, with weights
    222   * $y=w_1 x_1 _ w_2 x_2 + b$
    223 
    224 but chaining doesn't make it more interesting, the function collapses to a linear one
    225   
    226 {{file:img/perceptron.png}}
    227 
    228 So introduce an activation function instead of using a standard linear calculation. This puts the values in a certain range, and now the diagram looks like this:
    229 
    230 {{file:img/activation-functions.png}}
    231 
    232 Then you can build a feed-forward network with hidden layers between input and output:
    233 
    234 {{file:img/feedforward.png}}
    235 
    236   
    237 === Training neural networks ===
    238 Loss function determines how close a given network is to being correct for an input.
    239 If you plot neural networks on x-y, and the amount of loss on z axis, you end up with a loss surface. Then you want to find a point in that surface where the loss is the lowest. 
    240 
    241 {{file:img/loss-plot.png}}
    242 {{file:img/loss-surface.png}}
    243 
    244 You can find the low point in the error surface with gradient descent (differentiation).
    245 
    246 Stochastic gradient descent:
    247   1. pick random weights w
    248   2. loop:
    249     * $w = w - r \triangledown loss (w)$
    250 
    251 where r is the learning rate. make sure to pick a good learning rate because if it's too high (like 0.1), it will be inaccurate, while if it's too low, it will take a long time.
    252 
    253 so in summary:
    254   * get examples of input and output
    255   * get a loss function
    256   * work out the gradient of loss with respect to the weights
    257   * use (stochastic) gradient descent to slowly improve the weights
    258 
    259 how do you calculate the gradient for complex data?
    260   * symbolically? too computationally expensive
    261   * numerically? too unstable, also expensive
    262   * so settle for a middle ground -- backpropagation
    263 
    264 backpropagation: if the system is a composition of modules, the gradient is the product of the gradient of each module with respect to its arguments
    265   * break computation down into chain of modules
    266   * work out derivative of each module with respect to its input (_symbolically_)
    267   * compute the gradient for a given input by multiplying these gradients for the current input
    268 
    269 {{file:img/backpropagation.png}}
    270 
    271 for a given input x, you do a forward pass (calculating the value for each part), then fill in into final calculation.
    272 
    273 so like this, with x = 0:
    274 
    275 {{file:img/backpropagation-calculation.png}}
    276 
    277 neural networks are just a way of thinking about differentiable computation.
    278 
    279 === Autoencoders: a NN architecture === 
    280 bottleneck architecture:
    281 
    282 {{file:img/autoencoder-architecture.png}}
    283 
    284 input should be as close as possible to the output. but it must pass through a small representation
    285 
    286 autoencoders map data to a _latent representation_
    287 
    288 === Trying it out ===
    289 * code: https://github.com/pbloem/machine-learning/blob/master/code-samples/autoencoder/autoencoder.py
    290 * setup: https://docs.google.com/document/d/1-LXG5Lb76xQy70W2ZdannnYMEXRLt0CsoiaK0gTkmfY/edit?usp=sharing
    291 
    292 == The promise of depth ==
    293 if you have many dimensions, the gradient isn't as easy to detect anymore. there's a huge number of parameters.
    294 
    295 solutions:
    296   * Pretraining (no pre-classification by humans)
    297     * train the network on itself in hidden layers
    298     * similar to the autoencoder idea
    299     * you can then stack the autoencoders (hidden layers)
    300 
    301     {{file:img/pretraining.png|Pretraining}}
    302     
    303   * Pruning the network 
    304     * convolutional neural network: combine multiple input nodes into one node of a hidden layer