lectures.alex.balgavy.eu

Lecture notes from university.
git clone git://git.alex.balgavy.eu/lectures.alex.balgavy.eu.git
Log | Files | Refs | Submodules

index.md (11800B)


      1 +++
      2 title = "Machine Learning"
      3 template = 'page-math.html'
      4 +++
      5 # Machine Learning
      6 
      7 design of a learning element is affected by:
      8   * which components of performance element are to be learned
      9   * what feedback is available to learn these components
     10   * what representation is used for the components
     11 
     12 offline learning: learn based on some data, then apply results to situation
     13 
     14 feedback types (get input, make decision, learn based on feedback):
     15   * supervised learning: correct answers for each example.
     16     * only positive examples
     17     * positive and negative examples
     18   * unsupervised learning: no correct answers given
     19   * semi-supervised learning: learn which questions to ask (active learning)
     20   * reinforcement learning: occasional rewards
     21 
     22 ![image](neurons-vs-nn.png)
     23 
     24 ## Learning problems
     25 
     26 Classification
     27   * use: from data to discrete classes.
     28   * examples: handwriting recognition, spam filtering, facial recognition.
     29 
     30 Regression
     31   * use: predicting a numeric value
     32   * examples: stock market, weather predictions
     33 
     34 Ranking
     35   * use: comparing items
     36   * examples: web search
     37 
     38 Collaborative filtering
     39   * use: take some else's information, based on that, give prediction
     40   * examples: recommendation systems (books, movies/shows)
     41 
     42 Clustering
     43   * use: discovering structure/patterns in data
     44   * examples: clustering images
     45 
     46 ## Methodology
     47 ### Data
     48 labeled instances (like spam, not spam)
     49   * training set
     50   * held out set (e.g. for validation)
     51   * test set (don't even look at this until you want to test your model)
     52 
     53 randomly allocate to data sets.
     54 
     55 features: attribute-value pairs characterizing each x
     56 
     57 ### Experimentation
     58 experimentation cycle:
     59   1. select hypothesis, tune hyperparameters on held-out/validation set
     60   2. compute accuracy of test set (never "peek" at test set itself)
     61 
     62 ### Evaluation
     63 accuracy -- fraction of instances predicted correctly
     64 
     65 Cross validation:
     66 
     67 ![Cross validation diagram](cross-validation.png)
     68 
     69 create a confusion matrix (TODO: there's a diagram for this but not in the old slides)
     70 
     71 ## Machine Learning Steps:
     72   1. choose the features
     73   2. choose the model class
     74   3. choose a search method
     75 
     76 ### Choose the features
     77 create a feature space: features on x-y axes, points are individual data, classification would be a color scheme.
     78 
     79 ![Feature space graph](feature-space.png)
     80 
     81 
     82 #### Inductive learning method
     83   * construct/adjust h to agree with f (function predicting value) on training set
     84   * h is consistent if it agrees with f on all examples
     85   * for example, curve fitting:
     86 
     87     ![Curve fitting graph](curve-fitting.png)
     88 
     89 Occam's razor: "one should not increase, beyond what is necessary, the number of entities required to explain anything"
     90 basically, choose the simplest option.
     91 
     92 #### Classifying with naive Bayes
     93 
     94 Binary classification
     95 * input: email
     96 * output: spam, not spam
     97 * setup:
     98   * get large collection of example emails, labeled "spam/not spam"
     99   * someone has to hand-label that
    100   * want to learn to predict label for new emails in future
    101 * features: attributes used to make label decision
    102   * words, text patterns (e.g. caps), non-text
    103 
    104 calculation for Bayesian classifier: $P(C|F_1,...,F_n)$
    105 
    106 using Bayes' theorem:
    107 
    108 $P(C|F_1,...F_n)=\frac{P(C)P(F_1,...,F_n|C)}{P(F_1,...,F_n)}$
    109 
    110 rewrite the numerator of the equation:
    111 
    112 $P(C)P(F_1,...,F_n |C) = P(C)P(F_1 | C)P(F_2 | C, F_1)P(F_3|C, F_1, F_2)P(F_4,...,F_n | C, F_1, F_2, F_3)$
    113 
    114 that uses the chaining rule. but it's too computationally expensive.
    115 so naively assume conditional independence:
    116 
    117 $P(F_i | C, F_j) = P(F_i | C)$
    118 
    119 This simplifies the formula to:
    120 
    121 $P(C)P(F_1,...,F_n | C) = P(C) PI(0 to n) P(F_i | C)$
    122 
    123 ![image](bayes-calculation.png)
    124 
    125 Laplace smoothing helps with really small probabilities.
    126 Naive Bayes often works. sometimes performs well even if the assumptions are badly violated.
    127 classification is about predicting correct class label, _not_ about accurately estimating probabilities
    128 
    129 #### Clustering with K-nearest neighbor
    130 k nearest neighbor classification: find k most similar case, copy majority label
    131 
    132 e.g. to classify unlabeled document, find most similar document and copy label:
    133 
    134 ![K-nearest neighbor example](knn.png)
    135 
    136 the _k_ helps get rid of noise which would lead to misclassification
    137 
    138 #### Linear classifier
    139 linear classifier: come up with a line that divides feature space, use that for prediction.
    140 
    141 works for stuff like $x \lor y$, but not if we want $x \oplus y$ or other things that are not linearly separable.
    142 
    143 you can build a design matrix of all the different features you want to include. here's an example with 5 different features *age, height, age², age × height, height²) that classifies a person as male/female:
    144 
    145 ![Design matrix](design-matrix.png)
    146 
    147 if you go to more dimensions (so more features in a design matrix), you need hyperplanes.
    148 
    149 hyperplanes sound fancy af but it's just a line in some higher dimension. for example, this is how you would use a hyperplane in the third dimension to separate points:
    150 
    151 ![3D hyperplane example](3d-hyperplane.png)
    152 
    153 #### Support vector machine
    154 k(x,y): "distance" function between instances x and y
    155 
    156 SVMs create a linear separating hyperplane
    157 
    158 they can embed that hyperplane into a higher dimensional domain using the kernel trick -- so long as _k_ is a kernel, you don't have to explicitly compute the design matrix (that drastically reduces the computation)
    159 
    160 try to achieve a good margin -- a large separation from the line for both classes (so reduce the blur between two classes, make a clear distinction).
    161 
    162 watch out for over-fitting -- you might end up with the model being trained extremely specifically to your data.
    163 
    164 ### Choose the model (model search)
    165 
    166 so we have a lot of ways to put a line on a graph. but how do you choose the right one?
    167 
    168 #### Regression
    169 train a learner to produce a model (the model is the line/hyperplane). then you give the model a new instance, and it produces a new value based on a function.
    170 
    171 assign a value for each of the points in the feature space.
    172 
    173 evaluating regression -- what's a good model for a regression?
    174 
    175 you use an error function (the difference between predicted and real values, square to avoid negatives):
    176 
    177 $error(p) = \sum_i (f_p (x_i) - y_i)^2$
    178 
    179 ![Error graph](error-graph.png)
    180 
    181 in this example, each of the possible lines is represented by two parameters (s the slope, b the y-intercept), in the left graph. those parameters can be plotted on 2D axes, on the right graph.
    182 
    183 ![Feature and model space](feature-model-space.png)
    184 
    185 Then you can take those points in the right graph, and plot their respective error rates (from the error function above) on the z axis, so you end up with a 3D graph -- an error surface:
    186 
    187 ![Error surface](error-surface.png)
    188 
    189 now the point is to find the one with the lowest error (the lowest point in the surface, colored green). and that's what calculus is for, specifically differentiation.
    190 
    191 if you've never done calculus, it's not easy to explain, but basically taking the derivative means you're looking for the slope of a certain function at some point in that function. if you set the derivative to zero, you're looking for the point where the slope is zero -- specifically the minimum or maximum of a function.
    192 
    193 quick example: if you have a function $y = x^2 + 2$, there's a minimum at x = 0. you may know that by intuition, but you can also take the first derivative ($y' = 2x$), set $y'$ equal to zero, and solve for x -- you'll get the same result. it's trivial with a simple function, but once you get into cubic and quartic functions, you can't really do it by intuition..
    194 
    195 so a derivative $f'(x)$ of a function $f(x)$ gives the slope of $f(x)$ at any point in the domain (where $f(x)$ has a value $y$).
    196 
    197 the rules for taking derivatives (don't need to memorize these, they will be provided on the exam):
    198 
    199 ![Derivative rules](derivative-rules.png)
    200 
    201 the problems:
    202   * not all derivatives that we find can be resolved to zero
    203   * and we also have to consider that this can be in higher dimensions
    204 
    205 #### Gradient descent
    206 you use a variant of the hill climbing algorithm
    207 
    208 the steps:
    209   1. pick random parameters s (slope), b (intercept)
    210   2. repeat:
    211     1. compute the gradient
    212     2. move a small step in the opposite direction
    213 
    214 but there may be multiple zero points -- local optima. there's no guarantee of convergence.
    215 
    216 ## Neural Networks
    217 ### Overview
    218 the difference between original machine learning and deep learning:
    219 
    220 ![image](ml-vs-dl.png)
    221 
    222 the original perceptron:
    223   * binary classifier
    224   * bias node, always set to 1
    225   * n inputs for each feature, with weights
    226   * $y=w_1 x_1 + w_2 x_2 + b$
    227 
    228 but chaining doesn't make it more interesting, the function collapses to a linear one
    229 
    230 ![image](perceptron.png)
    231 
    232 So introduce an activation function instead of using a standard linear calculation. This puts the values in a certain range, and now the diagram looks like this:
    233 
    234 ![image](activation-functions.png)
    235 
    236 Then you can build a feed-forward network with hidden layers between input and output:
    237 
    238 ![image](feedforward.png)
    239 
    240 
    241 ### Training neural networks
    242 Loss function determines how close a given network is to being correct for an input.
    243 If you plot neural networks on x-y, and the amount of loss on z axis, you end up with a loss surface. Then you want to find a point in that surface where the loss is the lowest.
    244 
    245 ![image](loss-plot.png)
    246 ![image](loss-surface.png)
    247 
    248 You can find the low point in the error surface with gradient descent (differentiation).
    249 
    250 Stochastic gradient descent:
    251   1. pick random weights w
    252   2. loop:
    253     * $w = w - r \triangledown loss (w)$
    254 
    255 where r is the learning rate. make sure to pick a good learning rate because if it's too high (like 0.1), it will be inaccurate, while if it's too low, it will take a long time.
    256 
    257 so in summary:
    258   * get examples of input and output
    259   * get a loss function
    260   * work out the gradient of loss with respect to the weights
    261   * use (stochastic) gradient descent to slowly improve the weights
    262 
    263 how do you calculate the gradient for complex data?
    264   * symbolically? too computationally expensive
    265   * numerically? too unstable, also expensive
    266   * so settle for a middle ground -- backpropagation
    267 
    268 backpropagation: if the system is a composition of modules, the gradient is the product of the gradient of each module with respect to its arguments
    269   * break computation down into chain of modules
    270   * work out derivative of each module with respect to its input (_symbolically_)
    271   * compute the gradient for a given input by multiplying these gradients for the current input
    272 
    273 ![image](backpropagation.png)
    274 
    275 for a given input x, you do a forward pass (calculating the value for each part), then fill in into final calculation.
    276 
    277 so like this, with x = 0:
    278 
    279 ![image](backpropagation-calculation.png)
    280 
    281 neural networks are just a way of thinking about differentiable computation.
    282 
    283 ### Autoencoders: a NN architecture
    284 bottleneck architecture:
    285 
    286 ![image](autoencoder-architecture.png)
    287 
    288 input should be as close as possible to the output. but it must pass through a small representation
    289 
    290 autoencoders map data to a _latent representation_
    291 
    292 ### Trying it out
    293 * code: https://github.com/pbloem/machine-learning/blob/master/code-samples/autoencoder/autoencoder.py
    294 * setup: https://docs.google.com/document/d/1-LXG5Lb76xQy70W2ZdannnYMEXRLt0CsoiaK0gTkmfY/edit?usp=sharing
    295 
    296 ## The promise of depth
    297 if you have many dimensions, the gradient isn't as easy to detect anymore. there's a huge number of parameters.
    298 
    299 solutions:
    300   * Pretraining (no pre-classification by humans)
    301     * train the network on itself in hidden layers
    302     * similar to the autoencoder idea
    303     * you can then stack the autoencoders (hidden layers)
    304 
    305     ![Pretraining](pretraining.png)
    306 
    307   * Pruning the network
    308     * convolutional neural network: combine multiple input nodes into one node of a hidden layer