index.md - lectures.alex.balgavy.eu - Lecture notes from university.

index.md (12620B)
      1 +++
      2 title = 'Methodology'
      3 template = 'page-math.html'
      4 +++
      5 # ML: Methodology
      6 
      7 ## Performing an experiment
      8 
      9 Never judge your performance on the training data (or you'll fail the
     10 course and life).
     11 
     12 The proportion of training to test data is not important, the _absolute
     13 size_ of the test data is. Aim to have min 500 examples in
     14 test data (ideal 10 000 or more).
     15 
     16 ### What if you need to test many models?
     17 
     18 e.g. k-nearest neighbours, which classifies a point based on the
     19 classification of its k nearest neighbours.
     20 
     21 ### The modern recipe
     22 
     23 1.  Split data into train, validation, and test data. Sample randomly,
     24     at least 500 examples in test set.
     25 2.  Choose model, hyperparameters, etc. only based on the training set.
     26     Test on validation. Don't use test set for anything.
     27 3.  State the hypothesis.
     28 4.  During the final run, train on training + validation data.
     29 5.  Test hypothesis _once_ on the test data. Usually at the
     30     very end of the project, when you write report/paper.
     31 
     32 Don't re-use test data:
     33 
     34 -   you'd pick the wrong model
     35 -   it would inflate your performance estimate
     36 
     37 For temporal data, you'll probably want to keep the data ordered by
     38 time.
     39 
     40 Which hyperparameters to try?
     41 
     42 -   trial-and-error (via intuition)
     43 -   grid search: define finite set of values for each hyperparam, try
     44     all combinations
     45 -   random search
     46 
     47 ### Cross-validation
     48 
     49 You still split your data, but every run, a different slice becomes the
     50 validation data. Then you average the results for the final result.
     51 
     52 If it's temporal data, you might want to do walk-forward validation,
     53 where you always expand your data slices forward in time.
     54 
     55 ## What to report
     56 
     57 ### Classification
     58 
     59 #### What's a good error (5%)?
     60 
     61 It depends, just like in every class so far:
     62 
     63 -   Class imbalance: how much more likely is a positive example than a
     64     negative example?
     65 -   Cost imbalance: how much worse is mislabeled positive than
     66     mislabeled negative? e.g. how bad is it to mark a real email as spam
     67     vs letting a spam message into your inbox?
     68 
     69 #### Performance metrics
     70 
     71 ##### Confusion matrix (contingency table)
     72 
     73 Metrics for a single classifier.
     74 
     75 The margins give four totals: actual number of each class present in
     76 data, number of each class predicted by the classifier.
     77 
     78 ![](6fa59a4013a0431a9561c4b00b29e8b9.png)
     79 
     80 ##### Precision and recall
     81 
     82 Also for a single classifier.
     83 
     84 -   Precision: proportion of returned positives that are
     85     _actually_ positive
     86 -   Recall: proportion of existing positives that the classifier found
     87 
     88 ![](e395f5797bc3479090c5c7128b77f074.png)
     89 
     90 You can then calculate rates:
     91 
     92 -   True positive rate (TPR): proportion of actual positives that we
     93     classified correctly
     94 -   False positive rate (FPR): proportion of actual negatives that we
     95     misclassified as positive
     96 
     97 ![](4a756d1ddc51411cbb13957c08b20a8f.png)
     98 
     99 ROC (receiver-operating characteristics) space: plot true positives
    100 against false positives. the best classifier is in the top left corner.
    101 
    102 ![](aba7a57e16944be7b654c26df0acae65.png)
    103 
    104 Ranking classifier: also gives score of how negative/positive a point
    105 is.
    106 
    107 -   turning classifier into ranking classifier:
    108     -   for linear classifier, measure distance from decision boundary,
    109         and now you can scale classifier from timid to bold by moving
    110         the decision boundary
    111     -   for tree classifier: sort by class proportion in each segment
    112 -   ranking errors: one per every pair of instances that's ranked
    113     wrongly (a negative point is ranked more positively than a positive
    114     point)
    115 
    116 Coverage matrix: shows what happens to TPR and FPR if we move threshold
    117 from right to left (more or less identical to ROC space)
    118 
    119 If we draw line between two classifiers, we can create classifier for
    120 every point on that line by picking output of one of the classifiers at
    121 random. E.g. with 50/50 probability, end up halfway between the two. The
    122 area under the curve of classifiers we can create ("convex hull") is
    123 good indication of quality of classifier -- the bigger this area, the
    124 more useful classifiers we can achieve. Good way to compare classifiers
    125 with class or cost imbalance, if we're unsure of our preferences.
    126 
    127 ### Regression
    128 
    129 Loss function: mean squared errors
    130 ($\frac{1}{n} \sum_i (\hat{y_i} - y_i)^2$)
    131 
    132 Evaluation function: root mean squared error
    133 ($\sqrt{\frac{1}{n} \sum_i (\hat{y_i} - y_i)^2}$)
    134 
    135 -   you may want to report this, because minimised at same places as MSE,
    136     but has same units as the original output value, so easier to
    137     interpret
    138 
    139 Bias: distance from true MSE (which is unknown) to the optimum MSE.
    140 
    141 -   high bias: model doesn't fit generating distribution.
    142     "underfitting"
    143 -   reduce by increasing model capacity or features
    144 
    145 Variance: spread of different experiments' MSE around the true MSE
    146 
    147 -   high variance: high model capacity, sensitivity to random
    148     fluctuations. "overfitting"
    149 -   reduce by reducing model capacity, adding regularization, reducing
    150     tree depth
    151 
    152 specifically for k-NN regression: increasing k increases bias and
    153 decreases variance
    154 
    155 Dartboard example:
    156 
    157 ![](748a8a36136244d9bc6e3c5c8e4060cb.png)
    158 
    159 ### Errors & confidence intervals
    160 
    161 Statistics tries to answer: can observed results be attributed to _real
    162 characteristics_ of the models, or are they observed _by
    163 chance_?
    164 
    165 If you see error bars, the author has to indicate what they mean --
    166 there's no convention.
    167 
    168 Standard deviation: measure of spread, variance
    169 
    170 Standard error, confidence interval: measure of confidence
    171 
    172 If the population distribution is normal, the standard error of the mean
    173 is calculated by $\frac{\sigma}{\sqrt{n}}$(because the
    174 sample distribution is the t distribution)
    175 
    176 Re confidence intervals: the correct phrasing is "if we repeat the
    177 experiment many times, computing the confidence interval each time, the
    178 true mean would be inside the interval in 95% of those experiments"
    179 
    180 Use statistics in ML to show confidence and spread.
    181 
    182 ## The no-free-lunch theorem and principle
    183 
    184 Answer to question "what is the best ML method/model in general?"
    185 
    186 Theorem: "any two optimization algorithms are equivalent when their
    187 performance is averaged across all possible problems"
    188 
    189 i.e. you can't say shit in general.
    190 
    191 A few outs:
    192 
    193 -   universal distribution, the datasets for which our methods works are
    194     the likely ones
    195 -   Occam's razor, the simplest solution/explanation is often the best
    196 
    197 Principle: there is no single best learning method; whether an algorithm
    198 is good depends on the domain
    199 
    200 Inductive bias: the aspects of a learning algorithm, which implicitly or
    201 explicitly make it suitable for certain problems make it unsuitable for
    202 others
    203 
    204 ## Cleaning your data
    205 ### Missing data
    206 Simplest way - remove features for which values missing. Maybe they're not important, probably, hopefully.
    207 
    208 Or remove instances (rows) with missing data. The problem is if data wasn't corrupted uniformly, removing rows with missing values changes the data distribution. An example is if people refuse to answer questions.
    209 
    210 Generally, think about the real-world use case -- can you also expect missing data there?
    211 * if yes: keep them in test set, make a model that can consume them
    212 * if no: try to get a test set without missing values, test methods for completing data only in the training set
    213 
    214 Guessing the missing data ("imputation"):
    215 * categorical: use the <dfn title="the value that occurs most often">mode</dfn>
    216 * numerical: use the mean
    217 * or, make the feature a target value and train a model
    218 
    219 ### Outliers
    220 Are they mistakes?:
    221 * Yes: deal with them.
    222 * No: leave them alone, check model for strong assumptions of normally distributed data
    223 
    224 Can we expect them in production?
    225 * Yes: make sure model can deal with them
    226 * No: remove them, get a test dataset representing production
    227 
    228 Watch out for MSE, it's based on assumption of normally distributed randomness. If you get data with big outliers, it fucks up.
    229 
    230 ### Class imbalance
    231 <def title="i.e. how much more likely is a positive example than a negative example?">Class imbalance</def> is a problem, but how do you improve training?
    232 * Use a big test set
    233 * Don't rely on accuracy -- try ROC plots, precision-recall plots, AUC, look at confusion matrix...
    234 * Resample training data
    235   * oversample: sample with replacements. leads to more data, but creates duplicates and increases likelihood of overfitting.
    236   * undersample: doesn't lead to duplicates, but you throw away data. might be useful for multiple-pass algorithms
    237 * Use data augmentation for minority class
    238   * oversample minority with new data derived from existing data
    239   * example: SMOTE, which finds small clusters of points in minority class, and generates their mean as new minority class point
    240 
    241 ## Choosing features
    242 Even if data is a table, you shouldn't just use columns as features.
    243 Some algorithms work only on numeric features, some only on categorical, some on both.
    244 
    245 Converting between categoric/numeric:
    246 * numeric to categoric - you're bound to lose information, but it might be tolerable
    247 * categoric to numeric
    248   * integer coding: make everything an integer - imposes false ordering on unordered data. generally not a good idea.
    249   * one-hot coding: one categorical feature becomes several numeric features. for each element, you say whether or not the feature applies (0 or 1).
    250 
    251 Expanding features: adding extra features derived from existing features (improves performance).
    252 For example, when you have results that don't fit on a line, but _do_ fit on a curve, you can add a derived feature x².
    253 If we don't have any intuition for extra features to add, just add all cross products, or use functions like sin/log.
    254 
    255 ## Normalisation & standardisation
    256 Create a uniform scale.
    257 
    258 ### Normalisation
    259 Fit to [0,1].
    260 Scales the data linearly, smallest point becomes zero, largest point becomes 1:
    261 $\chi \leftarrow \frac{\chi - \chi_{min}}{\chi_{max} - \chi_{\min}}$
    262 
    263 ### Standardisation
    264 Fit to 1D standard normal distribution.
    265 Rescale data so mean becomes zero, standard deviation becomes 1. Make it look like the data came from a standard normal distribution.
    266 $\chi \leftarrow \frac{\chi - \mu}{\sigma}$
    267 
    268 ### Whitening
    269 Fit to multivariate standard normal distribution.
    270 If the data is correlated, you don't end up with a spherical shape after normalising/standardising. So you have to choose a different basis (coordinate system) for the points.
    271 
    272 Back to linear algebra - choose a basis
    273 $$B = \begin{bmatrix} c & d \end{bmatrix} = \begin{bmatrix} 1.26 & -0.3 \\ 0.9 & 0.5 \end{bmatrix}$$
    274 
    275 Then if you want to convert a coordinate to this basis, multiply $Bx$. If you want to convert from this basis to the standard, multiply $B^{-1} x$.
    276 
    277 Since the inverse of a matrix is computationally expensive, prefer orthonormal bases (the basis vectors are <def title="perpendicular to each other">orthogonal</def> and <def title="have length 1">normal</def>). Because then $B<sup>{-1} = B</sup>T$, and the transpose is much easier to compute.
    278 
    279 Steps:
    280   1. Compute sample mean $m = \frac{1}{n} \sum_i x_i$ and sample covariance $S = \frac{1}{n-1} X X^T$ (where $X = [x_1, \dots, x_n] -m$).
    281   2. Find some A st $S = AA^T$:
    282     * Cholesky decomposition
    283     * Singular value decomposition
    284     * Matrix square root
    285   3. White the data: $x \leftarrow A^{-1} (x-m)$
    286 
    287 So whitening means we choose new basis vectors for a coordinate system where the features are not correlated, and variance is 1 in every direction.
    288 ## Dimensionality reduction
    289 Opposite of feature expansion - reducing number of features in data by deriving new features from old ones, hopefully without losing essential information.
    290 
    291 Good for efficiency, reducing variance of model performance, and visualisation.
    292 
    293 Principal component analysis (PCA): whitening with some extra properties. Afte applying, you throw away all but first k dimensions, and get very good projection of data down to k dimensions.
    294   1. Mean-center the data
    295   2. Compute sample covariance S
    296   3. Compute singular value decomposition: $UZU^T$
    297     * SVD is usually computed from X or A (set equal to X or A)
    298     * Z is diagonal, whose diagonal values sorted from largest to smallest are the eigenvalues.
    299     * U is an orthonormal basis, whose columns are the eigenvectors of A.
    300       * Eigenvectors: a matrix transforms vectors, with some getting stretched and rotated. If a vector only gets stretched/flipped, but its direction doesn't change, it's an eigenvector. Translating to math, if Au = λu, u is an eigenvector, and λ is its corresponding scalar eigenvalue.
    301   4. Transform data: $x \leftarrow U^T x$. To whiten, also divide by diag(Z)
    302   5. Discard all but first k features (keep only features corresponding to biggest eigenvectors)
	lectures.alex.balgavy.eu Lecture notes from university.
	git clone git://git.alex.balgavy.eu/lectures.alex.balgavy.eu.git
	Log \| Files \| Refs \| Submodules