index.md (12620B)
1 +++ 2 title = 'Methodology' 3 template = 'page-math.html' 4 +++ 5 # ML: Methodology 6 7 ## Performing an experiment 8 9 Never judge your performance on the training data (or you'll fail the 10 course and life). 11 12 The proportion of training to test data is not important, the _absolute 13 size_ of the test data is. Aim to have min 500 examples in 14 test data (ideal 10 000 or more). 15 16 ### What if you need to test many models? 17 18 e.g. k-nearest neighbours, which classifies a point based on the 19 classification of its k nearest neighbours. 20 21 ### The modern recipe 22 23 1. Split data into train, validation, and test data. Sample randomly, 24 at least 500 examples in test set. 25 2. Choose model, hyperparameters, etc. only based on the training set. 26 Test on validation. Don't use test set for anything. 27 3. State the hypothesis. 28 4. During the final run, train on training + validation data. 29 5. Test hypothesis _once_ on the test data. Usually at the 30 very end of the project, when you write report/paper. 31 32 Don't re-use test data: 33 34 - you'd pick the wrong model 35 - it would inflate your performance estimate 36 37 For temporal data, you'll probably want to keep the data ordered by 38 time. 39 40 Which hyperparameters to try? 41 42 - trial-and-error (via intuition) 43 - grid search: define finite set of values for each hyperparam, try 44 all combinations 45 - random search 46 47 ### Cross-validation 48 49 You still split your data, but every run, a different slice becomes the 50 validation data. Then you average the results for the final result. 51 52 If it's temporal data, you might want to do walk-forward validation, 53 where you always expand your data slices forward in time. 54 55 ## What to report 56 57 ### Classification 58 59 #### What's a good error (5%)? 60 61 It depends, just like in every class so far: 62 63 - Class imbalance: how much more likely is a positive example than a 64 negative example? 65 - Cost imbalance: how much worse is mislabeled positive than 66 mislabeled negative? e.g. how bad is it to mark a real email as spam 67 vs letting a spam message into your inbox? 68 69 #### Performance metrics 70 71 ##### Confusion matrix (contingency table) 72 73 Metrics for a single classifier. 74 75 The margins give four totals: actual number of each class present in 76 data, number of each class predicted by the classifier. 77 78 ![](6fa59a4013a0431a9561c4b00b29e8b9.png) 79 80 ##### Precision and recall 81 82 Also for a single classifier. 83 84 - Precision: proportion of returned positives that are 85 _actually_ positive 86 - Recall: proportion of existing positives that the classifier found 87 88 ![](e395f5797bc3479090c5c7128b77f074.png) 89 90 You can then calculate rates: 91 92 - True positive rate (TPR): proportion of actual positives that we 93 classified correctly 94 - False positive rate (FPR): proportion of actual negatives that we 95 misclassified as positive 96 97 ![](4a756d1ddc51411cbb13957c08b20a8f.png) 98 99 ROC (receiver-operating characteristics) space: plot true positives 100 against false positives. the best classifier is in the top left corner. 101 102 ![](aba7a57e16944be7b654c26df0acae65.png) 103 104 Ranking classifier: also gives score of how negative/positive a point 105 is. 106 107 - turning classifier into ranking classifier: 108 - for linear classifier, measure distance from decision boundary, 109 and now you can scale classifier from timid to bold by moving 110 the decision boundary 111 - for tree classifier: sort by class proportion in each segment 112 - ranking errors: one per every pair of instances that's ranked 113 wrongly (a negative point is ranked more positively than a positive 114 point) 115 116 Coverage matrix: shows what happens to TPR and FPR if we move threshold 117 from right to left (more or less identical to ROC space) 118 119 If we draw line between two classifiers, we can create classifier for 120 every point on that line by picking output of one of the classifiers at 121 random. E.g. with 50/50 probability, end up halfway between the two. The 122 area under the curve of classifiers we can create ("convex hull") is 123 good indication of quality of classifier -- the bigger this area, the 124 more useful classifiers we can achieve. Good way to compare classifiers 125 with class or cost imbalance, if we're unsure of our preferences. 126 127 ### Regression 128 129 Loss function: mean squared errors 130 ($\frac{1}{n} \sum_i (\hat{y_i} - y_i)^2$) 131 132 Evaluation function: root mean squared error 133 ($\sqrt{\frac{1}{n} \sum_i (\hat{y_i} - y_i)^2}$) 134 135 - you may want to report this, because minimised at same places as MSE, 136 but has same units as the original output value, so easier to 137 interpret 138 139 Bias: distance from true MSE (which is unknown) to the optimum MSE. 140 141 - high bias: model doesn't fit generating distribution. 142 "underfitting" 143 - reduce by increasing model capacity or features 144 145 Variance: spread of different experiments' MSE around the true MSE 146 147 - high variance: high model capacity, sensitivity to random 148 fluctuations. "overfitting" 149 - reduce by reducing model capacity, adding regularization, reducing 150 tree depth 151 152 specifically for k-NN regression: increasing k increases bias and 153 decreases variance 154 155 Dartboard example: 156 157 ![](748a8a36136244d9bc6e3c5c8e4060cb.png) 158 159 ### Errors & confidence intervals 160 161 Statistics tries to answer: can observed results be attributed to _real 162 characteristics_ of the models, or are they observed _by 163 chance_? 164 165 If you see error bars, the author has to indicate what they mean -- 166 there's no convention. 167 168 Standard deviation: measure of spread, variance 169 170 Standard error, confidence interval: measure of confidence 171 172 If the population distribution is normal, the standard error of the mean 173 is calculated by $\frac{\sigma}{\sqrt{n}}$(because the 174 sample distribution is the t distribution) 175 176 Re confidence intervals: the correct phrasing is "if we repeat the 177 experiment many times, computing the confidence interval each time, the 178 true mean would be inside the interval in 95% of those experiments" 179 180 Use statistics in ML to show confidence and spread. 181 182 ## The no-free-lunch theorem and principle 183 184 Answer to question "what is the best ML method/model in general?" 185 186 Theorem: "any two optimization algorithms are equivalent when their 187 performance is averaged across all possible problems" 188 189 i.e. you can't say shit in general. 190 191 A few outs: 192 193 - universal distribution, the datasets for which our methods works are 194 the likely ones 195 - Occam's razor, the simplest solution/explanation is often the best 196 197 Principle: there is no single best learning method; whether an algorithm 198 is good depends on the domain 199 200 Inductive bias: the aspects of a learning algorithm, which implicitly or 201 explicitly make it suitable for certain problems make it unsuitable for 202 others 203 204 ## Cleaning your data 205 ### Missing data 206 Simplest way - remove features for which values missing. Maybe they're not important, probably, hopefully. 207 208 Or remove instances (rows) with missing data. The problem is if data wasn't corrupted uniformly, removing rows with missing values changes the data distribution. An example is if people refuse to answer questions. 209 210 Generally, think about the real-world use case -- can you also expect missing data there? 211 * if yes: keep them in test set, make a model that can consume them 212 * if no: try to get a test set without missing values, test methods for completing data only in the training set 213 214 Guessing the missing data ("imputation"): 215 * categorical: use the <dfn title="the value that occurs most often">mode</dfn> 216 * numerical: use the mean 217 * or, make the feature a target value and train a model 218 219 ### Outliers 220 Are they mistakes?: 221 * Yes: deal with them. 222 * No: leave them alone, check model for strong assumptions of normally distributed data 223 224 Can we expect them in production? 225 * Yes: make sure model can deal with them 226 * No: remove them, get a test dataset representing production 227 228 Watch out for MSE, it's based on assumption of normally distributed randomness. If you get data with big outliers, it fucks up. 229 230 ### Class imbalance 231 <def title="i.e. how much more likely is a positive example than a negative example?">Class imbalance</def> is a problem, but how do you improve training? 232 * Use a big test set 233 * Don't rely on accuracy -- try ROC plots, precision-recall plots, AUC, look at confusion matrix... 234 * Resample training data 235 * oversample: sample with replacements. leads to more data, but creates duplicates and increases likelihood of overfitting. 236 * undersample: doesn't lead to duplicates, but you throw away data. might be useful for multiple-pass algorithms 237 * Use data augmentation for minority class 238 * oversample minority with new data derived from existing data 239 * example: SMOTE, which finds small clusters of points in minority class, and generates their mean as new minority class point 240 241 ## Choosing features 242 Even if data is a table, you shouldn't just use columns as features. 243 Some algorithms work only on numeric features, some only on categorical, some on both. 244 245 Converting between categoric/numeric: 246 * numeric to categoric - you're bound to lose information, but it might be tolerable 247 * categoric to numeric 248 * integer coding: make everything an integer - imposes false ordering on unordered data. generally not a good idea. 249 * one-hot coding: one categorical feature becomes several numeric features. for each element, you say whether or not the feature applies (0 or 1). 250 251 Expanding features: adding extra features derived from existing features (improves performance). 252 For example, when you have results that don't fit on a line, but _do_ fit on a curve, you can add a derived feature x². 253 If we don't have any intuition for extra features to add, just add all cross products, or use functions like sin/log. 254 255 ## Normalisation & standardisation 256 Create a uniform scale. 257 258 ### Normalisation 259 Fit to [0,1]. 260 Scales the data linearly, smallest point becomes zero, largest point becomes 1: 261 $\chi \leftarrow \frac{\chi - \chi_{min}}{\chi_{max} - \chi_{\min}}$ 262 263 ### Standardisation 264 Fit to 1D standard normal distribution. 265 Rescale data so mean becomes zero, standard deviation becomes 1. Make it look like the data came from a standard normal distribution. 266 $\chi \leftarrow \frac{\chi - \mu}{\sigma}$ 267 268 ### Whitening 269 Fit to multivariate standard normal distribution. 270 If the data is correlated, you don't end up with a spherical shape after normalising/standardising. So you have to choose a different basis (coordinate system) for the points. 271 272 Back to linear algebra - choose a basis 273 $$B = \begin{bmatrix} c & d \end{bmatrix} = \begin{bmatrix} 1.26 & -0.3 \\ 0.9 & 0.5 \end{bmatrix}$$ 274 275 Then if you want to convert a coordinate to this basis, multiply $Bx$. If you want to convert from this basis to the standard, multiply $B^{-1} x$. 276 277 Since the inverse of a matrix is computationally expensive, prefer orthonormal bases (the basis vectors are <def title="perpendicular to each other">orthogonal</def> and <def title="have length 1">normal</def>). Because then $B<sup>{-1} = B</sup>T$, and the transpose is much easier to compute. 278 279 Steps: 280 1. Compute sample mean $m = \frac{1}{n} \sum_i x_i$ and sample covariance $S = \frac{1}{n-1} X X^T$ (where $X = [x_1, \dots, x_n] -m$). 281 2. Find some A st $S = AA^T$: 282 * Cholesky decomposition 283 * Singular value decomposition 284 * Matrix square root 285 3. White the data: $x \leftarrow A^{-1} (x-m)$ 286 287 So whitening means we choose new basis vectors for a coordinate system where the features are not correlated, and variance is 1 in every direction. 288 ## Dimensionality reduction 289 Opposite of feature expansion - reducing number of features in data by deriving new features from old ones, hopefully without losing essential information. 290 291 Good for efficiency, reducing variance of model performance, and visualisation. 292 293 Principal component analysis (PCA): whitening with some extra properties. Afte applying, you throw away all but first k dimensions, and get very good projection of data down to k dimensions. 294 1. Mean-center the data 295 2. Compute sample covariance S 296 3. Compute singular value decomposition: $UZU^T$ 297 * SVD is usually computed from X or A (set equal to X or A) 298 * Z is diagonal, whose diagonal values sorted from largest to smallest are the eigenvalues. 299 * U is an orthonormal basis, whose columns are the eigenvectors of A. 300 * Eigenvectors: a matrix transforms vectors, with some getting stretched and rotated. If a vector only gets stretched/flipped, but its direction doesn't change, it's an eigenvector. Translating to math, if Au = λu, u is an eigenvector, and λ is its corresponding scalar eigenvalue. 301 4. Transform data: $x \leftarrow U^T x$. To whiten, also divide by diag(Z) 302 5. Discard all but first k features (keep only features corresponding to biggest eigenvectors)