index.md - lectures.alex.balgavy.eu - Lecture notes from university.

index.md (4406B)
      1 +++
      2 title = 'Tree models and ensembles'
      3 template = 'page-math.html'
      4 +++
      5 # Tree models & ensembles
      6 
      7 ## Tree models
      8 ### Decision trees (categorical)
      9 Work on numerical and categorical features
     10 
     11 Standard decision tree learning algorithm (ID3, C45):
     12 - start with empty tree
     13 - extend step by step
     14     - stop when: all inputs are same (no more features left), or all outputs are same (all instances same class)
     15 - greedy (no backtracking)
     16 - choose the split that creates least uniform distribution over class labels in the resulting segmentation
     17     - entropy is measure of uniformity of distribution
     18     - recall, entropy $H(p) = - \sum_{x \in X} P(x) \log{P(x)}$
     19     - conditional entropy: $H(X|Y) = \sum_{y} P(y) H(X | Y = y)$
     20     - information gain of Y: $I_{X}(Y) = H(X) - H(X | Y)$
     21     - so, pick the one with the highest information gain
     22 
     23 The algorithm in steps:
     24 1. start with single unlabeled leaf
     25 2. loop until no unlabeled leaves:
     26     - for each unlabeled leaf l with segment s:
     27         - if stop condition, label majority class of S
     28             - stop when: all inputs are same (no more features left), or all outputs are same (all instances same class)
     29         - else split L on feature F with highest gain Is(F)
     30 
     31 With categoric features, it doesn't make sense to split on the same feature twice.
     32 
     33 ### Regression trees (numeric)
     34 For numeric features, split at a numeric threshold t.
     35 
     36 Of course there's a trade-off, complicated decision trees lead to overfitting.
     37 
     38 Pruning - for every split, ask whether the tree classifies better with or without the split (on validation data)
     39 
     40 Using validation data: test is only for final testing, validation for hyperparameter selection. If you want to control search, split training data and use a part of it for 'validation'.
     41 
     42 Label the leaves with the one element, or take the mean.
     43 
     44 Instead of entropy, use $I_{S}(V) = Var(S) - \sum_{i} \frac{| S_i |}{|S|} Var(S_i)$
     45 
     46 ### Generalization hierarchy
     47 ![154f1d15c7808fb3db6bf33d60c61bbb.png](dc5bb005d60e400e9026666b704139da.png)
     48 
     49 
     50 ## Ensembling methods
     51 Bias and variance:
     52 
     53 ![ba23c24af56d5203ba58cdc8b1b3f8d8.png](2b81999a96204ddeb7dc539b580b8dbb.png)
     54 
     55 Real-life example:
     56 - grading by rubric: high bias, low variance
     57 - grading by TA: low bias, high variance
     58 
     59 Bootstrapping (from methodology 2):
     60 - sample with replacement a dataset of same size as whole dataset
     61 - each bootstrapped sample lets you repeat your experiment
     62 - better than cross validation for small datasets
     63 - but some classifiers don't like duplicates
     64 
     65 Ensembling is:
     66 - used in production to get better performance from model
     67 - never used in research, we can improve any model by boosting
     68 - can be expensive for big models
     69 
     70 Bagging reduces variance, boosting reduces bias.
     71 
     72 ### Bagging
     73 Bagging: **b**ootstrap **agg**regating
     74 - resample k datasets, train k models. this is the ensemble
     75 - ensemble classifies by majority vote
     76     - for class probabilities, use relative freq among votes
     77 
     78 Random forest: bagging with decision trees
     79 - subsample data and features for each model in ensemble
     80 - pro: reduces variance, few hyperparameters, easy to parallelize
     81 - con: no reduction of bias
     82 
     83 ### Boosting
     84 train some classifier m0, then iteratively train more classifiers.
     85 increase weights for instances misclassified by a classifier.
     86 train the next iteration on reweighted data.
     87 
     88 weighted loss function: $loss(\theta) = \sum_{i} w_{i} (f_{\theta}(x_i) - t_i)^2$
     89 
     90 or resample data by the weights. the weight determines how likely an instance is to end up in the resampled data.
     91 
     92 boosting works even if the model only classifies slightly better than random.
     93 
     94 #### AdaBoost (binary classifiers)
     95 TODO: think of a better way to explain this for the exam
     96 
     97 each model fits a reweighted dataset, each model defines its own reweighted loss.
     98 
     99 #### Gradient boosting
    100 boosting for regression models
    101 
    102 fit the next model to the residuals of the current ensemble
    103 
    104 each model fits (pseudo) residuals of previous model, ensemble optimises a global loss -- even if individual models don't optimise a well-defined loss.
    105 
    106 ### Stacking
    107 When you want to combine some number of existing models into a single model.
    108 Simply compute outputs of the models for every point in the dataset, and add them to the dataset as a column.
    109 Then, train a new model ('combiner') on this extended data (usually a logistic regression). If NNs are used for the ensemble, the whole thing turns into a big NN.
	lectures.alex.balgavy.eu Lecture notes from university.
	git clone git://git.alex.balgavy.eu/lectures.alex.balgavy.eu.git
	Log \| Files \| Refs \| Submodules