lectures.alex.balgavy.eu

Lecture notes from university.
git clone git://git.alex.balgavy.eu/lectures.alex.balgavy.eu.git
Log | Files | Refs | Submodules

commit e63763c8aeefd6f77deff857dcbd2f5dd04e668c
parent 4896d121688e89b80d4c3477f241d354fee2bec8
Author: Alex Balgavy <alex@balgavy.eu>
Date:   Mon,  7 Jun 2021 16:38:29 +0200

ML4QS notes

Diffstat:
Mcontent/ml4qs/_index.md | 3+++
Acontent/ml4qs/clustering.md | 46++++++++++++++++++++++++++++++++++++++++++++++
Acontent/ml4qs/supervised-learning.md | 39+++++++++++++++++++++++++++++++++++++++
3 files changed, 88 insertions(+), 0 deletions(-)

diff --git a/content/ml4qs/_index.md b/content/ml4qs/_index.md @@ -6,3 +6,5 @@ title = 'Machine Learning for the Quantified Self' 1. [Introduction & Basics of Sensory Data](introduction-basics-of-sensory-data) 2. [Handling sensory noise](handling-sensory-noise) 3. [Feature engineering](feature-engineering) +4. [Clustering](clustering) +5. [Supervised learning](supervised-learning)+ \ No newline at end of file diff --git a/content/ml4qs/clustering.md b/content/ml4qs/clustering.md @@ -0,0 +1,46 @@ ++++ +title = 'Clustering' ++++ + +# Clustering + +## Learning setup +Per instance (e.g. instances for each person), or per person (groups of people). + +Distance metrics: +- euclidian distance: length of line between two points (hypotenuse of a triangle) +- manhattan distance: summing distances in all dimensions (the two other sides of the triangle, added together) +- minkowski distance is a generalized form +- important to consider scaling of data +- assuming numeric values; for others you can use Gower's similarity: + - dichotomous (present or not): 1 if both present, 0 otherwise + - categorical: 1 if both are the same, 0 otherwise + - numerical: some calculation for scaling. don't know if I actually need to know this. + + Distance metrics for datasets: + - non-temporal + - summarize values per attribute over entire dataset into single value (mean, min, max...) + - estimate parameters of distribution per attributes, and compare those (e.g. normal distribution) + - compare distributions of values for attribute i with a statistical test (e.g. Kolmogorov Smirnov, take 1-p value as distance metric) +- temporal + - raw-data based + - simplest case: assume equal number of points, compute euclidian distance on per point basis + - if time series are shifted in time, use a lag to compensate (across all attributes), use cross correlation coefficient to choose best value for shift (higher coefficient is better) + - for frequencies, e.g. people might walk at different frequencies, so use dynamic time warping + - pairing: time order should be preserved, first and last points should be matched + - feature-based + - model-based + - fit time series model, use those params + +## Clustering approaches +Options: +- k-means: take k number of clusters, start with k random points, and cluster closest points (center of cluster is the average) + - performance metric: silhouette +- k-medoids: use actual points instead of artifical means +- hierarchical: + - divisive: start with one big cluster, split in each step + - find dissimilarity of a point, move the most dissimilar out of the cluster + - select cluster with largest diameter + - agglomerative: start with one cluster per point, merge + - merge using single linkage (by minimum distance between two clusters), complete linkage (by maximum distance), or group average, or Ward's criterion (minimize standard deviation) +- subspace: look at subspace in feature space (otherwise with huge number of features, everything might even out) diff --git a/content/ml4qs/supervised-learning.md b/content/ml4qs/supervised-learning.md @@ -0,0 +1,39 @@ ++++ +title = 'Supervised learning' ++++ + +# Supervised learning +Learn functional relationship ("unknown target function") from an observation to a target. + +Assume unknown conditional target distribution p(y|x). +If we calculate f(x), add noise from noise distribution (Bernoulli/categorical for discrete target, normal for continuous). + +1. Separate dataset into training, validation, test. +2. Learn function that fits our observed data in training set + - should stratify training set if unbalanced (e.g. oversample) +3. Evaluate generalizability of function on test set +4. Stop learning process based on validation set. +5. If small dataset, use cross validation. + +Error measure: +- assume hypothesis for target function h. How far is h from f (risk), what is its value per point (loss). +- approximate it using the data we have +- in sample error: error made on training set +- out of sample error: error that you make on all the other possible elements +- we try to minimize in sample error + +Model selection: +- select hypothesis with lowest in-sample error on validation set +- watch out for overfitting, don't use too many features +- PAC ("probably approximately correct") learnable -- formal definition of an "almost perfect" model +- VC dimension: the max number of input vectors (points) that can be shattered (model can represent every possible labelling) +- all hypothesis sets with finite VC-dimension are PAC learnable + +## Predictive modeling without notion of time +1. Think about the learning setup (what do you want to learn) +2. Don't overfit, select features with forward and backward selection, consider regularization (punishing more complex models) + - forward selection: iteratively add most predictive feature + - backward selection: iteratively remove least predictive feature + - regularization: add term to error function to punish more complex models + +