commit e63763c8aeefd6f77deff857dcbd2f5dd04e668c
parent 4896d121688e89b80d4c3477f241d354fee2bec8
Author: Alex Balgavy <alex@balgavy.eu>
Date: Mon, 7 Jun 2021 16:38:29 +0200
ML4QS notes
Diffstat:
3 files changed, 88 insertions(+), 0 deletions(-)
diff --git a/content/ml4qs/_index.md b/content/ml4qs/_index.md
@@ -6,3 +6,5 @@ title = 'Machine Learning for the Quantified Self'
1. [Introduction & Basics of Sensory Data](introduction-basics-of-sensory-data)
2. [Handling sensory noise](handling-sensory-noise)
3. [Feature engineering](feature-engineering)
+4. [Clustering](clustering)
+5. [Supervised learning](supervised-learning)+
\ No newline at end of file
diff --git a/content/ml4qs/clustering.md b/content/ml4qs/clustering.md
@@ -0,0 +1,46 @@
++++
+title = 'Clustering'
++++
+
+# Clustering
+
+## Learning setup
+Per instance (e.g. instances for each person), or per person (groups of people).
+
+Distance metrics:
+- euclidian distance: length of line between two points (hypotenuse of a triangle)
+- manhattan distance: summing distances in all dimensions (the two other sides of the triangle, added together)
+- minkowski distance is a generalized form
+- important to consider scaling of data
+- assuming numeric values; for others you can use Gower's similarity:
+ - dichotomous (present or not): 1 if both present, 0 otherwise
+ - categorical: 1 if both are the same, 0 otherwise
+ - numerical: some calculation for scaling. don't know if I actually need to know this.
+
+ Distance metrics for datasets:
+ - non-temporal
+ - summarize values per attribute over entire dataset into single value (mean, min, max...)
+ - estimate parameters of distribution per attributes, and compare those (e.g. normal distribution)
+ - compare distributions of values for attribute i with a statistical test (e.g. Kolmogorov Smirnov, take 1-p value as distance metric)
+- temporal
+ - raw-data based
+ - simplest case: assume equal number of points, compute euclidian distance on per point basis
+ - if time series are shifted in time, use a lag to compensate (across all attributes), use cross correlation coefficient to choose best value for shift (higher coefficient is better)
+ - for frequencies, e.g. people might walk at different frequencies, so use dynamic time warping
+ - pairing: time order should be preserved, first and last points should be matched
+ - feature-based
+ - model-based
+ - fit time series model, use those params
+
+## Clustering approaches
+Options:
+- k-means: take k number of clusters, start with k random points, and cluster closest points (center of cluster is the average)
+ - performance metric: silhouette
+- k-medoids: use actual points instead of artifical means
+- hierarchical:
+ - divisive: start with one big cluster, split in each step
+ - find dissimilarity of a point, move the most dissimilar out of the cluster
+ - select cluster with largest diameter
+ - agglomerative: start with one cluster per point, merge
+ - merge using single linkage (by minimum distance between two clusters), complete linkage (by maximum distance), or group average, or Ward's criterion (minimize standard deviation)
+- subspace: look at subspace in feature space (otherwise with huge number of features, everything might even out)
diff --git a/content/ml4qs/supervised-learning.md b/content/ml4qs/supervised-learning.md
@@ -0,0 +1,39 @@
++++
+title = 'Supervised learning'
++++
+
+# Supervised learning
+Learn functional relationship ("unknown target function") from an observation to a target.
+
+Assume unknown conditional target distribution p(y|x).
+If we calculate f(x), add noise from noise distribution (Bernoulli/categorical for discrete target, normal for continuous).
+
+1. Separate dataset into training, validation, test.
+2. Learn function that fits our observed data in training set
+ - should stratify training set if unbalanced (e.g. oversample)
+3. Evaluate generalizability of function on test set
+4. Stop learning process based on validation set.
+5. If small dataset, use cross validation.
+
+Error measure:
+- assume hypothesis for target function h. How far is h from f (risk), what is its value per point (loss).
+- approximate it using the data we have
+- in sample error: error made on training set
+- out of sample error: error that you make on all the other possible elements
+- we try to minimize in sample error
+
+Model selection:
+- select hypothesis with lowest in-sample error on validation set
+- watch out for overfitting, don't use too many features
+- PAC ("probably approximately correct") learnable -- formal definition of an "almost perfect" model
+- VC dimension: the max number of input vectors (points) that can be shattered (model can represent every possible labelling)
+- all hypothesis sets with finite VC-dimension are PAC learnable
+
+## Predictive modeling without notion of time
+1. Think about the learning setup (what do you want to learn)
+2. Don't overfit, select features with forward and backward selection, consider regularization (punishing more complex models)
+ - forward selection: iteratively add most predictive feature
+ - backward selection: iteratively remove least predictive feature
+ - regularization: add term to error function to punish more complex models
+
+