lectures.alex.balgavy.eu

Lecture notes from university.
git clone git://git.alex.balgavy.eu/lectures.alex.balgavy.eu.git
Log | Files | Refs | Submodules

commit 72ff5e36849d4e6da29cf4b43fed19a8aac48eb6
parent 10dd9a6d460b77a1e43a43459859f2977390e88d
Author: Alex Balgavy <alex@balgavy.eu>
Date:   Tue, 29 Jun 2021 16:19:13 +0200

Finalize ML4QS notes

Diffstat:
Mcontent/_index.md | 2+-
Acontent/ml4qs-notes/ML4QS.apkg | 0
Acontent/ml4qs-notes/_index.md | 16++++++++++++++++
Rcontent/ml4qs/clustering.md -> content/ml4qs-notes/clustering.md | 0
Rcontent/ml4qs/feature-engineering.md -> content/ml4qs-notes/feature-engineering.md | 0
Acontent/ml4qs-notes/handling-sensory-noise.md | 98+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Rcontent/ml4qs/introduction-basics-of-sensory-data.md -> content/ml4qs-notes/introduction-basics-of-sensory-data.md | 0
Rcontent/ml4qs/supervised-learning.md -> content/ml4qs-notes/supervised-learning.md | 0
Dcontent/ml4qs/_index.md | 11-----------
Dcontent/ml4qs/handling-sensory-noise.md | 98-------------------------------------------------------------------------------
10 files changed, 115 insertions(+), 110 deletions(-)

diff --git a/content/_index.md b/content/_index.md @@ -13,7 +13,7 @@ title = "Alex's university course notes" * [Coding and Cryptography](coding-and-cryptography) * [Binary and Malware Analysis](binary-malware-analysis-notes) * [Distributed Algorithms](distributed-algorithms-notes) -* [Machine Learning for the Quantified Self](ml4qs) +* [Machine Learning for the Quantified Self](ml4qs-notes) # BSc Computer Science (VU Amsterdam) --- diff --git a/content/ml4qs-notes/ML4QS.apkg b/content/ml4qs-notes/ML4QS.apkg Binary files differ. diff --git a/content/ml4qs-notes/_index.md b/content/ml4qs-notes/_index.md @@ -0,0 +1,15 @@ ++++ +title = 'Machine Learning for the Quantified Self' ++++ + +# Machine Learning for the Quantified Self +1. [Introduction & Basics of Sensory Data](introduction-basics-of-sensory-data) +2. [Handling sensory noise](handling-sensory-noise) +3. [Feature engineering](feature-engineering) +4. [Clustering](clustering) +5. [Supervised learning](supervised-learning) + +[A good video on dynamic time warping](https://www.youtube.com/watch?v=_K1OsqCicBY). +You can test it out yourself with [the dtw package](https://dynamictimewarping.github.io/) in R and Python. + +I used Anki to study for the exam, [here's the Anki deck](ML4QS.apkg).+ \ No newline at end of file diff --git a/content/ml4qs/clustering.md b/content/ml4qs-notes/clustering.md diff --git a/content/ml4qs/feature-engineering.md b/content/ml4qs-notes/feature-engineering.md diff --git a/content/ml4qs-notes/handling-sensory-noise.md b/content/ml4qs-notes/handling-sensory-noise.md @@ -0,0 +1,98 @@ ++++ +title = 'Handling sensory noise' +template = 'page-math.html' ++++ + +Removing noise: outliers, imputing missing values, transforming data. + +Outlier: observation point that's distant from other observations +- may be caused by measurement error, or variability + +Remove with domain knowledge, or without. +But be careful, don't want to remove valuable info. + +Outlier detection: +- distribution based -- assume a certain distribution of data +- distance based -- only look at distance between data points + +## Distribution based +### Chauvenet's criterion +Assume normal distribution, single attribute Xi. + +Take mean and standard dev for attribute j in the data set: + +$\mu = \frac{\sum_{n = 1}^{N} x_{n}^{j}}{N}$ + +$\sigma = \sqrt { \frac{ \sum_{{n=1}}^{{N}} {\left({{x_{{n}}^{{j}}} - \mu} \right)^2} }{N} } $ + +Take those values as parameters for normal distribution. + +For each instance i for attribute j, compute probability of observation: + +$P(X \leq x_{i}^{j}) = \int_{-\infty}^{x_{i}^{j}}{\frac{1}{\sqrt{2 \sigma^2 \pi}} e^{-\frac{(u - \mu)^{2}}{2 \sigma^2}}} du$ + +Define instance as outlier when: +- $(1 - P(X \leq x_{i}^{j})) < \frac{1}{c \cdot N}$ +- $P(X \leq x_{i}^{j} < \frac{1}{c \cdot N}$ + +Typical value for $c$ is 2. + + +### Mixture models +Assuming data follows a single distribution might be too simple. +So, assume it can be described with K normal distributions. + +$p(x) = \sum_{k=1}^{K} \pi_{k} \mathscr{N} (x | \mu_{k}, \sigma_{k})$ with $\sum_{k=1}^{k} \pi_{k} = 1 \quad \forall k: 0 < \pi_{k} \leq 1$ + +Find best for parameters by maximizing likelihood: $L = \prod_{n=1}{N} p(x_{n}^{j})$ + +For example with expectation maximization algorithm. + +## Distance-based +Use $d(x_{i}^{j}, x_{k}^{j})$ to represent distance between two values of attribute j. + +points are "close" if within distance $d_{min}$. +points are outliers when they are more than a fraction $f_{min}$ away. + +### Local outlier factor +Takes density into account. + +Define $k_{dist}$ for point $x_{i}^{j}$ as largest distance to one of its k closest neighbors. + +Set of neighbors of $x_{i}^{j}$ within $k_{dist}$ is k-distance neighborhood. + +Reachability distance of $x_{i}^{j}$ to $x$: $k_{reach dist} (x_{i}^{j}, x) = \max (k_{dist}(x), d(x, x_{i}^{j}))$ + +Define local reachability distance of point $x_{i}^{j}$ and compare to neighbors. + +## Missing values +Replace missing values by substituted value (imputation). +Can use mean, mode, median. +Or other attribute values in same instance, or values of same attributes from other instances. + +## Combining outlier detection & imputation +Kalman filter: +- estimates expected values based on historical data +- if observed value is an outlier, impute with the expected value + +Assume some latent state $s_{t}$ which can have multiple components. +Data performs $x_t$ measurements about that state. + +Next value of state is: $s_{t} = F_{t} s_{t-1} + B_{t} u_{t} + w_{t}$ +- $u_{t}$ is control input state (like sending a message) +- $w_{t}$ is white noise +- $F_{t}$ and $B_{t}$ are matrices + +Measurement associated with $s_{t}$ is $x_{t} = H_{t} s_{t} + v_{t}$ +- $v_{t}$ is white noise + +For white noise, assume a normal distribution. +Try to predict next state, and estimate prediction error (matrix of variances and covariances). +Based on prediction, look at the error, and update prediction of the state. + +## Transforming data +Filter out more subtle noise. + +Lowpass filter: some data has periodicity, decompose series of values into different periodic signals and select most interesting frequencies. + +Principal component analysis: find new features explaining most of variability in data, select number of components based on explained variance. diff --git a/content/ml4qs/introduction-basics-of-sensory-data.md b/content/ml4qs-notes/introduction-basics-of-sensory-data.md diff --git a/content/ml4qs/supervised-learning.md b/content/ml4qs-notes/supervised-learning.md diff --git a/content/ml4qs/_index.md b/content/ml4qs/_index.md @@ -1,10 +0,0 @@ -+++ -title = 'Machine Learning for the Quantified Self' -+++ - -# Machine Learning for the Quantified Self -1. [Introduction & Basics of Sensory Data](introduction-basics-of-sensory-data) -2. [Handling sensory noise](handling-sensory-noise) -3. [Feature engineering](feature-engineering) -4. [Clustering](clustering) -5. [Supervised learning](supervised-learning)- \ No newline at end of file diff --git a/content/ml4qs/handling-sensory-noise.md b/content/ml4qs/handling-sensory-noise.md @@ -1,98 +0,0 @@ -+++ -title = 'Handling sensory noise' -template = 'page-math.html' -+++ - -Removing noise: outliers, imputing missing values, transforming data. - -Outlier: observation point that's distant from other observations -- may be caused by measurement error, or variability - -Remove with domain knowledge, or without. -But be careful, don't want to remove valuable info. - -Outlier detection: -- distribution based -- assume a certain distribution of data -- distance based -- only look at distance between data points - -## Distribution based -### Chauvenet's criterion -Assume normal distribution, single attribute Xi. - -Take mean and standard dev for attribute j in the data set: - -$\mu = \frac{\sum_{n = 1}^{N} x_{n}^{j}}{N}$ - -$\sigma = \sqrt { \frac{ \sum_{{n=1}}^{{N}} {\left({{x_{{n}}^{{j}}} - \mu} \right)^2} }{N} } $ - -Take those values as parameters for normal distribution. - -For each instance i for attribute j, compute probability of observation: - -$P(X \leq x_{i}^{j}) = \int_{-\infty}^{x_{i}^{j}}{\frac{1}{\sqrt{2 \sigma^2 \pi}} e^{-\frac{(u - \mu)^{2}}{2 \sigma^2}}} du$ - -Define instance as outlier when: -- $(1 - P(X \leq x_{i}^{j})) < \frac{1}{c \cdot N}$ -- $P(X \leq x_{i}^{j} < \frac{1}{c \cdot N}$ - -Typical value for $c$ is 2. - - -### Mixture models -Assuming data follows a single distribution might be too simple. -So, assume it can be described with K normal distributions. - -$p(x) = \sum_{k=1}^{K} \pi_{k} \mathscr{N} (x | \mu_{k}, \sigma_{k})$ with $\sum_{k=1}^{k} \pi_{k} = 1 \quad \forall k: 0 < \pi_{k} \leq 1$ - -Find best for parameters by maximizing likelihood: $L = \prod_{n=1}{N} p(x_{n}^{j})$ - -For example with expectation maximization algorithm. - -## Distance-based -Use $d(x_{i}^{j}, x_{k}^{j})$ to represent distance between two values of attribute j. - -points are "close" if within distance $d_{min}$. -points are outliers when they are more than a fraction $f_{min}$ away. - -### Local outlier factor -Takes density into account. - -Define $k_{dist}$ for point $x_{i}^{j}$ as largest distance to one of its k closest neighbors. - -Set of neighbors of $x_{i}^{j}$ within $k_{dist]$ is k-distance neighborhood. - -Reachability distance of $x_{i}^{j}$ to $x$: $k_{reach dist} (x_{i}^{j}, x) = \max (k_{dist}(x), d(x, x_{i}^{j}))$ - -Define local reachability distance of point $x_{i}^{j}$ and compare to neighbors. - -## Missing values -Replace missing values by substituted value (imputation). -Can use mean, mode, median. -Or other attribute values in same instance, or values of same attributes from other instances. - -## Combining outlier detection & imputation -Kalman filter: -- estimates expected values based on historical data -- if observed value is an outlier, impute with the expected value - -Assume some latent state $s_{t}$ which can have multiple components. -Data performs $x_t$ measurements about that state. - -Next value of state is: $s_{t} = F_{t} s_{t-1} + B_{t} u_{t} + w_{t}$ -- $u_{t}$ is control input state (like sending a message) -- $w_{t}$ is white noise -- $F_{t}$ and $B_{t}$ are matrices - -Measurement associated with $s_{t}$ is $x_{t} = H_{t} s_{t} + v_{t}$ -- $v_{t}$ is white noise - -For white noise, assume a normal distribution. -Try to predict next state, and estimate prediction error (matrix of variances and covariances). -Based on prediction, look at the error, and update prediction of the state. - -## Transforming data -Filter out more subtle noise. - -Lowpass filter: some data has periodicity, decompose series of values into different periodic signals and select most interesting frequencies. - -Principal component analysis: find new features explaining most of variability in data, select number of components based on explained variance.