lectures.alex.balgavy.eu

Lecture notes from university.
git clone git://git.alex.balgavy.eu/lectures.alex.balgavy.eu.git
Log | Files | Refs | Submodules

commit 93ddce2556c939120dac85c969f316d4da6757b8
parent 958587373ae267156f5aaec7e5f0d534d18f162b
Author: Alex Balgavy <alex@balgavy.eu>
Date:   Wed,  2 Jun 2021 10:25:15 +0200

Update ml4qs notes

Diffstat:
Mcontent/ml4qs/_index.md | 1+
Acontent/ml4qs/handling-sensory-noise.md | 98+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 99 insertions(+), 0 deletions(-)

diff --git a/content/ml4qs/_index.md b/content/ml4qs/_index.md @@ -4,3 +4,4 @@ title = 'Machine Learning for the Quantified Self' # Machine Learning for the Quantified Self 1. [Introduction & Basics of Sensory Data](introduction-basics-of-sensory-data) +2. [Handling sensory noise](handling-sensory-noise) diff --git a/content/ml4qs/handling-sensory-noise.md b/content/ml4qs/handling-sensory-noise.md @@ -0,0 +1,98 @@ ++++ +title = 'Handling sensory noise' +template = 'page-math.html' ++++ + +Removing noise: outliers, imputing missing values, transforming data. + +Outlier: observation point that's distant from other observations +- may be caused by measurement error, or variability + +Remove with domain knowledge, or without. +But be careful, don't want to remove valuable info. + +Outlier detection: +- distribution based -- assume a certain distribution of data +- distance based -- only look at distance between data points + +## Distribution based +### Chauvenet's criterion +Assume normal distribution, single attribute Xi. + +Take mean and standard dev for attribute j in the data set: + +$\mu = \frac{\sum_{n = 1}^{N} x_{n}^{j}}{N}$ + +$\sigma = \sqrt { \frac{ \sum_{{n=1}}^{{N}} {\left({{x_{{n}}^{{j}}} - \mu} \right)^2} }{N} } $ + +Take those values as parameters for normal distribution. + +For each instance i for attribute j, compute probability of observation: + +$P(X \leq x_{i}^{j}) = \int_{-\infty}^{x_{i}^{j}}{\frac{1}{\sqrt{2 \sigma^2 \pi}} e^{-\frac{(u - \mu)^{2}}{2 \sigma^2}}} du$ + +Define instance as outlier when: +- $(1 - P(X \leq x_{i}^{j})) < \frac{1}{c \cdot N}$ +- $P(X \leq x_{i}^{j} < \frac{1}{c \cdot N}$ + +Typical value for $c$ is 2. + + +### Mixture models +Assuming data follows a single distribution might be too simple. +So, assume it can be described with K normal distributions. + +$p(x) = \sum_{k=1}^{K} \pi_{k} \mathscr{N} (x | \mu_{k}, \sigma_{k})$ with $\sum_{k=1}^{k} \pi_{k} = 1 \quad \forall k: 0 < \pi_{k} \leq 1$ + +Find best for parameters by maximizing likelihood: $L = \prod_{n=1}{N} p(x_{n}^{j})$ + +For example with expectation maximization algorithm. + +## Distance-based +Use $d(x_{i}^{j}, x_{k}^{j})$ to represent distance between two values of attribute j. + +points are "close" if within distance $d_{min}$. +points are outliers when they are more than a fraction $f_{min}$ away. + +### Local outlier factor +Takes density into account. + +Define $k_{dist}$ for point $x_{i}^{j}$ as largest distance to one of its k closest neighbors. + +Set of neighbors of $x_{i}^{j}$ within $k_{dist]$ is k-distance neighborhood. + +Reachability distance of $x_{i}^{j}$ to $x$: $k_{reach dist} (x_{i}^{j}, x) = \max (k_{dist}(x), d(x, x_{i}^{j}))$ + +Define local reachability distance of point $x_{i}^{j}$ and compare to neighbors. + +## Missing values +Replace missing values by substituted value (imputation). +Can use mean, mode, median. +Or other attribute values in same instance, or values of same attributes from other instances. + +## Combining outlier detection & imputation +Kalman filter: +- estimates expected values based on historical data +- if observed value is an outlier, impute with the expected value + +Assume some latent state $s_{t}$ which can have multiple components. +Data performs $x_t$ measurements about that state. + +Next value of state is: $s_{t} = F_{t} s_{t-1} + B_{t} u_{t} + w_{t}$ +- $u_{t}$ is control input state (like sending a message) +- $w_{t}$ is white noise +- $F_{t}$ and $B_{t}$ are matrices + +Measurement associated with $s_{t}$ is $x_{t} = H_{t} s_{t} + v_{t}$ +- $v_{t}$ is white noise + +For white noise, assume a normal distribution. +Try to predict next state, and estimate prediction error (matrix of variances and covariances). +Based on prediction, look at the error, and update prediction of the state. + +## Transforming data +Filter out more subtle noise. + +Lowpass filter: some data has periodicity, decompose series of values into different periodic signals and select most interesting frequencies. + +Principal component analysis: find new features explaining most of variability in data, select number of components based on explained variance.