handling-sensory-noise.md (3647B)
1 +++ 2 title = 'Handling sensory noise' 3 template = 'page-math.html' 4 +++ 5 6 Removing noise: outliers, imputing missing values, transforming data. 7 8 Outlier: observation point that's distant from other observations 9 - may be caused by measurement error, or variability 10 11 Remove with domain knowledge, or without. 12 But be careful, don't want to remove valuable info. 13 14 Outlier detection: 15 - distribution based -- assume a certain distribution of data 16 - distance based -- only look at distance between data points 17 18 ## Distribution based 19 ### Chauvenet's criterion 20 Assume normal distribution, single attribute Xi. 21 22 Take mean and standard dev for attribute j in the data set: 23 24 $\mu = \frac{\sum_{n = 1}^{N} x_{n}^{j}}{N}$ 25 26 $\sigma = \sqrt { \frac{ \sum_{{n=1}}^{{N}} {\left({{x_{{n}}^{{j}}} - \mu} \right)^2} }{N} } $ 27 28 Take those values as parameters for normal distribution. 29 30 For each instance i for attribute j, compute probability of observation: 31 32 $P(X \leq x_{i}^{j}) = \int_{-\infty}^{x_{i}^{j}}{\frac{1}{\sqrt{2 \sigma^2 \pi}} e^{-\frac{(u - \mu)^{2}}{2 \sigma^2}}} du$ 33 34 Define instance as outlier when: 35 - $(1 - P(X \leq x_{i}^{j})) < \frac{1}{c \cdot N}$ 36 - $P(X \leq x_{i}^{j} < \frac{1}{c \cdot N}$ 37 38 Typical value for $c$ is 2. 39 40 41 ### Mixture models 42 Assuming data follows a single distribution might be too simple. 43 So, assume it can be described with K normal distributions. 44 45 $p(x) = \sum_{k=1}^{K} \pi_{k} \mathscr{N} (x | \mu_{k}, \sigma_{k})$ with $\sum_{k=1}^{k} \pi_{k} = 1 \quad \forall k: 0 < \pi_{k} \leq 1$ 46 47 Find best for parameters by maximizing likelihood: $L = \prod_{n=1}{N} p(x_{n}^{j})$ 48 49 For example with expectation maximization algorithm. 50 51 ## Distance-based 52 Use $d(x_{i}^{j}, x_{k}^{j})$ to represent distance between two values of attribute j. 53 54 points are "close" if within distance $d_{min}$. 55 points are outliers when they are more than a fraction $f_{min}$ away. 56 57 ### Local outlier factor 58 Takes density into account. 59 60 Define $k_{dist}$ for point $x_{i}^{j}$ as largest distance to one of its k closest neighbors. 61 62 Set of neighbors of $x_{i}^{j}$ within $k_{dist]$ is k-distance neighborhood. 63 64 Reachability distance of $x_{i}^{j}$ to $x$: $k_{reach dist} (x_{i}^{j}, x) = \max (k_{dist}(x), d(x, x_{i}^{j}))$ 65 66 Define local reachability distance of point $x_{i}^{j}$ and compare to neighbors. 67 68 ## Missing values 69 Replace missing values by substituted value (imputation). 70 Can use mean, mode, median. 71 Or other attribute values in same instance, or values of same attributes from other instances. 72 73 ## Combining outlier detection & imputation 74 Kalman filter: 75 - estimates expected values based on historical data 76 - if observed value is an outlier, impute with the expected value 77 78 Assume some latent state $s_{t}$ which can have multiple components. 79 Data performs $x_t$ measurements about that state. 80 81 Next value of state is: $s_{t} = F_{t} s_{t-1} + B_{t} u_{t} + w_{t}$ 82 - $u_{t}$ is control input state (like sending a message) 83 - $w_{t}$ is white noise 84 - $F_{t}$ and $B_{t}$ are matrices 85 86 Measurement associated with $s_{t}$ is $x_{t} = H_{t} s_{t} + v_{t}$ 87 - $v_{t}$ is white noise 88 89 For white noise, assume a normal distribution. 90 Try to predict next state, and estimate prediction error (matrix of variances and covariances). 91 Based on prediction, look at the error, and update prediction of the state. 92 93 ## Transforming data 94 Filter out more subtle noise. 95 96 Lowpass filter: some data has periodicity, decompose series of values into different periodic signals and select most interesting frequencies. 97 98 Principal component analysis: find new features explaining most of variability in data, select number of components based on explained variance.