lectures.alex.balgavy.eu

Lecture notes from university.
git clone git://git.alex.balgavy.eu/lectures.alex.balgavy.eu.git
Log | Files | Refs | Submodules

commit 4896d121688e89b80d4c3477f241d354fee2bec8
parent 93ddce2556c939120dac85c969f316d4da6757b8
Author: Alex Balgavy <alex@balgavy.eu>
Date:   Thu,  3 Jun 2021 11:59:56 +0200

ML4QS

Diffstat:
Mcontent/ml4qs/_index.md | 1+
Acontent/ml4qs/feature-engineering.md | 48++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 49 insertions(+), 0 deletions(-)

diff --git a/content/ml4qs/_index.md b/content/ml4qs/_index.md @@ -5,3 +5,4 @@ title = 'Machine Learning for the Quantified Self' # Machine Learning for the Quantified Self 1. [Introduction & Basics of Sensory Data](introduction-basics-of-sensory-data) 2. [Handling sensory noise](handling-sensory-noise) +3. [Feature engineering](feature-engineering) diff --git a/content/ml4qs/feature-engineering.md b/content/ml4qs/feature-engineering.md @@ -0,0 +1,48 @@ ++++ +title = 'Feature engineering' ++++ + +# Feature engineering +Creating useful features in different domains. + +## Time domain +### Numerical +- want to summarize values of numerical attribute i in a time window +- assume temporal ordering $x_{1}^{i}, \dots, x_{N}^{i}$ +- select window size λ +- for each time point t, select proper values $[x_{t-\lambda}^{i}, \dots, x_{t}^{i}]$ +- compute new value of feature, per time point, over each of those values + +### Categorical +- generate patterns combining categorical values over time +- consider a window size λ +- consider different relationships: succession "(b)", co-occurrence "(c)" +- support: what fraction of all time points does the pattern occur + +### Mixed +Make categories from numerical values & apply categorical approach: +- ranges (low, normal, high) +- temporal relations (increasing, decreasing) + + +### Pattern generation +- only focus on patterns with sufficient support +- start with patterns of single attribute value pairs with sufficient support + +## Frequency domain +Consider series of values within a certain window of size λ. +Perform Fourier transformation to see what frequencies we see in the window -- create sinusoid functions with different periods, with a base frequency (lowest frequency with complete sinusoid period). + +Get feature values: highest amplitude frequency, normalize. + +## Unstructured data - text +Perform number of steps: +- tokenization: identify sentences and words +- lower case everything +- stemming: identify stem of each word, map all variations of word to the stem +- stop word removal: get rid of words like 'the' that are not predictive + +Approaches: +- bag of words: count occurrences of n-grams (n consecutive words) +- TF-IDF: frequency of words giving more weight to unique words +- topic modeling: assume the text has some topics associated with words, look at topics instead of words