lectures.alex.balgavy.eu

Lecture notes from university.
git clone git://git.alex.balgavy.eu/lectures.alex.balgavy.eu.git
Log | Files | Refs | Submodules

feature-engineering.md (1868B)


      1 +++
      2 title = 'Feature engineering'
      3 +++
      4 
      5 # Feature engineering
      6 Creating useful features in different domains.
      7 
      8 ## Time domain
      9 ### Numerical
     10 - want to summarize values of numerical attribute i in a time window
     11 - assume temporal ordering $x_{1}^{i}, \dots, x_{N}^{i}$
     12 - select window size λ
     13 - for each time point t, select proper values $[x_{t-\lambda}^{i}, \dots, x_{t}^{i}]$
     14 - compute new value of feature, per time point, over each of those values
     15 
     16 ### Categorical
     17 - generate patterns combining categorical values over time
     18 - consider a window size λ
     19 - consider different relationships: succession "(b)", co-occurrence "(c)"
     20 - support: what fraction of all time points does the pattern occur
     21 
     22 ### Mixed
     23 Make categories from numerical values & apply categorical approach:
     24 - ranges (low, normal, high)
     25 - temporal relations (increasing, decreasing)
     26 
     27 
     28 ### Pattern generation
     29 - only focus on patterns with sufficient support
     30 - start with patterns of single attribute value pairs with sufficient support
     31 
     32 ## Frequency domain
     33 Consider series of values within a certain window of size λ.
     34 Perform Fourier transformation to see what frequencies we see in the window -- create sinusoid functions with different periods, with a base frequency (lowest frequency with complete sinusoid period).
     35 
     36 Get feature values: highest amplitude frequency, normalize.
     37 
     38 ## Unstructured data - text
     39 Perform number of steps:
     40 - tokenization: identify sentences and words
     41 - lower case everything
     42 - stemming: identify stem of each word, map all variations of word to the stem
     43 - stop word removal: get rid of words like 'the' that are not predictive
     44 
     45 Approaches:
     46 - bag of words: count occurrences of n-grams (n consecutive words)
     47 - TF-IDF: frequency of words giving more weight to unique words
     48 - topic modeling: assume the text has some topics associated with words, look at topics instead of words