feature-engineering.md (1868B)
1 +++ 2 title = 'Feature engineering' 3 +++ 4 5 # Feature engineering 6 Creating useful features in different domains. 7 8 ## Time domain 9 ### Numerical 10 - want to summarize values of numerical attribute i in a time window 11 - assume temporal ordering $x_{1}^{i}, \dots, x_{N}^{i}$ 12 - select window size λ 13 - for each time point t, select proper values $[x_{t-\lambda}^{i}, \dots, x_{t}^{i}]$ 14 - compute new value of feature, per time point, over each of those values 15 16 ### Categorical 17 - generate patterns combining categorical values over time 18 - consider a window size λ 19 - consider different relationships: succession "(b)", co-occurrence "(c)" 20 - support: what fraction of all time points does the pattern occur 21 22 ### Mixed 23 Make categories from numerical values & apply categorical approach: 24 - ranges (low, normal, high) 25 - temporal relations (increasing, decreasing) 26 27 28 ### Pattern generation 29 - only focus on patterns with sufficient support 30 - start with patterns of single attribute value pairs with sufficient support 31 32 ## Frequency domain 33 Consider series of values within a certain window of size λ. 34 Perform Fourier transformation to see what frequencies we see in the window -- create sinusoid functions with different periods, with a base frequency (lowest frequency with complete sinusoid period). 35 36 Get feature values: highest amplitude frequency, normalize. 37 38 ## Unstructured data - text 39 Perform number of steps: 40 - tokenization: identify sentences and words 41 - lower case everything 42 - stemming: identify stem of each word, map all variations of word to the stem 43 - stop word removal: get rid of words like 'the' that are not predictive 44 45 Approaches: 46 - bag of words: count occurrences of n-grams (n consecutive words) 47 - TF-IDF: frequency of words giving more weight to unique words 48 - topic modeling: assume the text has some topics associated with words, look at topics instead of words