clustering.md (2554B)
1 +++ 2 title = 'Clustering' 3 +++ 4 5 # Clustering 6 7 ## Learning setup 8 Per instance (e.g. instances for each person), or per person (groups of people). 9 10 Distance metrics: 11 - euclidian distance: length of line between two points (hypotenuse of a triangle) 12 - manhattan distance: summing distances in all dimensions (the two other sides of the triangle, added together) 13 - minkowski distance is a generalized form 14 - important to consider scaling of data 15 - assuming numeric values; for others you can use Gower's similarity: 16 - dichotomous (present or not): 1 if both present, 0 otherwise 17 - categorical: 1 if both are the same, 0 otherwise 18 - numerical: some calculation for scaling. don't know if I actually need to know this. 19 20 Distance metrics for datasets: 21 - non-temporal 22 - summarize values per attribute over entire dataset into single value (mean, min, max...) 23 - estimate parameters of distribution per attributes, and compare those (e.g. normal distribution) 24 - compare distributions of values for attribute i with a statistical test (e.g. Kolmogorov Smirnov, take 1-p value as distance metric) 25 - temporal 26 - raw-data based 27 - simplest case: assume equal number of points, compute euclidian distance on per point basis 28 - if time series are shifted in time, use a lag to compensate (across all attributes), use cross correlation coefficient to choose best value for shift (higher coefficient is better) 29 - for frequencies, e.g. people might walk at different frequencies, so use dynamic time warping 30 - pairing: time order should be preserved, first and last points should be matched 31 - feature-based 32 - model-based 33 - fit time series model, use those params 34 35 ## Clustering approaches 36 Options: 37 - k-means: take k number of clusters, start with k random points, and cluster closest points (center of cluster is the average) 38 - performance metric: silhouette 39 - k-medoids: use actual points instead of artifical means 40 - hierarchical: 41 - divisive: start with one big cluster, split in each step 42 - find dissimilarity of a point, move the most dissimilar out of the cluster 43 - select cluster with largest diameter 44 - agglomerative: start with one cluster per point, merge 45 - merge using single linkage (by minimum distance between two clusters), complete linkage (by maximum distance), or group average, or Ward's criterion (minimize standard deviation) 46 - subspace: look at subspace in feature space (otherwise with huge number of features, everything might even out)