lectures.alex.balgavy.eu

Lecture notes from university.
git clone git://git.alex.balgavy.eu/lectures.alex.balgavy.eu.git
Log | Files | Refs | Submodules

clustering.md (2554B)


      1 +++
      2 title = 'Clustering'
      3 +++
      4 
      5 # Clustering
      6 
      7 ## Learning setup
      8 Per instance (e.g. instances for each person), or per person (groups of people).
      9 
     10 Distance metrics:
     11 - euclidian distance: length of line between two points (hypotenuse of a triangle)
     12 - manhattan distance: summing distances in all dimensions (the two other sides of the triangle, added together)
     13 - minkowski distance is a generalized form
     14 - important to consider scaling of data
     15 - assuming numeric values; for others you can use Gower's similarity:
     16     - dichotomous (present or not): 1 if both present, 0 otherwise
     17     - categorical: 1 if both are the same, 0 otherwise
     18     - numerical: some calculation for scaling. don't know if I actually need to know this.
     19 
     20  Distance metrics for datasets:
     21  - non-temporal
     22     - summarize values per attribute over entire dataset into single value (mean, min, max...)
     23     - estimate parameters of distribution per attributes, and compare those (e.g. normal distribution)
     24     - compare distributions of values for attribute i with a statistical test (e.g. Kolmogorov Smirnov, take 1-p value as distance metric)
     25 - temporal
     26     - raw-data based
     27         - simplest case: assume equal number of points, compute euclidian distance on per point basis
     28         - if time series are shifted in time, use a lag to compensate (across all attributes), use cross correlation coefficient to choose best value for shift (higher coefficient is better)
     29         - for frequencies, e.g. people might walk at different frequencies, so use dynamic time warping
     30             - pairing: time order should be preserved, first and last points should be matched
     31     - feature-based
     32     - model-based
     33         - fit time series model, use those params
     34 
     35 ## Clustering approaches
     36 Options:
     37 - k-means: take k number of clusters, start with k random points, and cluster closest points (center of cluster is the average)
     38     - performance metric: silhouette
     39 - k-medoids: use actual points instead of artifical means
     40 - hierarchical:
     41     - divisive: start with one big cluster, split in each step
     42         - find dissimilarity of a point, move the most dissimilar out of the cluster
     43         - select cluster with largest diameter
     44     - agglomerative: start with one cluster per point, merge
     45         - merge using single linkage (by minimum distance between two clusters), complete linkage (by maximum distance), or group average, or Ward's criterion (minimize standard deviation)
     46 - subspace: look at subspace in feature space (otherwise with huge number of features, everything might even out)