Processing math: 0%
+ - 0:00:00
Notes for current slide
Notes for next slide

Clustering

AU STAT-427/627

Emil Hvitfeldt

2021-06-05

1 / 64
`\require{color}\definecolor{orange}{rgb}{1, 0.603921568627451, 0.301960784313725} \require{color}\definecolor{blue}{rgb}{0.301960784313725, 0.580392156862745, 1} \require{color}\definecolor{pink}{rgb}{0.976470588235294, 0.301960784313725, 1}`

Clustering

This is a case of unsupervised learning

We are working with unlabeled data

2 / 64

Unsupervised Learning

  • Clustering
  • Anomaly Detection
  • Dimensionality Reduction
  • Association Rules

We are trying to find patterns and/or structure in the data

3 / 64

Unsupervised Learning

The main characteristic for unsupervised learning is that we have unlabeled data

So far when working with supervised learning we have had a response variable Y and some predictors variables X

This time we only have X

Our goal is to see if there is anything we can get out of this information

4 / 64

Clustering

Trying to divide/partition the n observations into several sub-groups/clusters

How do we do this?

5 / 64

Centroid-based Clustering

6 / 64

Density-based Clustering

7 / 64

Distribution-based Clustering

8 / 64

Hierarchical Clustering

9 / 64

Uses of Clustering

  • Semi-Supervised Learning
  • EDA
  • Pre-processing
  • Clusters as Analysis
10 / 64

Semi-Supervised Learning

If we have class labels on some of the objects, we can apply unsupervised clustering, then let the clusters be defined by their class enrichment of labeled objects.

A word of caution for this approach: Just because a clustering structure doesn't align with known labels doesn't mean it is "wrong". It could be capturing a different (true) aspect of the data than the one we have labels for.

11 / 64

EDA

Sometimes clustering is applied as a first exploratory step, to get a sense of the structure of the data. This is somewhat nebulous and usually involves eyeballing a visualization.

12 / 64

Pre-processing

Clustering can be used to discover relationships in data that are undesirable, so that we can residualize or decorrelate the objects before applying an analysis.

A great example of this is in genetics, where we have measurements of gene expression for several subjects. Typically, gene expression is most strongly correlated by race. If we cluster the subjects on gene expression, we can then identify unwanted dependence to remove from the data.

13 / 64

Clusters as analysis

Sometimes, the assignment of cluster membership is the end goal of the study. For example:

In the Enron corruption case in 2001, researchers created a network based on who emailed who within the company. They then looked at which clusters contained known conspirators and investigated the other individuals in those groups.

In the early days of breast cancer genetic studies, researchers clustered known patients on genetic expression, which led to the discovery of different tumor types (e.g. Basal, Her-2, Luminal). These have later been clinically validated and better defined.

14 / 64

How are clusters found?

One way is to define a geometry that is used to determine whether 2 points are close to each other

Having the "distances" between points allows us to see if there are any points with a lot of "friends"

15 / 64

How are clusters found?

We will focus K-means clustering and Hierarchical clustering

Which are Centroid-based Clustering and Hierarchical Clustering respectively

16 / 64

Survey of many more clustering methods

A Comprehensive Survey of Clustering Algorithms

by Dongkuan Xu & Yingjie Tian

17 / 64

K-Means Clustering

A simple and elegant approach

Intuitively easy to understand

Does partitioning into K non-overlapping clusters

18 / 64

K-Means Clustering

We let C_1, ..., C_K denote sets of indices of the observations in each cluster.

For K-Means we have that the union of C_1, ..., C_K is equal to 1, ..., n and that there is no overlap between the sets

19 / 64

K-Means Clustering

We are trying to maximize/optimize something

K-means states that we want to minimize the within-cluster variation

\underset{C_1, ..., C_K}{\text{minimize}}\left\{ \sum_{k=1}^K W(C_k) \right\}

This is a reasonable starting point. But we need to define W

20 / 64

K-Means Clustering

The most common way is using squared Euclidean distance

W(C_k) = \dfrac{1}{|C_k|} \sum_{i, i' \in C_k}\sum_{j=1}^p(x_{ij} - x_{i'j})^2

here |C_k| denote the number of observations in the kth cluster

21 / 64

K-Means Clustering

The variation is defined as the sum of all the pairwise squared euclidean distances between the observations within a cluster

There is no closed-form solution to this since the function isn't smooth

We have to find a way to walk through the different partitions to find a good one. HOWEVER!!

Since we are working with partitions the number goes up VERY fast as K^n

22 / 64

K-Means Clustering

  1. Randomly assign a number, from 1 to K, to each of the observations. These serve as initial cluster assignments for the observations.
  2. iterate until the cluster assignments stop changing a. For each of the K clusters, compute the cluster centroid. The kth cluster centroid is the vector of the p feature means for the observations in the kth cluster. b. Assign each observation to the cluster whose centroid is .blue[closest]

Here closest is defined using Euclidean distance

23 / 64

Art by Allison Horst

24 / 64

Art by Allison Horst

25 / 64

Art by Allison Horst

26 / 64

Art by Allison Horst

27 / 64

Art by Allison Horst

28 / 64

Art by Allison Horst

29 / 64

Art by Allison Horst

30 / 64

Art by Allison Horst

31 / 64

Art by Allison Horst

32 / 64

Art by Allison Horst

33 / 64

Art by Allison Horst

34 / 64

Art by Allison Horst

35 / 64

Scaling

Since we are using a Euclidean measure you need to scale the variables to make sure the clusters are being influenced evenly

36 / 64

With scaling

37 / 64

Without scaling

38 / 64

Initialization Matters

There is no natural ordering in the clusters, keep that in mind when doing the analysis

39 / 64

No natural ordering of Clusters

40 / 64

Advantages of K-Means

  • Relatively simple to implement
  • Scales to large data sets
  • Guarantees convergence
  • Can warm-start the positions of centroids
  • Easily adapts to new examples
41 / 64

Downsides to K-means

  • you have to specify the number of clusters

We can do many different values of K and draw the elbow chart

42 / 64

43 / 64

44 / 64

Elbow chart for separated data

45 / 64

Elbow chart for normal data

46 / 64

Downsides to K-means

  • Can only use numerical X
  • k-means assumes that we deal with spherical clusters and that each cluster has roughly equal numbers of observations
  • Being dependent on initial values.
  • Clustering data of varying sizes and densities.
  • Clustering outliers.
  • Scaling with the number of dimensions.
  • Kmeans may still cluster the data even if it can’t be clustered such as data that comes from uniform distributions.
47 / 64

Hierarchical clustering

One of the main assumptions when using K-means is that we need to specify the number of clusters we want to find.

Hierarchical clustering is an alternative approach where we won't have to do this

We also get a tree-based representation of the data

48 / 64

Hierarchical clustering

HC works as a bottom-up/agglomerative method

We start by having each observation being its own class, then we iteratively merge nearby classes

A good thing about HC is that we only have to calculate once, then we can take some time to decide on the cutting location

49 / 64

Hierarchical clustering

  1. Begin with n observations and a measure (such as Euclidean distance) of all the pairwise dissimilarities. Treat each observation as its own cluster.
  2. For i = n, n-1, ..., 2 a. Examine all pairwise inter-cluster dissimilarities among the i clusters and identify the pair of clusters that are least dissimilar. Fuse these two clusters. The dissimilarity between these two clusters indicates the height in the dendrogram at which the fusion should be placed. b. Compute the new pairwise inter-cluster dissimilarities among the i-1 remaining clusters
50 / 64

Art by Allison Horst

51 / 64

Art by Allison Horst

52 / 64

Art by Allison Horst

53 / 64

Art by Allison Horst

54 / 64

Art by Allison Horst

55 / 64

Art by Allison Horst

56 / 64

Art by Allison Horst

57 / 64

Working with a dendrogram

We cut at a given size up or down to decide how many clusters we want

it is not entirely obvious

58 / 64

General Clustering considerations

How do we perform validation?

They can be very hard to validate properly, so far it hasn't been hard since we only have 2 dimensions, but these algorithms are not limited to only 2 variables

no consensus on a single best approach

59 / 64

Ways to validate a cluster

The major departure from supervised learning is this: With a supervised method, we have a very clear way to measure success, namely, how well does it predict?

With clustering, there is no "right answer" to compare results against.

There are several ways people typically validate a clustering result

60 / 64

within-group versus without-group similarity

The goal is to find groups of similar objects. Thus, we can check how close objects in the same cluster are as compared to how close objects in different clusters are.

  • A problem with this is that there's not objective baseline about what is a "good" ratio.
61 / 64

Stability

If we regard the objects being clustered as a random subset of a population, we can ask whether the same cluster structure would have emerged in a different random subset. We can measure this with bootstrapped subsampling.

A cluster structure being stable doesn't necessarily mean it is meaningful.

62 / 64

63 / 64

Other considerations

Both these methods we saw right here will assign every point to one class

There are two possible kinds of problems here

  • Forced to be part of a cluster
  • Can only be part of one cluster
64 / 64
`\require{color}\definecolor{orange}{rgb}{1, 0.603921568627451, 0.301960784313725} \require{color}\definecolor{blue}{rgb}{0.301960784313725, 0.580392156862745, 1} \require{color}\definecolor{pink}{rgb}{0.976470588235294, 0.301960784313725, 1}`

Clustering

This is a case of unsupervised learning

We are working with unlabeled data

2 / 64
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow