This is a case of unsupervised learning
We are working with unlabeled data
We are trying to find patterns and/or structure in the data
The main characteristic for unsupervised learning is that we have unlabeled data
So far when working with supervised learning we have had a response variable Y and some predictors variables X
This time we only have X
Our goal is to see if there is anything we can get out of this information
Trying to divide/partition the n observations into several sub-groups/clusters
How do we do this?
If we have class labels on some of the objects, we can apply unsupervised clustering, then let the clusters be defined by their class enrichment of labeled objects.
A word of caution for this approach: Just because a clustering structure doesn't align with known labels doesn't mean it is "wrong". It could be capturing a different (true) aspect of the data than the one we have labels for.
Sometimes clustering is applied as a first exploratory step, to get a sense of the structure of the data. This is somewhat nebulous and usually involves eyeballing a visualization.
Clustering can be used to discover relationships in data that are undesirable, so that we can residualize or decorrelate the objects before applying an analysis.
A great example of this is in genetics, where we have measurements of gene expression for several subjects. Typically, gene expression is most strongly correlated by race. If we cluster the subjects on gene expression, we can then identify unwanted dependence to remove from the data.
Sometimes, the assignment of cluster membership is the end goal of the study. For example:
In the Enron corruption case in 2001, researchers created a network based on who emailed who within the company. They then looked at which clusters contained known conspirators and investigated the other individuals in those groups.
In the early days of breast cancer genetic studies, researchers clustered known patients on genetic expression, which led to the discovery of different tumor types (e.g. Basal, Her-2, Luminal). These have later been clinically validated and better defined.
One way is to define a geometry that is used to determine whether 2 points are close to each other
Having the "distances" between points allows us to see if there are any points with a lot of "friends"
We will focus K-means clustering and Hierarchical clustering
Which are Centroid-based Clustering and Hierarchical Clustering respectively
A Comprehensive Survey of Clustering Algorithms
by Dongkuan Xu & Yingjie Tian
A simple and elegant approach
Intuitively easy to understand
Does partitioning into K non-overlapping clusters
We let C_1, ..., C_K denote sets of indices of the observations in each cluster.
For K-Means we have that the union of C_1, ..., C_K is equal to 1, ..., n and that there is no overlap between the sets
We are trying to maximize/optimize something
K-means states that we want to minimize the within-cluster variation
\underset{C_1, ..., C_K}{\text{minimize}}\left\{ \sum_{k=1}^K W(C_k) \right\}
This is a reasonable starting point. But we need to define W
The most common way is using squared Euclidean distance
W(C_k) = \dfrac{1}{|C_k|} \sum_{i, i' \in C_k}\sum_{j=1}^p(x_{ij} - x_{i'j})^2
here |C_k| denote the number of observations in the kth cluster
The variation is defined as the sum of all the pairwise squared euclidean distances between the observations within a cluster
There is no closed-form solution to this since the function isn't smooth
We have to find a way to walk through the different partitions to find a good one. HOWEVER!!
Since we are working with partitions the number goes up VERY fast as K^n
Here closest is defined using Euclidean distance
Art by Allison Horst
Art by Allison Horst
Art by Allison Horst
Art by Allison Horst
Art by Allison Horst
Art by Allison Horst
Art by Allison Horst
Art by Allison Horst
Art by Allison Horst
Art by Allison Horst
Art by Allison Horst
Art by Allison Horst
Since we are using a Euclidean measure you need to scale the variables to make sure the clusters are being influenced evenly
There is no natural ordering in the clusters, keep that in mind when doing the analysis
We can do many different values of K and draw the elbow chart
One of the main assumptions when using K-means is that we need to specify the number of clusters we want to find.
Hierarchical clustering is an alternative approach where we won't have to do this
We also get a tree-based representation of the data
HC works as a bottom-up/agglomerative method
We start by having each observation being its own class, then we iteratively merge nearby classes
A good thing about HC is that we only have to calculate once, then we can take some time to decide on the cutting location
Art by Allison Horst
Art by Allison Horst
Art by Allison Horst
Art by Allison Horst
Art by Allison Horst
Art by Allison Horst
Art by Allison Horst
We cut at a given size up or down to decide how many clusters we want
it is not entirely obvious
How do we perform validation?
They can be very hard to validate properly, so far it hasn't been hard since we only have 2 dimensions, but these algorithms are not limited to only 2 variables
no consensus on a single best approach
The major departure from supervised learning is this: With a supervised method, we have a very clear way to measure success, namely, how well does it predict?
With clustering, there is no "right answer" to compare results against.
There are several ways people typically validate a clustering result
The goal is to find groups of similar objects. Thus, we can check how close objects in the same cluster are as compared to how close objects in different clusters are.
If we regard the objects being clustered as a random subset of a population, we can ask whether the same cluster structure would have emerged in a different random subset. We can measure this with bootstrapped subsampling.
A cluster structure being stable doesn't necessarily mean it is meaningful.
Both these methods we saw right here will assign every point to one class
There are two possible kinds of problems here
This is a case of unsupervised learning
We are working with unlabeled data
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |