By Vivek Kumar Bansal on June 20th, 2016

Unsupervised learning is contrasted from supervised learning because it uses an **unlabeled** training set rather than a labeled one.

In other words, we don't have the vector y of expected results, we only have a dataset of features where we can find structure.

Clustering is good for:

- Market segmentation
- Social network analysis
- Organizing computer clusters
- Astronomical data analysis

The K-Means Algorithm is the most popular and widely used algorithm for automatically grouping data into coherent subsets.

- Randomly initialize two points in the dataset called the
*cluster centroids*. - Cluster assignment: assign all examples into one of two groups based on which cluster centroid the example is closest to.
- Move centroid: compute the averages for all the points inside each of the two cluster centroid groups, then move the cluster centroid points to those averages.
- Re-run (2) and (3) until we have found our clusters.

Our main variables are:

$$
K = \text{number of clusters} \
\text{Training set } x^{(1)}, x^{(2)},\dots, x^{(m)} \text{ where } x^{(i)} \in R^n
$$
Note that we **will not use** the $x_0=1$ convention.

**The algorithm:**

Randomly initialize $K$ cluster centroids $\mu_1, \mu_2, \dots, \mu_k$.
$$
\begin{align*}
& \text{Repeat:} \
& \hspace{2em} \text{for } i=1 \text{ to } m: \
& \hspace{4em} c^{(i)} := \text{index (from } 1 \text{ to } K \text {) of cluster centroid to } x^{(i)} \
& \hspace{2em} \text{for } k=1 \text{ to } K: \
& \hspace{4em} \mu_k := \text{average (mean) of points assigned to cluster } k
\end{align*}
$$
The **first for-loop** is the 'Cluster Assignment' step. We make a vector $c$ where $c^{(i)} $represents the centroid assigned to example $x^{(i)}$.

We can write the operation of the Cluster Assignment step more mathematically as follows: $$ c^{(i)} = \min_k ||x^{(i)} - \mu_k||^2 $$

c(i)=argmink ||x(i)−μk||2

That is, each $c^{(i)}$ contains the index of the centroid that has minimal distance to $x^{(i)}$.

By convention, we square the right-hand-side, which makes the function we are trying to minimize more sharply increasing. It is mostly just a convention.

The **second for-loop** is the 'Move Centroid' step where we move each centroid to the average of its group.

More formally, the equation for this loop is as follows:

$$ \mu_k = \frac{1}{n} [x^{(k_1)} + x^{(k_2)} + \dots + x^{(k_n)}] \in R^n $$ Where each of $x^{(k_1)},x^{(k_2)},\dots,x^{(k_n)}$ are the training examples assigned to group $μ_k$.

If you have a cluster centroid with **0 points** assigned to it, you can randomly **re-initialize** that centroid to a new point. You can also simply **eliminate** that cluster group.

After a number of iterations the algorithm will **converge**, where new iterations do not affect the clusters.

Note on non-separated clusters: some datasets have no real inner separation or natural structure. K-means can still evenly segment your data into K subsets, so can still be useful in this case.

Recall some of the parameters we used in our algorithm:

$$ \begin{align*} c^{(i)} &= \text{index of cluster } (1,2,\dots,K) \text{ to which example } x^{(i)} \text{ is currently assigned}\ μ_k &= \text{cluster centroid } k \space (μ_k \in R^n) \ μ_c^{(i)} &= \text{cluster centroid of cluster to which example } x^{(i)} \text{ has been assigned} \end{align*} $$

Using these variables we can define our **cost function**:

$$
J(c^{(1)},\dots,c^{(m)}, \mu_1, \dots, \mu_k) = \frac{1}{m} \sum_{i=1}^m ||x^{(i)} - \mu_{(c^{(i)})}||^2
$$
Our **optimization objective** is to minimize all our parameters using the above cost function:

$$
\min_{c,\mu} J(c, \mu)
$$
That is, we are finding all the values in sets $c$, representing all our clusters, and $\mu$, representing all our centroids, that will minimize **the average of the distances** of every training example to its corresponding cluster centroid. The above cost function is often called the **distortion** of the training examples.

In the **cluster assignment step**, our goal is to:
$$
\text{Minimize } J(\dots) \text{ with } c^{(1)}, \dots, c^{(m)} \text{ holding } \mu_1, \mu_2, \dots, \mu_k \text{ fixed}
$$
In the **move centroid** step, our goal is to:
$$
\text{Minimize } J(\dots) \text{ with } \mu_1, \mu_2, \dots, \mu_k
$$
With k-means, it is **not possible for the cost function to sometimes increase**. It should always descend.

There's one particular recommended method for randomly initializing your cluster centroids.

- Have $K<m$. That is, make sure the number of your clusters is less than the number of your training examples.
- Randomly pick $K$ training examples. (Also be sure the selected examples are unique).
- Set $μ_1,\dots,μ_k$ equal to these $K$ examples.

K-means **can get stuck in local optima**. To decrease the chance of this happening, you can run the algorithm on many different random initializations.

$$ \begin{align*} &\text{for } i = 1 \text{ to } 100 :\ & \hspace{2em} \text{randomly initialize k-means} \ & \hspace{2em} \text{run k-means to get } c \text{ and } m \ & \hspace{2em} \text{compute the cost function (distortion) } J(c,m) \ & \hspace{2em} \text{pick the clustering that gave us the lowest cost} \end{align*} $$

Choosing $K$ can be quite arbitrary and ambiguous.

**The elbow method**: plot the cost $J$ and the number of clusters $K$. The cost function should reduce as we increase the number of clusters, and then flatten out. Choose K at the point where the cost function starts to flatten out.

However, fairly often, the curve is **very gradual**, so there's no clear elbow.

**Note:** $J$ will **always** decrease as $K$ is increased. The one exception is if k-means gets stuck at a bad local optimum.

Another way to choose $K$ is to observe how well k-means performs on a **downstream purpose**. In other words, you choose $K$ that proves to be most useful for some goal you're trying to achieve from using these clusters.