Clustering Algorithms in Machine Learning (Simplified)

Clustering is an unsupervised learning method in machine learning. Clustering Algorithms in Machine learning are used to create data groups. These data groups are called as clusters. The data points inside a cluster have high degree of similarity among themselves.

Divisive Hierarchical clustering is a top down approach in which a single greater cluster is divided into multiple clusters till a particular termination condition is satisfied. This technique is very simple and easy to use to get desired output in the form of a hierarchy. Number of clusters may not be specified in this type of clustering technique.

Let’s discuss the Clustering Algorithms in Machine Learning

Hierarchical Clustering

This technique in Clustering Algorithms in Machine Learning follows top down or bottom up approach. HAC or Hierarchical Agglomerative Clustering is a bottom up approach that uses subsequent iterations to form clusters.

Steps in Hierarchical Clustering:

1. Calculate distance between data points

2. Find proximity matrix (also known as distance matrix)

3. Consider each point as cluster

4. Combine closest distance clusters

5. Update the proximity matrix until a cluster is formed.

Note that the proximity matrix is symmetric, meaning that the numbers on the lower half will be the same as the numbers in the top half.

EM or Expectation Maximisation Clustering

Expectation Maximisation algorithm is used for multi modal data. Maximum likelihood hypothesis is used for distributional parameters in EM. Basic 2 steps are:

E (Expectation): Fix model and also estimate missing labels

M (Maximisation): Find model that maximises the log likelihood of the data by choosing new parameters.

Soft and Hard Clustering

Soft Clustering : In this type of clustering the data object/observation belongs to a single cluster, So clusters can be formed like {C1,C2,C3,C4,……..Ck}. Example of Soft Clustering includes EM.

Hard Clustering : In this type of clustering, the data object is more likely to belong to one of the K numbered clusters via a probability distribution.Example of Hard Clustering includes Algorithms like Hierarchical clustering where K means are produced.

Steps in Expectation Maximisation Clustering:

1. The first step includes the making of guesses or parameters and probability distribution. This step is also called as the ‘Expected Distribution Step’ and sometimes even ‘E-Step’.

2. Observed data is fed into the model

3. The chance distribution from the 1st step is adjusted to get fresh information from the observed data.

4. The steps numbered 2 and 3 are repeated until stability is reached.

Agglomerative Clustering

Agglomerative Clustering or AGNES (Agglomerative Nesting) is a hierarchal clustering model . The algorithm starts by treating each object as a singleton cluster. Next, pairs of clusters are successively merged until all clusters have been merged into one big cluster containing all objects.

Agglomerative Clustering can be easily applied to projects using Scikit-learn in Python. https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html

Metrics in Agglomerative Clustering

1. Euclidean Distance

Euclidean Distance states that the distance between 2 points is a line. It is derived from Pythagoras theorem. So, the distance between 2 points is given by

distance((x,y),(a,b))=√(x-a)² +(y-b)²

2. Manhattan Distance

The manhattan or city block distance is calculated as

d_AB=∑|P_i-Q_i|

where distance is measured along axes at right angle.

3. Cosine distance

Cosine distance is the measure of similarity between 2 vectors. Here data objects are treated as vectors. Also angle Θ is measured between the 2 vectors. When Θ =0, similarity is 1 and when Θ =90 similarity is 0. So,mathematically it can be said that Cosine distance=1-Cosine Similarity

and

d_cosine(x1,x2)=1-(x1.x2/||x1||₂||x2||₂)

Some strategies such as Single Linkage clustering, Complete linkage clustering, Centroid linkage clustering, average linkage clustering and ward’s linkage clustering can be used to aggregate different clusters.

Dendrogram

Dendrogram is a data structure used in hierarchical clustering to visualise the clustering hierarchies. Agglomerative clustering is visualised through a dendrogram structure. Dendrogram follows a bottom-top approach to showcase aggregation of clusters.

In building a dendrogram, the initial step is to set leaf nodes, which are single clusters. These single clusters are merged until an entire cluster is formed. Each level of Dendrogram shows clusters for that level.

Clustering Algorithms in Machine Learning -dendrogram

Basic way to implement dendrogram in Python :

from scipy.cluster.hierarchy import dendrogram
from scipy.cluster.hierarchy import linkage
Z=linkage(X,metric='ward')
dendrogram(Z)

Z=linkage matrix

Process of forming dendrogram:

1. Leaf of Dendrogram is the individual cluster and root has one cluster

2. Custer at i^th level is union of its children at the i+1 level.

3. Clustering of data objects is obtained by cutting the dendrogram the the required level

4. As mentioned some useful Scipy libraries in python can be used to create a dendrogram for hierarchical clustering.

admin