DBSCAN is a clustering algorithm used mainly for discovering clusters of random shapes. Its runtime is defined to be O(nlog(n)). Mathematically, it is based on the formal idea of density-reachability for the k-dimensional points in a region. DBSCAN is frequently used on Noisy data for clustering operations.
The 2 important parameters associated with DBSCAN are
- Epsilon (Eps) and
- minimum Points (MinPts).
As the first step of preparation, an unvisited point is chosen to be the starting point for the clustering process. Epsilon or Eps is the mathematical calculation that is done between 2 points of the dataset.
The above-mentioned steps are chosen repeatedly until the remaining neighbors are found to form the desired cluster.
Condition for forming a cluster
If the number of data points in the dataset is greater than or equal to MinPts, then form a cluster. Points are marked as visited once used in cluster formation.
Note : if a data point is less than the MinPts value, it is marked to be noise.
We always have a central point which is denoted as p (also called core point) as well as the distance from the centroid of the cluster formed also known as Eps.
Inputs for DBSCAN algorithm are MinPts, Eps, Core Point, and a point p that belongs to N.
Here is how the DBSCAN method is used in a stepwise manner:
1. Density reachable objects are found and merging is done.
2. Clusters are developed
3. When no new point can be added to cluster from the dataset after all points are utilized.
4. Outliers not affected by the clustering process
Clusters formed have a quite high-density region. It is separated by low-density areas which are also known as noise.
Important features of DBSCAN:
1. DBSCAN handles noisy data with great efficiency
2. It is very fast as it uses only 1 scan for the clustering process
3. Able to discover clusters of any arbitrary shape
4. Termination condition is required which are included as density parameters
Limitations for DBSCAN
1. Very fickle to noise
2. User needs to input some parameters such as MinPts and threshold.
3. High degree of dataset analysis is required beforehand
4. Not suitable for high dimensional data
We can use Sklearn library for accessing the DBSCAN function to apply all the DBSCAN clustering functionalities when using python for machine learning. We can also use other programming languages such as C++ for using DBSCAN.