bstruct – This article contains the description of clustering as one of the main methods of data analysis. it highlights the goals and problems of clustering and gives a brief explanation of major methods and algorithms of division. Furthermore, the relevance of this topic will be revealed and examples of the use of clustering will be given.
Defining the purpose and the definition of clustering, we will start with an example. Let insurance company has a number of drivers with cars and they have several characteristics (for example for driver – age, driving experience etc. and for car – price, mileage etc.). The insurance company needs to classify them into risk groups, which contains people with similar parameters. In general the task is – let us have a set of data points, we need to split data into several groups, which contains objects with the similar features. Thus, Clustering is the grouping of a particular set of objects based on their characteristics, aggregating them according to their similarities. It is the main method of data classifying in data science.Since the data is divided into clusters, they can be used in further analysis.
Types of Clustering.
Firstly, lets us highlight the main types of clustering:
Within each of the types there exists a wealth of subtypes and different algorithms for finding the clusters. We will cover only two first ones. They are considered to be the main methods in clustering. All types have a common goal – division into clusters, but they have different approaches to this task.
According to this book1, Partitioning Clustering are flexible methods based on iterative relocation of data points between clusters.
Partitioning Clustering attempts to split dataset on disjoint sets. Clustering criterion is the measurement of the quality of the solutions. The main goal of each iteration is reducing the value of the criterion function until convergence. Typically the criterion involve minimizing some measure of dissimilarity in the inside each cluster, while maximizing the dissimilarity between different clusters.We can built reliable clustering methods by changing the criterion, that would make clustering independent to erroneous and missing data values.
There are different types of partitioning clustering methods, however the most well-known data analysis algorithm is k-Means. It is one of Partitioning Clustering algorithms, which “outputs a partition of the entity set into clusters and centroids representing them”2. K-means divided the data into K disjoint groups. Here is algorithm of it:
Selecting a cluster and randomly initialize their respective center points.
Each point is classified by calculating the distance between that its nearest center, and then classifying it accordingly.
After each point is marked, we recalculate the group center by taking the mean of all the points in the cluster.
Repeat these steps until the group centers don’t change much between iterations.
Usually in k-Means the criterion function is:
where is a data point, is is the index of the centroid that is closest to pending data point, is the position, that are corrected iteratively assigning he nearest clusters and then recalculating the centroids.
There are a lot of modifications of this algorithm, such as randomize initialization of the group centers for a few times or different function for distance. Moreover, it has a linear complexity. However, there are some disadvantages. For example, we need to know the amount of clusters in the beginning or the algorithm can make different clustering on different runs because of random choice. Additionally, there is no single criterion that measures how data connects to the real world.
Another algorithm, which is close to k-Means is k-Medians. It is based on the median vector of the group.(instead of recalculating the group center each iteration).
K-means is used in different applications, for example in computer vision (highlighted the background).
Another type of clustering is Hierarchical one. It proceeds successively by merging smaller clusters into larger ones, or by splitting larger clusters.
According to the book3, there are several steps in method:
it starts with all data marked as they are independent cluster.
Then two closest clusters are merged together into the same cluster.
The second step is iterating until only one cluster left.
The result of the algorithm is a tree of clusters called a dendrogram. By cutting the dendrogram at a desired level a data items will separate into several groups.