Sunday, November 22, 2009

Clustering

The process of organizing a collection of objects into groups whose members share similar features in some way.


There are four types of algorithms to build clusters such as : K-means Clustering include: (K-means Algorithm) and Hierarchical Clustering include (Single-Link Method, Complete-Link Method and Average-Link Method). We want to discuss all of this algorithms in the following :
(1) K-means Algorithm
The k-means algorithm is the best known partitional clustering algo­rithm. It is perhaps also the most widely used among all clustering algo­rithms due to its simplicity and efficiency. Given a set of data points and the required number of k clusters (k is specified by the user), this algorithm iteratively partitions the data into k clusters based on a distance function.
(2) Hierarchical Clustering
Hierarchical clustering is another major clustering approach. It has a num­ber of desirable properties which make it popular. It clusters by producing a nested sequence of clusters like a tree (also called a.
dendrogram). Sin­gleton clusters (individual data points) are at the bottom of the tree and one root cluster is at the top, which covers all data points. Each internal cluster node contains child cluster nodes.
Sibling clusters partition the data points covered by their common parent
We can divide it to two kinds :
* Agglomerative (bottom up) clustering: It builds the dendrogram (tree) from the bottom level, and merges the most similar (or nearest) pair of clusters at each level to go one level up. The process continues until all the data points are merged into a single cluster (i.e., the root cluster).
* Divisive (top down) clustering: It starts with all data points in one cluster, the root. It then splits the root into a set of child clusters. Each child cluster is recursively divided further until only singleton clusters of in­dividual data points remain, i.e., each cluster with only a single point. Agglomerative methods are much more popular than divisive methods. We will focus on agglomerative hierarchical clustering.
There are many types of hierarchical clustering methods:
* Single-Link Method In single-link (or single linkage) hierarchical clustering, the distance be­tween two clusters is the distance between two closest data points in the two clusters (one data point from each cluster). In other words, the single-link clustering merges the two clusters in each step whose two nearest data points (or members) have the smallest distance, i.e., the two clusters with the smallest minimum pair-wise distance. The single-link method is suit­able for finding non-elliptical shape clusters. However, it can be sensitive to noise in the data, which may cause the chain effect and produce strag­gly clusters.
* Complete-Link Method In complete-link (or complete linkage) clustering, the distance between two clusters is the maximum of all pair-wise distances between the data points in the two clusters. In other words,
the complete-link clustering merges the two clusters in each step whose two furthest data points have the smallest distance, i.e., the two clusters with the smallest maximum pair-wise distance.
*Average-Link Method This is a compromise between the sensitivity of complete-link clustering to outliers and the tendency of single-link clustering to form long chains that do not correspond to the intuitive notion of clusters as compact, spherical objects. In this method, the distance between two clusters is the average distance of all pair-wise distances between the data points in two clusters.



No comments:

Post a Comment