Sunday, November 22, 2009

K-Means Clustering Algorithm

Clustering

The process of organizing a collection of objects into groups whose members share similar features in some way.


There are four types of algorithms to build clusters such as : K-means Clustering include: (K-means Algorithm) and Hierarchical Clustering include (Single-Link Method, Complete-Link Method and Average-Link Method). We want to discuss all of this algorithms in the following :
(1) K-means Algorithm
The k-means algorithm is the best known partitional clustering algo­rithm. It is perhaps also the most widely used among all clustering algo­rithms due to its simplicity and efficiency. Given a set of data points and the required number of k clusters (k is specified by the user), this algorithm iteratively partitions the data into k clusters based on a distance function.
(2) Hierarchical Clustering
Hierarchical clustering is another major clustering approach. It has a num­ber of desirable properties which make it popular. It clusters by producing a nested sequence of clusters like a tree (also called a.
dendrogram). Sin­gleton clusters (individual data points) are at the bottom of the tree and one root cluster is at the top, which covers all data points. Each internal cluster node contains child cluster nodes.
Sibling clusters partition the data points covered by their common parent
We can divide it to two kinds :
* Agglomerative (bottom up) clustering: It builds the dendrogram (tree) from the bottom level, and merges the most similar (or nearest) pair of clusters at each level to go one level up. The process continues until all the data points are merged into a single cluster (i.e., the root cluster).
* Divisive (top down) clustering: It starts with all data points in one cluster, the root. It then splits the root into a set of child clusters. Each child cluster is recursively divided further until only singleton clusters of in­dividual data points remain, i.e., each cluster with only a single point. Agglomerative methods are much more popular than divisive methods. We will focus on agglomerative hierarchical clustering.
There are many types of hierarchical clustering methods:
* Single-Link Method In single-link (or single linkage) hierarchical clustering, the distance be­tween two clusters is the distance between two closest data points in the two clusters (one data point from each cluster). In other words, the single-link clustering merges the two clusters in each step whose two nearest data points (or members) have the smallest distance, i.e., the two clusters with the smallest minimum pair-wise distance. The single-link method is suit­able for finding non-elliptical shape clusters. However, it can be sensitive to noise in the data, which may cause the chain effect and produce strag­gly clusters.
* Complete-Link Method In complete-link (or complete linkage) clustering, the distance between two clusters is the maximum of all pair-wise distances between the data points in the two clusters. In other words,
the complete-link clustering merges the two clusters in each step whose two furthest data points have the smallest distance, i.e., the two clusters with the smallest maximum pair-wise distance.
*Average-Link Method This is a compromise between the sensitivity of complete-link clustering to outliers and the tendency of single-link clustering to form long chains that do not correspond to the intuitive notion of clusters as compact, spherical objects. In this method, the distance between two clusters is the average distance of all pair-wise distances between the data points in two clusters.



Decision Tree Algorithm




Classification Methods


(1) Decision Tree
There are two main types of Decision Tree classification methods:
* Decision tree induction is a technique for partitioning a training file into a set of rules. A decision tree consists of nodes and edges. Each node is labeled with attribute names and each edge is labeled with possible values for this attribute. The starting node is called the root node. Depending upon the results of a test, the training files are partitioned into two or more sub-sets.
The end result is a set of rules covering all the possibilities. They are labeled with different classes.
To construct a decision tree, we could start with the first attribute, find a certain threshold, go on to the next one, find a certain threshold, and repeat this process until we have made a correct classification for our customers, thus creating a decision tree for our database. The tree induction algorithms scale up very well for large data sets. They give a true insight into the nature of the decision process.
*Tree pruning When a decision tree is built, many of the branches will reflect anomalies in the training data due to noise or outliers. Tree pruning methods address this problem of over fitting the data. Such methods typically use statistical measures to remove the least reliable branches, generally resulting in faster classification and an improvement in the ability of the tree to correctly classify independent test data.
(2) Classification Using Association Rules
Classification rule mining and association rule mining are two important data mining techniques. Classification rule mining aims to discover a small set of rules in the database to form an accurate classifier. Association rule mining finds all rules in the database that satisfy some minimum support and minimum confidence constraints. For association rule mining, the target of mining is not predetermined, while for classification rule mining there is one and only one pre-determined target, i.e., the class. Both classification rule mining and association rule mining are indispensable to practical applications. Thus, great savings and conveniences to the user could result if the two mining techniques can somehow be integrated. In this paper, we propose such an integrated framework, called associative classification. We show that the integration can be done efficiently and without loss of performance, i.e., the accuracy of the resultant classifier.[6]
The integration is done by focusing on a special subset of association rules whose right-hand-side are restricted to the classification class attribute. We refer to this subset of rules as the class association rules (CARs). An existing association rule mining algorithm is adapted to mine all the CARs that satisfy the minimum support and minimum confidence constraints. This adaptation is necessary for two main reasons:
1. Unlike a transactional database normally used in association rule mining that does not have many associations, classification data tends to contain a huge number of associations. Adaptation of the existing association rule mining algorithm to mine only the CARs is needed so as to reduce the number of rules generated, thus avoiding combinatorial explosion.
2. Classification datasets often contain many continuous (or numeric) attributes. Mining of association rules with continuous attributes is still a major research issue. Our adaptation involves discretization continuous attributes based on the classification predetermined class target. There are many good discretization algorithms for this purpose.[7]
(3) Bayesian Classification
are statistical classifiers. They can predict class membership probabilities, such as the probability that a given sample belongs to a particular class.
Bayesian classification is based on Bayes theorem, described below. Studies comparing classification algorithms have found a simple Bayesian classifier known as the naïve Bayesian classifier to be comparable in performance with decision tree and neural network classifiers. Bayesian classifiers have also exhibited high accuracy and speed when applied to large databases.
Naive Bayesian classifiers assume that the effect of an attribute value on a given class is independent of the values of the other attributes. This assumption is called class conditional independence. It is made to simplify the computations involved, and in this sense, is considered "naive". Bayesian belief networks are graphical models, which unlike naive Bayesian classifiers, allow the representation of dependencies
among subsets of attributes. Bayesian belief networks can also be used for classification.
There are two main types of Bayesian Classification methods:
* The Basic Naive Bayes Classifier The most straightforward Bayesian learning method is the Naive Bayesian classifier. This method uses a set of discriminant functions for estimating the probability of a given instance to belong to a certain class. More specifically it uses Bayes rule to compute the probability of each possible value of the target attribute given the instance, assuming the input attributes are conditionally independent given the target attribute.
Due to the fact that this method is based on the simplistic, and rather unrealistic, assumption that the causes are conditionally independent given the effect, this method is well known as Naive Bayes.
* Bayesian belief networks The naive Bayesian classifier makes the assumption of class conditional independence, i.e., that given the class label of a sample, the values of the attributes are conditionally independent of one another. This assumption simplifies computation. When the assumption holds true, then the naive Bayesian classifier is the most accurate in comparison with all other classifiers. In practice, however, dependencies can exist between variables. Bayesian belief networks specify joint conditional probability distributions. They allow class conditional independencies to be defined between subsets of variables. They provide a graphical model of causal relationships, on which learning can be performed. These networks are also known as belief networks, Bayesian networks, and probabilistic networks. For brevity, we will refer to them as belief networks.

A belief network is defined by two components. The first is a directed acyclic graph, where each node represents a random variable, and each arc represents a probabilistic dependence[1].
(4) Backpropagation
Is a neural network learning algorithm. The field of neural networks was originally kindled by psychologists and neurobiologists who sought to develop and test computational analogues of neurons. Roughly speaking, a neural network is a set of connected input/output units where each connection has a weight associated with it. During the learning phase, the network learns by adjusting the weights so as to be able to predict the correct class label of the input samples. Neural network learning is also referred to as connectionist learning due to the connections between units.
Neural networks involve long training times, and are therefore more suitable for applications where this is feasible. They require a number of parameters which are typically best determined empirically, such as the network topology or "structure". Neural networks have been criticized for their poor interpretability, since it is difficult for humans to interpret the symbolic meaning behind the learned weights. These features initially made neural networks less desirable for data mining.
Advantages of neural networks, however, include their high tolerance to noisy data as well as their ability to classify patterns on which they have not been trained. In addition, several algorithms have recently been developed for the extraction of rules from trained neural networks. These factors contribute towards the usefulness of neural networks for classification in data mining.

(5) Other classification methods
* K - Nearest Neighbor Method are based on learning by analogy. The training samples are described by n-dimensional numeric attributes. Each sample represents a point in an n-dimensional space. In this way, all of the training samples are stored in an n-dimensional pattern space.
* Case-based reasoning (CBR) classifiers are instanced-based. Unlike nearest neighbor classifiers, which store training samples as points in Euclidean space, the samples or "cases" stored by CBR are complex symbolic descriptions. CBR has also been applied to areas such as
engineering and law, where cases are either technical designs or legal rulings, respectively.
When given a new case to classify, a case-based reasoner will first check if an identical training case exists. If one is found, then the accompanying solution to that case is returned. If no identical case is found, then the case-based reasoner will search for training cases having components that are similar to those of the new case. Conceptually, these training cases may be considered as neighbors of the new case. If cases are represented as graphs, this involves searching for subgraphs which are similar to subgraphs within the new case. The case-based reasoner tries to combine the solutions of the neighboring training cases in order to propose a solution for the new case. If incompatibilities arise with the individual solutions, then backtracking to search for other solutions may be necessary. The case-based reasoner may employ background knowledge and problem-solving strategies in order to propose a feasible combined solution.
Challenges in case-based reasoning include finding a good similarity metric (e.g., for matching subgraphs), developing efficient techniques for indexing training cases, and methods for combining solutions.
* Genetic algorithms attempt to incorporate ideas of natural evolution. In general, genetic learning starts as follows. An initial population is created consisting of randomly generated rules. Each rule can be represented by a string of bits.
*Rough set theory can be used for classification to discover structural relationships within imprecise or noisy data. It applies to discrete-valued attributes. Continuous-valued attributes must therefore be discretized prior to its use.
Rough set theory is based on the establishment of equivalence classes within the given training data. All of the data samples forming an equivalence class are indiscernible, that is, the samples are identical with respect to the attributes describing the data.
Reference:
(Data Mining: Concepts and Techniques) Jiawei Han and Micheline Kamber Simon Fraser University c 2000

Classification

The data classification process: a) Learning: Training data are analyzed by a classification algorithm. b) Classification: Test data are used to estimate the accuracy of the classification rules. If the accuracy is considered acceptable, the rules can be applied to the classification of new data tuples.



Classification have more than one algorithms include: Classification by back propagation, Decision tree(Decision tree induction and tree pruning), Bayesian classification(Naive Bayesian classification and Bayesian belief networks), Classification using association rules and Other classification methods(k-nearest neighbor classifiers, Genetic algorithms and Rough set theory.

Data Mining Definition

Data Mining is an analytic process designed to explore data (usually large amounts of data), in search of consistent patterns and/or systematic relationships between variables, and then to validate the findings by applying the detected patterns to new subsets of data.