IMPROVEMENT IN K-MEANS CLUSTERING ALGORITHM
FOR DATA CLUSTERING Omkar Acharya
Department of Computer Engineering
Pimpri Chinchwad College Of Engineering
Savitribai Phule Pune University
Pune, India omkarchamp1000@gmil.com Mayur Sharma
Department of Computer Engineering
Pimpri Chinchwad College Of Engineering
Savitribai Phule Pune University
Pune, India mayur_sharma60@yahoo.com Mahesh Kopnar
Department of Computer Engineering
Pimpri Chinchwad College Of Engineering
Savitribai Phule Pune University
Pune, India mkopnar@gmail.com Abstract— The set of objects having same characteristics are organized in groups and clusters of these objects are formed known as Data Clustering.It is an unsupervised learning technique for classification of data. K-means algorithm is widely used and famous algorithm for analysis of clusters.In this algorithm, n number of data points are divided into k clusters based on some similarity measurement criterion. K-Means Algorithm has fast speed and thus is used commonly clustering algorithm. Vector quantization,cluster analysis,feature learning are some of the application of K-Means.However results generated using this algorithm are mainly dependant on choosing initial cluster centroids.The main shortcome of this algorithm is to provide appropriate number of clusters.Provision of number of clusters before applying the algorithm is highly impractical and requires deep knowledge of clustering
Recently, density-based clustering approaches have gained much attention among researchers. These methods suppose that clusters span in high-density regions that are separated by lower density areas. They are seeking to identify clusters with arbitrary shapes. Also, these methods require a minimum domain knowledge to organize data into clusters (Lovely Sharma & Ramya, 2013). DBSCAN (Ester, Kriegel, Sander, & Xu, 1996) is a well-known density-based clustering method and its aim is to identify a maximum set of density-connected points. This method has several advantages such as identifying clusters with arbitrary shapes, handles noise or outliers effectively and does not require predefining the number of clusters. But this method suffers from several shortcomings. For example, it cannot deal with uneven density datasets. This method needs a quadratic time complexity and its effectiveness depends on appropriate selection
To handle the problem of high dimensionality of such data sets, a number of algorithms have been introduced, which use row-based enumeration techniques [6, 7] instead of using column-based enumeration algorithms such as [8-10]. These methods work based on the assumption that the data sets have thousands of columns or have a lot of dimensions, but smaller number of rows. Carpenter algorithm [6] uses a bottom-up method while the TD-close Algorithm [7] uses a top-down exploration approach. Both of these techniques however, work best for dense datasets. Also, these algorithms work only for high-dimensional datasets that have significantly lesser number of rows compared to number of columns. Different dimension reduction algorithms, such as Principal Component Analysis [11], Multi-Dimensional scaling [12] and Independent component analysis [13] are effective and popular algorithms for reducing dimensionality. However, the effectiveness of these different algorithms is limited due to their global linearity.
We have intentionally separated thresholding technique from region based due the usage of histogram and its simplicity. A lot of limitations are faced while applying segmentation namely:
2. Detection using factors: The pre- duplicate record elimination stage is useful for removing data but it helps in retaining only one copy of the duplicate data and removing the rest. For this purpose, a threshold value is calculated for all the records and a similarity. Threshold value is calculated for elimination purpose. All the possible pairs are selected from the clusters
Abstract —Data Clustering is key point used in data processing algorithms for Data Mining. Clustering is a data mining technique used to place data elements into related groups without advance knowledge of the group definitions. Popular clustering techniques include k-means clustering. Clustering is imperative idea in data investigation and data mining applications. In last decade, K-means has been popular clustering algorithm because of its ease of use and simplicity. Now days, as data size is continuously increasing, some researchers started working over distributed environment such as MapReduce to get high performance for big data clustering.
Secondly, Omran propose Dynamic Clustering (DCPSO) algorithm based on binary PSO combining with k-means clustering. In this approach, PSO is used for clustering the data while k-means is used to refine the clustering solution. At first, the number of clusters is determined automatically and the data sets are clustered based on minimal user interference. In order to decrease the effects of initial conditions, a relatively large number of clusters are generated firstly. Then the number of clusters is optimized by binary particle swarm optimization, while K-means clustering algorithm is used to select the centroids. Both synthetic and natural images are used to test the approach, which show that the optimum number of clusters are generally founded on the tested images.
Abstract— Data mining is logical process that is used to extract or “mining” large amount of data in order to find useful data [2]. Knowledge discovery from Data or KDD is synonym for Data Mining[13].There are many different types of techniques that can be used to retrieve information from large amount of data. Each type of technique will generate different results. The type of data mining technique that should be selected depends on the type of business problem that we are trying to solve.
Abstract— Data mining is the method of extracting the data from large database. Various data mining techniques are clustering, classification, association analysis, regression, summarization, time series analysis and sequence analysis, etc. Clustering is one of the important tasks in mining and is said to be unsupervised classification. Clustering is the techniques which is used to group similar objects or processes. In this work four clustering algorithms (K-Means, Farthest first, EM, Hierarchal) have been analyzed to cluster the data and to find the outliers based on the number of clusters. Here the WEKA (Waikato Environment for Knowledge Analysis) for analyzing the clustering techniques. Here the time, Clustered and un-clustered
Abstract - This paper presents the analysis of Kmeans and K-Medians clustering algorithm in detecting outliers. Clustering is generally used in pattern recognition where if a user wants to search for some particular pattern, clustering reduces the searching load. The k-means clustering and kmedians clustering algorithm’s performance in detecting outliers are analysed here. K-means clustering clusters the similar data with the help of the mean value and squared error criterion. Kmedians is similar to k-means algorithm but median values are calculated there. Outliers are the one different from norm. If they are not properly detected and handled, they clustering will be affected in a great manner.
Fuzzy clustering plays an important role in feature analysis, system identification and classifier design [21]. Given its capacity for managing uncertainty, impreciseness and vagueness, the fuzzy algorithm is far more realistic for solving real-world problems than hard clustering algorithms. The fuzzy C-mean (FCM) clustering algorithm [22] is a well-known method based on fuzzy clustering. In this method, the image is represented in different feature spaces and the FCM classifies similar data points according to the distance of the pixel from the center of the feature space.
Correlation Coefficient approach evaluates how well an individual feature contributes to the separation of classes. Ranking criteria is used to rank all features using their mean and standard deviation for all the samples of both classes. The correlation coefficient is successfully reduced the number of features and also kept good classification accuracy (Zong-Xia et.al,2006).
The existing classification methods have limitation in accuracy, exactness and require manual interaction. So, designing automated system using image segmentation techniques helps make the detection accurate and efficient.
Mr. Ashish P. Mohod1 , Mr. Najim A. Sheikh2 , Mr. Bharat S. Dhak3 , Ms. Mausami Sawarkar4
Pattern recognition is a technique to differentiate different pattern into classes through the help of supervised or unsupervised technique. We have developed highly sophisticated skills for sensing the environment and taking actions according to what we observe. This sensing and understanding is mostly dependent on ability to differentiate between patterns. The pattern recognition ability if with the help of machine learning can be applied in machine. The machine ability to make decision like human being will be enhanced. Many applications such as data mining, web searching, face recognition etc has already been in uses which are based on the pattern recognitions. The objective of this review paper is to summarize and compare some of the well-known methods and application used in pattern recognition system.
Segmenting or dividing a digital image into region of interests or meaningful structures in general plays a momentous role in quite a few image processing tasks. Image analysis, image visualization, object representation are some of them. The prime objective of segmenting a digital image is to change its representation so that it looks more expressive for image analysis. During the course of action in image segmentation, each and every pixel of the image segmentation is assigned a label or value. The pixels that share the same value also share homogeneous traits. The examples can include color, texture, intensity or some other features. Image segmentation can be defined as the technique to divide the an image f (x, y) into a non empty subset f1, f2, ...., fn which is continuous and disconnected. This step contributes in feature extraction. There are quite a few applications where image segmentation plays a pivotal role. These applications vary from image filtering, face recognition, medical imaging