WorldWideScience

Sample records for k-means cluster analysis

  1. Subspace K-means clustering.

    Science.gov (United States)

    Timmerman, Marieke E; Ceulemans, Eva; De Roover, Kim; Van Leeuwen, Karla

    2013-12-01

    To achieve an insightful clustering of multivariate data, we propose subspace K-means. Its central idea is to model the centroids and cluster residuals in reduced spaces, which allows for dealing with a wide range of cluster types and yields rich interpretations of the clusters. We review the existing related clustering methods, including deterministic, stochastic, and unsupervised learning approaches. To evaluate subspace K-means, we performed a comparative simulation study, in which we manipulated the overlap of subspaces, the between-cluster variance, and the error variance. The study shows that the subspace K-means algorithm is sensitive to local minima but that the problem can be reasonably dealt with by using partitions of various cluster procedures as a starting point for the algorithm. Subspace K-means performs very well in recovering the true clustering across all conditions considered and appears to be superior to its competitor methods: K-means, reduced K-means, factorial K-means, mixtures of factor analyzers (MFA), and MCLUST. The best competitor method, MFA, showed a performance similar to that of subspace K-means in easy conditions but deteriorated in more difficult ones. Using data from a study on parental behavior, we show that subspace K-means analysis provides a rich insight into the cluster characteristics, in terms of both the relative positions of the clusters (via the centroids) and the shape of the clusters (via the within-cluster residuals).

  2. Soil data clustering by using K-means and fuzzy K-means algorithm

    Directory of Open Access Journals (Sweden)

    E. Hot

    2016-06-01

    Full Text Available A problem of soil clustering based on the chemical characteristics of soil, and proper visual representation of the obtained results, is analysed in the paper. To that aim, K-means and fuzzy K-means algorithms are adapted for soil data clustering. A database of soil characteristics sampled in Montenegro is used for a comparative analysis of implemented algorithms. The procedure of setting proper values for control parameters of fuzzy K-means is illustrated on the used database. In addition, validation of clustering is made through visualisation. Classified soil data are presented on the static Google map and dynamic Open Street Map.

  3. Normalization based K means Clustering Algorithm

    OpenAIRE

    Virmani, Deepali; Taneja, Shweta; Malhotra, Geetika

    2015-01-01

    K-means is an effective clustering technique used to separate similar data into groups based on initial centroids of clusters. In this paper, Normalization based K-means clustering algorithm(N-K means) is proposed. Proposed N-K means clustering algorithm applies normalization prior to clustering on the available data as well as the proposed approach calculates initial centroids based on weights. Experimental results prove the betterment of proposed N-K means clustering algorithm over existing...

  4. Fatigue Feature Extraction Analysis based on a K-Means Clustering Approach

    Directory of Open Access Journals (Sweden)

    M.F.M. Yunoh

    2015-06-01

    Full Text Available This paper focuses on clustering analysis using a K-means approach for fatigue feature dataset extraction. The aim of this study is to group the dataset as closely as possible (homogeneity for the scattered dataset. Kurtosis, the wavelet-based energy coefficient and fatigue damage are calculated for all segments after the extraction process using wavelet transform. Kurtosis, the wavelet-based energy coefficient and fatigue damage are used as input data for the K-means clustering approach. K-means clustering calculates the average distance of each group from the centroid and gives the objective function values. Based on the results, maximum values of the objective function can be seen in the two centroid clusters, with a value of 11.58. The minimum objective function value is found at 8.06 for five centroid clusters. It can be seen that the objective function with the lowest value for the number of clusters is equal to five; which is therefore the best cluster for the dataset.

  5. ANALISIS CLUSTER K-MEANS DALAM PENGELOMPOKAN KEMAMPUAN MAHASISWA

    Directory of Open Access Journals (Sweden)

    B. Poerwanto

    2016-12-01

    Full Text Available Abstract. Cluster Analysis, K-Means Algorithm, Student Classification. This study aims to classify students based on learning outcomes for subject the basic of statistics (DDS, which is measured based on attendance, task, midterm (UTS, and final exams (UAS to further used to evaluate learning for subjects that require analysis of quantitative . This study uses k-means cluster analysis to classify the students into three groups based on learning outcomes. After grouped, there are 3 people in the low category, 27 in the medium category and over 70% in the high category.Abstrak. Analisis Cluster K-Means dalam Pengelompokan Kemampuan Mahasiswa. Pene-litian ini bertujuan untuk mengelompokkan mahasiswa berdasarkan hasil belajar mata kuliah dasar-dasar statistika (DDS yang diukur berdasarkan variabel nilai kehadiran, tugas, ujian tengah semester (UTS, dan ujian akhir semester (UAS untuk selanjutnya digunakan untuk mengevaluasi pembelajaran untuk mata kuliah yang membutuhkan kemampuan analisis kuantititatif yang baik. Penelitian ini menggunakan analisis cluster k-means dalam mengelompokkan mahasiswa ke dalam tiga kelompok berdasarkan hasil belajarnya. Seteleh dikelompokkan, terdapat 3 orang yang masuk pada kategori rendah, 27 orang pada kategori sedang dan lebih dari 70% pada kategori tinggi.Kata Kunci: Cluster Analysis, K-Means Algoritma, Klasifikasi Mahasiswa, Universitas Cokroaminoto Palopo

  6. 3D Building Models Segmentation Based on K-Means++ Cluster Analysis

    Science.gov (United States)

    Zhang, C.; Mao, B.

    2016-10-01

    3D mesh model segmentation is drawing increasing attentions from digital geometry processing field in recent years. The original 3D mesh model need to be divided into separate meaningful parts or surface patches based on certain standards to support reconstruction, compressing, texture mapping, model retrieval and etc. Therefore, segmentation is a key problem for 3D mesh model segmentation. In this paper, we propose a method to segment Collada (a type of mesh model) 3D building models into meaningful parts using cluster analysis. Common clustering methods segment 3D mesh models by K-means, whose performance heavily depends on randomized initial seed points (i.e., centroid) and different randomized centroid can get quite different results. Therefore, we improved the existing method and used K-means++ clustering algorithm to solve this problem. Our experiments show that K-means++ improves both the speed and the accuracy of K-means, and achieve good and meaningful results.

  7. 3D BUILDING MODELS SEGMENTATION BASED ON K-MEANS++ CLUSTER ANALYSIS

    Directory of Open Access Journals (Sweden)

    C. Zhang

    2016-10-01

    Full Text Available 3D mesh model segmentation is drawing increasing attentions from digital geometry processing field in recent years. The original 3D mesh model need to be divided into separate meaningful parts or surface patches based on certain standards to support reconstruction, compressing, texture mapping, model retrieval and etc. Therefore, segmentation is a key problem for 3D mesh model segmentation. In this paper, we propose a method to segment Collada (a type of mesh model 3D building models into meaningful parts using cluster analysis. Common clustering methods segment 3D mesh models by K-means, whose performance heavily depends on randomized initial seed points (i.e., centroid and different randomized centroid can get quite different results. Therefore, we improved the existing method and used K-means++ clustering algorithm to solve this problem. Our experiments show that K-means++ improves both the speed and the accuracy of K-means, and achieve good and meaningful results.

  8. K-means Clustering: Lloyd's algorithm

    Indian Academy of Sciences (India)

    First page Back Continue Last page Overview Graphics. K-means Clustering: Lloyd's algorithm. Refines clusters iteratively. Cluster points using Voronoi partitioning of the centers; Centroids of the clusters determine the new centers. Bad example k = 3, n =4.

  9. Subspace K-means clustering

    NARCIS (Netherlands)

    Timmerman, Marieke E.; Ceulemans, Eva; De Roover, Kim; Van Leeuwen, Karla

    2013-01-01

    To achieve an insightful clustering of multivariate data, we propose subspace K-means. Its central idea is to model the centroids and cluster residuals in reduced spaces, which allows for dealing with a wide range of cluster types and yields rich interpretations of the clusters. We review the

  10. Robust K-Median and K-Means Clustering Algorithms for Incomplete Data

    Directory of Open Access Journals (Sweden)

    Jinhua Li

    2016-01-01

    Full Text Available Incomplete data with missing feature values are prevalent in clustering problems. Traditional clustering methods first estimate the missing values by imputation and then apply the classical clustering algorithms for complete data, such as K-median and K-means. However, in practice, it is often hard to obtain accurate estimation of the missing values, which deteriorates the performance of clustering. To enhance the robustness of clustering algorithms, this paper represents the missing values by interval data and introduces the concept of robust cluster objective function. A minimax robust optimization (RO formulation is presented to provide clustering results, which are insensitive to estimation errors. To solve the proposed RO problem, we propose robust K-median and K-means clustering algorithms with low time and space complexity. Comparisons and analysis of experimental results on both artificially generated and real-world incomplete data sets validate the robustness and effectiveness of the proposed algorithms.

  11. Performance Analysis of Entropy Methods on K Means in Clustering Process

    Science.gov (United States)

    Dicky Syahputra Lubis, Mhd.; Mawengkang, Herman; Suwilo, Saib

    2017-12-01

    K Means is a non-hierarchical data clustering method that attempts to partition existing data into one or more clusters / groups. This method partitions the data into clusters / groups so that data that have the same characteristics are grouped into the same cluster and data that have different characteristics are grouped into other groups.The purpose of this data clustering is to minimize the objective function set in the clustering process, which generally attempts to minimize variation within a cluster and maximize the variation between clusters. However, the main disadvantage of this method is that the number k is often not known before. Furthermore, a randomly chosen starting point may cause two points to approach the distance to be determined as two centroids. Therefore, for the determination of the starting point in K Means used entropy method where this method is a method that can be used to determine a weight and take a decision from a set of alternatives. Entropy is able to investigate the harmony in discrimination among a multitude of data sets. Using Entropy criteria with the highest value variations will get the highest weight. Given this entropy method can help K Means work process in determining the starting point which is usually determined at random. Thus the process of clustering on K Means can be more quickly known by helping the entropy method where the iteration process is faster than the K Means Standard process. Where the postoperative patient dataset of the UCI Repository Machine Learning used and using only 12 data as an example of its calculations is obtained by entropy method only with 2 times iteration can get the desired end result.

  12. Implementasi KD-Tree K-Means Clustering untuk Klasterisasi Dokumen

    Directory of Open Access Journals (Sweden)

    Eric Budiman Gosno

    2013-09-01

    Full Text Available Klasterisasi dokumen adalah suatu proses pengelompokan dokumen secara otomatis dan unsupervised. Klasterisasi dokumen merupakan permasalahan yang sering ditemui dalam berbagai bidang seperti text mining dan sistem temu kembali informasi. Metode klasterisasi dokumen yang memiliki akurasi dan efisiensi waktu yang tinggi sangat diperlukan untuk meningkatkan hasil pada mesin pencari web,  dan untuk proses filtering. Salah satu metode klasterisasi yang telah dikenal dan diaplikasikan dalam klasterisasi dokumen adalah K-Means Clustering. Tetapi K-Means Clustering sensitif terhadap pemilihan posisi awal dari titik tengah klaster sehingga pemilihan posisi awal dari titik tengah klaster yang buruk akan mengakibatkan K-Means Clustering terjebak dalam local optimum. KD-Tree K-Means Clustering merupakan perbaikan dari K-Means Clustering. KD-Tree K-Means Clustering menggunakan struktur data K-Dimensional Tree dan nilai kerapatan pada proses inisialisasi titik tengah klaster. Pada makalah ini diimplementasikan algoritma KD-Tree K-Means Clustering untuk permasalahan klasterisasi dokumen. Performa klasterisasi dokumen yang dihasilkan oleh metode KD-Tree K-Means Clustering pada data set 20 newsgroup memiliki nilai distorsi 3×105 lebih rendah dibandingkan dengan nilai rerata distorsi K-Means Clustering dan nilai NIG 0,09 lebih baik dibandingkan dengan nilai NIG K-Means Clustering.

  13. Security and Correctness Analysis on Privacy-Preserving k-Means Clustering Schemes

    Science.gov (United States)

    Su, Chunhua; Bao, Feng; Zhou, Jianying; Takagi, Tsuyoshi; Sakurai, Kouichi

    Due to the fast development of Internet and the related IT technologies, it becomes more and more easier to access a large amount of data. k-means clustering is a powerful and frequently used technique in data mining. Many research papers about privacy-preserving k-means clustering were published. In this paper, we analyze the existing privacy-preserving k-means clustering schemes based on the cryptographic techniques. We show those schemes will cause the privacy breach and cannot output the correct results due to the faults in the protocol construction. Furthermore, we analyze our proposal as an option to improve such problems but with intermediate information breach during the computation.

  14. Effects of Group Size and Lack of Sphericity on the Recovery of Clusters in K-Means Cluster Analysis

    Science.gov (United States)

    de Craen, Saskia; Commandeur, Jacques J. F.; Frank, Laurence E.; Heiser, Willem J.

    2006-01-01

    K-means cluster analysis is known for its tendency to produce spherical and equally sized clusters. To assess the magnitude of these effects, a simulation study was conducted, in which populations were created with varying departures from sphericity and group sizes. An analysis of the recovery of clusters in the samples taken from these…

  15. Clustering performance comparison using K-means and expectation maximization algorithms.

    Science.gov (United States)

    Jung, Yong Gyu; Kang, Min Soo; Heo, Jun

    2014-11-14

    Clustering is an important means of data mining based on separating data categories by similar features. Unlike the classification algorithm, clustering belongs to the unsupervised type of algorithms. Two representatives of the clustering algorithms are the K -means and the expectation maximization (EM) algorithm. Linear regression analysis was extended to the category-type dependent variable, while logistic regression was achieved using a linear combination of independent variables. To predict the possibility of occurrence of an event, a statistical approach is used. However, the classification of all data by means of logistic regression analysis cannot guarantee the accuracy of the results. In this paper, the logistic regression analysis is applied to EM clusters and the K -means clustering method for quality assessment of red wine, and a method is proposed for ensuring the accuracy of the classification results.

  16. Finding reproducible cluster partitions for the k-means algorithm.

    Science.gov (United States)

    Lisboa, Paulo J G; Etchells, Terence A; Jarman, Ian H; Chambers, Simon J

    2013-01-01

    K-means clustering is widely used for exploratory data analysis. While its dependence on initialisation is well-known, it is common practice to assume that the partition with lowest sum-of-squares (SSQ) total i.e. within cluster variance, is both reproducible under repeated initialisations and also the closest that k-means can provide to true structure, when applied to synthetic data. We show that this is generally the case for small numbers of clusters, but for values of k that are still of theoretical and practical interest, similar values of SSQ can correspond to markedly different cluster partitions. This paper extends stability measures previously presented in the context of finding optimal values of cluster number, into a component of a 2-d map of the local minima found by the k-means algorithm, from which not only can values of k be identified for further analysis but, more importantly, it is made clear whether the best SSQ is a suitable solution or whether obtaining a consistently good partition requires further application of the stability index. The proposed method is illustrated by application to five synthetic datasets replicating a real world breast cancer dataset with varying data density, and a large bioinformatics dataset.

  17. *K-means and cluster models for cancer signatures.

    Science.gov (United States)

    Kakushadze, Zura; Yu, Willie

    2017-09-01

    We present *K-means clustering algorithm and source code by expanding statistical clustering methods applied in https://ssrn.com/abstract=2802753 to quantitative finance. *K-means is statistically deterministic without specifying initial centers, etc. We apply *K-means to extracting cancer signatures from genome data without using nonnegative matrix factorization (NMF). *K-means' computational cost is a fraction of NMF's. Using 1389 published samples for 14 cancer types, we find that 3 cancers (liver cancer, lung cancer and renal cell carcinoma) stand out and do not have cluster-like structures. Two clusters have especially high within-cluster correlations with 11 other cancers indicating common underlying structures. Our approach opens a novel avenue for studying such structures. *K-means is universal and can be applied in other fields. We discuss some potential applications in quantitative finance.

  18. Unsupervised Cryo-EM Data Clustering through Adaptively Constrained K-Means Algorithm.

    Science.gov (United States)

    Xu, Yaofang; Wu, Jiayi; Yin, Chang-Cheng; Mao, Youdong

    2016-01-01

    In single-particle cryo-electron microscopy (cryo-EM), K-means clustering algorithm is widely used in unsupervised 2D classification of projection images of biological macromolecules. 3D ab initio reconstruction requires accurate unsupervised classification in order to separate molecular projections of distinct orientations. Due to background noise in single-particle images and uncertainty of molecular orientations, traditional K-means clustering algorithm may classify images into wrong classes and produce classes with a large variation in membership. Overcoming these limitations requires further development on clustering algorithms for cryo-EM data analysis. We propose a novel unsupervised data clustering method building upon the traditional K-means algorithm. By introducing an adaptive constraint term in the objective function, our algorithm not only avoids a large variation in class sizes but also produces more accurate data clustering. Applications of this approach to both simulated and experimental cryo-EM data demonstrate that our algorithm is a significantly improved alterative to the traditional K-means algorithm in single-particle cryo-EM analysis.

  19. A Variable-Selection Heuristic for K-Means Clustering.

    Science.gov (United States)

    Brusco, Michael J.; Cradit, J. Dennis

    2001-01-01

    Presents a variable selection heuristic for nonhierarchical (K-means) cluster analysis based on the adjusted Rand index for measuring cluster recovery. Subjected the heuristic to Monte Carlo testing across more than 2,200 datasets. Results indicate that the heuristic is extremely effective at eliminating masking variables. (SLD)

  20. Cluster analysis of polymers using laser-induced breakdown spectroscopy with K-means

    Science.gov (United States)

    Yangmin, GUO; Yun, TANG; Yu, DU; Shisong, TANG; Lianbo, GUO; Xiangyou, LI; Yongfeng, LU; Xiaoyan, ZENG

    2018-06-01

    Laser-induced breakdown spectroscopy (LIBS) combined with K-means algorithm was employed to automatically differentiate industrial polymers under atmospheric conditions. The unsupervised learning algorithm K-means were utilized for the clustering of LIBS dataset measured from twenty kinds of industrial polymers. To prevent the interference from metallic elements, three atomic emission lines (C I 247.86 nm , H I 656.3 nm, and O I 777.3 nm) and one molecular line C–N (0, 0) 388.3 nm were used. The cluster analysis results were obtained through an iterative process. The Davies–Bouldin index was employed to determine the initial number of clusters. The average relative standard deviation values of characteristic spectral lines were used as the iterative criterion. With the proposed approach, the classification accuracy for twenty kinds of industrial polymers achieved 99.6%. The results demonstrated that this approach has great potential for industrial polymers recycling by LIBS.

  1. Integration K-Means Clustering Method and Elbow Method For Identification of The Best Customer Profile Cluster

    Science.gov (United States)

    Syakur, M. A.; Khotimah, B. K.; Rochman, E. M. S.; Satoto, B. D.

    2018-04-01

    Clustering is a data mining technique used to analyse data that has variations and the number of lots. Clustering was process of grouping data into a cluster, so they contained data that is as similar as possible and different from other cluster objects. SMEs Indonesia has a variety of customers, but SMEs do not have the mapping of these customers so they did not know which customers are loyal or otherwise. Customer mapping is a grouping of customer profiling to facilitate analysis and policy of SMEs in the production of goods, especially batik sales. Researchers will use a combination of K-Means method with elbow to improve efficient and effective k-means performance in processing large amounts of data. K-Means Clustering is a localized optimization method that is sensitive to the selection of the starting position from the midpoint of the cluster. So choosing the starting position from the midpoint of a bad cluster will result in K-Means Clustering algorithm resulting in high errors and poor cluster results. The K-means algorithm has problems in determining the best number of clusters. So Elbow looks for the best number of clusters on the K-means method. Based on the results obtained from the process in determining the best number of clusters with elbow method can produce the same number of clusters K on the amount of different data. The result of determining the best number of clusters with elbow method will be the default for characteristic process based on case study. Measurement of k-means value of k-means has resulted in the best clusters based on SSE values on 500 clusters of batik visitors. The result shows the cluster has a sharp decrease is at K = 3, so K as the cut-off point as the best cluster.

  2. A comparison of latent class, K-means, and K-median methods for clustering dichotomous data.

    Science.gov (United States)

    Brusco, Michael J; Shireman, Emilie; Steinley, Douglas

    2017-09-01

    The problem of partitioning a collection of objects based on their measurements on a set of dichotomous variables is a well-established problem in psychological research, with applications including clinical diagnosis, educational testing, cognitive categorization, and choice analysis. Latent class analysis and K-means clustering are popular methods for partitioning objects based on dichotomous measures in the psychological literature. The K-median clustering method has recently been touted as a potentially useful tool for psychological data and might be preferable to its close neighbor, K-means, when the variable measures are dichotomous. We conducted simulation-based comparisons of the latent class, K-means, and K-median approaches for partitioning dichotomous data. Although all 3 methods proved capable of recovering cluster structure, K-median clustering yielded the best average performance, followed closely by latent class analysis. We also report results for the 3 methods within the context of an application to transitive reasoning data, in which it was found that the 3 approaches can exhibit profound differences when applied to real data. (PsycINFO Database Record (c) 2017 APA, all rights reserved).

  3. Performance Evaluation of Incremental K-means Clustering Algorithm

    OpenAIRE

    Chakraborty, Sanjay; Nagwani, N. K.

    2014-01-01

    The incremental K-means clustering algorithm has already been proposed and analysed in paper [Chakraborty and Nagwani, 2011]. It is a very innovative approach which is applicable in periodically incremental environment and dealing with a bulk of updates. In this paper the performance evaluation is done for this incremental K-means clustering algorithm using air pollution database. This paper also describes the comparison on the performance evaluations between existing K-means clustering and i...

  4. Paternal age related schizophrenia (PARS): Latent subgroups detected by k-means clustering analysis.

    Science.gov (United States)

    Lee, Hyejoo; Malaspina, Dolores; Ahn, Hongshik; Perrin, Mary; Opler, Mark G; Kleinhaus, Karine; Harlap, Susan; Goetz, Raymond; Antonius, Daniel

    2011-05-01

    Paternal age related schizophrenia (PARS) has been proposed as a subgroup of schizophrenia with distinct etiology, pathophysiology and symptoms. This study uses a k-means clustering analysis approach to generate hypotheses about differences between PARS and other cases of schizophrenia. We studied PARS (operationally defined as not having any family history of schizophrenia among first and second-degree relatives and fathers' age at birth ≥ 35 years) in a series of schizophrenia cases recruited from a research unit. Data were available on demographic variables, symptoms (Positive and Negative Syndrome Scale; PANSS), cognitive tests (Wechsler Adult Intelligence Scale-Revised; WAIS-R) and olfaction (University of Pennsylvania Smell Identification Test; UPSIT). We conducted a series of k-means clustering analyses to identify clusters of cases containing high concentrations of PARS. Two analyses generated clusters with high concentrations of PARS cases. The first analysis (N=136; PARS=34) revealed a cluster containing 83% PARS cases, in which the patients showed a significant discrepancy between verbal and performance intelligence. The mean paternal and maternal ages were 41 and 33, respectively. The second analysis (N=123; PARS=30) revealed a cluster containing 71% PARS cases, of which 93% were females; the mean age of onset of psychosis, at 17.2, was significantly early. These results strengthen the evidence that PARS cases differ from other patients with schizophrenia. Hypothesis-generating findings suggest that features of PARS may include a discrepancy between verbal and performance intelligence, and in females, an early age of onset. These findings provide a rationale for separating these phenotypes from others in future clinical, genetic and pathophysiologic studies of schizophrenia and in considering responses to treatment. Copyright © 2011 Elsevier B.V. All rights reserved.

  5. K-means clustering versus validation measures: a data-distribution perspective.

    Science.gov (United States)

    Xiong, Hui; Wu, Junjie; Chen, Jian

    2009-04-01

    K-means is a well-known and widely used partitional clustering method. While there are considerable research efforts to characterize the key features of the K-means clustering algorithm, further investigation is needed to understand how data distributions can have impact on the performance of K-means clustering. To that end, in this paper, we provide a formal and organized study of the effect of skewed data distributions on K-means clustering. Along this line, we first formally illustrate that K-means tends to produce clusters of relatively uniform size, even if input data have varied "true" cluster sizes. In addition, we show that some clustering validation measures, such as the entropy measure, may not capture this uniform effect and provide misleading information on the clustering performance. Viewed in this light, we provide the coefficient of variation (CV) as a necessary criterion to validate the clustering results. Our findings reveal that K-means tends to produce clusters in which the variations of cluster sizes, as measured by CV, are in a range of about 0.3-1.0. Specifically, for data sets with large variation in "true" cluster sizes (e.g., CV > 1.0), K-means reduces variation in resultant cluster sizes to less than 1.0. In contrast, for data sets with small variation in "true" cluster sizes (e.g., CV K-means increases variation in resultant cluster sizes to greater than 0.3. In other words, for the earlier two cases, K-means produces the clustering results which are away from the "true" cluster distributions.

  6. Single pass kernel k-means clustering method

    Indian Academy of Sciences (India)

    paper proposes a simple and faster version of the kernel k-means clustering ... It has been considered as an important tool ... On the other hand, kernel-based clustering methods, like kernel k-means clus- ..... able at the UCI machine learning repository (Murphy 1994). ... All the data sets have only numeric valued features.

  7. Clustering Using Boosted Constrained k-Means Algorithm

    Directory of Open Access Journals (Sweden)

    Masayuki Okabe

    2018-03-01

    Full Text Available This article proposes a constrained clustering algorithm with competitive performance and less computation time to the state-of-the-art methods, which consists of a constrained k-means algorithm enhanced by the boosting principle. Constrained k-means clustering using constraints as background knowledge, although easy to implement and quick, has insufficient performance compared with metric learning-based methods. Since it simply adds a function into the data assignment process of the k-means algorithm to check for constraint violations, it often exploits only a small number of constraints. Metric learning-based methods, which exploit constraints to create a new metric for data similarity, have shown promising results although the methods proposed so far are often slow depending on the amount of data or number of feature dimensions. We present a method that exploits the advantages of the constrained k-means and metric learning approaches. It incorporates a mechanism for accepting constraint priorities and a metric learning framework based on the boosting principle into a constrained k-means algorithm. In the framework, a metric is learned in the form of a kernel matrix that integrates weak cluster hypotheses produced by the constrained k-means algorithm, which works as a weak learner under the boosting principle. Experimental results for 12 data sets from 3 data sources demonstrated that our method has performance competitive to those of state-of-the-art constrained clustering methods for most data sets and that it takes much less computation time. Experimental evaluation demonstrated the effectiveness of controlling the constraint priorities by using the boosting principle and that our constrained k-means algorithm functions correctly as a weak learner of boosting.

  8. *K-means and Cluster Models for Cancer Signatures

    OpenAIRE

    Kakushadze, Zura; Yu, Willie

    2017-01-01

    We present *K-means clustering algorithm and source code by expanding statistical clustering methods applied in https://ssrn.com/abstract=2802753 to quantitative finance. *K-means is statistically deterministic without specifying initial centers, etc. We apply *K-means to extracting cancer signatures from genome data without using nonnegative matrix factorization (NMF). *K-means’ computational cost is a fraction of NMF’s. Using 1389 published samples for 14 cancer types, we find that 3 cancer...

  9. Comparative Performance Of Using PCA With K-Means And Fuzzy C Means Clustering For Customer Segmentation

    Directory of Open Access Journals (Sweden)

    Fahmida Afrin

    2015-08-01

    Full Text Available Abstract Data mining is the process of analyzing data and discovering useful information. Sometimes it is called knowledge Discovery. Clustering refers to groups whereas data are grouped in such a way that the data in one cluster are similar data in different clusters are dissimilar. Many data mining technologies are developed for customer segmentation. PCA is working as a preprocessor of Fuzzy C means and K- means for reducing the high dimensional and noisy data. There are many clustering method apply on customer segmentation. In this paper the performance of Fuzzy C means and K-means after implementing Principal Component Analysis is analyzed. We analyze the performance on a standard dataset for these algorithms. The results indicate that PCA based fuzzy clustering produces better results than PCA based K-means and is a more stable method for customer segmentation.

  10. Android Malware Classification Using K-Means Clustering Algorithm

    Science.gov (United States)

    Hamid, Isredza Rahmi A.; Syafiqah Khalid, Nur; Azma Abdullah, Nurul; Rahman, Nurul Hidayah Ab; Chai Wen, Chuah

    2017-08-01

    Malware was designed to gain access or damage a computer system without user notice. Besides, attacker exploits malware to commit crime or fraud. This paper proposed Android malware classification approach based on K-Means clustering algorithm. We evaluate the proposed model in terms of accuracy using machine learning algorithms. Two datasets were selected to demonstrate the practicing of K-Means clustering algorithms that are Virus Total and Malgenome dataset. We classify the Android malware into three clusters which are ransomware, scareware and goodware. Nine features were considered for each types of dataset such as Lock Detected, Text Detected, Text Score, Encryption Detected, Threat, Porn, Law, Copyright and Moneypak. We used IBM SPSS Statistic software for data classification and WEKA tools to evaluate the built cluster. The proposed K-Means clustering algorithm shows promising result with high accuracy when tested using Random Forest algorithm.

  11. An improved K-means clustering algorithm in agricultural image segmentation

    Science.gov (United States)

    Cheng, Huifeng; Peng, Hui; Liu, Shanmei

    Image segmentation is the first important step to image analysis and image processing. In this paper, according to color crops image characteristics, we firstly transform the color space of image from RGB to HIS, and then select proper initial clustering center and cluster number in application of mean-variance approach and rough set theory followed by clustering calculation in such a way as to automatically segment color component rapidly and extract target objects from background accurately, which provides a reliable basis for identification, analysis, follow-up calculation and process of crops images. Experimental results demonstrate that improved k-means clustering algorithm is able to reduce the computation amounts and enhance precision and accuracy of clustering.

  12. Choosing the Number of Clusters in K-Means Clustering

    Science.gov (United States)

    Steinley, Douglas; Brusco, Michael J.

    2011-01-01

    Steinley (2007) provided a lower bound for the sum-of-squares error criterion function used in K-means clustering. In this article, on the basis of the lower bound, the authors propose a method to distinguish between 1 cluster (i.e., a single distribution) versus more than 1 cluster. Additionally, conditional on indicating there are multiple…

  13. An extended k-means technique for clustering moving objects

    Directory of Open Access Journals (Sweden)

    Omnia Ossama

    2011-03-01

    Full Text Available k-means algorithm is one of the basic clustering techniques that is used in many data mining applications. In this paper we present a novel pattern based clustering algorithm that extends the k-means algorithm for clustering moving object trajectory data. The proposed algorithm uses a key feature of moving object trajectories namely, its direction as a heuristic to determine the different number of clusters for the k-means algorithm. In addition, we use the silhouette coefficient as a measure for the quality of our proposed approach. Finally, we present experimental results on both real and synthetic data that show the performance and accuracy of our proposed technique.

  14. Worst-case and smoothed analysis of k-means clustering with Bregman divergences

    NARCIS (Netherlands)

    Manthey, Bodo; Röglin, H.

    2013-01-01

    The $k$-means method is the method of choice for clustering large-scale data sets and it performs exceedingly well in practice despite its exponential worst-case running-time. To narrow the gap between theory and practice, $k$-means has been studied in the semi-random input model of smoothed

  15. MULTI-K: accurate classification of microarray subtypes using ensemble k-means clustering

    Directory of Open Access Journals (Sweden)

    Ashlock Daniel

    2009-08-01

    Full Text Available Abstract Background Uncovering subtypes of disease from microarray samples has important clinical implications such as survival time and sensitivity of individual patients to specific therapies. Unsupervised clustering methods have been used to classify this type of data. However, most existing methods focus on clusters with compact shapes and do not reflect the geometric complexity of the high dimensional microarray clusters, which limits their performance. Results We present a cluster-number-based ensemble clustering algorithm, called MULTI-K, for microarray sample classification, which demonstrates remarkable accuracy. The method amalgamates multiple k-means runs by varying the number of clusters and identifies clusters that manifest the most robust co-memberships of elements. In addition to the original algorithm, we newly devised the entropy-plot to control the separation of singletons or small clusters. MULTI-K, unlike the simple k-means or other widely used methods, was able to capture clusters with complex and high-dimensional structures accurately. MULTI-K outperformed other methods including a recently developed ensemble clustering algorithm in tests with five simulated and eight real gene-expression data sets. Conclusion The geometric complexity of clusters should be taken into account for accurate classification of microarray data, and ensemble clustering applied to the number of clusters tackles the problem very well. The C++ code and the data sets tested are available from the authors.

  16. MULTI-K: accurate classification of microarray subtypes using ensemble k-means clustering.

    Science.gov (United States)

    Kim, Eun-Youn; Kim, Seon-Young; Ashlock, Daniel; Nam, Dougu

    2009-08-22

    Uncovering subtypes of disease from microarray samples has important clinical implications such as survival time and sensitivity of individual patients to specific therapies. Unsupervised clustering methods have been used to classify this type of data. However, most existing methods focus on clusters with compact shapes and do not reflect the geometric complexity of the high dimensional microarray clusters, which limits their performance. We present a cluster-number-based ensemble clustering algorithm, called MULTI-K, for microarray sample classification, which demonstrates remarkable accuracy. The method amalgamates multiple k-means runs by varying the number of clusters and identifies clusters that manifest the most robust co-memberships of elements. In addition to the original algorithm, we newly devised the entropy-plot to control the separation of singletons or small clusters. MULTI-K, unlike the simple k-means or other widely used methods, was able to capture clusters with complex and high-dimensional structures accurately. MULTI-K outperformed other methods including a recently developed ensemble clustering algorithm in tests with five simulated and eight real gene-expression data sets. The geometric complexity of clusters should be taken into account for accurate classification of microarray data, and ensemble clustering applied to the number of clusters tackles the problem very well. The C++ code and the data sets tested are available from the authors.

  17. Optimasi Pusat Cluster Awal K-Means dengan Algoritma Genetika Pada Pengelompokan Dokumen

    OpenAIRE

    Fauzi, Muhammad

    2017-01-01

    147038065 Clustering a data set of documents based on certain data points in documents are an easy way to organize document for extension to work. K-Means clustering algorithm is one of iterative cluster algorithm to partition a set of entities into K cluster. Unfortunately, resulting in K?Means cluster is depending on the initial cluster center that generally assigned randomly. In this reserach, determining initial cluster center K-Means for documents clustering are investi...

  18. Clustering Educational Digital Library Usage Data: A Comparison of Latent Class Analysis and K-Means Algorithms

    Science.gov (United States)

    Xu, Beijie; Recker, Mimi; Qi, Xiaojun; Flann, Nicholas; Ye, Lei

    2013-01-01

    This article examines clustering as an educational data mining method. In particular, two clustering algorithms, the widely used K-means and the model-based Latent Class Analysis, are compared, using usage data from an educational digital library service, the Instructional Architect (IA.usu.edu). Using a multi-faceted approach and multiple data…

  19. Analysis of k-means clustering approach on the breast cancer Wisconsin dataset.

    Science.gov (United States)

    Dubey, Ashutosh Kumar; Gupta, Umesh; Jain, Sonal

    2016-11-01

    Breast cancer is one of the most common cancers found worldwide and most frequently found in women. An early detection of breast cancer provides the possibility of its cure; therefore, a large number of studies are currently going on to identify methods that can detect breast cancer in its early stages. This study was aimed to find the effects of k-means clustering algorithm with different computation measures like centroid, distance, split method, epoch, attribute, and iteration and to carefully consider and identify the combination of measures that has potential of highly accurate clustering accuracy. K-means algorithm was used to evaluate the impact of clustering using centroid initialization, distance measures, and split methods. The experiments were performed using breast cancer Wisconsin (BCW) diagnostic dataset. Foggy and random centroids were used for the centroid initialization. In foggy centroid, based on random values, the first centroid was calculated. For random centroid, the initial centroid was considered as (0, 0). The results were obtained by employing k-means algorithm and are discussed with different cases considering variable parameters. The calculations were based on the centroid (foggy/random), distance (Euclidean/Manhattan/Pearson), split (simple/variance), threshold (constant epoch/same centroid), attribute (2-9), and iteration (4-10). Approximately, 92 % average positive prediction accuracy was obtained with this approach. Better results were found for the same centroid and the highest variance. The results achieved using Euclidean and Manhattan were better than the Pearson correlation. The findings of this work provided extensive understanding of the computational parameters that can be used with k-means. The results indicated that k-means has a potential to classify BCW dataset.

  20. Canonical PSO Based K-Means Clustering Approach for Real Datasets.

    Science.gov (United States)

    Dey, Lopamudra; Chakraborty, Sanjay

    2014-01-01

    "Clustering" the significance and application of this technique is spread over various fields. Clustering is an unsupervised process in data mining, that is why the proper evaluation of the results and measuring the compactness and separability of the clusters are important issues. The procedure of evaluating the results of a clustering algorithm is known as cluster validity measure. Different types of indexes are used to solve different types of problems and indices selection depends on the kind of available data. This paper first proposes Canonical PSO based K-means clustering algorithm and also analyses some important clustering indices (intercluster, intracluster) and then evaluates the effects of those indices on real-time air pollution database, wholesale customer, wine, and vehicle datasets using typical K-means, Canonical PSO based K-means, simple PSO based K-means, DBSCAN, and Hierarchical clustering algorithms. This paper also describes the nature of the clusters and finally compares the performances of these clustering algorithms according to the validity assessment. It also defines which algorithm will be more desirable among all these algorithms to make proper compact clusters on this particular real life datasets. It actually deals with the behaviour of these clustering algorithms with respect to validation indexes and represents their results of evaluation in terms of mathematical and graphical forms.

  1. The global kernel k-means algorithm for clustering in feature space.

    Science.gov (United States)

    Tzortzis, Grigorios F; Likas, Aristidis C

    2009-07-01

    Kernel k-means is an extension of the standard k -means clustering algorithm that identifies nonlinearly separable clusters. In order to overcome the cluster initialization problem associated with this method, we propose the global kernel k-means algorithm, a deterministic and incremental approach to kernel-based clustering. Our method adds one cluster at each stage, through a global search procedure consisting of several executions of kernel k-means from suitable initializations. This algorithm does not depend on cluster initialization, identifies nonlinearly separable clusters, and, due to its incremental nature and search procedure, locates near-optimal solutions avoiding poor local minima. Furthermore, two modifications are developed to reduce the computational cost that do not significantly affect the solution quality. The proposed methods are extended to handle weighted data points, which enables their application to graph partitioning. We experiment with several data sets and the proposed approach compares favorably to kernel k -means with random restarts.

  2. Smoothed analysis of the k-means method

    NARCIS (Netherlands)

    Arthur, David; Manthey, Bodo; Röglin, Heiko

    2011-01-01

    The k-means method is one of the most widely used clustering algorithms, drawing its popularity from its speed in practice. Recently, however, it was shown to have exponential worst-case running time. In order to close the gap between practical performance and theoretical analysis, the k-means

  3. Segmentation of dermatoscopic images by frequency domain filtering and k-means clustering algorithms.

    Science.gov (United States)

    Rajab, Maher I

    2011-11-01

    Since the introduction of epiluminescence microscopy (ELM), image analysis tools have been extended to the field of dermatology, in an attempt to algorithmically reproduce clinical evaluation. Accurate image segmentation of skin lesions is one of the key steps for useful, early and non-invasive diagnosis of coetaneous melanomas. This paper proposes two image segmentation algorithms based on frequency domain processing and k-means clustering/fuzzy k-means clustering. The two methods are capable of segmenting and extracting the true border that reveals the global structure irregularity (indentations and protrusions), which may suggest excessive cell growth or regression of a melanoma. As a pre-processing step, Fourier low-pass filtering is applied to reduce the surrounding noise in a skin lesion image. A quantitative comparison of the techniques is enabled by the use of synthetic skin lesion images that model lesions covered with hair to which Gaussian noise is added. The proposed techniques are also compared with an established optimal-based thresholding skin-segmentation method. It is demonstrated that for lesions with a range of different border irregularity properties, the k-means clustering and fuzzy k-means clustering segmentation methods provide the best performance over a range of signal to noise ratios. The proposed segmentation techniques are also demonstrated to have similar performance when tested on real skin lesions representing high-resolution ELM images. This study suggests that the segmentation results obtained using a combination of low-pass frequency filtering and k-means or fuzzy k-means clustering are superior to the result that would be obtained by using k-means or fuzzy k-means clustering segmentation methods alone. © 2011 John Wiley & Sons A/S.

  4. Ckmeans.1d.dp: Optimal k-means Clustering in One Dimension by Dynamic Programming.

    Science.gov (United States)

    Wang, Haizhou; Song, Mingzhou

    2011-12-01

    The heuristic k -means algorithm, widely used for cluster analysis, does not guarantee optimality. We developed a dynamic programming algorithm for optimal one-dimensional clustering. The algorithm is implemented as an R package called Ckmeans.1d.dp . We demonstrate its advantage in optimality and runtime over the standard iterative k -means algorithm.

  5. Interactive K-Means Clustering Method Based on User Behavior for Different Analysis Target in Medicine.

    Science.gov (United States)

    Lei, Yang; Yu, Dai; Bin, Zhang; Yang, Yang

    2017-01-01

    Clustering algorithm as a basis of data analysis is widely used in analysis systems. However, as for the high dimensions of the data, the clustering algorithm may overlook the business relation between these dimensions especially in the medical fields. As a result, usually the clustering result may not meet the business goals of the users. Then, in the clustering process, if it can combine the knowledge of the users, that is, the doctor's knowledge or the analysis intent, the clustering result can be more satisfied. In this paper, we propose an interactive K -means clustering method to improve the user's satisfactions towards the result. The core of this method is to get the user's feedback of the clustering result, to optimize the clustering result. Then, a particle swarm optimization algorithm is used in the method to optimize the parameters, especially the weight settings in the clustering algorithm to make it reflect the user's business preference as possible. After that, based on the parameter optimization and adjustment, the clustering result can be closer to the user's requirement. Finally, we take an example in the breast cancer, to testify our method. The experiments show the better performance of our algorithm.

  6. Long-term surface EMG monitoring using K-means clustering and compressive sensing

    Science.gov (United States)

    Balouchestani, Mohammadreza; Krishnan, Sridhar

    2015-05-01

    In this work, we present an advanced K-means clustering algorithm based on Compressed Sensing theory (CS) in combination with the K-Singular Value Decomposition (K-SVD) method for Clustering of long-term recording of surface Electromyography (sEMG) signals. The long-term monitoring of sEMG signals aims at recording of the electrical activity produced by muscles which are very useful procedure for treatment and diagnostic purposes as well as for detection of various pathologies. The proposed algorithm is examined for three scenarios of sEMG signals including healthy person (sEMG-Healthy), a patient with myopathy (sEMG-Myopathy), and a patient with neuropathy (sEMG-Neuropathr), respectively. The proposed algorithm can easily scan large sEMG datasets of long-term sEMG recording. We test the proposed algorithm with Principal Component Analysis (PCA) and Linear Correlation Coefficient (LCC) dimensionality reduction methods. Then, the output of the proposed algorithm is fed to K-Nearest Neighbours (K-NN) and Probabilistic Neural Network (PNN) classifiers in order to calclute the clustering performance. The proposed algorithm achieves a classification accuracy of 99.22%. This ability allows reducing 17% of Average Classification Error (ACE), 9% of Training Error (TE), and 18% of Root Mean Square Error (RMSE). The proposed algorithm also reduces 14% clustering energy consumption compared to the existing K-Means clustering algorithm.

  7. Automatic detection of erythemato-squamous diseases using k-means clustering.

    Science.gov (United States)

    Ubeyli, Elif Derya; Doğdu, Erdoğan

    2010-04-01

    A new approach based on the implementation of k-means clustering is presented for automated detection of erythemato-squamous diseases. The purpose of clustering techniques is to find a structure for the given data by finding similarities between data according to data characteristics. The studied domain contained records of patients with known diagnosis. The k-means clustering algorithm's task was to classify the data points, in this case the patients with attribute data, to one of the five clusters. The algorithm was used to detect the five erythemato-squamous diseases when 33 features defining five disease indications were used. The purpose is to determine an optimum classification scheme for this problem. The present research demonstrated that the features well represent the erythemato-squamous diseases and the k-means clustering algorithm's task achieved high classification accuracies for only five erythemato-squamous diseases.

  8. Support Vector Data Descriptions and k-Means Clustering: One Class?

    Science.gov (United States)

    Gornitz, Nico; Lima, Luiz Alberto; Muller, Klaus-Robert; Kloft, Marius; Nakajima, Shinichi

    2017-09-27

    We present ClusterSVDD, a methodology that unifies support vector data descriptions (SVDDs) and k-means clustering into a single formulation. This allows both methods to benefit from one another, i.e., by adding flexibility using multiple spheres for SVDDs and increasing anomaly resistance and flexibility through kernels to k-means. In particular, our approach leads to a new interpretation of k-means as a regularized mode seeking algorithm. The unifying formulation further allows for deriving new algorithms by transferring knowledge from one-class learning settings to clustering settings and vice versa. As a showcase, we derive a clustering method for structured data based on a one-class learning scenario. Additionally, our formulation can be solved via a particularly simple optimization scheme. We evaluate our approach empirically to highlight some of the proposed benefits on artificially generated data, as well as on real-world problems, and provide a Python software package comprising various implementations of primal and dual SVDD as well as our proposed ClusterSVDD.

  9. Merging K-means with hierarchical clustering for identifying general-shaped groups.

    Science.gov (United States)

    Peterson, Anna D; Ghosh, Arka P; Maitra, Ranjan

    2018-01-01

    Clustering partitions a dataset such that observations placed together in a group are similar but different from those in other groups. Hierarchical and K -means clustering are two approaches but have different strengths and weaknesses. For instance, hierarchical clustering identifies groups in a tree-like structure but suffers from computational complexity in large datasets while K -means clustering is efficient but designed to identify homogeneous spherically-shaped clusters. We present a hybrid non-parametric clustering approach that amalgamates the two methods to identify general-shaped clusters and that can be applied to larger datasets. Specifically, we first partition the dataset into spherical groups using K -means. We next merge these groups using hierarchical methods with a data-driven distance measure as a stopping criterion. Our proposal has the potential to reveal groups with general shapes and structure in a dataset. We demonstrate good performance on several simulated and real datasets.

  10. Hybrid K-means Dan Particle Swarm Optimization Untuk Clustering Nasabah Kredit

    Directory of Open Access Journals (Sweden)

    Yusuf Priyo Anggodo

    2017-05-01

    Credit is the biggest revenue for the bank. However, banks have to be selective in deciding which clients can receive the credit. This issue is becoming increasingly complex because when the bank was wrong to give credit to customers can do harm, apart of that a large number of deciding parameter in determining customer credit. Clustering is one way to be able to resolve this issue. K-means is a simple and popular method for solving clustering. However, K-means pure can’t provide optimum solutions so that needs to be done to get the optimum solution to improve. One method of optimization that can solve the problems of optimization with particle swarm optimization is good (PSO. PSO is very helpful in the process of clustering to perform optimization on the central point of each cluster. To improve better results on PSO there are some that do improve. The first use of time-variant inertia to make the dynamic value of inertial w each iteration. Both control the speed of the particle velocity or clamping to get the best position. Besides to overcome premature convergence do hybrid PSO with random injection. The results of this research provide the optimum results for solving clustering of customer credits. The test results showed the hybrid PSO K-means provide the greatest results than K-means and PSO K-means, where the silhouette of the K-means, PSO K-means, and hybrid PSO K-means respectively 0.57343, 0.792045, 1. Keywords: Credit, Clustering, PSO, K-means, Random Injection

  11. Worst-case and smoothed analysis of $k$-means clustering with Bregman divergences

    NARCIS (Netherlands)

    Manthey, Bodo; Röglin, Heiko; Dong, Yingfei; Du, Dingzhu; Ibarra, Oscar

    2009-01-01

    The $k$-means algorithm is the method of choice for clustering large-scale data sets and it performs exceedingly well in practice. Most of the theoretical work is restricted to the case that squared Euclidean distances are used as similarity measure. In many applications, however, data is to be

  12. Reducing Earth Topography Resolution for SMAP Mission Ground Tracks Using K-Means Clustering

    Science.gov (United States)

    Rizvi, Farheen

    2013-01-01

    The K-means clustering algorithm is used to reduce Earth topography resolution for the SMAP mission ground tracks. As SMAP propagates in orbit, knowledge of the radar antenna footprints on Earth is required for the antenna misalignment calibration. Each antenna footprint contains a latitude and longitude location pair on the Earth surface. There are 400 pairs in one data set for the calibration model. It is computationally expensive to calculate corresponding Earth elevation for these data pairs. Thus, the antenna footprint resolution is reduced. Similar topographical data pairs are grouped together with the K-means clustering algorithm. The resolution is reduced to the mean of each topographical cluster called the cluster centroid. The corresponding Earth elevation for each cluster centroid is assigned to the entire group. Results show that 400 data points are reduced to 60 while still maintaining algorithm performance and computational efficiency. In this work, sensitivity analysis is also performed to show a trade-off between algorithm performance versus computational efficiency as the number of cluster centroids and algorithm iterations are increased.

  13. Parallel k-means++

    Energy Technology Data Exchange (ETDEWEB)

    2017-04-04

    A parallelization of the k-means++ seed selection algorithm on three distinct hardware platforms: GPU, multicore CPU, and multithreaded architecture. K-means++ was developed by David Arthur and Sergei Vassilvitskii in 2007 as an extension of the k-means data clustering technique. These algorithms allow people to cluster multidimensional data, by attempting to minimize the mean distance of data points within a cluster. K-means++ improved upon traditional k-means by using a more intelligent approach to selecting the initial seeds for the clustering process. While k-means++ has become a popular alternative to traditional k-means clustering, little work has been done to parallelize this technique. We have developed original C++ code for parallelizing the algorithm on three unique hardware architectures: GPU using NVidia's CUDA/Thrust framework, multicore CPU using OpenMP, and the Cray XMT multithreaded architecture. By parallelizing the process for these platforms, we are able to perform k-means++ clustering much more quickly than it could be done before.

  14. A hybrid sales forecasting scheme by combining independent component analysis with K-means clustering and support vector regression.

    Science.gov (United States)

    Lu, Chi-Jie; Chang, Chi-Chang

    2014-01-01

    Sales forecasting plays an important role in operating a business since it can be used to determine the required inventory level to meet consumer demand and avoid the problem of under/overstocking. Improving the accuracy of sales forecasting has become an important issue of operating a business. This study proposes a hybrid sales forecasting scheme by combining independent component analysis (ICA) with K-means clustering and support vector regression (SVR). The proposed scheme first uses the ICA to extract hidden information from the observed sales data. The extracted features are then applied to K-means algorithm for clustering the sales data into several disjoined clusters. Finally, the SVR forecasting models are applied to each group to generate final forecasting results. Experimental results from information technology (IT) product agent sales data reveal that the proposed sales forecasting scheme outperforms the three comparison models and hence provides an efficient alternative for sales forecasting.

  15. On the Equivalence of Nonnegative Matrix Factorization and K-means- Spectral Clustering

    Energy Technology Data Exchange (ETDEWEB)

    Ding, Chris; He, Xiaofeng; Simon, Horst D.; Jin, Rong

    2005-12-04

    We provide a systematic analysis of nonnegative matrix factorization (NMF) relating to data clustering. We generalize the usual X = FG{sup T} decomposition to the symmetric W = HH{sup T} and W = HSH{sup T} decompositions. We show that (1) W = HH{sup T} is equivalent to Kernel K-means clustering and the Laplacian-based spectral clustering. (2) X = FG{sup T} is equivalent to simultaneous clustering of rows and columns of a bipartite graph. We emphasizes the importance of orthogonality in NMF and soft clustering nature of NMF. These results are verified with experiments on face images and newsgroups.

  16. Factorial and reduced K-means reconsidered

    NARCIS (Netherlands)

    Timmerman, Marieke E.; Ceulemans, Eva; Kiers, Henk A. L.; Vichi, Maurizio

    2010-01-01

    Factorial K-means analysis (FKM) and Reduced K-means analysis (RKM) are clustering methods that aim at simultaneously achieving a clustering of the objects and a dimension reduction of the variables. Because a comprehensive comparison between FKM and RKM is lacking in the literature so far, a

  17. A New Soft Computing Method for K-Harmonic Means Clustering.

    Science.gov (United States)

    Yeh, Wei-Chang; Jiang, Yunzhi; Chen, Yee-Fen; Chen, Zhe

    2016-01-01

    The K-harmonic means clustering algorithm (KHM) is a new clustering method used to group data such that the sum of the harmonic averages of the distances between each entity and all cluster centroids is minimized. Because it is less sensitive to initialization than K-means (KM), many researchers have recently been attracted to studying KHM. In this study, the proposed iSSO-KHM is based on an improved simplified swarm optimization (iSSO) and integrates a variable neighborhood search (VNS) for KHM clustering. As evidence of the utility of the proposed iSSO-KHM, we present extensive computational results on eight benchmark problems. From the computational results, the comparison appears to support the superiority of the proposed iSSO-KHM over previously developed algorithms for all experiments in the literature.

  18. Clustering for Binary Data Sets by Using Genetic Algorithm-Incremental K-means

    Science.gov (United States)

    Saharan, S.; Baragona, R.; Nor, M. E.; Salleh, R. M.; Asrah, N. M.

    2018-04-01

    This research was initially driven by the lack of clustering algorithms that specifically focus in binary data. To overcome this gap in knowledge, a promising technique for analysing this type of data became the main subject in this research, namely Genetic Algorithms (GA). For the purpose of this research, GA was combined with the Incremental K-means (IKM) algorithm to cluster the binary data streams. In GAIKM, the objective function was based on a few sufficient statistics that may be easily and quickly calculated on binary numbers. The implementation of IKM will give an advantage in terms of fast convergence. The results show that GAIKM is an efficient and effective new clustering algorithm compared to the clustering algorithms and to the IKM itself. In conclusion, the GAIKM outperformed other clustering algorithms such as GCUK, IKM, Scalable K-means (SKM) and K-means clustering and paves the way for future research involving missing data and outliers.

  19. CLUSTERING PENENTUAN POTENSI KEJAHATAN DAERAH DI KOTA BANJARBARU DENGAN METODE K-MEANS

    Directory of Open Access Journals (Sweden)

    Sri Rahayu

    2016-09-01

    Full Text Available Abstract Within the scope of the police, the data held in the database can be used to make a crime report, the presumption of evil to come, and so on. With the data mining based on the amount of data stored so much, these data can be processed to find the useful knowledge for police. One technique that is known in the data mining clustering techniques. The purpose of the job grouping (clustering the data can be divided into two, namely grouping for understanding and grouping to use. Methods K-Means clustering is a method for engineering the most simple and common. KMeans clustering is one method of data non-hierarchy (partition which seeks to partition the existing data in the form of two or more groups. This method of partitioning data into groups so that the same characteristic of data put into the same group and a different characteristic data are grouped into another group. The purpose of this grouping is to minimize the objective function is set in the grouping process, which generally seek to minimize the variation within a group and maximize the variation between groups. The data mined to determine the potential clustering of crime in the city area of crime data Banjarbaru is owned by the city police in the Police Banjarbaru. Thus this study aims to assess the stage of clustering techniques and build clustering determination of potential crime areas in the city Banjarbaru. Keywords:Clustering, Data mining, K-Means, K-Means Clustering ABSTRAK Dalam ruang lingkup kepolisian, data-data yang dimiliki pada basis data dapat dimanfaatkan untuk pembuatan laporan kejahatan, praduga kejahatan yang akan datang, dan sebagainya.Dengan adanya data mining yang didasarkan pada jumlah data yang tersimpan begitu banyak, data-data tersebut dapat diproses untuk menemukan suatu pengetahuan yang berguna bagi pihak kepolisian.Salah satu teknik yang dikenal dalam data mining yaitu teknik clustering.Tujuan pekerjaan pengelompokan (clustering data dapat dibedakan

  20. Profiling Local Optima in K-Means Clustering: Developing a Diagnostic Technique

    Science.gov (United States)

    Steinley, Douglas

    2006-01-01

    Using the cluster generation procedure proposed by D. Steinley and R. Henson (2005), the author investigated the performance of K-means clustering under the following scenarios: (a) different probabilities of cluster overlap; (b) different types of cluster overlap; (c) varying samples sizes, clusters, and dimensions; (d) different multivariate…

  1. Implementasi Pendekatan Rule-Of-Thumb untuk Optimasi Algoritma K-Means Clustering

    Directory of Open Access Journals (Sweden)

    M Nishom

    2018-05-01

    Full Text Available In the big data era, the clustering of data or so-called clustering has attracted great interest or attention from researchers in conducting various studies, many grouping algorithms have been proposed in recent times. However, as technology evolves, data volumes continue to grow and data formats are increasingly varied, thus making massive data grouping into a huge and challenging task. To overcome this problem, various research related methods for data grouping have been done, among them is K-Means. However, this method still has some shortcomings, among them is the sensitivity issue in determining the value of cluster (K. In this paper we discuss the implementation of the rule-of-thumb approach and the normalization of data on the K-Means method to determine the number of clusters or K values dynamically in the data groupings. The results show that the implementation of the approach has a significant impact (related to time, number of iterations, and no outliers in the data grouping.

  2. Crouch gait patterns defined using k-means cluster analysis are related to underlying clinical pathology.

    Science.gov (United States)

    Rozumalski, Adam; Schwartz, Michael H

    2009-08-01

    In this study a gait classification method was developed and applied to subjects with Cerebral palsy who walk with excessive knee flexion at initial contact. Sagittal plane gait data, simplified using the gait features method, is used as input into a k-means cluster analysis to determine homogeneous groups. Several clinical domains were explored to determine if the clusters are related to underlying pathology. These domains included age, joint range-of-motion, strength, selective motor control, and spasticity. Principal component analysis is used to determine one overall score for each of the multi-joint domains (strength, selective motor control, and spasticity). The current study shows that there are five clusters among children with excessive knee flexion at initial contact. These clusters were labeled, in order of increasing gait pathology: (1) mild crouch with mild equinus, (2) moderate crouch, (3) moderate crouch with anterior pelvic tilt, (4) moderate crouch with equinus, and (5) severe crouch. Further analysis showed that age, range-of-motion, strength, selective motor control, and spasticity were significantly different between the clusters (p<0.001). The general tendency was for the clinical domains to worsen as gait pathology increased. This new classification tool can be used to define homogeneous groups of subjects in crouch gait, which can help guide treatment decisions and outcomes assessment.

  3. Integrated analysis of CFD data with K-means clustering algorithm and extreme learning machine for localized HVAC control

    International Nuclear Information System (INIS)

    Zhou, Hongming; Soh, Yeng Chai; Wu, Xiaoying

    2015-01-01

    Maintaining a desired comfort level while minimizing the total energy consumed is an interesting optimization problem in Heating, ventilating and air conditioning (HVAC) system control. This paper proposes a localized control strategy that uses Computational Fluid Dynamics (CFD) simulation results and K-means clustering algorithm to optimally partition an air-conditioned room into different zones. The temperature and air velocity results from CFD simulation are combined in two ways: 1) based on the relationship indicated in predicted mean vote (PMV) formula; 2) based on the relationship extracted from ASHRAE RP-884 database using extreme learning machine (ELM). Localized control can then be effected in which each of the zones can be treated individually and an optimal control strategy can be developed based on the partitioning result. - Highlights: • The paper provides a visual guideline for thermal comfort analysis. • CFD, K-means, PMV and ELM are used to analyze thermal conditions within a room. • Localized control strategy could be developed based on our clustering results

  4. Performance Analysis of Combined Methods of Genetic Algorithm and K-Means Clustering in Determining the Value of Centroid

    Science.gov (United States)

    Adya Zizwan, Putra; Zarlis, Muhammad; Budhiarti Nababan, Erna

    2017-12-01

    The determination of Centroid on K-Means Algorithm directly affects the quality of the clustering results. Determination of centroid by using random numbers has many weaknesses. The GenClust algorithm that combines the use of Genetic Algorithms and K-Means uses a genetic algorithm to determine the centroid of each cluster. The use of the GenClust algorithm uses 50% chromosomes obtained through deterministic calculations and 50% is obtained from the generation of random numbers. This study will modify the use of the GenClust algorithm in which the chromosomes used are 100% obtained through deterministic calculations. The results of this study resulted in performance comparisons expressed in Mean Square Error influenced by centroid determination on K-Means method by using GenClust method, modified GenClust method and also classic K-Means.

  5. Automatic classification of canine PRG neuronal discharge patterns using K-means clustering.

    Science.gov (United States)

    Zuperku, Edward J; Prkic, Ivana; Stucke, Astrid G; Miller, Justin R; Hopp, Francis A; Stuth, Eckehard A

    2015-02-01

    Respiratory-related neurons in the parabrachial-Kölliker-Fuse (PB-KF) region of the pons play a key role in the control of breathing. The neuronal activities of these pontine respiratory group (PRG) neurons exhibit a variety of inspiratory (I), expiratory (E), phase spanning and non-respiratory related (NRM) discharge patterns. Due to the variety of patterns, it can be difficult to classify them into distinct subgroups according to their discharge contours. This report presents a method that automatically classifies neurons according to their discharge patterns and derives an average subgroup contour of each class. It is based on the K-means clustering technique and it is implemented via SigmaPlot User-Defined transform scripts. The discharge patterns of 135 canine PRG neurons were classified into seven distinct subgroups. Additional methods for choosing the optimal number of clusters are described. Analysis of the results suggests that the K-means clustering method offers a robust objective means of both automatically categorizing neuron patterns and establishing the underlying archetypical contours of subtypes based on the discharge patterns of group of neurons. Published by Elsevier B.V.

  6. Differential Spatio-temporal Multiband Satellite Image Clustering using K-means Optimization With Reinforcement Programming

    Directory of Open Access Journals (Sweden)

    Irene Erlyn Wina Rachmawan

    2015-06-01

    Full Text Available Deforestration is one of the crucial issues in Indonesia because now Indonesia has world's highest deforestation rate. In other hand, multispectral image delivers a great source of data for studying spatial and temporal changeability of the environmental such as deforestration area. This research present differential image processing methods for detecting nature change of deforestration. Our differential image processing algorithms extract and indicating area automatically. The feature of our proposed idea produce extracted information from multiband satellite image and calculate the area of deforestration by years with calculating data using temporal dataset. Yet, multiband satellite image consists of big data size that were difficult to be handled for segmentation. Commonly, K- Means clustering is considered to be a powerfull clustering algorithm because of its ability to clustering big data. However K-Means has sensitivity of its first generated centroids, which could lead into a bad performance. In this paper we propose a new approach to optimize K-Means clustering using Reinforcement Programming in order to clustering multispectral image. We build a new mechanism for generating initial centroids by implementing exploration and exploitation knowledge from Reinforcement Programming. This optimization will lead a better result for K-means data cluster. We select multispectral image from Landsat 7 in past ten years in Medawai, Borneo, Indonesia, and apply two segmentation areas consist of deforestration land and forest field. We made series of experiments and compared the experimental results of K-means using Reinforcement Programming as optimizing initiate centroid and normal K-means without optimization process. Keywords: Deforestration, Multispectral images, landsat, automatic clustering, K-means.

  7. Group analyses of connectivity-based cortical parcellation using repeated k-means clustering.

    Science.gov (United States)

    Nanetti, Luca; Cerliani, Leonardo; Gazzola, Valeria; Renken, Remco; Keysers, Christian

    2009-10-01

    K-means clustering has become a popular tool for connectivity-based cortical segmentation using Diffusion Weighted Imaging (DWI) data. A sometimes ignored issue is, however, that the output of the algorithm depends on the initial placement of starting points, and that different sets of starting points therefore could lead to different solutions. In this study we explore this issue. We apply k-means clustering a thousand times to the same DWI dataset collected in 10 individuals to segment two brain regions: the SMA-preSMA on the medial wall, and the insula. At the level of single subjects, we found that in both brain regions, repeatedly applying k-means indeed often leads to a variety of rather different cortical based parcellations. By assessing the similarity and frequency of these different solutions, we show that approximately 256 k-means repetitions are needed to accurately estimate the distribution of possible solutions. Using nonparametric group statistics, we then propose a method to employ the variability of clustering solutions to assess the reliability with which certain voxels can be attributed to a particular cluster. In addition, we show that the proportion of voxels that can be attributed significantly to either cluster in the SMA and preSMA is relatively higher than in the insula and discuss how this difference may relate to differences in the anatomy of these regions.

  8. Privacy-Preserving k-Means Clustering under Multiowner Setting in Distributed Cloud Environments

    Directory of Open Access Journals (Sweden)

    Hong Rong

    2017-01-01

    Full Text Available With the advent of big data era, clients who lack computational and storage resources tend to outsource data mining tasks to cloud service providers in order to improve efficiency and reduce costs. It is also increasingly common for clients to perform collaborative mining to maximize profits. However, due to the rise of privacy leakage issues, the data contributed by clients should be encrypted using their own keys. This paper focuses on privacy-preserving k-means clustering over the joint datasets encrypted under multiple keys. Unfortunately, existing outsourcing k-means protocols are impractical because not only are they restricted to a single key setting, but also they are inefficient and nonscalable for distributed cloud computing. To address these issues, we propose a set of privacy-preserving building blocks and outsourced k-means clustering protocol under Spark framework. Theoretical analysis shows that our scheme protects the confidentiality of the joint database and mining results, as well as access patterns under the standard semihonest model with relatively small computational overhead. Experimental evaluations on real datasets also demonstrate its efficiency improvements compared with existing approaches.

  9. Sleep stages identification in patients with sleep disorder using k-means clustering

    Science.gov (United States)

    Fadhlullah, M. U.; Resahya, A.; Nugraha, D. F.; Yulita, I. N.

    2018-05-01

    Data mining is a computational intelligence discipline where a large dataset processed using a certain method to look for patterns within the large dataset. This pattern then used for real time application or to develop some certain knowledge. This is a valuable tool to solve a complex problem, discover new knowledge, data analysis and decision making. To be able to get the pattern that lies inside the large dataset, clustering method is used to get the pattern. Clustering is basically grouping data that looks similar so a certain pattern can be seen in the large data set. Clustering itself has several algorithms to group the data into the corresponding cluster. This research used data from patients who suffer sleep disorders and aims to help people in the medical world to reduce the time required to classify the sleep stages from a patient who suffers from sleep disorders. This study used K-Means algorithm and silhouette evaluation to find out that 3 clusters are the optimal cluster for this dataset which means can be divided to 3 sleep stages.

  10. A hybrid sequential approach for data clustering using K-Means and ...

    African Journals Online (AJOL)

    Experiments on four kinds of data sets have been conducted. The obtained results are compared with K-Means, PSO, Hybrid, K-Means+Genetic Algorithm and it has been found that the proposed algorithm generates more accurate, robust and better clustering results. International Journal of Engineering, Science and ...

  11. Functional connectivity analysis of the neural bases of emotion regulation: A comparison of independent component method with density-based k-means clustering method.

    Science.gov (United States)

    Zou, Ling; Guo, Qian; Xu, Yi; Yang, Biao; Jiao, Zhuqing; Xiang, Jianbo

    2016-04-29

    Functional magnetic resonance imaging (fMRI) is an important tool in neuroscience for assessing connectivity and interactions between distant areas of the brain. To find and characterize the coherent patterns of brain activity as a means of identifying brain systems for the cognitive reappraisal of the emotion task, both density-based k-means clustering and independent component analysis (ICA) methods can be applied to characterize the interactions between brain regions involved in cognitive reappraisal of emotion. Our results reveal that compared with the ICA method, the density-based k-means clustering method provides a higher sensitivity of polymerization. In addition, it is more sensitive to those relatively weak functional connection regions. Thus, the study concludes that in the process of receiving emotional stimuli, the relatively obvious activation areas are mainly distributed in the frontal lobe, cingulum and near the hypothalamus. Furthermore, density-based k-means clustering method creates a more reliable method for follow-up studies of brain functional connectivity.

  12. An Initial Seed Selection Algorithm for K-means Clustering of Georeferenced Data to Improve Replicability of Cluster Assignments for Mapping Application

    OpenAIRE

    Khan, Fouad

    2016-01-01

    K-means is one of the most widely used clustering algorithms in various disciplines, especially for large datasets. However the method is known to be highly sensitive to initial seed selection of cluster centers. K-means++ has been proposed to overcome this problem and has been shown to have better accuracy and computational efficiency than k-means. In many clustering problems though -such as when classifying georeferenced data for mapping applications- standardization of clustering methodolo...

  13. K-Means Clustering for Problems with Periodic Attributes

    Czech Academy of Sciences Publication Activity Database

    Vejmelka, Martin; Musílek, P.; Paluš, Milan; Pelikán, Emil

    2009-01-01

    Roč. 23, č. 4 (2009), s. 721-743 ISSN 0218-0014 R&D Projects: GA AV ČR 1ET400300513 EU Projects: European Commission(XE) 517133 - BRACCIA Institutional research plan: CEZ:AV0Z10300504 Keywords : clustering algorithms * similarity measures * K- means * periodic attributes Subject RIV: BB - Applied Statistics, Operational Research Impact factor: 0.512, year: 2009

  14. An improved K-means clustering method for cDNA microarray image segmentation.

    Science.gov (United States)

    Wang, T N; Li, T J; Shao, G F; Wu, S X

    2015-07-14

    Microarray technology is a powerful tool for human genetic research and other biomedical applications. Numerous improvements to the standard K-means algorithm have been carried out to complete the image segmentation step. However, most of the previous studies classify the image into two clusters. In this paper, we propose a novel K-means algorithm, which first classifies the image into three clusters, and then one of the three clusters is divided as the background region and the other two clusters, as the foreground region. The proposed method was evaluated on six different data sets. The analyses of accuracy, efficiency, expression values, special gene spots, and noise images demonstrate the effectiveness of our method in improving the segmentation quality.

  15. Order-Constrained Solutions in K-Means Clustering: Even Better than Being Globally Optimal

    Science.gov (United States)

    Steinley, Douglas; Hubert, Lawrence

    2008-01-01

    This paper proposes an order-constrained K-means cluster analysis strategy, and implements that strategy through an auxiliary quadratic assignment optimization heuristic that identifies an initial object order. A subsequent dynamic programming recursion is applied to optimally subdivide the object set subject to the order constraint. We show that…

  16. Pharmacokinetic analysis and k-means clustering of DCEMR images for radiotherapy outcome prediction of advanced cervical cancers.

    Science.gov (United States)

    Andersen, Erlend K F; Kristensen, Gunnar B; Lyng, Heidi; Malinen, Eirik

    2011-08-01

    Pharmacokinetic analysis of dynamic contrast enhanced magnetic resonance images (DCEMRI) allows for quantitative characterization of vascular properties of tumors. The aim of this study is twofold, first to determine if tumor regions with similar vascularization could be labeled by clustering methods, second to determine if the identified regions can be associated with local cancer relapse. Eighty-one patients with locally advanced cervical cancer treated with chemoradiotherapy underwent DCEMRI with Gd-DTPA prior to external beam radiotherapy. The median follow-up time after treatment was four years, in which nine patients had primary tumor relapse. By fitting a pharmacokinetic two-compartment model function to the temporal contrast enhancement in the tumor, two pharmacokinetic parameters, K(trans) and ύ(e), were estimated voxel by voxel from the DCEMR-images. Intratumoral regions with similar vascularization were identified by k-means clustering of the two pharmacokinetic parameter estimates over all patients. The volume fraction of each cluster was used to evaluate the prognostic value of the clusters. Three clusters provided a sufficient reduction of the cluster variance to label different vascular properties within the tumors. The corresponding median volume fraction of each cluster was 38%, 46% and 10%. The second cluster was significantly associated with primary tumor control in a log-rank survival test (p-value: 0.042), showing a decreased risk of treatment failure for patients with high volume fraction of voxels. Intratumoral regions showing similar vascular properties could successfully be labeled in three distinct clusters and the volume fraction of one cluster region was associated with primary tumor control.

  17. Pharmacokinetic analysis and k-means clustering of DCEMR images for radiotherapy outcome prediction of advanced cervical cancers

    International Nuclear Information System (INIS)

    Andersen, Erlend K. F.; Kristensen, Gunnar B.; Lyng, Heidi; Malinen, Eirik

    2011-01-01

    Introduction. Pharmacokinetic analysis of dynamic contrast enhanced magnetic resonance images (DCEMRI) allows for quantitative characterization of vascular properties of tumors. The aim of this study is twofold, first to determine if tumor regions with similar vascularization could be labeled by clustering methods, second to determine if the identified regions can be associated with local cancer relapse. Materials and methods. Eighty-one patients with locally advanced cervical cancer treated with chemoradiotherapy underwent DCEMRI with Gd-DTPA prior to external beam radiotherapy. The median follow-up time after treatment was four years, in which nine patients had primary tumor relapse. By fitting a pharmacokinetic two-compartment model function to the temporal contrast enhancement in the tumor, two pharmacokinetic parameters, K trans and u e , were estimated voxel by voxel from the DCEMR-images. Intratumoral regions with similar vascularization were identified by k-means clustering of the two pharmacokinetic parameter estimates over all patients. The volume fraction of each cluster was used to evaluate the prognostic value of the clusters. Results. Three clusters provided a sufficient reduction of the cluster variance to label different vascular properties within the tumors. The corresponding median volume fraction of each cluster was 38%, 46% and 10%. The second cluster was significantly associated with primary tumor control in a log-rank survival test (p-value: 0.042), showing a decreased risk of treatment failure for patients with high volume fraction of voxels. Conclusions. Intratumoral regions showing similar vascular properties could successfully be labeled in three distinct clusters and the volume fraction of one cluster region was associated with primary tumor control

  18. Nucleus and cytoplasm segmentation in microscopic images using K-means clustering and region growing.

    Science.gov (United States)

    Sarrafzadeh, Omid; Dehnavi, Alireza Mehri

    2015-01-01

    Segmentation of leukocytes acts as the foundation for all automated image-based hematological disease recognition systems. Most of the time, hematologists are interested in evaluation of white blood cells only. Digital image processing techniques can help them in their analysis and diagnosis. The main objective of this paper is to detect leukocytes from a blood smear microscopic image and segment them into their two dominant elements, nucleus and cytoplasm. The segmentation is conducted using two stages of applying K-means clustering. First, the nuclei are segmented using K-means clustering. Then, a proposed method based on region growing is applied to separate the connected nuclei. Next, the nuclei are subtracted from the original image. Finally, the cytoplasm is segmented using the second stage of K-means clustering. The results indicate that the proposed method is able to extract the nucleus and cytoplasm regions accurately and works well even though there is no significant contrast between the components in the image. In this paper, a method based on K-means clustering and region growing is proposed in order to detect leukocytes from a blood smear microscopic image and segment its components, the nucleus and the cytoplasm. As region growing step of the algorithm relies on the information of edges, it will not able to separate the connected nuclei more accurately in poor edges and it requires at least a weak edge to exist between the nuclei. The nucleus and cytoplasm segments of a leukocyte can be used for feature extraction and classification which leads to automated leukemia detection.

  19. Towards enhancement of performance of K-means clustering using nature-inspired optimization algorithms.

    Science.gov (United States)

    Fong, Simon; Deb, Suash; Yang, Xin-She; Zhuang, Yan

    2014-01-01

    Traditional K-means clustering algorithms have the drawback of getting stuck at local optima that depend on the random values of initial centroids. Optimization algorithms have their advantages in guiding iterative computation to search for global optima while avoiding local optima. The algorithms help speed up the clustering process by converging into a global optimum early with multiple search agents in action. Inspired by nature, some contemporary optimization algorithms which include Ant, Bat, Cuckoo, Firefly, and Wolf search algorithms mimic the swarming behavior allowing them to cooperatively steer towards an optimal objective within a reasonable time. It is known that these so-called nature-inspired optimization algorithms have their own characteristics as well as pros and cons in different applications. When these algorithms are combined with K-means clustering mechanism for the sake of enhancing its clustering quality by avoiding local optima and finding global optima, the new hybrids are anticipated to produce unprecedented performance. In this paper, we report the results of our evaluation experiments on the integration of nature-inspired optimization methods into K-means algorithms. In addition to the standard evaluation metrics in evaluating clustering quality, the extended K-means algorithms that are empowered by nature-inspired optimization methods are applied on image segmentation as a case study of application scenario.

  20. Prediction of chemotherapeutic response in bladder cancer using K-means clustering of dynamic contrast-enhanced (DCE)-MRI pharmacokinetic parameters.

    Science.gov (United States)

    Nguyen, Huyen T; Jia, Guang; Shah, Zarine K; Pohar, Kamal; Mortazavi, Amir; Zynger, Debra L; Wei, Lai; Yang, Xiangyu; Clark, Daniel; Knopp, Michael V

    2015-05-01

    To apply k-means clustering of two pharmacokinetic parameters derived from 3T dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI) to predict the chemotherapeutic response in bladder cancer at the mid-cycle timepoint. With the predetermined number of three clusters, k-means clustering was performed on nondimensionalized Amp and kep estimates of each bladder tumor. Three cluster volume fractions (VFs) were calculated for each tumor at baseline and mid-cycle. The changes of three cluster VFs from baseline to mid-cycle were correlated with the tumor's chemotherapeutic response. Receiver-operating-characteristics curve analysis was used to evaluate the performance of each cluster VF change as a biomarker of chemotherapeutic response in bladder cancer. The k-means clustering partitioned each bladder tumor into cluster 1 (low kep and low Amp), cluster 2 (low kep and high Amp), cluster 3 (high kep and low Amp). The changes of all three cluster VFs were found to be associated with bladder tumor response to chemotherapy. The VF change of cluster 2 presented with the highest area-under-the-curve value (0.96) and the highest sensitivity/specificity/accuracy (96%/100%/97%) with a selected cutoff value. The k-means clustering of the two DCE-MRI pharmacokinetic parameters can characterize the complex microcirculatory changes within a bladder tumor to enable early prediction of the tumor's chemotherapeutic response. © 2014 Wiley Periodicals, Inc.

  1. [Automatic Sleep Stage Classification Based on an Improved K-means Clustering Algorithm].

    Science.gov (United States)

    Xiao, Shuyuan; Wang, Bei; Zhang, Jian; Zhang, Qunfeng; Zou, Junzhong

    2016-10-01

    Sleep stage scoring is a hotspot in the field of medicine and neuroscience.Visual inspection of sleep is laborious and the results may be subjective to different clinicians.Automatic sleep stage classification algorithm can be used to reduce the manual workload.However,there are still limitations when it encounters complicated and changeable clinical cases.The purpose of this paper is to develop an automatic sleep staging algorithm based on the characteristics of actual sleep data.In the proposed improved K-means clustering algorithm,points were selected as the initial centers by using a concept of density to avoid the randomness of the original K-means algorithm.Meanwhile,the cluster centers were updated according to the‘Three-Sigma Rule’during the iteration to abate the influence of the outliers.The proposed method was tested and analyzed on the overnight sleep data of the healthy persons and patients with sleep disorders after continuous positive airway pressure(CPAP)treatment.The automatic sleep stage classification results were compared with the visual inspection by qualified clinicians and the averaged accuracy reached 76%.With the analysis of morphological diversity of sleep data,it was proved that the proposed improved K-means algorithm was feasible and valid for clinical practice.

  2. Evaluation of stability of k-means cluster ensembles with respect to random initialization.

    Science.gov (United States)

    Kuncheva, Ludmila I; Vetrov, Dmitry P

    2006-11-01

    Many clustering algorithms, including cluster ensembles, rely on a random component. Stability of the results across different runs is considered to be an asset of the algorithm. The cluster ensembles considered here are based on k-means clusterers. Each clusterer is assigned a random target number of clusters, k and is started from a random initialization. Here, we use 10 artificial and 10 real data sets to study ensemble stability with respect to random k, and random initialization. The data sets were chosen to have a small number of clusters (two to seven) and a moderate number of data points (up to a few hundred). Pairwise stability is defined as the adjusted Rand index between pairs of clusterers in the ensemble, averaged across all pairs. Nonpairwise stability is defined as the entropy of the consensus matrix of the ensemble. An experimental comparison with the stability of the standard k-means algorithm was carried out for k from 2 to 20. The results revealed that ensembles are generally more stable, markedly so for larger k. To establish whether stability can serve as a cluster validity index, we first looked at the relationship between stability and accuracy with respect to the number of clusters, k. We found that such a relationship strongly depends on the data set, varying from almost perfect positive correlation (0.97, for the glass data) to almost perfect negative correlation (-0.93, for the crabs data). We propose a new combined stability index to be the sum of the pairwise individual and ensemble stabilities. This index was found to correlate better with the ensemble accuracy. Following the hypothesis that a point of stability of a clustering algorithm corresponds to a structure found in the data, we used the stability measures to pick the number of clusters. The combined stability index gave best results.

  3. The global Minmax k-means algorithm.

    Science.gov (United States)

    Wang, Xiaoyan; Bai, Yanping

    2016-01-01

    The global k -means algorithm is an incremental approach to clustering that dynamically adds one cluster center at a time through a deterministic global search procedure from suitable initial positions, and employs k -means to minimize the sum of the intra-cluster variances. However the global k -means algorithm sometimes results singleton clusters and the initial positions sometimes are bad, after a bad initialization, poor local optimal can be easily obtained by k -means algorithm. In this paper, we modified the global k -means algorithm to eliminate the singleton clusters at first, and then we apply MinMax k -means clustering error method to global k -means algorithm to overcome the effect of bad initialization, proposed the global Minmax k -means algorithm. The proposed clustering method is tested on some popular data sets and compared to the k -means algorithm, the global k -means algorithm and the MinMax k -means algorithm. The experiment results show our proposed algorithm outperforms other algorithms mentioned in the paper.

  4. Optimization Approach for Multi-scale Segmentation of Remotely Sensed Imagery under k-means Clustering Guidance

    Directory of Open Access Journals (Sweden)

    WANG Huixian

    2015-05-01

    Full Text Available In order to adapt different scale land cover segmentation, an optimized approach under the guidance of k-means clustering for multi-scale segmentation is proposed. At first, small scale segmentation and k-means clustering are used to process the original images; then the result of k-means clustering is used to guide objects merging procedure, in which Otsu threshold method is used to automatically select the impact factor of k-means clustering; finally we obtain the segmentation results which are applicable to different scale objects. FNEA method is taken for an example and segmentation experiments are done using a simulated image and a real remote sensing image from GeoEye-1 satellite, qualitative and quantitative evaluation demonstrates that the proposed method can obtain high quality segmentation results.

  5. Perbandingan Kinerja Metode Ward Dan K-means Dalam Menentukan Cluster Data Mahasiswa Pemohon Beasiswa (Studi Kasus : STMIK Pringsewu)

    OpenAIRE

    Satria, Fiqih; Aziz, R Z Abdul

    2016-01-01

    This research aims to determine the steps cluster analysis method with Ward method and K-Means method, and compare the results of the analysis of the two methods for clustering student data related decision-making to determine the students are eligible to receive a Peningkatan PrestasiAkademik (PPA) scholarship and Bantuan Biaya Akademik (BBA) scholarship in STMIK Pringsewu. Cluster analysis was performed using IBM SPSS Version 23. Cluster Analysis results of both methods were compared using ...

  6. Efficient privacy preserving K-means clustering in a three-party setting

    NARCIS (Netherlands)

    Beye, Michael; Erkin, Zekeriya; Erkin, Zekeriya; Lagendijk, Reginald L.

    2011-01-01

    User clustering is a common operation in online social networks, for example to recommend new friends. In previous work [5], Erkin et al. proposed a privacy-preserving K-means clustering algorithm for the semi-honest model, using homomorphic encryption and multi-party computation. This paper makes

  7. K-means-clustering-based fiber nonlinearity equalization techniques for 64-QAM coherent optical communication system.

    Science.gov (United States)

    Zhang, Junfeng; Chen, Wei; Gao, Mingyi; Shen, Gangxiang

    2017-10-30

    In this work, we proposed two k-means-clustering-based algorithms to mitigate the fiber nonlinearity for 64-quadrature amplitude modulation (64-QAM) signal, the training-sequence assisted k-means algorithm and the blind k-means algorithm. We experimentally demonstrated the proposed k-means-clustering-based fiber nonlinearity mitigation techniques in 75-Gb/s 64-QAM coherent optical communication system. The proposed algorithms have reduced clustering complexity and low data redundancy and they are able to quickly find appropriate initial centroids and select correctly the centroids of the clusters to obtain the global optimal solutions for large k value. We measured the bit-error-ratio (BER) performance of 64-QAM signal with different launched powers into the 50-km single mode fiber and the proposed techniques can greatly mitigate the signal impairments caused by the amplified spontaneous emission noise and the fiber Kerr nonlinearity and improve the BER performance.

  8. Towards Enhancement of Performance of K-Means Clustering Using Nature-Inspired Optimization Algorithms

    Directory of Open Access Journals (Sweden)

    Simon Fong

    2014-01-01

    Full Text Available Traditional K-means clustering algorithms have the drawback of getting stuck at local optima that depend on the random values of initial centroids. Optimization algorithms have their advantages in guiding iterative computation to search for global optima while avoiding local optima. The algorithms help speed up the clustering process by converging into a global optimum early with multiple search agents in action. Inspired by nature, some contemporary optimization algorithms which include Ant, Bat, Cuckoo, Firefly, and Wolf search algorithms mimic the swarming behavior allowing them to cooperatively steer towards an optimal objective within a reasonable time. It is known that these so-called nature-inspired optimization algorithms have their own characteristics as well as pros and cons in different applications. When these algorithms are combined with K-means clustering mechanism for the sake of enhancing its clustering quality by avoiding local optima and finding global optima, the new hybrids are anticipated to produce unprecedented performance. In this paper, we report the results of our evaluation experiments on the integration of nature-inspired optimization methods into K-means algorithms. In addition to the standard evaluation metrics in evaluating clustering quality, the extended K-means algorithms that are empowered by nature-inspired optimization methods are applied on image segmentation as a case study of application scenario.

  9. Towards Enhancement of Performance of K-Means Clustering Using Nature-Inspired Optimization Algorithms

    Science.gov (United States)

    Deb, Suash; Yang, Xin-She

    2014-01-01

    Traditional K-means clustering algorithms have the drawback of getting stuck at local optima that depend on the random values of initial centroids. Optimization algorithms have their advantages in guiding iterative computation to search for global optima while avoiding local optima. The algorithms help speed up the clustering process by converging into a global optimum early with multiple search agents in action. Inspired by nature, some contemporary optimization algorithms which include Ant, Bat, Cuckoo, Firefly, and Wolf search algorithms mimic the swarming behavior allowing them to cooperatively steer towards an optimal objective within a reasonable time. It is known that these so-called nature-inspired optimization algorithms have their own characteristics as well as pros and cons in different applications. When these algorithms are combined with K-means clustering mechanism for the sake of enhancing its clustering quality by avoiding local optima and finding global optima, the new hybrids are anticipated to produce unprecedented performance. In this paper, we report the results of our evaluation experiments on the integration of nature-inspired optimization methods into K-means algorithms. In addition to the standard evaluation metrics in evaluating clustering quality, the extended K-means algorithms that are empowered by nature-inspired optimization methods are applied on image segmentation as a case study of application scenario. PMID:25202730

  10. The effect of mining data k-means clustering toward students profile model drop out potential

    Science.gov (United States)

    Purba, Windania; Tamba, Saut; Saragih, Jepronel

    2018-04-01

    The high of student success and the low of student failure can reflect the quality of a college. One of the factors of fail students was drop out. To solve the problem, so mining data with K-means Clustering was applied. K-Means Clustering method would be implemented to clustering the drop out students potentially. Firstly the the result data would be clustering to get the information of all students condition. Based on the model taken was found that students who potentially drop out because of the unexciting students in learning, unsupported parents, diffident students and less of students behavior time. The result of process of K-Means Clustering could known that students who more potentially drop out were in Cluster 1 caused Credit Total System, Quality Total, and the lowest Grade Point Average (GPA) compared between cluster 2 and 3.

  11. Extending the Functionality of Behavioural Change-Point Analysis with k-Means Clustering: A Case Study with the Little Penguin (Eudyptula minor)

    Science.gov (United States)

    Zhang, Jingjing; Dennis, Todd E.

    2015-01-01

    We present a simple framework for classifying mutually exclusive behavioural states within the geospatial lifelines of animals. This method involves use of three sequentially applied statistical procedures: (1) behavioural change point analysis to partition movement trajectories into discrete bouts of same-state behaviours, based on abrupt changes in the spatio-temporal autocorrelation structure of movement parameters; (2) hierarchical multivariate cluster analysis to determine the number of different behavioural states; and (3) k-means clustering to classify inferred bouts of same-state location observations into behavioural modes. We demonstrate application of the method by analysing synthetic trajectories of known ‘artificial behaviours’ comprised of different correlated random walks, as well as real foraging trajectories of little penguins (Eudyptula minor) obtained by global-positioning-system telemetry. Our results show that the modelling procedure correctly classified 92.5% of all individual location observations in the synthetic trajectories, demonstrating reasonable ability to successfully discriminate behavioural modes. Most individual little penguins were found to exhibit three unique behavioural states (resting, commuting/active searching, area-restricted foraging), with variation in the timing and locations of observations apparently related to ambient light, bathymetry, and proximity to coastlines and river mouths. Addition of k-means clustering extends the utility of behavioural change point analysis, by providing a simple means through which the behaviours inferred for the location observations comprising individual movement trajectories can be objectively classified. PMID:25922935

  12. Extending the Functionality of Behavioural Change-Point Analysis with k-Means Clustering: A Case Study with the Little Penguin (Eudyptula minor).

    Science.gov (United States)

    Zhang, Jingjing; O'Reilly, Kathleen M; Perry, George L W; Taylor, Graeme A; Dennis, Todd E

    2015-01-01

    We present a simple framework for classifying mutually exclusive behavioural states within the geospatial lifelines of animals. This method involves use of three sequentially applied statistical procedures: (1) behavioural change point analysis to partition movement trajectories into discrete bouts of same-state behaviours, based on abrupt changes in the spatio-temporal autocorrelation structure of movement parameters; (2) hierarchical multivariate cluster analysis to determine the number of different behavioural states; and (3) k-means clustering to classify inferred bouts of same-state location observations into behavioural modes. We demonstrate application of the method by analysing synthetic trajectories of known 'artificial behaviours' comprised of different correlated random walks, as well as real foraging trajectories of little penguins (Eudyptula minor) obtained by global-positioning-system telemetry. Our results show that the modelling procedure correctly classified 92.5% of all individual location observations in the synthetic trajectories, demonstrating reasonable ability to successfully discriminate behavioural modes. Most individual little penguins were found to exhibit three unique behavioural states (resting, commuting/active searching, area-restricted foraging), with variation in the timing and locations of observations apparently related to ambient light, bathymetry, and proximity to coastlines and river mouths. Addition of k-means clustering extends the utility of behavioural change point analysis, by providing a simple means through which the behaviours inferred for the location observations comprising individual movement trajectories can be objectively classified.

  13. Extending the Functionality of Behavioural Change-Point Analysis with k-Means Clustering: A Case Study with the Little Penguin (Eudyptula minor.

    Directory of Open Access Journals (Sweden)

    Jingjing Zhang

    Full Text Available We present a simple framework for classifying mutually exclusive behavioural states within the geospatial lifelines of animals. This method involves use of three sequentially applied statistical procedures: (1 behavioural change point analysis to partition movement trajectories into discrete bouts of same-state behaviours, based on abrupt changes in the spatio-temporal autocorrelation structure of movement parameters; (2 hierarchical multivariate cluster analysis to determine the number of different behavioural states; and (3 k-means clustering to classify inferred bouts of same-state location observations into behavioural modes. We demonstrate application of the method by analysing synthetic trajectories of known 'artificial behaviours' comprised of different correlated random walks, as well as real foraging trajectories of little penguins (Eudyptula minor obtained by global-positioning-system telemetry. Our results show that the modelling procedure correctly classified 92.5% of all individual location observations in the synthetic trajectories, demonstrating reasonable ability to successfully discriminate behavioural modes. Most individual little penguins were found to exhibit three unique behavioural states (resting, commuting/active searching, area-restricted foraging, with variation in the timing and locations of observations apparently related to ambient light, bathymetry, and proximity to coastlines and river mouths. Addition of k-means clustering extends the utility of behavioural change point analysis, by providing a simple means through which the behaviours inferred for the location observations comprising individual movement trajectories can be objectively classified.

  14. Automated spike sorting algorithm based on Laplacian eigenmaps and k-means clustering.

    Science.gov (United States)

    Chah, E; Hok, V; Della-Chiesa, A; Miller, J J H; O'Mara, S M; Reilly, R B

    2011-02-01

    This study presents a new automatic spike sorting method based on feature extraction by Laplacian eigenmaps combined with k-means clustering. The performance of the proposed method was compared against previously reported algorithms such as principal component analysis (PCA) and amplitude-based feature extraction. Two types of classifier (namely k-means and classification expectation-maximization) were incorporated within the spike sorting algorithms, in order to find a suitable classifier for the feature sets. Simulated data sets and in-vivo tetrode multichannel recordings were employed to assess the performance of the spike sorting algorithms. The results show that the proposed algorithm yields significantly improved performance with mean sorting accuracy of 73% and sorting error of 10% compared to PCA which combined with k-means had a sorting accuracy of 58% and sorting error of 10%.A correction was made to this article on 22 February 2011. The spacing of the title was amended on the abstract page. No changes were made to the article PDF and the print version was unaffected.

  15. An Enhanced K-Means Algorithm for Water Quality Analysis of The Haihe River in China.

    Science.gov (United States)

    Zou, Hui; Zou, Zhihong; Wang, Xiaojing

    2015-11-12

    The increase and the complexity of data caused by the uncertain environment is today's reality. In order to identify water quality effectively and reliably, this paper presents a modified fast clustering algorithm for water quality analysis. The algorithm has adopted a varying weights K-means cluster algorithm to analyze water monitoring data. The varying weights scheme was the best weighting indicator selected by a modified indicator weight self-adjustment algorithm based on K-means, which is named MIWAS-K-means. The new clustering algorithm avoids the margin of the iteration not being calculated in some cases. With the fast clustering analysis, we can identify the quality of water samples. The algorithm is applied in water quality analysis of the Haihe River (China) data obtained by the monitoring network over a period of eight years (2006-2013) with four indicators at seven different sites (2078 samples). Both the theoretical and simulated results demonstrate that the algorithm is efficient and reliable for water quality analysis of the Haihe River. In addition, the algorithm can be applied to more complex data matrices with high dimensionality.

  16. Implementation of K-Means Clustering Method for Electronic Learning Model

    Science.gov (United States)

    Latipa Sari, Herlina; Suranti Mrs., Dewi; Natalia Zulita, Leni

    2017-12-01

    Teaching and Learning process at SMK Negeri 2 Bengkulu Tengah has applied e-learning system for teachers and students. The e-learning was based on the classification of normative, productive, and adaptive subjects. SMK Negeri 2 Bengkulu Tengah consisted of 394 students and 60 teachers with 16 subjects. The record of e-learning database was used in this research to observe students’ activity pattern in attending class. K-Means algorithm in this research was used to classify students’ learning activities using e-learning, so that it was obtained cluster of students’ activity and improvement of student’s ability. Implementation of K-Means Clustering method for electronic learning model at SMK Negeri 2 Bengkulu Tengah was conducted by observing 10 students’ activities, namely participation of students in the classroom, submit assignment, view assignment, add discussion, view discussion, add comment, download course materials, view article, view test, and submit test. In the e-learning model, the testing was conducted toward 10 students that yielded 2 clusters of membership data (C1 and C2). Cluster 1: with membership percentage of 70% and it consisted of 6 members, namely 1112438 Anggi Julian, 1112439 Anis Maulita, 1112441 Ardi Febriansyah, 1112452 Berlian Sinurat, 1112460 Dewi Anugrah Anwar and 1112467 Eka Tri Oktavia Sari. Cluster 2:with membership percentage of 30% and it consisted of 4 members, namely 1112463 Dosita Afriyani, 1112471 Erda Novita, 1112474 Eskardi and 1112477 Fachrur Rozi.

  17. K-means clustering for support construction in diffractive imaging.

    Science.gov (United States)

    Hattanda, Shunsuke; Shioya, Hiroyuki; Maehara, Yosuke; Gohara, Kazutoshi

    2014-03-01

    A method for constructing an object support based on K-means clustering of the object-intensity distribution is newly presented in diffractive imaging. This releases the adjustment of unknown parameters in the support construction, and it is well incorporated with the Gerchberg and Saxton diagram. A simple numerical simulation reveals that the proposed method is effective for dynamically constructing the support without an initial prior support.

  18. A Fast Exact k-Nearest Neighbors Algorithm for High Dimensional Search Using k-Means Clustering and Triangle Inequality.

    Science.gov (United States)

    Wang, Xueyi

    2012-02-08

    The k-nearest neighbors (k-NN) algorithm is a widely used machine learning method that finds nearest neighbors of a test object in a feature space. We present a new exact k-NN algorithm called kMkNN (k-Means for k-Nearest Neighbors) that uses the k-means clustering and the triangle inequality to accelerate the searching for nearest neighbors in a high dimensional space. The kMkNN algorithm has two stages. In the buildup stage, instead of using complex tree structures such as metric trees, kd-trees, or ball-tree, kMkNN uses a simple k-means clustering method to preprocess the training dataset. In the searching stage, given a query object, kMkNN finds nearest training objects starting from the nearest cluster to the query object and uses the triangle inequality to reduce the distance calculations. Experiments show that the performance of kMkNN is surprisingly good compared to the traditional k-NN algorithm and tree-based k-NN algorithms such as kd-trees and ball-trees. On a collection of 20 datasets with up to 10(6) records and 10(4) dimensions, kMkNN shows a 2-to 80-fold reduction of distance calculations and a 2- to 60-fold speedup over the traditional k-NN algorithm for 16 datasets. Furthermore, kMkNN performs significant better than a kd-tree based k-NN algorithm for all datasets and performs better than a ball-tree based k-NN algorithm for most datasets. The results show that kMkNN is effective for searching nearest neighbors in high dimensional spaces.

  19. Reducing the time requirement of k-means algorithm.

    Science.gov (United States)

    Osamor, Victor Chukwudi; Adebiyi, Ezekiel Femi; Oyelade, Jelilli Olarenwaju; Doumbia, Seydou

    2012-01-01

    Traditional k-means and most k-means variants are still computationally expensive for large datasets, such as microarray data, which have large datasets with large dimension size d. In k-means clustering, we are given a set of n data points in d-dimensional space R(d) and an integer k. The problem is to determine a set of k points in R(d), called centers, so as to minimize the mean squared distance from each data point to its nearest center. In this work, we develop a novel k-means algorithm, which is simple but more efficient than the traditional k-means and the recent enhanced k-means. Our new algorithm is based on the recently established relationship between principal component analysis and the k-means clustering. We provided the correctness proof for this algorithm. Results obtained from testing the algorithm on three biological data and six non-biological data (three of these data are real, while the other three are simulated) also indicate that our algorithm is empirically faster than other known k-means algorithms. We assessed the quality of our algorithm clusters against the clusters of a known structure using the Hubert-Arabie Adjusted Rand index (ARI(HA)). We found that when k is close to d, the quality is good (ARI(HA)>0.8) and when k is not close to d, the quality of our new k-means algorithm is excellent (ARI(HA)>0.9). In this paper, emphases are on the reduction of the time requirement of the k-means algorithm and its application to microarray data due to the desire to create a tool for clustering and malaria research. However, the new clustering algorithm can be used for other clustering needs as long as an appropriate measure of distance between the centroids and the members is used. This has been demonstrated in this work on six non-biological data.

  20. k-Means has polynomial smoothed complexity

    NARCIS (Netherlands)

    Arthur, David; Manthey, Bodo; Röglin, Heiko; Spielman, D.A.

    2009-01-01

    The k-means method is one of the most widely used clustering algorithms, drawing its popularity from its speed in practice. Recently, however, it was shown to have exponential worst-case running time. In order to close the gap between practical performance and theoretical analysis, the k-means

  1. K-means cluster analysis of rehabilitation service users in the Home Health Care System of Ontario: examining the heterogeneity of a complex geriatric population.

    Science.gov (United States)

    Armstrong, Joshua J; Zhu, Mu; Hirdes, John P; Stolee, Paul

    2012-12-01

    To examine the heterogeneity of home care clients who use rehabilitation services by using the K-means algorithm to identify previously unknown patterns of clinical characteristics. Observational study of secondary data. Home care system. Assessment information was collected on 150,253 home care clients using the provincially mandated Resident Assessment Instrument-Home Care (RAI-HC) data system. Not applicable. Assessment information from every long-stay (>60 d) home care client that entered the home care system between 2005 and 2008 and used rehabilitation services within 3 months of their initial assessment was analyzed. The K-means clustering algorithm was applied using 37 variables from the RAI-HC assessment. The K-means cluster analysis resulted in the identification of 7 relatively homogeneous subgroups that differed on characteristics such as age, sex, cognition, and functional impairment. Client profiles were created to illustrate the diversity of this geriatric population. The K-means algorithm provided a useful way to segment a heterogeneous rehabilitation client population into more homogeneous subgroups. This analysis provides an enhanced understanding of client characteristics and needs, and could enable more appropriate targeting of rehabilitation services for home care clients. Copyright © 2012 American Congress of Rehabilitation Medicine. Published by Elsevier Inc. All rights reserved.

  2. Towards explaining the speed of k-means

    NARCIS (Netherlands)

    Manthey, Bodo; van de Pol, Jan Cornelis; Raamsdonk, F.; Stoelinga, Mariëlle Ida Antoinette

    2011-01-01

    The $k$-means method is a popular algorithm for clustering, known for its speed in practice. This stands in contrast to its exponential worst-case running-time. To explain the speed of the $k$-means method, a smoothed analysis has been conducted. We sketch this smoothed analysis and a generalization

  3. Single pass kernel k-means clustering method

    Indian Academy of Sciences (India)

    In unsupervised classification, kernel -means clustering method has been shown to perform better than conventional -means clustering method in ... 518501, India; Department of Computer Science and Engineering, Jawaharlal Nehru Technological University, Anantapur College of Engineering, Anantapur 515002, India ...

  4. Pharmacokinetic analysis and k-means clustering of DCEMR images for radiotherapy outcome prediction of advanced cervical cancers

    Energy Technology Data Exchange (ETDEWEB)

    Andersen, Erlend K. F. (Dept. of Medical Physics, The Norwegian Radium Hospital, Oslo Univ. Hospital, Oslo (Norway)), e-mail: eirik.malinen@fys.uio.no; Kristensen, Gunnar B. (Section for Gynaecological Oncology, The Norwegian Radium Hospital, Oslo Univ. Hospital, Oslo (Norway)); Lyng, Heidi (Dept. of Radiation Biology, The Norwegian Radium Hospital, Oslo Univ. Hospital, Oslo (Norway)); Malinen, Eirik (Dept. of Medical Physics, The Norwegian Radium Hospital, Oslo Univ. Hospital, Oslo (Norway); Dept. of Physics, Univ. of Oslo, Oslo (Norway))

    2011-08-15

    Introduction. Pharmacokinetic analysis of dynamic contrast enhanced magnetic resonance images (DCEMRI) allows for quantitative characterization of vascular properties of tumors. The aim of this study is twofold, first to determine if tumor regions with similar vascularization could be labeled by clustering methods, second to determine if the identified regions can be associated with local cancer relapse. Materials and methods. Eighty-one patients with locally advanced cervical cancer treated with chemoradiotherapy underwent DCEMRI with Gd-DTPA prior to external beam radiotherapy. The median follow-up time after treatment was four years, in which nine patients had primary tumor relapse. By fitting a pharmacokinetic two-compartment model function to the temporal contrast enhancement in the tumor, two pharmacokinetic parameters, Ktrans and u{sub e}, were estimated voxel by voxel from the DCEMR-images. Intratumoral regions with similar vascularization were identified by k-means clustering of the two pharmacokinetic parameter estimates over all patients. The volume fraction of each cluster was used to evaluate the prognostic value of the clusters. Results. Three clusters provided a sufficient reduction of the cluster variance to label different vascular properties within the tumors. The corresponding median volume fraction of each cluster was 38%, 46% and 10%. The second cluster was significantly associated with primary tumor control in a log-rank survival test (p-value: 0.042), showing a decreased risk of treatment failure for patients with high volume fraction of voxels. Conclusions. Intratumoral regions showing similar vascular properties could successfully be labeled in three distinct clusters and the volume fraction of one cluster region was associated with primary tumor control

  5. Hopfield-K-Means clustering algorithm: A proposal for the segmentation of electricity customers

    Energy Technology Data Exchange (ETDEWEB)

    Lopez, Jose J.; Aguado, Jose A.; Martin, F.; Munoz, F.; Rodriguez, A.; Ruiz, Jose E. [Department of Electrical Engineering, University of Malaga, C/ Dr. Ortiz Ramos, sn., Escuela de Ingenierias, 29071 Malaga (Spain)

    2011-02-15

    Customer classification aims at providing electric utilities with a volume of information to enable them to establish different types of tariffs. Several methods have been used to segment electricity customers, including, among others, the hierarchical clustering, Modified Follow the Leader and K-Means methods. These, however, entail problems with the pre-allocation of the number of clusters (Follow the Leader), randomness of the solution (K-Means) and improvement of the solution obtained (hierarchical algorithm). Another segmentation method used is Hopfield's autonomous recurrent neural network, although the solution obtained only guarantees that it is a local minimum. In this paper, we present the Hopfield-K-Means algorithm in order to overcome these limitations. This approach eliminates the randomness of the initial solution provided by K-Means based algorithms and it moves closer to the global optimun. The proposed algorithm is also compared against other customer segmentation and characterization techniques, on the basis of relative validation indexes. Finally, the results obtained by this algorithm with a set of 230 electricity customers (residential, industrial and administrative) are presented. (author)

  6. Hopfield-K-Means clustering algorithm: A proposal for the segmentation of electricity customers

    International Nuclear Information System (INIS)

    Lopez, Jose J.; Aguado, Jose A.; Martin, F.; Munoz, F.; Rodriguez, A.; Ruiz, Jose E.

    2011-01-01

    Customer classification aims at providing electric utilities with a volume of information to enable them to establish different types of tariffs. Several methods have been used to segment electricity customers, including, among others, the hierarchical clustering, Modified Follow the Leader and K-Means methods. These, however, entail problems with the pre-allocation of the number of clusters (Follow the Leader), randomness of the solution (K-Means) and improvement of the solution obtained (hierarchical algorithm). Another segmentation method used is Hopfield's autonomous recurrent neural network, although the solution obtained only guarantees that it is a local minimum. In this paper, we present the Hopfield-K-Means algorithm in order to overcome these limitations. This approach eliminates the randomness of the initial solution provided by K-Means based algorithms and it moves closer to the global optimun. The proposed algorithm is also compared against other customer segmentation and characterization techniques, on the basis of relative validation indexes. Finally, the results obtained by this algorithm with a set of 230 electricity customers (residential, industrial and administrative) are presented. (author)

  7. A New Approach to Identify High Burnout Medical Staffs by Kernel K-Means Cluster Analysis in a Regional Teaching Hospital in Taiwan.

    Science.gov (United States)

    Lee, Yii-Ching; Huang, Shian-Chang; Huang, Chih-Hsuan; Wu, Hsin-Hung

    2016-01-01

    This study uses kernel k-means cluster analysis to identify medical staffs with high burnout. The data collected in October to November 2014 are from the emotional exhaustion dimension of the Chinese version of Safety Attitudes Questionnaire in a regional teaching hospital in Taiwan. The number of effective questionnaires including the entire staffs such as physicians, nurses, technicians, pharmacists, medical administrators, and respiratory therapists is 680. The results show that 8 clusters are generated by kernel k-means method. Employees in clusters 1, 4, and 5 are relatively in good conditions, whereas employees in clusters 2, 3, 6, 7, and 8 need to be closely monitored from time to time because they have relatively higher degree of burnout. When employees with higher degree of burnout are identified, the hospital management can take actions to improve the resilience, reduce the potential medical errors, and, eventually, enhance the patient safety. This study also suggests that the hospital management needs to keep track of medical staffs' fatigue conditions and provide timely assistance for burnout recovery through employee assistance programs, mindfulness-based stress reduction programs, positivity currency buildup, and forming appreciative inquiry groups. © The Author(s) 2016.

  8. A New Approach to Identify High Burnout Medical Staffs by Kernel K-Means Cluster Analysis in a Regional Teaching Hospital in Taiwan

    Directory of Open Access Journals (Sweden)

    Yii-Ching Lee PhD

    2016-11-01

    Full Text Available This study uses kernel k-means cluster analysis to identify medical staffs with high burnout. The data collected in October to November 2014 are from the emotional exhaustion dimension of the Chinese version of Safety Attitudes Questionnaire in a regional teaching hospital in Taiwan. The number of effective questionnaires including the entire staffs such as physicians, nurses, technicians, pharmacists, medical administrators, and respiratory therapists is 680. The results show that 8 clusters are generated by kernel k-means method. Employees in clusters 1, 4, and 5 are relatively in good conditions, whereas employees in clusters 2, 3, 6, 7, and 8 need to be closely monitored from time to time because they have relatively higher degree of burnout. When employees with higher degree of burnout are identified, the hospital management can take actions to improve the resilience, reduce the potential medical errors, and, eventually, enhance the patient safety. This study also suggests that the hospital management needs to keep track of medical staffs’ fatigue conditions and provide timely assistance for burnout recovery through employee assistance programs, mindfulness-based stress reduction programs, positivity currency buildup, and forming appreciative inquiry groups.

  9. What to Do When K-Means Clustering Fails: A Simple yet Principled Alternative Algorithm.

    Science.gov (United States)

    Raykov, Yordan P; Boukouvalas, Alexis; Baig, Fahd; Little, Max A

    The K-means algorithm is one of the most popular clustering algorithms in current use as it is relatively fast yet simple to understand and deploy in practice. Nevertheless, its use entails certain restrictive assumptions about the data, the negative consequences of which are not always immediately apparent, as we demonstrate. While more flexible algorithms have been developed, their widespread use has been hindered by their computational and technical complexity. Motivated by these considerations, we present a flexible alternative to K-means that relaxes most of the assumptions, whilst remaining almost as fast and simple. This novel algorithm which we call MAP-DP (maximum a-posteriori Dirichlet process mixtures), is statistically rigorous as it is based on nonparametric Bayesian Dirichlet process mixture modeling. This approach allows us to overcome most of the limitations imposed by K-means. The number of clusters K is estimated from the data instead of being fixed a-priori as in K-means. In addition, while K-means is restricted to continuous data, the MAP-DP framework can be applied to many kinds of data, for example, binary, count or ordinal data. Also, it can efficiently separate outliers from the data. This additional flexibility does not incur a significant computational overhead compared to K-means with MAP-DP convergence typically achieved in the order of seconds for many practical problems. Finally, in contrast to K-means, since the algorithm is based on an underlying statistical model, the MAP-DP framework can deal with missing data and enables model testing such as cross validation in a principled way. We demonstrate the simplicity and effectiveness of this algorithm on the health informatics problem of clinical sub-typing in a cluster of diseases known as parkinsonism.

  10. Comparison of five cluster validity indices performance in brain [18 F]FET-PET image segmentation using k-means.

    Science.gov (United States)

    Abualhaj, Bedor; Weng, Guoyang; Ong, Melissa; Attarwala, Ali Asgar; Molina, Flavia; Büsing, Karen; Glatting, Gerhard

    2017-01-01

    Dynamic [ 18 F]fluoro-ethyl-L-tyrosine positron emission tomography ([ 18 F]FET-PET) is used to identify tumor lesions for radiotherapy treatment planning, to differentiate glioma recurrence from radiation necrosis and to classify gliomas grading. To segment different regions in the brain k-means cluster analysis can be used. The main disadvantage of k-means is that the number of clusters must be pre-defined. In this study, we therefore compared different cluster validity indices for automated and reproducible determination of the optimal number of clusters based on the dynamic PET data. The k-means algorithm was applied to dynamic [ 18 F]FET-PET images of 8 patients. Akaike information criterion (AIC), WB, I, modified Dunn's and Silhouette indices were compared on their ability to determine the optimal number of clusters based on requirements for an adequate cluster validity index. To check the reproducibility of k-means, the coefficients of variation CVs of the objective function values OFVs (sum of squared Euclidean distances within each cluster) were calculated using 100 random centroid initialization replications RCI 100 for 2 to 50 clusters. k-means was performed independently on three neighboring slices containing tumor for each patient to investigate the stability of the optimal number of clusters within them. To check the independence of the validity indices on the number of voxels, cluster analysis was applied after duplication of a slice selected from each patient. CVs of index values were calculated at the optimal number of clusters using RCI 100 to investigate the reproducibility of the validity indices. To check if the indices have a single extremum, visual inspection was performed on the replication with minimum OFV from RCI 100 . The maximum CV of OFVs was 2.7 × 10 -2 from all patients. The optimal number of clusters given by modified Dunn's and Silhouette indices was 2 or 3 leading to a very poor segmentation. WB and I indices suggested in

  11. The k-means clustering technique: General considerations and implementation in Mathematica

    Directory of Open Access Journals (Sweden)

    Laurence Morissette

    2013-02-01

    Full Text Available Data clustering techniques are valuable tools for researchers working with large databases of multivariate data. In this tutorial, we present a simple yet powerful one: the k-means clustering technique, through three different algorithms: the Forgy/Lloyd, algorithm, the MacQueen algorithm and the Hartigan and Wong algorithm. We then present an implementation in Mathematica and various examples of the different options available to illustrate the application of the technique.

  12. Balanced Cluster Head Selection Based on Modified k-Means in a Distributed Wireless Sensor Network

    OpenAIRE

    Periyasamy, Sasikumar; Khara, Sibaram; Thangavelu, Shankar

    2016-01-01

    A major problem with Wireless Sensor Networks (WSNs) is the maximization of effective network lifetime through minimization of energy usage in the network nodes. A modified k-means (Mk-means) algorithm for clustering was proposed which includes three cluster heads (simultaneously chosen) for each cluster. These cluster heads (CHs) use a load sharing mechanism to rotate as the active cluster head, which conserves residual energy of the nodes, thereby extending network lifetime. Moreover, it re...

  13. A K-means multivariate approach for clustering independent components from magnetoencephalographic data.

    Science.gov (United States)

    Spadone, Sara; de Pasquale, Francesco; Mantini, Dante; Della Penna, Stefania

    2012-09-01

    Independent component analysis (ICA) is typically applied on functional magnetic resonance imaging, electroencephalographic and magnetoencephalographic (MEG) data due to its data-driven nature. In these applications, ICA needs to be extended from single to multi-session and multi-subject studies for interpreting and assigning a statistical significance at the group level. Here a novel strategy for analyzing MEG independent components (ICs) is presented, Multivariate Algorithm for Grouping MEG Independent Components K-means based (MAGMICK). The proposed approach is able to capture spatio-temporal dynamics of brain activity in MEG studies by running ICA at subject level and then clustering the ICs across sessions and subjects. Distinctive features of MAGMICK are: i) the implementation of an efficient set of "MEG fingerprints" designed to summarize properties of MEG ICs as they are built on spatial, temporal and spectral parameters; ii) the implementation of a modified version of the standard K-means procedure to improve its data-driven character. This algorithm groups the obtained ICs automatically estimating the number of clusters through an adaptive weighting of the parameters and a constraint on the ICs independence, i.e. components coming from the same session (at subject level) or subject (at group level) cannot be grouped together. The performances of MAGMICK are illustrated by analyzing two sets of MEG data obtained during a finger tapping task and median nerve stimulation. The results demonstrate that the method can extract consistent patterns of spatial topography and spectral properties across sessions and subjects that are in good agreement with the literature. In addition, these results are compared to those from a modified version of affinity propagation clustering method. The comparison, evaluated in terms of different clustering validity indices, shows that our methodology often outperforms the clustering algorithm. Eventually, these results are

  14. Segmentation of Mushroom and Cap width Measurement using Modified K-Means Clustering Algorithm

    Directory of Open Access Journals (Sweden)

    Eser Sert

    2014-01-01

    Full Text Available Mushroom is one of the commonly consumed foods. Image processing is one of the effective way for examination of visual features and detecting the size of a mushroom. We developed software for segmentation of a mushroom in a picture and also to measure the cap width of the mushroom. K-Means clustering method is used for the process. K-Means is one of the most successful clustering methods. In our study we customized the algorithm to get the best result and tested the algorithm. In the system, at first mushroom picture is filtered, histograms are balanced and after that segmentation is performed. Results provided that customized algorithm performed better segmentation than classical K-Means algorithm. Tests performed on the designed software showed that segmentation on complex background pictures is performed with high accuracy, and 20 mushrooms caps are measured with 2.281 % relative error.

  15. GLOBAL CLASSIFICATION OF DERMATITIS DISEASE WITH K-MEANS CLUSTERING IMAGE SEGMENTATION METHODS

    OpenAIRE

    Prafulla N. Aerkewar1 & Dr. G. H. Agrawal2

    2018-01-01

    The objective of this paper to presents a global technique for classification of different dermatitis disease lesions using the process of k-Means clustering image segmentation method. The word global is used such that the all dermatitis disease having skin lesion on body are classified in to four category using k-means image segmentation and nntool of Matlab. Through the image segmentation technique and nntool can be analyze and study the segmentation properties of skin lesions occurs in...

  16. Conveyor Performance based on Motor DC 12 Volt Eg-530ad-2f using K-Means Clustering

    Science.gov (United States)

    Arifin, Zaenal; Artini, Sri DP; Much Ibnu Subroto, Imam

    2017-04-01

    To produce goods in industry, a controlled tool to improve production is required. Separation process has become a part of production process. Separation process is carried out based on certain criteria to get optimum result. By knowing the characteristics performance of a controlled tools in separation process the optimum results is also possible to be obtained. Clustering analysis is popular method for clustering data into smaller segments. Clustering analysis is useful to divide a group of object into a k-group in which the member value of the group is homogeny or similar. Similarity in the group is set based on certain criteria. The work in this paper based on K-Means method to conduct clustering of loading in the performance of a conveyor driven by a dc motor 12 volt eg-530-2f. This technique gives a complete clustering data for a prototype of conveyor driven by dc motor to separate goods in term of height. The parameters involved are voltage, current, time of travelling. These parameters give two clusters namely optimal cluster with center of cluster 10.50 volt, 0.3 Ampere, 10.58 second, and unoptimal cluster with center of cluster 10.88 volt, 0.28 Ampere and 40.43 second.

  17. Location and Size Planning of Distributed Photovoltaic Generation in Distribution network System Based on K-means Clustering Analysis

    Science.gov (United States)

    Lu, Siqi; Wang, Xiaorong; Wu, Junyong

    2018-01-01

    The paper presents a method to generate the planning scenarios, which is based on K-means clustering analysis algorithm driven by data, for the location and size planning of distributed photovoltaic (PV) units in the network. Taken the power losses of the network, the installation and maintenance costs of distributed PV, the profit of distributed PV and the voltage offset as objectives and the locations and sizes of distributed PV as decision variables, Pareto optimal front is obtained through the self-adaptive genetic algorithm (GA) and solutions are ranked by a method called technique for order preference by similarity to an ideal solution (TOPSIS). Finally, select the planning schemes at the top of the ranking list based on different planning emphasis after the analysis in detail. The proposed method is applied to a 10-kV distribution network in Gansu Province, China and the results are discussed.

  18. AN EFFICIENT INITIALIZATION METHOD FOR K-MEANS CLUSTERING OF HYPERSPECTRAL DATA

    Directory of Open Access Journals (Sweden)

    A. Alizade Naeini

    2014-10-01

    Full Text Available K-means is definitely the most frequently used partitional clustering algorithm in the remote sensing community. Unfortunately due to its gradient decent nature, this algorithm is highly sensitive to the initial placement of cluster centers. This problem deteriorates for the high-dimensional data such as hyperspectral remotely sensed imagery. To tackle this problem, in this paper, the spectral signatures of the endmembers in the image scene are extracted and used as the initial positions of the cluster centers. For this purpose, in the first step, A Neyman–Pearson detection theory based eigen-thresholding method (i.e., the HFC method has been employed to estimate the number of endmembers in the image. Afterwards, the spectral signatures of the endmembers are obtained using the Minimum Volume Enclosing Simplex (MVES algorithm. Eventually, these spectral signatures are used to initialize the k-means clustering algorithm. The proposed method is implemented on a hyperspectral dataset acquired by ROSIS sensor with 103 spectral bands over the Pavia University campus, Italy. For comparative evaluation, two other commonly used initialization methods (i.e., Bradley & Fayyad (BF and Random methods are implemented and compared. The confusion matrix, overall accuracy and Kappa coefficient are employed to assess the methods’ performance. The evaluations demonstrate that the proposed solution outperforms the other initialization methods and can be applied for unsupervised classification of hyperspectral imagery for landcover mapping.

  19. K-mean clustering algorithm for processing signals from compound semiconductor detectors

    International Nuclear Information System (INIS)

    Tada, Tsutomu; Hitomi, Keitaro; Wu, Yan; Kim, Seong-Yun; Yamazaki, Hiromichi; Ishii, Keizo

    2011-01-01

    The K-mean clustering algorithm was employed for processing signal waveforms from TlBr detectors. The signal waveforms were classified based on its shape reflecting the charge collection process in the detector. The classified signal waveforms were processed individually to suppress the pulse height variation of signals due to the charge collection loss. The obtained energy resolution of a 137 Cs spectrum measured with a 0.5 mm thick TlBr detector was 1.3% FWHM by employing 500 clusters.

  20. Identification of spatiotemporal nutrient patterns in a coastal bay via an integrated k-means clustering and gravity model.

    Science.gov (United States)

    Chang, Ni-Bin; Wimberly, Brent; Xuan, Zhemin

    2012-03-01

    This study presents an integrated k-means clustering and gravity model (IKCGM) for investigating the spatiotemporal patterns of nutrient and associated dissolved oxygen levels in Tampa Bay, Florida. By using a k-means clustering analysis to first partition the nutrient data into a user-specified number of subsets, it is possible to discover the spatiotemporal patterns of nutrient distribution in the bay and capture the inherent linkages of hydrodynamic and biogeochemical features. Such patterns may then be combined with a gravity model to link the nutrient source contribution from each coastal watershed to the generated clusters in the bay to aid in the source proportion analysis for environmental management. The clustering analysis was carried out based on 1 year (2008) water quality data composed of 55 sample stations throughout Tampa Bay collected by the Environmental Protection Commission of Hillsborough County. In addition, hydrological and river water quality data of the same year were acquired from the United States Geological Survey's National Water Information System to support the gravity modeling analysis. The results show that the k-means model with 8 clusters is the optimal choice, in which cluster 2 at Lower Tampa Bay had the minimum values of total nitrogen (TN) concentrations, chlorophyll a (Chl-a) concentrations, and ocean color values in every season as well as the minimum concentration of total phosphorus (TP) in three consecutive seasons in 2008. The datasets indicate that Lower Tampa Bay is an area with limited nutrient input throughout the year. Cluster 5, located in Middle Tampa Bay, displayed elevated TN concentrations, ocean color values, and Chl-a concentrations, suggesting that high values of colored dissolved organic matter are linked with some nutrient sources. The data presented by the gravity modeling analysis indicate that the Alafia River Basin is the major contributor of nutrients in terms of both TP and TN values in all seasons

  1. MEMANFAATKAN ALGORITMA K-MEANS DALAM MENENTUKAN PEGAWAI YANG LAYAK MENGIKUTI ASESSMENT CENTER UNTUK CLUSTERING PROGRAM SDP

    Directory of Open Access Journals (Sweden)

    Iin Parlina

    2018-01-01

    Full Text Available Data mining merupakan teknik pengolahan data dalam jumlah besar untuk pengelompokan. Teknik Data mining mempunyai beberapa metode dalam  mengelompokkan salah satu teknik yang dipakai penulis saat ini adalah K-Means. Dalam hal ini penulis mengelompokan data daftar program SDP tahun 2017 untuk mengetahui manakah pegawai yang layak lolos dalam program SDP sehingga dapat melakukan Registrasi Asessment Center. Pengelompokan tersebut berdasarkan kriteria – kriteria data Program SDP. Pada penelitian ini, penulis menerapkan algoritma K-Means Clustering untuk pengelompokan data Program SDP di PT.Bank Syariah. Dalam hal ini, pada umumnya untuk memamasuki program SDP tersebut disesuaikan dengan ketentuan dan parameter Program SDP saja, namun dalam penelitian ini pengelompokan disesuaikan dengan kriteria – kriteria Program SDP seperti kedisiplinan pegawai, Target Kerja Pegawai, Kepatuhan Program SDP. Penulis menggunakan beberapa kriteria tersebut agar pengelompokan yang dihasilkan menjadi lebih optimal. Tujuan dari pengelompokan ini adalah terbentuknya kelompok SDP pada Program SDP yang menggunakan algoritma K-Means clustering. Hasil dari pengelompokan tersebut diperoleh tiga kelompok yaitu kelompok Lolos, Hampir Lolos dan Tidak Lolos. Terdapat pusat cluster dengan Cluster-1= 8;66;13, Cluster-2= 10;71;14 dan Cluster-3=7;60;12. Pusat cluster tersebut didapat dari beberapa iterasi sehingga mengahasilakan pusat cluster yang optimal.

  2. An additional k-means clustering step improves the biological features of WGCNA gene co-expression networks.

    Science.gov (United States)

    Botía, Juan A; Vandrovcova, Jana; Forabosco, Paola; Guelfi, Sebastian; D'Sa, Karishma; Hardy, John; Lewis, Cathryn M; Ryten, Mina; Weale, Michael E

    2017-04-12

    Weighted Gene Co-expression Network Analysis (WGCNA) is a widely used R software package for the generation of gene co-expression networks (GCN). WGCNA generates both a GCN and a derived partitioning of clusters of genes (modules). We propose k-means clustering as an additional processing step to conventional WGCNA, which we have implemented in the R package km2gcn (k-means to gene co-expression network, https://github.com/juanbot/km2gcn ). We assessed our method on networks created from UKBEC data (10 different human brain tissues), on networks created from GTEx data (42 human tissues, including 13 brain tissues), and on simulated networks derived from GTEx data. We observed substantially improved module properties, including: (1) few or zero misplaced genes; (2) increased counts of replicable clusters in alternate tissues (x3.1 on average); (3) improved enrichment of Gene Ontology terms (seen in 48/52 GCNs) (4) improved cell type enrichment signals (seen in 21/23 brain GCNs); and (5) more accurate partitions in simulated data according to a range of similarity indices. The results obtained from our investigations indicate that our k-means method, applied as an adjunct to standard WGCNA, results in better network partitions. These improved partitions enable more fruitful downstream analyses, as gene modules are more biologically meaningful.

  3. Further heuristics for $k$-means: The merge-and-split heuristic and the $(k,l)$-means

    OpenAIRE

    Nielsen, Frank; Nock, Richard

    2014-01-01

    Finding the optimal $k$-means clustering is NP-hard in general and many heuristics have been designed for minimizing monotonically the $k$-means objective. We first show how to extend Lloyd's batched relocation heuristic and Hartigan's single-point relocation heuristic to take into account empty-cluster and single-point cluster events, respectively. Those events tend to increasingly occur when $k$ or $d$ increases, or when performing several restarts. First, we show that those special events ...

  4. Estimating Single and Multiple Target Locations Using K-Means Clustering with Radio Tomographic Imaging in Wireless Sensor Networks

    Science.gov (United States)

    2015-03-26

    clustering is an algorithm that has been used in data mining applications such as machine learning applications , pattern recognition, hyper-spectral imagery...42 3.7.2 Application of K-means Clustering . . . . . . . . . . . . . . . . . 42 3.8 Experiment Design...Tomographic Imaging WLAN Wireless Local Area Networks WSN Wireless Sensor Network xx ESTIMATING SINGLE AND MULTIPLE TARGET LOCATIONS USING K-MEANS CLUSTERING

  5. Classification of Two Class Motor Imagery Tasks Using Hybrid GA-PSO Based K-Means Clustering.

    Science.gov (United States)

    Suraj; Tiwari, Purnendu; Ghosh, Subhojit; Sinha, Rakesh Kumar

    2015-01-01

    Transferring the brain computer interface (BCI) from laboratory condition to meet the real world application needs BCI to be applied asynchronously without any time constraint. High level of dynamism in the electroencephalogram (EEG) signal reasons us to look toward evolutionary algorithm (EA). Motivated by these two facts, in this work a hybrid GA-PSO based K-means clustering technique has been used to distinguish two class motor imagery (MI) tasks. The proposed hybrid GA-PSO based K-means clustering is found to outperform genetic algorithm (GA) and particle swarm optimization (PSO) based K-means clustering techniques in terms of both accuracy and execution time. The lesser execution time of hybrid GA-PSO technique makes it suitable for real time BCI application. Time frequency representation (TFR) techniques have been used to extract the feature of the signal under investigation. TFRs based features are extracted and relying on the concept of event related synchronization (ERD) and desynchronization (ERD) feature vector is formed.

  6. [Research on K-means clustering segmentation method for MRI brain image based on selecting multi-peaks in gray histogram].

    Science.gov (United States)

    Chen, Zhaoxue; Yu, Haizhong; Chen, Hao

    2013-12-01

    To solve the problem of traditional K-means clustering in which initial clustering centers are selected randomly, we proposed a new K-means segmentation algorithm based on robustly selecting 'peaks' standing for White Matter, Gray Matter and Cerebrospinal Fluid in multi-peaks gray histogram of MRI brain image. The new algorithm takes gray value of selected histogram 'peaks' as the initial K-means clustering center and can segment the MRI brain image into three parts of tissue more effectively, accurately, steadily and successfully. Massive experiments have proved that the proposed algorithm can overcome many shortcomings caused by traditional K-means clustering method such as low efficiency, veracity, robustness and time consuming. The histogram 'peak' selecting idea of the proposed segmentootion method is of more universal availability.

  7. Optimized data fusion for K-means Laplacian clustering

    Science.gov (United States)

    Yu, Shi; Liu, Xinhai; Tranchevent, Léon-Charles; Glänzel, Wolfgang; Suykens, Johan A. K.; De Moor, Bart; Moreau, Yves

    2011-01-01

    Motivation: We propose a novel algorithm to combine multiple kernels and Laplacians for clustering analysis. The new algorithm is formulated on a Rayleigh quotient objective function and is solved as a bi-level alternating minimization procedure. Using the proposed algorithm, the coefficients of kernels and Laplacians can be optimized automatically. Results: Three variants of the algorithm are proposed. The performance is systematically validated on two real-life data fusion applications. The proposed Optimized Kernel Laplacian Clustering (OKLC) algorithms perform significantly better than other methods. Moreover, the coefficients of kernels and Laplacians optimized by OKLC show some correlation with the rank of performance of individual data source. Though in our evaluation the K values are predefined, in practical studies, the optimal cluster number can be consistently estimated from the eigenspectrum of the combined kernel Laplacian matrix. Availability: The MATLAB code of algorithms implemented in this paper is downloadable from http://homes.esat.kuleuven.be/~sistawww/bioi/syu/oklc.html. Contact: shiyu@uchicago.edu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:20980271

  8. A new locally weighted K-means for cancer-aided microarray data analysis.

    Science.gov (United States)

    Iam-On, Natthakan; Boongoen, Tossapon

    2012-11-01

    Cancer has been identified as the leading cause of death. It is predicted that around 20-26 million people will be diagnosed with cancer by 2020. With this alarming rate, there is an urgent need for a more effective methodology to understand, prevent and cure cancer. Microarray technology provides a useful basis of achieving this goal, with cluster analysis of gene expression data leading to the discrimination of patients, identification of possible tumor subtypes and individualized treatment. Amongst clustering techniques, k-means is normally chosen for its simplicity and efficiency. However, it does not account for the different importance of data attributes. This paper presents a new locally weighted extension of k-means, which has proven more accurate across many published datasets than the original and other extensions found in the literature.

  9. Group analyses of connectivity-based cortical parcellation using repeated k-means clustering

    NARCIS (Netherlands)

    Nanetti, Luca; Cerliani, Leonardo; Gazzola, Valeria; Renken, Remco; Keysers, Christian

    2009-01-01

    K-means clustering has become a popular tool for connectivity-based cortical segmentation using Diffusion Weighted Imaging (DWI) data. A sometimes ignored issue is, however, that the output of the algorithm depends on the initial placement of starting points, and that different sets of starting

  10. Elastic K-means using posterior probability.

    Science.gov (United States)

    Zheng, Aihua; Jiang, Bo; Li, Yan; Zhang, Xuehan; Ding, Chris

    2017-01-01

    The widely used K-means clustering is a hard clustering algorithm. Here we propose a Elastic K-means clustering model (EKM) using posterior probability with soft capability where each data point can belong to multiple clusters fractionally and show the benefit of proposed Elastic K-means. Furthermore, in many applications, besides vector attributes information, pairwise relations (graph information) are also available. Thus we integrate EKM with Normalized Cut graph clustering into a single clustering formulation. Finally, we provide several useful matrix inequalities which are useful for matrix formulations of learning models. Based on these results, we prove the correctness and the convergence of EKM algorithms. Experimental results on six benchmark datasets demonstrate the effectiveness of proposed EKM and its integrated model.

  11. Utility of the k-means clustering algorithm in differentiating apparent diffusion coefficient values of benign and malignant neck pathologies.

    Science.gov (United States)

    Srinivasan, A; Galbán, C J; Johnson, T D; Chenevert, T L; Ross, B D; Mukherji, S K

    2010-04-01

    Does the K-means algorithm do a better job of differentiating benign and malignant neck pathologies compared to only mean ADC? The objective of our study was to analyze the differences between ADC partitions to evaluate whether the K-means technique can be of additional benefit to whole-lesion mean ADC alone in distinguishing benign and malignant neck pathologies. MR imaging studies of 10 benign and 10 malignant proved neck pathologies were postprocessed on a PC by using in-house software developed in Matlab. Two neuroradiologists manually contoured the lesions, with the ADC values within each lesion clustered into 2 (low, ADC-ADC(L); high, ADC-ADC(H)) and 3 partitions (ADC(L); intermediate, ADC-ADC(I); ADC(H)) by using the K-means clustering algorithm. An unpaired 2-tailed Student t test was performed for all metrics to determine statistical differences in the means of the benign and malignant pathologies. A statistically significant difference between the mean ADC(L) clusters in benign and malignant pathologies was seen in the 3-cluster models of both readers (P = .03 and .022, respectively) and the 2-cluster model of reader 2 (P = .04), with the other metrics (ADC(H), ADC(I); whole-lesion mean ADC) not revealing any significant differences. ROC curves demonstrated the quantitative differences in mean ADC(H) and ADC(L) in both the 2- and 3-cluster models to be predictive of malignancy (2 clusters: P = .008, area under curve = 0.850; 3 clusters: P = .01, area under curve = 0.825). The K-means clustering algorithm that generates partitions of large datasets may provide a better characterization of neck pathologies and may be of additional benefit in distinguishing benign and malignant neck pathologies compared with whole-lesion mean ADC alone.

  12. Complex time series analysis of PM10 and PM2.5 for a coastal site using artificial neural network modelling and k-means clustering

    Science.gov (United States)

    Elangasinghe, M. A.; Singhal, N.; Dirks, K. N.; Salmond, J. A.; Samarasinghe, S.

    2014-09-01

    This paper uses artificial neural networks (ANN), combined with k-means clustering, to understand the complex time series of PM10 and PM2.5 concentrations at a coastal location of New Zealand based on data from a single site. Out of available meteorological parameters from the network (wind speed, wind direction, solar radiation, temperature, relative humidity), key factors governing the pattern of the time series concentrations were identified through input sensitivity analysis performed on the trained neural network model. The transport pathways of particulate matter under these key meteorological parameters were further analysed through bivariate concentration polar plots and k-means clustering techniques. The analysis shows that the external sources such as marine aerosols and local sources such as traffic and biomass burning contribute equally to the particulate matter concentrations at the study site. These results are in agreement with the results of receptor modelling by the Auckland Council based on Positive Matrix Factorization (PMF). Our findings also show that contrasting concentration-wind speed relationships exist between marine aerosols and local traffic sources resulting in very noisy and seemingly large random PM10 concentrations. The inclusion of cluster rankings as an input parameter to the ANN model showed a statistically significant (p advanced air dispersion models.

  13. Utility of K-Means clustering algorithm in differentiating apparent diffusion coefficient values between benign and malignant neck pathologies

    Science.gov (United States)

    Srinivasan, A.; Galbán, C.J.; Johnson, T.D.; Chenevert, T.L.; Ross, B.D.; Mukherji, S.K.

    2014-01-01

    Purpose The objective of our study was to analyze the differences between apparent diffusion coefficient (ADC) partitions (created using the K-Means algorithm) between benign and malignant neck lesions and evaluate its benefit in distinguishing these entities. Material and methods MRI studies of 10 benign and 10 malignant proven neck pathologies were post-processed on a PC using in-house software developed in MATLAB (The MathWorks, Inc., Natick, MA). Lesions were manually contoured by two neuroradiologists with the ADC values within each lesion clustered into two (low ADC-ADCL, high ADC-ADCH) and three partitions (ADCL, intermediate ADC-ADCI, ADCH) using the K-Means clustering algorithm. An unpaired two-tailed Student’s t-test was performed for all metrics to determine statistical differences in the means between the benign and malignant pathologies. Results Statistically significant difference between the mean ADCL clusters in benign and malignant pathologies was seen in the 3 cluster models of both readers (p=0.03, 0.022 respectively) and the 2 cluster model of reader 2 (p=0.04) with the other metrics (ADCH, ADCI, whole lesion mean ADC) not revealing any significant differences. Receiver operating characteristics curves demonstrated the quantitative difference in mean ADCH and ADCL in both the 2 and 3 cluster models to be predictive of malignancy (2 clusters: p=0.008, area under curve=0.850, 3 clusters: p=0.01, area under curve=0.825). Conclusion The K-Means clustering algorithm that generates partitions of large datasets may provide a better characterization of neck pathologies and may be of additional benefit in distinguishing benign and malignant neck pathologies compared to whole lesion mean ADC alone. PMID:20007723

  14. Analysis of Home Energy Consumption by K-Mean

    Directory of Open Access Journals (Sweden)

    Fahad Razaque

    2017-10-01

    Full Text Available The smart meter offered exceptional chances to well comprehend energy consumption manners in which quantity of data being generated. One request was the separation of energy load-profiles into clusters of related conduct. The Research measured the resemblance between groups them together and load-profiles into clusters by k-means clustering algorithm. The cluster met, also called “Gender (Male/Female, House (Rented/Owned and customers status (Satisfied/Unsatisfied” display methods of consuming energy. It provided value information aimed at utilities to generate specific electricity charges and healthier aim energy efficiency programs. The results show that 43% extremely dissatisfied of energy customer is achieved by using energy consumption.

  15. Determining the k in k-means with MapReduce

    OpenAIRE

    Debatty , Thibault; Michiardi , Pietro; Mees , Wim; Thonnard , Olivier

    2014-01-01

    International audience; In this paper we propose a MapReduce implementation of G-means, a variant of k-means that is able to automatically determine k, the number of clusters. We show that our implementation scales to very large datasets and very large values of k, as the computation cost is proportional to nk. Other techniques that run a clustering algorithm with different values of k and choose the value of k that provides the " best " results have a computation cost that is proportional to...

  16. An improved initialization center k-means clustering algorithm based on distance and density

    Science.gov (United States)

    Duan, Yanling; Liu, Qun; Xia, Shuyin

    2018-04-01

    Aiming at the problem of the random initial clustering center of k means algorithm that the clustering results are influenced by outlier data sample and are unstable in multiple clustering, a method of central point initialization method based on larger distance and higher density is proposed. The reciprocal of the weighted average of distance is used to represent the sample density, and the data sample with the larger distance and the higher density are selected as the initial clustering centers to optimize the clustering results. Then, a clustering evaluation method based on distance and density is designed to verify the feasibility of the algorithm and the practicality, the experimental results on UCI data sets show that the algorithm has a certain stability and practicality.

  17. IP2P K-means: an efficient method for data clustering on sensor networks

    Directory of Open Access Journals (Sweden)

    Peyman Mirhadi

    2013-03-01

    Full Text Available Many wireless sensor network applications require data gathering as the most important parts of their operations. There are increasing demands for innovative methods to improve energy efficiency and to prolong the network lifetime. Clustering is considered as an efficient topology control methods in wireless sensor networks, which can increase network scalability and lifetime. This paper presents a method, IP2P K-means – Improved P2P K-means, which uses efficient leveling in clustering approach, reduces false labeling and restricts the necessary communication among various sensors, which obviously saves more energy. The proposed method is examined in Network Simulator Ver.2 (NS2 and the preliminary results show that the algorithm works effectively and relatively more precisely.

  18. Linear regression models and k-means clustering for statistical analysis of fNIRS data.

    Science.gov (United States)

    Bonomini, Viola; Zucchelli, Lucia; Re, Rebecca; Ieva, Francesca; Spinelli, Lorenzo; Contini, Davide; Paganoni, Anna; Torricelli, Alessandro

    2015-02-01

    We propose a new algorithm, based on a linear regression model, to statistically estimate the hemodynamic activations in fNIRS data sets. The main concern guiding the algorithm development was the minimization of assumptions and approximations made on the data set for the application of statistical tests. Further, we propose a K-means method to cluster fNIRS data (i.e. channels) as activated or not activated. The methods were validated both on simulated and in vivo fNIRS data. A time domain (TD) fNIRS technique was preferred because of its high performances in discriminating cortical activation and superficial physiological changes. However, the proposed method is also applicable to continuous wave or frequency domain fNIRS data sets.

  19. An implementation of the relational k-means algorithm

    OpenAIRE

    Szalkai, Balázs

    2013-01-01

    A C# implementation of a generalized k-means variant called relational k-means is described here. Relational k-means is a generalization of the well-known k-means clustering method which works for non-Euclidean scenarios as well. The input is an arbitrary distance matrix, as opposed to the traditional k-means method, where the clustered objects need to be identified with vectors.

  20. Prioritizing the risk of plant pests by clustering methods; self-organising maps, k-means and hierarchical clustering

    Directory of Open Access Journals (Sweden)

    Susan Worner

    2013-09-01

    Full Text Available For greater preparedness, pest risk assessors are required to prioritise long lists of pest species with potential to establish and cause significant impact in an endangered area. Such prioritization is often qualitative, subjective, and sometimes biased, relying mostly on expert and stakeholder consultation. In recent years, cluster based analyses have been used to investigate regional pest species assemblages or pest profiles to indicate the risk of new organism establishment. Such an approach is based on the premise that the co-occurrence of well-known global invasive pest species in a region is not random, and that the pest species profile or assemblage integrates complex functional relationships that are difficult to tease apart. In other words, the assemblage can help identify and prioritise species that pose a threat in a target region. A computational intelligence method called a Kohonen self-organizing map (SOM, a type of artificial neural network, was the first clustering method applied to analyse assemblages of invasive pests. The SOM is a well known dimension reduction and visualization method especially useful for high dimensional data that more conventional clustering methods may not analyse suitably. Like all clustering algorithms, the SOM can give details of clusters that identify regions with similar pest assemblages, possible donor and recipient regions. More important, however SOM connection weights that result from the analysis can be used to rank the strength of association of each species within each regional assemblage. Species with high weights that are not already established in the target region are identified as high risk. However, the SOM analysis is only the first step in a process to assess risk to be used alongside or incorporated within other measures. Here we illustrate the application of SOM analyses in a range of contexts in invasive species risk assessment, and discuss other clustering methods such as k-means

  1. Sistem Pendukung Keputusan Pemilihan Line-up Pemain Sepak Bola Menggunakan Metode Fuzzy Multiple Attribute Decision Making dan K-Means Clustering

    Directory of Open Access Journals (Sweden)

    Aldi Nurzahputra

    2017-07-01

    Full Text Available In football, the selection of players line-up is based on their statistical performance. In this research, the line-up selection can implement the decision support system (DSS with FMADM SAW method. The criterias were used are goal, assists, saves, clean sheets, yellow cards, red cards, games, and an own goal. Then, the assessment players performance is using K-Means Clustering. There are two clusters: cluster_cukup and cluster_baik. The system used Manchester City player data in Forward, Mildfilder, Defender and Goal Keeper position. The purpose of this research is applying the FMADM and K-Means Clustering method to the system. Based on the results, the line-up selection can be processed by FMADM method and the performance assessed by K-Means Clustering method. By using the system, the selection and the assessment can be conducted and give the best decision for footbal coach objectively. Dalam sepak bola, pemilihan line-up pemain oleh pelatih dilakukan berdasarkan statistik yang dimiliki pemain. Penelitian ini menerapkan sistem pendukung keputusan (SPK dengan metode FMADM SAW untuk memilih pemain dari hasil pembobotan dari beberapa kriteria, yaitu goal, assist, saves, clean sheet, kartu kuning, kartu merah, main, dan gol bunuh diri. Penilaian performa pemain menggunakan metode K-Means clustering dengan dua cluster, yaitu cluster_cukup dan cluster_baik. Data yang digunakan dalam sistem ini menggunakan data pemain club Manchester City dengan posisi Forward, Mildfilder, Defender, dan Goal Keeper. Berdasarkan hasil yang diteliti, data statistik pemain dapat diolah dengan metode FMADM dan penilaian performa dengan metode K-Means clustering. Dengan adanya sistem ini, pemilihan dan penilaian dilakukan secara objektif dan memberikan pilihan untuk pelatih dalam mengambil keputusan.

  2. Improved smoothed analysis of the k-means method

    NARCIS (Netherlands)

    Manthey, Bodo; Röglin, Heiko; Mathieu, C.

    2009-01-01

    The k-means method is a widely used clustering algorithm. One of its distinguished features is its speed in practice. Its worst-case running-time, however, is exponential, leaving a gap between practical and theoretical performance. Arthur and Vassilvitskii [3] aimed at closing this gap, and they

  3. Improved Performance of Unsupervised Method by Renovated K-Means

    OpenAIRE

    Ashok, P.; Nawaz, G. M Kadhar; Elayaraja, E.; Vadivel, V.

    2013-01-01

    Clustering is a separation of data into groups of similar objects. Every group called cluster consists of objects that are similar to one another and dissimilar to objects of other groups. In this paper, the K-Means algorithm is implemented by three distance functions and to identify the optimal distance function for clustering methods. The proposed K-Means algorithm is compared with K-Means, Static Weighted K-Means (SWK-Means) and Dynamic Weighted K-Means (DWK-Means) algorithm by using Davis...

  4. A novel intrusion detection method based on OCSVM and K-means recursive clustering

    Directory of Open Access Journals (Sweden)

    Leandros A. Maglaras

    2015-01-01

    Full Text Available In this paper we present an intrusion detection module capable of detecting malicious network traffic in a SCADA (Supervisory Control and Data Acquisition system, based on the combination of One-Class Support Vector Machine (OCSVM with RBF kernel and recursive k-means clustering. Important parameters of OCSVM, such as Gaussian width o and parameter v affect the performance of the classifier. Tuning of these parameters is of great importance in order to avoid false positives and over fitting. The combination of OCSVM with recursive k- means clustering leads the proposed intrusion detection module to distinguish real alarms from possible attacks regardless of the values of parameters o and v, making it ideal for real-time intrusion detection mechanisms for SCADA systems. Extensive simulations have been conducted with datasets extracted from small and medium sized HTB SCADA testbeds, in order to compare the accuracy, false alarm rate and execution time against the base line OCSVM method.

  5. An enhanced deterministic K-Means clustering algorithm for cancer subtype prediction from gene expression data.

    Science.gov (United States)

    Nidheesh, N; Abdul Nazeer, K A; Ameer, P M

    2017-12-01

    Clustering algorithms with steps involving randomness usually give different results on different executions for the same dataset. This non-deterministic nature of algorithms such as the K-Means clustering algorithm limits their applicability in areas such as cancer subtype prediction using gene expression data. It is hard to sensibly compare the results of such algorithms with those of other algorithms. The non-deterministic nature of K-Means is due to its random selection of data points as initial centroids. We propose an improved, density based version of K-Means, which involves a novel and systematic method for selecting initial centroids. The key idea of the algorithm is to select data points which belong to dense regions and which are adequately separated in feature space as the initial centroids. We compared the proposed algorithm to a set of eleven widely used single clustering algorithms and a prominent ensemble clustering algorithm which is being used for cancer data classification, based on the performances on a set of datasets comprising ten cancer gene expression datasets. The proposed algorithm has shown better overall performance than the others. There is a pressing need in the Biomedical domain for simple, easy-to-use and more accurate Machine Learning tools for cancer subtype prediction. The proposed algorithm is simple, easy-to-use and gives stable results. Moreover, it provides comparatively better predictions of cancer subtypes from gene expression data. Copyright © 2017 Elsevier Ltd. All rights reserved.

  6. Enhanced K-means clustering with encryption on cloud

    Science.gov (United States)

    Singh, Iqjot; Dwivedi, Prerna; Gupta, Taru; Shynu, P. G.

    2017-11-01

    This paper tries to solve the problem of storing and managing big files over cloud by implementing hashing on Hadoop in big-data and ensure security while uploading and downloading files. Cloud computing is a term that emphasis on sharing data and facilitates to share infrastructure and resources.[10] Hadoop is an open source software that gives us access to store and manage big files according to our needs on cloud. K-means clustering algorithm is an algorithm used to calculate distance between the centroid of the cluster and the data points. Hashing is a algorithm in which we are storing and retrieving data with hash keys. The hashing algorithm is called as hash function which is used to portray the original data and later to fetch the data stored at the specific key. [17] Encryption is a process to transform electronic data into non readable form known as cipher text. Decryption is the opposite process of encryption, it transforms the cipher text into plain text that the end user can read and understand well. For encryption and decryption we are using Symmetric key cryptographic algorithm. In symmetric key cryptography are using DES algorithm for a secure storage of the files. [3

  7. Automatic video shot boundary detection using k-means clustering and improved adaptive dual threshold comparison

    Science.gov (United States)

    Sa, Qila; Wang, Zhihui

    2018-03-01

    At present, content-based video retrieval (CBVR) is the most mainstream video retrieval method, using the video features of its own to perform automatic identification and retrieval. This method involves a key technology, i.e. shot segmentation. In this paper, the method of automatic video shot boundary detection with K-means clustering and improved adaptive dual threshold comparison is proposed. First, extract the visual features of every frame and divide them into two categories using K-means clustering algorithm, namely, one with significant change and one with no significant change. Then, as to the classification results, utilize the improved adaptive dual threshold comparison method to determine the abrupt as well as gradual shot boundaries.Finally, achieve automatic video shot boundary detection system.

  8. The implementation of two stages clustering (k-means clustering and adaptive neuro fuzzy inference system) for prediction of medicine need based on medical data

    Science.gov (United States)

    Husein, A. M.; Harahap, M.; Aisyah, S.; Purba, W.; Muhazir, A.

    2018-03-01

    Medication planning aim to get types, amount of medicine according to needs, and avoid the emptiness medicine based on patterns of disease. In making the medicine planning is still rely on ability and leadership experience, this is due to take a long time, skill, difficult to obtain a definite disease data, need a good record keeping and reporting, and the dependence of the budget resulted in planning is not going well, and lead to frequent lack and excess of medicines. In this research, we propose Adaptive Neuro Fuzzy Inference System (ANFIS) method to predict medication needs in 2016 and 2017 based on medical data in 2015 and 2016 from two source of hospital. The framework of analysis using two approaches. The first phase is implementing ANFIS to a data source, while the second approach we keep using ANFIS, but after the process of clustering from K-Means algorithm, both approaches are calculated values of Root Mean Square Error (RMSE) for training and testing. From the testing result, the proposed method with better prediction rates based on the evaluation analysis of quantitative and qualitative compared with existing systems, however the implementation of K-Means Algorithm against ANFIS have an effect on the timing of the training process and provide a classification accuracy significantly better without clustering.

  9. Variance-Based Cluster Selection Criteria in a K-Means Framework for One-Mode Dissimilarity Data.

    Science.gov (United States)

    Vera, J Fernando; Macías, Rodrigo

    2017-06-01

    One of the main problems in cluster analysis is that of determining the number of groups in the data. In general, the approach taken depends on the cluster method used. For K-means, some of the most widely employed criteria are formulated in terms of the decomposition of the total point scatter, regarding a two-mode data set of N points in p dimensions, which are optimally arranged into K classes. This paper addresses the formulation of criteria to determine the number of clusters, in the general situation in which the available information for clustering is a one-mode [Formula: see text] dissimilarity matrix describing the objects. In this framework, p and the coordinates of points are usually unknown, and the application of criteria originally formulated for two-mode data sets is dependent on their possible reformulation in the one-mode situation. The decomposition of the variability of the clustered objects is proposed in terms of the corresponding block-shaped partition of the dissimilarity matrix. Within-block and between-block dispersion values for the partitioned dissimilarity matrix are derived, and variance-based criteria are subsequently formulated in order to determine the number of groups in the data. A Monte Carlo experiment was carried out to study the performance of the proposed criteria. For simulated clustered points in p dimensions, greater efficiency in recovering the number of clusters is obtained when the criteria are calculated from the related Euclidean distances instead of the known two-mode data set, in general, for unequal-sized clusters and for low dimensionality situations. For simulated dissimilarity data sets, the proposed criteria always outperform the results obtained when these criteria are calculated from their original formulation, using dissimilarities instead of distances.

  10. Supplier Risk Assessment Based on Best-Worst Method and K-Means Clustering: A Case Study

    Directory of Open Access Journals (Sweden)

    Merve Er Kara

    2018-04-01

    Full Text Available Supplier evaluation and selection is one of the most critical strategic decisions for developing a competitive and sustainable organization. Companies have to consider supplier related risks and threats in their purchasing decisions. In today’s competitive and risky business environment, it is very important to work with reliable suppliers. This study proposes a clustering based approach to group suppliers based on their risk profile. Suppliers of a company in the heavy-machinery sector are assessed based on 17 qualitative and quantitative risk types. The weights of the criteria are determined by using the Best-Worst method. Four factors are extracted by applying Factor Analysis to the supplier risk data. Then k-means clustering algorithm is applied to group core suppliers of the company based on the four risk factors. Three clusters are created with different risk exposure levels. The interpretation of the results provides insights for risk management actions and supplier development programs to mitigate supplier risk.

  11. A Parametric k-Means Algorithm

    Science.gov (United States)

    Tarpey, Thaddeus

    2007-01-01

    Summary The k points that optimally represent a distribution (usually in terms of a squared error loss) are called the k principal points. This paper presents a computationally intensive method that automatically determines the principal points of a parametric distribution. Cluster means from the k-means algorithm are nonparametric estimators of principal points. A parametric k-means approach is introduced for estimating principal points by running the k-means algorithm on a very large simulated data set from a distribution whose parameters are estimated using maximum likelihood. Theoretical and simulation results are presented comparing the parametric k-means algorithm to the usual k-means algorithm and an example on determining sizes of gas masks is used to illustrate the parametric k-means algorithm. PMID:17917692

  12. K-Means Algorithm Performance Analysis With Determining The Value Of Starting Centroid With Random And KD-Tree Method

    Science.gov (United States)

    Sirait, Kamson; Tulus; Budhiarti Nababan, Erna

    2017-12-01

    Clustering methods that have high accuracy and time efficiency are necessary for the filtering process. One method that has been known and applied in clustering is K-Means Clustering. In its application, the determination of the begining value of the cluster center greatly affects the results of the K-Means algorithm. This research discusses the results of K-Means Clustering with starting centroid determination with a random and KD-Tree method. The initial determination of random centroid on the data set of 1000 student academic data to classify the potentially dropout has a sse value of 952972 for the quality variable and 232.48 for the GPA, whereas the initial centroid determination by KD-Tree has a sse value of 504302 for the quality variable and 214,37 for the GPA variable. The smaller sse values indicate that the result of K-Means Clustering with initial KD-Tree centroid selection have better accuracy than K-Means Clustering method with random initial centorid selection.

  13. Artificial Bee Colony Algorithm Based on K-Means Clustering for Multiobjective Optimal Power Flow Problem

    Directory of Open Access Journals (Sweden)

    Liling Sun

    2015-01-01

    Full Text Available An improved multiobjective ABC algorithm based on K-means clustering, called CMOABC, is proposed. To fasten the convergence rate of the canonical MOABC, the way of information communication in the employed bees’ phase is modified. For keeping the population diversity, the multiswarm technology based on K-means clustering is employed to decompose the population into many clusters. Due to each subcomponent evolving separately, after every specific iteration, the population will be reclustered to facilitate information exchange among different clusters. Application of the new CMOABC on several multiobjective benchmark functions shows a marked improvement in performance over the fast nondominated sorting genetic algorithm (NSGA-II, the multiobjective particle swarm optimizer (MOPSO, and the multiobjective ABC (MOABC. Finally, the CMOABC is applied to solve the real-world optimal power flow (OPF problem that considers the cost, loss, and emission impacts as the objective functions. The 30-bus IEEE test system is presented to illustrate the application of the proposed algorithm. The simulation results demonstrate that, compared to NSGA-II, MOPSO, and MOABC, the proposed CMOABC is superior for solving OPF problem, in terms of optimization accuracy.

  14. K-means cluster analysis of tourist destination in special region of Yogyakarta using spatial approach and social network analysis (a case study: post of @explorejogja instagram account in 2016)

    Science.gov (United States)

    Iswandhani, N.; Muhajir, M.

    2018-03-01

    This research was conducted in Department of Statistics Islamic University of Indonesia. The data used are primary data obtained by post @explorejogja instagram account from January until December 2016. In the @explorejogja instagram account found many tourist destinations that can be visited by tourists both in the country and abroad, Therefore it is necessary to form a cluster of existing tourist destinations based on the number of likes from user instagram assumed as the most popular. The purpose of this research is to know the most popular distribution of tourist spot, the cluster formation of tourist destinations, and central popularity of tourist destinations based on @explorejogja instagram account in 2016. Statistical analysis used is descriptive statistics, k-means clustering, and social network analysis. The results of this research were obtained the top 10 most popular destinations in Yogyakarta, map of html-based tourist destination distribution consisting of 121 tourist destination points, formed 3 clusters each consisting of cluster 1 with 52 destinations, cluster 2 with 9 destinations and cluster 3 with 60 destinations, and Central popularity of tourist destinations in the special region of Yogyakarta by district.

  15. Normalized mutual information based PET-MR registration using K-Means clustering and shading correction

    NARCIS (Netherlands)

    Knops, Z.F.; Maintz, J.B.A.; Viergever, M.A.; Pluim, J.P.W.; Gee, J.C.; Maintz, J.B.A.; Vannier, M.W.

    2003-01-01

    A method for the efficient re-binning and shading based correction of intensity distributions of the images prior to normalized mutual information based registration is presented. Our intensity distribution re-binning method is based on the K-means clustering algorithm as opposed to the generally

  16. Clustering analysis

    International Nuclear Information System (INIS)

    Romli

    1997-01-01

    Cluster analysis is the name of group of multivariate techniques whose principal purpose is to distinguish similar entities from the characteristics they process.To study this analysis, there are several algorithms that can be used. Therefore, this topic focuses to discuss the algorithms, such as, similarity measures, and hierarchical clustering which includes single linkage, complete linkage and average linkage method. also, non-hierarchical clustering method, which is popular name K -mean method ' will be discussed. Finally, this paper will be described the advantages and disadvantages of every methods

  17. Vertebra identification using template matching modelmp and K-means clustering.

    Science.gov (United States)

    Larhmam, Mohamed Amine; Benjelloun, Mohammed; Mahmoudi, Saïd

    2014-03-01

    Accurate vertebra detection and segmentation are essential steps for automating the diagnosis of spinal disorders. This study is dedicated to vertebra alignment measurement, the first step in a computer-aided diagnosis tool for cervical spine trauma. Automated vertebral segment alignment determination is a challenging task due to low contrast imaging and noise. A software tool for segmenting vertebrae and detecting subluxations has clinical significance. A robust method was developed and tested for cervical vertebra identification and segmentation that extracts parameters used for vertebra alignment measurement. Our contribution involves a novel combination of a template matching method and an unsupervised clustering algorithm. In this method, we build a geometric vertebra mean model. To achieve vertebra detection, manual selection of the region of interest is performed initially on the input image. Subsequent preprocessing is done to enhance image contrast and detect edges. Candidate vertebra localization is then carried out by using a modified generalized Hough transform (GHT). Next, an adapted cost function is used to compute local voted centers and filter boundary data. Thereafter, a K-means clustering algorithm is applied to obtain clusters distribution corresponding to the targeted vertebrae. These clusters are combined with the vote parameters to detect vertebra centers. Rigid segmentation is then carried out by using GHT parameters. Finally, cervical spine curves are extracted to measure vertebra alignment. The proposed approach was successfully applied to a set of 66 high-resolution X-ray images. Robust detection was achieved in 97.5 % of the 330 tested cervical vertebrae. An automated vertebral identification method was developed and demonstrated to be robust to noise and occlusion. This work presents a first step toward an automated computer-aided diagnosis system for cervical spine trauma detection.

  18. Surface EMG decomposition based on K-means clustering and convolution kernel compensation.

    Science.gov (United States)

    Ning, Yong; Zhu, Xiangjun; Zhu, Shanan; Zhang, Yingchun

    2015-03-01

    A new approach has been developed by combining the K-mean clustering (KMC) method and a modified convolution kernel compensation (CKC) method for multichannel surface electromyogram (EMG) decomposition. The KMC method was first utilized to cluster vectors of observations at different time instants and then estimate the initial innervation pulse train (IPT). The CKC method, modified with a novel multistep iterative process, was conducted to update the estimated IPT. The performance of the proposed K-means clustering-Modified CKC (KmCKC) approach was evaluated by reconstructing IPTs from both simulated and experimental surface EMG signals. The KmCKC approach successfully reconstructed all 10 IPTs from the simulated surface EMG signals with true positive rates (TPR) of over 90% with a low signal-to-noise ratio (SNR) of -10 dB. More than 10 motor units were also successfully extracted from the 64-channel experimental surface EMG signals of the first dorsal interosseous (FDI) muscles when a contraction force was held at 8 N by using the KmCKC approach. A "two-source" test was further conducted with 64-channel surface EMG signals. The high percentage of common MUs and common pulses (over 92% at all force levels) between the IPTs reconstructed from the two independent groups of surface EMG signals demonstrates the reliability and capability of the proposed KmCKC approach in multichannel surface EMG decomposition. Results from both simulated and experimental data are consistent and confirm that the proposed KmCKC approach can successfully reconstruct IPTs with high accuracy at different levels of contraction.

  19. Simultaneous Two-Way Clustering of Multiple Correspondence Analysis

    Science.gov (United States)

    Hwang, Heungsun; Dillon, William R.

    2010-01-01

    A 2-way clustering approach to multiple correspondence analysis is proposed to account for cluster-level heterogeneity of both respondents and variable categories in multivariate categorical data. Specifically, in the proposed method, multiple correspondence analysis is combined with k-means in a unified framework in which "k"-means is…

  20. Are judgments a form of data clustering? Reexamining contrast effects with the k-means algorithm.

    Science.gov (United States)

    Boillaud, Eric; Molina, Guylaine

    2015-04-01

    A number of theories have been proposed to explain in precise mathematical terms how statistical parameters and sequential properties of stimulus distributions affect category ratings. Various contextual factors such as the mean, the midrange, and the median of the stimuli; the stimulus range; the percentile rank of each stimulus; and the order of appearance have been assumed to influence judgmental contrast. A data clustering reinterpretation of judgmental relativity is offered wherein the influence of the initial choice of centroids on judgmental contrast involves 2 combined frequency and consistency tendencies. Accounts of the k-means algorithm are provided, showing good agreement with effects observed on multiple distribution shapes and with a variety of interaction effects relating to the number of stimuli, the number of response categories, and the method of skewing. Experiment 1 demonstrates that centroid initialization accounts for contrast effects obtained with stretched distributions. Experiment 2 demonstrates that the iterative convergence inherent to the k-means algorithm accounts for the contrast reduction observed across repeated blocks of trials. The concept of within-cluster variance minimization is discussed, as is the applicability of a backward k-means calculation method for inferring, from empirical data, the values of the centroids that would serve as a representation of the judgmental context. (c) 2015 APA, all rights reserved.

  1. Stroke localization and classification using microwave tomography with k-means clustering and support vector machine.

    Science.gov (United States)

    Guo, Lei; Abbosh, Amin

    2018-05-01

    For any chance for stroke patients to survive, the stroke type should be classified to enable giving medication within a few hours of the onset of symptoms. In this paper, a microwave-based stroke localization and classification framework is proposed. It is based on microwave tomography, k-means clustering, and a support vector machine (SVM) method. The dielectric profile of the brain is first calculated using the Born iterative method, whereas the amplitude of the dielectric profile is then taken as the input to k-means clustering. The cluster is selected as the feature vector for constructing and testing the SVM. A database of MRI-derived realistic head phantoms at different signal-to-noise ratios is used in the classification procedure. The performance of the proposed framework is evaluated using the receiver operating characteristic (ROC) curve. The results based on a two-dimensional framework show that 88% classification accuracy, with a sensitivity of 91% and a specificity of 87%, can be achieved. Bioelectromagnetics. 39:312-324, 2018. © 2018 Wiley Periodicals, Inc. © 2018 Wiley Periodicals, Inc.

  2. [Cluster analysis in biomedical researches].

    Science.gov (United States)

    Akopov, A S; Moskovtsev, A A; Dolenko, S A; Savina, G D

    2013-01-01

    Cluster analysis is one of the most popular methods for the analysis of multi-parameter data. The cluster analysis reveals the internal structure of the data, group the separate observations on the degree of their similarity. The review provides a definition of the basic concepts of cluster analysis, and discusses the most popular clustering algorithms: k-means, hierarchical algorithms, Kohonen networks algorithms. Examples are the use of these algorithms in biomedical research.

  3. Quick detection of QRS complexes and R-waves using a wavelet transform and K-means clustering.

    Science.gov (United States)

    Xia, Yong; Han, Junze; Wang, Kuanquan

    2015-01-01

    Based on the idea of telemedicine, 24-hour uninterrupted monitoring on electrocardiograms (ECG) has started to be implemented. To create an intelligent ECG monitoring system, an efficient and quick detection algorithm for the characteristic waveforms is needed. This paper aims to give a quick and effective method for detecting QRS-complexes and R-waves in ECGs. The real ECG signal from the MIT-BIH Arrhythmia Database is used for the performance evaluation. The method proposed combined a wavelet transform and the K-means clustering algorithm. A wavelet transform is adopted in the data analysis and preprocessing. Then, based on the slope information of the filtered data, a segmented K-means clustering method is adopted to detect the QRS region. Detection of the R-peak is based on comparing the local amplitudes in each QRS region, which is different from other approaches, and the time cost of R-wave detection is reduced. Of the tested 8 records (total 18201 beats) from the MIT-BIH Arrhythmia Database, an average R-peak detection sensitivity of 99.72 and a positive predictive value of 99.80% are gained; the average time consumed detecting a 30-min original signal is 5.78s, which is competitive with other methods.

  4. K-means clustering for optimal partitioning and dynamic load balancing of parallel hierarchical N-body simulations

    International Nuclear Information System (INIS)

    Marzouk, Youssef M.; Ghoniem, Ahmed F.

    2005-01-01

    A number of complex physical problems can be approached through N-body simulation, from fluid flow at high Reynolds number to gravitational astrophysics and molecular dynamics. In all these applications, direct summation is prohibitively expensive for large N and thus hierarchical methods are employed for fast summation. This work introduces new algorithms, based on k-means clustering, for partitioning parallel hierarchical N-body interactions. We demonstrate that the number of particle-cluster interactions and the order at which they are performed are directly affected by partition geometry. Weighted k-means partitions minimize the sum of clusters' second moments and create well-localized domains, and thus reduce the computational cost of N-body approximations by enabling the use of lower-order approximations and fewer cells. We also introduce compatible techniques for dynamic load balancing, including adaptive scaling of cluster volumes and adaptive redistribution of cluster centroids. We demonstrate the performance of these algorithms by constructing a parallel treecode for vortex particle simulations, based on the serial variable-order Cartesian code developed by Lindsay and Krasny [Journal of Computational Physics 172 (2) (2001) 879-907]. The method is applied to vortex simulations of a transverse jet. Results show outstanding parallel efficiencies even at high concurrencies, with velocity evaluation errors maintained at or below their serial values; on a realistic distribution of 1.2 million vortex particles, we observe a parallel efficiency of 98% on 1024 processors. Excellent load balance is achieved even in the face of several obstacles, such as an irregular, time-evolving particle distribution containing a range of length scales and the continual introduction of new vortex particles throughout the domain. Moreover, results suggest that k-means yields a more efficient partition of the domain than a global oct-tree

  5. k-Means Clustering with Hölder Divergences

    KAUST Repository

    Nielsen, Frank

    2017-10-24

    We introduced two novel classes of Hölder divergences and Hölder pseudo-divergences that are both invariant to rescaling, and that both encapsulate the Cauchy-Schwarz divergence and the skew Bhattacharyya divergences. We review the elementary concepts of those parametric divergences, and perform a clustering analysis on two synthetic datasets. It is shown experimentally that the symmetrized Hölder divergences consistently outperform significantly the Cauchy-Schwarz divergence in clustering tasks.

  6. k-Means Clustering with Hölder Divergences

    KAUST Repository

    Nielsen, Frank; Sun, Ke; Marchand-Maillet, Sté phane

    2017-01-01

    We introduced two novel classes of Hölder divergences and Hölder pseudo-divergences that are both invariant to rescaling, and that both encapsulate the Cauchy-Schwarz divergence and the skew Bhattacharyya divergences. We review the elementary concepts of those parametric divergences, and perform a clustering analysis on two synthetic datasets. It is shown experimentally that the symmetrized Hölder divergences consistently outperform significantly the Cauchy-Schwarz divergence in clustering tasks.

  7. Developing cluster strategy of apples dodol SMEs by integration K-means clustering and analytical hierarchy process method

    Science.gov (United States)

    Mustaniroh, S. A.; Effendi, U.; Silalahi, R. L. R.; Sari, T.; Ala, M.

    2018-03-01

    The purposes of this research were to determine the grouping of apples dodol small and medium enterprises (SMEs) in Batu City and to determine an appropriate development strategy for each cluster. The methods used for clustering SMEs was k-means. The Analytical Hierarchy Process (AHP) approach was then applied to determine the development strategy priority for each cluster. The variables used in grouping include production capacity per month, length of operation, investment value, average sales revenue per month, amount of SMEs assets, and the number of workers. Several factors were considered in AHP include industry cluster, government, as well as related and supporting industries. Data was collected using the methods of questionaire and interviews. SMEs respondents were selected among SMEs appels dodol in Batu City using purposive sampling. The result showed that two clusters were formed from five apples dodol SMEs. The 1stcluster of apples dodol SMEs, classified as small enterprises, included SME A, SME C, and SME D. The 2ndcluster of SMEs apples dodol, classified as medium enterprises, consisted of SME B and SME E. The AHP results indicated that the priority development strategy for the 1stcluster of apples dodol SMEs was improving quality and the product standardisation, while for the 2nd cluster was increasing the marketing access.

  8. Identify High-Quality Protein Structural Models by Enhanced K-Means.

    Science.gov (United States)

    Wu, Hongjie; Li, Haiou; Jiang, Min; Chen, Cheng; Lv, Qiang; Wu, Chuang

    2017-01-01

    Background. One critical issue in protein three-dimensional structure prediction using either ab initio or comparative modeling involves identification of high-quality protein structural models from generated decoys. Currently, clustering algorithms are widely used to identify near-native models; however, their performance is dependent upon different conformational decoys, and, for some algorithms, the accuracy declines when the decoy population increases. Results. Here, we proposed two enhanced K -means clustering algorithms capable of robustly identifying high-quality protein structural models. The first one employs the clustering algorithm SPICKER to determine the initial centroids for basic K -means clustering ( SK -means), whereas the other employs squared distance to optimize the initial centroids ( K -means++). Our results showed that SK -means and K -means++ were more robust as compared with SPICKER alone, detecting 33 (59%) and 42 (75%) of 56 targets, respectively, with template modeling scores better than or equal to those of SPICKER. Conclusions. We observed that the classic K -means algorithm showed a similar performance to that of SPICKER, which is a widely used algorithm for protein-structure identification. Both SK -means and K -means++ demonstrated substantial improvements relative to results from SPICKER and classical K -means.

  9. IMPLEMENTASI ALGORITMA K-MEANS CLUSTERING UNTUK MENENTUKAN STRATEGI MARKETING PRESIDENT UNIVERSITY

    Directory of Open Access Journals (Sweden)

    Johan Oscar Ong

    2013-06-01

    Full Text Available Information technology advances very rapidly at this time to generate thousands or even millions of data from various aspect of life. However, what can be done with that much data?. In this research, we start from calculation of data set of students who have graduated from President University using k-means clustering algorithm, namely by classifying the data of students into several clusters based on the characteristics of this data in order to discover the information hidden from the data set of student who have graduated from President University. The attribute data that is used in this study is hometown, major and GPA. The purpose of this study is to help the President University's marketing department in predicting promotion strategies undertaken in the cities in Indonesia. Information gained in this study can be used as a references in determining the proper strategy for marketing team in their promotion activities in the cities in Indonesia so that the campaign will be more effective and efficient.

  10. Anthropometric typology of male and female rowers using k-means clustering.

    Science.gov (United States)

    Forjasz, Justyna

    2011-06-01

    The aim of this paper is to present the morphological features of rowers. The objective is to establish the type of body build best suited to the present requirements of this sports discipline through the determination of the most important morphological features in rowing with regard to the type of racing boat. The subjects of this study included competitors who practise rowing and were members of the Junior National Team. The considered variables included a group of 32 anthropometric measurements of body composition determined using the BIA method among male and female athletes, while also including rowing boat categories. In order to determine the analysed structures of male and female rowers, an observation analysis was taken into consideration and performed by the k-means clustering method. In the group of male and female rowers using long paddles, higher mean values for the analysed features were observed, with the exception of fat-free mass, and water content in both genders, and trunk length and horizontal reach in women who achieved higher means in the short-paddle group. On the men's team, both groups differed significantly in body mass, longitudinal features, horizontal reach, hand width and body circumferences, while on the women's, they differed in body mass, width and length of the chest, body circumferences and fat content. The method of grouping used in this paper confirmed morphological differences in the competitors with regard to the type of racing boat.

  11. OPTIMASI PUSAT CLUSTER K-PROTOTYPE DENGAN ALGORITMA GENETIKA

    Directory of Open Access Journals (Sweden)

    Pivin Suwirmayanti

    2014-12-01

    Full Text Available Teknik clustering saat ini telah banyak digunakan untuk mengatasi permasalahan yang terkait dengansegementasi data. Implementasi clustering ini dapat diterapkan pada berbagai bidang sebagai contoh dalam halpemasaran, clustering dapat digunakan sebagai metode untuk mengelompokkan data. Metode Clustering memilikitujuan untuk mengelompokkan beberapa data ke dalam beberapa kelompok data sehingga kelompok yang terbentukmemiliki kemiripan data, secara umum proses clustering diolah menggunakan tipe data numerik, namun padakenyataannya proses pengelompokan data tidak hanya menggunakan tipe data numerik, terdapat juga tipe datakategorikal. Untuk itu penulis menggunakan metode K-Prototype yang dioptimasi dengan Algortima Genetikadimana data uji yang digunakan adalah Data German Credit yang memiliki tipe data numerikal dan kategorikal.Dalam penelitian dilakukan perbandingan kinerja antara metode K-Prototype dengan Algoritma Genetika, denganmetode K-Prototype Tanpa Algortima Genetika, dan metode K-Means. Dari beberapa hasil percobaan yangdilakukan metode K-Prototype dengan Algoritma Genetika menghasilkan hasil yang terbaik dari metode KPrototypetanpa Algortima Genetika, dan metode K-Means

  12. Technical Note: Using k-means clustering to determine the number and position of isocenters in MLC-based multiple target intracranial radiosurgery.

    Science.gov (United States)

    Yock, Adam D; Kim, Gwe-Ya

    2017-09-01

    To present the k-means clustering algorithm as a tool to address treatment planning considerations characteristic of stereotactic radiosurgery using a single isocenter for multiple targets. For 30 patients treated with stereotactic radiosurgery for multiple brain metastases, the geometric centroids and radii of each met were determined from the treatment planning system. In-house software used this as well as weighted and unweighted versions of the k-means clustering algorithm to group the targets to be treated with a single isocenter, and to position each isocenter. The algorithm results were evaluated using within-cluster sum of squares as well as a minimum target coverage metric that considered the effect of target size. Both versions of the algorithm were applied to an example patient to demonstrate the prospective determination of the appropriate number and location of isocenters. Both weighted and unweighted versions of the k-means algorithm were applied successfully to determine the number and position of isocenters. Comparing the two, both the within-cluster sum of squares metric and the minimum target coverage metric resulting from the unweighted version were less than those from the weighted version. The average magnitudes of the differences were small (-0.2 cm 2 and 0.1% for the within cluster sum of squares and minimum target coverage, respectively) but statistically significant (Wilcoxon signed-rank test, P k-means clustering algorithm represented an advantage of the unweighted version for the within-cluster sum of squares metric, and an advantage of the weighted version for the minimum target coverage metric. While additional treatment planning considerations have a large influence on the final treatment plan quality, both versions of the k-means algorithm provide automatic, consistent, quantitative, and objective solutions to the tasks associated with SRS treatment planning using a single isocenter for multiple targets. © 2017 The Authors. Journal

  13. An Automatic K-Means Clustering Algorithm of GPS Data Combining a Novel Niche Genetic Algorithm with Noise and Density

    Directory of Open Access Journals (Sweden)

    Xiangbing Zhou

    2017-12-01

    Full Text Available Rapidly growing Global Positioning System (GPS data plays an important role in trajectory and their applications (e.g., GPS-enabled smart devices. In order to employ K-means to mine the better origins and destinations (OD behind the GPS data and overcome its shortcomings including slowness of convergence, sensitivity to initial seeds selection, and getting stuck in a local optimum, this paper proposes and focuses on a novel niche genetic algorithm (NGA with density and noise for K-means clustering (NoiseClust. In NoiseClust, an improved noise method and K-means++ are proposed to produce the initial population and capture higher quality seeds that can automatically determine the proper number of clusters, and also handle the different sizes and shapes of genes. A density-based method is presented to divide the number of niches, with its aim to maintain population diversity. Adaptive probabilities of crossover and mutation are also employed to prevent the convergence to a local optimum. Finally, the centers (the best chromosome are obtained and then fed into the K-means as initial seeds to generate even higher quality clustering results by allowing the initial seeds to readjust as needed. Experimental results based on taxi GPS data sets demonstrate that NoiseClust has high performance and effectiveness, and easily mine the city’s situations in four taxi GPS data sets.

  14. Cluster analysis of rural, urban, and curbside atmospheric particle size data.

    Science.gov (United States)

    Beddows, David C S; Dall'Osto, Manuel; Harrison, Roy M

    2009-07-01

    Particle size is a key determinant of the hazard posed by airborne particles. Continuous multivariate particle size data have been collected using aerosol particle size spectrometers sited at four locations within the UK: Harwell (Oxfordshire); Regents Park (London); British Telecom Tower (London); and Marylebone Road (London). These data have been analyzed using k-means cluster analysis, deduced to be the preferred cluster analysis technique, selected from an option of four partitional cluster packages, namelythe following: Fuzzy; k-means; k-median; and Model-Based clustering. Using cluster validation indices k-means clustering was shown to produce clusters with the smallest size, furthest separation, and importantly the highest degree of similarity between the elements within each partition. Using k-means clustering, the complexity of the data set is reduced allowing characterization of the data according to the temporal and spatial trends of the clusters. At Harwell, the rural background measurement site, the cluster analysis showed that the spectra may be differentiated by their modal-diameters and average temporal trends showing either high counts during the day-time or night-time hours. Likewise for the urban sites, the cluster analysis differentiated the spectra into a small number of size distributions according their modal-diameter, the location of the measurement site, and time of day. The responsible aerosol emission, formation, and dynamic processes can be inferred according to the cluster characteristics and correlation to concurrently measured meteorological, gas phase, and particle phase measurements.

  15. Detection of sensor degradation using K-means clustering and support vector regression in nuclear power plant

    International Nuclear Information System (INIS)

    Seo, Inyong; Ha, Bokam; Lee, Sungwoo; Shin, Changhoon; Lee, Jaeyong; Kim, Seongjun

    2011-01-01

    In a nuclear power plant (NPP), periodic sensor calibrations are required to assure sensors are operating correctly. However, only a few faulty sensors are found to be rectified. For the safe operation of an NPP and the reduction of unnecessary calibration, on-line calibration monitoring is needed. In this study, an on-line calibration monitoring called KPCSVR using k-means clustering and principal component based Auto-Associative support vector regression (PCSVR) is proposed for nuclear power plant. To reduce the training time of the model, k-means clustering method was used. Response surface methodology is employed to efficiently determine the optimal values of support vector regression hyperparameters. The proposed KPCSVR model was confirmed with actual plant data of Kori Nuclear Power Plant Unit 3 which were measured from the primary and secondary systems of the plant, and compared with the PCSVR model. By using data clustering, the average accuracy of PCSVR improved from 1.228×10 -4 to 0.472×10 -4 and the average sensitivity of PCSVR from 0.0930 to 0.0909, which results in good detection of sensor drift. Moreover, the training time is greatly reduced from 123.5 to 31.5 sec. (author)

  16. Determining the Number of Instars in Simulium quinquestriatum (Diptera: Simuliidae) Using k-Means Clustering via the Canberra Distance.

    Science.gov (United States)

    Yang, Yao Ming; Jia, Ruo; Xun, Hui; Yang, Jie; Chen, Qiang; Zeng, Xiang Guang; Yang, Ming

    2018-02-21

    Simulium quinquestriatum Shiraki (Diptera: Simuliidae), a human-biting fly that is distributed widely across Asia, is a vector for multiple pathogens. However, the larval development of this species is poorly understood. In this study, we determined the number of instars in this pest using three batches of field-collected larvae from Guiyang, Guizhou, China. The postgenal length, head capsule width, mandibular phragma length, and body length of 773 individuals were measured, and k-means clustering was used for instar grouping. Four distance measures-Manhattan, Euclidean, Chebyshev, and Canberra-were determined. The reported instar numbers, ranging from 4 to 11, were set as initial cluster centers for k-means clustering. The Canberra distance yielded reliable instar grouping, which was consistent with the first instar, as characterized by egg bursters and prepupae with dark histoblasts. Females and males of the last cluster of larvae were identified using Feulgen-stained gonads. Morphometric differences between the two sexes were not significant. Validation was performed using the Brooks-Dyar and Crosby rules, revealing that the larval stage of S. quinquestriatum is composed of eight instars.

  17. White blood cell segmentation by color-space-based k-means clustering.

    Science.gov (United States)

    Zhang, Congcong; Xiao, Xiaoyan; Li, Xiaomei; Chen, Ying-Jie; Zhen, Wu; Chang, Jun; Zheng, Chengyun; Liu, Zhi

    2014-09-01

    White blood cell (WBC) segmentation, which is important for cytometry, is a challenging issue because of the morphological diversity of WBCs and the complex and uncertain background of blood smear images. This paper proposes a novel method for the nucleus and cytoplasm segmentation of WBCs for cytometry. A color adjustment step was also introduced before segmentation. Color space decomposition and k-means clustering were combined for segmentation. A database including 300 microscopic blood smear images were used to evaluate the performance of our method. The proposed segmentation method achieves 95.7% and 91.3% overall accuracy for nucleus segmentation and cytoplasm segmentation, respectively. Experimental results demonstrate that the proposed method can segment WBCs effectively with high accuracy.

  18. flowPeaks: a fast unsupervised clustering for flow cytometry data via K-means and density peak finding.

    Science.gov (United States)

    Ge, Yongchao; Sealfon, Stuart C

    2012-08-01

    For flow cytometry data, there are two common approaches to the unsupervised clustering problem: one is based on the finite mixture model and the other on spatial exploration of the histograms. The former is computationally slow and has difficulty to identify clusters of irregular shapes. The latter approach cannot be applied directly to high-dimensional data as the computational time and memory become unmanageable and the estimated histogram is unreliable. An algorithm without these two problems would be very useful. In this article, we combine ideas from the finite mixture model and histogram spatial exploration. This new algorithm, which we call flowPeaks, can be applied directly to high-dimensional data and identify irregular shape clusters. The algorithm first uses K-means algorithm with a large K to partition the cell population into many small clusters. These partitioned data allow the generation of a smoothed density function using the finite mixture model. All local peaks are exhaustively searched by exploring the density function and the cells are clustered by the associated local peak. The algorithm flowPeaks is automatic, fast and reliable and robust to cluster shape and outliers. This algorithm has been applied to flow cytometry data and it has been compared with state of the art algorithms, including Misty Mountain, FLOCK, flowMeans, flowMerge and FLAME. The R package flowPeaks is available at https://github.com/yongchao/flowPeaks. yongchao.ge@mssm.edu Supplementary data are available at Bioinformatics online.

  19. Bagged K-means clustering of metabolome data

    NARCIS (Netherlands)

    Hageman, J. A.; van den Berg, R. A.; Westerhuis, J. A.; Hoefsloot, H. C. J.; Smilde, A. K.

    2006-01-01

    Clustering of metabolomics data can be hampered by noise originating from biological variation, physical sampling error and analytical error. Using data analysis methods which are not specially suited for dealing with noisy data will yield sub optimal solutions. Bootstrap aggregating (bagging) is a

  20. Fast and Accuracy Control Chart Pattern Recognition using a New cluster-k-Nearest Neighbor

    OpenAIRE

    Samir Brahim Belhaouari

    2009-01-01

    By taking advantage of both k-NN which is highly accurate and K-means cluster which is able to reduce the time of classification, we can introduce Cluster-k-Nearest Neighbor as "variable k"-NN dealing with the centroid or mean point of all subclasses generated by clustering algorithm. In general the algorithm of K-means cluster is not stable, in term of accuracy, for that reason we develop another algorithm for clustering our space which gives a higher accuracy than K-means cluster, less ...

  1. Mitigate the impact of transmitter finite extinction ratio using K-means clustering algorithm for 16QAM signal

    Science.gov (United States)

    Yu, Miao; Li, Yan; Shu, Tong; Zhang, Yifan; Hong, Xiaobin; Qiu, Jifang; Zuo, Yong; Guo, Hongxiang; Li, Wei; Wu, Jian

    2018-02-01

    A method of recognizing 16QAM signal based on k-means clustering algorithm is proposed to mitigate the impact of transmitter finite extinction ratio. There are pilot symbols with 0.39% overhead assigned to be regarded as initial centroids of k-means clustering algorithm. Simulation result in 10 GBaud 16QAM system shows that the proposed method obtains higher precision of identification compared with traditional decision method for finite ER and IQ mismatch. Specially, the proposed method improves the required OSNR by 5.5 dB, 4.5 dB, 4 dB and 3 dB at FEC limit with ER= 12 dB, 16 dB, 20 dB and 24 dB, respectively, and the acceptable bias error and IQ mismatch range is widened by 767% and 360% with ER =16 dB, respectively.

  2. Distribuição de subgrupos com base nas respostas fisiológicas em jogadores profissionais de futebol pela técnica K Means Cluster Subgroup distribution based on physiological responses in professional soccer players by K-means cluster technique

    Directory of Open Access Journals (Sweden)

    Luiz Fernando Novack

    2013-04-01

    Full Text Available INTRODUÇÃO: A preparação física no futebol necessita estar sempre em constante atualização em virtude das exigências presentes no futebol contemporâneo. OBJETIVO: Verificar a sensibilidade da técnica estatística K Means Cluster na distribuição de grupos com base nas respostas fisiológicas pertinentes ao futebol. MÉTODOS: Os atletas foram submetidos a avaliações antropométricas para determinar o percentual de gordura (%G e de massa magra (MM, teste incremental em esteira para obter o VO2 máximo (VO2máx e a velocidade de limiar ventilatório (VLim, bem como testes de campo para a agilidade (AG e o salto vertical (SV. Os dados foram analisados pelo teste de Kruskal-Wallis e a distribuição dos grupos foi desenvolvida pela técnica de K Means Cluster conforme as semelhanças dos jogadores com essas variáveis fisiológicas, assumindo o nível de significância de p INTRODUCTION: Physical fitness in soccer needs to be constantly updated due to current demands in contemporary soccer. OBJECTIVE: To assess the sensitivity of the K Means Clustering in group distribution based on physiological responses relevant to soccer. METHODS: The athletes underwent anthropometric evaluations to determine fat percentage (%F lean mass (LM, treadmill incremental test to obtain the VO2 maximum (VO2max and ventilatory threshold velocity (VL, as well as a field test for agility (AG and vertical jump (VJ. Data were analyzed by Kruskal-Wallis and distribution of groups was determined by K Means Clustering according to their similarities with these physiological variables, assuming significance level of p < 0.05. RESULTS: Showed that both groups were significantly different only concerning VJ (p < 0.001; LM (p < 0.001; VL (p = 0.011 and VO2max (p = 0.029 indicating that the athletes need to be distributed in groups for these variables. Nevertheless, %F and AG (p = 0.317; p = 0.922 respectively, were not different, indicating that these variables can be

  3. Dietary patterns derived from principal component- and k-means cluster analysis: long-term association with coronary heart disease and stroke.

    Science.gov (United States)

    Stricker, M D; Onland-Moret, N C; Boer, J M A; van der Schouw, Y T; Verschuren, W M M; May, A M; Peeters, P H M; Beulens, J W J

    2013-03-01

    Studies comparing dietary patterns derived from different a posteriori methods in view of predicting disease risk are scarce. We aimed to explore differences between dietary patterns derived from principal component- (PCA) and k-means cluster analysis (KCA) in relation to their food group composition and ability to predict CHD and stroke risk. The study was conducted in the EPIC-NL cohort that consists of 40,011 men and women. Baseline dietary intake was measured using a validated food-frequency questionnaire. Food items were consolidated into 31 food groups. Occurrence of CHD and stroke was assessed through linkage with registries. After 13 years of follow-up, 1,843 CHD and 588 stroke cases were documented. Both PCA and KCA extracted a prudent pattern (high intakes of fish, high-fiber products, raw vegetables, wine) and a western pattern (high consumption of French fries, fast food, low-fiber products, other alcoholic drinks, soft drinks with sugar) with small variation between components and clusters. The prudent component was associated with a reduced risk of CHD (HR for extreme quartiles: 0.87; 95%-CI: 0.75-1.00) and stroke (0.68; 0.53-0.88). The western component was not related to any outcome. The prudent cluster was related with a lower risk of CHD (0.91; 0.82-1.00) and stroke (0.79; 0.67-0.94) compared to the western cluster. PCA and KCA found similar underlying patterns with comparable associations with CHD and stroke risk. A prudent pattern reduced the risk of CHD and stroke. Copyright © 2012 Elsevier B.V. All rights reserved.

  4. Improved Fuzzy Art Method for Initializing K-means

    Directory of Open Access Journals (Sweden)

    Sevinc Ilhan

    2010-09-01

    Full Text Available The K-means algorithm is quite sensitive to the cluster centers selected initially and can perform different clusterings depending on these initialization conditions. Within the scope of this study, a new method based on the Fuzzy ART algorithm which is called Improved Fuzzy ART (IFART is used in the determination of initial cluster centers. By using IFART, better quality clusters are achieved than Fuzzy ART do and also IFART is as good as Fuzzy ART about capable of fast clustering and capability on large scaled data clustering. Consequently, it is observed that, with the proposed method, the clustering operation is completed in fewer steps, that it is performed in a more stable manner by fixing the initialization points and that it is completed with a smaller error margin compared with the conventional K-means.

  5. Comparison of K-means and fuzzy c-means algorithm performance for automated determination of the arterial input function.

    Science.gov (United States)

    Yin, Jiandong; Sun, Hongzan; Yang, Jiawen; Guo, Qiyong

    2014-01-01

    The arterial input function (AIF) plays a crucial role in the quantification of cerebral perfusion parameters. The traditional method for AIF detection is based on manual operation, which is time-consuming and subjective. Two automatic methods have been reported that are based on two frequently used clustering algorithms: fuzzy c-means (FCM) and K-means. However, it is still not clear which is better for AIF detection. Hence, we compared the performance of these two clustering methods using both simulated and clinical data. The results demonstrate that K-means analysis can yield more accurate and robust AIF results, although it takes longer to execute than the FCM method. We consider that this longer execution time is trivial relative to the total time required for image manipulation in a PACS setting, and is acceptable if an ideal AIF is obtained. Therefore, the K-means method is preferable to FCM in AIF detection.

  6. K-AP: Generating specified K clusters by efficient Affinity Propagation

    KAUST Repository

    Zhang, Xiangliang

    2010-12-01

    The Affinity Propagation (AP) clustering algorithm proposed by Frey and Dueck (2007) provides an understandable, nearly optimal summary of a data set. However, it suffers two major shortcomings: i) the number of clusters is vague with the user-defined parameter called self-confidence, and ii) the quadratic computational complexity. When aiming at a given number of clusters due to prior knowledge, AP has to be launched many times until an appropriate setting of self-confidence is found. The re-launched AP increases the computational cost by one order of magnitude. In this paper, we propose an algorithm, called K-AP, to exploit the immediate results of K clusters by introducing a constraint in the process of message passing. Through theoretical analysis and experimental validation, K-AP was shown to be able to directly generate K clusters as user defined, with a negligible increase of computational cost compared to AP. In the meanwhile, K-AP preserves the clustering quality as AP in terms of the distortion. K-AP is more effective than k-medoids w.r.t. the distortion minimization and higher clustering purity. © 2010 IEEE.

  7. Fuzzy C-means method for clustering microarray data.

    Science.gov (United States)

    Dembélé, Doulaye; Kastner, Philippe

    2003-05-22

    Clustering analysis of data from DNA microarray hybridization studies is essential for identifying biologically relevant groups of genes. Partitional clustering methods such as K-means or self-organizing maps assign each gene to a single cluster. However, these methods do not provide information about the influence of a given gene for the overall shape of clusters. Here we apply a fuzzy partitioning method, Fuzzy C-means (FCM), to attribute cluster membership values to genes. A major problem in applying the FCM method for clustering microarray data is the choice of the fuzziness parameter m. We show that the commonly used value m = 2 is not appropriate for some data sets, and that optimal values for m vary widely from one data set to another. We propose an empirical method, based on the distribution of distances between genes in a given data set, to determine an adequate value for m. By setting threshold levels for the membership values, genes which are tigthly associated to a given cluster can be selected. Using a yeast cell cycle data set as an example, we show that this selection increases the overall biological significance of the genes within the cluster. Supplementary text and Matlab functions are available at http://www-igbmc.u-strasbg.fr/fcm/

  8. Findings in resting-state fMRI by differences from K-means clustering.

    Science.gov (United States)

    Chyzhyk, Darya; Graña, Manuel

    2014-01-01

    Resting state fMRI has growing number of studies with diverse aims, always centered on some kind of functional connectivity biomarker obtained from correlation regarding seed regions, or by analytical decomposition of the signal towards the localization of the spatial distribution of functional connectivity patterns. In general, studies are computationally costly and very sensitive to noise and preprocessing of data. In this paper we consider clustering by K-means as a exploratory procedure which can provide some results with little computational effort, due to efficient implementations that are readily available. We demonstrate the approach on a dataset of schizophrenia patients, finding differences between patients with and without auditory hallucinations.

  9. Analisis Perbandingan Algoritma Fuzzy C-Means dan K-Means

    OpenAIRE

    Yohannes, Yohannes

    2016-01-01

    Klasterisasi merupakan teknik pengelompokkan data berdasarkan kemiripan data. Teknik klasterisasi ini banyak digunakan pada bidang ilmu komputer khususnya pengolahan citra, pengenalan pola, dan data mining. Banyak sekali algoritma yang digunakan untuk klasterisasi data. Algoritma yang sering digunakan untuk klasterisasi data pada umumnya adalah Fuzzy C-Means dan K-Means. Algoritma Fuzzy C-Means merupakan algoritma klasterisasi dimana data dikelompokkan ke dalam suatu pusat cluster data denga...

  10. A Simple but Powerful Heuristic Method for Accelerating k-Means Clustering of Large-Scale Data in Life Science.

    Science.gov (United States)

    Ichikawa, Kazuki; Morishita, Shinichi

    2014-01-01

    K-means clustering has been widely used to gain insight into biological systems from large-scale life science data. To quantify the similarities among biological data sets, Pearson correlation distance and standardized Euclidean distance are used most frequently; however, optimization methods have been largely unexplored. These two distance measurements are equivalent in the sense that they yield the same k-means clustering result for identical sets of k initial centroids. Thus, an efficient algorithm used for one is applicable to the other. Several optimization methods are available for the Euclidean distance and can be used for processing the standardized Euclidean distance; however, they are not customized for this context. We instead approached the problem by studying the properties of the Pearson correlation distance, and we invented a simple but powerful heuristic method for markedly pruning unnecessary computation while retaining the final solution. Tests using real biological data sets with 50-60K vectors of dimensions 10-2001 (~400 MB in size) demonstrated marked reduction in computation time for k = 10-500 in comparison with other state-of-the-art pruning methods such as Elkan's and Hamerly's algorithms. The BoostKCP software is available at http://mlab.cb.k.u-tokyo.ac.jp/~ichikawa/boostKCP/.

  11. A Modified MinMax k-Means Algorithm Based on PSO.

    Science.gov (United States)

    Wang, Xiaoyan; Bai, Yanping

    The MinMax k -means algorithm is widely used to tackle the effect of bad initialization by minimizing the maximum intraclustering errors. Two parameters, including the exponent parameter and memory parameter, are involved in the executive process. Since different parameters have different clustering errors, it is crucial to choose appropriate parameters. In the original algorithm, a practical framework is given. Such framework extends the MinMax k -means to automatically adapt the exponent parameter to the data set. It has been believed that if the maximum exponent parameter has been set, then the programme can reach the lowest intraclustering errors. However, our experiments show that this is not always correct. In this paper, we modified the MinMax k -means algorithm by PSO to determine the proper values of parameters which can subject the algorithm to attain the lowest clustering errors. The proposed clustering method is tested on some favorite data sets in several different initial situations and is compared to the k -means algorithm and the original MinMax k -means algorithm. The experimental results indicate that our proposed algorithm can reach the lowest clustering errors automatically.

  12. MEANS CLUSTERING FOR HIDDEN MARKOV MODEL

    NARCIS (Netherlands)

    Perrone, M.P.; Connell, S.D.

    2004-01-01

    An unsupervised k­means clustering algorithm for hidden Markov models is described and applied to the task of generating subclass models for individual handwritten character classes. The algorithm is compared to a related clustering method and shown to give a relative change in the error rate of as

  13. Accuracies of genomic breeding values in American Angus beef cattle using K-means clustering for cross-validation.

    Science.gov (United States)

    Saatchi, Mahdi; McClure, Mathew C; McKay, Stephanie D; Rolf, Megan M; Kim, JaeWoo; Decker, Jared E; Taxis, Tasia M; Chapple, Richard H; Ramey, Holly R; Northcutt, Sally L; Bauck, Stewart; Woodward, Brent; Dekkers, Jack C M; Fernando, Rohan L; Schnabel, Robert D; Garrick, Dorian J; Taylor, Jeremy F

    2011-11-28

    Genomic selection is a recently developed technology that is beginning to revolutionize animal breeding. The objective of this study was to estimate marker effects to derive prediction equations for direct genomic values for 16 routinely recorded traits of American Angus beef cattle and quantify corresponding accuracies of prediction. Deregressed estimated breeding values were used as observations in a weighted analysis to derive direct genomic values for 3570 sires genotyped using the Illumina BovineSNP50 BeadChip. These bulls were clustered into five groups using K-means clustering on pedigree estimates of additive genetic relationships between animals, with the aim of increasing within-group and decreasing between-group relationships. All five combinations of four groups were used for model training, with cross-validation performed in the group not used in training. Bivariate animal models were used for each trait to estimate the genetic correlation between deregressed estimated breeding values and direct genomic values. Accuracies of direct genomic values ranged from 0.22 to 0.69 for the studied traits, with an average of 0.44. Predictions were more accurate when animals within the validation group were more closely related to animals in the training set. When training and validation sets were formed by random allocation, the accuracies of direct genomic values ranged from 0.38 to 0.85, with an average of 0.65, reflecting the greater relationship between animals in training and validation. The accuracies of direct genomic values obtained from training on older animals and validating in younger animals were intermediate to the accuracies obtained from K-means clustering and random clustering for most traits. The genetic correlation between deregressed estimated breeding values and direct genomic values ranged from 0.15 to 0.80 for the traits studied. These results suggest that genomic estimates of genetic merit can be produced in beef cattle at a young age but

  14. Breast Cancer Image Segmentation Using K-Means Clustering Based on GPU Cuda Parallel Computing

    Directory of Open Access Journals (Sweden)

    Andika Elok Amalia

    2018-02-01

    Full Text Available Image processing technology is now widely used in the health area, one example is to help the radiologist to analyze the result of MRI (Magnetic Resonance Imaging, CT Scan and Mammography. Image segmentation is a process which is intended to obtain the objects contained in the image by dividing the image into several areas that have similarity attributes on an object with the aim of facilitating the analysis process. The increasing amount  of patient data and larger image size are new challenges in segmentation process to use time efficiently while still keeping the process quality. Research on the segmentation of medical images have been done but still few that combine with parallel computing. In this research, K-Means clustering on the image of mammography result is implemented using two-way computation which are serial and parallel. The result shows that parallel computing  gives faster average performance execution up to twofold.

  15. Sistem Multiagen untuk Pengklasteran Pendaki Menggunakan K-Means

    Directory of Open Access Journals (Sweden)

    Maya Cendana

    2015-01-01

    Abstract The beginner climbers should do mountain climbing as a group, however, manually grouping method that is currently running is not effective and efficient , especially for the solo climber who do not have the mountaineering community . Therefore, it is necessary to build an online media that is able to classify mountaineers automatically. The grouping is done by the algorithm K-Means clustering -based intelligent agents . The agents will collaborate in the negotiation process and determines the cluster members that have similar criteria . The main advantage of using multiagentsis multithreading proces, so the clustering process will run at once . The agentsconsists of the user agent , the database agent , clustering agents , and validation agent . The agents are implemented with JADE platform because JADE is using FIPA ACL communication language . Evaluation will calculate the value of cohesion / density in one cluster and inter - cluster separation distances with 10 , 100 and 200 data. Metric measurement used are WGAD and BGAD . Thequality of a cluster member is better than using an usual k-means .   Keywords: agen, jade, fipa acl, wgad, bgad

  16. Parallel k-means++ for Multiple Shared-Memory Architectures

    Energy Technology Data Exchange (ETDEWEB)

    Mackey, Patrick S.; Lewis, Robert R.

    2016-09-22

    In recent years k-means++ has become a popular initialization technique for improved k-means clustering. To date, most of the work done to improve its performance has involved parallelizing algorithms that are only approximations of k-means++. In this paper we present a parallelization of the exact k-means++ algorithm, with a proof of its correctness. We develop implementations for three distinct shared-memory architectures: multicore CPU, high performance GPU, and the massively multithreaded Cray XMT platform. We demonstrate the scalability of the algorithm on each platform. In addition we present a visual approach for showing which platform performed k-means++ the fastest for varying data sizes.

  17. Fuzzy Modeled K-Cluster Quality Mining of Hidden Knowledge for Decision Support

    OpenAIRE

    S. Parkash  Kumar; K. S. Ramaswami

    2011-01-01

    Problem statement: The work presented Fuzzy Modeled K-means Cluster Quality Mining of hidden knowledge for Decision Support. Based on the number of clusters, number of objects in each cluster and its cohesiveness, precision and recall values, the cluster quality metrics is measured. The fuzzy k-means is adapted approach by using heuristic method which iterates the cluster to form an efficient valid cluster. With the obtained data clusters, quality assessment is made by predictive mining using...

  18. Re-weighted Discriminatively Embedded K-Means for Multi-view Clustering.

    Science.gov (United States)

    Xu, Jinglin; Han, Junwei; Nie, Feiping; Li, Xuelong

    2017-02-08

    Recent years, more and more multi-view data are widely used in many real world applications. This kind of data (such as image data) are high dimensional and obtained from different feature extractors, which represents distinct perspectives of the data. How to cluster such data efficiently is a challenge. In this paper, we propose a novel multi-view clustering framework, called Re-weighted Discriminatively Embedded KMeans (RDEKM), for this task. The proposed method is a multiview least-absolute residual model which induces robustness to efficiently mitigates the influence of outliers and realizes dimension reduction during multi-view clustering. Specifically, the proposed model is an unsupervised optimization scheme which utilizes Iterative Re-weighted Least Squares to solve leastabsolute residual and adaptively controls the distribution of multiple weights in a re-weighted manner only based on its own low-dimensional subspaces and a common clustering indicator matrix. Furthermore, theoretical analysis (including optimality and convergence analysis) and the optimization algorithm are also presented. Compared to several state-of-the-art multi-view clustering methods, the proposed method substantially improves the accuracy of the clustering results on widely used benchmark datasets, which demonstrates the superiority of the proposed work.

  19. A Trajectory Regression Clustering Technique Combining a Novel Fuzzy C-Means Clustering Algorithm with the Least Squares Method

    Directory of Open Access Journals (Sweden)

    Xiangbing Zhou

    2018-04-01

    Full Text Available Rapidly growing GPS (Global Positioning System trajectories hide much valuable information, such as city road planning, urban travel demand, and population migration. In order to mine the hidden information and to capture better clustering results, a trajectory regression clustering method (an unsupervised trajectory clustering method is proposed to reduce local information loss of the trajectory and to avoid getting stuck in the local optimum. Using this method, we first define our new concept of trajectory clustering and construct a novel partitioning (angle-based partitioning method of line segments; second, the Lagrange-based method and Hausdorff-based K-means++ are integrated in fuzzy C-means (FCM clustering, which are used to maintain the stability and the robustness of the clustering process; finally, least squares regression model is employed to achieve regression clustering of the trajectory. In our experiment, the performance and effectiveness of our method is validated against real-world taxi GPS data. When comparing our clustering algorithm with the partition-based clustering algorithms (K-means, K-median, and FCM, our experimental results demonstrate that the presented method is more effective and generates a more reasonable trajectory.

  20. CLASSIFICATION OF IRANIAN NURSES ACCORDING TO THEIR MENTAL HEALTH OUTCOMES USING GHQ-12 QUESTIONNAIRE: A COMPARISON BETWEEN LATENT CLASS ANALYSIS AND K-MEANS CLUSTERING WITH TRADITIONAL SCORING METHOD.

    Science.gov (United States)

    Jamali, Jamshid; Ayatollahi, Seyyed Mohammad Taghi

    2015-10-01

    Nurses constitute the most providers of health care systems. Their mental health can affect the quality of services and patients' satisfaction. General Health Questionnaire (GHQ-12) is a general screening tool used to detect mental disorders. Scoring method and determining thresholds for this questionnaire are debatable and the cut-off points can vary from sample to sample. This study was conducted to estimate the prevalence of mental disorders among Iranian nurses using GHQ-12 and also compare Latent Class Analysis (LCA) and K-means clustering with traditional scoring method. A cross-sectional study was carried out in Fars and Bushehr provinces of southern Iran in 2014. Participants were 771 Iranian nurses, who filled out the GHQ-12 questionnaire. Traditional scoring method, LCA and K-means were used to estimate the prevalence of mental disorder among Iranian nurses. Cohen's kappa statistic was applied to assess the agreement between the LCA and K-means with traditional scoring method of GHQ-12. The nurses with mental disorder by scoring method, LCA and K-mean were 36.3% (n=280), 32.2% (n=248), and 26.5% (n=204), respectively. LCA and logistic regression revealed that the prevalence of mental disorder in females was significantly higher than males. Mental disorder in nurses was in a medium level compared to other people living in Iran. There was a little difference between prevalence of mental disorder estimated by scoring method, K-means and LCA. According to the advantages of LCA than K-means and different results in scoring method, we suggest LCA for classification of Iranian nurses according to their mental health outcomes using GHQ-12 questionnaire.

  1. Discriminating isogenic cancer cells and identifying altered unsaturated fatty acid content as associated with metastasis status, using k-means clustering and partial least squares-discriminant analysis of Raman maps

    DEFF Research Database (Denmark)

    Hedegaard, Martin; Krafft, Christoph; Ditzel, Henrik J

    2010-01-01

    level of a few proteins and genes. Raman maps were recorded of single cells after fixation and drying using 785 nm laser excitation. K-means clustering reduced the amount of data from each cell and improved the signal-to-noise ratio of cluster-averaged spectra. Spectra representing the nucleus were...

  2. Accuracies of genomic breeding values in American Angus beef cattle using K-means clustering for cross-validation

    Directory of Open Access Journals (Sweden)

    Saatchi Mahdi

    2011-11-01

    Full Text Available Abstract Background Genomic selection is a recently developed technology that is beginning to revolutionize animal breeding. The objective of this study was to estimate marker effects to derive prediction equations for direct genomic values for 16 routinely recorded traits of American Angus beef cattle and quantify corresponding accuracies of prediction. Methods Deregressed estimated breeding values were used as observations in a weighted analysis to derive direct genomic values for 3570 sires genotyped using the Illumina BovineSNP50 BeadChip. These bulls were clustered into five groups using K-means clustering on pedigree estimates of additive genetic relationships between animals, with the aim of increasing within-group and decreasing between-group relationships. All five combinations of four groups were used for model training, with cross-validation performed in the group not used in training. Bivariate animal models were used for each trait to estimate the genetic correlation between deregressed estimated breeding values and direct genomic values. Results Accuracies of direct genomic values ranged from 0.22 to 0.69 for the studied traits, with an average of 0.44. Predictions were more accurate when animals within the validation group were more closely related to animals in the training set. When training and validation sets were formed by random allocation, the accuracies of direct genomic values ranged from 0.38 to 0.85, with an average of 0.65, reflecting the greater relationship between animals in training and validation. The accuracies of direct genomic values obtained from training on older animals and validating in younger animals were intermediate to the accuracies obtained from K-means clustering and random clustering for most traits. The genetic correlation between deregressed estimated breeding values and direct genomic values ranged from 0.15 to 0.80 for the traits studied. Conclusions These results suggest that genomic estimates

  3. A hybridized K-means clustering approach for high dimensional ...

    African Journals Online (AJOL)

    International Journal of Engineering, Science and Technology ... Due to incredible growth of high dimensional dataset, conventional data base querying methods are inadequate to extract useful information, so researchers nowadays ... Recently cluster analysis is a popularly used data analysis method in number of areas.

  4. Quantitative Volumetric K-Means Cluster Segmentation of Fibroglandular Tissue and Skin in Breast MRI.

    Science.gov (United States)

    Niukkanen, Anton; Arponen, Otso; Nykänen, Aki; Masarwah, Amro; Sutela, Anna; Liimatainen, Timo; Vanninen, Ritva; Sudah, Mazen

    2017-10-18

    Mammographic breast density (MBD) is the most commonly used method to assess the volume of fibroglandular tissue (FGT). However, MRI could provide a clinically feasible and more accurate alternative. There were three aims in this study: (1) to evaluate a clinically feasible method to quantify FGT with MRI, (2) to assess the inter-rater agreement of MRI-based volumetric measurements and (3) to compare them to measurements acquired using digital mammography and 3D tomosynthesis. This retrospective study examined 72 women (mean age 52.4 ± 12.3 years) with 105 disease-free breasts undergoing diagnostic 3.0-T breast MRI and either digital mammography or tomosynthesis. Two observers analyzed MRI images for breast and FGT volumes and FGT-% from T1-weighted images (0.7-, 2.0-, and 4.0-mm-thick slices) using K-means clustering, data from histogram, and active contour algorithms. Reference values were obtained with Quantra software. Inter-rater agreement for MRI measurements made with 2-mm-thick slices was excellent: for FGT-%, r = 0.994 (95% CI 0.990-0.997); for breast volume, r = 0.985 (95% CI 0.934-0.994); and for FGT volume, r = 0.979 (95% CI 0.958-0.989). MRI-based FGT-% correlated strongly with MBD in mammography (r = 0.819-0.904, P K-means clustering-based assessments of the proportion of the fibroglandular tissue in the breast at MRI are highly reproducible. In the future, quantitative assessment of FGT-% to complement visual estimation of FGT should be performed on a more regular basis as it provides a component which can be incorporated into the individual's breast cancer risk stratification.

  5. Application of cluster analysis to geochemical compositional data for identifying ore-related geochemical anomalies

    Science.gov (United States)

    Zhou, Shuguang; Zhou, Kefa; Wang, Jinlin; Yang, Genfang; Wang, Shanshan

    2017-12-01

    Cluster analysis is a well-known technique that is used to analyze various types of data. In this study, cluster analysis is applied to geochemical data that describe 1444 stream sediment samples collected in northwestern Xinjiang with a sample spacing of approximately 2 km. Three algorithms (the hierarchical, k-means, and fuzzy c-means algorithms) and six data transformation methods (the z-score standardization, ZST; the logarithmic transformation, LT; the additive log-ratio transformation, ALT; the centered log-ratio transformation, CLT; the isometric log-ratio transformation, ILT; and no transformation, NT) are compared in terms of their effects on the cluster analysis of the geochemical compositional data. The study shows that, on the one hand, the ZST does not affect the results of column- or variable-based (R-type) cluster analysis, whereas the other methods, including the LT, the ALT, and the CLT, have substantial effects on the results. On the other hand, the results of the row- or observation-based (Q-type) cluster analysis obtained from the geochemical data after applying NT and the ZST are relatively poor. However, we derive some improved results from the geochemical data after applying the CLT, the ILT, the LT, and the ALT. Moreover, the k-means and fuzzy c-means clustering algorithms are more reliable than the hierarchical algorithm when they are used to cluster the geochemical data. We apply cluster analysis to the geochemical data to explore for Au deposits within the study area, and we obtain a good correlation between the results retrieved by combining the CLT or the ILT with the k-means or fuzzy c-means algorithms and the potential zones of Au mineralization. Therefore, we suggest that the combination of the CLT or the ILT with the k-means or fuzzy c-means algorithms is an effective tool to identify potential zones of mineralization from geochemical data.

  6. Research on hotspot discovery in internet public opinions based on improved K-means.

    Science.gov (United States)

    Wang, Gensheng

    2013-01-01

    How to discover hotspot in the Internet public opinions effectively is a hot research field for the researchers related which plays a key role for governments and corporations to find useful information from mass data in the Internet. An improved K-means algorithm for hotspot discovery in internet public opinions is presented based on the analysis of existing defects and calculation principle of original K-means algorithm. First, some new methods are designed to preprocess website texts, select and express the characteristics of website texts, and define the similarity between two website texts, respectively. Second, clustering principle and the method of initial classification centers selection are analyzed and improved in order to overcome the limitations of original K-means algorithm. Finally, the experimental results verify that the improved algorithm can improve the clustering stability and classification accuracy of hotspot discovery in internet public opinions when used in practice.

  7. Research on Hotspot Discovery in Internet Public Opinions Based on Improved K-Means

    Science.gov (United States)

    2013-01-01

    How to discover hotspot in the Internet public opinions effectively is a hot research field for the researchers related which plays a key role for governments and corporations to find useful information from mass data in the Internet. An improved K-means algorithm for hotspot discovery in internet public opinions is presented based on the analysis of existing defects and calculation principle of original K-means algorithm. First, some new methods are designed to preprocess website texts, select and express the characteristics of website texts, and define the similarity between two website texts, respectively. Second, clustering principle and the method of initial classification centers selection are analyzed and improved in order to overcome the limitations of original K-means algorithm. Finally, the experimental results verify that the improved algorithm can improve the clustering stability and classification accuracy of hotspot discovery in internet public opinions when used in practice. PMID:24106496

  8. Magnetic resonance imaging with k-means clustering objectively measures whole muscle volume compartments in sarcopenia/cancer cachexia.

    Science.gov (United States)

    Gray, Calum; MacGillivray, Thomas J; Eeley, Clare; Stephens, Nathan A; Beggs, Ian; Fearon, Kenneth C; Greig, Carolyn A

    2011-02-01

    Sarcopenia and cachexia are characterized by infiltration of non-contractile tissue within muscle which influences area and volume measurements. We applied a statistical clustering (k-means) technique to magnetic resonance (MR) images of the quadriceps of young and elderly healthy women and women with cancer to objectively separate the contractile and non-contractile tissue compartments. MR scans of the thigh were obtained for 34 women (n = 16 young, (median) age 26 y; n = 9 older, age 80 y; n = 9 upper gastrointestinal cancer patients, age 65 y). Segmented regions of consecutive axial images were used to calculate cross-sectional area and (gross) volume. The k-means unsupervised algorithm was subsequently applied to the MR binary mask image array data with resultant volumes compared between groups. Older women and women with cancer had 37% and 48% less quadriceps muscle respectively than young women (p k-means subtracted a significant 9%, 14% and 20% non-contractile tissue from the quadriceps of young, older and patient groups respectively (p K-means objectively separates contractile and non-contractile tissue components. Women with upper GI cancer have significant fatty infiltration throughout whole muscle groups which is maintained when controlling for age. Copyright © 2010 Elsevier Ltd and European Society for Clinical Nutrition and Metabolism. All rights reserved.

  9. Genetic k-means clustering approach for mapping human vulnerability to chemical hazards in the industrialized city: a case study of Shanghai, China.

    Science.gov (United States)

    Shi, Weifang; Zeng, Weihua

    2013-06-20

    Reducing human vulnerability to chemical hazards in the industrialized city is a matter of great urgency. Vulnerability mapping is an alternative approach for providing vulnerability-reducing interventions in a region. This study presents a method for mapping human vulnerability to chemical hazards by using clustering analysis for effective vulnerability reduction. Taking the city of Shanghai as the study area, we measure human exposure to chemical hazards by using the proximity model with additionally considering the toxicity of hazardous substances, and capture the sensitivity and coping capacity with corresponding indicators. We perform an improved k-means clustering approach on the basis of genetic algorithm by using a 500 m × 500 m geographical grid as basic spatial unit. The sum of squared errors and silhouette coefficient are combined to measure the quality of clustering and to determine the optimal clustering number. Clustering result reveals a set of six typical human vulnerability patterns that show distinct vulnerability dimension combinations. The vulnerability mapping of the study area reflects cluster-specific vulnerability characteristics and their spatial distribution. Finally, we suggest specific points that can provide new insights in rationally allocating the limited funds for the vulnerability reduction of each cluster.

  10. Genetic k-Means Clustering Approach for Mapping Human Vulnerability to Chemical Hazards in the Industrialized City: A Case Study of Shanghai, China

    Directory of Open Access Journals (Sweden)

    Weihua Zeng

    2013-06-01

    Full Text Available Reducing human vulnerability to chemical hazards in the industrialized city is a matter of great urgency. Vulnerability mapping is an alternative approach for providing vulnerability-reducing interventions in a region. This study presents a method for mapping human vulnerability to chemical hazards by using clustering analysis for effective vulnerability reduction. Taking the city of Shanghai as the study area, we measure human exposure to chemical hazards by using the proximity model with additionally considering the toxicity of hazardous substances, and capture the sensitivity and coping capacity with corresponding indicators. We perform an improved k-means clustering approach on the basis of genetic algorithm by using a 500 m × 500 m geographical grid as basic spatial unit. The sum of squared errors and silhouette coefficient are combined to measure the quality of clustering and to determine the optimal clustering number. Clustering result reveals a set of six typical human vulnerability patterns that show distinct vulnerability dimension combinations. The vulnerability mapping of the study area reflects cluster-specific vulnerability characteristics and their spatial distribution. Finally, we suggest specific points that can provide new insights in rationally allocating the limited funds for the vulnerability reduction of each cluster.

  11. Application of k-means clustering algorithm in grouping the DNA sequences of hepatitis B virus (HBV)

    Science.gov (United States)

    Bustamam, A.; Tasman, H.; Yuniarti, N.; Frisca, Mursidah, I.

    2017-07-01

    Based on WHO data, an estimated of 15 millions people worldwide who are infected with hepatitis B (HBsAg+), which is caused by HBV virus, are also infected by hepatitis D, which is caused by HDV virus. Hepatitis D infection can occur simultaneously with hepatitis B (co infection) or after a person is exposed to chronic hepatitis B (super infection). Since HDV cannot live without HBV, HDV infection is closely related to HBV infection, hence it is very realistic that every effort of prevention against hepatitis B can indirectly prevent hepatitis D. This paper presents clustering of HBV DNA sequences by using k-means clustering algorithm and R programming. Clustering processes are started with collecting HBV DNA sequences from GenBank, then performing extraction HBV DNA sequences using n-mers frequency and furthermore the extraction results are collected as a matrix and normalized using the min-max normalization with interval [0, 1] which will later be used as an input data. The number of clusters is two and the initial centroid selected of the cluster is chosen randomly. In each iteration, the distance of every object to each centroid are calculated using the Euclidean distance and the minimum distance is selected to determine the membership in a cluster until two convergent clusters are created. As the result, the HBV viruses in the first cluster is more virulent than the HBV viruses in the second cluster, so the HBV viruses in the first cluster can potentially evolve with HDV viruses that cause hepatitis D.

  12. Semi-supervised consensus clustering for gene expression data analysis

    OpenAIRE

    Wang, Yunli; Pan, Youlian

    2014-01-01

    Background Simple clustering methods such as hierarchical clustering and k-means are widely used for gene expression data analysis; but they are unable to deal with noise and high dimensionality associated with the microarray gene expression data. Consensus clustering appears to improve the robustness and quality of clustering results. Incorporating prior knowledge in clustering process (semi-supervised clustering) has been shown to improve the consistency between the data partitioning and do...

  13. A User-Adaptive Algorithm for Activity Recognition Based on K-Means Clustering, Local Outlier Factor, and Multivariate Gaussian Distribution

    Directory of Open Access Journals (Sweden)

    Shizhen Zhao

    2018-06-01

    Full Text Available Mobile activity recognition is significant to the development of human-centric pervasive applications including elderly care, personalized recommendations, etc. Nevertheless, the distribution of inertial sensor data can be influenced to a great extent by varying users. This means that the performance of an activity recognition classifier trained by one user’s dataset will degenerate when transferred to others. In this study, we focus on building a personalized classifier to detect four categories of human activities: light intensity activity, moderate intensity activity, vigorous intensity activity, and fall. In order to solve the problem caused by different distributions of inertial sensor signals, a user-adaptive algorithm based on K-Means clustering, local outlier factor (LOF, and multivariate Gaussian distribution (MGD is proposed. To automatically cluster and annotate a specific user’s activity data, an improved K-Means algorithm with a novel initialization method is designed. By quantifying the samples’ informative degree in a labeled individual dataset, the most profitable samples can be selected for activity recognition model adaption. Through experiments, we conclude that our proposed models can adapt to new users with good recognition performance.

  14. Implementasi Algoritma K-Means Clustering Untuk Mengetahui Bidang Skripsi Mahasiswa Multimedia Pendidikan Teknik Informatika Dan Komputer Universitas Negeri Jakarta

    Directory of Open Access Journals (Sweden)

    Widodo

    2017-12-01

    Full Text Available Penelitian ini bertujuan untuk mengetahui bidang skripsi mahasiswa peminatan multimedia PTIK UNJ angkatan 2010 yang dapat digunakan sebagai saran bagi mahasiswa yang belum mengajukan skripsi. Sedangkan bagi mahasiswa yang telah mengajukan skripsi dapat dijadikan sebagai perbandingan antara hasil program bantu dengan bidang skripsi yang diteliti. Penelitian dilakukan dengan menggunakan implementasi dari algoritma K-Means clustering, yaitu setiap data dikelompokkan berdasarkan jarak minimum terdekat dengan centroid. Data yang diolah merupakan rata-rata nilai mata kuliah Proyek Video Digital, Desain Grafis, dan Animasi Komputer untuk cluster 1 yaitu bidang video dan animasi. Untuk cluster 2 yaitu bidang aplikasi pengembangan perangkat lunak menggunakan rata-rata nilai mata kuliah Algoritma dan Pemrograman, Struktur Data, serta Pemrograman Web. Proses perhitungan menggunakan software MATLAB. Input data nilai berjumlah 53 mahasiswa dengan lima kali uji centroid. Hasil saran bidang skripsi untuk 19 mahasiswa yang belum mengajukan skripsi adalah 2 mahasiswa di bidang video dan animasi dan 17 mahasiswa di bidang aplikasi pengembangan perangkat lunak. Untuk 34 mahasiswa yang telah mengajukan skripsi, dilakukan perbandingan hasil bidang skripsi dengan menggunakan sign test. Berdasarkan hasil uji diketahui X2hitung < X2 tabel yaitu terima H0 dengan kesimpulan tidak terdapat perbedaan antara bidang skripsi hasil perhitungan program bantu algoritma K-Means dengan bidang skripsi yang telah diajukan mahasiswa.

  15. A SURVEY ON DOCUMENT CLUSTERING APPROACH FOR COMPUTER FORENSIC ANALYSIS

    OpenAIRE

    Monika Raghuvanshi*, Rahul Patel

    2016-01-01

    In a forensic analysis, large numbers of files are examined. Much of the information comprises of in unstructured format, so it’s quite difficult task for computer forensic to perform such analysis. That’s why to do the forensic analysis of document within a limited period of time require a special approach such as document clustering. This paper review different document clustering algorithms methodologies for example K-mean, K-medoid, single link, complete link, average link in accorandance...

  16. K-Line Patterns’ Predictive Power Analysis Using the Methods of Similarity Match and Clustering

    Directory of Open Access Journals (Sweden)

    Lv Tao

    2017-01-01

    Full Text Available Stock price prediction based on K-line patterns is the essence of candlestick technical analysis. However, there are some disputes on whether the K-line patterns have predictive power in academia. To help resolve the debate, this paper uses the data mining methods of pattern recognition, pattern clustering, and pattern knowledge mining to research the predictive power of K-line patterns. The similarity match model and nearest neighbor-clustering algorithm are proposed for solving the problem of similarity match and clustering of K-line series, respectively. The experiment includes testing the predictive power of the Three Inside Up pattern and Three Inside Down pattern with the testing dataset of the K-line series data of Shanghai 180 index component stocks over the latest 10 years. Experimental results show that (1 the predictive power of a pattern varies a great deal for different shapes and (2 each of the existing K-line patterns requires further classification based on the shape feature for improving the prediction performance.

  17. PENERAPAN DATAMINING PADA POPULASI DAGING AYAM RAS PEDAGING DI INDONESIA BERDASARKAN PROVINSI MENGGUNAKAN K-MEANS CLUSTERING

    Directory of Open Access Journals (Sweden)

    Mhd Gading Sadewo

    2017-09-01

    Full Text Available Ayam bukanlah makanan yang asing bagi penduduk Indonesia. Makanan tersebut sangat mudah dijumpai dalam kehidupan masyarakat sehari-hari. Namun tingkat konsumsi daging ayam di Indonesia masih tergolong rendah dibandingkan dengan Negara tetangga. Penelitian ini membahas tentang Penerapan Datamining Pada Populasi Daging Ayam Ras Pedaging di Indonesia Berdasarkan Provinsi Menggunakan K-Means Clustering. Sumber data penelitian ini dikumpulkan berdasarkan dokumen-dokumen keterangan populasi daging ayam yang dihasilkan oleh Badan Pusat Statistik Nasional. Data yang digunakan dalam penelitian ini adalah data dari tahun 2009-2016 yang terdiri dari 34 provinsi. Variable yang digunakan (1 jumlah populasi dari tahun 2009-2016. Data akan diolah dengan melakukan clushtering dalam 3 clushter yaitu clusther tingkat populasi tinggi, clusther tingkat populasi sedang dan rendah. Centroid data untuk cluster tingkat populasi tinggi 4711403141, Centroid data untuk cluster tingkat populasi sedang 304240647, dan Centroid data untuk cluster tingkat populasi rendah 554200. Sehingga diperoleh penilaian berdasarkan indeks populasi daging ayam dengan 1 provinsi tingkat populasi tinggi yaitu Jawa Barat, 6 provinsi tingkat populasi sedang yaitu Sumatera Utara, Jawa Tengah, Jawa Timur, Banten, Kalimantan Selatan dan Kalimantan Timur, dan 27 provinsi lainnya termasuk tingkat populasi rendah. Hal ini dapat menjadi masukan kepada pemerintah, provinsi yang menjadi perhatian lebih pada populasi daging ayam berdasarkan cluster yang telah dilakukan

  18. Segmentation of Brain Lesions in MRI and CT Scan Images: A Hybrid Approach Using k-Means Clustering and Image Morphology

    Science.gov (United States)

    Agrawal, Ritu; Sharma, Manisha; Singh, Bikesh Kumar

    2018-04-01

    Manual segmentation and analysis of lesions in medical images is time consuming and subjected to human errors. Automated segmentation has thus gained significant attention in recent years. This article presents a hybrid approach for brain lesion segmentation in different imaging modalities by combining median filter, k means clustering, Sobel edge detection and morphological operations. Median filter is an essential pre-processing step and is used to remove impulsive noise from the acquired brain images followed by k-means segmentation, Sobel edge detection and morphological processing. The performance of proposed automated system is tested on standard datasets using performance measures such as segmentation accuracy and execution time. The proposed method achieves a high accuracy of 94% when compared with manual delineation performed by an expert radiologist. Furthermore, the statistical significance test between lesion segmented using automated approach and that by expert delineation using ANOVA and correlation coefficient achieved high significance values of 0.986 and 1 respectively. The experimental results obtained are discussed in lieu of some recently reported studies.

  19. Prediction of settled water turbidity and optimal coagulant dosage in drinking water treatment plant using a hybrid model of k-means clustering and adaptive neuro-fuzzy inference system

    Science.gov (United States)

    Kim, Chan Moon; Parnichkun, Manukid

    2017-11-01

    Coagulation is an important process in drinking water treatment to attain acceptable treated water quality. However, the determination of coagulant dosage is still a challenging task for operators, because coagulation is nonlinear and complicated process. Feedback control to achieve the desired treated water quality is difficult due to lengthy process time. In this research, a hybrid of k-means clustering and adaptive neuro-fuzzy inference system ( k-means-ANFIS) is proposed for the settled water turbidity prediction and the optimal coagulant dosage determination using full-scale historical data. To build a well-adaptive model to different process states from influent water, raw water quality data are classified into four clusters according to its properties by a k-means clustering technique. The sub-models are developed individually on the basis of each clustered data set. Results reveal that the sub-models constructed by a hybrid k-means-ANFIS perform better than not only a single ANFIS model, but also seasonal models by artificial neural network (ANN). The finally completed model consisting of sub-models shows more accurate and consistent prediction ability than a single model of ANFIS and a single model of ANN based on all five evaluation indices. Therefore, the hybrid model of k-means-ANFIS can be employed as a robust tool for managing both treated water quality and production costs simultaneously.

  20. Two-Way Regularized Fuzzy Clustering of Multiple Correspondence Analysis.

    Science.gov (United States)

    Kim, Sunmee; Choi, Ji Yeh; Hwang, Heungsun

    2017-01-01

    Multiple correspondence analysis (MCA) is a useful tool for investigating the interrelationships among dummy-coded categorical variables. MCA has been combined with clustering methods to examine whether there exist heterogeneous subclusters of a population, which exhibit cluster-level heterogeneity. These combined approaches aim to classify either observations only (one-way clustering of MCA) or both observations and variable categories (two-way clustering of MCA). The latter approach is favored because its solutions are easier to interpret by providing explicitly which subgroup of observations is associated with which subset of variable categories. Nonetheless, the two-way approach has been built on hard classification that assumes observations and/or variable categories to belong to only one cluster. To relax this assumption, we propose two-way fuzzy clustering of MCA. Specifically, we combine MCA with fuzzy k-means simultaneously to classify a subgroup of observations and a subset of variable categories into a common cluster, while allowing both observations and variable categories to belong partially to multiple clusters. Importantly, we adopt regularized fuzzy k-means, thereby enabling us to decide the degree of fuzziness in cluster memberships automatically. We evaluate the performance of the proposed approach through the analysis of simulated and real data, in comparison with existing two-way clustering approaches.

  1. Comparative analysis of clustering methods for gene expression time course data

    Directory of Open Access Journals (Sweden)

    Ivan G. Costa

    2004-01-01

    Full Text Available This work performs a data driven comparative study of clustering methods used in the analysis of gene expression time courses (or time series. Five clustering methods found in the literature of gene expression analysis are compared: agglomerative hierarchical clustering, CLICK, dynamical clustering, k-means and self-organizing maps. In order to evaluate the methods, a k-fold cross-validation procedure adapted to unsupervised methods is applied. The accuracy of the results is assessed by the comparison of the partitions obtained in these experiments with gene annotation, such as protein function and series classification.

  2. A Genetic Algorithm That Exchanges Neighboring Centers for Fuzzy c-Means Clustering

    Science.gov (United States)

    Chahine, Firas Safwan

    2012-01-01

    Clustering algorithms are widely used in pattern recognition and data mining applications. Due to their computational efficiency, partitional clustering algorithms are better suited for applications with large datasets than hierarchical clustering algorithms. K-means is among the most popular partitional clustering algorithm, but has a major…

  3. A Flocking Based algorithm for Document Clustering Analysis

    Energy Technology Data Exchange (ETDEWEB)

    Cui, Xiaohui [ORNL; Gao, Jinzhu [ORNL; Potok, Thomas E [ORNL

    2006-01-01

    Social animals or insects in nature often exhibit a form of emergent collective behavior known as flocking. In this paper, we present a novel Flocking based approach for document clustering analysis. Our Flocking clustering algorithm uses stochastic and heuristic principles discovered from observing bird flocks or fish schools. Unlike other partition clustering algorithm such as K-means, the Flocking based algorithm does not require initial partitional seeds. The algorithm generates a clustering of a given set of data through the embedding of the high-dimensional data items on a two-dimensional grid for easy clustering result retrieval and visualization. Inspired by the self-organized behavior of bird flocks, we represent each document object with a flock boid. The simple local rules followed by each flock boid result in the entire document flock generating complex global behaviors, which eventually result in a clustering of the documents. We evaluate the efficiency of our algorithm with both a synthetic dataset and a real document collection that includes 100 news articles collected from the Internet. Our results show that the Flocking clustering algorithm achieves better performance compared to the K- means and the Ant clustering algorithm for real document clustering.

  4. Sensitivity Sampling Over Dynamic Geometric Data Streams with Applications to $k$-Clustering

    OpenAIRE

    Song, Zhao; Yang, Lin F.; Zhong, Peilin

    2018-01-01

    Sensitivity based sampling is crucial for constructing nearly-optimal coreset for $k$-means / median clustering. In this paper, we provide a novel data structure that enables sensitivity sampling over a dynamic data stream, where points from a high dimensional discrete Euclidean space can be either inserted or deleted. Based on this data structure, we provide a one-pass coreset construction for $k$-means %and M-estimator clustering using space $\\widetilde{O}(k\\mathrm{poly}(d))$ over $d$-dimen...

  5. Statistical Techniques Applied to Aerial Radiometric Surveys (STAARS): cluster analysis. National Uranium Resource Evaluation

    International Nuclear Information System (INIS)

    Pirkle, F.L.; Stablein, N.K.; Howell, J.A.; Wecksung, G.W.; Duran, B.S.

    1982-11-01

    One objective of the aerial radiometric surveys flown as part of the US Department of Energy's National Uranium Resource Evaluation (NURE) program was to ascertain the regional distribution of near-surface radioelement abundances. Some method for identifying groups of observations with similar radioelement values was therefore required. It is shown in this report that cluster analysis can identify such groups even when no a priori knowledge of the geology of an area exists. A method of convergent k-means cluster analysis coupled with a hierarchical cluster analysis is used to classify 6991 observations (three radiometric variables at each observation location) from the Precambrian rocks of the Copper Mountain, Wyoming, area. Another method, one that combines a principal components analysis with a convergent k-means analysis, is applied to the same data. These two methods are compared with a convergent k-means analysis that utilizes available geologic knowledge. All three methods identify four clusters. Three of the clusters represent background values for the Precambrian rocks of the area, and one represents outliers (anomalously high 214 Bi). A segmentation of the data corresponding to geologic reality as discovered by other methods has been achieved based solely on analysis of aerial radiometric data. The techniques employed are composites of classical clustering methods designed to handle the special problems presented by large data sets. 20 figures, 7 tables

  6. Classifying epileptic EEG signals with delay permutation entropy and Multi-Scale K-means.

    Science.gov (United States)

    Zhu, Guohun; Li, Yan; Wen, Peng Paul; Wang, Shuaifang

    2015-01-01

    Most epileptic EEG classification algorithms are supervised and require large training datasets, that hinder their use in real time applications. This chapter proposes an unsupervised Multi-Scale K-means (MSK-means) MSK-means algorithm to distinguish epileptic EEG signals and identify epileptic zones. The random initialization of the K-means algorithm can lead to wrong clusters. Based on the characteristics of EEGs, the MSK-means MSK-means algorithm initializes the coarse-scale centroid of a cluster with a suitable scale factor. In this chapter, the MSK-means algorithm is proved theoretically superior to the K-means algorithm on efficiency. In addition, three classifiers: the K-means, MSK-means MSK-means and support vector machine (SVM), are used to identify seizure and localize epileptogenic zone using delay permutation entropy features. The experimental results demonstrate that identifying seizure with the MSK-means algorithm and delay permutation entropy achieves 4. 7 % higher accuracy than that of K-means, and 0. 7 % higher accuracy than that of the SVM.

  7. Assessment of Random Assignment in Training and Test Sets using Generalized Cluster Analysis Technique

    Directory of Open Access Journals (Sweden)

    Sorana D. BOLBOACĂ

    2011-06-01

    Full Text Available Aim: The properness of random assignment of compounds in training and validation sets was assessed using the generalized cluster technique. Material and Method: A quantitative Structure-Activity Relationship model using Molecular Descriptors Family on Vertices was evaluated in terms of assignment of carboquinone derivatives in training and test sets during the leave-many-out analysis. Assignment of compounds was investigated using five variables: observed anticancer activity and four structure descriptors. Generalized cluster analysis with K-means algorithm was applied in order to investigate if the assignment of compounds was or not proper. The Euclidian distance and maximization of the initial distance using a cross-validation with a v-fold of 10 was applied. Results: All five variables included in analysis proved to have statistically significant contribution in identification of clusters. Three clusters were identified, each of them containing both carboquinone derivatives belonging to training as well as to test sets. The observed activity of carboquinone derivatives proved to be normal distributed on every. The presence of training and test sets in all clusters identified using generalized cluster analysis with K-means algorithm and the distribution of observed activity within clusters sustain a proper assignment of compounds in training and test set. Conclusion: Generalized cluster analysis using the K-means algorithm proved to be a valid method in assessment of random assignment of carboquinone derivatives in training and test sets.

  8. Development of a Semi-Automatic Technique for Flow Estimation using Optical Flow Registration and k-means Clustering on Two Dimensional Cardiovascular Magnetic Resonance Flow Images

    DEFF Research Database (Denmark)

    Brix, Lau; Christoffersen, Christian P. V.; Kristiansen, Martin Søndergaard

    was then categorized into groups by the k-means clustering method. Finally, the cluster containing the vessel under investigation was selected manually by a single mouse click. All calculations were performed on a Nvidia 8800 GTX graphics card using the Compute Unified Device Architecture (CUDA) extension to the C...

  9. A Simple Density with Distance Based Initial Seed Selection Technique for K Means Algorithm

    Directory of Open Access Journals (Sweden)

    Sajidha Syed Azimuddin

    2017-01-01

    Full Text Available Open issues with respect to K means algorithm are identifying the number of clusters, initial seed concept selection, clustering tendency, handling empty clusters, identifying outliers etc. In this paper we propose a novel and a simple technique considering both density and distance of the concepts in a dataset to identify initial seed concepts for clustering. Many authors have proposed different techniques to identify initial seed concepts; but our method ensures that the initial seed concepts are chosen from different clusters that are to be generated by the clustering solution. The hallmark of our algorithm is that it is a single pass algorithm that does not require any extra parameters to be estimated. Further, our seed concepts are one among the actual concepts and not the mean of representative concepts as is the case in many other algorithms. We have implemented our proposed algorithm and compared the results with the interval based technique of Fouad Khan. We see that our method outperforms the interval based method. We have also compared our method with the original random K means and K Means++ algorithms.

  10. Integrating K-means Clustering with Kernel Density Estimation for the Development of a Conditional Weather Generation Downscaling Model

    Science.gov (United States)

    Chen, Y.; Ho, C.; Chang, L.

    2011-12-01

    In previous decades, the climate change caused by global warming increases the occurrence frequency of extreme hydrological events. Water supply shortages caused by extreme events create great challenges for water resource management. To evaluate future climate variations, general circulation models (GCMs) are the most wildly known tools which shows possible weather conditions under pre-defined CO2 emission scenarios announced by IPCC. Because the study area of GCMs is the entire earth, the grid sizes of GCMs are much larger than the basin scale. To overcome the gap, a statistic downscaling technique can transform the regional scale weather factors into basin scale precipitations. The statistic downscaling technique can be divided into three categories include transfer function, weather generator and weather type. The first two categories describe the relationships between the weather factors and precipitations respectively based on deterministic algorithms, such as linear or nonlinear regression and ANN, and stochastic approaches, such as Markov chain theory and statistical distributions. In the weather type, the method has ability to cluster weather factors, which are high dimensional and continuous variables, into weather types, which are limited number of discrete states. In this study, the proposed downscaling model integrates the weather type, using the K-means clustering algorithm, and the weather generator, using the kernel density estimation. The study area is Shihmen basin in northern of Taiwan. In this study, the research process contains two steps, a calibration step and a synthesis step. Three sub-steps were used in the calibration step. First, weather factors, such as pressures, humidities and wind speeds, obtained from NCEP and the precipitations observed from rainfall stations were collected for downscaling. Second, the K-means clustering grouped the weather factors into four weather types. Third, the Markov chain transition matrixes and the

  11. A new method of spatio-temporal topographic mapping by correlation coefficient of K-means cluster.

    Science.gov (United States)

    Li, Ling; Yao, Dezhong

    2007-01-01

    It would be of the utmost interest to map correlated sources in the working human brain by Event-Related Potentials (ERPs). This work is to develop a new method to map correlated neural sources based on the time courses of the scalp ERPs waveforms. The ERP data are classified first by k-means cluster analysis, and then the Correlation Coefficients (CC) between the original data of each electrode channel and the time course of each cluster centroid are calculated and utilized as the mapping variable on the scalp surface. With a normalized 4-concentric-sphere head model with radius 1, the performance of the method is evaluated by simulated data. CC, between simulated four sources (s (1)-s (4)) and the estimated cluster centroids (c (1)-c (4)), and the distances (Ds), between the scalp projection points of the s (1)-s (4) and that of the c (1)-c (4), are utilized as the evaluation indexes. Applied to four sources with two of them partially correlated (with maximum mutual CC = 0.4892), CC (Ds) between s (1)-s (4) and c (1)-c (4) are larger (smaller) than 0.893 (0.108) for noise levels NSRclusters located at left, right occipital and frontal. The estimated vectors of the contra-occipital area demonstrate that attention to the stimulus location produces increased amplitude of the P1 and N1 components over the contra-occipital scalp. The estimated vector in the frontal area displays two large processing negativity waves around 100 ms and 250 ms when subjects are attentive, and there is a small negative wave around 140 ms and a P300 when subjects are unattentive. The results of simulations and real Visual Evoked Potentials (VEPs) data demonstrate the validity of the method in mapping correlated sources. This method may be an objective, heuristic and important tool to study the properties of cerebral, neural networks in cognitive and clinical neurosciences.

  12. Recognizing upper limb movements with wrist worn inertial sensors using k-means clustering classification.

    Science.gov (United States)

    Biswas, Dwaipayan; Cranny, Andy; Gupta, Nayaab; Maharatna, Koushik; Achner, Josy; Klemke, Jasmin; Jöbges, Michael; Ortmann, Steffen

    2015-04-01

    In this paper we present a methodology for recognizing three fundamental movements of the human forearm (extension, flexion and rotation) using pattern recognition applied to the data from a single wrist-worn, inertial sensor. We propose that this technique could be used as a clinical tool to assess rehabilitation progress in neurodegenerative pathologies such as stroke or cerebral palsy by tracking the number of times a patient performs specific arm movements (e.g. prescribed exercises) with their paretic arm throughout the day. We demonstrate this with healthy subjects and stroke patients in a simple proof of concept study in which these arm movements are detected during an archetypal activity of daily-living (ADL) - 'making-a-cup-of-tea'. Data is collected from a tri-axial accelerometer and a tri-axial gyroscope located proximal to the wrist. In a training phase, movements are initially performed in a controlled environment which are represented by a ranked set of 30 time-domain features. Using a sequential forward selection technique, for each set of feature combinations three clusters are formed using k-means clustering followed by 10 runs of 10-fold cross validation on the training data to determine the best feature combinations. For the testing phase, movements performed during the ADL are associated with each cluster label using a minimum distance classifier in a multi-dimensional feature space, comprised of the best ranked features, using Euclidean or Mahalanobis distance as the metric. Experiments were performed with four healthy subjects and four stroke survivors and our results show that the proposed methodology can detect the three movements performed during the ADL with an overall average accuracy of 88% using the accelerometer data and 83% using the gyroscope data across all healthy subjects and arm movement types. The average accuracy across all stroke survivors was 70% using accelerometer data and 66% using gyroscope data. We also use a Linear

  13. CLASSIFICATION OF LIDAR DATA OVER BUILDING ROOFS USING K-MEANS AND PRINCIPAL COMPONENT ANALYSIS

    Directory of Open Access Journals (Sweden)

    Renato César dos Santos

    Full Text Available Abstract: The classification is an important step in the extraction of geometric primitives from LiDAR data. Normally, it is applied for the identification of points sampled on geometric primitives of interest. In the literature there are several studies that have explored the use of eigenvalues to classify LiDAR points into different classes or structures, such as corner, edge, and plane. However, in some works the classes are defined considering an ideal geometry, which can be affected by the inadequate sampling and/or by the presence of noise when using real data. To overcome this limitation, in this paper is proposed the use of metrics based on eigenvalues and the k-means method to carry out the classification. So, the concept of principal component analysis is used to obtain the eigenvalues and the derived metrics, while the k-means is applied to cluster the roof points in two classes: edge and non-edge. To evaluate the proposed method four test areas with different levels of complexity were selected. From the qualitative and quantitative analyses, it could be concluded that the proposed classification procedure gave satisfactory results, resulting in completeness and correctness above 92% for the non-edge class, and between 61% to 98% for the edge class.

  14. A Fast SVM-Based Tongue's Colour Classification Aided by k-Means Clustering Identifiers and Colour Attributes as Computer-Assisted Tool for Tongue Diagnosis

    Science.gov (United States)

    Ooi, Chia Yee; Kawanabe, Tadaaki; Odaguchi, Hiroshi; Kobayashi, Fuminori

    2017-01-01

    In tongue diagnosis, colour information of tongue body has kept valuable information regarding the state of disease and its correlation with the internal organs. Qualitatively, practitioners may have difficulty in their judgement due to the instable lighting condition and naked eye's ability to capture the exact colour distribution on the tongue especially the tongue with multicolour substance. To overcome this ambiguity, this paper presents a two-stage tongue's multicolour classification based on a support vector machine (SVM) whose support vectors are reduced by our proposed k-means clustering identifiers and red colour range for precise tongue colour diagnosis. In the first stage, k-means clustering is used to cluster a tongue image into four clusters of image background (black), deep red region, red/light red region, and transitional region. In the second-stage classification, red/light red tongue images are further classified into red tongue or light red tongue based on the red colour range derived in our work. Overall, true rate classification accuracy of the proposed two-stage classification to diagnose red, light red, and deep red tongue colours is 94%. The number of support vectors in SVM is improved by 41.2%, and the execution time for one image is recorded as 48 seconds. PMID:29065640

  15. A Fast SVM-Based Tongue's Colour Classification Aided by k-Means Clustering Identifiers and Colour Attributes as Computer-Assisted Tool for Tongue Diagnosis.

    Science.gov (United States)

    Kamarudin, Nur Diyana; Ooi, Chia Yee; Kawanabe, Tadaaki; Odaguchi, Hiroshi; Kobayashi, Fuminori

    2017-01-01

    In tongue diagnosis, colour information of tongue body has kept valuable information regarding the state of disease and its correlation with the internal organs. Qualitatively, practitioners may have difficulty in their judgement due to the instable lighting condition and naked eye's ability to capture the exact colour distribution on the tongue especially the tongue with multicolour substance. To overcome this ambiguity, this paper presents a two-stage tongue's multicolour classification based on a support vector machine (SVM) whose support vectors are reduced by our proposed k -means clustering identifiers and red colour range for precise tongue colour diagnosis. In the first stage, k -means clustering is used to cluster a tongue image into four clusters of image background (black), deep red region, red/light red region, and transitional region. In the second-stage classification, red/light red tongue images are further classified into red tongue or light red tongue based on the red colour range derived in our work. Overall, true rate classification accuracy of the proposed two-stage classification to diagnose red, light red, and deep red tongue colours is 94%. The number of support vectors in SVM is improved by 41.2%, and the execution time for one image is recorded as 48 seconds.

  16. Cluster analysis of midlatitude oceanic cloud regimes: mean properties and temperature sensitivity

    Directory of Open Access Journals (Sweden)

    N. D. Gordon

    2010-07-01

    Full Text Available Clouds play an important role in the climate system by reducing the amount of shortwave radiation reaching the surface and the amount of longwave radiation escaping to space. Accurate simulation of clouds in computer models remains elusive, however, pointing to a lack of understanding of the connection between large-scale dynamics and cloud properties. This study uses a k-means clustering algorithm to group 21 years of satellite cloud data over midlatitude oceans into seven clusters, and demonstrates that the cloud clusters are associated with distinct large-scale dynamical conditions. Three clusters correspond to low-level cloud regimes with different cloud fraction and cumuliform or stratiform characteristics, but all occur under large-scale descent and a relatively dry free troposphere. Three clusters correspond to vertically extensive cloud regimes with tops in the middle or upper troposphere, and they differ according to the strength of large-scale ascent and enhancement of tropospheric temperature and humidity. The final cluster is associated with a lower troposphere that is dry and an upper troposphere that is moist and experiencing weak ascent and horizontal moist advection.

    Since the present balance of reflection of shortwave and absorption of longwave radiation by clouds could change as the atmosphere warms from increasing anthropogenic greenhouse gases, we must also better understand how increasing temperature modifies cloud and radiative properties. We therefore undertake an observational analysis of how midlatitude oceanic clouds change with temperature when dynamical processes are held constant (i.e., partial derivative with respect to temperature. For each of the seven cloud regimes, we examine the difference in cloud and radiative properties between warm and cold subsets. To avoid misinterpreting a cloud response to large-scale dynamical forcing as a cloud response to temperature, we require horizontal and vertical

  17. 2-Way k-Means as a Model for Microbiome Samples.

    Science.gov (United States)

    Jackson, Weston J; Agarwal, Ipsita; Pe'er, Itsik

    2017-01-01

    Motivation . Microbiome sequencing allows defining clusters of samples with shared composition. However, this paradigm poorly accounts for samples whose composition is a mixture of cluster-characterizing ones and which therefore lie in between them in the cluster space. This paper addresses unsupervised learning of 2-way clusters. It defines a mixture model that allows 2-way cluster assignment and describes a variant of generalized k -means for learning such a model. We demonstrate applicability to microbial 16S rDNA sequencing data from the Human Vaginal Microbiome Project.

  18. K-AP: Generating specified K clusters by efficient Affinity Propagation

    KAUST Repository

    Zhang, Xiangliang; Wang, Wei; Nø rvå g, Kjetil; Sebag, Michè le

    2010-01-01

    and experimental validation, K-AP was shown to be able to directly generate K clusters as user defined, with a negligible increase of computational cost compared to AP. In the meanwhile, K-AP preserves the clustering quality as AP in terms of the distortion. K

  19. A comparison of heuristic and model-based clustering methods for dietary pattern analysis.

    Science.gov (United States)

    Greve, Benjamin; Pigeot, Iris; Huybrechts, Inge; Pala, Valeria; Börnhorst, Claudia

    2016-02-01

    Cluster analysis is widely applied to identify dietary patterns. A new method based on Gaussian mixture models (GMM) seems to be more flexible compared with the commonly applied k-means and Ward's method. In the present paper, these clustering approaches are compared to find the most appropriate one for clustering dietary data. The clustering methods were applied to simulated data sets with different cluster structures to compare their performance knowing the true cluster membership of observations. Furthermore, the three methods were applied to FFQ data assessed in 1791 children participating in the IDEFICS (Identification and Prevention of Dietary- and Lifestyle-Induced Health Effects in Children and Infants) Study to explore their performance in practice. The GMM outperformed the other methods in the simulation study in 72 % up to 100 % of cases, depending on the simulated cluster structure. Comparing the computationally less complex k-means and Ward's methods, the performance of k-means was better in 64-100 % of cases. Applied to real data, all methods identified three similar dietary patterns which may be roughly characterized as a 'non-processed' cluster with a high consumption of fruits, vegetables and wholemeal bread, a 'balanced' cluster with only slight preferences of single foods and a 'junk food' cluster. The simulation study suggests that clustering via GMM should be preferred due to its higher flexibility regarding cluster volume, shape and orientation. The k-means seems to be a good alternative, being easier to use while giving similar results when applied to real data.

  20. "Analyzing the Longitudinal K-12 Grading Histories of Entire Cohorts of Students: Grades, Data Driven Decision Making, Dropping out and Hierarchical Cluster Analysis"

    Directory of Open Access Journals (Sweden)

    Alex J. Bowers

    2010-05-01

    Full Text Available School personnel currently lack an effective method to pattern and visually interpret disaggregated achievement data collected on students as a means to help inform decision making. This study, through the examination of longitudinal K-12 teacher assigned grading histories for entire cohorts of students from a school district (n=188, demonstrates a novel application of hierarchical cluster analysis and pattern visualization in which all data points collected on every student in a cohort can be patterned, visualized and interpreted to aid in data driven decision making by teachers and administrators. Additionally, as a proof-of-concept study, overall schooling outcomes, such as student dropout or taking a college entrance exam, are identified from the data patterns and compared to past methods of dropout identification as one example of the usefulness of the method. Hierarchical cluster analysis correctly identified over 80% of the students who dropped out using the entire student grade history patterns from either K-12 or K-8.

  1. Star clusters and K2

    Science.gov (United States)

    Dotson, Jessie; Barentsen, Geert; Cody, Ann Marie

    2018-01-01

    The K2 survey has expanded the Kepler legacy by using the repurposed spacecraft to observe over 20 star clusters. The sample includes open and globular clusters at all ages, including very young (1-10 Myr, e.g. Taurus, Upper Sco, NGC 6530), moderately young (0.1-1 Gyr, e.g. M35, M44, Pleiades, Hyades), middle-aged (e.g. M67, Ruprecht 147, NGC 2158), and old globular clusters (e.g. M9, M19, Terzan 5). K2 observations of stellar clusters are exploring the rotation period-mass relationship to significantly lower masses than was previously possible, shedding light on the angular momentum budget and its dependence on mass and circumstellar disk properties, and illuminating the role of multiplicity in stellar angular momentum. Exoplanets discovered by K2 in stellar clusters provides planetary systems ripe for modeling given the extensive information available about their ages and environment. I will review the star clusters sampled by K2 across 16 fields so far, highlighting several characteristics, caveats, and unexplored uses of the public data set along the way. With fuel expected to run out in 2018, I will discuss the closing Campaigns, highlight the final target selection opportunities, and explain the data archive and TESS-compatible software tools the K2 mission intends to leave behind for posterity.

  2. Enhanced bag of words using multilevel k-means for human activity recognition

    Directory of Open Access Journals (Sweden)

    Motasem Elshourbagy

    2016-07-01

    Full Text Available This paper aims to enhance the bag of features in order to improve the accuracy of human activity recognition. In this paper, human activity recognition process consists of four stages: local space time features detection, feature description, bag of features representation, and SVMs classification. The k-means step in the bag of features is enhanced by applying three levels of clustering: clustering per video, clustering per action class, and clustering for the final code book. The experimental results show that the proposed method of enhancement reduces the time and memory requirements, and enables the use of all training data in the k-means clustering algorithm. The evaluation of accuracy of action classification on two popular datasets (KTH and Weizmann has been performed. In addition, the proposed method improves the human activity recognition accuracy by 5.57% on the KTH dataset using the same detector, descriptor, and classifier.

  3. SISTEM PEMBAGIAN KELAS KULIAH MAHASISWA DENGAN METODE K-MEANS DAN K-NEAREST NEIGHBORS UNTUK MENINGKATKAN KUALITAS PEMBELAJARAN

    Directory of Open Access Journals (Sweden)

    Gede Aditra Pradnyana

    2018-01-01

    Full Text Available Permasalahan yang terjadi saat pembentukan atau pembagian kelas mahasiswa adalah perbedaan kemampuan yang dimiliki oleh mahasiswa di setiap kelasnya yang dapat berdampak pada tidak efektifnya proses pembelajaran yang berlangsung. Pengelompokkan mahasiswa dengan kemampuan yang sama merupakan hal yang sangat penting dalam rangka meningkatkan kualitas proses belajar mengajar yang dilakukan. Dengan pengelompokkan mahasiswa yang tepat, mereka akan dapat saling membantu dalam proses pembelajaran. Selain itu, membagi kelas mahasiswa sesuai dengan kemampuannya dapat mempermudah tenaga pendidik dalam menentukan metode atau strategi pembelajaran yang sesuai. Penggunaan metode dan strategi pembelajaran yang tepat akan meningkatkan efektifitas proses belajar mengajar. Pada penelitian ini dirancang sebuah metode baru untuk pembagian kelas kuliah mahasiswa dengan mengkombinasikan metode K-Means dan K-Nearest Neighbors (KNN. Metode K-means digunakan untuk pembagian kelas kuliah mahasiswa berdasarkan komponen penilaian dari mata kuliah prasyaratnya. Adapun fitur yang digunakan dalam pengelompokkan adalah nilai tugas, nilai ujian tengah semester, nilai ujian akhir semester, dan indeks prestasi kumulatif (IPK. Metode KNN digunakan untuk memprediksi kelulusan seoarang mahasiswa di sebuah matakuliah berdasarkan data sebelumnya. Hasil prediksi ini akan digunakan sebagai fitur tambahan yang digunakan dalam pembentukan kelas mahasiswa menggunakan metode K-means. Pendekatan yang digunakan dalam penelitian ini adalah Software Development Live Cycle (SDLC dengan model waterfall. Berdasarkan hasil pengujian yang dilakukan diperoleh kesimpulan bahwa jumlah cluster atau kelas dan jumlah data yang digunakan mempengaruhi dari kualitas cluster yang dibentuk oleh metode K-Means dan KNN yang digunakan. Nilai Silhouette Indeks tertinggi diperolah saat menggunakan 100 data dengan jumlah cluster 10 sebesar 0,534 yang tergolong kelas dengan kualitas medium structure.

  4. Hierarchical Aligned Cluster Analysis for Temporal Clustering of Human Motion.

    Science.gov (United States)

    Zhou, Feng; De la Torre, Fernando; Hodgins, Jessica K

    2013-03-01

    Temporal segmentation of human motion into plausible motion primitives is central to understanding and building computational models of human motion. Several issues contribute to the challenge of discovering motion primitives: the exponential nature of all possible movement combinations, the variability in the temporal scale of human actions, and the complexity of representing articulated motion. We pose the problem of learning motion primitives as one of temporal clustering, and derive an unsupervised hierarchical bottom-up framework called hierarchical aligned cluster analysis (HACA). HACA finds a partition of a given multidimensional time series into m disjoint segments such that each segment belongs to one of k clusters. HACA combines kernel k-means with the generalized dynamic time alignment kernel to cluster time series data. Moreover, it provides a natural framework to find a low-dimensional embedding for time series. HACA is efficiently optimized with a coordinate descent strategy and dynamic programming. Experimental results on motion capture and video data demonstrate the effectiveness of HACA for segmenting complex motions and as a visualization tool. We also compare the performance of HACA to state-of-the-art algorithms for temporal clustering on data of a honey bee dance. The HACA code is available online.

  5. Noise reduction and functional maps image quality improvement in dynamic CT perfusion using a new k-means clustering guided bilateral filter (KMGB).

    Science.gov (United States)

    Pisana, Francesco; Henzler, Thomas; Schönberg, Stefan; Klotz, Ernst; Schmidt, Bernhard; Kachelrieß, Marc

    2017-07-01

    Dynamic CT perfusion (CTP) consists in repeated acquisitions of the same volume in different time steps, slightly before, during and slightly afterwards the injection of contrast media. Important functional information can be derived for each voxel, which reflect the local hemodynamic properties and hence the metabolism of the tissue. Different approaches are being investigated to exploit data redundancy and prior knowledge for noise reduction of such datasets, ranging from iterative reconstruction schemes to high dimensional filters. We propose a new spatial bilateral filter which makes use of the k-means clustering algorithm and of an optimal calculated guiding image. We named the proposed filter as k-means clustering guided bilateral filter (KMGB). In this study, the KMGB filter is compared with the partial temporal non-local means filter (PATEN), with the time-intensity profile similarity (TIPS) filter, and with a new version derived from it, by introducing the guiding image (GB-TIPS). All the filters were tested on a digital in-house developed brain CTP phantom, were noise was added to simulate 80 kV and 200 mAs (default scanning parameters), 100 mAs and 30 mAs. Moreover, the filters performances were tested on 7 noisy clinical datasets with different pathologies in different body regions. The original contribution of our work is two-fold: first we propose an efficient algorithm to calculate a guiding image to improve the results of the TIPS filter, secondly we propose the introduction of the k-means clustering step and demonstrate how this can potentially replace the TIPS part of the filter obtaining better results at lower computational efforts. As expected, in the GB-TIPS, the introduction of the guiding image limits the over-smoothing of the TIPS filter, improving spatial resolution by more than 50%. Furthermore, replacing the time-intensity profile similarity calculation with a fuzzy k-means clustering strategy (KMGB) allows to control the edge preserving

  6. Coexistence of Cluster Structure and Mean-field-type Structure in Medium-weight Nuclei

    International Nuclear Information System (INIS)

    Taniguchi, Yasutaka; Horiuchi, Hisashi; Kimura, Masaaki

    2006-01-01

    We have studied the coexistence of cluster structure and mean-field-type structure in 20Ne and 40Ca using Antisymmetrized Molecular Dynamics (AMD) + Generator Coordinate Method (GCM). By energy variation with new constraint for clustering, we calculate cluster structure wave function. Superposing cluster structure wave functions and mean-field-type structure wave function, we found that 8Be-12C, α-36Ar and 12C-28Si cluster structure are important components of K π = 0 3 + band of 20Ne, that of normal deformed band of 40Ca and that of super deformed band of 40Ca, respectively

  7. An initialization method for the k-means using the concept of useful nearest centers

    OpenAIRE

    Ismkhan, Hassan

    2017-01-01

    The aim of the k-means is to minimize squared sum of Euclidean distance from the mean (SSEDM) of each cluster. The k-means can effectively optimize this function, but it is too sensitive for initial centers (seeds). This paper proposed a method for initialization of the k-means using the concept of useful nearest center for each data point.

  8. Unemployment and sovereign debt crisis in the Eurozone: A k-means- r analysis

    Science.gov (United States)

    Dias, João

    2017-09-01

    Some southern countries in Europe, together with Ireland, were particularly affected by the sovereign debt crises in the Eurozone and were obliged to implement tough corrective measures which proved to be very recessive in nature. As a result, not only GDP declined but unemployment jumped to very high levels as well. This paper uses a modified version of k-means (restricted k-means) to analyze the clustering of the Eurozone countries during the recent sovereign debt crisis, combining monthly data on unemployment and government bond yield rates. Our method shows that the separation of southern Europe from the other Eurozone is not necessarily a good characterization of this area before the crisis but the group of externally assisted countries plus Italy gains consistence as the crisis evolved, although there is no perfect homogeneity in this group, since the problems they faced, the type of response requested, the speed of reaction to the crisis and the lasting effects were not the same for all these countries.

  9. A Fast SVM-Based Tongue’s Colour Classification Aided by k-Means Clustering Identifiers and Colour Attributes as Computer-Assisted Tool for Tongue Diagnosis

    Directory of Open Access Journals (Sweden)

    Nur Diyana Kamarudin

    2017-01-01

    Full Text Available In tongue diagnosis, colour information of tongue body has kept valuable information regarding the state of disease and its correlation with the internal organs. Qualitatively, practitioners may have difficulty in their judgement due to the instable lighting condition and naked eye’s ability to capture the exact colour distribution on the tongue especially the tongue with multicolour substance. To overcome this ambiguity, this paper presents a two-stage tongue’s multicolour classification based on a support vector machine (SVM whose support vectors are reduced by our proposed k-means clustering identifiers and red colour range for precise tongue colour diagnosis. In the first stage, k-means clustering is used to cluster a tongue image into four clusters of image background (black, deep red region, red/light red region, and transitional region. In the second-stage classification, red/light red tongue images are further classified into red tongue or light red tongue based on the red colour range derived in our work. Overall, true rate classification accuracy of the proposed two-stage classification to diagnose red, light red, and deep red tongue colours is 94%. The number of support vectors in SVM is improved by 41.2%, and the execution time for one image is recorded as 48 seconds.

  10. Cluster radioactive decay within the preformed cluster model using relativistic mean-field theory densities

    International Nuclear Information System (INIS)

    Singh, BirBikram; Patra, S. K.; Gupta, Raj K.

    2010-01-01

    We have studied the (ground-state) cluster radioactive decays within the preformed cluster model (PCM) of Gupta and collaborators [R. K. Gupta, in Proceedings of the 5th International Conference on Nuclear Reaction Mechanisms, Varenna, edited by E. Gadioli (Ricerca Scientifica ed Educazione Permanente, Milano, 1988), p. 416; S. S. Malik and R. K. Gupta, Phys. Rev. C 39, 1992 (1989)]. The relativistic mean-field (RMF) theory is used to obtain the nuclear matter densities for the double folding procedure used to construct the cluster-daughter potential with M3Y nucleon-nucleon interaction including exchange effects. Following the PCM approach, we have deduced empirically the preformation probability P 0 emp from the experimental data on both the α- and exotic cluster-decays, specifically of parents in the trans-lead region having doubly magic 208 Pb or its neighboring nuclei as daughters. Interestingly, the RMF-densities-based nuclear potential supports the concept of preformation for both the α and heavier clusters in radioactive nuclei. P 0 α(emp) for α decays is almost constant (∼10 -2 -10 -3 ) for all the parent nuclei considered here, and P 0 c(emp) for cluster decays of the same parents decrease with the size of clusters emitted from different parents. The results obtained for P 0 c(emp) are reasonable and are within two to three orders of magnitude of the well-accepted phenomenological model of Blendowske-Walliser for light clusters.

  11. Forecasting hourly global solar radiation using hybrid k-means and nonlinear autoregressive neural network models

    International Nuclear Information System (INIS)

    Benmouiza, Khalil; Cheknane, Ali

    2013-01-01

    Highlights: • An unsupervised clustering algorithm with a neural network model was explored. • The forecasting results of solar radiation time series and the comparison of their performance was simulated. • A new method was proposed combining k-means algorithm and NAR network to provide better prediction results. - Abstract: In this paper, we review our work for forecasting hourly global horizontal solar radiation based on the combination of unsupervised k-means clustering algorithm and artificial neural networks (ANN). k-Means algorithm focused on extracting useful information from the data with the aim of modeling the time series behavior and find patterns of the input space by clustering the data. On the other hand, nonlinear autoregressive (NAR) neural networks are powerful computational models for modeling and forecasting nonlinear time series. Taking the advantage of both methods, a new method was proposed combining k-means algorithm and NAR network to provide better forecasting results

  12. Clustering Dycom

    KAUST Repository

    Minku, Leandro L.; Hou, Siqing

    2017-01-01

    baseline WC model is also included in the analysis. Results: Clustering Dycom with K-Means can potentially help to split the CC projects, managing to achieve similar or better predictive performance than Dycom. However, K-Means still requires the number

  13. Approximate fuzzy C-means (AFCM) cluster analysis of medical magnetic resonance image (MRI) data

    International Nuclear Information System (INIS)

    DelaPaz, R.L.; Chang, P.J.; Bernstein, R.; Dave, J.V.

    1987-01-01

    The authors describe the application of an approximate fuzzy C-means (AFCM) clustering algorithm as a data dimension reduction approach to medical magnetic resonance images (MRI). Image data consisted of one T1-weighted, two T2-weighted, and one T2*-weighted (magnetic susceptibility) image for each cranial study and a matrix of 10 images generated from 10 combinations of TE and TR for each body lymphoma study. All images were obtained with a 1.5 Tesla imaging system (GE Signa). Analyses were performed on over 100 MR image sets with a variety of pathologies. The cluster analysis was operated in an unsupervised mode and computational overhead was minimized by utilizing a table look-up approach without adversely affecting accuracy. Image data were first segmented into 2 coarse clusters, each of which was then subdivided into 16 fine clusters. The final tissue classifications were presented as color-coded anatomically-mapped images and as two and three dimensional displays of cluster center data in selected feature space (minimum spanning tree). Fuzzy cluster analysis appears to be a clinically useful dimension reduction technique which results in improved diagnostic specificity of medical magnetic resonance images

  14. Extending the input–output energy balance methodology in agriculture through cluster analysis

    International Nuclear Information System (INIS)

    Bojacá, Carlos Ricardo; Casilimas, Héctor Albeiro; Gil, Rodrigo; Schrevens, Eddie

    2012-01-01

    The input–output balance methodology has been applied to characterize the energy balance of agricultural systems. This study proposes to extend this methodology with the inclusion of multivariate analysis to reveal particular patterns in the energy use of a system. The objective was to demonstrate the usefulness of multivariate exploratory techniques to analyze the variability found in a farming system and, establish efficiency categories that can be used to improve the energy balance of the system. To this purpose an input–output analysis was applied to the major greenhouse tomato production area in Colombia. Individual energy profiles were built and the k-means clustering method was applied to the production factors. On average, the production system in the study zone consumes 141.8 GJ ha −1 to produce 96.4 GJ ha −1 , resulting in an energy efficiency of 0.68. With the k-means clustering analysis, three clusters of farmers were identified with energy efficiencies of 0.54, 0.67 and 0.78. The most energy efficient cluster grouped 56.3% of the farmers. It is possible to optimize the production system by improving the management practices of those with the lowest energy use efficiencies. Multivariate analysis techniques demonstrated to be a complementary pathway to improve the energy efficiency of a system. -- Highlights: ► An input–output energy balance was estimated for greenhouse tomatoes in Colombia. ► We used the k-means clustering method to classify growers based on their energy use. ► Three clusters of growers were found with energy efficiencies of 0.54, 0.67 and 0.78. ► Overall system optimization is possible by improving the energy use of the less efficient.

  15. A New Variable Weighting and Selection Procedure for K-Means Cluster Analysis

    Science.gov (United States)

    Steinley, Douglas; Brusco, Michael J.

    2008-01-01

    A variance-to-range ratio variable weighting procedure is proposed. We show how this weighting method is theoretically grounded in the inherent variability found in data exhibiting cluster structure. In addition, a variable selection procedure is proposed to operate in conjunction with the variable weighting technique. The performances of these…

  16. A Comparison of AOP Classification Based on Difficulty, Importance, and Frequency by Cluster Analysis and Standardized Mean

    International Nuclear Information System (INIS)

    Choi, Sun Yeong; Jung, Wondea

    2014-01-01

    In Korea, there are plants that have more than one-hundred kinds of abnormal operation procedures (AOPs). Therefore, operators have started to recognize the importance of classifying the AOPs. They should pay attention to those AOPs required to take emergency measures against an abnormal status that has a more serious effect on plant safety and/or often occurs. We suggested a measure of prioritizing AOPs for a training purpose based on difficulty, importance, and frequency. A DIF analysis based on how difficult the task is, how important it is, and how frequently they occur is a well-known method of assessing the performance, prioritizing training needs and planning. We used an SDIF-mean (Standardized DIF-mean) to prioritize AOPs in the previous paper. For the SDIF-mean, we standardized the three kinds of data respectively. The results of this research will be utilized not only to understand the AOP characteristics at a job analysis level but also to develop an effective AOP training program. The purpose of this paper is to perform a cluster analysis for an AOP classification and compare the results through a cluster analysis with that by a standardized mean based on difficulty, importance, and frequency. In this paper, we categorized AOPs into three groups by a cluster analysis based on D, I, and F. Clustering is the classification of similar objects into groups so that each group shares some common characteristics. In addition, we compared the result by the cluster analysis in this paper with the classification result by the SDIF-mean in the previous paper. From the comparison, we found that a reevaluation can be required to assign a training interval for the AOPs of group C' in the previous paper those have lower SDIF-mean. The reason for this is that some of the AOPs of group C' have quite high D and I values while they have the lowest frequencies. From an educational point of view, AOPs in group which have the highest difficulty and importance, but

  17. A Comparison of AOP Classification Based on Difficulty, Importance, and Frequency by Cluster Analysis and Standardized Mean

    Energy Technology Data Exchange (ETDEWEB)

    Choi, Sun Yeong; Jung, Wondea [Korea Atomic Energy Research Institute, Daejeon (Korea, Republic of)

    2014-05-15

    In Korea, there are plants that have more than one-hundred kinds of abnormal operation procedures (AOPs). Therefore, operators have started to recognize the importance of classifying the AOPs. They should pay attention to those AOPs required to take emergency measures against an abnormal status that has a more serious effect on plant safety and/or often occurs. We suggested a measure of prioritizing AOPs for a training purpose based on difficulty, importance, and frequency. A DIF analysis based on how difficult the task is, how important it is, and how frequently they occur is a well-known method of assessing the performance, prioritizing training needs and planning. We used an SDIF-mean (Standardized DIF-mean) to prioritize AOPs in the previous paper. For the SDIF-mean, we standardized the three kinds of data respectively. The results of this research will be utilized not only to understand the AOP characteristics at a job analysis level but also to develop an effective AOP training program. The purpose of this paper is to perform a cluster analysis for an AOP classification and compare the results through a cluster analysis with that by a standardized mean based on difficulty, importance, and frequency. In this paper, we categorized AOPs into three groups by a cluster analysis based on D, I, and F. Clustering is the classification of similar objects into groups so that each group shares some common characteristics. In addition, we compared the result by the cluster analysis in this paper with the classification result by the SDIF-mean in the previous paper. From the comparison, we found that a reevaluation can be required to assign a training interval for the AOPs of group C' in the previous paper those have lower SDIF-mean. The reason for this is that some of the AOPs of group C' have quite high D and I values while they have the lowest frequencies. From an educational point of view, AOPs in group which have the highest difficulty and importance, but

  18. A diabetic retinopathy detection method using an improved pillar K-means algorithm.

    Science.gov (United States)

    Gogula, Susmitha Valli; Divakar, Ch; Satyanarayana, Ch; Rao, Allam Appa

    2014-01-01

    The paper presents a new approach for medical image segmentation. Exudates are a visible sign of diabetic retinopathy that is the major reason of vision loss in patients with diabetes. If the exudates extend into the macular area, blindness may occur. Automated detection of exudates will assist ophthalmologists in early diagnosis. This segmentation process includes a new mechanism for clustering the elements of high-resolution images in order to improve precision and reduce computation time. The system applies K-means clustering to the image segmentation after getting optimized by Pillar algorithm; pillars are constructed in such a way that they can withstand the pressure. Improved pillar algorithm can optimize the K-means clustering for image segmentation in aspects of precision and computation time. This evaluates the proposed approach for image segmentation by comparing with Kmeans and Fuzzy C-means in a medical image. Using this method, identification of dark spot in the retina becomes easier and the proposed algorithm is applied on diabetic retinal images of all stages to identify hard and soft exudates, where the existing pillar K-means is more appropriate for brain MRI images. This proposed system help the doctors to identify the problem in the early stage and can suggest a better drug for preventing further retinal damage.

  19. Integrasi Algoritma K-Means Dengan Bahasa SQL Untuk Klasterisasi IPK Mahasiswa (Studi Kasus: Fakultas Ilmu Komputer Universitas Brawijaya

    Directory of Open Access Journals (Sweden)

    Issa Arwani

    2015-07-01

    Abstract Generally, clustering implemented with taking data from database to be stored temporarily in a program variable (eg, in an array then continue with clustering process.Directclustering where the data is storedby integrating the clustering algorithm using the SQL language on the DBMS is proposed.In this study focused on the design and implementation of K-means clustering algorithm on a Relational DBMS using the SQL language. The clustering process carried out with a case study of GPA student in the Faculty of Computer Science University of Brawijaya.Based on results with a variety of dimensions, the number of clusters and different distance calculation methods, has obtained clustering data correctly. Based on time complexity to review each stage of the implementation K - means using SQL and without SQL, showing the same results of asymptotic time complexity where phase euclidean distance still requires the highest time complexity. Keywords: Clustering, K-means, SQL, GPA (Grade Point Average

  20. Improving estimation of kinetic parameters in dynamic force spectroscopy using cluster analysis

    Science.gov (United States)

    Yen, Chi-Fu; Sivasankar, Sanjeevi

    2018-03-01

    Dynamic Force Spectroscopy (DFS) is a widely used technique to characterize the dissociation kinetics and interaction energy landscape of receptor-ligand complexes with single-molecule resolution. In an Atomic Force Microscope (AFM)-based DFS experiment, receptor-ligand complexes, sandwiched between an AFM tip and substrate, are ruptured at different stress rates by varying the speed at which the AFM-tip and substrate are pulled away from each other. The rupture events are grouped according to their pulling speeds, and the mean force and loading rate of each group are calculated. These data are subsequently fit to established models, and energy landscape parameters such as the intrinsic off-rate (koff) and the width of the potential energy barrier (xβ) are extracted. However, due to large uncertainties in determining mean forces and loading rates of the groups, errors in the estimated koff and xβ can be substantial. Here, we demonstrate that the accuracy of fitted parameters in a DFS experiment can be dramatically improved by sorting rupture events into groups using cluster analysis instead of sorting them according to their pulling speeds. We test different clustering algorithms including Gaussian mixture, logistic regression, and K-means clustering, under conditions that closely mimic DFS experiments. Using Monte Carlo simulations, we benchmark the performance of these clustering algorithms over a wide range of koff and xβ, under different levels of thermal noise, and as a function of both the number of unbinding events and the number of pulling speeds. Our results demonstrate that cluster analysis, particularly K-means clustering, is very effective in improving the accuracy of parameter estimation, particularly when the number of unbinding events are limited and not well separated into distinct groups. Cluster analysis is easy to implement, and our performance benchmarks serve as a guide in choosing an appropriate method for DFS data analysis.

  1. Multivariate spatial condition mapping using subtractive fuzzy cluster means.

    Science.gov (United States)

    Sabit, Hakilo; Al-Anbuky, Adnan

    2014-10-13

    Wireless sensor networks are usually deployed for monitoring given physical phenomena taking place in a specific space and over a specific duration of time. The spatio-temporal distribution of these phenomena often correlates to certain physical events. To appropriately characterise these events-phenomena relationships over a given space for a given time frame, we require continuous monitoring of the conditions. WSNs are perfectly suited for these tasks, due to their inherent robustness. This paper presents a subtractive fuzzy cluster means algorithm and its application in data stream mining for wireless sensor systems over a cloud-computing-like architecture, which we call sensor cloud data stream mining. Benchmarking on standard mining algorithms, the k-means and the FCM algorithms, we have demonstrated that the subtractive fuzzy cluster means model can perform high quality distributed data stream mining tasks comparable to centralised data stream mining.

  2. Integrating an artificial intelligence approach with k-means clustering to model groundwater salinity: the case of Gaza coastal aquifer (Palestine)

    Science.gov (United States)

    Alagha, Jawad S.; Seyam, Mohammed; Md Said, Md Azlin; Mogheir, Yunes

    2017-12-01

    Artificial intelligence (AI) techniques have increasingly become efficient alternative modeling tools in the water resources field, particularly when the modeled process is influenced by complex and interrelated variables. In this study, two AI techniques—artificial neural networks (ANNs) and support vector machine (SVM)—were employed to achieve deeper understanding of the salinization process (represented by chloride concentration) in complex coastal aquifers influenced by various salinity sources. Both models were trained using 11 years of groundwater quality data from 22 municipal wells in Khan Younis Governorate, Gaza, Palestine. Both techniques showed satisfactory prediction performance, where the mean absolute percentage error (MAPE) and correlation coefficient ( R) for the test data set were, respectively, about 4.5 and 99.8% for the ANNs model, and 4.6 and 99.7% for SVM model. The performances of the developed models were further noticeably improved through preprocessing the wells data set using a k-means clustering method, then conducting AI techniques separately for each cluster. The developed models with clustered data were associated with higher performance, easiness and simplicity. They can be employed as an analytical tool to investigate the influence of input variables on coastal aquifer salinity, which is of great importance for understanding salinization processes, leading to more effective water-resources-related planning and decision making.

  3. Spin dynamics study of magnetic molecular clusters by means of Moessbauer spectroscopy

    International Nuclear Information System (INIS)

    Cianchi, L.; Del Giallo, F.; Spina, G.; Reiff, W.; Caneschi, A.

    2002-01-01

    Spin dynamics of the two magnetic molecular clusters Fe4 and Fe8, with four and eight Fe(III) ions, respectively, was studied by means of Moessbauer spectroscopy. The transition probabilities W's between the spin states of the ground multiplet were obtained from the fitting of the spectra. For the Fe4 cluster we found that, in the range from 1.38 to 77 K, the trend of W's versus the temperature corresponds to an Orbach's process involving an excited state with energy of about 160 K. For the Fe8, which, due to the presence of a low-energy excited state, could not be studied at temperatures greater than 20 K, the trend of W's in the range from 4 to 18 K seems to correspond to a direct process. The correlation functions of the magnetization were then calculated in terms of the W's. They have an exponential trend for the Fe4 cluster, while a small oscillating component is also present for the Fe8 cluster. For the first of the clusters, τ vs T (τ is the decay time of the magnetization) has a trend which, at low temperatures (T 15 K, τ follows the trend of W -1 . For the Fe8, τ follows an Arrhenius law, but with a prefactor which is smaller than the one obtained susceptibility measurements

  4. Usage of K-cluster and factor analysis for grouping and evaluation the quality of olive oil in accordance with physico-chemical parameters

    Science.gov (United States)

    Milev, M.; Nikolova, Kr.; Ivanova, Ir.; Dobreva, M.

    2015-11-01

    25 olive oils were studied- different in origin and ways of extraction, in accordance with 17 physico-chemical parameters as follows: color parameters - a and b, light, fluorescence peaks, pigments - chlorophyll and β-carotene, fatty-acid content. The goals of the current study were: Conducting correlation analysis to find the inner relation between the studied indices; By applying factor analysis with the help of the method of Principal Components (PCA), to reduce the great number of variables into a few factors, which are of main importance for distinguishing the different types of olive oil;Using K-means cluster to compare and group the tested types olive oils based on their similarity. The inner relation between the studied indices was found by applying correlation analysis. A factor analysis using PCA was applied on the basis of the found correlation matrix. Thus the number of the studied indices was reduced to 4 factors, which explained 79.3% from the entire variation. The first one unified the color parameters, β-carotene and the related with oxidative products fluorescence peak - about 520 nm. The second one was determined mainly by the chlorophyll content and related to it fluorescence peak - about 670 nm. The third and the fourth factors were determined by the fatty-acid content of the samples. The third one unified the fatty-acids, which give us the opportunity to distinguish olive oil from the other plant oils - oleic, linoleic and stearin acids. The fourth factor included fatty-acids with relatively much lower content in the studied samples. It is enquired the number of clusters to be determined preliminary in order to apply the K-Cluster analysis. The variant K = 3 was worked out because the types of the olive oil were three. The first cluster unified all salad and pomace olive oils, the second unified the samples of extra virgin oilstaken as controls from producers, which were bought from the trade network. The third cluster unified samples from

  5. Clustering of users of digital libraries through log file analysis

    Directory of Open Access Journals (Sweden)

    Juan Antonio Martínez-Comeche

    2017-09-01

    Full Text Available This study analyzes how users perform information retrieval tasks when introducing queries to the Hispanic Digital Library. Clusters of users are differentiated based on their distinct information behavior. The study used the log files collected by the server over a year and different possible clustering algorithms are compared. The k-means algorithm is found to be a suitable clustering method for the analysis of large log files from digital libraries. In the case of the Hispanic Digital Library the results show three clusters of users and the characteristic information behavior of each group is described.

  6. Iris recognition using image moments and k-means algorithm.

    Science.gov (United States)

    Khan, Yaser Daanial; Khan, Sher Afzal; Ahmad, Farooq; Islam, Saeed

    2014-01-01

    This paper presents a biometric technique for identification of a person using the iris image. The iris is first segmented from the acquired image of an eye using an edge detection algorithm. The disk shaped area of the iris is transformed into a rectangular form. Described moments are extracted from the grayscale image which yields a feature vector containing scale, rotation, and translation invariant moments. Images are clustered using the k-means algorithm and centroids for each cluster are computed. An arbitrary image is assumed to belong to the cluster whose centroid is the nearest to the feature vector in terms of Euclidean distance computed. The described model exhibits an accuracy of 98.5%.

  7. Applications of Cluster Analysis to the Creation of Perfectionism Profiles: A Comparison of two Clustering Approaches

    Directory of Open Access Journals (Sweden)

    Jocelyn H Bolin

    2014-04-01

    Full Text Available Although traditional clustering methods (e.g., K-means have been shown to be useful in the social sciences it is often difficult for such methods to handle situations where clusters in the population overlap or are ambiguous. Fuzzy clustering, a method already recognized in many disciplines, provides a more flexible alternative to these traditional clustering methods. Fuzzy clustering differs from other traditional clustering methods in that it allows for a case to belong to multiple clusters simultaneously. Unfortunately, fuzzy clustering techniques remain relatively unused in the social and behavioral sciences. The purpose of this paper is to introduce fuzzy clustering to these audiences who are currently relatively unfamiliar with the technique. In order to demonstrate the advantages associated with this method, cluster solutions of a common perfectionism measure were created using both fuzzy clustering and K-means clustering, and the results compared. Results of these analyses reveal that different cluster solutions are found by the two methods, and the similarity between the different clustering solutions depends on the amount of cluster overlap allowed for in fuzzy clustering.

  8. Applications of cluster analysis to the creation of perfectionism profiles: a comparison of two clustering approaches.

    Science.gov (United States)

    Bolin, Jocelyn H; Edwards, Julianne M; Finch, W Holmes; Cassady, Jerrell C

    2014-01-01

    Although traditional clustering methods (e.g., K-means) have been shown to be useful in the social sciences it is often difficult for such methods to handle situations where clusters in the population overlap or are ambiguous. Fuzzy clustering, a method already recognized in many disciplines, provides a more flexible alternative to these traditional clustering methods. Fuzzy clustering differs from other traditional clustering methods in that it allows for a case to belong to multiple clusters simultaneously. Unfortunately, fuzzy clustering techniques remain relatively unused in the social and behavioral sciences. The purpose of this paper is to introduce fuzzy clustering to these audiences who are currently relatively unfamiliar with the technique. In order to demonstrate the advantages associated with this method, cluster solutions of a common perfectionism measure were created using both fuzzy clustering and K-means clustering, and the results compared. Results of these analyses reveal that different cluster solutions are found by the two methods, and the similarity between the different clustering solutions depends on the amount of cluster overlap allowed for in fuzzy clustering.

  9. SU-E-J-98: Radiogenomics: Correspondence Between Imaging and Genetic Features Based On Clustering Analysis

    International Nuclear Information System (INIS)

    Harmon, S; Wendelberger, B; Jeraj, R

    2014-01-01

    Purpose: Radiogenomics aims to establish relationships between patient genotypes and imaging phenotypes. An open question remains on how best to integrate information from these distinct datasets. This work investigates if similarities in genetic features across patients correspond to similarities in PET-imaging features, assessed with various clustering algorithms. Methods: [ 18 F]FDG PET data was obtained for 26 NSCLC patients from a public database (TCIA). Tumors were contoured using an in-house segmentation algorithm combining gradient and region-growing techniques; resulting ROIs were used to extract 54 PET-based features. Corresponding genetic microarray data containing 48,778 elements were also obtained for each tumor. Given mismatch in feature sizes, two dimension reduction techniques were also applied to the genetic data: principle component analysis (PCA) and selective filtering of 25 NSCLC-associated genes-ofinterest (GOI). Gene datasets (full, PCA, and GOI) and PET feature datasets were independently clustered using K-means and hierarchical clustering using variable number of clusters (K). Jaccard Index (JI) was used to score similarity of cluster assignments across different datasets. Results: Patient clusters from imaging data showed poor similarity to clusters from gene datasets, regardless of clustering algorithms or number of clusters (JI mean = 0.3429±0.1623). Notably, we found clustering algorithms had different sensitivities to data reduction techniques. Using hierarchical clustering, the PCA dataset showed perfect cluster agreement to the full-gene set (JI =1) for all values of K, and the agreement between the GOI set and the full-gene set decreased as number of clusters increased (JI=0.9231 and 0.5769 for K=2 and 5, respectively). K-means clustering assignments were highly sensitive to data reduction and showed poor stability for different values of K (JI range : 0.2301–1). Conclusion: Using commonly-used clustering algorithms, we found

  10. SU-E-J-98: Radiogenomics: Correspondence Between Imaging and Genetic Features Based On Clustering Analysis

    Energy Technology Data Exchange (ETDEWEB)

    Harmon, S; Wendelberger, B [University of Wisconsin-Madison, Madison, WI (United States); Jeraj, R [University of Wisconsin-Madison, Madison, WI (United States); University of Ljubljana (Slovenia)

    2014-06-01

    Purpose: Radiogenomics aims to establish relationships between patient genotypes and imaging phenotypes. An open question remains on how best to integrate information from these distinct datasets. This work investigates if similarities in genetic features across patients correspond to similarities in PET-imaging features, assessed with various clustering algorithms. Methods: [{sup 18}F]FDG PET data was obtained for 26 NSCLC patients from a public database (TCIA). Tumors were contoured using an in-house segmentation algorithm combining gradient and region-growing techniques; resulting ROIs were used to extract 54 PET-based features. Corresponding genetic microarray data containing 48,778 elements were also obtained for each tumor. Given mismatch in feature sizes, two dimension reduction techniques were also applied to the genetic data: principle component analysis (PCA) and selective filtering of 25 NSCLC-associated genes-ofinterest (GOI). Gene datasets (full, PCA, and GOI) and PET feature datasets were independently clustered using K-means and hierarchical clustering using variable number of clusters (K). Jaccard Index (JI) was used to score similarity of cluster assignments across different datasets. Results: Patient clusters from imaging data showed poor similarity to clusters from gene datasets, regardless of clustering algorithms or number of clusters (JI{sub mean}= 0.3429±0.1623). Notably, we found clustering algorithms had different sensitivities to data reduction techniques. Using hierarchical clustering, the PCA dataset showed perfect cluster agreement to the full-gene set (JI =1) for all values of K, and the agreement between the GOI set and the full-gene set decreased as number of clusters increased (JI=0.9231 and 0.5769 for K=2 and 5, respectively). K-means clustering assignments were highly sensitive to data reduction and showed poor stability for different values of K (JI{sub range}: 0.2301–1). Conclusion: Using commonly-used clustering algorithms

  11. Cluster analysis for validated climatology stations using precipitation in Mexico

    NARCIS (Netherlands)

    Bravo Cabrera, J. L.; Azpra-Romero, E.; Zarraluqui-Such, V.; Gay-García, C.; Estrada Porrúa, F.

    2012-01-01

    Annual average of daily precipitation was used to group climatological stations into clusters using the k-means procedure and principal component analysis with varimax rotation. After a careful selection of the stations deployed in Mexico since 1950, we selected 349 characterized by having 35 to 40

  12. Aplikasi Pengolahan Citra Digital Meat Detection Dengan Metode Segmentasi K-Mean Clustering Berbasis OpenCV Dan Eclipse

    Directory of Open Access Journals (Sweden)

    Lazuardi Arsy

    2016-04-01

    Full Text Available Kualitas suatu daging sapi ditentukan oleh beberapa parameter, diantaranya adalah parameter ukuran, terkstur, ciri warna, bau dari daging dan lain – lain. Parameter terseburt merupakan salah satu faktor penting untuk menentukan kualitas daging. Umunya dalam menetukan kualitas baik buruknya daging dilakukan dengan cara manual yaitu menggunakan indera penglihatan dari segi warna maupun bentuk yang memiliki banyak kelemahan seperti penilaian oleh manusia yang bersifat subyektif dan tak konsisten. Tujuan dari penelitian ini adalah untuk membuat aplikasi untuk mendeteksi kualitas daging. Aplikasi dibangun dengan menggunakan bahasa pemrograman Java pada Android yang terintegrasi dengan Android SDK dan Eclipse menggunakan library OpenCV sehingga aplikasi ini berbasis mobile. Metode yang dipakai menggunakan segmentasi k-mean clustering selanjutnya dianalisis secara statistik. Pendeteksian kualitas dilakukan dengan menggunakan pencocokan tekstur dan warna daging berdasar data yang sudah ada. Aplikasi yang dibuat dapat digunakan untuk mencari nilai k yang signifikan serta mampu mendeteksi kualitas baik atau buruknya daging dengan melakukan pengujian terhadap beberapa jenis daging serta aplikasi ini dapat digunakan oleh masyarakat luas.

  13. From virtual clustering analysis to self-consistent clustering analysis: a mathematical study

    Science.gov (United States)

    Tang, Shaoqiang; Zhang, Lei; Liu, Wing Kam

    2018-03-01

    In this paper, we propose a new homogenization algorithm, virtual clustering analysis (VCA), as well as provide a mathematical framework for the recently proposed self-consistent clustering analysis (SCA) (Liu et al. in Comput Methods Appl Mech Eng 306:319-341, 2016). In the mathematical theory, we clarify the key assumptions and ideas of VCA and SCA, and derive the continuous and discrete Lippmann-Schwinger equations. Based on a key postulation of "once response similarly, always response similarly", clustering is performed in an offline stage by machine learning techniques (k-means and SOM), and facilitates substantial reduction of computational complexity in an online predictive stage. The clear mathematical setup allows for the first time a convergence study of clustering refinement in one space dimension. Convergence is proved rigorously, and found to be of second order from numerical investigations. Furthermore, we propose to suitably enlarge the domain in VCA, such that the boundary terms may be neglected in the Lippmann-Schwinger equation, by virtue of the Saint-Venant's principle. In contrast, they were not obtained in the original SCA paper, and we discover these terms may well be responsible for the numerical dependency on the choice of reference material property. Since VCA enhances the accuracy by overcoming the modeling error, and reduce the numerical cost by avoiding an outer loop iteration for attaining the material property consistency in SCA, its efficiency is expected even higher than the recently proposed SCA algorithm.

  14. AUTOMATIC EXTRACTION OF ROCK JOINTS FROM LASER SCANNED DATA BY MOVING LEAST SQUARES METHOD AND FUZZY K-MEANS CLUSTERING

    Directory of Open Access Journals (Sweden)

    S. Oh

    2012-09-01

    Full Text Available Recent development of laser scanning device increased the capability of representing rock outcrop in a very high resolution. Accurate 3D point cloud model with rock joint information can help geologist to estimate stability of rock slope on-site or off-site. An automatic plane extraction method was developed by computing normal directions and grouping them in similar direction. Point normal was calculated by moving least squares (MLS method considering every point within a given distance to minimize error to the fitting plane. Normal directions were classified into a number of dominating clusters by fuzzy K-means clustering. Region growing approach was exploited to discriminate joints in a point cloud. Overall procedure was applied to point cloud with about 120,000 points, and successfully extracted joints with joint information. The extraction procedure was implemented to minimize number of input parameters and to construct plane information into the existing point cloud for less redundancy and high usability of the point cloud itself.

  15. Analysis of Learning Development With Sugeno Fuzzy Logic And Clustering

    Directory of Open Access Journals (Sweden)

    Maulana Erwin Saputra

    2017-06-01

    Full Text Available In the first journal, I made this attempt to analyze things that affect the achievement of students in each school of course vary. Because students are one of the goals of achieving the goals of successful educational organizations. The mental influence of students’ emotions and behaviors themselves in relation to learning performance. Fuzzy logic can be used in various fields as well as Clustering for grouping, as in Learning Development analyzes. The process will be performed on students based on the symptoms that exist. In this research will use fuzzy logic and clustering. Fuzzy is an uncertain logic but its excess is capable in the process of language reasoning so that in its design is not required complicated mathematical equations. However Clustering method is K-Means method is method where data analysis is broken down by group k (k = 1,2,3, .. k. To know the optimal number of Performance group. The results of the research is with a questionnaire entered into matlab will produce a value that means in generating the graph. And simplify the school in seeing Student performance in the learning process by using certain criteria. So from the system that obtained the results for a decision-making required by the school.

  16. Distributed k-Means Algorithm and Fuzzy c-Means Algorithm for Sensor Networks Based on Multiagent Consensus Theory.

    Science.gov (United States)

    Qin, Jiahu; Fu, Weiming; Gao, Huijun; Zheng, Wei Xing

    2016-03-03

    This paper is concerned with developing a distributed k-means algorithm and a distributed fuzzy c-means algorithm for wireless sensor networks (WSNs) where each node is equipped with sensors. The underlying topology of the WSN is supposed to be strongly connected. The consensus algorithm in multiagent consensus theory is utilized to exchange the measurement information of the sensors in WSN. To obtain a faster convergence speed as well as a higher possibility of having the global optimum, a distributed k-means++ algorithm is first proposed to find the initial centroids before executing the distributed k-means algorithm and the distributed fuzzy c-means algorithm. The proposed distributed k-means algorithm is capable of partitioning the data observed by the nodes into measure-dependent groups which have small in-group and large out-group distances, while the proposed distributed fuzzy c-means algorithm is capable of partitioning the data observed by the nodes into different measure-dependent groups with degrees of membership values ranging from 0 to 1. Simulation results show that the proposed distributed algorithms can achieve almost the same results as that given by the centralized clustering algorithms.

  17. Identification of column edges of DNA fragments by using K-means clustering and mean algorithm on lane histograms of DNA agarose gel electrophoresis images

    Science.gov (United States)

    Turan, Muhammed K.; Sehirli, Eftal; Elen, Abdullah; Karas, Ismail R.

    2015-07-01

    Gel electrophoresis (GE) is one of the most used method to separate DNA, RNA, protein molecules according to size, weight and quantity parameters in many areas such as genetics, molecular biology, biochemistry, microbiology. The main way to separate each molecule is to find borders of each molecule fragment. This paper presents a software application that show columns edges of DNA fragments in 3 steps. In the first step the application obtains lane histograms of agarose gel electrophoresis images by doing projection based on x-axis. In the second step, it utilizes k-means clustering algorithm to classify point values of lane histogram such as left side values, right side values and undesired values. In the third step, column edges of DNA fragments is shown by using mean algorithm and mathematical processes to separate DNA fragments from the background in a fully automated way. In addition to this, the application presents locations of DNA fragments and how many DNA fragments exist on images captured by a scientific camera.

  18. Performance analysis of clustering techniques over microarray data: A case study

    Science.gov (United States)

    Dash, Rasmita; Misra, Bijan Bihari

    2018-03-01

    Handling big data is one of the major issues in the field of statistical data analysis. In such investigation cluster analysis plays a vital role to deal with the large scale data. There are many clustering techniques with different cluster analysis approach. But which approach suits a particular dataset is difficult to predict. To deal with this problem a grading approach is introduced over many clustering techniques to identify a stable technique. But the grading approach depends on the characteristic of dataset as well as on the validity indices. So a two stage grading approach is implemented. In this study the grading approach is implemented over five clustering techniques like hybrid swarm based clustering (HSC), k-means, partitioning around medoids (PAM), vector quantization (VQ) and agglomerative nesting (AGNES). The experimentation is conducted over five microarray datasets with seven validity indices. The finding of grading approach that a cluster technique is significant is also established by Nemenyi post-hoc hypothetical test.

  19. Cluster Analysis of Flow Cytometric List Mode Data on a Personal Computer

    NARCIS (Netherlands)

    Bakker Schut, Tom C.; Bakker schut, T.C.; de Grooth, B.G.; Greve, Jan

    1993-01-01

    A cluster analysis algorithm, dedicated to analysis of flow cytometric data is described. The algorithm is written in Pascal and implemented on an MS-DOS personal computer. It uses k-means, initialized with a large number of seed points, followed by a modified nearest neighbor technique to reduce

  20. Research on classified real-time flood forecasting framework based on K-means cluster and rough set.

    Science.gov (United States)

    Xu, Wei; Peng, Yong

    2015-01-01

    This research presents a new classified real-time flood forecasting framework. In this framework, historical floods are classified by a K-means cluster according to the spatial and temporal distribution of precipitation, the time variance of precipitation intensity and other hydrological factors. Based on the classified results, a rough set is used to extract the identification rules for real-time flood forecasting. Then, the parameters of different categories within the conceptual hydrological model are calibrated using a genetic algorithm. In real-time forecasting, the corresponding category of parameters is selected for flood forecasting according to the obtained flood information. This research tests the new classified framework on Guanyinge Reservoir and compares the framework with the traditional flood forecasting method. It finds that the performance of the new classified framework is significantly better in terms of accuracy. Furthermore, the framework can be considered in a catchment with fewer historical floods.

  1. A cluster analysis on road traffic accidents using genetic algorithms

    Science.gov (United States)

    Saharan, Sabariah; Baragona, Roberto

    2017-04-01

    The analysis of traffic road accidents is increasingly important because of the accidents cost and public road safety. The availability or large data sets makes the study of factors that affect the frequency and severity accidents are viable. However, the data are often highly unbalanced and overlapped. We deal with the data set of the road traffic accidents recorded in Christchurch, New Zealand, from 2000-2009 with a total of 26440 accidents. The data is in a binary set and there are 50 factors road traffic accidents with four level of severity. We used genetic algorithm for the analysis because we are in the presence of a large unbalanced data set and standard clustering like k-means algorithm may not be suitable for the task. The genetic algorithm based on clustering for unknown K, (GCUK) has been used to identify the factors associated with accidents of different levels of severity. The results provided us with an interesting insight into the relationship between factors and accidents severity level and suggest that the two main factors that contributes to fatal accidents are "Speed greater than 60 km h" and "Did not see other people until it was too late". A comparison with the k-means algorithm and the independent component analysis is performed to validate the results.

  2. AUTOMATED UNSUPERVISED CLASSIFICATION OF THE SLOAN DIGITAL SKY SURVEY STELLAR SPECTRA USING k-MEANS CLUSTERING

    Energy Technology Data Exchange (ETDEWEB)

    Sanchez Almeida, J.; Allende Prieto, C., E-mail: jos@iac.es, E-mail: callende@iac.es [Instituto de Astrofisica de Canarias, E-38205 La Laguna, Tenerife (Spain)

    2013-01-20

    Large spectroscopic surveys require automated methods of analysis. This paper explores the use of k-means clustering as a tool for automated unsupervised classification of massive stellar spectral catalogs. The classification criteria are defined by the data and the algorithm, with no prior physical framework. We work with a representative set of stellar spectra associated with the Sloan Digital Sky Survey (SDSS) SEGUE and SEGUE-2 programs, which consists of 173,390 spectra from 3800 to 9200 A sampled on 3849 wavelengths. We classify the original spectra as well as the spectra with the continuum removed. The second set only contains spectral lines, and it is less dependent on uncertainties of the flux calibration. The classification of the spectra with continuum renders 16 major classes. Roughly speaking, stars are split according to their colors, with enough finesse to distinguish dwarfs from giants of the same effective temperature, but with difficulties to separate stars with different metallicities. There are classes corresponding to particular MK types, intrinsically blue stars, dust-reddened, stellar systems, and also classes collecting faulty spectra. Overall, there is no one-to-one correspondence between the classes we derive and the MK types. The classification of spectra without continuum renders 13 classes, the color separation is not so sharp, but it distinguishes stars of the same effective temperature and different metallicities. Some classes thus obtained present a fairly small range of physical parameters (200 K in effective temperature, 0.25 dex in surface gravity, and 0.35 dex in metallicity), so that the classification can be used to estimate the main physical parameters of some stars at a minimum computational cost. We also analyze the outliers of the classification. Most of them turn out to be failures of the reduction pipeline, but there are also high redshift QSOs, multiple stellar systems, dust-reddened stars, galaxies, and, finally, odd

  3. Machine learning in APOGEE. Unsupervised spectral classification with K-means

    Science.gov (United States)

    Garcia-Dias, Rafael; Allende Prieto, Carlos; Sánchez Almeida, Jorge; Ordovás-Pascual, Ignacio

    2018-05-01

    Context. The volume of data generated by astronomical surveys is growing rapidly. Traditional analysis techniques in spectroscopy either demand intensive human interaction or are computationally expensive. In this scenario, machine learning, and unsupervised clustering algorithms in particular, offer interesting alternatives. The Apache Point Observatory Galactic Evolution Experiment (APOGEE) offers a vast data set of near-infrared stellar spectra, which is perfect for testing such alternatives. Aims: Our research applies an unsupervised classification scheme based on K-means to the massive APOGEE data set. We explore whether the data are amenable to classification into discrete classes. Methods: We apply the K-means algorithm to 153 847 high resolution spectra (R ≈ 22 500). We discuss the main virtues and weaknesses of the algorithm, as well as our choice of parameters. Results: We show that a classification based on normalised spectra captures the variations in stellar atmospheric parameters, chemical abundances, and rotational velocity, among other factors. The algorithm is able to separate the bulge and halo populations, and distinguish dwarfs, sub-giants, RC, and RGB stars. However, a discrete classification in flux space does not result in a neat organisation in the parameters' space. Furthermore, the lack of obvious groups in flux space causes the results to be fairly sensitive to the initialisation, and disrupts the efficiency of commonly-used methods to select the optimal number of clusters. Our classification is publicly available, including extensive online material associated with the APOGEE Data Release 12 (DR12). Conclusions: Our description of the APOGEE database can help greatly with the identification of specific types of targets for various applications. We find a lack of obvious groups in flux space, and identify limitations of the K-means algorithm in dealing with this kind of data. Full Tables B.1-B.4 are only available at the CDS via

  4. Cluster analysis of HZE particle tracks as applied to space radiobiology problems

    International Nuclear Information System (INIS)

    Batmunkh, M.; Bayarchimeg, L.; Lkhagva, O.; Belov, O.

    2013-01-01

    A cluster analysis is performed of ionizations in tracks produced by the most abundant nuclei in the charge and energy spectra of the galactic cosmic rays. The frequency distribution of clusters is estimated for cluster sizes comparable to the DNA molecule at different packaging levels. For this purpose, an improved K-mean-based algorithm is suggested. This technique allows processing particle tracks containing a large number of ionization events without setting the number of clusters as an input parameter. Using this method, the ionization distribution pattern is analyzed depending on the cluster size and particle's linear energy transfer

  5. Using Cluster Analysis to Group Countries for Cost-effectiveness Analysis: An Application to Sub-Saharan Africa.

    Science.gov (United States)

    Russell, Louise B; Bhanot, Gyan; Kim, Sun-Young; Sinha, Anushua

    2018-02-01

    To explore the use of cluster analysis to define groups of similar countries for the purpose of evaluating the cost-effectiveness of a public health intervention-maternal immunization-within the constraints of a project budget originally meant for an overall regional analysis. We used the most common cluster analysis algorithm, K-means, and the most common measure of distance, Euclidean distance, to group 37 low-income, sub-Saharan African countries on the basis of 24 measures of economic development, general health resources, and past success in public health programs. The groups were tested for robustness and reviewed by regional disease experts. We explored 2-, 3- and 4-group clustering. Public health performance was consistently important in determining the groups. For the 2-group clustering, for example, infant mortality in Group 1 was 81 per 1,000 live births compared with 51 per 1,000 in Group 2, and 67% of children in Group 1 received DPT immunization compared with 87% in Group 2. The experts preferred four groups to fewer, on the ground that national decision makers would more readily recognize their country among four groups. Clusters defined by K-means clustering made sense to subject experts and allowed a more detailed evaluation of the cost-effectiveness of maternal immunization within the constraint of the project budget. The method may be useful for other evaluations that, without having the resources to conduct separate analyses for each unit, seek to inform decision makers in numerous countries or subdivisions within countries, such as states or counties.

  6. Fast segmentation of industrial quality pavement images using Laws texture energy measures and k -means clustering

    Science.gov (United States)

    Mathavan, Senthan; Kumar, Akash; Kamal, Khurram; Nieminen, Michael; Shah, Hitesh; Rahman, Mujib

    2016-09-01

    Thousands of pavement images are collected by road authorities daily for condition monitoring surveys. These images typically have intensity variations and texture nonuniformities that make their segmentation challenging. The automated segmentation of such pavement images is crucial for accurate, thorough, and expedited health monitoring of roads. In the pavement monitoring area, well-known texture descriptors, such as gray-level co-occurrence matrices and local binary patterns, are often used for surface segmentation and identification. These, despite being the established methods for texture discrimination, are inherently slow. This work evaluates Laws texture energy measures as a viable alternative for pavement images for the first time. k-means clustering is used to partition the feature space, limiting the human subjectivity in the process. Data classification, hence image segmentation, is performed by the k-nearest neighbor method. Laws texture energy masks are shown to perform well with resulting accuracy and precision values of more than 80%. The implementations of the algorithm, in both MATLAB® and OpenCV/C++, are extensively compared against the state of the art for execution speed, clearly showing the advantages of the proposed method. Furthermore, the OpenCV-based segmentation shows a 100% increase in processing speed when compared to the fastest algorithm available in literature.

  7. Methodology сomparative statistical analysis of Russian industry based on cluster analysis

    Directory of Open Access Journals (Sweden)

    Sergey S. Shishulin

    2017-01-01

    Full Text Available The article is devoted to researching of the possibilities of applying multidimensional statistical analysis in the study of industrial production on the basis of comparing its growth rates and structure with other developed and developing countries of the world. The purpose of this article is to determine the optimal set of statistical methods and the results of their application to industrial production data, which would give the best access to the analysis of the result.Data includes such indicators as output, output, gross value added, the number of employed and other indicators of the system of national accounts and operational business statistics. The objects of observation are the industry of the countrys of the Customs Union, the United States, Japan and Erope in 2005-2015. As the research tool used as the simplest methods of transformation, graphical and tabular visualization of data, and methods of statistical analysis. In particular, based on a specialized software package (SPSS, the main components method, discriminant analysis, hierarchical methods of cluster analysis, Ward’s method and k-means were applied.The application of the method of principal components to the initial data makes it possible to substantially and effectively reduce the initial space of industrial production data. Thus, for example, in analyzing the structure of industrial production, the reduction was from fifteen industries to three basic, well-interpreted factors: the relatively extractive industries (with a low degree of processing, high-tech industries and consumer goods (medium-technology sectors. At the same time, as a result of comparison of the results of application of cluster analysis to the initial data and data obtained on the basis of the principal components method, it was established that clustering industrial production data on the basis of new factors significantly improves the results of clustering.As a result of analyzing the parameters of

  8. The composite sequential clustering technique for analysis of multispectral scanner data

    Science.gov (United States)

    Su, M. Y.

    1972-01-01

    The clustering technique consists of two parts: (1) a sequential statistical clustering which is essentially a sequential variance analysis, and (2) a generalized K-means clustering. In this composite clustering technique, the output of (1) is a set of initial clusters which are input to (2) for further improvement by an iterative scheme. This unsupervised composite technique was employed for automatic classification of two sets of remote multispectral earth resource observations. The classification accuracy by the unsupervised technique is found to be comparable to that by traditional supervised maximum likelihood classification techniques. The mathematical algorithms for the composite sequential clustering program and a detailed computer program description with job setup are given.

  9. Identification of new candidate drugs for lung cancer using chemical-chemical interactions, chemical-protein interactions and a K-means clustering algorithm.

    Science.gov (United States)

    Lu, Jing; Chen, Lei; Yin, Jun; Huang, Tao; Bi, Yi; Kong, Xiangyin; Zheng, Mingyue; Cai, Yu-Dong

    2016-01-01

    Lung cancer, characterized by uncontrolled cell growth in the lung tissue, is the leading cause of global cancer deaths. Until now, effective treatment of this disease is limited. Many synthetic compounds have emerged with the advancement of combinatorial chemistry. Identification of effective lung cancer candidate drug compounds among them is a great challenge. Thus, it is necessary to build effective computational methods that can assist us in selecting for potential lung cancer drug compounds. In this study, a computational method was proposed to tackle this problem. The chemical-chemical interactions and chemical-protein interactions were utilized to select candidate drug compounds that have close associations with approved lung cancer drugs and lung cancer-related genes. A permutation test and K-means clustering algorithm were employed to exclude candidate drugs with low possibilities to treat lung cancer. The final analysis suggests that the remaining drug compounds have potential anti-lung cancer activities and most of them have structural dissimilarity with approved drugs for lung cancer.

  10. Are clusters of dietary patterns and cluster membership stable over time? Results of a longitudinal cluster analysis study.

    Science.gov (United States)

    Walthouwer, Michel Jean Louis; Oenema, Anke; Soetens, Katja; Lechner, Lilian; de Vries, Hein

    2014-11-01

    Developing nutrition education interventions based on clusters of dietary patterns can only be done adequately when it is clear if distinctive clusters of dietary patterns can be derived and reproduced over time, if cluster membership is stable, and if it is predictable which type of people belong to a certain cluster. Hence, this study aimed to: (1) identify clusters of dietary patterns among Dutch adults, (2) test the reproducibility of these clusters and stability of cluster membership over time, and (3) identify sociodemographic predictors of cluster membership and cluster transition. This study had a longitudinal design with online measurements at baseline (N=483) and 6 months follow-up (N=379). Dietary intake was assessed with a validated food frequency questionnaire. A hierarchical cluster analysis was performed, followed by a K-means cluster analysis. Multinomial logistic regression analyses were conducted to identify the sociodemographic predictors of cluster membership and cluster transition. At baseline and follow-up, a comparable three-cluster solution was derived, distinguishing a healthy, moderately healthy, and unhealthy dietary pattern. Male and lower educated participants were significantly more likely to have a less healthy dietary pattern. Further, 251 (66.2%) participants remained in the same cluster, 45 (11.9%) participants changed to an unhealthier cluster, and 83 (21.9%) participants shifted to a healthier cluster. Men and people living alone were significantly more likely to shift toward a less healthy dietary pattern. Distinctive clusters of dietary patterns can be derived. Yet, cluster membership is unstable and only few sociodemographic factors were associated with cluster membership and cluster transition. These findings imply that clusters based on dietary intake may not be suitable as a basis for nutrition education interventions. Copyright © 2014 Elsevier Ltd. All rights reserved.

  11. Estimation of Tree Lists from Airborne Laser Scanning Using Tree Model Clustering and k-MSN Imputation

    Directory of Open Access Journals (Sweden)

    Jörgen Wallerman

    2013-04-01

    Full Text Available Individual tree crowns may be delineated from airborne laser scanning (ALS data by segmentation of surface models or by 3D analysis. Segmentation of surface models benefits from using a priori knowledge about the proportions of tree crowns, which has not yet been utilized for 3D analysis to any great extent. In this study, an existing surface segmentation method was used as a basis for a new tree model 3D clustering method applied to ALS returns in 104 circular field plots with 12 m radius in pine-dominated boreal forest (64°14'N, 19°50'E. For each cluster below the tallest canopy layer, a parabolic surface was fitted to model a tree crown. The tree model clustering identified more trees than segmentation of the surface model, especially smaller trees below the tallest canopy layer. Stem attributes were estimated with k-Most Similar Neighbours (k-MSN imputation of the clusters based on field-measured trees. The accuracy at plot level from the k-MSN imputation (stem density root mean square error or RMSE 32.7%; stem volume RMSE 28.3% was similar to the corresponding results from the surface model (stem density RMSE 33.6%; stem volume RMSE 26.1% with leave-one-out cross-validation for one field plot at a time. Three-dimensional analysis of ALS data should also be evaluated in multi-layered forests since it identified a larger number of small trees below the tallest canopy layer.

  12. FINGER KNUCKLE PRINT RECOGNITION WITH SIFT AND K-MEANS ALGORITHM

    Directory of Open Access Journals (Sweden)

    A. Muthukumar

    2013-02-01

    Full Text Available In general, the identification and verification are done by passwords, pin number, etc., which is easily cracked by others. Biometrics is a powerful and unique tool based on the anatomical and behavioral characteristics of the human beings in order to prove their authentication. This paper proposes a novel recognition methodology of biometrics named as Finger Knuckle print (FKP. Hence this paper has focused on the extraction of features of Finger knuckle print using Scale Invariant Feature Transform (SIFT, and the key points are derived from FKP are clustered using K-Means Algorithm. The centroid of K-Means is stored in the database which is compared with the query FKP K-Means centroid value to prove the recognition and authentication. The comparison is based on the XOR operation. Hence this paper provides a novel recognition method to provide authentication. Results are performed on the PolyU FKP database to check the proposed FKP recognition method.

  13. Electricity Consumption Clustering Using Smart Meter Data

    Directory of Open Access Journals (Sweden)

    Alexander Tureczek

    2018-04-01

    Full Text Available Electricity smart meter consumption data is enabling utilities to analyze consumption information at unprecedented granularity. Much focus has been directed towards consumption clustering for diversifying tariffs; through modern clustering methods, cluster analyses have been performed. However, the clusters developed exhibit a large variation with resulting shadow clusters, making it impossible to truly identify the individual clusters. Using clearly defined dwelling types, this paper will present methods to improve clustering by harvesting inherent structure from the smart meter data. This paper clusters domestic electricity consumption using smart meter data from the Danish city of Esbjerg. Methods from time series analysis and wavelets are applied to enable the K-Means clustering method to account for autocorrelation in data and thereby improve the clustering performance. The results show the importance of data knowledge and we identify sub-clusters of consumption within the dwelling types and enable K-Means to produce satisfactory clustering by accounting for a temporal component. Furthermore our study shows that careful preprocessing of the data to account for intrinsic structure enables better clustering performance by the K-Means method.

  14. Discrimination of neutrons and γ-rays in liquid scintillators based of fuzzy c-means clustering

    International Nuclear Information System (INIS)

    Luo Xiaoliang; Liu Guofu; Yang Jun

    2011-01-01

    A novel method based on fuzzy c-means (FCM) clustering for the discrimination of neutrons and γ-rays in liquid scintillators was presented. The neutrons and γ-rays in the environment were firstly acquired by the portable real-time n-γ discriminator and then discriminated using fuzzy c-means clustering and pulse gradient analysis, respectively. By comparing the results with each other, it is shown that the discrimination results of the fuzzy c-means clustering are consistent with those of the pulse gradient analysis. The decrease in uncertainty and the improvement in discrimination performance of the fuzzy c-means clustering were also observed. (authors)

  15. A Novel Grouping Method for Lithium Iron Phosphate Batteries Based on a Fractional Joint Kalman Filter and a New Modified K-Means Clustering Algorithm

    Directory of Open Access Journals (Sweden)

    Xiaoyu Li

    2015-07-01

    Full Text Available This paper presents a novel grouping method for lithium iron phosphate batteries. In this method, a simplified electrochemical impedance spectroscopy (EIS model is utilized to describe the battery characteristics. Dynamic stress test (DST and fractional joint Kalman filter (FJKF are used to extract battery model parameters. In order to realize equal-number grouping of batteries, a new modified K-means clustering algorithm is proposed. Two rules are designed to equalize the numbers of elements in each group and exchange samples among groups. In this paper, the principles of battery model selection, physical meaning and identification method of model parameters, data preprocessing and equal-number clustering method for battery grouping are comprehensively described. Additionally, experiments for battery grouping and method validation are designed. This method is meaningful to application involving the grouping of fresh batteries for electric vehicles (EVs and screening of aged batteries for recycling.

  16. Performance Analysis of Unsupervised Clustering Methods for Brain Tumor Segmentation

    Directory of Open Access Journals (Sweden)

    Tushar H Jaware

    2013-10-01

    Full Text Available Medical image processing is the most challenging and emerging field of neuroscience. The ultimate goal of medical image analysis in brain MRI is to extract important clinical features that would improve methods of diagnosis & treatment of disease. This paper focuses on methods to detect & extract brain tumour from brain MR images. MATLAB is used to design, software tool for locating brain tumor, based on unsupervised clustering methods. K-Means clustering algorithm is implemented & tested on data base of 30 images. Performance evolution of unsupervised clusteringmethods is presented.

  17. Clustering Analysis for Credit Default Probabilities in a Retail Bank Portfolio

    Directory of Open Access Journals (Sweden)

    Elena ANDREI (DRAGOMIR

    2012-08-01

    Full Text Available Methods underlying cluster analysis are very useful in data analysis, especially when the processed volume of data is very large, so that it becomes impossible to extract essential information, unless specific instruments are used to summarize and structure the gross information. In this context, cluster analysis techniques are used particularly, for systematic information analysis. The aim of this article is to build an useful model for banking field, based on data mining techniques, by dividing the groups of borrowers into clusters, in order to obtain a profile of the customers (debtors and good payers. We assume that a class is appropriate if it contains members that have a high degree of similarity and the standard method for measuring the similarity within a group shows the lowest variance. After clustering, data mining techniques are implemented on the cluster with bad debtors, reaching a very high accuracy after implementation. The paper is structured as follows: Section 2 describes the model for data analysis based on a specific scoring model that we proposed. In section 3, we present a cluster analysis using K-means algorithm and the DM models are applied on a specific cluster. Section 4 shows the conclusions.

  18. A REVIEW WAVELET TRANSFORM AND FUZZY K-MEANS BASED IMAGE DE-NOISING METHOD

    OpenAIRE

    Nidhi Patel*, Asst. Prof. Pratik Kumar Soni

    2017-01-01

    The research area of image processing technique using fuzzy k-means and wavelet transform. The enormous amount of data necessary for images is a main reason for the growth of many areas within the research field of computer imaging such as image processing and compression. In order to get this in requisites of the concerned research work, wavelet transforms and k-means clustering is applied. This can be done in order to discover more possible combinations that may lead to the finest de-noisin...

  19. Study on Adaptive Parameter Determination of Cluster Analysis in Urban Management Cases

    Science.gov (United States)

    Fu, J. Y.; Jing, C. F.; Du, M. Y.; Fu, Y. L.; Dai, P. P.

    2017-09-01

    The fine management for cities is the important way to realize the smart city. The data mining which uses spatial clustering analysis for urban management cases can be used in the evaluation of urban public facilities deployment, and support the policy decisions, and also provides technical support for the fine management of the city. Aiming at the problem that DBSCAN algorithm which is based on the density-clustering can not realize parameter adaptive determination, this paper proposed the optimizing method of parameter adaptive determination based on the spatial analysis. Firstly, making analysis of the function Ripley's K for the data set to realize adaptive determination of global parameter MinPts, which means setting the maximum aggregation scale as the range of data clustering. Calculating every point object's highest frequency K value in the range of Eps which uses K-D tree and setting it as the value of clustering density to realize the adaptive determination of global parameter MinPts. Then, the R language was used to optimize the above process to accomplish the precise clustering of typical urban management cases. The experimental results based on the typical case of urban management in XiCheng district of Beijing shows that: The new DBSCAN clustering algorithm this paper presents takes full account of the data's spatial and statistical characteristic which has obvious clustering feature, and has a better applicability and high quality. The results of the study are not only helpful for the formulation of urban management policies and the allocation of urban management supervisors in XiCheng District of Beijing, but also to other cities and related fields.

  20. STUDY ON ADAPTIVE PARAMETER DETERMINATION OF CLUSTER ANALYSIS IN URBAN MANAGEMENT CASES

    Directory of Open Access Journals (Sweden)

    J. Y. Fu

    2017-09-01

    Full Text Available The fine management for cities is the important way to realize the smart city. The data mining which uses spatial clustering analysis for urban management cases can be used in the evaluation of urban public facilities deployment, and support the policy decisions, and also provides technical support for the fine management of the city. Aiming at the problem that DBSCAN algorithm which is based on the density-clustering can not realize parameter adaptive determination, this paper proposed the optimizing method of parameter adaptive determination based on the spatial analysis. Firstly, making analysis of the function Ripley's K for the data set to realize adaptive determination of global parameter MinPts, which means setting the maximum aggregation scale as the range of data clustering. Calculating every point object’s highest frequency K value in the range of Eps which uses K-D tree and setting it as the value of clustering density to realize the adaptive determination of global parameter MinPts. Then, the R language was used to optimize the above process to accomplish the precise clustering of typical urban management cases. The experimental results based on the typical case of urban management in XiCheng district of Beijing shows that: The new DBSCAN clustering algorithm this paper presents takes full account of the data’s spatial and statistical characteristic which has obvious clustering feature, and has a better applicability and high quality. The results of the study are not only helpful for the formulation of urban management policies and the allocation of urban management supervisors in XiCheng District of Beijing, but also to other cities and related fields.

  1. Will farmers intend to cultivate Provitamin A genetically modified (GM cassava in Nigeria? Evidence from a k-means segmentation analysis of beliefs and attitudes.

    Directory of Open Access Journals (Sweden)

    Adewale Oparinde

    Full Text Available Analysis of market segments within a population remains critical to agricultural systems and policy processes for targeting new innovations. Patterns in attitudes and intentions toward cultivating Provitamin A GM cassava are examined through the use of a combination of behavioural theory and k-means cluster analysis method, investigating the interrelationship among various behavioural antecedents. Using a state-level sample of smallholder cassava farmers in Nigeria, this paper identifies three distinct classes of attitude and intention denoted as low opposition, medium opposition and high opposition farmers. It was estimated that only 25% of the surveyed population of farmers was highly opposed to cultivating Provitamin A GM cassava.

  2. Will farmers intend to cultivate Provitamin A genetically modified (GM) cassava in Nigeria? Evidence from a k-means segmentation analysis of beliefs and attitudes.

    Science.gov (United States)

    Oparinde, Adewale; Abdoulaye, Tahirou; Mignouna, Djana Babatima; Bamire, Adebayo Simeon

    2017-01-01

    Analysis of market segments within a population remains critical to agricultural systems and policy processes for targeting new innovations. Patterns in attitudes and intentions toward cultivating Provitamin A GM cassava are examined through the use of a combination of behavioural theory and k-means cluster analysis method, investigating the interrelationship among various behavioural antecedents. Using a state-level sample of smallholder cassava farmers in Nigeria, this paper identifies three distinct classes of attitude and intention denoted as low opposition, medium opposition and high opposition farmers. It was estimated that only 25% of the surveyed population of farmers was highly opposed to cultivating Provitamin A GM cassava.

  3. Will farmers intend to cultivate Provitamin A genetically modified (GM) cassava in Nigeria? Evidence from a k-means segmentation analysis of beliefs and attitudes

    Science.gov (United States)

    Abdoulaye, Tahirou; Mignouna, Djana Babatima; Bamire, Adebayo Simeon

    2017-01-01

    Analysis of market segments within a population remains critical to agricultural systems and policy processes for targeting new innovations. Patterns in attitudes and intentions toward cultivating Provitamin A GM cassava are examined through the use of a combination of behavioural theory and k-means cluster analysis method, investigating the interrelationship among various behavioural antecedents. Using a state-level sample of smallholder cassava farmers in Nigeria, this paper identifies three distinct classes of attitude and intention denoted as low opposition, medium opposition and high opposition farmers. It was estimated that only 25% of the surveyed population of farmers was highly opposed to cultivating Provitamin A GM cassava. PMID:28700605

  4. Segmentation of White Blood Cells From Microscopic Images Using a Novel Combination of K-Means Clustering and Modified Watershed Algorithm.

    Science.gov (United States)

    Ghane, Narjes; Vard, Alireza; Talebi, Ardeshir; Nematollahy, Pardis

    2017-01-01

    Recognition of white blood cells (WBCs) is the first step to diagnose some particular diseases such as acquired immune deficiency syndrome, leukemia, and other blood-related diseases that are usually done by pathologists using an optical microscope. This process is time-consuming, extremely tedious, and expensive and needs experienced experts in this field. Thus, a computer-aided diagnosis system that assists pathologists in the diagnostic process can be so effective. Segmentation of WBCs is usually a first step in developing a computer-aided diagnosis system. The main purpose of this paper is to segment WBCs from microscopic images. For this purpose, we present a novel combination of thresholding, k-means clustering, and modified watershed algorithms in three stages including (1) segmentation of WBCs from a microscopic image, (2) extraction of nuclei from cell's image, and (3) separation of overlapping cells and nuclei. The evaluation results of the proposed method show that similarity measures, precision, and sensitivity respectively were 92.07, 96.07, and 94.30% for nucleus segmentation and 92.93, 97.41, and 93.78% for cell segmentation. In addition, statistical analysis presents high similarity between manual segmentation and the results obtained by the proposed method.

  5. Clustering Binary Data in the Presence of Masking Variables

    Science.gov (United States)

    Brusco, Michael J.

    2004-01-01

    A number of important applications require the clustering of binary data sets. Traditional nonhierarchical cluster analysis techniques, such as the popular K-means algorithm, can often be successfully applied to these data sets. However, the presence of masking variables in a data set can impede the ability of the K-means algorithm to recover the…

  6. Comparing clustering models in bank customers: Based on Fuzzy relational clustering approach

    Directory of Open Access Journals (Sweden)

    Ayad Hendalianpour

    2016-11-01

    Full Text Available Clustering is absolutely useful information to explore data structures and has been employed in many places. It organizes a set of objects into similar groups called clusters, and the objects within one cluster are both highly similar and dissimilar with the objects in other clusters. The K-mean, C-mean, Fuzzy C-mean and Kernel K-mean algorithms are the most popular clustering algorithms for their easy implementation and fast work, but in some cases we cannot use these algorithms. Regarding this, in this paper, a hybrid model for customer clustering is presented that is applicable in five banks of Fars Province, Shiraz, Iran. In this way, the fuzzy relation among customers is defined by using their features described in linguistic and quantitative variables. As follows, the customers of banks are grouped according to K-mean, C-mean, Fuzzy C-mean and Kernel K-mean algorithms and the proposed Fuzzy Relation Clustering (FRC algorithm. The aim of this paper is to show how to choose the best clustering algorithms based on density-based clustering and present a new clustering algorithm for both crisp and fuzzy variables. Finally, we apply the proposed approach to five datasets of customer's segmentation in banks. The result of the FCR shows the accuracy and high performance of FRC compared other clustering methods.

  7. Clustering Dycom

    KAUST Repository

    Minku, Leandro L.

    2017-10-06

    Background: Software Effort Estimation (SEE) can be formulated as an online learning problem, where new projects are completed over time and may become available for training. In this scenario, a Cross-Company (CC) SEE approach called Dycom can drastically reduce the number of Within-Company (WC) projects needed for training, saving the high cost of collecting such training projects. However, Dycom relies on splitting CC projects into different subsets in order to create its CC models. Such splitting can have a significant impact on Dycom\\'s predictive performance. Aims: This paper investigates whether clustering methods can be used to help finding good CC splits for Dycom. Method: Dycom is extended to use clustering methods for creating the CC subsets. Three different clustering methods are investigated, namely Hierarchical Clustering, K-Means, and Expectation-Maximisation. Clustering Dycom is compared against the original Dycom with CC subsets of different sizes, based on four SEE databases. A baseline WC model is also included in the analysis. Results: Clustering Dycom with K-Means can potentially help to split the CC projects, managing to achieve similar or better predictive performance than Dycom. However, K-Means still requires the number of CC subsets to be pre-defined, and a poor choice can negatively affect predictive performance. EM enables Dycom to automatically set the number of CC subsets while still maintaining or improving predictive performance with respect to the baseline WC model. Clustering Dycom with Hierarchical Clustering did not offer significant advantage in terms of predictive performance. Conclusion: Clustering methods can be an effective way to automatically generate Dycom\\'s CC subsets.

  8. Research on retailer data clustering algorithm based on Spark

    Science.gov (United States)

    Huang, Qiuman; Zhou, Feng

    2017-03-01

    Big data analysis is a hot topic in the IT field now. Spark is a high-reliability and high-performance distributed parallel computing framework for big data sets. K-means algorithm is one of the classical partition methods in clustering algorithm. In this paper, we study the k-means clustering algorithm on Spark. Firstly, the principle of the algorithm is analyzed, and then the clustering analysis is carried out on the supermarket customers through the experiment to find out the different shopping patterns. At the same time, this paper proposes the parallelization of k-means algorithm and the distributed computing framework of Spark, and gives the concrete design scheme and implementation scheme. This paper uses the two-year sales data of a supermarket to validate the proposed clustering algorithm and achieve the goal of subdividing customers, and then analyze the clustering results to help enterprises to take different marketing strategies for different customer groups to improve sales performance.

  9. Full text clustering and relationship network analysis of biomedical publications.

    Directory of Open Access Journals (Sweden)

    Renchu Guan

    Full Text Available Rapid developments in the biomedical sciences have increased the demand for automatic clustering of biomedical publications. In contrast to current approaches to text clustering, which focus exclusively on the contents of abstracts, a novel method is proposed for clustering and analysis of complete biomedical article texts. To reduce dimensionality, Cosine Coefficient is used on a sub-space of only two vectors, instead of computing the Euclidean distance within the space of all vectors. Then a strategy and algorithm is introduced for Semi-supervised Affinity Propagation (SSAP to improve analysis efficiency, using biomedical journal names as an evaluation background. Experimental results show that by avoiding high-dimensional sparse matrix computations, SSAP outperforms conventional k-means methods and improves upon the standard Affinity Propagation algorithm. In constructing a directed relationship network and distribution matrix for the clustering results, it can be noted that overlaps in scope and interests among BioMed publications can be easily identified, providing a valuable analytical tool for editors, authors and readers.

  10. Full text clustering and relationship network analysis of biomedical publications.

    Science.gov (United States)

    Guan, Renchu; Yang, Chen; Marchese, Maurizio; Liang, Yanchun; Shi, Xiaohu

    2014-01-01

    Rapid developments in the biomedical sciences have increased the demand for automatic clustering of biomedical publications. In contrast to current approaches to text clustering, which focus exclusively on the contents of abstracts, a novel method is proposed for clustering and analysis of complete biomedical article texts. To reduce dimensionality, Cosine Coefficient is used on a sub-space of only two vectors, instead of computing the Euclidean distance within the space of all vectors. Then a strategy and algorithm is introduced for Semi-supervised Affinity Propagation (SSAP) to improve analysis efficiency, using biomedical journal names as an evaluation background. Experimental results show that by avoiding high-dimensional sparse matrix computations, SSAP outperforms conventional k-means methods and improves upon the standard Affinity Propagation algorithm. In constructing a directed relationship network and distribution matrix for the clustering results, it can be noted that overlaps in scope and interests among BioMed publications can be easily identified, providing a valuable analytical tool for editors, authors and readers.

  11. Phenotypes Determined by Cluster Analysis in Moderate to Severe Bronchial Asthma.

    Science.gov (United States)

    Youroukova, Vania M; Dimitrova, Denitsa G; Valerieva, Anna D; Lesichkova, Spaska S; Velikova, Tsvetelina V; Ivanova-Todorova, Ekaterina I; Tumangelova-Yuzeir, Kalina D

    2017-06-01

    Bronchial asthma is a heterogeneous disease that includes various subtypes. They may share similar clinical characteristics, but probably have different pathological mechanisms. To identify phenotypes using cluster analysis in moderate to severe bronchial asthma and to compare differences in clinical, physiological, immunological and inflammatory data between the clusters. Forty adult patients with moderate to severe bronchial asthma out of exacerbation were included. All underwent clinical assessment, anthropometric measurements, skin prick testing, standard spirometry and measurement fraction of exhaled nitric oxide. Blood eosinophilic count, serum total IgE and periostin levels were determined. Two-step cluster approach, hierarchical clustering method and k-mean analysis were used for identification of the clusters. We have identified four clusters. Cluster 1 (n=14) - late-onset, non-atopic asthma with impaired lung function, Cluster 2 (n=13) - late-onset, atopic asthma, Cluster 3 (n=6) - late-onset, aspirin sensitivity, eosinophilic asthma, and Cluster 4 (n=7) - early-onset, atopic asthma. Our study is the first in Bulgaria in which cluster analysis is applied to asthmatic patients. We identified four clusters. The variables with greatest force for differentiation in our study were: age of asthma onset, duration of diseases, atopy, smoking, blood eosinophils, nonsteroidal anti-inflammatory drugs hypersensitivity, baseline FEV1/FVC and symptoms severity. Our results support the concept of heterogeneity of bronchial asthma and demonstrate that cluster analysis can be an useful tool for phenotyping of disease and personalized approach to the treatment of patients.

  12. Hierarchical Adaptive Means (HAM) clustering for hardware-efficient, unsupervised and real-time spike sorting.

    Science.gov (United States)

    Paraskevopoulou, Sivylla E; Wu, Di; Eftekhar, Amir; Constandinou, Timothy G

    2014-09-30

    This work presents a novel unsupervised algorithm for real-time adaptive clustering of neural spike data (spike sorting). The proposed Hierarchical Adaptive Means (HAM) clustering method combines centroid-based clustering with hierarchical cluster connectivity to classify incoming spikes using groups of clusters. It is described how the proposed method can adaptively track the incoming spike data without requiring any past history, iteration or training and autonomously determines the number of spike classes. Its performance (classification accuracy) has been tested using multiple datasets (both simulated and recorded) achieving a near-identical accuracy compared to k-means (using 10-iterations and provided with the number of spike classes). Also, its robustness in applying to different feature extraction methods has been demonstrated by achieving classification accuracies above 80% across multiple datasets. Last but crucially, its low complexity, that has been quantified through both memory and computation requirements makes this method hugely attractive for future hardware implementation. Copyright © 2014 Elsevier B.V. All rights reserved.

  13. Structuring heterogeneous biological information using fuzzy clustering of k-partite graphs

    Directory of Open Access Journals (Sweden)

    Theis Fabian J

    2010-10-01

    Full Text Available Abstract Background Extensive and automated data integration in bioinformatics facilitates the construction of large, complex biological networks. However, the challenge lies in the interpretation of these networks. While most research focuses on the unipartite or bipartite case, we address the more general but common situation of k-partite graphs. These graphs contain k different node types and links are only allowed between nodes of different types. In order to reveal their structural organization and describe the contained information in a more coarse-grained fashion, we ask how to detect clusters within each node type. Results Since entities in biological networks regularly have more than one function and hence participate in more than one cluster, we developed a k-partite graph partitioning algorithm that allows for overlapping (fuzzy clusters. It determines for each node a degree of membership to each cluster. Moreover, the algorithm estimates a weighted k-partite graph that connects the extracted clusters. Our method is fast and efficient, mimicking the multiplicative update rules commonly employed in algorithms for non-negative matrix factorization. It facilitates the decomposition of networks on a chosen scale and therefore allows for analysis and interpretation of structures on various resolution levels. Applying our algorithm to a tripartite disease-gene-protein complex network, we were able to structure this graph on a large scale into clusters that are functionally correlated and biologically meaningful. Locally, smaller clusters enabled reclassification or annotation of the clusters' elements. We exemplified this for the transcription factor MECP2. Conclusions In order to cope with the overwhelming amount of information available from biomedical literature, we need to tackle the challenge of finding structures in large networks with nodes of multiple types. To this end, we presented a novel fuzzy k-partite graph partitioning

  14. Cluster mislocation in kinematic Sunyaev-Zel'dovich (kSZ) effect extraction

    Science.gov (United States)

    Calafut, Victoria Rose; Bean, Rachel; Yu, Byeonghee

    2018-01-01

    We investigate the impact of a variety of analysis assumptions that influence cluster identification and location on the kSZ pairwise momentum signal and covariance estimation. Photometric and spectroscopic galaxy tracers from SDSS, WISE, and DECaLs, spanning redshifts 0.05zgeneration of CMB and LSS surveys the statistical and photometric errors will shrink markedly. Our results demonstrate that uncertainties introduced through using galaxy proxies for cluster locations will need to be fully incorporated, and actively mitigated, for the kSZ to reach its full potential as a cosmological constraining tool for dark energy and neutrino physics.

  15. Exploring the individual patterns of spiritual well-being in people newly diagnosed with advanced cancer: a cluster analysis.

    Science.gov (United States)

    Bai, Mei; Dixon, Jane; Williams, Anna-Leila; Jeon, Sangchoon; Lazenby, Mark; McCorkle, Ruth

    2016-11-01

    Research shows that spiritual well-being correlates positively with quality of life (QOL) for people with cancer, whereas contradictory findings are frequently reported with respect to the differentiated associations between dimensions of spiritual well-being, namely peace, meaning and faith, and QOL. This study aimed to examine individual patterns of spiritual well-being among patients newly diagnosed with advanced cancer. Cluster analysis was based on the twelve items of the 12-item Functional Assessment of Chronic Illness Therapy-Spiritual Well-Being Scale at Time 1. A combination of hierarchical and k-means (non-hierarchical) clustering methods was employed to jointly determine the number of clusters. Self-rated health, depressive symptoms, peace, meaning and faith, and overall QOL were compared at Time 1 and Time 2. Hierarchical and k-means clustering methods both suggested four clusters. Comparison of the four clusters supported statistically significant and clinically meaningful differences in QOL outcomes among clusters while revealing contrasting relations of faith with QOL. Cluster 1, Cluster 3, and Cluster 4 represented high, medium, and low levels of overall QOL, respectively, with correspondingly high, medium, and low levels of peace, meaning, and faith. Cluster 2 was distinguished from other clusters by its medium levels of overall QOL, peace, and meaning and low level of faith. This study provides empirical support for individual difference in response to a newly diagnosed cancer and brings into focus conceptual and methodological challenges associated with the measure of spiritual well-being, which may partly contribute to the attenuated relation between faith and QOL.

  16. Robust multi-scale clustering of large DNA microarray datasets with the consensus algorithm

    DEFF Research Database (Denmark)

    Grotkjær, Thomas; Winther, Ole; Regenberg, Birgitte

    2006-01-01

    Motivation: Hierarchical and relocation clustering (e.g. K-means and self-organizing maps) have been successful tools in the display and analysis of whole genome DNA microarray expression data. However, the results of hierarchical clustering are sensitive to outliers, and most relocation methods...... analysis by collecting re-occurring clustering patterns in a co-occurrence matrix. The results show that consensus clustering obtained from clustering multiple times with Variational Bayes Mixtures of Gaussians or K-means significantly reduces the classification error rate for a simulated dataset...

  17. ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition.

    Science.gov (United States)

    Koslicki, David; Chatterjee, Saikat; Shahrivar, Damon; Walker, Alan W; Francis, Suzanna C; Fraser, Louise J; Vehkaperä, Mikko; Lan, Yueheng; Corander, Jukka

    2015-01-01

    Estimation of bacterial community composition from high-throughput sequenced 16S rRNA gene amplicons is a key task in microbial ecology. Since the sequence data from each sample typically consist of a large number of reads and are adversely impacted by different levels of biological and technical noise, accurate analysis of such large datasets is challenging. There has been a recent surge of interest in using compressed sensing inspired and convex-optimization based methods to solve the estimation problem for bacterial community composition. These methods typically rely on summarizing the sequence data by frequencies of low-order k-mers and matching this information statistically with a taxonomically structured database. Here we show that the accuracy of the resulting community composition estimates can be substantially improved by aggregating the reads from a sample with an unsupervised machine learning approach prior to the estimation phase. The aggregation of reads is a pre-processing approach where we use a standard K-means clustering algorithm that partitions a large set of reads into subsets with reasonable computational cost to provide several vectors of first order statistics instead of only single statistical summarization in terms of k-mer frequencies. The output of the clustering is then processed further to obtain the final estimate for each sample. The resulting method is called Aggregation of Reads by K-means (ARK), and it is based on a statistical argument via mixture density formulation. ARK is found to improve the fidelity and robustness of several recently introduced methods, with only a modest increase in computational complexity. An open source, platform-independent implementation of the method in the Julia programming language is freely available at https://github.com/dkoslicki/ARK. A Matlab implementation is available at http://www.ee.kth.se/ctsoftware.

  18. Risk assessment of water pollution sources based on an integrated k-means clustering and set pair analysis method in the region of Shiyan, China.

    Science.gov (United States)

    Li, Chunhui; Sun, Lian; Jia, Junxiang; Cai, Yanpeng; Wang, Xuan

    2016-07-01

    Source water areas are facing many potential water pollution risks. Risk assessment is an effective method to evaluate such risks. In this paper an integrated model based on k-means clustering analysis and set pair analysis was established aiming at evaluating the risks associated with water pollution in source water areas, in which the weights of indicators were determined through the entropy weight method. Then the proposed model was applied to assess water pollution risks in the region of Shiyan in which China's key source water area Danjiangkou Reservoir for the water source of the middle route of South-to-North Water Diversion Project is located. The results showed that eleven sources with relative high risk value were identified. At the regional scale, Shiyan City and Danjiangkou City would have a high risk value in term of the industrial discharge. Comparatively, Danjiangkou City and Yunxian County would have a high risk value in terms of agricultural pollution. Overall, the risk values of north regions close to the main stream and reservoir of the region of Shiyan were higher than that in the south. The results of risk level indicated that five sources were in lower risk level (i.e., level II), two in moderate risk level (i.e., level III), one in higher risk level (i.e., level IV) and three in highest risk level (i.e., level V). Also risks of industrial discharge are higher than that of the agricultural sector. It is thus essential to manage the pillar industry of the region of Shiyan and certain agricultural companies in the vicinity of the reservoir to reduce water pollution risks of source water areas. Copyright © 2016 Elsevier B.V. All rights reserved.

  19. TOWARDS FINDING A NEW KERNELIZED FUZZY C-MEANS CLUSTERING ALGORITHM

    Directory of Open Access Journals (Sweden)

    Samarjit Das

    2014-04-01

    Full Text Available Kernelized Fuzzy C-Means clustering technique is an attempt to improve the performance of the conventional Fuzzy C-Means clustering technique. Recently this technique where a kernel-induced distance function is used as a similarity measure instead of a Euclidean distance which is used in the conventional Fuzzy C-Means clustering technique, has earned popularity among research community. Like the conventional Fuzzy C-Means clustering technique this technique also suffers from inconsistency in its performance due to the fact that here also the initial centroids are obtained based on the randomly initialized membership values of the objects. Our present work proposes a new method where we have applied the Subtractive clustering technique of Chiu as a preprocessor to Kernelized Fuzzy CMeans clustering technique. With this new method we have tried not only to remove the inconsistency of Kernelized Fuzzy C-Means clustering technique but also to deal with the situations where the number of clusters is not predetermined. We have also provided a comparison of our method with the Subtractive clustering technique of Chiu and Kernelized Fuzzy C-Means clustering technique using two validity measures namely Partition Coefficient and Clustering Entropy.

  20. Cluster analysis for DNA methylation profiles having a detection threshold

    Directory of Open Access Journals (Sweden)

    Siegmund Kimberly D

    2006-07-01

    Full Text Available Abstract Background DNA methylation, a molecular feature used to investigate tumor heterogeneity, can be measured on many genomic regions using the MethyLight technology. Due to the combination of the underlying biology of DNA methylation and the MethyLight technology, the measurements, while being generated on a continuous scale, have a large number of 0 values. This suggests that conventional clustering methodology may not perform well on this data. Results We compare performance of existing methodology (such as k-means with two novel methods that explicitly allow for the preponderance of values at 0. We also consider how the ability to successfully cluster such data depends upon the number of informative genes for which methylation is measured and the correlation structure of the methylation values for those genes. We show that when data is collected for a sufficient number of genes, our models do improve clustering performance compared to methods, such as k-means, that do not explicitly respect the supposed biological realities of the situation. Conclusion The performance of analysis methods depends upon how well the assumptions of those methods reflect the properties of the data being analyzed. Differing technologies will lead to data with differing properties, and should therefore be analyzed differently. Consequently, it is prudent to give thought to what the properties of the data are likely to be, and which analysis method might therefore be likely to best capture those properties.

  1. Orbit Clustering Based on Transfer Cost

    Science.gov (United States)

    Gustafson, Eric D.; Arrieta-Camacho, Juan J.; Petropoulos, Anastassios E.

    2013-01-01

    We propose using cluster analysis to perform quick screening for combinatorial global optimization problems. The key missing component currently preventing cluster analysis from use in this context is the lack of a useable metric function that defines the cost to transfer between two orbits. We study several proposed metrics and clustering algorithms, including k-means and the expectation maximization algorithm. We also show that proven heuristic methods such as the Q-law can be modified to work with cluster analysis.

  2. K-nearest uphill clustering in the protein structure space

    KAUST Repository

    Cui, Xuefeng

    2016-08-26

    The protein structure classification problem, which is to assign a protein structure to a cluster of similar proteins, is one of the most fundamental problems in the construction and application of the protein structure space. Early manually curated protein structure classifications (e.g., SCOP and CATH) are very successful, but recently suffer the slow updating problem because of the increased throughput of newly solved protein structures. Thus, fully automatic methods to cluster proteins in the protein structure space have been designed and developed. In this study, we observed that the SCOP superfamilies are highly consistent with clustering trees representing hierarchical clustering procedures, but the tree cutting is very challenging and becomes the bottleneck of clustering accuracy. To overcome this challenge, we proposed a novel density-based K-nearest uphill clustering method that effectively eliminates noisy pairwise protein structure similarities and identifies density peaks as cluster centers. Specifically, the density peaks are identified based on K-nearest uphills (i.e., proteins with higher densities) and K-nearest neighbors. To our knowledge, this is the first attempt to apply and develop density-based clustering methods in the protein structure space. Our results show that our density-based clustering method outperforms the state-of-the-art clustering methods previously applied to the problem. Moreover, we observed that computational methods and human experts could produce highly similar clusters at high precision values, while computational methods also suggest to split some large superfamilies into smaller clusters. © 2016 Elsevier B.V.

  3. A Preliminary Study Application Clustering System in Acoustic Emission Monitoring

    Directory of Open Access Journals (Sweden)

    Saiful Bahari Nur Amira Afiza

    2017-01-01

    Full Text Available Acoustic Emission (AE is a non-destructive testing known as assessment on damage detection in structural engineering. It also can be used to discriminate the different types of damage occurring in a composite materials. The main problem associated with the data analysis is the discrimination between the different AE sources and analysis of the AE signal in order to identify the most critical damage mechanism. Clustering analysis is a technique in which the set of object are assigned to a group called cluster. The objective of the cluster analysis is to separate a set of data into several classes that reflect the internal structure of data. In this paper was used k-means algorithm for partitioned clustering method, numerous effort have been made to improve the performance of application k-means clustering algorithm. This paper presents a current review on application clustering system in Acoustic Emission.

  4. B0-correction and k-means clustering for accurate and automatic identification of regions with reduced apparent diffusion coefficient (ADC) in adva nced cervical cancer at the time of brachytherapy

    DEFF Research Database (Denmark)

    Haack, Søren; Pedersen, Erik Morre; Vinding, Mads Sloth

    in dose planning of radiotherapy. This study evaluates the use of k-means clustering for automatic user independent delineation of regions of reduced apparent diffusion coefficient (ADC) and the value of B0-correction of DW-MRI for reduction of geometrical distortions during dose planning of brachytherapy...

  5. Magnetic cluster mean-field description of spin glasses in amorphous La-Gd-Au alloys

    International Nuclear Information System (INIS)

    Poon, S.J.; Durand, J.

    1978-03-01

    Bulk magnetic properties of splat-cooled amorphous alloys of composition La/sub 80-x/Gd/sub x/Au 20 (0 less than or equal to x less than or equal to 80) were studied. Zero-field susceptibility, high-field magnetization (up to 75 kOe) and saturated remanence were measured between 1.8 and 290 0 K. Data were analyzed using a cluster mean-field approximation for the spin-glass and mictomagnetic alloys (x less than or equal to 56). Mean-field theories can account for the experimental freezing-temperatures of dilute spin-glasses in which the Ruderman-Kittel-Kasuya-Yosida interaction is dominant. For the dilute alloys, the role of amorphousness on the magnetic interactions is discussed. By extending the mean-field approximation, the concentrated spin-glasses are represented by rigid ferromagnetic clusters as individual spin-entities interacting via random forces. Scaling laws for the magnetization M and saturation remanent magnetization M/sub rs/ are obtained and presented graphically for the x less than or equal to 32 alloys in which M/x = g(H/x*, T/x), M/sub rs/(T)/x = M/sub rs/(0)/x/ exp (-α*T/x/sup p/) where x* is the concentration of clusters, α* is a constant, and p is the freezing-temperature exponent given by T/sub M/ infinity x/sup p/. It is found that p = 1 and 1.3 for the regions 4 less than or equal to x less than or equal to 40 respectively. An attempt is also made to account for the freezing temperatures of concentrated spin glasses. The strength of the interaction among clusters is determined from high-field magnetization measurements using the Larkin-Smith method modified for clusters. It is shown that for the x < 24 alloys, the size of the clusters can be correlated to the structural short-range order in the amorphous state. More concentrated alloys are marked by the emergence of cluster percolation

  6. Fast clustering algorithm for large ECG data sets based on CS theory in combination with PCA and K-NN methods.

    Science.gov (United States)

    Balouchestani, Mohammadreza; Krishnan, Sridhar

    2014-01-01

    Long-term recording of Electrocardiogram (ECG) signals plays an important role in health care systems for diagnostic and treatment purposes of heart diseases. Clustering and classification of collecting data are essential parts for detecting concealed information of P-QRS-T waves in the long-term ECG recording. Currently used algorithms do have their share of drawbacks: 1) clustering and classification cannot be done in real time; 2) they suffer from huge energy consumption and load of sampling. These drawbacks motivated us in developing novel optimized clustering algorithm which could easily scan large ECG datasets for establishing low power long-term ECG recording. In this paper, we present an advanced K-means clustering algorithm based on Compressed Sensing (CS) theory as a random sampling procedure. Then, two dimensionality reduction methods: Principal Component Analysis (PCA) and Linear Correlation Coefficient (LCC) followed by sorting the data using the K-Nearest Neighbours (K-NN) and Probabilistic Neural Network (PNN) classifiers are applied to the proposed algorithm. We show our algorithm based on PCA features in combination with K-NN classifier shows better performance than other methods. The proposed algorithm outperforms existing algorithms by increasing 11% classification accuracy. In addition, the proposed algorithm illustrates classification accuracy for K-NN and PNN classifiers, and a Receiver Operating Characteristics (ROC) area of 99.98%, 99.83%, and 99.75% respectively.

  7. Cluster Analysis of Customer Reviews Extracted from Web Pages

    Directory of Open Access Journals (Sweden)

    S. Shivashankar

    2010-01-01

    Full Text Available As e-commerce is gaining popularity day by day, the web has become an excellent source for gathering customer reviews / opinions by the market researchers. The number of customer reviews that a product receives is growing at very fast rate (It could be in hundreds or thousands. Customer reviews posted on the websites vary greatly in quality. The potential customer has to read necessarily all the reviews irrespective of their quality to make a decision on whether to purchase the product or not. In this paper, we make an attempt to assess are view based on its quality, to help the customer make a proper buying decision. The quality of customer review is assessed as most significant, more significant, significant and insignificant.A novel and effective web mining technique is proposed for assessing a customer review of a particular product based on the feature clustering techniques, namely, k-means method and fuzzy c-means method. This is performed in three steps : (1Identify review regions and extract reviews from it, (2 Extract and cluster the features of reviews by a clustering technique and then assign weights to the features belonging to each of the clusters (groups and (3 Assess the review by considering the feature weights and group belongingness. The k-means and fuzzy c-means clustering techniques are implemented and tested on customer reviews extracted from web pages. Performance of these techniques are analyzed.

  8. A hybrid method based on a new clustering technique and multilayer perceptron neural networks for hourly solar radiation forecasting

    International Nuclear Information System (INIS)

    Azimi, R.; Ghayekhloo, M.; Ghofrani, M.

    2016-01-01

    Highlights: • A novel clustering approach is proposed based on the data transformation approach. • A novel cluster selection method based on correlation analysis is presented. • The proposed hybrid clustering approach leads to deep learning for MLPNN. • A hybrid forecasting method is developed to predict solar radiations. • The evaluation results show superior performance of the proposed forecasting model. - Abstract: Accurate forecasting of renewable energy sources plays a key role in their integration into the grid. This paper proposes a hybrid solar irradiance forecasting framework using a Transformation based K-means algorithm, named TB K-means, to increase the forecast accuracy. The proposed clustering method is a combination of a new initialization technique, K-means algorithm and a new gradual data transformation approach. Unlike the other K-means based clustering methods which are not capable of providing a fixed and definitive answer due to the selection of different cluster centroids for each run, the proposed clustering provides constant results for different runs of the algorithm. The proposed clustering is combined with a time-series analysis, a novel cluster selection algorithm and a multilayer perceptron neural network (MLPNN) to develop the hybrid solar radiation forecasting method for different time horizons (1 h ahead, 2 h ahead, …, 48 h ahead). The performance of the proposed TB K-means clustering is evaluated using several different datasets and compared with different variants of K-means algorithm. Solar datasets with different solar radiation characteristics are also used to determine the accuracy and processing speed of the developed forecasting method with the proposed TB K-means and other clustering techniques. The results of direct comparison with other well-established forecasting models demonstrate the superior performance of the proposed hybrid forecasting method. Furthermore, a comparative analysis with the benchmark solar

  9. [Predicting Incidence of Hepatitis E in Chinausing Fuzzy Time Series Based on Fuzzy C-Means Clustering Analysis].

    Science.gov (United States)

    Luo, Yi; Zhang, Tao; Li, Xiao-song

    2016-05-01

    To explore the application of fuzzy time series model based on fuzzy c-means clustering in forecasting monthly incidence of Hepatitis E in mainland China. Apredictive model (fuzzy time series method based on fuzzy c-means clustering) was developed using Hepatitis E incidence data in mainland China between January 2004 and July 2014. The incidence datafrom August 2014 to November 2014 were used to test the fitness of the predictive model. The forecasting results were compared with those resulted from traditional fuzzy time series models. The fuzzy time series model based on fuzzy c-means clustering had 0.001 1 mean squared error (MSE) of fitting and 6.977 5 x 10⁻⁴ MSE of forecasting, compared with 0.0017 and 0.0014 from the traditional forecasting model. The results indicate that the fuzzy time series model based on fuzzy c-means clustering has a better performance in forecasting incidence of Hepatitis E.

  10. Dynamic Trajectory Extraction from Stereo Vision Using Fuzzy Clustering

    Science.gov (United States)

    Onishi, Masaki; Yoda, Ikushi

    In recent years, many human tracking researches have been proposed in order to analyze human dynamic trajectory. These researches are general technology applicable to various fields, such as customer purchase analysis in a shopping environment and safety control in a (railroad) crossing. In this paper, we present a new approach for tracking human positions by stereo image. We use the framework of two-stepped clustering with k-means method and fuzzy clustering to detect human regions. In the initial clustering, k-means method makes middle clusters from objective features extracted by stereo vision at high speed. In the last clustering, c-means fuzzy method cluster middle clusters based on attributes into human regions. Our proposed method can be correctly clustered by expressing ambiguity using fuzzy clustering, even when many people are close to each other. The validity of our technique was evaluated with the experiment of trajectories extraction of doctors and nurses in an emergency room of a hospital.

  11. Small traveling clusters in attractive and repulsive Hamiltonian mean-field models.

    Science.gov (United States)

    Barré, Julien; Yamaguchi, Yoshiyuki Y

    2009-03-01

    Long-lasting small traveling clusters are studied in the Hamiltonian mean-field model by comparing between attractive and repulsive interactions. Nonlinear Landau damping theory predicts that a Gaussian momentum distribution on a spatially homogeneous background permits the existence of traveling clusters in the repulsive case, as in plasma systems, but not in the attractive case. Nevertheless, extending the analysis to a two-parameter family of momentum distributions of Fermi-Dirac type, we theoretically predict the existence of traveling clusters in the attractive case; these findings are confirmed by direct N -body numerical simulations. The parameter region with the traveling clusters is much reduced in the attractive case with respect to the repulsive case.

  12. PREPAID TELECOM CUSTOMERS SEGMENTATION USING THE K-MEAN ALGORITHM

    Directory of Open Access Journals (Sweden)

    Marar Liviu Ioan

    2012-07-01

    Full Text Available The scope of relationship marketing is to retain customers and win their loyalty. This can be achieved if the companies’ products and services are developed and sold considering customers’ demands. Fulfilling customers’ demands, taken as the starting point of relationship marketing, can be obtained by acknowledging that the customers’ needs and wishes are heterogeneous. The segmentation of the customers’ base allows operators to overcome this because it illustrates the whole heterogeneous market as the sum of smaller homogeneous markets. The concept of segmentation relies on the high probability of persons grouped into segments based on common demands and behaviours to have a similar response to marketing strategies. This article focuses on the segmentation of a telecom customer base according to specific and noticeable criteria of a certain service. Although the segmentation concept is widely approached in professional literature, articles on the segmentation of a telecom customer base are very scarce, due to the strategic nature of this information. Market segmentation is carried out based on how customers spent their money on credit recharging, on making calls, on sending SMS and on Internet navigation. The method used for customer segmentation is the K-mean cluster analysis. To assess the internal cohesion of the clusters we employed the average sum of squares error indicator, and to determine the differences among the clusters we used the ANOVA and the post-hoc Tukey tests. The analyses revealed seven customer segments with different features and behaviours. The results enable the telecom company to conceive marketing strategies and planning which lead to better understanding of its customers’ needs and ultimately to a more efficient relationship with the subscribers and enhanced customer satisfaction. At the same time, the results enable the description and characterization of expenditure patterns

  13. Internet of Things-Based Arduino Intelligent Monitoring and Cluster Analysis of Seasonal Variation in Physicochemical Parameters of Jungnangcheon, an Urban Stream

    Directory of Open Access Journals (Sweden)

    Byungwan Jo

    2017-03-01

    Full Text Available In the present case study, the use of an advanced, efficient and low-cost technique for monitoring an urban stream was reported. Physicochemical parameters (PcPs of Jungnangcheon stream (Seoul, South Korea were assessed using an Internet of Things (IoT platform. Temperature, dissolved oxygen (DO, and pH parameters were monitored for the three summer months and the first fall month at a fixed location. Analysis was performed using clustering techniques (CTs, such as K-means clustering, agglomerative hierarchical clustering (AHC, and density-based spatial clustering of applications with noise (DBSCAN. An IoT-based Arduino sensor module (ASM network with a 99.99% efficient communication platform was developed to allow collection of stream data with user-friendly software and hardware and facilitated data analysis by interested individuals using their smartphones. Clustering was used to formulate relationships among physicochemical parameters. K-means clustering was used to identify natural clusters using the silhouette coefficient based on cluster compactness and looseness. AHC grouped all data into two clusters as well as temperature, DO and pH into four, eight, and four clusters, respectively. DBSCAN analysis was also performed to evaluate yearly variations in physicochemical parameters. Noise points (NOISE of temperature in 2016 were border points (ƥ, whereas in 2014 and 2015 they remained core points (ɋ, indicating a trend toward increasing stream temperature. We found the stream parameters were within the permissible limits set by the Water Quality Standards for River Water, South Korea.

  14. Vessel Segmentation in Retinal Images Using Multi-scale Line Operator and K-Means Clustering.

    Science.gov (United States)

    Saffarzadeh, Vahid Mohammadi; Osareh, Alireza; Shadgar, Bita

    2014-04-01

    Detecting blood vessels is a vital task in retinal image analysis. The task is more challenging with the presence of bright and dark lesions in retinal images. Here, a method is proposed to detect vessels in both normal and abnormal retinal fundus images based on their linear features. First, the negative impact of bright lesions is reduced by using K-means segmentation in a perceptive space. Then, a multi-scale line operator is utilized to detect vessels while ignoring some of the dark lesions, which have intensity structures different from the line-shaped vessels in the retina. The proposed algorithm is tested on two publicly available STARE and DRIVE databases. The performance of the method is measured by calculating the area under the receiver operating characteristic curve and the segmentation accuracy. The proposed method achieves 0.9483 and 0.9387 localization accuracy against STARE and DRIVE respectively.

  15. Integrative Sparse K-Means With Overlapping Group Lasso in Genomic Applications for Disease Subtype Discovery.

    Science.gov (United States)

    Huo, Zhiguang; Tseng, George

    2017-06-01

    Cancer subtypes discovery is the first step to deliver personalized medicine to cancer patients. With the accumulation of massive multi-level omics datasets and established biological knowledge databases, omics data integration with incorporation of rich existing biological knowledge is essential for deciphering a biological mechanism behind the complex diseases. In this manuscript, we propose an integrative sparse K -means (is- K means) approach to discover disease subtypes with the guidance of prior biological knowledge via sparse overlapping group lasso. An algorithm using an alternating direction method of multiplier (ADMM) will be applied for fast optimization. Simulation and three real applications in breast cancer and leukemia will be used to compare is- K means with existing methods and demonstrate its superior clustering accuracy, feature selection, functional annotation of detected molecular features and computing efficiency.

  16. Fault detection of flywheel system based on clustering and principal component analysis

    Directory of Open Access Journals (Sweden)

    Wang Rixin

    2015-12-01

    Full Text Available Considering the nonlinear, multifunctional properties of double-flywheel with closed-loop control, a two-step method including clustering and principal component analysis is proposed to detect the two faults in the multifunctional flywheels. At the first step of the proposed algorithm, clustering is taken as feature recognition to check the instructions of “integrated power and attitude control” system, such as attitude control, energy storage or energy discharge. These commands will ask the flywheel system to work in different operation modes. Therefore, the relationship of parameters in different operations can define the cluster structure of training data. Ordering points to identify the clustering structure (OPTICS can automatically identify these clusters by the reachability-plot. K-means algorithm can divide the training data into the corresponding operations according to the reachability-plot. Finally, the last step of proposed model is used to define the relationship of parameters in each operation through the principal component analysis (PCA method. Compared with the PCA model, the proposed approach is capable of identifying the new clusters and learning the new behavior of incoming data. The simulation results show that it can effectively detect the faults in the multifunctional flywheels system.

  17. Extension of K-Means Algorithm for clustering mixed data | Onuodu ...

    African Journals Online (AJOL)

    Also proposed is a new dissimilarity measure that uses relative cumulative frequency-based method in clustering objects with mixed values. The dissimilarity model developed could serve as a predictive tool for identifying attributes of objects in mixed datasets. It has been implemented using JAVA programming language ...

  18. A spectral k-means approach to bright-field cell image segmentation.

    Science.gov (United States)

    Bradbury, Laura; Wan, Justin W L

    2010-01-01

    Automatic segmentation of bright-field cell images is important to cell biologists, but difficult to complete due to the complex nature of the cells in bright-field images (poor contrast, broken halo, missing boundaries). Standard approaches such as level set segmentation and active contours work well for fluorescent images where cells appear as round shape, but become less effective when optical artifacts such as halo exist in bright-field images. In this paper, we present a robust segmentation method which combines the spectral and k-means clustering techniques to locate cells in bright-field images. This approach models an image as a matrix graph and segment different regions of the image by computing the appropriate eigenvectors of the matrix graph and using the k-means algorithm. We illustrate the effectiveness of the method by segmentation results of C2C12 (muscle) cells in bright-field images.

  19. Semi-supervised clustering methods.

    Science.gov (United States)

    Bair, Eric

    2013-01-01

    Cluster analysis methods seek to partition a data set into homogeneous subgroups. It is useful in a wide variety of applications, including document processing and modern genetics. Conventional clustering methods are unsupervised, meaning that there is no outcome variable nor is anything known about the relationship between the observations in the data set. In many situations, however, information about the clusters is available in addition to the values of the features. For example, the cluster labels of some observations may be known, or certain observations may be known to belong to the same cluster. In other cases, one may wish to identify clusters that are associated with a particular outcome variable. This review describes several clustering algorithms (known as "semi-supervised clustering" methods) that can be applied in these situations. The majority of these methods are modifications of the popular k-means clustering method, and several of them will be described in detail. A brief description of some other semi-supervised clustering algorithms is also provided.

  20. Transcriptional organization of the DNA region controlling expression of the K99 gene cluster.

    Science.gov (United States)

    Roosendaal, B; Damoiseaux, J; Jordi, W; de Graaf, F K

    1989-01-01

    The transcriptional organization of the K99 gene cluster was investigated in two ways. First, the DNA region, containing the transcriptional signals was analyzed using a transcription vector system with Escherichia coli galactokinase (GalK) as assayable marker and second, an in vitro transcription system was employed. A detailed analysis of the transcription signals revealed that a strong promoter PA and a moderate promoter PB are located upstream of fanA and fanB, respectively. No promoter activity was detected in the intercistronic region between fanB and fanC. Factor-dependent terminators of transcription were detected and are probably located in the intercistronic region between fanA and fanB (T1), and between fanB and fanC (T2). A third terminator (T3) was observed between fanC and fanD and has an efficiency of 90%. Analysis of the regulatory region in an in vitro transcription system confirmed the location of the respective transcription signals. A model for the transcriptional organization of the K99 cluster is presented. Indications were obtained that the trans-acting regulatory polypeptides FanA and FanB both function as anti-terminators. A model for the regulation of expression of the K99 gene cluster is postulated.

  1. Cluster analysis as a method for determining size ranges for spinal implants: disc lumbar replacement prosthesis dimensions from magnetic resonance images.

    Science.gov (United States)

    Lei, Dang; Holder, Roger L; Smith, Francis W; Wardlaw, Douglas; Hukins, David W L

    2006-12-01

    Statistical analysis of clinical radiologic data. To develop an objective method for finding the number of sizes for a lumbar disc replacement. Cluster analysis is a well-established technique for sorting observations into clusters so that the "similarity level" is maximal if they belong to the same cluster and minimal otherwise. Magnetic resonance scans from 69 patients, with no abnormal discs, yielded 206 sagittal and transverse images of 206 discs (levels L3-L4-L5-S1). Anteroposterior and lateral dimensions were measured from vertebral margins on transverse images; disc heights were measured from sagittal images. Hierarchical cluster analysis was performed to determine the number of clusters followed by nonhierarchical (K-means) cluster analysis. Discriminant analysis was used to determine how well the clusters could be used to classify an observation. The most successful method of clustering the data involved the following parameters: anteroposterior dimension; lateral dimension (both were the mean of results from the superior and inferior margins of a vertebral body, measured on transverse images); and maximum disc height (from a midsagittal image). These were grouped into 7 clusters so that a discriminant analysis was capable of correctly classifying 97.1% of the observations. The mean and standard deviations for the parameter values in each cluster were determined. Cluster analysis has been successfully used to find the dimensions of the minimum number of prosthesis sizes required to replace L3-L4 to L5-S1 discs; the range of sizes would enable them to be used at higher lumbar levels in some patients.

  2. The association between content of the elements S, Cl, K, Fe, Cu, Zn and Br in normal and cirrhotic liver tissue from Danes and Greenlandic Inuit examined by dual hierarchical clustering analysis.

    Science.gov (United States)

    Laursen, Jens; Milman, Nils; Pind, Niels; Pedersen, Henrik; Mulvad, Gert

    2014-01-01

    Meta-analysis of previous studies evaluating associations between content of elements sulphur (S), chlorine (Cl), potassium (K), iron (Fe), copper (Cu), zinc (Zn) and bromine (Br) in normal and cirrhotic autopsy liver tissue samples. Normal liver samples from 45 Greenlandic Inuit, median age 60 years and from 71 Danes, median age 61 years. Cirrhotic liver samples from 27 Danes, median age 71 years. Element content was measured using X-ray fluorescence spectrometry. Dual hierarchical clustering analysis, creating a dual dendrogram, one clustering element contents according to calculated similarities, one clustering elements according to correlation coefficients between the element contents, both using Euclidian distance and Ward Procedure. One dendrogram separated subjects in 7 clusters showing no differences in ethnicity, gender or age. The analysis discriminated between elements in normal and cirrhotic livers. The other dendrogram clustered elements in four clusters: sulphur and chlorine; copper and bromine; potassium and zinc; iron. There were significant correlations between the elements in normal liver samples: S was associated with Cl, K, Br and Zn; Cl with S and Br; K with S, Br and Zn; Cu with Br. Zn with S and K. Br with S, Cl, K and Cu. Fe did not show significant associations with any other element. In contrast to simple statistical methods, which analyses content of elements separately one by one, dual hierarchical clustering analysis incorporates all elements at the same time and can be used to examine the linkage and interplay between multiple elements in tissue samples. Copyright © 2013 Elsevier GmbH. All rights reserved.

  3. Detecting COPD exacerbations early using daily telemonitoring of symptoms and k-means clustering: a pilot study.

    Science.gov (United States)

    Sanchez-Morillo, Daniel; Fernandez-Granero, Miguel Angel; Jiménez, Antonio León

    2015-05-01

    COPD places an enormous burden on the healthcare systems and causes diminished health-related quality of life. The highest proportion of human and economic cost is associated with admissions for acute exacerbation of respiratory symptoms (AECOPD). Since prompt detection and treatment of exacerbations may improve outcomes, early detection of AECOPD is a critical issue. This pilot study was aimed to determine whether a mobile health system could enable early detection of AECOPD on a day-to-day basis. A novel electronic questionnaire for the early detection of COPD exacerbations was evaluated during a 6-months field trial in a group of 16 patients. Pattern recognition techniques were applied. A k-means clustering algorithm was trained and validated, and its accuracy in detecting AECOPD was assessed. Sensitivity and specificity were 74.6 and 89.7 %, respectively, and area under the receiver operating characteristic curve was 0.84. 31 out of 33 AECOPD were early identified with an average of 4.5 ± 2.1 days prior to the onset of the exacerbation that was considered the day of medical attendance. Based on the findings of this preliminary pilot study, the proposed electronic questionnaire and the applied methodology could help to early detect COPD exacerbations on a day-to-day basis and therefore could provide support to patients and physicians.

  4. Cluster analysis of track structure

    International Nuclear Information System (INIS)

    Michalik, V.

    1991-01-01

    One of the possibilities of classifying track structures is application of conventional partition techniques of analysis of multidimensional data to the track structure. Using these cluster algorithms this paper attempts to find characteristics of radiation reflecting the spatial distribution of ionizations in the primary particle track. An absolute frequency distribution of clusters of ionizations giving the mean number of clusters produced by radiation per unit of deposited energy can serve as this characteristic. General computation techniques used as well as methods of calculations of distributions of clusters for different radiations are discussed. 8 refs.; 5 figs

  5. α/β-particle radiation identification based on fuzzy C-means clustering

    International Nuclear Information System (INIS)

    Yang Yijianxia; Yang Lu; Li Wenqiang

    2013-01-01

    A pulse shape recognition method based on fuzzy C-means clustering for the discrimination of α/βparticle was presented. A detection circuit to isolate α/β-particles is designed. Using a single probe scintillating detector to acquire α/β particles. By comparing the results to pulse amplitude analysis, it is shown that by Fuzzy C-means clustering α-particle count rate increased by 42.9% and the cross-talk ratio of α-β is decreased by 15.9% for 6190 cps 0420 αsource; β-particle count rate increased by 31.8% and the cross -talk ratio of β-α is decreased by 7.7% for 05-05β source. (authors)

  6. Using Cluster Analysis and ICP-MS to Identify Groups of Ecstasy Tablets in Sao Paulo State, Brazil.

    Science.gov (United States)

    Maione, Camila; de Oliveira Souza, Vanessa Cristina; Togni, Loraine Rezende; da Costa, José Luiz; Campiglia, Andres Dobal; Barbosa, Fernando; Barbosa, Rommel Melgaço

    2017-11-01

    The variations found in the elemental composition in ecstasy samples result in spectral profiles with useful information for data analysis, and cluster analysis of these profiles can help uncover different categories of the drug. We provide a cluster analysis of ecstasy tablets based on their elemental composition. Twenty-five elements were determined by ICP-MS in tablets apprehended by Sao Paulo's State Police, Brazil. We employ the K-means clustering algorithm along with C4.5 decision tree to help us interpret the clustering results. We found a better number of two clusters within the data, which can refer to the approximated number of sources of the drug which supply the cities of seizures. The C4.5 model was capable of differentiating the ecstasy samples from the two clusters with high prediction accuracy using the leave-one-out cross-validation. The model used only Nd, Ni, and Pb concentration values in the classification of the samples. © 2017 American Academy of Forensic Sciences.

  7. Potential shallow aquifers characterization through an integrated geophysical method: multivariate approach by means of k-means algorithms

    Directory of Open Access Journals (Sweden)

    Stefano Bernardinetti

    2017-06-01

    Full Text Available The need to obtain a detailed hydrogeological characterization of the subsurface and its interpretation for the groundwater resources management, often requires to apply several and complementary geophysical methods. The goal of the approach in this paper is to provide a unique model of the aquifer by synthesizing and optimizing the information provided by several geophysical methods. This approach greatly reduces the degree of uncertainty and subjectivity of the interpretation by exploiting the different physical and mechanic characteristics of the aquifer. The studied area, into the municipality of Laterina (Arezzo, Italy, is a shallow basin filled by lacustrine and alluvial deposits (Pleistocene and Olocene epochs, Quaternary period, with alternated silt, sand with variable content of gravel and clay where the bottom is represented by arenaceous-pelitic rocks (Mt. Cervarola Unit, Tuscan Domain, Miocene epoch. This shallow basin constitutes the unconfined superficial aquifer to be exploited in the nearly future. To improve the geological model obtained from a detailed geological survey we performed electrical resistivity and P wave refraction tomographies along the same line in order to obtain different, independent and integrable data sets. For the seismic data also the reflected events have been processed, a remarkable contribution to draw the geologic setting. Through the k-means algorithm, we perform a cluster analysis for the bivariate data set to individuate relationships between the two sets of variables. This algorithm allows to individuate clusters with the aim of minimizing the dissimilarity within each cluster and maximizing it among different clusters of the bivariate data set. The optimal number of clusters “K”, corresponding to the individuated geophysical facies, depends to the multivariate data set distribution and in this work is estimated with the Silhouettes. The result is an integrated tomography that shows a finite

  8. Comparative Investigation of Guided Fuzzy Clustering and Mean Shift Clustering for Edge Detection in Electrical Resistivity Tomography Images of Mineral Deposits

    Science.gov (United States)

    Ward, Wil; Wilkinson, Paul; Chambers, Jon; Bai, Li

    2014-05-01

    is an optimised choice of kernel size, for which automated procedures are available. In general, since the gfcm does not require a convergence criterion it is notably faster than mean-shift. The mean-shift approach is sensitive to small changes of the bandwidth, whereas with gfcm the guiding pdf approximation can be intermediately reviewed and smoothed if desired. In mapping the resistivity information to 1D in pre-processing, no assumptions on the spatial shape of the clusters are needed, a common problem in centroid-based clustering. References: [1] River Terrace Sand and Gravel Deposit Reserve Estimation Using Three Dimensional Electrical Resistivity Tomography for Bedrock Surface Detection. Journal of Applied Geophysics, 2013. Chambers, J.E. et al. [2] Distribution-based Fuzzy Clustering of Electrical Resistivity Tomography Images for Interface Detection. Geophysical Journal International, 2014. Ward W.O.C. et al. [3] Mean Shift: A Robust Approach Toward Feature Space Analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002. Comaniciu, D and Meer, P.

  9. Cluster analysis and its application to healthcare claims data: a study of end-stage renal disease patients who initiated hemodialysis.

    Science.gov (United States)

    Liao, Minlei; Li, Yunfeng; Kianifard, Farid; Obi, Engels; Arcona, Stephen

    2016-03-02

    Cluster analysis (CA) is a frequently used applied statistical technique that helps to reveal hidden structures and "clusters" found in large data sets. However, this method has not been widely used in large healthcare claims databases where the distribution of expenditure data is commonly severely skewed. The purpose of this study was to identify cost change patterns of patients with end-stage renal disease (ESRD) who initiated hemodialysis (HD) by applying different clustering methods. A retrospective, cross-sectional, observational study was conducted using the Truven Health MarketScan® Research Databases. Patients aged ≥18 years with ≥2 ESRD diagnoses who initiated HD between 2008 and 2010 were included. The K-means CA method and hierarchical CA with various linkage methods were applied to all-cause costs within baseline (12-months pre-HD) and follow-up periods (12-months post-HD) to identify clusters. Demographic, clinical, and cost information was extracted from both periods, and then examined by cluster. A total of 18,380 patients were identified. Meaningful all-cause cost clusters were generated using K-means CA and hierarchical CA with either flexible beta or Ward's methods. Based on cluster sample sizes and change of cost patterns, the K-means CA method and 4 clusters were selected: Cluster 1: Average to High (n = 113); Cluster 2: Very High to High (n = 89); Cluster 3: Average to Average (n = 16,624); or Cluster 4: Increasing Costs, High at Both Points (n = 1554). Median cost changes in the 12-month pre-HD and post-HD periods increased from $185,070 to $884,605 for Cluster 1 (Average to High), decreased from $910,930 to $157,997 for Cluster 2 (Very High to High), were relatively stable and remained low from $15,168 to $13,026 for Cluster 3 (Average to Average), and increased from $57,909 to $193,140 for Cluster 4 (Increasing Costs, High at Both Points). Relatively stable costs after starting HD were associated with more stable scores

  10. Clustering User Behavior in Scientific Collections

    OpenAIRE

    Blixhavn, Øystein Hoel

    2014-01-01

    This master thesis looks at how clustering techniques can be appliedto a collection of scientific documents. Approximately one year of serverlogs from the CERN Document Server (CDS) are analyzed and preprocessed.Based on the findings of this analysis, and a review of thecurrent state of the art, three different clustering methods are selectedfor further work: Simple k-Means, Hierarchical Agglomerative Clustering(HAC) and Graph Partitioning. In addition, a custom, agglomerativeclustering algor...

  11. Application of clustering for customer segmentation in private banking

    Science.gov (United States)

    Yang, Xuan; Chen, Jin; Hao, Pengpeng; Wang, Yanbo J.

    2015-07-01

    With fierce competition in banking industry, more and more banks have realised that accurate customer segmentation is of fundamental importance, especially for the identification of those high-value customers. In order to solve this problem, we collected real data about private banking customers of a commercial bank in China, conducted empirical analysis by applying K-means clustering technique. When determine the K value, we propose a mechanism that meet both academic requirements and practical needs. Through K-means clustering, we successfully segmented the customers into three categories, and features of each group have been illustrated in details.

  12. Improved infrared precipitation estimation approaches based on k-means clustering: Application to north Algeria using MSG-SEVIRI satellite data

    Science.gov (United States)

    Mokdad, Fatiha; Haddad, Boualem

    2017-06-01

    In this paper, two new infrared precipitation estimation approaches based on the concept of k-means clustering are first proposed, named the NAW-Kmeans and the GPI-Kmeans methods. Then, they are adapted to the southern Mediterranean basin, where the subtropical climate prevails. The infrared data (10.8 μm channel) acquired by MSG-SEVIRI sensor in winter and spring 2012 are used. Tests are carried out in eight areas distributed over northern Algeria: Sebra, El Bordj, Chlef, Blida, Bordj Menael, Sidi Aich, Beni Ourthilane, and Beni Aziz. The validation is performed by a comparison of the estimated rainfalls to rain gauges observations collected by the National Office of Meteorology in Dar El Beida (Algeria). Despite the complexity of the subtropical climate, the obtained results indicate that the NAW-Kmeans and the GPI-Kmeans approaches gave satisfactory results for the considered rain rates. Also, the proposed schemes lead to improvement in precipitation estimation performance when compared to the original algorithms NAW (Nagri, Adler, and Wetzel) and GPI (GOES Precipitation Index).

  13. Accident patterns for construction-related workers: a cluster analysis

    Science.gov (United States)

    Liao, Chia-Wen; Tyan, Yaw-Yauan

    2012-01-01

    The construction industry has been identified as one of the most hazardous industries. The risk of constructionrelated workers is far greater than that in a manufacturing based industry. However, some steps can be taken to reduce worker risk through effective injury prevention strategies. In this article, k-means clustering methodology is employed in specifying the factors related to different worker types and in identifying the patterns of industrial occupational accidents. Accident reports during the period 1998 to 2008 are extracted from case reports of the Northern Region Inspection Office of the Council of Labor Affairs of Taiwan. The results show that the cluster analysis can indicate some patterns of occupational injuries in the construction industry. Inspection plans should be proposed according to the type of construction-related workers. The findings provide a direction for more effective inspection strategies and injury prevention programs.

  14. Cluster analysis in severe emphysema subjects using phenotype and genotype data: an exploratory investigation

    Directory of Open Access Journals (Sweden)

    Martinez Fernando J

    2010-03-01

    Full Text Available Abstract Background Numerous studies have demonstrated associations between genetic markers and COPD, but results have been inconsistent. One reason may be heterogeneity in disease definition. Unsupervised learning approaches may assist in understanding disease heterogeneity. Methods We selected 31 phenotypic variables and 12 SNPs from five candidate genes in 308 subjects in the National Emphysema Treatment Trial (NETT Genetics Ancillary Study cohort. We used factor analysis to select a subset of phenotypic variables, and then used cluster analysis to identify subtypes of severe emphysema. We examined the phenotypic and genotypic characteristics of each cluster. Results We identified six factors accounting for 75% of the shared variability among our initial phenotypic variables. We selected four phenotypic variables from these factors for cluster analysis: 1 post-bronchodilator FEV1 percent predicted, 2 percent bronchodilator responsiveness, and quantitative CT measurements of 3 apical emphysema and 4 airway wall thickness. K-means cluster analysis revealed four clusters, though separation between clusters was modest: 1 emphysema predominant, 2 bronchodilator responsive, with higher FEV1; 3 discordant, with a lower FEV1 despite less severe emphysema and lower airway wall thickness, and 4 airway predominant. Of the genotypes examined, membership in cluster 1 (emphysema-predominant was associated with TGFB1 SNP rs1800470. Conclusions Cluster analysis may identify meaningful disease subtypes and/or groups of related phenotypic variables even in a highly selected group of severe emphysema subjects, and may be useful for genetic association studies.

  15. Detection of uterine MMG contractions using a multiple change point estimator and the K-means cluster algorithm.

    Science.gov (United States)

    La Rosa, Patricio S; Nehorai, Arye; Eswaran, Hari; Lowery, Curtis L; Preissl, Hubert

    2008-02-01

    We propose a single channel two-stage time-segment discriminator of uterine magnetomyogram (MMG) contractions during pregnancy. We assume that the preprocessed signals are piecewise stationary having distribution in a common family with a fixed number of parameters. Therefore, at the first stage, we propose a model-based segmentation procedure, which detects multiple change-points in the parameters of a piecewise constant time-varying autoregressive model using a robust formulation of the Schwarz information criterion (SIC) and a binary search approach. In particular, we propose a test statistic that depends on the SIC, derive its asymptotic distribution, and obtain closed-form optimal detection thresholds in the sense of the Neyman-Pearson criterion; therefore, we control the probability of false alarm and maximize the probability of change-point detection in each stage of the binary search algorithm. We compute and evaluate the relative energy variation [root mean squares (RMS)] and the dominant frequency component [first order zero crossing (FOZC)] in discriminating between time segments with and without contractions. The former consistently detects a time segment with contractions. Thus, at the second stage, we apply a nonsupervised K-means cluster algorithm to classify the detected time segments using the RMS values. We apply our detection algorithm to real MMG records obtained from ten patients admitted to the hospital for contractions with gestational ages between 31 and 40 weeks. We evaluate the performance of our detection algorithm in computing the detection and false alarm rate, respectively, using as a reference the patients' feedback. We also analyze the fusion of the decision signals from all the sensors as in the parallel distributed detection approach.

  16. Fault Detection Using the Clustering-kNN Rule for Gas Sensor Arrays

    Directory of Open Access Journals (Sweden)

    Jingli Yang

    2016-12-01

    Full Text Available The k-nearest neighbour (kNN rule, which naturally handles the possible non-linearity of data, is introduced to solve the fault detection problem of gas sensor arrays. In traditional fault detection methods based on the kNN rule, the detection process of each new test sample involves all samples in the entire training sample set. Therefore, these methods can be computation intensive in monitoring processes with a large volume of variables and training samples and may be impossible for real-time monitoring. To address this problem, a novel clustering-kNN rule is presented. The landmark-based spectral clustering (LSC algorithm, which has low computational complexity, is employed to divide the entire training sample set into several clusters. Further, the kNN rule is only conducted in the cluster that is nearest to the test sample; thus, the efficiency of the fault detection methods can be enhanced by reducing the number of training samples involved in the detection process of each test sample. The performance of the proposed clustering-kNN rule is fully verified in numerical simulations with both linear and non-linear models and a real gas sensor array experimental system with different kinds of faults. The results of simulations and experiments demonstrate that the clustering-kNN rule can greatly enhance both the accuracy and efficiency of fault detection methods and provide an excellent solution to reliable and real-time monitoring of gas sensor arrays.

  17. Fault Detection Using the Clustering-kNN Rule for Gas Sensor Arrays

    Science.gov (United States)

    Yang, Jingli; Sun, Zhen; Chen, Yinsheng

    2016-01-01

    The k-nearest neighbour (kNN) rule, which naturally handles the possible non-linearity of data, is introduced to solve the fault detection problem of gas sensor arrays. In traditional fault detection methods based on the kNN rule, the detection process of each new test sample involves all samples in the entire training sample set. Therefore, these methods can be computation intensive in monitoring processes with a large volume of variables and training samples and may be impossible for real-time monitoring. To address this problem, a novel clustering-kNN rule is presented. The landmark-based spectral clustering (LSC) algorithm, which has low computational complexity, is employed to divide the entire training sample set into several clusters. Further, the kNN rule is only conducted in the cluster that is nearest to the test sample; thus, the efficiency of the fault detection methods can be enhanced by reducing the number of training samples involved in the detection process of each test sample. The performance of the proposed clustering-kNN rule is fully verified in numerical simulations with both linear and non-linear models and a real gas sensor array experimental system with different kinds of faults. The results of simulations and experiments demonstrate that the clustering-kNN rule can greatly enhance both the accuracy and efficiency of fault detection methods and provide an excellent solution to reliable and real-time monitoring of gas sensor arrays. PMID:27929412

  18. A simple and fast method to determine the parameters for fuzzy c-means cluster analysis

    DEFF Research Database (Denmark)

    Schwämmle, Veit; Jensen, Ole Nørregaard

    2010-01-01

    MOTIVATION: Fuzzy c-means clustering is widely used to identify cluster structures in high-dimensional datasets, such as those obtained in DNA microarray and quantitative proteomics experiments. One of its main limitations is the lack of a computationally fast method to set optimal values...... of algorithm parameters. Wrong parameter values may either lead to the inclusion of purely random fluctuations in the results or ignore potentially important data. The optimal solution has parameter values for which the clustering does not yield any results for a purely random dataset but which detects cluster...... formation with maximum resolution on the edge of randomness. RESULTS: Estimation of the optimal parameter values is achieved by evaluation of the results of the clustering procedure applied to randomized datasets. In this case, the optimal value of the fuzzifier follows common rules that depend only...

  19. Clustering Moving Objects Using Segments Slopes

    OpenAIRE

    Mohamed E. El-Sharkawi; Hoda M. O. Mokhtar; Omnia Ossama

    2011-01-01

    Given a set of moving object trajectories, we show how to cluster them using k-meansclustering approach. Our proposed clustering algorithm is competitive with the k-means clusteringbecause it specifies the value of “k” based on the segment’s slope of the moving object trajectories. Theadvantage of this approach is that it overcomes the known drawbacks of the k-means algorithm, namely,the dependence on the number of clusters (k), and the dependence on the initial choice of the clusters’centroi...

  20. Semi-supervised clustering methods

    Science.gov (United States)

    Bair, Eric

    2013-01-01

    Cluster analysis methods seek to partition a data set into homogeneous subgroups. It is useful in a wide variety of applications, including document processing and modern genetics. Conventional clustering methods are unsupervised, meaning that there is no outcome variable nor is anything known about the relationship between the observations in the data set. In many situations, however, information about the clusters is available in addition to the values of the features. For example, the cluster labels of some observations may be known, or certain observations may be known to belong to the same cluster. In other cases, one may wish to identify clusters that are associated with a particular outcome variable. This review describes several clustering algorithms (known as “semi-supervised clustering” methods) that can be applied in these situations. The majority of these methods are modifications of the popular k-means clustering method, and several of them will be described in detail. A brief description of some other semi-supervised clustering algorithms is also provided. PMID:24729830

  1. PREDIKSI VOLUME LALU LINTAS ANGKUTAN LEBARAN PADA WILAYAH JAWA TENGAH DENGAN METODE K-MEANS CLUSTERING UNTUK ADAPTIVE NEURO FUZZY INFERENCE SYSTEM (ANFIS

    Directory of Open Access Journals (Sweden)

    Evanita Evanita

    2016-04-01

    Full Text Available Di Indonesia kepadatan arus lalu lintas terjadi pada jam berangkat dan pulang kantor, hari-hari libur panjang atau hari-hari besar nasional terutama saat hari raya Idul Fitri (lebaran. Mudik sudah menjadi tradisi bagi masyarakat Indonesia yang ditunggu-tunggu menjelang lebaran, berbondong-bondong untuk pulang ke kampung halaman untuk bertemu dan berkumpul dengan keluarga. Kegiatan rutin tahunan ini banyak di lakukan khususnya bagi masyarakat kota-kota besar seperti Jakarta, dimana diketahui bahwa Jakarta adalah Ibu kota negara Republik Indonesia dan menjadi tujuan merantau untuk mencari pekerjaan yang lebih layak yang merupakan harapan besar bagi masyarakat desa. Volume kendaraan bertambah sejak 7 hari menjelang lebaran sampai 7 hari setelah lebaran tiap tahunnya terutama pada arah keluar dan masuk wilayah Jawa Tengah yang banyak menjadi tujuan mudik. Volume kendaraan saat arus mudik yang selalu meningkat inilah yang akan diteliti lebih lanjut dengan metode ANFIS agar dapat menjadi alternatif solusi langkah apa yang akan dilakukan di tahun selanjutnya agar pelayanan lalu lintas, kemacetan panjang dan angka kecelakaan berkurang. Dengan input parameter ANFIS yang digunakan yaitu pengclusteran hingga 5 cluster, epoch 100, error goal 0 diperoleh performa terbaik ANFIS dengan K-Means clustering yang terbagi menjadi 3 cluster, epoch terbaik sebesar 20 dengan RMSE Training terbaik sebesar 0,1198, RMSE Testing terbaik sebesar 0,0282 dan waktu proses tersingkat sebesar 0,0695.Selanjutnya hasil prediksi diharapkan dapat bermanfaat menjadi alternatif solusi langkah apa yang akan dilakukan di tahun selanjutnya agar pelayanan lalu lintas lebih baik lagi. Kata kunci: angkutan lebaran, Jawa Tengah, ANFIS.

  2. Adaptive phase k-means algorithm for waveform classification

    Science.gov (United States)

    Song, Chengyun; Liu, Zhining; Wang, Yaojun; Xu, Feng; Li, Xingming; Hu, Guangmin

    2018-01-01

    Waveform classification is a powerful technique for seismic facies analysis that describes the heterogeneity and compartments within a reservoir. Horizon interpretation is a critical step in waveform classification. However, the horizon often produces inconsistent waveform phase, and thus results in an unsatisfied classification. To alleviate this problem, an adaptive phase waveform classification method called the adaptive phase k-means is introduced in this paper. Our method improves the traditional k-means algorithm using an adaptive phase distance for waveform similarity measure. The proposed distance is a measure with variable phases as it moves from sample to sample along the traces. Model traces are also updated with the best phase interference in the iterative process. Therefore, our method is robust to phase variations caused by the interpretation horizon. We tested the effectiveness of our algorithm by applying it to synthetic and real data. The satisfactory results reveal that the proposed method tolerates certain waveform phase variation and is a good tool for seismic facies analysis.

  3. Genome-scale cluster analysis of replicated microarrays using shrinkage correlation coefficient.

    Science.gov (United States)

    Yao, Jianchao; Chang, Chunqi; Salmi, Mari L; Hung, Yeung Sam; Loraine, Ann; Roux, Stanley J

    2008-06-18

    Currently, clustering with some form of correlation coefficient as the gene similarity metric has become a popular method for profiling genomic data. The Pearson correlation coefficient and the standard deviation (SD)-weighted correlation coefficient are the two most widely-used correlations as the similarity metrics in clustering microarray data. However, these two correlations are not optimal for analyzing replicated microarray data generated by most laboratories. An effective correlation coefficient is needed to provide statistically sufficient analysis of replicated microarray data. In this study, we describe a novel correlation coefficient, shrinkage correlation coefficient (SCC), that fully exploits the similarity between the replicated microarray experimental samples. The methodology considers both the number of replicates and the variance within each experimental group in clustering expression data, and provides a robust statistical estimation of the error of replicated microarray data. The value of SCC is revealed by its comparison with two other correlation coefficients that are currently the most widely-used (Pearson correlation coefficient and SD-weighted correlation coefficient) using statistical measures on both synthetic expression data as well as real gene expression data from Saccharomyces cerevisiae. Two leading clustering methods, hierarchical and k-means clustering were applied for the comparison. The comparison indicated that using SCC achieves better clustering performance. Applying SCC-based hierarchical clustering to the replicated microarray data obtained from germinating spores of the fern Ceratopteris richardii, we discovered two clusters of genes with shared expression patterns during spore germination. Functional analysis suggested that some of the genetic mechanisms that control germination in such diverse plant lineages as mosses and angiosperms are also conserved among ferns. This study shows that SCC is an alternative to the Pearson

  4. K-Nearest Neighbor Intervals Based AP Clustering Algorithm for Large Incomplete Data

    Directory of Open Access Journals (Sweden)

    Cheng Lu

    2015-01-01

    Full Text Available The Affinity Propagation (AP algorithm is an effective algorithm for clustering analysis, but it can not be directly applicable to the case of incomplete data. In view of the prevalence of missing data and the uncertainty of missing attributes, we put forward a modified AP clustering algorithm based on K-nearest neighbor intervals (KNNI for incomplete data. Based on an Improved Partial Data Strategy, the proposed algorithm estimates the KNNI representation of missing attributes by using the attribute distribution information of the available data. The similarity function can be changed by dealing with the interval data. Then the improved AP algorithm can be applicable to the case of incomplete data. Experiments on several UCI datasets show that the proposed algorithm achieves impressive clustering results.

  5. The Classification of Diabetes Mellitus Using Kernel k-means

    Science.gov (United States)

    Alamsyah, M.; Nafisah, Z.; Prayitno, E.; Afida, A. M.; Imah, E. M.

    2018-01-01

    Diabetes Mellitus is a metabolic disorder which is characterized by chronicle hypertensive glucose. Automatics detection of diabetes mellitus is still challenging. This study detected diabetes mellitus by using kernel k-Means algorithm. Kernel k-means is an algorithm which was developed from k-means algorithm. Kernel k-means used kernel learning that is able to handle non linear separable data; where it differs with a common k-means. The performance of kernel k-means in detecting diabetes mellitus is also compared with SOM algorithms. The experiment result shows that kernel k-means has good performance and a way much better than SOM.

  6. "K"-Means Clustering and Mixture Model Clustering: Reply to McLachlan (2011) and Vermunt (2011)

    Science.gov (United States)

    Steinley, Douglas; Brusco, Michael J.

    2011-01-01

    McLachlan (2011) and Vermunt (2011) each provided thoughtful replies to our original article (Steinley & Brusco, 2011). This response serves to incorporate some of their comments while simultaneously clarifying our position. We argue that greater caution against overparamaterization must be taken when assuming that clusters are highly elliptical…

  7. A Dimensionality Reduction-Based Multi-Step Clustering Method for Robust Vessel Trajectory Analysis

    Directory of Open Access Journals (Sweden)

    Huanhuan Li

    2017-08-01

    Full Text Available The Shipboard Automatic Identification System (AIS is crucial for navigation safety and maritime surveillance, data mining and pattern analysis of AIS information have attracted considerable attention in terms of both basic research and practical applications. Clustering of spatio-temporal AIS trajectories can be used to identify abnormal patterns and mine customary route data for transportation safety. Thus, the capacities of navigation safety and maritime traffic monitoring could be enhanced correspondingly. However, trajectory clustering is often sensitive to undesirable outliers and is essentially more complex compared with traditional point clustering. To overcome this limitation, a multi-step trajectory clustering method is proposed in this paper for robust AIS trajectory clustering. In particular, the Dynamic Time Warping (DTW, a similarity measurement method, is introduced in the first step to measure the distances between different trajectories. The calculated distances, inversely proportional to the similarities, constitute a distance matrix in the second step. Furthermore, as a widely-used dimensional reduction method, Principal Component Analysis (PCA is exploited to decompose the obtained distance matrix. In particular, the top k principal components with above 95% accumulative contribution rate are extracted by PCA, and the number of the centers k is chosen. The k centers are found by the improved center automatically selection algorithm. In the last step, the improved center clustering algorithm with k clusters is implemented on the distance matrix to achieve the final AIS trajectory clustering results. In order to improve the accuracy of the proposed multi-step clustering algorithm, an automatic algorithm for choosing the k clusters is developed according to the similarity distance. Numerous experiments on realistic AIS trajectory datasets in the bridge area waterway and Mississippi River have been implemented to compare our

  8. A Dimensionality Reduction-Based Multi-Step Clustering Method for Robust Vessel Trajectory Analysis.

    Science.gov (United States)

    Li, Huanhuan; Liu, Jingxian; Liu, Ryan Wen; Xiong, Naixue; Wu, Kefeng; Kim, Tai-Hoon

    2017-08-04

    The Shipboard Automatic Identification System (AIS) is crucial for navigation safety and maritime surveillance, data mining and pattern analysis of AIS information have attracted considerable attention in terms of both basic research and practical applications. Clustering of spatio-temporal AIS trajectories can be used to identify abnormal patterns and mine customary route data for transportation safety. Thus, the capacities of navigation safety and maritime traffic monitoring could be enhanced correspondingly. However, trajectory clustering is often sensitive to undesirable outliers and is essentially more complex compared with traditional point clustering. To overcome this limitation, a multi-step trajectory clustering method is proposed in this paper for robust AIS trajectory clustering. In particular, the Dynamic Time Warping (DTW), a similarity measurement method, is introduced in the first step to measure the distances between different trajectories. The calculated distances, inversely proportional to the similarities, constitute a distance matrix in the second step. Furthermore, as a widely-used dimensional reduction method, Principal Component Analysis (PCA) is exploited to decompose the obtained distance matrix. In particular, the top k principal components with above 95% accumulative contribution rate are extracted by PCA, and the number of the centers k is chosen. The k centers are found by the improved center automatically selection algorithm. In the last step, the improved center clustering algorithm with k clusters is implemented on the distance matrix to achieve the final AIS trajectory clustering results. In order to improve the accuracy of the proposed multi-step clustering algorithm, an automatic algorithm for choosing the k clusters is developed according to the similarity distance. Numerous experiments on realistic AIS trajectory datasets in the bridge area waterway and Mississippi River have been implemented to compare our proposed method with

  9. Data Clustering

    Science.gov (United States)

    Wagstaff, Kiri L.

    2012-03-01

    particular application involves considerations of the kind of data being analyzed, algorithm runtime efficiency, and how much prior knowledge is available about the problem domain, which can dictate the nature of clusters sought. Fundamentally, the clustering method and its representations of clusters carries with it a definition of what a cluster is, and it is important that this be aligned with the analysis goals for the problem at hand. In this chapter, I emphasize this point by identifying for each algorithm the cluster representation as a model, m_j , even for algorithms that are not typically thought of as creating a “model.” This chapter surveys a basic collection of clustering methods useful to any practitioner who is interested in applying clustering to a new data set. The algorithms include k-means (Section 25.2), EM (Section 25.3), agglomerative (Section 25.4), and spectral (Section 25.5) clustering, with side mentions of variants such as kernel k-means and divisive clustering. The chapter also discusses each algorithm’s strengths and limitations and provides pointers to additional in-depth reading for each subject. Section 25.6 discusses methods for incorporating domain knowledge into the clustering process. This chapter concludes with a brief survey of interesting applications of clustering methods to astronomy data (Section 25.7). The chapter begins with k-means because it is both generally accessible and so widely used that understanding it can be considered a necessary prerequisite for further work in the field. EM can be viewed as a more sophisticated version of k-means that uses a generative model for each cluster and probabilistic item assignments. Agglomerative clustering is the most basic form of hierarchical clustering and provides a basis for further exploration of algorithms in that vein. Spectral clustering permits a departure from feature-vector-based clustering and can operate on data sets instead represented as affinity, or similarity

  10. Deconstructing Bipolar Disorder and Schizophrenia: A cross-diagnostic cluster analysis of cognitive phenotypes.

    Science.gov (United States)

    Lee, Junghee; Rizzo, Shemra; Altshuler, Lori; Glahn, David C; Miklowitz, David J; Sugar, Catherine A; Wynn, Jonathan K; Green, Michael F

    2017-02-01

    Bipolar disorder (BD) and schizophrenia (SZ) show substantial overlap. It has been suggested that a subgroup of patients might contribute to these overlapping features. This study employed a cross-diagnostic cluster analysis to identify subgroups of individuals with shared cognitive phenotypes. 143 participants (68 BD patients, 39 SZ patients and 36 healthy controls) completed a battery of EEG and performance assessments on perception, nonsocial cognition and social cognition. A K-means cluster analysis was conducted with all participants across diagnostic groups. Clinical symptoms, functional capacity, and functional outcome were assessed in patients. A two-cluster solution across 3 groups was the most stable. One cluster including 44 BD patients, 31 controls and 5 SZ patients showed better cognition (High cluster) than the other cluster with 24 BD patients, 35 SZ patients and 5 controls (Low cluster). BD patients in the High cluster performed better than BD patients in the Low cluster across cognitive domains. Within each cluster, participants with different clinical diagnoses showed different profiles across cognitive domains. All patients are in the chronic phase and out of mood episode at the time of assessment and most of the assessment were behavioral measures. This study identified two clusters with shared cognitive phenotype profiles that were not proxies for clinical diagnoses. The finding of better social cognitive performance of BD patients than SZ patients in the Lowe cluster suggest that relatively preserved social cognition may be important to identify disease process distinct to each disorder. Copyright © 2016 Elsevier B.V. All rights reserved.

  11. Identification of multiple cracks in 2D elasticity by means of the reciprocity principle and cluster analysis

    Science.gov (United States)

    Shifrin, Efim I.; Kaptsov, Alexander V.

    2018-01-01

    An inverse 2D elastostatic problem is considered. It is assumed that an isotropic, linear elastic body can contain a finite number of rectilinear, well-separated cracks. The surfaces of the cracks are assumed to be free of the loads. A method is developed for reconstruction the cracks by means of the applied loads and displacements on the boundary of the body, obtained in a single static test. The method is based on the reciprocity principle, elements of the theory of distributions, and cluster analysis. Numerical examples are considered.

  12. Characterizing Heterogeneity within Head and Neck Lesions Using Cluster Analysis of Multi-Parametric MRI Data.

    Directory of Open Access Journals (Sweden)

    Marco Borri

    Full Text Available To describe a methodology, based on cluster analysis, to partition multi-parametric functional imaging data into groups (or clusters of similar functional characteristics, with the aim of characterizing functional heterogeneity within head and neck tumour volumes. To evaluate the performance of the proposed approach on a set of longitudinal MRI data, analysing the evolution of the obtained sub-sets with treatment.The cluster analysis workflow was applied to a combination of dynamic contrast-enhanced and diffusion-weighted imaging MRI data from a cohort of squamous cell carcinoma of the head and neck patients. Cumulative distributions of voxels, containing pre and post-treatment data and including both primary tumours and lymph nodes, were partitioned into k clusters (k = 2, 3 or 4. Principal component analysis and cluster validation were employed to investigate data composition and to independently determine the optimal number of clusters. The evolution of the resulting sub-regions with induction chemotherapy treatment was assessed relative to the number of clusters.The clustering algorithm was able to separate clusters which significantly reduced in voxel number following induction chemotherapy from clusters with a non-significant reduction. Partitioning with the optimal number of clusters (k = 4, determined with cluster validation, produced the best separation between reducing and non-reducing clusters.The proposed methodology was able to identify tumour sub-regions with distinct functional properties, independently separating clusters which were affected differently by treatment. This work demonstrates that unsupervised cluster analysis, with no prior knowledge of the data, can be employed to provide a multi-parametric characterization of functional heterogeneity within tumour volumes.

  13. Identification and validation of asthma phenotypes in Chinese population using cluster analysis.

    Science.gov (United States)

    Wang, Lei; Liang, Rui; Zhou, Ting; Zheng, Jing; Liang, Bing Miao; Zhang, Hong Ping; Luo, Feng Ming; Gibson, Peter G; Wang, Gang

    2017-10-01

    Asthma is a heterogeneous airway disease, so it is crucial to clearly identify clinical phenotypes to achieve better asthma management. To identify and prospectively validate asthma clusters in a Chinese population. Two hundred eighty-four patients were consecutively recruited and 18 sociodemographic and clinical variables were collected. Hierarchical cluster analysis was performed by the Ward method followed by k-means cluster analysis. Then, a prospective 12-month cohort study was used to validate the identified clusters. Five clusters were successfully identified. Clusters 1 (n = 71) and 3 (n = 81) were mild asthma phenotypes with slight airway obstruction and low exacerbation risk, but with a sex differential. Cluster 2 (n = 65) described an "allergic" phenotype, cluster 4 (n = 33) featured a "fixed airflow limitation" phenotype with smoking, and cluster 5 (n = 34) was a "low socioeconomic status" phenotype. Patients in clusters 2, 4, and 5 had distinctly lower socioeconomic status and more psychological symptoms. Cluster 2 had a significantly increased risk of exacerbations (risk ratio [RR] 1.13, 95% confidence interval [CI] 1.03-1.25), unplanned visits for asthma (RR 1.98, 95% CI 1.07-3.66), and emergency visits for asthma (RR 7.17, 95% CI 1.26-40.80). Cluster 4 had an increased risk of unplanned visits (RR 2.22, 95% CI 1.02-4.81), and cluster 5 had increased emergency visits (RR 12.72, 95% CI 1.95-69.78). Kaplan-Meier analysis confirmed that cluster grouping was predictive of time to the first asthma exacerbation, unplanned visit, emergency visit, and hospital admission (P clusters as "allergic asthma," "fixed airflow limitation," and "low socioeconomic status" phenotypes that are at high risk of severe asthma exacerbations and that have management implications for clinical practice in developing countries. Copyright © 2017 American College of Allergy, Asthma & Immunology. Published by Elsevier Inc. All rights reserved.

  14. A Framework To Support Management Of HIVAIDS Using K-Means And Random Forest Algorithm

    Directory of Open Access Journals (Sweden)

    Gladys Iseu

    2017-06-01

    Full Text Available Healthcare industry generates large amounts of complex data about patients hospital resources disease management electronic patient records and medical devices among others. The availability of these huge amounts of medical data creates a need for powerful mining tools to support health care professionals in diagnosis treatment and management of HIVAIDS. Several data mining techniques have been used in management of different data sets. Data mining techniques have been categorized into regression algorithms segmentation algorithms association algorithms sequence analysis algorithms and classification algorithms. In the medical field there has not been a specific study that has incorporated two or more data mining algorithms hence limiting decision making levels by medical practitioners. This study identified the extent to which K-means algorithm cluster patient characteristics it has also evaluated the extent to which random forest algorithm can classify the data for informed decision making as well as design a framework to support medical decision making in the treatment of HIVAIDS related diseases in Kenya. The paper further used random forest classification algorithm to compute proximities between pairs of cases that can be used in clustering locating outliers or by scaling to give interesting views of the data.

  15. k-Means: Random Sampling Procedure

    Indian Academy of Sciences (India)

    First page Back Continue Last page Overview Graphics. k-Means: Random Sampling Procedure. Optimal 1-Mean is. Approximation of Centroid (Inaba et al). S = random sample of size O(1/ ); Centroid of S is a (1+ )-approx centroid of P with constant probability.

  16. Metode RCE-Kmeans untuk Clustering Data

    Directory of Open Access Journals (Sweden)

    Izmy Alwiah Musdar

    2015-07-01

    Abstract  There have been many methods developed to solve the clustering problem. One of them is method in swarm intelligence field such as Particle Swarm Optimization (PSO. Rapid Centroid Estimation (RCE is a method of clustering based Particle Swarm Optimization. RCE, like other variants of PSO clustering, does not depend on initial cluster centers. Moreover, RCE has faster computational time than the previous method like PSC and mPSC. However, RCE has higher standar deviation value than PSC and mPSC in which has impact in the variance of clustering result. It is happaned because of improper equilibrium state, a condition in which the position of the particle does not change anymore, when  the stopping criteria is reached. This study proposes RCE-Kmeans which is a  method applying K-means after the equilibrium state of RCE  reached to update the particle's position which is generated from the RCE method. The results showed that RCE-Kmeans has better quality of the clustering scheme in 7 of 10 datasets compared to K-means and better in 8 of 10 dataset then RCE method. The use of K-means clustering on the RCE method is also able to reduce the standard deviation from RCE method.   Keywords—Data Clustering, Particle Swarm, K-means, Rapid Centroid Estimation.

  17. Properties of polycyclic aromatic hydrocarbons in the northwest photon dominated region of NGC 7023. II. Traditional PAH analysis using k-means as a visualization tool

    International Nuclear Information System (INIS)

    Boersma, C.; Bregman, J.; Allamandola, L. J.

    2014-01-01

    Polycyclic aromatic hydrocarbon (PAH) emission in the Spitzer-IRS spectral map of the northwest photon dominated region (PDR) in NGC 7023 is analyzed using the 'traditional' approach in which the PAH bands and plateaus between 5.2-19.5 μm are isolated by subtracting the underlying continuum and removing H 2 emission lines. The spectra are organized into seven spectroscopic bins by using k-means clustering. Each cluster corresponds to, and reveals, a morphological zone within NGC 7023. The zones self-organize parallel to the well-defined PDR front that coincides with an increase in intensity of the H 2 emission lines. PAH band profiles and integrated strengths are measured, classified, and mapped. The morphological zones revealed by the k-means clustering provides deeper insight into the conditions that drive variations in band strength ratios and evolution of the PAH population that otherwise would be lost. For example, certain band-band relations are bifurcated, revealing two limiting cases; one associated with the PDR, the other with the diffuse medium. Traditionally, PAH band strength ratios are used to gain insight into the properties of the emitting PAH population, i.e., charge, size, structure, and composition. Insights inferred from this work are compared and contrasted to those from Boersma et al. (first paper in this series), where the PAH emission in NGC 7023 is decomposed exclusively using the PAH spectra and tools made available through the NASA Ames PAH IR Spectroscopic Database.

  18. Properties of polycyclic aromatic hydrocarbons in the northwest photon dominated region of NGC 7023. II. Traditional PAH analysis using k-means as a visualization tool

    Energy Technology Data Exchange (ETDEWEB)

    Boersma, C.; Bregman, J.; Allamandola, L. J., E-mail: Christiaan.Boersma@nasa.gov [NASA Ames Research Center, MS 245-6, Moffett Field, CA 94035-0001 (United States)

    2014-11-10

    Polycyclic aromatic hydrocarbon (PAH) emission in the Spitzer-IRS spectral map of the northwest photon dominated region (PDR) in NGC 7023 is analyzed using the 'traditional' approach in which the PAH bands and plateaus between 5.2-19.5 μm are isolated by subtracting the underlying continuum and removing H{sub 2} emission lines. The spectra are organized into seven spectroscopic bins by using k-means clustering. Each cluster corresponds to, and reveals, a morphological zone within NGC 7023. The zones self-organize parallel to the well-defined PDR front that coincides with an increase in intensity of the H{sub 2} emission lines. PAH band profiles and integrated strengths are measured, classified, and mapped. The morphological zones revealed by the k-means clustering provides deeper insight into the conditions that drive variations in band strength ratios and evolution of the PAH population that otherwise would be lost. For example, certain band-band relations are bifurcated, revealing two limiting cases; one associated with the PDR, the other with the diffuse medium. Traditionally, PAH band strength ratios are used to gain insight into the properties of the emitting PAH population, i.e., charge, size, structure, and composition. Insights inferred from this work are compared and contrasted to those from Boersma et al. (first paper in this series), where the PAH emission in NGC 7023 is decomposed exclusively using the PAH spectra and tools made available through the NASA Ames PAH IR Spectroscopic Database.

  19. Plasmon response in K, Na and Li clusters: systematics using the separable random-phase approximation with pseudo-Hamiltonians

    International Nuclear Information System (INIS)

    Kleinig, W.; Nesterenko, V.O.; Reinhard, P.-G.; Serra, Ll.

    1998-01-01

    The systematics of the plasmon response in spherical K, Na and Li clusters in a wide size region (8≤N≤440) is studied. We have considered two simplifying approximations whose validity has been established previously. First, a separable approach to the random-phase approximation is used. This involves an expansion of the residual interaction into a sum of separable terms until convergence is reached. Second, the electron-ion interaction is modelled by using the pseudo-Hamiltonian jellium model (MHJM) which includes nonlocal effects by means of realistic atomic pseudo-Hamiltonians. In cases where nonlocal effects are negligible the Structure Averaged Jellium Model (SAJM) has been used. Good agreement with available experimental data is achieved for K, Na (using the SAJM) and small Li clusters (invoking the PHJM). The trends for peak position and width are generally well reproduced, even up to details of the Landau fragmentation in several clusters. Less good agreement, however, is found for large Li clusters. This remains an open question

  20. Application Of WIMS Code To Calculation Kartini Reactor Parameters By Pin-Cell And Cluster Method

    International Nuclear Information System (INIS)

    Sumarsono, Bambang; Tjiptono, T.W.

    1996-01-01

    Analysis UZrH fuel element parameters calculation in Kartini Reactor by WIMS Code has been done. The analysis is done by pin cell and cluster method. The pin cell method is done as a function percent burn-up and by 8 group 3 region analysis and cluster method by 8 group 12 region analysis. From analysis and calculation resulted K ∼ = 1.3687 by pin cell method and K ∼ = 1.3162 by cluster method and so deviation is 3.83%. By pin cell analysis as a function percent burn-up at the percent burn-up greater than 59.50%, the multiplication factor is less than one (k ∼ < 1) it is mean that the fuel element reactivity is negative

  1. A comparative analysis of DBSCAN, K-means, and quadratic variation algorithms for automatic identification of swallows from swallowing accelerometry signals.

    Science.gov (United States)

    Dudik, Joshua M; Kurosu, Atsuko; Coyle, James L; Sejdić, Ervin

    2015-04-01

    Cervical auscultation with high resolution sensors is currently under consideration as a method of automatically screening for specific swallowing abnormalities. To be clinically useful without human involvement, any devices based on cervical auscultation should be able to detect specified swallowing events in an automatic manner. In this paper, we comparatively analyze the density-based spatial clustering of applications with noise algorithm (DBSCAN), a k-means based algorithm, and an algorithm based on quadratic variation as methods of differentiating periods of swallowing activity from periods of time without swallows. These algorithms utilized swallowing vibration data exclusively and compared the results to a gold standard measure of swallowing duration. Data was collected from 23 subjects that were actively suffering from swallowing difficulties. Comparing the performance of the DBSCAN algorithm with a proven segmentation algorithm that utilizes k-means clustering demonstrated that the DBSCAN algorithm had a higher sensitivity and correctly segmented more swallows. Comparing its performance with a threshold-based algorithm that utilized the quadratic variation of the signal showed that the DBSCAN algorithm offered no direct increase in performance. However, it offered several other benefits including a faster run time and more consistent performance between patients. All algorithms showed noticeable differentiation from the endpoints provided by a videofluoroscopy examination as well as reduced sensitivity. In summary, we showed that the DBSCAN algorithm is a viable method for detecting the occurrence of a swallowing event using cervical auscultation signals, but significant work must be done to improve its performance before it can be implemented in an unsupervised manner. Copyright © 2015 Elsevier Ltd. All rights reserved.

  2. Time-of-flight secondary ion mass spectrometry of a range of coal samples: a chemometrics (PCA, cluster, and PLS) analysis

    Energy Technology Data Exchange (ETDEWEB)

    Lei Pei; Guilin Jiang; Bonnie J. Tyler; Larry L. Baxter; Matthew R. Linford [Brigham Young University, Provo, UT (United States). Department of Chemistry and Biochemistry

    2008-03-15

    This paper documents time-of-flight secondary ion mass spectrometry (ToF-SIMS) analyses of 34 different coal samples. In many cases, the inorganic Na{sup +}, Al{sup +}, Si{sup +}, and K{sup +} ions dominate the spectra, eclipsing the organic peaks. A scores plot of principal component 1 (PC1) versus principal component 2 (PC2) in a principal components analysis (PCA) effectively separates the coal spectra into a triangular pattern, where the different vertices of this pattern come from (I) spectra that have a strong inorganic signature that is dominated by Na{sup +}, (ii) spectra that have a strong inorganic signature that is dominated by Al{sup +}, Si{sup +}, and K{sup +}, and (iii) spectra that have a strong organic signature. Loadings plots of PC1 and PC2 confirm these observations. The spectra with the more prominent inorganic signatures come from samples with higher ash contents. Cluster analysis with the K-means algorithm was also applied to the data. The progressive clustering revealed in the dendrogram correlates extremely well with the clustering of the data points found in the scores plot of PC1 versus PC2 from the PCA. In addition, this clustering often correlates with properties of the coal samples, as measured by traditional analyses. Partial least-squares (PLS), which included the use of interval PLS and a genetic algorithm for variable selection, shows a good correlation between ToF-SIMS spectra and some of the properties measured by traditional means. Thus, ToF-SIMS appears to be a promising technique for the analysis of this important fuel. 33 refs., 9 figs., 5 tabs.

  3. An Efficient MapReduce-Based Parallel Clustering Algorithm for Distributed Traffic Subarea Division

    Directory of Open Access Journals (Sweden)

    Dawen Xia

    2015-01-01

    Full Text Available Traffic subarea division is vital for traffic system management and traffic network analysis in intelligent transportation systems (ITSs. Since existing methods may not be suitable for big traffic data processing, this paper presents a MapReduce-based Parallel Three-Phase K-Means (Par3PKM algorithm for solving traffic subarea division problem on a widely adopted Hadoop distributed computing platform. Specifically, we first modify the distance metric and initialization strategy of K-Means and then employ a MapReduce paradigm to redesign the optimized K-Means algorithm for parallel clustering of large-scale taxi trajectories. Moreover, we propose a boundary identifying method to connect the borders of clustering results for each cluster. Finally, we divide traffic subarea of Beijing based on real-world trajectory data sets generated by 12,000 taxis in a period of one month using the proposed approach. Experimental evaluation results indicate that when compared with K-Means, Par2PK-Means, and ParCLARA, Par3PKM achieves higher efficiency, more accuracy, and better scalability and can effectively divide traffic subarea with big taxi trajectory data.

  4. Cluster analysis for applications

    CERN Document Server

    Anderberg, Michael R

    1973-01-01

    Cluster Analysis for Applications deals with methods and various applications of cluster analysis. Topics covered range from variables and scales to measures of association among variables and among data units. Conceptual problems in cluster analysis are discussed, along with hierarchical and non-hierarchical clustering methods. The necessary elements of data analysis, statistics, cluster analysis, and computer implementation are integrated vertically to cover the complete path from raw data to a finished analysis.Comprised of 10 chapters, this book begins with an introduction to the subject o

  5. clusterMaker: a multi-algorithm clustering plugin for Cytoscape

    Directory of Open Access Journals (Sweden)

    Morris John H

    2011-11-01

    Full Text Available Abstract Background In the post-genomic era, the rapid increase in high-throughput data calls for computational tools capable of integrating data of diverse types and facilitating recognition of biologically meaningful patterns within them. For example, protein-protein interaction data sets have been clustered to identify stable complexes, but scientists lack easily accessible tools to facilitate combined analyses of multiple data sets from different types of experiments. Here we present clusterMaker, a Cytoscape plugin that implements several clustering algorithms and provides network, dendrogram, and heat map views of the results. The Cytoscape network is linked to all of the other views, so that a selection in one is immediately reflected in the others. clusterMaker is the first Cytoscape plugin to implement such a wide variety of clustering algorithms and visualizations, including the only implementations of hierarchical clustering, dendrogram plus heat map visualization (tree view, k-means, k-medoid, SCPS, AutoSOME, and native (Java MCL. Results Results are presented in the form of three scenarios of use: analysis of protein expression data using a recently published mouse interactome and a mouse microarray data set of nearly one hundred diverse cell/tissue types; the identification of protein complexes in the yeast Saccharomyces cerevisiae; and the cluster analysis of the vicinal oxygen chelate (VOC enzyme superfamily. For scenario one, we explore functionally enriched mouse interactomes specific to particular cellular phenotypes and apply fuzzy clustering. For scenario two, we explore the prefoldin complex in detail using both physical and genetic interaction clusters. For scenario three, we explore the possible annotation of a protein as a methylmalonyl-CoA epimerase within the VOC superfamily. Cytoscape session files for all three scenarios are provided in the Additional Files section. Conclusions The Cytoscape plugin cluster

  6. Mathematical classification and clustering

    CERN Document Server

    Mirkin, Boris

    1996-01-01

    I am very happy to have this opportunity to present the work of Boris Mirkin, a distinguished Russian scholar in the areas of data analysis and decision making methodologies. The monograph is devoted entirely to clustering, a discipline dispersed through many theoretical and application areas, from mathematical statistics and combina­ torial optimization to biology, sociology and organizational structures. It compiles an immense amount of research done to date, including many original Russian de­ velopments never presented to the international community before (for instance, cluster-by-cluster versions of the K-Means method in Chapter 4 or uniform par­ titioning in Chapter 5). The author's approach, approximation clustering, allows him both to systematize a great part of the discipline and to develop many in­ novative methods in the framework of optimization problems. The optimization methods considered are proved to be meaningful in the contexts of data analysis and clustering. The material presented in ...

  7. An evaluation of centrality measures used in cluster analysis

    Science.gov (United States)

    Engström, Christopher; Silvestrov, Sergei

    2014-12-01

    Clustering of data into groups of similar objects plays an important part when analysing many types of data, especially when the datasets are large as they often are in for example bioinformatics, social networks and computational linguistics. Many clustering algorithms such as K-means and some types of hierarchical clustering need a number of centroids representing the 'center' of the clusters. The choice of centroids for the initial clusters often plays an important role in the quality of the clusters. Since a data point with a high centrality supposedly lies close to the 'center' of some cluster, this can be used to assign centroids rather than through some other method such as picking them at random. Some work have been done to evaluate the use of centrality measures such as degree, betweenness and eigenvector centrality in clustering algorithms. The aim of this article is to compare and evaluate the usefulness of a number of common centrality measures such as the above mentioned and others such as PageRank and related measures.

  8. Breast Cancer Symptom Clusters Derived from Social Media and Research Study Data Using Improved K-Medoid Clustering

    Science.gov (United States)

    Ping, Qing; Yang, Christopher C.; Marshall, Sarah A.; Avis, Nancy E.; Ip, Edward H.

    2017-01-01

    Most cancer patients, including patients with breast cancer, experience multiple symptoms simultaneously while receiving active treatment. Some symptoms tend to occur together and may be related, such as hot flashes and night sweats. Co-occurring symptoms may have a multiplicative effect on patients’ functioning, mental health, and quality of life. Symptom clusters in the context of oncology were originally described as groups of three or more related symptoms. Some authors have suggested symptom clusters may have practical applications, such as the formulation of more effective therapeutic interventions that address the combined effects of symptoms rather than treating each symptom separately. Most studies that have sought to identify clusters in breast cancer survivors have relied on traditional research studies. Social media, such as online health-related forums, contain a bevy of user-generated content in the form of threads and posts, and could be used as a data source to identify and characterize symptom clusters among cancer patients. The present study seeks to determine patterns of symptom clusters in breast cancer survivors derived from both social media and research study data using improved K-Medoid clustering. A total of 50,426 publicly available messages were collected from Medhelp.com and 653 questionnaires were collected as part of a research study. The network of symptoms built from social media was sparse compared to that of the research study data, making the social media data easier to partition. The proposed revised K-Medoid clustering helps to improve the clustering performance by re-assigning some of the negative-ASW (average silhouette width) symptoms to other clusters after initial K-Medoid clustering. This retains an overall non-decreasing ASW and avoids the problem of trapping in local optima. The overall ASW, individual ASW, and improved interpretation of the final clustering solution suggest improvement. The clustering results suggest

  9. Breast Cancer Symptom Clusters Derived from Social Media and Research Study Data Using Improved K-Medoid Clustering.

    Science.gov (United States)

    Ping, Qing; Yang, Christopher C; Marshall, Sarah A; Avis, Nancy E; Ip, Edward H

    2016-06-01

    Most cancer patients, including patients with breast cancer, experience multiple symptoms simultaneously while receiving active treatment. Some symptoms tend to occur together and may be related, such as hot flashes and night sweats. Co-occurring symptoms may have a multiplicative effect on patients' functioning, mental health, and quality of life. Symptom clusters in the context of oncology were originally described as groups of three or more related symptoms. Some authors have suggested symptom clusters may have practical applications, such as the formulation of more effective therapeutic interventions that address the combined effects of symptoms rather than treating each symptom separately. Most studies that have sought to identify clusters in breast cancer survivors have relied on traditional research studies. Social media, such as online health-related forums, contain a bevy of user-generated content in the form of threads and posts, and could be used as a data source to identify and characterize symptom clusters among cancer patients. The present study seeks to determine patterns of symptom clusters in breast cancer survivors derived from both social media and research study data using improved K-Medoid clustering. A total of 50,426 publicly available messages were collected from Medhelp.com and 653 questionnaires were collected as part of a research study. The network of symptoms built from social media was sparse compared to that of the research study data, making the social media data easier to partition. The proposed revised K-Medoid clustering helps to improve the clustering performance by re-assigning some of the negative-ASW (average silhouette width) symptoms to other clusters after initial K-Medoid clustering. This retains an overall non-decreasing ASW and avoids the problem of trapping in local optima. The overall ASW, individual ASW, and improved interpretation of the final clustering solution suggest improvement. The clustering results suggest

  10. Convex Clustering: An Attractive Alternative to Hierarchical Clustering

    Science.gov (United States)

    Chen, Gary K.; Chi, Eric C.; Ranola, John Michael O.; Lange, Kenneth

    2015-01-01

    The primary goal in cluster analysis is to discover natural groupings of objects. The field of cluster analysis is crowded with diverse methods that make special assumptions about data and address different scientific aims. Despite its shortcomings in accuracy, hierarchical clustering is the dominant clustering method in bioinformatics. Biologists find the trees constructed by hierarchical clustering visually appealing and in tune with their evolutionary perspective. Hierarchical clustering operates on multiple scales simultaneously. This is essential, for instance, in transcriptome data, where one may be interested in making qualitative inferences about how lower-order relationships like gene modules lead to higher-order relationships like pathways or biological processes. The recently developed method of convex clustering preserves the visual appeal of hierarchical clustering while ameliorating its propensity to make false inferences in the presence of outliers and noise. The solution paths generated by convex clustering reveal relationships between clusters that are hidden by static methods such as k-means clustering. The current paper derives and tests a novel proximal distance algorithm for minimizing the objective function of convex clustering. The algorithm separates parameters, accommodates missing data, and supports prior information on relationships. Our program CONVEXCLUSTER incorporating the algorithm is implemented on ATI and nVidia graphics processing units (GPUs) for maximal speed. Several biological examples illustrate the strengths of convex clustering and the ability of the proximal distance algorithm to handle high-dimensional problems. CONVEXCLUSTER can be freely downloaded from the UCLA Human Genetics web site at http://www.genetics.ucla.edu/software/ PMID:25965340

  11. Graph analysis of cell clusters forming vascular networks

    Science.gov (United States)

    Alves, A. P.; Mesquita, O. N.; Gómez-Gardeñes, J.; Agero, U.

    2018-03-01

    This manuscript describes the experimental observation of vasculogenesis in chick embryos by means of network analysis. The formation of the vascular network was observed in the area opaca of embryos from 40 to 55 h of development. In the area opaca endothelial cell clusters self-organize as a primitive and approximately regular network of capillaries. The process was observed by bright-field microscopy in control embryos and in embryos treated with Bevacizumab (Avastin), an antibody that inhibits the signalling of the vascular endothelial growth factor (VEGF). The sequence of images of the vascular growth were thresholded, and used to quantify the forming network in control and Avastin-treated embryos. This characterization is made by measuring vessels density, number of cell clusters and the largest cluster density. From the original images, the topology of the vascular network was extracted and characterized by means of the usual network metrics such as: the degree distribution, average clustering coefficient, average short path length and assortativity, among others. This analysis allows to monitor how the largest connected cluster of the vascular network evolves in time and provides with quantitative evidence of the disruptive effects that Avastin has on the tree structure of vascular networks.

  12. Analyzing coastal environments by means of functional data analysis

    Science.gov (United States)

    Sierra, Carlos; Flor-Blanco, Germán; Ordoñez, Celestino; Flor, Germán; Gallego, José R.

    2017-07-01

    Here we used Functional Data Analysis (FDA) to examine particle-size distributions (PSDs) in a beach/shallow marine sedimentary environment in Gijón Bay (NW Spain). The work involved both Functional Principal Components Analysis (FPCA) and Functional Cluster Analysis (FCA). The grainsize of the sand samples was characterized by means of laser dispersion spectroscopy. Within this framework, FPCA was used as a dimension reduction technique to explore and uncover patterns in grain-size frequency curves. This procedure proved useful to describe variability in the structure of the data set. Moreover, an alternative approach, FCA, was applied to identify clusters and to interpret their spatial distribution. Results obtained with this latter technique were compared with those obtained by means of two vector approaches that combine PCA with CA (Cluster Analysis). The first method, the point density function (PDF), was employed after adapting a log-normal distribution to each PSD and resuming each of the density functions by its mean, sorting, skewness and kurtosis. The second applied a centered-log-ratio (clr) to the original data. PCA was then applied to the transformed data, and finally CA to the retained principal component scores. The study revealed functional data analysis, specifically FPCA and FCA, as a suitable alternative with considerable advantages over traditional vector analysis techniques in sedimentary geology studies.

  13. Objective Classification of Rainfall in Northern Europe for Online Operation of Urban Water Systems Based on Clustering Techniques

    DEFF Research Database (Denmark)

    Löwe, Roland; Madsen, Henrik; McSharry, Patrick

    2016-01-01

    operators to change modes of control of their facilities. A k-means clustering technique was applied to group events retrospectively and was able to distinguish events with clearly different temporal and spatial correlation properties. For online applications, techniques based on k-means clustering...... and quadratic discriminant analysis both provided a fast and reliable identification of rain events of "high" variability, while the k-means provided the smallest number of rain events falsely identified as being of "high" variability (false hits). A simple classification method based on a threshold...

  14. Image Registration Algorithm Based on Parallax Constraint and Clustering Analysis

    Science.gov (United States)

    Wang, Zhe; Dong, Min; Mu, Xiaomin; Wang, Song

    2018-01-01

    To resolve the problem of slow computation speed and low matching accuracy in image registration, a new image registration algorithm based on parallax constraint and clustering analysis is proposed. Firstly, Harris corner detection algorithm is used to extract the feature points of two images. Secondly, use Normalized Cross Correlation (NCC) function to perform the approximate matching of feature points, and the initial feature pair is obtained. Then, according to the parallax constraint condition, the initial feature pair is preprocessed by K-means clustering algorithm, which is used to remove the feature point pairs with obvious errors in the approximate matching process. Finally, adopt Random Sample Consensus (RANSAC) algorithm to optimize the feature points to obtain the final feature point matching result, and the fast and accurate image registration is realized. The experimental results show that the image registration algorithm proposed in this paper can improve the accuracy of the image matching while ensuring the real-time performance of the algorithm.

  15. Clusters of cultures: diversity in meaning of family value and gender role items across Europe.

    Science.gov (United States)

    van Vlimmeren, Eva; Moors, Guy B D; Gelissen, John P T M

    2017-01-01

    Survey data are often used to map cultural diversity by aggregating scores of attitude and value items across countries. However, this procedure only makes sense if the same concept is measured in all countries. In this study we argue that when (co)variances among sets of items are similar across countries, these countries share a common way of assigning meaning to the items. Clusters of cultures can then be observed by doing a cluster analysis on the (co)variance matrices of sets of related items. This study focuses on family values and gender role attitudes. We find four clusters of cultures that assign a distinct meaning to these items, especially in the case of gender roles. Some of these differences reflect response style behavior in the form of acquiescence. Adjusting for this style effect impacts on country comparisons hence demonstrating the usefulness of investigating the patterns of meaning given to sets of items prior to aggregating scores into cultural characteristics.

  16. Implementation of hierarchical clustering using k-mer sparse matrix to analyze MERS-CoV genetic relationship

    Science.gov (United States)

    Bustamam, A.; Ulul, E. D.; Hura, H. F. A.; Siswantining, T.

    2017-07-01

    Hierarchical clustering is one of effective methods in creating a phylogenetic tree based on the distance matrix between DNA (deoxyribonucleic acid) sequences. One of the well-known methods to calculate the distance matrix is k-mer method. Generally, k-mer is more efficient than some distance matrix calculation techniques. The steps of k-mer method are started from creating k-mer sparse matrix, and followed by creating k-mer singular value vectors. The last step is computing the distance amongst vectors. In this paper, we analyze the sequences of MERS-CoV (Middle East Respiratory Syndrome - Coronavirus) DNA by implementing hierarchical clustering using k-mer sparse matrix in order to perform the phylogenetic analysis. Our results show that the ancestor of our MERS-CoV is coming from Egypt. Moreover, we found that the MERS-CoV infection that occurs in one country may not necessarily come from the same country of origin. This suggests that the process of MERS-CoV mutation might not only be influenced by geographical factor.

  17. Cluster analysis

    CERN Document Server

    Everitt, Brian S; Leese, Morven; Stahl, Daniel

    2011-01-01

    Cluster analysis comprises a range of methods for classifying multivariate data into subgroups. By organizing multivariate data into such subgroups, clustering can help reveal the characteristics of any structure or patterns present. These techniques have proven useful in a wide range of areas such as medicine, psychology, market research and bioinformatics.This fifth edition of the highly successful Cluster Analysis includes coverage of the latest developments in the field and a new chapter dealing with finite mixture models for structured data.Real life examples are used throughout to demons

  18. Towards Development of Clustering Applications for Large-Scale Comparative Genotyping and Kinship Analysis Using Y-Short Tandem Repeats.

    Science.gov (United States)

    Seman, Ali; Sapawi, Azizian Mohd; Salleh, Mohd Zaki

    2015-06-01

    Y-chromosome short tandem repeats (Y-STRs) are genetic markers with practical applications in human identification. However, where mass identification is required (e.g., in the aftermath of disasters with significant fatalities), the efficiency of the process could be improved with new statistical approaches. Clustering applications are relatively new tools for large-scale comparative genotyping, and the k-Approximate Modal Haplotype (k-AMH), an efficient algorithm for clustering large-scale Y-STR data, represents a promising method for developing these tools. In this study we improved the k-AMH and produced three new algorithms: the Nk-AMH I (including a new initial cluster center selection), the Nk-AMH II (including a new dominant weighting value), and the Nk-AMH III (combining I and II). The Nk-AMH III was the superior algorithm, with mean clustering accuracy that increased in four out of six datasets and remained at 100% in the other two. Additionally, the Nk-AMH III achieved a 2% higher overall mean clustering accuracy score than the k-AMH, as well as optimal accuracy for all datasets (0.84-1.00). With inclusion of the two new methods, the Nk-AMH III produced an optimal solution for clustering Y-STR data; thus, the algorithm has potential for further development towards fully automatic clustering of any large-scale genotypic data.

  19. Deeply quasi-bound state in single- and double-K nuclear clusters

    Energy Technology Data Exchange (ETDEWEB)

    Marri, S.; Kalantari, S.Z. [Isfahan University of Technology, Department of Physics, Isfahan (Iran, Islamic Republic of); Esmaili, J. [Shahrekord University, Department of Physics, Faculty of Basic Sciences, Shahrekord (Iran, Islamic Republic of)

    2016-12-15

    New calculations of the quasi-bound state positions in K{sup -}K{sup -}pp kaonic nuclear cluster are performed using non-relativistic four-body Faddeev-type equations in AGS form. The corresponding separable approximation for the integral kernels in the three- and four-body kaonic clusters is obtained by using the Hilbert-Schmidt expansion procedure. Different phenomenological models of anti KN-πΣ potentials with one- and two-pole structure of Λ(1405) resonance and separable potential models for anti K- anti K and nucleon-nucleon interactions, are used. The dependence of the resulting four-body binding energy on models of anti KN-πΣ interaction is investigated. We obtained the binding energy of the K{sup -}K{sup -}pp quasi-bound state ∝ 80-94 MeV with the phenomenological anti KN potentials. The width is about ∝ 5-8 MeV for the two-pole models of the interaction, while the one-pole potentials give ∝ 24-31 MeV width. (orig.)

  20. Percolation with multiple giant clusters

    International Nuclear Information System (INIS)

    Ben-Naim, E; Krapivsky, P L

    2005-01-01

    We study mean-field percolation with freezing. Specifically, we consider cluster formation via two competing processes: irreversible aggregation and freezing. We find that when the freezing rate exceeds a certain threshold, the percolation transition is suppressed. Below this threshold, the system undergoes a series of percolation transitions with multiple giant clusters ('gels') formed. Giant clusters are not self-averaging as their total number and their sizes fluctuate from realization to realization. The size distribution F k , of frozen clusters of size k, has a universal tail, F kk -3 . We propose freezing as a practical mechanism for controlling the gel size. (letter to the editor)

  1. Stability cluster links hydrofobic gate to K873 in ATP8A2

    DEFF Research Database (Denmark)

    Mikkelsen, Stine; Vestergaard, Anna Lindeløv; Coleman, Jonathan Allan

    ATPases, though it catalyzes the transport of a much larger substrate, an enigma referred to as the “giant substrate problem”. Recently, based on mutational analysis and molecular dynamics we have identified a hydrophobic gate in a groove surrounded by M1, M2, M4 and M6 (1). A plausible water filled...... harboring key residues in the center of the enzyme important for linking M5 through K873 to M4 and M6. This stability cluster supposedly allows M4 to act as a pumping rod during enzyme reaction cycle. We find that mutation of residues in this stability cluster affects the substrate affinity, as previously...

  2. Novel Clustering Method Based on K-Medoids and Mobility Metric

    Directory of Open Access Journals (Sweden)

    Y. Hamzaoui

    2018-06-01

    Full Text Available The structure and constraint of MANETS influence negatively the performance of QoS, moreover the main routing protocols proposed generally operate in flat routing. Hence, this structure gives the bad results of QoS when the network becomes larger and denser. To solve this problem we use one of the most popular methods named clustering. The present paper comes within the frameworks of research to improve the QoS in MANETs. In this paper we propose a new algorithm of clustering based on the new mobility metric and K-Medoid to distribute the nodes into several clusters. Intuitively our algorithm can give good results in terms of stability of the cluster, and can also extend life time of cluster head.

  3. Research on Abnormal Detection Based on Improved Combination of K - means and SVDD

    Science.gov (United States)

    Hao, Xiaohong; Zhang, Xiaofeng

    2018-01-01

    In order to improve the efficiency of network intrusion detection and reduce the false alarm rate, this paper proposes an anomaly detection algorithm based on improved K-means and SVDD. The algorithm first uses the improved K-means algorithm to cluster the training samples of each class, so that each class is independent and compact in class; Then, according to the training samples, the SVDD algorithm is used to construct the minimum superspheres. The subordinate relationship of the samples is determined by calculating the distance of the minimum superspheres constructed by SVDD. If the test sample is less than the center of the hypersphere, the test sample belongs to this class, otherwise it does not belong to this class, after several comparisons, the final test of the effective detection of the test sample.In this paper, we use KDD CUP99 data set to simulate the proposed anomaly detection algorithm. The results show that the algorithm has high detection rate and low false alarm rate, which is an effective network security protection method.

  4. A new collaborative recommendation approach based on users clustering using artificial bee colony algorithm.

    Science.gov (United States)

    Ju, Chunhua; Xu, Chonghuan

    2013-01-01

    Although there are many good collaborative recommendation methods, it is still a challenge to increase the accuracy and diversity of these methods to fulfill users' preferences. In this paper, we propose a novel collaborative filtering recommendation approach based on K-means clustering algorithm. In the process of clustering, we use artificial bee colony (ABC) algorithm to overcome the local optimal problem caused by K-means. After that we adopt the modified cosine similarity to compute the similarity between users in the same clusters. Finally, we generate recommendation results for the corresponding target users. Detailed numerical analysis on a benchmark dataset MovieLens and a real-world dataset indicates that our new collaborative filtering approach based on users clustering algorithm outperforms many other recommendation methods.

  5. A New Collaborative Recommendation Approach Based on Users Clustering Using Artificial Bee Colony Algorithm

    Directory of Open Access Journals (Sweden)

    Chunhua Ju

    2013-01-01

    Full Text Available Although there are many good collaborative recommendation methods, it is still a challenge to increase the accuracy and diversity of these methods to fulfill users’ preferences. In this paper, we propose a novel collaborative filtering recommendation approach based on K-means clustering algorithm. In the process of clustering, we use artificial bee colony (ABC algorithm to overcome the local optimal problem caused by K-means. After that we adopt the modified cosine similarity to compute the similarity between users in the same clusters. Finally, we generate recommendation results for the corresponding target users. Detailed numerical analysis on a benchmark dataset MovieLens and a real-world dataset indicates that our new collaborative filtering approach based on users clustering algorithm outperforms many other recommendation methods.

  6. Parkinson's Disease Subtypes Identified from Cluster Analysis of Motor and Non-motor Symptoms.

    Science.gov (United States)

    Mu, Jesse; Chaudhuri, Kallol R; Bielza, Concha; de Pedro-Cuesta, Jesus; Larrañaga, Pedro; Martinez-Martin, Pablo

    2017-01-01

    Parkinson's disease is now considered a complex, multi-peptide, central, and peripheral nervous system disorder with considerable clinical heterogeneity. Non-motor symptoms play a key role in the trajectory of Parkinson's disease, from prodromal premotor to end stages. To understand the clinical heterogeneity of Parkinson's disease, this study used cluster analysis to search for subtypes from a large, multi-center, international, and well-characterized cohort of Parkinson's disease patients across all motor stages, using a combination of cardinal motor features (bradykinesia, rigidity, tremor, axial signs) and, for the first time, specific validated rater-based non-motor symptom scales. Two independent international cohort studies were used: (a) the validation study of the Non-Motor Symptoms Scale ( n = 411) and (b) baseline data from the global Non-Motor International Longitudinal Study ( n = 540). k -means cluster analyses were performed on the non-motor and motor domains (domains clustering) and the 30 individual non-motor symptoms alone (symptoms clustering), and hierarchical agglomerative clustering was performed to group symptoms together. Four clusters are identified from the domains clustering supporting previous studies: mild, non-motor dominant, motor-dominant, and severe. In addition, six new smaller clusters are identified from the symptoms clustering, each characterized by clinically-relevant non-motor symptoms. The clusters identified in this study present statistical confirmation of the increasingly important role of non-motor symptoms (NMS) in Parkinson's disease heterogeneity and take steps toward subtype-specific treatment packages.

  7. Developing a Clustering-Based Empirical Bayes Analysis Method for Hotspot Identification

    Directory of Open Access Journals (Sweden)

    Yajie Zou

    2017-01-01

    Full Text Available Hotspot identification (HSID is a critical part of network-wide safety evaluations. Typical methods for ranking sites are often rooted in using the Empirical Bayes (EB method to estimate safety from both observed crash records and predicted crash frequency based on similar sites. The performance of the EB method is highly related to the selection of a reference group of sites (i.e., roadway segments or intersections similar to the target site from which safety performance functions (SPF used to predict crash frequency will be developed. As crash data often contain underlying heterogeneity that, in essence, can make them appear to be generated from distinct subpopulations, methods are needed to select similar sites in a principled manner. To overcome this possible heterogeneity problem, EB-based HSID methods that use common clustering methodologies (e.g., mixture models, K-means, and hierarchical clustering to select “similar” sites for building SPFs are developed. Performance of the clustering-based EB methods is then compared using real crash data. Here, HSID results, when computed on Texas undivided rural highway cash data, suggest that all three clustering-based EB analysis methods are preferred over the conventional statistical methods. Thus, properly classifying the road segments for heterogeneous crash data can further improve HSID accuracy.

  8. Application of Clustering Techniques for Lung Sounds to Improve Interpretability and Detection of Crackles

    Directory of Open Access Journals (Sweden)

    Germán D. Sosa

    2015-01-01

    Full Text Available Due to the subjectivity involved currently in pulmonary auscultation process and its diagnostic to evaluate the condition of respiratory airways, this work pretends to evaluate the performance of clustering algorithms such as k-means and DBSCAN to perform a computational analysis of lung sounds aiming to visualize a representation of such sounds that highlights the presence of crackles and the energy associated with them. In order to achieve that goal, Wavelet analysis techniques were used in contrast to traditional frequency analysis given the similarity between the typical waveform for a crackle and the wavelet sym4. Once the lung sound signal with isolated crackles is obtained, the clustering process groups crackles in regions of high density and provides visualization that might be useful for the diagnostic made by an expert. Evaluation suggests that k-means groups crackle more effective than DBSCAN in terms of generated clusters.

  9. Clustering Trajectories by Relevant Parts for Air Traffic Analysis.

    Science.gov (United States)

    Andrienko, Gennady; Andrienko, Natalia; Fuchs, Georg; Garcia, Jose Manuel Cordero

    2018-01-01

    Clustering of trajectories of moving objects by similarity is an important technique in movement analysis. Existing distance functions assess the similarity between trajectories based on properties of the trajectory points or segments. The properties may include the spatial positions, times, and thematic attributes. There may be a need to focus the analysis on certain parts of trajectories, i.e., points and segments that have particular properties. According to the analysis focus, the analyst may need to cluster trajectories by similarity of their relevant parts only. Throughout the analysis process, the focus may change, and different parts of trajectories may become relevant. We propose an analytical workflow in which interactive filtering tools are used to attach relevance flags to elements of trajectories, clustering is done using a distance function that ignores irrelevant elements, and the resulting clusters are summarized for further analysis. We demonstrate how this workflow can be useful for different analysis tasks in three case studies with real data from the domain of air traffic. We propose a suite of generic techniques and visualization guidelines to support movement data analysis by means of relevance-aware trajectory clustering.

  10. Analyzing patients' values by applying cluster analysis and LRFM model in a pediatric dental clinic in Taiwan.

    Science.gov (United States)

    Wu, Hsin-Hung; Lin, Shih-Yen; Liu, Chih-Wei

    2014-01-01

    This study combines cluster analysis and LRFM (length, recency, frequency, and monetary) model in a pediatric dental clinic in Taiwan to analyze patients' values. A two-stage approach by self-organizing maps and K-means method is applied to segment 1,462 patients into twelve clusters. The average values of L, R, and F excluding monetary covered by national health insurance program are computed for each cluster. In addition, customer value matrix is used to analyze customer values of twelve clusters in terms of frequency and monetary. Customer relationship matrix considering length and recency is also applied to classify different types of customers from these twelve clusters. The results show that three clusters can be classified into loyal patients with L, R, and F values greater than the respective average L, R, and F values, while three clusters can be viewed as lost patients without any variable above the average values of L, R, and F. When different types of patients are identified, marketing strategies can be designed to meet different patients' needs.

  11. Analyzing Patients' Values by Applying Cluster Analysis and LRFM Model in a Pediatric Dental Clinic in Taiwan

    Science.gov (United States)

    Lin, Shih-Yen; Liu, Chih-Wei

    2014-01-01

    This study combines cluster analysis and LRFM (length, recency, frequency, and monetary) model in a pediatric dental clinic in Taiwan to analyze patients' values. A two-stage approach by self-organizing maps and K-means method is applied to segment 1,462 patients into twelve clusters. The average values of L, R, and F excluding monetary covered by national health insurance program are computed for each cluster. In addition, customer value matrix is used to analyze customer values of twelve clusters in terms of frequency and monetary. Customer relationship matrix considering length and recency is also applied to classify different types of customers from these twelve clusters. The results show that three clusters can be classified into loyal patients with L, R, and F values greater than the respective average L, R, and F values, while three clusters can be viewed as lost patients without any variable above the average values of L, R, and F. When different types of patients are identified, marketing strategies can be designed to meet different patients' needs. PMID:25045741

  12. Analyzing Patients’ Values by Applying Cluster Analysis and LRFM Model in a Pediatric Dental Clinic in Taiwan

    Directory of Open Access Journals (Sweden)

    Hsin-Hung Wu

    2014-01-01

    Full Text Available This study combines cluster analysis and LRFM (length, recency, frequency, and monetary model in a pediatric dental clinic in Taiwan to analyze patients’ values. A two-stage approach by self-organizing maps and K-means method is applied to segment 1,462 patients into twelve clusters. The average values of L, R, and F excluding monetary covered by national health insurance program are computed for each cluster. In addition, customer value matrix is used to analyze customer values of twelve clusters in terms of frequency and monetary. Customer relationship matrix considering length and recency is also applied to classify different types of customers from these twelve clusters. The results show that three clusters can be classified into loyal patients with L, R, and F values greater than the respective average L, R, and F values, while three clusters can be viewed as lost patients without any variable above the average values of L, R, and F. When different types of patients are identified, marketing strategies can be designed to meet different patients’ needs.

  13. Gas load forecasting based on optimized fuzzy c-mean clustering analysis of selecting similar days

    Directory of Open Access Journals (Sweden)

    Qiu Jing

    2017-08-01

    Full Text Available Traditional fuzzy c-means (FCM clustering in short term load forecasting method is easy to fall into local optimum and is sensitive to the initial cluster center.In this paper,we propose to use global search feature of particle swarm optimization (PSO algorithm to avoid these shortcomings,and to use FCM optimization to select similar date of forecast as training sample of support vector machines.This will not only strengthen the data rule of training samples,but also ensure the consistency of data characteristics.Experimental results show that the prediction accuracy of this prediction model is better than that of BP neural network and support vector machine (SVM algorithms.

  14. Prediction of Tibial Rotation Pathologies Using Particle Swarm Optimization and K-Means Algorithms.

    Science.gov (United States)

    Sari, Murat; Tuna, Can; Akogul, Serkan

    2018-03-28

    The aim of this article is to investigate pathological subjects from a population through different physical factors. To achieve this, particle swarm optimization (PSO) and K-means (KM) clustering algorithms have been combined (PSO-KM). Datasets provided by the literature were divided into three clusters based on age and weight parameters and each one of right tibial external rotation (RTER), right tibial internal rotation (RTIR), left tibial external rotation (LTER), and left tibial internal rotation (LTIR) values were divided into three types as Type 1, Type 2 and Type 3 (Type 2 is non-pathological (normal) and the other two types are pathological (abnormal)), respectively. The rotation values of every subject in any cluster were noted. Then the algorithm was run and the produced values were also considered. The values of the produced algorithm, the PSO-KM, have been compared with the real values. The hybrid PSO-KM algorithm has been very successful on the optimal clustering of the tibial rotation types through the physical criteria. In this investigation, Type 2 (pathological subjects) is of especially high predictability and the PSO-KM algorithm has been very successful as an operation system for clustering and optimizing the tibial motion data assessments. These research findings are expected to be very useful for health providers, such as physiotherapists, orthopedists, and so on, in which this consequence may help clinicians to appropriately designing proper treatment schedules for patients.

  15. Cluster Oriented Spatio Temporal Multidimensional Data Visualization of Earthquakes in Indonesia

    Directory of Open Access Journals (Sweden)

    Mohammad Nur Shodiq

    2016-03-01

    Full Text Available Spatio temporal data clustering is challenge task. The result of clustering data are utilized to investigate the seismic parameters. Seismic parameters are used to describe the characteristics of earthquake behavior. One of the effective technique to study multidimensional spatio temporal data is visualization. But, visualization of multidimensional data is complicated problem. Because, this analysis consists of observed data cluster and seismic parameters. In this paper, we propose a visualization system, called as IES (Indonesia Earthquake System, for cluster analysis, spatio temporal analysis, and visualize the multidimensional data of seismic parameters. We analyze the cluster analysis by using automatic clustering, that consists of get optimal number of cluster and Hierarchical K-means clustering. We explore the visual cluster and multidimensional data in low dimensional space visualization. We made experiment with observed data, that consists of seismic data around Indonesian archipelago during 2004 to 2014. Keywords: Clustering, visualization, multidimensional data, seismic parameters.

  16. Performance Evaluation of Spectral Clustering Algorithm using Various Clustering Validity Indices

    OpenAIRE

    M. T. Somashekara; D. Manjunatha

    2014-01-01

    In spite of the popularity of spectral clustering algorithm, the evaluation procedures are still in developmental stage. In this article, we have taken benchmarking IRIS dataset for performing comparative study of twelve indices for evaluating spectral clustering algorithm. The results of the spectral clustering technique were also compared with k-mean algorithm. The validity of the indices was also verified with accuracy and (Normalized Mutual Information) NMI score. Spectral clustering algo...

  17. Marketing research cluster analysis

    OpenAIRE

    Marić Nebojša

    2002-01-01

    One area of applications of cluster analysis in marketing is identification of groups of cities and towns with similar demographic profiles. This paper considers main aspects of cluster analysis by an example of clustering 12 cities with the use of Minitab software.

  18. Membership, binarity, and rotation of F-G-K stars in the open cluster Blanco 1

    Science.gov (United States)

    Mermilliod, J.-C.; Platais, I.; James, D. J.; Grenon, M.; Cargile, P. A.

    2008-07-01

    Context: The nearby open cluster Blanco 1 is of considerable astrophysical interest for formation and evolution studies of open clusters because it is the third highest Galactic latitude cluster known. It has been observed often, but so far no definitive and comprehensive membership determination is readily available. Aims: An observing programme was carried out to study the stellar population of Blanco 1, and especially the membership and binary frequency of the F5-K0 dwarfs. Methods: We obtained radial-velocities with the CORAVEL spectrograph in the field of Blanco 1 for a sample of 148 F-G-K candidate stars in the magnitude range 10 rate reaches 40% (27/68) if one includes the photometric binaries. The cluster mean heliocentric radial velocity is +5.53 ± 0.11 km s-1 based on the most reliable 49 members. The V sin i distribution is similar to that of the Pleiades, confirming the age similarities between the two clusters. Conclusions: This study clearly demonstrates that, in spite of the cluster's high Galactic latitude, three membership criteria - radial velocity, proper motion, and photometry - are necessary for performing a reliable membership selection. Furthermore, even with accurate and extensive data, ambiguous cases still remain. Based on observations collected with the Danish 1.54-m and the Swiss telescopes at the European Southern Observatory, La Silla, Chile, and with the old YALO 1-m telescope at the Cerro Tololo InterAmerican Observatory, Chile. Table [see full textsee full textsee full textsee full textsee full textsee full text] is also available in electronic form at the CDS via anonymous ftp to cdsarc.u-strasbg.fr (130.79.128.5) or via http://cdsweb.u-strasbg.fr/cgi-bin/qcat?J/A+A/485/95

  19. Development and optimization of SPECT gated blood pool cluster analysis for the prediction of CRT outcome

    Energy Technology Data Exchange (ETDEWEB)

    Lalonde, Michel, E-mail: mlalonde15@rogers.com; Wassenaar, Richard [Department of Physics, Carleton University, Ottawa, Ontario K1S 5B6 (Canada); Wells, R. Glenn; Birnie, David; Ruddy, Terrence D. [Division of Cardiology, University of Ottawa Heart Institute, Ottawa, Ontario K1Y 4W7 (Canada)

    2014-07-15

    Purpose: Phase analysis of single photon emission computed tomography (SPECT) radionuclide angiography (RNA) has been investigated for its potential to predict the outcome of cardiac resynchronization therapy (CRT). However, phase analysis may be limited in its potential at predicting CRT outcome as valuable information may be lost by assuming that time-activity curves (TAC) follow a simple sinusoidal shape. A new method, cluster analysis, is proposed which directly evaluates the TACs and may lead to a better understanding of dyssynchrony patterns and CRT outcome. Cluster analysis algorithms were developed and optimized to maximize their ability to predict CRT response. Methods: About 49 patients (N = 27 ischemic etiology) received a SPECT RNA scan as well as positron emission tomography (PET) perfusion and viability scans prior to undergoing CRT. A semiautomated algorithm sampled the left ventricle wall to produce 568 TACs from SPECT RNA data. The TACs were then subjected to two different cluster analysis techniques, K-means, and normal average, where several input metrics were also varied to determine the optimal settings for the prediction of CRT outcome. Each TAC was assigned to a cluster group based on the comparison criteria and global and segmental cluster size and scores were used as measures of dyssynchrony and used to predict response to CRT. A repeated random twofold cross-validation technique was used to train and validate the cluster algorithm. Receiver operating characteristic (ROC) analysis was used to calculate the area under the curve (AUC) and compare results to those obtained for SPECT RNA phase analysis and PET scar size analysis methods. Results: Using the normal average cluster analysis approach, the septal wall produced statistically significant results for predicting CRT results in the ischemic population (ROC AUC = 0.73;p < 0.05 vs. equal chance ROC AUC = 0.50) with an optimal operating point of 71% sensitivity and 60% specificity. Cluster

  20. Development and optimization of SPECT gated blood pool cluster analysis for the prediction of CRT outcome

    International Nuclear Information System (INIS)

    Lalonde, Michel; Wassenaar, Richard; Wells, R. Glenn; Birnie, David; Ruddy, Terrence D.

    2014-01-01

    Purpose: Phase analysis of single photon emission computed tomography (SPECT) radionuclide angiography (RNA) has been investigated for its potential to predict the outcome of cardiac resynchronization therapy (CRT). However, phase analysis may be limited in its potential at predicting CRT outcome as valuable information may be lost by assuming that time-activity curves (TAC) follow a simple sinusoidal shape. A new method, cluster analysis, is proposed which directly evaluates the TACs and may lead to a better understanding of dyssynchrony patterns and CRT outcome. Cluster analysis algorithms were developed and optimized to maximize their ability to predict CRT response. Methods: About 49 patients (N = 27 ischemic etiology) received a SPECT RNA scan as well as positron emission tomography (PET) perfusion and viability scans prior to undergoing CRT. A semiautomated algorithm sampled the left ventricle wall to produce 568 TACs from SPECT RNA data. The TACs were then subjected to two different cluster analysis techniques, K-means, and normal average, where several input metrics were also varied to determine the optimal settings for the prediction of CRT outcome. Each TAC was assigned to a cluster group based on the comparison criteria and global and segmental cluster size and scores were used as measures of dyssynchrony and used to predict response to CRT. A repeated random twofold cross-validation technique was used to train and validate the cluster algorithm. Receiver operating characteristic (ROC) analysis was used to calculate the area under the curve (AUC) and compare results to those obtained for SPECT RNA phase analysis and PET scar size analysis methods. Results: Using the normal average cluster analysis approach, the septal wall produced statistically significant results for predicting CRT results in the ischemic population (ROC AUC = 0.73;p < 0.05 vs. equal chance ROC AUC = 0.50) with an optimal operating point of 71% sensitivity and 60% specificity. Cluster

  1. Hardness of Clustering

    Indian Academy of Sciences (India)

    First page Back Continue Last page Overview Graphics. Hardness of Clustering. Both k-means and k-medians intractable (when n and d are both inputs even for k =2). The best known deterministic algorithms. are based on Voronoi partitioning that. takes about time. Need for approximation – “close” to optimal.

  2. A Location-Aware Service Deployment Algorithm Based on K-Means for Cloudlets

    Directory of Open Access Journals (Sweden)

    Tyng-Yeu Liang

    2017-01-01

    Full Text Available Cloudlet recently was proposed to push data centers towards network edges for reducing the network latency of delivering cloud services to mobile devices. For the sake of user mobility, it is necessary to deploy and hand off services anytime anywhere for achieving the minimal network latency for users’ service requests. However, the cost of this solution usually is too high for service providers and is not effective for resource exploitation. To resolve this problem, we propose a location-aware service deployment algorithm based on K-means for cloudlets in this paper. Simply speaking, the proposed algorithm divides mobile devices into a number of device clusters according to the geographical location of mobile devices and then deploys service instances onto the edge cloud servers nearest to the centers of device clusters. Our performance evaluation has shown that the proposed algorithm can effectively reduce not only the network latency of edge cloud services but also the number of service instances used for satisfying the condition of tolerable network latency.

  3. Clustering for data mining a data recovery approach

    CERN Document Server

    Mirkin, Boris

    2005-01-01

    Often considered more as an art than a science, the field of clustering has been dominated by learning through examples and by techniques chosen almost through trial-and-error. Even the most popular clustering methods--K-Means for partitioning the data set and Ward's method for hierarchical clustering--have lacked the theoretical attention that would establish a firm relationship between the two methods and relevant interpretation aids.Rather than the traditional set of ad hoc techniques, Clustering for Data Mining: A Data Recovery Approach presents a theory that not only closes gaps in K-Mean

  4. Marketing research cluster analysis

    Directory of Open Access Journals (Sweden)

    Marić Nebojša

    2002-01-01

    Full Text Available One area of applications of cluster analysis in marketing is identification of groups of cities and towns with similar demographic profiles. This paper considers main aspects of cluster analysis by an example of clustering 12 cities with the use of Minitab software.

  5. Model selection for semiparametric marginal mean regression accounting for within-cluster subsampling variability and informative cluster size.

    Science.gov (United States)

    Shen, Chung-Wei; Chen, Yi-Hau

    2018-03-13

    We propose a model selection criterion for semiparametric marginal mean regression based on generalized estimating equations. The work is motivated by a longitudinal study on the physical frailty outcome in the elderly, where the cluster size, that is, the number of the observed outcomes in each subject, is "informative" in the sense that it is related to the frailty outcome itself. The new proposal, called Resampling Cluster Information Criterion (RCIC), is based on the resampling idea utilized in the within-cluster resampling method (Hoffman, Sen, and Weinberg, 2001, Biometrika 88, 1121-1134) and accommodates informative cluster size. The implementation of RCIC, however, is free of performing actual resampling of the data and hence is computationally convenient. Compared with the existing model selection methods for marginal mean regression, the RCIC method incorporates an additional component accounting for variability of the model over within-cluster subsampling, and leads to remarkable improvements in selecting the correct model, regardless of whether the cluster size is informative or not. Applying the RCIC method to the longitudinal frailty study, we identify being female, old age, low income and life satisfaction, and chronic health conditions as significant risk factors for physical frailty in the elderly. © 2018, The International Biometric Society.

  6. Variable selection based on clustering analysis for improvement of polyphenols prediction in green tea using synchronous fluorescence spectra

    Science.gov (United States)

    Shan, Jiajia; Wang, Xue; Zhou, Hao; Han, Shuqing; Riza, Dimas Firmanda Al; Kondo, Naoshi

    2018-04-01

    Synchronous fluorescence spectra, combined with multivariate analysis were used to predict flavonoids content in green tea rapidly and nondestructively. This paper presented a new and efficient spectral intervals selection method called clustering based partial least square (CL-PLS), which selected informative wavelengths by combining clustering concept and partial least square (PLS) methods to improve models’ performance by synchronous fluorescence spectra. The fluorescence spectra of tea samples were obtained and k-means and kohonen-self organizing map clustering algorithms were carried out to cluster full spectra into several clusters, and sub-PLS regression model was developed on each cluster. Finally, CL-PLS models consisting of gradually selected clusters were built. Correlation coefficient (R) was used to evaluate the effect on prediction performance of PLS models. In addition, variable influence on projection partial least square (VIP-PLS), selectivity ratio partial least square (SR-PLS), interval partial least square (iPLS) models and full spectra PLS model were investigated and the results were compared. The results showed that CL-PLS presented the best result for flavonoids prediction using synchronous fluorescence spectra.

  7. Comprehensive cluster analysis with Transitivity Clustering.

    Science.gov (United States)

    Wittkop, Tobias; Emig, Dorothea; Truss, Anke; Albrecht, Mario; Böcker, Sebastian; Baumbach, Jan

    2011-03-01

    Transitivity Clustering is a method for the partitioning of biological data into groups of similar objects, such as genes, for instance. It provides integrated access to various functions addressing each step of a typical cluster analysis. To facilitate this, Transitivity Clustering is accessible online and offers three user-friendly interfaces: a powerful stand-alone version, a web interface, and a collection of Cytoscape plug-ins. In this paper, we describe three major workflows: (i) protein (super)family detection with Cytoscape, (ii) protein homology detection with incomplete gold standards and (iii) clustering of gene expression data. This protocol guides the user through the most important features of Transitivity Clustering and takes ∼1 h to complete.

  8. Glutamic acid promotes monacolin K production and monacolin K biosynthetic gene cluster expression in Monascus.

    Science.gov (United States)

    Zhang, Chan; Liang, Jian; Yang, Le; Chai, Shiyuan; Zhang, Chenxi; Sun, Baoguo; Wang, Chengtao

    2017-12-01

    This study investigated the effects of glutamic acid on production of monacolin K and expression of the monacolin K biosynthetic gene cluster. When Monascus M1 was grown in glutamic medium instead of in the original medium, monacolin K production increased from 48.4 to 215.4 mg l -1 , monacolin K production increased by 3.5 times. Glutamic acid enhanced monacolin K production by upregulating the expression of mokB-mokI; on day 8, the expression level of mokA tended to decrease by Reverse Transcription-polymerase Chain Reaction. Our findings demonstrated that mokA was not a key gene responsible for the quantity of monacolin K production in the presence of glutamic acid. Observation of Monascus mycelium morphology using Scanning Electron Microscope showed glutamic acid significantly increased the content of Monascus mycelium, altered the permeability of Monascus mycelium, enhanced secretion of monacolin K from the cell, and reduced the monacolin K content in Monascus mycelium, thereby enhancing monacolin K production.

  9. A Link-Based Cluster Ensemble Approach For Improved Gene Expression Data Analysis

    Directory of Open Access Journals (Sweden)

    P.Balaji

    2015-01-01

    Full Text Available Abstract It is difficult from possibilities to select a most suitable effective way of clustering algorithm and its dataset for a defined set of gene expression data because we have a huge number of ways and huge number of gene expressions. At present many researchers are preferring to use hierarchical clustering in different forms this is no more totally optimal. Cluster ensemble research can solve this type of problem by automatically merging multiple data partitions from a wide range of different clusterings of any dimensions to improve both the quality and robustness of the clustering result. But we have many existing ensemble approaches using an association matrix to condense sample-cluster and co-occurrence statistics and relations within the ensemble are encapsulated only at raw level while the existing among clusters are totally discriminated. Finding these missing associations can greatly expand the capability of those ensemble methodologies for microarray data clustering. We propose general K-means cluster ensemble approach for the clustering of general categorical data into required number of partitions.

  10. A fuzzy clustering technique for calorimetric data reconstruction

    International Nuclear Information System (INIS)

    Sandhir, Radha Pyari; Muhuri, Sanjib; Nayak, Tapan K.

    2010-01-01

    In high energy physics experiments, electromagnetic calorimeters are used to measure shower particles produced in p-p or heavy-ion collisions. In order to extract information and reconstruct the characteristics of the various incoming particles, clustering is required to be performed on each of the calorimeter planes. Hard clustering techniques such as Local Maxima Search, Connected-cell Search and K-means Clustering simply assign a data point to a cluster. A data point either lies in a cluster or it does not, and so, overlapping clusters are hardly distinguishable. Fuzzy c-means clustering is a version of the k-means algorithm that incorporates fuzzy logic, so that each point has a weak or strong association to the cluster, determined by the inverse distance to the center of the cluster. The term fuzzy is used because an observation may in fact lie in more than one cluster simultaneously, though to different degrees called 'memberships', as is the case with many high energy physics applications. The centers obtained using the FCM algorithm are based on the geometric locations of the data points

  11. Study on text mining algorithm for ultrasound examination of chronic liver diseases based on spectral clustering

    Science.gov (United States)

    Chang, Bingguo; Chen, Xiaofei

    2018-05-01

    Ultrasonography is an important examination for the diagnosis of chronic liver disease. The doctor gives the liver indicators and suggests the patient's condition according to the description of ultrasound report. With the rapid increase in the amount of data of ultrasound report, the workload of professional physician to manually distinguish ultrasound results significantly increases. In this paper, we use the spectral clustering method to cluster analysis of the description of the ultrasound report, and automatically generate the ultrasonic diagnostic diagnosis by machine learning. 110 groups ultrasound examination report of chronic liver disease were selected as test samples in this experiment, and the results were validated by spectral clustering and compared with k-means clustering algorithm. The results show that the accuracy of spectral clustering is 92.73%, which is higher than that of k-means clustering algorithm, which provides a powerful ultrasound-assisted diagnosis for patients with chronic liver disease.

  12. Bayesian Simultaneous Estimation for Means in k Sample Problems

    OpenAIRE

    Imai, Ryo; Kubokawa, Tatsuya; Ghosh, Malay

    2017-01-01

    This paper is concerned with the simultaneous estimation of k population means when one suspects that the k means are nearly equal. As an alternative to the preliminary test estimator based on the test statistics for testing hypothesis of equal means, we derive Bayesian and minimax estimators which shrink individual sample means toward a pooled mean estimator given under the hypothesis. Interestingly, it is shown that both the preliminary test estimator and the Bayesian minimax shrinkage esti...

  13. Cluster Analysis on Longitudinal Data of Patients with Adult-Onset Asthma.

    Science.gov (United States)

    Ilmarinen, Pinja; Tuomisto, Leena E; Niemelä, Onni; Tommola, Minna; Haanpää, Jussi; Kankaanranta, Hannu

    Previous cluster analyses on asthma are based on cross-sectional data. To identify phenotypes of adult-onset asthma by using data from baseline (diagnostic) and 12-year follow-up visits. The Seinäjoki Adult Asthma Study is a 12-year follow-up study of patients with new-onset adult asthma. K-means cluster analysis was performed by using variables from baseline and follow-up visits on 171 patients to identify phenotypes. Five clusters were identified. Patients in cluster 1 (n = 38) were predominantly nonatopic males with moderate smoking history at baseline. At follow-up, 40% of these patients had developed persistent obstruction but the number of patients with uncontrolled asthma (5%) and rhinitis (10%) was the lowest. Cluster 2 (n = 19) was characterized by older men with heavy smoking history, poor lung function, and persistent obstruction at baseline. At follow-up, these patients were mostly uncontrolled (84%) despite daily use of inhaled corticosteroid (ICS) with add-on therapy. Cluster 3 (n = 50) consisted mostly of nonsmoking females with good lung function at diagnosis/follow-up and well-controlled/partially controlled asthma at follow-up. Cluster 4 (n = 25) had obese and symptomatic patients at baseline/follow-up. At follow-up, these patients had several comorbidities (40% psychiatric disease) and were treated daily with ICS and add-on therapy. Patients in cluster 5 (n = 39) were mostly atopic and had the earliest onset of asthma, the highest blood eosinophils, and FEV 1 reversibility at diagnosis. At follow-up, these patients used the lowest ICS dose but 56% were well controlled. Results can be used to predict outcomes of patients with adult-onset asthma and to aid in development of personalized therapy (NCT02733016 at ClinicalTrials.gov). Copyright © 2017 The Authors. Published by Elsevier Inc. All rights reserved.

  14. Cluster Analysis-Based Approaches for Geospatiotemporal Data Mining of Massive Data Sets for Identification of Forest Threats

    Energy Technology Data Exchange (ETDEWEB)

    Mills, Richard T [ORNL; Hoffman, Forrest M [ORNL; Kumar, Jitendra [ORNL; HargroveJr., William Walter [USDA Forest Service

    2011-01-01

    We investigate methods for geospatiotemporal data mining of multi-year land surface phenology data (250 m2 Normalized Difference Vegetation Index (NDVI) values derived from the Moderate Resolution Imaging Spectrometer (MODIS) in this study) for the conterminous United States (CONUS) as part of an early warning system for detecting threats to forest ecosystems. The approaches explored here are based on k-means cluster analysis of this massive data set, which provides a basis for defining the bounds of the expected or normal phenological patterns that indicate healthy vegetation at a given geographic location. We briefly describe the computational approaches we have used to make cluster analysis of such massive data sets feasible, describe approaches we have explored for distinguishing between normal and abnormal phenology, and present some examples in which we have applied these approaches to identify various forest disturbances in the CONUS.

  15. Pendekatan Clustering untuk Ekstraksi Pengetahuan pada Pembangunan Sistem Manajemen Pengetahuan

    Directory of Open Access Journals (Sweden)

    Dwinta Rahmallah Pulukadang

    2016-01-01

    Full Text Available The importance of knowledge management in any organization encourages the development of a knowledge management system with features that can facilitate knowledge management processes such as storing, organizing, filtering, searching, and most important is the transfer of knowledge. The purpose of this researchis to develop a knowledge management system with a clustering approach for knowledge extraction by using a knowledge of publication writing. This study uses clustering k-means method which is used for cluster knowledge feature where at the same time can help the process of organizing, filtering, browsing and searching knowledge. The results of this research showed that the clustering k-means can be used for knowledge management system with the best value of purity= 0,8454 which is found by using k = 20. Clustering approach in the system again can help the process for knowledge searching based on knowledge cluster. This can be proved by carried out 15 times experiments which result in average level of accuracy (precision about 89.13% and the average rate of completeness (recallabout 85.73 %.   Keywords : Knowledge Management System; Extraction; K-means; Clustering

  16. Fuzzy C-Means Clustering Model Data Mining For Recognizing Stock Data Sampling Pattern

    Directory of Open Access Journals (Sweden)

    Sylvia Jane Annatje Sumarauw

    2007-06-01

    Full Text Available Abstract Capital market has been beneficial to companies and investor. For investors, the capital market provides two economical advantages, namely deviden and capital gain, and a non-economical one that is a voting .} hare in Shareholders General Meeting. But, it can also penalize the share owners. In order to prevent them from the risk, the investors should predict the prospect of their companies. As a consequence of having an abstract commodity, the share quality will be determined by the validity of their company profile information. Any information of stock value fluctuation from Jakarta Stock Exchange can be a useful consideration and a good measurement for data analysis. In the context of preventing the shareholders from the risk, this research focuses on stock data sample category or stock data sample pattern by using Fuzzy c-Me, MS Clustering Model which providing any useful information jar the investors. lite research analyses stock data such as Individual Index, Volume and Amount on Property and Real Estate Emitter Group at Jakarta Stock Exchange from January 1 till December 31 of 204. 'he mining process follows Cross Industry Standard Process model for Data Mining (CRISP,. DM in the form of circle with these steps: Business Understanding, Data Understanding, Data Preparation, Modelling, Evaluation and Deployment. At this modelling process, the Fuzzy c-Means Clustering Model will be applied. Data Mining Fuzzy c-Means Clustering Model can analyze stock data in a big database with many complex variables especially for finding the data sample pattern, and then building Fuzzy Inference System for stimulating inputs to be outputs that based on Fuzzy Logic by recognising the pattern. Keywords: Data Mining, AUz..:y c-Means Clustering Model, Pattern Recognition

  17. Cluster Mean-Field Approach to the Steady-State Phase Diagram of Dissipative Spin Systems

    Directory of Open Access Journals (Sweden)

    Jiasen Jin

    2016-07-01

    Full Text Available We show that short-range correlations have a dramatic impact on the steady-state phase diagram of quantum driven-dissipative systems. This effect, never observed in equilibrium, follows from the fact that ordering in the steady state is of dynamical origin, and is established only at very long times, whereas in thermodynamic equilibrium it arises from the properties of the (free energy. To this end, by combining the cluster methods extensively used in equilibrium phase transitions to quantum trajectories and tensor-network techniques, we extend them to nonequilibrium phase transitions in dissipative many-body systems. We analyze in detail a model of spin-1/2 on a lattice interacting through an XYZ Hamiltonian, each of them coupled to an independent environment that induces incoherent spin flips. In the steady-state phase diagram derived from our cluster approach, the location of the phase boundaries and even its topology radically change, introducing reentrance of the paramagnetic phase as compared to the single-site mean field where correlations are neglected. Furthermore, a stability analysis of the cluster mean field indicates a susceptibility towards a possible incommensurate ordering, not present if short-range correlations are ignored.

  18. Clustering Professional Basketball Players by Performance

    OpenAIRE

    Patel, Riki

    2017-01-01

    Basketball players are traditionally grouped into five distinct positions, but these designationsare quickly becoming outdated. We attempt to reclassify players into new groupsbased on personal performance in the 2016-2017 NBA regular season. Two dimensionalityreduction techniques, t-Distributed Stochastic Neighbor Embedding (t-SNE) and principalcomponent analysis (PCA), were employed to reduce 18 classic metrics down to two dimensionsfor visualization. k-means clustering discovered four grou...

  19. Synthesis and crystal structure of new K and Rb selenido/tellurido ferrate cluster compounds

    Energy Technology Data Exchange (ETDEWEB)

    Stueble, Pirmin; Berroth, Angela; Roehr, Caroline [Freiburg Univ. (Germany). Inst. fuer Anorganische und Analytische Chemie

    2016-08-01

    In the course of a systematic study of alkali iron chalcogenido salts containing clusters [Fe{sub 4}Q{sub 8}] a series of new mixed-valent potassium and rubidium selenido and tellurido ferrates(II/III) was synthesized by carefully heating the pure elements enclosed in sample tubes under an argon atmosphere up to maximum temperatures of 800-900 C. Their crystal structures have been determined by means of single crystal X-ray diffraction. The mixed-valent Fe{sup II/III} tellurido ferrates A{sub 7}[Fe{sub 4}Te{sub 8}] form three different structure types. All structures contain tetramers of four edge sharing [FeTe{sub 4}] tetrahedra, which are connected by common edges to form only slightly distorted tetrahedral [Fe{sub 4}Te{sub 8}]{sup 7-} anions ('stella quadrangula') with a [Fe{sub 4}Te{sub 4}] cubane core. In all cases, these anions are surrounded by 26 alkali cations, which are located at the eight corners and the midpoints of the six faces and 12 edges of a cube. The three crystal structures can thus be described by three different packings of cuboid moieties: The monoclinic rubidium compound Rb{sub 7}[Fe{sub 4}Te{sub 8}] (space group C2/c, a = 2000.16(7), b = 897.79(3), c = 1768.12(6) pm, β = 117.4995(10) , Z = 4, R1 = 0.0296) is isotypic to the known cesium tellurido and sulfido ferrates Cs{sub 7}[Fe{sub 4}(S/Te){sub 8}]. Depending on the temperature, K{sub 7}[Fe{sub 4}Te{sub 8}] forms two different but closely related new structure types: The tetragonal r.t. modification (space group P4{sub 2}/nmc, a = 1222.25(14), c = 872.1(2) pm, Z = 2, R1 = 0.0583) crystallizes in a supergroup of the orthorhombic l.t. (100 K) form (space group Pbcn, a = 1715.5, b = 866.76(3), c = 1715.50(7) pm, Z = 4, R1 = 0.0160). In all structures, the cluster centered cubes are stacked to form columns along the short (∼ 870 pm) axis. These columns are themselves densely packed with 4 (both K compounds) and 6 (A = Rb) adjacent face-sharing columns. According to these

  20. Comparison of cluster and principal component analysis techniques to derive dietary patterns in Irish adults.

    Science.gov (United States)

    Hearty, Aine P; Gibney, Michael J

    2009-02-01

    The aims of the present study were to examine and compare dietary patterns in adults using cluster and factor analyses and to examine the format of the dietary variables on the pattern solutions (i.e. expressed as grams/day (g/d) of each food group or as the percentage contribution to total energy intake). Food intake data were derived from the North/South Ireland Food Consumption Survey 1997-9, which was a randomised cross-sectional study of 7 d recorded food and nutrient intakes of a representative sample of 1379 Irish adults aged 18-64 years. Cluster analysis was performed using the k-means algorithm and principal component analysis (PCA) was used to extract dietary factors. Food data were reduced to thirty-three food groups. For cluster analysis, the most suitable format of the food-group variable was found to be the percentage contribution to energy intake, which produced six clusters: 'Traditional Irish'; 'Continental'; 'Unhealthy foods'; 'Light-meal foods & low-fat milk'; 'Healthy foods'; 'Wholemeal bread & desserts'. For PCA, food groups in the format of g/d were found to be the most suitable format, and this revealed four dietary patterns: 'Unhealthy foods & high alcohol'; 'Traditional Irish'; 'Healthy foods'; 'Sweet convenience foods & low alcohol'. In summary, cluster and PCA identified similar dietary patterns when presented with the same dataset. However, the two dietary pattern methods required a different format of the food-group variable, and the most appropriate format of the input variable should be considered in future studies.

  1. A roadmap of clustering algorithms: finding a match for a biomedical application.

    Science.gov (United States)

    Andreopoulos, Bill; An, Aijun; Wang, Xiaogang; Schroeder, Michael

    2009-05-01

    Clustering is ubiquitously applied in bioinformatics with hierarchical clustering and k-means partitioning being the most popular methods. Numerous improvements of these two clustering methods have been introduced, as well as completely different approaches such as grid-based, density-based and model-based clustering. For improved bioinformatics analysis of data, it is important to match clusterings to the requirements of a biomedical application. In this article, we present a set of desirable clustering features that are used as evaluation criteria for clustering algorithms. We review 40 different clustering algorithms of all approaches and datatypes. We compare algorithms on the basis of desirable clustering features, and outline algorithms' benefits and drawbacks as a basis for matching them to biomedical applications.

  2. Quantum annealing for combinatorial clustering

    Science.gov (United States)

    Kumar, Vaibhaw; Bass, Gideon; Tomlin, Casey; Dulny, Joseph

    2018-02-01

    Clustering is a powerful machine learning technique that groups "similar" data points based on their characteristics. Many clustering algorithms work by approximating the minimization of an objective function, namely the sum of within-the-cluster distances between points. The straightforward approach involves examining all the possible assignments of points to each of the clusters. This approach guarantees the solution will be a global minimum; however, the number of possible assignments scales quickly with the number of data points and becomes computationally intractable even for very small datasets. In order to circumvent this issue, cost function minima are found using popular local search-based heuristic approaches such as k-means and hierarchical clustering. Due to their greedy nature, such techniques do not guarantee that a global minimum will be found and can lead to sub-optimal clustering assignments. Other classes of global search-based techniques, such as simulated annealing, tabu search, and genetic algorithms, may offer better quality results but can be too time-consuming to implement. In this work, we describe how quantum annealing can be used to carry out clustering. We map the clustering objective to a quadratic binary optimization problem and discuss two clustering algorithms which are then implemented on commercially available quantum annealing hardware, as well as on a purely classical solver "qbsolv." The first algorithm assigns N data points to K clusters, and the second one can be used to perform binary clustering in a hierarchical manner. We present our results in the form of benchmarks against well-known k-means clustering and discuss the advantages and disadvantages of the proposed techniques.

  3. Genetic diversity of K-antigen gene clusters of Escherichia coli and their molecular typing using a suspension array.

    Science.gov (United States)

    Yang, Shuang; Xi, Daoyi; Jing, Fuyi; Kong, Deju; Wu, Junli; Feng, Lu; Cao, Boyang; Wang, Lei

    2018-04-01

    Capsular polysaccharides (CPSs), or K-antigens, are the major surface antigens of Escherichia coli. More than 80 serologically unique K-antigens are classified into 4 groups (Groups 1-4) of capsules. Groups 1 and 4 contain the Wzy-dependent polymerization pathway and the gene clusters are in the order galF to gnd; Groups 2 and 3 contain the ABC-transporter-dependent pathway and the gene clusters consist of 3 regions, regions 1, 2 and 3. Little is known about the variations among the gene clusters. In this study, 9 serotypes of K-antigen gene clusters (K2ab, K11, K20, K24, K38, K84, K92, K96, and K102) were sequenced and correlated with their CPS chemical structures. On the basis of sequence data, a K-antigen-specific suspension array that detects 10 distinct CPSs, including the above 9 CPSs plus K30, was developed. This is the first report to catalog the genetic features of E. coli K-antigen variations and to develop a suspension array for their molecular typing. The method has a number of advantages over traditional bacteriophage and serum agglutination methods and lays the foundation for straightforward identification and detection of additional K-antigens in the future.

  4. Internal Cluster Validation on Earthquake Data in the Province of Bengkulu

    Science.gov (United States)

    Rini, D. S.; Novianti, P.; Fransiska, H.

    2018-04-01

    K-means method is an algorithm for cluster n object based on attribute to k partition, where k < n. There is a deficiency of algorithms that is before the algorithm is executed, k points are initialized randomly so that the resulting data clustering can be different. If the random value for initialization is not good, the clustering becomes less optimum. Cluster validation is a technique to determine the optimum cluster without knowing prior information from data. There are two types of cluster validation, which are internal cluster validation and external cluster validation. This study aims to examine and apply some internal cluster validation, including the Calinski-Harabasz (CH) Index, Sillhouette (S) Index, Davies-Bouldin (DB) Index, Dunn Index (D), and S-Dbw Index on earthquake data in the Bengkulu Province. The calculation result of optimum cluster based on internal cluster validation is CH index, S index, and S-Dbw index yield k = 2, DB Index with k = 6 and Index D with k = 15. Optimum cluster (k = 6) based on DB Index gives good results for clustering earthquake in the Bengkulu Province.

  5. Envelopment filter and K-means for the detection of QRS waveforms in electrocardiogram.

    Science.gov (United States)

    Merino, Manuel; Gómez, Isabel María; Molina, Alberto J

    2015-06-01

    The electrocardiogram (ECG) is a well-established technique for determining the electrical activity of the heart and studying its diseases. One of the most common pieces of information that can be read from the ECG is the heart rate (HR) through the detection of its most prominent feature: the QRS complex. This paper describes an offline version and a real-time implementation of a new algorithm to determine QRS localization in the ECG signal based on its envelopment and K-means clustering algorithm. The envelopment is used to obtain a signal with only QRS complexes, deleting P, T, and U waves and baseline wander. Two moving average filters are applied to smooth data. The K-means algorithm classifies data into QRS and non-QRS. The technique is validated using 22 h of ECG data from five Physionet databases. These databases were arbitrarily selected to analyze different morphologies of QRS complexes: three stored data with cardiac pathologies, and two had data with normal heartbeats. The algorithm has a low computational load, with no decision thresholds. Furthermore, it does not require any additional parameter. Sensitivity, positive prediction and accuracy from results are over 99.7%. Copyright © 2015 IPEM. Published by Elsevier Ltd. All rights reserved.

  6. Sistem Informasi Geografis Berbasis Web untuk Pemetaan Sebaran Alumni Menggunakan Metode K-Means

    Directory of Open Access Journals (Sweden)

    Slamet Handoko Handoko

    2014-01-01

    Full Text Available The  graduates  of  State  Polytechnic  of  Semarang  are  not  only  the  member  of  social  community  but  also part  of  State Polytechnic  of Semarang  community  who  have  academic  knowledge  and  special  skills.  Based  on  the  researcher  observation  the  graduates  of  State Polytechnic  of  Semarang  are  not  recorded,  the  management  of  State  Polytechnic  of  Semarang  has  not  provided  a  system  that  can facilitate  the  interaction  between  State  Polytechnic  of  Semarang  and  graduates.  In  this  thesis,  the  system  for  mapping  the  graduates distribution is aimed to measure the level of the  graduates  compliance skill with the competence area of their job. The method of KMeans Clustering is used for grouping the distribution of State Polytechnic of Semarang graduates. Grouping or clustering mechanism in this system is based on four variables. They are  type of company, job classification, working area, and competency of study program. While  the geographical position of graduates is used to filter the data when the users are searching the graduates location in a ce rtain province.  In  this  research  the  cluster  is  divided  into  three,  they  are,  cluster  one:  graduates  have  matching  competence,  cluster  two: graduates have matching enough competence, and cluster three: graduates have no matching competence.Keywords: Clustering;  GIS ;  K-Means

  7. Clustering analysis of line indices for LAMOST spectra with AstroStat

    Science.gov (United States)

    Chen, Shu-Xin; Sun, Wei-Min; Yan, Qi

    2018-06-01

    The application of data mining in astronomical surveys, such as the Large Sky Area Multi-Object Fiber Spectroscopic Telescope (LAMOST) survey, provides an effective approach to automatically analyze a large amount of complex survey data. Unsupervised clustering could help astronomers find the associations and outliers in a big data set. In this paper, we employ the k-means method to perform clustering for the line index of LAMOST spectra with the powerful software AstroStat. Implementing the line index approach for analyzing astronomical spectra is an effective way to extract spectral features for low resolution spectra, which can represent the main spectral characteristics of stars. A total of 144 340 line indices for A type stars is analyzed through calculating their intra and inter distances between pairs of stars. For intra distance, we use the definition of Mahalanobis distance to explore the degree of clustering for each class, while for outlier detection, we define a local outlier factor for each spectrum. AstroStat furnishes a set of visualization tools for illustrating the analysis results. Checking the spectra detected as outliers, we find that most of them are problematic data and only a few correspond to rare astronomical objects. We show two examples of these outliers, a spectrum with abnormal continuumand a spectrum with emission lines. Our work demonstrates that line index clustering is a good method for examining data quality and identifying rare objects.

  8. A Cluster-Analytical Approach towards Physical Activity and Eating Habits among 10-Year-Old Children

    Science.gov (United States)

    Sabbe, Dieter; De Bourdeaudhuij, I.; Legiest, E.; Maes, L.

    2008-01-01

    The purpose was to investigate whether clusters--based on physical activity (PA) and eating habits--can be found among children, and to explore subgroups' characteristics. A total of 1725 10-year olds completed a self-administered questionnaire. K-means cluster analysis was based on the weekly quantity of vigorous and moderate PA, the excess index…

  9. Objectively Measured Baseline Physical Activity Patterns in Women in the mPED Trial: Cluster Analysis.

    Science.gov (United States)

    Fukuoka, Yoshimi; Zhou, Mo; Vittinghoff, Eric; Haskell, William; Goldberg, Ken; Aswani, Anil

    2018-02-01

    Determining patterns of physical activity throughout the day could assist in developing more personalized interventions or physical activity guidelines in general and, in particular, for women who are less likely to be physically active than men. The aims of this report are to identify clusters of women based on accelerometer-measured baseline raw metabolic equivalent of task (MET) values and a normalized version of the METs ≥3 data, and to compare sociodemographic and cardiometabolic risks among these identified clusters. A total of 215 women who were enrolled in the Mobile Phone Based Physical Activity Education (mPED) trial and wore an accelerometer for at least 8 hours per day for the 7 days prior to the randomization visit were analyzed. The k-means clustering method and the Lloyd algorithm were used on the data. We used the elbow method to choose the number of clusters, looking at the percentage of variance explained as a function of the number of clusters. The results of the k-means cluster analyses of raw METs revealed three different clusters. The unengaged group (n=102) had the highest depressive symptoms score compared with the afternoon engaged (n=65) and morning engaged (n=48) groups (overall Pcluster groups using a large national dataset. ClinicalTrials.gov NCT01280812; https://clinicaltrials.gov/ct2/show/NCT01280812 (Archived by WebCite at http://www.webcitation.org/6vVyLzwft). ©Yoshimi Fukuoka, Mo Zhou, Eric Vittinghoff, William Haskell, Ken Goldberg, Anil Aswani. Originally published in JMIR Public Health and Surveillance (http://publichealth.jmir.org), 01.02.2018.

  10. ANALISIS SEGMENTASI PELANGGAN MENGGUNAKAN KOMBINASI RFM MODEL DAN TEKNIK CLUSTERING

    Directory of Open Access Journals (Sweden)

    Beta Estri Adiana

    2018-04-01

    Full Text Available Intense competition in the business field motivates a small and medium enterprises (SMEs to manage customer services to the maximal. Improve of customer royalty by grouping cunstomers into some of groups and determining appropriate and effective marketing strategies for each group. Customer segmentation can be performed by data mining approach with clustering method. The main purpose of this paper is customer segmentation and measure their loyalty to a SME’s product. Using CRISP-DM method which consist of six phases, namely business understanding, data understanding, data preparatuin, modeling, evaluation and deployment. The K-Means algorithm is used for cluster formation and RapidMiner as a tool used to evaluate the result of clusters. Cluster formation is based on RFM (recency, frequency, monetary analysis. Davies Bouldin Index (DBI is used to find the optimal number of clusters (k. The customers are divided into 3 clusters, total of customer in first cluster is 30 customers who entered in typical customer category, the second cluster there are 8 customer whho entered in superstar customer and 89 customers in third cluster is dormant cluster category.

  11. Genomic signal processing for DNA sequence clustering.

    Science.gov (United States)

    Mendizabal-Ruiz, Gerardo; Román-Godínez, Israel; Torres-Ramos, Sulema; Salido-Ruiz, Ricardo A; Vélez-Pérez, Hugo; Morales, J Alejandro

    2018-01-01

    Genomic signal processing (GSP) methods which convert DNA data to numerical values have recently been proposed, which would offer the opportunity of employing existing digital signal processing methods for genomic data. One of the most used methods for exploring data is cluster analysis which refers to the unsupervised classification of patterns in data. In this paper, we propose a novel approach for performing cluster analysis of DNA sequences that is based on the use of GSP methods and the K-means algorithm. We also propose a visualization method that facilitates the easy inspection and analysis of the results and possible hidden behaviors. Our results support the feasibility of employing the proposed method to find and easily visualize interesting features of sets of DNA data.

  12. Clinical fracture risk evaluated by hierarchical agglomerative clustering

    DEFF Research Database (Denmark)

    Kruse, C; Eiken, P; Vestergaard, P

    2017-01-01

    reimbursement, primary healthcare sector use and comorbidity of female subjects were combined. Standardized variable means, Euclidean distances and Ward's D2 method of hierarchical agglomerative clustering (HAC), were used to form the clustering object. K number of clusters was selected with the lowest cluster...

  13. Mining the National Career Assessment Examination Result Using Clustering Algorithm

    Science.gov (United States)

    Pagudpud, M. V.; Palaoag, T. T.; Padirayon, L. M.

    2018-03-01

    Education is an essential process today which elicits authorities to discover and establish innovative strategies for educational improvement. This study applied data mining using clustering technique for knowledge extraction from the National Career Assessment Examination (NCAE) result in the Division of Quirino. The NCAE is an examination given to all grade 9 students in the Philippines to assess their aptitudes in the different domains. Clustering the students is helpful in identifying students’ learning considerations. With the use of the RapidMiner tool, clustering algorithms such as Density-Based Spatial Clustering of Applications with Noise (DBSCAN), k-means, k-medoid, expectation maximization clustering, and support vector clustering algorithms were analyzed. The silhouette indexes of the said clustering algorithms were compared, and the result showed that the k-means algorithm with k = 3 and silhouette index equal to 0.196 is the most appropriate clustering algorithm to group the students. Three groups were formed having 477 students in the determined group (cluster 0), 310 proficient students (cluster 1) and 396 developing students (cluster 2). The data mining technique used in this study is essential in extracting useful information from the NCAE result to better understand the abilities of students which in turn is a good basis for adopting teaching strategies.

  14. A Comparison of Methods for Player Clustering via Behavioral Telemetry

    DEFF Research Database (Denmark)

    Drachen, Anders; Thurau, C.; Sifa, R.

    2013-01-01

    patterns in the behavioral data, and developing profiles that are actionable to game developers. There are numerous methods for unsupervised clustering of user behavior, e.g. k-means/c-means, Nonnegative Matrix Factorization, or Principal Component Analysis. Although all yield behavior categorizations......, interpretation of the resulting categories in terms of actual play behavior can be difficult if not impossible. In this paper, a range of unsupervised techniques are applied together with Archetypal Analysis to develop behavioral clusters from playtime data of 70,014 World of Warcraft players, covering a five......The analysis of user behavior in digital games has been aided by the introduction of user telemetry in game development, which provides unprecedented access to quantitative data on user behavior from the installed game clients of the entire population of players. Player behavior telemetry datasets...

  15. Transcriptional analysis of ESAT-6 cluster 3 in Mycobacterium smegmatis

    Directory of Open Access Journals (Sweden)

    Riccardi Giovanna

    2009-03-01

    Full Text Available Abstract Background The ESAT-6 (early secreted antigenic target, 6 kDa family collects small mycobacterial proteins secreted by Mycobacterium tuberculosis, particularly in the early phase of growth. There are 23 ESAT-6 family members in M. tuberculosis H37Rv. In a previous work, we identified the Zur- dependent regulation of five proteins of the ESAT-6/CFP-10 family (esxG, esxH, esxQ, esxR, and esxS. esxG and esxH are part of ESAT-6 cluster 3, whose expression was already known to be induced by iron starvation. Results In this research, we performed EMSA experiments and transcriptional analysis of ESAT-6 cluster 3 in Mycobacterium smegmatis (msmeg0615-msmeg0625 and M. tuberculosis. In contrast to what we had observed in M. tuberculosis, we found that in M. smegmatis ESAT-6 cluster 3 responds only to iron and not to zinc. In both organisms we identified an internal promoter, a finding which suggests the presence of two transcriptional units and, by consequence, a differential expression of cluster 3 genes. We compared the expression of msmeg0615 and msmeg0620 in different growth and stress conditions by means of relative quantitative PCR. The expression of msmeg0615 and msmeg0620 genes was essentially similar; they appeared to be repressed in most of the tested conditions, with the exception of acid stress (pH 4.2 where msmeg0615 was about 4-fold induced, while msmeg0620 was repressed. Analysis revealed that in acid stress conditions M. tuberculosis rv0282 gene was 3-fold induced too, while rv0287 induction was almost insignificant. Conclusion In contrast with what has been reported for M. tuberculosis, our results suggest that in M. smegmatis only IdeR-dependent regulation is retained, while zinc has no effect on gene expression. The role of cluster 3 in M. tuberculosis virulence is still to be defined; however, iron- and zinc-dependent expression strongly suggests that cluster 3 is highly expressed in the infective process, and that the cluster

  16. Dietary Patterns Derived by Cluster Analysis are Associated with Cognitive Function among Korean Older Adults.

    Science.gov (United States)

    Kim, Jihye; Yu, Areum; Choi, Bo Youl; Nam, Jung Hyun; Kim, Mi Kyung; Oh, Dong Hoon; Yang, Yoon Jung

    2015-05-29

    The objective of this study was to investigate major dietary patterns among older Korean adults through cluster analysis and to determine an association between dietary patterns and cognitive function. This is a cross-sectional study. The data from the Korean Multi-Rural Communities Cohort Study was used. Participants included 765 participants aged 60 years and over. A quantitative food frequency questionnaire with 106 items was used to investigate dietary intake. The Korean version of the MMSE-KC (Mini-Mental Status Examination-Korean version) was used to assess cognitive function. Two major dietary patterns were identified using K-means cluster analysis. The "MFDF" dietary pattern indicated high consumption of Multigrain rice, Fish, Dairy products, Fruits and fruit juices, while the "WNC" dietary pattern referred to higher intakes of White rice, Noodles, and Coffee. Means of the total MMSE-KC and orientation score of the participants in the MFDF dietary pattern were higher than those of the WNC dietary pattern. Compared with the WNC dietary pattern, the MFDF dietary pattern showed a lower risk of cognitive impairment after adjusting for covariates (OR 0.64, 95% CI 0.44-0.94). The MFDF dietary pattern, with high consumption of multigrain rice, fish, dairy products, and fruits may be related to better cognition among Korean older adults.

  17. Short-Term Wind Power Forecasting Based on Clustering Pre-Calculated CFD Method

    Directory of Open Access Journals (Sweden)

    Yimei Wang

    2018-04-01

    Full Text Available To meet the increasing wind power forecasting (WPF demands of newly built wind farms without historical data, physical WPF methods are widely used. The computational fluid dynamics (CFD pre-calculated flow fields (CPFF-based WPF is a promising physical approach, which can balance well the competing demands of computational efficiency and accuracy. To enhance its adaptability for wind farms in complex terrain, a WPF method combining wind turbine clustering with CPFF is first proposed where the wind turbines in the wind farm are clustered and a forecasting is undertaken for each cluster. K-means, hierarchical agglomerative and spectral analysis methods are used to establish the wind turbine clustering models. The Silhouette Coefficient, Calinski-Harabaz index and within-between index are proposed as criteria to evaluate the effectiveness of the established clustering models. Based on different clustering methods and schemes, various clustering databases are built for clustering pre-calculated CFD (CPCC-based short-term WPF. For the wind farm case studied, clustering evaluation criteria show that hierarchical agglomerative clustering has reasonable results, spectral clustering is better and K-means gives the best performance. The WPF results produced by different clustering databases also prove the effectiveness of the three evaluation criteria in turn. The newly developed CPCC model has a much higher WPF accuracy than the CPFF model without using clustering techniques, both on temporal and spatial scales. The research provides supports for both the development and improvement of short-term physical WPF systems.

  18. Exponential Convergence of Cellular Dynamical Mean Field Theory: Reply to the comment by K. Aryanpour, Th. Maier and M. Jarrell (cond-mat/0301460)

    OpenAIRE

    Biroli, G.; Kotliar, G.

    2004-01-01

    We reply to the comment by K. Aryanpour, Th. Maier and M. Jarrell (cond-mat/0301460) on our paper (Phys. Rev. B {\\bf 65} 155112 (2002)). We demonstrate using general arguments and explicit examples that whenever the correlation length is finite, local observables converge exponentially fast in the cluster size, $L_{c}$, within Cellular Dynamical Mean Field Theory (CDMFT). This is a faster rate of convergence than the $1/L_{c}^{2}$ behavior of the Dynamical Cluster approximation (DCA) thus ref...

  19. Properties of proton clusters in inelastic CC interactions accompanied by the production of Λ and K0 particles at p = 4.2 GeV/c per nucleon

    International Nuclear Information System (INIS)

    Bekmirzaev, R.N.; Shukurov, E.Kh.; Kuznetsov, A.A.; Yuldashev, B.S.

    2004-01-01

    Within a new relativistically invariant approach, the properties of proton clusters that are formed together with Λ and K 0 particles in inelastic CC interactions at p = 4.2 GeV/c per nucleon are investigated in the space of relative 4-velocities. The observed proton clusters are shown to be characterized by high values of the mean kinetic energy of the protons in the cluster rest frame: p > = 100 ± 2 MeV

  20. EU Travel and Tourism Industry - A Cluster Analysis of Impact and Competitiveness

    Directory of Open Access Journals (Sweden)

    DANIEL BULIN

    2014-05-01

    Full Text Available The tourism industry has known a sustained growth in recent decades, this sector taking advantage of the capacity to generate added value regardless of the type of capital input. The impact of tourism on the economy is indisputable, but the efficiency and the competitiveness provide a sustainable development of this industry. The EU enlargement, geographical and numerical, provides both-a diversity of tourism destinations and an opportunity for growth, considering widening of single market. This paper aims to assess tourism in the European Union, considering the impact and value of the tourism multiplier effect in economy, respectively efficiency and competitiveness of tourism. This paper proposes a classification of EU countries based on cluster analysis, using K-means algorithm.

  1. Performance Analysis of Cluster Formation in Wireless Sensor Networks.

    Science.gov (United States)

    Montiel, Edgar Romo; Rivero-Angeles, Mario E; Rubino, Gerardo; Molina-Lozano, Heron; Menchaca-Mendez, Rolando; Menchaca-Mendez, Ricardo

    2017-12-13

    Clustered-based wireless sensor networks have been extensively used in the literature in order to achieve considerable energy consumption reductions. However, two aspects of such systems have been largely overlooked. Namely, the transmission probability used during the cluster formation phase and the way in which cluster heads are selected. Both of these issues have an important impact on the performance of the system. For the former, it is common to consider that sensor nodes in a clustered-based Wireless Sensor Network (WSN) use a fixed transmission probability to send control data in order to build the clusters. However, due to the highly variable conditions experienced by these networks, a fixed transmission probability may lead to extra energy consumption. In view of this, three different transmission probability strategies are studied: optimal, fixed and adaptive. In this context, we also investigate cluster head selection schemes, specifically, we consider two intelligent schemes based on the fuzzy C-means and k-medoids algorithms and a random selection with no intelligence. We show that the use of intelligent schemes greatly improves the performance of the system, but their use entails higher complexity and selection delay. The main performance metrics considered in this work are energy consumption, successful transmission probability and cluster formation latency. As an additional feature of this work, we study the effect of errors in the wireless channel and the impact on the performance of the system under the different transmission probability schemes.

  2. Performance Analysis of Cluster Formation in Wireless Sensor Networks

    Directory of Open Access Journals (Sweden)

    Edgar Romo Montiel

    2017-12-01

    Full Text Available Clustered-based wireless sensor networks have been extensively used in the literature in order to achieve considerable energy consumption reductions. However, two aspects of such systems have been largely overlooked. Namely, the transmission probability used during the cluster formation phase and the way in which cluster heads are selected. Both of these issues have an important impact on the performance of the system. For the former, it is common to consider that sensor nodes in a clustered-based Wireless Sensor Network (WSN use a fixed transmission probability to send control data in order to build the clusters. However, due to the highly variable conditions experienced by these networks, a fixed transmission probability may lead to extra energy consumption. In view of this, three different transmission probability strategies are studied: optimal, fixed and adaptive. In this context, we also investigate cluster head selection schemes, specifically, we consider two intelligent schemes based on the fuzzy C-means and k-medoids algorithms and a random selection with no intelligence. We show that the use of intelligent schemes greatly improves the performance of the system, but their use entails higher complexity and selection delay. The main performance metrics considered in this work are energy consumption, successful transmission probability and cluster formation latency. As an additional feature of this work, we study the effect of errors in the wireless channel and the impact on the performance of the system under the different transmission probability schemes.

  3. Acinetobacter baumannii K27 and K44 capsular polysaccharides have the same K unit but different structures due to the presence of distinct wzy genes in otherwise closely related K gene clusters.

    Science.gov (United States)

    Shashkov, Alexander S; Kenyon, Johanna J; Senchenkova, Sof'ya N; Shneider, Mikhail M; Popova, Anastasiya V; Arbatsky, Nikolay P; Miroshnikov, Konstantin A; Volozhantsev, Nikolay V; Hall, Ruth M; Knirel, Yuriy A

    2016-05-01

    Capsular polysaccharides (CPSs), from Acinetobacter baumannii isolates 1432, 4190 and NIPH 70, which have related gene content at the K locus, were examined, and the chemical structures established using 2D(1)H and(13)C NMR spectroscopy. The three isolates produce the same pentasaccharide repeat unit, which consists of 5-N-acetyl-7-N-[(S)-3-hydroxybutanoyl] (major) or 5,7-di-N-acetyl (minor) derivatives of 5,7-diamino-3,5,7,9-tetradeoxy-D-glycero-D-galacto-non-2-ulosonic (legionaminic) acid (Leg5Ac7R), D-galactose, N-acetyl-D-galactosamine and N-acetyl-D-glucosamine. However, the linkage between repeat units in NIPH 70 was different to that in 1432 and 4190, and this significantly alters the CPS structure. The KL27 gene cluster in 4190 and KL44 gene cluster in NIPH 70 are organized identically and contain lga genes for Leg5Ac7R synthesis, genes for the synthesis of the common sugars, as well as anitrA2 initiating transferase and four glycosyltransferases genes. They share high-level nucleotide sequence identity for corresponding genes, but differ in the wzy gene encoding the Wzy polymerase. The Wzy proteins, which have different lengths and share no similarity, would form the unrelated linkages in the K27 and K44 structures. The linkages formed by the four shared glycosyltransferases were predicted by comparison with gene clusters that synthesize related structures. These findings unambiguously identify the linkages formed by WzyK27 and WzyK44, and show that the presence of different wzy genes in otherwise closely related K gene clusters changes the structure of the CPS. This may affect its capacity as a protective barrier for A. baumannii. © The Author 2015. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.

  4. An Initialization Method Based on Hybrid Distance for k-Means Algorithm.

    Science.gov (United States)

    Yang, Jie; Ma, Yan; Zhang, Xiangfen; Li, Shunbao; Zhang, Yuping

    2017-11-01

    The traditional [Formula: see text]-means algorithm has been widely used as a simple and efficient clustering method. However, the performance of this algorithm is highly dependent on the selection of initial cluster centers. Therefore, the method adopted for choosing initial cluster centers is extremely important. In this letter, we redefine the density of points according to the number of its neighbors, as well as the distance between points and their neighbors. In addition, we define a new distance measure that considers both Euclidean distance and density. Based on that, we propose an algorithm for selecting initial cluster centers that can dynamically adjust the weighting parameter. Furthermore, we propose a new internal clustering validation measure, the clustering validation index based on the neighbors (CVN), which can be exploited to select the optimal result among multiple clustering results. Experimental results show that the proposed algorithm outperforms existing initialization methods on real-world data sets and demonstrates the adaptability of the proposed algorithm to data sets with various characteristics.

  5. Hierarchical cluster analysis of progression patterns in open-angle glaucoma patients with medical treatment.

    Science.gov (United States)

    Bae, Hyoung Won; Rho, Seungsoo; Lee, Hye Sun; Lee, Naeun; Hong, Samin; Seong, Gong Je; Sung, Kyung Rim; Kim, Chan Yun

    2014-04-29

    To classify medically treated open-angle glaucoma (OAG) by the pattern of progression using hierarchical cluster analysis, and to determine OAG progression characteristics by comparing clusters. Ninety-five eyes of 95 OAG patients who received medical treatment, and who had undergone visual field (VF) testing at least once per year for 5 or more years. OAG was classified into subgroups using hierarchical cluster analysis based on the following five variables: baseline mean deviation (MD), baseline visual field index (VFI), MD slope, VFI slope, and Glaucoma Progression Analysis (GPA) printout. After that, other parameters were compared between clusters. Two clusters were made after a hierarchical cluster analysis. Cluster 1 showed -4.06 ± 2.43 dB baseline MD, 92.58% ± 6.27% baseline VFI, -0.28 ± 0.38 dB per year MD slope, -0.52% ± 0.81% per year VFI slope, and all "no progression" cases in GPA printout, whereas cluster 2 showed -8.68 ± 3.81 baseline MD, 77.54 ± 12.98 baseline VFI, -0.72 ± 0.55 MD slope, -2.22 ± 1.89 VFI slope, and seven "possible" and four "likely" progression cases in GPA printout. There were no significant differences in age, sex, mean IOP, central corneal thickness, and axial length between clusters. However, cluster 2 included more high-tension glaucoma patients and used a greater number of antiglaucoma eye drops significantly compared with cluster 1. Hierarchical cluster analysis of progression patterns divided OAG into slow and fast progression groups, evidenced by assessing the parameters of glaucomatous progression in VF testing. In the fast progression group, the prevalence of high-tension glaucoma was greater and the number of antiglaucoma medications administered was increased versus the slow progression group. Copyright 2014 The Association for Research in Vision and Ophthalmology, Inc.

  6. Clustering and training set selection methods for improving the accuracy of quantitative laser induced breakdown spectroscopy

    International Nuclear Information System (INIS)

    Anderson, Ryan B.; Bell, James F.; Wiens, Roger C.; Morris, Richard V.; Clegg, Samuel M.

    2012-01-01

    We investigated five clustering and training set selection methods to improve the accuracy of quantitative chemical analysis of geologic samples by laser induced breakdown spectroscopy (LIBS) using partial least squares (PLS) regression. The LIBS spectra were previously acquired for 195 rock slabs and 31 pressed powder geostandards under 7 Torr CO 2 at a stand-off distance of 7 m at 17 mJ per pulse to simulate the operational conditions of the ChemCam LIBS instrument on the Mars Science Laboratory Curiosity rover. The clustering and training set selection methods, which do not require prior knowledge of the chemical composition of the test-set samples, are based on grouping similar spectra and selecting appropriate training spectra for the partial least squares (PLS2) model. These methods were: (1) hierarchical clustering of the full set of training spectra and selection of a subset for use in training; (2) k-means clustering of all spectra and generation of PLS2 models based on the training samples within each cluster; (3) iterative use of PLS2 to predict sample composition and k-means clustering of the predicted compositions to subdivide the groups of spectra; (4) soft independent modeling of class analogy (SIMCA) classification of spectra, and generation of PLS2 models based on the training samples within each class; (5) use of Bayesian information criteria (BIC) to determine an optimal number of clusters and generation of PLS2 models based on the training samples within each cluster. The iterative method and the k-means method using 5 clusters showed the best performance, improving the absolute quadrature root mean squared error (RMSE) by ∼ 3 wt.%. The statistical significance of these improvements was ∼ 85%. Our results show that although clustering methods can modestly improve results, a large and diverse training set is the most reliable way to improve the accuracy of quantitative LIBS. In particular, additional sulfate standards and specifically

  7. Integrating Data Clustering and Visualization for the Analysis of 3D Gene Expression Data

    Energy Technology Data Exchange (ETDEWEB)

    Data Analysis and Visualization (IDAV) and the Department of Computer Science, University of California, Davis, One Shields Avenue, Davis CA 95616, USA,; nternational Research Training Group ``Visualization of Large and Unstructured Data Sets,' ' University of Kaiserslautern, Germany; Computational Research Division, Lawrence Berkeley National Laboratory, One Cyclotron Road, Berkeley, CA 94720, USA; Genomics Division, Lawrence Berkeley National Laboratory, One Cyclotron Road, Berkeley CA 94720, USA; Life Sciences Division, Lawrence Berkeley National Laboratory, One Cyclotron Road, Berkeley CA 94720, USA,; Computer Science Division,University of California, Berkeley, CA, USA,; Computer Science Department, University of California, Irvine, CA, USA,; All authors are with the Berkeley Drosophila Transcription Network Project, Lawrence Berkeley National Laboratory,; Rubel, Oliver; Weber, Gunther H.; Huang, Min-Yu; Bethel, E. Wes; Biggin, Mark D.; Fowlkes, Charless C.; Hendriks, Cris L. Luengo; Keranen, Soile V. E.; Eisen, Michael B.; Knowles, David W.; Malik, Jitendra; Hagen, Hans; Hamann, Bernd

    2008-05-12

    The recent development of methods for extracting precise measurements of spatial gene expression patterns from three-dimensional (3D) image data opens the way for new analyses of the complex gene regulatory networks controlling animal development. We present an integrated visualization and analysis framework that supports user-guided data clustering to aid exploration of these new complex datasets. The interplay of data visualization and clustering-based data classification leads to improved visualization and enables a more detailed analysis than previously possible. We discuss (i) integration of data clustering and visualization into one framework; (ii) application of data clustering to 3D gene expression data; (iii) evaluation of the number of clusters k in the context of 3D gene expression clustering; and (iv) improvement of overall analysis quality via dedicated post-processing of clustering results based on visualization. We discuss the use of this framework to objectively define spatial pattern boundaries and temporal profiles of genes and to analyze how mRNA patterns are controlled by their regulatory transcription factors.

  8. Clustering for Different Scales of Measurement - the Gap-Ratio Weighted K-means Algorithm

    OpenAIRE

    Guérin, Joris; Gibaru, Olivier; Thiery, Stéphane; Nyiri, Eric

    2017-01-01

    This paper describes a method for clustering data that are spread out over large regions and which dimensions are on different scales of measurement. Such an algorithm was developed to implement a robotics application consisting in sorting and storing objects in an unsupervised way. The toy dataset used to validate such application consists of Lego bricks of different shapes and colors. The uncontrolled lighting conditions together with the use of RGB color features, respectively involve data...

  9. Hybrid Tracking Algorithm Improvements and Cluster Analysis Methods.

    Science.gov (United States)

    1982-02-26

    UPGMA ), and Ward’s method. Ling’s papers describe a (k,r) clustering method. Each of these methods have individual characteristics which make them...Reference 7), UPGMA is probably the most frequently used clustering strategy. UPGMA tries to group new points into an existing cluster by using an

  10. Text Mining Untuk Analisis Sentimen Review Film Menggunakan Algoritma K-Means

    Directory of Open Access Journals (Sweden)

    Setyo Budi

    2017-02-01

    Full Text Available Kemudahan manusia didalam menggunakan website mengakibatkan bertambahnya dokumen teks yang berupa pendapat dan informasi. Dalam waktu yang lama dokumen teks akan bertambah besar. Text mining merupakan salah satu teknik yang digunakan untuk menggali kumpulan dokumen text sehingga dapat diambil intisarinya. Ada beberapa algoritma yang di gunakan untuk penggalian dokumen untuk analisis sentimen, salah satunya adalah K-Means. Didalam penelitian ini algoritma yang digunakan adalah K-Means. Hasil penelitian menunjukkan bahwa akurasi K-Means dengan dataset digunakan 300 positif dan 300 negatif  akurasinya 57.83%,  700 dokumen positif dan 700  negatif akurasinya 56.71%%, 1000 dokumen positif dan 1000  negatif akurasinya 50.40%%. Dari hasil pengujian disimpulkan bahwa semakin besar dataset yang digunakan semakin rendah akurasi K-Means.   Kata Kunci : Text Mining, Analisis Sentimen, K-Means, Review Film 

  11. Large-Scale Multi-Dimensional Document Clustering on GPU Clusters

    Energy Technology Data Exchange (ETDEWEB)

    Cui, Xiaohui [ORNL; Mueller, Frank [North Carolina State University; Zhang, Yongpeng [ORNL; Potok, Thomas E [ORNL

    2010-01-01

    Document clustering plays an important role in data mining systems. Recently, a flocking-based document clustering algorithm has been proposed to solve the problem through simulation resembling the flocking behavior of birds in nature. This method is superior to other clustering algorithms, including k-means, in the sense that the outcome is not sensitive to the initial state. One limitation of this approach is that the algorithmic complexity is inherently quadratic in the number of documents. As a result, execution time becomes a bottleneck with large number of documents. In this paper, we assess the benefits of exploiting the computational power of Beowulf-like clusters equipped with contemporary Graphics Processing Units (GPUs) as a means to significantly reduce the runtime of flocking-based document clustering. Our framework scales up to over one million documents processed simultaneously in a sixteennode GPU cluster. Results are also compared to a four-node cluster with higher-end GPUs. On these clusters, we observe 30X-50X speedups, which demonstrates the potential of GPU clusters to efficiently solve massive data mining problems. Such speedups combined with the scalability potential and accelerator-based parallelization are unique in the domain of document-based data mining, to the best of our knowledge.

  12. Extragalactic globular clusters. I. The metallicity calibration

    International Nuclear Information System (INIS)

    Brodie, J.P.; Huchra, J.P.

    1990-01-01

    The ability of absorption-line strength indices, measured from integrated globular cluster spectra, to predict mean cluster metallicity is explored. Statistical criteria, are used to identify the six best indices out of about 20 measured in a large sample of Galactic and M31 cluster spectra. Linear relations between index and metallicity have been derived along with new calibrations of infrared colors (V - K, J - K, and CO) versus Fe/H. Estimates of metallicity from the six spectroscopic index-metallicity relations have been combined in three different ways to identify the most efficient estimator and the minimum bias estimator of Fe/H - the weighted mean. This provides an estimate of Fe/H accurate to about 15 percent. 37 refs

  13. Dense Fe cluster-assembled films by energetic cluster deposition

    International Nuclear Information System (INIS)

    Peng, D.L.; Yamada, H.; Hihara, T.; Uchida, T.; Sumiyama, K.

    2004-01-01

    High-density Fe cluster-assembled films were produced at room temperature by an energetic cluster deposition. Though cluster-assemblies are usually sooty and porous, the present Fe cluster-assembled films are lustrous and dense, revealing a soft magnetic behavior. Size-monodispersed Fe clusters with the mean cluster size d=9 nm were synthesized using a plasma-gas-condensation technique. Ionized clusters are accelerated electrically and deposited onto the substrate together with neutral clusters from the same cluster source. Packing fraction and saturation magnetic flux density increase rapidly and magnetic coercivity decreases remarkably with increasing acceleration voltage. The Fe cluster-assembled film obtained at the acceleration voltage of -20 kV has a packing fraction of 0.86±0.03, saturation magnetic flux density of 1.78±0.05 Wb/m 2 , and coercivity value smaller than 80 A/m. The resistivity at room temperature is ten times larger than that of bulk Fe metal

  14. Clustering algorithms for Stokes space modulation format recognition

    DEFF Research Database (Denmark)

    Boada, Ricard; Borkowski, Robert; Tafur Monroy, Idelfonso

    2015-01-01

    influences the performance of the detection process, particularly at low signal-to-noise ratios. This paper reports on an extensive study of six different clustering algorithms: k-means, expectation maximization, density-based DBSCAN and OPTICS, spectral clustering and maximum likelihood clustering, used...

  15. K2: A NEW METHOD FOR THE DETECTION OF GALAXY CLUSTERS BASED ON CANADA-FRANCE-HAWAII TELESCOPE LEGACY SURVEY MULTICOLOR IMAGES

    International Nuclear Information System (INIS)

    Thanjavur, Karun; Willis, Jon; Crampton, David

    2009-01-01

    We have developed a new method, K2, optimized for the detection of galaxy clusters in multicolor images. Based on the Red Sequence approach, K2 detects clusters using simultaneous enhancements in both colors and position. The detection significance is robustly determined through extensive Monte Carlo simulations and through comparison with available cluster catalogs based on two different optical methods, and also on X-ray data. K2 also provides quantitative estimates of the candidate clusters' richness and photometric redshifts. Initially, K2 was applied to the two color (gri) 161 deg 2 images of the Canada-France-Hawaii Telescope Legacy Survey Wide (CFHTLS-W) data. Our simulations show that the false detection rate for these data, at our selected threshold, is only ∼1%, and that the cluster catalogs are ∼80% complete up to a redshift of z = 0.6 for Fornax-like and richer clusters and to z ∼ 0.3 for poorer clusters. Based on the g-, r-, and i-band photometric catalogs of the Terapix T05 release, 35 clusters/deg 2 are detected, with 1-2 Fornax-like or richer clusters every 2 deg 2 . Catalogs containing data for 6144 galaxy clusters have been prepared, of which 239 are rich clusters. These clusters, especially the latter, are being searched for gravitational lenses-one of our chief motivations for cluster detection in CFHTLS. The K2 method can be easily extended to use additional color information and thus improve overall cluster detection to higher redshifts. The complete set of K2 cluster catalogs, along with the supplementary catalogs for the member galaxies, are available on request from the authors.

  16. Changing cluster composition in cluster randomised controlled trials: design and analysis considerations

    Science.gov (United States)

    2014-01-01

    Background There are many methodological challenges in the conduct and analysis of cluster randomised controlled trials, but one that has received little attention is that of post-randomisation changes to cluster composition. To illustrate this, we focus on the issue of cluster merging, considering the impact on the design, analysis and interpretation of trial outcomes. Methods We explored the effects of merging clusters on study power using standard methods of power calculation. We assessed the potential impacts on study findings of both homogeneous cluster merges (involving clusters randomised to the same arm of a trial) and heterogeneous merges (involving clusters randomised to different arms of a trial) by simulation. To determine the impact on bias and precision of treatment effect estimates, we applied standard methods of analysis to different populations under analysis. Results Cluster merging produced a systematic reduction in study power. This effect depended on the number of merges and was most pronounced when variability in cluster size was at its greatest. Simulations demonstrate that the impact on analysis was minimal when cluster merges were homogeneous, with impact on study power being balanced by a change in observed intracluster correlation coefficient (ICC). We found a decrease in study power when cluster merges were heterogeneous, and the estimate of treatment effect was attenuated. Conclusions Examples of cluster merges found in previously published reports of cluster randomised trials were typically homogeneous rather than heterogeneous. Simulations demonstrated that trial findings in such cases would be unbiased. However, simulations also showed that any heterogeneous cluster merges would introduce bias that would be hard to quantify, as well as having negative impacts on the precision of estimates obtained. Further methodological development is warranted to better determine how to analyse such trials appropriately. Interim recommendations

  17. Cataloging the Praesepe Cluster: Identifying Interlopers and Binary Systems

    Science.gov (United States)

    Lucey, Madeline R.; Gosnell, Natalie M.; Mann, Andrew; Douglas, Stephanie

    2018-01-01

    We present radial velocity measurements from an ongoing survey of the Praesepe open cluster using the WIYN 3.5m Telescope. Our target stars include 229 early-K to mid-M dwarfs with proper motion memberships that have been observed by the repurposed Kepler mission, K2. With this survey, we will provide a well-constrained membership list of the cluster. By removing interloping stars and determining the cluster binary frequency we can avoid systematic errors in our analysis of the K2 findings and more accurately determine exoplanet properties in the Praesepe cluster. Obtaining accurate exoplanet parameters in open clusters allows us to study the temporal dimension of exoplanet parameter space. We find Praesepe to have a mean radial velocity of 34.09 km/s and a velocity dispersion of 1.13 km/s, which is consistent with previous studies. We derive radial velocity membership probabilities for stars with ≥3 radial velocity measurements and compare against published membership probabilities. We also identify radial velocity variables and potential double-lined spectroscopic binaries. We plan to obtain more observations to determine the radial velocity membership of all the stars in our sample, as well as follow up on radial velocity variables to determine binary orbital solutions.

  18. Nanocomposite metal/plasma polymer films prepared by means of gas aggregation cluster source

    Energy Technology Data Exchange (ETDEWEB)

    Polonskyi, O.; Solar, P.; Kylian, O.; Drabik, M.; Artemenko, A.; Kousal, J.; Hanus, J.; Pesicka, J.; Matolinova, I. [Charles University in Prague, Faculty of Mathematics and Physics, V Holesovickach 2, 18000 Prague 8 (Czech Republic); Kolibalova, E. [Tescan, Libusina trida 21, 632 00 Brno (Czech Republic); Slavinska, D. [Charles University in Prague, Faculty of Mathematics and Physics, V Holesovickach 2, 18000 Prague 8 (Czech Republic); Biederman, H., E-mail: bieder@kmf.troja.mff.cuni.cz [Charles University in Prague, Faculty of Mathematics and Physics, V Holesovickach 2, 18000 Prague 8 (Czech Republic)

    2012-04-02

    Nanocomposite metal/plasma polymer films have been prepared by simultaneous plasma polymerization using a mixture of Ar/n-hexane and metal cluster beams. A simple compact cluster gas aggregation source is described and characterized with emphasis on the determination of the amount of charged clusters and their size distribution. It is shown that the fraction of neutral, positively and negatively charged nanoclusters leaving the gas aggregation source is largely influenced by used operational conditions. In addition, it is demonstrated that a large portion of Ag clusters is positively charged, especially when higher currents are used for their production. Deposition of nanocomposite Ag/C:H plasma polymer films is described in detail by means of cluster gas aggregation source. Basic characterization of the films is performed using transmission electron microscopy, ultraviolet-visible and Fourier-transform infrared spectroscopies. It is shown that the morphology, structure and optical properties of such prepared nanocomposites differ significantly from the ones fabricated by means of magnetron sputtering of Ag target in Ar/n-hexane mixture.

  19. Integrative cluster analysis in bioinformatics

    CERN Document Server

    Abu-Jamous, Basel; Nandi, Asoke K

    2015-01-01

    Clustering techniques are increasingly being put to use in the analysis of high-throughput biological datasets. Novel computational techniques to analyse high throughput data in the form of sequences, gene and protein expressions, pathways, and images are becoming vital for understanding diseases and future drug discovery. This book details the complete pathway of cluster analysis, from the basics of molecular biology to the generation of biological knowledge. The book also presents the latest clustering methods and clustering validation, thereby offering the reader a comprehensive review o

  20. Efficient computation of k-Nearest Neighbour Graphs for large high-dimensional data sets on GPU clusters.

    Directory of Open Access Journals (Sweden)

    Ali Dashti

    Full Text Available This paper presents an implementation of the brute-force exact k-Nearest Neighbor Graph (k-NNG construction for ultra-large high-dimensional data cloud. The proposed method uses Graphics Processing Units (GPUs and is scalable with multi-levels of parallelism (between nodes of a cluster, between different GPUs on a single node, and within a GPU. The method is applicable to homogeneous computing clusters with a varying number of nodes and GPUs per node. We achieve a 6-fold speedup in data processing as compared with an optimized method running on a cluster of CPUs and bring a hitherto impossible [Formula: see text]-NNG generation for a dataset of twenty million images with 15 k dimensionality into the realm of practical possibility.

  1. Fractal dimension to classify the heart sound recordings with KNN and fuzzy c-mean clustering methods

    Science.gov (United States)

    Juniati, D.; Khotimah, C.; Wardani, D. E. K.; Budayasa, K.

    2018-01-01

    The heart abnormalities can be detected from heart sound. A heart sound can be heard directly with a stethoscope or indirectly by a phonocardiograph, a machine of the heart sound recording. This paper presents the implementation of fractal dimension theory to make a classification of phonocardiograms into a normal heart sound, a murmur, or an extrasystole. The main algorithm used to calculate the fractal dimension was Higuchi’s Algorithm. There were two steps to make a classification of phonocardiograms, feature extraction, and classification. For feature extraction, we used Discrete Wavelet Transform to decompose the signal of heart sound into several sub-bands depending on the selected level. After the decomposition process, the signal was processed using Fast Fourier Transform (FFT) to determine the spectral frequency. The fractal dimension of the FFT output was calculated using Higuchi Algorithm. The classification of fractal dimension of all phonocardiograms was done with KNN and Fuzzy c-mean clustering methods. Based on the research results, the best accuracy obtained was 86.17%, the feature extraction by DWT decomposition level 3 with the value of kmax 50, using 5-fold cross validation and the number of neighbors was 5 at K-NN algorithm. Meanwhile, for fuzzy c-mean clustering, the accuracy was 78.56%.

  2. Clustering-based analysis for residential district heating data

    DEFF Research Database (Denmark)

    Gianniou, Panagiota; Liu, Xiufeng; Heller, Alfred

    2018-01-01

    The wide use of smart meters enables collection of a large amount of fine-granular time series, which can be used to improve the understanding of consumption behavior and used for consumption optimization. This paper presents a clustering-based knowledge discovery in databases method to analyze r....... These findings will be valuable for district heating utilities and energy planners to optimize their operations, design demand-side management strategies, and develop targeting energy-efficiency programs or policies.......The wide use of smart meters enables collection of a large amount of fine-granular time series, which can be used to improve the understanding of consumption behavior and used for consumption optimization. This paper presents a clustering-based knowledge discovery in databases method to analyze...... residential heating consumption data and evaluate information included in national building databases. The proposed method uses the K-means algorithm to segment consumption groups based on consumption intensity and representative patterns and ranks the groups according to daily consumption. This paper also...

  3. Clustering Millions of Faces by Identity.

    Science.gov (United States)

    Otto, Charles; Wang, Dayong; Jain, Anil K

    2018-02-01

    Given a large collection of unlabeled face images, we address the problem of clustering faces into an unknown number of identities. This problem is of interest in social media, law enforcement, and other applications, where the number of faces can be of the order of hundreds of million, while the number of identities (clusters) can range from a few thousand to millions. To address the challenges of run-time complexity and cluster quality, we present an approximate Rank-Order clustering algorithm that performs better than popular clustering algorithms (k-Means and Spectral). Our experiments include clustering up to 123 million face images into over 10 million clusters. Clustering results are analyzed in terms of external (known face labels) and internal (unknown face labels) quality measures, and run-time. Our algorithm achieves an F-measure of 0.87 on the LFW benchmark (13 K faces of 5,749 individuals), which drops to 0.27 on the largest dataset considered (13 K faces in LFW + 123M distractor images). Additionally, we show that frames in the YouTube benchmark can be clustered with an F-measure of 0.71. An internal per-cluster quality measure is developed to rank individual clusters for manual exploration of high quality clusters that are compact and isolated.

  4. Generalized Smoluchowski equation with correlation between clusters

    International Nuclear Information System (INIS)

    Sittler, Lionel

    2008-01-01

    In this paper we compute new reaction rates of the Smoluchowski equation which takes into account correlations. The new rate K = K MF + K C is the sum of two terms. The first term is the known Smoluchowski rate with the mean-field approximation. The second takes into account a correlation between clusters. For this purpose we introduce the average path of a cluster. We relate the length of this path to the reaction rate of the Smoluchowski equation. We solve the implicit dependence between the average path and the density of clusters. We show that this correlation length is the same for all clusters. Our result depends strongly on the spatial dimension d. The mean-field term K MF i,j = (D i + D j )(r j + r i ) d-2 , which vanishes for d = 1 and is valid up to logarithmic correction for d = 2, is the usual rate found with the Smoluchowski model without correlation (where r i is the radius and D i is the diffusion constant of the cluster). We compute a new rate: the correlation rate K i,j C = (D i +D j )(r j +r i ) d-1 M((d-1)/d f ) is valid for d ≥ 1(where M(α) = Σ +∞ i=1 i α N i is the moment of the density of clusters and d f is the fractal dimension of the cluster). The result is valid for a large class of diffusion processes and mass-radius relations. This approach confirms some analytical solutions in d = 1 found with other methods. We also show Monte Carlo simulations which illustrate some exact new solvable models

  5. Melodic pattern discovery by structural analysis via wavelets and clustering techniques

    DEFF Research Database (Denmark)

    Velarde, Gissel; Meredith, David

    We present an automatic method to support melodic pattern discovery by structural analysis of symbolic representations by means of wavelet analysis and clustering techniques. In previous work, we used the method to recognize the parent works of melodic segments, or to classify tunes into tune......-means to cluster melodic segments into groups of measured similarity and obtain a raking of the most prototypical melodic segments or patterns and their occurrences. We test the method on the JKU Patterns Development Database and evaluate it based on the ground truth defined by the MIREX 2013 Discovery of Repeated...... Themes & Sections task. We compare the results of our method to the output of geometric approaches. Finally, we discuss about the relevance of our wavelet-based analysis in relation to structure, pattern discovery, similarity and variation, and comment about the considerations of the method when used...

  6. Cluster Analysis of an International Pressure Pain Threshold Database Identifies 4 Meaningful Subgroups of Adults With Mechanical Neck Pain

    DEFF Research Database (Denmark)

    Walton, David M; Kwok, Timothy S H; Mehta, Swati

    2017-01-01

    OBJECTIVE: To determine pressure pain detection threshold (PPDT) related phenotypes of individuals with mechanical neck pain that may be identifiable in clinical practice. METHODS: This report describes a secondary analysis of 5 independent, international mechanical neck pain databases of PPDT...... values taken at both a local and distal region (total N=1176). Minor systematic differences in mean PPDT values across cohorts necessitated z-transformation before analysis, and each cohort was split into male and female sexes. Latent profile analysis (LPA) using the k-means approach was undertaken...... to identify the most parsimonious set of PPDT-based phenotypes that were both statistically and clinically meaningful. RESULTS: LPA revealed 4 distinct clusters named according to PPDT levels at the local and distal zones: low-low PPDT (67%), mod-mod (25%), mod-high (4%), and high-high (4%). Secondary...

  7. Nonuniform Sparse Data Clustering Cascade Algorithm Based on Dynamic Cumulative Entropy

    Directory of Open Access Journals (Sweden)

    Ning Li

    2016-01-01

    Full Text Available A small amount of prior knowledge and randomly chosen initial cluster centers have a direct impact on the accuracy of the performance of iterative clustering algorithm. In this paper we propose a new algorithm to compute initial cluster centers for k-means clustering and the best number of the clusters with little prior knowledge and optimize clustering result. It constructs the Euclidean distance control factor based on aggregation density sparse degree to select the initial cluster center of nonuniform sparse data and obtains initial data clusters by multidimensional diffusion density distribution. Multiobjective clustering approach based on dynamic cumulative entropy is adopted to optimize the initial data clusters and the best number of the clusters. The experimental results show that the newly proposed algorithm has good performance to obtain the initial cluster centers for the k-means algorithm and it effectively improves the clustering accuracy of nonuniform sparse data by about 5%.

  8. Image Segmentation Method Using Fuzzy C Mean Clustering Based on Multi-Objective Optimization

    Science.gov (United States)

    Chen, Jinlin; Yang, Chunzhi; Xu, Guangkui; Ning, Li

    2018-04-01

    Image segmentation is not only one of the hottest topics in digital image processing, but also an important part of computer vision applications. As one kind of image segmentation algorithms, fuzzy C-means clustering is an effective and concise segmentation algorithm. However, the drawback of FCM is that it is sensitive to image noise. To solve the problem, this paper designs a novel fuzzy C-mean clustering algorithm based on multi-objective optimization. We add a parameter λ to the fuzzy distance measurement formula to improve the multi-objective optimization. The parameter λ can adjust the weights of the pixel local information. In the algorithm, the local correlation of neighboring pixels is added to the improved multi-objective mathematical model to optimize the clustering cent. Two different experimental results show that the novel fuzzy C-means approach has an efficient performance and computational time while segmenting images by different type of noises.

  9. A new methodology to study customer electrocardiogram using RFM analysis and clustering

    Directory of Open Access Journals (Sweden)

    Mohammad Reza Gholamian

    2011-04-01

    Full Text Available One of the primary issues on marketing planning is to know the customer's behavioral trends. A customer's purchasing interest may fluctuate for different reasons and it is important to find the declining or increasing trends whenever they happen. It is important to study these fluctuations to improve customer relationships. There are different methods to increase the customer's willingness such as planning good promotions, an increase on advertisement, etc. This paper proposes a new methodology to measure customer's behavioral trends called customer electrocardiogram. The proposed model of this paper uses K-means clustering method with RFM analysis to study customer's fluctuations over different time frames. We also apply the proposed electrocardiogram methodology for a real-world case study of food industry and the results are discussed in details.

  10. [Optimization of cluster analysis based on drug resistance profiles of MRSA isolates].

    Science.gov (United States)

    Tani, Hiroya; Kishi, Takahiko; Gotoh, Minehiro; Yamagishi, Yuka; Mikamo, Hiroshige

    2015-12-01

    We examined 402 methicillin-resistant Staphylococcus aureus (MRSA) strains isolated from clinical specimens in our hospital between November 19, 2010 and December 27, 2011 to evaluate the similarity between cluster analysis of drug susceptibility tests and pulsed-field gel electrophoresis (PFGE). The results showed that the 402 strains tested were classified into 27 PFGE patterns (151 subtypes of patterns). Cluster analyses of drug susceptibility tests with the cut-off distance yielding a similar classification capability showed favorable results--when the MIC method was used, and minimum inhibitory concentration (MIC) values were used directly in the method, the level of agreement with PFGE was 74.2% when 15 drugs were tested. The Unweighted Pair Group Method with Arithmetic mean (UPGMA) method was effective when the cut-off distance was 16. Using the SIR method in which susceptible (S), intermediate (I), and resistant (R) were coded as 0, 2, and 3, respectively, according to the Clinical and Laboratory Standards Institute (CLSI) criteria, the level of agreement with PFGE was 75.9% when the number of drugs tested was 17, the method used for clustering was the UPGMA, and the cut-off distance was 3.6. In addition, to assess the reproducibility of the results, 10 strains were randomly sampled from the overall test and subjected to cluster analysis. This was repeated 100 times under the same conditions. The results indicated good reproducibility of the results, with the level of agreement with PFGE showing a mean of 82.0%, standard deviation of 12.1%, and mode of 90.0% for the MIC method and a mean of 80.0%, standard deviation of 13.4%, and mode of 90.0% for the SIR method. In summary, cluster analysis for drug susceptibility tests is useful for the epidemiological analysis of MRSA.

  11. Cluster-based analysis of multi-model climate ensembles

    Science.gov (United States)

    Hyde, Richard; Hossaini, Ryan; Leeson, Amber A.

    2018-06-01

    Clustering - the automated grouping of similar data - can provide powerful and unique insight into large and complex data sets, in a fast and computationally efficient manner. While clustering has been used in a variety of fields (from medical image processing to economics), its application within atmospheric science has been fairly limited to date, and the potential benefits of the application of advanced clustering techniques to climate data (both model output and observations) has yet to be fully realised. In this paper, we explore the specific application of clustering to a multi-model climate ensemble. We hypothesise that clustering techniques can provide (a) a flexible, data-driven method of testing model-observation agreement and (b) a mechanism with which to identify model development priorities. We focus our analysis on chemistry-climate model (CCM) output of tropospheric ozone - an important greenhouse gas - from the recent Atmospheric Chemistry and Climate Model Intercomparison Project (ACCMIP). Tropospheric column ozone from the ACCMIP ensemble was clustered using the Data Density based Clustering (DDC) algorithm. We find that a multi-model mean (MMM) calculated using members of the most-populous cluster identified at each location offers a reduction of up to ˜ 20 % in the global absolute mean bias between the MMM and an observed satellite-based tropospheric ozone climatology, with respect to a simple, all-model MMM. On a spatial basis, the bias is reduced at ˜ 62 % of all locations, with the largest bias reductions occurring in the Northern Hemisphere - where ozone concentrations are relatively large. However, the bias is unchanged at 9 % of all locations and increases at 29 %, particularly in the Southern Hemisphere. The latter demonstrates that although cluster-based subsampling acts to remove outlier model data, such data may in fact be closer to observed values in some locations. We further demonstrate that clustering can provide a viable and

  12. Determining the number of clusters for kernelized fuzzy C-means algorithms for automatic medical image segmentation

    Directory of Open Access Journals (Sweden)

    E.A. Zanaty

    2012-03-01

    Full Text Available In this paper, we determine the suitable validity criterion of kernelized fuzzy C-means and kernelized fuzzy C-means with spatial constraints for automatic segmentation of magnetic resonance imaging (MRI. For that; the original Euclidean distance in the FCM is replaced by a Gaussian radial basis function classifier (GRBF and the corresponding algorithms of FCM methods are derived. The derived algorithms are called as the kernelized fuzzy C-means (KFCM and kernelized fuzzy C-means with spatial constraints (SKFCM. These methods are implemented on eighteen indexes as validation to determine whether indexes are capable to acquire the optimal clusters number. The performance of segmentation is estimated by applying these methods independently on several datasets to prove which method can give good results and with which indexes. Our test spans various indexes covering the classical and the rather more recent indexes that have enjoyed noticeable success in that field. These indexes are evaluated and compared by applying them on various test images, including synthetic images corrupted with noise of varying levels, and simulated volumetric MRI datasets. Comparative analysis is also presented to show whether the validity index indicates the optimal clustering for our datasets.

  13. Topic modeling for cluster analysis of large biological and medical datasets.

    Science.gov (United States)

    Zhao, Weizhong; Zou, Wen; Chen, James J

    2014-01-01

    The big data moniker is nowhere better deserved than to describe the ever-increasing prodigiousness and complexity of biological and medical datasets. New methods are needed to generate and test hypotheses, foster biological interpretation, and build validated predictors. Although multivariate techniques such as cluster analysis may allow researchers to identify groups, or clusters, of related variables, the accuracies and effectiveness of traditional clustering methods diminish for large and hyper dimensional datasets. Topic modeling is an active research field in machine learning and has been mainly used as an analytical tool to structure large textual corpora for data mining. Its ability to reduce high dimensionality to a small number of latent variables makes it suitable as a means for clustering or overcoming clustering difficulties in large biological and medical datasets. In this study, three topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, are proposed and tested on the cluster analysis of three large datasets: Salmonella pulsed-field gel electrophoresis (PFGE) dataset, lung cancer dataset, and breast cancer dataset, which represent various types of large biological or medical datasets. All three various methods are shown to improve the efficacy/effectiveness of clustering results on the three datasets in comparison to traditional methods. A preferable cluster analysis method emerged for each of the three datasets on the basis of replicating known biological truths. Topic modeling could be advantageously applied to the large datasets of biological or medical research. The three proposed topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, yield clustering improvements for the three different data types. Clusters more efficaciously represent truthful groupings and subgroupings in the data than traditional methods, suggesting

  14. Dietary Patterns Derived by Cluster Analysis are Associated with Cognitive Function among Korean Older Adults

    Directory of Open Access Journals (Sweden)

    Jihye Kim

    2015-05-01

    Full Text Available The objective of this study was to investigate major dietary patterns among older Korean adults through cluster analysis and to determine an association between dietary patterns and cognitive function. This is a cross-sectional study. The data from the Korean Multi-Rural Communities Cohort Study was used. Participants included 765 participants aged 60 years and over. A quantitative food frequency questionnaire with 106 items was used to investigate dietary intake. The Korean version of the MMSE-KC (Mini-Mental Status Examination–Korean version was used to assess cognitive function. Two major dietary patterns were identified using K-means cluster analysis. The “MFDF” dietary pattern indicated high consumption of Multigrain rice, Fish, Dairy products, Fruits and fruit juices, while the “WNC” dietary pattern referred to higher intakes of White rice, Noodles, and Coffee. Means of the total MMSE-KC and orientation score of the participants in the MFDF dietary pattern were higher than those of the WNC dietary pattern. Compared with the WNC dietary pattern, the MFDF dietary pattern showed a lower risk of cognitive impairment after adjusting for covariates (OR 0.64, 95% CI 0.44–0.94. The MFDF dietary pattern, with high consumption of multigrain rice, fish, dairy products, and fruits may be related to better cognition among Korean older adults.

  15. A Comparison between Standard and Functional Clustering Methodologies: Application to Agricultural Fields for Yield Pattern Assessment

    Directory of Open Access Journals (Sweden)

    Simone Pascucci

    2018-04-01

    Full Text Available The recognition of spatial patterns within agricultural fields, presenting similar yield potential areas, stable through time, is very important for optimizing agricultural practices. This study proposes the evaluation of different clustering methodologies applied to multispectral satellite time series for retrieving temporally stable (constant patterns in agricultural fields, related to within-field yield spatial distribution. The ability of different clustering procedures for the recognition and mapping of constant patterns in fields of cereal crops was assessed. Crop vigor patterns, considered to be related to soils characteristics, and possibly indicative of yield potential, were derived by applying the different clustering algorithms to time series of Landsat images acquired on 94 agricultural fields near Rome (Italy. Two different approaches were applied and validated using Landsat 7 and 8 archived imagery. The first approach automatically extracts and calculates for each field of interest (FOI the Normalized Difference Vegetation Index (NDVI, then exploits the standard K-means clustering algorithm to derive constant patterns at the field level. The second approach applies novel clustering procedures directly to spectral reflectance time series, in particular: (1 standard K-means; (2 functional K-means; (3 multivariate functional principal components clustering analysis; (4 hierarchical clustering. The different approaches were validated through cluster accuracy estimates on a reference set of FOIs for which yield maps were available for some years. Results show that multivariate functional principal components clustering, with an a priori determination of the optimal number of classes for each FOI, provides a better accuracy than those of standard clustering algorithms. The proposed novel functional clustering methodologies are effective and efficient for constant pattern retrieval and can be used for a sustainable management of

  16. Ionization of nitrogen cluster beam

    International Nuclear Information System (INIS)

    Yano, Katsuki; Be, S.H.; Enjoji, Hiroshi; Okamoto, Kosuke

    1975-01-01

    A nitrogen cluster beam (neutral particle intensity of 28.6 mAsub(eq)) is ionized by electron collisions in a Bayard-Alpert gauge type ionizer. The extraction efficiency of about 65% is obtained at an electron current of 10 mA with an energy of 50 eV. The mean cluster size produced at a pressure of 663 Torr and temperature of 77.3 K is 2x10 5 molecules per cluster. By the Coulomb repulsion force, multiply ionized cluster ions are broken up into smaller fragments and the cluster ion size reduces to one-fourth at an electron current of 15 mA. Mean neutral cluster sizes depend strongly on the initial degree of saturation PHI 0 and are 2x10 5 , 7x10 4 and 3x10 4 molecules per cluster at PHI 0 's of 0.87, 0.66 and 0.39, respectively. (auth.)

  17. Fuzzy cluster means algorithm for the diagnosis of confusable disease

    African Journals Online (AJOL)

    ... end platform while Microsoft Access was used as the database application. The system gives a measure of each disease within a set of confusable disease. The proposed system had a classification accuracy of 60%. Keywords: Artificial Intelligence, expert system Fuzzy clustermeans Algorithm, physician, Diagnosis ...

  18. Sm cluster superlattice on graphene/Ir(111)

    Science.gov (United States)

    Mousadakos, Dimitris; Pivetta, Marina; Brune, Harald; Rusponi, Stefano

    2017-12-01

    We report on the first example of a self-assembled rare earth cluster superlattice. As a template, we use the moiré pattern formed by graphene on Ir(111); its lattice constant of 2.52 nm defines the interparticle distance. The samarium cluster superlattice forms for substrate temperatures during deposition ranging from 80 to 110 K, and it is stable upon annealing to 140 K. By varying the samarium coverage, the mean cluster size can be increased up to 50 atoms, without affecting the long-range order. The spatial order and the width of the cluster size distribution match the best examples of metal cluster superlattices grown by atomic beam epitaxy on template surfaces.

  19. Clustering and training set selection methods for improving the accuracy of quantitative laser induced breakdown spectroscopy

    Energy Technology Data Exchange (ETDEWEB)

    Anderson, Ryan B., E-mail: randerson@astro.cornell.edu [Cornell University Department of Astronomy, 406 Space Sciences Building, Ithaca, NY 14853 (United States); Bell, James F., E-mail: Jim.Bell@asu.edu [Arizona State University School of Earth and Space Exploration, Bldg.: INTDS-A, Room: 115B, Box 871404, Tempe, AZ 85287 (United States); Wiens, Roger C., E-mail: rwiens@lanl.gov [Los Alamos National Laboratory, P.O. Box 1663 MS J565, Los Alamos, NM 87545 (United States); Morris, Richard V., E-mail: richard.v.morris@nasa.gov [NASA Johnson Space Center, 2101 NASA Parkway, Houston, TX 77058 (United States); Clegg, Samuel M., E-mail: sclegg@lanl.gov [Los Alamos National Laboratory, P.O. Box 1663 MS J565, Los Alamos, NM 87545 (United States)

    2012-04-15

    We investigated five clustering and training set selection methods to improve the accuracy of quantitative chemical analysis of geologic samples by laser induced breakdown spectroscopy (LIBS) using partial least squares (PLS) regression. The LIBS spectra were previously acquired for 195 rock slabs and 31 pressed powder geostandards under 7 Torr CO{sub 2} at a stand-off distance of 7 m at 17 mJ per pulse to simulate the operational conditions of the ChemCam LIBS instrument on the Mars Science Laboratory Curiosity rover. The clustering and training set selection methods, which do not require prior knowledge of the chemical composition of the test-set samples, are based on grouping similar spectra and selecting appropriate training spectra for the partial least squares (PLS2) model. These methods were: (1) hierarchical clustering of the full set of training spectra and selection of a subset for use in training; (2) k-means clustering of all spectra and generation of PLS2 models based on the training samples within each cluster; (3) iterative use of PLS2 to predict sample composition and k-means clustering of the predicted compositions to subdivide the groups of spectra; (4) soft independent modeling of class analogy (SIMCA) classification of spectra, and generation of PLS2 models based on the training samples within each class; (5) use of Bayesian information criteria (BIC) to determine an optimal number of clusters and generation of PLS2 models based on the training samples within each cluster. The iterative method and the k-means method using 5 clusters showed the best performance, improving the absolute quadrature root mean squared error (RMSE) by {approx} 3 wt.%. The statistical significance of these improvements was {approx} 85%. Our results show that although clustering methods can modestly improve results, a large and diverse training set is the most reliable way to improve the accuracy of quantitative LIBS. In particular, additional sulfate standards and

  20. Structures of three different neutral polysaccharides of Acinetobacter baumannii, NIPH190, NIPH201, and NIPH615, assigned to K30, K45, and K48 capsule types, respectively, based on capsule biosynthesis gene clusters.

    Science.gov (United States)

    Shashkov, Alexander S; Kenyon, Johanna J; Arbatsky, Nikolay P; Shneider, Mikhail M; Popova, Anastasiya V; Miroshnikov, Konstantin A; Volozhantsev, Nikolay V; Knirel, Yuriy A

    2015-11-19

    Neutral capsular polysaccharides (CPSs) were isolated from Acinetobacter baumannii NIPH190, NIPH201, and NIPH615. The CPSs were found to contain common monosaccharides only and to be branched with a side-chain 1→3-linked β-d-glucopyranose residue. Structures of the oligosaccharide repeat units (K units) of the CPSs were elucidated by 1D and 2D (1)H and (13)C NMR spectroscopy. Novel CPS biosynthesis gene clusters, designated KL30, KL45, and KL48, were found at the K locus in the genome sequences of NIPH190, NIPH201, and NIPH615, respectively. The genetic content of each gene cluster correlated with the structure of the CPS unit established, and therefore, the capsular types of the strains studied were designated as K30, K45, and K48, respectively. The initiating sugar of each K unit was predicted, and glycosyltransferases encoded by each gene cluster were assigned to the formation of the linkages between sugars in the corresponding K unit. Copyright © 2015 Elsevier Ltd. All rights reserved.