Kernel method-based fuzzy clustering algorithm
Institute of Scientific and Technical Information of China (English)
Wu Zhongdong; Gao Xinbo; Xie Weixin; Yu Jianping
2005-01-01
The fuzzy C-means clustering algorithm(FCM) to the fuzzy kernel C-means clustering algorithm(FKCM) to effectively perform cluster analysis on the diversiform structures are extended, such as non-hyperspherical data, data with noise, data with mixture of heterogeneous cluster prototypes, asymmetric data, etc. Based on the Mercer kernel, FKCM clustering algorithm is derived from FCM algorithm united with kernel method. The results of experiments with the synthetic and real data show that the FKCM clustering algorithm is universality and can effectively unsupervised analyze datasets with variform structures in contrast to FCM algorithm. It is can be imagined that kernel-based clustering algorithm is one of important research direction of fuzzy clustering analysis.
A Clustering Method Based on the Maximum Entropy Principle
Directory of Open Access Journals (Sweden)
Edwin Aldana-Bobadilla
2015-01-01
Full Text Available Clustering is an unsupervised process to determine which unlabeled objects in a set share interesting properties. The objects are grouped into k subsets (clusters whose elements optimize a proximity measure. Methods based on information theory have proven to be feasible alternatives. They are based on the assumption that a cluster is one subset with the minimal possible degree of “disorder”. They attempt to minimize the entropy of each cluster. We propose a clustering method based on the maximum entropy principle. Such a method explores the space of all possible probability distributions of the data to find one that maximizes the entropy subject to extra conditions based on prior information about the clusters. The prior information is based on the assumption that the elements of a cluster are “similar” to each other in accordance with some statistical measure. As a consequence of such a principle, those distributions of high entropy that satisfy the conditions are favored over others. Searching the space to find the optimal distribution of object in the clusters represents a hard combinatorial problem, which disallows the use of traditional optimization techniques. Genetic algorithms are a good alternative to solve this problem. We benchmark our method relative to the best theoretical performance, which is given by the Bayes classifier when data are normally distributed, and a multilayer perceptron network, which offers the best practical performance when data are not normal. In general, a supervised classification method will outperform a non-supervised one, since, in the first case, the elements of the classes are known a priori. In what follows, we show that our method’s effectiveness is comparable to a supervised one. This clearly exhibits the superiority of our method.
Agent-based method for distributed clustering of textual information
Potok, Thomas E. [Oak Ridge, TN; Reed, Joel W. [Knoxville, TN; Elmore, Mark T. [Oak Ridge, TN; Treadwell, Jim N. [Louisville, TN
2010-09-28
A computer method and system for storing, retrieving and displaying information has a multiplexing agent (20) that calculates a new document vector (25) for a new document (21) to be added to the system and transmits the new document vector (25) to master cluster agents (22) and cluster agents (23) for evaluation. These agents (22, 23) perform the evaluation and return values upstream to the multiplexing agent (20) based on the similarity of the document to documents stored under their control. The multiplexing agent (20) then sends the document (21) and the document vector (25) to the master cluster agent (22), which then forwards it to a cluster agent (23) or creates a new cluster agent (23) to manage the document (21). The system also searches for stored documents according to a search query having at least one term and identifying the documents found in the search, and displays the documents in a clustering display (80) of similarity so as to indicate similarity of the documents to each other.
Super pixel density based clustering automatic image classification method
Xu, Mingxing; Zhang, Chuan; Zhang, Tianxu
2015-12-01
The image classification is an important means of image segmentation and data mining, how to achieve rapid automated image classification has been the focus of research. In this paper, based on the super pixel density of cluster centers algorithm for automatic image classification and identify outlier. The use of the image pixel location coordinates and gray value computing density and distance, to achieve automatic image classification and outlier extraction. Due to the increased pixel dramatically increase the computational complexity, consider the method of ultra-pixel image preprocessing, divided into a small number of super-pixel sub-blocks after the density and distance calculations, while the design of a normalized density and distance discrimination law, to achieve automatic classification and clustering center selection, whereby the image automatically classify and identify outlier. After a lot of experiments, our method does not require human intervention, can automatically categorize images computing speed than the density clustering algorithm, the image can be effectively automated classification and outlier extraction.
Density-based clustering method in the moving object database
Institute of Scientific and Technical Information of China (English)
ZHOU Xing; XIANG Shu; GE Jun-wei; LIU Zhao-hong; BAE Hae-young
2004-01-01
With the rapid advance of wireless communication, tracking the positions of the moving objects is becoming increasingly feasible and necessary. Because a large number of people use mobile phones, we must handle a large moving object database as well as the following problems. How can we provide the customers with high quality service, that means, how can we deal with so many enquiries within as less time as possible? Because of the large number of data, the gap between CPU speed and the size of main memory has increasing considerably. One way to reduce the time to handle enquiries is to reduce the I/O number between the buffer and the secondary storage. An effective clustering of the objects can minimize the I/O-cost between them. In this paper, according to the characteristic of the moving object database, we analyze the objects in buffer, according to their mappings in the two-dimension coordinate, and then develop a density-based clustering method to effectively reorganize the clusters. This new mechanism leads to the less cost of the I/O operation and the more efficient response to enquiries.
Urban Fire Risk Clustering Method Based on Fire Statistics
Institute of Scientific and Technical Information of China (English)
WU Lizhi; REN Aizhu
2008-01-01
Fire statistics and fire analysis have become important ways for us to understand the law of fire,prevent the occurrence of fire, and improve the ability to control fire. According to existing fire statistics, the weighted fire risk calculating method characterized by the number of fire occurrence, direct economic losses,and fire casualties was put forward. On the basis of this method, meanwhile having improved K-mean clus-tering arithmetic, this paper established fire dsk K-mean clustering model, which could better resolve the automatic classifying problems towards fire risk. Fire risk cluster should be classified by the absolute dis-tance of the target instead of the relative distance in the traditional cluster arithmetic. Finally, for applying the established model, this paper carded out fire risk clustering on fire statistics from January 2000 to December 2004 of Shenyang in China. This research would provide technical support for urban fire management.
Color Image Segmentation Method Based on Improved Spectral Clustering Algorithm
Directory of Open Access Journals (Sweden)
Dong Qin
2014-08-01
Full Text Available Contraposing to the features of image data with high sparsity of and the problems on determination of clustering numbers, we try to put forward an color image segmentation algorithm, combined with semi-supervised machine learning technology and spectral graph theory. By the research of related theories and methods of spectral clustering algorithms, we introduce information entropy conception to design a method which can automatically optimize the scale parameter value. So it avoids the unstability in clustering result of the scale parameter input manually. In addition, we try to excavate available priori information existing in large number of non-generic data and apply semi-supervised algorithm to improve the clustering performance for rare class. We also use added tag data to compute similar matrix and perform clustering through FKCM algorithms. By the simulation of standard dataset and image segmentation, the experiments demonstrate our algorithm has overcome the defects of traditional spectral clustering methods, which are sensitive to outliers and easy to fall into local optimum, and also poor in the convergence rate
Comparison of chemical clustering methods using graph- and fingerprint-based similarity measures
Raymond, J.W.; Blankley, C.J.; Willett, P.
2003-01-01
This paper compares several published methods for clustering chemical structures, using both graph- and fingerprint-based similarity measures. The clusterings from each method were compared to determine the degree of cluster overlap. Each method was also evaluated on how well it grouped structures into clusters possessing a non-trivial substructural commonality. The methods which employ adjustable parameters were tested to determine the stability of each parameter for datasets of varying size...
Directory of Open Access Journals (Sweden)
Ichiro IWASAKI
2010-06-01
Full Text Available Michael Porter’s concept of competitive advantages emphasizes the importance of regional cooperation of various actors in order to gain competitiveness on globalized markets. Foreign investors may play an important role in forming such cooperation networks. Their local suppliers tend to concentrate regionally. They can form, together with local institutions of education, research, financial and other services, development agencies, the nucleus of cooperative clusters. This paper deals with the relationship between supplier networks and clusters. Two main issues are discussed in more detail: the interest of multinational companies in entering regional clusters and the spillover effects that may stem from their participation. After the discussion on the theoretical background, the paper introduces a relatively new analytical method: “cluster mapping” - a method that can spot regional hot spots of specific economic activities with cluster building potential. Experience with the method was gathered in the US and in the European Union. After the discussion on the existing empirical evidence, the authors introduce their own cluster mapping results, which they obtained by using a refined version of the original methodology.
Image Clustering Method Based on Density Maps Derived from Self-Organizing Mapping: SOM
Directory of Open Access Journals (Sweden)
Kohei Arai
2012-07-01
Full Text Available A new method for image clustering with density maps derived from Self-Organizing Maps (SOM is proposed together with a clarification of learning processes during a construction of clusters. It is found that the proposed SOM based image clustering method shows much better clustered result for both simulation and real satellite imagery data. It is also found that the separability among clusters of the proposed method is 16% longer than the existing k-mean clustering. It is also found that the separability among clusters of the proposed method is 16% longer than the existing k-mean clustering. In accordance with the experimental results with Landsat-5 TM image, it takes more than 20000 of iteration for convergence of the SOM learning processes.
Clustering Based Classification in Data Mining Method Recommendation
Czech Academy of Sciences Publication Activity Database
Kazík, O.; Pešková, K.; Šmíd, J.; Neruda, Roman
Vol. 2. Los Alamitos: IEEE Computer Society, 2013 - (Wani, M.; Tecuci, G.; Boicu, M.; Kubát, M.; Khoshgoftaar, T.; Seliya, N.), s. 356-361 ISBN 978-0-7695-5144-9. [ICMLA 2013. International Conference on Machine Learning and Applications /12./. Miami (US), 04.12.2013-07.12.2013] R&D Projects: GA ČR GAP202/11/1368; GA MŠk(CZ) LD13002 Grant ostatní: GA UK(CZ) 29612; SVV(CZ) 265314 Institutional support: RVO:67985807 Keywords : metalearning * clustering * data mining * method recommendation Subject RIV: IN - Informatics, Computer Science
AN ADAPTIVE GRID-BASED METHOD FOR CLUSTERING MULTIDIMENSIONAL ONLINE DATA STREAMS
Directory of Open Access Journals (Sweden)
Toktam Dehghani
2012-10-01
Full Text Available Clustering is an important task in mining the evolving data streams. A lot of data streams are high dimensional in nature. Clustering in the high dimensional data space is a complex problem, which is inherently more complex for data streams. Most data stream clustering methods are not capable of dealing with high dimensional data streams; therefore they sacrifice the accuracy of clusters. In order to solve this problem we proposed an adaptive grid -based clustering method. Our focus is on providing up-to-date arbitrary shaped clusters along with improving the processing time and bounding the amount of the memory u sage. In our method (B+C tree, a structure called “B+cell tree” is used to keep the recent information of a data stream. In order to reduce the complexity of the clustering, a structure called “cluster tree” is proposed to maintain multi dimensional clusters. A Cluster tree yields high quality clusters by keeping the boundaries of clusters in a semi -optimal way. Clustertree captures the dynamic changes of data streams and adjusts the clusters. Our performance study over a number of real and synthetic data streams demonstrates the scalability of algorithm on the number of dimensions and data without sacrificing the accuracy of identified clusters.
Farthest-Point Heuristic based Initialization Methods for K-Modes Clustering
He, Zengyou
2006-01-01
The k-modes algorithm has become a popular technique in solving categorical data clustering problems in different application domains. However, the algorithm requires random selection of initial points for the clusters. Different initial points often lead to considerable distinct clustering results. In this paper we present an experimental study on applying a farthest-point heuristic based initialization method to k-modes clustering to improve its performance. Experiments show that new initia...
Šubelj, Lovro; Waltman, Ludo
2015-01-01
Clustering methods are applied regularly in the bibliometric literature to identify research areas or scientific fields. These methods are for instance used to group publications into clusters based on their relations in a citation network. In the network science literature, many clustering methods, often referred to as graph partitioning or community detection techniques, have been developed. Focusing on the problem of clustering the publications in a citation network, we present a systematic comparison of the performance of a large number of these clustering methods. Using a number of different citation networks, some of them relatively small and others very large, we extensively study the statistical properties of the results provided by different methods. In addition, we also carry out an expert-based assessment of the results produced by different methods. The expert-based assessment focuses on publications in the field of scientometrics. Our findings seem to indicate that there is a trade-off between di...
Smooth Splicing: A Robust SNN-Based Method for Clustering High-Dimensional Data
Directory of Open Access Journals (Sweden)
JingDong Tan
2013-01-01
Full Text Available Sharing nearest neighbor (SNN is a novel metric measure of similarity, and it can conquer two hardships: the low similarities between samples and the different densities of classes. At present, there are two popular SNN similarity based clustering methods: JP clustering and SNN density based clustering. Their clustering results highly rely on the weighting value of the single edge, and thus they are very vulnerable. Motivated by the idea of smooth splicing in computing geometry, the authors design a novel SNN similarity based clustering algorithm within the structure of graph theory. Since it inherits complementary intensity-smoothness principle, its generalizing ability surpasses those of the previously mentioned two methods. The experiments on text datasets show its effectiveness.
Clustering with Spectral Methods
Gaertler, Marco
2002-01-01
Grouping and sorting are problems with a great tradition in the history of mankind. Clustering and cluster analysis is a small aspect in the wide spectrum. But these topics have applications in most scientific disciplines. Graph clustering is again a little fragment in the clustering area. Nevertheless it has the potential for new pioneering and innovative methods. One such method is the Markov Clustering presented by van Dongen in 'Graph Clustering by Flow Simulation'. We investigated the qu...
Šubelj, Lovro; van Eck, Nees Jan; Waltman, Ludo
2016-01-01
Clustering methods are applied regularly in the bibliometric literature to identify research areas or scientific fields. These methods are for instance used to group publications into clusters based on their relations in a citation network. In the network science literature, many clustering methods, often referred to as graph partitioning or community detection techniques, have been developed. Focusing on the problem of clustering the publications in a citation network, we present a systematic comparison of the performance of a large number of these clustering methods. Using a number of different citation networks, some of them relatively small and others very large, we extensively study the statistical properties of the results provided by different methods. In addition, we also carry out an expert-based assessment of the results produced by different methods. The expert-based assessment focuses on publications in the field of scientometrics. Our findings seem to indicate that there is a trade-off between different properties that may be considered desirable for a good clustering of publications. Overall, map equation methods appear to perform best in our analysis, suggesting that these methods deserve more attention from the bibliometric community. PMID:27124610
A clustering based method to evaluate soil corrosivity for pipeline external integrity management
International Nuclear Information System (INIS)
One important category of transportation infrastructure is underground pipelines. Corrosion of these buried pipeline systems may cause pipeline failures with the attendant hazards of property loss and fatalities. Therefore, developing the capability to estimate the soil corrosivity is important for designing and preserving materials and for risk assessment. The deterioration rate of metal is highly influenced by the physicochemical characteristics of a material and the environment of its surroundings. In this study, the field data obtained from the southeast region of Mexico was examined using various data mining techniques to determine the usefulness of these techniques for clustering soil corrosivity level. Specifically, the soil was classified into different corrosivity level clusters by k-means and Gaussian mixture model (GMM). In terms of physical space, GMM shows better separability; therefore, the distributions of the material loss of the buried petroleum pipeline walls were estimated via the empirical density within GMM clusters. The soil corrosivity levels of the clusters were determined based on the medians of metal loss. The proposed clustering method was demonstrated to be capable of classifying the soil into different levels of corrosivity severity. - Highlights: • The clustering approach is applied to the data extracted from a real-life pipeline system. • Soil properties in the right-of-way are analyzed via clustering techniques to assess corrosivity. • GMM is selected as the preferred method for detecting the hidden pattern of in-situ data. • K–W test is performed for significant difference of corrosivity level between clusters
A method for context-based adaptive QRS clustering in real-time
Castro, Daniel; Presedo, Jesús
2014-01-01
Continuous follow-up of heart condition through long-term electrocardiogram monitoring is an invaluable tool for diagnosing some cardiac arrhythmias. In such context, providing tools for fast locating alterations of normal conduction patterns is mandatory and still remains an open issue. This work presents a real-time method for adaptive clustering QRS complexes from multilead ECG signals that provides the set of QRS morphologies that appear during an ECG recording. The method processes the QRS complexes sequentially, grouping them into a dynamic set of clusters based on the information content of the temporal context. The clusters are represented by templates which evolve over time and adapt to the QRS morphology changes. Rules to create, merge and remove clusters are defined along with techniques for noise detection in order to avoid their proliferation. To cope with beat misalignment, Derivative Dynamic Time Warping is used. The proposed method has been validated against the MIT-BIH Arrhythmia Database and...
A scale-independent clustering method with automatic variable selection based on trees
Lynch, Sarah K.
2014-01-01
Approved for public release; distribution is unlimited. Clustering is the process of putting observations into groups based on their distance, or dissimilarity, from one another. Measuring distance for continuous variables often requires scaling or monotonic transformation. Determining dissimilarity when observations have both continuous and categorical measurements can be difficult because each type of measurement must be approached differently. We introduce a new clustering method that u...
An effective trust-based recommendation method using a novel graph clustering algorithm
Moradi, Parham; Ahmadian, Sajad; Akhlaghian, Fardin
2015-10-01
Recommender systems are programs that aim to provide personalized recommendations to users for specific items (e.g. music, books) in online sharing communities or on e-commerce sites. Collaborative filtering methods are important and widely accepted types of recommender systems that generate recommendations based on the ratings of like-minded users. On the other hand, these systems confront several inherent issues such as data sparsity and cold start problems, caused by fewer ratings against the unknowns that need to be predicted. Incorporating trust information into the collaborative filtering systems is an attractive approach to resolve these problems. In this paper, we present a model-based collaborative filtering method by applying a novel graph clustering algorithm and also considering trust statements. In the proposed method first of all, the problem space is represented as a graph and then a sparsest subgraph finding algorithm is applied on the graph to find the initial cluster centers. Then, the proposed graph clustering algorithm is performed to obtain the appropriate users/items clusters. Finally, the identified clusters are used as a set of neighbors to recommend unseen items to the current active user. Experimental results based on three real-world datasets demonstrate that the proposed method outperforms several state-of-the-art recommender system methods.
A semantics-based method for clustering of Chinese web search results
Zhang, Hui; Wang, Deqing; Wang, Li; Bi, Zhuming; Chen, Yong
2014-01-01
Information explosion is a critical challenge to the development of modern information systems. In particular, when the application of an information system is over the Internet, the amount of information over the web has been increasing exponentially and rapidly. Search engines, such as Google and Baidu, are essential tools for people to find the information from the Internet. Valuable information, however, is still likely submerged in the ocean of search results from those tools. By clustering the results into different groups based on subjects automatically, a search engine with the clustering feature allows users to select most relevant results quickly. In this paper, we propose an online semantics-based method to cluster Chinese web search results. First, we employ the generalised suffix tree to extract the longest common substrings (LCSs) from search snippets. Second, we use the HowNet to calculate the similarities of the words derived from the LCSs, and extract the most representative features by constructing the vocabulary chain. Third, we construct a vector of text features and calculate snippets' semantic similarities. Finally, we improve the Chameleon algorithm to cluster snippets. Extensive experimental results have shown that the proposed algorithm has outperformed over the suffix tree clustering method and other traditional clustering methods.
A polymerization-based method to construct a plasmid containing clustered DNA damage and a mismatch.
Takahashi, Momoko; Akamatsu, Ken; Shikazono, Naoya
2016-10-01
Exposure of biological materials to ionizing radiation often induces clustered DNA damage. The mutagenicity of clustered DNA damage can be analyzed with plasmids carrying a clustered DNA damage site, in which the strand bias of a replicating plasmid (i.e., the degree to which each of the two strands of the plasmid are used as the template for replication of the plasmid) can help to clarify how clustered DNA damage enhances the mutagenic potential of comprising lesions. Placement of a mismatch near a clustered DNA damage site can help to determine the strand bias, but present plasmid-based methods do not allow insertion of a mismatch at a given site in the plasmid. Here, we describe a polymerization-based method for constructing a plasmid containing clustered DNA lesions and a mismatch. The presence of a DNA lesion and a mismatch in the plasmid was verified by enzymatic treatment and by determining the relative abundance of the progeny plasmids derived from each of the two strands of the plasmid. PMID:27449134
Cluster Evaluation of Density Based Subspace Clustering
Sembiring, Rahmat Widia
2010-01-01
Clustering real world data often faced with curse of dimensionality, where real world data often consist of many dimensions. Multidimensional data clustering evaluation can be done through a density-based approach. Density approaches based on the paradigm introduced by DBSCAN clustering. In this approach, density of each object neighbours with MinPoints will be calculated. Cluster change will occur in accordance with changes in density of each object neighbours. The neighbours of each object typically determined using a distance function, for example the Euclidean distance. In this paper SUBCLU, FIRES and INSCY methods will be applied to clustering 6x1595 dimension synthetic datasets. IO Entropy, F1 Measure, coverage, accurate and time consumption used as evaluation performance parameters. Evaluation results showed SUBCLU method requires considerable time to process subspace clustering; however, its value coverage is better. Meanwhile INSCY method is better for accuracy comparing with two other methods, altho...
Directory of Open Access Journals (Sweden)
Issam SAHMOUDI
2013-12-01
Full Text Available Document Clustering is a branch of a larger area of scientific study kn own as data mining .which is an unsupervised classification using to find a structu re in a collection of unlabeled data. The useful information in the documents can be accompanied b y a large amount of noise words when using Full Tex t Representation, and therefore will affect negativel y the result of the clustering process. So it is w ith great need to eliminate the noise words and keeping just the useful information in order to enhance the qual ity of the clustering results. This problem occurs with di fferent degree for any language such as English, European, Hindi, Chinese, and Arabic Language. To o vercome this problem, in this paper, we propose a new and efficient Keyphrases extraction method base d on the Suffix Tree data structure (KpST, the extracted Keyphrases are then used in the clusterin g process instead of Full Text Representation. The proposed method for Keyphrases extraction is langua ge independent and therefore it may be applied to a ny language. In this investigation, we are interested to deal with the Arabic language which is one of th e most complex languages. To evaluate our method, we condu ct an experimental study on Arabic Documents using the most popular Clustering approach of Hiera rchical algorithms: Agglomerative Hierarchical algorithm with seven linkage techniques and a varie ty of distance functions and similarity measures to perform Arabic Document Clustering task. The obtain ed results show that our method for extracting Keyphrases increases the quality of the clustering results. We propose also to study the effect of using the stemming for the testing dataset to cluster it with the same documents clustering techniques and similarity/distance measures.
New Clustering Method in High-Dimensional Space Based on Hypergraph-Models
Institute of Scientific and Technical Information of China (English)
CHEN Jian-bin; WANG Shu-jing; SONG Han-tao
2006-01-01
To overcome the limitation of the traditional clustering algorithms which fail to produce meanirigful clusters in high-dimensional, sparseness and binary value data sets, a new method based on hypergraph model is proposed. The hypergraph model maps the relationship present in the original data in high dimensional space into a hypergraph. A hyperedge represents the similarity of attribute-value distribution between two points. A hypergraph partitioning algorithm is used to find a partitioning of the vertices such that the corresponding data items in each partition are highly related and the weight of the hyperedges cut by the partitioning is minimized. The quality of the clustering result can be evaluated by applying the intra-cluster singularity value.Analysis and experimental results have demonstrated that this approach is applicable and effective in wide ranging scheme.
A NOVEL METHOD FOR MULTISTAGE SCENARIO GENERATION BASED ON CLUSTER ANALYSIS
XIAODONG JI; XIUJUAN ZHAO; XIULI CHAO
2006-01-01
Based on cluster analysis, a novel method is introduced in this paper to generate multistage scenarios. A linear programming model is proposed to exclude the arbitrage opportunity by appending a scenario to the generated scenario set. By means of a cited stochastic linear goal programming portfolio model, a case is given to exhibit the virtues of this scenario generation approach.
DIMK-means “Distance-based Initialization Method for K-means Clustering Algorithm”
Raed T. Aldahdooh; Wesam Ashour
2013-01-01
Partition-based clustering technique is one of several clustering techniques that attempt to directly decompose the dataset into a set of disjoint clusters. K-means algorithm dependence on partition-based clustering technique is popular and widely used and applied to a variety of domains. K-means clustering results are extremely sensitive to the initial centroid; this is one of the major drawbacks of k-means algorithm. Due to such sensitivity; several different initialization approaches were ...
A novel PPGA-based clustering analysis method for business cycle indicator selection
Institute of Scientific and Technical Information of China (English)
Dabin ZHANG; Lean YU; Shouyang WANG; Yingwen SONG
2009-01-01
A new clustering analysis method based on the pseudo parallel genetic algorithm (PPGA) is proposed for business cycle indicator selection. In the proposed method,the category of each indicator is coded by real numbers,and some illegal chromosomes are repaired by the identi-fication arid restoration of empty class. Two mutation op-erators, namely the discrete random mutation operator andthe optimal direction mutation operator, are designed to bal-ance the local convergence speed and the global convergence performance, which are then combined with migration strat-egy and insertion strategy. For the purpose of verification and illustration, the proposed method is compared with the K-means clustering algorithm and the standard genetic algo-rithms via a numerical simulation experiment. The experi-mental result shows the feasibility and effectiveness of the new PPGA-based clustering analysis algorithm. Meanwhile,the proposed clustering analysis algorithm is also applied to select the business cycle indicators to examine the status of the macro economy. Empirical results demonstrate that the proposed method can effectively and correctly select some leading indicators, coincident indicators, and lagging indi-cators to reflect the business cycle, which is extremely op-erational for some macro economy administrative managers and business decision-makers.
Unconventional methods for clustering
Kotyrba, Martin
2016-06-01
Cluster analysis or clustering is a task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). It is the main task of exploratory data mining and a common technique for statistical data analysis used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics. The topic of this paper is one of the modern methods of clustering namely SOM (Self Organising Map). The paper describes the theory needed to understand the principle of clustering and descriptions of algorithm used with clustering in our experiments.
Centroid Based Text Clustering
Directory of Open Access Journals (Sweden)
Priti Maheshwari
2010-09-01
Full Text Available Web mining is a burgeoning new field that attempts to glean meaningful information from natural language text. Web mining refers generally to the process of extracting interesting information and knowledge from unstructured text. Text clustering is one of the important Web mining functionalities. Text clustering is the task in which texts are classified into groups of similar objects based on their contents. Current research in the area of Web mining is tacklesproblems of text data representation, classification, clustering, information extraction or the search for and modeling of hidden patterns. In this paper we propose for mining large document collections it is necessary to pre-process the web documents and store the information in a data structure, which is more appropriate for further processing than a plain web file. In this paper we developed a php-mySql based utility to convert unstructured web documents into structured tabular representation by preprocessing, indexing .We apply centroid based web clustering method on preprocessed data. We apply three methods for clustering. Finally we proposed a method that can increase accuracy based on clustering ofdocuments.
Galaxy Cluster Mass Reconstruction Project: I. Methods and first results on galaxy-based techniques
Old, L; Pearce, F R; Croton, D; Muldrew, S I; Muñoz-Cuartas, J C; Gifford, D; Gray, M E; von der Linden, A; Mamon, G A; Merrifield, M R; Müller, V; Pearson, R J; Ponman, T J; Saro, A; Sepp, T; Sifón, C; Tempel, E; Tundo, E; Wang, Y O; Wojtak, R
2014-01-01
This paper is the first in a series in which we perform an extensive comparison of various galaxy-based cluster mass estimation techniques that utilise the positions, velocities and colours of galaxies. Our primary aim is to test the performance of these cluster mass estimation techniques on a diverse set of models that will increase in complexity. We begin by providing participating methods with data from a simple model that delivers idealised clusters, enabling us to quantify the underlying scatter intrinsic to these mass estimation techniques. The mock catalogue is based on a Halo Occupation Distribution (HOD) model that assumes spherical Navarro, Frenk and White (NFW) haloes truncated at R_200, with no substructure nor colour segregation, and with isotropic, isothermal Maxwellian velocities. We find that, above 10^14 M_solar, recovered cluster masses are correlated with the true underlying cluster mass with an intrinsic scatter of typically a factor of two. Below 10^14 M_solar, the scatter rises as the nu...
Jai-Houng Leu; Chih-Yao Lo; Chi-Hau Liu
2009-01-01
New analytical methods and tools which were called FAKDT (Fixed Average K-means base Decision Trees) on human performance have been developed and they make us look at the Enterprise in different aspects in this study. Decision Tree Clustering Method is one of the data mining methods that have been applied widely in different fields to analyze a large amount of data in recent years. Generally speaking, in the human resource incubation of an enterprise, if employees of high learning poten...
Watts, Michael J.; Worner, Susan P.
2011-01-01
Existing cluster-based methods for investigating insect species assemblages or profiles of a region to indicate the risk of new insect pest invasion have a major limitation in that they assign the same species risk factors to each region in a cluster. Clearly regions assigned to the same cluster have different degrees of similarity with respect to their species profile or assemblage. This study addresses this concern by applying weighting factors to the cluster elements used to calculate regi...
Are fragment-based quantum chemistry methods applicable to medium-sized water clusters?
Yuan, Dandan; Shen, Xiaoling; Li, Wei; Li, Shuhua
2016-06-28
Fragment-based quantum chemistry methods are either based on the many-body expansion or the inclusion-exclusion principle. To compare the applicability of these two categories of methods, we have systematically evaluated the performance of the generalized energy based fragmentation (GEBF) method (J. Phys. Chem. A, 2007, 111, 2193) and the electrostatically embedded many-body (EE-MB) method (J. Chem. Theory Comput., 2007, 3, 46) for medium-sized water clusters (H2O)n (n = 10, 20, 30). Our calculations demonstrate that the GEBF method provides uniformly accurate ground-state energies for 10 low-energy isomers of three water clusters under study at a series of theory levels, while the EE-MB method (with one water molecule as a fragment and without using the cutoff distance) shows a poor convergence for (H2O)20 and (H2O)30 when the basis set contains diffuse functions. Our analysis shows that the neglect of the basis set superposition error for each subsystem has little effect on the accuracy of the GEBF method, but leads to much less accurate results for the EE-MB method. The accuracy of the EE-MB method can be dramatically improved by using an appropriate cutoff distance and using two water molecules as a fragment. For (H2O)30, the average deviation of the EE-MB method truncated up to the three-body level calculated using this strategy (relative to the conventional energies) is about 0.003 hartree at the M06-2X/6-311++G** level, while the deviation of the GEBF method with a similar computational cost is less than 0.001 hartree. The GEBF method is demonstrated to be applicable for electronic structure calculations of water clusters at any basis set. PMID:27263629
Directory of Open Access Journals (Sweden)
Deepa Devasenapathy
2015-01-01
Full Text Available The traffic in the road network is progressively increasing at a greater extent. Good knowledge of network traffic can minimize congestions using information pertaining to road network obtained with the aid of communal callers, pavement detectors, and so on. Using these methods, low featured information is generated with respect to the user in the road network. Although the existing schemes obtain urban traffic information, they fail to calculate the energy drain rate of nodes and to locate equilibrium between the overhead and quality of the routing protocol that renders a great challenge. Thus, an energy-efficient cluster-based vehicle detection in road network using the intention numeration method (CVDRN-IN is developed. Initially, sensor nodes that detect a vehicle are grouped into separate clusters. Further, we approximate the strength of the node drain rate for a cluster using polynomial regression function. In addition, the total node energy is estimated by taking the integral over the area. Finally, enhanced data aggregation is performed to reduce the amount of data transmission using digital signature tree. The experimental performance is evaluated with Dodgers loop sensor data set from UCI repository and the performance evaluation outperforms existing work on energy consumption, clustering efficiency, and node drain rate.
A novel intrusion detection method based on OCSVM and K-means recursive clustering
Directory of Open Access Journals (Sweden)
Leandros A. Maglaras
2015-01-01
Full Text Available In this paper we present an intrusion detection module capable of detecting malicious network traffic in a SCADA (Supervisory Control and Data Acquisition system, based on the combination of One-Class Support Vector Machine (OCSVM with RBF kernel and recursive k-means clustering. Important parameters of OCSVM, such as Gaussian width o and parameter v affect the performance of the classifier. Tuning of these parameters is of great importance in order to avoid false positives and over fitting. The combination of OCSVM with recursive k- means clustering leads the proposed intrusion detection module to distinguish real alarms from possible attacks regardless of the values of parameters o and v, making it ideal for real-time intrusion detection mechanisms for SCADA systems. Extensive simulations have been conducted with datasets extracted from small and medium sized HTB SCADA testbeds, in order to compare the accuracy, false alarm rate and execution time against the base line OCSVM method.
Targets Separation and Imaging Method in Sparse Scene Based on Cluster Result of Range Profile Peaks
Directory of Open Access Journals (Sweden)
YANG Qiu
2015-08-01
Full Text Available This paper focuses on the synthetic aperture radar (SAR imaging of space-sparse targets such as ships on the sea, and proposes a method of targets separation and imaging of sparse scene based on cluster result of range profile peaks. Firstly, wavelet de-noising algorithm is used to preprocess the original echo, and then the range profile at different viewing positions can be obtained by range compression and range migration correction. Peaks of the range profiles can be detected by the fast peak detection algorithm based on second order difference operator. Targets with sparse energy intervals can be imaged through azimuth compression after clustering of peaks in range dimension. What's more, targets without coupling in range energy interval and direction synthetic aperture time can be imaged through azimuth compression after clustering of peaks both in range and direction dimension. Lastly, the effectiveness of the proposed method is validated by simulations. Results of experiment demonstrate that space-sparse targets such as ships can be imaged separately and completely with a small computation in azimuth compression, and the images are more beneficial for target recognition.
Comparison three methods of clustering: k-means, spectral clustering and hierarchical clustering
Kowsari, Kamran
2013-01-01
Comparison of three kind of the clustering and find cost function and loss function and calculate them. Error rate of the clustering methods and how to calculate the error percentage always be one on the important factor for evaluating the clustering methods, so this paper introduce one way to calculate the error rate of clustering methods. Clustering algorithms can be divided into several categories including partitioning clustering algorithms, hierarchical algorithms and density based algor...
Bishop, R. F.; Li, P. H. Y.
2011-04-01
An approximation hierarchy, called the lattice-path-based subsystem (LPSUBm) approximation scheme, is described for the coupled-cluster method (CCM). It is applicable to systems defined on a regular spatial lattice. We then apply it to two well-studied prototypical (spin-(1)/(2) Heisenberg antiferromagnetic) spin-lattice models, namely, the XXZ and the XY models on the square lattice in two dimensions. Results are obtained in each case for the ground-state energy, the ground-state sublattice magnetization, and the quantum critical point. They are all in good agreement with those from such alternative methods as spin-wave theory, series expansions, quantum Monte Carlo methods, and the CCM using the alternative lattice-animal-based subsystem (LSUBm) and the distance-based subsystem (DSUBm) schemes. Each of the three CCM schemes (LSUBm, DSUBm, and LPSUBm) for use with systems defined on a regular spatial lattice is shown to have its own advantages in particular applications.
International Nuclear Information System (INIS)
An approximation hierarchy, called the lattice-path-based subsystem (LPSUBm) approximation scheme, is described for the coupled-cluster method (CCM). It is applicable to systems defined on a regular spatial lattice. We then apply it to two well-studied prototypical (spin-(1/2) Heisenberg antiferromagnetic) spin-lattice models, namely, the XXZ and the XY models on the square lattice in two dimensions. Results are obtained in each case for the ground-state energy, the ground-state sublattice magnetization, and the quantum critical point. They are all in good agreement with those from such alternative methods as spin-wave theory, series expansions, quantum Monte Carlo methods, and the CCM using the alternative lattice-animal-based subsystem (LSUBm) and the distance-based subsystem (DSUBm) schemes. Each of the three CCM schemes (LSUBm, DSUBm, and LPSUBm) for use with systems defined on a regular spatial lattice is shown to have its own advantages in particular applications.
Santos, Miriam Seoane; Abreu, Pedro Henriques; García-Laencina, Pedro J; Simão, Adélia; Carvalho, Armando
2015-12-01
Liver cancer is the sixth most frequently diagnosed cancer and, particularly, Hepatocellular Carcinoma (HCC) represents more than 90% of primary liver cancers. Clinicians assess each patient's treatment on the basis of evidence-based medicine, which may not always apply to a specific patient, given the biological variability among individuals. Over the years, and for the particular case of Hepatocellular Carcinoma, some research studies have been developing strategies for assisting clinicians in decision making, using computational methods (e.g. machine learning techniques) to extract knowledge from the clinical data. However, these studies have some limitations that have not yet been addressed: some do not focus entirely on Hepatocellular Carcinoma patients, others have strict application boundaries, and none considers the heterogeneity between patients nor the presence of missing data, a common drawback in healthcare contexts. In this work, a real complex Hepatocellular Carcinoma database composed of heterogeneous clinical features is studied. We propose a new cluster-based oversampling approach robust to small and imbalanced datasets, which accounts for the heterogeneity of patients with Hepatocellular Carcinoma. The preprocessing procedures of this work are based on data imputation considering appropriate distance metrics for both heterogeneous and missing data (HEOM) and clustering studies to assess the underlying patient groups in the studied dataset (K-means). The final approach is applied in order to diminish the impact of underlying patient profiles with reduced sizes on survival prediction. It is based on K-means clustering and the SMOTE algorithm to build a representative dataset and use it as training example for different machine learning procedures (logistic regression and neural networks). The results are evaluated in terms of survival prediction and compared across baseline approaches that do not consider clustering and/or oversampling using the
A Method of Clustering Components into Modules Based on Products' Functional and Structural Analysis
Institute of Scientific and Technical Information of China (English)
MENG Xiang-hui; JIANG Zu-hua; ZHENG Ying-fei
2006-01-01
Modularity is the key to improving the cost-variety trade-off in product development. To achieve the functional independency and structural independency of modules, a method of clustering components to identify modules based on functional and structural analysis was presented. Two stages were included in the method. In the first stage the products' function was analyzed to determine the primary level of modules. Then the objective function for modules identifying was formulated to achieve functional independency of modules. Finally the genetic algorithm was used to solve the combinatorial optimization problem in modules identifying to form the primary modules of products. In the second stage the cohesion degree of modules and the coupling degree between modules were analyzed. Based on this structural analysis the modular scheme was refined according to the thinking of structural independency. A case study on the gear reducer was conducted to illustrate the validity of the presented method.
Image reconstruction of muon tomographic data using a density-based clustering method
Perry, Kimberly B.
Muons are subatomic particles capable of reaching the Earth's surface before decaying. When these particles collide with an object that has a high atomic number (Z), their path of travel changes substantially. Tracking muon movement through shielded containers can indicate what types of materials lie inside. This thesis proposes using a density-based clustering algorithm called OPTICS to perform image reconstructions using muon tomographic data. The results show that this method is capable of detecting high-Z materials quickly, and can also produce detailed reconstructions with large amounts of data.
Targets Separation and Imaging Method in Sparse Scene Based on Cluster Result of Range Profile Peaks
Yang, Qiu; Qun ZHANG; Wang, Min; Sun, Li
2015-01-01
This paper focuses on the synthetic aperture radar (SAR) imaging of space-sparse targets such as ships on the sea, and proposes a method of targets separation and imaging of sparse scene based on cluster result of range profile peaks. Firstly, wavelet de-noising algorithm is used to preprocess the original echo, and then the range profile at different viewing positions can be obtained by range compression and range migration correction. Peaks of the range profiles can be detected by the fast ...
A Novel Method to Predict Genomic Islands Based on Mean Shift Clustering Algorithm.
de Brito, Daniel M; Maracaja-Coutinho, Vinicius; de Farias, Savio T; Batista, Leonardo V; do Rêgo, Thaís G
2016-01-01
Genomic Islands (GIs) are regions of bacterial genomes that are acquired from other organisms by the phenomenon of horizontal transfer. These regions are often responsible for many important acquired adaptations of the bacteria, with great impact on their evolution and behavior. Nevertheless, these adaptations are usually associated with pathogenicity, antibiotic resistance, degradation and metabolism. Identification of such regions is of medical and industrial interest. For this reason, different approaches for genomic islands prediction have been proposed. However, none of them are capable of predicting precisely the complete repertory of GIs in a genome. The difficulties arise due to the changes in performance of different algorithms in the face of the variety of nucleotide distribution in different species. In this paper, we present a novel method to predict GIs that is built upon mean shift clustering algorithm. It does not require any information regarding the number of clusters, and the bandwidth parameter is automatically calculated based on a heuristic approach. The method was implemented in a new user-friendly tool named MSGIP--Mean Shift Genomic Island Predictor. Genomes of bacteria with GIs discussed in other papers were used to evaluate the proposed method. The application of this tool revealed the same GIs predicted by other methods and also different novel unpredicted islands. A detailed investigation of the different features related to typical GI elements inserted in these new regions confirmed its effectiveness. Stand-alone and user-friendly versions for this new methodology are available at http://msgip.integrativebioinformatics.me. PMID:26731657
A Novel Method to Predict Genomic Islands Based on Mean Shift Clustering Algorithm.
Directory of Open Access Journals (Sweden)
Daniel M de Brito
Full Text Available Genomic Islands (GIs are regions of bacterial genomes that are acquired from other organisms by the phenomenon of horizontal transfer. These regions are often responsible for many important acquired adaptations of the bacteria, with great impact on their evolution and behavior. Nevertheless, these adaptations are usually associated with pathogenicity, antibiotic resistance, degradation and metabolism. Identification of such regions is of medical and industrial interest. For this reason, different approaches for genomic islands prediction have been proposed. However, none of them are capable of predicting precisely the complete repertory of GIs in a genome. The difficulties arise due to the changes in performance of different algorithms in the face of the variety of nucleotide distribution in different species. In this paper, we present a novel method to predict GIs that is built upon mean shift clustering algorithm. It does not require any information regarding the number of clusters, and the bandwidth parameter is automatically calculated based on a heuristic approach. The method was implemented in a new user-friendly tool named MSGIP--Mean Shift Genomic Island Predictor. Genomes of bacteria with GIs discussed in other papers were used to evaluate the proposed method. The application of this tool revealed the same GIs predicted by other methods and also different novel unpredicted islands. A detailed investigation of the different features related to typical GI elements inserted in these new regions confirmed its effectiveness. Stand-alone and user-friendly versions for this new methodology are available at http://msgip.integrativebioinformatics.me.
Stability of maximum-likelihood-based clustering methods: exploring the backbone of classifications
International Nuclear Information System (INIS)
Components of complex systems are often classified according to the way they interact with each other. In graph theory such groups are known as clusters or communities. Many different techniques have been recently proposed to detect them, some of which involve inference methods using either Bayesian or maximum likelihood approaches. In this paper, we study a statistical model designed for detecting clusters based on connection similarity. The basic assumption of the model is that the graph was generated by a certain grouping of the nodes and an expectation maximization algorithm is employed to infer that grouping. We show that the method admits further development to yield a stability analysis of the groupings that quantifies the extent to which each node influences its neighbors' group membership. Our approach naturally allows for the identification of the key elements responsible for the grouping and their resilience to changes in the network. Given the generality of the assumptions underlying the statistical model, such nodes are likely to play special roles in the original system. We illustrate this point by analyzing several empirical networks for which further information about the properties of the nodes is available. The search and identification of stabilizing nodes constitutes thus a novel technique to characterize the relevance of nodes in complex networks
Directory of Open Access Journals (Sweden)
Jai-Houng Leu
2009-01-01
Full Text Available New analytical methods and tools which were called FAKDT (Fixed Average K-means base Decision Trees on human performance have been developed and they make us look at the Enterprise in different aspects in this study. Decision Tree Clustering Method is one of the data mining methods that have been applied widely in different fields to analyze a large amount of data in recent years. Generally speaking, in the human resource incubation of an enterprise, if employees of high learning potential, high stability and high emotional quotient are selected, the return of investment in human resources will be more apparent. If employees of the above mentioned traits can be well utilized and incubated, the industry competitiveness of the enterprise will be enhanced effectively. From the personality specialty point of view, its function is to predict the efficiency of the personal achievement in correlation to his some implying personality specialties (blood group, constellation, etc.. The main purpose of this research is to get the useful information and important message about human performance from their historical records with this method. The Decision Tree Clustering Method data mining skills were improved and applied to get the critical factors that affect the human traits for its feasibility in this study.
Fan, Chang-ke; Wu, Yu
2010-01-01
A total of 10 indices of regional economic development in Guangxi are selected. According to the relevant economic data, regional economic development in Guangxi City is analyzed by using System Clustering Method and Principal Component Analysis Method. Result shows that System Clustering Method and Principal Component Analysis Method have revealed similar results analysis of economic development level. Overall economic strength of Guangxi is weak and Nanning has relatively high scores of fac...
Institute of Scientific and Technical Information of China (English)
2010-01-01
A total of 10 indices of regional economic development in Guangxi are selected.According to the relevant economic data,regional economic development in Guangxi is analyzed by using System Clustering Method and Principal Component Analysis Method.Result shows that System Clustering Method and Principal Component Analysis Method have revealed similar results analysis of economic development level.Overall economic strength of Guangxi is weak and Nanning has relatively high scores of factors due to its advantage of the political,economic and cultural center.Comprehensive scores of other regions are all lower than 1,which has big gap with the development of Nanning.Overall development strategy points out that Guangxi should accelerate the construction of the Ring Northern Bay Economic Zone,create a strong logistics system having strategic significance to national development,use the unique location advantage and rely on the modern transportation system to establish a logistics center and business center connecting the hinterland and the Asean Market.Based on the problems of unbalanced regional economic development in Guangxi,we should speed up the development of service industry in Nanning,construct the circular economy system of industrial city,and accelerate the industrialization process of tourism city in order to realize balanced development of regional economy in Guangxi,China.
Effective Term Based Text Clustering Algorithms
P. Ponmuthuramalingam,; T. Devi
2010-01-01
Text clustering methods can be used to group large sets of text documents. Most of the text clustering methods do not address the problems of text clustering such as very high dimensionality of the data and understandability of the clustering descriptions. In this paper, a frequent term based approach of clustering has been introduced; it provides a natural way of reducing a large dimensionality of the document vector space. This approach is based on clustering the low dimensionality frequent...
The tidal tails of globular cluster Palomar 5 based on the neural networks method
Institute of Scientific and Technical Information of China (English)
Hu Zou; Zhen-Yu WU; Jun Ma; Xu Zhou
2009-01-01
The sixth Data Release (DR6) of the Sloan Digital Sky Survey (SDSS) provides more photometric regions,new features and more accurate data around globular cluster Palomar 5.A new method,Back Propagation Neural Network (BPNN),is used to estimate the cluster membership probability in order to detect its tidal tails.Cluster and field stars,used for training the networks,are extracted over a 40×20 deg~2 field by color-magnitude diagrams (CMDs).The best BPNNs with two hidden layers and a Levenberg-Marquardt(LM) training algorithm are determined by the chosen cluster and field samples.The membership probabilities of stars in the whole field are obtained with the BPNNs,and contour maps of the probability distribution show that a tail extends 5.42°to the north of the cluster and another tail extends 3.77°to the south.The tails are similar to those detected by Odenkirchen et al.,but no more debris from the cluster is found to the northeast in the sky.The radial density profiles are investigated both along the tails and near the cluster center.Quite a few substructures are discovered in the tails.The number density profile of the cluster is fitted with the King model and the tidal radius is determined as 14.28'.However,the King model cannot fit the observed profile at the outer regions (R ＞ 8') because of the tidal tails generated by the tidal force.Luminosity functions of the cluster and the tidal tails are calculated,which confirm that the tails originate from Palomar 5.
Splitting Methods for Convex Clustering
Chi, Eric C.; Lange, Kenneth
2013-01-01
Clustering is a fundamental problem in many scientific applications. Standard methods such as $k$-means, Gaussian mixture models, and hierarchical clustering, however, are beset by local minima, which are sometimes drastically suboptimal. Recently introduced convex relaxations of $k$-means and hierarchical clustering shrink cluster centroids toward one another and ensure a unique global minimizer. In this work we present two splitting methods for solving the convex clustering problem. The fir...
Dynamic cluster formation using level set methods.
Yip, Andy M; Ding, Chris; Chan, Tony F
2006-06-01
Density-based clustering has the advantages for 1) allowing arbitrary shape of cluster and 2) not requiring the number of clusters as input. However, when clusters touch each other, both the cluster centers and cluster boundaries (as the peaks and valleys of the density distribution) become fuzzy and difficult to determine. We introduce the notion of cluster intensity function (CIF) which captures the important characteristics of clusters. When clusters are well-separated, CIFs are similar to density functions. But, when clusters become closed to each other, CIFs still clearly reveal cluster centers, cluster boundaries, and degree of membership of each data point to the cluster that it belongs. Clustering through bump hunting and valley seeking based on these functions are more robust than that based on density functions obtained by kernel density estimation, which are often oscillatory or oversmoothed. These problems of kernel density estimation are resolved using Level Set Methods and related techniques. Comparisons with two existing density-based methods, valley seeking and DBSCAN, are presented which illustrate the advantages of our approach. PMID:16724583
Cluster Evaluation of Density Based Subspace Clustering
Sembiring, Rahmat Widia; Zain, Jasni Mohamad
2010-01-01
Clustering real world data often faced with curse of dimensionality, where real world data often consist of many dimensions. Multidimensional data clustering evaluation can be done through a density-based approach. Density approaches based on the paradigm introduced by DBSCAN clustering. In this approach, density of each object neighbours with MinPoints will be calculated. Cluster change will occur in accordance with changes in density of each object neighbours. The neighbours of each object ...
Dioba, A.
2010-01-01
The article considers the use of EM algorithm of fuzzy clustering analysis to assess the level of productive personnel employment at industrial enterprises. The methodical approach of the productive personnel employment assessing is suggested. The criteria for evaluating the productive personnel employment are developed in the article.
Clustering Software Methods and Comparison
Directory of Open Access Journals (Sweden)
Rachana Kamble
2014-12-01
Full Text Available Document clustering as associate not supervised approach extensively won’t to navigate, filter, summarize and manage huge group of document repositories just like the World Wide Web (WWW. Recently, Document clustering is that the method of segmenting a selected group of texts into subgroups as well as content based mostly similar ones. the aim of document clustering is to fulfil human interests in info looking and understanding. element based mostly software system development has gained lots of sensible importance within the field of system engineering from educational researchers and additionally from business perspective. Finding parts for economical code utilize is one among the necessary issues aimed by researchers. Clump reduces the search area of parts by grouping similar entities along so guaranteeing reduced time complexness because it reduces the search time for part retrieval. This work can study the key challenges of the clustering drawback, because it applies to the text domain. Additionally can discuss the key ways used for text clustering, and their relative benefits.
GPU-based Multilevel Clustering.
Chiosa, Iurie; Kolb, Andreas
2010-04-01
The processing power of parallel co-processors like the Graphics Processing Unit (GPU) are dramatically increasing. However, up until now only a few approaches have been presented to utilize this kind of hardware for mesh clustering purposes. In this paper we introduce a Multilevel clustering technique designed as a parallel algorithm and solely implemented on the GPU. Our formulation uses the spatial coherence present in the cluster optimization and hierarchical cluster merging to significantly reduce the number of comparisons in both parts . Our approach provides a fast, high quality and complete clustering analysis. Furthermore, based on the original concept we present a generalization of the method to data clustering. All advantages of the meshbased techniques smoothly carry over to the generalized clustering approach. Additionally, this approach solves the problem of the missing topological information inherent to general data clustering and leads to a Local Neighbors k-means algorithm. We evaluate both techniques by applying them to Centroidal Voronoi Diagram (CVD) based clustering. Compared to classical approaches, our techniques generate results with at least the same clustering quality. Our technique proves to scale very well, currently being limited only by the available amount of graphics memory. PMID:20421676
Zou, Ling; Guo, Qian; Xu, Yi; Yang, Biao; Jiao, Zhuqing; Xiang, Jianbo
2016-04-29
Functional magnetic resonance imaging (fMRI) is an important tool in neuroscience for assessing connectivity and interactions between distant areas of the brain. To find and characterize the coherent patterns of brain activity as a means of identifying brain systems for the cognitive reappraisal of the emotion task, both density-based k-means clustering and independent component analysis (ICA) methods can be applied to characterize the interactions between brain regions involved in cognitive reappraisal of emotion. Our results reveal that compared with the ICA method, the density-based k-means clustering method provides a higher sensitivity of polymerization. In addition, it is more sensitive to those relatively weak functional connection regions. Thus, the study concludes that in the process of receiving emotional stimuli, the relatively obvious activation areas are mainly distributed in the frontal lobe, cingulum and near the hypothalamus. Furthermore, density-based k-means clustering method creates a more reliable method for follow-up studies of brain functional connectivity. PMID:27177109
A New Elliptical Grid Clustering Method
Guansheng, Zheng
A new base on grid clustering method is presented in this paper. This new method first does unsupervised learning on the high dimensions data. This paper proposed a grid-based approach to clustering. It maps the data onto a multi-dimensional space and applies a linear transformation to the feature space instead of to the objects themselves and then approach a grid-clustering method. Unlike the conventional methods, it uses a multidimensional hyper-eclipse grid cell. Some case studies and ideas how to use the algorithms are described. The experimental results show that EGC can discover abnormity shapes of clusters.
Institute of Scientific and Technical Information of China (English)
LING Ling; HU Yu-jin; WANG Xue-lin; LI Cheng-gang
2006-01-01
In order to improve the efficiency of ontology construction from heterogeneous knowledge sources, a semantic-based approach is presented. The ontology will be constructed with the application of cluster technique in an incremental way.Firstly, terms will be extracted from knowledge sources and congregate a term set after pretreat-ment. Then the concept set will be built via semantic-based clustering according to semanteme of terms provided by WordNet. Next, a concept tree is constructed in terms of mapping rules between semanteme relationships and concept relationships. The semi-automatic approach can avoid non-consistence due to knowledge engineers having different understanding of the same concept and the obtained ontology is easily to be expanded.
Fast Density Based Clustering Algorithm
Priyanka Trikha; Singh Vijendra
2013-01-01
Clustering problem is an unsupervised learning problem. It is a procedure that partition data objects into matching clusters. The data objects in the same cluster are quite similar to each other and dissimilar in the other clusters. The traditional algorithms do not meet the latest multiple requirements simultaneously for objects. Density-based clustering algorithms find clusters based on density of data points in a region. DBSCAN algorithm is one of the density-based clustering algorithms. I...
Sanfilippo, Antonio; Calapristi, Augustin J.; Crow, Vernon L.; Hetzler, Elizabeth G.; Turner, Alan E.
2009-12-22
Document clustering methods, document cluster label disambiguation methods, document clustering apparatuses, and articles of manufacture are described. In one aspect, a document clustering method includes providing a document set comprising a plurality of documents, providing a cluster comprising a subset of the documents of the document set, using a plurality of terms of the documents, providing a cluster label indicative of subject matter content of the documents of the cluster, wherein the cluster label comprises a plurality of word senses, and selecting one of the word senses of the cluster label.
Indian Academy of Sciences (India)
Andrea Paz; Andrew J Crawford
2012-11-01
Molecular markers offer a universal source of data for quantifying biodiversity. DNA barcoding uses a standardized genetic marker and a curated reference database to identify known species and to reveal cryptic diversity within well-sampled clades. Rapid biological inventories, e.g. rapid assessment programs (RAPs), unlike most barcoding campaigns, are focused on particular geographic localities rather than on clades. Because of the potentially sparse phylogenetic sampling, the addition of DNA barcoding to RAPs may present a greater challenge for the identification of named species or for revealing cryptic diversity. In this article we evaluate the use of DNA barcoding for quantifying lineage diversity within a single sampling site as compared to clade-based sampling, and present examples from amphibians. We compared algorithms for identifying DNA barcode clusters (e.g. species, cryptic species or Evolutionary Significant Units) using previously published DNA barcode data obtained from geography-based sampling at a site in Central Panama, and from clade-based sampling in Madagascar. We found that clustering algorithms based on genetic distance performed similarly on sympatric as well as clade-based barcode data, while a promising coalescent-based method performed poorly on sympatric data. The various clustering algorithms were also compared in terms of speed and software implementation. Although each method has its shortcomings in certain contexts, we recommend the use of the ABGD method, which not only performs fairly well under either sampling method, but does so in a few seconds and with a user-friendly Web interface.
Directory of Open Access Journals (Sweden)
Michael J. Watts
2011-09-01
Full Text Available Existing cluster-based methods for investigating insect species assemblages or profiles of a region to indicate the risk of new insect pest invasion have a major limitation in that they assign the same species risk factors to each region in a cluster. Clearly regions assigned to the same cluster have different degrees of similarity with respect to their species profile or assemblage. This study addresses this concern by applying weighting factors to the cluster elements used to calculate regional risk factors, thereby producing region-specific risk factors. Using a database of the global distribution of crop insect pest species, we found that we were able to produce highly differentiated region-specific risk factors for insect pests. We did this by weighting cluster elements by their Euclidean distance from the target region. Using this approach meant that risk weightings were derived that were more realistic, as they were specific to the pest profile or species assemblage of each region. This weighting method provides an improved tool for estimating the potential invasion risk posed by exotic species given that they have an opportunity to establish in a target region.
Saeidi, Omid; Torabi, Seyed Rahman; Ataei, Mohammad
2014-03-01
Rock mass classification systems are one of the most common ways of determining rock mass excavatability and related equipment assessment. However, the strength and weak points of such rating-based classifications have always been questionable. Such classification systems assign quantifiable values to predefined classified geotechnical parameters of rock mass. This causes particular ambiguities, leading to the misuse of such classifications in practical applications. Recently, intelligence system approaches such as artificial neural networks (ANNs) and neuro-fuzzy methods, along with multiple regression models, have been used successfully to overcome such uncertainties. The purpose of the present study is the construction of several models by using an adaptive neuro-fuzzy inference system (ANFIS) method with two data clustering approaches, including fuzzy c-means (FCM) clustering and subtractive clustering, an ANN and non-linear multiple regression to estimate the basic rock mass diggability index. A set of data from several case studies was used to obtain the real rock mass diggability index and compared to the predicted values by the constructed models. In conclusion, it was observed that ANFIS based on the FCM model shows higher accuracy and correlation with actual data compared to that of the ANN and multiple regression. As a result, one can use the assimilation of ANNs with fuzzy clustering-based models to construct such rigorous predictor tools.
A Vibration Method for Discovering Density Varied Clusters
Elbatta, Mohammad T.; Bolbol, Raed M.; Wesam M. Ashour
2012-01-01
DBSCAN is a base algorithm for density-based clustering. It can find out the clusters of different shapes and sizes from a large amount of data, which is containing noise and outliers. However, it is fail to handle the local density variation that exists within the cluster. Thus, a good clustering method should allow a significant density variation within the cluster because, if we go for homogeneous clustering, a large number of smaller unimportant clusters may be generated. In this paper, a...
Directory of Open Access Journals (Sweden)
Peixin Zhao
2014-01-01
Full Text Available This paper suggests a novel clustering method for analyzing the National Incident-Based Reporting System (NIBRS data, which include the determination of correlation of different crime types, the development of a likelihood index for crimes to occur in a jurisdiction, and the clustering of jurisdictions based on crime type. The method was tested by using the 2005 assault data from 121 jurisdictions in Virginia as a test case. The analyses of these data show that some different crime types are correlated and some different crime parameters are correlated with different crime types. The analyses also show that certain jurisdictions within Virginia share certain crime patterns. This information assists with constructing a pattern for a specific crime type and can be used to determine whether a jurisdiction may be more likely to see this type of crime occur in their area.
Watchdog-LEACH: A new method based on LEACH protocol to Secure Clustered Wireless Sensor Networks
Directory of Open Access Journals (Sweden)
Mohammad Reza Rohbanian
2013-07-01
Full Text Available Wireless sensor network comprises of small sensor nodes with limited resources. Clustered networks have been proposed in many researches to reduce the power consumption in sensor networks. LEACH is one of the most interested techniques that offer an efficient way to minimize the power consumption in sensor networks. However, due to the characteristics of restricted resources and operation in a hostile environment, WSNs are subjected to numerous threats and are vulnerable to attacks. This research proposes a solution that can be applied on LEACH to increase the level of security. In Watchdog-LEACH, some nodes are considered as watchdogs and some changes are applied on LEACH protocol for intrusion detection. Watchdog-LEACH is able to protect against a wide range of attacks and it provides security, energy efficiency and memory efficiency. The result of simulation shows that in comparison to LEACH, the energy overhead is about 2% so this method is practical and can be applied to WSNs.
Transfer Prototype-based Fuzzy Clustering
Deng, Zhaohong; Jiang, Yizhang; Chung, Fu-Lai; Ishibuchi, Hisao; Choi, Kup-Sze; Wang, Shitong
2014-01-01
The traditional prototype based clustering methods, such as the well-known fuzzy c-mean (FCM) algorithm, usually need sufficient data to find a good clustering partition. If the available data is limited or scarce, most of the existing prototype based clustering algorithms will no longer be effective. While the data for the current clustering task may be scarce, there is usually some useful knowledge available in the related scenes/domains. In this study, the concept of transfer learning is a...
International Nuclear Information System (INIS)
We have developed a new method, K2, optimized for the detection of galaxy clusters in multicolor images. Based on the Red Sequence approach, K2 detects clusters using simultaneous enhancements in both colors and position. The detection significance is robustly determined through extensive Monte Carlo simulations and through comparison with available cluster catalogs based on two different optical methods, and also on X-ray data. K2 also provides quantitative estimates of the candidate clusters' richness and photometric redshifts. Initially, K2 was applied to the two color (gri) 161 deg2 images of the Canada-France-Hawaii Telescope Legacy Survey Wide (CFHTLS-W) data. Our simulations show that the false detection rate for these data, at our selected threshold, is only ∼1%, and that the cluster catalogs are ∼80% complete up to a redshift of z = 0.6 for Fornax-like and richer clusters and to z ∼ 0.3 for poorer clusters. Based on the g-, r-, and i-band photometric catalogs of the Terapix T05 release, 35 clusters/deg2 are detected, with 1-2 Fornax-like or richer clusters every 2 deg2. Catalogs containing data for 6144 galaxy clusters have been prepared, of which 239 are rich clusters. These clusters, especially the latter, are being searched for gravitational lenses-one of our chief motivations for cluster detection in CFHTLS. The K2 method can be easily extended to use additional color information and thus improve overall cluster detection to higher redshifts. The complete set of K2 cluster catalogs, along with the supplementary catalogs for the member galaxies, are available on request from the authors.
Combination Clustering Analysis Method and its Application
Bang-Chun Wen; Li-Yuan Dong; Qin-Liang Li; Yang Liu
2013-01-01
The traditional clustering analysis method can not automatically determine the optimal clustering number. In this study, we provided a new clustering analysis method which is combination clustering analysis method to solve this problem. Through analyzed 25 kinds of automobile data samples by combination clustering analysis method, the correctness of the analysis result was verified. It showed that combination clustering analysis method could objectively determine the number of clustering firs...
FAULT DIAGNOSIS BASED ON INTE- GRATION OF CLUSTER ANALYSIS,ROUGH SET METHOD AND FUZZY NEURAL NETWORK
Institute of Scientific and Technical Information of China (English)
Feng Zhipeng; Song Xigeng; Chu Fulei
2004-01-01
In order to increase the efficiency and decrease the cost of machinery diagnosis, a hybrid system of computational intelligence methods is presented. Firstly, the continuous attributes in diagnosis decision system are discretized with the self-organizing map (SOM) neural network. Then, dynamic reducts are computed based on rough set method, and the key conditions for diagnosis are found according to the maximum cluster ratio. Lastly, according to the optimal reduct, the adaptive neuro-fuzzy inference system (ANFIS) is designed for fault identification. The diagnosis of a diesel verifies the feasibility of engineering applications.
CORECLUSTER: A Degeneracy Based Graph Clustering Framework
Giatsidis, Christos; Malliaros, Fragkiskos; Thilikos, Dimitrios M.; Vazirgiannis, Michalis
2014-01-01
Graph clustering or community detection constitutes an important task forinvestigating the internal structure of graphs, with a plethora of applications in several domains. Traditional tools for graph clustering, such asspectral methods, typically suffer from high time and space complexity. In thisarticle, we present \\textsc{CoreCluster}, an efficient graph clusteringframework based on the concept of graph degeneracy, that can be used along withany known graph clustering algorithm. Our approa...
Xin Liu
2015-01-01
In a cognitive sensor network (CSN), the wastage of sensing time and energy is a challenge to cooperative spectrum sensing, when the number of cooperative cognitive nodes (CNs) becomes very large. In this paper, a novel wireless power transfer (WPT)-based weighed clustering cooperative spectrum sensing model is proposed, which divides all the CNs into several clusters, and then selects the most favorable CNs as the cluster heads and allows the common CNs to transfer the received radio freque...
Cluster Tree Based Hybrid Document Similarity Measure
Directory of Open Access Journals (Sweden)
M. Varshana Devi
2015-10-01
Full Text Available <Cluster tree based hybrid similarity measure is established to measure the hybrid similarity. In cluster tree, the hybrid similarity measure can be calculated for the random data even it may not be the co-occurred and generate different views. Different views of tree can be combined and choose the one which is significant in cost. A method is proposed to combine the multiple views. Multiple views are represented by different distance measures into a single cluster. Comparing the cluster tree based hybrid similarity with the traditional statistical methods it gives the better feasibility for intelligent based search. It helps in improving the dimensionality reduction and semantic analysis.
Model-free functional MRI analysis using cluster-based methods
Otto, Thomas D.; Meyer-Baese, Anke; Hurdal, Monica; Sumners, DeWitt; Auer, Dorothee; Wismuller, Axel
2003-08-01
Conventional model-based or statistical analysis methods for functional MRI (fMRI) are easy to implement, and are effective in analyzing data with simple paradigms. However, they are not applicable in situations in which patterns of neural response are complicated and when fMRI response is unknown. In this paper the "neural gas" network is adapted and rigorously studied for analyzing fMRI data. The algorithm supports spatial connectivity aiding in the identification of activation sites in functional brain imaging. A comparison of this new method with Kohonen's self-organizing map and with a minimal free energy vector quantizer is done in a systematic fMRI study showing comparative quantitative evaluations. The most important findings in this paper are: (1) the "neural gas" network outperforms the other two methods in terms of detecting small activation areas, and (2) computed reference function several that the "neural gas" network outperforms the other two methods. The applicability of the new algorithm is demonstrated on experimental data.
A Fast Three-Phase Line Segments Clustering Method Based on Relative Spatial Relationship
Liu, Y. Q.; X.H. Su; Wu, E. H.
2013-01-01
Lines indicate structure information of objects. However, the general line detectors cannot give enough clear information with many short or discontinuous line segments. This study presents a new fast three-phase line segment clustering algorithm. Firstly, Hough transform or LSD algorithm is used to attain initial line set; and then these lines are grouped into different sets according to direction; and then each direction set is further subdivided into dif...
Projection-based curve clustering
International Nuclear Information System (INIS)
This paper focuses on unsupervised curve classification in the context of nuclear industry. At the Commissariat a l'Energie Atomique (CEA), Cadarache (France), the thermal-hydraulic computer code CATHARE is used to study the reliability of reactor vessels. The code inputs are physical parameters and the outputs are time evolution curves of a few other physical quantities. As the CATHARE code is quite complex and CPU time-consuming, it has to be approximated by a regression model. This regression process involves a clustering step. In the present paper, the CATHARE output curves are clustered using a k-means scheme, with a projection onto a lower dimensional space. We study the properties of the empirically optimal cluster centres found by the clustering method based on projections, compared with the 'true' ones. The choice of the projection basis is discussed, and an algorithm is implemented to select the best projection basis among a library of orthonormal bases. The approach is illustrated on a simulated example and then applied to the industrial problem. (authors)
High Dimensional Data Clustering Using Fast Cluster Based Feature Selection
Directory of Open Access Journals (Sweden)
Karthikeyan.P
2014-03-01
Full Text Available Feature selection involves identifying a subset of the most useful features that produces compatible results as the original entire set of features. A feature selection algorithm may be evaluated from both the efficiency and effectiveness points of view. While the efficiency concerns the time required to find a subset of features, the effectiveness is related to the quality of the subset of features. Based on these criteria, a fast clustering-based feature selection algorithm (FAST is proposed and experimentally evaluated in this paper. The FAST algorithm works in two steps. In the first step, features are divided into clusters by using graph-theoretic clustering methods. In the second step, the most representative feature that is strongly related to target classes is selected from each cluster to form a subset of features. Features in different clusters are relatively independent; the clustering-based strategy of FAST has a high probability of producing a subset of useful and independent features. To ensure the efficiency of FAST, we adopt the efficient minimum-spanning tree (MST using the Kruskal‟s Algorithm clustering method. The efficiency and effectiveness of the FAST algorithm are evaluated through an empirical study. Index Terms—
Directory of Open Access Journals (Sweden)
G V S Rajkumar
2011-07-01
Full Text Available Image segmentation is one of the most important area of image retrieval. In colour image segmentation the feature vector of each image region is 'n' dimension different from grey level image. In this paper a new image segmentation algorithm is developed and analyzed using the finite mixture of doubly truncated bivariate Gaussian distribution by integrating with the hierarchical clustering. The number of image regions in the whole image is determined using the hierarchical clustering algorithm. Assuming that a bivariate feature vector (consisting of Hue angle and Saturation of each pixel in the image region follows a doubly truncated bivariate Gaussian distribution, the segmentation algorithm is developed. The model parameters are estimated using EM-Algorithm, the updated equations of EM-Algorithm for a finite mixture of doubly truncated Gaussian distribution are derived. A segmentation algorithm for colour images is proposed by using component maximum likelihood. The performance of the developed algorithm is evaluated by carrying out experimentation with five images taken form Berkeley image dataset and computing the image segmentation metrics like, Global Consistency Error (GCE, Variation of Information (VOI, and Probability Rand Index (PRI. The experimentation results show that this algorithm outperforms the existing image segmentation algorithms.
Cycle-Based Cluster Variational Method for Direct and Inverse Inference
Furtlehner, Cyril; Decelle, Aurélien
2016-08-01
Large scale inference problems of practical interest can often be addressed with help of Markov random fields. This requires to solve in principle two related problems: the first one is to find offline the parameters of the MRF from empirical data (inverse problem); the second one (direct problem) is to set up the inference algorithm to make it as precise, robust and efficient as possible. In this work we address both the direct and inverse problem with mean-field methods of statistical physics, going beyond the Bethe approximation and associated belief propagation algorithm. We elaborate on the idea that loop corrections to belief propagation can be dealt with in a systematic way on pairwise Markov random fields, by using the elements of a cycle basis to define regions in a generalized belief propagation setting. For the direct problem, the region graph is specified in such a way as to avoid feed-back loops as much as possible by selecting a minimal cycle basis. Following this line we are led to propose a two-level algorithm, where a belief propagation algorithm is run alternatively at the level of each cycle and at the inter-region level. Next we observe that the inverse problem can be addressed region by region independently, with one small inverse problem per region to be solved. It turns out that each elementary inverse problem on the loop geometry can be solved efficiently. In particular in the random Ising context we propose two complementary methods based respectively on fixed point equations and on a one-parameter log likelihood function minimization. Numerical experiments confirm the effectiveness of this approach both for the direct and inverse MRF inference. Heterogeneous problems of size up to 10^5 are addressed in a reasonable computational time, notably with better convergence properties than ordinary belief propagation.
Liu, Xin
2015-01-01
In a cognitive sensor network (CSN), the wastage of sensing time and energy is a challenge to cooperative spectrum sensing, when the number of cooperative cognitive nodes (CNs) becomes very large. In this paper, a novel wireless power transfer (WPT)-based weighed clustering cooperative spectrum sensing model is proposed, which divides all the CNs into several clusters, and then selects the most favorable CNs as the cluster heads and allows the common CNs to transfer the received radio frequency (RF) energy of the primary node (PN) to the cluster heads, in order to supply the electrical energy needed for sensing and cooperation. A joint resource optimization is formulated to maximize the spectrum access probability of the CSN, through jointly allocating sensing time and clustering number. According to the resource optimization results, a clustering algorithm is proposed. The simulation results have shown that compared to the traditional model, the cluster heads of the proposed model can achieve more transmission power and there exists optimal sensing time and clustering number to maximize the spectrum access probability. PMID:26528987
Directory of Open Access Journals (Sweden)
Xin Liu
2015-10-01
Full Text Available In a cognitive sensor network (CSN, the wastage of sensing time and energy is a challenge to cooperative spectrum sensing, when the number of cooperative cognitive nodes (CNs becomes very large. In this paper, a novel wireless power transfer (WPT-based weighed clustering cooperative spectrum sensing model is proposed, which divides all the CNs into several clusters, and then selects the most favorable CNs as the cluster heads and allows the common CNs to transfer the received radio frequency (RF energy of the primary node (PN to the cluster heads, in order to supply the electrical energy needed for sensing and cooperation. A joint resource optimization is formulated to maximize the spectrum access probability of the CSN, through jointly allocating sensing time and clustering number. According to the resource optimization results, a clustering algorithm is proposed. The simulation results have shown that compared to the traditional model, the cluster heads of the proposed model can achieve more transmission power and there exists optimal sensing time and clustering number to maximize the spectrum access probability.
Cluster identification based on correlations
Schulman, L. S.
2012-04-01
The problem addressed is the identification of cooperating agents based on correlations created as a result of the joint action of these and other agents. A systematic method for using correlations beyond second moments is developed. The technique is applied to a didactic example, the identification of alphabet letters based on correlations among the pixels used in an image of the letter. As in this example, agents can belong to more than one cluster. Moreover, the identification scheme does not require that the patterns be known ahead of time.
Cluster Based Text Classification Model
DEFF Research Database (Denmark)
Nizamani, Sarwat; Memon, Nasrullah; Wiil, Uffe Kock
2011-01-01
We propose a cluster based classification model for suspicious email detection and other text classification tasks. The text classification tasks comprise many training examples that require a complex classification model. Using clusters for classification makes the model simpler and increases the...... classifier is trained on each cluster having reduced dimensionality and less number of examples. The experimental results show that the proposed model outperforms the existing classification models for the task of suspicious email detection and topic categorization on the Reuters-21578 and 20 Newsgroups...... datasets. Our model also outperforms A Decision Cluster Classification (ADCC) and the Decision Cluster Forest Classification (DCFC) models on the Reuters-21578 dataset....
International Nuclear Information System (INIS)
Highlights: • A novel pattern sequence-based direct time series forecasting method was proposed. • Due to the use of SOM’s topology preserving property, only SOM can be applied. • SCPSNSP only deals with the cluster patterns not each specific time series value. • SCPSNSP performs better than recently developed forecasting algorithms. - Abstract: In this paper, we propose a new day-ahead direct time series forecasting method for competitive electricity markets based on clustering and next symbol prediction. In the clustering step, pattern sequence and their topology relations are obtained from self organizing map time series clustering. In the next symbol prediction step, with each cluster label in the pattern sequence represented as a pair of its topologically identical coordinates, artificial neural network is used to predict the topological coordinates of next day by training the relationship between previous daily pattern sequence and its next day pattern. According to the obtained topology relations, the nearest nonzero hits pattern is assigned to next day so that the whole time series values can be directly forecasted from the assigned cluster pattern. The proposed method was evaluated on Spanish, Australian and New York electricity markets and compared with PSF and some of the most recently published forecasting methods. Experimental results show that the proposed method outperforms the best forecasting methods at least 3.64%
Cosine-Based Clustering Algorithm Approach
Directory of Open Access Journals (Sweden)
Mohammed A. H. Lubbad
2012-02-01
Full Text Available Due to many applications need the management of spatial data; clustering large spatial databases is an important problem which tries to find the densely populated regions in the feature space to be used in data mining, knowledge discovery, or efficient information retrieval. A good clustering approach should be efficient and detect clusters of arbitrary shapes. It must be insensitive to the outliers (noise and the order of input data. In this paper Cosine Cluster is proposed based on cosine transformation, which satisfies all the above requirements. Using multi-resolution property of cosine transforms, arbitrary shape clusters can be effectively identified at different degrees of accuracy. Cosine Cluster is also approved to be highly efficient in terms of time complexity. Experimental results on very large data sets are presented, which show the efficiency and effectiveness of the proposed approach compared to other recent clustering methods.
Methods for co-clustering: a review
Brault, Vincent; Lomet, Aurore
2015-01-01
Co-clustering aims to identify block patterns in a data table, from a joint clustering of rows and columns. This problem has been studied since 1965, with recent interests in various fields, ranging from graph analysis, machine learning, data mining and genomics. Several variants have been proposed with diverse names: bi-clustering, block clustering, cross-clustering, or simultaneous clustering. We propose here a review of these methods in order to describe, compare and discuss the different ...
Izadi, Hossein; Sadri, Javad; Mehran, Nosrat-Agha
2015-08-01
Mineral segmentation in thin sections is a challenging, popular, and important research topic in computational geology, mineralogy, and mining engineering. Mineral segmentation in thin sections containing altered minerals, in which there are no evident and close boundaries, is a rather complex process. Most of the thin sections created in industries include altered minerals. However, intelligent mineral segmentation in thin sections containing altered minerals has not been widely investigated in the literature, and the current state of the art algorithms are not able to accurately segment minerals in such thin sections. In this paper, a novel method based on incremental learning for clustering pixels is proposed in order to segment index minerals in both thin sections with and without altered minerals. Our algorithm uses 12 color features that are extracted from thin section images. These features include red, green, blue, hue, saturation and intensity, under plane and cross polarized lights in maximum intensity situation. The proposed method has been tested on 155 igneous samples and the overall accuracy of 92.15% and 85.24% has been obtained for thin sections without altered minerals and thin sections containing altered minerals, respectively. Experimental results indicate that the proposed method outperforms the results of other similar methods in the literature, especially for segmenting thin sections containing altered minerals. The proposed algorithm could be applied in applications which require a real time segmentation or efficient identification map such as petroleum geology, petrography and NASA Mars explorations.
Breaking the hierarchy - a new cluster selection mechanism for hierarchical clustering methods
Directory of Open Access Journals (Sweden)
Zweig Katharina A
2009-10-01
Full Text Available Abstract Background Hierarchical clustering methods like Ward's method have been used since decades to understand biological and chemical data sets. In order to get a partition of the data set, it is necessary to choose an optimal level of the hierarchy by a so-called level selection algorithm. In 2005, a new kind of hierarchical clustering method was introduced by Palla et al. that differs in two ways from Ward's method: it can be used on data on which no full similarity matrix is defined and it can produce overlapping clusters, i.e., allow for multiple membership of items in clusters. These features are optimal for biological and chemical data sets but until now no level selection algorithm has been published for this method. Results In this article we provide a general selection scheme, the level independent clustering selection method, called LInCS. With it, clusters can be selected from any level in quadratic time with respect to the number of clusters. Since hierarchically clustered data is not necessarily associated with a similarity measure, the selection is based on a graph theoretic notion of cohesive clusters. We present results of our method on two data sets, a set of drug like molecules and set of protein-protein interaction (PPI data. In both cases the method provides a clustering with very good sensitivity and specificity values according to a given reference clustering. Moreover, we can show for the PPI data set that our graph theoretic cohesiveness measure indeed chooses biologically homogeneous clusters and disregards inhomogeneous ones in most cases. We finally discuss how the method can be generalized to other hierarchical clustering methods to allow for a level independent cluster selection. Conclusion Using our new cluster selection method together with the method by Palla et al. provides a new interesting clustering mechanism that allows to compute overlapping clusters, which is especially valuable for biological and
Document Clustering Based on Semi-Supervised Term Clustering
Directory of Open Access Journals (Sweden)
Hamid Mahmoodi
2012-05-01
Full Text Available The study is conducted to propose a multi-step feature (term selection process and in semi-supervised fashion, provide initial centers for term clusters. Then utilize the fuzzy c-means (FCM clustering algorithm for clustering terms. Finally assign each of documents to closest associated term clusters. While most text clustering algorithms directly use documents for clustering, we propose to first group the terms using FCM algorithm and then cluster documents based on terms clusters. We evaluate effectiveness of our technique on several standard text collections and compare our results with the some classical text clustering algorithms.
A Local Pair Natural Orbital-Based Multireference Mukherjee’s Coupled Cluster Method
Czech Academy of Sciences Publication Activity Database
Demel, Ondřej; Pittner, Jiří
2015-01-01
Roč. 11, č. 7 (2015), s. 3104-3114. ISSN 1549-9618 R&D Projects: GA ČR GAP208/11/2222; GA ČR(CZ) GJ15-00058Y Institutional support: RVO:61388955 Keywords : ELECTRON CORRELATION METHODS * BRILLOUIN-WIGNER * CONFIGURATION-INTERACTION Subject RIV: CF - Physical ; Theoretical Chemistry Impact factor: 5.498, year: 2014
Ghahari, Alireza
2009-01-01
Multiview 3D face modeling has attracted increasing attention recently and has become one of the potential avenues in future video systems. We aim to make more reliable and robust automatic feature extraction and natural 3D feature construction from 2D features detected on a pair of frontal and profile view face images. We propose several heuristic algorithms to minimize possible errors introduced by prevalent nonperfect orthogonal condition and noncoherent luminance. In our approach, we first extract the 2D features that are visible to both cameras in both views. Then, we estimate the coordinates of the features in the hidden profile view based on the visible features extracted in the two orthogonal views. Finally, based on the coordinates of the extracted features, we deform a 3D generic model to perform the desired 3D clone modeling. Present study proves the scope of resulted facial models for practical applications like face recognition and facial animation.
Niching method using clustering crowding
Institute of Scientific and Technical Information of China (English)
GUO Guan-qi; GUI Wei-hua; WU Min; YU Shou-yi
2005-01-01
This study analyzes drift phenomena of deterministic crowding and probabilistic crowding by using equivalence class model and expectation proportion equations. It is proved that the replacement errors of deterministic crowding cause the population converging to a single individual, thus resulting in premature stagnation or losing optional optima. And probabilistic crowding can maintain equilibrium multiple subpopulations as the population size is adequate large. An improved niching method using clustering crowding is proposed. By analyzing topology of fitness landscape using hill valley function and extending the search space for similarity analysis, clustering crowding determines the locality of search space more accurately, thus greatly decreasing replacement errors of crowding. The integration of deterministic and probabilistic replacement increases the capacity of both parallel local hill climbing and maintaining multiple subpopulations. The experimental results optimizing various multimodal functions show that,the performances of clustering crowding, such as the number of effective peaks maintained, average peak ratio and global optimum ratio are uniformly superior to those of the evolutionary algorithms using fitness sharing, simple deterministic crowding and probabilistic crowding.
Coupled-cluster method for excitation energies
International Nuclear Information System (INIS)
The coupled-cluster method of electronic-structure calculation is briefly introduced and examined as to its dependence upon the choice of reference state. It is found that the method depends relatively weakly on the reference state if single-particle ''clusters'' are included in the calculations. This fact makes it reasonable to combine coupled-cluster calculations of ground and excited states, based on the same reference wave function, to obtain an equation for the excitation energy. This excitation-energy equation is of nearly the same form as that obtained by the ''equations of motion'' approach, but contains additional terms which should improve the description of orbital-relaxation and state-dependent correlation effects
Model-based clustered-dot screening
Kim, Sang Ho
2006-01-01
I propose a halftone screen design method based on a human visual system model and the characteristics of the electro-photographic (EP) printer engine. Generally, screen design methods based on human visual models produce dispersed-dot type screens while design methods considering EP printer characteristics generate clustered-dot type screens. In this paper, I propose a cost function balancing the conflicting characteristics of the human visual system and the printer. By minimizing the obtained cost function, I design a model-based clustered-dot screen using a modified direct binary search algorithm. Experimental results demonstrate the superior quality of the model-based clustered-dot screen compared to a conventional clustered-dot screen.
Pavement Crack Detection Using Spectral Clustering Method
Directory of Open Access Journals (Sweden)
Jin Huazhong
2015-01-01
Full Text Available Pavement crack detection plays an important role in pavement maintaining and management, nowadays, which could be performed through remote image analysis. Thus, edges of pavement crack should be extracted in advance; in general, traditional edge detection methods don’t consider phase information and the spatial relationship between the adjacent image areas to extract the edges. To overcome the deficiency of the traditional approaches, this paper proposes a pavement crack detection algorithm based on spectral clustering method. Firstly, a measure of similarity between pairs of pixels is taken into account through orientation energy. Then, spatial relationship is needed to find regions where similarity between pixels in a given region is high and similarity between pixels in different regions is low. After that, crack edge detection is completed with spectral clustering method. The presented method has been run on some real life images of pavement crack, experimental results display that the crack detection method of this paper could obtain ideal result.
A method of open cluster membership determination
Javakhishvili, G; Todua, M; Inasaridze, R
2006-01-01
A new method for the determination of open cluster membership based on a cumulative effect is proposed. In the field of a plate the relative x and y coordinate positions of each star with respect to all the other stars are added. The procedure is carried out for two epochs t_1 and t_2 separately, then one sum is subtracted from another. For a field star the differences in its relative coordinate positions of two epochs will be accumulated. For a cluster star, on the contrary, the changes in relative positions of cluster members at t_1 and t_2 will be very small. On the histogram of sums the cluster stars will gather to the left of the diagram, while the field stars will form a tail to the right. The procedure allows us to efficiently discriminate one group from another. The greater the distance between t_1 and t_2 and the more cluster stars present, the greater is the effect. The accumulation method does not require reference stars, determination of centroids and modelling the distribution of field stars, nec...
Abusamra, Heba
2016-07-20
The native nature of high dimension low sample size of gene expression data make the classification task more challenging. Therefore, feature (gene) selection become an apparent need. Selecting a meaningful and relevant genes for classifier not only decrease the computational time and cost, but also improve the classification performance. Among different approaches of feature selection methods, however most of them suffer from several problems such as lack of robustness, validation issues etc. Here, we present a new feature selection technique that takes advantage of clustering both samples and genes. Materials and methods We used leukemia gene expression dataset [1]. The effectiveness of the selected features were evaluated by four different classification methods; support vector machines, k-nearest neighbor, random forest, and linear discriminate analysis. The method evaluate the importance and relevance of each gene cluster by summing the expression level for each gene belongs to this cluster. The gene cluster consider important, if it satisfies conditions depend on thresholds and percentage otherwise eliminated. Results Initial analysis identified 7120 differentially expressed genes of leukemia (Fig. 15a), after applying our feature selection methodology we end up with specific 1117 genes discriminating two classes of leukemia (Fig. 15b). Further applying the same method with more stringent higher positive and lower negative threshold condition, number reduced to 58 genes have be tested to evaluate the effectiveness of the method (Fig. 15c). The results of the four classification methods are summarized in Table 11. Conclusions The feature selection method gave good results with minimum classification error. Our heat-map result shows distinct pattern of refines genes discriminating between two classes of leukemia.
Model-based clustering using copulas with applications
Kosmidis, Ioannis; Karlis, Dimitris
2014-01-01
The majority of model-based clustering techniques is based on multivariate Normal models and their variants. In this paper copulas are used for the construction of flexible families of models for clustering applications. The use of copulas in model-based clustering offers two direct advantages over current methods: i) the appropriate choice of copulas provides the ability to obtain a range of exotic shapes for the clusters, and ii) the explicit choice of marginal distributions for the cluster...
Resampling methods for document clustering
Volk, D.; Stepanov, M. G.
2001-01-01
We compare the performance of different clustering algorithms applied to the task of unsupervised text categorization. We consider agglomerative clustering algorithms, principal direction divisive partitioning and (for the first time) superparamagnetic clustering with several distance measures. The algorithms have been applied to test databases extracted from the Reuters-21578 text categorization test database. We find that simple application of the different clustering algorithms yields clus...
Voting-based consensus clustering for combining multiple clusterings of chemical structures
Directory of Open Access Journals (Sweden)
Saeed Faisal
2012-12-01
Full Text Available Abstract Background Although many consensus clustering methods have been successfully used for combining multiple classifiers in many areas such as machine learning, applied statistics, pattern recognition and bioinformatics, few consensus clustering methods have been applied for combining multiple clusterings of chemical structures. It is known that any individual clustering method will not always give the best results for all types of applications. So, in this paper, three voting and graph-based consensus clusterings were used for combining multiple clusterings of chemical structures to enhance the ability of separating biologically active molecules from inactive ones in each cluster. Results The cumulative voting-based aggregation algorithm (CVAA, cluster-based similarity partitioning algorithm (CSPA and hyper-graph partitioning algorithm (HGPA were examined. The F-measure and Quality Partition Index method (QPI were used to evaluate the clusterings and the results were compared to the Ward’s clustering method. The MDL Drug Data Report (MDDR dataset was used for experiments and was represented by two 2D fingerprints, ALOGP and ECFP_4. The performance of voting-based consensus clustering method outperformed the Ward’s method using F-measure and QPI method for both ALOGP and ECFP_4 fingerprints, while the graph-based consensus clustering methods outperformed the Ward’s method only for ALOGP using QPI. The Jaccard and Euclidean distance measures were the methods of choice to generate the ensembles, which give the highest values for both criteria. Conclusions The results of the experiments show that consensus clustering methods can improve the effectiveness of chemical structures clusterings. The cumulative voting-based aggregation algorithm (CVAA was the method of choice among consensus clustering methods.
Chen Bernard; Mete Mutlu; Kockara Sinan; Aydin Kemal
2010-01-01
Abstract Background Computer-aided segmentation and border detection in dermoscopic images is one of the core components of diagnostic procedures and therapeutic interventions for skin cancer. Automated assessment tools for dermoscopy images have become an important research field mainly because of inter- and intra-observer variations in human interpretation. In this study, we compare two approaches for automatic border detection in dermoscopy images: density based clustering (DBSCAN) and Fuz...
A local distribution based spatial clustering algorithm
Deng, Min; Liu, Qiliang; Li, Guangqiang; Cheng, Tao
2009-10-01
Spatial clustering is an important means for spatial data mining and spatial analysis, and it can be used to discover the potential spatial association rules and outliers among the spatial data. Most existing spatial clustering algorithms only utilize the spatial distance or local density to find the spatial clusters in a spatial database, without taking the spatial local distribution characters into account, so that the clustered results are unreasonable in many cases. To overcome such limitations, this paper develops a new indicator (i.e. local median angle) to measure the local distribution at first, and further proposes a new algorithm, called local distribution based spatial clustering algorithm (LDBSC in abbreviation). In the process of spatial clustering, a series of recursive search are implemented for all the entities so that those entities with its local median angle being very close or equal are clustered. In this way, all the spatial entities in the spatial database can be automatically divided into some clusters. Finally, two tests are implemented to demonstrate that the method proposed in this paper is more prominent than DBSCAN, as well as that it is very robust and feasible, and can be used to find the clusters with different shapes.
Single pass kernel -means clustering method
Indian Academy of Sciences (India)
T Hitendra Sarma; P Viswanath; B Eswara Reddy
2013-06-01
In unsupervised classiﬁcation, kernel -means clustering method has been shown to perform better than conventional -means clustering method in identifying non-isotropic clusters in a data set. The space and time requirements of this method are $O(n^2)$, where is the data set size. Because of this quadratic time complexity, the kernel -means method is not applicable to work with large data sets. The paper proposes a simple and faster version of the kernel -means clustering method, called single pass kernel k-means clustering method. The proposed method works as follows. First, a random sample $\\mathcal{S}$ is selected from the data set $\\mathcal{D}$. A partition $\\Pi_{\\mathcal{S}}$ is obtained by applying the conventional kernel -means method on the random sample $\\mathcal{S}$. The novelty of the paper is, for each cluster in $\\Pi_{\\mathcal{S}}$, the exact cluster center in the input space is obtained using the gradient descent approach. Finally, each unsampled pattern is assigned to its closest exact cluster center to get a partition of the entire data set. The proposed method needs to scan the data set only once and it is much faster than the conventional kernel -means method. The time complexity of this method is $O(s^2+t+nk)$ where is the size of the random sample $\\mathcal{S}$, is the number of clusters required, and is the time taken by the gradient descent method (to ﬁnd exact cluster centers). The space complexity of the method is $O(s^2)$. The proposed method can be easily implemented and is suitable for large data sets, like those in data mining applications. Experimental results show that, with a small loss of quality, the proposed method can signiﬁcantly reduce the time taken than the conventional kernel -means clustering method. The proposed method is also compared with other recent similar methods.
Data Clustering Analysis Based on Wavelet Feature Extraction
Institute of Scientific and Technical Information of China (English)
QIANYuntao; TANGYuanyan
2003-01-01
A novel wavelet-based data clustering method is presented in this paper, which includes wavelet feature extraction and cluster growing algorithm. Wavelet transform can provide rich and diversified information for representing the global and local inherent structures of dataset. therefore, it is a very powerful tool for clustering feature extraction. As an unsupervised classification, the target of clustering analysis is dependent on the specific clustering criteria. Several criteria that should be con-sidered for general-purpose clustering algorithm are pro-posed. And the cluster growing algorithm is also con-structed to connect clustering criteria with wavelet fea-tures. Compared with other popular clustering methods,our clustering approach provides multi-resolution cluster-ing results,needs few prior parameters, correctly deals with irregularly shaped clusters, and is insensitive to noises and outliers. As this wavelet-based clustering method isaimed at solving two-dimensional data clustering prob-lem, for high-dimensional datasets, self-organizing mapand U-matrlx method are applied to transform them intotwo-dimensional Euclidean space, so that high-dimensional data clustering analysis,Results on some sim-ulated data and standard test data are reported to illus-trate the power of our method.
Document Clustering using Sequential Information Bottleneck Method
MS. P.J.Gayathri; S.C. Punitha; Dr.M.Punithavalli
2010-01-01
Document clustering is a subset of the larger field of data clustering, which borrows concepts from the fields of information retrieval (IR), natural language processing (NLP), and machine learning (ML). It is a more specific technique for unsupervised document organization, automatic topic extraction and fast information retrieval or filtering. There exist a wide variety of unsupervised clustering algorithms. In this paper presents a sequential algorithm for document clustering based with an...
Clustering based segmentation of text in complex color images
Institute of Scientific and Technical Information of China (English)
毛文革; 王洪滨; 张田文
2004-01-01
We propose a novel scheme based on clustering analysis in color space to solve text segmentation in complex color images. Text segmentation includes automatic clustering of color space and foreground image generation. Two methods are also proposed for automatic clustering: The first one is to determine the optimal number of clusters and the second one is the fuzzy competitively clustering method based on competitively learning techniques. Essential foreground images obtained from any of the color clusters are combined into foreground images. Further performance analysis reveals the advantages of the proposed methods.
Normalization based K means Clustering Algorithm
Virmani, Deepali; Taneja, Shweta; Malhotra, Geetika
2015-01-01
K-means is an effective clustering technique used to separate similar data into groups based on initial centroids of clusters. In this paper, Normalization based K-means clustering algorithm(N-K means) is proposed. Proposed N-K means clustering algorithm applies normalization prior to clustering on the available data as well as the proposed approach calculates initial centroids based on weights. Experimental results prove the betterment of proposed N-K means clustering algorithm over existing...
The polarizable embedding coupled cluster method
DEFF Research Database (Denmark)
Sneskov, Kristian; Schwabe, Tobias; Kongsted, Jacob; Christiansen, Ove
2011-01-01
We formulate a new combined quantum mechanics/molecular mechanics (QM/MM) method based on a self-consistent polarizable embedding (PE) scheme. For the description of the QM region, we apply the popular coupled cluster (CC) method detailing the inclusion of electrostatic and polarization effects...... hyperpolarizabilities all coupled to a polarizable MM environment. In the process, we identify CC densitylike intermediates that allow for a very efficient implementation retaining a computational low cost of the QM/MM terms even when the number of MM sites increases. The strengths of the new implementation are...
Fuzzy Clustering Methods and their Application to Fuzzy Modeling
DEFF Research Database (Denmark)
Kroszynski, Uri; Zhou, Jianjun
1999-01-01
Fuzzy modeling techniques based upon the analysis of measured input/output data sets result in a set of rules that allow to predict system outputs from given inputs. Fuzzy clustering methods for system modeling and identification result in relatively small rule-bases, allowing fast, yet accurate...... prediction of outputs. This article presents an overview of some of the most popular clustering methods, namely Fuzzy Cluster-Means (FCM) and its generalizations to Fuzzy C-Lines and Elliptotypes. The algorithms for computing cluster centers and principal directions from a training data-set are described. A...
Quartile Clustering: A quartile based technique for Generating Meaningful Clusters
Goswami, Saptarsi; Chakrabarti, Amlan
2012-01-01
Clustering is one of the main tasks in exploratory data analysis and descriptive statistics where the main objective is partitioning observations in groups. Clustering has a broad range of application in varied domains like climate, business, information retrieval, biology, psychology, to name a few. A variety of methods and algorithms have been developed for clustering tasks in the last few decades. We observe that most of these algorithms define a cluster in terms of value of the attributes...
Robust Clustering Method in the Presence of Scattered Observations.
Notsu, Akifumi; Eguchi, Shinto
2016-06-01
Contamination of scattered observations, which are either featureless or unlike the other observations, frequently degrades the performance of standard methods such as K-means and model-based clustering. In this letter, we propose a robust clustering method in the presence of scattered observations called Gamma-clust. Gamma-clust is based on a robust estimation for cluster centers using gamma-divergence. It provides a proper solution for clustering in which the distributions for clustered data are nonnormal, such as t-distributions with different variance-covariance matrices and degrees of freedom. As demonstrated in a simulation study and data analysis, Gamma-clust is more flexible and provides superior results compared to the robustified K-means and model-based clustering. PMID:26942745
Quartile Clustering: A quartile based technique for Generating Meaningful Clusters
Goswami, Saptarsi
2012-01-01
Clustering is one of the main tasks in exploratory data analysis and descriptive statistics where the main objective is partitioning observations in groups. Clustering has a broad range of application in varied domains like climate, business, information retrieval, biology, psychology, to name a few. A variety of methods and algorithms have been developed for clustering tasks in the last few decades. We observe that most of these algorithms define a cluster in terms of value of the attributes, density, distance etc. However these definitions fail to attach a clear meaning/semantics to the generated clusters. We argue that clusters having understandable and distinct semantics defined in terms of quartiles/halves are more appealing to business analysts than the clusters defined by data boundaries or prototypes. On the samepremise, we propose our new algorithm named as quartile clustering technique. Through a series of experiments we establish efficacy of this algorithm. We demonstrate that the quartile clusteri...
ATAT@WIEN2k: An interface for cluster expansion based on the linearized augmented planewave method
Chakraborty, Monodeep; Spitaler, Jürgen; Puschnig, Peter; Ambrosch-Draxl, Claudia
2010-05-01
We have developed an interface between the all-electron density functional theory code WIEN2k, and the MIT Ab-initio Phase Stability (MAPS) code of the Alloy-Theoretic Automated Toolkit (ATAT). WIEN2k is an implementation of the full-potential linearized augmented planewave method which yields highly accurate total energies and optimized geometries for any given structure. The ATAT package consists of two parts. The first one is the MAPS code, which constructs a cluster expansion (CE) in conjunction with a first-principles code. These results form the basis for the second part, which computes the thermodynamic properties of the alloy. The main task of the CE is to calculate the many-body potentials or effective cluster interactions (ECIs) from the first-principles total energies of different structures or supercells using the structure-inversion technique. By linking MAPS seamlessly with WIEN2k we have created a tool to obtain the ECIs for any lattice type of an alloy. We have chosen fcc Al-Ti and bcc W-Re to evaluate our implementation. Our calculated ECIs exhibit all features of a converged CE and compare well with literature results.
A simulation study of three methods for detecting disease clusters
Directory of Open Access Journals (Sweden)
Samuelsen Sven O
2006-04-01
Full Text Available Abstract Background Cluster detection is an important part of spatial epidemiology because it can help identifying environmental factors associated with disease and thus guide investigation of the aetiology of diseases. In this article we study three methods suitable for detecting local spatial clusters: (1 a spatial scan statistic (SaTScan, (2 generalized additive models (GAM and (3 Bayesian disease mapping (BYM. We conducted a simulation study to compare the methods. Seven geographic clusters with different shapes were initially chosen as high-risk areas. Different scenarios for the magnitude of the relative risk of these areas as compared to the normal risk areas were considered. For each scenario the performance of the methods were assessed in terms of the sensitivity, specificity, and percentage correctly classified for each cluster. Results The performance depends on the relative risk, but all methods are in general suitable for identifying clusters with a relative risk larger than 1.5. However, it is difficult to detect clusters with lower relative risks. The GAM approach had the highest sensitivity, but relatively low specificity leading to an overestimation of the cluster area. Both the BYM and the SaTScan methods work well. Clusters with irregular shapes are more difficult to detect than more circular clusters. Conclusion Based on our simulations we conclude that the methods differ in their ability to detect spatial clusters. Different aspects should be considered for appropriate choice of method such as size and shape of the assumed spatial clusters and the relative importance of sensitivity and specificity. In general, the BYM method seems preferable for local cluster detection with relatively high relative risks whereas the SaTScan method appears preferable for lower relative risks. The GAM method needs to be tuned (using cross-validation to get satisfactory results.
Comparison of Clustering Methods for Time Course Genomic Data: Applications to Aging Effects
Zhang, Y.; Horvath, S.; Ophoff, R; Telesca, D
2014-01-01
Time course microarray data provide insight about dynamic biological processes. While several clustering methods have been proposed for the analysis of these data structures, comparison and selection of appropriate clustering methods are seldom discussed. We compared $3$ probabilistic based clustering methods and $3$ distance based clustering methods for time course microarray data. Among probabilistic methods, we considered: smoothing spline clustering also known as model b...
Quantum Monte Carlo methods and lithium cluster properties. [Atomic clusters
Energy Technology Data Exchange (ETDEWEB)
Owen, R.K.
1990-12-01
Properties of small lithium clusters with sizes ranging from n = 1 to 5 atoms were investigated using quantum Monte Carlo (QMC) methods. Cluster geometries were found from complete active space self consistent field (CASSCF) calculations. A detailed development of the QMC method leading to the variational QMC (V-QMC) and diffusion QMC (D-QMC) methods is shown. The many-body aspect of electron correlation is introduced into the QMC importance sampling electron-electron correlation functions by using density dependent parameters, and are shown to increase the amount of correlation energy obtained in V-QMC calculations. A detailed analysis of D-QMC time-step bias is made and is found to be at least linear with respect to the time-step. The D-QMC calculations determined the lithium cluster ionization potentials to be 0.1982(14) (0.1981), 0.1895(9) (0.1874(4)), 0.1530(34) (0.1599(73)), 0.1664(37) (0.1724(110)), 0.1613(43) (0.1675(110)) Hartrees for lithium clusters n = 1 through 5, respectively; in good agreement with experimental results shown in the brackets. Also, the binding energies per atom was computed to be 0.0177(8) (0.0203(12)), 0.0188(10) (0.0220(21)), 0.0247(8) (0.0310(12)), 0.0253(8) (0.0351(8)) Hartrees for lithium clusters n = 2 through 5, respectively. The lithium cluster one-electron density is shown to have charge concentrations corresponding to nonnuclear attractors. The overall shape of the electronic charge density also bears a remarkable similarity with the anisotropic harmonic oscillator model shape for the given number of valence electrons.
Scalable Density-Based Subspace Clustering
DEFF Research Database (Denmark)
Müller, Emmanuel; Assent, Ira; Günnemann, Stephan;
2011-01-01
For knowledge discovery in high dimensional databases, subspace clustering detects clusters in arbitrary subspace projections. Scalability is a crucial issue, as the number of possible projections is exponential in the number of dimensions. We propose a scalable density-based subspace clustering...... synthetic databases show that steering is efficient and scalable, with high quality results. For future work, our steering paradigm for density-based subspace clustering opens research potential for speeding up other subspace clustering approaches as well....
Comparison between optical and X-ray cluster detection methods
Basilakos, S; Georgakakis, A; Georgantopoulos, I; Gaga, T; Kolokotronis, V G; Stewart, G C
2003-01-01
In this work we present combined optical and X-ray cluster detection methods in an area near the North Galactic Pole area, previously covered by the SDSS and 2dF optical surveys. The same area has been covered by shallow ($\\sim 1.8$ deg$^{2}$) XMM-{\\em Newton} observations. The optical cluster detection procedure is based on merging two independent selection methods - a smoothing+percolation technique, and a Matched Filter Algorithm. The X-ray cluster detection is based on a wavelet-based algorithm, incorporated in the SAS v.5.2 package. The final optical sample counts 9 candidate clusters with richness of more than 20 galaxies, corresponding roughly to APM richness class. Three, of our optically detected clusters are also detected in our X-ray survey.
Directory of Open Access Journals (Sweden)
Jinfei Liu
2013-04-01
Full Text Available DBSCAN is a well-known density-based clustering algorithm which offers advantages for finding clusters of arbitrary shapes compared to partitioning and hierarchical clustering methods. However, there are few papers studying the DBSCAN algorithm under the privacy preserving distributed data mining model, in which the data is distributed between two or more parties, and the parties cooperate to obtain the clustering results without revealing the data at the individual parties. In this paper, we address the problem of two-party privacy preserving DBSCAN clustering. We first propose two protocols for privacy preserving DBSCAN clustering over horizontally and vertically partitioned data respectively and then extend them to arbitrarily partitioned data. We also provide performance analysis and privacy proof of our solution..
Sequential Combination Methods forData Clustering Analysis
Institute of Scientific and Technical Information of China (English)
钱 涛; Ching Y.Suen; 唐远炎
2002-01-01
This paper proposes the use of more than one clustering method to improve clustering performance. Clustering is an optimization procedure based on a specific clustering criterion. Clustering combination can be regardedasatechnique that constructs and processes multiple clusteringcriteria.Sincetheglobalandlocalclusteringcriteriaarecomplementary rather than competitive, combining these two types of clustering criteria may enhance theclustering performance. In our past work, a multi-objective programming based simultaneous clustering combination algorithmhasbeenproposed, which incorporates multiple criteria into an objective function by a weighting method, and solves this problem with constrained nonlinear optimization programming. But this algorithm has high computationalcomplexity.Hereasequential combination approach is investigated, which first uses the global criterion based clustering to produce an initial result, then uses the local criterion based information to improve the initial result with aprobabilisticrelaxation algorithm or linear additive model.Compared with the simultaneous combination method, sequential combination haslow computational complexity. Results on some simulated data and standard test data arereported.Itappearsthatclustering performance improvement can be achieved at low cost through sequential combination.
Cluster beam sources. Part 1. Methods of cluster beams generation
Directory of Open Access Journals (Sweden)
A.Ju. Karpenko
2012-10-01
Full Text Available The short review on cluster beams generation is proposed. The basic types of cluster sources are considered and the processes leading to cluster formation are analyzed. The parameters, that affects the work of cluster sources are presented.
Cluster beam sources. Part 1. Methods of cluster beams generation
A.Ju. Karpenko; V.A. Baturin
2012-01-01
The short review on cluster beams generation is proposed. The basic types of cluster sources are considered and the processes leading to cluster formation are analyzed. The parameters, that affects the work of cluster sources are presented.
Spanning Tree Based Attribute Clustering
DEFF Research Database (Denmark)
Zeng, Yifeng; Jorge, Cordero Hernandez
2009-01-01
inconsistent edges from a maximum spanning tree by starting appropriate initial modes, therefore generating stable clusters. It discovers sound clusters through simple graph operations and achieves significant computational savings. We compare the Star Discovery algorithm against earlier attribute clustering...
ADVANCED CLUSTER BASED IMAGE SEGMENTATION
Directory of Open Access Journals (Sweden)
D. Kesavaraja
2011-11-01
Full Text Available This paper presents efficient and portable implementations of a useful image segmentation technique which makes use of the faster and a variant of the conventional connected components algorithm which we call parallel Components. In the Modern world majority of the doctors are need image segmentation as the service for various purposes and also they expect this system is run faster and secure. Usually Image segmentation Algorithms are not working faster. In spite of several ongoing researches in Conventional Segmentation and its Algorithms might not be able to run faster. So we propose a cluster computing environment for parallel image Segmentation to provide faster result. This paper is the real time implementation of Distributed Image Segmentation in Clustering of Nodes. We demonstrate the effectiveness and feasibility of our method on a set of Medical CT Scan Images. Our general framework is a single address space, distributed memory programming model. We use efficient techniques for distributing and coalescing data as well as efficient combinations of task and data parallelism. The image segmentation algorithm makes use of an efficient cluster process which uses a novel approach for parallel merging. Our experimental results are consistent with the theoretical analysis and practical results. It provides the faster execution time for segmentation, when compared with Conventional method. Our test data is different CT scan images from the Medical database. More efficient implementations of Image Segmentation will likely result in even faster execution times.
An alternative method to study star cluster disruption
Gieles, Mark
2008-01-01
Many embedded star clusters do not evolve into long-lived bound clusters. The most popular explanation for this "infant mortality" of young clusters is the expulsion of natal gas by stellar winds and supernovae, which leaves up to 90% of them unbound. A cluster disruption model has recently been proposed in which this mass- independent disruption of clusters proceeds for another Gyr after gas expulsion. In this scenario, the survival chances of massive clusters are much smaller than in the traditional mass-dependent disruption models. The most common way to study cluster disruption is to use the cluster age distribution, which, however, can be heavily affected by incompleteness. To avoid this, we introduce a new method, based on size-of-sample effects, namely the relation between the most massive cluster, M_max, and the age range sampled. Assuming that clusters are sampled from a power-law initial mass function, with index -2 and that the cluster formation rate is constant, M_max scales with the age range sam...
A PSO-Based Subtractive Data Clustering Algorithm
Directory of Open Access Journals (Sweden)
Gamal Abdel-Azeem
2013-03-01
Full Text Available There is a tremendous proliferation in the amount of information available on the largest shared information source, the World Wide Web. Fast and high-quality clustering algorithms play an important role in helping users to effectively navigate, summarize, and organize the information. Recent studies have shown that partitional clustering algorithms such as the k-means algorithm are the most popular algorithms for clustering large datasets. The major problem with partitional clustering algorithms is that they are sensitive to the selection of the initial partitions and are prone to premature converge to local optima. Subtractive clustering is a fast, one-pass algorithm for estimating the number of clusters and cluster centers for any given set of data. The cluster estimates can be used to initialize iterative optimization-based clustering methods and model identification methods. In this paper, we present a hybrid Particle Swarm Optimization, Subtractive + (PSO clustering algorithm that performs fast clustering. For comparison purpose, we applied the Subtractive + (PSO clustering algorithm, PSO, and the Subtractive clustering algorithms on three different datasets. The results illustrate that the Subtractive + (PSO clustering algorithm can generate the most compact clustering results as compared to other algorithms.
Ontology Partitioning: Clustering Based Approach
Directory of Open Access Journals (Sweden)
Soraya Setti Ahmed
2015-05-01
Full Text Available The semantic web goal is to share and integrate data across different domains and organizations. The knowledge representations of semantic data are made possible by ontology. As the usage of semantic web increases, construction of the semantic web ontologies is also increased. Moreover, due to the monolithic nature of the ontology various semantic web operations like query answering, data sharing, data matching, data reuse and data integration become more complicated as the size of ontology increases. Partitioning the ontology is the key solution to handle this scalability issue. In this work, we propose a revision and an enhancement of K-means clustering algorithm based on a new semantic similarity measure for partitioning given ontology into high quality modules. The results show that our approach produces meaningful clusters than the traditional algorithm of K-means.
基于主动学习策略的半监督聚类算法研究%Semi-supervised clustering method based on active learning strategy
Institute of Scientific and Technical Information of China (English)
芦世丹; 崔荣一
2013-01-01
提出一种选择最富信息数据并予以标记的基于主动学习策略的半监督聚类算法.首先,采用传统K-均值聚类算法对数据集进行粗聚类；其次,根据粗聚类结果计算出每个数据隶属于每个类簇的隶属度,筛选出满足最大与次大隶属度差值小于阈值的候选数据,并从中选择差值较小的数据作为最富信息的数据进行标记；最后,将候选数据集合中未标记数据分组到与每类已被标记数据平均距离最小的类簇中.实验表明,提出的主动学习策略能够很好地学习到最富信息数据,基于该学习策略的半监督聚类算法在测试不同数据集时均获得了较高的准确率.%By employing active learning strategy to learn informative dataset to be labeled,this paper proposed a semi-supervised clustering method based on active learning strategy.Firstly,it employed traditional K-means algorithm to make coarse clustering for unlabeled dataset.And furthermore,based on the result of coarse clustering,it calculated the membership degree of each data belonging to each cluster,then screened out alternative data of which the difference between maximum and the second maximum membership degree was lower than threshold,then the partial data would be labeled if the difference of which was relatively small,i.e.,the data were informative samples.Finally,they grouped each selected unlabeled data to corresponding labeled cluster which acquired minimum average distances.The experimental results show that the proposed active learning strategy is very powerful to learn informative data,and the semi-supervised clustering method based on active learning strategy is quite accurate with regards to various dataset.
Coupled Cluster Methods in Lattice Gauge Theory
Watson, Nicholas Jay
Available from UMI in association with The British Library. Requires signed TDF. The many body coupled cluster method is applied to Hamiltonian pure lattice gauge theories. The vacuum wavefunction is written as the exponential of a single sum over the lattice of clusters of gauge invariant operators at fixed relative orientation and separation, generating excitations of the bare vacuum. The basic approximation scheme involves a truncation according to geometrical size on the lattice of the clusters in the wavefunction. For a wavefunction including clusters up to a given size, all larger clusters generated in the Schrodinger equation are discarded. The general formalism is first given, including that for excited states. Two possible procedures for discarding clusters are considered. The first involves discarding clusters describing excitations of the bare vacuum which are larger than those in the given wavefunction. The second involves rearranging the clusters so that they describe fluctuations of the gauge invariant excitations about their self-consistently calculated expectation values, and then discarding fluctuations larger then those in the given wavefunction. The coupled cluster method is applied to the Z_2 and Su(2) models in 2 + 1D. For the Z_2 model, the first procedure gives poor results, while the second gives wavefunctions which explicitly display a phase transition with critical couplings in good agreement with those obtained by other methods. For the SU(2) model, the first procedure also gives poor results, while the second gives vacuum wavefunctions valid at all couplings. The general properties of the wavefunctions at weak coupling are discussed. Approximations with clusters spanning up to four plaquettes are considered. Excited states are calculated, yielding mass gaps with fair scaling properties. Insight is obtained into the form of the wavefunctions at all couplings.
A Modified Ant-based Clustering for Medical Data
Directory of Open Access Journals (Sweden)
C. Immaculate Mary
2010-10-01
Full Text Available Ant-based techniques, in the computer sciences, are designed for those who take biological inspirations on the behavior of the social insects. Data-clustering techniques are classification algorithms that have a wide range of applications, from Biology to Image processing and Data presentation. The ant-based clustering technique has been proven a promising technique for the data clustering problems. In this paper a modified ant-based clustering is proposed for medical data processing. The performance of the proposed method is compared with k-means clustering.
Clustering Method in Data Mining%数据挖掘中的聚类方法
Institute of Scientific and Technical Information of China (English)
王实; 高文
2000-01-01
In this paper we introduce clustering method at Data Mining.Clustering has been studied very deeply.In the field of Data Mining,clustering is facing the new situation.We summarize the major clustering methods and introduce four kinds of clustering method that have been used broadly in Data Mitring.Finally we draw a conclusion that the partitional clustering method based on distance in data mining is a typical two phase iteration process:1)appoint cluster;2)update the center of cluster.
PERFORMANCE OF SELECTED AGGLOMERATIVE HIERARCHICAL CLUSTERING METHODS
Directory of Open Access Journals (Sweden)
Nusa Erman
2015-01-01
Full Text Available A broad variety of different methods of agglomerative hierarchical clustering brings along problems how to choose the most appropriate method for the given data. It is well known that some methods outperform others if the analysed data have a specific structure. In the presented study we have observed the behaviour of the centroid, the median (Gower median method, and the average method (unweighted pair-group method with arithmetic mean – UPGMA; average linkage between groups. We have compared them with mostly used methods of hierarchical clustering: the minimum (single linkage clustering, the maximum (complete linkage clustering, the Ward, and the McQuitty (groups method average, weighted pair-group method using arithmetic averages - WPGMA methods. We have applied the comparison of these methods on spherical, ellipsoid, umbrella-like, “core-and-sphere”, ring-like and intertwined three-dimensional data structures. To generate the data and execute the analysis, we have used R statistical software. Results show that all seven methods are successful in finding compact, ball-shaped or ellipsoid structures when they are enough separated. Conversely, all methods except the minimum perform poor on non-homogenous, irregular and elongated ones. Especially challenging is a circular double helix structure; it is being correctly revealed only by the minimum method. We can also confirm formerly published results of other simulation studies, which usually favour average method (besides Ward method in cases when data is assumed to be fairly compact and well separated.
A clustering method based on Dirichlet process mixture model%Dirichlet过程混合模型的聚类算法
Institute of Scientific and Technical Information of China (English)
张林; 刘辉
2012-01-01
The number of clusters should be determined in advance when a finite mixture model is built to cluster high dimensional data, which deteriorates the precision and generalization of clustering. A Dirichlet process infinite mixture model was built to cluster high dimensional data in this paper. Based on Urn model, the posterior distributions of each parameter were derived. All parameters, including the number of potential clusters were estimated through Gibbs sam- pling MCMC method. The clustering results on both simulation dataset and IRIS dataset show that this method can correctly estimate the number of potential clusters after 200 Gibbs sampling MCMC iterations. The average time of iteration for simulation and IRIS datasets were 0. 1850 s and 0. 1455 s, respectively, and the time complexity of each iteration was O（N）, where N is the number of sample.%有限混合模型进行高维数据聚类分析时需预先估计聚类个数，因而聚类的准确性和泛化性受到影响。通过建立Dirichlet过程无限混合模型对高维数据开展聚类分析，采用Dirichlet过程的Urn模型分析出模型中各参数的后验分布，利用Gibbs采样MCMC方法估计出模型中各参数及数据中潜在的聚类数。在五维的仿真数据集和IRIS测试数据集上的聚类结果表明：经过200次Gibbs采样MCMC过程，该算法能够正确地估计出数据中潜在的聚类数。单次Gibbs采样MCMC过程的平均占用时间分别为0.1850S和0.1455S，其时间复杂度和数据的样本个数N有关，为0（N）。
Cosmological Constraints with Clustering-Based Redshifts
Kovetz, Ely D; Rahman, Mubdi
2016-01-01
We demonstrate that observations lacking reliable redshift information, such as photometric and radio continuum surveys, can produce robust measurements of cosmological parameters when empowered by clustering-based redshift estimation. This method infers the redshift distribution based on the spatial clustering of sources, using cross-correlation with a reference dataset with known redshifts. Applying this method to the existing SDSS photometric galaxies, and projecting to future radio continuum surveys, we show that sources can be efficiently divided into several redshift bins, increasing their ability to constrain cosmological parameters. We forecast constraints on the dark-energy equation-of-state and on local non-gaussianity parameters. We explore several pertinent issues, including the tradeoff between including more sources versus minimizing the overlap between bins, the shot-noise limitations on binning, and the predicted performance of the method at high redshifts. Remarkably, we find that, once this ...
基于用户过滤的校园无线网用户聚类方法%User filtering based campus WLAN user clustering method
Institute of Scientific and Technical Information of China (English)
仇一泓; 尧婷娟; 秦丰林; 葛连升
2014-01-01
With the widespread of smart terminals such as smart phones and smart pads, using MAC address as user iden-tification in campus wireless local area network (WLAN) user clustering research cannot exactly represent user behavior. An user filtering based user clustering is proposed. This method filters users’ behavior data by their degree of activeness, and then further conducts clustering analysis of campus WLAN user behavior. The experimental result verifies the effec-tiveness of the proposed method.%随着智能终端地普及，在校园无线网用户聚类研究中采用MAC地址作为用户区分已不能真实反映用户的行为，为此，提出了一个基于用户过滤的校园无线网用户聚类方法，该方法基于用户活跃度对用户行为数据进行过滤，在此基础上对校园无线网用户行为做进一步地聚类分析。实验结果表明了该方法的有效性。
Directory of Open Access Journals (Sweden)
Galway LP
2012-04-01
Full Text Available Abstract Background Mortality estimates can measure and monitor the impacts of conflict on a population, guide humanitarian efforts, and help to better understand the public health impacts of conflict. Vital statistics registration and surveillance systems are rarely functional in conflict settings, posing a challenge of estimating mortality using retrospective population-based surveys. Results We present a two-stage cluster sampling method for application in population-based mortality surveys. The sampling method utilizes gridded population data and a geographic information system (GIS to select clusters in the first sampling stage and Google Earth TM imagery and sampling grids to select households in the second sampling stage. The sampling method is implemented in a household mortality study in Iraq in 2011. Factors affecting feasibility and methodological quality are described. Conclusion Sampling is a challenge in retrospective population-based mortality studies and alternatives that improve on the conventional approaches are needed. The sampling strategy presented here was designed to generate a representative sample of the Iraqi population while reducing the potential for bias and considering the context specific challenges of the study setting. This sampling strategy, or variations on it, are adaptable and should be considered and tested in other conflict settings.
Variable cluster analysis method for building neural network model
Institute of Scientific and Technical Information of China (English)
王海东; 刘元东
2004-01-01
To address the problems that input variables should be reduced as much as possible and explain output variables fully in building neural network model of complicated system, a variable selection method based on cluster analysis was investigated. Similarity coefficient which describes the mutual relation of variables was defined. The methods of the highest contribution rate, part replacing whole and variable replacement are put forwarded and deduced by information theory. The software of the neural network based on cluster analysis, which can provide many kinds of methods for defining variable similarity coefficient, clustering system variable and evaluating variable cluster, was developed and applied to build neural network forecast model of cement clinker quality. The results show that all the network scale, training time and prediction accuracy are perfect. The practical application demonstrates that the method of selecting variables for neural network is feasible and effective.
Progressive Exponential Clustering-Based Steganography
Directory of Open Access Journals (Sweden)
Li Yue
2010-01-01
Full Text Available Cluster indexing-based steganography is an important branch of data-hiding techniques. Such schemes normally achieve good balance between high embedding capacity and low embedding distortion. However, most cluster indexing-based steganographic schemes utilise less efficient clustering algorithms for embedding data, which causes redundancy and leaves room for increasing the embedding capacity further. In this paper, a new clustering algorithm, called progressive exponential clustering (PEC, is applied to increase the embedding capacity by avoiding redundancy. Meanwhile, a cluster expansion algorithm is also developed in order to further increase the capacity without sacrificing imperceptibility.
Energy Technology Data Exchange (ETDEWEB)
Riplinger, Christoph; Pinski, Peter; Becker, Ute; Neese, Frank, E-mail: frank.neese@cec.mpg.de, E-mail: evaleev@vt.edu [Max Planck Institute for Chemical Energy Conversion, Stiftstr. 34-36, D-45470 Mülheim an der Ruhr (Germany); Valeev, Edward F., E-mail: frank.neese@cec.mpg.de, E-mail: evaleev@vt.edu [Department of Chemistry, Virginia Tech, Blacksburg, Virginia 24061 (United States)
2016-01-14
Domain based local pair natural orbital coupled cluster theory with single-, double-, and perturbative triple excitations (DLPNO-CCSD(T)) is a highly efficient local correlation method. It is known to be accurate and robust and can be used in a black box fashion in order to obtain coupled cluster quality total energies for large molecules with several hundred atoms. While previous implementations showed near linear scaling up to a few hundred atoms, several nonlinear scaling steps limited the applicability of the method for very large systems. In this work, these limitations are overcome and a linear scaling DLPNO-CCSD(T) method for closed shell systems is reported. The new implementation is based on the concept of sparse maps that was introduced in Part I of this series [P. Pinski, C. Riplinger, E. F. Valeev, and F. Neese, J. Chem. Phys. 143, 034108 (2015)]. Using the sparse map infrastructure, all essential computational steps (integral transformation and storage, initial guess, pair natural orbital construction, amplitude iterations, triples correction) are achieved in a linear scaling fashion. In addition, a number of additional algorithmic improvements are reported that lead to significant speedups of the method. The new, linear-scaling DLPNO-CCSD(T) implementation typically is 7 times faster than the previous implementation and consumes 4 times less disk space for large three-dimensional systems. For linear systems, the performance gains and memory savings are substantially larger. Calculations with more than 20 000 basis functions and 1000 atoms are reported in this work. In all cases, the time required for the coupled cluster step is comparable to or lower than for the preceding Hartree-Fock calculation, even if this is carried out with the efficient resolution-of-the-identity and chain-of-spheres approximations. The new implementation even reduces the error in absolute correlation energies by about a factor of two, compared to the already accurate
International Nuclear Information System (INIS)
Domain based local pair natural orbital coupled cluster theory with single-, double-, and perturbative triple excitations (DLPNO-CCSD(T)) is a highly efficient local correlation method. It is known to be accurate and robust and can be used in a black box fashion in order to obtain coupled cluster quality total energies for large molecules with several hundred atoms. While previous implementations showed near linear scaling up to a few hundred atoms, several nonlinear scaling steps limited the applicability of the method for very large systems. In this work, these limitations are overcome and a linear scaling DLPNO-CCSD(T) method for closed shell systems is reported. The new implementation is based on the concept of sparse maps that was introduced in Part I of this series [P. Pinski, C. Riplinger, E. F. Valeev, and F. Neese, J. Chem. Phys. 143, 034108 (2015)]. Using the sparse map infrastructure, all essential computational steps (integral transformation and storage, initial guess, pair natural orbital construction, amplitude iterations, triples correction) are achieved in a linear scaling fashion. In addition, a number of additional algorithmic improvements are reported that lead to significant speedups of the method. The new, linear-scaling DLPNO-CCSD(T) implementation typically is 7 times faster than the previous implementation and consumes 4 times less disk space for large three-dimensional systems. For linear systems, the performance gains and memory savings are substantially larger. Calculations with more than 20 000 basis functions and 1000 atoms are reported in this work. In all cases, the time required for the coupled cluster step is comparable to or lower than for the preceding Hartree-Fock calculation, even if this is carried out with the efficient resolution-of-the-identity and chain-of-spheres approximations. The new implementation even reduces the error in absolute correlation energies by about a factor of two, compared to the already accurate
Riplinger, Christoph; Pinski, Peter; Becker, Ute; Valeev, Edward F; Neese, Frank
2016-01-14
Domain based local pair natural orbital coupled cluster theory with single-, double-, and perturbative triple excitations (DLPNO-CCSD(T)) is a highly efficient local correlation method. It is known to be accurate and robust and can be used in a black box fashion in order to obtain coupled cluster quality total energies for large molecules with several hundred atoms. While previous implementations showed near linear scaling up to a few hundred atoms, several nonlinear scaling steps limited the applicability of the method for very large systems. In this work, these limitations are overcome and a linear scaling DLPNO-CCSD(T) method for closed shell systems is reported. The new implementation is based on the concept of sparse maps that was introduced in Part I of this series [P. Pinski, C. Riplinger, E. F. Valeev, and F. Neese, J. Chem. Phys. 143, 034108 (2015)]. Using the sparse map infrastructure, all essential computational steps (integral transformation and storage, initial guess, pair natural orbital construction, amplitude iterations, triples correction) are achieved in a linear scaling fashion. In addition, a number of additional algorithmic improvements are reported that lead to significant speedups of the method. The new, linear-scaling DLPNO-CCSD(T) implementation typically is 7 times faster than the previous implementation and consumes 4 times less disk space for large three-dimensional systems. For linear systems, the performance gains and memory savings are substantially larger. Calculations with more than 20 000 basis functions and 1000 atoms are reported in this work. In all cases, the time required for the coupled cluster step is comparable to or lower than for the preceding Hartree-Fock calculation, even if this is carried out with the efficient resolution-of-the-identity and chain-of-spheres approximations. The new implementation even reduces the error in absolute correlation energies by about a factor of two, compared to the already accurate
International Nuclear Information System (INIS)
Graphical abstract: The structure of a minimum in Ar19K+ cluster. Abstract: In this paper we explore the possibility of using stochastic optimizers, namely simulated annealing (SA) in locating critical points (global minima, local minima and first order saddle points) in Argon noble gas clusters perturbed by alkali metal ions namely sodium and potassium. The atomic interaction potential is the Lennard Jones potential. We also try to see if a continuous transformation in geometry during the search process can lead to a realization of a kind of minimum energy path (MEP) for transformation from one minimum geometry to another through a transition state (first order saddle point). We try our recipe for three sizes of clusters, namely (Ar)16M+, (Ar)19M+ and (Ar)24M+, where M+ is Na+ and K+.
A statistical method to determine open cluster metallicities
Poehnl, Harald
2010-01-01
The study of open cluster metallicities helps to understand the local stellar formation and evolution throughout the Milky Way. Its metallicity gradient is an important tracer for the Galactic formation in a global sense. Because open clusters can be treated in a statistical way, the error of the cluster mean is minimized. Our final goal is a semi-automatic statistical robust method to estimate the metallicity of a statistically significant number of open clusters based on Johnson BV data of their members, an algorithm that can easily be extended to other photometric systems for a systematic investigation. This method incorporates evolutionary grids for different metallicities and a calibration of the effective temperature and luminosity. With cluster parameters (age, reddening and distance) it is possible to estimate the metallicity from a statistical point of view. The iterative process includes an intrinsic consistency check of the starting input parameters and allows us to modify them. We extensively test...
Initialization independent clustering with actively self-training method.
Nie, Feiping; Xu, Dong; Li, Xuelong
2012-02-01
The results of traditional clustering methods are usually unreliable as there is not any guidance from the data labels, while the class labels can be predicted more reliable by the semisupervised learning if the labels of partial data are given. In this paper, we propose an actively self-training clustering method, in which the samples are actively selected as training set to minimize an estimated Bayes error, and then explore semisupervised learning to perform clustering. Traditional graph-based semisupervised learning methods are not convenient to estimate the Bayes error; we develop a specific regularization framework on graph to perform semisupervised learning, in which the Bayes error can be effectively estimated. In addition, the proposed clustering algorithm can be readily applied in a semisupervised setting with partial class labels. Experimental results on toy data and real-world data sets demonstrate the effectiveness of the proposed clustering method on the unsupervised and the semisupervised setting. It is worthy noting that the proposed clustering method is free of initialization, while traditional clustering methods are usually dependent on initialization. PMID:22086542
DNA splice site sequences clustering method for conservativeness analysis
Institute of Scientific and Technical Information of China (English)
Quanwei Zhang; Qinke Peng; Tao Xu
2009-01-01
DNA sequences that are near to splice sites have remarkable conservativeness,and many researchers have contributed to the prediction of splice site.In order to mine the underlying biological knowledge,we analyze the conservativeness of DNA splice site adjacent sequences by clustering.Firstly,we propose a kind of DNA splice site sequences clustering method which is based on DBSCAN,and use four kinds of dissimilarity calculating methods.Then,we analyze the conservative feature of the clustering results and the experimental data set.
Model-Based Clustering of Large Networks
Vu, Duy Quang; Schweinberger, Michael
2012-01-01
We describe a network clustering framework, based on finite mixture models, that can be applied to discrete-valued networks with hundreds of thousands of nodes and billions of edge variables. Relative to other recent model-based clustering work for networks, we introduce a more flexible modeling framework, improve the variational-approximation estimation algorithm, discuss and implement standard error estimation via a parametric bootstrap approach, and apply these methods to much larger datasets than those seen elsewhere in the literature. The more flexible modeling framework is achieved through introducing novel parameterizations of the model, giving varying degrees of parsimony, using exponential family models whose structure may be exploited in various theoretical and algorithmic ways. The algorithms, which we show how to adapt to the more complicated optimization requirements introduced by the constraints imposed by the novel parameterizations we propose, are based on variational generalized EM algorithms...
Incremental Web Usage Mining Based on Active Ant Colony Clustering
Institute of Scientific and Technical Information of China (English)
SHEN Jie; LIN Ying; CHEN Zhimin
2006-01-01
To alleviate the scalability problem caused by the increasing Web using and changing users' interests, this paper presents a novel Web Usage Mining algorithm-Incremental Web Usage Mining algorithm based on Active Ant Colony Clustering. Firstly, an active movement strategy about direction selection and speed, different with the positive strategy employed by other Ant Colony Clustering algorithms, is proposed to construct an Active Ant Colony Clustering algorithm, which avoid the idle and "flying over the plane" moving phenomenon, effectively improve the quality and speed of clustering on large dataset. Then a mechanism of decomposing clusters based on above methods is introduced to form new clusters when users' interests change. Empirical studies on a real Web dataset show the active ant colony clustering algorithm has better performance than the previous algorithms, and the incremental approach based on the proposed mechanism can efficiently implement incremental Web usage mining.
Orbit Clustering Based on Transfer Cost
Gustafson, Eric D.; Arrieta-Camacho, Juan J.; Petropoulos, Anastassios E.
2013-01-01
We propose using cluster analysis to perform quick screening for combinatorial global optimization problems. The key missing component currently preventing cluster analysis from use in this context is the lack of a useable metric function that defines the cost to transfer between two orbits. We study several proposed metrics and clustering algorithms, including k-means and the expectation maximization algorithm. We also show that proven heuristic methods such as the Q-law can be modified to work with cluster analysis.
Directory of Open Access Journals (Sweden)
Li Ma
2015-01-01
Full Text Available Image segmentation plays an important role in medical image processing. Fuzzy c-means (FCM clustering is one of the popular clustering algorithms for medical image segmentation. However, FCM has the problems of depending on initial clustering centers, falling into local optimal solution easily, and sensitivity to noise disturbance. To solve these problems, this paper proposes a hybrid artificial fish swarm algorithm (HAFSA. The proposed algorithm combines artificial fish swarm algorithm (AFSA with FCM whose advantages of global optimization searching and parallel computing ability of AFSA are utilized to find a superior result. Meanwhile, Metropolis criterion and noise reduction mechanism are introduced to AFSA for enhancing the convergence rate and antinoise ability. The artificial grid graph and Magnetic Resonance Imaging (MRI are used in the experiments, and the experimental results show that the proposed algorithm has stronger antinoise ability and higher precision. A number of evaluation indicators also demonstrate that the effect of HAFSA is more excellent than FCM and suppressed FCM (SFCM.
Ma, Li; Li, Yang; Fan, Suohai; Fan, Runzhu
2015-01-01
Image segmentation plays an important role in medical image processing. Fuzzy c-means (FCM) clustering is one of the popular clustering algorithms for medical image segmentation. However, FCM has the problems of depending on initial clustering centers, falling into local optimal solution easily, and sensitivity to noise disturbance. To solve these problems, this paper proposes a hybrid artificial fish swarm algorithm (HAFSA). The proposed algorithm combines artificial fish swarm algorithm (AFSA) with FCM whose advantages of global optimization searching and parallel computing ability of AFSA are utilized to find a superior result. Meanwhile, Metropolis criterion and noise reduction mechanism are introduced to AFSA for enhancing the convergence rate and antinoise ability. The artificial grid graph and Magnetic Resonance Imaging (MRI) are used in the experiments, and the experimental results show that the proposed algorithm has stronger antinoise ability and higher precision. A number of evaluation indicators also demonstrate that the effect of HAFSA is more excellent than FCM and suppressed FCM (SFCM). PMID:26649068
Ma, Li; Li, Yang; Fan, Suohai; Fan, Runzhu
2015-01-01
Image segmentation plays an important role in medical image processing. Fuzzy c-means (FCM) clustering is one of the popular clustering algorithms for medical image segmentation. However, FCM has the problems of depending on initial clustering centers, falling into local optimal solution easily, and sensitivity to noise disturbance. To solve these problems, this paper proposes a hybrid artificial fish swarm algorithm (HAFSA). The proposed algorithm combines artificial fish swarm algorithm (AFSA) with FCM whose advantages of global optimization searching and parallel computing ability of AFSA are utilized to find a superior result. Meanwhile, Metropolis criterion and noise reduction mechanism are introduced to AFSA for enhancing the convergence rate and antinoise ability. The artificial grid graph and Magnetic Resonance Imaging (MRI) are used in the experiments, and the experimental results show that the proposed algorithm has stronger antinoise ability and higher precision. A number of evaluation indicators also demonstrate that the effect of HAFSA is more excellent than FCM and suppressed FCM (SFCM). PMID:26649068
Comparing the performance of biomedical clustering methods.
Wiwie, Christian; Baumbach, Jan; Röttger, Richard
2015-11-01
Identifying groups of similar objects is a popular first step in biomedical data analysis, but it is error-prone and impossible to perform manually. Many computational methods have been developed to tackle this problem. Here we assessed 13 well-known methods using 24 data sets ranging from gene expression to protein domains. Performance was judged on the basis of 13 common cluster validity indices. We developed a clustering analysis platform, ClustEval (http://clusteval.mpi-inf.mpg.de), to promote streamlined evaluation, comparison and reproducibility of clustering results in the future. This allowed us to objectively evaluate the performance of all tools on all data sets with up to 1,000 different parameter sets each, resulting in a total of more than 4 million calculated cluster validity indices. We observed that there was no universal best performer, but on the basis of this wide-ranging comparison we were able to develop a short guideline for biomedical clustering tasks. ClustEval allows biomedical researchers to pick the appropriate tool for their data type and allows method developers to compare their tool to the state of the art. PMID:26389570
CCM: A Text Classification Method by Clustering
DEFF Research Database (Denmark)
Nizamani, Sarwat; Memon, Nasrullah; Wiil, Uffe Kock;
2011-01-01
In this paper, a new Cluster based Classification Model (CCM) for suspicious email detection and other text classification tasks, is presented. Comparative experiments of the proposed model against traditional classification models and the boosting algorithm are also discussed. Experimental results...... show that the CCM outperforms traditional classification models as well as the boosting algorithm for the task of suspicious email detection on terrorism domain email dataset and topic categorization on the Reuters-21578 and 20 Newsgroups datasets. The overall finding is that applying a cluster based...... approach to text classification tasks simplifies the model and at the same time increases the accuracy....
Performance Analysis of Unsupervised Clustering Methods for Brain Tumor Segmentation
Directory of Open Access Journals (Sweden)
Tushar H Jaware
2013-10-01
Full Text Available Medical image processing is the most challenging and emerging field of neuroscience. The ultimate goal of medical image analysis in brain MRI is to extract important clinical features that would improve methods of diagnosis & treatment of disease. This paper focuses on methods to detect & extract brain tumour from brain MR images. MATLAB is used to design, software tool for locating brain tumor, based on unsupervised clustering methods. K-Means clustering algorithm is implemented & tested on data base of 30 images. Performance evolution of unsupervised clusteringmethods is presented.
Institute of Scientific and Technical Information of China (English)
程宏斌; 乐德广; 孙霞; 王海军
2012-01-01
The paper established an energy consumption model in order to improve the lower energy efficiency of nodes in LEACH protocol. A method of cluster—head rotation based on non — competitive mode was proposed, according to analysis result of the energy consumption difference value between the the different nodes and elected cluster—head. The means elected the cluster—head once only at the first round of each rotation cycle. Then the other nodes acted as a cluster —head by fixed rotary method in remaining round. Furthermore, reasonable collection times of data in each round also could effectively reduce the energy consumption of cluster—head election. Finally, the theoretical analysis and simulation results show that the power consumption performance of WSNs clustering was improved effective by this optimized clustering algorithm.%针对LEACH协议中节点网络能量效率低的问题,建立了分簇协议的能耗模型；基于对簇首竞选能耗和不同节点能耗差的分析,提出了一种基于非竞争式的WSNs簇首轮换方法:在每一个轮转周期的第一轮中竞选一次簇首,其余轮中采取固定轮转的方法依次让其它节点充当簇首；同时合理设置每轮中的数据收集次数,以便有效降低网络簇首竞选能耗；理论分析和仿真实验表明:改进的分簇算法能够有效地改善WSNs分簇协议的总能耗性能.
Generating a multilingual taxonomy based on multilingual terminology clustering
Institute of Scientific and Technical Information of China (English)
Chengzhi; ZHANG
2011-01-01
Taxonomy denotes the hierarchical structure of a knowledge organization system.It has important applications in knowledge navigation,semantic annotation and semantic search.It is a useful instrument to study the multilingual taxonomy generated automatically under the dynamic information environment in which massive amounts of information are processed and found.Multilingual taxonomy is the core component of the multilingual thesaurus or ontology.This paper presents two methods of bilingual generated taxonomy:Cross-language terminology clustering and mixed-language based terminology clustering.According to our experimental results of terminology clustering related to four specific subject domains,we found that if the parallel corpus is used to cluster multilingual terminologies,the method of using mixed-language based terminology clustering outperforms that of using the cross-language terminology clustering.
Study on Grey Clustering Decision Methods that Based on Reny Entropy%基于Reny熵的灰色聚类决策方法研究
Institute of Scientific and Technical Information of China (English)
吴正朋; 张友萍; 李梅
2011-01-01
On account of the weight of traditional grey fixed weight clustering methords which is given in advance and does not have objective problems,the passage proves out a method of decicling weight that based on Reny entropy,owing to the thinking of traditional Shannon entropy of information,and construct methods that based on Reny entropy.The algorithem makes use of system state data,throughing calculating entropy to have decision weight,and makes example stheric syndrome research on the background of practical problem.The result proves that the method is easy in calculating and the weight decision is objective,and also complement and perfect grey clustering decision theory.%针对传统灰色定权聚类方法中权重是事先给定的,不具有客观性的问题。借鉴传统的shannon信息熵的思想,本文提出了基于Reny熵权确定权重的方法。构造了基于构造了基于Reny熵权的灰色定权聚类评估方法的算法。该方法利用系统状态数据为依据,通过计算熵来得到决策权重,以实际问题为背景进行了算例实证研究。结果表明该方法计算简单,权重确定客观,对灰色聚类决策理论进行了补充和完善。
FLCW: Frequent Itemset Based Text Clustering with Window Constraint
Institute of Scientific and Technical Information of China (English)
ZHOU Chong; LU Yansheng; ZOU Lei; HU Rong
2006-01-01
Most of the existing text clustering algorithms overlook the fact that one document is a word sequence with semantic information.There is some important semantic information existed in the positions of words in the sequence.In this paper, a novel method named Frequent Itemset-based Clustering with Window (FICW) was proposed, which makes use of the semantic information for text clustering with a window constraint.The experimental results obtained from tests on three (hypertext) text sets show that FICW outperforms the method compared in both clustering accuracy and efficiency.
New clustering methods for population comparison on paternal lineages.
Juhász, Z; Fehér, T; Bárány, G; Zalán, A; Németh, E; Pádár, Z; Pamjav, H
2015-04-01
The goal of this study is to show two new clustering and visualising techniques developed to find the most typical clusters of 18-dimensional Y chromosomal haplogroup frequency distributions of 90 Western Eurasian populations. The first technique called "self-organizing cloud (SOC)" is a vector-based self-learning method derived from the Self Organising Map and non-metric Multidimensional Scaling algorithms. The second technique is a new probabilistic method called the "maximal relation probability" (MRP) algorithm, based on a probability function having its local maximal values just in the condensation centres of the input data. This function is calculated immediately from the distance matrix of the data and can be interpreted as the probability that a given element of the database has a real genetic relation with at least one of the remaining elements. We tested these two new methods by comparing their results to both each other and the k-medoids algorithm. By means of these new algorithms, we determined 10 clusters of populations based on the similarity of haplogroup composition. The results obtained represented a genetically, geographically and historically well-interpretable picture of 10 genetic clusters of populations mirroring the early spread of populations from the Fertile Crescent to the Caucasus, Central Asia, Arabia and Southeast Europe. The results show that a parallel clustering of populations using SOC and MRP methods can be an efficient tool for studying the demographic history of populations sharing common genetic footprints. PMID:25388803
Web Document Clustering Using Cuckoo Search Clustering Algorithm based on Levy Flight
Directory of Open Access Journals (Sweden)
Moe Moe Zaw
2013-09-01
Full Text Available The World Wide Web serves as a huge widely distributed global information service center. The tremendous amount of information on the web is improving day by day. So, the process of finding the relevant information on the web is a major challenge in Information Retrieval. This leads the need for the development of new techniques for helping users to effectively navigate, summarize and organize the overwhelmed information. One of the techniques that can play an important role towards the achievement of this objective is web document clustering. This paper aims to develop a clustering algorithm and apply in web document clustering area. The Cuckoo Search Optimization algorithm is a recently developed optimization algorithm based on the obligate behavior of some cuckoo species in combining with the levy flight. In this paper, Cuckoo Search Clustering Algorithm based on levy flight is proposed. This algorithm is the application of Cuckoo Search Optimization algorithm in web document clustering area to locate the optimal centroids of the cluster and to find global solution of the clustering algorithm. For testing the performance of the proposed method, this paper will show the experience result by using the benchmark dataset. The result obtained shows that the Cuckoo Search Clustering algorithm based on Levy Flight performs well in web document clustering.
Wang, Tai-Chi; Phoa, Frederick Kin Hing
2016-03-01
Community/cluster is one of the most important features in social networks. Many cluster detection methods were proposed to identify such an important pattern, but few were able to identify the statistical significance of the clusters by considering the likelihood of network structure and its attributes. Based on the definition of clustering, we propose a scanning method, originated from analyzing spatial data, for identifying clusters in social networks. Since the properties of network data are more complicated than those of spatial data, we verify our method's feasibility via simulation studies. The results show that the detection powers are affected by cluster sizes and connection probabilities. According to our simulation results, the detection accuracy of structure clusters and both structure and attribute clusters detected by our proposed method is better than that of other methods in most of our simulation cases. In addition, we apply our proposed method to some empirical data to identify statistically significant clusters.
A New Method of Open Cluster Membership Determination
Gao, Xin-hua; Chen, Li; Hou, Zhen-jie
2014-07-01
Membership determination is the key-important step to study open clusters, which can directly influence on the estimation of open clusters’ physical parameters. DBSCAN (Density Based Spatial Clustering of Applications with Noise) is a density-based clustering algorithm in data mining techniques. In this paper the DBSCAN algorithm has been used for the first time to make the membership determination of the open clusters NGC 6791 and M 67 (NGC 2682). Our results indicate that the DBSCAN algorithm can effectively eliminate the contamination of field stars. The obtained member stars of NGC 6791 exhibit clearly a doubled main-sequence structure in the color-magnitude diagram, implying that NGC 6791 may have a more complicated history of star formation and evolution. The clustering analysis of M67 indicates the presence of mass segregation, and the distinct relative motion between the central part and the outer part of the cluster. These results demonstrate that the DBSCAN algorithm is an effective method of membership determination, and that it has some advantages superior to the conventional kinematic method.
Comparing the performance of biomedical clustering methods
DEFF Research Database (Denmark)
Wiwie, Christian; Baumbach, Jan; Röttger, Richard
2015-01-01
Identifying groups of similar objects is a popular first step in biomedical data analysis, but it is error-prone and impossible to perform manually. Many computational methods have been developed to tackle this problem. Here we assessed 13 well-known methods using 24 data sets ranging from gene......-ranging comparison we were able to develop a short guideline for biomedical clustering tasks. ClustEval allows biomedical researchers to pick the appropriate tool for their data type and allows method developers to compare their tool to the state of the art....
An Empirical Comparison of the Summarization Power of Graph Clustering Methods
Liu, Yike; Shah, Neil; Koutra, Danai
2015-01-01
How do graph clustering techniques compare with respect to their summarization power? How well can they summarize a million-node graph with a few representative structures? Graph clustering or community detection algorithms can summarize a graph in terms of coherent and tightly connected clusters. In this paper, we compare and contrast different techniques: METIS, Louvain, spectral clustering, SlashBurn and KCBC, our proposed k-core-based clustering method. Unlike prior work that focuses on v...
A survey of kernel and spectral methods for clustering
Filippone, M.; Camastra, F.; Masulli, F.; Rovetta, S.
2008-01-01
Clustering algorithms are a useful tool to explore data structures and have been employed in many disciplines. The focus of this paper is the partitioning clustering problem with a special interest in two recent approaches: kernel and spectral methods. The aim of this paper is to present a survey of kernel and spectral clustering methods, two approaches able to produce nonlinear separating hypersurfaces between clusters. The presented kernel clustering methods are the kernel version of many c...
Structure based alignment and clustering of proteins (STRALCP)
Zemla, Adam T.; Zhou, Carol E.; Smith, Jason R.; Lam, Marisa W.
2013-06-18
Disclosed are computational methods of clustering a set of protein structures based on local and pair-wise global similarity values. Pair-wise local and global similarity values are generated based on pair-wise structural alignments for each protein in the set of protein structures. Initially, the protein structures are clustered based on pair-wise local similarity values. The protein structures are then clustered based on pair-wise global similarity values. For each given cluster both a representative structure and spans of conserved residues are identified. The representative protein structure is used to assign newly-solved protein structures to a group. The spans are used to characterize conservation and assign a "structural footprint" to the cluster.
A Cluster Based Approach for Classification of Web Results
Directory of Open Access Journals (Sweden)
Apeksha Khabia
2014-12-01
Full Text Available Nowadays significant amount of information from web is present in the form of text, e.g., reviews, forum postings, blogs, news articles, email messages, web pages. It becomes difficult to classify documents in predefined categories as the number of document grows. Clustering is the classification of a data into clusters, so that the data in each cluster share some common trait – often vicinity according to some defined measure. Underlying distribution of data set can somewhat be depicted based on the learned clusters under the guidance of initial data set. Thus, clusters of documents can be employed to train the classifier by using defined features of those clusters. One of the important issues is also to classify the text data from web into different clusters by mining the knowledge. Conforming to that, this paper presents a review on most of document clustering technique and cluster based classification techniques used so far. Also pre-processing on text dataset and document clustering method is explained in brief.
Open cluster membership probability based on K-means clustering algorithm
El Aziz, Mohamed Abd; Selim, I. M.; Essam, A.
2016-05-01
In the field of galaxies images, the relative coordinate positions of each star with respect to all the other stars are adapted. Therefore the membership of star cluster will be adapted by two basic criterions, one for geometric membership and other for physical (photometric) membership. So in this paper, we presented a new method for the determination of open cluster membership based on K-means clustering algorithm. This algorithm allows us to efficiently discriminate the cluster membership from the field stars. To validate the method we applied it on NGC 188 and NGC 2266, membership stars in these clusters have been obtained. The color-magnitude diagram of the membership stars is significantly clearer and shows a well-defined main sequence and a red giant branch in NGC 188, which allows us to better constrain the cluster members and estimate their physical parameters. The membership probabilities have been calculated and compared to those obtained by the other methods. The results show that the K-means clustering algorithm can effectively select probable member stars in space without any assumption about the spatial distribution of stars in cluster or field. The similarity of our results is in a good agreement with results derived by previous works.
Eros-based Fuzzy Cluster Method for Longitudual Data%基于Eros距离的纵向数据模糊聚类方法
Institute of Scientific and Technical Information of China (English)
李会民; 闫健卓; 方丽英; 王普
2013-01-01
Considering the characteristics of longitudinal data set,such as multi-variates,missing data,unequal series length,and irregular time interval,an algorithm based on Eros distance similarity measure for longitudinal data is proposed.Eros distance is used in Fuzzy-C-Means cluster processing.First,preprocessing is done for unbalance longitudinal data set,which includes filling the missing data,reducing the randaut attributes,etc.Second,FErosCM Cluster method is used for claasification automatically,and takes into account information entropy for assessing the performance of cluster algorithm.Experiments show that this method is effective and efficient for longitudinal data classification.%针对纵向数据集的数据特征,如多维、含缺失值、序列不等间隔和不全等长等特点,研究一种基于Eros距离的纵向数据的相似性度量方法,并对模糊C均值聚类算法进行改进,提出一种基于Eros距离度量的模糊聚类数据处理方法.对于纵向数据集,首先进行缺失值填充、变量标准化等预处理,使用粗糙集理论对冗余属性进行约简,然后基于FErosCM聚类方法进行数据自动分类.对比实验证实此方法可用于纵向数据集的自动聚类处理,并使用信息熵作为聚类效果的评价手段.实验结果表明:无论在聚类效率还是准确度上,FErosCM方法对于纵向数据的分类处理均是有效可行的.
基于直觉模糊聚类的Web资源推荐方法%Web resource recommendation method based on intuitive fuzzy clustering
Institute of Scientific and Technical Information of China (English)
肖满生; 汪新凡; 周丽娟
2012-01-01
在Web资源分类中,针对传统基于用户兴趣的方法不能准确反映用户兴趣的变化以及难以区分资源内容的品质和风格等问题,提出一种基于直觉模糊C均值聚类的Web资源聚类推荐方法.该方法首先根据用户兴趣度将Web资源表示为直觉模糊数,然后应用直觉模糊信息集成理论进行资源分类,最后实现向用户推荐相似或相近资源.理论分析和实验表明,该方法比传统的模糊C均值以及协同过滤方法在推荐质量上有很大的提高.%In the classification of the Web resources, a recommending method of Web resources based on intuitive fuzzy C-means clustering was proposed to solve the problem that the traditional method based on user interest cannot reflect the change of their interests accurately and the difficulty in distinguishing the quality and the style of content of resources. In the method, firstly, the Web resources were expressed as intuitive fuzzy data according to the user interest degree. Then the integrated theory of intuitive fuzzy information was applied to classify the resources. Lastly, the similar resources would be recommended to user successfully. Theoretical analysis and experimental results show that this method has a great advantage in improving the quality of recommendation compared with traditional fuzzy C-means and collaborative filtering method.
Recent advances in coupled-cluster methods
Bartlett, Rodney J
1997-01-01
Today, coupled-cluster (CC) theory has emerged as the most accurate, widely applicable approach for the correlation problem in molecules. Furthermore, the correct scaling of the energy and wavefunction with size (i.e. extensivity) recommends it for studies of polymers and crystals as well as molecules. CC methods have also paid dividends for nuclei, and for certain strongly correlated systems of interest in field theory.In order for CC methods to have achieved this distinction, it has been necessary to formulate new, theoretical approaches for the treatment of a variety of essential quantities
SOFT CLUSTERING BASED EXPOSITION TO MULTIPLE DICTIONARY BAG OF WORDS
Directory of Open Access Journals (Sweden)
K. S. Sujatha
2012-01-01
Full Text Available Object classification is a highly important area of computer vision and has many applications including robotics, searching images, face recognition, aiding visually impaired people, censoring images and many more. A new common method of classification that uses features is the Bag of Words approach. In this method a codebook of visual words is created using various clustering methods. For increasing the performance Multiple Dictionaries BoW (MDBoW method that uses more visual words from different independent dictionaries instead of adding more words to the same dictionary was implemented using hard clustering method. Nearest-neighbor assignments are used in hard clustering of features. A given feature may be nearly the same distance from two cluster centers. For a typical hard clustering method, only the slightly nearer neighbor is selected to represent that feature. Thus, the ambiguous features are not well-represented by the visual vocabulary. To address this problem, soft clustering model based Multiple Dictionary Bag of Visual words for image classification is implemented with dictionary generated using modified Fuzzy C-means algorithm using R1 norm. A performance evaluation on images has been done by varying the dictionary size. The proposed method works better when the number of topics and the number of images per topics are more. The results obtained indicate that multiple dictionary bag of words model using fuzzy clustering increases the recognition performance than the baseline method.
Zhuang, X. W.; Li, Y. P.; Huang, G. H.; Liu, J.
2016-07-01
An integrated multi-GCM-based stochastic weather generator and stepwise cluster analysis (MGCM-SWG-SCA) method is developed, through incorporating multiple global climate models (MGCM), stochastic weather generator (SWG), and stepwise-clustered hydrological model (SCHM) within a general framework. MGCM-SWG-SCA can investigate uncertainties of projected climate changes as well as create watershed-scale climate projections from large-scale variables. It can also assess climate change impacts on hydrological processes and capture nonlinear relationship between input variables and outputs in watershed systems. MGCM-SWG-SCA is then applied to the Kaidu watershed with cold-arid characteristics in the Xinjiang Uyghur Autonomous Region of northwest China, for demonstrating its efficiency. Results reveal that the variability of streamflow is mainly affected by (1) temperature change during spring, (2) precipitation change during winter, and (3) both temperature and precipitation changes in summer and autumn. Results also disclose that: (1) the projected minimum and maximum temperatures and precipitation from MGCM change with seasons in different ways; (2) various climate change projections can reproduce the seasonal variability of watershed-scale climate series; (3) SCHM can simulate daily streamflow with a satisfactory degree, and a significant increasing trend of streamflow is indicated from future (2015-2035) to validation (2006-2011) periods; (4) the streamflow can vary under different climate change projections. The findings can be explained that, for the Kaidu watershed located in the cold-arid region, glacier melt is mainly related to temperature changes and precipitation changes can directly cause the variability of streamflow.
Zhuang, X. W.; Li, Y. P.; Huang, G. H.; Liu, J.
2015-12-01
An integrated multi-GCM-based stochastic weather generator and stepwise cluster analysis (MGCM-SWG-SCA) method is developed, through incorporating multiple global climate models (MGCM), stochastic weather generator (SWG), and stepwise-clustered hydrological model (SCHM) within a general framework. MGCM-SWG-SCA can investigate uncertainties of projected climate changes as well as create watershed-scale climate projections from large-scale variables. It can also assess climate change impacts on hydrological processes and capture nonlinear relationship between input variables and outputs in watershed systems. MGCM-SWG-SCA is then applied to the Kaidu watershed with cold-arid characteristics in the Xinjiang Uyghur Autonomous Region of northwest China, for demonstrating its efficiency. Results reveal that the variability of streamflow is mainly affected by (1) temperature change during spring, (2) precipitation change during winter, and (3) both temperature and precipitation changes in summer and autumn. Results also disclose that: (1) the projected minimum and maximum temperatures and precipitation from MGCM change with seasons in different ways; (2) various climate change projections can reproduce the seasonal variability of watershed-scale climate series; (3) SCHM can simulate daily streamflow with a satisfactory degree, and a significant increasing trend of streamflow is indicated from future (2015-2035) to validation (2006-2011) periods; (4) the streamflow can vary under different climate change projections. The findings can be explained that, for the Kaidu watershed located in the cold-arid region, glacier melt is mainly related to temperature changes and precipitation changes can directly cause the variability of streamflow.
Document Clustering based on Topic Maps
Rafi, Muhammad; Farooq, Amir; 10.5120/1640-2204
2011-01-01
Importance of document clustering is now widely acknowledged by researchers for better management, smart navigation, efficient filtering, and concise summarization of large collection of documents like World Wide Web (WWW). The next challenge lies in semantically performing clustering based on the semantic contents of the document. The problem of document clustering has two main components: (1) to represent the document in such a form that inherently captures semantics of the text. This may also help to reduce dimensionality of the document, and (2) to define a similarity measure based on the semantic representation such that it assigns higher numerical values to document pairs which have higher semantic relationship. Feature space of the documents can be very challenging for document clustering. A document may contain multiple topics, it may contain a large set of class-independent general-words, and a handful class-specific core-words. With these features in mind, traditional agglomerative clustering algori...
An Efficient Fuzzy Clustering-Based Approach for Intrusion Detection
Nguyen, Huu Hoa; Darmont, Jérôme
2011-01-01
The need to increase accuracy in detecting sophisticated cyber attacks poses a great challenge not only to the research community but also to corporations. So far, many approaches have been proposed to cope with this threat. Among them, data mining has brought on remarkable contributions to the intrusion detection problem. However, the generalization ability of data mining-based methods remains limited, and hence detecting sophisticated attacks remains a tough task. In this thread, we present a novel method based on both clustering and classification for developing an efficient intrusion detection system (IDS). The key idea is to take useful information exploited from fuzzy clustering into account for the process of building an IDS. To this aim, we first present cornerstones to construct additional cluster features for a training set. Then, we come up with an algorithm to generate an IDS based on such cluster features and the original input features. Finally, we experimentally prove that our method outperform...
Fuzzy Clustering - Principles, Methods and Examples
DEFF Research Database (Denmark)
Kroszynski, Uri; Zhou, Jianjun
1998-01-01
One of the most remarkable advances in the field of identification and control of systems -in particular mechanical systems- whose behaviour can not be described by means of the usual mathematical models, has been achieved by the application of methods of fuzzy theory.In the framework of a study...... about identification of "black-box" properties by analysis of system input/output data sets, we have prepared an introductory note on the principles and the most popular data classification methods used in fuzzy modeling. This introductory note also includes some examples that illustrate the use of the...... methods. The examples were solved by hand and served as a test bench for exploration of the MATLAB capabilities included in the Fuzzy Control Toolbox. The fuzzy clustering methods described include Fuzzy c-means (FCM), Fuzzy c-lines (FCL) and Fuzzy c-elliptotypes (FCE)....
Sakumichi, Naoyuki; Kawakami, Norio; Ueda, Masahito
2011-01-01
The quantum-statistical cluster expansion method of Lee and Yang is extended to investigate off-diagonal long-range order (ODLRO) in one- and multi-component mixtures of bosons or fermions. Our formulation is applicable to both a uniform system and a trapped system without local-density approximation and allows systematic expansions of one- and multi-particle reduced density matrices in terms of cluster functions which are defined for the same system with Boltzmann statistics. Each term in th...
Park, Sang Ha; Lee, Seokjin; Sung, Koeng-Mo
Non-negative matrix factorization (NMF) is widely used for monaural musical sound source separation because of its efficiency and good performance. However, an additional clustering process is required because the musical sound mixture is separated into more signals than the number of musical tracks during NMF separation. In the conventional method, manual clustering or training-based clustering is performed with an additional learning process. Recently, a clustering algorithm based on the mel-frequency cepstrum coefficient (MFCC) was proposed for unsupervised clustering. However, MFCC clustering supplies limited information for clustering. In this paper, we propose various timbre features for unsupervised clustering and a clustering algorithm with these features. Simulation experiments are carried out using various musical sound mixtures. The results indicate that the proposed method improves clustering performance, as compared to conventional MFCC-based clustering.
基于信息熵的专家聚类赋权方法%Method for determining experts' weights based on entropy and cluster analysis
Institute of Scientific and Technical Information of China (English)
周漩; 张凤鸣; 惠晓滨; 李克武
2011-01-01
According to the methods of determining experts' weights in group decision-making, the existing methods take into account the consistency of experts' collating vectors, but it is lack of the measure of its information similarity. So it may occur that although the collating vector is similar to the group consensus, information uncertainty is great of a certain expert. However, it is given the same weight to the other experts. For this, a method for deriving experts' weights based on entropy and cluster analysis is proposed, in which the collating vectors of all experts are classified with information similarity coefficient, and the experts' weights are determined according to the result of classification and entropy of collating vectors.Finally, a numerical example shows that the method is effective and feasible.%鉴于群组决策专家赋权方法研究中,现有赋权方法虽然考虑了专家给出的排序向量的一致性,但缺乏对排序向量信息相似性的度量,导致可能出现排序向量与群体共识相近,但信息不确定性较大的专家被赋予了与其他专家相同权重的问题.基于此,提出一种基于信息熵的专家聚类赋权方法,运用信息相似系数对排序向量进行聚类分析,根据聚类结果和排序向量的信息熵来确定专家的权重.具体算例表明,该方法有效且可行.
MANNER OF STOCKS SORTING USING CLUSTER ANALYSIS METHODS
Directory of Open Access Journals (Sweden)
Jana Halčinová
2014-06-01
Full Text Available The aim of the present article is to show the possibility of using the methods of cluster analysis in classification of stocks of finished products. Cluster analysis creates groups (clusters of finished products according to similarity in demand i.e. customer requirements for each product. Manner stocks sorting of finished products by clusters is described a practical example. The resultants clusters are incorporated into the draft layout of the distribution warehouse.
Malware Classification based on Call Graph Clustering
Kinable, Joris
2010-01-01
Each day, anti-virus companies receive tens of thousands samples of potentially harmful executables. Many of the malicious samples are variations of previously encountered malware, created by their authors to evade pattern-based detection. Dealing with these large amounts of data requires robust, automatic detection approaches. This paper studies malware classification based on call graph clustering. By representing malware samples as call graphs, it is possible to abstract certain variations away, and enable the detection of structural similarities between samples. The ability to cluster similar samples together will make more generic detection techniques possible, thereby targeting the commonalities of the samples within a cluster. To compare call graphs mutually, we compute pairwise graph similarity scores via graph matchings which approximately minimize the graph edit distance. Next, to facilitate the discovery of similar malware samples, we employ several clustering algorithms, including k-medoids and DB...
The Local Maximum Clustering Method and Its Application in Microarray Gene Expression Data Analysis
Directory of Open Access Journals (Sweden)
Chen Yidong
2004-01-01
Full Text Available An unsupervised data clustering method, called the local maximum clustering (LMC method, is proposed for identifying clusters in experiment data sets based on research interest. A magnitude property is defined according to research purposes, and data sets are clustered around each local maximum of the magnitude property. By properly defining a magnitude property, this method can overcome many difficulties in microarray data clustering such as reduced projection in similarities, noises, and arbitrary gene distribution. To critically evaluate the performance of this clustering method in comparison with other methods, we designed three model data sets with known cluster distributions and applied the LMC method as well as the hierarchic clustering method, the -mean clustering method, and the self-organized map method to these model data sets. The results show that the LMC method produces the most accurate clustering results. As an example of application, we applied the method to cluster the leukemia samples reported in the microarray study of Golub et al. (1999.
A Clustering Ensemble approach based on the similarities in 2-mode social networks
Institute of Scientific and Technical Information of China (English)
SU Bao-ping; ZHANG Meng-jie
2014-01-01
For a particular clustering problems, selecting the best clustering method is a challenging problem.Research suggests that integrate the multiple clustering can improve the accuracy of clustering ensemble greatly. A new clustering ensemble approach based on the similarities in 2-mode networks is proposed in this paper. First of all, the data object and the initial clustering clusters transform into 2-mode networks, then using the similarities in 2-mode networks to calculate the similarity between different clusters iteratively to refine the adjacency matrix , K-means algorithm is finally used to get the final clustering, then obtain the final clustering results.The method effectively use the similarity between different clusters, example shows the feasibility of this method.
Market Segmentation Using Bayesian Model Based Clustering
Van Hattum, P.
2009-01-01
This dissertation deals with two basic problems in marketing, that are market segmentation, which is the grouping of persons who share common aspects, and market targeting, which is focusing your marketing efforts on one or more attractive market segments. For the grouping of persons who share common aspects a Bayesian model based clustering approach is proposed such that it can be applied to data sets that are specifically used for market segmentation. The cluster algorithm can handle very l...
Li, Chunhui; Sun, Lian; Jia, Junxiang; Cai, Yanpeng; Wang, Xuan
2016-07-01
Source water areas are facing many potential water pollution risks. Risk assessment is an effective method to evaluate such risks. In this paper an integrated model based on k-means clustering analysis and set pair analysis was established aiming at evaluating the risks associated with water pollution in source water areas, in which the weights of indicators were determined through the entropy weight method. Then the proposed model was applied to assess water pollution risks in the region of Shiyan in which China's key source water area Danjiangkou Reservoir for the water source of the middle route of South-to-North Water Diversion Project is located. The results showed that eleven sources with relative high risk value were identified. At the regional scale, Shiyan City and Danjiangkou City would have a high risk value in term of the industrial discharge. Comparatively, Danjiangkou City and Yunxian County would have a high risk value in terms of agricultural pollution. Overall, the risk values of north regions close to the main stream and reservoir of the region of Shiyan were higher than that in the south. The results of risk level indicated that five sources were in lower risk level (i.e., level II), two in moderate risk level (i.e., level III), one in higher risk level (i.e., level IV) and three in highest risk level (i.e., level V). Also risks of industrial discharge are higher than that of the agricultural sector. It is thus essential to manage the pillar industry of the region of Shiyan and certain agricultural companies in the vicinity of the reservoir to reduce water pollution risks of source water areas. PMID:27016678
基于灰色聚类的管网水质评价%Water Quality of Pipe Network Based on the Grey Clustering Method
Institute of Scientific and Technical Information of China (English)
李明
2011-01-01
The water quality of pipe network can be seen as a grey water system, which can be evaluated by using the grey clustering approach to water quality of pipe network.The Grey clustering method can overcome the disadvantages of traditional method of evaluating many factors and indexes a single value.Guangzhou network is exemplified to assess the water quality of pipe network.The results show that,the grey clustering method can use a small number of samples to assess pipe net levels of water quality,consequently obtaining the water quality testing point,which is very convenient to obtain information on the status of each water quality testing point.%管网水质可以视为一个灰色系统,运用灰色聚类方法可对管网水质进行评价。灰色聚类方法克服了传统的用单一值评价多因素多指标问题的弊病。以广州市管网水质为实例,对其管网水质进行评估。结果表明,灰色聚类方法可采用数量较少的样本对管网水质的等级进行评估,从而为各测点水质状况信息的获取提供了便利。
Model-based clustering of array CGH data
Shah, Sohrab P.; Cheung, K-John; Johnson, Nathalie A.; Alain, Guillaume; Gascoyne, Randy D.; Horsman, Douglas E.; Ng, Raymond T.; Murphy, Kevin P.
2009-01-01
Motivation: Analysis of array comparative genomic hybridization (aCGH) data for recurrent DNA copy number alterations from a cohort of patients can yield distinct sets of molecular signatures or profiles. This can be due to the presence of heterogeneous cancer subtypes within a supposedly homogeneous population. Results: We propose a novel statistical method for automatically detecting such subtypes or clusters. Our approach is model based: each cluster is defined in terms of a sparse profile...
Detecting influential observations in a model-based cluster analysis
Bruckers, L.; Molenberghs, G; Verbeke, G; Geys, H.
2016-01-01
Finite mixture models have been used to model population heterogeneity and to relax distributional assumptions. These models are also convenient tools for clustering and classification of complex data such as, for example, repeated-measurements data. The performance of model-based clustering algorithms is sensitive to influential and outlying observations. Methods for identifying outliers in a finite mixture model have been described in the literature. Approaches to identify influential obser...
Seeland, Madeleine
2014-01-01
This thesis focuses on graph clustering. It introduces scalable methods for clustering large databases of small graphs by common scaffolds, i.e., the existence of one sufficiently large subgraph shared by all cluster elements. Further, the thesis studies applications for classification and regression. The experimental results show that it is for the first time possible to cluster millions of graphs within a reasonable time using an accurate scaffold-based similarity measure.
Clustering Methods Application for Customer Segmentation to Manage Advertisement Campaign
Maciej Kutera; Mirosława Lasek
2010-01-01
Clustering methods are recently so advanced elaborated algorithms for large collection data analysis that they have been already included today to data mining methods. Clustering methods are nowadays larger and larger group of methods, very quickly evolving and having more and more various applications. In the article, our research concerning usefulness of clustering methods in customer segmentation to manage advertisement campaign is presented. We introduce results obtained by using four sel...
Seniority-based coupled cluster theory
Henderson, Thomas M; Stein, Tamar; Scuseria, Gustavo E
2014-01-01
Doubly occupied configuration interaction (DOCI) with optimized orbitals often accurately describes strong correlations while working in a Hilbert space much smaller than that needed for full configuration interaction. However, the scaling of such calculations remains combinatorial with system size. Pair coupled cluster doubles (pCCD) is very successful in reproducing DOCI energetically, but can do so with low polynomial scaling ($N^3$, disregarding the two-electron integral transformation from atomic to molecular orbitals). We show here several examples illustrating the success of pCCD in reproducing both the DOCI energy and wave function, and show how this success frequently comes about. What DOCI and pCCD lack are an effective treatment of dynamic correlations, which we here add by including higher-seniority cluster amplitudes which are excluded from pCCD. This frozen pair coupled cluster approach is comparable in cost to traditional closed-shell coupled cluster methods with results that are competitive fo...
CORM: An R Package Implementing the Clustering of Regression Models Method for Gene Clustering
Jiejun Shi; Li-Xuan Qin
2014-01-01
We report a new R package implementing the clustering of regression models (CORM) method for clustering genes using gene expression data and provide data examples illustrating each clustering function in the package. The CORM package is freely available at CRAN from http://cran.r-project.org.
Cluster-based control of nonlinear dynamics
Kaiser, Eurika; Spohn, Andreas; Cattafesta, Louis N; Morzynski, Marek
2016-01-01
The ability to manipulate and control fluid flows is of great importance in many scientific and engineering applications. Here, a cluster-based control framework is proposed to determine optimal control laws with respect to a cost function for unsteady flows. The proposed methodology frames high-dimensional, nonlinear dynamics into low-dimensional, probabilistic, linear dynamics which considerably simplifies the optimal control problem while preserving nonlinear actuation mechanisms. The data-driven approach builds upon a state space discretization using a clustering algorithm which groups kinematically similar flow states into a low number of clusters. The temporal evolution of the probability distribution on this set of clusters is then described by a Markov model. The Markov model can be used as predictor for the ergodic probability distribution for a particular control law. This probability distribution approximates the long-term behavior of the original system on which basis the optimal control law is de...
Query Expansion Based on Clustered Results
Liu, Ziyang; Chen, Yi
2011-01-01
Query expansion is a functionality of search engines that suggests a set of related queries for a user-issued keyword query. Typical corpus-driven keyword query expansion approaches return popular words in the results as expanded queries. Using these approaches, the expanded queries may correspond to a subset of possible query semantics, and thus miss relevant results. To handle ambiguous queries and exploratory queries, whose result relevance is difficult to judge, we propose a new framework for keyword query expansion: we start with clustering the results according to user specified granularity, and then generate expanded queries, such that one expanded query is generated for each cluster whose result set should ideally be the corresponding cluster. We formalize this problem and show its APX-hardness. Then we propose two efficient algorithms named iterative single-keyword refinement and partial elimination based convergence, respectively, which effectively generate a set of expanded queries from clustered r...
Logistics Enterprise Evaluation Model Based On Fuzzy Clustering Analysis
Fu, Pei-hua; Yin, Hong-bo
In this thesis, we introduced an evaluation model based on fuzzy cluster algorithm of logistics enterprises. First of all,we present the evaluation index system which contains basic information, management level, technical strength, transport capacity,informatization level, market competition and customer service. We decided the index weight according to the grades, and evaluated integrate ability of the logistics enterprises using fuzzy cluster analysis method. In this thesis, we introduced the system evaluation module and cluster analysis module in detail and described how we achieved these two modules. At last, we gave the result of the system.
An Evolutionary Dynamic Clustering based Colour Image Segmentation
Directory of Open Access Journals (Sweden)
Amiya Halder, Nilvra Pathak
2011-02-01
Full Text Available We have presented a novel Dynamic Colour Image Segmentation (DCISSystem for colour image. In this paper, we have proposed an efficient colourimage segmentation algorithm based on evolutionary approach i.e. dynamic GAbased clustering (GADCIS. The proposed technique automatically determinesthe optimum number of clusters for colour images. The optimal number ofclusters is obtained by using cluster validity criterion with the help of Gaussiandistribution. The advantage of this method is that no a priori knowledge isrequired to segment the color image. The proposed algorithm is evaluated onwell known natural images and its performance is compared to other clusteringtechniques. Experimental results show the performance of the proposedalgorithm producing comparable segmentation results.
Unbiased methods for removing systematics from galaxy clustering measurements
Elsner, Franz; Peiris, Hiranya V
2015-01-01
Measuring the angular clustering of galaxies as a function of redshift is a powerful method for tracting information from the three-dimensional galaxy distribution. The precision of such measurements will dramatically increase with ongoing and future wide-field galaxy surveys. However, these are also increasingly sensitive to observational and astrophysical contaminants. Here, we study the statistical properties of three methods proposed for controlling such systematics - template subtraction, basic mode projection, and extended mode projection - all of which make use of externally supplied template maps, designed to characterise and capture the spatial variations of potential systematic effects. Based on a detailed mathematical analysis, and in agreement with simulations, we find that the template subtraction method in its original formulation returns biased estimates of the galaxy angular clustering. We derive closed-form expressions that should be used to correct results for this shortcoming. Turning to th...
Bugge, Anna; Tarp, Jakob; Østergaard, Lars; Domazet, Sidsel Louise; Andersen, Lars Bo; Froberg, Karsten
2014-01-01
Background The aim of the study; LCoMotion – Learning, Cognition and Motion was to develop, document, and evaluate a multi-component physical activity (PA) intervention in public schools in Denmark. The primary outcome was cognitive function. Secondary outcomes were academic skills, body composition, aerobic fitness and PA. The primary aim of the present paper was to describe the rationale, design and methods of the LCoMotion study. Methods/Design LCoMotion was designed as a cluster-randomize...
Directory of Open Access Journals (Sweden)
Kohei Arai
2013-07-01
Full Text Available Cluster analysis aims at identifying groups of similar objects and, therefore helps to discover distribution of patterns and interesting correlations in the data sets. In this paper, we propose to provide a consistent partitioning of a dataset which allows identifying any shape of cluster patterns in case of numerical clustering, convex or non-convex. The method is based on layered structure representation that be obtained from measurement distance and angle of numerical data to the centroid data and based on the iterative clustering construction utilizing a nearest neighbor distance between clusters to merge. Encourage result show the effectiveness of the proposed technique.
Finding Within Cluster Dense Regions Using Distance Based Technique
Wesam Ashour; Motaz Murtaja
2012-01-01
One of the main categories in Data Clustering is density based clustering. Density based clustering techniques like DBSCAN are attractive because they can find arbitrary shaped clusters along with noisy outlier. The main weakness of the traditional density based algorithms like DBSCAN is clustering the different density level data sets. DBSCAN calculations done according to given parameters applied to all points in a data set, while densities of the data set clusters may be totally different....
Myllys, Nanna; Elm, Jonas; Halonen, Roope; Kurtén, Theo; Vehkamäki, Hanna
2016-02-01
We investigate the utilization of the domain local pair natural orbital coupled cluster (DLPNO-CCSD(T)) method for calculating binding energies of atmospherical molecular clusters. Applied to small complexes of atmospherical relevance we find that the DLPNO method significantly reduces the scatter in the binding energy, which is commonly present in DFT calculations. For medium sized clusters consisting of sulfuric acid and bases the DLPNO method yields a systematic underestimation of the binding energy compared to canonical coupled cluster results. The errors in the DFT binding energies appear to be more random, while the systematic nature of the DLPNO results allows the establishment of a scaling factor, to better mimic the canonical coupled cluster calculations. Based on the trends identified for the small and medium sized systems, we further extend the application of the DLPNO method to large acid - base clusters consisting of up to 10 molecules, which have previously been out of reach with accurate coupled cluster methods. Using the Atmospheric Cluster Dynamics Code (ACDC) we compare the sulfuric acid dimer formation based on the new DLPNO binding energies with previously published RI-CC2/aug-cc-pV(T+d)Z results. We also compare the simulated sulfuric acid dimer concentration as a function of the base concentration with measurement data from the CLOUD chamber and flow tube experiments. The DLPNO method, even after scaling, underpredicts the dimer concentration significantly. Reasons for this are discussed. PMID:26771121
Ontology-based topic clustering for online discussion data
Wang, Yongheng; Cao, Kening; Zhang, Xiaoming
2013-03-01
With the rapid development of online communities, mining and extracting quality knowledge from online discussions becomes very important for the industrial and marketing sector, as well as for e-commerce applications and government. Most of the existing techniques model a discussion as a social network of users represented by a user-based graph without considering the content of the discussion. In this paper we propose a new multilayered mode to analysis online discussions. The user-based and message-based representation is combined in this model. A novel frequent concept sets based clustering method is used to cluster the original online discussion network into topic space. Domain ontology is used to improve the clustering accuracy. Parallel methods are also used to make the algorithms scalable to very large data sets. Our experimental study shows that the model and algorithms are effective when analyzing large scale online discussion data.
Missing data treatment method on cluster analysis
Elsiddig Elsadig Mohamed Koko; Amin Ibrahim Adam Mohamed
2015-01-01
The missing data in household health survey was challenged for the researcher because of incomplete analysis. The statistical tool cluster analysis methodology implemented in the collected data of Sudan's household health survey in 2006. Current research specifically focuses on the data analysis as the objective is to deal with the missing values in cluster analysis. Two-Step Cluster Analysis is applied in which each participant is classified into one of the identified pattern and the opt...
Model-based clustering in networks with Stochastic Community Finding
McDaid, Aaron F; Friel, Nial; Hurley, Neil J
2012-01-01
In the model-based clustering of networks, blockmodelling may be used to identify roles in the network. We identify a special case of the Stochastic Block Model (SBM) where we constrain the cluster-cluster interactions such that the density inside the clusters of nodes is expected to be greater than the density between clusters. This corresponds to the intuition behind community-finding methods, where nodes tend to clustered together if they link to each other. We call this model Stochastic Community Finding (SCF) and present an efficient MCMC algorithm which can cluster the nodes, given the network. The algorithm is evaluated on synthetic data and is applied to a social network of interactions at a karate club and at a monastery, demonstrating how the SCF finds the 'ground truth' clustering where sometimes the SBM does not. The SCF is only one possible form of constraint or specialization that may be applied to the SBM. In a more supervised context, it may be appropriate to use other specializations to guide...
Commodity-Based Computing Clusters at PPPL.
Wah, Darren; Davis, Steven L.; Johansson, Marques; Klasky, Scott; Tang, William; Valeo, Ernest
2002-11-01
In order to cost-effectively facilitate mid-scale serial and parallel computations and code development, a number of commodity-based clusters have been built at PPPL. A recent addition is the PETREL cluster, consisting of 100 dual-processor machines, both Intel and AMD, interconnected by a 100Mbit switch. Sixteen machines have an additional Myrinet 2000 interconnect. Also underway is the implementation of a Prototype Topical Computing Facility which will explore the effectiveness and scaling of cluster computing for larger scale fusion codes, specifically including those being developed under the SCIDAC auspices. This facility will consist of two parts: a 64 dual-processor node cluster, with high speed interconnect, and a 16 dual-processor node cluster, utilizing gigabit networking, built for the purpose of exploring grid-enabled computing. The initial grid explorations will be in collaboration with the Princeton University Institute for Computational Science and Engineering (PICSciE), where a 16 processor cluster dedicated to investigation of grid computing is being built. The initial objectives are to (1) grid-enable the GTC code and an MHD code, making use of MPICH-G2 and (2) implement grid-enabled interactive visualization using DXMPI and the Chromium API.
Graph-based clustering and data visualization algorithms
Vathy-Fogarassy, Ágnes
2013-01-01
This work presents a data visualization technique that combines graph-based topology representation and dimensionality reduction methods to visualize the intrinsic data structure in a low-dimensional vector space. The application of graphs in clustering and visualization has several advantages. A graph of important edges (where edges characterize relations and weights represent similarities or distances) provides a compact representation of the entire complex data set. This text describes clustering and visualization methods that are able to utilize information hidden in these graphs, based on
Cluster-based aggregation for inter-vehicle communication
Balanici, Mihail
2015-01-01
The present master thesis is focused on the design and evaluation of a cluster-based aggregation protocol (CBAP), which defines a set of rules and procedures for data aggregation based on a cluster structure. The proposed protocol is regarded as a complex mechanism consisting of two component sub-protocols: a clustering algorithm, grouping vehicles into cluster entities, and an aggregation scheme, deploying in-network and hierarchical data aggregation atop the prebuilt clusters. A cluster-bas...
Unbiased methods for removing systematics from galaxy clustering measurements
Elsner, Franz; Leistedt, Boris; Peiris, Hiranya V.
2016-02-01
Measuring the angular clustering of galaxies as a function of redshift is a powerful method for extracting information from the three-dimensional galaxy distribution. The precision of such measurements will dramatically increase with ongoing and future wide-field galaxy surveys. However, these are also increasingly sensitive to observational and astrophysical contaminants. Here, we study the statistical properties of three methods proposed for controlling such systematics - template subtraction, basic mode projection, and extended mode projection - all of which make use of externally supplied template maps, designed to characterize and capture the spatial variations of potential systematic effects. Based on a detailed mathematical analysis, and in agreement with simulations, we find that the template subtraction method in its original formulation returns biased estimates of the galaxy angular clustering. We derive closed-form expressions that should be used to correct results for this shortcoming. Turning to the basic mode projection algorithm, we prove it to be free of any bias, whereas we conclude that results computed with extended mode projection are biased. Within a simplified setup, we derive analytical expressions for the bias and discuss the options for correcting it in more realistic configurations. Common to all three methods is an increased estimator variance induced by the cleaning process, albeit at different levels. These results enable unbiased high-precision clustering measurements in the presence of spatially varying systematics, an essential step towards realizing the full potential of current and planned galaxy surveys.
Web-based Interface in Public Cluster
Akbar, Z
2007-01-01
A web-based interface dedicated for cluster computer which is publicly accessible for free is introduced. The interface plays an important role to enable secure public access, while providing user-friendly computational environment for end-users and easy maintainance for administrators as well. The whole architecture which integrates both aspects of hardware and software is briefly explained. It is argued that the public cluster is globally a unique approach, and could be a new kind of e-learning system especially for parallel programming communities.
Clustering-based selective neural network ensemble
Institute of Scientific and Technical Information of China (English)
FU Qiang; HU Shang-xu; ZHAO Sheng-ying
2005-01-01
An effective ensemble should consist of a set of networks that are both accurate and diverse. We propose a novel clustering-based selective algorithm for constructing neural network ensemble, where clustering technology is used to classify trained networks according to similarity and optimally select the most accurate individual network from each cluster to make up the ensemble. Empirical studies on regression of four typical datasets showed that this approach yields significantly smaller en semble achieving better performance than other traditional ones such as Bagging and Boosting. The bias variance decomposition of the predictive error shows that the success of the proposed approach may lie in its properly tuning the bias/variance trade-offto reduce the prediction error (the sum of bias2 and variance).
MHCcluster, a method for functional clustering of MHC molecules
DEFF Research Database (Denmark)
Thomsen, Martin Christen Frølund; Lundegaard, Claus; Buus, Søren;
2013-01-01
binding specificity. The method has a flexible web interface that allows the user to include any MHC of interest in the analysis. The output consists of a static heat map and graphical tree-based visualizations of the functional relationship between MHC variants and a dynamic TreeViewer interface where...... both the functional relationship and the individual binding specificities of MHC molecules are visualized. We demonstrate that conventional sequence-based clustering will fail to identify the functional relationship between molecules, when applied to MHC system, and only through the use of the...
DBCSVM: Density Based Clustering Using Support VectorMachines
Directory of Open Access Journals (Sweden)
Santosh Kumar Rai
2012-07-01
Full Text Available Data categorization is challenging job in a current scenario. The growth rate of a multimedia data are increase day to day in an internet technology. For the better retrieval and efficient searching of a data, a process required for grouping the data. However, data mining can find out helpful implicit information in large databases. To detect the implicit useful information from large databases various data mining techniques are use. Data clustering is an important data mining technique for grouping data sets into different clusters and each cluster having same properties of data. In this paper we have taken image data sets and firstly applying the density based clustering to grouped the images, density based clustering grouped the images according to the nearest feature sets but not grouped outliers, then we used an important super hyperplane classifier support vector machine (SVM which classify the all outlier left from density based clustering. This method improves the efficiency of image grouping and gives better results.
Semisupervised Clustering for Networks Based on Fast Affinity Propagation
Directory of Open Access Journals (Sweden)
Mu Zhu
2013-01-01
Full Text Available Most of the existing clustering algorithms for networks are unsupervised, which cannot help improve the clustering quality by utilizing a small number of prior knowledge. We propose a semisupervised clustering algorithm for networks based on fast affinity propagation (SCAN-FAP, which is essentially a kind of similarity metric learning method. Firstly, we define a new constraint similarity measure integrating the structural information and the pairwise constraints, which reflects the effective similarities between nodes in networks. Then, taking the constraint similarities as input, we propose a fast affinity propagation algorithm which keeps the advantages of the original affinity propagation algorithm while increasing the time efficiency by passing only the messages between certain nodes. Finally, by extensive experimental studies, we demonstrate that the proposed algorithm can take fully advantage of the prior knowledge and improve the clustering quality significantly. Furthermore, our algorithm has a superior performance to some of the state-of-art approaches.
Clustering Seven Data Sets by Means of Some or All of Seven Clustering Methods.
Dreger, Ralph Mason; And Others
1988-01-01
Seven data sets (namely, clinical data on children) were subjected to clustering by seven algorithms--the B-coefficient, Linear Typal Analysis; elementary linkage analysis, Numerical Taxonomy System, Statistical Analysis System hierarchical clustering method, Taxonomy, and Bolz's Type Analysis. The little-known B-coefficient method compared…
Clustering Methods for Real Estate Portfolios
William N. Goetzmann; Susan M. Wachter
1998-01-01
A clustering algorithm is applied to effective rents for twenty-one U.S. office markets, and to twenty-two metropolitan markets using vacancy data. It provides support for the conjecture that there exists a few major families of cities: including an oil and gas group and an industrial Northeast group. Unlike other clustering studies, we find strong evidence of bicoastal city associations among cities such as Boston and Los Angeles. We present a bootstrapping methodology for investigating the ...
ENERGY OPTIMIZATION IN CLUSTER BASED WIRELESS SENSOR NETWORKS
Directory of Open Access Journals (Sweden)
T. SHANKAR
2014-04-01
Full Text Available Wireless sensor networks (WSN are made up of sensor nodes which are usually battery-operated devices, and hence energy saving of sensor nodes is a major design issue. To prolong the networks lifetime, minimization of energy consumption should be implemented at all layers of the network protocol stack starting from the physical to the application layer including cross-layer optimization. Optimizing energy consumption is the main concern for designing and planning the operation of the WSN. Clustering technique is one of the methods utilized to extend lifetime of the network by applying data aggregation and balancing energy consumption among sensor nodes of the network. This paper proposed new version of Low Energy Adaptive Clustering Hierarchy (LEACH, protocols called Advanced Optimized Low Energy Adaptive Clustering Hierarchy (AOLEACH, Optimal Deterministic Low Energy Adaptive Clustering Hierarchy (ODLEACH, and Varying Probability Distance Low Energy Adaptive Clustering Hierarchy (VPDL combination with Shuffled Frog Leap Algorithm (SFLA that enables selecting best optimal adaptive cluster heads using improved threshold energy distribution compared to LEACH protocol and rotating cluster head position for uniform energy dissipation based on energy levels. The proposed algorithm optimizing the life time of the network by increasing the first node death (FND time and number of alive nodes, thereby increasing the life time of the network.
Cancer detection based on Raman spectra super-paramagnetic clustering
González-Solís, José Luis; Guizar-Ruiz, Juan Ignacio; Martínez-Espinosa, Juan Carlos; Martínez-Zerega, Brenda Esmeralda; Juárez-López, Héctor Alfonso; Vargas-Rodríguez, Héctor; Gallegos-Infante, Luis Armando; González-Silva, Ricardo Armando; Espinoza-Padilla, Pedro Basilio; Palomares-Anda, Pascual
2016-08-01
The clustering of Raman spectra of serum sample is analyzed using the super-paramagnetic clustering technique based in the Potts spin model. We investigated the clustering of biochemical networks by using Raman data that define edge lengths in the network, and where the interactions are functions of the Raman spectra's individual band intensities. For this study, we used two groups of 58 and 102 control Raman spectra and the intensities of 160, 150 and 42 Raman spectra of serum samples from breast and cervical cancer and leukemia patients, respectively. The spectra were collected from patients from different hospitals from Mexico. By using super-paramagnetic clustering technique, we identified the most natural and compact clusters allowing us to discriminate the control and cancer patients. A special interest was the leukemia case where its nearly hierarchical observed structure allowed the identification of the patients's leukemia type. The goal of this study is to apply a model of statistical physics, as the super-paramagnetic, to find these natural clusters that allow us to design a cancer detection method. To the best of our knowledge, this is the first report of preliminary results evaluating the usefulness of super-paramagnetic clustering in the discipline of spectroscopy where it is used for classification of spectra.
Visual cluster analysis and pattern recognition template and methods
Energy Technology Data Exchange (ETDEWEB)
Osbourn, G.C.; Martinez, R.F.
1993-12-31
This invention is comprised of a method of clustering using a novel template to define a region of influence. Using neighboring approximation methods, computation times can be significantly reduced. The template and method are applicable and improve pattern recognition techniques.
Malware Classification based on Call Graph Clustering
Kinable, Joris; Kostakis, Orestis
2010-01-01
Each day, anti-virus companies receive tens of thousands samples of potentially harmful executables. Many of the malicious samples are variations of previously encountered malware, created by their authors to evade pattern-based detection. Dealing with these large amounts of data requires robust, automatic detection approaches. This paper studies malware classification based on call graph clustering. By representing malware samples as call graphs, it is possible to abstract certain variations...
TOWARDS MORE ACCURATE CLUSTERING METHOD BY USING DYNAMIC TIME WARPING
Directory of Open Access Journals (Sweden)
Khadoudja Ghanem
2013-03-01
Full Text Available An intrinsic problem of classifiers based on machine learning (ML methods is that their learning time grows as the size and complexity of the training dataset increases. For this reason, it is important to have efficient computational methods and algorithms that can be applied on large datasets, such that it is still possible to complete the machine learning tasks in reasonable time. In this context, we present in this paper a more accurate simple process to speed up ML methods. An unsupervised clustering algorithm is combined with Expectation, Maximization (EM algorithm to develop an efficient Hidden Markov Model (HMM training. The idea of the proposed process consists of two steps. In the first step, training instances with similar inputs are clustered and a weight factor which represents the frequency of these instances is assigned to each representative cluster. Dynamic Time Warping technique is used as a dissimilarity function to cluster similar examples. In the second step, all formulas in the classical HMM training algorithm (EM associated with the number of training instances are modified to include the weight factor in appropriate terms. This process significantly accelerates HMM training while maintaining the same initial, transition and emission probabilities matrixes as those obtained with the classical HMM training algorithm. Accordingly, the classification accuracy is preserved. Depending on the size of the training set, speedups of up to 2200 times is possible when the size is about 100.000 instances. The proposed approach is not limited to training HMMs, but it can be employed for a large variety of MLs methods.
Core Business Selection Based on Ant Colony Clustering Algorithm
Directory of Open Access Journals (Sweden)
Yu Lan
2014-01-01
Full Text Available Core business is the most important business to the enterprise in diversified business. In this paper, we first introduce the definition and characteristics of the core business and then descript the ant colony clustering algorithm. In order to test the effectiveness of the proposed method, Tianjin Port Logistics Development Co., Ltd. is selected as the research object. Based on the current situation of the development of the company, the core business of the company can be acquired by ant colony clustering algorithm. Thus, the results indicate that the proposed method is an effective way to determine the core business for company.
Comparison of two cluster analysis methods using single particle mass spectra
Zhao, Weixiang; Hopke, Philip K.; Prather, Kimberly A.
Cluster analysis of aerosol time-of-flight mass spectrometry (ATOFMS) data has been an effective tool for the identification of possible sources of ambient aerosols. In this study, the clustering results of two typical methods, adaptive resonance theory-based neural networks-2a (ART-2a) and density-based clustering of application with noise (DBSCAN), on ATOFMS data were investigated by employing a set of benchmark ATOFMS data. The advantages and disadvantages of these two methods are discussed and some feasible remedies proposed for problems encountered in the clustering process. The results of this study will provide promising directions for future work on ambient aerosol cluster analysis, suggesting a more effective and feasible clustering strategy based on the integration of ART-2a and DBSCAN.
Directory of Open Access Journals (Sweden)
Peixin Zhao
2013-01-01
Full Text Available Community detection in social networks plays an important role in cluster analysis. Many traditional techniques for one-dimensional problems have been proven inadequate for high-dimensional or mixed type datasets due to the data sparseness and attribute redundancy. In this paper we propose a graph-based clustering method for multidimensional datasets. This novel method has two distinguished features: nonbinary hierarchical tree and the multi-membership clusters. The nonbinary hierarchical tree clearly highlights meaningful clusters, while the multimembership feature may provide more useful service strategies. Experimental results on the customer relationship management confirm the effectiveness of the new method.
Sakumichi, Naoyuki; Kawakami, Norio; Ueda, Masahito
2012-04-01
The quantum-statistical cluster expansion method of Lee and Yang is extended to investigate off-diagonal long-range order (ODLRO) in one-component and multicomponent mixtures of bosons or fermions. Our formulation is applicable to both a uniform system and a trapped system without local-density approximation and allows systematic expansions of one-particle and multiparticle reduced density matrices in terms of cluster functions, which are defined for the same system with Boltzmann statistics. Each term in this expansion can be associated with a Lee-Yang graph. We elucidate a physical meaning of each Lee-Yang graph; in particular, for a mixture of ultracold atoms and bound dimers, an infinite sum of the ladder-type Lee-Yang 0-graphs is shown to lead to Bose-Einstein condensation of dimers below the critical temperature. In the case of Bose statistics, an infinite series of Lee-Yang 1-graphs is shown to converge and gives the criteria of ODLRO at the one-particle level. Applications to a dilute Bose system of hard spheres are also made. In the case of Fermi statistics, an infinite series of Lee-Yang 2-graphs is shown to converge and gives the criteria of ODLRO at the two-particle level. Applications to a two-component Fermi gas in the tightly bound limit are also made.
Sagar S. De; Minati Mishra; Satchidananda Dehuri
2013-01-01
In the visual data mining, visualization of clusters is a challenging task. Although lots of techniques already have been developed, the challenges still remain to represent large volume of data with multiple dimension and overlapped clusters. In this paper, a multivariate clusters visualization technique (MVClustViz) has been presented to visualize the centroid-based clusters. The geographic projection technique supports multi-dimension, large volume, and both crisp and fuzzy clusters visual...
Finding Within Cluster Dense Regions Using Distance Based Technique
Directory of Open Access Journals (Sweden)
Wesam Ashour
2012-03-01
Full Text Available One of the main categories in Data Clustering is density based clustering. Density based clustering techniques like DBSCAN are attractive because they can find arbitrary shaped clusters along with noisy outlier. The main weakness of the traditional density based algorithms like DBSCAN is clustering the different density level data sets. DBSCAN calculations done according to given parameters applied to all points in a data set, while densities of the data set clusters may be totally different. The proposed algorithm overcomes this weakness of the traditional density based algorithms. The algorithm starts with partitioning the data within a cluster to units based on a user parameter and compute the density for each unit separately. Consequently, the algorithm compares the results and merges neighboring units with closer approximate density values to become a new cluster. The experimental results of the simulation show that the proposed algorithm gives good results in finding clusters for different density cluster data set.
MHCcluster, a method for functional clustering of MHC molecules.
Thomsen, Martin; Lundegaard, Claus; Buus, Søren; Lund, Ole; Nielsen, Morten
2013-09-01
The identification of peptides binding to major histocompatibility complexes (MHC) is a critical step in the understanding of T cell immune responses. The human MHC genomic region (HLA) is extremely polymorphic comprising several thousand alleles, many encoding a distinct molecule. The potentially unique specificities remain experimentally uncharacterized for the vast majority of HLA molecules. Likewise, for nonhuman species, only a minor fraction of the known MHC molecules have been characterized. Here, we describe a tool, MHCcluster, to functionally cluster MHC molecules based on their predicted binding specificity. The method has a flexible web interface that allows the user to include any MHC of interest in the analysis. The output consists of a static heat map and graphical tree-based visualizations of the functional relationship between MHC variants and a dynamic TreeViewer interface where both the functional relationship and the individual binding specificities of MHC molecules are visualized. We demonstrate that conventional sequence-based clustering will fail to identify the functional relationship between molecules, when applied to MHC system, and only through the use of the predicted binding specificity can a correct clustering be found. Clustering of prevalent HLA-A and HLA-B alleles using MHCcluster confirms the presence of 12 major specificity groups (supertypes) some however with highly divergent specificities. Importantly, some HLA molecules are shown not to fit any supertype classification. Also, we use MHCcluster to show that chimpanzee MHC class I molecules have a reduced functional diversity compared to that of HLA class I molecules. MHCcluster is available at www.cbs.dtu.dk/services/MHCcluster-2.0. PMID:23775223
Li, Hao; Li, Peng; Xie, Jing; Yi, Shengjie; Yang, Chaojie; Wang, Jian; Sun, Jichao; Liu, Nan; Wang, Xu; Wu, Zhihao; Wang, Ligui; Hao, Rongzhang; Wang, Yong; Jia, Leili; Li, Kaiqin; Qiu, Shaofu; Song, Hongbin
2014-08-01
A clustered regularly interspaced short palindromic repeat (CRISPR) typing method has recently been developed and used for typing and subtyping of Salmonella spp., but it is complicated and labor intensive because it has to analyze all spacers in two CRISPR loci. Here, we developed a more convenient and efficient method, namely, CRISPR locus spacer pair typing (CLSPT), which only needs to analyze the two newly incorporated spacers adjoining the leader array in the two CRISPR loci. We analyzed a CRISPR array of 82 strains belonging to 21 Salmonella serovars isolated from humans in different areas of China by using this new method. We also retrieved the newly incorporated spacers in each CRISPR locus of 537 Salmonella isolates which have definite serotypes in the Pasteur Institute's CRISPR Database to evaluate this method. Our findings showed that this new CLSPT method presents a high level of consistency (kappa = 0.9872, Matthew's correlation coefficient = 0.9712) with the results of traditional serotyping, and thus, it can also be used to predict serotypes of Salmonella spp. Moreover, this new method has a considerable discriminatory power (discriminatory index [DI] = 0.8145), comparable to those of multilocus sequence typing (DI = 0.8088) and conventional CRISPR typing (DI = 0.8684). Because CLSPT only costs about $5 to $10 per isolate, it is a much cheaper and more attractive method for subtyping of Salmonella isolates. In conclusion, this new method will provide considerable advantages over other molecular subtyping methods, and it may become a valuable epidemiologic tool for the surveillance of Salmonella infections. PMID:24899040
Institute of Scientific and Technical Information of China (English)
李超顺; 周建中; 肖剑; 肖汉
2013-01-01
Kernel clustering is a kind of valid methods for vibration fault diagnosis of hydro-turbine generating unit (HGU). In order to solve the problem of evaluating clustering results and selecting parameter of kernel function, a novel gravitational search based kernel clustering (GSKC) was proposed. At first, the kernel clustering objective function was built based on kernel Xie-Beni clustering index, then the gravitational search method was introduced and applied to solve the objective function, while the clustering center and parameter of kernel function were encoded as optimization variables together; in this end the fault diagnosis model based on similarity was defined. UCI testing data sets were used to check the classification accuracy, and then CSKC was applied in fault diagnosis of HGU. Experimental results show that GSKC was more accurate in classification than traditional methods, meanwhile GSKC was able to cluster the fault samples of HGU effectively, and diagnosis different kinds of fault accurately.%核聚类是一类有效的水力发电机组振动故障诊断方法,为了解决核聚类有效性评价和核参数选择的问题,提出了一种引力搜索核聚类算法.首先建立以核Xie-Beni指标为目标的聚类模型;然后引入引力搜索框架,以聚类中心和核函数参数为优化变量,通过引力搜索求解核聚类模型；最后定义了基于核空间样本相似度的故障诊断模型.利用国际标准样本集对该方法进行分类测试,并将该方法应用于水电机组振动故障诊断.试验结果表明:与传统聚类方法相比,文中方法具有更高分类精度,且能对故障样本准确聚类并提取诊断模型参数,实现故障的准确诊断.
Institute of Scientific and Technical Information of China (English)
殷春武
2013-01-01
Subject clusters and industrial clusters cooperative innovation is the priority among priorities for guaranteeing the sustainable development of regional economy. Based on analyzing the importance of the subject cluster and industrial cluster cooperative innovation ability,double cluster cooperative innovation ability evaluation index system is constructed,and u-sing OWA operator assembly the multiple weight determining methods the combination weights of evaluation index is ob-tained .Combining assessment scale of language scale and gray degree is proposed for evaluation.Final a double cluster co-operative innovation ability evaluation method is proposed based on fuzzy set and gray degree and full of double cluster syn-ergy innovation ability evaluation theory.%学科集群和产业集群的协同创新是保障区域经济可持续发展的重中之重。在充分分析学科集群和产业集群协同创新能力重要性的基础上，着重构造双集群协同创新能力评价指标体系，利用OWA算子集结多种权重确定方法实现评价指标的组合赋权，根据协同创新能力评价的不可定量性和不可知性提出利用语言标度与灰度相结合的评价标度进行评价，最后给出一种基于模糊灰度的双集群协同创新能力评价方法，充实了双集群协同创新能力评价理论体系。
Comparison of Selected Methods for Document Clustering
Czech Academy of Sciences Publication Activity Database
Ševčík, R.; Řezanková, H.; Húsek, Dušan
Berlin : Springer, 2011 - (Mugellini, E.; Szczepaniak, P.; Pettenati, M.; Sokhn, M.), s. 101-110 ISBN 978-3-642-18028-6. ISSN 1867-5662. - (Advances in Intelligent and Soft Computing. 86). [AWIC 2011. Atlantic Web Intelligence Conference /7./. Fribourg (CH), 26.01.2011-28.01.2011] R&D Projects: GA ČR GAP202/10/0262; GA ČR GA205/09/1079 Institutional research plan: CEZ:AV0Z10300504 Keywords : web clustering * cluster analysis * textual documents * web content classification * newsgroups analysis * vector model Subject RIV: IN - Informatics, Computer Science
ONTOLOGY BASED DOCUMENT CLUSTERING USING MAPREDUCE
Directory of Open Access Journals (Sweden)
Abdelrahman Elsayed
2015-05-01
Full Text Available Nowadays, document clustering is considered as a data intensive task due to the dramatic, fast increase in the number of available documents. Nevertheless, the features that represent those documents are also too large. The most common method for representing documents is the vector space model, which represents document features as a bag of words and does not represent semantic relations between words. In this paper we introduce a distributed implementation for the bisecting k-means using MapReduce programming model. The aim behind our proposed implementation is to solve the problem of clustering intensive data documents. In addition, we propose integrating the WordNet ontology with bisecting k-means in order to utilize the semantic relations between words to enhance document clustering results. Our presented experimental results show that using lexical categories for nouns only enhances internal evaluation measures of document clustering; and decreases the documents features from thousands to tens features. Our experiments were conducted using Amazon Elastic MapReduce to deploy the Bisecting k-means algorithm.
PRIVACY PRESERVING CLUSTERING BASED ON LINEAR APPROXIMATION OF FUNCTION
Rajesh Pasupuleti; Narsimha Gugulothu
2014-01-01
Clustering analysis initiatives a new direction in data mining that has major impact in various domains including machine learning, pattern recognition, image processing, information retrieval and bioinformatics. Current clustering techniques address some of the requirements not adequately and failed in standardizing clustering algorithms to support for all real applications. Many clustering methods mostly depend on user specified parametric methods and initial seeds of clusters are randoml...
CLUSTERING-BASED ANALYSIS OF TEXT SIMILARITY
Bovcon , Borja
2013-01-01
The focus of this thesis is comparison of analysis of text-document similarity using clustering algorithms. We begin by defining main problem and then, we proceed to describe the two most used text-document representation techniques, where we present words filtering methods and their importance, Porter's algorithm and tf-idf term weighting algorithm. We then proceed to apply all previously described algorithms on selected data-sets, which vary in size and compactness. Fallowing this, we ...
A clustering routing algorithm based on improved ant colony clustering for wireless sensor networks
Xiao, Xiaoli; Li, Yang
Because of real wireless sensor network node distribution uniformity, this paper presents a clustering strategy based on the ant colony clustering algorithm (ACC-C). To reduce the energy consumption of the head near the base station and the whole network, The algorithm uses ant colony clustering on non-uniform clustering. The improve route optimal degree is presented to evaluate the performance of the chosen route. Simulation results show that, compared with other algorithms, like the LEACH algorithm and the improve particle cluster kind of clustering algorithm (PSC - C), the proposed approach is able to keep away from the node with less residual energy, which can improve the life of networks.
A new method to measure the mass of galaxy clusters
Falco, Martina; Wojtak, Radoslaw; Brinckmann, Thejs; Lindholmer, Mikkel; Pandolfi, Stefania
2013-01-01
The mass measurement of galaxy clusters is an important tool for the determination of cosmological parameters describing the matter and energy content of the Universe. However, the standard methods rely on various assumptions about the shape or the level of equilibrium of the cluster. We present a novel method of measuring cluster masses. It is complementary to most of the other methods, since it only uses kinematical information from outside the virialized cluster. Our method identifies objects, as galaxy sheets or filaments, in the cluster outer region, and infers the cluster mass by modeling how the massive cluster perturbs the motion of the structures from the Hubble flow. At the same time, this technique allows to constrain the three-dimensional orientation of the detected structures with a good accuracy. We use a cosmological numerical simulation to test the method. We then apply the method to the Coma cluster, where we find two galaxy sheets, and measure the mass of Coma to be Mvir=(9.2\\pm2.4)10^{14} M...
Information Clustering Based on Fuzzy Multisets.
Miyamoto, Sadaaki
2003-01-01
Proposes a fuzzy multiset model for information clustering with application to information retrieval on the World Wide Web. Highlights include search engines; term clustering; document clustering; algorithms for calculating cluster centers; theoretical properties concerning clustering algorithms; and examples to show how the algorithms work.…
Directory of Open Access Journals (Sweden)
Susan Worner
2013-09-01
Full Text Available For greater preparedness, pest risk assessors are required to prioritise long lists of pest species with potential to establish and cause significant impact in an endangered area. Such prioritization is often qualitative, subjective, and sometimes biased, relying mostly on expert and stakeholder consultation. In recent years, cluster based analyses have been used to investigate regional pest species assemblages or pest profiles to indicate the risk of new organism establishment. Such an approach is based on the premise that the co-occurrence of well-known global invasive pest species in a region is not random, and that the pest species profile or assemblage integrates complex functional relationships that are difficult to tease apart. In other words, the assemblage can help identify and prioritise species that pose a threat in a target region. A computational intelligence method called a Kohonen self-organizing map (SOM, a type of artificial neural network, was the first clustering method applied to analyse assemblages of invasive pests. The SOM is a well known dimension reduction and visualization method especially useful for high dimensional data that more conventional clustering methods may not analyse suitably. Like all clustering algorithms, the SOM can give details of clusters that identify regions with similar pest assemblages, possible donor and recipient regions. More important, however SOM connection weights that result from the analysis can be used to rank the strength of association of each species within each regional assemblage. Species with high weights that are not already established in the target region are identified as high risk. However, the SOM analysis is only the first step in a process to assess risk to be used alongside or incorporated within other measures. Here we illustrate the application of SOM analyses in a range of contexts in invasive species risk assessment, and discuss other clustering methods such as k
A Method of Deep Web Clustering Based on SOM Neural Network%一种基于自组织映射神经网络的Deep Web聚类方法
Institute of Scientific and Technical Information of China (English)
吴凌云
2012-01-01
为提高Deepwleb数据源聚类的效率，降低人工参与度，提出了一种基于自组织映射网络SOM的DeepWeb接口聚类方法。该方法采用PRE．QUERY方式，使用接口表单的结构特征统计量作为输入。在UIUC数据集上测试后取得了预期的效果。%In order to improve the efficiency of Deep Web data sources clustering and reduce the manual work, this paper addressed a method of Deep Web interface clustering based on self-orgaalizing map neural network, which utilizes PREQUERY and takes the struetual statistic as inputs. After testing on UIUC datasets, this method gets an expected effect.
On Comparison of Clustering Methods for Pharmacoepidemiological Data.
Feuillet, Fanny; Bellanger, Lise; Hardouin, Jean-Benoit; Victorri-Vigneau, Caroline; Sébille, Véronique
2015-01-01
The high consumption of psychotropic drugs is a public health problem. Rigorous statistical methods are needed to identify consumption characteristics in post-marketing phase. Agglomerative hierarchical clustering (AHC) and latent class analysis (LCA) can both provide clusters of subjects with similar characteristics. The objective of this study was to compare these two methods in pharmacoepidemiology, on several criteria: number of clusters, concordance, interpretation, and stability over time. From a dataset on bromazepam consumption, the two methods present a good concordance. AHC is a very stable method and it provides homogeneous classes. LCA is an inferential approach and seems to allow identifying more accurately extreme deviant behavior. PMID:24905478
Directory of Open Access Journals (Sweden)
Alex Ing
Full Text Available Functional connectivity has become an increasingly important area of research in recent years. At a typical spatial resolution, approximately 300 million connections link each voxel in the brain with every other. This pattern of connectivity is known as the functional connectome. Connectivity is often compared between experimental groups and conditions. Standard methods used to control the type 1 error rate are likely to be insensitive when comparisons are carried out across the whole connectome, due to the huge number of statistical tests involved. To address this problem, two new cluster based methods--the cluster size statistic (CSS and cluster mass statistic (CMS--are introduced to control the family wise error rate across all connectivity values. These methods operate within a statistical framework similar to the cluster based methods used in conventional task based fMRI. Both methods are data driven, permutation based and require minimal statistical assumptions. Here, the performance of each procedure is evaluated in a receiver operator characteristic (ROC analysis, utilising a simulated dataset. The relative sensitivity of each method is also tested on real data: BOLD (blood oxygen level dependent fMRI scans were carried out on twelve subjects under normal conditions and during the hypercapnic state (induced through the inhalation of 6% CO2 in 21% O2 and 73%N2. Both CSS and CMS detected significant changes in connectivity between normal and hypercapnic states. A family wise error correction carried out at the individual connection level exhibited no significant changes in connectivity.
Directory of Open Access Journals (Sweden)
Baumbach Jan
2007-10-01
Full Text Available Abstract Background Detecting groups of functionally related proteins from their amino acid sequence alone has been a long-standing challenge in computational genome research. Several clustering approaches, following different strategies, have been published to attack this problem. Today, new sequencing technologies provide huge amounts of sequence data that has to be efficiently clustered with constant or increased accuracy, at increased speed. Results We advocate that the model of weighted cluster editing, also known as transitive graph projection is well-suited to protein clustering. We present the FORCE heuristic that is based on transitive graph projection and clusters arbitrary sets of objects, given pairwise similarity measures. In particular, we apply FORCE to the problem of protein clustering and show that it outperforms the most popular existing clustering tools (Spectral clustering, TribeMCL, GeneRAGE, Hierarchical clustering, and Affinity Propagation. Furthermore, we show that FORCE is able to handle huge datasets by calculating clusters for all 192 187 prokaryotic protein sequences (66 organisms obtained from the COG database. Finally, FORCE is integrated into the corynebacterial reference database CoryneRegNet. Conclusion FORCE is an applicable alternative to existing clustering algorithms. Its theoretical foundation, weighted cluster editing, can outperform other clustering paradigms on protein homology clustering. FORCE is open source and implemented in Java. The software, including the source code, the clustering results for COG and CoryneRegNet, and all evaluation datasets are available at http://gi.cebitec.uni-bielefeld.de/comet/force/.
Component Based Clustering in Wireless Sensor Networks
Amaxilatis, Dimitrios; Koninis, Christos; Pyrgelis, Apostolos
2011-01-01
Clustering is an important research topic for wireless sensor networks (WSNs). A large variety of approaches has been presented focusing on different performance metrics. Even though all of them have many practical applications, an extremely limited number of software implementations is available to the research community. Furthermore, these very few techniques are implemented for specific WSN systems or are integrated in complex applications. Thus it is very difficult to comparatively study their performance and almost impossible to reuse them in future applications under a different scope. In this work we study a large body of well established algorithms. We identify their main building blocks and propose a component-based architecture for developing clustering algorithms that (a) promotes exchangeability of algorithms thus enabling the fast prototyping of new approaches, (b) allows cross-layer implementations to realize complex applications, (c) offers a common platform to comparatively study the performan...
Li, Xin-Xiong; Wang, Yang-Xin; Wang, Rui-Hu; Cui, Cai-Yan; Tian, Chong-Bin; Yang, Guo-Yu
2016-05-23
A new approach to prepare heterometallic cluster organic frameworks has been developed. The method was employed to link Anderson-type polyoxometalate (POM) clusters and transition-metal clusters by using a designed rigid tris(alkoxo) ligand containing a pyridyl group to form a three-fold interpenetrated anionic diamondoid structure and a 2D anionic layer, respectively. This technique facilitates the integration of the unique inherent properties of Anderson-type POM clusters and cuprous iodide clusters into one cluster organic framework. PMID:27061042
AN IMPROVED TEACHING-LEARNING BASED OPTIMIZATION APPROACH FOR FUZZY CLUSTERING
Directory of Open Access Journals (Sweden)
Parastou Shahsamandi E.
2014-11-01
Full Text Available Fuzzy clustering has been widely studied and applied in a variety of key areas of science and engineering. In this paper the Improved Teaching Learning Based Optimization (ITLBO algorithm is used for data clustering, in which the objects in the same cluster are similar. This algorithm has been tested on several datasets and compared with some other popular algorithm in clustering. Results have been shown that the proposed method improves the output of clustering and can be efficiently used for fuzzy clustering.