mining cluster analysis: Topics by WorldWideScience.org

Sample records for mining cluster analysis

URL Mining Using Agglomerative Clustering Algorithm

Directory of Open Access Journals (Sweden)

Chinmay R. Deshmukh

2015-02-01

Full Text Available Abstract The tremendous growth of the web world incorporates application of data mining techniques to the web logs. Data Mining and World Wide Web encompasses an important and active area of research. Web log mining is analysis of web log files with web pages sequences. Web mining is broadly classified as web content mining web usage mining and web structure mining. Web usage mining is a technique to discover usage patterns from Web data in order to understand and better serve the needs of Web-based applications. URL mining refers to a subclass of Web mining that helps us to investigate the details of a Uniform Resource Locator. URL mining can be advantageous in the fields of security and protection. The paper introduces a technique for mining a collection of user transactions with an Internet search engine to discover clusters of similar queries and similar URLs. The information we exploit is a clickthrough data each record consist of a users query to a search engine along with the URL which the user selected from among the candidates offered by search engine. By viewing this dataset as a bipartite graph with the vertices on one side corresponding to queries and on the other side to URLs one can apply an agglomerative clustering algorithm to the graphs vertices to identify related queries and URLs.
Frequent Pattern Mining Algorithms for Data Clustering

DEFF Research Database (Denmark)

Zimek, Arthur; Assent, Ira; Vreeken, Jilles

2014-01-01

that frequent pattern mining was at the cradle of subspace clustering—yet, it quickly developed into an independent research field. In this chapter, we discuss how frequent pattern mining algorithms have been extended and generalized towards the discovery of local clusters in high-dimensional data......Discovering clusters in subspaces, or subspace clustering and related clustering paradigms, is a research field where we find many frequent pattern mining related influences. In fact, as the first algorithms for subspace clustering were based on frequent pattern mining algorithms, it is fair to say....... In particular, we discuss several example algorithms for subspace clustering or projected clustering as well as point out recent research questions and open topics in this area relevant to researchers in either clustering or pattern mining...
Android Malware Clustering through Malicious Payload Mining

OpenAIRE

Li, Yuping; Jang, Jiyong; Hu, Xin; Ou, Xinming

2017-01-01

Clustering has been well studied for desktop malware analysis as an effective triage method. Conventional similarity-based clustering techniques, however, cannot be immediately applied to Android malware analysis due to the excessive use of third-party libraries in Android application development and the widespread use of repackaging in malware development. We design and implement an Android malware clustering system through iterative mining of malicious payload and checking whether malware s...
Clustering-based approaches to SAGE data mining

Directory of Open Access Journals (Sweden)

Wang Haiying

2008-07-01

Full Text Available Abstract Serial analysis of gene expression (SAGE is one of the most powerful tools for global gene expression profiling. It has led to several biological discoveries and biomedical applications, such as the prediction of new gene functions and the identification of biomarkers in human cancer research. Clustering techniques have become fundamental approaches in these applications. This paper reviews relevant clustering techniques specifically designed for this type of data. It places an emphasis on current limitations and opportunities in this area for supporting biologically-meaningful data mining and visualisation.
EOQ estimation for imperfect quality items using association rule mining with clustering

Directory of Open Access Journals (Sweden)

Mandeep Mittal

2015-09-01

Full Text Available Timely identification of newly emerging trends is needed in business process. Data mining techniques like clustering, association rule mining, classification, etc. are very important for business support and decision making. This paper presents a method for redesigning the ordering policy by including cross-selling effect. Initially, association rules are mined on the transactional database and EOQ is estimated with revenue earned. Then, transactions are clustered to obtain homogeneous clusters and association rules are mined in each cluster to estimate EOQ with revenue earned for each cluster. Further, this paper compares ordering policy for imperfect quality items which is developed by applying rules derived from apriori algorithm viz. a without clustering the transactions, and b after clustering the transactions. A numerical example is illustrated to validate the results.
Cluster analysis to evaluate stable chemical elements and physical-chemical parameters behavior on uranium mining waste

International Nuclear Information System (INIS)

Pereira, Wagner de Souza; Py Junior, Delcy de Azevedo; Goncalves, Simone; Kelecom, Alphonse; Morais, Gustavo Ferrari de; Campelo, Emanuele Lazzaretti Cordova; Dores, Luis Augusto de Carvalho Bresser

2011-01-01

The Ore Treating Unit (UTM, in portuguese) is a deactivated uranium mine. A cluster analysis was used to evaluate the behavior of stable chemical elements and physical-chemical parameters in their effluents. The utilization of the cluster analysis proved itself effective in the assessment, allowing the identification of groups of chemical elements, physical-chemical parameters and their joint analysis (elements and parameters). As a result we may assert, based on data analysis, that there is a strong link between calcium and magnesium and between aluminum and rare-earth oxides on UTM's effluents. Sulphate was also identified as strongly linked to total and dissolved solids, and those to electrical conductivity. There were other associations, but not so strongly linked. Further gathering, to seasonal evaluation, are required in order to confirm those analysis. Additional statistical analysis (factor analysis) must be used to try to identify the origin of the identified groups on this analysis. (author)
Cluster analysis to evaluate stable chemical elements and physical-chemical parameters behavior on uranium mining waste

Energy Technology Data Exchange (ETDEWEB)

Pereira, Wagner de Souza; Py Junior, Delcy de Azevedo; Goncalves, Simone, E-mail: wspereira@inb.gov.br [Unidade de Tratamento de Minerio (UTM/INB), Pocos de Caldas, MG (Brazil). Coordenacao de Protecao Radiologica. Grupo Multidisciplinar de Radioprotecao; Kelecom, Alphonse [Universidade Federal Fluminense (UFF), Niteroi, RJ (Brazil). Inst. de Biologia. Lab. de Radiobiologia e Radiometria Pedro Lopes dos Santos; Morais, Gustavo Ferrari de; Campelo, Emanuele Lazzaretti Cordova [Unidade de Tratamento de Minerio (UTM/INB), Pocos de Caldas, MG (Brazil). Coordenacao de Desenvolvimento de Processos; Dores, Luis Augusto de Carvalho Bresser [Unidade de Tratamento de Minerio (UTM/INB), Pocos de Caldas, MG (Brazil). Gerencia de Descomissionamento

2011-07-01

The Ore Treating Unit (UTM, in portuguese) is a deactivated uranium mine. A cluster analysis was used to evaluate the behavior of stable chemical elements and physical-chemical parameters in their effluents. The utilization of the cluster analysis proved itself effective in the assessment, allowing the identification of groups of chemical elements, physical-chemical parameters and their joint analysis (elements and parameters). As a result we may assert, based on data analysis, that there is a strong link between calcium and magnesium and between aluminum and rare-earth oxides on UTM's effluents. Sulphate was also identified as strongly linked to total and dissolved solids, and those to electrical conductivity. There were other associations, but not so strongly linked. Further gathering, to seasonal evaluation, are required in order to confirm those analysis. Additional statistical analysis (factor analysis) must be used to try to identify the origin of the identified groups on this analysis. (author)
Fuzzy Modeled K-Cluster Quality Mining of Hidden Knowledge for Decision Support

OpenAIRE

S. Parkash Kumar; K. S. Ramaswami

2011-01-01

Problem statement: The work presented Fuzzy Modeled K-means Cluster Quality Mining of hidden knowledge for Decision Support. Based on the number of clusters, number of objects in each cluster and its cohesiveness, precision and recall values, the cluster quality metrics is measured. The fuzzy k-means is adapted approach by using heuristic method which iterates the cluster to form an efficient valid cluster. With the obtained data clusters, quality assessment is made by predictive mining using...
Cluster Analysis-Based Approaches for Geospatiotemporal Data Mining of Massive Data Sets for Identification of Forest Threats

Energy Technology Data Exchange (ETDEWEB)

Mills, Richard T [ORNL; Hoffman, Forrest M [ORNL; Kumar, Jitendra [ORNL; HargroveJr., William Walter [USDA Forest Service

2011-01-01

We investigate methods for geospatiotemporal data mining of multi-year land surface phenology data (250 m2 Normalized Difference Vegetation Index (NDVI) values derived from the Moderate Resolution Imaging Spectrometer (MODIS) in this study) for the conterminous United States (CONUS) as part of an early warning system for detecting threats to forest ecosystems. The approaches explored here are based on k-means cluster analysis of this massive data set, which provides a basis for defining the bounds of the expected or normal phenological patterns that indicate healthy vegetation at a given geographic location. We briefly describe the computational approaches we have used to make cluster analysis of such massive data sets feasible, describe approaches we have explored for distinguishing between normal and abnormal phenology, and present some examples in which we have applied these approaches to identify various forest disturbances in the CONUS.
Clustering for data mining a data recovery approach

CERN Document Server

Mirkin, Boris

2005-01-01

Often considered more as an art than a science, the field of clustering has been dominated by learning through examples and by techniques chosen almost through trial-and-error. Even the most popular clustering methods--K-Means for partitioning the data set and Ward's method for hierarchical clustering--have lacked the theoretical attention that would establish a firm relationship between the two methods and relevant interpretation aids.Rather than the traditional set of ad hoc techniques, Clustering for Data Mining: A Data Recovery Approach presents a theory that not only closes gaps in K-Mean
Mining the National Career Assessment Examination Result Using Clustering Algorithm

Science.gov (United States)

Pagudpud, M. V.; Palaoag, T. T.; Padirayon, L. M.

2018-03-01

Education is an essential process today which elicits authorities to discover and establish innovative strategies for educational improvement. This study applied data mining using clustering technique for knowledge extraction from the National Career Assessment Examination (NCAE) result in the Division of Quirino. The NCAE is an examination given to all grade 9 students in the Philippines to assess their aptitudes in the different domains. Clustering the students is helpful in identifying students’ learning considerations. With the use of the RapidMiner tool, clustering algorithms such as Density-Based Spatial Clustering of Applications with Noise (DBSCAN), k-means, k-medoid, expectation maximization clustering, and support vector clustering algorithms were analyzed. The silhouette indexes of the said clustering algorithms were compared, and the result showed that the k-means algorithm with k = 3 and silhouette index equal to 0.196 is the most appropriate clustering algorithm to group the students. Three groups were formed having 477 students in the determined group (cluster 0), 310 proficient students (cluster 1) and 396 developing students (cluster 2). The data mining technique used in this study is essential in extracting useful information from the NCAE result to better understand the abilities of students which in turn is a good basis for adopting teaching strategies.
Data mining theories, algorithms, and examples

CERN Document Server

Ye, Nong

2013-01-01

AN OVERVIEW OF DATA MINING METHODOLOGIESIntroduction to data mining methodologiesMETHODOLOGIES FOR MINING CLASSIFICATION AND PREDICTION PATTERNSRegression modelsBayes classifiersDecision treesMulti-layer feedforward artificial neural networksSupport vector machinesSupervised clusteringMETHODOLOGIES FOR MINING CLUSTERING AND ASSOCIATION PATTERNSHierarchical clusteringPartitional clusteringSelf-organized mapProbability distribution estimationAssociation rulesBayesian networksMETHODOLOGIES FOR MINING DATA REDUCTION PATTERNSPrincipal components analysisMulti-dimensional scalingLatent variable anal
Marine data users clustering using data mining technique

Directory of Open Access Journals (Sweden)

Farnaz Ghiasi

2015-09-01

Full Text Available The objective of this research is marine data users clustering using data mining technique. To achieve this objective, marine organizations will enable to know their data and users requirements. In this research, CRISP-DM standard model was used to implement the data mining technique. The required data was extracted from 500 marine data users profile database of Iranian National Institute for Oceanography and Atmospheric Sciences (INIOAS from 1386 to 1393. The TwoStep algorithm was used for clustering. In this research, patterns was discovered between marine data users such as student, organization and scientist and their data request (Data source, Data type, Data set, Parameter and Geographic area using clustering for the first time. The most important clusters are: Student with International data source, Chemistry data type, “World Ocean Database” dataset, Persian Gulf geographic area and Organization with Nitrate parameter. Senior managers of the marine organizations will enable to make correct decisions concerning their existing data. They will direct to planning for better data collection in the future. Also data users will guide with respect to their requests. Finally, the valuable suggestions were offered to improve the performance of marine organizations.
Fuzzy C-Means Clustering Model Data Mining For Recognizing Stock Data Sampling Pattern

Directory of Open Access Journals (Sweden)

Sylvia Jane Annatje Sumarauw

2007-06-01

Full Text Available Abstract Capital market has been beneficial to companies and investor. For investors, the capital market provides two economical advantages, namely deviden and capital gain, and a non-economical one that is a voting .} hare in Shareholders General Meeting. But, it can also penalize the share owners. In order to prevent them from the risk, the investors should predict the prospect of their companies. As a consequence of having an abstract commodity, the share quality will be determined by the validity of their company profile information. Any information of stock value fluctuation from Jakarta Stock Exchange can be a useful consideration and a good measurement for data analysis. In the context of preventing the shareholders from the risk, this research focuses on stock data sample category or stock data sample pattern by using Fuzzy c-Me, MS Clustering Model which providing any useful information jar the investors. lite research analyses stock data such as Individual Index, Volume and Amount on Property and Real Estate Emitter Group at Jakarta Stock Exchange from January 1 till December 31 of 204. 'he mining process follows Cross Industry Standard Process model for Data Mining (CRISP,. DM in the form of circle with these steps: Business Understanding, Data Understanding, Data Preparation, Modelling, Evaluation and Deployment. At this modelling process, the Fuzzy c-Means Clustering Model will be applied. Data Mining Fuzzy c-Means Clustering Model can analyze stock data in a big database with many complex variables especially for finding the data sample pattern, and then building Fuzzy Inference System for stimulating inputs to be outputs that based on Fuzzy Logic by recognising the pattern. Keywords: Data Mining, AUz..:y c-Means Clustering Model, Pattern Recognition
Clustering Analysis for Credit Default Probabilities in a Retail Bank Portfolio

Directory of Open Access Journals (Sweden)

Elena ANDREI (DRAGOMIR

2012-08-01

Full Text Available Methods underlying cluster analysis are very useful in data analysis, especially when the processed volume of data is very large, so that it becomes impossible to extract essential information, unless specific instruments are used to summarize and structure the gross information. In this context, cluster analysis techniques are used particularly, for systematic information analysis. The aim of this article is to build an useful model for banking field, based on data mining techniques, by dividing the groups of borrowers into clusters, in order to obtain a profile of the customers (debtors and good payers. We assume that a class is appropriate if it contains members that have a high degree of similarity and the standard method for measuring the similarity within a group shows the lowest variance. After clustering, data mining techniques are implemented on the cluster with bad debtors, reaching a very high accuracy after implementation. The paper is structured as follows: Section 2 describes the model for data analysis based on a specific scoring model that we proposed. In section 3, we present a cluster analysis using K-means algorithm and the DM models are applied on a specific cluster. Section 4 shows the conclusions.
Using Cluster Analysis for Data Mining in Educational Technology Research

Science.gov (United States)

Antonenko, Pavlo D.; Toy, Serkan; Niederhauser, Dale S.

2012-01-01

Cluster analysis is a group of statistical methods that has great potential for analyzing the vast amounts of web server-log data to understand student learning from hyperlinked information resources. In this methodological paper we provide an introduction to cluster analysis for educational technology researchers and illustrate its use through…
Text Mining in Biomedical Domain with Emphasis on Document Clustering.

Science.gov (United States)

Renganathan, Vinaitheerthan

2017-07-01

With the exponential increase in the number of articles published every year in the biomedical domain, there is a need to build automated systems to extract unknown information from the articles published. Text mining techniques enable the extraction of unknown knowledge from unstructured documents. This paper reviews text mining processes in detail and the software tools available to carry out text mining. It also reviews the roles and applications of text mining in the biomedical domain. Text mining processes, such as search and retrieval of documents, pre-processing of documents, natural language processing, methods for text clustering, and methods for text classification are described in detail. Text mining techniques can facilitate the mining of vast amounts of knowledge on a given topic from published biomedical research articles and draw meaningful conclusions that are not possible otherwise.
Data mining approach to bipolar cognitive map development and decision analysis

Science.gov (United States)

Zhang, Wen-Ran

2002-03-01

A data mining approach to cognitive mapping is presented based on bipolar logic, bipolar relations, and bipolar clustering. It is shown that a correlation network derived from a database can be converted to a bipolar cognitive map (or bipolar relation). A transitive, symmetric, and reflexive bipolar relation (equilibrium relation) can be used to identify focal links in decision analysis. It can also be used to cluster a set of events or itemsets into three different clusters: coalition sets, conflict sets, and harmony sets. The coalition sets are positively correlated events or itemsets; each conflict set is a negatively correlated set of two coalition subsets; and a harmony set consists of events that are both negatively and positively correlated. A cognitive map and the clusters can then be used for online decision analysis. This approach combines knowledge discovery with the views of decision makers and provides an effective means for online analytical processing (OLAP) and online analytical mining (OLAM).
Identification of nitrogen-fixing genes and gene clusters from metagenomic library of acid mine drainage.

Directory of Open Access Journals (Sweden)

Zhimin Dai

Full Text Available Biological nitrogen fixation is an essential function of acid mine drainage (AMD microbial communities. However, most acidophiles in AMD environments are uncultured microorganisms and little is known about the diversity of nitrogen-fixing genes and structure of nif gene cluster in AMD microbial communities. In this study, we used metagenomic sequencing to isolate nif genes in the AMD microbial community from Dexing Copper Mine, China. Meanwhile, a metagenome microarray containing 7,776 large-insertion fosmids was constructed to screen novel nif gene clusters. Metagenomic analyses revealed that 742 sequences were identified as nif genes including structural subunit genes nifH, nifD, nifK and various additional genes. The AMD community is massively dominated by the genus Acidithiobacillus. However, the phylogenetic diversity of nitrogen-fixing microorganisms is much higher than previously thought in the AMD community. Furthermore, a 32.5-kb genomic sequence harboring nif, fix and associated genes was screened by metagenome microarray. Comparative genome analysis indicated that most nif genes in this cluster are most similar to those of Herbaspirillum seropedicae, but the organization of the nif gene cluster had significant differences from H. seropedicae. Sequence analysis and reverse transcription PCR also suggested that distinct transcription units of nif genes exist in this gene cluster. nifQ gene falls into the same transcription unit with fixABCX genes, which have not been reported in other diazotrophs before. All of these results indicated that more novel diazotrophs survive in the AMD community.
Identification of nitrogen-fixing genes and gene clusters from metagenomic library of acid mine drainage.

Science.gov (United States)

Dai, Zhimin; Guo, Xue; Yin, Huaqun; Liang, Yili; Cong, Jing; Liu, Xueduan

2014-01-01

Biological nitrogen fixation is an essential function of acid mine drainage (AMD) microbial communities. However, most acidophiles in AMD environments are uncultured microorganisms and little is known about the diversity of nitrogen-fixing genes and structure of nif gene cluster in AMD microbial communities. In this study, we used metagenomic sequencing to isolate nif genes in the AMD microbial community from Dexing Copper Mine, China. Meanwhile, a metagenome microarray containing 7,776 large-insertion fosmids was constructed to screen novel nif gene clusters. Metagenomic analyses revealed that 742 sequences were identified as nif genes including structural subunit genes nifH, nifD, nifK and various additional genes. The AMD community is massively dominated by the genus Acidithiobacillus. However, the phylogenetic diversity of nitrogen-fixing microorganisms is much higher than previously thought in the AMD community. Furthermore, a 32.5-kb genomic sequence harboring nif, fix and associated genes was screened by metagenome microarray. Comparative genome analysis indicated that most nif genes in this cluster are most similar to those of Herbaspirillum seropedicae, but the organization of the nif gene cluster had significant differences from H. seropedicae. Sequence analysis and reverse transcription PCR also suggested that distinct transcription units of nif genes exist in this gene cluster. nifQ gene falls into the same transcription unit with fixABCX genes, which have not been reported in other diazotrophs before. All of these results indicated that more novel diazotrophs survive in the AMD community.

Identification of Nitrogen-Fixing Genes and Gene Clusters from Metagenomic Library of Acid Mine Drainage

Science.gov (United States)

Yin, Huaqun; Liang, Yili; Cong, Jing; Liu, Xueduan

2014-01-01

Biological nitrogen fixation is an essential function of acid mine drainage (AMD) microbial communities. However, most acidophiles in AMD environments are uncultured microorganisms and little is known about the diversity of nitrogen-fixing genes and structure of nif gene cluster in AMD microbial communities. In this study, we used metagenomic sequencing to isolate nif genes in the AMD microbial community from Dexing Copper Mine, China. Meanwhile, a metagenome microarray containing 7,776 large-insertion fosmids was constructed to screen novel nif gene clusters. Metagenomic analyses revealed that 742 sequences were identified as nif genes including structural subunit genes nifH, nifD, nifK and various additional genes. The AMD community is massively dominated by the genus Acidithiobacillus. However, the phylogenetic diversity of nitrogen-fixing microorganisms is much higher than previously thought in the AMD community. Furthermore, a 32.5-kb genomic sequence harboring nif, fix and associated genes was screened by metagenome microarray. Comparative genome analysis indicated that most nif genes in this cluster are most similar to those of Herbaspirillum seropedicae, but the organization of the nif gene cluster had significant differences from H. seropedicae. Sequence analysis and reverse transcription PCR also suggested that distinct transcription units of nif genes exist in this gene cluster. nifQ gene falls into the same transcription unit with fixABCX genes, which have not been reported in other diazotrophs before. All of these results indicated that more novel diazotrophs survive in the AMD community. PMID:24498417
Cytokine profile determined by data-mining analysis set into clusters of non-small-cell lung cancer patients according to prognosis.

Science.gov (United States)

Barrera, L; Montes-Servín, E; Barrera, A; Ramírez-Tirado, L A; Salinas-Parra, F; Bañales-Méndez, J L; Sandoval-Ríos, M; Arrieta, Ó

2015-02-01

Immunoregulatory cytokines may play a fundamental role in tumor growth and metastases. Their effects are mediated through complex regulatory networks. Human cytokine profiles could define patient subgroups and represent new potential biomarkers. The aim of this study was to associate a cytokine profile obtained through data mining with the clinical characteristics of patients with advanced non-small-cell lung cancer (NSCLC). We conducted a prospective study of the plasma levels of 14 immunoregulatory cytokines by ELISA and a cytometric bead array assay in 110 NSCLC patients before chemotherapy and 25 control subjects. Cytokine levels and data-mining profiles were associated with clinical, quality of life and pathological outcomes. NSCLC patients had higher levels of interleukin (IL)-6, IL-8, IL-12p70, IL-17a and interferon (IFN)-γ, and lower levels of IL-33 and IL-29 compared with controls. The pro-inflammatory cytokines IL-1b, IL-6 and IL-8 were associated with lower hemoglobin levels, worse functional performance status (Eastern Cooperative Oncology Group, ECOG), fatigue and hyporexia. The anti-inflammatory cytokines IL-4, IL-10 and IL-33 were associated with anorexia and lower body mass index. We identified three clusters of patients according to data-mining analysis with different overall survival (OS; 25.4, 16.8 and 5.09 months, respectively, P = 0.0012). Multivariate analysis showed that ECOG performance status and data-mining clusters were significantly associated with OS (RR 3.59, [95% CI 1.9-6.7], P < 0.001 and 2.2, [1.2-3.8], P = 0.005). Our results provide evidence that complex cytokine networks may be used to identify patient subgroups with different prognoses in advanced NSCLC. These cytokines may represent potential biomarkers, particularly in the immunotherapy era in cancer research. © The Author 2014. Published by Oxford University Press on behalf of the European Society for Medical Oncology. All rights reserved. For permissions, please email
Genome cluster database. A sequence family analysis platform for Arabidopsis and rice.

Science.gov (United States)

Horan, Kevin; Lauricha, Josh; Bailey-Serres, Julia; Raikhel, Natasha; Girke, Thomas

2005-05-01

The genome-wide protein sequences from Arabidopsis (Arabidopsis thaliana) and rice (Oryza sativa) spp. japonica were clustered into families using sequence similarity and domain-based clustering. The two fundamentally different methods resulted in separate cluster sets with complementary properties to compensate the limitations for accurate family analysis. Functional names for the identified families were assigned with an efficient computational approach that uses the description of the most common molecular function gene ontology node within each cluster. Subsequently, multiple alignments and phylogenetic trees were calculated for the assembled families. All clustering results and their underlying sequences were organized in the Web-accessible Genome Cluster Database (http://bioinfo.ucr.edu/projects/GCD) with rich interactive and user-friendly sequence family mining tools to facilitate the analysis of any given family of interest for the plant science community. An automated clustering pipeline ensures current information for future updates in the annotations of the two genomes and clustering improvements. The analysis allowed the first systematic identification of family and singlet proteins present in both organisms as well as those restricted to one of them. In addition, the established Web resources for mining these data provide a road map for future studies of the composition and structure of protein families between the two species.
Environmental conflict analysis using an integrated grey clustering and entropy-weight method: A case study of a mining project in Peru.

OpenAIRE

Delgado-Villanueva, Kiko Alexi; Romero Gil, Inmaculada

2016-01-01

[EN] Environmental conflict analysis (henceforth ECA) has become a key factor for the viability of projects and welfare of affected populations. In this study, we propose an approach for ECA using an integrated grey clustering and entropy-weight method (The IGCEW method). The case study considered a mining project in northern Peru. Three stakeholder groups and seven criteria were identified. The data were gathered by conducting field interviews. The results revealed that for the groups urban ...
antiSMASH 3.0—a comprehensive resource for the genome mining of biosynthetic gene clusters

DEFF Research Database (Denmark)

Weber, Tilmann; Blin, Kai; Duddela, Srikanth

2015-01-01

Microbial secondary metabolism constitutes a rich source of antibiotics, chemotherapeutics, insecticides and other high-value chemicals. Genome mining of gene clusters that encode the biosynthetic pathways for these metabolites has become a key methodology for novel compound discovery. In 2011, we...... introduced antiSMASH, a web server and stand-alone tool for the automatic genomic identification and analysis of biosynthetic gene clusters, available at http://antismash.secondarymetabolites.org. Here, we present version 3.0 of antiSMASH, which has undergone major improvements. A full integration...... of the recently published ClusterFinder algorithm now allows using this probabilistic algorithm to detect putative gene clusters of unknown types. Also, a new dereplication variant of the ClusterBlast module now identifies similarities of identified clusters to any of 1172 clusters with known end products...
An improved Pearson's correlation proximity-based hierarchical clustering for mining biological association between genes.

Science.gov (United States)

Booma, P M; Prabhakaran, S; Dhanalakshmi, R

2014-01-01

Microarray gene expression datasets has concerned great awareness among molecular biologist, statisticians, and computer scientists. Data mining that extracts the hidden and usual information from datasets fails to identify the most significant biological associations between genes. A search made with heuristic for standard biological process measures only the gene expression level, threshold, and response time. Heuristic search identifies and mines the best biological solution, but the association process was not efficiently addressed. To monitor higher rate of expression levels between genes, a hierarchical clustering model was proposed, where the biological association between genes is measured simultaneously using proximity measure of improved Pearson's correlation (PCPHC). Additionally, the Seed Augment algorithm adopts average linkage methods on rows and columns in order to expand a seed PCPHC model into a maximal global PCPHC (GL-PCPHC) model and to identify association between the clusters. Moreover, a GL-PCPHC applies pattern growing method to mine the PCPHC patterns. Compared to existing gene expression analysis, the PCPHC model achieves better performance. Experimental evaluations are conducted for GL-PCPHC model with standard benchmark gene expression datasets extracted from UCI repository and GenBank database in terms of execution time, size of pattern, significance level, biological association efficiency, and pattern quality.
Mining Co-Location Patterns with Clustering Items from Spatial Data Sets

Science.gov (United States)

Zhou, G.; Li, Q.; Deng, G.; Yue, T.; Zhou, X.

2018-05-01

The explosive growth of spatial data and widespread use of spatial databases emphasize the need for the spatial data mining. Co-location patterns discovery is an important branch in spatial data mining. Spatial co-locations represent the subsets of features which are frequently located together in geographic space. However, the appearance of a spatial feature C is often not determined by a single spatial feature A or B but by the two spatial features A and B, that is to say where A and B appear together, C often appears. We note that this co-location pattern is different from the traditional co-location pattern. Thus, this paper presents a new concept called clustering terms, and this co-location pattern is called co-location patterns with clustering items. And the traditional algorithm cannot mine this co-location pattern, so we introduce the related concept in detail and propose a novel algorithm. This algorithm is extended by join-based approach proposed by Huang. Finally, we evaluate the performance of this algorithm.
Process mining : overview and opportunities

NARCIS (Netherlands)

Aalst, van der W.M.P.

2012-01-01

Over the last decade, process mining emerged as a new research ¿eld that focuses on the analysis of processes using event data. Classical data mining techniques such as classi¿cation, clustering, regression, association rule learning, and sequence/episode mining do not focus on business process
Manipulating measurement scales in medical statistical analysis and data mining: A review of methodologies

Directory of Open Access Journals (Sweden)

Hamid Reza Marateb

2014-01-01

Full Text Available Background: selecting the correct statistical test and data mining method depends highly on the measurement scale of data, type of variables, and purpose of the analysis. Different measurement scales are studied in details and statistical comparison, modeling, and data mining methods are studied based upon using several medical examples. We have presented two ordinal-variables clustering examples, as more challenging variable in analysis, using Wisconsin Breast Cancer Data (WBCD. Ordinal-to-Interval scale conversion example: a breast cancer database of nine 10-level ordinal variables for 683 patients was analyzed by two ordinal-scale clustering methods. The performance of the clustering methods was assessed by comparison with the gold standard groups of malignant and benign cases that had been identified by clinical tests. Results: the sensitivity and accuracy of the two clustering methods were 98% and 96%, respectively. Their specificity was comparable. Conclusion: by using appropriate clustering algorithm based on the measurement scale of the variables in the study, high performance is granted. Moreover, descriptive and inferential statistics in addition to modeling approach must be selected based on the scale of the variables.
Manipulating measurement scales in medical statistical analysis and data mining: A review of methodologies

Science.gov (United States)

Marateb, Hamid Reza; Mansourian, Marjan; Adibi, Peyman; Farina, Dario

2014-01-01

Background: selecting the correct statistical test and data mining method depends highly on the measurement scale of data, type of variables, and purpose of the analysis. Different measurement scales are studied in details and statistical comparison, modeling, and data mining methods are studied based upon using several medical examples. We have presented two ordinal–variables clustering examples, as more challenging variable in analysis, using Wisconsin Breast Cancer Data (WBCD). Ordinal-to-Interval scale conversion example: a breast cancer database of nine 10-level ordinal variables for 683 patients was analyzed by two ordinal-scale clustering methods. The performance of the clustering methods was assessed by comparison with the gold standard groups of malignant and benign cases that had been identified by clinical tests. Results: the sensitivity and accuracy of the two clustering methods were 98% and 96%, respectively. Their specificity was comparable. Conclusion: by using appropriate clustering algorithm based on the measurement scale of the variables in the study, high performance is granted. Moreover, descriptive and inferential statistics in addition to modeling approach must be selected based on the scale of the variables. PMID:24672565
Data mining with unsupervised clustering using photonic micro-ring resonators

Science.gov (United States)

McAulay, Alastair D.

2013-09-01

Data is commonly moved through optical fiber in modern data centers and may be stored optically. We propose an optical method of data mining for future data centers to enhance performance. For example, in clustering, a form of unsupervised learning, we propose that parameters corresponding to information in a database are converted from analog values to frequencies, as in the brain's neurons, where similar data will have close frequencies. We describe the Wilson-Cowan model for oscillating neurons. In optics we implement the frequencies with micro ring resonators. Due to the influence of weak coupling, a group of resonators will form clusters of similar frequencies that will indicate the desired parameters having close relations. Fewer clusters are formed as clustering proceeds, which allows the creation of a tree showing topics of importance and their relationships in the database. The tree can be used for instance to target advertising and for planning.
Data Mining and Knowledge Management in Higher Education -Potential Applications.

Science.gov (United States)

Luan, Jing

This paper introduces a new decision support tool, data mining, in the context of knowledge management. The most striking features of data mining techniques are clustering and prediction. The clustering aspect of data mining offers comprehensive characteristics analysis of students, while the predicting function estimates the likelihood for a…
Functional Genome Mining for Metabolites Encoded by Large Gene Clusters through Heterologous Expression of a Whole-Genome Bacterial Artificial Chromosome Library in Streptomyces spp.

Science.gov (United States)

Xu, Min; Wang, Yemin; Zhao, Zhilong; Gao, Guixi; Huang, Sheng-Xiong; Kang, Qianjin; He, Xinyi; Lin, Shuangjun; Pang, Xiuhua; Deng, Zixin

2016-01-01

ABSTRACT Genome sequencing projects in the last decade revealed numerous cryptic biosynthetic pathways for unknown secondary metabolites in microbes, revitalizing drug discovery from microbial metabolites by approaches called genome mining. In this work, we developed a heterologous expression and functional screening approach for genome mining from genomic bacterial artificial chromosome (BAC) libraries in Streptomyces spp. We demonstrate mining from a strain of Streptomyces rochei, which is known to produce streptothricins and borrelidin, by expressing its BAC library in the surrogate host Streptomyces lividans SBT5, and screening for antimicrobial activity. In addition to the successful capture of the streptothricin and borrelidin biosynthetic gene clusters, we discovered two novel linear lipopeptides and their corresponding biosynthetic gene cluster, as well as a novel cryptic gene cluster for an unknown antibiotic from S. rochei. This high-throughput functional genome mining approach can be easily applied to other streptomycetes, and it is very suitable for the large-scale screening of genomic BAC libraries for bioactive natural products and the corresponding biosynthetic pathways. IMPORTANCE Microbial genomes encode numerous cryptic biosynthetic gene clusters for unknown small metabolites with potential biological activities. Several genome mining approaches have been developed to activate and bring these cryptic metabolites to biological tests for future drug discovery. Previous sequence-guided procedures relied on bioinformatic analysis to predict potentially interesting biosynthetic gene clusters. In this study, we describe an efficient approach based on heterologous expression and functional screening of a whole-genome library for the mining of bioactive metabolites from Streptomyces. The usefulness of this function-driven approach was demonstrated by the capture of four large biosynthetic gene clusters for metabolites of various chemical types, including
Using data mining to segment healthcare markets from patients' preference perspectives.

Science.gov (United States)

Liu, Sandra S; Chen, Jie

2009-01-01

This paper aims to provide an example of how to use data mining techniques to identify patient segments regarding preferences for healthcare attributes and their demographic characteristics. Data were derived from a number of individuals who received in-patient care at a health network in 2006. Data mining and conventional hierarchical clustering with average linkage and Pearson correlation procedures are employed and compared to show how each procedure best determines segmentation variables. Data mining tools identified three differentiable segments by means of cluster analysis. These three clusters have significantly different demographic profiles. The study reveals, when compared with traditional statistical methods, that data mining provides an efficient and effective tool for market segmentation. When there are numerous cluster variables involved, researchers and practitioners need to incorporate factor analysis for reducing variables to clearly and meaningfully understand clusters. Interests and applications in data mining are increasing in many businesses. However, this technology is seldom applied to healthcare customer experience management. The paper shows that efficient and effective application of data mining methods can aid the understanding of patient healthcare preferences.
Practical graph mining with R

CERN Document Server

Hendrix, William; Jenkins, John; Padmanabhan, Kanchana; Chakraborty, Arpan

2014-01-01

Practical Graph Mining with R presents a "do-it-yourself" approach to extracting interesting patterns from graph data. It covers many basic and advanced techniques for the identification of anomalous or frequently recurring patterns in a graph, the discovery of groups or clusters of nodes that share common patterns of attributes and relationships, the extraction of patterns that distinguish one category of graphs from another, and the use of those patterns to predict the category of new graphs. Hands-On Application of Graph Data Mining Each chapter in the book focuses on a graph mining task, such as link analysis, cluster analysis, and classification. Through applications using real data sets, the book demonstrates how computational techniques can help solve real-world problems. The applications covered include network intrusion detection, tumor cell diagnostics, face recognition, predictive toxicology, mining metabolic and protein-protein interaction networks, and community detection in social networks. De...
Text-mining analysis of mHealth research

Science.gov (United States)

Zengul, Ferhat; Oner, Nurettin; Delen, Dursun

2017-01-01

In recent years, because of the advancements in communication and networking technologies, mobile technologies have been developing at an unprecedented rate. mHealth, the use of mobile technologies in medicine, and the related research has also surged parallel to these technological advancements. Although there have been several attempts to review mHealth research through manual processes such as systematic reviews, the sheer magnitude of the number of studies published in recent years makes this task very challenging. The most recent developments in machine learning and text mining offer some potential solutions to address this challenge by allowing analyses of large volumes of texts through semi-automated processes. The objective of this study is to analyze the evolution of mHealth research by utilizing text-mining and natural language processing (NLP) analyses. The study sample included abstracts of 5,644 mHealth research articles, which were gathered from five academic search engines by using search terms such as mobile health, and mHealth. The analysis used the Text Explorer module of JMP Pro 13 and an iterative semi-automated process involving tokenizing, phrasing, and terming. After developing the document term matrix (DTM) analyses such as single value decomposition (SVD), topic, and hierarchical document clustering were performed, along with the topic-informed document clustering approach. The results were presented in the form of word-clouds and trend analyses. There were several major findings regarding research clusters and trends. First, our results confirmed time-dependent nature of terminology use in mHealth research. For example, in earlier versus recent years the use of terminology changed from “mobile phone” to “smartphone” and from “applications” to “apps”. Second, ten clusters for mHealth research were identified including (I) Clinical Research on Lifestyle Management, (II) Community Health, (III) Literature Review, (IV) Medical
Text-mining analysis of mHealth research.

Science.gov (United States)

Ozaydin, Bunyamin; Zengul, Ferhat; Oner, Nurettin; Delen, Dursun

2017-01-01

In recent years, because of the advancements in communication and networking technologies, mobile technologies have been developing at an unprecedented rate. mHealth, the use of mobile technologies in medicine, and the related research has also surged parallel to these technological advancements. Although there have been several attempts to review mHealth research through manual processes such as systematic reviews, the sheer magnitude of the number of studies published in recent years makes this task very challenging. The most recent developments in machine learning and text mining offer some potential solutions to address this challenge by allowing analyses of large volumes of texts through semi-automated processes. The objective of this study is to analyze the evolution of mHealth research by utilizing text-mining and natural language processing (NLP) analyses. The study sample included abstracts of 5,644 mHealth research articles, which were gathered from five academic search engines by using search terms such as mobile health, and mHealth. The analysis used the Text Explorer module of JMP Pro 13 and an iterative semi-automated process involving tokenizing, phrasing, and terming. After developing the document term matrix (DTM) analyses such as single value decomposition (SVD), topic, and hierarchical document clustering were performed, along with the topic-informed document clustering approach. The results were presented in the form of word-clouds and trend analyses. There were several major findings regarding research clusters and trends. First, our results confirmed time-dependent nature of terminology use in mHealth research. For example, in earlier versus recent years the use of terminology changed from "mobile phone" to "smartphone" and from "applications" to "apps". Second, ten clusters for mHealth research were identified including (I) Clinical Research on Lifestyle Management, (II) Community Health, (III) Literature Review, (IV) Medical Interventions
ArrayMining: a modular web-application for microarray analysis combining ensemble and consensus methods with cross-study normalization

Directory of Open Access Journals (Sweden)

Krasnogor Natalio

2009-10-01

Full Text Available Abstract Background Statistical analysis of DNA microarray data provides a valuable diagnostic tool for the investigation of genetic components of diseases. To take advantage of the multitude of available data sets and analysis methods, it is desirable to combine both different algorithms and data from different studies. Applying ensemble learning, consensus clustering and cross-study normalization methods for this purpose in an almost fully automated process and linking different analysis modules together under a single interface would simplify many microarray analysis tasks. Results We present ArrayMining.net, a web-application for microarray analysis that provides easy access to a wide choice of feature selection, clustering, prediction, gene set analysis and cross-study normalization methods. In contrast to other microarray-related web-tools, multiple algorithms and data sets for an analysis task can be combined using ensemble feature selection, ensemble prediction, consensus clustering and cross-platform data integration. By interlinking different analysis tools in a modular fashion, new exploratory routes become available, e.g. ensemble sample classification using features obtained from a gene set analysis and data from multiple studies. The analysis is further simplified by automatic parameter selection mechanisms and linkage to web tools and databases for functional annotation and literature mining. Conclusion ArrayMining.net is a free web-application for microarray analysis combining a broad choice of algorithms based on ensemble and consensus methods, using automatic parameter selection and integration with annotation databases.
Data mining for clustering naming of the village at Java Island

Science.gov (United States)

Setiawan Abdullah, Atje; Nurani Ruchjana, Budi; Hidayat, Akik; Akmal; Setiana, Deni

2017-10-01

Clustering of query based data mining to identify the meaning of the naming of the village in Java island, done by exploring the database village with three categories namely: prefix in the naming of the village, syllables contained in the naming of the village, and full word naming of the village which is actually used. While syllables contained in the naming of the village are classified by the behaviour of the culture and character of each province that describes the business, feelings, circumstances, places, nature, respect, plants, fruits, and animals. Sources of data used for the clustering of the naming of the village on the island of Java was obtained from Geospatial Information Agency (BIG) in the form of a complete village name data with the coordinates in six provinces in Java, which is arranged in a hierarchy of provinces, districts / cities, districts and villages. The research method using KDD (Knowledge Discovery in Database) through the process of preprocessing, data mining and postprocessing to obtain knowledge. In this study, data mining applications to facilitate the search query based on the name of the village, using Java software. While the contours of a map is processed using ArcGIS software. The results of the research can give recommendations to stakeholders such as the Department of Tourism to describe the meaning of the classification of naming the village according to the character in each province at Java island.
A Dimensionality Reduction-Based Multi-Step Clustering Method for Robust Vessel Trajectory Analysis

Directory of Open Access Journals (Sweden)

Huanhuan Li

2017-08-01

Full Text Available The Shipboard Automatic Identification System (AIS is crucial for navigation safety and maritime surveillance, data mining and pattern analysis of AIS information have attracted considerable attention in terms of both basic research and practical applications. Clustering of spatio-temporal AIS trajectories can be used to identify abnormal patterns and mine customary route data for transportation safety. Thus, the capacities of navigation safety and maritime traffic monitoring could be enhanced correspondingly. However, trajectory clustering is often sensitive to undesirable outliers and is essentially more complex compared with traditional point clustering. To overcome this limitation, a multi-step trajectory clustering method is proposed in this paper for robust AIS trajectory clustering. In particular, the Dynamic Time Warping (DTW, a similarity measurement method, is introduced in the first step to measure the distances between different trajectories. The calculated distances, inversely proportional to the similarities, constitute a distance matrix in the second step. Furthermore, as a widely-used dimensional reduction method, Principal Component Analysis (PCA is exploited to decompose the obtained distance matrix. In particular, the top k principal components with above 95% accumulative contribution rate are extracted by PCA, and the number of the centers k is chosen. The k centers are found by the improved center automatically selection algorithm. In the last step, the improved center clustering algorithm with k clusters is implemented on the distance matrix to achieve the final AIS trajectory clustering results. In order to improve the accuracy of the proposed multi-step clustering algorithm, an automatic algorithm for choosing the k clusters is developed according to the similarity distance. Numerous experiments on realistic AIS trajectory datasets in the bridge area waterway and Mississippi River have been implemented to compare our

A Dimensionality Reduction-Based Multi-Step Clustering Method for Robust Vessel Trajectory Analysis.

Science.gov (United States)

Li, Huanhuan; Liu, Jingxian; Liu, Ryan Wen; Xiong, Naixue; Wu, Kefeng; Kim, Tai-Hoon

2017-08-04

The Shipboard Automatic Identification System (AIS) is crucial for navigation safety and maritime surveillance, data mining and pattern analysis of AIS information have attracted considerable attention in terms of both basic research and practical applications. Clustering of spatio-temporal AIS trajectories can be used to identify abnormal patterns and mine customary route data for transportation safety. Thus, the capacities of navigation safety and maritime traffic monitoring could be enhanced correspondingly. However, trajectory clustering is often sensitive to undesirable outliers and is essentially more complex compared with traditional point clustering. To overcome this limitation, a multi-step trajectory clustering method is proposed in this paper for robust AIS trajectory clustering. In particular, the Dynamic Time Warping (DTW), a similarity measurement method, is introduced in the first step to measure the distances between different trajectories. The calculated distances, inversely proportional to the similarities, constitute a distance matrix in the second step. Furthermore, as a widely-used dimensional reduction method, Principal Component Analysis (PCA) is exploited to decompose the obtained distance matrix. In particular, the top k principal components with above 95% accumulative contribution rate are extracted by PCA, and the number of the centers k is chosen. The k centers are found by the improved center automatically selection algorithm. In the last step, the improved center clustering algorithm with k clusters is implemented on the distance matrix to achieve the final AIS trajectory clustering results. In order to improve the accuracy of the proposed multi-step clustering algorithm, an automatic algorithm for choosing the k clusters is developed according to the similarity distance. Numerous experiments on realistic AIS trajectory datasets in the bridge area waterway and Mississippi River have been implemented to compare our proposed method with
Depth data research of GIS based on clustering analysis algorithm

Science.gov (United States)

Xiong, Yan; Xu, Wenli

2018-03-01

The data of GIS have spatial distribution. Geographic data has both spatial characteristics and attribute characteristics, and also changes with time. Therefore, the amount of data is very large. Nowadays, many industries and departments in the society are using GIS. However, without proper data analysis and mining scheme, GIS will not exert its maximum effectiveness and will waste a lot of data. In this paper, we use the geographic information demand of a national security department as the experimental object, combining the characteristics of GIS data, taking into account the characteristics of time, space, attributes and so on, and using cluster analysis algorithm. We further study the mining scheme for depth data, and get the algorithm model. This algorithm can automatically classify sample data, and then carry out exploratory analysis. The research shows that the algorithm model and the information mining scheme can quickly find hidden depth information from the surface data of GIS, thus improving the efficiency of the security department. This algorithm can also be extended to other fields.
Multiscale visual quality assessment for cluster analysis with self-organizing maps

Science.gov (United States)

Bernard, Jürgen; von Landesberger, Tatiana; Bremm, Sebastian; Schreck, Tobias

2011-01-01

Cluster analysis is an important data mining technique for analyzing large amounts of data, reducing many objects to a limited number of clusters. Cluster visualization techniques aim at supporting the user in better understanding the characteristics and relationships among the found clusters. While promising approaches to visual cluster analysis already exist, these usually fall short of incorporating the quality of the obtained clustering results. However, due to the nature of the clustering process, quality plays an important aspect, as for most practical data sets, typically many different clusterings are possible. Being aware of clustering quality is important to judge the expressiveness of a given cluster visualization, or to adjust the clustering process with refined parameters, among others. In this work, we present an encompassing suite of visual tools for quality assessment of an important visual cluster algorithm, namely, the Self-Organizing Map (SOM) technique. We define, measure, and visualize the notion of SOM cluster quality along a hierarchy of cluster abstractions. The quality abstractions range from simple scalar-valued quality scores up to the structural comparison of a given SOM clustering with output of additional supportive clustering methods. The suite of methods allows the user to assess the SOM quality on the appropriate abstraction level, and arrive at improved clustering results. We implement our tools in an integrated system, apply it on experimental data sets, and show its applicability.
antiSMASH 3.0-a comprehensive resource for the genome mining of biosynthetic gene clusters.

Science.gov (United States)

Weber, Tilmann; Blin, Kai; Duddela, Srikanth; Krug, Daniel; Kim, Hyun Uk; Bruccoleri, Robert; Lee, Sang Yup; Fischbach, Michael A; Müller, Rolf; Wohlleben, Wolfgang; Breitling, Rainer; Takano, Eriko; Medema, Marnix H

2015-07-01

Microbial secondary metabolism constitutes a rich source of antibiotics, chemotherapeutics, insecticides and other high-value chemicals. Genome mining of gene clusters that encode the biosynthetic pathways for these metabolites has become a key methodology for novel compound discovery. In 2011, we introduced antiSMASH, a web server and stand-alone tool for the automatic genomic identification and analysis of biosynthetic gene clusters, available at http://antismash.secondarymetabolites.org. Here, we present version 3.0 of antiSMASH, which has undergone major improvements. A full integration of the recently published ClusterFinder algorithm now allows using this probabilistic algorithm to detect putative gene clusters of unknown types. Also, a new dereplication variant of the ClusterBlast module now identifies similarities of identified clusters to any of 1172 clusters with known end products. At the enzyme level, active sites of key biosynthetic enzymes are now pinpointed through a curated pattern-matching procedure and Enzyme Commission numbers are assigned to functionally classify all enzyme-coding genes. Additionally, chemical structure prediction has been improved by incorporating polyketide reduction states. Finally, in order for users to be able to organize and analyze multiple antiSMASH outputs in a private setting, a new XML output module allows offline editing of antiSMASH annotations within the Geneious software. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.
Clustering Educational Digital Library Usage Data: A Comparison of Latent Class Analysis and K-Means Algorithms

Science.gov (United States)

Xu, Beijie; Recker, Mimi; Qi, Xiaojun; Flann, Nicholas; Ye, Lei

2013-01-01

This article examines clustering as an educational data mining method. In particular, two clustering algorithms, the widely used K-means and the model-based Latent Class Analysis, are compared, using usage data from an educational digital library service, the Instructional Architect (IA.usu.edu). Using a multi-faceted approach and multiple data…
Herd Clustering: A synergistic data clustering approach using collective intelligence

KAUST Repository

Wong, Kachun

2014-10-01

Traditional data mining methods emphasize on analytical abilities to decipher data, assuming that data are static during a mining process. We challenge this assumption, arguing that we can improve the analysis by vitalizing data. In this paper, this principle is used to develop a new clustering algorithm. Inspired by herd behavior, the clustering method is a synergistic approach using collective intelligence called Herd Clustering (HC). The novel part is laid in its first stage where data instances are represented by moving particles. Particles attract each other locally and form clusters by themselves as shown in the case studies reported. To demonstrate its effectiveness, the performance of HC is compared to other state-of-the art clustering methods on more than thirty datasets using four performance metrics. An application for DNA motif discovery is also conducted. The results support the effectiveness of HC and thus the underlying philosophy. © 2014 Elsevier B.V.
TIME SERIES ANALYSIS ON STOCK MARKET FOR TEXT MINING CORRELATION OF ECONOMY NEWS

Directory of Open Access Journals (Sweden)

Sadi Evren SEKER

2014-01-01

Full Text Available This paper proposes an information retrieval methodfor the economy news. Theeffect of economy news, are researched in the wordlevel and stock market valuesare considered as the ground proof.The correlation between stock market prices and economy news is an already ad-dressed problem for most of the countries. The mostwell-known approach is ap-plying the text mining approaches to the news and some time series analysis tech-niques over stock market closing values in order toapply classification or cluster-ing algorithms over the features extracted. This study goes further and tries to askthe question what are the available time series analysis techniques for the stockmarket closing values and which one is the most suitable? In this study, the newsand their dates are collected into a database and text mining is applied over thenews, the text mining part has been kept simple with only term frequency – in-verse document frequency method. For the time series analysis part, we havestudied 10 different methods such as random walk, moving average, acceleration,Bollinger band, price rate of change, periodic average, difference, momentum orrelative strength index and their variation. In this study we have also explainedthese techniques in a comparative way and we have applied the methods overTurkish Stock Market closing values for more than a2 year period. On the otherhand, we have applied the term frequency – inversedocument frequency methodon the economy news of one of the high-circulatingnewspapers in Turkey.
Grey Wolf Optimizer Based on Powell Local Optimization Method for Clustering Analysis

Directory of Open Access Journals (Sweden)

Sen Zhang

2015-01-01

Full Text Available One heuristic evolutionary algorithm recently proposed is the grey wolf optimizer (GWO, inspired by the leadership hierarchy and hunting mechanism of grey wolves in nature. This paper presents an extended GWO algorithm based on Powell local optimization method, and we call it PGWO. PGWO algorithm significantly improves the original GWO in solving complex optimization problems. Clustering is a popular data analysis and data mining technique. Hence, the PGWO could be applied in solving clustering problems. In this study, first the PGWO algorithm is tested on seven benchmark functions. Second, the PGWO algorithm is used for data clustering on nine data sets. Compared to other state-of-the-art evolutionary algorithms, the results of benchmark and data clustering demonstrate the superior performance of PGWO algorithm.
The effect of mining data k-means clustering toward students profile model drop out potential

Science.gov (United States)

Purba, Windania; Tamba, Saut; Saragih, Jepronel

2018-04-01

The high of student success and the low of student failure can reflect the quality of a college. One of the factors of fail students was drop out. To solve the problem, so mining data with K-means Clustering was applied. K-Means Clustering method would be implemented to clustering the drop out students potentially. Firstly the the result data would be clustering to get the information of all students condition. Based on the model taken was found that students who potentially drop out because of the unexciting students in learning, unsupported parents, diffident students and less of students behavior time. The result of process of K-Means Clustering could known that students who more potentially drop out were in Cluster 1 caused Credit Total System, Quality Total, and the lowest Grade Point Average (GPA) compared between cluster 2 and 3.
Text Clustering Algorithm Based on Random Cluster Core

Directory of Open Access Journals (Sweden)

Huang Long-Jun

2016-01-01

Full Text Available Nowadays clustering has become a popular text mining algorithm, but the huge data can put forward higher requirements for the accuracy and performance of text mining. In view of the performance bottleneck of traditional text clustering algorithm, this paper proposes a text clustering algorithm with random features. This is a kind of clustering algorithm based on text density, at the same time using the neighboring heuristic rules, the concept of random cluster is introduced, which effectively reduces the complexity of the distance calculation.
Off-road truck-related accidents in U.S. mines.

Science.gov (United States)

Dindarloo, Saeid R; Pollard, Jonisha P; Siami-Irdemoosa, Elnaz

2016-09-01

Off-road trucks are one of the major sources of equipment-related accidents in the U.S. mining industries. A systematic analysis of all off-road truck-related accidents, injuries, and illnesses, which are reported and published by the Mine Safety and Health Administration (MSHA), is expected to provide practical insights for identifying the accident patterns and trends in the available raw database. Therefore, appropriate safety management measures can be administered and implemented based on these accident patterns/trends. A hybrid clustering-classification methodology using K-means clustering and gene expression programming (GEP) is proposed for the analysis of severe and non-severe off-road truck-related injuries at U.S. mines. Using the GEP sub-model, a small subset of the 36 recorded attributes was found to be correlated to the severity level. Given the set of specified attributes, the clustering sub-model was able to cluster the accident records into 5 distinct groups. For instance, the first cluster contained accidents related to minerals processing mills and coal preparation plants (91%). More than two-thirds of the victims in this cluster had less than 5years of job experience. This cluster was associated with the highest percentage of severe injuries (22 severe accidents, 3.4%). Almost 50% of all accidents in this cluster occurred at stone operations. Similarly, the other four clusters were characterized to highlight important patterns that can be used to determine areas of focus for safety initiatives. The identified clusters of accidents may play a vital role in the prevention of severe injuries in mining. Further research into the cluster attributes and identified patterns will be necessary to determine how these factors can be mitigated to reduce the risk of severe injuries. Analyzing injury data using data mining techniques provides some insight into attributes that are associated with high accuracies for predicting injury severity. Copyright © 2016
Topic modeling for cluster analysis of large biological and medical datasets.

Science.gov (United States)

Zhao, Weizhong; Zou, Wen; Chen, James J

2014-01-01

The big data moniker is nowhere better deserved than to describe the ever-increasing prodigiousness and complexity of biological and medical datasets. New methods are needed to generate and test hypotheses, foster biological interpretation, and build validated predictors. Although multivariate techniques such as cluster analysis may allow researchers to identify groups, or clusters, of related variables, the accuracies and effectiveness of traditional clustering methods diminish for large and hyper dimensional datasets. Topic modeling is an active research field in machine learning and has been mainly used as an analytical tool to structure large textual corpora for data mining. Its ability to reduce high dimensionality to a small number of latent variables makes it suitable as a means for clustering or overcoming clustering difficulties in large biological and medical datasets. In this study, three topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, are proposed and tested on the cluster analysis of three large datasets: Salmonella pulsed-field gel electrophoresis (PFGE) dataset, lung cancer dataset, and breast cancer dataset, which represent various types of large biological or medical datasets. All three various methods are shown to improve the efficacy/effectiveness of clustering results on the three datasets in comparison to traditional methods. A preferable cluster analysis method emerged for each of the three datasets on the basis of replicating known biological truths. Topic modeling could be advantageously applied to the large datasets of biological or medical research. The three proposed topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, yield clustering improvements for the three different data types. Clusters more efficaciously represent truthful groupings and subgroupings in the data than traditional methods, suggesting
Sentiment Analysis and Opinion Mining

CERN Document Server

Liu, Bing

2012-01-01

Sentiment analysis and opinion mining is the field of study that analyzes people's opinions, sentiments, evaluations, attitudes, and emotions from written language. It is one of the most active research areas in natural language processing and is also widely studied in data mining, Web mining, and text mining. In fact, this research has spread outside of computer science to the management sciences and social sciences due to its importance to business and society as a whole. The growing importance of sentiment analysis coincides with the growth of social media such as reviews, forum discussions
Hydrochemical characteristics of mine waters from abandoned mining sites in Serbia and their impact on surface water quality.

Science.gov (United States)

Atanacković, Nebojša; Dragišić, Veselin; Stojković, Jana; Papić, Petar; Zivanović, Vladimir

2013-11-01

Upon completion of exploration and extraction of mineral resources, many mining sites have been abandoned without previously putting environmental protection measures in place. As a consequence, mine waters originating from such sites are discharged freely into surface water. Regional scale analyses were conducted to determine the hydrochemical characteristics of mine waters from abandoned sites featuring metal (Cu, Pb-Zn, Au, Fe, Sb, Mo, Bi, Hg) deposits, non-metallic minerals (coal, Mg, F, B) and uranium. The study included 80 mine water samples from 59 abandoned mining sites. Their cation composition was dominated by Ca2+, while the most common anions were found to be SO4(2-) and HCO3-. Strong correlations were established between the pH level and metal (Fe, Mn, Zn, Cu) concentrations in the mine waters. Hierarchical cluster analysis was applied to parameters generally indicative of pollution, such as pH, TDS, SO4(2-), Fe total, and As total. Following this approach, mine water samples were grouped into three main clusters and six subclusters, depending on their potential environmental impact. Principal component analysis was used to group together variables that share the same variance. The extracted principal components indicated that sulfide oxidation and weathering of silicate and carbonate rocks were the primary processes, while pH buffering, adsorption and ion exchange were secondary drivers of the chemical composition of the analyzed mine waters. Surface waters, which received the mine waters, were examined. Analysis showed increases of sulfate and metal concentrations and general degradation of surface water quality.
Cluster analysis as a prediction tool for pregnancy outcomes.

Science.gov (United States)

Banjari, Ines; Kenjerić, Daniela; Šolić, Krešimir; Mandić, Milena L

2015-03-01

Considering specific physiology changes during gestation and thinking of pregnancy as a "critical window", classification of pregnant women at early pregnancy can be considered as crucial. The paper demonstrates the use of a method based on an approach from intelligent data mining, cluster analysis. Cluster analysis method is a statistical method which makes possible to group individuals based on sets of identifying variables. The method was chosen in order to determine possibility for classification of pregnant women at early pregnancy to analyze unknown correlations between different variables so that the certain outcomes could be predicted. 222 pregnant women from two general obstetric offices' were recruited. The main orient was set on characteristics of these pregnant women: their age, pre-pregnancy body mass index (BMI) and haemoglobin value. Cluster analysis gained a 94.1% classification accuracy rate with three branch- es or groups of pregnant women showing statistically significant correlations with pregnancy outcomes. The results are showing that pregnant women both of older age and higher pre-pregnancy BMI have a significantly higher incidence of delivering baby of higher birth weight but they gain significantly less weight during pregnancy. Their babies are also longer, and these women have significantly higher probability for complications during pregnancy (gestosis) and higher probability of induced or caesarean delivery. We can conclude that the cluster analysis method can appropriately classify pregnant women at early pregnancy to predict certain outcomes.
Identification of mine waters by statistical multivariate methods

Energy Technology Data Exchange (ETDEWEB)

Mali, N [IGGG, Ljubljana (Slovenia)

1992-01-01

Three water-bearing aquifers are present in the Velenje lignite mine. The aquifer waters have differing chemical composition; a geochemical water analysis can therefore determine the source of mine water influx. Mine water samples from different locations in the mine were analyzed, the results of chemical content and of electric conductivity of mine water were statistically processed by means of MICROGAS, SPSS-X and IN STATPAC computer programs, which apply three multivariate statistical methods (discriminate, cluster and factor analysis). Reliability of calculated values was determined with the Kolmogorov and Smirnov tests. It is concluded that laboratory analysis of single water samples can produce measurement errors, but statistical processing of water sample data can identify origin and movement of mine water. 15 refs.
CLUSTERING ANALYSIS OF OFFICER'S BEHAVIOURS IN LONDON POLICE FOOT PATROL ACTIVITIES

Directory of Open Access Journals (Sweden)

J. Shen

2015-07-01

Full Text Available In this small paper we aim at presenting a framework of conceptual representation and clustering analysis of police officers’ patrol pattern obtained from mining their raw movement trajectory data. This have been achieved by a model developed to accounts for the spatio-temporal dynamics human movements by incorporating both the behaviour features of the travellers and the semantic meaning of the environment they are moving in. Hence, the similarity metric of traveller behaviours is jointly defined according to the stay time allocation in each Spatio-temporal region of interests (ST-ROI to support clustering analysis of patrol behaviours. The proposed framework enables the analysis of behaviour and preferences on higher level based on raw moment trajectories. The model is firstly applied to police patrol data provided by the Metropolitan Police and will be tested by other type of dataset afterwards.
Research of the Space Clustering Method for the Airport Noise Data Minings

Directory of Open Access Journals (Sweden)

Jiwen Xie

2014-03-01

Full Text Available Mining the distribution pattern and evolution of the airport noise from the airport noise data and the geographic information of the monitoring points is of great significance for the scientific and rational governance of airport noise pollution problem. However, most of the traditional clustering methods are based on the closeness of space location or the similarity of non-spatial features, which split the duality of space elements, resulting in that the clustering result has difficult in satisfying both the closeness of space location and the similarity of non-spatial features. This paper, therefore, proposes a spatial clustering algorithm based on dual-distance. This algorithm uses a distance function as the similarity measure function in which spatial features and non-spatial features are combined. The experimental results show that the proposed algorithm can discover the noise distribution pattern around the airport effectively.
Cluster analysis for applications

CERN Document Server

Anderberg, Michael R

1973-01-01

Cluster Analysis for Applications deals with methods and various applications of cluster analysis. Topics covered range from variables and scales to measures of association among variables and among data units. Conceptual problems in cluster analysis are discussed, along with hierarchical and non-hierarchical clustering methods. The necessary elements of data analysis, statistics, cluster analysis, and computer implementation are integrated vertically to cover the complete path from raw data to a finished analysis.Comprised of 10 chapters, this book begins with an introduction to the subject o
K-Line Patterns’ Predictive Power Analysis Using the Methods of Similarity Match and Clustering

Directory of Open Access Journals (Sweden)

Lv Tao

2017-01-01

Full Text Available Stock price prediction based on K-line patterns is the essence of candlestick technical analysis. However, there are some disputes on whether the K-line patterns have predictive power in academia. To help resolve the debate, this paper uses the data mining methods of pattern recognition, pattern clustering, and pattern knowledge mining to research the predictive power of K-line patterns. The similarity match model and nearest neighbor-clustering algorithm are proposed for solving the problem of similarity match and clustering of K-line series, respectively. The experiment includes testing the predictive power of the Three Inside Up pattern and Three Inside Down pattern with the testing dataset of the K-line series data of Shanghai 180 index component stocks over the latest 10 years. Experimental results show that (1 the predictive power of a pattern varies a great deal for different shapes and (2 each of the existing K-line patterns requires further classification based on the shape feature for improving the prediction performance.

Applications of Data Mining in Higher Education

OpenAIRE

Monika Goyal; Rajan Vohra

2012-01-01

Data analysis plays an important role for decision support irrespective of type of industry like any manufacturing unit and educations system. There are many domains in which data mining techniques plays an important role. This paper proposes the use of data mining techniques to improve the efficiency of higher education institution. If data mining techniques such as clustering, decision tree and association are applied to higher education processes, it would help to improve students performa...
Social big data mining

CERN Document Server

Ishikawa, Hiroshi

2015-01-01

Social Media. Big Data and Social Data. Hypotheses in the Era of Big Data. Social Big Data Applications. Basic Concepts in Data Mining. Association Rule Mining. Clustering. Classification. Prediction. Web Structure Mining. Web Content Mining. Web Access Log Mining, Information Extraction and Deep Web Mining. Media Mining. Scalability and Outlier Detection.
Web Mining of Hotel Customer Survey Data

Directory of Open Access Journals (Sweden)

Richard S. Segall

2008-12-01

Full Text Available This paper provides an extensive literature review and list of references on the background of web mining as applied specifically to hotel customer survey data. This research applies the techniques of web mining to actual text of written comments for hotel customers using Megaputer PolyAnalyst®. Web mining functionalities utilized include those such as clustering, link analysis, key word and phrase extraction, taxonomy, and dimension matrices. This paper provides screen shots of the web mining applications using Megaputer PolyAnalyst®. Conclusions and future directions of the research are presented.
Data mining in radiology

International Nuclear Information System (INIS)

Kharat, Amit T; Singh, Amarjit; Kulkarni, Vilas M; Shah, Digish

2014-01-01

Data mining facilitates the study of radiology data in various dimensions. It converts large patient image and text datasets into useful information that helps in improving patient care and provides informative reports. Data mining technology analyzes data within the Radiology Information System and Hospital Information System using specialized software which assesses relationships and agreement in available information. By using similar data analysis tools, radiologists can make informed decisions and predict the future outcome of a particular imaging finding. Data, information and knowledge are the components of data mining. Classes, Clusters, Associations, Sequential patterns, Classification, Prediction and Decision tree are the various types of data mining. Data mining has the potential to make delivery of health care affordable and ensure that the best imaging practices are followed. It is a tool for academic research. Data mining is considered to be ethically neutral, however concerns regarding privacy and legality exists which need to be addressed to ensure success of data mining
Clustering analysis

International Nuclear Information System (INIS)

Romli

1997-01-01

Cluster analysis is the name of group of multivariate techniques whose principal purpose is to distinguish similar entities from the characteristics they process.To study this analysis, there are several algorithms that can be used. Therefore, this topic focuses to discuss the algorithms, such as, similarity measures, and hierarchical clustering which includes single linkage, complete linkage and average linkage method. also, non-hierarchical clustering method, which is popular name K -mean method ' will be discussed. Finally, this paper will be described the advantages and disadvantages of every methods
Cluster analysis

CERN Document Server

Everitt, Brian S; Leese, Morven; Stahl, Daniel

2011-01-01

Cluster analysis comprises a range of methods for classifying multivariate data into subgroups. By organizing multivariate data into such subgroups, clustering can help reveal the characteristics of any structure or patterns present. These techniques have proven useful in a wide range of areas such as medicine, psychology, market research and bioinformatics.This fifth edition of the highly successful Cluster Analysis includes coverage of the latest developments in the field and a new chapter dealing with finite mixture models for structured data.Real life examples are used throughout to demons
ESTminer: a Web interface for mining EST contig and cluster databases.

Science.gov (United States)

Huang, Yecheng; Pumphrey, Janie; Gingle, Alan R

2005-03-01

ESTminer is a Web application and database schema for interactive mining of expressed sequence tag (EST) contig and cluster datasets. The Web interface contains a query frame that allows the selection of contigs/clusters with specific cDNA library makeup or a threshold number of members. The results are displayed as color-coded tree nodes, where the color indicates the fractional size of each cDNA library component. The nodes are expandable, revealing library statistics as well as EST or contig members, with links to sequence data, GenBank records or user configurable links. Also, the interface allows 'queries within queries' where the result set of a query is further filtered by the subsequent query. ESTminer is implemented in Java/JSP and the package, including MySQL and Oracle schema creation scripts, is available from http://cggc.agtec.uga.edu/Data/download.asp agingle@uga.edu.
Large-Scale Multi-Dimensional Document Clustering on GPU Clusters

Energy Technology Data Exchange (ETDEWEB)

Cui, Xiaohui [ORNL; Mueller, Frank [North Carolina State University; Zhang, Yongpeng [ORNL; Potok, Thomas E [ORNL

2010-01-01

Document clustering plays an important role in data mining systems. Recently, a flocking-based document clustering algorithm has been proposed to solve the problem through simulation resembling the flocking behavior of birds in nature. This method is superior to other clustering algorithms, including k-means, in the sense that the outcome is not sensitive to the initial state. One limitation of this approach is that the algorithmic complexity is inherently quadratic in the number of documents. As a result, execution time becomes a bottleneck with large number of documents. In this paper, we assess the benefits of exploiting the computational power of Beowulf-like clusters equipped with contemporary Graphics Processing Units (GPUs) as a means to significantly reduce the runtime of flocking-based document clustering. Our framework scales up to over one million documents processed simultaneously in a sixteennode GPU cluster. Results are also compared to a four-node cluster with higher-end GPUs. On these clusters, we observe 30X-50X speedups, which demonstrates the potential of GPU clusters to efficiently solve massive data mining problems. Such speedups combined with the scalability potential and accelerator-based parallelization are unique in the domain of document-based data mining, to the best of our knowledge.
Text Mining in Organizational Research.

Science.gov (United States)

Kobayashi, Vladimer B; Mol, Stefan T; Berkers, Hannah A; Kismihók, Gábor; Den Hartog, Deanne N

2018-07-01

Despite the ubiquity of textual data, so far few researchers have applied text mining to answer organizational research questions. Text mining, which essentially entails a quantitative approach to the analysis of (usually) voluminous textual data, helps accelerate knowledge discovery by radically increasing the amount data that can be analyzed. This article aims to acquaint organizational researchers with the fundamental logic underpinning text mining, the analytical stages involved, and contemporary techniques that may be used to achieve different types of objectives. The specific analytical techniques reviewed are (a) dimensionality reduction, (b) distance and similarity computing, (c) clustering, (d) topic modeling, and (e) classification. We describe how text mining may extend contemporary organizational research by allowing the testing of existing or new research questions with data that are likely to be rich, contextualized, and ecologically valid. After an exploration of how evidence for the validity of text mining output may be generated, we conclude the article by illustrating the text mining process in a job analysis setting using a dataset composed of job vacancies.
Mining Views : database views for data mining

NARCIS (Netherlands)

Blockeel, H.; Calders, T.; Fromont, É.; Goethals, B.; Prado, A.; Nijssen, S.; De Raedt, L.

2007-01-01

We propose a relational database model towards the integration of data mining into relational database systems, based on the so called virtual mining views. We show that several types of patterns and models over the data, such as itemsets, association rules, decision trees and clusterings, can be
Highly Robust Methods in Data Mining

Czech Academy of Sciences Publication Activity Database

Kalina, Jan

2013-01-01

Roč. 8, č. 1 (2013), s. 9-24 ISSN 1452-4864 Institutional support: RVO:67985807 Keywords : data mining * robust statistics * high-dimensional data * cluster analysis * logistic regression * neural networks Subject RIV: BB - Applied Statistics, Operational Research
Recent development of antiSMASH and other computational approaches to mine secondary metabolite biosynthetic gene clusters

DEFF Research Database (Denmark)

Blin, Kai; Kim, Hyun Uk; Medema, Marnix H.

2017-01-01

Many drugs are derived from small molecules produced by microorganisms and plants, so-called natural products. Natural products have diverse chemical structures, but the biosynthetic pathways producing those compounds are often organized as biosynthetic gene clusters (BGCs) and follow a highly...... conserved biosynthetic logic. This allows for the identification of core biosynthetic enzymes using genome mining strategies that are based on the sequence similarity of the involved enzymes/genes. However, mining for a variety of BGCs quickly approaches a complexity level where manual analyses...... are no longer possible and require the use of automated genome mining pipelines, such as the antiSMASH software. In this review, we discuss the principles underlying the predictions of antiSMASH and other tools and provide practical advice for their application. Furthermore, we discuss important caveats...
HC StratoMineR: A Web-Based Tool for the Rapid Analysis of High-Content Datasets.

Science.gov (United States)

Omta, Wienand A; van Heesbeen, Roy G; Pagliero, Romina J; van der Velden, Lieke M; Lelieveld, Daphne; Nellen, Mehdi; Kramer, Maik; Yeong, Marley; Saeidi, Amir M; Medema, Rene H; Spruit, Marco; Brinkkemper, Sjaak; Klumperman, Judith; Egan, David A

2016-10-01

High-content screening (HCS) can generate large multidimensional datasets and when aligned with the appropriate data mining tools, it can yield valuable insights into the mechanism of action of bioactive molecules. However, easy-to-use data mining tools are not widely available, with the result that these datasets are frequently underutilized. Here, we present HC StratoMineR, a web-based tool for high-content data analysis. It is a decision-supportive platform that guides even non-expert users through a high-content data analysis workflow. HC StratoMineR is built by using My Structured Query Language for storage and querying, PHP: Hypertext Preprocessor as the main programming language, and jQuery for additional user interface functionality. R is used for statistical calculations, logic and data visualizations. Furthermore, C++ and graphical processor unit power is diffusely embedded in R by using the rcpp and rpud libraries for operations that are computationally highly intensive. We show that we can use HC StratoMineR for the analysis of multivariate data from a high-content siRNA knock-down screen and a small-molecule screen. It can be used to rapidly filter out undesirable data; to select relevant data; and to perform quality control, data reduction, data exploration, morphological hit picking, and data clustering. Our results demonstrate that HC StratoMineR can be used to functionally categorize HCS hits and, thus, provide valuable information for hit prioritization.
The Top Ten Algorithms in Data Mining

CERN Document Server

Wu, Xindong

2009-01-01

From classification and clustering to statistical learning, association analysis, and link mining, this book covers the most important topics in data mining research. It presents the ten most influential algorithms used in the data mining community today. Each chapter provides a detailed description of the algorithm, a discussion of available software implementation, advanced topics, and exercises. With a simple data set, examples illustrate how each algorithm works and highlight the overall performance of each algorithm in a real-world application. Featuring contributions from leading researc
Text mining to decipher free-response consumer complaints: insights from the NHTSA vehicle owner's complaint database.

Science.gov (United States)

Ghazizadeh, Mahtab; McDonald, Anthony D; Lee, John D

2014-09-01

This study applies text mining to extract clusters of vehicle problems and associated trends from free-response data in the National Highway Traffic Safety Administration's vehicle owner's complaint database. As the automotive industry adopts new technologies, it is important to systematically assess the effect of these changes on traffic safety. Driving simulators, naturalistic driving data, and crash databases all contribute to a better understanding of how drivers respond to changing vehicle technology, but other approaches, such as automated analysis of incident reports, are needed. Free-response data from incidents representing two severity levels (fatal incidents and incidents involving injury) were analyzed using a text mining approach: latent semantic analysis (LSA). LSA and hierarchical clustering identified clusters of complaints for each severity level, which were compared and analyzed across time. Cluster analysis identified eight clusters of fatal incidents and six clusters of incidents involving injury. Comparisons showed that although the airbag clusters across the two severity levels have the same most frequent terms, the circumstances around the incidents differ. The time trends show clear increases in complaints surrounding the Ford/Firestone tire recall and the Toyota unintended acceleration recall. Increases in complaints may be partially driven by these recall announcements and the associated media attention. Text mining can reveal useful information from free-response databases that would otherwise be prohibitively time-consuming and difficult to summarize manually. Text mining can extend human analysis capabilities for large free-response databases to support earlier detection of problems and more timely safety interventions.
Availability analysis of selected mining machinery

Directory of Open Access Journals (Sweden)

Brodny Jarosław

2017-06-01

Full Text Available Underground extraction of coal is characterized by high variability of mining and geological conditions in which it is conducted. Despite ever more effective methods and tools, used to identify the factors influencing this process, mining machinery, used in mining underground, work in difficult and not always foreseeable conditions, which means that these machines should be very universal and reliable. Additionally, a big competition, occurring on the coal market, causes that it is necessary to take action in order to reduce the cost of its production, e.g. by increasing the efficiency of utilization machines. To meet this objective it should be pro-ceed with analysis presented in this paper. The analysis concerns to availability of utilization selected mining machinery, conducted using the model of OEE, which is a tool for quantitative estimate strategy TPM. In this article we considered the machines being part of the mechanized longwall complex and the basis of analysis was the data recording by the industrial automation system. Using this data set we evaluated the availability of studied machines and the structure of registered breaks in their work. The results should be an important source of information for maintenance staff and management of mining plants, needed to improve the economic efficiency of underground mining.
Fuzzy Clustering: An Approachfor Mining Usage Profilesfrom Web

OpenAIRE

Ms.Archana N. Boob; Prof. D. M. Dakhane

2012-01-01

Web usage mining is an application of data mining technology to mining the data of the web server log file. It can discover the browsing patterns of user and some kind of correlations between the web pages. Web usage mining provides the support for the web site design, providing personalization server and other business making decision, etc. Web mining applies the data mining, the artificial intelligence and the chart technology and so on to the web data and traces users' visiting characteris...
Marketing research cluster analysis

OpenAIRE

Marić Nebojša

2002-01-01

One area of applications of cluster analysis in marketing is identification of groups of cities and towns with similar demographic profiles. This paper considers main aspects of cluster analysis by an example of clustering 12 cities with the use of Minitab software.
The accident analysis of mobile mine machinery in Indian opencast coal mines.

Science.gov (United States)

Kumar, R; Ghosh, A K

2014-01-01

This paper presents the analysis of large mining machinery related accidents in Indian opencast coal mines. The trends of coal production, share of mining methods in production, machinery deployment in open cast mines, size and population of machinery, accidents due to machinery, types and causes of accidents have been analysed from the year 1995 to 2008. The scrutiny of accidents during this period reveals that most of the responsible factors are machine reversal, haul road design, human fault, operator's fault, machine fault, visibility and dump design. Considering the types of machines, namely, dumpers, excavators, dozers and loaders together the maximum number of fatal accidents has been caused by operator's faults and human faults jointly during the period from 1995 to 2008. The novel finding of this analysis is that large machines with state-of-the-art safety system did not reduce the fatal accidents in Indian opencast coal mines.
Marketing research cluster analysis

Directory of Open Access Journals (Sweden)

Marić Nebojša

2002-01-01

Full Text Available One area of applications of cluster analysis in marketing is identification of groups of cities and towns with similar demographic profiles. This paper considers main aspects of cluster analysis by an example of clustering 12 cities with the use of Minitab software.

Data Mining and Statistics for Decision Making

CERN Document Server

Tufféry, Stéphane

2011-01-01

Data mining is the process of automatically searching large volumes of data for models and patterns using computational techniques from statistics, machine learning and information theory; it is the ideal tool for such an extraction of knowledge. Data mining is usually associated with a business or an organization's need to identify trends and profiles, allowing, for example, retailers to discover patterns on which to base marketing objectives. This book looks at both classical and recent techniques of data mining, such as clustering, discriminant analysis, logistic regression, generalized lin
Study on Adaptive Parameter Determination of Cluster Analysis in Urban Management Cases

Science.gov (United States)

Fu, J. Y.; Jing, C. F.; Du, M. Y.; Fu, Y. L.; Dai, P. P.

2017-09-01

The fine management for cities is the important way to realize the smart city. The data mining which uses spatial clustering analysis for urban management cases can be used in the evaluation of urban public facilities deployment, and support the policy decisions, and also provides technical support for the fine management of the city. Aiming at the problem that DBSCAN algorithm which is based on the density-clustering can not realize parameter adaptive determination, this paper proposed the optimizing method of parameter adaptive determination based on the spatial analysis. Firstly, making analysis of the function Ripley's K for the data set to realize adaptive determination of global parameter MinPts, which means setting the maximum aggregation scale as the range of data clustering. Calculating every point object's highest frequency K value in the range of Eps which uses K-D tree and setting it as the value of clustering density to realize the adaptive determination of global parameter MinPts. Then, the R language was used to optimize the above process to accomplish the precise clustering of typical urban management cases. The experimental results based on the typical case of urban management in XiCheng district of Beijing shows that: The new DBSCAN clustering algorithm this paper presents takes full account of the data's spatial and statistical characteristic which has obvious clustering feature, and has a better applicability and high quality. The results of the study are not only helpful for the formulation of urban management policies and the allocation of urban management supervisors in XiCheng District of Beijing, but also to other cities and related fields.
STUDY ON ADAPTIVE PARAMETER DETERMINATION OF CLUSTER ANALYSIS IN URBAN MANAGEMENT CASES

Directory of Open Access Journals (Sweden)

J. Y. Fu

2017-09-01

Full Text Available The fine management for cities is the important way to realize the smart city. The data mining which uses spatial clustering analysis for urban management cases can be used in the evaluation of urban public facilities deployment, and support the policy decisions, and also provides technical support for the fine management of the city. Aiming at the problem that DBSCAN algorithm which is based on the density-clustering can not realize parameter adaptive determination, this paper proposed the optimizing method of parameter adaptive determination based on the spatial analysis. Firstly, making analysis of the function Ripley's K for the data set to realize adaptive determination of global parameter MinPts, which means setting the maximum aggregation scale as the range of data clustering. Calculating every point object’s highest frequency K value in the range of Eps which uses K-D tree and setting it as the value of clustering density to realize the adaptive determination of global parameter MinPts. Then, the R language was used to optimize the above process to accomplish the precise clustering of typical urban management cases. The experimental results based on the typical case of urban management in XiCheng district of Beijing shows that: The new DBSCAN clustering algorithm this paper presents takes full account of the data’s spatial and statistical characteristic which has obvious clustering feature, and has a better applicability and high quality. The results of the study are not only helpful for the formulation of urban management policies and the allocation of urban management supervisors in XiCheng District of Beijing, but also to other cities and related fields.
Communication Base Station Log Analysis Based on Hierarchical Clustering

Directory of Open Access Journals (Sweden)

Zhang Shao-Hua

2017-01-01

Full Text Available Communication base stations generate massive data every day, these base station logs play an important value in mining of the business circles. This paper use data mining technology and hierarchical clustering algorithm to group the scope of business circle for the base station by recording the data of these base stations.Through analyzing the data of different business circle based on feature extraction and comparing different business circle category characteristics, which can choose a suitable area for operators of commercial marketing.
Hot Zone Identification: Analyzing Effects of Data Sampling on Spam Clustering

Directory of Open Access Journals (Sweden)

Rasib Khan

2014-03-01

Full Text Available Email is the most common and comparatively the most efficient means of exchanging information in today's world. However, given the widespread use of emails in all sectors, they have been the target of spammers since the beginning. Filtering spam emails has now led to critical actions such as forensic activities based on mining spam email. The data mine for spam emails at the University of Alabama at Birmingham is considered to be one of the most prominent resources for mining and identifying spam sources. It is a widely researched repository used by researchers from different global organizations. The usual process of mining the spam data involves going through every email in the data mine and clustering them based on their different attributes. However, given the size of the data mine, it takes an exceptionally long time to execute the clustering mechanism each time. In this paper, we have illustrated sampling as an efficient tool for data reduction, while preserving the information within the clusters, which would thus allow the spam forensic experts to quickly and effectively identify the ‘hot zone’ from the spam campaigns. We have provided detailed comparative analysis of the quality of the clusters after sampling, the overall distribution of clusters on the spam data, and timing measurements for our sampling approach. Additionally, we present different strategies which allowed us to optimize the sampling process using data-preprocessing and using the database engine's computational resources, and thus improving the performance of the clustering process.
Comprehensive cluster analysis with Transitivity Clustering.

Science.gov (United States)

Wittkop, Tobias; Emig, Dorothea; Truss, Anke; Albrecht, Mario; Böcker, Sebastian; Baumbach, Jan

2011-03-01

Transitivity Clustering is a method for the partitioning of biological data into groups of similar objects, such as genes, for instance. It provides integrated access to various functions addressing each step of a typical cluster analysis. To facilitate this, Transitivity Clustering is accessible online and offers three user-friendly interfaces: a powerful stand-alone version, a web interface, and a collection of Cytoscape plug-ins. In this paper, we describe three major workflows: (i) protein (super)family detection with Cytoscape, (ii) protein homology detection with incomplete gold standards and (iii) clustering of gene expression data. This protocol guides the user through the most important features of Transitivity Clustering and takes ∼1 h to complete.
Data Mining of University Philanthropic Giving: Cluster-Discriminant Analysis and Pareto Effects

Science.gov (United States)

Le Blanc, Louis A.; Rucks, Conway T.

2009-01-01

A large sample of 33,000 university alumni records were cluster-analyzed to generate six groups relatively unique in their respective attribute values. The attributes used to cluster the former students included average gift to the university's foundation and to the alumni association for the same institution. Cluster detection is useful in this…
[Cluster analysis in biomedical researches].

Science.gov (United States)

Akopov, A S; Moskovtsev, A A; Dolenko, S A; Savina, G D

2013-01-01

Cluster analysis is one of the most popular methods for the analysis of multi-parameter data. The cluster analysis reveals the internal structure of the data, group the separate observations on the degree of their similarity. The review provides a definition of the basic concepts of cluster analysis, and discusses the most popular clustering algorithms: k-means, hierarchical algorithms, Kohonen networks algorithms. Examples are the use of these algorithms in biomedical research.
Fatal accidents analysis in Peruvian mining industry

International Nuclear Information System (INIS)

Candia, R. C.; Hennies, W. T.; Azevedo, R. c.; Almeida, I.G.; Soto, J. F.

2010-01-01

Although reductions in the tax of injuries and accidents have been observed in recent years, Mining is still one of the highest risks industries. The basic causes for occurrence of fatalities can be attributed to unsafe conditions and unsafe acts. In this scene is necessary to identify safety problems and to aim the effective solutions. On the other hand, the developing countries dependence on primary industries as mining is evident. In the Peruvian economy, approximately 16% of the GNP and more than 50% of the exportations are due to the mining sector, detaching its competitive position in the worldwide mining. This paper presents fatal accidents analysis in the Peruvian mining industry, having as basis the register of occurred fatal accidents since year 2000 until 2007, identifying the main types of accidents occurred. The source of primary information is the General Mining Direction (DGM) of the Peruvian Mining and Energy Ministry (MEM). The majority of victims belongs to tertiary contractor companies that render services for mine companies. The results of the analysis show also that the majority of accidents happened in the underground mines, and that it is necessary to propose effective solutions to manage risks, aiming at reducing the fatal accidents taxes. (Author)
The legacy of war: an epidemiological study of cluster weapon and land mine accidents in Quang Tri Province, Vietnam.

Science.gov (United States)

Phung, Tran Kim; Le, Viet; Husum, Hans

2012-07-01

The study examines the epidemiology of cluster weapon and land mine accidents in Quang Tri Province since the end of the Vietnam War. The province is located just south of the demarcation line and was the province most affected during the war. In 2009, a cross sectional household study was conducted in all nine districts of the province. During the study period of 1975-2009, 7,030 persons in the study area were exposed to unexploded ordnances (UXO) or land mine accidents, or 1.1% of the provincial population. There were 2,620 fatalities and 4,410 accident survivors. The study documents that the main problem is cluster weapons and other unexploded ordnances; only 4.3% of casualties were caused by land mines. The legacy of the war affects poor people the most; the accident rate was highest among villagers living in mountainous areas, ethnic minorities, and low-income families. The most common activities leading to the accidents were farming (38.6%), collecting scrap metal (11.2%), and herding of cattle (8.3%). The study documents that the people of the Quang Tri Province until this day have suffered heavily due to the legacy of war. Mine risk education programs should account for the epidemiological findings when future accident prevention programs are designed to target high-risk areas and activities.
Application of multivariate analysis to investigate the trace element contamination in top soil of coal mining district in Jorong, South Kalimantan, Indonesia

Science.gov (United States)

Pujiwati, Arie; Nakamura, K.; Watanabe, N.; Komai, T.

2018-02-01

Multivariate analysis is applied to investigate geochemistry of several trace elements in top soils and their relation with the contamination source as the influence of coal mines in Jorong, South Kalimantan. Total concentration of Cd, V, Co, Ni, Cr, Zn, As, Pb, Sb, Cu and Ba was determined in 20 soil samples by the bulk analysis. Pearson correlation is applied to specify the linear correlation among the elements. Principal Component Analysis (PCA) and Cluster Analysis (CA) were applied to observe the classification of trace elements and contamination sources. The results suggest that contamination loading is contributed by Cr, Cu, Ni, Zn, As, and Pb. The elemental loading mostly affects the non-coal mining area, for instances the area near settlement and agricultural land use. Moreover, the contamination source is classified into the areas that are influenced by the coal mining activity, the agricultural types, and the river mixing zone. Multivariate analysis could elucidate the elemental loading and the contamination sources of trace elements in the vicinity of coal mine area.
Semi-Supervised Clustering for High-Dimensional and Sparse Features

Science.gov (United States)

Yan, Su

2010-01-01

Clustering is one of the most common data mining tasks, used frequently for data organization and analysis in various application domains. Traditional machine learning approaches to clustering are fully automated and unsupervised where class labels are unknown a priori. In real application domains, however, some "weak" form of side…
Genome mining of the sordarin biosynthetic gene cluster from Sordaria araneosa Cain ATCC 36386: characterization of cycloaraneosene synthase and GDP-6-deoxyaltrose transferase.

Science.gov (United States)

Kudo, Fumitaka; Matsuura, Yasunori; Hayashi, Takaaki; Fukushima, Masayuki; Eguchi, Tadashi

2016-07-01

Sordarin is a glycoside antibiotic with a unique tetracyclic diterpene aglycone structure called sordaricin. To understand its intriguing biosynthetic pathway that may include a Diels-Alder-type [4+2]cycloaddition, genome mining of the gene cluster from the draft genome sequence of the producer strain, Sordaria araneosa Cain ATCC 36386, was carried out. A contiguous 67 kb gene cluster consisting of 20 open reading frames encoding a putative diterpene cyclase, a glycosyltransferase, a type I polyketide synthase, and six cytochrome P450 monooxygenases were identified. In vitro enzymatic analysis of the putative diterpene cyclase SdnA showed that it catalyzes the transformation of geranylgeranyl diphosphate to cycloaraneosene, a known biosynthetic intermediate of sordarin. Furthermore, a putative glycosyltransferase SdnJ was found to catalyze the glycosylation of sordaricin in the presence of GDP-6-deoxy-d-altrose to give 4'-O-demethylsordarin. These results suggest that the identified sdn gene cluster is responsible for the biosynthesis of sordarin. Based on the isolated potential biosynthetic intermediates and bioinformatics analysis, a plausible biosynthetic pathway for sordarin is proposed.
A Novel Double Cluster and Principal Component Analysis-Based Optimization Method for the Orbit Design of Earth Observation Satellites

Directory of Open Access Journals (Sweden)

Yunfeng Dong

2017-01-01

Full Text Available The weighted sum and genetic algorithm-based hybrid method (WSGA-based HM, which has been applied to multiobjective orbit optimizations, is negatively influenced by human factors through the artificial choice of the weight coefficients in weighted sum method and the slow convergence of GA. To address these two problems, a cluster and principal component analysis-based optimization method (CPC-based OM is proposed, in which many candidate orbits are gradually randomly generated until the optimal orbit is obtained using a data mining method, that is, cluster analysis based on principal components. Then, the second cluster analysis of the orbital elements is introduced into CPC-based OM to improve the convergence, developing a novel double cluster and principal component analysis-based optimization method (DCPC-based OM. In DCPC-based OM, the cluster analysis based on principal components has the advantage of reducing the human influences, and the cluster analysis based on six orbital elements can reduce the search space to effectively accelerate convergence. The test results from a multiobjective numerical benchmark function and the orbit design results of an Earth observation satellite show that DCPC-based OM converges more efficiently than WSGA-based HM. And DCPC-based OM, to some degree, reduces the influence of human factors presented in WSGA-based HM.
Co-clustering Analysis of Weblogs Using Bipartite Spectral Projection Approach

DEFF Research Database (Denmark)

Xu, Guandong; Zong, Yu; Dolog, Peter

2010-01-01

Web clustering is an approach for aggregating Web objects into various groups according to underlying relationships among them. Finding co-clusters of Web objects is an interesting topic in the context of Web usage mining, which is able to capture the underlying user navigational interest...... and content preference simultaneously. In this paper we will present an algorithm using bipartite spectral clustering to co-cluster Web users and pages. The usage data of users visiting Web sites is modeled as a bipartite graph and the spectral clustering is then applied to the graph representation of usage...... data. The proposed approach is evaluated by experiments performed on real datasets, and the impact of using various clustering algorithms is also investigated. Experimental results have demonstrated the employed method can effectively reveal the subset aggregates of Web users and pages which...
Exploitation of Clustering Techniques in Transactional Healthcare Data

Directory of Open Access Journals (Sweden)

Naeem Ahmad Mahoto

2014-03-01

Full Text Available Healthcare service centres equipped with electronic health systems have improved their resources as well as treatment processes. The dynamic nature of healthcare data of each individual makes it complex and difficult for physicians to manually mediate them; therefore, automatic techniques are essential to manage the quality and standardization of treatment procedures. Exploratory data analysis, patternanalysis and grouping of data is managed using clustering techniques, which work as an unsupervised classification. A number of healthcare applications are developed that use several data mining techniques for classification, clustering and extracting useful information from healthcare data. The challenging issue in this domain is to select adequate data mining algorithm for optimal results. This paper exploits three different clustering algorithms: DBSCAN (Density-Based Clustering, agglomerative hierarchical and k-means in real transactional healthcare data of diabetic patients (taken as case study to analyse their performance in large and dispersed healthcare data. The best solution of cluster sets among the exploited algorithms is evaluated using clustering quality indexes and is selected to identify the possible subgroups of patients having similar treatment patterns
Epidemiological geomatics in evaluation of mine risk education in Afghanistan: introducing population weighted raster maps

Directory of Open Access Journals (Sweden)

Andersson Neil

2006-01-01

Full Text Available Abstract Evaluation of mine risk education in Afghanistan used population weighted raster maps as an evaluation tool to assess mine education performance, coverage and costs. A stratified last-stage random cluster sample produced representative data on mine risk and exposure to education. Clusters were weighted by the population they represented, rather than the land area. A "friction surface" hooked the population weight into interpolation of cluster-specific indicators. The resulting population weighted raster contours offer a model of the population effects of landmine risks and risk education. Five indicator levels ordered the evidence from simple description of the population-weighted indicators (level 0, through risk analysis (levels 1–3 to modelling programme investment and local variations (level 4. Using graphic overlay techniques, it was possible to metamorphose the map, portraying the prediction of what might happen over time, based on the causality models developed in the epidemiological analysis. Based on a lattice of local site-specific predictions, each cluster being a small universe, the "average" prediction was immediately interpretable without losing the spatial complexity.
An application of data mining in district heating substations for improving energy performance

Science.gov (United States)

Xue, Puning; Zhou, Zhigang; Chen, Xin; Liu, Jing

2017-11-01

Automatic meter reading system is capable of collecting and storing a huge number of district heating (DH) data. However, the data obtained are rarely fully utilized. Data mining is a promising technology to discover potential interesting knowledge from vast data. This paper applies data mining methods to analyse the massive data for improving energy performance of DH substation. The technical approach contains three steps: data selection, cluster analysis and association rule mining (ARM). Two-heating-season data of a substation are used for case study. Cluster analysis identifies six distinct heating patterns based on the primary heat of the substation. ARM reveals that secondary pressure difference and secondary flow rate have a strong correlation. Using the discovered rules, a fault occurring in remote flow meter installed at secondary network is detected accurately. The application demonstrates that data mining techniques can effectively extrapolate potential useful knowledge to better understand substation operation strategies and improve substation energy performance.
An improved clustering algorithm based on reverse learning in intelligent transportation

Science.gov (United States)

Qiu, Guoqing; Kou, Qianqian; Niu, Ting

2017-05-01

With the development of artificial intelligence and data mining technology, big data has gradually entered people's field of vision. In the process of dealing with large data, clustering is an important processing method. By introducing the reverse learning method in the clustering process of PAM clustering algorithm, to further improve the limitations of one-time clustering in unsupervised clustering learning, and increase the diversity of clustering clusters, so as to improve the quality of clustering. The algorithm analysis and experimental results show that the algorithm is feasible.
Clustering box office movie with Partition Around Medoids (PAM) Algorithm based on Text Mining of Indonesian subtitle

Science.gov (United States)

Alfarizy, A. D.; Indahwati; Sartono, B.

2017-03-01

Indonesia is the largest Hollywood movie industry target market in Southeast Asia in 2015. Hollywood movies distributed in Indonesia targeted people in all range of ages including children. Low awareness of guiding children while watching movies make them could watch any rated films even the unsuitable ones for their ages. Even after being translated into Bahasa and passed the censorship phase, words that uncomfortable for children to watch still exist. The purpose of this research is to cluster box office Hollywood movies based on Indonesian subtitle, revenue, IMDb user rating and genres as one of the reference for adults to choose right movies for their children to watch. Text mining is used to extract words from the subtitles and count the frequency for three group of words (bad words, sexual words and terror words), while Partition Around Medoids (PAM) Algorithm with Gower similarity coefficient as proximity matrix is used as clustering method. We clustered 624 movies from 2006 until first half of 2016 from IMDb. Cluster with highest silhouette coefficient value (0.36) is the one with 5 clusters. Animation, Adventure and Comedy movies with high revenue like in cluster 5 is recommended for children to watch, while Comedy movies with high revenue like in cluster 4 should be avoided to watch.

Haplotyping Problem, A Clustering Approach

International Nuclear Information System (INIS)

Eslahchi, Changiz; Sadeghi, Mehdi; Pezeshk, Hamid; Kargar, Mehdi; Poormohammadi, Hadi

2007-01-01

Construction of two haplotypes from a set of Single Nucleotide Polymorphism (SNP) fragments is called haplotype reconstruction problem. One of the most popular computational model for this problem is Minimum Error Correction (MEC). Since MEC is an NP-hard problem, here we propose a novel heuristic algorithm based on clustering analysis in data mining for haplotype reconstruction problem. Based on hamming distance and similarity between two fragments, our iterative algorithm produces two clusters of fragments; then, in each iteration, the algorithm assigns a fragment to one of the clusters. Our results suggest that the algorithm has less reconstruction error rate in comparison with other algorithms
Clustering Methods Application for Customer Segmentation to Manage Advertisement Campaign

OpenAIRE

Maciej Kutera; Mirosława Lasek

2010-01-01

Clustering methods are recently so advanced elaborated algorithms for large collection data analysis that they have been already included today to data mining methods. Clustering methods are nowadays larger and larger group of methods, very quickly evolving and having more and more various applications. In the article, our research concerning usefulness of clustering methods in customer segmentation to manage advertisement campaign is presented. We introduce results obtained by using four sel...
Exploratory analysis of textual data from the Mother and Child Handbook using a text mining method (II): Monthly changes in the words recorded by mothers.

Science.gov (United States)

Tagawa, Miki; Matsuda, Yoshio; Manaka, Tomoko; Kobayashi, Makiko; Ohwada, Michitaka; Matsubara, Shigeki

2017-01-01

The aim of the study was to examine the possibility of converting subjective textual data written in the free column space of the Mother and Child Handbook (MCH) into objective information using text mining and to compare any monthly changes in the words written by the mothers. Pregnant women without complications (n = 60) were divided into two groups according to State-Trait Anxiety Inventory grade: low trait anxiety (group I, n = 39) and high trait anxiety (group II, n = 21). Exploratory analysis of the textual data from the MCH was conducted by text mining using the Word Miner software program. Using 1203 structural elements extracted after processing, a comparison of monthly changes in the words used in the mothers' comments was made between the two groups. The data was mainly analyzed by a correspondence analysis. The structural elements in groups I and II were divided into seven and six clusters, respectively, by cluster analysis. Correspondence analysis revealed clear monthly changes in the words used in the mothers' comments as the pregnancy progressed in group I, whereas the association was not clear in group II. The text mining method was useful for exploratory analysis of the textual data obtained from pregnant women, and the monthly change in the words used in the mothers' comments as pregnancy progressed differed according to their degree of unease. © 2016 Japan Society of Obstetrics and Gynecology.
Changing cluster composition in cluster randomised controlled trials: design and analysis considerations

Science.gov (United States)

2014-01-01

Background There are many methodological challenges in the conduct and analysis of cluster randomised controlled trials, but one that has received little attention is that of post-randomisation changes to cluster composition. To illustrate this, we focus on the issue of cluster merging, considering the impact on the design, analysis and interpretation of trial outcomes. Methods We explored the effects of merging clusters on study power using standard methods of power calculation. We assessed the potential impacts on study findings of both homogeneous cluster merges (involving clusters randomised to the same arm of a trial) and heterogeneous merges (involving clusters randomised to different arms of a trial) by simulation. To determine the impact on bias and precision of treatment effect estimates, we applied standard methods of analysis to different populations under analysis. Results Cluster merging produced a systematic reduction in study power. This effect depended on the number of merges and was most pronounced when variability in cluster size was at its greatest. Simulations demonstrate that the impact on analysis was minimal when cluster merges were homogeneous, with impact on study power being balanced by a change in observed intracluster correlation coefficient (ICC). We found a decrease in study power when cluster merges were heterogeneous, and the estimate of treatment effect was attenuated. Conclusions Examples of cluster merges found in previously published reports of cluster randomised trials were typically homogeneous rather than heterogeneous. Simulations demonstrated that trial findings in such cases would be unbiased. However, simulations also showed that any heterogeneous cluster merges would introduce bias that would be hard to quantify, as well as having negative impacts on the precision of estimates obtained. Further methodological development is warranted to better determine how to analyse such trials appropriately. Interim recommendations
Co-clustering for Weblogs in Semantic Space

DEFF Research Database (Denmark)

Zong, Yu; Xu, Guandong; Dolog, Peter

2010-01-01

Web clustering is an approach for aggregating web objects into various groups according to underlying relationships among them. Finding co-clusters of web objects in semantic space is an interesting topic in the context of web usage mining, which is able to capture the underlying user navigational...... interest and content preference simultaneously. In this paper we will present a novel web co-clustering algorithm named Co-Clustering in Semantic space (COCS) to simultaneously partition web users and pages via a latent semantic analysis approach. In COCS, we first, train the latent semantic space...... of weblog data by using Probabilistic Latent Semantic Analysis (PLSA) model, and then, project all weblog data objects into this semantic space with probability distribution to capture the relationship among web pages and web users, at last, propose a clustering algorithm to generate the co...
Integrative cluster analysis in bioinformatics

CERN Document Server

Abu-Jamous, Basel; Nandi, Asoke K

2015-01-01

Clustering techniques are increasingly being put to use in the analysis of high-throughput biological datasets. Novel computational techniques to analyse high throughput data in the form of sequences, gene and protein expressions, pathways, and images are becoming vital for understanding diseases and future drug discovery. This book details the complete pathway of cluster analysis, from the basics of molecular biology to the generation of biological knowledge. The book also presents the latest clustering methods and clustering validation, thereby offering the reader a comprehensive review o
An intelligent hybrid system for surface coal mine safety analysis

Energy Technology Data Exchange (ETDEWEB)

Lilic, N.; Obradovic, I.; Cvjetic, A. [University of Belgrade, Belgrade (Serbia)

2010-06-15

Analysis of safety in surface coal mines represents a very complex process. Published studies on mine safety analysis are usually based on research related to accidents statistics and hazard identification with risk assessment within the mining industry. Discussion in this paper is focused on the application of AI methods in the analysis of safety in mining environment. Complexity of the subject matter requires a high level of expert knowledge and great experience. The solution was found in the creation of a hybrid system PROTECTOR, whose knowledge base represents a formalization of the expert knowledge in the mine safety field. The main goal of the system is the estimation of mining environment as one of the significant components of general safety state in a mine. This global goal is subdivided into a hierarchical structure of subgoals where each subgoal can be viewed as the estimation of a set of parameters (gas, dust, climate, noise, vibration, illumination, geotechnical hazard) which determine the general mine safety state and category of hazard in mining environment. Both the hybrid nature of the system and the possibilities it offers are illustrated through a case study using field data related to an existing Serbian surface coal mine.
Accounting and Financial Data Analysis Data Mining Tools

Directory of Open Access Journals (Sweden)

Diana Elena Codreanu

2011-05-01

Full Text Available Computerized accounting systems in recent years have seen an increase in complexity due to thecompetitive economic environment but with the help of data analysis solutions such as OLAP and DataMining can be a multidimensional data analysis, can detect the fraud and can discover knowledge hidden indata, ensuring such information is useful for decision making within the organization. In the literature thereare many definitions for data mining but all boils down to same idea: the process takes place to extract newinformation from large data collections, information without the aid of data mining tools would be verydifficult to obtain. Information obtained by data mining process has the advantage that only respond to thequestion of what happens but at the same time argue and show why certain things are happening. In this paperwe wish to present advanced techniques for analysis and exploitation of data stored in a multidimensionaldatabase.
Review of Data Mining Techniques for Churn Prediction in Telecom

Directory of Open Access Journals (Sweden)

Vishal Mahajan

2015-12-01

service. This data can be usefully mined for churn analysis and prediction. Significant research had been undertaken by researchers worldwide to understand the data mining practices that can be used for predicting customer churn. This paper provides a review of around 100 recent journal articles starting from year 2000 to present the various data mining techniques used in multiple customer based churn models. It then summarizes the existing telecom literature by highlighting the sample size used, churn variables employed and the findings of different DM techniques. Finally, we list the most popular techniques for churn prediction in telecom as decision trees, regression analysis and clustering, thereby providing a roadmap to new researchers to build upon novel churn management models.
Security and Correctness Analysis on Privacy-Preserving k-Means Clustering Schemes

Science.gov (United States)

Su, Chunhua; Bao, Feng; Zhou, Jianying; Takagi, Tsuyoshi; Sakurai, Kouichi

Due to the fast development of Internet and the related IT technologies, it becomes more and more easier to access a large amount of data. k-means clustering is a powerful and frequently used technique in data mining. Many research papers about privacy-preserving k-means clustering were published. In this paper, we analyze the existing privacy-preserving k-means clustering schemes based on the cryptographic techniques. We show those schemes will cause the privacy breach and cannot output the correct results due to the faults in the protocol construction. Furthermore, we analyze our proposal as an option to improve such problems but with intermediate information breach during the computation.
Analysis on safety production in coal mines Henan Province

Institute of Scientific and Technical Information of China (English)

KONG Liu-an; ZHANG Wen-yong

2006-01-01

Based on the rigorous situation of safety production in coal mines, the paper analyzed the statistical data of recent accidents indexes in Henan's coal mines. Using investigation and comparison analysis methods, a specified analysis on mining conditions, technical facility level, safety input and vocational quality of workers in Henan's coal mines was conducted. The result indicates that there have been existing such main safety production problems as weak safety management, low-level facilities, inadequate safety input and poor vocational quality and so on. Finally it proposes such reference solutions as to establish and perfect coal mining supervision and management system, to increase safety investment into techniques and facilities and to strengthen workers' safety education and introduction of more high-level professional talents.
Critical analysis of the Colombian mining legislation

International Nuclear Information System (INIS)

Vargas P, Elkin; Gonzalez S, Carmen Lucia

2003-01-01

The document analyses the Colombian mining legislation, Act 685 of 2001, based on the reasons expressed by the government and the miners for its conceit and approval. The document tries to determine the developments achieved by this new Mining Code considering international mining competitiveness and its adaptation to the constitutional rules about environment, indigenous communities, decentralization and sustainable development. The analysis formulates general and specific hypothesis about the proposed objectives of the reform, which are confronted with the arguments and critical evaluations of the results. Most hypothesis are not verified, thus demonstrating that the Colombian mining legislation is far from being the necessary instrument to promote mining activities, making it competitive according to international standards and adapted to the principles of sustainable development, healthy environment, community participation, ethnic minorities and regional autonomy
Text mining analysis of public comments regarding high-level radioactive waste disposal

International Nuclear Information System (INIS)

Kugo, Akihide; Yoshikawa, Hidekazu; Shimoda, Hiroshi; Wakabayashi, Yasunaga

2005-01-01

In order to narrow the risk perception gap as seen in social investigations between the general public and people who are involved in nuclear industry, public comments on high-level radioactive waste (HLW) disposal have been conducted to find the significant talking points with the general public for constructing an effective risk communication model of social risk information regarding HLW disposal. Text mining was introduced to examine public comments to identify the core public interest underlying the comments. The utilized test mining method is to cluster specific groups of words with negative meanings and then to analyze public understanding by employing text structural analysis to extract words from subjective expressions. Using these procedures, it was found that the public does not trust the nuclear fuel cycle promotion policy and shows signs of anxiety about the long-lasting technological reliability of waste storage. To develop effective social risk communication of HLW issues, these findings are expected to help experts in the nuclear industry to communicate with the general public more effectively to obtain their trust. (author)
Clustering Game Behavior Data

DEFF Research Database (Denmark)

Bauckhage, C.; Drachen, Anders; Sifa, Rafet

2015-01-01

of the causes, the proliferation of behavioral data poses the problem of how to derive insights therefrom. Behavioral data sets can be large, time-dependent and high-dimensional. Clustering offers a way to explore such data and to discover patterns that can reduce the overall complexity of the data. Clustering...... and other techniques for player profiling and play style analysis have, therefore, become popular in the nascent field of game analytics. However, the proper use of clustering techniques requires expertise and an understanding of games is essential to evaluate results. With this paper, we address game data...... scientists and present a review and tutorial focusing on the application of clustering techniques to mine behavioral game data. Several algorithms are reviewed and examples of their application shown. Key topics such as feature normalization are discussed and open problems in the context of game analytics...
Online Analytical Processing (OLAP: A Fast and Effective Data Mining Tool for Gene Expression Databases

Directory of Open Access Journals (Sweden)

Alkharouf Nadim W.

2005-01-01

Full Text Available Gene expression databases contain a wealth of information, but current data mining tools are limited in their speed and effectiveness in extracting meaningful biological knowledge from them. Online analytical processing (OLAP can be used as a supplement to cluster analysis for fast and effective data mining of gene expression databases. We used Analysis Services 2000, a product that ships with SQLServer2000, to construct an OLAP cube that was used to mine a time series experiment designed to identify genes associated with resistance of soybean to the soybean cyst nematode, a devastating pest of soybean. The data for these experiments is stored in the soybean genomics and microarray database (SGMD. A number of candidate resistance genes and pathways were found. Compared to traditional cluster analysis of gene expression data, OLAP was more effective and faster in finding biologically meaningful information. OLAP is available from a number of vendors and can work with any relational database management system through OLE DB.
Online analytical processing (OLAP): a fast and effective data mining tool for gene expression databases.

Science.gov (United States)

Alkharouf, Nadim W; Jamison, D Curtis; Matthews, Benjamin F

2005-06-30

Gene expression databases contain a wealth of information, but current data mining tools are limited in their speed and effectiveness in extracting meaningful biological knowledge from them. Online analytical processing (OLAP) can be used as a supplement to cluster analysis for fast and effective data mining of gene expression databases. We used Analysis Services 2000, a product that ships with SQLServer2000, to construct an OLAP cube that was used to mine a time series experiment designed to identify genes associated with resistance of soybean to the soybean cyst nematode, a devastating pest of soybean. The data for these experiments is stored in the soybean genomics and microarray database (SGMD). A number of candidate resistance genes and pathways were found. Compared to traditional cluster analysis of gene expression data, OLAP was more effective and faster in finding biologically meaningful information. OLAP is available from a number of vendors and can work with any relational database management system through OLE DB.
A Dimensionally Reduced Clustering Methodology for Heterogeneous Occupational Medicine Data Mining.

Science.gov (United States)

Saâdaoui, Foued; Bertrand, Pierre R; Boudet, Gil; Rouffiac, Karine; Dutheil, Frédéric; Chamoux, Alain

2015-10-01

Clustering is a set of techniques of the statistical learning aimed at finding structures of heterogeneous partitions grouping homogenous data called clusters. There are several fields in which clustering was successfully applied, such as medicine, biology, finance, economics, etc. In this paper, we introduce the notion of clustering in multifactorial data analysis problems. A case study is conducted for an occupational medicine problem with the purpose of analyzing patterns in a population of 813 individuals. To reduce the data set dimensionality, we base our approach on the Principal Component Analysis (PCA), which is the statistical tool most commonly used in factorial analysis. However, the problems in nature, especially in medicine, are often based on heterogeneous-type qualitative-quantitative measurements, whereas PCA only processes quantitative ones. Besides, qualitative data are originally unobservable quantitative responses that are usually binary-coded. Hence, we propose a new set of strategies allowing to simultaneously handle quantitative and qualitative data. The principle of this approach is to perform a projection of the qualitative variables on the subspaces spanned by quantitative ones. Subsequently, an optimal model is allocated to the resulting PCA-regressed subspaces.
Unsupervised text mining methods for literature analysis: a case study for Thomas Pynchon's V.

Directory of Open Access Journals (Sweden)

Christos Iraklis Tsatsoulis

2013-08-01

Full Text Available We investigate the use of unsupervised text mining methods for the analysis of prose literature works, using Thomas Pynchon's novel 'V'. as a case study. Our results suggest that such methods may be employed to reveal meaningful information regarding the novel’s structure. We report results using a wide variety of clustering algorithms, several distinct distance functions, and different visualization techniques. The application of a simple topic model is also demonstrated. We discuss the meaningfulness of our results along with the limitations of our approach, and we suggest some possible paths for further study.
BGDMdocker: a Docker workflow for data mining and visualization of bacterial pan-genomes and biosynthetic gene clusters

Directory of Open Access Journals (Sweden)

Gong Cheng

2017-11-01

Full Text Available Recently, Docker technology has received increasing attention throughout the bioinformatics community. However, its implementation has not yet been mastered by most biologists; accordingly, its application in biological research has been limited. In order to popularize this technology in the field of bioinformatics and to promote the use of publicly available bioinformatics tools, such as Dockerfiles and Images from communities, government sources, and private owners in the Docker Hub Registry and other Docker-based resources, we introduce here a complete and accurate bioinformatics workflow based on Docker. The present workflow enables analysis and visualization of pan-genomes and biosynthetic gene clusters of bacteria. This provides a new solution for bioinformatics mining of big data from various publicly available biological databases. The present step-by-step guide creates an integrative workflow through a Dockerfile to allow researchers to build their own Image and run Container easily.
BGDMdocker: a Docker workflow for data mining and visualization of bacterial pan-genomes and biosynthetic gene clusters.

Science.gov (United States)

Cheng, Gong; Lu, Quan; Ma, Ling; Zhang, Guocai; Xu, Liang; Zhou, Zongshan

2017-01-01

Recently, Docker technology has received increasing attention throughout the bioinformatics community. However, its implementation has not yet been mastered by most biologists; accordingly, its application in biological research has been limited. In order to popularize this technology in the field of bioinformatics and to promote the use of publicly available bioinformatics tools, such as Dockerfiles and Images from communities, government sources, and private owners in the Docker Hub Registry and other Docker-based resources, we introduce here a complete and accurate bioinformatics workflow based on Docker. The present workflow enables analysis and visualization of pan-genomes and biosynthetic gene clusters of bacteria. This provides a new solution for bioinformatics mining of big data from various publicly available biological databases. The present step-by-step guide creates an integrative workflow through a Dockerfile to allow researchers to build their own Image and run Container easily.

Clustering performance comparison using K-means and expectation maximization algorithms.

Science.gov (United States)

Jung, Yong Gyu; Kang, Min Soo; Heo, Jun

2014-11-14

Clustering is an important means of data mining based on separating data categories by similar features. Unlike the classification algorithm, clustering belongs to the unsupervised type of algorithms. Two representatives of the clustering algorithms are the K -means and the expectation maximization (EM) algorithm. Linear regression analysis was extended to the category-type dependent variable, while logistic regression was achieved using a linear combination of independent variables. To predict the possibility of occurrence of an event, a statistical approach is used. However, the classification of all data by means of logistic regression analysis cannot guarantee the accuracy of the results. In this paper, the logistic regression analysis is applied to EM clusters and the K -means clustering method for quality assessment of red wine, and a method is proposed for ensuring the accuracy of the classification results.
Online Nonparametric Bayesian Activity Mining and Analysis From Surveillance Video.

Science.gov (United States)

Bastani, Vahid; Marcenaro, Lucio; Regazzoni, Carlo S

2016-05-01

A method for online incremental mining of activity patterns from the surveillance video stream is presented in this paper. The framework consists of a learning block in which Dirichlet process mixture model is employed for the incremental clustering of trajectories. Stochastic trajectory pattern models are formed using the Gaussian process regression of the corresponding flow functions. Moreover, a sequential Monte Carlo method based on Rao-Blackwellized particle filter is proposed for tracking and online classification as well as the detection of abnormality during the observation of an object. Experimental results on real surveillance video data are provided to show the performance of the proposed algorithm in different tasks of trajectory clustering, classification, and abnormality detection.
Cluster analysis in phenotyping a Portuguese population.

Science.gov (United States)

Loureiro, C C; Sa-Couto, P; Todo-Bom, A; Bousquet, J

2015-09-03

Unbiased cluster analysis using clinical parameters has identified asthma phenotypes. Adding inflammatory biomarkers to this analysis provided a better insight into the disease mechanisms. This approach has not yet been applied to asthmatic Portuguese patients. To identify phenotypes of asthma using cluster analysis in a Portuguese asthmatic population treated in secondary medical care. Consecutive patients with asthma were recruited from the outpatient clinic. Patients were optimally treated according to GINA guidelines and enrolled in the study. Procedures were performed according to a standard evaluation of asthma. Phenotypes were identified by cluster analysis using Ward's clustering method. Of the 72 patients enrolled, 57 had full data and were included for cluster analysis. Distribution was set in 5 clusters described as follows: cluster (C) 1, early onset mild allergic asthma; C2, moderate allergic asthma, with long evolution, female prevalence and mixed inflammation; C3, allergic brittle asthma in young females with early disease onset and no evidence of inflammation; C4, severe asthma in obese females with late disease onset, highly symptomatic despite low Th2 inflammation; C5, severe asthma with chronic airflow obstruction, late disease onset and eosinophilic inflammation. In our study population, the identified clusters were mainly coincident with other larger-scale cluster analysis. Variables such as age at disease onset, obesity, lung function, FeNO (Th2 biomarker) and disease severity were important for cluster distinction. Copyright © 2015. Published by Elsevier España, S.L.U.
Research on forecast technology of mine gas emission based on fuzzy data mining (FDM)

Energy Technology Data Exchange (ETDEWEB)

Xu Chang-kai; Wang Yao-cai; Wang Jun-wei [CUMT, Xuzhou (China). School of Information and Electrical Engineering

2004-07-01

The safe production of coalmine can be further improved by forecasting the quantity of gas emission based on the real-time data and historical data which the gas monitoring system has saved. By making use of the advantages of data warehouse and data mining technology for processing large quantity of redundancy data, the method and its application of forecasting mine gas emission quantity based on FDM were studied. The constructing fuzzy resembling relation and clustering analysis were proposed, which the potential relationship inside the gas emission data may be found. The mode finds model and forecast model were presented, and the detailed approach to realize this forecast was also proposed, which have been applied to forecast the gas emission quantity efficiently.
Text Mining of Journal Articles for Sleep Disorder Terminologies.

Directory of Open Access Journals (Sweden)

Calvin Lam

Full Text Available Research on publication trends in journal articles on sleep disorders (SDs and the associated methodologies by using text mining has been limited. The present study involved text mining for terms to determine the publication trends in sleep-related journal articles published during 2000-2013 and to identify associations between SD and methodology terms as well as conducting statistical analyses of the text mining findings.SD and methodology terms were extracted from 3,720 sleep-related journal articles in the PubMed database by using MetaMap. The extracted data set was analyzed using hierarchical cluster analyses and adjusted logistic regression models to investigate publication trends and associations between SD and methodology terms.MetaMap had a text mining precision, recall, and false positive rate of 0.70, 0.77, and 11.51%, respectively. The most common SD term was breathing-related sleep disorder, whereas narcolepsy was the least common. Cluster analyses showed similar methodology clusters for each SD term, except narcolepsy. The logistic regression models showed an increasing prevalence of insomnia, parasomnia, and other sleep disorders but a decreasing prevalence of breathing-related sleep disorder during 2000-2013. Different SD terms were positively associated with different methodology terms regarding research design terms, measure terms, and analysis terms.Insomnia-, parasomnia-, and other sleep disorder-related articles showed an increasing publication trend, whereas those related to breathing-related sleep disorder showed a decreasing trend. Furthermore, experimental studies more commonly focused on hypersomnia and other SDs and less commonly on insomnia, breathing-related sleep disorder, narcolepsy, and parasomnia. Thus, text mining may facilitate the exploration of the publication trends in SDs and the associated methodologies.
Text Mining of Journal Articles for Sleep Disorder Terminologies.

Science.gov (United States)

Lam, Calvin; Lai, Fu-Chih; Wang, Chia-Hui; Lai, Mei-Hsin; Hsu, Nanly; Chung, Min-Huey

2016-01-01

Research on publication trends in journal articles on sleep disorders (SDs) and the associated methodologies by using text mining has been limited. The present study involved text mining for terms to determine the publication trends in sleep-related journal articles published during 2000-2013 and to identify associations between SD and methodology terms as well as conducting statistical analyses of the text mining findings. SD and methodology terms were extracted from 3,720 sleep-related journal articles in the PubMed database by using MetaMap. The extracted data set was analyzed using hierarchical cluster analyses and adjusted logistic regression models to investigate publication trends and associations between SD and methodology terms. MetaMap had a text mining precision, recall, and false positive rate of 0.70, 0.77, and 11.51%, respectively. The most common SD term was breathing-related sleep disorder, whereas narcolepsy was the least common. Cluster analyses showed similar methodology clusters for each SD term, except narcolepsy. The logistic regression models showed an increasing prevalence of insomnia, parasomnia, and other sleep disorders but a decreasing prevalence of breathing-related sleep disorder during 2000-2013. Different SD terms were positively associated with different methodology terms regarding research design terms, measure terms, and analysis terms. Insomnia-, parasomnia-, and other sleep disorder-related articles showed an increasing publication trend, whereas those related to breathing-related sleep disorder showed a decreasing trend. Furthermore, experimental studies more commonly focused on hypersomnia and other SDs and less commonly on insomnia, breathing-related sleep disorder, narcolepsy, and parasomnia. Thus, text mining may facilitate the exploration of the publication trends in SDs and the associated methodologies.
Clustering analysis of line indices for LAMOST spectra with AstroStat

Science.gov (United States)

Chen, Shu-Xin; Sun, Wei-Min; Yan, Qi

2018-06-01

The application of data mining in astronomical surveys, such as the Large Sky Area Multi-Object Fiber Spectroscopic Telescope (LAMOST) survey, provides an effective approach to automatically analyze a large amount of complex survey data. Unsupervised clustering could help astronomers find the associations and outliers in a big data set. In this paper, we employ the k-means method to perform clustering for the line index of LAMOST spectra with the powerful software AstroStat. Implementing the line index approach for analyzing astronomical spectra is an effective way to extract spectral features for low resolution spectra, which can represent the main spectral characteristics of stars. A total of 144 340 line indices for A type stars is analyzed through calculating their intra and inter distances between pairs of stars. For intra distance, we use the definition of Mahalanobis distance to explore the degree of clustering for each class, while for outlier detection, we define a local outlier factor for each spectrum. AstroStat furnishes a set of visualization tools for illustrating the analysis results. Checking the spectra detected as outliers, we find that most of them are problematic data and only a few correspond to rare astronomical objects. We show two examples of these outliers, a spectrum with abnormal continuumand a spectrum with emission lines. Our work demonstrates that line index clustering is a good method for examining data quality and identifying rare objects.
Data Mining in Earth System Science (DMESS 2011)

Science.gov (United States)

Forrest M. Hoffman; J. Walter Larson; Richard Tran Mills; Bhorn-Gustaf Brooks; Auroop R. Ganguly; William Hargrove; et al

2011-01-01

From field-scale measurements to global climate simulations and remote sensing, the growing body of very large and long time series Earth science data are increasingly difficult to analyze, visualize, and interpret. Data mining, information theoretic, and machine learning techniquesâsuch as cluster analysis, singular value decomposition, block entropy, Fourier and...
MINING ON CAR DATABASE EMPLOYING LEARNING AND CLUSTERING ALGORITHMS

OpenAIRE

Muhammad Rukunuddin Ghalib; Shivam Vohra; Sunish Vohra; Akash Juneja

2013-01-01

In data mining, classification is a form of data analysis that can be used to extract models describing important data classes. Two of the known learning algorithms used are Naïve Bayesian (NB) and SMO (Self-Minimal-Optimisation) .Thus the following two learning algorithms are used on a Car review database and thus a model is hence created which predicts the characteristic of a review comment after getting trained. It was found that model successfully predicted correctly about the review comm...
Academic Performance: An Approach From Data Mining

Directory of Open Access Journals (Sweden)

David L. La Red Martinez

2012-02-01

Full Text Available The relatively low% of students promoted and regularized in Operating Systems Course of the LSI (Bachelor’s Degree in Information Systems of FaCENA (Faculty of Sciences and Natural Surveying - Facultad de Ciencias Exactas, Naturales y Agrimensura of UNNE (academic success, prompted this work, whose objective is to determine the variables that affect the academic performance, whereas the final status of the student according to the Res. 185/03 CD (scheme for evaluation and promotion: promoted, regular or free1. The variables considered are: status of the student, educational level of parents, secondary education, socio-economic level, and others. Data warehouse (Data Warehouses: DW and data mining (Data Mining: DM techniques were used to search pro.les of students and determine success or failure academic potential situations. Classifications through techniques of clustering according to different criteria have become. Some criteria were the following: mining of classification according to academic program, according to final status of the student, according to importance given to the study, mining of demographic clustering and Kohonen clustering according to final status of the student. Were conducted statistics of partition, detail of partitions, details of clusters, detail of fields and frequency of fields, overall quality of each process and quality detailed (precision, classification, reliability, arrays of confusion, diagrams of gain / elevation, trees, distribution of nodes, of importance of fields, correspondence tables of fields and statistics of cluster. Once certain profiles of students with low academic performance, it may address actions aimed at avoiding potential academic failures. This work aims to provide a brief description of aspects related to the data warehouse built and some processes of data mining developed on the same.
Privacy-preserving distributed clustering

NARCIS (Netherlands)

Erkin, Z.; Veugen, T.; Toft, T.; Lagendijk, R.L.

2013-01-01

Clustering is a very important tool in data mining and is widely used in on-line services for medical, financial and social environments. The main goal in clustering is to create sets of similar objects in a data set. The data set to be used for clustering can be owned by a single entity, or in some
Data Mining and Data Fusion for Enhanced Decision Support

Energy Technology Data Exchange (ETDEWEB)

Khan, Shiraj [ORNL; Ganguly, Auroop R [ORNL; Gupta, Amar [University of Arizona

2008-01-01

The process of Data Mining converts information to knowledge by utilizing tools from the disciplines of computational statistics, database technologies, machine learning, signal processing, nonlinear dynamics, process modeling, simulation, and allied disciplines. Data Mining allows business problems to be analyzed from diverse perspectives, including dimensionality reduction, correlation and co-occurrence, clustering and classification, regression and forecasting, anomaly detection, and change analysis. The predictive insights generated from Data Mining can be further utilized through real-time analysis and decision sciences, as well as through human-driven analysis based on management by exceptions or by objectives, to generate actionable knowledge. The tools that enable the transformation of raw data to actionable predictive insights are collectively referred as Decision Support tools. This chapter presents a new formalization of the decision process, leading to a new Decision Superiority model, partially motivated by the Joint Directors of Laboratories (JDL) Data Fusion Model. In addition, it examines the growing importance of Data Fusion concepts.
A novel procedure on next generation sequencing data analysis using text mining algorithm.

Science.gov (United States)

Zhao, Weizhong; Chen, James J; Perkins, Roger; Wang, Yuping; Liu, Zhichao; Hong, Huixiao; Tong, Weida; Zou, Wen

2016-05-13

Next-generation sequencing (NGS) technologies have provided researchers with vast possibilities in various biological and biomedical research areas. Efficient data mining strategies are in high demand for large scale comparative and evolutional studies to be performed on the large amounts of data derived from NGS projects. Topic modeling is an active research field in machine learning and has been mainly used as an analytical tool to structure large textual corpora for data mining. We report a novel procedure to analyse NGS data using topic modeling. It consists of four major procedures: NGS data retrieval, preprocessing, topic modeling, and data mining using Latent Dirichlet Allocation (LDA) topic outputs. The NGS data set of the Salmonella enterica strains were used as a case study to show the workflow of this procedure. The perplexity measurement of the topic numbers and the convergence efficiencies of Gibbs sampling were calculated and discussed for achieving the best result from the proposed procedure. The output topics by LDA algorithms could be treated as features of Salmonella strains to accurately describe the genetic diversity of fliC gene in various serotypes. The results of a two-way hierarchical clustering and data matrix analysis on LDA-derived matrices successfully classified Salmonella serotypes based on the NGS data. The implementation of topic modeling in NGS data analysis procedure provides a new way to elucidate genetic information from NGS data, and identify the gene-phenotype relationships and biomarkers, especially in the era of biological and medical big data. The implementation of topic modeling in NGS data analysis provides a new way to elucidate genetic information from NGS data, and identify the gene-phenotype relationships and biomarkers, especially in the era of biological and medical big data.
Application of Learning Analytics Using Clustering Data Mining for Students' Disposition Analysis

Science.gov (United States)

Bharara, Sanyam; Sabitha, Sai; Bansal, Abhay

2018-01-01

Learning Analytics (LA) is an emerging field in which sophisticated analytic tools are used to improve learning and education. It draws from, and is closely tied to, a series of other fields of study like business intelligence, web analytics, academic analytics, educational data mining, and action analytics. The main objective of this research…
The Analysis of Object-Based Change Detection in Mining Area: a Case Study with Pingshuo Coal Mine

Science.gov (United States)

Zhang, M.; Zhou, W.; Li, Y.

2017-09-01

Accurate information on mining land use and land cover change are crucial for monitoring and environmental change studies. In this paper, RapidEye Remote Sensing Image (Map 2012) and SPOT7 Remote Sensing Image (Map 2015) in Pingshuo Mining Area are selected to monitor changes combined with object-based classification and change vector analysis method, we also used R in highresolution remote sensing image for mining land classification, and found the feasibility and the flexibility of open source software. The results show that (1) the classification of reclaimed mining land has higher precision, the overall accuracy and kappa coefficient of the classification of the change region map were 86.67 % and 89.44 %. It's obvious that object-based classification and change vector analysis which has a great significance to improve the monitoring accuracy can be used to monitor mining land, especially reclaiming mining land; (2) the vegetation area changed from 46 % to 40 % accounted for the proportion of the total area from 2012 to 2015, and most of them were transformed into the arable land. The sum of arable land and vegetation area increased from 51 % to 70 %; meanwhile, build-up land has a certain degree of increase, part of the water area was transformed into arable land, but the extent of the two changes is not obvious. The result illustrated the transformation of reclaimed mining area, at the same time, there is still some land convert to mining land, and it shows the mine is still operating, mining land use and land cover are the dynamic procedure.
Trust and safety in the coal mining sector

Energy Technology Data Exchange (ETDEWEB)

Neil Gunningham; Darren Sinclair [Gunningham and Associates (Australia)

2008-08-15

This report examines the relationship between trust (and mistrust) and occupational health and safety (OHS) in the Australian coal mining sector. Previous research in Australian coal mining companies indicated that mistrust is deep-seated at a number of mines, and that these mines are usually the worst performers in terms of OHS. Mistrust also handicaps the ability of inspectors to worker together with mines sites to improve OHS outcomes. Given this, there is a compelling need to understand how mistrust comes about, and to identify practical steps that can be adopted by companies, mines sites and the inspectorate to foster the development of trust. The report builds on these earlier findings by investigating trust in a much more detailed and sophisticated fashion, drawing on an in-depth analysis at mines, across a number of coal mining companies, and in two state jurisdictions. Research revealed that a 'cluster of characteristics' are associated with the formation and maintenance of mistrust at mines with a lower OHS track-record. These findings, together with an analysis of the characteristics of mines with better OHS outcomes, enabled the report to outline a variety of ways in which mines may build trust within and between management and the workforce. It also considers the at times fractious relationship between trade unions and management, and flags some of the challenges confronting these two groups in working together to improve OHS performance in the coal mining sector. Finally, the report examines the rise and impact of mistrust on the operations of the New South Wales and Queensland inspectorates, and suggests ways in which a fairer and more just enforcement policy may help foster greater trust between inspectors and mines.
Risk assessment of particle dispersion and trace element contamination from mine-waste dumps.

Science.gov (United States)

Romero, Antonio; González, Isabel; Martín, José María; Vázquez, María Auxiliadora; Ortiz, Pilar

2015-04-01

In this study, a model to delimit risk zones influenced by atmospheric particle dispersion from mine-waste dumps is developed to assess their influence on the soil and the population according to the concentration of trace elements in the waste. The model is applied to the Riotinto Mine (in SW Spain), which has a long history of mining and heavy land contamination. The waste materials are separated into three clusters according to the mapping, mineralogy, and geochemical classification using cluster analysis. Two of the clusters are composed of slag, fresh pyrite, and roasted pyrite ashes, which may contain high concentrations of trace elements (e.g., >1 % As or >4 % Pb). The average pollution load index (PLI) calculated for As, Cd, Co, Cu, Pb, Tl, and Zn versus the baseline of the regional soil is 19. The other cluster is primarily composed of sterile rocks and ochreous tailings, and the average PLI is 3. The combination of particle dispersion calculated by a Gaussian model, the PLI, the surface area of each waste and the wind direction is used to develop a risk-assessment model with Geographic Information System GIS software. The zone of high risk can affect the agricultural soil and the population in the study area, particularly if mining activity is restarted in the near future. This model can be applied to spatial planning and environmental protection if the information is complemented with atmospheric particulate matter studies.
A novel model for Time-Series Data Clustering Based on piecewise SVD and BIRCH for Stock Data Analysis on Hadoop Platform

Directory of Open Access Journals (Sweden)

Ibgtc Bowala

2017-06-01

Full Text Available With the rapid growth of financial markets, analyzers are paying more attention on predictions. Stock data are time series data, with huge amounts. Feasible solution for handling the increasing amount of data is to use a cluster for parallel processing, and Hadoop parallel computing platform is a typical representative. There are various statistical models for forecasting time series data, but accurate clusters are a pre-requirement. Clustering analysis for time series data is one of the main methods for mining time series data for many other analysis processes. However, general clustering algorithms cannot perform clustering for time series data because series data has a special structure and a high dimensionality has highly co-related values due to high noise level. A novel model for time series clustering is presented using BIRCH, based on piecewise SVD, leading to a novel dimension reduction approach. Highly co-related features are handled using SVD with a novel approach for dimensionality reduction in order to keep co-related behavior optimal and then use BIRCH for clustering. The algorithm is a novel model that can handle massive time series data. Finally, this new model is successfully applied to real stock time series data of Yahoo finance with satisfactory results.
plantiSMASH: automated identification, annotation and expression analysis of plant biosynthetic gene clusters

DEFF Research Database (Denmark)

Kautsar, Satria A.; Suarez Duran, Hernando G.; Blin, Kai

2017-01-01

exploration of the nature and dynamics of gene clustering in plant metabolism. Moreover, spurred by the continuing decrease in costs of plant genome sequencing, they will allow genome mining technologies to be applied to plant natural product discovery. The plantiSMASH web server, precalculated results...
Comparative Performance Of Using PCA With K-Means And Fuzzy C Means Clustering For Customer Segmentation

Directory of Open Access Journals (Sweden)

Fahmida Afrin

2015-08-01

Full Text Available Abstract Data mining is the process of analyzing data and discovering useful information. Sometimes it is called knowledge Discovery. Clustering refers to groups whereas data are grouped in such a way that the data in one cluster are similar data in different clusters are dissimilar. Many data mining technologies are developed for customer segmentation. PCA is working as a preprocessor of Fuzzy C means and K- means for reducing the high dimensional and noisy data. There are many clustering method apply on customer segmentation. In this paper the performance of Fuzzy C means and K-means after implementing Principal Component Analysis is analyzed. We analyze the performance on a standard dataset for these algorithms. The results indicate that PCA based fuzzy clustering produces better results than PCA based K-means and is a more stable method for customer segmentation.

Analysis of post-blasting source mechanisms of mining-induced seismic events in Rudna copper mine, Poland

Directory of Open Access Journals (Sweden)

Caputa Alicja

2015-10-01

Full Text Available The exploitation of georesources by underground mining can be responsible for seismic activity in areas considered aseismic. Since strong seismic events are connected with rockburst hazard, it is a continuous requirement to reduce seismic risk. One of the most effective methods to do so is blasting in potentially hazardous mining panels. In this way, small to moderate tremors are provoked and stress accumulation is substantially reduced. In this paper we present an analysis of post-blasting events using Full Moment Tensor (MT inversion at the Rudna mine, Poland, underground seismic network. In addition, we describe the problems we faced when analyzing seismic signals. Our studies show that focal mechanisms for events that occurred after blasts exhibit common features in the MT solution. The strong isotropic and small Double Couple (DC component of the MT, indicate that these events were provoked by detonations. On the other hand, post-blasting MT is considerably different than the MT obtained for strong mining events. We believe that seismological analysis of provoked and unprovoked events can be a very useful tool in confirming the effectiveness of blasting in seismic hazard reduction in mining areas.
Analisis Data Lulusan dengan Data Mining untuk Mendukung Strategi Promosi Universitas Lancang Kuning

Directory of Open Access Journals (Sweden)

Elvira Asril

2015-11-01

Full Text Available Setiap perusahaan maupun organisasi yang ingin tetap bertahan perlu untuk menentukan strategi promosi yang tepat. Penentuan strategi promosi yang tepat akan dapat mengurangi biaya promosi dan mencapai sasaran promosi yang tepat. Salah satu cara yang dapat dilakukan untuk penentuan strategi promosi adalah dengan menggunakan teknik data mining. Teknik data mining yang digunakan dalam hal ini adalah dengan menggunakan algoritma Clustering K-Means. Clustering merupakan pengelompokkan record, observasi, atau kasus ke dalam kelas-kelas objek yang mirip. K-Means adalah metode klaster data non-hirarkis yang mencoba untuk membagi data ke dalam satu atau lebih klaster. Penelitian dilakukan dengan mengamati beberapa variabel penelitian yang sering dipertimbangkan oleh perguruan tinggi dalam menentukan sasaran promosinya yaitu asal sekolah, daerah, dan jurusan. Hasil penelitian ini adalah berupa pola menarik hasil data mining yang merupakan informasi penting untuk mendukung strategi promosi yang tepat dalam mendapatkan calon mahasiswa baru.Kata kunci: Data Mining, Clustering, K-Means Each company or organization that wants to survive needs to determine appropriate promotional strategies. Determination of appropriate promotional strategies will be able to reduce costs and achieve the goals the promotion of proper promotion. One way that can be done to determine campaign strategy is to use data mining techniques. Data mining techniques used in this case is to use a K-Means clustering algorithm. Clustering is the grouping of records, observation, or in the case of the object classes that are similar. K-Means is a method of non-hierarchical clustering of data that is trying to divide the data into one or more clusters. The study was conducted by observing some of the variables that are often considered by the college in determining the target of promotion that the school of origin, region, and department. Results of this study are interesting pattern of
Information mining in remote sensing imagery

Science.gov (United States)

Li, Jiang

The volume of remotely sensed imagery continues to grow at an enormous rate due to the advances in sensor technology, and our capability for collecting and storing images has greatly outpaced our ability to analyze and retrieve information from the images. This motivates us to develop image information mining techniques, which is very much an interdisciplinary endeavor drawing upon expertise in image processing, databases, information retrieval, machine learning, and software design. This dissertation proposes and implements an extensive remote sensing image information mining (ReSIM) system prototype for mining useful information implicitly stored in remote sensing imagery. The system consists of three modules: image processing subsystem, database subsystem, and visualization and graphical user interface (GUI) subsystem. Land cover and land use (LCLU) information corresponding to spectral characteristics is identified by supervised classification based on support vector machines (SVM) with automatic model selection, while textural features that characterize spatial information are extracted using Gabor wavelet coefficients. Within LCLU categories, textural features are clustered using an optimized k-means clustering approach to acquire search efficient space. The clusters are stored in an object-oriented database (OODB) with associated images indexed in an image database (IDB). A k-nearest neighbor search is performed using a query-by-example (QBE) approach. Furthermore, an automatic parametric contour tracing algorithm and an O(n) time piecewise linear polygonal approximation (PLPA) algorithm are developed for shape information mining of interesting objects within the image. A fuzzy object-oriented database based on the fuzzy object-oriented data (FOOD) model is developed to handle the fuzziness and uncertainty. Three specific applications are presented: integrated land cover and texture pattern mining, shape information mining for change detection of lakes, and
Improved Density Based Spatial Clustering of Applications of Noise Clustering Algorithm for Knowledge Discovery in Spatial Data

Directory of Open Access Journals (Sweden)

Arvind Sharma

2016-01-01

Full Text Available There are many techniques available in the field of data mining and its subfield spatial data mining is to understand relationships between data objects. Data objects related with spatial features are called spatial databases. These relationships can be used for prediction and trend detection between spatial and nonspatial objects for social and scientific reasons. A huge data set may be collected from different sources as satellite images, X-rays, medical images, traffic cameras, and GIS system. To handle this large amount of data and set relationship between them in a certain manner with certain results is our primary purpose of this paper. This paper gives a complete process to understand how spatial data is different from other kinds of data sets and how it is refined to apply to get useful results and set trends to predict geographic information system and spatial data mining process. In this paper a new improved algorithm for clustering is designed because role of clustering is very indispensable in spatial data mining process. Clustering methods are useful in various fields of human life such as GIS (Geographic Information System, GPS (Global Positioning System, weather forecasting, air traffic controller, water treatment, area selection, cost estimation, planning of rural and urban areas, remote sensing, and VLSI designing. This paper presents study of various clustering methods and algorithms and an improved algorithm of DBSCAN as IDBSCAN (Improved Density Based Spatial Clustering of Application of Noise. The algorithm is designed by addition of some important attributes which are responsible for generation of better clusters from existing data sets in comparison of other methods.
Grouping of Cities In Terms Of Primary Health Indicators in Turkey: An Application of Cluster Analysis

Directory of Open Access Journals (Sweden)

Bilgehan TEKİN

2015-12-01

Full Text Available It is thought that to determine the differences between cities that locate in Turkey is important in the context of primary health care indicators. The subject of this study is the classification of cities in Turkey in terms of health indicators. The cluster analysis method which is the one of the data mining and multivariate statistical methods is used for classification method. The main objective of the study is to examine the point of results of movement transformation in health in terms of basic health indicators on the basis of cities.. In this context, 81 cities, in Turkey are grouped with sixteen health indicators which is assumed to demonstrate the effectiveness of health care services, by the years of 2013. And also compared with the health and socio-economic development ranking in the previous studies. Providences are gathered in 21, 13, 11, 7 and 5 clusters. 11’s, 7’s and 5’s clusters are determined as the most significant clusters. As a result of the study the development gap between eastern and western provinces emerges in terms of the health variables.
Cluster analysis of track structure

International Nuclear Information System (INIS)

Michalik, V.

1991-01-01

One of the possibilities of classifying track structures is application of conventional partition techniques of analysis of multidimensional data to the track structure. Using these cluster algorithms this paper attempts to find characteristics of radiation reflecting the spatial distribution of ionizations in the primary particle track. An absolute frequency distribution of clusters of ionizations giving the mean number of clusters produced by radiation per unit of deposited energy can serve as this characteristic. General computation techniques used as well as methods of calculations of distributions of clusters for different radiations are discussed. 8 refs.; 5 figs
Mining Heterogeneous Information Networks by Exploring the Power of Links

Science.gov (United States)

Han, Jiawei

Knowledge is power but for interrelated data, knowledge is often hidden in massive links in heterogeneous information networks. We explore the power of links at mining heterogeneous information networks with several interesting tasks, including link-based object distinction, veracity analysis, multidimensional online analytical processing of heterogeneous information networks, and rank-based clustering. Some recent results of our research that explore the crucial information hidden in links will be introduced, including (1) Distinct for object distinction analysis, (2) TruthFinder for veracity analysis, (3) Infonet-OLAP for online analytical processing of information networks, and (4) RankClus for integrated ranking-based clustering. We also discuss some of our on-going studies in this direction.
Statistic analysis of grouping in evaluation of the behavior of stable chemical elements and physical-chemical parameters in effluent from uranium mining

International Nuclear Information System (INIS)

Pereira, Wagner de S.

2013-01-01

The Ore Treatment Unit (UTM) is a uranium mine off. The statistical analysis of clustering was used to evaluate the behavior of stable chemical elements and physico-chemical variables in their effluents. The use of cluster analysis proved effective in the evaluation, allowing to identify groups of chemical elements in physico-chemical variables and group analyzes (element and variables ). As a result, we can say, based on the analysis of the data, a strong link between Ca and Mg and between Al and TR 2 O 3 (rare earth oxides) in the UTM effluents. The SO 4 was also identified as strongly linked to total solids and dissolved and these linked to electrical conductivity. Other associations existed, but were not as strongly linked. Additional collections for seasonal evaluation are required so that assessments can be confirmed. Additional statistics analysis (ordination techniques) should be used to help identify the origins of the groups identified in this analysis. (author)
SegMine workflows for semantic microarray data analysis in Orange4WS

Directory of Open Access Journals (Sweden)

Kulovesi Kimmo

2011-10-01

Full Text Available Abstract Background In experimental data analysis, bioinformatics researchers increasingly rely on tools that enable the composition and reuse of scientific workflows. The utility of current bioinformatics workflow environments can be significantly increased by offering advanced data mining services as workflow components. Such services can support, for instance, knowledge discovery from diverse distributed data and knowledge sources (such as GO, KEGG, PubMed, and experimental databases. Specifically, cutting-edge data analysis approaches, such as semantic data mining, link discovery, and visualization, have not yet been made available to researchers investigating complex biological datasets. Results We present a new methodology, SegMine, for semantic analysis of microarray data by exploiting general biological knowledge, and a new workflow environment, Orange4WS, with integrated support for web services in which the SegMine methodology is implemented. The SegMine methodology consists of two main steps. First, the semantic subgroup discovery algorithm is used to construct elaborate rules that identify enriched gene sets. Then, a link discovery service is used for the creation and visualization of new biological hypotheses. The utility of SegMine, implemented as a set of workflows in Orange4WS, is demonstrated in two microarray data analysis applications. In the analysis of senescence in human stem cells, the use of SegMine resulted in three novel research hypotheses that could improve understanding of the underlying mechanisms of senescence and identification of candidate marker genes. Conclusions Compared to the available data analysis systems, SegMine offers improved hypothesis generation and data interpretation for bioinformatics in an easy-to-use integrated workflow environment.
Development of Database for Accident Analysis in Indian Mines

Science.gov (United States)

Tripathy, Debi Prasad; Guru Raghavendra Reddy, K.

2016-10-01

Mining is a hazardous industry and high accident rates associated with underground mining is a cause of deep concern. Technological developments notwithstanding, rate of fatal accidents and reportable incidents have not shown corresponding levels of decline. This paper argues that adoption of appropriate safety standards by both mine management and the government may result in appreciable reduction in accident frequency. This can be achieved by using the technology in improving the working conditions, sensitising workers and managers about causes and prevention of accidents. Inputs required for a detailed analysis of an accident include information on location, time, type, cost of accident, victim, nature of injury, personal and environmental factors etc. Such information can be generated from data available in the standard coded accident report form. This paper presents a web based application for accident analysis in Indian mines during 2001-2013. An accident database (SafeStat) prototype based on Intranet of the TCP/IP agreement, as developed by the authors, is also discussed.
Are clusters of dietary patterns and cluster membership stable over time? Results of a longitudinal cluster analysis study.

Science.gov (United States)

Walthouwer, Michel Jean Louis; Oenema, Anke; Soetens, Katja; Lechner, Lilian; de Vries, Hein

2014-11-01

Developing nutrition education interventions based on clusters of dietary patterns can only be done adequately when it is clear if distinctive clusters of dietary patterns can be derived and reproduced over time, if cluster membership is stable, and if it is predictable which type of people belong to a certain cluster. Hence, this study aimed to: (1) identify clusters of dietary patterns among Dutch adults, (2) test the reproducibility of these clusters and stability of cluster membership over time, and (3) identify sociodemographic predictors of cluster membership and cluster transition. This study had a longitudinal design with online measurements at baseline (N=483) and 6 months follow-up (N=379). Dietary intake was assessed with a validated food frequency questionnaire. A hierarchical cluster analysis was performed, followed by a K-means cluster analysis. Multinomial logistic regression analyses were conducted to identify the sociodemographic predictors of cluster membership and cluster transition. At baseline and follow-up, a comparable three-cluster solution was derived, distinguishing a healthy, moderately healthy, and unhealthy dietary pattern. Male and lower educated participants were significantly more likely to have a less healthy dietary pattern. Further, 251 (66.2%) participants remained in the same cluster, 45 (11.9%) participants changed to an unhealthier cluster, and 83 (21.9%) participants shifted to a healthier cluster. Men and people living alone were significantly more likely to shift toward a less healthy dietary pattern. Distinctive clusters of dietary patterns can be derived. Yet, cluster membership is unstable and only few sociodemographic factors were associated with cluster membership and cluster transition. These findings imply that clusters based on dietary intake may not be suitable as a basis for nutrition education interventions. Copyright © 2014 Elsevier Ltd. All rights reserved.
CytoCluster: A Cytoscape Plugin for Cluster Analysis and Visualization of Biological Networks.

Science.gov (United States)

Li, Min; Li, Dongyan; Tang, Yu; Wu, Fangxiang; Wang, Jianxin

2017-08-31

Nowadays, cluster analysis of biological networks has become one of the most important approaches to identifying functional modules as well as predicting protein complexes and network biomarkers. Furthermore, the visualization of clustering results is crucial to display the structure of biological networks. Here we present CytoCluster, a cytoscape plugin integrating six clustering algorithms, HC-PIN (Hierarchical Clustering algorithm in Protein Interaction Networks), OH-PIN (identifying Overlapping and Hierarchical modules in Protein Interaction Networks), IPCA (Identifying Protein Complex Algorithm), ClusterONE (Clustering with Overlapping Neighborhood Expansion), DCU (Detecting Complexes based on Uncertain graph model), IPC-MCE (Identifying Protein Complexes based on Maximal Complex Extension), and BinGO (the Biological networks Gene Ontology) function. Users can select different clustering algorithms according to their requirements. The main function of these six clustering algorithms is to detect protein complexes or functional modules. In addition, BinGO is used to determine which Gene Ontology (GO) categories are statistically overrepresented in a set of genes or a subgraph of a biological network. CytoCluster can be easily expanded, so that more clustering algorithms and functions can be added to this plugin. Since it was created in July 2013, CytoCluster has been downloaded more than 9700 times in the Cytoscape App store and has already been applied to the analysis of different biological networks. CytoCluster is available from http://apps.cytoscape.org/apps/cytocluster.
Knowledge Discovery and Data Mining in Iran's Climatic Researches

Science.gov (United States)

Karimi, Mostafa

2013-04-01

Advances in measurement technology and data collection is the database gets larger. Large databases require powerful tools for analysis data. Iterative process of acquiring knowledge from information obtained from data processing is done in various forms in all scientific fields. However, when the data volume large, and many of the problems the Traditional methods cannot respond. in the recent years, use of databases in various scientific fields, especially atmospheric databases in climatology expanded. in addition, increases in the amount of data generated by the climate models is a challenge for analysis of it for extraction of hidden pattern and knowledge. The approach to this problem has been made in recent years uses the process of knowledge discovery and data mining techniques with the use of the concepts of machine learning, artificial intelligence and expert (professional) systems is overall performance. Data manning is analytically process for manning in massive volume data. The ultimate goal of data mining is access to information and finally knowledge. climatology is a part of science that uses variety and massive volume data. Goal of the climate data manning is Achieve to information from variety and massive atmospheric and non-atmospheric data. in fact, Knowledge Discovery performs these activities in a logical and predetermined and almost automatic process. The goal of this research is study of uses knowledge Discovery and data mining technique in Iranian climate research. For Achieve This goal, study content (descriptive) analysis and classify base method and issue. The result shown that in climatic research of Iran most clustering, k-means and wards applied and in terms of issues precipitation and atmospheric circulation patterns most introduced. Although several studies in geography and climate issues with statistical techniques such as clustering and pattern extraction is done, Due to the nature of statistics and data mining, but cannot say for
MANNER OF STOCKS SORTING USING CLUSTER ANALYSIS METHODS

Directory of Open Access Journals (Sweden)

Jana Halčinová

2014-06-01

Full Text Available The aim of the present article is to show the possibility of using the methods of cluster analysis in classification of stocks of finished products. Cluster analysis creates groups (clusters of finished products according to similarity in demand i.e. customer requirements for each product. Manner stocks sorting of finished products by clusters is described a practical example. The resultants clusters are incorporated into the draft layout of the distribution warehouse.
Using data mining and OLAP to discover patterns in a database of patients with Y-chromosome deletions.

Science.gov (United States)

Dzeroski, S; Hristovski, D; Peterlin, B

2000-01-01

The paper presents a database of published Y chromosome deletions and the results of analyzing the database with data mining techniques. The database describes 382 patients for which 177 different markers were tested: 364 of the 382 patients had deletions. Two data mining techniques, clustering and decision tree induction were used. Clustering was used to group patients according to the overall presence/absence of deletions at the tested markers. Decision trees and On-Line-Analytical-Processing (OLAP) were used to inspect the resulting clustering and look for correlations between deletion patterns, populations and the clinical picture of infertility. The results of the analysis indicate that there are correlations between deletion patterns and patient populations, as well as clinical phenotype severity.
The Evaluation on Data Mining Methods of Horizontal Bar Training Based on BP Neural Network

Directory of Open Access Journals (Sweden)

Zhang Yanhui

2015-01-01

Full Text Available With the rapid development of science and technology, data analysis has become an indispensable part of people’s work and life. Horizontal bar training has multiple categories. It is an emphasis for the re-search of related workers that categories of the training and match should be reduced. The application of data mining methods is discussed based on the problem of reducing categories of horizontal bar training. The BP neural network is applied to the cluster analysis and the principal component analysis, which are used to evaluate horizontal bar training. Two kinds of data mining methods are analyzed from two aspects, namely the operational convenience of data mining and the rationality of results. It turns out that the principal component analysis is more suitable for data processing of horizontal bar training.
Mining High-Dimensional Data

Science.gov (United States)

Wang, Wei; Yang, Jiong

With the rapid growth of computational biology and e-commerce applications, high-dimensional data becomes very common. Thus, mining high-dimensional data is an urgent problem of great practical importance. However, there are some unique challenges for mining data of high dimensions, including (1) the curse of dimensionality and more crucial (2) the meaningfulness of the similarity measure in the high dimension space. In this chapter, we present several state-of-art techniques for analyzing high-dimensional data, e.g., frequent pattern mining, clustering, and classification. We will discuss how these methods deal with the challenges of high dimensionality.
Exact WKB analysis and cluster algebras

International Nuclear Information System (INIS)

Iwaki, Kohei; Nakanishi, Tomoki

2014-01-01

We develop the mutation theory in the exact WKB analysis using the framework of cluster algebras. Under a continuous deformation of the potential of the Schrödinger equation on a compact Riemann surface, the Stokes graph may change the topology. We call this phenomenon the mutation of Stokes graphs. Along the mutation of Stokes graphs, the Voros symbols, which are monodromy data of the equation, also mutate due to the Stokes phenomenon. We show that the Voros symbols mutate as variables of a cluster algebra with surface realization. As an application, we obtain the identities of Stokes automorphisms associated with periods of cluster algebras. The paper also includes an extensive introduction of the exact WKB analysis and the surface realization of cluster algebras for nonexperts. This article is part of a special issue of Journal of Physics A: Mathematical and Theoretical devoted to ‘Cluster algebras in mathematical physics’. (paper)
PENGKLASIFIKASIAN KARAKTERISTIK MAHASISWA BARU DALAM MEMILIH PROGRAM STUDI MENGGUNAKAN ANALISIS CLUSTER

Directory of Open Access Journals (Sweden)

Maxsi Ary

2016-03-01

Full Text Available Abstract - Object Clustering is one of the object mining process which aims to partition an existing object into one or more cluster objects based on their characteristics. Private Universities is one of the alternatives for the community colleges to meet increased demand for educational needs. The number of private colleges, especially in Bandung and generally in Indonesia is quite a lot. The number of colleges and universities means used to attract prospective students to be an interesting thing to study. As a reason for the intense competition in the search for new students, no doubt there are some ways that actually do not need to be done. Issues raised, namely classify new students of characteristics in selecting a course using cluster analysis. Data obtained from the questionnaire prospective new students in February 2014 Data processing using SPSS. The results using analysis SPSS aiming easier to describe the characteristics of each group of new students in choosing courses. Keywords: Clustering, characteristics of students, courses, cluster analysis Abstrak - Pengelompokan Objek (object clustering adalah salah satu proses dari objek mining yang bertujuan untuk mempartisi objek yang ada kedalam satu atau lebih cluster objek berdasarkan karakteristiknya. Perguruan tinggi swasta merupakan salah satu perguruan tinggi alternatif bagi masyarakat untuk menghadapi peningkatan permintaan terhadap kebutuhan pendidikan. Jumlah perguruan tinggi swasta khususnya di Bandung dan umumnya di Indonesia berjumlah cukup banyak. Jumlah perguruan tinggi dan cara yang digunakan perguruan tinggi untuk menarik minat calon mahasiswa menjadi hal yang menarik untuk dikaji. Sebagai alasan ketatnya persaingan dalam mencari calon mahasiswa baru, tidak dipungkiri terdapat beberapa cara yang dilakukan yang sebetulnya tidak perlu dilakukan. Persoalan yang dikemukakan yaitu mengklasifikasikan karakteristik mahasiawa baru dalam memilih program studi menggunakan analisis
Data Mining and Analysis

Science.gov (United States)

Samms, Kevin O.

2015-01-01

The Data Mining project seeks to bring the capability of data visualization to NASA anomaly and problem reporting systems for the purpose of improving data trending, evaluations, and analyses. Currently NASA systems are tailored to meet the specific needs of its organizations. This tailoring has led to a variety of nomenclatures and levels of annotation for procedures, parts, and anomalies making difficult the realization of the common causes for anomalies. Making significant observations and realizing the connection between these causes without a common way to view large data sets is difficult to impossible. In the first phase of the Data Mining project a portal was created to present a common visualization of normalized sensitive data to customers with the appropriate security access. The tool of the visualization itself was also developed and fine-tuned. In the second phase of the project we took on the difficult task of searching and analyzing the target data set for common causes between anomalies. In the final part of the second phase we have learned more about how much of the analysis work will be the job of the Data Mining team, how to perform that work, and how that work may be used by different customers in different ways. In this paper I detail how our perspective has changed after gaining more insight into how the customers wish to interact with the output and how that has changed the product.

Data clustering algorithms and applications

CERN Document Server

Aggarwal, Charu C

2013-01-01

Research on the problem of clustering tends to be fragmented across the pattern recognition, database, data mining, and machine learning communities. Addressing this problem in a unified way, Data Clustering: Algorithms and Applications provides complete coverage of the entire area of clustering, from basic methods to more refined and complex data clustering approaches. It pays special attention to recent issues in graphs, social networks, and other domains.The book focuses on three primary aspects of data clustering: Methods, describing key techniques commonly used for clustering, such as fea
Application of data mining techniques for nuclear data and instrumentation

International Nuclear Information System (INIS)

Toshniwal, Durga

2013-01-01

Data mining is defined as the discovery of previously unknown, valid, novel, potentially useful, and understandable patterns in large databases. It encompasses many different techniques and algorithms which differ in the kinds of data that can be analyzed and the form of knowledge representation used to convey the discovered knowledge. Patterns in the data can be represented in many different forms, including classification rules, association rules, clusters, etc. Data mining thus deals with the discovery of hidden trends and patterns from large quantities of data. The field of data mining is emerging as a new, fundamental research area with important applications to science, engineering, medicine, business, and education. It is an interdisciplinary research area and draws upon several roots, including database systems, machine learning, information systems, statistics and expert systems. Data mining, when performed on time series data, is known as time series data mining (TSDM). A time series is a sequence of real numbers, each number representing a value at a point of time. During the past few years, there has been an explosion of research in the area of time series data mining. This includes attempts to model time series data, to design languages to query such data, and to develop access structures to efficiently process queries on such data. Time series data arises naturally in many real-world applications. Efficient discovery of knowledge through time series data mining can be helpful in several domains such as: Stock market analysis, Weather forecasting etc. An important application area of data mining techniques is in nuclear power plant and related data. Nuclear power plant data can be represented in form of time sequences. Often it may be of prime importance to analyze such data to find trends and anomalies. The general goals of data mining include feature extraction, similarity search, clustering and classification, association rule mining and anomaly
Performance Analysis Tool for HPC and Big Data Applications on Scientific Clusters

Energy Technology Data Exchange (ETDEWEB)

Yoo, Wucherl [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Koo, Michelle [Univ. of California, Berkeley, CA (United States); Cao, Yu [California Inst. of Technology (CalTech), Pasadena, CA (United States); Sim, Alex [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Nugent, Peter [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Univ. of California, Berkeley, CA (United States); Wu, Kesheng [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)

2016-09-17

Big data is prevalent in HPC computing. Many HPC projects rely on complex workflows to analyze terabytes or petabytes of data. These workflows often require running over thousands of CPU cores and performing simultaneous data accesses, data movements, and computation. It is challenging to analyze the performance involving terabytes or petabytes of workflow data or measurement data of the executions, from complex workflows over a large number of nodes and multiple parallel task executions. To help identify performance bottlenecks or debug the performance issues in large-scale scientific applications and scientific clusters, we have developed a performance analysis framework, using state-ofthe- art open-source big data processing tools. Our tool can ingest system logs and application performance measurements to extract key performance features, and apply the most sophisticated statistical tools and data mining methods on the performance data. It utilizes an efficient data processing engine to allow users to interactively analyze a large amount of different types of logs and measurements. To illustrate the functionality of the big data analysis framework, we conduct case studies on the workflows from an astronomy project known as the Palomar Transient Factory (PTF) and the job logs from the genome analysis scientific cluster. Our study processed many terabytes of system logs and application performance measurements collected on the HPC systems at NERSC. The implementation of our tool is generic enough to be used for analyzing the performance of other HPC systems and Big Data workows.
Analysis of queuing mine-cars affecting shaft station radon concentrations in Quzhou uranium mine, eastern China

Directory of Open Access Journals (Sweden)

Changshou Hong

2018-04-01

Full Text Available Shaft stations of underground uranium mines in China are not only utilized as waiting space for loaded mine-cars queuing to be hoisted but also as the principal channel for fresh air taken to working places. Therefore, assessment of how mine-car queuing processes affect shaft station radon concentration was carried out. Queuing network of mine-cars has been analyzed in an underground uranium mine, located in Quzhou, Zhejiang province of Eastern China. On the basis of mathematical analysis of the queue network, a MATLAB-based quasi-random number generating program utilizing Monte-Carlo methods was worked out. Extensive simulations were then implemented via MATALB operating on a DELL PC. Thereafter, theoretical calculations and field measurements of shaft station radon concentrations for several working conditions were performed. The queuing performance measures of interest, like average queuing length and waiting time, were found to be significantly affected by the utilization rate (positively correlated. However, even with respect to the “worst case”, the shaft station radon concentration was always lower than 200 Bq/m3. The model predictions were compared with the measuring results, and a satisfactory agreement was noted. Under current working conditions, queuing-induced variations of shaft station radon concentration of the study mine are not remarkable. Keywords: Hoist and Transport Systems, Mine-cars, Queuing Simulation, Radon Concentration, Underground Uranium Mine
From virtual clustering analysis to self-consistent clustering analysis: a mathematical study

Science.gov (United States)

Tang, Shaoqiang; Zhang, Lei; Liu, Wing Kam

2018-03-01

In this paper, we propose a new homogenization algorithm, virtual clustering analysis (VCA), as well as provide a mathematical framework for the recently proposed self-consistent clustering analysis (SCA) (Liu et al. in Comput Methods Appl Mech Eng 306:319-341, 2016). In the mathematical theory, we clarify the key assumptions and ideas of VCA and SCA, and derive the continuous and discrete Lippmann-Schwinger equations. Based on a key postulation of "once response similarly, always response similarly", clustering is performed in an offline stage by machine learning techniques (k-means and SOM), and facilitates substantial reduction of computational complexity in an online predictive stage. The clear mathematical setup allows for the first time a convergence study of clustering refinement in one space dimension. Convergence is proved rigorously, and found to be of second order from numerical investigations. Furthermore, we propose to suitably enlarge the domain in VCA, such that the boundary terms may be neglected in the Lippmann-Schwinger equation, by virtue of the Saint-Venant's principle. In contrast, they were not obtained in the original SCA paper, and we discover these terms may well be responsible for the numerical dependency on the choice of reference material property. Since VCA enhances the accuracy by overcoming the modeling error, and reduce the numerical cost by avoiding an outer loop iteration for attaining the material property consistency in SCA, its efficiency is expected even higher than the recently proposed SCA algorithm.
Cluster Analysis of Customer Reviews Extracted from Web Pages

Directory of Open Access Journals (Sweden)

S. Shivashankar

2010-01-01

Full Text Available As e-commerce is gaining popularity day by day, the web has become an excellent source for gathering customer reviews / opinions by the market researchers. The number of customer reviews that a product receives is growing at very fast rate (It could be in hundreds or thousands. Customer reviews posted on the websites vary greatly in quality. The potential customer has to read necessarily all the reviews irrespective of their quality to make a decision on whether to purchase the product or not. In this paper, we make an attempt to assess are view based on its quality, to help the customer make a proper buying decision. The quality of customer review is assessed as most significant, more significant, significant and insignificant.A novel and effective web mining technique is proposed for assessing a customer review of a particular product based on the feature clustering techniques, namely, k-means method and fuzzy c-means method. This is performed in three steps : (1Identify review regions and extract reviews from it, (2 Extract and cluster the features of reviews by a clustering technique and then assign weights to the features belonging to each of the clusters (groups and (3 Assess the review by considering the feature weights and group belongingness. The k-means and fuzzy c-means clustering techniques are implemented and tested on customer reviews extracted from web pages. Performance of these techniques are analyzed.
Real Options Analysis of Mining Projects

OpenAIRE

Rudolf Zdravlje

2011-01-01

When long life assets are being evaluated based on constant predictions of future variables and the assumptions of zero management flexibility, is value being missed? In project evaluation today, the most common evaluation methods that calculate a net present value are discounted cash flow (DCF) analysis, decision tree analysis and Monte Carlo simulation. A fourth method, which is beginning to gain ground in terms of its use in the mining industry, is real option analysis (ROA). ROA utilizes ...
Analysis of Bonds as an Instrument for Financing Mining Investments

Science.gov (United States)

Ranosz, Robert

2017-06-01

The purpose of this article is to examine the structure of financing for mining enterprises in the years 2007-2013, with particular emphasis on bonds. The document pays special attention to Polish mining enterprises. The financing structure analysis was based on data collected from financial statements (cash flows) of the largest mining companies in Poland, and their comparison with the results of global mining enterprises pursuant to reports prepared by international advisory firms. The article takes into account capital sources such as: corporate bonds, bank loans and issue of shares. As indicated by the performed analysis, mining enterprises both around the world and in Poland are increasingly eager to take advantage of obtaining business financing from issue of corporate bonds. It should also be recognized that in the analyzed period, both global and Polish mining enterprises deviate from forms of financing such as issue of shares. This may be caused by the fact that the bonds market in Poland is becoming increasingly popular, mainly due to interest rate on bonds being lower in comparison with bank loans. Another reason may be that banks and potential buyers of shares are less eager to finance this type of investment due to a relatively substantial risk acceptable to bondholders.
antiSMASH 2.0-a versatile platform for genome mining of secondary metabolite producers

NARCIS (Netherlands)

Blin, Kai; Medema, Marnix H.; Kazempour, Daniyal; Fischbach, Michael A.; Breitling, Rainer; Takano, Eriko; Weber, Tilmann

Microbial secondary metabolites are a potent source of antibiotics and other pharmaceuticals. Genome mining of their biosynthetic gene clusters has become a key method to accelerate their identification and characterization. In 2011, we developed antiSMASH, a web-based analysis platform that
Analysis on present radon ventilation situation of Chinese uranium mines

International Nuclear Information System (INIS)

Li Xianjie; Hu Penghua

2010-01-01

Mine Ventilation is the most important way in lowering radon of uranium mines. At present, radon and radon daughter concentration of underground air is 3∼5 times higher than any other air concentration of foreign uranium mines, as the same input for Protective Ventilation between Chinese uranium mines with compaction methodology and international advanced uranium mines. In this passage, through the analysis of Ventilation Radon Reduction status in Chinese uranium mines and the comparison of advantages and shortcomings between variety of ventilation and radon reduction, it illuminated the reasons of higher radon and radon daughter concentration in Chinese uranium mines and put forward some problems in three aspects, which are Ventilation Radon Reduction Theory, Ventilation Radon Reduction Measures and Ventilation Management. And to above problems, this passage put forward some proposals and measures about some aspects, such as strengthen examination and verification and monitoring practical situation, making clear ventilation plan, in according to mining sequence strictly, training Ventilation technician forcefully, enhance Ventilation System management, development of Ventilation Radon Reduction technology research in uranium mines and carrying out ventilation equipments as soon as possible in further and so on. (authors)
DrugQuest - a text mining workflow for drug association discovery.

Science.gov (United States)

Papanikolaou, Nikolas; Pavlopoulos, Georgios A; Theodosiou, Theodosios; Vizirianakis, Ioannis S; Iliopoulos, Ioannis

2016-06-06

Text mining and data integration methods are gaining ground in the field of health sciences due to the exponential growth of bio-medical literature and information stored in biological databases. While such methods mostly try to extract bioentity associations from PubMed, very few of them are dedicated in mining other types of repositories such as chemical databases. Herein, we apply a text mining approach on the DrugBank database in order to explore drug associations based on the DrugBank "Description", "Indication", "Pharmacodynamics" and "Mechanism of Action" text fields. We apply Name Entity Recognition (NER) techniques on these fields to identify chemicals, proteins, genes, pathways, diseases, and we utilize the TextQuest algorithm to find additional biologically significant words. Using a plethora of similarity and partitional clustering techniques, we group the DrugBank records based on their common terms and investigate possible scenarios why these records are clustered together. Different views such as clustered chemicals based on their textual information, tag clouds consisting of Significant Terms along with the terms that were used for clustering are delivered to the user through a user-friendly web interface. DrugQuest is a text mining tool for knowledge discovery: it is designed to cluster DrugBank records based on text attributes in order to find new associations between drugs. The service is freely available at http://bioinformatics.med.uoc.gr/drugquest .
Spatiotemporal analysis of changes in lode mining claims around the McDermitt Caldera, northern Nevada and southern Oregon

Science.gov (United States)

Coyan, Joshua; Zientek, Michael L.; Mihalasky, Mark J.

2017-01-01

Resource managers and agencies involved with planning for future federal land needs are required to complete an assessment of and forecast for future land use every ten years. Predicting mining activities on federal lands is difficult as current regulations do not require disclosure of exploration results. In these cases, historic mining claims may serve as a useful proxy for determining where mining-related activities may occur. We assess the utility of using a space–time cube (STC) and associated analyses to evaluate and characterize mining claim activities around the McDermitt Caldera in northern Nevada and southern Oregon. The most significant advantage of arranging the mining claim data into a STC is the ability to visualize and compare the data, which allows scientists to better understand patterns and results. Additional analyses of the STC (i.e., Trend, Emerging Hot Spot, Hot Spot, and Cluster and Outlier Analyses) provide extra insights into the data and may aid in predicting future mining claim activities.
The modernisation of mining

CSIR Research Space (South Africa)

Ritchken, E

2017-10-01

Full Text Available mechanisms that will entrench the collaboration. The Phakisa had the task of developing collaborative solutions in response to the mining cluster challenges. • Operates through threat • Company focus • Objective is to comply • Company acts... in relative isolation. • Focus on ticking boxes • Focus on individual, easy to measure, projects of limited ambition • Funding through mining company balance sheet. • Creativity unlocked in finding loop- holes in compliance framework – does...
Applications of Clustering

Indian Academy of Sciences (India)

First page Back Continue Last page Overview Graphics. Applications of Clustering. Biology – medical imaging, bioinformatics, ecology, phylogenies problems etc. Market research. Data Mining. Social Networks. Any problem measuring similarity/correlation. (dimensions represent different parameters)
Kajian Data Mining Customer Relationship Management pada Lembaga Keuangan Mikro

Directory of Open Access Journals (Sweden)

Tikaridha Hardiani

2016-01-01

Full Text Available Companies are required to be ready to face the competition will be intense with other companies, including micro-finance institutions. Faced more intense competition, has led to many businesses in microfinance institutions find profitable strategy to distinguish from the others. Strategy that can be applied is implementing Customer Relationship Management (CRM and data mining. Data mining can be used to microfinance institutions that have a large enough data. Determine the potential customers with customer segmentation can help the decision-making marketing strategy that will be implemented . This paper discusses several data mining techniques that can be used for customer segmentation. Proposed method of data mining technique is fuzzy clustering with fuzzy C-Means algorithm and fuzzy RFM. Keywords : Customer relationship management; Data mining; Fuzzy clustering; Micro-finance institutions; Fuzzy C-Means; Fuzzy RFM
Robust cluster analysis and variable selection

CERN Document Server

Ritter, Gunter

2014-01-01

Clustering remains a vibrant area of research in statistics. Although there are many books on this topic, there are relatively few that are well founded in the theoretical aspects. In Robust Cluster Analysis and Variable Selection, Gunter Ritter presents an overview of the theory and applications of probabilistic clustering and variable selection, synthesizing the key research results of the last 50 years. The author focuses on the robust clustering methods he found to be the most useful on simulated data and real-time applications. The book provides clear guidance for the varying needs of bot
Examining Mobile Learning Trends 2003-2008: A Categorical Meta-Trend Analysis Using Text Mining Techniques

Science.gov (United States)

Hung, Jui-Long; Zhang, Ke

2012-01-01

This study investigated the longitudinal trends of academic articles in Mobile Learning (ML) using text mining techniques. One hundred and nineteen (119) refereed journal articles and proceedings papers from the SCI/SSCI database were retrieved and analyzed. The taxonomies of ML publications were grouped into twelve clusters (topics) and four…
Analysis of Occupational Accidents in Underground and Surface Mining in Spain Using Data-Mining Techniques.

Science.gov (United States)

Sanmiquel, Lluís; Bascompta, Marc; Rossell, Josep M; Anticoi, Hernán Francisco; Guash, Eduard

2018-03-07

An analysis of occupational accidents in the mining sector was conducted using the data from the Spanish Ministry of Employment and Social Safety between 2005 and 2015, and data-mining techniques were applied. Data was processed with the software Weka. Two scenarios were chosen from the accidents database: surface and underground mining. The most important variables involved in occupational accidents and their association rules were determined. These rules are composed of several predictor variables that cause accidents, defining its characteristics and context. This study exposes the 20 most important association rules in the sector-either surface or underground mining-based on the statistical confidence levels of each rule as obtained by Weka. The outcomes display the most typical immediate causes, along with the percentage of accidents with a basis in each association rule. The most important immediate cause is body movement with physical effort or overexertion, and the type of accident is physical effort or overexertion. On the other hand, the second most important immediate cause and type of accident are different between the two scenarios. Data-mining techniques were chosen as a useful tool to find out the root cause of the accidents.
Granular-relational data mining how to mine relational data in the paradigm of granular computing ?

CERN Document Server

Hońko, Piotr

2017-01-01

This book provides two general granular computing approaches to mining relational data, the first of which uses abstract descriptions of relational objects to build their granular representation, while the second extends existing granular data mining solutions to a relational case. Both approaches make it possible to perform and improve popular data mining tasks such as classification, clustering, and association discovery. How can different relational data mining tasks best be unified? How can the construction process of relational patterns be simplified? How can richer knowledge from relational data be discovered? All these questions can be answered in the same way: by mining relational data in the paradigm of granular computing! This book will allow readers with previous experience in the field of relational data mining to discover the many benefits of its granular perspective. In turn, those readers familiar with the paradigm of granular computing will find valuable insights on its application to mining r...
Mining the SDSS SkyServer SQL queries log

Science.gov (United States)

Hirota, Vitor M.; Santos, Rafael; Raddick, Jordan; Thakar, Ani

2016-05-01

SkyServer, the Internet portal for the Sloan Digital Sky Survey (SDSS) astronomic catalog, provides a set of tools that allows data access for astronomers and scientific education. One of SkyServer data access interfaces allows users to enter ad-hoc SQL statements to query the catalog. SkyServer also presents some template queries that can be used as basis for more complex queries. This interface has logged over 330 million queries submitted since 2001. It is expected that analysis of this data can be used to investigate usage patterns, identify potential new classes of queries, find similar queries, etc. and to shed some light on how users interact with the Sloan Digital Sky Survey data and how scientists have adopted the new paradigm of e-Science, which could in turn lead to enhancements on the user interfaces and experience in general. In this paper we review some approaches to SQL query mining, apply the traditional techniques used in the literature and present lessons learned, namely, that the general text mining approach for feature extraction and clustering does not seem to be adequate for this type of data, and, most importantly, we find that this type of analysis can result in very different queries being clustered together.

Multivariate spatial condition mapping using subtractive fuzzy cluster means.

Science.gov (United States)

Sabit, Hakilo; Al-Anbuky, Adnan

2014-10-13

Wireless sensor networks are usually deployed for monitoring given physical phenomena taking place in a specific space and over a specific duration of time. The spatio-temporal distribution of these phenomena often correlates to certain physical events. To appropriately characterise these events-phenomena relationships over a given space for a given time frame, we require continuous monitoring of the conditions. WSNs are perfectly suited for these tasks, due to their inherent robustness. This paper presents a subtractive fuzzy cluster means algorithm and its application in data stream mining for wireless sensor systems over a cloud-computing-like architecture, which we call sensor cloud data stream mining. Benchmarking on standard mining algorithms, the k-means and the FCM algorithms, we have demonstrated that the subtractive fuzzy cluster means model can perform high quality distributed data stream mining tasks comparable to centralised data stream mining.
Supporting Solar Physics Research via Data Mining

Science.gov (United States)

Angryk, Rafal; Banda, J.; Schuh, M.; Ganesan Pillai, K.; Tosun, H.; Martens, P.

2012-05-01

In this talk we will briefly introduce three pillars of data mining (i.e. frequent patterns discovery, classification, and clustering), and discuss some possible applications of known data mining techniques which can directly benefit solar physics research. In particular, we plan to demonstrate applicability of frequent patterns discovery methods for the verification of hypotheses about co-occurrence (in space and time) of filaments and sigmoids. We will also show how classification/machine learning algorithms can be utilized to verify human-created software modules to discover individual types of solar phenomena. Finally, we will discuss applicability of clustering techniques to image data processing.
Pharmacovigilance from social media: mining adverse drug reaction mentions using sequence labeling with word embedding cluster features.

Science.gov (United States)

Nikfarjam, Azadeh; Sarker, Abeed; O'Connor, Karen; Ginn, Rachel; Gonzalez, Graciela

2015-05-01

Social media is becoming increasingly popular as a platform for sharing personal health-related information. This information can be utilized for public health monitoring tasks, particularly for pharmacovigilance, via the use of natural language processing (NLP) techniques. However, the language in social media is highly informal, and user-expressed medical concepts are often nontechnical, descriptive, and challenging to extract. There has been limited progress in addressing these challenges, and thus far, advanced machine learning-based NLP techniques have been underutilized. Our objective is to design a machine learning-based approach to extract mentions of adverse drug reactions (ADRs) from highly informal text in social media. We introduce ADRMine, a machine learning-based concept extraction system that uses conditional random fields (CRFs). ADRMine utilizes a variety of features, including a novel feature for modeling words' semantic similarities. The similarities are modeled by clustering words based on unsupervised, pretrained word representation vectors (embeddings) generated from unlabeled user posts in social media using a deep learning technique. ADRMine outperforms several strong baseline systems in the ADR extraction task by achieving an F-measure of 0.82. Feature analysis demonstrates that the proposed word cluster features significantly improve extraction performance. It is possible to extract complex medical concepts, with relatively high performance, from informal, user-generated content. Our approach is particularly scalable, suitable for social media mining, as it relies on large volumes of unlabeled data, thus diminishing the need for large, annotated training data sets. © The Author 2015. Published by Oxford University Press on behalf of the American Medical Informatics Association.
A cost-benefit analysis of landfill mining and material recycling in China

International Nuclear Information System (INIS)

Zhou, Chuanbin; Gong, Zhe; Hu, Junsong; Cao, Aixin; Liang, Hanwen

2015-01-01

Highlights: • Assessing the economic feasibility of landfill mining. • We applied a cost-benefit analysis model for landfill mining. • Four material cycling and energy recovery scenarios were designed. • We used net present value to evaluate the cost-benefit efficiency. - Abstract: Landfill mining is an environmentally-friendly technology that combines the concepts of material recycling and sustainable waste management, and it has received a great deal of worldwide attention because of its significant environmental and economic potential in material recycling, energy recovery, land reclamation and pollution prevention. This work applied a cost-benefit analysis model for assessing the economic feasibility, which is important for promoting landfill mining. The model includes eight indicators of costs and nine indicators of benefits. Four landfill mining scenarios were designed and analyzed based on field data. The economic feasibility of landfill mining was then evaluated by the indicator of net present value (NPV). According to our case study of a typical old landfill mining project in China (Yingchun landfill), rental of excavation and hauling equipment, waste processing and material transportation were the top three costs of landfill mining, accounting for 88.2% of the total cost, and the average cost per unit of stored waste was 12.7 USD ton −1 . The top three benefits of landfill mining were electricity generation by incineration, land reclamation and recycling soil-like materials. The NPV analysis of the four different scenarios indicated that the Yingchun landfill mining project could obtain a net positive benefit varying from 1.92 million USD to 16.63 million USD. However, the NPV was sensitive to the mode of land reuse, the availability of energy recovery facilities and the possibility of obtaining financial support by avoiding post-closure care
A cost-benefit analysis of landfill mining and material recycling in China

Energy Technology Data Exchange (ETDEWEB)

Zhou, Chuanbin, E-mail: cbzhou@rcees.ac.cn; Gong, Zhe; Hu, Junsong; Cao, Aixin; Liang, Hanwen

2015-01-15

Highlights: • Assessing the economic feasibility of landfill mining. • We applied a cost-benefit analysis model for landfill mining. • Four material cycling and energy recovery scenarios were designed. • We used net present value to evaluate the cost-benefit efficiency. - Abstract: Landfill mining is an environmentally-friendly technology that combines the concepts of material recycling and sustainable waste management, and it has received a great deal of worldwide attention because of its significant environmental and economic potential in material recycling, energy recovery, land reclamation and pollution prevention. This work applied a cost-benefit analysis model for assessing the economic feasibility, which is important for promoting landfill mining. The model includes eight indicators of costs and nine indicators of benefits. Four landfill mining scenarios were designed and analyzed based on field data. The economic feasibility of landfill mining was then evaluated by the indicator of net present value (NPV). According to our case study of a typical old landfill mining project in China (Yingchun landfill), rental of excavation and hauling equipment, waste processing and material transportation were the top three costs of landfill mining, accounting for 88.2% of the total cost, and the average cost per unit of stored waste was 12.7 USD ton{sup −1}. The top three benefits of landfill mining were electricity generation by incineration, land reclamation and recycling soil-like materials. The NPV analysis of the four different scenarios indicated that the Yingchun landfill mining project could obtain a net positive benefit varying from 1.92 million USD to 16.63 million USD. However, the NPV was sensitive to the mode of land reuse, the availability of energy recovery facilities and the possibility of obtaining financial support by avoiding post-closure care.
Cluster analysis

OpenAIRE

Mucha, Hans-Joachim; Sofyan, Hizir

2000-01-01

As an explorative technique, duster analysis provides a description or a reduction in the dimension of the data. It classifies a set of observations into two or more mutually exclusive unknown groups based on combinations of many variables. Its aim is to construct groups in such a way that the profiles of objects in the same groups are relatively homogenous whereas the profiles of objects in different groups are relatively heterogeneous. Clustering is distinct from classification techniques, ...
Clusters, orders, and trees methods and applications in honor of Boris Mirkin's 70th birthday

CERN Document Server

Goldengorin, Boris; Pardalos, Panos

2014-01-01

The volume is dedicated to Boris Mirkin on the occasion of his 70th birthday. In addition to his startling PhD results in abstract automata theory, Mirkin’s ground breaking contributions in various fields of decision making and data analysis have marked the fourth quarter of the 20th century and beyond. Mirkin has done pioneering work in group choice, clustering, data mining and knowledge discovery aimed at finding and describing non-trivial or hidden structures—first of all, clusters, orderings, and hierarchies—in multivariate and/or network data. This volume contains a collection of papers reflecting recent developments rooted in Mirkin's fundamental contribution to the state-of-the-art in group choice, ordering, clustering, data mining, and knowledge discovery. Researchers, students, and software engineers will benefit from new knowledge discovery techniques and application directions.
CLUSTERING PENENTUAN POTENSI KEJAHATAN DAERAH DI KOTA BANJARBARU DENGAN METODE K-MEANS

Directory of Open Access Journals (Sweden)

Sri Rahayu

2016-09-01

Full Text Available Abstract Within the scope of the police, the data held in the database can be used to make a crime report, the presumption of evil to come, and so on. With the data mining based on the amount of data stored so much, these data can be processed to find the useful knowledge for police. One technique that is known in the data mining clustering techniques. The purpose of the job grouping (clustering the data can be divided into two, namely grouping for understanding and grouping to use. Methods K-Means clustering is a method for engineering the most simple and common. KMeans clustering is one method of data non-hierarchy (partition which seeks to partition the existing data in the form of two or more groups. This method of partitioning data into groups so that the same characteristic of data put into the same group and a different characteristic data are grouped into another group. The purpose of this grouping is to minimize the objective function is set in the grouping process, which generally seek to minimize the variation within a group and maximize the variation between groups. The data mined to determine the potential clustering of crime in the city area of crime data Banjarbaru is owned by the city police in the Police Banjarbaru. Thus this study aims to assess the stage of clustering techniques and build clustering determination of potential crime areas in the city Banjarbaru. Keywords:Clustering, Data mining, K-Means, K-Means Clustering ABSTRAK Dalam ruang lingkup kepolisian, data-data yang dimiliki pada basis data dapat dimanfaatkan untuk pembuatan laporan kejahatan, praduga kejahatan yang akan datang, dan sebagainya.Dengan adanya data mining yang didasarkan pada jumlah data yang tersimpan begitu banyak, data-data tersebut dapat diproses untuk menemukan suatu pengetahuan yang berguna bagi pihak kepolisian.Salah satu teknik yang dikenal dalam data mining yaitu teknik clustering.Tujuan pekerjaan pengelompokan (clustering data dapat dibedakan
Data mining analysis of Professor Liu Shangyi’s prescription characteristics in clinical medicine for the treatment of cancer patients with stomachache

Directory of Open Access Journals (Sweden)

Wen-Qi Huang

2018-01-01

Full Text Available Objective: To analyze National Chinese Medicine Master Liu Shangyi’s prescription characteristics of clinical medicine for the treatment of cancer patients with stomachache. Methods: Data on prescriptions for cancer patients with stomachache between January 2014 and July 2016 were collected. The composing principles were analyzed by unsupervised data mining methods including Apriori algorithm in association rules and complex system entropy cluster. Results: Based on the analysis of 120 prescriptions, the frequency of each herb and association rules among the herbs were computed. Four core combinations and two new prescriptions were mined from the database. Compared to the before treatment, the clinical symptomatic grading of stomachache after treatment was lower (P < 0.001. Conclusion: Professor Liu has been successful in the treatment of cancer patients with stomachache by prescribing medication that aids in activating blood circulation, removing dampness, and alleviating pain.
Clustering with Obstacles in Spatial Databases

OpenAIRE

El-Zawawy, Mohamed A.; El-Sharkawi, Mohamed E.

2009-01-01

Clustering large spatial databases is an important problem, which tries to find the densely populated regions in a spatial area to be used in data mining, knowledge discovery, or efficient information retrieval. However most algorithms have ignored the fact that physical obstacles such as rivers, lakes, and highways exist in the real world and could thus affect the result of the clustering. In this paper, we propose CPO, an efficient clustering technique to solve the problem of clustering in ...
Scalable Density-Based Subspace Clustering

DEFF Research Database (Denmark)

Müller, Emmanuel; Assent, Ira; Günnemann, Stephan

2011-01-01

For knowledge discovery in high dimensional databases, subspace clustering detects clusters in arbitrary subspace projections. Scalability is a crucial issue, as the number of possible projections is exponential in the number of dimensions. We propose a scalable density-based subspace clustering...... method that steers mining to few selected subspace clusters. Our novel steering technique reduces subspace processing by identifying and clustering promising subspaces and their combinations directly. Thereby, it narrows down the search space while maintaining accuracy. Thorough experiments on real...... and synthetic databases show that steering is efficient and scalable, with high quality results. For future work, our steering paradigm for density-based subspace clustering opens research potential for speeding up other subspace clustering approaches as well....
Lead exposure from soil in Peruvian mining towns: a national assessment supported by two contrasting examples.

Science.gov (United States)

van Geen, Alexander; Bravo, Carolina; Gil, Vladimir; Sherpa, Shaky; Jack, Darby

2012-12-01

To estimate the population of Peru living in the vicinity of active or former mining operations that could be exposed to lead from contaminated soil. Geographic coordinates were compiled for 113 active mines, 138 ore processing plants and 3 smelters, as well as 7743 former mining sites. The population living within 5 km of these sites was calculated from census data for 2000. In addition, the lead content of soil in the historic mining town of Cerro de Pasco and around a recent mine and ore processing plant near the city of Huaral was mapped in 2009 using a hand-held X-ray fluorescence analyser. Spatial analysis indicated that 1.6 million people in Peru could be living within 5 km of an active or former mining operation. Two thirds of the population potentially exposed was accounted for by 29 clusters of mining operations, each with a population of over 10 000 each. These clusters included 112 active and 3438 former mining operations. Soil lead levels exceeded 1200 mg/kg, a reference standard for residential soil, in 35 of 74 sites tested in Cerro de Pasco but in only 4 of 47 sites tested around the newer operations near Huaral. Soil contamination with lead is likely to be extensive in Peruvian mining towns but the level of contamination is spatially far from uniform. Childhood exposure by soil ingestion could be substantially reduced by mapping soil lead levels, making this information public and encouraging local communities to isolate contaminated areas from children.
Advances in Machine Learning and Data Mining for Astronomy

Science.gov (United States)

Way, Michael J.; Scargle, Jeffrey D.; Ali, Kamal M.; Srivastava, Ashok N.

2012-03-01

Advances in Machine Learning and Data Mining for Astronomy documents numerous successful collaborations among computer scientists, statisticians, and astronomers who illustrate the application of state-of-the-art machine learning and data mining techniques in astronomy. Due to the massive amount and complexity of data in most scientific disciplines, the material discussed in this text transcends traditional boundaries between various areas in the sciences and computer science. The book's introductory part provides context to issues in the astronomical sciences that are also important to health, social, and physical sciences, particularly probabilistic and statistical aspects of classification and cluster analysis. The next part describes a number of astrophysics case studies that leverage a range of machine learning and data mining technologies. In the last part, developers of algorithms and practitioners of machine learning and data mining show how these tools and techniques are used in astronomical applications. With contributions from leading astronomers and computer scientists, this book is a practical guide to many of the most important developments in machine learning, data mining, and statistics. It explores how these advances can solve current and future problems in astronomy and looks at how they could lead to the creation of entirely new algorithms within the data mining community.
An AK-LDMeans algorithm based on image clustering

Science.gov (United States)

Chen, Huimin; Li, Xingwei; Zhang, Yongbin; Chen, Nan

2018-03-01

Clustering is an effective analytical technique for handling unmarked data for value mining. Its ultimate goal is to mark unclassified data quickly and correctly. We use the roadmap for the current image processing as the experimental background. In this paper, we propose an AK-LDMeans algorithm to automatically lock the K value by designing the Kcost fold line, and then use the long-distance high-density method to select the clustering centers to further replace the traditional initial clustering center selection method, which further improves the efficiency and accuracy of the traditional K-Means Algorithm. And the experimental results are compared with the current clustering algorithm and the results are obtained. The algorithm can provide effective reference value in the fields of image processing, machine vision and data mining.
Reduct Driven Pattern Extraction from Clusters

Directory of Open Access Journals (Sweden)

Shuchita Upadhyaya

2009-03-01

Full Text Available Clustering algorithms give general description of clusters, listing number of clusters and member entities in those clusters. However, these algorithms lack in generating cluster description in the form of pattern. From data mining perspective, pattern learning from clusters is as important as cluster finding. In the proposed approach, reduct derived from rough set theory is employed for pattern formulation. Further, reduct are the set of attributes which distinguishes the entities in a homogenous cluster, hence these can be clear cut removed from the same. Remaining attributes are then ranked for their contribution in the cluster. Pattern is formulated with the conjunction of most contributing attributes such that pattern distinctively describes the cluster with minimum error.
XML documents cluster research based on frequent subpatterns

Science.gov (United States)

Ding, Tienan; Li, Wei; Li, Xiongfei

2015-12-01

XML data is widely used in the information exchange field of Internet, and XML document data clustering is the hot research topic. In the XML document clustering process, measure differences between two XML documents is time costly, and impact the efficiency of XML document clustering. This paper proposed an XML documents clustering method based on frequent patterns of XML document dataset, first proposed a coding tree structure for encoding the XML document, and translate frequent pattern mining from XML documents into frequent pattern mining from string. Further, using the cosine similarity calculation method and cohesive hierarchical clustering method for XML document dataset by frequent patterns. Because of frequent patterns are subsets of the original XML document data, so the time consumption of XML document similarity measure is reduced. The experiment runs on synthetic dataset and the real datasets, the experimental result shows that our method is efficient.
Planning, implementation and analysis of mine-surveying measurements to detect rock movements at the Asse salt mine

International Nuclear Information System (INIS)

Hensel, G.

1991-01-01

At the Asse pit, a former salt mine, research has been done since 1965 mainly for the ultimate disposal of radioactive wastes. Within this framework a mine-surveying measurement program has been developed to detect local and extensive rock movements in the mine structure and on the surface. The rock observation program consists of surface levelling, levellings in the mine structure, measurement of shaft depth, shaft sounding, position and gyroscopic measurements as well as cavity convergence and extensometer measurements. The results of that measuring program are taken into account to judge stability. The subject of this work is to analyse the position measurements by priorities to find out to which extent the results, that is the horizontal displacement components, are interpretable. Such analysis is carried out according to the rules of compensating calculation by means of strict compensation after mediating observations. (HS) [de
Nonlinear coupling analysis of coal seam floor during mining based on FLAC3D

Institute of Scientific and Technical Information of China (English)

YAO Duo-xi; XU Ji-ying; LU Hai-feng

2011-01-01

Based on the hydro-geological conditions of 1028 mining face in Suntuan Coal Mine, mining seepage strain mechanism of seam floor was simulated by a nonlinear coupling method, which applied fluid-solid coupling analysis module of FLAC3D. The results indicate that the permeability coefficient of adjoining rock changes a lot due to mining. The maximum value reaches 1 379.9 times to the original value, where it is at immediate roof of the mined-out area. According to the analysis on the seepage field, mining does not destroy water resistance of the floor aquiclude. The mining fissure does not conduct lime-stone aquifer, and it is less likely to form damage. The plastic zone does not exactly correspond to the seepage area, and the scope of the altered seepage area is much larger than the plastic zone.
The analysis of strategies for the mining regions’ development in Russia as a condition of effective management of economy

Directory of Open Access Journals (Sweden)

Zaruba Natalya

2017-01-01

Full Text Available The conceptual issues of a new approach in the implementation of strategic management development of the coal-mining region as conditions of effective government regulation of economy at the macro level are considered in the article. The purpose of the study is to justify the use of marketing techniques in the strategic management of the region, clustering on the basis of the territorial concentration and combination of all available resources, the integration of regional economic networks. A comparative analysis of the main strategic directions of development of the coal-mining regions from the point of view of the leading economic development strategies is carried out. The main result is that the estimation of value of synergy effects occurring when the resources of combining sectors and industries in the region are united has been made. The results of the study can be recommended for usage in the development of strategies for sustainable development of «mono-territory».
Statistical and Machine-Learning Data Mining Techniques for Better Predictive Modeling and Analysis of Big Data

CERN Document Server

Ratner, Bruce

2011-01-01

The second edition of a bestseller, Statistical and Machine-Learning Data Mining: Techniques for Better Predictive Modeling and Analysis of Big Data is still the only book, to date, to distinguish between statistical data mining and machine-learning data mining. The first edition, titled Statistical Modeling and Analysis for Database Marketing: Effective Techniques for Mining Big Data, contained 17 chapters of innovative and practical statistical data mining techniques. In this second edition, renamed to reflect the increased coverage of machine-learning data mining techniques, the author has

A Case Study for Student Performance Analysis based on Educational Data Mining (EDM)

OpenAIRE

Daxa Kundariya; Prof. Vaseem Ghada

2016-01-01

Educational Data Mining (EDM) is a study methodology and an application of data mining techniques related to student’s data from academic database. Like other domain, educational domain also produce vast amount of studying data. To enhance the quality of education system student performance analysis plays an important role for decision support. This paper elaborates a study on various Educational data mining technique and how they could be used to educational system to analysis student perfor...
Decision-making on the integration of renewable energy in the mining industry: A case studies analysis, a cost analysis and a SWOT analysis

Directory of Open Access Journals (Sweden)

Kateryna Zharan

2017-01-01

Full Text Available The mining industry is showing increasing interest in using renewable energy (RE technologies as one of the principles of sustainable mining. This is witnessed in several pilot projects in major mining countries around the world. Positive factors which favor this interest are gaining importance and negative barrier factors seem to be less relevant. For a mine operator, the switch from fossil fuel to RE technologies is the outcome of decision making processes. So far, research about such decision making on the use of RE in mining is underdeveloped. The purpose of this paper to present a practical decision rule based on a principle of indifference between RE and fossil fuel technologies and on appropriate time management. To achieve this objective, three investigations are made: (i a case studies analysis, (ii a comparative cost analysis, and (iii a SWOT analysis.
Clustering high dimensional data

DEFF Research Database (Denmark)

Assent, Ira

2012-01-01

High-dimensional data, i.e., data described by a large number of attributes, pose specific challenges to clustering. The so-called ‘curse of dimensionality’, coined originally to describe the general increase in complexity of various computational problems as dimensionality increases, is known...... to render traditional clustering algorithms ineffective. The curse of dimensionality, among other effects, means that with increasing number of dimensions, a loss of meaningful differentiation between similar and dissimilar objects is observed. As high-dimensional objects appear almost alike, new approaches...... for clustering are required. Consequently, recent research has focused on developing techniques and clustering algorithms specifically for high-dimensional data. Still, open research issues remain. Clustering is a data mining task devoted to the automatic grouping of data based on mutual similarity. Each cluster...
Working with Data: Discovering Knowledge through Mining and Analysis; Systematic Knowledge Management and Knowledge Discovery; Text Mining; Methodological Approach in Discovering User Search Patterns through Web Log Analysis; Knowledge Discovery in Databases Using Formal Concept Analysis; Knowledge Discovery with a Little Perspective.

Science.gov (United States)

Qin, Jian; Jurisica, Igor; Liddy, Elizabeth D.; Jansen, Bernard J; Spink, Amanda; Priss, Uta; Norton, Melanie J.

2000-01-01

These six articles discuss knowledge discovery in databases (KDD). Topics include data mining; knowledge management systems; applications of knowledge discovery; text and Web mining; text mining and information retrieval; user search patterns through Web log analysis; concept analysis; data collection; and data structure inconsistency. (LRW)
Clustering Trajectories by Relevant Parts for Air Traffic Analysis.

Science.gov (United States)

Andrienko, Gennady; Andrienko, Natalia; Fuchs, Georg; Garcia, Jose Manuel Cordero

2018-01-01

Clustering of trajectories of moving objects by similarity is an important technique in movement analysis. Existing distance functions assess the similarity between trajectories based on properties of the trajectory points or segments. The properties may include the spatial positions, times, and thematic attributes. There may be a need to focus the analysis on certain parts of trajectories, i.e., points and segments that have particular properties. According to the analysis focus, the analyst may need to cluster trajectories by similarity of their relevant parts only. Throughout the analysis process, the focus may change, and different parts of trajectories may become relevant. We propose an analytical workflow in which interactive filtering tools are used to attach relevance flags to elements of trajectories, clustering is done using a distance function that ignores irrelevant elements, and the resulting clusters are summarized for further analysis. We demonstrate how this workflow can be useful for different analysis tasks in three case studies with real data from the domain of air traffic. We propose a suite of generic techniques and visualization guidelines to support movement data analysis by means of relevance-aware trajectory clustering.
Reliability analysis of mining equipment: A case study of a crushing plant at Jajarm Bauxite Mine in Iran

International Nuclear Information System (INIS)

Barabady, Javad; Kumar, Uday

2008-01-01

The performance of mining machines depends on the reliability of the equipment used, the operating environment, the maintenance efficiency, the operation process, the technical expertise of the miners, etc. As the size and complexity of mining equipments continue to increase, the implications of equipment failure become ever more critical. Therefore, reliability analysis is required to identify the bottlenecks in the system and to find the components or subsystems with low reliability for a given designed performance. It is important to select a suitable method for data collection as well as for reliability analysis. This paper presents a case study describing reliability and availability analysis of the crushing plant number 3 at Jajarm Bauxite Mine in Iran. In this study, the crushing plant number 3 is divided into six subsystems. The parameters of some probability distributions, such as Weibull, Exponential, and Lognormal distributions have been estimated by using ReliaSoft's Weibull++6 software. The results of the analysis show that the conveyer subsystem and secondary screen subsystem are critical from a reliability point of view, and the secondary crusher subsystem and conveyer subsystem are critical from an availability point of view. The study also shows that the reliability analysis is very useful for deciding maintenance intervals
Analysis of Occupational Accidents in Underground and Surface Mining in Spain Using Data-Mining Techniques

Science.gov (United States)

Sanmiquel, Lluís; Bascompta, Marc; Rossell, Josep M.; Anticoi, Hernán Francisco; Guash, Eduard

2018-01-01

An analysis of occupational accidents in the mining sector was conducted using the data from the Spanish Ministry of Employment and Social Safety between 2005 and 2015, and data-mining techniques were applied. Data was processed with the software Weka. Two scenarios were chosen from the accidents database: surface and underground mining. The most important variables involved in occupational accidents and their association rules were determined. These rules are composed of several predictor variables that cause accidents, defining its characteristics and context. This study exposes the 20 most important association rules in the sector—either surface or underground mining—based on the statistical confidence levels of each rule as obtained by Weka. The outcomes display the most typical immediate causes, along with the percentage of accidents with a basis in each association rule. The most important immediate cause is body movement with physical effort or overexertion, and the type of accident is physical effort or overexertion. On the other hand, the second most important immediate cause and type of accident are different between the two scenarios. Data-mining techniques were chosen as a useful tool to find out the root cause of the accidents. PMID:29518921
Analysis of Occupational Accidents in Underground and Surface Mining in Spain Using Data-Mining Techniques

Directory of Open Access Journals (Sweden)

Lluís Sanmiquel

2018-03-01

Full Text Available An analysis of occupational accidents in the mining sector was conducted using the data from the Spanish Ministry of Employment and Social Safety between 2005 and 2015, and data-mining techniques were applied. Data was processed with the software Weka. Two scenarios were chosen from the accidents database: surface and underground mining. The most important variables involved in occupational accidents and their association rules were determined. These rules are composed of several predictor variables that cause accidents, defining its characteristics and context. This study exposes the 20 most important association rules in the sector—either surface or underground mining—based on the statistical confidence levels of each rule as obtained by Weka. The outcomes display the most typical immediate causes, along with the percentage of accidents with a basis in each association rule. The most important immediate cause is body movement with physical effort or overexertion, and the type of accident is physical effort or overexertion. On the other hand, the second most important immediate cause and type of accident are different between the two scenarios. Data-mining techniques were chosen as a useful tool to find out the root cause of the accidents.
Performance Analysis of Indonesia’s Mining Sector Price Index

Directory of Open Access Journals (Sweden)

Hastra Reza Satyatama

2017-07-01

Full Text Available Subprime mortage’s crisis in United States 2008 giving effect to the global capital markets especially the stock price index of the mining sector Indonesia. This research analyzes the effect of BI Rate, exchange rate, world gold price, crude oil price, and Dow Jones Industrial Average on the stock price index of the mining sector. This research employs time series monthly data of 2009-2016 with Error Correction Model-Engle Granger (ECM-EG as the method. The analysis showed that the BI rate, exchange rate and world gold price, has a negative and significant effect. World oil prices affect positively but not significant meanwhile the Dow Jones Industrial Average has a positive and significant impact on the stock price index of the mining sector. For investors in the mining sector, should pay attention to the exchange rate of the rupiah and Dow Jones Index significantly in the mining sector of the stock price index.DOI: 10.15408/sjie.v6i2.5395
The smart cluster method. Adaptive earthquake cluster identification and analysis in strong seismic regions

Science.gov (United States)

Schaefer, Andreas M.; Daniell, James E.; Wenzel, Friedemann

2017-07-01

Earthquake clustering is an essential part of almost any statistical analysis of spatial and temporal properties of seismic activity. The nature of earthquake clusters and subsequent declustering of earthquake catalogues plays a crucial role in determining the magnitude-dependent earthquake return period and its respective spatial variation for probabilistic seismic hazard assessment. This study introduces the Smart Cluster Method (SCM), a new methodology to identify earthquake clusters, which uses an adaptive point process for spatio-temporal cluster identification. It utilises the magnitude-dependent spatio-temporal earthquake density to adjust the search properties, subsequently analyses the identified clusters to determine directional variation and adjusts its search space with respect to directional properties. In the case of rapid subsequent ruptures like the 1992 Landers sequence or the 2010-2011 Darfield-Christchurch sequence, a reclassification procedure is applied to disassemble subsequent ruptures using near-field searches, nearest neighbour classification and temporal splitting. The method is capable of identifying and classifying earthquake clusters in space and time. It has been tested and validated using earthquake data from California and New Zealand. A total of more than 1500 clusters have been found in both regions since 1980 with M m i n = 2.0. Utilising the knowledge of cluster classification, the method has been adjusted to provide an earthquake declustering algorithm, which has been compared to existing methods. Its performance is comparable to established methodologies. The analysis of earthquake clustering statistics lead to various new and updated correlation functions, e.g. for ratios between mainshock and strongest aftershock and general aftershock activity metrics.
Functional Principal Component Analysis and Randomized Sparse Clustering Algorithm for Medical Image Analysis

Science.gov (United States)

Lin, Nan; Jiang, Junhai; Guo, Shicheng; Xiong, Momiao

2015-01-01

Due to the advancement in sensor technology, the growing large medical image data have the ability to visualize the anatomical changes in biological tissues. As a consequence, the medical images have the potential to enhance the diagnosis of disease, the prediction of clinical outcomes and the characterization of disease progression. But in the meantime, the growing data dimensions pose great methodological and computational challenges for the representation and selection of features in image cluster analysis. To address these challenges, we first extend the functional principal component analysis (FPCA) from one dimension to two dimensions to fully capture the space variation of image the signals. The image signals contain a large number of redundant features which provide no additional information for clustering analysis. The widely used methods for removing the irrelevant features are sparse clustering algorithms using a lasso-type penalty to select the features. However, the accuracy of clustering using a lasso-type penalty depends on the selection of the penalty parameters and the threshold value. In practice, they are difficult to determine. Recently, randomized algorithms have received a great deal of attentions in big data analysis. This paper presents a randomized algorithm for accurate feature selection in image clustering analysis. The proposed method is applied to both the liver and kidney cancer histology image data from the TCGA database. The results demonstrate that the randomized feature selection method coupled with functional principal component analysis substantially outperforms the current sparse clustering algorithms in image cluster analysis. PMID:26196383
Advanced Data Mining of Leukemia Cells Micro-Arrays

Directory of Open Access Journals (Sweden)

Richard S. Segall

2009-12-01

Full Text Available This paper provides continuation and extensions of previous research by Segall and Pierce (2009a that discussed data mining for micro-array databases of Leukemia cells for primarily self-organized maps (SOM. As Segall and Pierce (2009a and Segall and Pierce (2009b the results of applying data mining are shown and discussed for the data categories of microarray databases of HL60, Jurkat, NB4 and U937 Leukemia cells that are also described in this article. First, a background section is provided on the work of others pertaining to the applications of data mining to micro-array databases of Leukemia cells and micro-array databases in general. As noted in predecessor article by Segall and Pierce (2009a, micro-array databases are one of the most popular functional genomics tools in use today. This research in this paper is intended to use advanced data mining technologies for better interpretations and knowledge discovery as generated by the patterns of gene expressions of HL60, Jurkat, NB4 and U937 Leukemia cells. The advanced data mining performed entailed using other data mining tools such as cubic clustering criterion, variable importance rankings, decision trees, and more detailed examinations of data mining statistics and study of other self-organized maps (SOM clustering regions of workspace as generated by SAS Enterprise Miner version 4. Conclusions and future directions of the research are also presented.
Mining Outlier Data in Mobile Internet-Based Large Real-Time Databases

Directory of Open Access Journals (Sweden)

Xin Liu

2018-01-01

Full Text Available Mining outlier data guarantees access security and data scheduling of parallel databases and maintains high-performance operation of real-time databases. Traditional mining methods generate abundant interference data with reduced accuracy, efficiency, and stability, causing severe deficiencies. This paper proposes a new mining outlier data method, which is used to analyze real-time data features, obtain magnitude spectra models of outlier data, establish a decisional-tree information chain transmission model for outlier data in mobile Internet, obtain the information flow of internal outlier data in the information chain of a large real-time database, and cluster data. Upon local characteristic time scale parameters of information flow, the phase position features of the outlier data before filtering are obtained; the decision-tree outlier-classification feature-filtering algorithm is adopted to acquire signals for analysis and instant amplitude and to achieve the phase-frequency characteristics of outlier data. Wavelet transform threshold denoising is combined with signal denoising to analyze data offset, to correct formed detection filter model, and to realize outlier data mining. The simulation suggests that the method detects the characteristic outlier data feature response distribution, reduces response time, iteration frequency, and mining error rate, improves mining adaptation and coverage, and shows good mining outcomes.
Simultaneous Two-Way Clustering of Multiple Correspondence Analysis

Science.gov (United States)

Hwang, Heungsun; Dillon, William R.

2010-01-01

A 2-way clustering approach to multiple correspondence analysis is proposed to account for cluster-level heterogeneity of both respondents and variable categories in multivariate categorical data. Specifically, in the proposed method, multiple correspondence analysis is combined with k-means in a unified framework in which "k"-means is…
Drug repurposing by integrated literature mining and drug–gene–disease triangulation

DEFF Research Database (Denmark)

Sun, Peng; Guo, Jiong; Winnenburg, Rainer

2017-01-01

recent developments in computational drug repositioning and introduce the utilized data sources. Afterwards, we introduce a new data fusion model based on n-cluster editing as a novel multi-source triangulation strategy, which was further combined with semantic literature mining. Our evaluation suggests...... that utilizing drug–gene–disease triangulation coupled to sophisticated text analysis is a robust approach for identifying new drug candidates for repurposing....
Integration K-Means Clustering Method and Elbow Method For Identification of The Best Customer Profile Cluster

Science.gov (United States)

Syakur, M. A.; Khotimah, B. K.; Rochman, E. M. S.; Satoto, B. D.

2018-04-01

Clustering is a data mining technique used to analyse data that has variations and the number of lots. Clustering was process of grouping data into a cluster, so they contained data that is as similar as possible and different from other cluster objects. SMEs Indonesia has a variety of customers, but SMEs do not have the mapping of these customers so they did not know which customers are loyal or otherwise. Customer mapping is a grouping of customer profiling to facilitate analysis and policy of SMEs in the production of goods, especially batik sales. Researchers will use a combination of K-Means method with elbow to improve efficient and effective k-means performance in processing large amounts of data. K-Means Clustering is a localized optimization method that is sensitive to the selection of the starting position from the midpoint of the cluster. So choosing the starting position from the midpoint of a bad cluster will result in K-Means Clustering algorithm resulting in high errors and poor cluster results. The K-means algorithm has problems in determining the best number of clusters. So Elbow looks for the best number of clusters on the K-means method. Based on the results obtained from the process in determining the best number of clusters with elbow method can produce the same number of clusters K on the amount of different data. The result of determining the best number of clusters with elbow method will be the default for characteristic process based on case study. Measurement of k-means value of k-means has resulted in the best clusters based on SSE values on 500 clusters of batik visitors. The result shows the cluster has a sharp decrease is at K = 3, so K as the cut-off point as the best cluster.
Semi-supervised consensus clustering for gene expression data analysis

OpenAIRE

Wang, Yunli; Pan, Youlian

2014-01-01

Background Simple clustering methods such as hierarchical clustering and k-means are widely used for gene expression data analysis; but they are unable to deal with noise and high dimensionality associated with the microarray gene expression data. Consensus clustering appears to improve the robustness and quality of clustering results. Incorporating prior knowledge in clustering process (semi-supervised clustering) has been shown to improve the consistency between the data partitioning and do...
Data Analysis and Data Mining: Current Issues in Biomedical Informatics

Science.gov (United States)

Bellazzi, Riccardo; Diomidous, Marianna; Sarkar, Indra Neil; Takabayashi, Katsuhiko; Ziegler, Andreas; McCray, Alexa T.

2011-01-01

Summary Background Medicine and biomedical sciences have become data-intensive fields, which, at the same time, enable the application of data-driven approaches and require sophisticated data analysis and data mining methods. Biomedical informatics provides a proper interdisciplinary context to integrate data and knowledge when processing available information, with the aim of giving effective decision-making support in clinics and translational research. Objectives To reflect on different perspectives related to the role of data analysis and data mining in biomedical informatics. Methods On the occasion of the 50th year of Methods of Information in Medicine a symposium was organized, that reflected on opportunities, challenges and priorities of organizing, representing and analysing data, information and knowledge in biomedicine and health care. The contributions of experts with a variety of backgrounds in the area of biomedical data analysis have been collected as one outcome of this symposium, in order to provide a broad, though coherent, overview of some of the most interesting aspects of the field. Results The paper presents sections on data accumulation and data-driven approaches in medical informatics, data and knowledge integration, statistical issues for the evaluation of data mining models, translational bioinformatics and bioinformatics aspects of genetic epidemiology. Conclusions Biomedical informatics represents a natural framework to properly and effectively apply data analysis and data mining methods in a decision-making context. In the future, it will be necessary to preserve the inclusive nature of the field and to foster an increasing sharing of data and methods between researchers. PMID:22146916
Allergen Sensitization Pattern by Sex: A Cluster Analysis in Korea.

Science.gov (United States)

Ohn, Jungyoon; Paik, Seung Hwan; Doh, Eun Jin; Park, Hyun-Sun; Yoon, Hyun-Sun; Cho, Soyun

2017-12-01

Allergens tend to sensitize simultaneously. Etiology of this phenomenon has been suggested to be allergen cross-reactivity or concurrent exposure. However, little is known about specific allergen sensitization patterns. To investigate the allergen sensitization characteristics according to gender. Multiple allergen simultaneous test (MAST) is widely used as a screening tool for detecting allergen sensitization in dermatologic clinics. We retrospectively reviewed the medical records of patients with MAST results between 2008 and 2014 in our Department of Dermatology. A cluster analysis was performed to elucidate the allergen-specific immunoglobulin (Ig)E cluster pattern. The results of MAST (39 allergen-specific IgEs) from 4,360 cases were analyzed. By cluster analysis, 39items were grouped into 8 clusters. Each cluster had characteristic features. When compared with female, the male group tended to be sensitized more frequently to all tested allergens, except for fungus allergens cluster. The cluster and comparative analysis results demonstrate that the allergen sensitization is clustered, manifesting allergen similarity or co-exposure. Only the fungus cluster allergens tend to sensitize female group more frequently than male group.
Hydrologic analysis for ecological risk assessment of watersheds with abandoned mine lands

International Nuclear Information System (INIS)

Gallagher, D.; Babendreier, J.; Cherry, D.

1999-01-01

As part of on-going study of acid mine drainage (AMD), a comprehensive ecological risk assessment was conducted in the Leading Creek Watershed in southeast Ohio. The watershed is influenced by agriculture and active and abandoned coal-mining operations. This work presents a broad overview of several quantitative measures of hydrology and hydraulic watershed properties available for in risk assessment and evaluates their relation to metrics of ecology. Data analysis included statistical comparisons of metrics of ecology, ecotoxicology, water quality, and physically based parameters describing land use, geomorphology, flow, velocity, and particle size. A multiple regression analysis indicated that abandoned mining operations dominated impacts upon aquatic ecology. It also indicated low flow velocity measurements and a ratio of maximum velocity to average velocity at low flow where helpful in describing variation in macroinvertebrate Total Taxa scores. Other key parameters also identified strong impact relationships with biodiversity trends and included pH, simple knowledge of any mining upstream, calculated % of the subshed covered by strip mines, and the measured depth of streambed sediments from site to site

Hierarchical Aligned Cluster Analysis for Temporal Clustering of Human Motion.

Science.gov (United States)

Zhou, Feng; De la Torre, Fernando; Hodgins, Jessica K

2013-03-01

Temporal segmentation of human motion into plausible motion primitives is central to understanding and building computational models of human motion. Several issues contribute to the challenge of discovering motion primitives: the exponential nature of all possible movement combinations, the variability in the temporal scale of human actions, and the complexity of representing articulated motion. We pose the problem of learning motion primitives as one of temporal clustering, and derive an unsupervised hierarchical bottom-up framework called hierarchical aligned cluster analysis (HACA). HACA finds a partition of a given multidimensional time series into m disjoint segments such that each segment belongs to one of k clusters. HACA combines kernel k-means with the generalized dynamic time alignment kernel to cluster time series data. Moreover, it provides a natural framework to find a low-dimensional embedding for time series. HACA is efficiently optimized with a coordinate descent strategy and dynamic programming. Experimental results on motion capture and video data demonstrate the effectiveness of HACA for segmenting complex motions and as a visualization tool. We also compare the performance of HACA to state-of-the-art algorithms for temporal clustering on data of a honey bee dance. The HACA code is available online.
Science and Technology Text Mining Basic Concepts

National Research Council Canada - National Science Library

Losiewicz, Paul

2003-01-01

...). It then presents some of the most widely used data and text mining techniques, including clustering and classification methods, such as nearest neighbor, relational learning models, and genetic...
WebGimm: An integrated web-based platform for cluster analysis, functional analysis, and interactive visualization of results.

Science.gov (United States)

Joshi, Vineet K; Freudenberg, Johannes M; Hu, Zhen; Medvedovic, Mario

2011-01-17

Cluster analysis methods have been extensively researched, but the adoption of new methods is often hindered by technical barriers in their implementation and use. WebGimm is a free cluster analysis web-service, and an open source general purpose clustering web-server infrastructure designed to facilitate easy deployment of integrated cluster analysis servers based on clustering and functional annotation algorithms implemented in R. Integrated functional analyses and interactive browsing of both, clustering structure and functional annotations provides a complete analytical environment for cluster analysis and interpretation of results. The Java Web Start client-based interface is modeled after the familiar cluster/treeview packages making its use intuitive to a wide array of biomedical researchers. For biomedical researchers, WebGimm provides an avenue to access state of the art clustering procedures. For Bioinformatics methods developers, WebGimm offers a convenient avenue to deploy their newly developed clustering methods. WebGimm server, software and manuals can be freely accessed at http://ClusterAnalysis.org/.
DATA MINING IN SPORTS BETTING

Directory of Open Access Journals (Sweden)

Cristian Georgescu

2013-12-01

Full Text Available n this paper, we have made a brief analysis on how to make decisions in betting on European football with the help of data mining techniques. Whether you refer to betting a few days in advance of the sporting event or live betting, both options have been taken into consideration. By using a clustering algorithm for analyzing both the database containing events from football matches and the odds given by bookmakers, we have obtained graphs indicating the probabilities associated with analyzed events. Given the purely informative aspect of the current paper, we have only analyzed the number of corners from a match.
Analysis of Network Clustering Algorithms and Cluster Quality Metrics at Scale.

Science.gov (United States)

Emmons, Scott; Kobourov, Stephen; Gallant, Mike; Börner, Katy

2016-01-01

Notions of community quality underlie the clustering of networks. While studies surrounding network clustering are increasingly common, a precise understanding of the realtionship between different cluster quality metrics is unknown. In this paper, we examine the relationship between stand-alone cluster quality metrics and information recovery metrics through a rigorous analysis of four widely-used network clustering algorithms-Louvain, Infomap, label propagation, and smart local moving. We consider the stand-alone quality metrics of modularity, conductance, and coverage, and we consider the information recovery metrics of adjusted Rand score, normalized mutual information, and a variant of normalized mutual information used in previous work. Our study includes both synthetic graphs and empirical data sets of sizes varying from 1,000 to 1,000,000 nodes. We find significant differences among the results of the different cluster quality metrics. For example, clustering algorithms can return a value of 0.4 out of 1 on modularity but score 0 out of 1 on information recovery. We find conductance, though imperfect, to be the stand-alone quality metric that best indicates performance on the information recovery metrics. Additionally, our study shows that the variant of normalized mutual information used in previous work cannot be assumed to differ only slightly from traditional normalized mutual information. Smart local moving is the overall best performing algorithm in our study, but discrepancies between cluster evaluation metrics prevent us from declaring it an absolutely superior algorithm. Interestingly, Louvain performed better than Infomap in nearly all the tests in our study, contradicting the results of previous work in which Infomap was superior to Louvain. We find that although label propagation performs poorly when clusters are less clearly defined, it scales efficiently and accurately to large graphs with well-defined clusters.
Cluster analysis of activity-time series in motor learning

DEFF Research Database (Denmark)

Balslev, Daniela; Nielsen, Finn Å; Futiger, Sally A

2002-01-01

Neuroimaging studies of learning focus on brain areas where the activity changes as a function of time. To circumvent the difficult problem of model selection, we used a data-driven analytic tool, cluster analysis, which extracts representative temporal and spatial patterns from the voxel......-time series. The optimal number of clusters was chosen using a cross-validated likelihood method, which highlights the clustering pattern that generalizes best over the subjects. Data were acquired with PET at different time points during practice of a visuomotor task. The results from cluster analysis show...
ANALISIS SEGMENTASI PELANGGAN MENGGUNAKAN KOMBINASI RFM MODEL DAN TEKNIK CLUSTERING

Directory of Open Access Journals (Sweden)

Beta Estri Adiana

2018-04-01

Full Text Available Intense competition in the business field motivates a small and medium enterprises (SMEs to manage customer services to the maximal. Improve of customer royalty by grouping cunstomers into some of groups and determining appropriate and effective marketing strategies for each group. Customer segmentation can be performed by data mining approach with clustering method. The main purpose of this paper is customer segmentation and measure their loyalty to a SME’s product. Using CRISP-DM method which consist of six phases, namely business understanding, data understanding, data preparatuin, modeling, evaluation and deployment. The K-Means algorithm is used for cluster formation and RapidMiner as a tool used to evaluate the result of clusters. Cluster formation is based on RFM (recency, frequency, monetary analysis. Davies Bouldin Index (DBI is used to find the optimal number of clusters (k. The customers are divided into 3 clusters, total of customer in first cluster is 30 customers who entered in typical customer category, the second cluster there are 8 customer whho entered in superstar customer and 89 customers in third cluster is dormant cluster category.
Advanced analysis of forest fire clustering

Science.gov (United States)

Kanevski, Mikhail; Pereira, Mario; Golay, Jean

2017-04-01

Analysis of point pattern clustering is an important topic in spatial statistics and for many applications: biodiversity, epidemiology, natural hazards, geomarketing, etc. There are several fundamental approaches used to quantify spatial data clustering using topological, statistical and fractal measures. In the present research, the recently introduced multi-point Morisita index (mMI) is applied to study the spatial clustering of forest fires in Portugal. The data set consists of more than 30000 fire events covering the time period from 1975 to 2013. The distribution of forest fires is very complex and highly variable in space. mMI is a multi-point extension of the classical two-point Morisita index. In essence, mMI is estimated by covering the region under study by a grid and by computing how many times more likely it is that m points selected at random will be from the same grid cell than it would be in the case of a complete random Poisson process. By changing the number of grid cells (size of the grid cells), mMI characterizes the scaling properties of spatial clustering. From mMI, the data intrinsic dimension (fractal dimension) of the point distribution can be estimated as well. In this study, the mMI of forest fires is compared with the mMI of random patterns (RPs) generated within the validity domain defined as the forest area of Portugal. It turns out that the forest fires are highly clustered inside the validity domain in comparison with the RPs. Moreover, they demonstrate different scaling properties at different spatial scales. The results obtained from the mMI analysis are also compared with those of fractal measures of clustering - box counting and sand box counting approaches. REFERENCES Golay J., Kanevski M., Vega Orozco C., Leuenberger M., 2014: The multipoint Morisita index for the analysis of spatial patterns. Physica A, 406, 191-202. Golay J., Kanevski M. 2015: A new estimator of intrinsic dimension based on the multipoint Morisita index
Uranium solution mining cost estimating technique: means for rapid comparative analysis of deposits

International Nuclear Information System (INIS)

Anon.

1978-01-01

Twelve graphs provide a technique for determining relative cost ranges for uranium solution mining projects. The use of the technique can provide a consistent framework for rapid comparative analysis of various properties of mining situations. The technique is also useful to determine the sensitivities of cost figures to incremental changes in mining factors or deposit characteristics
Analysis of water control in an underground mine under strong karst media influence (Vazante mine, Brazil)

Science.gov (United States)

Ninanya, Hugo; Guiguer, Nilson; Vargas, Eurípedes A.; Nascimento, Gustavo; Araujo, Edmar; Cazarin, Caroline L.

2018-05-01

This work presents analysis of groundwater flow conditions and groundwater control measures for Vazante underground mine located in the state of Minas Gerais, Brazil. According to field observations, groundwater flow processes in this mine are highly influenced by the presence of karst features located in the near-surface terrain next to Santa Catarina River. The karstic features, such as caves, sinkholes, dolines and conduits, have direct contact with the aquifer and tend to increase water flow into the mine. These effects are more acute in areas under the influence of groundwater-level drawdown by pumping. Numerical analyses of this condition were carried out using the computer program FEFLOW. This program represents karstic features as one-dimensional discrete flow conduits inside a three-dimensional finite element structure representing the geologic medium following a combined discrete-continuum approach for representing the karst system. These features create preferential flow paths between the river and mine; their incorporation into the model is able to more realistically represent the hydrogeological environment of the mine surroundings. In order to mitigate the water-inflow problems, impermeabilization of the river through construction of a reinforced concrete channel was incorporated in the developed hydrogeological model. Different scenarios for channelization lengths for the most critical zones along the river were studied. Obtained results were able to compare effectiveness of different river channelization scenarios. It was also possible to determine whether the use of these impermeabilization measures would be able to reduce, in large part, the elevated costs of pumping inside the mine.
Physicochemical properties of different corn varieties by principal components analysis and cluster analysis

International Nuclear Information System (INIS)

Zeng, J.; Li, G.; Sun, J.

2013-01-01

Principal components analysis and cluster analysis were used to investigate the properties of different corn varieties. The chemical compositions and some properties of corn flour which processed by drying milling were determined. The results showed that the chemical compositions and physicochemical properties were significantly different among twenty six corn varieties. The quality of corn flour was concerned with five principal components from principal component analysis and the contribution rate of starch pasting properties was important, which could account for 48.90%. Twenty six corn varieties could be classified into four groups by cluster analysis. The consistency between principal components analysis and cluster analysis indicated that multivariate analyses were feasible in the study of corn variety properties. (author)
Data Mining Application in Customer Relationship Management for Hospital Inpatients

OpenAIRE

Lee, Eun Whan

2012-01-01

Objectives This study aims to discover patients loyal to a hospital and model their medical service usage patterns. Consequently, this study proposes a data mining application in customer relationship management (CRM) for hospital inpatients. Methods A recency, frequency, monetary (RFM) model has been applied toward 14,072 patients discharged from a university hospital. Cluster analysis was conducted to segment customers, and it modeled the patterns of the loyal customers' medical services us...
Privacy-Preserving k-Means Clustering under Multiowner Setting in Distributed Cloud Environments

Directory of Open Access Journals (Sweden)

Hong Rong

2017-01-01

Full Text Available With the advent of big data era, clients who lack computational and storage resources tend to outsource data mining tasks to cloud service providers in order to improve efficiency and reduce costs. It is also increasingly common for clients to perform collaborative mining to maximize profits. However, due to the rise of privacy leakage issues, the data contributed by clients should be encrypted using their own keys. This paper focuses on privacy-preserving k-means clustering over the joint datasets encrypted under multiple keys. Unfortunately, existing outsourcing k-means protocols are impractical because not only are they restricted to a single key setting, but also they are inefficient and nonscalable for distributed cloud computing. To address these issues, we propose a set of privacy-preserving building blocks and outsourced k-means clustering protocol under Spark framework. Theoretical analysis shows that our scheme protects the confidentiality of the joint database and mining results, as well as access patterns under the standard semihonest model with relatively small computational overhead. Experimental evaluations on real datasets also demonstrate its efficiency improvements compared with existing approaches.
Software tool for data mining and its applications

Science.gov (United States)

Yang, Jie; Ye, Chenzhou; Chen, Nianyi

2002-03-01

A software tool for data mining is introduced, which integrates pattern recognition (PCA, Fisher, clustering, hyperenvelop, regression), artificial intelligence (knowledge representation, decision trees), statistical learning (rough set, support vector machine), computational intelligence (neural network, genetic algorithm, fuzzy systems). It consists of nine function models: pattern recognition, decision trees, association rule, fuzzy rule, neural network, genetic algorithm, Hyper Envelop, support vector machine, visualization. The principle and knowledge representation of some function models of data mining are described. The software tool of data mining is realized by Visual C++ under Windows 2000. Nonmonotony in data mining is dealt with by concept hierarchy and layered mining. The software tool of data mining has satisfactorily applied in the prediction of regularities of the formation of ternary intermetallic compounds in alloy systems, and diagnosis of brain glioma.
Improving clustering with metabolic pathway data.

Science.gov (United States)

Milone, Diego H; Stegmayer, Georgina; López, Mariana; Kamenetzky, Laura; Carrari, Fernando

2014-04-10

It is a common practice in bioinformatics to validate each group returned by a clustering algorithm through manual analysis, according to a-priori biological knowledge. This procedure helps finding functionally related patterns to propose hypotheses for their behavior and the biological processes involved. Therefore, this knowledge is used only as a second step, after data are just clustered according to their expression patterns. Thus, it could be very useful to be able to improve the clustering of biological data by incorporating prior knowledge into the cluster formation itself, in order to enhance the biological value of the clusters. A novel training algorithm for clustering is presented, which evaluates the biological internal connections of the data points while the clusters are being formed. Within this training algorithm, the calculation of distances among data points and neurons centroids includes a new term based on information from well-known metabolic pathways. The standard self-organizing map (SOM) training versus the biologically-inspired SOM (bSOM) training were tested with two real data sets of transcripts and metabolites from Solanum lycopersicum and Arabidopsis thaliana species. Classical data mining validation measures were used to evaluate the clustering solutions obtained by both algorithms. Moreover, a new measure that takes into account the biological connectivity of the clusters was applied. The results of bSOM show important improvements in the convergence and performance for the proposed clustering method in comparison to standard SOM training, in particular, from the application point of view. Analyses of the clusters obtained with bSOM indicate that including biological information during training can certainly increase the biological value of the clusters found with the proposed method. It is worth to highlight that this fact has effectively improved the results, which can simplify their further analysis.The algorithm is available as a
Taxonomical analysis of the Cancer cluster of galaxies

International Nuclear Information System (INIS)

Perea, J.; Olmo, A. del; Moles, M.

1986-01-01

A description is presented of the Cancer cluster of galaxies, based on a taxonomical analysis in (α,delta, Vsub(r)) space. Earlier results by previous authors on the lack of dynamical entity of the cluster are confirmed. The present analysis points out the existence of a binary structure in the most populated region of the complex. (author)
Detection of land mines using fast and thermal neutron analysis

International Nuclear Information System (INIS)

Bach, P.

1998-01-01

The detection of land mines is made possible by using nuclear sensor based on neutron interrogation. Neutron interrogation allows to detect the sensitive elements (C, H, O, N) of the explosives in land mines or in unexploded shells: the evaluation of characteristic ratio N/O and C/O in a volume element gives a signature of high explosives. Fast neutron interrogation has been qualified in our laboratories as a powerful close distance method for identifying the presence of a mine or explosive. This method could be implemented together with a multisensor detection system - for instance IR or microwave - to reduce the false alarm rate by addressing the suspected area. Principle of operation is based on the measurement of gamma rays induced by neutron interaction with irradiated nuclei from the soil and from a possible mine. Specific energy of these gamma rays allows to recognise the elements at the origin of neutron interaction. Several detection methods can be used, depending on nuclei to be identified. Analysis of physical data, computations by simulation codes, and experimentations performed in our laboratory have shown the interest of Fast Neutron Analysis (FNA) combined with Thermal Neutron Analysis (TNA) techniques, especially for detection of nitrogen 14 N, carbon 12 C and oxygen 16 O. The FNA technique can be implemented using a 14 MeV sealed neutron tube, and a set of detectors. The mines detection has been demonstrated from our investigations, using a low power neutron generator working in the 10 8 n/s range, which is reasonable when considering safety rules. A fieldable demonstrator would be made with a detection head including tube and detectors, and with remote electronics, power supplies and computer installed in a vehicle. (author)
SURVEY ON CRIME ANALYSIS AND PREDICTION USING DATA MINING TECHNIQUES

Directory of Open Access Journals (Sweden)

H Benjamin Fredrick David

2017-04-01

Full Text Available Data Mining is the procedure which includes evaluating and examining large pre-existing databases in order to generate new information which may be essential to the organization. The extraction of new information is predicted using the existing datasets. Many approaches for analysis and prediction in data mining had been performed. But, many few efforts has made in the criminology field. Many few have taken efforts for comparing the information all these approaches produce. The police stations and other similar criminal justice agencies hold many large databases of information which can be used to predict or analyze the criminal movements and criminal activity involvement in the society. The criminals can also be predicted based on the crime data. The main aim of this work is to perform a survey on the supervised learning and unsupervised learning techniques that has been applied towards criminal identification. This paper presents the survey on the Crime analysis and crime prediction using several Data Mining techniques.
Revealing Significant Relations between Chemical/Biological Features and Activity: Associative Classification Mining for Drug Discovery

Science.gov (United States)

Yu, Pulan

2012-01-01

Classification, clustering and association mining are major tasks of data mining and have been widely used for knowledge discovery. Associative classification mining, the combination of both association rule mining and classification, has emerged as an indispensable way to support decision making and scientific research. In particular, it offers a…
Assessment of surface water quality using hierarchical cluster analysis

Directory of Open Access Journals (Sweden)

Dheeraj Kumar Dabgerwal

2016-02-01

Full Text Available This study was carried out to assess the physicochemical quality river Varuna inVaranasi,India. Water samples were collected from 10 sites during January-June 2015. Pearson correlation analysis was used to assess the direction and strength of relationship between physicochemical parameters. Hierarchical Cluster analysis was also performed to determine the sources of pollution in the river Varuna. The result showed quite high value of DO, Nitrate, BOD, COD and Total Alkalinity, above the BIS permissible limit. The results of correlation analysis identified key water parameters as pH, electrical conductivity, total alkalinity and nitrate, which influence the concentration of other water parameters. Cluster analysis identified three major clusters of sampling sites out of total 10 sites, according to the similarity in water quality. This study illustrated the usefulness of correlation and cluster analysis for getting better information about the river water quality.International Journal of Environment Vol. 5 (1 2016, pp: 32-44

Analysis on the influence of rainfall and mine water ratio against pH in East pit 3 West Banko coal mine

Directory of Open Access Journals (Sweden)

Rochyani Neny

2017-01-01

Full Text Available In the coal mining area, the pH of mine water is found tend to low and acids. In order to increase the pH, it is important to consider the treatment of acid mine drainage using lime, due the indicators of pollution. This work is focused on the influence of rainfall volume on the pH of acid mine drainage. This research conducted using a ratio of mine water and rainfall water that varies in the 9 (nine conditions, respectively: 1: 1, 1: 2, 1: 3, 1: 4 and 1: 5 and 5: 4, 5: 3 , 5: 2 and 5: 1. The results were then measured and tested with statistical analysis. The ratio of rainfall and mine water showed a significant effect on the pH. The higher of the rainfall lead to increase pH. This condition will affect the water neutralization process using lime where there are some possible differences on dose of lime needed to neutralized the acid mine drainage in the rainy season and dry season.
75 FR 17529 - High-Voltage Continuous Mining Machine Standard for Underground Coal Mines

Science.gov (United States)

2010-04-06

... High-Voltage Continuous Mining Machine Standard for Underground Coal Mines AGENCY: Mine Safety and... of high-voltage continuous mining machines in underground coal mines. It also revises MSHA's design...-- Underground Coal Mines III. Section-by-Section Analysis A. Part 18--Electric Motor-Driven Mine Equipment and...
DATA MINING IN EDUCATION: CURRENT STATE AND PERSPECTIVES OF DEVELOPMENT

Directory of Open Access Journals (Sweden)

Yurii O. Kovalchuk

2016-01-01

Full Text Available The main tasks (classification and regression, association rules, clustering and the basic principles of the Data Mining algorithms in the context of their use for a variety of research in the field of education which are the subject of a relatively new independent direction Educational Data Mining are considered. The findings about the most popular topics of research within this area as well as the perspectives of its development are presented. Presentation of the material is illustrated by simple examples. This article is intended for readers who are engaged in research in the field of education at various levels, especially those involved in the use of e-learning systems, but little familiar with this area of data analysis.
Analysis of Aspects of Innovation in a Brazilian Cluster

Directory of Open Access Journals (Sweden)

Adriana Valélia Saraceni

2012-09-01

Full Text Available Innovation through clustering has become very important on the increased significance that interaction represents on innovation and learning process concept. This study aims to identify whereas a case analysis on innovation process in a cluster represents on the learning process. Therefore, this study is developed in two stages. First, we used a preliminary case study verifying a cluster innovation analysis and it Innovation Index, for further, exploring a combined body of theory and practice. Further, the second stage is developed by exploring the learning process concept. Both stages allowed us building a theory model for the learning process development in clusters. The main results of the model development come up with a mechanism of improvement implementation on clusters when case studies are applied.
Effects of Group Size and Lack of Sphericity on the Recovery of Clusters in K-Means Cluster Analysis

Science.gov (United States)

de Craen, Saskia; Commandeur, Jacques J. F.; Frank, Laurence E.; Heiser, Willem J.

2006-01-01

K-means cluster analysis is known for its tendency to produce spherical and equally sized clusters. To assess the magnitude of these effects, a simulation study was conducted, in which populations were created with varying departures from sphericity and group sizes. An analysis of the recovery of clusters in the samples taken from these…
The Application of Data Mining Techniques to Create Promotion Strategy for Mobile Phone Shop

Science.gov (United States)

Khasanah, A. U.; Wibowo, K. S.; Dewantoro, H. F.

2017-12-01

The number of mobile shop is growing very fast in various regions in Indonesia including in Yogyakarta due to the increasing demand of mobile phone. This fact leads high competition among the mobile phone shops. In these conditions the mobile phone shop should have a good promotion strategy in order to survive in competition, especially for a small mobile phone shop. To create attractive promotion strategy, the companies/shops should know their customer segmentation and the buying pattern of their target market. These kind of analysis can be done using Data mining technique. This study aims to segment customer using Agglomerative Hierarchical Clustering and know customer buying pattern using Association Rule Mining. This result conducted in a mobile shop in Sleman Yogyakarta. The clustering result shows that the biggest customer segment of the shop was male university student who come on weekend and from association rule mining, it can be concluded that tempered glass and smart phone “x” as well as action camera and waterproof monopod and power bank have strong relationship. This results that used to create promotion strategies which are presented in the end of the study.
Two-Way Regularized Fuzzy Clustering of Multiple Correspondence Analysis.

Science.gov (United States)

Kim, Sunmee; Choi, Ji Yeh; Hwang, Heungsun

2017-01-01

Multiple correspondence analysis (MCA) is a useful tool for investigating the interrelationships among dummy-coded categorical variables. MCA has been combined with clustering methods to examine whether there exist heterogeneous subclusters of a population, which exhibit cluster-level heterogeneity. These combined approaches aim to classify either observations only (one-way clustering of MCA) or both observations and variable categories (two-way clustering of MCA). The latter approach is favored because its solutions are easier to interpret by providing explicitly which subgroup of observations is associated with which subset of variable categories. Nonetheless, the two-way approach has been built on hard classification that assumes observations and/or variable categories to belong to only one cluster. To relax this assumption, we propose two-way fuzzy clustering of MCA. Specifically, we combine MCA with fuzzy k-means simultaneously to classify a subgroup of observations and a subset of variable categories into a common cluster, while allowing both observations and variable categories to belong partially to multiple clusters. Importantly, we adopt regularized fuzzy k-means, thereby enabling us to decide the degree of fuzziness in cluster memberships automatically. We evaluate the performance of the proposed approach through the analysis of simulated and real data, in comparison with existing two-way clustering approaches.
A Genetic Algorithm That Exchanges Neighboring Centers for Fuzzy c-Means Clustering

Science.gov (United States)

Chahine, Firas Safwan

2012-01-01

Clustering algorithms are widely used in pattern recognition and data mining applications. Due to their computational efficiency, partitional clustering algorithms are better suited for applications with large datasets than hierarchical clustering algorithms. K-means is among the most popular partitional clustering algorithm, but has a major…
A knowledge discovery approach to urban analysis: Beyoglu Preservation Area as a data mine

Directory of Open Access Journals (Sweden)

Ahu Sokmenoglu Sohtorik

2017-11-01

Full Text Available Enhancing our knowledge of the complexities of cities in order to empower ourselves to make more informed decisions has always been a challenge for urban research. Recent developments in large-scale computing, together with the new techniques and automated tools for data collection and analysis are opening up promising opportunities for addressing this problem. The main motivation that served as the driving force behind this research is how these developments may contribute to urban data analysis. On this basis, the thesis focuses on urban data analysis in order to search for findings that can enhance our knowledge of urban environments, using the generic process of knowledge discovery using data mining. A knowledge discovery process based on data mining is a fully automated or semi-automated process which involves the application of computational tools and techniques to explore the “previously unknown, and potentially useful information” (Witten & Frank, 2005 hidden in large and often complex and multi-dimensional databases. This information can be obtained in the form of correlations amongst variables, data groupings (classes and clusters or more complex hypotheses (probabilistic rules of co-occurrence, performance vectors of prediction models etc.. This research targets researchers and practitioners working in the field of urban studies who are interested in quantitative/ computational approaches to urban data analysis and specifically aims to engage the interest of architects, urban designers and planners who do not have a background in statistics or in using data mining methods in their work. Accordingly, the overall aim of the thesis is the development of a knowledge discovery approach to urban analysis; a domain-specific adaptation of the generic process of knowledge discovery using data mining enabling the analyst to discover ‘relational urban knowledge’. ‘Relational urban knowledge’ is a term employed in this thesis to refer
Phenotypes Determined by Cluster Analysis in Moderate to Severe Bronchial Asthma.

Science.gov (United States)

Youroukova, Vania M; Dimitrova, Denitsa G; Valerieva, Anna D; Lesichkova, Spaska S; Velikova, Tsvetelina V; Ivanova-Todorova, Ekaterina I; Tumangelova-Yuzeir, Kalina D

2017-06-01

Bronchial asthma is a heterogeneous disease that includes various subtypes. They may share similar clinical characteristics, but probably have different pathological mechanisms. To identify phenotypes using cluster analysis in moderate to severe bronchial asthma and to compare differences in clinical, physiological, immunological and inflammatory data between the clusters. Forty adult patients with moderate to severe bronchial asthma out of exacerbation were included. All underwent clinical assessment, anthropometric measurements, skin prick testing, standard spirometry and measurement fraction of exhaled nitric oxide. Blood eosinophilic count, serum total IgE and periostin levels were determined. Two-step cluster approach, hierarchical clustering method and k-mean analysis were used for identification of the clusters. We have identified four clusters. Cluster 1 (n=14) - late-onset, non-atopic asthma with impaired lung function, Cluster 2 (n=13) - late-onset, atopic asthma, Cluster 3 (n=6) - late-onset, aspirin sensitivity, eosinophilic asthma, and Cluster 4 (n=7) - early-onset, atopic asthma. Our study is the first in Bulgaria in which cluster analysis is applied to asthmatic patients. We identified four clusters. The variables with greatest force for differentiation in our study were: age of asthma onset, duration of diseases, atopy, smoking, blood eosinophils, nonsteroidal anti-inflammatory drugs hypersensitivity, baseline FEV1/FVC and symptoms severity. Our results support the concept of heterogeneity of bronchial asthma and demonstrate that cluster analysis can be an useful tool for phenotyping of disease and personalized approach to the treatment of patients.
Community Clustering Algorithm in Complex Networks Based on Microcommunity Fusion

Directory of Open Access Journals (Sweden)

Jin Qi

2015-01-01

Full Text Available With the further research on physical meaning and digital features of the community structure in complex networks in recent years, the improvement of effectiveness and efficiency of the community mining algorithms in complex networks has become an important subject in this area. This paper puts forward a concept of the microcommunity and gets final mining results of communities through fusing different microcommunities. This paper starts with the basic definition of the network community and applies Expansion to the microcommunity clustering which provides prerequisites for the microcommunity fusion. The proposed algorithm is more efficient and has higher solution quality compared with other similar algorithms through the analysis of test results based on network data set.
A Clustering Approach Using Cooperative Artificial Bee Colony Algorithm

Directory of Open Access Journals (Sweden)

Wenping Zou

2010-01-01

Full Text Available Artificial Bee Colony (ABC is one of the most recently introduced algorithms based on the intelligent foraging behavior of a honey bee swarm. This paper presents an extended ABC algorithm, namely, the Cooperative Article Bee Colony (CABC, which significantly improves the original ABC in solving complex optimization problems. Clustering is a popular data analysis and data mining technique; therefore, the CABC could be used for solving clustering problems. In this work, first the CABC algorithm is used for optimizing six widely used benchmark functions and the comparative results produced by ABC, Particle Swarm Optimization (PSO, and its cooperative version (CPSO are studied. Second, the CABC algorithm is used for data clustering on several benchmark data sets. The performance of CABC algorithm is compared with PSO, CPSO, and ABC algorithms on clustering problems. The simulation results show that the proposed CABC outperforms the other three algorithms in terms of accuracy, robustness, and convergence speed.
Understanding Teacher Users of a Digital Library Service: A Clustering Approach

Science.gov (United States)

Xu, Beijie; Recker, Mimi

2011-01-01

This article describes the Knowledge Discovery and Data Mining (KDD) process and its application in the field of educational data mining (EDM) in the context of a digital library service called the Instructional Architect (IA.usu.edu). In particular, the study reported in this article investigated a certain type of data mining problem, clustering,…
Analysis of radon reduction and ventilation systems in uranium mines in China.

Science.gov (United States)

Hu, Peng-hua; Li, Xian-jie

2012-09-01

Mine ventilation is the most important way of reducing radon in uranium mines. At present, the radon and radon progeny levels in Chinese uranium mines where the cut and fill stoping method is used are 3-5 times higher than those in foreign uranium mines, as there is not much difference in the investments for ventilation protection between Chinese uranium mines and international advanced uranium mines with compaction methodology. In this paper, through the analysis of radon reduction and ventilation systems in Chinese uranium mines and the comparison of advantages and disadvantages between a variety of ventilation systems in terms of radon control, the authors try to illustrate the reasons for the higher radon and radon progeny levels in Chinese uranium mines and put forward some problems in three areas, namely the theory of radon control and ventilation systems, radon reduction ventilation measures and ventilation management. For these problems, this paper puts forward some proposals regarding some aspects, such as strengthening scrutiny, verifying and monitoring the practical situation, making clear ventilation plans, strictly following the mining sequence, promoting training of ventilation staff, enhancing ventilation system management, developing radon reduction ventilation technology, purchasing ventilation equipment as soon as possible in the future, and so on.
Analysis of radon reduction and ventilation systems in uranium mines in China

International Nuclear Information System (INIS)

Hu Penghua; Li Xianjie

2012-01-01

Mine ventilation is the most important way of reducing radon in uranium mines. At present, the radon and radon progeny levels in Chinese uranium mines where the cut and fill stoping method is used are 3–5 times higher than those in foreign uranium mines, as there is not much difference in the investments for ventilation protection between Chinese uranium mines and international advanced uranium mines with compaction methodology. In this paper, through the analysis of radon reduction and ventilation systems in Chinese uranium mines and the comparison of advantages and disadvantages between a variety of ventilation systems in terms of radon control, the authors try to illustrate the reasons for the higher radon and radon progeny levels in Chinese uranium mines and put forward some problems in three areas, namely the theory of radon control and ventilation systems, radon reduction ventilation measures and ventilation management. For these problems, this paper puts forward some proposals regarding some aspects, such as strengthening scrutiny, verifying and monitoring the practical situation, making clear ventilation plans, strictly following the mining sequence, promoting training of ventilation staff, enhancing ventilation system management, developing radon reduction ventilation technology, purchasing ventilation equipment as soon as possible in the future, and so on.
Ontology-based topic clustering for online discussion data

Science.gov (United States)

Wang, Yongheng; Cao, Kening; Zhang, Xiaoming

2013-03-01

With the rapid development of online communities, mining and extracting quality knowledge from online discussions becomes very important for the industrial and marketing sector, as well as for e-commerce applications and government. Most of the existing techniques model a discussion as a social network of users represented by a user-based graph without considering the content of the discussion. In this paper we propose a new multilayered mode to analysis online discussions. The user-based and message-based representation is combined in this model. A novel frequent concept sets based clustering method is used to cluster the original online discussion network into topic space. Domain ontology is used to improve the clustering accuracy. Parallel methods are also used to make the algorithms scalable to very large data sets. Our experimental study shows that the model and algorithms are effective when analyzing large scale online discussion data.
Clustering Methods Application for Customer Segmentation to Manage Advertisement Campaign

Directory of Open Access Journals (Sweden)

Maciej Kutera

2010-10-01

Full Text Available Clustering methods are recently so advanced elaborated algorithms for large collection data analysis that they have been already included today to data mining methods. Clustering methods are nowadays larger and larger group of methods, very quickly evolving and having more and more various applications. In the article, our research concerning usefulness of clustering methods in customer segmentation to manage advertisement campaign is presented. We introduce results obtained by using four selected methods which have been chosen because their peculiarities suggested their applicability to our purposes. One of the analyzed method k-means clustering with random selected initial cluster seeds gave very good results in customer segmentation to manage advertisement campaign and these results were presented in details in the article. In contrast one of the methods (hierarchical average linkage was found useless in customer segmentation. Further investigations concerning benefits of clustering methods in customer segmentation to manage advertisement campaign is worth continuing, particularly that finding solutions in this field can give measurable profits for marketing activity.
Citation-related reliability analysis for a pilot sample of underground coal mines

Energy Technology Data Exchange (ETDEWEB)

Kinilakodi, H.; Grayson, R.L. [Penn State University, University Park, PA (United States)

2011-05-15

The scrutiny of underground coal mine safety was heightened because of the disasters that occurred in 2006-2007, and more recently in 2010. In the aftermath of the 2006 incidents, the U.S. Congress passed the Mine Improvement and New Emergency Response Act of 2006 (MINER Act), which strengthened the existing regulations and mandated new laws to address various issues related to emergency preparedness and response, escape from an emergency situation, and protection of miners. The National Mining Association-sponsored Mine Safety Technology and Training Commission study highlighted the role of risk management in identifying and controlling major hazards, which are elements that could come together and cause a mine disaster. In 2007 MSHA revised its approach to the 'Pattern of Violations' (POV) process in order to target unsafe mines and then force them to remediate conditions in their mines. The POV approach has certain limitations that make it difficult for it to be enforced. One very understandable way to focus on removing threats from major-hazard conditions is to use citation-related reliability analysis. The citation reliability approach, which focuses on the probability of not getting a citation on a given inspector day, is considered an analogue to the maintenance reliability approach, which many mine operators understand and use. In this study, the citation reliability approach was applied to a stratified random sample of 31 underground coal mines to examine its potential for broader application. The results clearly show the best-performing and worst-performing mines for compliance with mine safety standards, and they highlight differences among different mine sizes.
Phrase Mining of Textual Data to Analyze Extracellular Matrix Protein Patterns Across Cardiovascular Disease.

Science.gov (United States)

Liem, David Alexandre; Murali, Sanjana; Sigdel, Dibakar; Shi, Yu; Wang, Xuan; Shen, Jiaming; Choi, Howard; Caufield, J Harry; Wang, Wei; Ping, Peipei; Han, Jiawei

2018-05-18

Extracellular matrix (ECM) proteins have been shown to play important roles regulating multiple biological processes in an array of organ systems, including the cardiovascular system. By using a novel bioinformatics text-mining tool, we studied six categories of cardiovascular disease (CVD), namely ischemic heart disease (IHD), cardiomyopathies (CM), cerebrovascular accident (CVA), congenital heart disease (CHD), arrhythmias (ARR), and valve disease (VD), anticipating novel ECM protein-disease and protein-protein relationships hidden within vast quantities of textual data. We conducted a phrase-mining analysis, delineating the relationships of 709 ECM proteins with the six groups of CVDs reported in 1,099,254 abstracts. The technology pipeline known as Context-aware Semantic Online Analytical Processing (CaseOLAP) was applied to semantically rank the association of proteins to each and all six CVDs, performing analyses to quantify each protein-disease relationship. We performed principal component analysis and hierarchical clustering of the data, where each protein is visualized as a six dimensional vector. We found that ECM proteins display variable degrees of association with the six CVDs; certain CVDs share groups of associated proteins whereas others have divergent protein associations. We identified 82 ECM proteins sharing associations with all six CVDs. Our bioinformatics analysis ascribed distinct ECM pathways (via Reactome) from this subset of proteins, namely insulin-like growth factor regulation and interleukin-4 and interleukin-13 signaling, suggesting their contribution to the pathogenesis of all six CVDs. Finally, we performed hierarchical clustering analysis and identified protein clusters associated with a targeted CVD; analyses revealed unexpected insights underlying ECM-pathogenesis of CVDs.
Event metadata records as a testbed for scalable data mining

International Nuclear Information System (INIS)

Gemmeren, P van; Malon, D

2010-01-01

At a data rate of 200 hertz, event metadata records ('TAGs,' in ATLAS parlance) provide fertile grounds for development and evaluation of tools for scalable data mining. It is easy, of course, to apply HEP-specific selection or classification rules to event records and to label such an exercise 'data mining,' but our interest is different. Advanced statistical methods and tools such as classification, association rule mining, and cluster analysis are common outside the high energy physics community. These tools can prove useful, not for discovery physics, but for learning about our data, our detector, and our software. A fixed and relatively simple schema makes TAG export to other storage technologies such as HDF5 straightforward. This simplifies the task of exploiting very-large-scale parallel platforms such as Argonne National Laboratory's BlueGene/P, currently the largest supercomputer in the world for open science, in the development of scalable tools for data mining. Using a domain-neutral scientific data format may also enable us to take advantage of existing data mining components from other communities. There is, further, a substantial literature on the topic of one-pass algorithms and stream mining techniques, and such tools may be inserted naturally at various points in the event data processing and distribution chain. This paper describes early experience with event metadata records from ATLAS simulation and commissioning as a testbed for scalable data mining tool development and evaluation.

Cluster analysis of rural, urban, and curbside atmospheric particle size data.

Science.gov (United States)

Beddows, David C S; Dall'Osto, Manuel; Harrison, Roy M

2009-07-01

Particle size is a key determinant of the hazard posed by airborne particles. Continuous multivariate particle size data have been collected using aerosol particle size spectrometers sited at four locations within the UK: Harwell (Oxfordshire); Regents Park (London); British Telecom Tower (London); and Marylebone Road (London). These data have been analyzed using k-means cluster analysis, deduced to be the preferred cluster analysis technique, selected from an option of four partitional cluster packages, namelythe following: Fuzzy; k-means; k-median; and Model-Based clustering. Using cluster validation indices k-means clustering was shown to produce clusters with the smallest size, furthest separation, and importantly the highest degree of similarity between the elements within each partition. Using k-means clustering, the complexity of the data set is reduced allowing characterization of the data according to the temporal and spatial trends of the clusters. At Harwell, the rural background measurement site, the cluster analysis showed that the spectra may be differentiated by their modal-diameters and average temporal trends showing either high counts during the day-time or night-time hours. Likewise for the urban sites, the cluster analysis differentiated the spectra into a small number of size distributions according their modal-diameter, the location of the measurement site, and time of day. The responsible aerosol emission, formation, and dynamic processes can be inferred according to the cluster characteristics and correlation to concurrently measured meteorological, gas phase, and particle phase measurements.
Clustering of Sun Exposure Measurements

OpenAIRE

Have, Anna Szynkowiak; Larsen, Jan; Hansen, Lars Kai; Philipsen, Peter Alshede; Thieden, Elisabeth; Wulf, Hans Christian

2002-01-01

In a medically motivated Sun-exposure study, questionnaires concerning Sun-habits were collected from a number of subjects together with UV radiation measurements. This paper focuses on identifying clusters in the heterogeneous set of data for the purpose of understanding possible relations between Sun-habits exposure and eventually assessing the risk of skin cancer. A general probabilistic framework originally developed for text and Web mining is demonstrated to be useful for clustering of b...
ANALYSIS OF WEB MINING APPLICATIONS AND BENEFICIAL AREAS

Directory of Open Access Journals (Sweden)

Khaleel Ahmad

2011-10-01

Full Text Available The main purpose of this paper is to study the process of Web mining techniques, features, application ( e-commerce and e-business and its beneficial areas. Web mining has become more popular and its widely used in varies application areas (such as business intelligent system, e-commerce and e-business. The e-commerce or e-business results are bettered by the application of the mining techniques such as data mining and text mining, among all the mining techniques web mining is better.
Preliminary analysis of surface mining options for Naval Oil Shale Reserve 1

Energy Technology Data Exchange (ETDEWEB)

1981-07-20

The study was undertaken to determine the economic viability of surface mining to exploit the reserves. It is based on resource information already developed for NOSR 1 and conceptual designs of mining systems compatible with this resource. Environmental considerations as they relate to surface mining have been addressed qualitatively. The conclusions on economic viability were based primarily on mining costs projected from other industries using surface mining. An analysis of surface mining for the NOSR 1 resource was performed based on its particular overburden thickness, oil shale thickness, oil shale grade, and topography. This evaluation considered reclamation of the surface as part of its design and cost estimate. The capital costs for mining 25 GPT and 30 GPT shale and the operating costs for mining 25 GPT, 30 GPT, and 35 GPT shale are presented. The relationship between operating cost and stripping ratio, and the break-even stripping ratio (BESR) for surface mining to be competitive with room-and-pillar mining, are shown. Identification of potential environmental impacts shows that environmental control procedures for surface mining are more difficult to implement than those for underground mining. The following three areas are of prime concern: maintenance of air quality standards by disruption, movement, and placement of large quantities of overburden; disruption or cutting of aquifers during the mining process which affect area water supplies; and potential mineral leaching from spent shales into the aquifers. Although it is an operational benefit to place spent shale in the open pit, leaching of the spent shales and contamination of the water is detrimental. It is therefore concluded that surface mining on NOSR 1 currently is neither economically desirable nor environmentally safe. Stringent mitigation measures would have to be implemented to overcome some of the potential environmental hazards.
Data mining a functional neuroimaging database for functional segregation in brain regions

DEFF Research Database (Denmark)

Nielsen, Finn Årup; Balslev, Daniela; Hansen, Lars Kai

2006-01-01

We describe a specialized neuroinformatic data mining technique in connection with a meta-analytic functional neuroimaging database: We mine for functional segregation within brain regions by identifying journal articles that report brain activations within the regions and clustering the abstract...
Data mining a functional neuroimaging database for functional|segregation in brain regions

DEFF Research Database (Denmark)

Nielsen, Finn Årup

2006-01-01

We describe a specialized neuroinformatic data mining technique in connection with a meta-analytic functional neuroimaging database: We mine for functional segregation within brain regions by identifying journal articles that report brain activations within the regions and clustering the abstract...
An Application of Multiplier Analysis in Analyzing the Role of Mining Sectors on Indonesian National Economy

Science.gov (United States)

Subanti, S.; Hakim, A. R.; Hakim, I. M.

2018-03-01

This purpose of the current study aims is to analyze the multiplier analysis on mining sector in Indonesia. The mining sectors defined by coal and metal; crude oil, natural gas, and geothermal; and other mining and quarrying. The multiplier analysis based from input output analysis, this divided by income multiplier and output multiplier. This results show that (1) Indonesian mining sectors ranked 6th with contribute amount of 6.81% on national total output; (2) Based on total gross value added, this sector contribute amount of 12.13% or ranked 4th; (3) The value from income multiplier is 0.7062 and the value from output multiplier is 1.2426.
Hyperspectral analysis for qualitative and quantitative features related to acid mine drainage at a remediated open-pit mine

Science.gov (United States)

Davies, G.; Calvin, W. M.

2015-12-01

The exposure of pyrite to oxygen and water in mine waste environments is known to generate acidity and the accumulation of secondary iron minerals. Sulfates and secondary iron minerals associated with acid mine drainage (AMD) exhibit diverse spectral properties in the ultraviolet, visible and near-infrared regions of the electromagnetic spectrum. The use of hyperspectral imagery for identification of AMD mineralogy and contamination has been well studied. Fewer studies have examined the impacts of hydrologic variations on mapping AMD or the unique spectral signatures of mine waters. Open-pit mine lakes are an additional environmental hazard which have not been widely studied using imaging spectroscopy. A better understanding of AMD variation related to climate fluctuations and the spectral signatures of contaminated surface waters will aid future assessments of environmental contamination. This study examined the ability of multi-season airborne hyperspectral data to identify the geochemical evolution of substances and contaminant patterns at the Leviathan Mine Superfund site. The mine is located 24 miles southeast of Lake Tahoe and contains remnant tailings piles and several AMD collection ponds. The objectives were to 1) distinguish temporal changes in mineralogy at a the remediated open-pit sulfur mine, 2) identify the absorption features of mine affected waters, and 3) quantitatively link water spectra to known dissolved iron concentrations. Images from NASA's AVIRIS instrument were collected in the spring, summer, and fall seasons for two consecutive years at Leviathan (HyspIRI campaign). Images had a spatial resolution of 15 meters at nadir. Ground-based surveys using the ASD FieldSpecPro spectrometer and laboratory spectral and chemical analysis complemented the remote sensing data. Temporal changes in surface mineralogy were difficult to distinguish. However, seasonal changes in pond water quality were identified. Dissolved ferric iron and chlorophyll
Cluster analysis for determining distribution center location

Science.gov (United States)

Lestari Widaningrum, Dyah; Andika, Aditya; Murphiyanto, Richard Dimas Julian

2017-12-01

Determination of distribution facilities is highly important to survive in the high level of competition in today’s business world. Companies can operate multiple distribution centers to mitigate supply chain risk. Thus, new problems arise, namely how many and where the facilities should be provided. This study examines a fast-food restaurant brand, which located in the Greater Jakarta. This brand is included in the category of top 5 fast food restaurant chain based on retail sales. There were three stages in this study, compiling spatial data, cluster analysis, and network analysis. Cluster analysis results are used to consider the location of the additional distribution center. Network analysis results show a more efficient process referring to a shorter distance to the distribution process.
Quantitative analysis of the taxation of uranium mines in Australia and Canada

International Nuclear Information System (INIS)

Barnett, D.W.; Anderson, D.L.

1984-01-01

The degree of neutrality of a tax policy is a gauge of how willing a government is to share in the risk of mineral development. This paper analyzes the practical characteristics of the uranium taxation policies of the Northern Territory in Australia and Saskatchewan in Canada. It superimposes these two policies on a large Australian uranium mine, based on the Ranger mine, and on a slightly larger Canadian mine, based on the Key Lake mine. The analysis focuses on the impact on the net-present-value of the producers' returns, the sharing of economic rent between the arms of government and the producer, and on the apparent neutrality of the tax policies. 24 references, 6 figures
Cluster Analysis as an Analytical Tool of Population Policy

Directory of Open Access Journals (Sweden)

Oksana Mikhaylovna Shubat

2017-12-01

Full Text Available The predicted negative trends in Russian demography (falling birth rates, population decline actualize the need to strengthen measures of family and population policy. Our research purpose is to identify groups of Russian regions with similar characteristics in the family sphere using cluster analysis. The findings should make an important contribution to the field of family policy. We used hierarchical cluster analysis based on the Ward method and the Euclidean distance for segmentation of Russian regions. Clustering is based on four variables, which allowed assessing the family institution in the region. The authors used the data of Federal State Statistics Service from 2010 to 2015. Clustering and profiling of each segment has allowed forming a model of Russian regions depending on the features of the family institution in these regions. The authors revealed four clusters grouping regions with similar problems in the family sphere. This segmentation makes it possible to develop the most relevant family policy measures in each group of regions. Thus, the analysis has shown a high degree of differentiation of the family institution in the regions. This suggests that a unified approach to population problems’ solving is far from being effective. To achieve greater results in the implementation of family policy, a differentiated approach is needed. Methods of multidimensional data classification can be successfully applied as a relevant analytical toolkit. Further research could develop the adaptation of multidimensional classification methods to the analysis of the population problems in Russian regions. In particular, the algorithms of nonparametric cluster analysis may be of relevance in future studies.
Clinical Characteristics of Exacerbation-Prone Adult Asthmatics Identified by Cluster Analysis.

Science.gov (United States)

Kim, Mi Ae; Shin, Seung Woo; Park, Jong Sook; Uh, Soo Taek; Chang, Hun Soo; Bae, Da Jeong; Cho, You Sook; Park, Hae Sim; Yoon, Ho Joo; Choi, Byoung Whui; Kim, Yong Hoon; Park, Choon Sik

2017-11-01

Asthma is a heterogeneous disease characterized by various types of airway inflammation and obstruction. Therefore, it is classified into several subphenotypes, such as early-onset atopic, obese non-eosinophilic, benign, and eosinophilic asthma, using cluster analysis. A number of asthmatics frequently experience exacerbation over a long-term follow-up period, but the exacerbation-prone subphenotype has rarely been evaluated by cluster analysis. This prompted us to identify clusters reflecting asthma exacerbation. A uniform cluster analysis method was applied to 259 adult asthmatics who were regularly followed-up for over 1 year using 12 variables, selected on the basis of their contribution to asthma phenotypes. After clustering, clinical profiles and exacerbation rates during follow-up were compared among the clusters. Four subphenotypes were identified: cluster 1 was comprised of patients with early-onset atopic asthma with preserved lung function, cluster 2 late-onset non-atopic asthma with impaired lung function, cluster 3 early-onset atopic asthma with severely impaired lung function, and cluster 4 late-onset non-atopic asthma with well-preserved lung function. The patients in clusters 2 and 3 were identified as exacerbation-prone asthmatics, showing a higher risk of asthma exacerbation. Two different phenotypes of exacerbation-prone asthma were identified among Korean asthmatics using cluster analysis; both were characterized by impaired lung function, but the age at asthma onset and atopic status were different between the two. Copyright © 2017 The Korean Academy of Asthma, Allergy and Clinical Immunology · The Korean Academy of Pediatric Allergy and Respiratory Disease
Data Exploration and Analysis of Alternative Learning System Accreditation and Equivalency Test Result Using Data Mining

Science.gov (United States)

Talingdan, J. A.; Trinidad, J. T., Jr.; Palaoag, T. D.

2018-03-01

Alternative Learning System (ALS) is a subsystem of Depatment of Education (DepEd) that serves as an option of learners who cannot afford to go in a formal education. The research focuses on the data exploration and analysis of ALS accreditation and equivalency test result using data mining. The ALS 2014 to 2016 A & E test results in the secondary level were used as data sets in the study. The A & E test results revealed that the passing rate is doubled per year. The results were clustered using k- means clustering algorithm and they were grouped into good, medium, and low standard learners to identify students need exceptional stuff for enhancement. From the clustered data, it was found out that the strand they are weak in is strand 4 which is the Development of Self and a Sense of Community with a general average of 84.23. It also revealed that the essay type of exam got the lowest score with a general average of 2.14 compared to the multiple type of exam that covers the five learning strands. Furthermore, decision tree and naive bayes were also employed in the study to predict the performance of the learners in the A & E test and determine which is better to use for prediction. It was concluded that naive bayes performs better because the accuracy rate is higher than the decision tree algorithm.
Study on text mining algorithm for ultrasound examination of chronic liver diseases based on spectral clustering

Science.gov (United States)

Chang, Bingguo; Chen, Xiaofei

2018-05-01

Ultrasonography is an important examination for the diagnosis of chronic liver disease. The doctor gives the liver indicators and suggests the patient's condition according to the description of ultrasound report. With the rapid increase in the amount of data of ultrasound report, the workload of professional physician to manually distinguish ultrasound results significantly increases. In this paper, we use the spectral clustering method to cluster analysis of the description of the ultrasound report, and automatically generate the ultrasonic diagnostic diagnosis by machine learning. 110 groups ultrasound examination report of chronic liver disease were selected as test samples in this experiment, and the results were validated by spectral clustering and compared with k-means clustering algorithm. The results show that the accuracy of spectral clustering is 92.73%, which is higher than that of k-means clustering algorithm, which provides a powerful ultrasound-assisted diagnosis for patients with chronic liver disease.
A Proposed Data Fusion Architecture for Micro-Zone Analysis and Data Mining

Energy Technology Data Exchange (ETDEWEB)

Kevin McCarthy; Milos Manic

2012-08-01

Data Fusion requires the ability to combine or “fuse” date from multiple data sources. Time Series Analysis is a data mining technique used to predict future values from a data set based upon past values. Unlike other data mining techniques, however, Time Series places special emphasis on periodicity and how seasonal and other time-based factors tend to affect trends over time. One of the difficulties encountered in developing generic time series techniques is the wide variability of the data sets available for analysis. This presents challenges all the way from the data gathering stage to results presentation. This paper presents an architecture designed and used to facilitate the collection of disparate data sets well suited to Time Series analysis as well as other predictive data mining techniques. Results show this architecture provides a flexible, dynamic framework for the capture and storage of a myriad of dissimilar data sets and can serve as a foundation from which to build a complete data fusion architecture.
The cluster analysis based on non-teacher artificial neural network for the danger prediction of coal spontaneous fire

Energy Technology Data Exchange (ETDEWEB)

Wang, D.; Wang, J. [China University of Mining and Technology (China)

1999-04-01

This paper focuses on the problem of predicting the danger level of spontaneous fire in coal mines. Firstly, the inadequacy of the present artificial neural networks prediction model is analysed. Then a new cluster model based on non-teacher neural network is constructed according to the danger judgement standards given by experts. On this basis, by adopting the error square sum criterion and its algorithm, the corresponding prediction software is developed and applied in two working faces of Chaili Coal Mine. The forecasting result is importantly significant for the prevention of spontaneous fire. 4 refs., 1 fig., 1 tab.
Overview of the INEX 2008 XML Mining Track

Science.gov (United States)

Denoyer, Ludovic; Gallinari, Patrick

We describe here the XML Mining Track at INEX 2008. This track was launched for exploring two main ideas: first identifying key problems for mining semi-structured documents and new challenges of this emerging field and second studying and assessing the potential of machine learning techniques for dealing with generic Machine Learning (ML) tasks in the structured domain i.e. classification and clustering of semi structured documents. This year, the track focuses on the supervised classification and the unsupervised clustering of XML documents using link information. We consider a corpus of about 100,000 Wikipedia pages with the associated hyperlinks. The participants have developed models using the content information, the internal structure information of the XML documents and also the link information between documents.
Automated analysis of organic particles using cluster SIMS

Energy Technology Data Exchange (ETDEWEB)

Gillen, Greg; Zeissler, Cindy; Mahoney, Christine; Lindstrom, Abigail; Fletcher, Robert; Chi, Peter; Verkouteren, Jennifer; Bright, David; Lareau, Richard T.; Boldman, Mike

2004-06-15

Cluster primary ion bombardment combined with secondary ion imaging is used on an ion microscope secondary ion mass spectrometer for the spatially resolved analysis of organic particles on various surfaces. Compared to the use of monoatomic primary ion beam bombardment, the use of a cluster primary ion beam (SF{sub 5}{sup +} or C{sub 8}{sup -}) provides significant improvement in molecular ion yields and a reduction in beam-induced degradation of the analyte molecules. These characteristics of cluster bombardment, along with automated sample stage control and custom image analysis software are utilized to rapidly characterize the spatial distribution of trace explosive particles, narcotics and inkjet-printed microarrays on a variety of surfaces.
Clustering Spam Domains and Destination Websites: Digital Forensics with Data Mining

Directory of Open Access Journals (Sweden)

Chun Wei

2010-03-01

Full Text Available Spam related cyber crimes have become a serious threat to society. Current spam research mainly aims to detect spam more effectively. We believe the prosecution of spammers is a more effective way of stopping spam emails than filtering, therefore more research is needed to help forensic investigators to collect useful evidence. This research proposes an algorithm for clustering spam domains extracted from spam emails based on the hosting IP addresses and tracing the domains over a period of time. The results reveal several facts that merit law enforcement attention: many seemingly unrelated spam campaigns are actually related; spammers have a sophisticated mechanism for combating URL blacklisting by registering many new domain names every day and flushing out old domains; the domains are hosted at different IP addresses across several networks, mostly in China where legislation is not as tight as in US; old IP addresses are replaced by new ones from time to time, but still show strong correlation among them. These facts lead to the conclusion that spam-related cyber crimes are operated by well-organized criminal syndicates that have sufficient manpower to distribute a huge volume of spam through bots, purchase a large number of domain names and hosting servers and maintain websites to sell counterfeit products online. Traditional law enforcements technology has not scaled well in cases involving millions of data elements. This paper demonstrates an effective use of data mining to respond to this challenge.
Analysis of Air Particles Around Site Plan of Gold Mining, North Sumatera

International Nuclear Information System (INIS)

Gatot-Suhariyono; Erizal-Tanjung

2004-01-01

Analysis of air particles around site plan of gold mining, North Sumatra has been conducted. Air particles of TSP (Total Suspended Particulate), which has maximum diameter around 45 μm (PM 2.5 ) was sampled in four places using impactor cascade. The measurement results indicate that concentration of TSP and PM 10 /PM 2.5 were in site plan center of mining smaller than quality standard of ambient air (PP RI no. 41/1999), while the concentration in areas of around it was on the contrary. The concentration in areas of around the mining was not because of air particle from in site plan center of mining. Based on regulatory of BAPEDAL head no. Kep-107/BAPEDAL/11/1997, concentration of PM 10 /PM 2.5 and TSP in site plan center of mining is in moderate category, while in areas of around the mining are in unhealthy category. Unhealthy category affects decrease at view distance and happened dust defilement everywhere, while moderate category is only happened degradation of view distance. (author)

Astronomy and big data a data clustering approach to identifying uncertain galaxy morphology

CERN Document Server

Edwards, Kieran Jay

2014-01-01

With the onset of massive cosmological data collection through media such as the Sloan Digital Sky Survey (SDSS), galaxy classification has been accomplished for the most part with the help of citizen science communities like Galaxy Zoo. Seeking the wisdom of the crowd for such Big Data processing has proved extremely beneficial. However, an analysis of one of the Galaxy Zoo morphological classification data sets has shown that a significant majority of all classified galaxies are labelled as “Uncertain”. This book reports on how to use data mining, more specifically clustering, to identify galaxies that the public has shown some degree of uncertainty for as to whether they belong to one morphology type or another. The book shows the importance of transitions between different data mining techniques in an insightful workflow. It demonstrates that Clustering enables to identify discriminating features in the analysed data sets, adopting a novel feature selection algorithms called Incremental Feature Select...
Network Analysis Tools: from biological networks to clusters and pathways.

Science.gov (United States)

Brohée, Sylvain; Faust, Karoline; Lima-Mendez, Gipsi; Vanderstocken, Gilles; van Helden, Jacques

2008-01-01

Network Analysis Tools (NeAT) is a suite of computer tools that integrate various algorithms for the analysis of biological networks: comparison between graphs, between clusters, or between graphs and clusters; network randomization; analysis of degree distribution; network-based clustering and path finding. The tools are interconnected to enable a stepwise analysis of the network through a complete analytical workflow. In this protocol, we present a typical case of utilization, where the tasks above are combined to decipher a protein-protein interaction network retrieved from the STRING database. The results returned by NeAT are typically subnetworks, networks enriched with additional information (i.e., clusters or paths) or tables displaying statistics. Typical networks comprising several thousands of nodes and arcs can be analyzed within a few minutes. The complete protocol can be read and executed in approximately 1 h.
Performance analysis of clustering techniques over microarray data: A case study

Science.gov (United States)

Dash, Rasmita; Misra, Bijan Bihari

2018-03-01

Handling big data is one of the major issues in the field of statistical data analysis. In such investigation cluster analysis plays a vital role to deal with the large scale data. There are many clustering techniques with different cluster analysis approach. But which approach suits a particular dataset is difficult to predict. To deal with this problem a grading approach is introduced over many clustering techniques to identify a stable technique. But the grading approach depends on the characteristic of dataset as well as on the validity indices. So a two stage grading approach is implemented. In this study the grading approach is implemented over five clustering techniques like hybrid swarm based clustering (HSC), k-means, partitioning around medoids (PAM), vector quantization (VQ) and agglomerative nesting (AGNES). The experimentation is conducted over five microarray datasets with seven validity indices. The finding of grading approach that a cluster technique is significant is also established by Nemenyi post-hoc hypothetical test.
Recommending Learning Activities in Social Network Using Data Mining Algorithms

Science.gov (United States)

Mahnane, Lamia

2017-01-01

In this paper, we show how data mining algorithms (e.g. Apriori Algorithm (AP) and Collaborative Filtering (CF)) is useful in New Social Network (NSN-AP-CF). "NSN-AP-CF" processes the clusters based on different learning styles. Next, it analyzes the habits and the interests of the users through mining the frequent episodes by the…
Cluster analysis of typhoid cases in Kota Bharu, Kelantan, Malaysia

Directory of Open Access Journals (Sweden)

Nazarudin Safian

2008-09-01

Full Text Available Typhoid fever is still a major public health problem globally as well as in Malaysia. This study was done to identify the spatial epidemiology of typhoid fever in the Kota Bharu District of Malaysia as a first step to developing more advanced analysis of the whole country. The main characteristic of the epidemiological pattern that interested us was whether typhoid cases occurred in clusters or whether they were evenly distributed throughout the area. We also wanted to know at what spatial distances they were clustered. All confirmed typhoid cases that were reported to the Kota Bharu District Health Department from the year 2001 to June of 2005 were taken as the samples. From the home address of the cases, the location of the house was traced and a coordinate was taken using handheld GPS devices. Spatial statistical analysis was done to determine the distribution of typhoid cases, whether clustered, random or dispersed. The spatial statistical analysis was done using CrimeStat III software to determine whether typhoid cases occur in clusters, and later on to determine at what distances it clustered. From 736 cases involved in the study there was significant clustering for cases occurring in the years 2001, 2002, 2003 and 2005. There was no significant clustering in year 2004. Typhoid clustering also occurred strongly for distances up to 6 km. This study shows that typhoid cases occur in clusters, and this method could be applicable to describe spatial epidemiology for a specific area. (Med J Indones 2008; 17: 175-82Keywords: typhoid, clustering, spatial epidemiology, GIS
Cluster analysis of Southeastern U.S. climate stations

Science.gov (United States)

Stooksbury, D. E.; Michaels, P. J.

1991-09-01

A two-step cluster analysis of 449 Southeastern climate stations is used to objectively determine general climate clusters (groups of climate stations) for eight southeastern states. The purpose is objectively to define regions of climatic homogeneity that should perform more robustly in subsequent climatic impact models. This type of analysis has been successfully used in many related climate research problems including the determination of corn/climate districts in Iowa (Ortiz-Valdez, 1985) and the classification of synoptic climate types (Davis, 1988). These general climate clusters may be more appropriate for climate research than the standard climate divisions (CD) groupings of climate stations, which are modifications of the agro-economic United States Department of Agriculture crop reporting districts. Unlike the CD's, these objectively determined climate clusters are not restricted by state borders and thus have reduced multicollinearity which makes them more appropriate for the study of the impact of climate and climatic change.
Analysis of radon reduction by ventilation in uranium mines in China

International Nuclear Information System (INIS)

Hu Penghua; Li Xianjie

2011-01-01

Mine ventilation is the most important way to reduce radon in uranium mines. At present, the concentrations of radon and its daughters in underground air is 3-5 times higher than those in other countries, at the same protection conditions. In this paper, through the analysis of radon reduction status in Chinese uranium mines and the comparison of advantages and shortcomings between variety of ventilation and radon reduction measures, the reasons for higher radon and radon daughter concentration in Chinese uranium mines are discussed and some problems are put forward in three aspects: radon reduction ventilation theory, measures and management. Based on above problems, this paper puts forward some proposals and measures, such as strengthening examination and verification and monitoring practical situation, making clear ventilation plan, training ventilation technician, enhancing ventilation system management, developing radon reduction ventilation research and putting ventilation equipment in place as soon as possible in future. (authors)
Cluster analysis by optimal decomposition of induced fuzzy sets

Energy Technology Data Exchange (ETDEWEB)

Backer, E

1978-01-01

Nonsupervised pattern recognition is addressed and the concept of fuzzy sets is explored in order to provide the investigator (data analyst) additional information supplied by the pattern class membership values apart from the classical pattern class assignments. The basic ideas behind the pattern recognition problem, the clustering problem, and the concept of fuzzy sets in cluster analysis are discussed, and a brief review of the literature of the fuzzy cluster analysis is given. Some mathematical aspects of fuzzy set theory are briefly discussed; in particular, a measure of fuzziness is suggested. The optimization-clustering problem is characterized. Then the fundamental idea behind affinity decomposition is considered. Next, further analysis takes place with respect to the partitioning-characterization functions. The iterative optimization procedure is then addressed. The reclassification function is investigated and convergence properties are examined. Finally, several experiments in support of the method suggested are described. Four object data sets serve as appropriate test cases. 120 references, 70 figures, 11 tables. (RWR)
Graph analysis of cell clusters forming vascular networks

Science.gov (United States)

Alves, A. P.; Mesquita, O. N.; Gómez-Gardeñes, J.; Agero, U.

2018-03-01

This manuscript describes the experimental observation of vasculogenesis in chick embryos by means of network analysis. The formation of the vascular network was observed in the area opaca of embryos from 40 to 55 h of development. In the area opaca endothelial cell clusters self-organize as a primitive and approximately regular network of capillaries. The process was observed by bright-field microscopy in control embryos and in embryos treated with Bevacizumab (Avastin), an antibody that inhibits the signalling of the vascular endothelial growth factor (VEGF). The sequence of images of the vascular growth were thresholded, and used to quantify the forming network in control and Avastin-treated embryos. This characterization is made by measuring vessels density, number of cell clusters and the largest cluster density. From the original images, the topology of the vascular network was extracted and characterized by means of the usual network metrics such as: the degree distribution, average clustering coefficient, average short path length and assortativity, among others. This analysis allows to monitor how the largest connected cluster of the vascular network evolves in time and provides with quantitative evidence of the disruptive effects that Avastin has on the tree structure of vascular networks.
Prediction accident triangle in maintenance of underground mine facilities using Poisson distribution analysis

Science.gov (United States)

Khuluqi, M. H.; Prapdito, R. R.; Sambodo, F. P.

2018-04-01

In Indonesia, mining is categorized as a hazardous industry. In recent years, a dramatic increase of mining equipment and technological complexities had resulted in higher maintenance expectations that accompanied by the changes in the working conditions, especially on safety. Ensuring safety during the process of conducting maintenance works in underground mine is important as an integral part of accident prevention programs. Accident triangle has provided a support to safety practitioner to draw a road map in preventing accidents. Poisson distribution is appropriate for the analysis of accidents at a specific site in a given time period. Based on the analysis of accident statistics in the underground mine maintenance of PT. Freeport Indonesia from 2011 through 2016, it is found that 12 minor accidents for 1 major accident and 66 equipment damages for 1 major accident as a new value of accident triangle. The result can be used for the future need for improving the accident prevention programs.
PARALLEL SPATIOTEMPORAL SPECTRAL CLUSTERING WITH MASSIVE TRAJECTORY DATA

Directory of Open Access Journals (Sweden)

Y. Z. Gu

2017-09-01

Full Text Available Massive trajectory data contains wealth useful information and knowledge. Spectral clustering, which has been shown to be effective in finding clusters, becomes an important clustering approaches in the trajectory data mining. However, the traditional spectral clustering lacks the temporal expansion on the algorithm and limited in its applicability to large-scale problems due to its high computational complexity. This paper presents a parallel spatiotemporal spectral clustering based on multiple acceleration solutions to make the algorithm more effective and efficient, the performance is proved due to the experiment carried out on the massive taxi trajectory dataset in Wuhan city, China.
An Evaluation of Practical Applicability of Multi-Assortment Production Break-Even Analysis based on Mining Companies

Science.gov (United States)

Fuksa, Dariusz; Trzaskuś-Żak, Beata; Gałaś, Zdzisław; Utrata, Arkadiusz

2017-03-01

In the practice of mining companies, the vast majority of them produce more than one product. The analysis of the break-even, which is referred to as CVP (Cost-Volume-Profit) analysis (Wilkinson, 2005; Czopek, 2003) in their case is significantly constricted, given the necessity to include multi-assortment structure in the analysis, which may have more than 20 types of assortments (depending on the grain size) in their offer, as in the case of open-pit mines. The article presents methods of evaluation of break-even (volume and value) for both a single-assortment production and a multi-assortment production. The complexity of problem of break-even evaluation for multi-assortment production has resulted in formation of many methods, and, simultaneously, various approaches to its analysis, especially differences in accounting fixed costs, which may be either totally accounted for among particular assortments, relating to the whole company or partially accounted for among particular assortments and partially relating to the company, as a whole. The evaluation of the chosen methods of break-even analysis, given the availability of data, was based on two examples of mining companies: an open-pit mine of rock materials and an underground hard coal mine. The selection of methods was set by the available data provided by the companies. The data for the analysis comes from internal documentation of the mines - financial statements, breakdowns and cost calculations.
Unsupervised text mining for assessing and augmenting GWAS results.

Science.gov (United States)

Ailem, Melissa; Role, François; Nadif, Mohamed; Demenais, Florence

2016-04-01

Text mining can assist in the analysis and interpretation of large-scale biomedical data, helping biologists to quickly and cheaply gain confirmation of hypothesized relationships between biological entities. We set this question in the context of genome-wide association studies (GWAS), an actively emerging field that contributed to identify many genes associated with multifactorial diseases. These studies allow to identify groups of genes associated with the same phenotype, but provide no information about the relationships between these genes. Therefore, our objective is to leverage unsupervised text mining techniques using text-based cosine similarity comparisons and clustering applied to candidate and random gene vectors, in order to augment the GWAS results. We propose a generic framework which we used to characterize the relationships between 10 genes reported associated with asthma by a previous GWAS. The results of this experiment showed that the similarities between these 10 genes were significantly stronger than would be expected by chance (one-sided p-value<0.01). The clustering of observed and randomly selected gene also allowed to generate hypotheses about potential functional relationships between these genes and thus contributed to the discovery of new candidate genes for asthma. Copyright © 2016 Elsevier Inc. All rights reserved.
application of single-linkage clustering method in the analysis of ...

African Journals Online (AJOL)

Admin

ANALYSIS OF GROWTH RATE OF GROSS DOMESTIC PRODUCT. (GDP) AT ... The end result of the algorithm is a tree of clusters called a dendrogram, which shows how the clusters are ..... Number of cluster sum from from observations of ...
Redundancy in electronic health record corpora: analysis, impact on text mining performance and mitigation strategies.

Science.gov (United States)

Cohen, Raphael; Elhadad, Michael; Elhadad, Noémie

2013-01-16

The increasing availability of Electronic Health Record (EHR) data and specifically free-text patient notes presents opportunities for phenotype extraction. Text-mining methods in particular can help disease modeling by mapping named-entities mentions to terminologies and clustering semantically related terms. EHR corpora, however, exhibit specific statistical and linguistic characteristics when compared with corpora in the biomedical literature domain. We focus on copy-and-paste redundancy: clinicians typically copy and paste information from previous notes when documenting a current patient encounter. Thus, within a longitudinal patient record, one expects to observe heavy redundancy. In this paper, we ask three research questions: (i) How can redundancy be quantified in large-scale text corpora? (ii) Conventional wisdom is that larger corpora yield better results in text mining. But how does the observed EHR redundancy affect text mining? Does such redundancy introduce a bias that distorts learned models? Or does the redundancy introduce benefits by highlighting stable and important subsets of the corpus? (iii) How can one mitigate the impact of redundancy on text mining? We analyze a large-scale EHR corpus and quantify redundancy both in terms of word and semantic concept repetition. We observe redundancy levels of about 30% and non-standard distribution of both words and concepts. We measure the impact of redundancy on two standard text-mining applications: collocation identification and topic modeling. We compare the results of these methods on synthetic data with controlled levels of redundancy and observe significant performance variation. Finally, we compare two mitigation strategies to avoid redundancy-induced bias: (i) a baseline strategy, keeping only the last note for each patient in the corpus; (ii) removing redundant notes with an efficient fingerprinting-based algorithm. (a)For text mining, preprocessing the EHR corpus with fingerprinting yields
A Flocking Based algorithm for Document Clustering Analysis

Energy Technology Data Exchange (ETDEWEB)

Cui, Xiaohui [ORNL; Gao, Jinzhu [ORNL; Potok, Thomas E [ORNL

2006-01-01

Social animals or insects in nature often exhibit a form of emergent collective behavior known as flocking. In this paper, we present a novel Flocking based approach for document clustering analysis. Our Flocking clustering algorithm uses stochastic and heuristic principles discovered from observing bird flocks or fish schools. Unlike other partition clustering algorithm such as K-means, the Flocking based algorithm does not require initial partitional seeds. The algorithm generates a clustering of a given set of data through the embedding of the high-dimensional data items on a two-dimensional grid for easy clustering result retrieval and visualization. Inspired by the self-organized behavior of bird flocks, we represent each document object with a flock boid. The simple local rules followed by each flock boid result in the entire document flock generating complex global behaviors, which eventually result in a clustering of the documents. We evaluate the efficiency of our algorithm with both a synthetic dataset and a real document collection that includes 100 news articles collected from the Internet. Our results show that the Flocking clustering algorithm achieves better performance compared to the K- means and the Ant clustering algorithm for real document clustering.
Using data mining on student behavior and cognitive style data for improving e-learning systems: a case study

Directory of Open Access Journals (Sweden)

Milos Jovanovic

2012-06-01

Full Text Available In this research we applied classification models for prediction of studentsarsquo; performance, and cluster models for grouping students based on their cognitive styles in e-learning environment. Classification models described in this paper should help: teachers, students and business people, for early engaging with students who are likely to become excellent on a selected topic. Clustering students based on cognitive styles and their overall performance should enable better adaption of the learning materials with respect to their learning styles. The approach is tested using well-established data mining algorithms, and evaluated by several evaluation measures. Model building process included data preprocessing, parameter optimization and attribute selection steps, which enhanced the overall performance. Additionally we propose a Moodle module that allows automatic extraction of data needed for educational data mining analysis and deploys models developed in this study.
CLUSTER ANALYSIS UKRAINIAN REGIONAL DISTRIBUTION BY LEVEL OF INNOVATION

Directory of Open Access Journals (Sweden)

Roman Shchur

2016-07-01

Full Text Available SWOT-analysis of the threats and benefits of innovation development strategy of Ivano-Frankivsk region in the context of financial support was сonducted. Methodical approach to determine of public-private partnerships potential that is tool of innovative economic development financing was identified. Cluster analysis of possibilities of forming public-private partnership in a particular region was carried out. Optimal set of problem areas that require urgent solutions and financial security is defined on the basis of cluster approach. It will help to form practical recommendations for the formation of an effective financial mechanism in the regions of Ukraine. Key words: the mechanism of innovation development financial provision, innovation development, public-private partnerships, cluster analysis, innovative development strategy.
A tm Plug-In for Distributed Text Mining in R

Directory of Open Access Journals (Sweden)

Stefan Theussl

2012-11-01

Full Text Available R has gained explicit text mining support with the tm package enabling statisticians to answer many interesting research questions via statistical analysis or modeling of (text corpora. However, we typically face two challenges when analyzing large corpora: (1 the amount of data to be processed in a single machine is usually limited by the available main memory (i.e., RAM, and (2 the more data to be analyzed the higher the need for efficient procedures for calculating valuable results. Fortunately, adequate programming models like MapReduce facilitate parallelization of text mining tasks and allow for processing data sets beyond what would fit into memory by using a distributed file system possibly spanning over several machines, e.g., in a cluster of workstations. In this paper we present a plug-in package to tm called tm.plugin.dc implementing a distributed corpus class which can take advantage of the Hadoop MapReduce library for large scale text mining tasks. We show on the basis of an application in culturomics that we can efficiently handle data sets of significant size.
Mining Predictors of Success in Air Force Flight Training Regiments via Semantic Analysis of Instructor Evaluations

Science.gov (United States)

2018-03-01

the flight-training course. 14. SUBJECT TERMS text mining , feedback analysis, semantic network, binary classification 15. NUMBER OF PAGES 105 16...A. TEXT MINING ..........................................................................................5 B. SEMANTIC WORD NETWORK...13 Figure 2. Text Mining Pre-Processing Techniques. Source: Vijayarani (2015). ............20 Figure 3. From text

Cluster Analysis of Clinical Data Identifies Fibromyalgia Subgroups

Science.gov (United States)

Docampo, Elisa; Collado, Antonio; Escaramís, Geòrgia; Carbonell, Jordi; Rivera, Javier; Vidal, Javier; Alegre, José

2013-01-01

Introduction Fibromyalgia (FM) is mainly characterized by widespread pain and multiple accompanying symptoms, which hinder FM assessment and management. In order to reduce FM heterogeneity we classified clinical data into simplified dimensions that were used to define FM subgroups. Material and Methods 48 variables were evaluated in 1,446 Spanish FM cases fulfilling 1990 ACR FM criteria. A partitioning analysis was performed to find groups of variables similar to each other. Similarities between variables were identified and the variables were grouped into dimensions. This was performed in a subset of 559 patients, and cross-validated in the remaining 887 patients. For each sample and dimension, a composite index was obtained based on the weights of the variables included in the dimension. Finally, a clustering procedure was applied to the indexes, resulting in FM subgroups. Results Variables clustered into three independent dimensions: “symptomatology”, “comorbidities” and “clinical scales”. Only the two first dimensions were considered for the construction of FM subgroups. Resulting scores classified FM samples into three subgroups: low symptomatology and comorbidities (Cluster 1), high symptomatology and comorbidities (Cluster 2), and high symptomatology but low comorbidities (Cluster 3), showing differences in measures of disease severity. Conclusions We have identified three subgroups of FM samples in a large cohort of FM by clustering clinical data. Our analysis stresses the importance of family and personal history of FM comorbidities. Also, the resulting patient clusters could indicate different forms of the disease, relevant to future research, and might have an impact on clinical assessment. PMID:24098674
Cluster analysis of clinical data identifies fibromyalgia subgroups.

Directory of Open Access Journals (Sweden)

Elisa Docampo

Full Text Available INTRODUCTION: Fibromyalgia (FM is mainly characterized by widespread pain and multiple accompanying symptoms, which hinder FM assessment and management. In order to reduce FM heterogeneity we classified clinical data into simplified dimensions that were used to define FM subgroups. MATERIAL AND METHODS: 48 variables were evaluated in 1,446 Spanish FM cases fulfilling 1990 ACR FM criteria. A partitioning analysis was performed to find groups of variables similar to each other. Similarities between variables were identified and the variables were grouped into dimensions. This was performed in a subset of 559 patients, and cross-validated in the remaining 887 patients. For each sample and dimension, a composite index was obtained based on the weights of the variables included in the dimension. Finally, a clustering procedure was applied to the indexes, resulting in FM subgroups. RESULTS: VARIABLES CLUSTERED INTO THREE INDEPENDENT DIMENSIONS: "symptomatology", "comorbidities" and "clinical scales". Only the two first dimensions were considered for the construction of FM subgroups. Resulting scores classified FM samples into three subgroups: low symptomatology and comorbidities (Cluster 1, high symptomatology and comorbidities (Cluster 2, and high symptomatology but low comorbidities (Cluster 3, showing differences in measures of disease severity. CONCLUSIONS: We have identified three subgroups of FM samples in a large cohort of FM by clustering clinical data. Our analysis stresses the importance of family and personal history of FM comorbidities. Also, the resulting patient clusters could indicate different forms of the disease, relevant to future research, and might have an impact on clinical assessment.
Mathematical tools for data mining set theory, partial orders, combinatorics

CERN Document Server

Simovici, Dan A

2014-01-01

Data mining essentially relies on several mathematical disciplines, many of which are presented in this second edition of this book. Topics include partially ordered sets, combinatorics, general topology, metric spaces, linear spaces, graph theory. To motivate the reader a significant number of applications of these mathematical tools are included ranging from association rules, clustering algorithms, classification, data constraints, logical data analysis, etc. The book is intended as a reference for researchers and graduate students. The current edition is a significant expansion of the firs
DATA MINING APPLICATION IN CREDIT CARD FRAUD DETECTION SYSTEM

Directory of Open Access Journals (Sweden)

FRANCISCA NONYELUM OGWUELEKA

2011-06-01

Full Text Available Data mining is popularly used to combat frauds because of its effectiveness. It is a well-defined procedure that takes data as input and produces models or patterns as output. Neural network, a data mining technique was used in this study. The design of the neural network (NN architecture for the credit card detection system was based on unsupervised method, which was applied to the transactions data to generate four clusters of low, high, risky and high-risk clusters. The self-organizing map neural network (SOMNN technique was used for solving the problem of carrying out optimal classification of each transaction into its associated group, since a prior output is unknown. The receiver-operating curve (ROC for credit card fraud (CCF detection watch detected over 95% of fraud cases without causing false alarms unlike other statistical models and the two-stage clusters. This shows that the performance of CCF detection watch is in agreement with other detection software, but performs better.
Clusters minero energéticos en Colombia: Desarrollo, hallazgos y propuestas

Directory of Open Access Journals (Sweden)

Ángela Inés Cadena

2011-01-01

Full Text Available En este artículo se presenta un resumen del Proyecto DNP- Universidad de los Andes sobre "Clusters" en la industria minero energética, que la Universidad de los Andes realizó para el Departamento Nacional de Planeación. Para comenzar, se hace una síntesis de la metodología empleada, que combinó análisis de tipo macro con análisis de detalle. Más adelante se exponen los hallazgos en cuanto a la caracterización y localización de las industrias minero energéticas y aquellas relacionadas. Se presentan las cuatro propuestas que se plantearon para aprovechar el crecimiento de esta industria mediante la generación de valor. La primera propuesta está relacionada con la estructuración de la función de comercialización para lograr mejoras en la productividad de la pequeña y mediana minería del carbón, la segunda con la creación de un programa nacional de desarrollo de proveedores de bienes y servicios para la industria petrolera y de gran minería. Las dos últimas propuestas se enfocan en el desarrollo de conglomerados de negocios, la primera de la petroquímica y los plásticos en la región de Bolívar - Atlántico y la otra de los nuevos materiales plásticos en Bogotá, con un papel protagónico de la academia. Es importante anotar el análisis y propuestas detalladas pueden encontrarse en los informes del proyecto.//This paper presents a summary of a Project on Clusters in the Mining and Energy Industry that the Universidad de los Andes developed for the National Planning Department. The document starts with a description of the methodology that combines macro analysis with detailed analysis. Next, we present the main findings regarding the characterization and location of the mining and energy industries, as well as of the related industries. We end with a detailed presentation of the four cluster proposals designed to create value. The first proposal relates to the structuring of the marketing function to achieve improvements
Cluster Analysis-Based Approaches for Geospatiotemporal Data Mining of Massive Data Sets for Identification of Forest Threats

Science.gov (United States)

Richard Trans Mills; Forrest M Hoffman; Jitendra Kumar; William W. Hargrove

2011-01-01

We investigate methods for geospatiotemporal data mining of multi-year land surface phenology data (250 m2 Normalized Difference Vegetation Index (NDVI) values derived from the Moderate Resolution Imaging Spectrometer (MODIS) in this study) for the conterminous United States (CONUS) as part of an early warning system for detecting threats to forest ecosystems. The...
Mining algorithm for association rules in big data based on Hadoop

Science.gov (United States)

Fu, Chunhua; Wang, Xiaojing; Zhang, Lijun; Qiao, Liying

2018-04-01

In order to solve the problem that the traditional association rules mining algorithm has been unable to meet the mining needs of large amount of data in the aspect of efficiency and scalability, take FP-Growth as an example, the algorithm is realized in the parallelization based on Hadoop framework and Map Reduce model. On the basis, it is improved using the transaction reduce method for further enhancement of the algorithm's mining efficiency. The experiment, which consists of verification of parallel mining results, comparison on efficiency between serials and parallel, variable relationship between mining time and node number and between mining time and data amount, is carried out in the mining results and efficiency by Hadoop clustering. Experiments show that the paralleled FP-Growth algorithm implemented is able to accurately mine frequent item sets, with a better performance and scalability. It can be better to meet the requirements of big data mining and efficiently mine frequent item sets and association rules from large dataset.
Examination of Clustering in Eutectic Microstrcture

Directory of Open Access Journals (Sweden)

Bortnyik K.

2017-06-01

Full Text Available The eutectic microstructures are complex microstructures and a hard work to describe it with few numbers. The eutectics builds up eutectic cells. In the cells the phases are clustered. With the development of big databases the data mining also develops, and produces a lot of method to handling the large datasets, and earns information from the sets. One typical method is the clustering, which finds the groups in the datasets. In this article a partitioning and a hierarchical clustering is applied to eutectic structures to find the clusters. In the case of AlMn alloy the K-means algorithm work well, and find the eutectic cells. In the case of ductile cast iron the hierarchical clustering works better. With the combination of the partitioning and hierarchical clustering with the image transformation, an effective method is developed for clustering the objects in the microstructures.
ANALYSIS METHODS OF BANKRUPTCY RISK IN ROMANIAN ENERGY MINING INDUSTRY

Directory of Open Access Journals (Sweden)

CORICI MARIAN CATALIN

2016-12-01

Full Text Available The study is an analysis of bankruptcy risk and assessing the economic performance of the entity in charge of energy mining industry from southwest region. The scientific activity assesses the risk of bankruptcy using score’s method and some indicators witch reflecting the results obtained and elements from organization balance sheet involved in mining and energy which contributes to the stability of the national energy system. Analysis undertaken is focused on the application of the business organization models that allow a comprehensive assessment of the risk of bankruptcy and be an instrument of its forecast. In this study will be highlighted developments bankruptcy risk within the organization through the Altman model and Conan-Holder model in order to show a versatile image on the organization's ability to ensure business continuity
Development of small scale cluster computer for numerical analysis

Science.gov (United States)

Zulkifli, N. H. N.; Sapit, A.; Mohammed, A. N.

2017-09-01

In this study, two units of personal computer were successfully networked together to form a small scale cluster. Each of the processor involved are multicore processor which has four cores in it, thus made this cluster to have eight processors. Here, the cluster incorporate Ubuntu 14.04 LINUX environment with MPI implementation (MPICH2). Two main tests were conducted in order to test the cluster, which is communication test and performance test. The communication test was done to make sure that the computers are able to pass the required information without any problem and were done by using simple MPI Hello Program where the program written in C language. Additional, performance test was also done to prove that this cluster calculation performance is much better than single CPU computer. In this performance test, four tests were done by running the same code by using single node, 2 processors, 4 processors, and 8 processors. The result shows that with additional processors, the time required to solve the problem decrease. Time required for the calculation shorten to half when we double the processors. To conclude, we successfully develop a small scale cluster computer using common hardware which capable of higher computing power when compare to single CPU processor, and this can be beneficial for research that require high computing power especially numerical analysis such as finite element analysis, computational fluid dynamics, and computational physics analysis.
GPR Detection of Buried Symmetrically Shaped Mine-like Objects using Selective Independent Component Analysis

DEFF Research Database (Denmark)

Karlsen, Brian; Sørensen, Helge Bjarup Dissing; Larsen, Jan

2003-01-01

from small-scale anti-personal (AP) mines to large-scale anti-tank (AT) mines were designed. Large-scale SF-GPR measurements on this series of mine-like objects buried in soil were performed. The SF-GPR data was acquired using a wideband monostatic bow-tie antenna operating in the frequency range 750......This paper addresses the detection of mine-like objects in stepped-frequency ground penetrating radar (SF-GPR) data as a function of object size, object content, and burial depth. The detection approach is based on a Selective Independent Component Analysis (SICA). SICA provides an automatic...... ranking of components, which enables the suppression of clutter, hence extraction of components carrying mine information. The goal of the investigation is to evaluate various time and frequency domain ICA approaches based on SICA. Performance comparison is based on a series of mine-like objects ranging...
Sleep stages identification in patients with sleep disorder using k-means clustering

Science.gov (United States)

Fadhlullah, M. U.; Resahya, A.; Nugraha, D. F.; Yulita, I. N.

2018-05-01

Data mining is a computational intelligence discipline where a large dataset processed using a certain method to look for patterns within the large dataset. This pattern then used for real time application or to develop some certain knowledge. This is a valuable tool to solve a complex problem, discover new knowledge, data analysis and decision making. To be able to get the pattern that lies inside the large dataset, clustering method is used to get the pattern. Clustering is basically grouping data that looks similar so a certain pattern can be seen in the large data set. Clustering itself has several algorithms to group the data into the corresponding cluster. This research used data from patients who suffer sleep disorders and aims to help people in the medical world to reduce the time required to classify the sleep stages from a patient who suffers from sleep disorders. This study used K-Means algorithm and silhouette evaluation to find out that 3 clusters are the optimal cluster for this dataset which means can be divided to 3 sleep stages.
Clustering of Sun Exposure Measurements

DEFF Research Database (Denmark)

Have, Anna Szynkowiak; Larsen, Jan; Hansen, Lars Kai

2002-01-01

In a medically motivated Sun-exposure study, questionnaires concerning Sun-habits were collected from a number of subjects together with UV radiation measurements. This paper focuses on identifying clusters in the heterogeneous set of data for the purpose of understanding possible relations between...... Sun-habits exposure and eventually assessing the risk of skin cancer. A general probabilistic framework originally developed for text and Web mining is demonstrated to be useful for clustering of behavioral data. The framework combines principal component subspace projection with probabilistic...
Analysis of Mining-induced Valley Closure Movements

Science.gov (United States)

Zhang, C.; Mitra, R.; Oh, J.; Hebblewhite, B.

2016-05-01

Valley closure movements have been observed for decades in Australia and overseas when underground mining occurred beneath or in close proximity to valleys and other forms of irregular topographies. Valley closure is defined as the inward movements of the valley sides towards the valley centreline. Due to the complexity of the local geology and the interplay between several geological, topographical and mining factors, the underlying mechanisms that actually cause this behaviour are not completely understood. A comprehensive programme of numerical modelling investigations has been carried out to further evaluate and quantify the influence of a number of these mining and geological factors and their inter-relationships. The factors investigated in this paper include longwall positional factors, horizontal stress, panel width, depth of cover and geological structures around the valley. It is found that mining in a series passing beneath the valley dramatically increases valley closure, and mining parallel to valley induces much more closure than other mining orientations. The redistribution of horizontal stress and influence of mining activity have also been recognised as important factors promoting valley closure, and the effect of geological structure around the valley is found to be relatively small. This paper provides further insight into both the valley closure mechanisms and how these mechanisms should be considered in valley closure prediction models.
Analysis of the planned post-mining landscape of MIBRAG's open-cast mines with regard to a possible environmental impact of alteration processes in mixed dumps

International Nuclear Information System (INIS)

Jolas, P.; Hofmann, B.

2010-01-01

There has been an increasing body of knowledge with regard to hydro- and geochemical alteration processes in overburden dumps and their impact on groundwater quality in lignite mining and reclamation operations associated with post-mining landscapes in Germany. The operators of the MIBRAG mines have examined issues regarding alteration processes and how they affect the environment and which opportunities exist to actively influence the dumping process. The objectives were to counteract any possible negative impact of the alteration processes. Special emphasis was on the impact caused by oxidation of sulfur containing minerals. This paper presented an analysis of the situation at United Schleenhain Mine and how it reflects on the work to date for MIBRAG's mines. A future outlook was also presented. Specifically, the paper discussed the development of the United Schleenhain mine and the post-mining landscape. The potential for discharge of substances was also evaluated along with acidification. 1 tab., 5 figs.
Quantitative analysis of raw materials mining of Sverdlovsk region in Russia

Science.gov (United States)

Tarasyev, Alexander M.; Vasilev, Julian; Turygina, Victoria F.

2016-06-01

The purpose of this article is to show the application of some qualitative methods for the analysis of a dataset for raw materials. The main approaches used are related to the correlation analysis and forecasting with trend lines. It is proved that the future mining of particular ores can be predicted on the basis of mathematical modeling. It is also shown that there exists a strong correlation between the mining of some specific raw materials. Some of the revealed correlations have meaningful explanations, and for others one should look for sophisticated interpretations. The applied approach can be used for forecasting of raw materials exploitation in various regions of Russia and in other countries.
Analysis of the Potential for Use of Floating Photovoltaic Systems on Mine Pit Lakes: Case Study at the Ssangyong Open-Pit Limestone Mine in Korea

Directory of Open Access Journals (Sweden)

Jinyoung Song

2016-02-01

Full Text Available Recently, the mining industry has introduced renewable energy technologies to resolve power supply problems at mines operating in polar regions or other remote areas, and to foster substitute industries, able to benefit from abandoned sites of exhausted mines. However, little attention has been paid to the potential placement of floating photovoltaic (PV systems operated on mine pit lakes because it was assumed that the topographic characteristics of open-pit mines are unsuitable for installing any type of PV systems. This study analyzed the potential of floating PV systems on a mine pit lake in Korea to break this misconception. Using a fish-eye lens camera and digital elevation models, a shading analysis was performed to identify the area suitable for installing a floating PV system. The layout of the floating PV system was designed in consideration of the optimal tilt angle and array spacing of the PV panels. The System Advisor Model (SAM by National Renewable Energy Laboratory, USA, was used to conduct energy simulations based on weather data and the system design. The results indicated that the proposed PV system could generate 971.57 MWh/year. The economic analysis (accounting for discount rate and a 20-year operational lifetime showed that the net present value would be $897,000 USD, and a payback period of about 12.3 years. Therefore, we could know that the economic effect of the floating PV system on the mine pit lake is relatively higher than that of PV systems in the other abandoned mines in Korea. The annual reduction of greenhouse gas emissions was analyzed and found to be 471.21 tCO2/year, which is twice the reduction effect achieved by forest restoration of an abandoned mine site. The economic feasibility of a floating PV system on a pit lake of an abandoned mine was thus established, and may be considered an efficient reuse option for abandoned mines.
Improvement on LEACH Agreement of Mine Wireless Sensor Network

Directory of Open Access Journals (Sweden)

Yun-xiang Liu

2017-05-01

Full Text Available Based on the characteristics of wireless sensor network communication in mine, LEACH protocol clustering is optimized, and the factors of energy and distance are considered fully. The selection of cluster head nodes is optimized, and a routing algorithm based on K-means ++ clustering is proposed. The problem of uneven distribution of cluster head nodes, uneven energy consumption and network stability in LEACH algorithm is improved effectively. Simulation results show that the proposed algorithm can improve the energy consumption of the whole network and improve the energy utilization rate, extending the network life cycle effectively.
[Typologies of Madrid's citizens (Spain) at the end-of-life: cluster analysis].

Science.gov (United States)

Ortiz-Gonçalves, Belén; Perea-Pérez, Bernardo; Labajo González, Elena; Albarrán Juan, Elena; Santiago-Sáez, Andrés

2018-03-06

To establish typologies within Madrid's citizens (Spain) with regard to end-of-life by cluster analysis. The SPAD 8 programme was implemented in a sample from a health care centre in the autonomous region of Madrid (Spain). A multiple correspondence analysis technique was used, followed by a cluster analysis to create a dendrogram. A cross-sectional study was made beforehand with the results of the questionnaire. Five clusters stand out. Cluster 1: a group who preferred not to answer numerous questions (5%). Cluster 2: in favour of receiving palliative care and euthanasia (40%). Cluster 3: would oppose assisted suicide and would not ask for spiritual assistance (15%). Cluster 4: would like to receive palliative care and assisted suicide (16%). Cluster 5: would oppose assisted suicide and would ask for spiritual assistance (24%). The following four clusters stood out. Clusters 2 and 4 would like to receive palliative care, euthanasia (2) and assisted suicide (4). Clusters 4 and 5 regularly practiced their faith and their family members did not receive palliative care. Clusters 3 and 5 would be opposed to euthanasia and assisted suicide in particular. Clusters 2, 4 and 5 had not completed an advance directive document (2, 4 and 5). Clusters 2 and 3 seldom practiced their faith. This study could be taken into consideration to improve the quality of end-of-life care choices. Copyright © 2017 SESPAS. Publicado por Elsevier España, S.L.U. All rights reserved.
Exploring the potential of data mining techniques for the analysis of accident patterns

DEFF Research Database (Denmark)

Prato, Carlo Giacomo; Bekhor, Shlomo; Galtzur, Ayelet

2010-01-01

Research in road safety faces major challenges: individuation of the most significant determinants of traffic accidents, recognition of the most recurrent accident patterns, and allocation of resources necessary to address the most relevant issues. This paper intends to comprehend which data mining...... and association rules) data mining techniques are implemented for the analysis of traffic accidents occurred in Israel between 2001 and 2004. Results show that descriptive techniques are useful to classify the large amount of analyzed accidents, even though introduce problems with respect to the clear...... importance of input and intermediate neurons, and the relative importance of hundreds of association rules. Further research should investigate whether limiting the analysis to fatal accidents would simplify the task of data mining techniques in recognizing accident patterns without the “noise” probably...

A novel water quality data analysis framework based on time-series data mining.

Science.gov (United States)

Deng, Weihui; Wang, Guoyin

2017-07-01

The rapid development of time-series data mining provides an emerging method for water resource management research. In this paper, based on the time-series data mining methodology, we propose a novel and general analysis framework for water quality time-series data. It consists of two parts: implementation components and common tasks of time-series data mining in water quality data. In the first part, we propose to granulate the time series into several two-dimensional normal clouds and calculate the similarities in the granulated level. On the basis of the similarity matrix, the similarity search, anomaly detection, and pattern discovery tasks in the water quality time-series instance dataset can be easily implemented in the second part. We present a case study of this analysis framework on weekly Dissolve Oxygen time-series data collected from five monitoring stations on the upper reaches of Yangtze River, China. It discovered the relationship of water quality in the mainstream and tributary as well as the main changing patterns of DO. The experimental results show that the proposed analysis framework is a feasible and efficient method to mine the hidden and valuable knowledge from water quality historical time-series data. Copyright © 2017 Elsevier Ltd. All rights reserved.
Analysis on evaluation ability of nonlinear safety assessment model of coal mines based on artificial neural network

Institute of Scientific and Technical Information of China (English)

SHI Shi-liang; LIU Hai-bo; LIU Ai-hua

2004-01-01

Based on the integration analysis of goods and shortcomings of various methods used in safety assessment of coal mines, combining nonlinear feature of mine safety sub-system, this paper establishes the neural network assessment model of mine safety, analyzes the ability of artificial neural network to evaluate mine safety state, and lays the theoretical foundation of artificial neural network using in the systematic optimization of mine safety assessment and getting reasonable accurate safety assessment result.
Traffic Flow Management: Data Mining Update

Science.gov (United States)

Grabbe, Shon R.

2012-01-01

This presentation provides an update on recent data mining efforts that have been designed to (1) identify like/similar days in the national airspace system, (2) cluster/aggregate national-level rerouting data and (3) apply machine learning techniques to predict when Ground Delay Programs are required at a weather-impacted airport
Opinion Mining in Latvian Text Using Semantic Polarity Analysis and Machine Learning Approach

Directory of Open Access Journals (Sweden)

Gatis Špats

2016-07-01

Full Text Available In this paper we demonstrate approaches for opinion mining in Latvian text. Authors have applied, combined and extended results of several previous studies and public resources to perform opinion mining in Latvian text using two approaches, namely, semantic polarity analysis and machine learning. One of the most significant constraints that make application of opinion mining for written content classification in Latvian text challenging is the limited publicly available text corpora for classifier training. We have joined several sources and created a publically available extended lexicon. Our results are comparable to or outperform current achievements in opinion mining in Latvian. Experiments show that lexicon-based methods provide more accurate opinion mining than the application of Naive Bayes machine learning classifier on Latvian tweets. Methods used during this study could be further extended using human annotators, unsupervised machine learning and bootstrapping to create larger corpora of classified text.
Integrated pathway clusters with coherent biological themes for target prioritisation.

Directory of Open Access Journals (Sweden)

Yi-An Chen

Full Text Available Prioritising candidate genes for further experimental characterisation is an essential, yet challenging task in biomedical research. One way of achieving this goal is to identify specific biological themes that are enriched within the gene set of interest to obtain insights into the biological phenomena under study. Biological pathway data have been particularly useful in identifying functional associations of genes and/or gene sets. However, biological pathway information as compiled in varied repositories often differs in scope and content, preventing a more effective and comprehensive characterisation of gene sets. Here we describe a new approach to constructing biologically coherent gene sets from pathway data in major public repositories and employing them for functional analysis of large gene sets. We first revealed significant overlaps in gene content between different pathways and then defined a clustering method based on the shared gene content and the similarity of gene overlap patterns. We established the biological relevance of the constructed pathway clusters using independent quantitative measures and we finally demonstrated the effectiveness of the constructed pathway clusters in comparative functional enrichment analysis of gene sets associated with diverse human diseases gathered from the literature. The pathway clusters and gene mappings have been integrated into the TargetMine data warehouse and are likely to provide a concise, manageable and biologically relevant means of functional analysis of gene sets and to facilitate candidate gene prioritisation.
Using cluster analysis to organize and explore regional GPS velocities

Science.gov (United States)

Simpson, Robert W.; Thatcher, Wayne; Savage, James C.

2012-01-01

Cluster analysis offers a simple visual exploratory tool for the initial investigation of regional Global Positioning System (GPS) velocity observations, which are providing increasingly precise mappings of actively deforming continental lithosphere. The deformation fields from dense regional GPS networks can often be concisely described in terms of relatively coherent blocks bounded by active faults, although the choice of blocks, their number and size, can be subjective and is often guided by the distribution of known faults. To illustrate our method, we apply cluster analysis to GPS velocities from the San Francisco Bay Region, California, to search for spatially coherent patterns of deformation, including evidence of block-like behavior. The clustering process identifies four robust groupings of velocities that we identify with four crustal blocks. Although the analysis uses no prior geologic information other than the GPS velocities, the cluster/block boundaries track three major faults, both locked and creeping.
Coal Mine Permit Boundaries

Data.gov (United States)

Earth Data Analysis Center, University of New Mexico — ESRI ArcView shapefile depicting New Mexico coal mines permitted under the Surface Mining Control and Reclamation Act of 1977 (SMCRA), by either the NM Mining these...
A Review of Subsequence Time Series Clustering

Directory of Open Access Journals (Sweden)

Seyedjamal Zolhavarieh

2014-01-01

Full Text Available Clustering of subsequence time series remains an open issue in time series clustering. Subsequence time series clustering is used in different fields, such as e-commerce, outlier detection, speech recognition, biological systems, DNA recognition, and text mining. One of the useful fields in the domain of subsequence time series clustering is pattern recognition. To improve this field, a sequence of time series data is used. This paper reviews some definitions and backgrounds related to subsequence time series clustering. The categorization of the literature reviews is divided into three groups: preproof, interproof, and postproof period. Moreover, various state-of-the-art approaches in performing subsequence time series clustering are discussed under each of the following categories. The strengths and weaknesses of the employed methods are evaluated as potential issues for future studies.
A review of subsequence time series clustering.

Science.gov (United States)

Zolhavarieh, Seyedjamal; Aghabozorgi, Saeed; Teh, Ying Wah

2014-01-01

Clustering of subsequence time series remains an open issue in time series clustering. Subsequence time series clustering is used in different fields, such as e-commerce, outlier detection, speech recognition, biological systems, DNA recognition, and text mining. One of the useful fields in the domain of subsequence time series clustering is pattern recognition. To improve this field, a sequence of time series data is used. This paper reviews some definitions and backgrounds related to subsequence time series clustering. The categorization of the literature reviews is divided into three groups: preproof, interproof, and postproof period. Moreover, various state-of-the-art approaches in performing subsequence time series clustering are discussed under each of the following categories. The strengths and weaknesses of the employed methods are evaluated as potential issues for future studies.
A Review of Subsequence Time Series Clustering

Science.gov (United States)

Teh, Ying Wah

2014-01-01

Clustering of subsequence time series remains an open issue in time series clustering. Subsequence time series clustering is used in different fields, such as e-commerce, outlier detection, speech recognition, biological systems, DNA recognition, and text mining. One of the useful fields in the domain of subsequence time series clustering is pattern recognition. To improve this field, a sequence of time series data is used. This paper reviews some definitions and backgrounds related to subsequence time series clustering. The categorization of the literature reviews is divided into three groups: preproof, interproof, and postproof period. Moreover, various state-of-the-art approaches in performing subsequence time series clustering are discussed under each of the following categories. The strengths and weaknesses of the employed methods are evaluated as potential issues for future studies. PMID:25140332
A Dynamic Fuzzy Cluster Algorithm for Time Series

Directory of Open Access Journals (Sweden)

Min Ji

2013-01-01

clustering time series by introducing the definition of key point and improving FCM algorithm. The proposed algorithm works by determining those time series whose class labels are vague and further partitions them into different clusters over time. The main advantage of this approach compared with other existing algorithms is that the property of some time series belonging to different clusters over time can be partially revealed. Results from simulation-based experiments on geographical data demonstrate the excellent performance and the desired results have been obtained. The proposed algorithm can be applied to solve other clustering problems in data mining.
A citation analysis of the research reports of the Central Mining Institute. Mining and Environment using the Web of Science, Scopus, BazTech, and Google Scholar: A case study

OpenAIRE

Magdalena Bemke-Switilnik; Aneta Drabek

2015-01-01

This paper presents the analysis of a Polish mining sciences journal (Prace Naukowe GIG. Górnictwo i Środowisko; title in English: Research Reports of the Central Mining Institute. Mining and Environment; acronym in English [RRCMIME]). The analysis is based on data from the following sources: the Web of Science (WoS), Scopus, BazTech (a bibliographic database containing citations from Polish Technical Journals), and Google Scholar (GS). The data from the WoS and Scopus were collected manually...
Methodology сomparative statistical analysis of Russian industry based on cluster analysis

Directory of Open Access Journals (Sweden)

Sergey S. Shishulin

2017-01-01

Full Text Available The article is devoted to researching of the possibilities of applying multidimensional statistical analysis in the study of industrial production on the basis of comparing its growth rates and structure with other developed and developing countries of the world. The purpose of this article is to determine the optimal set of statistical methods and the results of their application to industrial production data, which would give the best access to the analysis of the result.Data includes such indicators as output, output, gross value added, the number of employed and other indicators of the system of national accounts and operational business statistics. The objects of observation are the industry of the countrys of the Customs Union, the United States, Japan and Erope in 2005-2015. As the research tool used as the simplest methods of transformation, graphical and tabular visualization of data, and methods of statistical analysis. In particular, based on a specialized software package (SPSS, the main components method, discriminant analysis, hierarchical methods of cluster analysis, Ward’s method and k-means were applied.The application of the method of principal components to the initial data makes it possible to substantially and effectively reduce the initial space of industrial production data. Thus, for example, in analyzing the structure of industrial production, the reduction was from fifteen industries to three basic, well-interpreted factors: the relatively extractive industries (with a low degree of processing, high-tech industries and consumer goods (medium-technology sectors. At the same time, as a result of comparison of the results of application of cluster analysis to the initial data and data obtained on the basis of the principal components method, it was established that clustering industrial production data on the basis of new factors significantly improves the results of clustering.As a result of analyzing the parameters of
Data mining of mental health issues of non-bone marrow donor siblings.

Science.gov (United States)

Takita, Morihito; Tanaka, Yuji; Kodama, Yuko; Murashige, Naoko; Hatanaka, Nobuyo; Kishi, Yukiko; Matsumura, Tomoko; Ohsawa, Yukio; Kami, Masahiro

2011-07-20

Allogenic hematopoietic stem cell transplantation is a curative treatment for patients with advanced hematologic malignancies. However, the long-term mental health issues of siblings who were not selected as donors (non-donor siblings, NDS) in the transplantation have not been well assessed. Data mining is useful in discovering new findings from a large, multidisciplinary data set and the Scenario Map analysis is a novel approach which allows extracting keywords linking different conditions/events from text data of interviews even when the keywords appeared infrequently. The aim of this study is to assess mental health issues on NDSs and to find helpful keywords for the clinical follow-up using a Scenario Map analysis. A 47-year-old woman whose younger sister had undergone allogenic hematopoietic stem cell transplantation 20 years earlier was interviewed as a NDS. The text data from the interview transcriptions was analyzed using Scenario Mapping. Four clusters of words and six keywords were identified. Upon review of the word clusters and keywords, both the subject and researchers noticed that the subject has had mental health issues since the disease onset to date with being a NDS. The issues have been alleviated by her family. This single subject study suggested the advantages of data mining in clinical follow-up for mental health issues of patients and/or their families.
Critical analysis of the Colombian mining legislation; Analisis critico de la legislacion minera colombiana

Energy Technology Data Exchange (ETDEWEB)

Vargas P, Elkin; Gonzalez S, Carmen Lucia

2003-12-15

The document analyses the Colombian mining legislation, Act 685 of 2001, based on the reasons expressed by the government and the miners for its conceit and approval. The document tries to determine the developments achieved by this new Mining Code considering international mining competitiveness and its adaptation to the constitutional rules about environment, indigenous communities, decentralization and sustainable development. The analysis formulates general and specific hypothesis about the proposed objectives of the reform, which are confronted with the arguments and critical evaluations of the results. Most hypothesis are not verified, thus demonstrating that the Colombian mining legislation is far from being the necessary instrument to promote mining activities, making it competitive according to international standards and adapted to the principles of sustainable development, healthy environment, community participation, ethnic minorities and regional autonomy.
Genome-scale analysis of positional clustering of mouse testis-specific genes

Directory of Open Access Journals (Sweden)

Lee Bernett TK

2005-01-01

Full Text Available Abstract Background Genes are not randomly distributed on a chromosome as they were thought even after removal of tandem repeats. The positional clustering of co-expressed genes is known in prokaryotes and recently reported in several eukaryotic organisms such as Caenorhabditis elegans, Drosophila melanogaster, and Homo sapiens. In order to further investigate the mode of tissue-specific gene clustering in higher eukaryotes, we have performed a genome-scale analysis of positional clustering of the mouse testis-specific genes. Results Our computational analysis shows that a large proportion of testis-specific genes are clustered in groups of 2 to 5 genes in the mouse genome. The number of clusters is much higher than expected by chance even after removal of tandem repeats. Conclusion Our result suggests that testis-specific genes tend to cluster on the mouse chromosomes. This provides another piece of evidence for the hypothesis that clusters of tissue-specific genes do exist.
Pattern recognition in menstrual bleeding diaries by statistical cluster analysis

Directory of Open Access Journals (Sweden)

Wessel Jens

2009-07-01

Full Text Available Abstract Background The aim of this paper is to empirically identify a treatment-independent statistical method to describe clinically relevant bleeding patterns by using bleeding diaries of clinical studies on various sex hormone containing drugs. Methods We used the four cluster analysis methods single, average and complete linkage as well as the method of Ward for the pattern recognition in menstrual bleeding diaries. The optimal number of clusters was determined using the semi-partial R2, the cubic cluster criterion, the pseudo-F- and the pseudo-t2-statistic. Finally, the interpretability of the results from a gynecological point of view was assessed. Results The method of Ward yielded distinct clusters of the bleeding diaries. The other methods successively chained the observations into one cluster. The optimal number of distinctive bleeding patterns was six. We found two desirable and four undesirable bleeding patterns. Cyclic and non cyclic bleeding patterns were well separated. Conclusion Using this cluster analysis with the method of Ward medications and devices having an impact on bleeding can be easily compared and categorized.
Numerical analysis of the resonance mechanism of the lumped parameter system model for acoustic mine detection

International Nuclear Information System (INIS)

Wang Chi; Zhou Yu-Qiu; Shen Gao-Wei; Wu Wen-Wen; Ding Wei

2013-01-01

The method of numerical analysis is employed to study the resonance mechanism of the lumped parameter system model for acoustic mine detection. Based on the basic principle of the acoustic resonance technique for mine detection and the characteristics of low-frequency acoustics, the ''soil-mine'' system could be equivalent to a damping ''mass-spring'' resonance model with a lumped parameter analysis method. The dynamic simulation software, Adams, is adopted to analyze the lumped parameter system model numerically. The simulated resonance frequency and anti-resonance frequency are 151 Hz and 512 Hz respectively, basically in agreement with the published resonance frequency of 155 Hz and anti-resonance frequency of 513 Hz, which were measured in the experiment. Therefore, the technique of numerical simulation is validated to have the potential for analyzing the acoustic mine detection model quantitatively. The influences of the soil and mine parameters on the resonance characteristics of the soil—mine system could be investigated by changing the parameter setup in a flexible manner. (electromagnetism, optics, acoustics, heat transfer, classical mechanics, and fluid dynamics)
Comparative analysis of clustering methods for gene expression time course data

Directory of Open Access Journals (Sweden)

Ivan G. Costa

2004-01-01

Full Text Available This work performs a data driven comparative study of clustering methods used in the analysis of gene expression time courses (or time series. Five clustering methods found in the literature of gene expression analysis are compared: agglomerative hierarchical clustering, CLICK, dynamical clustering, k-means and self-organizing maps. In order to evaluate the methods, a k-fold cross-validation procedure adapted to unsupervised methods is applied. The accuracy of the results is assessed by the comparison of the partitions obtained in these experiments with gene annotation, such as protein function and series classification.
Survey of Analysis of Crime Detection Techniques Using Data Mining and Machine Learning

Science.gov (United States)

Prabakaran, S.; Mitra, Shilpa

2018-04-01

Data mining is the field containing procedures for finding designs or patterns in a huge dataset, it includes strategies at the convergence of machine learning and database framework. It can be applied to various fields like future healthcare, market basket analysis, education, manufacturing engineering, crime investigation etc. Among these, crime investigation is an interesting application to process crime characteristics to help the society for a better living. This paper survey various data mining techniques used in this domain. This study may be helpful in designing new strategies for crime prediction and analysis.

The Productivity Analysis of Chennai Automotive Industry Cluster

Science.gov (United States)

Bhaskaran, E.

2014-07-01

Chennai, also called the Detroit of India, is India's second fastest growing auto market and exports auto components and vehicles to US, Germany, Japan and Brazil. For inclusive growth and sustainable development, 250 auto component industries in Ambattur, Thirumalisai and Thirumudivakkam Industrial Estates located in Chennai have adopted the Cluster Development Approach called Automotive Component Cluster. The objective is to study the Value Chain, Correlation and Data Envelopment Analysis by determining technical efficiency, peer weights, input and output slacks of 100 auto component industries in three estates. The methodology adopted is using Data Envelopment Analysis of Output Oriented Banker Charnes Cooper model by taking net worth, fixed assets, employment as inputs and gross output as outputs. The non-zero represents the weights for efficient clusters. The higher slack obtained reveals the excess net worth, fixed assets, employment and shortage in gross output. To conclude, the variables are highly correlated and the inefficient industries should increase their gross output or decrease the fixed assets or employment. Moreover for sustainable development, the cluster should strengthen infrastructure, technology, procurement, production and marketing interrelationships to decrease costs and to increase productivity and efficiency to compete in the indigenous and export market.
3D Visual Data Mining: goals and experiences

DEFF Research Database (Denmark)

Bøhlen, Michael Hanspeter; Bukauskas, Linas; Eriksen, Poul Svante

2003-01-01

, statistical analyses, perceptual and cognitive psychology, and scientific visualization. At the conceptual level we offer perceptual and cognitive insights to guide the information visualization process. We then choose cluster surfaces to exemplify the data mining process, to discuss the tasks involved...
MMPI profiles of males accused of severe crimes: a cluster analysis

NARCIS (Netherlands)

Spaans, M.; Barendregt, M.; Muller, E.; Beurs, E. de; Nijman, H.L.I.; Rinne, T.

2009-01-01

In studies attempting to classify criminal offenders by cluster analysis of Minnesota Multiphasic Personality Inventory-2 (MMPI-2) data, the number of clusters found varied between 10 (the Megargee System) and two (one cluster indicating no psychopathology and one exhibiting serious
Assessment of water quality in the elbe river at flood water conditions based on cluster analysis, principle components analysis, and source apportionment

Energy Technology Data Exchange (ETDEWEB)

Baborowski, Martina [Department of River Ecology, UFZ-Helmholtz Centre for Environmental Research, Magdeburg (Germany); Simeonov, Vasil [Faculty of Chemistry, University of Sofia, Sofia (Bulgaria); Einax, Juergen W. [Institute of Inorganic and Analytical Chemistry, Friedrich Schiller University of Jena, Jena (Germany)

2012-04-15

An assessment of water quality measurements during a spring flood in the Elbe River is presented. Daily samples were taken at a site in the middle Elbe, which is part of the network of the International Commission for the Protection of the Elbe River (IKSE/MKOL). Cluster analysis (CA), principal components analysis (PCA), and source apportionment (APCS apportioning) were used to assess the flood-dependent matter transport. As a result, three main components could be extracted as important to the matter transport in the Elbe River basin during flood events: (i) re-suspended contaminated sediments, which led to temporarily increased concentrations of suspended matter and of most of the investigated heavy metals; (ii) water discharge related concentrations of pedogenic dissolved organic matter (DOM) as well as preliminary diluted concentrations of uranium and chloride, parameters with stable pollution background in the river basin; and (iii) abandoned mines, i.e., their dewatering systems, with particular influence on nickel, manganese, and zinc concentrations. (Copyright copyright 2012 WILEY-VCH Verlag GmbH and Co. KGaA, Weinheim)
Assessment of water quality in the elbe river at flood water conditions based on cluster analysis, principle components analysis, and source apportionment

International Nuclear Information System (INIS)

Baborowski, Martina; Simeonov, Vasil; Einax, Juergen W.

2012-01-01

An assessment of water quality measurements during a spring flood in the Elbe River is presented. Daily samples were taken at a site in the middle Elbe, which is part of the network of the International Commission for the Protection of the Elbe River (IKSE/MKOL). Cluster analysis (CA), principal components analysis (PCA), and source apportionment (APCS apportioning) were used to assess the flood-dependent matter transport. As a result, three main components could be extracted as important to the matter transport in the Elbe River basin during flood events: (i) re-suspended contaminated sediments, which led to temporarily increased concentrations of suspended matter and of most of the investigated heavy metals; (ii) water discharge related concentrations of pedogenic dissolved organic matter (DOM) as well as preliminary diluted concentrations of uranium and chloride, parameters with stable pollution background in the river basin; and (iii) abandoned mines, i.e., their dewatering systems, with particular influence on nickel, manganese, and zinc concentrations. (Copyright copyright 2012 WILEY-VCH Verlag GmbH and Co. KGaA, Weinheim)
Incremental temporal pattern mining using efficient batch-free stream clustering

NARCIS (Netherlands)

Lu, Y.; Hassani, M.; Seidl, T.

2017-01-01

This paper address the problem of temporal pattern mining from multiple data streams containing temporal events. Temporal events are considered as real world events aligned with comprehensive starting and ending timing information rather than simple integer timestamps. Predefined relations, such as
Crime analysis using open source information

DEFF Research Database (Denmark)

Nizamani, Sarwat; Memon, Nasrullah; Shah, Azhar Ali

2015-01-01

In this paper, we present a method of crime analysis from open source information. We employed un-supervised methods of data mining to explore the facts regarding the crimes of an area of interest. The analysis is based on well known clustering and association techniques. The results show...
Cluster bomb ocular injuries.

Science.gov (United States)

Mansour, Ahmad M; Hamade, Haya; Ghaddar, Ayman; Mokadem, Ahmad Samih; El Hajj Ali, Mohamad; Awwad, Shady

2012-01-01

To present the visual outcomes and ocular sequelae of victims of cluster bombs. This retrospective, multicenter case series of ocular injury due to cluster bombs was conducted for 3 years after the war in South Lebanon (July 2006). Data were gathered from the reports to the Information Management System for Mine Action. There were 308 victims of clusters bombs; 36 individuals were killed, of which 2 received ocular lacerations and; 272 individuals were injured with 18 receiving ocular injury. These 18 surviving individuals were assessed by the authors. Ocular injury occurred in 6.5% (20/308) of cluster bomb victims. Trauma to multiple organs occurred in 12 of 18 cases (67%) with ocular injury. Ocular findings included corneal or scleral lacerations (16 eyes), corneal foreign bodies (9 eyes), corneal decompensation (2 eyes), ruptured cataract (6 eyes), and intravitreal foreign bodies (10 eyes). The corneas of one patient had extreme attenuation of the endothelium. Ocular injury occurred in 6.5% of cluster bomb victims and 67% of the patients with ocular injury sustained trauma to multiple organs. Visual morbidity in civilians is an additional reason for a global ban on the use of cluster bombs.
The design and implementation of web mining in web sites security

Science.gov (United States)

Li, Jian; Zhang, Guo-Yin; Gu, Guo-Chang; Li, Jian-Li

2003-06-01

The backdoor or information leak of Web servers can be detected by using Web Mining techniques on some abnormal Web log and Web application log data. The security of Web servers can be enhanced and the damage of illegal access can be avoided. Firstly, the system for discovering the patterns of information leakages in CGI scripts from Web log data was proposed. Secondly, those patterns for system administrators to modify their codes and enhance their Web site security were provided. The following aspects were described: one is to combine web application log with web log to extract more information, so web data mining could be used to mine web log for discovering the information that firewall and Information Detection System cannot find. Another approach is to propose an operation module of web site to enhance Web site security. In cluster server session, Density-Based Clustering technique is used to reduce resource cost and obtain better efficiency.
Web Mining and Social Networking

DEFF Research Database (Denmark)

Xu, Guandong; Zhang, Yanchun; Li, Lin

This book examines the techniques and applications involved in the Web Mining, Web Personalization and Recommendation and Web Community Analysis domains, including a detailed presentation of the principles, developed algorithms, and systems of the research in these areas. The applications of web ...... sense of individuals or communities. The volume will benefit both academic and industry communities interested in the techniques and applications of web search, web data management, web mining and web knowledge discovery, as well as web community and social network analysis.......This book examines the techniques and applications involved in the Web Mining, Web Personalization and Recommendation and Web Community Analysis domains, including a detailed presentation of the principles, developed algorithms, and systems of the research in these areas. The applications of web...... mining, and the issue of how to incorporate web mining into web personalization and recommendation systems are also reviewed. Additionally, the volume explores web community mining and analysis to find the structural, organizational and temporal developments of web communities and reveal the societal...
ANALYSIS OF DEVELOPING BATIK INDUSTRY CLUSTER IN BAKARAN VILLAGE CENTRAL JAVA PROVINCE

Directory of Open Access Journals (Sweden)

Hermanto Hermanto

2017-06-01

Full Text Available SMEs grow in a cluster in a certain geographical area. The entrepreneurs grow and thrive through the business cluster. Central Java Province has a lot of business clusters in improving the regional economy, one of which is batik industry cluster. Pati Regency is one of regencies / city in Central Java that has the lowest turnover. Batik industy cluster in Pati develops quite well, which can be seen from the increasing number of batik industry incorporated in the cluster. This research examines the strategy of developing the batik industry cluster in Pati Regency. The purpose of this research is to determine the proper strategy for developing the batik industry clusters in Pati. The method of research is quantitative. The analysis tool of this research is the Strengths, Weakness, Opportunity, Threats (SWOT analysis. The result of SWOT analysis in this research shows that the proper strategy for developing the batik industry cluster in Pati is optimizing the management of batik business cluster in Bakaran Village; the local government provides information of the facility of business capital loans; the utilization of labors from Bakaran Village while improving the quality of labors by training, and marketing the Bakaran batik to the broader markets while maintaining the quality of batik. Advice that can be given from this research is that the parties who have a role in batik industry cluster development in Bakaran Village, Pati Regency, such as the Local Government.
Water quality assessment with hierarchical cluster analysis based on Mahalanobis distance.

Science.gov (United States)

Du, Xiangjun; Shao, Fengjing; Wu, Shunyao; Zhang, Hanlin; Xu, Si

2017-07-01

Water quality assessment is crucial for assessment of marine eutrophication, prediction of harmful algal blooms, and environment protection. Previous studies have developed many numeric modeling methods and data driven approaches for water quality assessment. The cluster analysis, an approach widely used for grouping data, has also been employed. However, there are complex correlations between water quality variables, which play important roles in water quality assessment but have always been overlooked. In this paper, we analyze correlations between water quality variables and propose an alternative method for water quality assessment with hierarchical cluster analysis based on Mahalanobis distance. Further, we cluster water quality data collected form coastal water of Bohai Sea and North Yellow Sea of China, and apply clustering results to evaluate its water quality. To evaluate the validity, we also cluster the water quality data with cluster analysis based on Euclidean distance, which are widely adopted by previous studies. The results show that our method is more suitable for water quality assessment with many correlated water quality variables. To our knowledge, it is the first attempt to apply Mahalanobis distance for coastal water quality assessment.
A SURVEY ON DOCUMENT CLUSTERING APPROACH FOR COMPUTER FORENSIC ANALYSIS

OpenAIRE

Monika Raghuvanshi*, Rahul Patel

2016-01-01

In a forensic analysis, large numbers of files are examined. Much of the information comprises of in unstructured format, so it’s quite difficult task for computer forensic to perform such analysis. That’s why to do the forensic analysis of document within a limited period of time require a special approach such as document clustering. This paper review different document clustering algorithms methodologies for example K-mean, K-medoid, single link, complete link, average link in accorandance...
Geologic mapping around Mahoma mining. San Jose mining company

International Nuclear Information System (INIS)

Techera, J.; Arrighetii, R.

1993-01-01

This study has as main objective carry out a geological mapping as well as the structural analysis , in 1.5.000 scale in the zone where the gold benefit plant of San Jose mining company is settled (Mahoma Mining). From this study has been marked many drillings.
Cluster Analysis in Rapeseed (Brassica Napus L.)

International Nuclear Information System (INIS)

Mahasi, J.M

2002-01-01

With widening edible deficit, Kenya has become increasingly dependent on imported edible oils. Many oilseed crops (e.g. sunflower, soya beans, rapeseed/mustard, sesame, groundnuts etc) can be grown in Kenya. But oilseed rape is preferred because it very high yielding (1.5 tons-4.0 tons/ha) with oil content of 42-46%. Other uses include fitting in various cropping systems as; relay/inter crops, rotational crops, trap crops and fodder. It is soft seeded hence oil extraction is relatively easy. The meal is high in protein and very useful in livestock supplementation. Rapeseed can be straight combined using adjusted wheat combines. The priority is to expand domestic oilseed production, hence the need to introduce improved rapeseed germplasm from other countries. The success of any crop improvement programme depends on the extent of genetic diversity in the material. Hence, it is essential to understand the adaptation of introduced genotypes and the similarities if any among them. Evaluation trials were carried out on 17 rapeseed genotypes (nine Canadian origin and eight of European origin) grown at 4 locations namely Endebess, Njoro, Timau and Mau Narok in three years (1992, 1993 and 1994). Results for 1993 were discarded due to severe drought. An analysis of variance was carried out only on seed yields and the treatments were found to be significantly different. Cluster analysis was then carried out on mean seed yields and based on this analysis; only one major group exists within the material. In 1992, varieties 2,3,8 and 9 didn't fall in the same cluster as the rest. Variety 8 was the only one not classified with the rest of the Canadian varieties. Three European varieties (2,3 and 9) were however not classified with the others. In 1994, varieties 10 and 6 didn't fall in the major cluster. Of these two, variety 10 is of Canadian origin. Varieties were more similar in 1994 than 1992 due to favorable weather. It is evident that, genotypes from different geographical
A survey of text clustering techniques used for web mining

Directory of Open Access Journals (Sweden)

Dan MUNTEANU

2005-12-01

Full Text Available This paper contains an overview of basic formulations and approaches to clustering. Then it presents two important clustering paradigms: a bottom-up agglomerative technique, which collects similar documents into larger and larger groups, and a top-down partitioning technique, which divides a corpus into topic-oriented partitions.
Big data mining analysis method based on cloud computing

Science.gov (United States)

Cai, Qing Qiu; Cui, Hong Gang; Tang, Hao

2017-08-01

Information explosion era, large data super-large, discrete and non-(semi) structured features have gone far beyond the traditional data management can carry the scope of the way. With the arrival of the cloud computing era, cloud computing provides a new technical way to analyze the massive data mining, which can effectively solve the problem that the traditional data mining method cannot adapt to massive data mining. This paper introduces the meaning and characteristics of cloud computing, analyzes the advantages of using cloud computing technology to realize data mining, designs the mining algorithm of association rules based on MapReduce parallel processing architecture, and carries out the experimental verification. The algorithm of parallel association rule mining based on cloud computing platform can greatly improve the execution speed of data mining.
MinePath: Mining for Phenotype Differential Sub-paths in Molecular Pathways

Science.gov (United States)

Koumakis, Lefteris; Kartsaki, Evgenia; Chatzimina, Maria; Zervakis, Michalis; Vassou, Despoina; Marias, Kostas; Moustakis, Vassilis; Potamias, George

2016-01-01

Pathway analysis methodologies couple traditional gene expression analysis with knowledge encoded in established molecular pathway networks, offering a promising approach towards the biological interpretation of phenotype differentiating genes. Early pathway analysis methodologies, named as gene set analysis (GSA), view pathways just as plain lists of genes without taking into account either the underlying pathway network topology or the involved gene regulatory relations. These approaches, even if they achieve computational efficiency and simplicity, consider pathways that involve the same genes as equivalent in terms of their gene enrichment characteristics. Most recent pathway analysis approaches take into account the underlying gene regulatory relations by examining their consistency with gene expression profiles and computing a score for each profile. Even with this approach, assessing and scoring single-relations limits the ability to reveal key gene regulation mechanisms hidden in longer pathway sub-paths. We introduce MinePath, a pathway analysis methodology that addresses and overcomes the aforementioned problems. MinePath facilitates the decomposition of pathways into their constituent sub-paths. Decomposition leads to the transformation of single-relations to complex regulation sub-paths. Regulation sub-paths are then matched with gene expression sample profiles in order to evaluate their functional status and to assess phenotype differential power. Assessment of differential power supports the identification of the most discriminant profiles. In addition, MinePath assess the significance of the pathways as a whole, ranking them by their p-values. Comparison results with state-of-the-art pathway analysis systems are indicative for the soundness and reliability of the MinePath approach. In contrast with many pathway analysis tools, MinePath is a web-based system (www.minepath.org) offering dynamic and rich pathway visualization functionality, with the
The Quantitative Analysis of Chennai Automotive Industry Cluster

Science.gov (United States)

Bhaskaran, Ethirajan

2016-07-01

Chennai, also called as Detroit of India due to presence of Automotive Industry producing over 40 % of the India's vehicle and components. During 2001-2002, the Automotive Component Industries (ACI) in Ambattur, Thirumalizai and Thirumudivakkam Industrial Estate, Chennai has faced problems on infrastructure, technology, procurement, production and marketing. The objective is to study the Quantitative Performance of Chennai Automotive Industry Cluster before (2001-2002) and after the CDA (2008-2009). The methodology adopted is collection of primary data from 100 ACI using quantitative questionnaire and analyzing using Correlation Analysis (CA), Regression Analysis (RA), Friedman Test (FMT), and Kruskall Wallis Test (KWT).The CA computed for the different set of variables reveals that there is high degree of relationship between the variables studied. The RA models constructed establish the strong relationship between the dependent variable and a host of independent variables. The models proposed here reveal the approximate relationship in a closer form. KWT proves, there is no significant difference between three locations clusters with respect to: Net Profit, Production Cost, Marketing Costs, Procurement Costs and Gross Output. This supports that each location has contributed for development of automobile component cluster uniformly. The FMT proves, there is no significant difference between industrial units in respect of cost like Production, Infrastructure, Technology, Marketing and Net Profit. To conclude, the Automotive Industries have fully utilized the Physical Infrastructure and Centralised Facilities by adopting CDA and now exporting their products to North America, South America, Europe, Australia, Africa and Asia. The value chain analysis models have been implemented in all the cluster units. This Cluster Development Approach (CDA) model can be implemented in industries of under developed and developing countries for cost reduction and productivity
Complementing the Numbers: A Text Mining Analysis of College Course Withdrawals

Science.gov (United States)

Michalski, Greg V.

2011-01-01

Excessive college course withdrawals are costly to the student and the institution in terms of time to degree completion, available classroom space, and other resources. Although generally well quantified, detailed analysis of the reasons given by students for course withdrawal is less common. To address this, a text mining analysis was performed…

Heavy metal contamination of agricultural soils affected by mining activities around the Ganxi River in Chenzhou, Southern China.

Science.gov (United States)

Ma, Li; Sun, Jing; Yang, Zhaoguang; Wang, Lin

2015-12-01

Heavy metal contamination attracted a wide spread attention due to their strong toxicity and persistence. The Ganxi River, located in Chenzhou City, Southern China, has been severely polluted by lead/zinc ore mining activities. This work investigated the heavy metal pollution in agricultural soils around the Ganxi River. The total concentrations of heavy metals were determined by inductively coupled plasma-mass spectrometry. The potential risk associated with the heavy metals in soil was assessed by Nemerow comprehensive index and potential ecological risk index. In both methods, the study area was rated as very high risk. Multivariate statistical methods including Pearson's correlation analysis, hierarchical cluster analysis, and principal component analysis were employed to evaluate the relationships between heavy metals, as well as the correlation between heavy metals and pH, to identify the metal sources. Three distinct clusters have been observed by hierarchical cluster analysis. In principal component analysis, a total of two components were extracted to explain over 90% of the total variance, both of which were associated with anthropogenic sources.
Clusters of Insomnia Disorder: An Exploratory Cluster Analysis of Objective Sleep Parameters Reveals Differences in Neurocognitive Functioning, Quantitative EEG, and Heart Rate Variability.

Science.gov (United States)

Miller, Christopher B; Bartlett, Delwyn J; Mullins, Anna E; Dodds, Kirsty L; Gordon, Christopher J; Kyle, Simon D; Kim, Jong Won; D'Rozario, Angela L; Lee, Rico S C; Comas, Maria; Marshall, Nathaniel S; Yee, Brendon J; Espie, Colin A; Grunstein, Ronald R

2016-11-01

To empirically derive and evaluate potential clusters of Insomnia Disorder through cluster analysis from polysomnography (PSG). We hypothesized that clusters would differ on neurocognitive performance, sleep-onset measures of quantitative ( q )-EEG and heart rate variability (HRV). Research volunteers with Insomnia Disorder (DSM-5) completed a neurocognitive assessment and overnight PSG measures of total sleep time (TST), wake time after sleep onset (WASO), and sleep onset latency (SOL) were used to determine clusters. From 96 volunteers with Insomnia Disorder, cluster analysis derived at least two clusters from objective sleep parameters: Insomnia with normal objective sleep duration (I-NSD: n = 53) and Insomnia with short sleep duration (I-SSD: n = 43). At sleep onset, differences in HRV between I-NSD and I-SSD clusters suggest attenuated parasympathetic activity in I-SSD (P insomnia clusters derived from cluster analysis differ in sleep onset HRV. Preliminary data suggest evidence for three clusters in insomnia with differences for sustained attention and sleep-onset q -EEG. Insomnia 100 sleep study: Australia New Zealand Clinical Trials Registry (ANZCTR) identification number 12612000049875. URL: https://www.anzctr.org.au/Trial/Registration/TrialReview.aspx?id=347742. © 2016 Associated Professional Sleep Societies, LLC.
Clusters of Insomnia Disorder: An Exploratory Cluster Analysis of Objective Sleep Parameters Reveals Differences in Neurocognitive Functioning, Quantitative EEG, and Heart Rate Variability

Science.gov (United States)

Miller, Christopher B.; Bartlett, Delwyn J.; Mullins, Anna E.; Dodds, Kirsty L.; Gordon, Christopher J.; Kyle, Simon D.; Kim, Jong Won; D'Rozario, Angela L.; Lee, Rico S.C.; Comas, Maria; Marshall, Nathaniel S.; Yee, Brendon J.; Espie, Colin A.; Grunstein, Ronald R.

2016-01-01

Study Objectives: To empirically derive and evaluate potential clusters of Insomnia Disorder through cluster analysis from polysomnography (PSG). We hypothesized that clusters would differ on neurocognitive performance, sleep-onset measures of quantitative (q)-EEG and heart rate variability (HRV). Methods: Research volunteers with Insomnia Disorder (DSM-5) completed a neurocognitive assessment and overnight PSG measures of total sleep time (TST), wake time after sleep onset (WASO), and sleep onset latency (SOL) were used to determine clusters. Results: From 96 volunteers with Insomnia Disorder, cluster analysis derived at least two clusters from objective sleep parameters: Insomnia with normal objective sleep duration (I-NSD: n = 53) and Insomnia with short sleep duration (I-SSD: n = 43). At sleep onset, differences in HRV between I-NSD and I-SSD clusters suggest attenuated parasympathetic activity in I-SSD (P insomnia clusters derived from cluster analysis differ in sleep onset HRV. Preliminary data suggest evidence for three clusters in insomnia with differences for sustained attention and sleep-onset q-EEG. Clinical Trial Registration: Insomnia 100 sleep study: Australia New Zealand Clinical Trials Registry (ANZCTR) identification number 12612000049875. URL: https://www.anzctr.org.au/Trial/Registration/TrialReview.aspx?id=347742. Citation: Miller CB, Bartlett DJ, Mullins AE, Dodds KL, Gordon CJ, Kyle SD, Kim JW, D'Rozario AL, Lee RS, Comas M, Marshall NS, Yee BJ, Espie CA, Grunstein RR. Clusters of Insomnia Disorder: an exploratory cluster analysis of objective sleep parameters reveals differences in neurocognitive functioning, quantitative EEG, and heart rate variability. SLEEP 2016;39(11):1993–2004. PMID:27568796
Assessment of genetic divergence in tomato through agglomerative hierarchical clustering and principal component analysis

International Nuclear Information System (INIS)

Iqbal, Q.; Saleem, M.Y.; Hameed, A.; Asghar, M.

2014-01-01

For the improvement of qualitative and quantitative traits, existence of variability has prime importance in plant breeding. Data on different morphological and reproductive traits of 47 tomato genotypes were analyzed for correlation,agglomerative hierarchical clustering and principal component analysis (PCA) to select genotypes and traits for future breeding program. Correlation analysis revealed significant positive association between yield and yield components like fruit diameter, single fruit weight and number of fruits plant-1. Principal component (PC) analysis depicted first three PCs with Eigen-value higher than 1 contributing 81.72% of total variability for different traits. The PC-I showed positive factor loadings for all the traits except number of fruits plant-1. The contribution of single fruit weight and fruit diameter was highest in PC-1. Cluster analysis grouped all genotypes into five divergent clusters. The genotypes in cluster-II and cluster-V exhibited uniform maturity and higher yield. The D2 statistics confirmed highest distance between cluster- III and cluster-V while maximum similarity was observed in cluster-II and cluster-III. It is therefore suggested that crosses between genotypes of cluster-II and cluster-V with those of cluster-I and cluster-III may exhibit heterosis in F1 for hybrid breeding and for selection of superior genotypes in succeeding generations for cross breeding programme. (author)
An application of data mining techniques in designing catalogue for a laundry service

Directory of Open Access Journals (Sweden)

Khasanah Annisa Uswatun

2018-01-01

Full Text Available Catalogues are the media that companies use to promote their products or services. Since catalogue is one of marketing media, the first essential step before designing product catalogue is determining the market target. Besides, it is also important to put some information that appeal to the target market, such as discount or promos by analysing customer pattern preferences in using services or buying product. This study conduct two data mining technique. The first is clustering analysis to segment customer and the second one is association rule mining to discover an interesting pattern about the services that commonly used by the customer at the same service time. Thus, the results will be used as a recommendation to make an attractive marketing strategy to be put in the service catalogue promo for a laundry in Sleman Yogyakarta. The clustering result showed that the biggest customer segment is university student who come 3 until 5 times in a month on weekends, while the association rule result showed that clothes, shoes, and bed sheet have strong relationship. The catalogue design is presented in the end of the paper.
Web Mining and Social Networking

CERN Document Server

Xu, Guandong; Li, Lin

2011-01-01

This book examines the techniques and applications involved in the Web Mining, Web Personalization and Recommendation and Web Community Analysis domains, including a detailed presentation of the principles, developed algorithms, and systems of the research in these areas. The applications of web mining, and the issue of how to incorporate web mining into web personalization and recommendation systems are also reviewed. Additionally, the volume explores web community mining and analysis to find the structural, organizational and temporal developments of web communities and reveal the societal s
Analysis of the planned post-mining landscape of MIBRAG's open-cast mines with regard to a possible environmental impact of alteration processes in mixed dumps

Energy Technology Data Exchange (ETDEWEB)

Jolas, P.; Hofmann, B. [Mitteldeutsche Braunkohlengesellschaft, Theissen (Germany)

2010-07-01

There has been an increasing body of knowledge with regard to hydro- and geochemical alteration processes in overburden dumps and their impact on groundwater quality in lignite mining and reclamation operations associated with post-mining landscapes in Germany. The operators of the MIBRAG mines have examined issues regarding alteration processes and how they affect the environment and which opportunities exist to actively influence the dumping process. The objectives were to counteract any possible negative impact of the alteration processes. Special emphasis was on the impact caused by oxidation of sulfur containing minerals. This paper presented an analysis of the situation at United Schleenhain Mine and how it reflects on the work to date for MIBRAG's mines. A future outlook was also presented. Specifically, the paper discussed the development of the United Schleenhain mine and the post-mining landscape. The potential for discharge of substances was also evaluated along with acidification. 1 tab., 5 figs.
Introduction to the JASIST Special Topic Issue on Web Retrieval and Mining: A Machine Learning Perspective.

Science.gov (United States)

Chen, Hsinchun

2003-01-01

Discusses information retrieval techniques used on the World Wide Web. Topics include machine learning in information extraction; relevance feedback; information filtering and recommendation; text classification and text clustering; Web mining, based on data mining techniques; hyperlink structure; and Web size. (LRW)
A supplier selection using a hybrid grey based hierarchical clustering and artificial bee colony

Directory of Open Access Journals (Sweden)

Farshad Faezy Razi

2014-06-01

Full Text Available Selection of one or a combination of the most suitable potential providers and outsourcing problem is the most important strategies in logistics and supply chain management. In this paper, selection of an optimal combination of suppliers in inventory and supply chain management are studied and analyzed via multiple attribute decision making approach, data mining and evolutionary optimization algorithms. For supplier selection in supply chain, hierarchical clustering according to the studied indexes first clusters suppliers. Then, according to its cluster, each supplier is evaluated through Grey Relational Analysis. Then the combination of suppliers’ Pareto optimal rank and costs are obtained using Artificial Bee Colony meta-heuristic algorithm. A case study is conducted for a better description of a new algorithm to select a multiple source of suppliers.
A Distributed Flocking Approach for Information Stream Clustering Analysis

Energy Technology Data Exchange (ETDEWEB)

Cui, Xiaohui [ORNL; Potok, Thomas E [ORNL

2006-01-01

Intelligence analysts are currently overwhelmed with the amount of information streams generated everyday. There is a lack of comprehensive tool that can real-time analyze the information streams. Document clustering analysis plays an important role in improving the accuracy of information retrieval. However, most clustering technologies can only be applied for analyzing the static document collection because they normally require a large amount of computation resource and long time to get accurate result. It is very difficult to cluster a dynamic changed text information streams on an individual computer. Our early research has resulted in a dynamic reactive flock clustering algorithm which can continually refine the clustering result and quickly react to the change of document contents. This character makes the algorithm suitable for cluster analyzing dynamic changed document information, such as text information stream. Because of the decentralized character of this algorithm, a distributed approach is a very natural way to increase the clustering speed of the algorithm. In this paper, we present a distributed multi-agent flocking approach for the text information stream clustering and discuss the decentralized architectures and communication schemes for load balance and status information synchronization in this approach.
Grizzly bear diet shifting on reclaimed mines

Directory of Open Access Journals (Sweden)

Bogdan Cristescu

2015-07-01

Full Text Available Industrial developments and reclamation change habitat, possibly altering large carnivore food base. We monitored the diet of a low-density population of grizzly bears occupying a landscape with open-pit coal mines in Canada. During 2009–2010 we instrumented 10 bears with GPS radiocollars and compared their feeding on reclaimed coal mines and neighboring Rocky Mountains and their foothills. In addition, we compared our data with historical bear diet for the same population collected in 2001–2003, before extensive mine reclamation occurred. Diet on mines (n=331 scats was dominated by non-native forbs and graminoids, while diets in the Foothills and Mountains consisted primarily of ungulates and Hedysarum spp. roots respectively, showing diet shifting with availability. Field visitation of feeding sites (n=234 GPS relocation clusters also showed that ungulates were the main diet component in the Foothills, whereas on reclaimed mines bears were least carnivorous. These differences illustrate a shift to feeding on non-native forbs while comparisons with historical diet reveal emergence of elk as an important bear food. Food resources on reclaimed mines attract bears from wilderness areas and bears may be more adaptable to landscape change than previously thought. The grizzly bear’s ready use of mines cautions the universal view of this species as umbrella indicative of biodiversity.
Cluster analysis of obesity and asthma phenotypes.

Directory of Open Access Journals (Sweden)

E Rand Sutherland

Full Text Available Asthma is a heterogeneous disease with variability among patients in characteristics such as lung function, symptoms and control, body weight, markers of inflammation, and responsiveness to glucocorticoids (GC. Cluster analysis of well-characterized cohorts can advance understanding of disease subgroups in asthma and point to unsuspected disease mechanisms. We utilized an hypothesis-free cluster analytical approach to define the contribution of obesity and related variables to asthma phenotype.In a cohort of clinical trial participants (n = 250, minimum-variance hierarchical clustering was used to identify clinical and inflammatory biomarkers important in determining disease cluster membership in mild and moderate persistent asthmatics. In a subset of participants, GC sensitivity was assessed via expression of GC receptor alpha (GCRα and induction of MAP kinase phosphatase-1 (MKP-1 expression by dexamethasone. Four asthma clusters were identified, with body mass index (BMI, kg/m(2 and severity of asthma symptoms (AEQ score the most significant determinants of cluster membership (F = 57.1, p<0.0001 and F = 44.8, p<0.0001, respectively. Two clusters were composed of predominantly obese individuals; these two obese asthma clusters differed from one another with regard to age of asthma onset, measures of asthma symptoms (AEQ and control (ACQ, exhaled nitric oxide concentration (F(ENO and airway hyperresponsiveness (methacholine PC(20 but were similar with regard to measures of lung function (FEV(1 (% and FEV(1/FVC, airway eosinophilia, IgE, leptin, adiponectin and C-reactive protein (hsCRP. Members of obese clusters demonstrated evidence of reduced expression of GCRα, a finding which was correlated with a reduced induction of MKP-1 expression by dexamethasoneObesity is an important determinant of asthma phenotype in adults. There is heterogeneity in expression of clinical and inflammatory biomarkers of asthma across obese individuals
RHSEG and Subdue: Background and Preliminary Approach for Combining these Technologies for Enhanced Image Data Analysis, Mining and Knowledge Discovery

Science.gov (United States)

Tilton, James C.; Cook, Diane J.

2008-01-01

Under a project recently selected for funding by NASA's Science Mission Directorate under the Applied Information Systems Research (AISR) program, Tilton and Cook will design and implement the integration of the Subdue graph based knowledge discovery system, developed at the University of Texas Arlington and Washington State University, with image segmentation hierarchies produced by the RHSEG software, developed at NASA GSFC, and perform pilot demonstration studies of data analysis, mining and knowledge discovery on NASA data. Subdue represents a method for discovering substructures in structural databases. Subdue is devised for general-purpose automated discovery, concept learning, and hierarchical clustering, with or without domain knowledge. Subdue was developed by Cook and her colleague, Lawrence B. Holder. For Subdue to be effective in finding patterns in imagery data, the data must be abstracted up from the pixel domain. An appropriate abstraction of imagery data is a segmentation hierarchy: a set of several segmentations of the same image at different levels of detail in which the segmentations at coarser levels of detail can be produced from simple merges of regions at finer levels of detail. The RHSEG program, a recursive approximation to a Hierarchical Segmentation approach (HSEG), can produce segmentation hierarchies quickly and effectively for a wide variety of images. RHSEG and HSEG were developed at NASA GSFC by Tilton. In this presentation we provide background on the RHSEG and Subdue technologies and present a preliminary analysis on how RHSEG and Subdue may be combined to enhance image data analysis, mining and knowledge discovery.
Cluster: A New Application for Spatial Analysis of Pixelated Data for Epiphytotics.

Science.gov (United States)

Nelson, Scot C; Corcoja, Iulian; Pethybridge, Sarah J

2017-12-01

Spatial analysis of epiphytotics is essential to develop and test hypotheses about pathogen ecology, disease dynamics, and to optimize plant disease management strategies. Data collection for spatial analysis requires substantial investment in time to depict patterns in various frames and hierarchies. We developed a new approach for spatial analysis of pixelated data in digital imagery and incorporated the method in a stand-alone desktop application called Cluster. The user isolates target entities (clusters) by designating up to 24 pixel colors as nontargets and moves a threshold slider to visualize the targets. The app calculates the percent area occupied by targeted pixels, identifies the centroids of targeted clusters, and computes the relative compass angle of orientation for each cluster. Users can deselect anomalous clusters manually and/or automatically by specifying a size threshold value to exclude smaller targets from the analysis. Up to 1,000 stochastic simulations randomly place the centroids of each cluster in ranked order of size (largest to smallest) within each matrix while preserving their calculated angles of orientation for the long axes. A two-tailed probability t test compares the mean inter-cluster distances for the observed versus the values derived from randomly simulated maps. This is the basis for statistical testing of the null hypothesis that the clusters are randomly distributed within the frame of interest. These frames can assume any shape, from natural (e.g., leaf) to arbitrary (e.g., a rectangular or polygonal field). Cluster summarizes normalized attributes of clusters, including pixel number, axis length, axis width, compass orientation, and the length/width ratio, available to the user as a downloadable spreadsheet. Each simulated map may be saved as an image and inspected. Provided examples demonstrate the utility of Cluster to analyze patterns at various spatial scales in plant pathology and ecology and highlight the
Improve Data Mining and Knowledge Discovery through the use of MatLab

Science.gov (United States)

Shaykahian, Gholan Ali; Martin, Dawn Elliott; Beil, Robert

2011-01-01

Data mining is widely used to mine business, engineering, and scientific data. Data mining uses pattern based queries, searches, or other analyses of one or more electronic databases/datasets in order to discover or locate a predictive pattern or anomaly indicative of system failure, criminal or terrorist activity, etc. There are various algorithms, techniques and methods used to mine data; including neural networks, genetic algorithms, decision trees, nearest neighbor method, rule induction association analysis, slice and dice, segmentation, and clustering. These algorithms, techniques and methods used to detect patterns in a dataset, have been used in the development of numerous open source and commercially available products and technology for data mining. Data mining is best realized when latent information in a large quantity of data stored is discovered. No one technique solves all data mining problems; challenges are to select algorithms or methods appropriate to strengthen data/text mining and trending within given datasets. In recent years, throughout industry, academia and government agencies, thousands of data systems have been designed and tailored to serve specific engineering and business needs. Many of these systems use databases with relational algebra and structured query language to categorize and retrieve data. In these systems, data analyses are limited and require prior explicit knowledge of metadata and database relations; lacking exploratory data mining and discoveries of latent information. This presentation introduces MatLab(TradeMark)(MATrix LABoratory), an engineering and scientific data analyses tool to perform data mining. MatLab was originally intended to perform purely numerical calculations (a glorified calculator). Now, in addition to having hundreds of mathematical functions, it is a programming language with hundreds built in standard functions and numerous available toolboxes. MatLab's ease of data processing, visualization and
Randomized algorithms in automatic control and data mining

CERN Document Server

Granichin, Oleg; Toledano-Kitai, Dvora

2015-01-01

In the fields of data mining and control, the huge amount of unstructured data and the presence of uncertainty in system descriptions have always been critical issues. The book Randomized Algorithms in Automatic Control and Data Mining introduces the readers to the fundamentals of randomized algorithm applications in data mining (especially clustering) and in automatic control synthesis. The methods proposed in this book guarantee that the computational complexity of classical algorithms and the conservativeness of standard robust control techniques will be reduced. It is shown that when a problem requires "brute force" in selecting among options, algorithms based on random selection of alternatives offer good results with certain probability for a restricted time and significantly reduce the volume of operations.
Relationships between sources of acid mine drainage and the hydrochemistry of acid effluents during rainy season in the Iberian Pyrite Belt.

Science.gov (United States)

Pérez-Ostalé, E; Grande, J A; Valente, T; de la Torre, M L; Santisteban, M; Fernández, P; Diaz-Curiel, J

2016-01-01

In the Iberian Pyrite Belt (IPB), southwest Spain, a prolonged and intense mining activity of more than 4,500 years has resulted in almost a hundred mines scattered through the region. After years of inactivity, these mines are still causing high levels of hydrochemical degradation in the fluvial network. This situation represents a unique scenario in the world, taking into consideration its magnitude and intensity of the contamination processes. In order to obtain a benchmark regarding the degree of acid mine drainage (AMD) pollution in the aquatic environment, the relationship between the areas occupied by the sulfide mines and the characteristics of the respective effluents after rainfall was analysed. The methodology developed, which includes the design of a sampling network, analytical treatment and cluster analysis, is a useful tool for diagnosing the contamination level by AMD in an entire metallogenic province, at the scale of each mining group. The results presented the relationship between sulfate, total dissolved solids and electrical conductivity, as well as other parameters that are typically associated with AMD and the major elements that compose the polymetallic sulfides of IPB. This analysis also indicates the low level of proximity between the affectation area and the other variables.
Phenotypes of asthma in low-income children and adolescents: cluster analysis

Directory of Open Access Journals (Sweden)

Anna Lucia Barros Cabral

Full Text Available ABSTRACT Objective: Studies characterizing asthma phenotypes have predominantly included adults or have involved children and adolescents in developed countries. Therefore, their applicability in other populations, such as those of developing countries, remains indeterminate. Our objective was to determine how low-income children and adolescents with asthma in Brazil are distributed across a cluster analysis. Methods: We included 306 children and adolescents (6-18 years of age with a clinical diagnosis of asthma and under medical treatment for at least one year of follow-up. At enrollment, all the patients were clinically stable. For the cluster analysis, we selected 20 variables commonly measured in clinical practice and considered important in defining asthma phenotypes. Variables with high multicollinearity were excluded. A cluster analysis was applied using a twostep agglomerative test and log-likelihood distance measure. Results: Three clusters were defined for our population. Cluster 1 (n = 94 included subjects with normal pulmonary function, mild eosinophil inflammation, few exacerbations, later age at asthma onset, and mild atopy. Cluster 2 (n = 87 included those with normal pulmonary function, a moderate number of exacerbations, early age at asthma onset, more severe eosinophil inflammation, and moderate atopy. Cluster 3 (n = 108 included those with poor pulmonary function, frequent exacerbations, severe eosinophil inflammation, and severe atopy. Conclusions: Asthma was characterized by the presence of atopy, number of exacerbations, and lung function in low-income children and adolescents in Brazil. The many similarities with previous cluster analyses of phenotypes indicate that this approach shows good generalizability.
The pit ventilation features and the design principle of ventilation system in trackless mining uranium mine

International Nuclear Information System (INIS)

Deng Wenhui; Zhou Xinghuo; Li Xianjie

2001-01-01

According to the pit arrangement features of trackless mining uranium mine, based on the fundamental of radon permeation and control, and analysis of radon pollution characteristics and radon education, the design principle of ventilation system in trackless mining uranium mine has been raised
Reproducibility of Cognitive Profiles in Psychosis Using Cluster Analysis.

Science.gov (United States)

Lewandowski, Kathryn E; Baker, Justin T; McCarthy, Julie M; Norris, Lesley A; Öngür, Dost

2018-04-01

Cognitive dysfunction is a core symptom dimension that cuts across the psychoses. Recent findings support classification of patients along the cognitive dimension using cluster analysis; however, data-derived groupings may be highly determined by sampling characteristics and the measures used to derive the clusters, and so their interpretability must be established. We examined cognitive clusters in a cross-diagnostic sample of patients with psychosis and associations with clinical and functional outcomes. We then compared our findings to a previous report of cognitive clusters in a separate sample using a different cognitive battery. Participants with affective or non-affective psychosis (n=120) and healthy controls (n=31) were administered the MATRICS Consensus Cognitive Battery, and clinical and community functioning assessments. Cluster analyses were performed on cognitive variables, and clusters were compared on demographic, cognitive, and clinical measures. Results were compared to findings from our previous report. A four-cluster solution provided a good fit to the data; profiles included a neuropsychologically normal cluster, a globally impaired cluster, and two clusters of mixed profiles. Cognitive burden was associated with symptom severity and poorer community functioning. The patterns of cognitive performance by cluster were highly consistent with our previous findings. We found evidence of four cognitive subgroups of patients with psychosis, with cognitive profiles that map closely to those produced in our previous work. Clusters were associated with clinical and community variables and a measure of premorbid functioning, suggesting that they reflect meaningful groupings: replicable, and related to clinical presentation and functional outcomes. (JINS, 2018, 24, 382-390).

Identifying novel phenotypes of acute heart failure using cluster analysis of clinical variables.

Science.gov (United States)

Horiuchi, Yu; Tanimoto, Shuzou; Latif, A H M Mahbub; Urayama, Kevin Y; Aoki, Jiro; Yahagi, Kazuyuki; Okuno, Taishi; Sato, Yu; Tanaka, Tetsu; Koseki, Keita; Komiyama, Kota; Nakajima, Hiroyoshi; Hara, Kazuhiro; Tanabe, Kengo

2018-07-01

Acute heart failure (AHF) is a heterogeneous disease caused by various cardiovascular (CV) pathophysiology and multiple non-CV comorbidities. We aimed to identify clinically important subgroups to improve our understanding of the pathophysiology of AHF and inform clinical decision-making. We evaluated detailed clinical data of 345 consecutive AHF patients using non-hierarchical cluster analysis of 77 variables, including age, sex, HF etiology, comorbidities, physical findings, laboratory data, electrocardiogram, echocardiogram and treatment during hospitalization. Cox proportional hazards regression analysis was performed to estimate the association between the clusters and clinical outcomes. Three clusters were identified. Cluster 1 (n=108) represented "vascular failure". This cluster had the highest average systolic blood pressure at admission and lung congestion with type 2 respiratory failure. Cluster 2 (n=89) represented "cardiac and renal failure". They had the lowest ejection fraction (EF) and worst renal function. Cluster 3 (n=148) comprised mostly older patients and had the highest prevalence of atrial fibrillation and preserved EF. Death or HF hospitalization within 12-month occurred in 23% of Cluster 1, 36% of Cluster 2 and 36% of Cluster 3 (p=0.034). Compared with Cluster 1, risk of death or HF hospitalization was 1.74 (95% CI, 1.03-2.95, p=0.037) for Cluster 2 and 1.82 (95% CI, 1.13-2.93, p=0.014) for Cluster 3. Cluster analysis may be effective in producing clinically relevant categories of AHF, and may suggest underlying pathophysiology and potential utility in predicting clinical outcomes. Copyright © 2018 Elsevier B.V. All rights reserved.
Identification and validation of asthma phenotypes in Chinese population using cluster analysis.

Science.gov (United States)

Wang, Lei; Liang, Rui; Zhou, Ting; Zheng, Jing; Liang, Bing Miao; Zhang, Hong Ping; Luo, Feng Ming; Gibson, Peter G; Wang, Gang

2017-10-01

Asthma is a heterogeneous airway disease, so it is crucial to clearly identify clinical phenotypes to achieve better asthma management. To identify and prospectively validate asthma clusters in a Chinese population. Two hundred eighty-four patients were consecutively recruited and 18 sociodemographic and clinical variables were collected. Hierarchical cluster analysis was performed by the Ward method followed by k-means cluster analysis. Then, a prospective 12-month cohort study was used to validate the identified clusters. Five clusters were successfully identified. Clusters 1 (n = 71) and 3 (n = 81) were mild asthma phenotypes with slight airway obstruction and low exacerbation risk, but with a sex differential. Cluster 2 (n = 65) described an "allergic" phenotype, cluster 4 (n = 33) featured a "fixed airflow limitation" phenotype with smoking, and cluster 5 (n = 34) was a "low socioeconomic status" phenotype. Patients in clusters 2, 4, and 5 had distinctly lower socioeconomic status and more psychological symptoms. Cluster 2 had a significantly increased risk of exacerbations (risk ratio [RR] 1.13, 95% confidence interval [CI] 1.03-1.25), unplanned visits for asthma (RR 1.98, 95% CI 1.07-3.66), and emergency visits for asthma (RR 7.17, 95% CI 1.26-40.80). Cluster 4 had an increased risk of unplanned visits (RR 2.22, 95% CI 1.02-4.81), and cluster 5 had increased emergency visits (RR 12.72, 95% CI 1.95-69.78). Kaplan-Meier analysis confirmed that cluster grouping was predictive of time to the first asthma exacerbation, unplanned visit, emergency visit, and hospital admission (P clusters as "allergic asthma," "fixed airflow limitation," and "low socioeconomic status" phenotypes that are at high risk of severe asthma exacerbations and that have management implications for clinical practice in developing countries. Copyright © 2017 American College of Allergy, Asthma & Immunology. Published by Elsevier Inc. All rights reserved.
Comparison of population-averaged and cluster-specific models for the analysis of cluster randomized trials with missing binary outcomes: a simulation study

Directory of Open Access Journals (Sweden)

Ma Jinhui

2013-01-01

Full Text Available Abstracts Background The objective of this simulation study is to compare the accuracy and efficiency of population-averaged (i.e. generalized estimating equations (GEE and cluster-specific (i.e. random-effects logistic regression (RELR models for analyzing data from cluster randomized trials (CRTs with missing binary responses. Methods In this simulation study, clustered responses were generated from a beta-binomial distribution. The number of clusters per trial arm, the number of subjects per cluster, intra-cluster correlation coefficient, and the percentage of missing data were allowed to vary. Under the assumption of covariate dependent missingness, missing outcomes were handled by complete case analysis, standard multiple imputation (MI and within-cluster MI strategies. Data were analyzed using GEE and RELR. Performance of the methods was assessed using standardized bias, empirical standard error, root mean squared error (RMSE, and coverage probability. Results GEE performs well on all four measures — provided the downward bias of the standard error (when the number of clusters per arm is small is adjusted appropriately — under the following scenarios: complete case analysis for CRTs with a small amount of missing data; standard MI for CRTs with variance inflation factor (VIF 50. RELR performs well only when a small amount of data was missing, and complete case analysis was applied. Conclusion GEE performs well as long as appropriate missing data strategies are adopted based on the design of CRTs and the percentage of missing data. In contrast, RELR does not perform well when either standard or within-cluster MI strategy is applied prior to the analysis.
Environmental assessment of mining industry solid pollution in the mercurial district of Azzaba, northeast Algeria.

Science.gov (United States)

Seklaoui, M'hamed; Boutaleb, Abdelhak; Benali, Hanafi; Alligui, Fadila; Prochaska, Walter

2016-11-01

To date, there have been few detailed studies regarding the impact of mining and metallogenic activities on solid fractions in the Azzaba mercurial district (northeast Algeria) despite its importance and global similarity with large Hg mines. To assess the degree, distribution, and sources of pollution, a physical inventory of apparent pollution was developed, and several samples of mining waste, process waste, sediment, and soil were collected on regional and local scales to determine the concentration of Hg and other metals according to their existing mineralogical association. Several physico-chemical parameters that are known to influence the pollution distribution are realized. The extremely high concentrations of all metals exceed all norms and predominantly characterize the metallurgic and mining areas; the metal concentrations significantly decrease at significant low distances from these sources. The geo-accumulation index, which is the most realistic assessment method, demonstrates that soils and sediments near waste dumps and abandoned Hg mines are extremely polluted by all analyzed metals. The pollution by these metals decreases significantly with distance, which indicates a limited dispersion. The results of a clustering analysis and an integrated pollution index suggest that waste dumps, which are composed of calcine and condensation wastes, are the main source of pollution. Correlations and principal component analysis reveal the important role of hosting carbonate rocks in limiting pollution and differentiating calcine wastes from condensation waste, which has an extremely high Hg concentration (˃1 %).
Cluster analysis of radionuclide concentrations in beach sand

NARCIS (Netherlands)

de Meijer, R.J.; James, I.; Jennings, P.J.; Keoyers, J.E.

This paper presents a method in which natural radionuclide concentrations of beach sand minerals are traced along a stretch of coast by cluster analysis. This analysis yields two groups of mineral deposit with different origins. The method deviates from standard methods of following dispersal of
Numerical Analysis on Failure Modes and Mechanisms of Mine Pillars under Shear Loading

Directory of Open Access Journals (Sweden)

Tianhui Ma

2016-01-01

Full Text Available Severe damage occurs frequently in mine pillars subjected to shear stresses. The empirical design charts or formulas for mine pillars are not applicable to orebodies under shear. In this paper, the failure process of pillars under shear stresses was investigated by numerical simulations using the rock failure process analysis (RFPA 2D software. The numerical simulation results indicate that the strength of mine pillars and the corresponding failure mode vary with different width-to-height ratios and dip angles. With increasing dip angle, stress concentration first occurs at the intersection between the pillar and the roof, leading to formation of microcracks. Damage gradually develops from the surface to the core of the pillar. The damage process is tracked with acoustic emission monitoring. The study in this paper can provide an effective means for understanding the failure mechanism, planning, and design of mine pillars.
Exploration on feasibility of mining under pressure in depth at uranium mine No.711

International Nuclear Information System (INIS)

Xie Jun

1993-01-01

Through the analysis of mining practice in the mine No.711, it was found that it was technically feasible to mine the depth of the deposit suffering plenty underground hot water under pressure, and good economic benefits and environmental effects were obtained
MINE-NEC - A Game for the Analysis of Regional Water Policies in Open-Pit Lignite Mining Areas: An Improved Implementation for the NEC PC-8201A

OpenAIRE

Kaden, S.; Varis, O.

1986-01-01

The game MINE was developed for the analysis of regional water policies in open-pit lignite mining areas. It is implemented for a GDR test area. The purpose of the game is above all to teach decision makers and their staff in mining regions in order to get a better understanding of the complex interrelated socio-economic processes with respect t o water management in such regions. The game is designed to be played by five groups of players representing municipal and industrial water supply, a...
Clustering of GPS velocities in the Mojave Block, southeastern California

Science.gov (United States)

Savage, James C.; Simpson, Robert W.

2013-01-01

We find subdivisions within the Mojave Block using cluster analysis to identify groupings in the velocities observed at GPS stations there. The clusters are represented on a fault map by symbols located at the positions of the GPS stations, each symbol representing the cluster to which the velocity of that GPS station belongs. Fault systems that separate the clusters are readily identified on such a map. The most significant representation as judged by the gap test involves 4 clusters within the Mojave Block. The fault systems bounding the clusters from east to west are 1) the faults defining the eastern boundary of the Northeast Mojave Domain extended southward to connect to the Hector Mine rupture, 2) the Calico-Paradise fault system, 3) the Landers-Blackwater fault system, and 4) the Helendale-Lockhart fault system. This division of the Mojave Block is very similar to that proposed by Meade and Hager. However, no cluster boundary coincides with the Garlock Fault, the northern boundary of the Mojave Block. Rather, the clusters appear to continue without interruption from the Mojave Block north into the southern Walker Lane Belt, similar to the continuity across the Garlock Fault of the shear zone along the Blackwater-Little Lake fault system observed by Peltzer et al. Mapped traces of individual faults in the Mojave Block terminate within the block and do not continue across the Garlock Fault [Dokka and Travis, ].
Genome-scale cluster analysis of replicated microarrays using shrinkage correlation coefficient.

Science.gov (United States)

Yao, Jianchao; Chang, Chunqi; Salmi, Mari L; Hung, Yeung Sam; Loraine, Ann; Roux, Stanley J

2008-06-18

Currently, clustering with some form of correlation coefficient as the gene similarity metric has become a popular method for profiling genomic data. The Pearson correlation coefficient and the standard deviation (SD)-weighted correlation coefficient are the two most widely-used correlations as the similarity metrics in clustering microarray data. However, these two correlations are not optimal for analyzing replicated microarray data generated by most laboratories. An effective correlation coefficient is needed to provide statistically sufficient analysis of replicated microarray data. In this study, we describe a novel correlation coefficient, shrinkage correlation coefficient (SCC), that fully exploits the similarity between the replicated microarray experimental samples. The methodology considers both the number of replicates and the variance within each experimental group in clustering expression data, and provides a robust statistical estimation of the error of replicated microarray data. The value of SCC is revealed by its comparison with two other correlation coefficients that are currently the most widely-used (Pearson correlation coefficient and SD-weighted correlation coefficient) using statistical measures on both synthetic expression data as well as real gene expression data from Saccharomyces cerevisiae. Two leading clustering methods, hierarchical and k-means clustering were applied for the comparison. The comparison indicated that using SCC achieves better clustering performance. Applying SCC-based hierarchical clustering to the replicated microarray data obtained from germinating spores of the fern Ceratopteris richardii, we discovered two clusters of genes with shared expression patterns during spore germination. Functional analysis suggested that some of the genetic mechanisms that control germination in such diverse plant lineages as mosses and angiosperms are also conserved among ferns. This study shows that SCC is an alternative to the Pearson
Application of microarray analysis on computer cluster and cloud platforms.

Science.gov (United States)

Bernau, C; Boulesteix, A-L; Knaus, J

2013-01-01

Analysis of recent high-dimensional biological data tends to be computationally intensive as many common approaches such as resampling or permutation tests require the basic statistical analysis to be repeated many times. A crucial advantage of these methods is that they can be easily parallelized due to the computational independence of the resampling or permutation iterations, which has induced many statistics departments to establish their own computer clusters. An alternative is to rent computing resources in the cloud, e.g. at Amazon Web Services. In this article we analyze whether a selection of statistical projects, recently implemented at our department, can be efficiently realized on these cloud resources. Moreover, we illustrate an opportunity to combine computer cluster and cloud resources. In order to compare the efficiency of computer cluster and cloud implementations and their respective parallelizations we use microarray analysis procedures and compare their runtimes on the different platforms. Amazon Web Services provide various instance types which meet the particular needs of the different statistical projects we analyzed in this paper. Moreover, the network capacity is sufficient and the parallelization is comparable in efficiency to standard computer cluster implementations. Our results suggest that many statistical projects can be efficiently realized on cloud resources. It is important to mention, however, that workflows can change substantially as a result of a shift from computer cluster to cloud computing.
GLOBULAR CLUSTER ABUNDANCES FROM HIGH-RESOLUTION, INTEGRATED-LIGHT SPECTROSCOPY. II. EXPANDING THE METALLICITY RANGE FOR OLD CLUSTERS AND UPDATED ANALYSIS TECHNIQUES

Energy Technology Data Exchange (ETDEWEB)

Colucci, Janet E.; Bernstein, Rebecca A.; McWilliam, Andrew [The Observatories of the Carnegie Institution for Science, 813 Santa Barbara St., Pasadena, CA 91101 (United States)

2017-01-10

We present abundances of globular clusters (GCs) in the Milky Way and Fornax from integrated-light (IL) spectra. Our goal is to evaluate the consistency of the IL analysis relative to standard abundance analysis for individual stars in those same clusters. This sample includes an updated analysis of seven clusters from our previous publications and results for five new clusters that expand the metallicity range over which our technique has been tested. We find that the [Fe/H] measured from IL spectra agrees to ∼0.1 dex for GCs with metallicities as high as [Fe/H] = −0.3, but the abundances measured for more metal-rich clusters may be underestimated. In addition we systematically evaluate the accuracy of abundance ratios, [X/Fe], for Na i, Mg i, Al i, Si i, Ca i, Ti i, Ti ii, Sc ii, V i, Cr i, Mn i, Co i, Ni i, Cu i, Y ii, Zr i, Ba ii, La ii, Nd ii, and Eu ii. The elements for which the IL analysis gives results that are most similar to analysis of individual stellar spectra are Fe i, Ca i, Si i, Ni i, and Ba ii. The elements that show the greatest differences include Mg i and Zr i. Some elements show good agreement only over a limited range in metallicity. More stellar abundance data in these clusters would enable more complete evaluation of the IL results for other important elements.
A Novel Divisive Hierarchical Clustering Algorithm for Geospatial Analysis

Directory of Open Access Journals (Sweden)

Shaoning Li

2017-01-01

Full Text Available In the fields of geographic information systems (GIS and remote sensing (RS, the clustering algorithm has been widely used for image segmentation, pattern recognition, and cartographic generalization. Although clustering analysis plays a key role in geospatial modelling, traditional clustering methods are limited due to computational complexity, noise resistant ability and robustness. Furthermore, traditional methods are more focused on the adjacent spatial context, which makes it hard for the clustering methods to be applied to multi-density discrete objects. In this paper, a new method, cell-dividing hierarchical clustering (CDHC, is proposed based on convex hull retraction. The main steps are as follows. First, a convex hull structure is constructed to describe the global spatial context of geospatial objects. Then, the retracting structure of each borderline is established in sequence by setting the initial parameter. The objects are split into two clusters (i.e., “sub-clusters” if the retracting structure intersects with the borderlines. Finally, clusters are repeatedly split and the initial parameter is updated until the terminate condition is satisfied. The experimental results show that CDHC separates the multi-density objects from noise sufficiently and also reduces complexity compared to the traditional agglomerative hierarchical clustering algorithm.
Arabic web pages clustering and annotation using semantic class features

Directory of Open Access Journals (Sweden)

Hanan M. Alghamdi

2014-12-01

Full Text Available To effectively manage the great amount of data on Arabic web pages and to enable the classification of relevant information are very important research problems. Studies on sentiment text mining have been very limited in the Arabic language because they need to involve deep semantic processing. Therefore, in this paper, we aim to retrieve machine-understandable data with the help of a Web content mining technique to detect covert knowledge within these data. We propose an approach to achieve clustering with semantic similarities. This approach comprises integrating k-means document clustering with semantic feature extraction and document vectorization to group Arabic web pages according to semantic similarities and then show the semantic annotation. The document vectorization helps to transform text documents into a semantic class probability distribution or semantic class density. To reach semantic similarities, the approach extracts the semantic class features and integrates them into the similarity weighting schema. The quality of the clustering result has evaluated the use of the purity and the mean intra-cluster distance (MICD evaluation measures. We have evaluated the proposed approach on a set of common Arabic news web pages. We have acquired favorable clustering results that are effective in minimizing the MICD, expanding the purity and lowering the runtime.
Spatial cluster detection using dynamic programming

Directory of Open Access Journals (Sweden)

Sverchkov Yuriy

2012-03-01

Full Text Available Abstract Background The task of spatial cluster detection involves finding spatial regions where some property deviates from the norm or the expected value. In a probabilistic setting this task can be expressed as finding a region where some event is significantly more likely than usual. Spatial cluster detection is of interest in fields such as biosurveillance, mining of astronomical data, military surveillance, and analysis of fMRI images. In almost all such applications we are interested both in the question of whether a cluster exists in the data, and if it exists, we are interested in finding the most accurate characterization of the cluster. Methods We present a general dynamic programming algorithm for grid-based spatial cluster detection. The algorithm can be used for both Bayesian maximum a-posteriori (MAP estimation of the most likely spatial distribution of clusters and Bayesian model averaging over a large space of spatial cluster distributions to compute the posterior probability of an unusual spatial clustering. The algorithm is explained and evaluated in the context of a biosurveillance application, specifically the detection and identification of Influenza outbreaks based on emergency department visits. A relatively simple underlying model is constructed for the purpose of evaluating the algorithm, and the algorithm is evaluated using the model and semi-synthetic test data. Results When compared to baseline methods, tests indicate that the new algorithm can improve MAP estimates under certain conditions: the greedy algorithm we compared our method to was found to be more sensitive to smaller outbreaks, while as the size of the outbreaks increases, in terms of area affected and proportion of individuals affected, our method overtakes the greedy algorithm in spatial precision and recall. The new algorithm performs on-par with baseline methods in the task of Bayesian model averaging. Conclusions We conclude that the dynamic
Clustering of users of digital libraries through log file analysis

Directory of Open Access Journals (Sweden)

Juan Antonio Martínez-Comeche

2017-09-01

Full Text Available This study analyzes how users perform information retrieval tasks when introducing queries to the Hispanic Digital Library. Clusters of users are differentiated based on their distinct information behavior. The study used the log files collected by the server over a year and different possible clustering algorithms are compared. The k-means algorithm is found to be a suitable clustering method for the analysis of large log files from digital libraries. In the case of the Hispanic Digital Library the results show three clusters of users and the characteristic information behavior of each group is described.
SIMPL: A Simplified Model-Based Program for the Analysis and Visualization of Groundwater Rebound in Abandoned Mines to Prevent Contamination of Water and Soils by Acid Mine Drainage

Directory of Open Access Journals (Sweden)

Sung-Min Kim

2018-05-01

Full Text Available Cessation of dewatering following underground mine closure typically results in groundwater rebound, because mine voids and surrounding strata undergo flooding up to the levels of the decant points, such as shafts and drifts. SIMPL (Simplified groundwater program In Mine workings using the Pipe equation and Lumped parameter model, a simplified lumped parameter model-based program for predicting groundwater levels in abandoned mines, is presented herein. The program comprises a simulation engine module, 3D visualization module, and graphical user interface, which aids data processing, analysis, and visualization of results. The 3D viewer facilitates effective visualization of the predicted groundwater level rebound phenomenon together with a topographic map, mine drift, goaf, and geological properties from borehole data. SIMPL is applied to data from the Dongwon coal mine and Dalsung copper mine in Korea, with strong similarities in simulated and observed results. By considering mine workings and interpond connections, SIMPL can thus be used to effectively analyze and visualize groundwater rebound. In addition, the predictions by SIMPL can be utilized to prevent the surrounding environment (water and soil from being polluted by acid mine drainage.
Clustering big data streams : recent challenges and contributions

NARCIS (Netherlands)

Hassani, M.; Seidl, T.

Traditional clustering algorithms merely considered static data. Today's various applications and research issues in big data mining have however to deal with continuous, possibly infinite streams of data, arriving at high velocity. Web traffic data, surveillance data, sensor measurements and stock
Application of EREP imagery to fracture-related mine safety hazards and environmental problems in mining

Science.gov (United States)

Wier, C. E.; Wobber, F. J.; Amato, R. V.; Russell, O. R. (Principal Investigator)

1973-01-01

The author has identified the following significant results. Numerous fracture traces were detected on both the color transparencies and black and white spectral bands. Fracture traces of value to mining hazards analysis were noted on the EREP imagery which could not be detected on either the ERTS-1 or high altitude aircraft color infrared photography. Several areas of mine subsidence occurring in the Busseron Creek area near Sullivan, Indiana were successfully identified using color photography. Skylab photography affords an increase over comparable scale ERTS-1 imagery in level of information obtained in mined lands inventory and reclamation analysis. A review of EREP color photography permitted the identification of a substantial number of non-fuel mines within the Southern Indiana test area. A new mine was detected on the EREP photography without prior data. EREP has definite value for estimating areal changes in active mines and for detecting new non-fuel mines. Gob piles and slurry ponds of several acres could be detected on the S-190B color photography when observed in association with large scale mining operations. Apparent degradation of water quality resulting from acid mine drainage and/or siltation was noted in several ponds or small lakes and appear to be related to intensive mining activity near Sullivan, Indiana.
Feasibility Study of Parallel Finite Element Analysis on Cluster-of-Clusters

Science.gov (United States)

Muraoka, Masae; Okuda, Hiroshi

With the rapid growth of WAN infrastructure and development of Grid middleware, it's become a realistic and attractive methodology to connect cluster machines on wide-area network for the execution of computation-demanding applications. Many existing parallel finite element (FE) applications have been, however, designed and developed with a single computing resource in mind, since such applications require frequent synchronization and communication among processes. There have been few FE applications that can exploit the distributed environment so far. In this study, we explore the feasibility of FE applications on the cluster-of-clusters. First, we classify FE applications into two types, tightly coupled applications (TCA) and loosely coupled applications (LCA) based on their communication pattern. A prototype of each application is implemented on the cluster-of-clusters. We perform numerical experiments executing TCA and LCA on both the cluster-of-clusters and a single cluster. Thorough these experiments, by comparing the performances and communication cost in each case, we evaluate the feasibility of FEA on the cluster-of-clusters.

An extended k-means technique for clustering moving objects

Directory of Open Access Journals (Sweden)

Omnia Ossama

2011-03-01

Full Text Available k-means algorithm is one of the basic clustering techniques that is used in many data mining applications. In this paper we present a novel pattern based clustering algorithm that extends the k-means algorithm for clustering moving object trajectory data. The proposed algorithm uses a key feature of moving object trajectories namely, its direction as a heuristic to determine the different number of clusters for the k-means algorithm. In addition, we use the silhouette coefficient as a measure for the quality of our proposed approach. Finally, we present experimental results on both real and synthetic data that show the performance and accuracy of our proposed technique.
Full text clustering and relationship network analysis of biomedical publications.

Directory of Open Access Journals (Sweden)

Renchu Guan

Full Text Available Rapid developments in the biomedical sciences have increased the demand for automatic clustering of biomedical publications. In contrast to current approaches to text clustering, which focus exclusively on the contents of abstracts, a novel method is proposed for clustering and analysis of complete biomedical article texts. To reduce dimensionality, Cosine Coefficient is used on a sub-space of only two vectors, instead of computing the Euclidean distance within the space of all vectors. Then a strategy and algorithm is introduced for Semi-supervised Affinity Propagation (SSAP to improve analysis efficiency, using biomedical journal names as an evaluation background. Experimental results show that by avoiding high-dimensional sparse matrix computations, SSAP outperforms conventional k-means methods and improves upon the standard Affinity Propagation algorithm. In constructing a directed relationship network and distribution matrix for the clustering results, it can be noted that overlaps in scope and interests among BioMed publications can be easily identified, providing a valuable analytical tool for editors, authors and readers.
Full text clustering and relationship network analysis of biomedical publications.

Science.gov (United States)

Guan, Renchu; Yang, Chen; Marchese, Maurizio; Liang, Yanchun; Shi, Xiaohu

2014-01-01

Rapid developments in the biomedical sciences have increased the demand for automatic clustering of biomedical publications. In contrast to current approaches to text clustering, which focus exclusively on the contents of abstracts, a novel method is proposed for clustering and analysis of complete biomedical article texts. To reduce dimensionality, Cosine Coefficient is used on a sub-space of only two vectors, instead of computing the Euclidean distance within the space of all vectors. Then a strategy and algorithm is introduced for Semi-supervised Affinity Propagation (SSAP) to improve analysis efficiency, using biomedical journal names as an evaluation background. Experimental results show that by avoiding high-dimensional sparse matrix computations, SSAP outperforms conventional k-means methods and improves upon the standard Affinity Propagation algorithm. In constructing a directed relationship network and distribution matrix for the clustering results, it can be noted that overlaps in scope and interests among BioMed publications can be easily identified, providing a valuable analytical tool for editors, authors and readers.
Steady state subchannel analysis of AHWR fuel cluster

International Nuclear Information System (INIS)

Dasgupta, A.; Chandraker, D.K.; Vijayan, P.K.; Saha, D.

2006-09-01

Subchannel analysis is a technique used to predict the thermal hydraulic behavior of reactor fuel assemblies. The rod cluster is subdivided into a number of parallel interacting flow subchannels. The conservation equations are solved for each of these subchannels, taking into account subchannel interactions. Subchannel analysis of AHWR D-5 fuel cluster has been carried out to determine the variations in thermal hydraulic conditions of coolant and fuel temperatures along the length of the fuel bundle. The hottest regions within the AHWR fuel bundle have been identified. The effect of creep on the fuel performance has also been studied. MCHFR has been calculated using Jansen-Levy correlation. The calculations have been backed by sensitivity analysis for parameters whose values are not known accurately. The sensitivity analysis showed the calculations to have a very low sensitivity to these parameters. Apart from the analysis, the report also includes a brief introduction of a few subchannel codes. A brief description of the equations and solution methodology used in COBRA-IIIC and COBRA-IV-I is also given. (author)
Oil sands mine planning and waste management using goal programming

Energy Technology Data Exchange (ETDEWEB)

Ben-Awuah, E.; Askari-Nasab, H. [Alberta Univ., Edmonton, AB (Canada). Dept. of Civil and Environmental Engineering; Alberta Univ., Edmonton, AB (Canada). Mining Optimization Laboratory

2010-07-01

A goal programming method was used to plan waste management processes at an oil sands mine. This method requires the decision maker (DM) to set goals. Mine planning is used to determine a block extraction schedule that maximizes net present value (NPV). Due to land restrictions, tailings facilities are sited within the pit area and dykes are used to contain the tailings. Many of the materials used to construct the dykes come from the mining operation. The mine plan scheduled both ore and dyke material concurrently. Dykes were constructed simultaneously as the mine phase advanced. A model was used to classify an oil sands block model into different material types. A mixed integer goal programming (MIGP) method was used to generate a strategic schedule. Block clustering techniques were used to large-scale mine planning projects. The method was used to verify and validate synthetic and real case data related to the cost of mining all material as waste, and the extra cost of mining dyke material. A case study of an oil sands project was used to demonstrate the method. The study showed that the developed model generates a smooth and uniform strategic schedule for large-scale mine planning projects. tabs., figs.
Oil sands mine planning and waste management using goal programming

International Nuclear Information System (INIS)

Ben-Awuah, E.; Askari-Nasab, H.; Alberta Univ., Edmonton, AB

2010-01-01

A goal programming method was used to plan waste management processes at an oil sands mine. This method requires the decision maker (DM) to set goals. Mine planning is used to determine a block extraction schedule that maximizes net present value (NPV). Due to land restrictions, tailings facilities are sited within the pit area and dykes are used to contain the tailings. Many of the materials used to construct the dykes come from the mining operation. The mine plan scheduled both ore and dyke material concurrently. Dykes were constructed simultaneously as the mine phase advanced. A model was used to classify an oil sands block model into different material types. A mixed integer goal programming (MIGP) method was used to generate a strategic schedule. Block clustering techniques were used to large-scale mine planning projects. The method was used to verify and validate synthetic and real case data related to the cost of mining all material as waste, and the extra cost of mining dyke material. A case study of an oil sands project was used to demonstrate the method. The study showed that the developed model generates a smooth and uniform strategic schedule for large-scale mine planning projects. tabs., figs.
Mobility in Europe: Recent Trends from a Cluster Analysis

Directory of Open Access Journals (Sweden)

Ioana Manafi

2017-08-01

Full Text Available During the past decade, Europe was confronted with major changes and events offering large opportunities for mobility. The EU enlargement process, the EU policies regarding youth, the economic crisis affecting national economies on different levels, political instabilities in some European countries, high rates of unemployment or the increasing number of refugees are only a few of the factors influencing net migration in Europe. Based on a set of socio-economic indicators for EU/EFTA countries and cluster analysis, the paper provides an overview of regional differences across European countries, related to migration magnitude in the identified clusters. The obtained clusters are in accordance with previous studies in migration, and appear stable during the period of 2005-2013, with only some exceptions. The analysis revealed three country clusters: EU/EFTA center-receiving countries, EU/EFTA periphery-sending countries and EU/EFTA outlier countries, the names suggesting not only the geographical position within Europe, but the trends in net migration flows during the years. Therewith, the results provide evidence for the persistence of a movement from periphery to center countries, which is correlated with recent flows of mobility in Europe.
Cluster analysis for portfolio optimization

OpenAIRE

Vincenzo Tola; Fabrizio Lillo; Mauro Gallegati; Rosario N. Mantegna

2005-01-01

We consider the problem of the statistical uncertainty of the correlation matrix in the optimization of a financial portfolio. We show that the use of clustering algorithms can improve the reliability of the portfolio in terms of the ratio between predicted and realized risk. Bootstrap analysis indicates that this improvement is obtained in a wide range of the parameters N (number of assets) and T (investment horizon). The predicted and realized risk level and the relative portfolio compositi...
A Trajectory Regression Clustering Technique Combining a Novel Fuzzy C-Means Clustering Algorithm with the Least Squares Method

Directory of Open Access Journals (Sweden)

Xiangbing Zhou

2018-04-01

Full Text Available Rapidly growing GPS (Global Positioning System trajectories hide much valuable information, such as city road planning, urban travel demand, and population migration. In order to mine the hidden information and to capture better clustering results, a trajectory regression clustering method (an unsupervised trajectory clustering method is proposed to reduce local information loss of the trajectory and to avoid getting stuck in the local optimum. Using this method, we first define our new concept of trajectory clustering and construct a novel partitioning (angle-based partitioning method of line segments; second, the Lagrange-based method and Hausdorff-based K-means++ are integrated in fuzzy C-means (FCM clustering, which are used to maintain the stability and the robustness of the clustering process; finally, least squares regression model is employed to achieve regression clustering of the trajectory. In our experiment, the performance and effectiveness of our method is validated against real-world taxi GPS data. When comparing our clustering algorithm with the partition-based clustering algorithms (K-means, K-median, and FCM, our experimental results demonstrate that the presented method is more effective and generates a more reasonable trajectory.
Hierarchical cluster analysis of progression patterns in open-angle glaucoma patients with medical treatment.

Science.gov (United States)

Bae, Hyoung Won; Rho, Seungsoo; Lee, Hye Sun; Lee, Naeun; Hong, Samin; Seong, Gong Je; Sung, Kyung Rim; Kim, Chan Yun

2014-04-29

To classify medically treated open-angle glaucoma (OAG) by the pattern of progression using hierarchical cluster analysis, and to determine OAG progression characteristics by comparing clusters. Ninety-five eyes of 95 OAG patients who received medical treatment, and who had undergone visual field (VF) testing at least once per year for 5 or more years. OAG was classified into subgroups using hierarchical cluster analysis based on the following five variables: baseline mean deviation (MD), baseline visual field index (VFI), MD slope, VFI slope, and Glaucoma Progression Analysis (GPA) printout. After that, other parameters were compared between clusters. Two clusters were made after a hierarchical cluster analysis. Cluster 1 showed -4.06 ± 2.43 dB baseline MD, 92.58% ± 6.27% baseline VFI, -0.28 ± 0.38 dB per year MD slope, -0.52% ± 0.81% per year VFI slope, and all "no progression" cases in GPA printout, whereas cluster 2 showed -8.68 ± 3.81 baseline MD, 77.54 ± 12.98 baseline VFI, -0.72 ± 0.55 MD slope, -2.22 ± 1.89 VFI slope, and seven "possible" and four "likely" progression cases in GPA printout. There were no significant differences in age, sex, mean IOP, central corneal thickness, and axial length between clusters. However, cluster 2 included more high-tension glaucoma patients and used a greater number of antiglaucoma eye drops significantly compared with cluster 1. Hierarchical cluster analysis of progression patterns divided OAG into slow and fast progression groups, evidenced by assessing the parameters of glaucomatous progression in VF testing. In the fast progression group, the prevalence of high-tension glaucoma was greater and the number of antiglaucoma medications administered was increased versus the slow progression group. Copyright 2014 The Association for Research in Vision and Ophthalmology, Inc.
Cluster analysis of spontaneous preterm birth phenotypes identifies potential associations among preterm birth mechanisms.

Science.gov (United States)

Esplin, M Sean; Manuck, Tracy A; Varner, Michael W; Christensen, Bryce; Biggio, Joseph; Bukowski, Radek; Parry, Samuel; Zhang, Heping; Huang, Hao; Andrews, William; Saade, George; Sadovsky, Yoel; Reddy, Uma M; Ilekis, John

2015-09-01

We sought to use an innovative tool that is based on common biologic pathways to identify specific phenotypes among women with spontaneous preterm birth (SPTB) to enhance investigators' ability to identify and to highlight common mechanisms and underlying genetic factors that are responsible for SPTB. We performed a secondary analysis of a prospective case-control multicenter study of SPTB. All cases delivered a preterm singleton at SPTB ≤34.0 weeks' gestation. Each woman was assessed for the presence of underlying SPTB causes. A hierarchic cluster analysis was used to identify groups of women with homogeneous phenotypic profiles. One of the phenotypic clusters was selected for candidate gene association analysis with the use of VEGAS software. One thousand twenty-eight women with SPTB were assigned phenotypes. Hierarchic clustering of the phenotypes revealed 5 major clusters. Cluster 1 (n = 445) was characterized by maternal stress; cluster 2 (n = 294) was characterized by premature membrane rupture; cluster 3 (n = 120) was characterized by familial factors, and cluster 4 (n = 63) was characterized by maternal comorbidities. Cluster 5 (n = 106) was multifactorial and characterized by infection (INF), decidual hemorrhage (DH), and placental dysfunction (PD). These 3 phenotypes were correlated highly by χ(2) analysis (PD and DH, P cluster 3 of SPTB. We identified 5 major clusters of SPTB based on a phenotype tool and hierarch clustering. There was significant correlation between several of the phenotypes. The INS gene was associated with familial factors that were underlying SPTB. Copyright © 2015 Elsevier Inc. All rights reserved.
Traversability analysis for a mine safety inspection robot

CSIR Research Space (South Africa)

Senekal, F

2013-09-01

Full Text Available A new fast algorithm for traversability analysis of an arbitrary three-dimensional point cloud is presented. The algorithm segments a three-dimensional point cloud into vertical sections; each of which is clustered into bins and further analysed...
Security Measures in Data Mining

OpenAIRE

Anish Gupta; Vimal Bibhu; Rashid Hussain

2012-01-01

Data mining is a technique to dig the data from the large databases for analysis and executive decision making. Security aspect is one of the measure requirement for data mining applications. In this paper we present security requirement measures for the data mining. We summarize the requirements of security for data mining in tabular format. The summarization is performed by the requirements with different aspects of security measure of data mining. The performances and outcomes are determin...
A critical cluster analysis of 44 indicators of author-level performance

DEFF Research Database (Denmark)

Wildgaard, Lorna Elizabeth

2016-01-01

-four indicators of individual researcher performance were computed using the data. The clustering solution was supported by continued reference to the researcher’s curriculum vitae, an effect analysis and a risk analysis. Disciplinary appropriate indicators were identified and used to divide the researchers......This paper explores a 7-stage cluster methodology as a process to identify appropriate indicators for evaluation of individual researchers at a disciplinary and seniority level. Publication and citation data for 741 researchers from 4 disciplines was collected in Web of Science. Forty...... of statistics in research evaluation. The strength of the 7-stage cluster methodology is that it makes clear that in the evaluation of individual researchers, statistics cannot stand alone. The methodology is reliant on contextual information to verify the bibliometric values and cluster solution...
Tweets clustering using latent semantic analysis

Science.gov (United States)

Rasidi, Norsuhaili Mahamed; Bakar, Sakhinah Abu; Razak, Fatimah Abdul

2017-04-01

Social media are becoming overloaded with information due to the increasing number of information feeds. Unlike other social media, Twitter users are allowed to broadcast a short message called as `tweet". In this study, we extract tweets related to MH370 for certain of time. In this paper, we present overview of our approach for tweets clustering to analyze the users' responses toward tragedy of MH370. The tweets were clustered based on the frequency of terms obtained from the classification process. The method we used for the text classification is Latent Semantic Analysis. As a result, there are two types of tweets that response to MH370 tragedy which is emotional and non-emotional. We show some of our initial results to demonstrate the effectiveness of our approach.
Symptom Cluster Research With Biomarkers and Genetics Using Latent Class Analysis.

Science.gov (United States)

Conley, Samantha

2017-12-01

The purpose of this article is to provide an overview of latent class analysis (LCA) and examples from symptom cluster research that includes biomarkers and genetics. A review of LCA with genetics and biomarkers was conducted using Medline, Embase, PubMed, and Google Scholar. LCA is a robust latent variable model used to cluster categorical data and allows for the determination of empirically determined symptom clusters. Researchers should consider using LCA to link empirically determined symptom clusters to biomarkers and genetics to better understand the underlying etiology of symptom clusters. The full potential of LCA in symptom cluster research has not yet been realized because it has been used in limited populations, and researchers have explored limited biologic pathways.
The composite sequential clustering technique for analysis of multispectral scanner data

Science.gov (United States)

Su, M. Y.

1972-01-01

The clustering technique consists of two parts: (1) a sequential statistical clustering which is essentially a sequential variance analysis, and (2) a generalized K-means clustering. In this composite clustering technique, the output of (1) is a set of initial clusters which are input to (2) for further improvement by an iterative scheme. This unsupervised composite technique was employed for automatic classification of two sets of remote multispectral earth resource observations. The classification accuracy by the unsupervised technique is found to be comparable to that by traditional supervised maximum likelihood classification techniques. The mathematical algorithms for the composite sequential clustering program and a detailed computer program description with job setup are given.
Cluster-based analysis of multi-model climate ensembles

Science.gov (United States)

Hyde, Richard; Hossaini, Ryan; Leeson, Amber A.

2018-06-01

Clustering - the automated grouping of similar data - can provide powerful and unique insight into large and complex data sets, in a fast and computationally efficient manner. While clustering has been used in a variety of fields (from medical image processing to economics), its application within atmospheric science has been fairly limited to date, and the potential benefits of the application of advanced clustering techniques to climate data (both model output and observations) has yet to be fully realised. In this paper, we explore the specific application of clustering to a multi-model climate ensemble. We hypothesise that clustering techniques can provide (a) a flexible, data-driven method of testing model-observation agreement and (b) a mechanism with which to identify model development priorities. We focus our analysis on chemistry-climate model (CCM) output of tropospheric ozone - an important greenhouse gas - from the recent Atmospheric Chemistry and Climate Model Intercomparison Project (ACCMIP). Tropospheric column ozone from the ACCMIP ensemble was clustered using the Data Density based Clustering (DDC) algorithm. We find that a multi-model mean (MMM) calculated using members of the most-populous cluster identified at each location offers a reduction of up to ˜ 20 % in the global absolute mean bias between the MMM and an observed satellite-based tropospheric ozone climatology, with respect to a simple, all-model MMM. On a spatial basis, the bias is reduced at ˜ 62 % of all locations, with the largest bias reductions occurring in the Northern Hemisphere - where ozone concentrations are relatively large. However, the bias is unchanged at 9 % of all locations and increases at 29 %, particularly in the Southern Hemisphere. The latter demonstrates that although cluster-based subsampling acts to remove outlier model data, such data may in fact be closer to observed values in some locations. We further demonstrate that clustering can provide a viable and
Mining gene expression data by interpreting principal components

Directory of Open Access Journals (Sweden)

Mortazavi Ali

2006-04-01

Full Text Available Abstract Background There are many methods for analyzing microarray data that group together genes having similar patterns of expression over all conditions tested. However, in many instances the biologically important goal is to identify relatively small sets of genes that share coherent expression across only some conditions, rather than all or most conditions as required in traditional clustering; e.g. genes that are highly up-regulated and/or down-regulated similarly across only a subset of conditions. Equally important is the need to learn which conditions are the decisive ones in forming such gene sets of interest, and how they relate to diverse conditional covariates, such as disease diagnosis or prognosis. Results We present a method for automatically identifying such candidate sets of biologically relevant genes using a combination of principal components analysis and information theoretic metrics. To enable easy use of our methods, we have developed a data analysis package that facilitates visualization and subsequent data mining of the independent sources of significant variation present in gene microarray expression datasets (or in any other similarly structured high-dimensional dataset. We applied these tools to two public datasets, and highlight sets of genes most affected by specific subsets of conditions (e.g. tissues, treatments, samples, etc.. Statistically significant associations for highlighted gene sets were shown via global analysis for Gene Ontology term enrichment. Together with covariate associations, the tool provides a basis for building testable hypotheses about the biological or experimental causes of observed variation. Conclusion We provide an unsupervised data mining technique for diverse microarray expression datasets that is distinct from major methods now in routine use. In test uses, this method, based on publicly available gene annotations, appears to identify numerous sets of biologically relevant genes. It
Robust MST-Based Clustering Algorithm.

Science.gov (United States)

Liu, Qidong; Zhang, Ruisheng; Zhao, Zhili; Wang, Zhenghai; Jiao, Mengyao; Wang, Guangjing

2018-06-01

Minimax similarity stresses the connectedness of points via mediating elements rather than favoring high mutual similarity. The grouping principle yields superior clustering results when mining arbitrarily-shaped clusters in data. However, it is not robust against noises and outliers in the data. There are two main problems with the grouping principle: first, a single object that is far away from all other objects defines a separate cluster, and second, two connected clusters would be regarded as two parts of one cluster. In order to solve such problems, we propose robust minimum spanning tree (MST)-based clustering algorithm in this letter. First, we separate the connected objects by applying a density-based coarsening phase, resulting in a low-rank matrix in which the element denotes the supernode by combining a set of nodes. Then a greedy method is presented to partition those supernodes through working on the low-rank matrix. Instead of removing the longest edges from MST, our algorithm groups the data set based on the minimax similarity. Finally, the assignment of all data points can be achieved through their corresponding supernodes. Experimental results on many synthetic and real-world data sets show that our algorithm consistently outperforms compared clustering algorithms.

An Approach for Understanding and Promoting Coal Mine Safety by Exploring Coal Mine Risk Network

Directory of Open Access Journals (Sweden)

Yongliang Deng

2017-01-01

Full Text Available Capturing the interrelations among risks is essential to thoroughly understand and promote coal mining safety. From this standpoint, 105 risks and 135 interrelations among risks had been identified from 126 typical accidents, which were also the foundation of constructing coal mine risk network (CMRN. Based on the complex network theory and Pajek, six parameters (i.e., network diameter, network density, average path length, degree, betweenness, and clustering coefficient were employed to reveal the topological properties of CMRN. As indicated by the results, CMRN possesses scale-free network property because its cumulative degree distribution obeys power-law distribution. This means that CMRN is robust to random hazard and vulnerable to deliberate attack. CMRN is also a small-world network due to its relatively small average path length as well as high clustering coefficient, implying that accident propagation in CMRN is faster than regular network. Furthermore, the effect of risk control is explored. According to the result, it shows that roof collapse, fire, and gas concentration exceeding limit refer to three most valuable targets for risk control among all the risks. This study will help offer recommendations and proposals for making beforehand strategies that can restrain original risks and reduce accidents.
OCHRE PRECIPITATES AND ACID MINE DRAINAGE IN A MINE ENVIRONMENT

Directory of Open Access Journals (Sweden)

BRANISLAV MÁŠA

2012-03-01

Full Text Available This paper is focused to characterize the ochre precipitates and the mine water effluents of some old mine adits and settling pits after mining of polymetallic ores in Slovakia. It was shown that the mine water effluents from two different types of deposits (adits; settling pits have similar composition and represent slightly acidic sulphate water (pH in range 5.60-6.05, sulphate concentration from 1160 to 1905 g.dm-3. The ochreous precipitates were characterized by methods of X-ray diffraction analysis (XRD, scanning electron microscopy (SEM and B.E.T. method for measuring the specific surface area and porosity. The dominant phases were ferrihydrite with goethite or goethite with lepidocrocide.
Data Clustering Menggunakan Metodologi CRISP-DM Untuk Pengenalan Pola Proporsi Pelaksanaan Tridharma

Directory of Open Access Journals (Sweden)

Irwan Budiman

2014-01-01

Full Text Available Quality of human resources faculty can be reflected from the implementation of productivity and quality Tridharma (education, research, community service and supporting field activities. Lecturer Workload and Evaluation of Higher Education Tridharma (BKD and theEPT-PT aims to ensure the implementation of the faculty task runs according to the criteria set out in legislation. Data clusteringTridharma implementation is needed to get some knowledge of the pattern of Tridharma implementation at college. Clustering as a data mining technique should be scalable, reliable and meet an agreed standard. CRISP-DM is the standardization of data mining is used in this study. The results of data clustering found the pattern of proportion of Tridharma into 3 clusters representing patterns: professionals, managers and teachers.Keywords : Clustering, CRISP-DM, K-Means, Tridharma
An novel frequent probability pattern mining algorithm based on circuit simulation method in uncertain biological networks

Science.gov (United States)

2014-01-01

Background Motif mining has always been a hot research topic in bioinformatics. Most of current research on biological networks focuses on exact motif mining. However, due to the inevitable experimental error and noisy data, biological network data represented as the probability model could better reflect the authenticity and biological significance, therefore, it is more biological meaningful to discover probability motif in uncertain biological networks. One of the key steps in probability motif mining is frequent pattern discovery which is usually based on the possible world model having a relatively high computational complexity. Methods In this paper, we present a novel method for detecting frequent probability patterns based on circuit simulation in the uncertain biological networks. First, the partition based efficient search is applied to the non-tree like subgraph mining where the probability of occurrence in random networks is small. Then, an algorithm of probability isomorphic based on circuit simulation is proposed. The probability isomorphic combines the analysis of circuit topology structure with related physical properties of voltage in order to evaluate the probability isomorphism between probability subgraphs. The circuit simulation based probability isomorphic can avoid using traditional possible world model. Finally, based on the algorithm of probability subgraph isomorphism, two-step hierarchical clustering method is used to cluster subgraphs, and discover frequent probability patterns from the clusters. Results The experiment results on data sets of the Protein-Protein Interaction (PPI) networks and the transcriptional regulatory networks of E. coli and S. cerevisiae show that the proposed method can efficiently discover the frequent probability subgraphs. The discovered subgraphs in our study contain all probability motifs reported in the experiments published in other related papers. Conclusions The algorithm of probability graph isomorphism
Clusters of galaxies as tools in observational cosmology : results from x-ray analysis

International Nuclear Information System (INIS)

Weratschnig, J.M.

2009-01-01

Clusters of galaxies are the largest gravitationally bound structures in the universe. They can be used as ideal tools to study large scale structure formation (e.g. when studying merger clusters) and provide highly interesting environments to analyse several characteristic interaction processes (like ram pressure stripping of galaxies, magnetic fields). In this dissertation thesis, we have studied several clusters of galaxies using X-ray observations. To obtain scientific results, we have applied different data reduction and analysis methods. With a combination of morphological and spectral analysis, the merger cluster Abell 514 was studied in much detail. It has a highly interesting morphology and shows signs for an ongoing merger as well as a shock. using a new method to detect substructure, we have analysed several clusters to determine whether any substructure is present in the X-ray image. This hints towards a real structure in the distribution of the intra-cluster medium (ICM) and is evidence for ongoing mergers. The results from this analysis are extensively used with the cluster of galaxies Abell S1136. Here, we study the ICM distribution and compare its structure with the spatial distribution of star forming galaxies. Cluster magnetic fields are another important topic of my thesis. They can be studied in Radio observations, which can be put into relation with results from X-ray observations. using observational data from several clusters, we could support the theory that cluster magnetic fields are frozen into the ICM. (author)
Characterizing Heterogeneity within Head and Neck Lesions Using Cluster Analysis of Multi-Parametric MRI Data.

Directory of Open Access Journals (Sweden)

Marco Borri

Full Text Available To describe a methodology, based on cluster analysis, to partition multi-parametric functional imaging data into groups (or clusters of similar functional characteristics, with the aim of characterizing functional heterogeneity within head and neck tumour volumes. To evaluate the performance of the proposed approach on a set of longitudinal MRI data, analysing the evolution of the obtained sub-sets with treatment.The cluster analysis workflow was applied to a combination of dynamic contrast-enhanced and diffusion-weighted imaging MRI data from a cohort of squamous cell carcinoma of the head and neck patients. Cumulative distributions of voxels, containing pre and post-treatment data and including both primary tumours and lymph nodes, were partitioned into k clusters (k = 2, 3 or 4. Principal component analysis and cluster validation were employed to investigate data composition and to independently determine the optimal number of clusters. The evolution of the resulting sub-regions with induction chemotherapy treatment was assessed relative to the number of clusters.The clustering algorithm was able to separate clusters which significantly reduced in voxel number following induction chemotherapy from clusters with a non-significant reduction. Partitioning with the optimal number of clusters (k = 4, determined with cluster validation, produced the best separation between reducing and non-reducing clusters.The proposed methodology was able to identify tumour sub-regions with distinct functional properties, independently separating clusters which were affected differently by treatment. This work demonstrates that unsupervised cluster analysis, with no prior knowledge of the data, can be employed to provide a multi-parametric characterization of functional heterogeneity within tumour volumes.
Latent cluster analysis of ALS phenotypes identifies prognostically differing groups.

Directory of Open Access Journals (Sweden)

Jeban Ganesalingam

2009-09-01

Full Text Available Amyotrophic lateral sclerosis (ALS is a degenerative disease predominantly affecting motor neurons and manifesting as several different phenotypes. Whether these phenotypes correspond to different underlying disease processes is unknown. We used latent cluster analysis to identify groupings of clinical variables in an objective and unbiased way to improve phenotyping for clinical and research purposes.Latent class cluster analysis was applied to a large database consisting of 1467 records of people with ALS, using discrete variables which can be readily determined at the first clinic appointment. The model was tested for clinical relevance by survival analysis of the phenotypic groupings using the Kaplan-Meier method.The best model generated five distinct phenotypic classes that strongly predicted survival (p<0.0001. Eight variables were used for the latent class analysis, but a good estimate of the classification could be obtained using just two variables: site of first symptoms (bulbar or limb and time from symptom onset to diagnosis (p<0.00001.The five phenotypic classes identified using latent cluster analysis can predict prognosis. They could be used to stratify patients recruited into clinical trials and generating more homogeneous disease groups for genetic, proteomic and risk factor research.
Global classification of human facial healthy skin using PLS discriminant analysis and clustering analysis.

Science.gov (United States)

Guinot, C; Latreille, J; Tenenhaus, M; Malvy, D J

2001-04-01

Today's classifications of healthy skin are predominantly based on a very limited number of skin characteristics, such as skin oiliness or susceptibility to sun exposure. The aim of the present analysis was to set up a global classification of healthy facial skin, using mathematical models. This classification is based on clinical, biophysical skin characteristics and self-reported information related to the skin, as well as the results of a theoretical skin classification assessed separately for the frontal and the malar zones of the face. In order to maximize the predictive power of the models with a minimum of variables, the Partial Least Square (PLS) discriminant analysis method was used. The resulting PLS components were subjected to clustering analyses to identify the plausible number of clusters and to group the individuals according to their proximities. Using this approach, four PLS components could be constructed and six clusters were found relevant. So, from the 36 hypothetical combinations of the theoretical skin types classification, we tended to a strengthened six classes proposal. Our data suggest that the association of the PLS discriminant analysis and the clustering methods leads to a valid and simple way to classify healthy human skin and represents a potentially useful tool for cosmetic and dermatological research.
Analysis of rockburst and rockfall accidents in relation to class of stope support, regional support, energy of seismic events and mining layout

CSIR Research Space (South Africa)

Cichowicz, A

1994-01-01

Full Text Available This report discusses the assessment of safety risk and the analysis of Falls Of Ground (FOG) in mines due to seismic events and mining layout during the period of 1991-1992 on a single mine. The multivariate analysis was used to obtain a...
Clustering by reordering of similarity and Laplacian matrices: Application to galaxy clusters

Science.gov (United States)

Mahmoud, E.; Shoukry, A.; Takey, A.

2018-04-01

Similarity metrics, kernels and similarity-based algorithms have gained much attention due to their increasing applications in information retrieval, data mining, pattern recognition and machine learning. Similarity Graphs are often adopted as the underlying representation of similarity matrices and are at the origin of known clustering algorithms such as spectral clustering. Similarity matrices offer the advantage of working in object-object (two-dimensional) space where visualization of clusters similarities is available instead of object-features (multi-dimensional) space. In this paper, sparse ɛ-similarity graphs are constructed and decomposed into strong components using appropriate methods such as Dulmage-Mendelsohn permutation (DMperm) and/or Reverse Cuthill-McKee (RCM) algorithms. The obtained strong components correspond to groups (clusters) in the input (feature) space. Parameter ɛi is estimated locally, at each data point i from a corresponding narrow range of the number of nearest neighbors. Although more advanced clustering techniques are available, our method has the advantages of simplicity, better complexity and direct visualization of the clusters similarities in a two-dimensional space. Also, no prior information about the number of clusters is needed. We conducted our experiments on two and three dimensional, low and high-sized synthetic datasets as well as on an astronomical real-dataset. The results are verified graphically and analyzed using gap statistics over a range of neighbors to verify the robustness of the algorithm and the stability of the results. Combining the proposed algorithm with gap statistics provides a promising tool for solving clustering problems. An astronomical application is conducted for confirming the existence of 45 galaxy clusters around the X-ray positions of galaxy clusters in the redshift range [0.1..0.8]. We re-estimate the photometric redshifts of the identified galaxy clusters and obtain acceptable values
Introduction to stream: An Extensible Framework for Data Stream Clustering Research with R

Directory of Open Access Journals (Sweden)

Michael Hahsler

2017-02-01

Full Text Available In recent years, data streams have become an increasingly important area of research for the computer science, database and statistics communities. Data streams are ordered and potentially unbounded sequences of data points created by a typically non-stationary data generating process. Common data mining tasks associated with data streams include clustering, classification and frequent pattern mining. New algorithms for these types of data are proposed regularly and it is important to evaluate them thoroughly under standardized conditions. In this paper we introduce stream, a research tool that includes modeling and simulating data streams as well as an extensible framework for implementing, interfacing and experimenting with algorithms for various data stream mining tasks. The main advantage of stream is that it seamlessly integrates with the large existing infrastructure provided by R. In addition to data handling, plotting and easy scripting capabilities, R also provides many existing algorithms and enables users to interface code written in many programming languages popular among data mining researchers (e.g., C/C++, Java and Python. In this paper we describe the architecture of stream and focus on its use for data stream clustering research. stream was implemented with extensibility in mind and will be extended in the future to cover additional data stream mining tasks like classification and frequent pattern mining.
An Event Reporting and Early-Warning Safety System Based on the Internet of Things for Underground Coal Mines: A Case Study

Directory of Open Access Journals (Sweden)

Byung Wan Jo

2017-09-01

Full Text Available Fatal accidents associated with underground coal mines require the implementation of high-level gas monitoring and miner’s localization approaches to promote underground safety and health. This study introduces a real-time monitoring, event-reporting and early-warning platform, based on cluster analysis for outlier detection, spatiotemporal statistical analysis, and an RSS range-based weighted centroid localization algorithm for improving safety management and preventing accidents in underground coal mines. The proposed platform seamlessly integrates monitoring, analyzing, and localization approaches using the Internet of Things (IoT, cloud computing, a real-time operational database, application gateways, and application program interfaces. The prototype has been validated and verified at the operating underground Hassan Kishore coal mine. Sensors for air quality parameters including temperature, humidity, CH4, CO2, and CO demonstrated an excellent performance, with regression constants always greater than 0.97 for each parameter when compared to their commercial equivalent. This framework enables real-time monitoring, identification of abnormal events (>90%, and verification of a miner’s localization (with <1.8 m of error in the harsh environment of underground mines. The main contribution of this study is the development of an open source, customizable, and cost-effective platform for effectively promoting underground coal mine safety. This system is helpful for solving the problems of accessibility, serviceability, interoperability, and flexibility associated with safety in coal mines.
Systematic analysis of molecular mechanisms for HCC metastasis via text mining approach.

Science.gov (United States)

Zhen, Cheng; Zhu, Caizhong; Chen, Haoyang; Xiong, Yiru; Tan, Junyuan; Chen, Dong; Li, Jin

2017-02-21

To systematically explore the molecular mechanism for hepatocellular carcinoma (HCC) metastasis and identify regulatory genes with text mining methods. Genes with highest frequencies and significant pathways related to HCC metastasis were listed. A handful of proteins such as EGFR, MDM2, TP53 and APP, were identified as hub nodes in PPI (protein-protein interaction) network. Compared with unique genes for HBV-HCCs, genes particular to HCV-HCCs were less, but may participate in more extensive signaling processes. VEGFA, PI3KCA, MAPK1, MMP9 and other genes may play important roles in multiple phenotypes of metastasis. Genes in abstracts of HCC-metastasis literatures were identified. Word frequency analysis, KEGG pathway and PPI network analysis were performed. Then co-occurrence analysis between genes and metastasis-related phenotypes were carried out. Text mining is effective for revealing potential regulators or pathways, but the purpose of it should be specific, and the combination of various methods will be more useful.
Data mining, mining data : energy consumption modelling

Energy Technology Data Exchange (ETDEWEB)

Dessureault, S. [Arizona Univ., Tucson, AZ (United States)

2007-09-15

Most modern mining operations are accumulating large amounts of data on production and business processes. Data, however, provides value only if it can be translated into information that appropriate users can utilize. This paper emphasized that a new technological focus should emerge, notably how to concentrate data into information; analyze information sufficiently to become knowledge; and, act on that knowledge. Researchers at the Mining Information Systems and Operations Management (MISOM) laboratory at the University of Arizona have created a method to transform data into action. The data-to-action approach was exercised in the development of an energy consumption model (ECM), in partnership with a major US-based copper mining company, 2 software companies, and the MISOM laboratory. The approach begins by integrating several key data sources using data warehousing techniques, and increasing the existing level of integration and data cleaning. An online analytical processing (OLAP) cube was also created to investigate the data and identify a subset of several million records. Data mining algorithms were applied using the information that was isolated by the OLAP cube. The data mining results showed that traditional cost drivers of energy consumption are poor predictors. A comparison was made between traditional methods of predicting energy consumption and the prediction formed using data mining. Traditionally, in the mines for which data were available, monthly averages of tons and distance are used to predict diesel fuel consumption. However, this article showed that new information technology can be used to incorporate many more variables into the budgeting process, resulting in more accurate predictions. The ECM helped mine planners improve the prediction of energy use through more data integration, measure development, and workflow analysis. 5 refs., 11 figs.
Sustainability Activities In The Mining Sector: Current Status And Challenges Ahead Limestone Mining In Nusakambangan

Science.gov (United States)

Ayuningrum, Theresia Vika; Purnaweni, Hartuti

2018-02-01

Potential Karst area in Nusakambangan has an important role in maintaining the balance of nature. But with the existence of mining activities, will automatically change the environmental conditions there. In order for the utilization of resources to meet the rules of optimization between the interests of mining and sustainability of the environment so in every mining sector activities required a variety of environmental studies. The purpose of this study is to find out how the analysis of environmental management due to limestone mining activities in Nusakambangan so that it can be known the management of mining areas are optimal, wise based on ecological principles, and sustainability. In qualitative research methods, data analysis using description percentage, with the type of data collected in the form of primary data and secondary data.
Application of cluster analysis and unsupervised learning to multivariate tissue characterization

International Nuclear Information System (INIS)

Momenan, R.; Insana, M.F.; Wagner, R.F.; Garra, B.S.; Loew, M.H.

1987-01-01

This paper describes a procedure for classifying tissue types from unlabeled acoustic measurements (data type unknown) using unsupervised cluster analysis. These techniques are being applied to unsupervised ultrasonic image segmentation and tissue characterization. The performance of a new clustering technique is measured and compared with supervised methods, such as a linear Bayes classifier. In these comparisons two objectives are sought: a) How well does the clustering method group the data?; b) Do the clusters correspond to known tissue classes? The first question is investigated by a measure of cluster similarity and dispersion. The second question involves a comparison with a supervised technique using labeled data
Clustering analysis for muon tomography data elaboration in the Muon Portal project

Science.gov (United States)

Bandieramonte, M.; Antonuccio-Delogu, V.; Becciani, U.; Costa, A.; La Rocca, P.; Massimino, P.; Petta, C.; Pistagna, C.; Riggi, F.; Riggi, S.; Sciacca, E.; Vitello, F.

2015-05-01

Clustering analysis is one of multivariate data analysis techniques which allows to gather statistical data units into groups, in order to minimize the logical distance within each group and to maximize the one between different groups. In these proceedings, the authors present a novel approach to the muontomography data analysis based on clustering algorithms. As a case study we present the Muon Portal project that aims to build and operate a dedicated particle detector for the inspection of harbor containers to hinder the smuggling of nuclear materials. Clustering techniques, working directly on scattering points, help to detect the presence of suspicious items inside the container, acting, as it will be shown, as a filter for a preliminary analysis of the data.
Subtypes of autism by cluster analysis based on structural MRI data.

Science.gov (United States)

Hrdlicka, Michal; Dudova, Iva; Beranova, Irena; Lisy, Jiri; Belsan, Tomas; Neuwirth, Jiri; Komarek, Vladimir; Faladova, Ludvika; Havlovicova, Marketa; Sedlacek, Zdenek; Blatny, Marek; Urbanek, Tomas

2005-05-01

The aim of our study was to subcategorize Autistic Spectrum Disorders (ASD) using a multidisciplinary approach. Sixty four autistic patients (mean age 9.4+/-5.6 years) were entered into a cluster analysis. The clustering analysis was based on MRI data. The clusters obtained did not differ significantly in the overall severity of autistic symptomatology as measured by the total score on the Childhood Autism Rating Scale (CARS). The clusters could be characterized as showing significant differences: Cluster 1: showed the largest sizes of the genu and splenium of the corpus callosum (CC), the lowest pregnancy order and the lowest frequency of facial dysmorphic features. Cluster 2: showed the largest sizes of the amygdala and hippocampus (HPC), the least abnormal visual response on the CARS, the lowest frequency of epilepsy and the least frequent abnormal psychomotor development during the first year of life. Cluster 3: showed the largest sizes of the caput of the nucleus caudatus (NC), the smallest sizes of the HPC and facial dysmorphic features were always present. Cluster 4: showed the smallest sizes of the genu and splenium of the CC, as well as the amygdala, and caput of the NC, the most abnormal visual response on the CARS, the highest frequency of epilepsy, the highest pregnancy order, abnormal psychomotor development during the first year of life was always present and facial dysmorphic features were always present. This multidisciplinary approach seems to be a promising method for subtyping autism.
Symptom Clusters in People Living with HIV Attending Five Palliative Care Facilities in Two Sub-Saharan African Countries: A Hierarchical Cluster Analysis.

Science.gov (United States)

Moens, Katrien; Siegert, Richard J; Taylor, Steve; Namisango, Eve; Harding, Richard

2015-01-01

Symptom research across conditions has historically focused on single symptoms, and the burden of multiple symptoms and their interactions has been relatively neglected especially in people living with HIV. Symptom cluster studies are required to set priorities in treatment planning, and to lessen the total symptom burden. This study aimed to identify and compare symptom clusters among people living with HIV attending five palliative care facilities in two sub-Saharan African countries. Data from cross-sectional self-report of seven-day symptom prevalence on the 32-item Memorial Symptom Assessment Scale-Short Form were used. A hierarchical cluster analysis was conducted using Ward's method applying squared Euclidean Distance as the similarity measure to determine the clusters. Contingency tables, X2 tests and ANOVA were used to compare the clusters by patient specific characteristics and distress scores. Among the sample (N=217) the mean age was 36.5 (SD 9.0), 73.2% were female, and 49.1% were on antiretroviral therapy (ART). The cluster analysis produced five symptom clusters identified as: 1) dermatological; 2) generalised anxiety and elimination; 3) social and image; 4) persistently present; and 5) a gastrointestinal-related symptom cluster. The patients in the first three symptom clusters reported the highest physical and psychological distress scores. Patient characteristics varied significantly across the five clusters by functional status (worst functional physical status in cluster one, ppeople living with HIV with longitudinally collected symptom data to test cluster stability and identify common symptom trajectories is recommended.
The application of Double-difference technique to improve localization of induced microseismic events at Pyhäsalmi copper mine, Pyhäjärvi, Finland.

Science.gov (United States)

Nevalainen, Jouni; Usoltseva, Olga; Kozlovskaya, Elena; Mäki, Timo

2017-04-01

Pyhäsalmi mine, an underground copper mine at Pyhäjärvi, Finland, have been known to have induced seismicity due ore excavation for over half of a century. In 2002, the excavation depth increased as mining activity focused to Pyhäsalmi deep ore body, a potato shaped ore concentration that lies roughly from 1000 meter to 1425 meters below the surface. The stress level in the rock was detected to be very high with clear main direction and due to this microseismicity started occurring immediately when the construction of "new mine" section began. Thus a microseismic monitoring system was installed to trace this frequently occurring induced seismicity as seismic observations are one of the quickest ways to map mines state-of-health. The system consist over 25 geophones that are mainly around the excavation site. Since the installation, over 250000 events have been observed. Currently the automated (triggered) and afterwards manually verified seismic events localization routine is applied by absolute location method that minimizes the penalty function of calculated location and origin time to match as good as possibly for corresponding events observed arrivaltimes. However with this method the best location accuracy is around 20 meters at center of the excavation, since it uses homogenous velocity model that have been applied to whole mine but in reality the seismic velocity structure is very complex with tunnels, fill material and ore. For mines seismic alarm purposes this suits well, but for more advanced source analysis this accuracy is not enough. We apply Double-difference technique to relocate microseismic scale events at Pyhäsalmi mine. This iterative least-squares procedure method utilizes pairs of events with common receiver. The basic principle of the technique is that it relates the residual between the observed and the predicted phase traveltime difference for pairs of earthquakes observed at common station to adjustments in the vector that connects

SOMFlow: Guided Exploratory Cluster Analysis with Self-Organizing Maps and Analytic Provenance.

Science.gov (United States)

Sacha, Dominik; Kraus, Matthias; Bernard, Jurgen; Behrisch, Michael; Schreck, Tobias; Asano, Yuki; Keim, Daniel A

2018-01-01

Clustering is a core building block for data analysis, aiming to extract otherwise hidden structures and relations from raw datasets, such as particular groups that can be effectively related, compared, and interpreted. A plethora of visual-interactive cluster analysis techniques has been proposed to date, however, arriving at useful clusterings often requires several rounds of user interactions to fine-tune the data preprocessing and algorithms. We present a multi-stage Visual Analytics (VA) approach for iterative cluster refinement together with an implementation (SOMFlow) that uses Self-Organizing Maps (SOM) to analyze time series data. It supports exploration by offering the analyst a visual platform to analyze intermediate results, adapt the underlying computations, iteratively partition the data, and to reflect previous analytical activities. The history of previous decisions is explicitly visualized within a flow graph, allowing to compare earlier cluster refinements and to explore relations. We further leverage quality and interestingness measures to guide the analyst in the discovery of useful patterns, relations, and data partitions. We conducted two pair analytics experiments together with a subject matter expert in speech intonation research to demonstrate that the approach is effective for interactive data analysis, supporting enhanced understanding of clustering results as well as the interactive process itself.
Using Dynamic Fourier Analysis to Discriminate Between Seismic Signals from Natural Earthquakes and Mining Explosions

Directory of Open Access Journals (Sweden)

Maria C. Mariani

2017-08-01

Full Text Available A sequence of intraplate earthquakes occurred in Arizona at the same location where miningexplosions were carried out in previous years. The explosions and some of the earthquakes generatedvery similar seismic signals. In this study Dynamic Fourier Analysis is used for discriminating signalsoriginating from natural earthquakes and mining explosions. Frequency analysis of seismogramsrecorded at regional distances shows that compared with the mining explosions the earthquake signalshave larger amplitudes in the frequency interval ~ 6 to 8 Hz and significantly smaller amplitudes inthe frequency interval ~ 2 to 4 Hz. This type of analysis permits identifying characteristics in theseismograms frequency yielding to detect potentially risky seismic events.
Mining biological databases for candidate disease genes

Science.gov (United States)

Braun, Terry A.; Scheetz, Todd; Webster, Gregg L.; Casavant, Thomas L.

2001-07-01

The publicly-funded effort to sequence the complete nucleotide sequence of the human genome, the Human Genome Project (HGP), has currently produced more than 93% of the 3 billion nucleotides of the human genome into a preliminary `draft' format. In addition, several valuable sources of information have been developed as direct and indirect results of the HGP. These include the sequencing of model organisms (rat, mouse, fly, and others), gene discovery projects (ESTs and full-length), and new technologies such as expression analysis and resources (micro-arrays or gene chips). These resources are invaluable for the researchers identifying the functional genes of the genome that transcribe and translate into the transcriptome and proteome, both of which potentially contain orders of magnitude more complexity than the genome itself. Preliminary analyses of this data identified approximately 30,000 - 40,000 human `genes.' However, the bulk of the effort still remains -- to identify the functional and structural elements contained within the transcriptome and proteome, and to associate function in the transcriptome and proteome to genes. A fortuitous consequence of the HGP is the existence of hundreds of databases containing biological information that may contain relevant data pertaining to the identification of disease-causing genes. The task of mining these databases for information on candidate genes is a commercial application of enormous potential. We are developing a system to acquire and mine data from specific databases to aid our efforts to identify disease genes. A high speed cluster of Linux of workstations is used to analyze sequence and perform distributed sequence alignments as part of our data mining and processing. This system has been used to mine GeneMap99 sequences within specific genomic intervals to identify potential candidate disease genes associated with Bardet-Biedle Syndrome (BBS).
Assessment of Random Assignment in Training and Test Sets using Generalized Cluster Analysis Technique

Directory of Open Access Journals (Sweden)

Sorana D. BOLBOACĂ

2011-06-01

Full Text Available Aim: The properness of random assignment of compounds in training and validation sets was assessed using the generalized cluster technique. Material and Method: A quantitative Structure-Activity Relationship model using Molecular Descriptors Family on Vertices was evaluated in terms of assignment of carboquinone derivatives in training and test sets during the leave-many-out analysis. Assignment of compounds was investigated using five variables: observed anticancer activity and four structure descriptors. Generalized cluster analysis with K-means algorithm was applied in order to investigate if the assignment of compounds was or not proper. The Euclidian distance and maximization of the initial distance using a cross-validation with a v-fold of 10 was applied. Results: All five variables included in analysis proved to have statistically significant contribution in identification of clusters. Three clusters were identified, each of them containing both carboquinone derivatives belonging to training as well as to test sets. The observed activity of carboquinone derivatives proved to be normal distributed on every. The presence of training and test sets in all clusters identified using generalized cluster analysis with K-means algorithm and the distribution of observed activity within clusters sustain a proper assignment of compounds in training and test set. Conclusion: Generalized cluster analysis using the K-means algorithm proved to be a valid method in assessment of random assignment of carboquinone derivatives in training and test sets.
Cluster analysis in severe emphysema subjects using phenotype and genotype data: an exploratory investigation

Directory of Open Access Journals (Sweden)

Martinez Fernando J

2010-03-01

Full Text Available Abstract Background Numerous studies have demonstrated associations between genetic markers and COPD, but results have been inconsistent. One reason may be heterogeneity in disease definition. Unsupervised learning approaches may assist in understanding disease heterogeneity. Methods We selected 31 phenotypic variables and 12 SNPs from five candidate genes in 308 subjects in the National Emphysema Treatment Trial (NETT Genetics Ancillary Study cohort. We used factor analysis to select a subset of phenotypic variables, and then used cluster analysis to identify subtypes of severe emphysema. We examined the phenotypic and genotypic characteristics of each cluster. Results We identified six factors accounting for 75% of the shared variability among our initial phenotypic variables. We selected four phenotypic variables from these factors for cluster analysis: 1 post-bronchodilator FEV1 percent predicted, 2 percent bronchodilator responsiveness, and quantitative CT measurements of 3 apical emphysema and 4 airway wall thickness. K-means cluster analysis revealed four clusters, though separation between clusters was modest: 1 emphysema predominant, 2 bronchodilator responsive, with higher FEV1; 3 discordant, with a lower FEV1 despite less severe emphysema and lower airway wall thickness, and 4 airway predominant. Of the genotypes examined, membership in cluster 1 (emphysema-predominant was associated with TGFB1 SNP rs1800470. Conclusions Cluster analysis may identify meaningful disease subtypes and/or groups of related phenotypic variables even in a highly selected group of severe emphysema subjects, and may be useful for genetic association studies.
Diagnostic analysis of electrodialysis in mine tailing materials

DEFF Research Database (Denmark)

Hansen, Henrik K.; Ribeiro, Alexandra B.; Mateus, Eduardo

2007-01-01

Removal of heavy metals from mine tailings and soil contaminated by copper mining activities was studied under batch electrodialytic conditions. Two types of mine tailings were treated: (i) freshly produced tailings coming directly from the flotation process, and (ii) tailings deposited...... in a tailings pond, for approximately 20 years. The main contaminant was copper-found in concentration around 800-1800 ppm. The fractionation of copper and other characteristics of the tailings differ for the two tailings, indicating natural oxidation reactions in the old deposited ones. Electrodialytical...
Transcriptional analysis of exopolysaccharides biosynthesis gene clusters in Lactobacillus plantarum.

Science.gov (United States)

Vastano, Valeria; Perrone, Filomena; Marasco, Rosangela; Sacco, Margherita; Muscariello, Lidia

2016-04-01

Exopolysaccharides (EPS) from lactic acid bacteria contribute to specific rheology and texture of fermented milk products and find applications also in non-dairy foods and in therapeutics. Recently, four clusters of genes (cps) associated with surface polysaccharide production have been identified in Lactobacillus plantarum WCFS1, a probiotic and food-associated lactobacillus. These clusters are involved in cell surface architecture and probably in release and/or exposure of immunomodulating bacterial molecules. Here we show a transcriptional analysis of these clusters. Indeed, RT-PCR experiments revealed that the cps loci are organized in five operons. Moreover, by reverse transcription-qPCR analysis performed on L. plantarum WCFS1 (wild type) and WCFS1-2 (ΔccpA), we demonstrated that expression of three cps clusters is under the control of the global regulator CcpA. These results, together with the identification of putative CcpA target sequences (catabolite responsive element CRE) in the regulatory region of four out of five transcriptional units, strongly suggest for the first time a role of the master regulator CcpA in EPS gene transcription among lactobacilli.
An Analysis of Trainers' Perspectives within an Ecological Framework: Factors that Influence Mine Safety Training Processes.

Science.gov (United States)

Haas, Emily J; Hoebbel, Cassandra L; Rost, Kristen A

2014-09-01

Satisfactory completion of mine safety training is a prerequisite for being hired and for continued employment in the coal industry. Although training includes content to develop skills in a variety of mineworker competencies, research and recommendations continue to specify that specific limitations in the self-escape portion of training still exist and that mineworkers need to be better prepared to respond to emergencies that could occur in their mine. Ecological models are often used to inform the development of health promotion programs but have not been widely applied to occupational health and safety training programs. Nine mine safety trainers participated in in-depth semi-structured interviews. A theoretical analysis of the interviews was completed via an ecological lens. Each level of the social ecological model was used to examine factors that could be addressed both during and after mine safety training. The analysis suggests that problems surrounding communication and collaboration, leadership development, and responsibility and accountability at different levels within the mining industry contribute to deficiencies in mineworkers' mastery and maintenance of skills. This study offers a new technique to identify limitations in safety training systems and processes. The analysis suggests that training should be developed and disseminated with consideration of various levels-individual, interpersonal, organizational, and community-to promote skills. If factors identified within and between levels are addressed, it may be easier to sustain mineworker competencies that are established during safety training.
Value, Cost, and Sharing: Open Issues in Constrained Clustering

Science.gov (United States)

Wagstaff, Kiri L.

2006-01-01

Clustering is an important tool for data mining, since it can identify major patterns or trends without any supervision (labeled data). Over the past five years, semi-supervised (constrained) clustering methods have become very popular. These methods began with incorporating pairwise constraints and have developed into more general methods that can learn appropriate distance metrics. However, several important open questions have arisen about which constraints are most useful, how they can be actively acquired, and when and how they should be propagated to neighboring points. This position paper describes these open questions and suggests future directions for constrained clustering research.
Data-driven modeling of background and mine-related acidity and metals in river basins

International Nuclear Information System (INIS)

Friedel, Michael J.

2014-01-01

A novel application of self-organizing map (SOM) and multivariate statistical techniques is used to model the nonlinear interaction among basin mineral-resources, mining activity, and surface-water quality. First, the SOM is trained using sparse measurements from 228 sample sites in the Animas River Basin, Colorado. The model performance is validated by comparing stochastic predictions of basin-alteration assemblages and mining activity at 104 independent sites. The SOM correctly predicts (>98%) the predominant type of basin hydrothermal alteration and presence (or absence) of mining activity. Second, application of the Davies–Bouldin criteria to k-means clustering of SOM neurons identified ten unique environmental groups. Median statistics of these groups define a nonlinear water-quality response along the spatiotemporal hydrothermal alteration-mining gradient. These results reveal that it is possible to differentiate among the continuum between inputs of background and mine-related acidity and metals, and it provides a basis for future research and empirical model development. The trained self-organizing map is used to determine upstream hydrothermal alteration (AS – acid sulfate; PROP – propylitic, PROP-V – propylitic veins, QSP – quartz-sericite-pyrite, WSP – weak-sericite-pyrite; Mining activity: MINES) from water-quality measurements in the Animas river basin, Colorado, USA. The white hexagons are sized proportional to the number of water-quality samples associated with that SOM neuron. Highlights: • We model surface-water quality response using a self-organizing map and multivariate statistics. • Applying Davies–Bouldin criteria to k-means clusters defines ten environmental response groups. • The approach differentiates between background and mine-related acidity and metals. -- These results reveal that it is possible to differentiate among the continuum between inputs of background and mine-related acidity and metals
Empirical advances with text mining of electronic health records.

Science.gov (United States)

Delespierre, T; Denormandie, P; Bar-Hen, A; Josseran, L

2017-08-22

Korian is a private group specializing in medical accommodations for elderly and dependent people. A professional data warehouse (DWH) established in 2010 hosts all of the residents' data. Inside this information system (IS), clinical narratives (CNs) were used only by medical staff as a residents' care linking tool. The objective of this study was to show that, through qualitative and quantitative textual analysis of a relatively small physiotherapy and well-defined CN sample, it was possible to build a physiotherapy corpus and, through this process, generate a new body of knowledge by adding relevant information to describe the residents' care and lives. Meaningful words were extracted through Standard Query Language (SQL) with the LIKE function and wildcards to perform pattern matching, followed by text mining and a word cloud using R® packages. Another step involved principal components and multiple correspondence analyses, plus clustering on the same residents' sample as well as on other health data using a health model measuring the residents' care level needs. By combining these techniques, physiotherapy treatments could be characterized by a list of constructed keywords, and the residents' health characteristics were built. Feeding defects or health outlier groups could be detected, physiotherapy residents' data and their health data were matched, and differences in health situations showed qualitative and quantitative differences in physiotherapy narratives. This textual experiment using a textual process in two stages showed that text mining and data mining techniques provide convenient tools to improve residents' health and quality of care by adding new, simple, useable data to the electronic health record (EHR). When used with a normalized physiotherapy problem list, text mining through information extraction (IE), named entity recognition (NER) and data mining (DM) can provide a real advantage to describe health care, adding new medical material and
Cluster Analysis of Maize Inbred Lines

Directory of Open Access Journals (Sweden)

Jiban Shrestha

2016-12-01

Full Text Available The determination of diversity among inbred lines is important for heterosis breeding. Sixty maize inbred lines were evaluated for their eight agro morphological traits during winter season of 2011 to analyze their genetic diversity. Clustering was done by average linkage method. The inbred lines were grouped into six clusters. Inbred lines grouped into Clusters II had taller plants with maximum number of leaves. The cluster III was characterized with shorter plants with minimum number of leaves. The inbred lines categorized into cluster V had early flowering whereas the group into cluster VI had late flowering time. The inbred lines grouped into the cluster III were characterized by higher value of anthesis silking interval (ASI and those of cluster VI had lower value of ASI. These results showed that the inbred lines having widely divergent clusters can be utilized in hybrid breeding programme.
Groundwater Mixing Process Identification in Deep Mines Based on Hydrogeochemical Property Analysis

Directory of Open Access Journals (Sweden)

Bo Liu

2016-12-01

Full Text Available Karst collapse columns, as a potential water passageway for mine water inrush, are always considered a critical problem for the development of deep mining techniques. This study aims to identify the mixing process of groundwater deriving two different limestone karst-fissure aquifer systems. Based on analysis of mining groundwater hydrogeochemical properties, hydraulic connection between the karst-fissure objective aquifer systems was revealed. In this paper, piper diagram was used to calculate the mixing ratios at different sampling points in the aquifer systems, and PHREEQC Interactive model (Version 2.5, USGS, Reston, VA, USA, 2001 was applied to modify the mixing ratios and model the water–rock interactions during the mixing processes. The analysis results show that the highest mixing ratio is 0.905 in the C12 borehole that is located nearest to the #2 karst collapse column, and the mixing ratio decreases with the increase of the distance from the #2 karst collapse column. It demonstrated that groundwater of the two aquifers mixed through the passage of #2 karst collapse column. As a result, the proposed Piper-PHREEQC based method can provide accurate identification of karst collapse columns’ water conductivity, and can be applied to practical applications.
[Principal component analysis and cluster analysis of inorganic elements in sea cucumber Apostichopus japonicus].

Science.gov (United States)

Liu, Xiao-Fang; Xue, Chang-Hu; Wang, Yu-Ming; Li, Zhao-Jie; Xue, Yong; Xu, Jie

2011-11-01

The present study is to investigate the feasibility of multi-elements analysis in determination of the geographical origin of sea cucumber Apostichopus japonicus, and to make choice of the effective tracers in sea cucumber Apostichopus japonicus geographical origin assessment. The content of the elements such as Al, V, Cr, Mn, Fe, Co, Ni, Cu, Zn, As, Se, Mo, Cd, Hg and Pb in sea cucumber Apostichopus japonicus samples from seven places of geographical origin were determined by means of ICP-MS. The results were used for the development of elements database. Cluster analysis(CA) and principal component analysis (PCA) were applied to differentiate the sea cucumber Apostichopus japonicus geographical origin. Three principal components which accounted for over 89% of the total variance were extracted from the standardized data. The results of Q-type cluster analysis showed that the 26 samples could be clustered reasonably into five groups, the classification results were significantly associated with the marine distribution of the sea cucumber Apostichopus japonicus samples. The CA and PCA were the effective methods for elements analysis of sea cucumber Apostichopus japonicus samples. The content of the mineral elements in sea cucumber Apostichopus japonicus samples was good chemical descriptors for differentiating their geographical origins.
Mining concepts of health responsibility using text mining and exploratory graph analysis.

Science.gov (United States)

Kjellström, Sofia; Golino, Hudson

2018-05-24

Occupational therapists need to know about people's beliefs about personal responsibility for health to help them pursue everyday activities. The study aims to employ state-of-the-art quantitative approaches to understand people's views of health and responsibility at different ages. A mixed method approach was adopted, using text mining to extract information from 233 interviews with participants aged 5 to 96 years, and then exploratory graph analysis to estimate the number of latent variables. The fit of the structure estimated via the exploratory graph analysis was verified using confirmatory factor analysis. Exploratory graph analysis estimated three dimensions of health responsibility: (1) creating good health habits and feeling good; (2) thinking about one's own health and wanting to improve it; and 3) adopting explicitly normative attitudes to take care of one's health. The comparison between the three dimensions among age groups showed, in general, that children and adolescents, as well as the old elderly (>73 years old) expressed ideas about personal responsibility for health less than young adults, adults and young elderly. Occupational therapists' knowledge of the concepts of health responsibility is of value when working with a patient's health, but an identified challenge is how to engage children and older persons.
Statistical Techniques Applied to Aerial Radiometric Surveys (STAARS): cluster analysis. National Uranium Resource Evaluation

International Nuclear Information System (INIS)

Pirkle, F.L.; Stablein, N.K.; Howell, J.A.; Wecksung, G.W.; Duran, B.S.

1982-11-01

One objective of the aerial radiometric surveys flown as part of the US Department of Energy's National Uranium Resource Evaluation (NURE) program was to ascertain the regional distribution of near-surface radioelement abundances. Some method for identifying groups of observations with similar radioelement values was therefore required. It is shown in this report that cluster analysis can identify such groups even when no a priori knowledge of the geology of an area exists. A method of convergent k-means cluster analysis coupled with a hierarchical cluster analysis is used to classify 6991 observations (three radiometric variables at each observation location) from the Precambrian rocks of the Copper Mountain, Wyoming, area. Another method, one that combines a principal components analysis with a convergent k-means analysis, is applied to the same data. These two methods are compared with a convergent k-means analysis that utilizes available geologic knowledge. All three methods identify four clusters. Three of the clusters represent background values for the Precambrian rocks of the area, and one represents outliers (anomalously high 214 Bi). A segmentation of the data corresponding to geologic reality as discovered by other methods has been achieved based solely on analysis of aerial radiometric data. The techniques employed are composites of classical clustering methods designed to handle the special problems presented by large data sets. 20 figures, 7 tables
Diffusion maps, clustering and fuzzy Markov modeling in peptide folding transitions

International Nuclear Information System (INIS)

Nedialkova, Lilia V.; Amat, Miguel A.; Kevrekidis, Ioannis G.; Hummer, Gerhard

2014-01-01

Using the helix-coil transitions of alanine pentapeptide as an illustrative example, we demonstrate the use of diffusion maps in the analysis of molecular dynamics simulation trajectories. Diffusion maps and other nonlinear data-mining techniques provide powerful tools to visualize the distribution of structures in conformation space. The resulting low-dimensional representations help in partitioning conformation space, and in constructing Markov state models that capture the conformational dynamics. In an initial step, we use diffusion maps to reduce the dimensionality of the conformational dynamics of Ala5. The resulting pretreated data are then used in a clustering step. The identified clusters show excellent overlap with clusters obtained previously by using the backbone dihedral angles as input, with small—but nontrivial—differences reflecting torsional degrees of freedom ignored in the earlier approach. We then construct a Markov state model describing the conformational dynamics in terms of a discrete-time random walk between the clusters. We show that by combining fuzzy C-means clustering with a transition-based assignment of states, we can construct robust Markov state models. This state-assignment procedure suppresses short-time memory effects that result from the non-Markovianity of the dynamics projected onto the space of clusters. In a comparison with previous work, we demonstrate how manifold learning techniques may complement and enhance informed intuition commonly used to construct reduced descriptions of the dynamics in molecular conformation space
Diffusion maps, clustering and fuzzy Markov modeling in peptide folding transitions

Energy Technology Data Exchange (ETDEWEB)

Nedialkova, Lilia V.; Amat, Miguel A. [Department of Chemical and Biological Engineering, Princeton University, Princeton, New Jersey 08544 (United States); Kevrekidis, Ioannis G., E-mail: yannis@princeton.edu, E-mail: gerhard.hummer@biophys.mpg.de [Department of Chemical and Biological Engineering and Program in Applied and Computational Mathematics, Princeton University, Princeton, New Jersey 08544 (United States); Hummer, Gerhard, E-mail: yannis@princeton.edu, E-mail: gerhard.hummer@biophys.mpg.de [Department of Theoretical Biophysics, Max Planck Institute of Biophysics, Max-von-Laue-Str. 3, 60438 Frankfurt am Main (Germany)

2014-09-21

Using the helix-coil transitions of alanine pentapeptide as an illustrative example, we demonstrate the use of diffusion maps in the analysis of molecular dynamics simulation trajectories. Diffusion maps and other nonlinear data-mining techniques provide powerful tools to visualize the distribution of structures in conformation space. The resulting low-dimensional representations help in partitioning conformation space, and in constructing Markov state models that capture the conformational dynamics. In an initial step, we use diffusion maps to reduce the dimensionality of the conformational dynamics of Ala5. The resulting pretreated data are then used in a clustering step. The identified clusters show excellent overlap with clusters obtained previously by using the backbone dihedral angles as input, with small—but nontrivial—differences reflecting torsional degrees of freedom ignored in the earlier approach. We then construct a Markov state model describing the conformational dynamics in terms of a discrete-time random walk between the clusters. We show that by combining fuzzy C-means clustering with a transition-based assignment of states, we can construct robust Markov state models. This state-assignment procedure suppresses short-time memory effects that result from the non-Markovianity of the dynamics projected onto the space of clusters. In a comparison with previous work, we demonstrate how manifold learning techniques may complement and enhance informed intuition commonly used to construct reduced descriptions of the dynamics in molecular conformation space.
Design database for quantitative trait loci (QTL) data warehouse, data mining, and meta-analysis.

Science.gov (United States)

Hu, Zhi-Liang; Reecy, James M; Wu, Xiao-Lin

2012-01-01

A database can be used to warehouse quantitative trait loci (QTL) data from multiple sources for comparison, genomic data mining, and meta-analysis. A robust database design involves sound data structure logistics, meaningful data transformations, normalization, and proper user interface designs. This chapter starts with a brief review of relational database basics and concentrates on issues associated with curation of QTL data into a relational database, with emphasis on the principles of data normalization and structure optimization. In addition, some simple examples of QTL data mining and meta-analysis are included. These examples are provided to help readers better understand the potential and importance of sound database design.
Mine drivage in hydraulic mines

Energy Technology Data Exchange (ETDEWEB)

Ehkber, B Ya

1983-09-01

From 20 to 25% of labor cost in hydraulic coal mines falls on mine drivage. Range of mine drivage is high due to the large number of shortwalls mined by hydraulic monitors. Reducing mining cost in hydraulic mines depends on lowering drivage cost by use of new drivage systems or by increasing efficiency of drivage systems used at present. The following drivage methods used in hydraulic mines are compared: heading machines with hydraulic haulage of cut rocks and coal, hydraulic monitors with hydraulic haulage, drilling and blasting with hydraulic haulage of blasted rocks. Mining and geologic conditions which influence selection of the optimum mine drivage system are analyzed. Standardized cross sections of mine roadways driven by the 3 methods are shown in schemes. Support systems used in mine roadways are compared: timber supports, roof bolts, roof bolts with steel elements, and roadways driven in rocks without a support system. Heading machines (K-56MG, GPKG, 4PU, PK-3M) and hydraulic monitors (GMDTs-3M, 12GD-2) used for mine drivage are described. Data on mine drivage in hydraulic coal mines in the Kuzbass are discussed. From 40 to 46% of roadways are driven by heading machines with hydraulic haulage and from 12 to 15% by hydraulic monitors with hydraulic haulage.

Assessment of atmospheric heavy metal deposition in the Tarkwa gold mining area of Ghana using epiphytic lichens

Energy Technology Data Exchange (ETDEWEB)

Boamponsem, L.K. [Department of Theoretical and Applied Biology, College of Science, Kwame Nkrumah University of Science and Technology, University Post Office, Kumasi (Ghana); Department of Laboratory Technology, School of Physical Sciences, University of Cape Coast, Cape Coast (Ghana); Adam, J.I. [Department of Theoretical and Applied Biology, College of Science, Kwame Nkrumah University of Science and Technology, University Post Office, Kumasi (Ghana); Dampare, S.B., E-mail: dampare@cc.okayama-u.ac.j [National Nuclear Research Institute, Ghana Atomic Energy Commission, P.O. Box LG 80, Legon-Accra (Ghana); Department of Earth Sciences, Okayama University, 1-1, Tsushima-Naka 3-Chome, Okayama 700-8530 (Japan); Nyarko, B.J.B. [National Nuclear Research Institute, Ghana Atomic Energy Commission, P.O. Box LG 80, Legon-Accra (Ghana); Essumang, D.K. [Department of Laboratory Technology, School of Physical Sciences, University of Cape Coast, Cape Coast (Ghana)

2010-05-01

In situ lichens (Parmelia sulcata) have been used to assess atmospheric heavy metal deposition in the Tarkwa gold mining area of Ghana. Total heavy metal concentrations obtained by instrumental neutron activation analysis (INAA) were processed by positive matrix factorization (PMF), principal component (PCA) and cluster (CA) analyses. The pollution index factor (PIF) and pollution load index (PLI) criteria revealed elevated levels of Sb, Mn, Cu, V, Al, Co, Hg, Cd and As in excess of the background values. The PCA and CA classified the examined elements into anthropogenic and natural sources, and PMF resolved three primary sources/factors: agricultural activities and other non-point anthropogenic origins, natural soil dust, and gold mining activities. Gold mining activities, which are characterized by dominant species of Sb, Th, As, Hg, Cd and Co, and significant contributions of Cu, Al, Mn and V, are the main contributors of heavy metals in the atmosphere of the study area.
Assessment of atmospheric heavy metal deposition in the Tarkwa gold mining area of Ghana using epiphytic lichens

International Nuclear Information System (INIS)

Boamponsem, L.K.; Adam, J.I.; Dampare, S.B.; Nyarko, B.J.B.; Essumang, D.K.

2010-01-01

In situ lichens (Parmelia sulcata) have been used to assess atmospheric heavy metal deposition in the Tarkwa gold mining area of Ghana. Total heavy metal concentrations obtained by instrumental neutron activation analysis (INAA) were processed by positive matrix factorization (PMF), principal component (PCA) and cluster (CA) analyses. The pollution index factor (PIF) and pollution load index (PLI) criteria revealed elevated levels of Sb, Mn, Cu, V, Al, Co, Hg, Cd and As in excess of the background values. The PCA and CA classified the examined elements into anthropogenic and natural sources, and PMF resolved three primary sources/factors: agricultural activities and other non-point anthropogenic origins, natural soil dust, and gold mining activities. Gold mining activities, which are characterized by dominant species of Sb, Th, As, Hg, Cd and Co, and significant contributions of Cu, Al, Mn and V, are the main contributors of heavy metals in the atmosphere of the study area.
Performance Evaluation of Hadoop-based Large-scale Network Traffic Analysis Cluster

Directory of Open Access Journals (Sweden)

Tao Ran

2016-01-01

Full Text Available As Hadoop has gained popularity in big data era, it is widely used in various fields. The self-design and self-developed large-scale network traffic analysis cluster works well based on Hadoop, with off-line applications running on it to analyze the massive network traffic data. On purpose of scientifically and reasonably evaluating the performance of analysis cluster, we propose a performance evaluation system. Firstly, we set the execution times of three benchmark applications as the benchmark of the performance, and pick 40 metrics of customized statistical resource data. Then we identify the relationship between the resource data and the execution times by a statistic modeling analysis approach, which is composed of principal component analysis and multiple linear regression. After training models by historical data, we can predict the execution times by current resource data. Finally, we evaluate the performance of analysis cluster by the validated predicting of execution times. Experimental results show that the predicted execution times by trained models are within acceptable error range, and the evaluation results of performance are accurate and reliable.
Development and optimization of SPECT gated blood pool cluster analysis for the prediction of CRT outcome

Energy Technology Data Exchange (ETDEWEB)

Lalonde, Michel, E-mail: mlalonde15@rogers.com; Wassenaar, Richard [Department of Physics, Carleton University, Ottawa, Ontario K1S 5B6 (Canada); Wells, R. Glenn; Birnie, David; Ruddy, Terrence D. [Division of Cardiology, University of Ottawa Heart Institute, Ottawa, Ontario K1Y 4W7 (Canada)

2014-07-15

Purpose: Phase analysis of single photon emission computed tomography (SPECT) radionuclide angiography (RNA) has been investigated for its potential to predict the outcome of cardiac resynchronization therapy (CRT). However, phase analysis may be limited in its potential at predicting CRT outcome as valuable information may be lost by assuming that time-activity curves (TAC) follow a simple sinusoidal shape. A new method, cluster analysis, is proposed which directly evaluates the TACs and may lead to a better understanding of dyssynchrony patterns and CRT outcome. Cluster analysis algorithms were developed and optimized to maximize their ability to predict CRT response. Methods: About 49 patients (N = 27 ischemic etiology) received a SPECT RNA scan as well as positron emission tomography (PET) perfusion and viability scans prior to undergoing CRT. A semiautomated algorithm sampled the left ventricle wall to produce 568 TACs from SPECT RNA data. The TACs were then subjected to two different cluster analysis techniques, K-means, and normal average, where several input metrics were also varied to determine the optimal settings for the prediction of CRT outcome. Each TAC was assigned to a cluster group based on the comparison criteria and global and segmental cluster size and scores were used as measures of dyssynchrony and used to predict response to CRT. A repeated random twofold cross-validation technique was used to train and validate the cluster algorithm. Receiver operating characteristic (ROC) analysis was used to calculate the area under the curve (AUC) and compare results to those obtained for SPECT RNA phase analysis and PET scar size analysis methods. Results: Using the normal average cluster analysis approach, the septal wall produced statistically significant results for predicting CRT results in the ischemic population (ROC AUC = 0.73;p < 0.05 vs. equal chance ROC AUC = 0.50) with an optimal operating point of 71% sensitivity and 60% specificity. Cluster
Development and optimization of SPECT gated blood pool cluster analysis for the prediction of CRT outcome

International Nuclear Information System (INIS)

Lalonde, Michel; Wassenaar, Richard; Wells, R. Glenn; Birnie, David; Ruddy, Terrence D.

2014-01-01

Purpose: Phase analysis of single photon emission computed tomography (SPECT) radionuclide angiography (RNA) has been investigated for its potential to predict the outcome of cardiac resynchronization therapy (CRT). However, phase analysis may be limited in its potential at predicting CRT outcome as valuable information may be lost by assuming that time-activity curves (TAC) follow a simple sinusoidal shape. A new method, cluster analysis, is proposed which directly evaluates the TACs and may lead to a better understanding of dyssynchrony patterns and CRT outcome. Cluster analysis algorithms were developed and optimized to maximize their ability to predict CRT response. Methods: About 49 patients (N = 27 ischemic etiology) received a SPECT RNA scan as well as positron emission tomography (PET) perfusion and viability scans prior to undergoing CRT. A semiautomated algorithm sampled the left ventricle wall to produce 568 TACs from SPECT RNA data. The TACs were then subjected to two different cluster analysis techniques, K-means, and normal average, where several input metrics were also varied to determine the optimal settings for the prediction of CRT outcome. Each TAC was assigned to a cluster group based on the comparison criteria and global and segmental cluster size and scores were used as measures of dyssynchrony and used to predict response to CRT. A repeated random twofold cross-validation technique was used to train and validate the cluster algorithm. Receiver operating characteristic (ROC) analysis was used to calculate the area under the curve (AUC) and compare results to those obtained for SPECT RNA phase analysis and PET scar size analysis methods. Results: Using the normal average cluster analysis approach, the septal wall produced statistically significant results for predicting CRT results in the ischemic population (ROC AUC = 0.73;p < 0.05 vs. equal chance ROC AUC = 0.50) with an optimal operating point of 71% sensitivity and 60% specificity. Cluster
Integrating Data Clustering and Visualization for the Analysis of 3D Gene Expression Data

Energy Technology Data Exchange (ETDEWEB)

Data Analysis and Visualization (IDAV) and the Department of Computer Science, University of California, Davis, One Shields Avenue, Davis CA 95616, USA,; nternational Research Training Group ``Visualization of Large and Unstructured Data Sets,' ' University of Kaiserslautern, Germany; Computational Research Division, Lawrence Berkeley National Laboratory, One Cyclotron Road, Berkeley, CA 94720, USA; Genomics Division, Lawrence Berkeley National Laboratory, One Cyclotron Road, Berkeley CA 94720, USA; Life Sciences Division, Lawrence Berkeley National Laboratory, One Cyclotron Road, Berkeley CA 94720, USA,; Computer Science Division,University of California, Berkeley, CA, USA,; Computer Science Department, University of California, Irvine, CA, USA,; All authors are with the Berkeley Drosophila Transcription Network Project, Lawrence Berkeley National Laboratory,; Rubel, Oliver; Weber, Gunther H.; Huang, Min-Yu; Bethel, E. Wes; Biggin, Mark D.; Fowlkes, Charless C.; Hendriks, Cris L. Luengo; Keranen, Soile V. E.; Eisen, Michael B.; Knowles, David W.; Malik, Jitendra; Hagen, Hans; Hamann, Bernd

2008-05-12

The recent development of methods for extracting precise measurements of spatial gene expression patterns from three-dimensional (3D) image data opens the way for new analyses of the complex gene regulatory networks controlling animal development. We present an integrated visualization and analysis framework that supports user-guided data clustering to aid exploration of these new complex datasets. The interplay of data visualization and clustering-based data classification leads to improved visualization and enables a more detailed analysis than previously possible. We discuss (i) integration of data clustering and visualization into one framework; (ii) application of data clustering to 3D gene expression data; (iii) evaluation of the number of clusters k in the context of 3D gene expression clustering; and (iv) improvement of overall analysis quality via dedicated post-processing of clustering results based on visualization. We discuss the use of this framework to objectively define spatial pattern boundaries and temporal profiles of genes and to analyze how mRNA patterns are controlled by their regulatory transcription factors.
Advances in research methods for information systems research data mining, data envelopment analysis, value focused thinking

CERN Document Server

Osei-Bryson, Kweku-Muata

2013-01-01

Advances in social science research methodologies and data analytic methods are changing the way research in information systems is conducted. New developments in statistical software technologies for data mining (DM) such as regression splines or decision tree induction can be used to assist researchers in systematic post-positivist theory testing and development. Established management science techniques like data envelopment analysis (DEA), and value focused thinking (VFT) can be used in combination with traditional statistical analysis and data mining techniques to more effectively explore
Analysis of RXTE data on Clusters of Galaxies

Science.gov (United States)

Petrosian, Vahe

2004-01-01

This grant provided support for the reduction, analysis and interpretation of of hard X-ray (HXR, for short) observations of the cluster of galaxies RXJO658--5557 scheduled for the week of August 23, 2002 under the RXTE Cycle 7 program (PI Vahe Petrosian, Obs. ID 70165). The goal of the observation was to search for and characterize the shape of the HXR component beyond the well established thermal soft X-ray (SXR) component. Such hard components have been detected in several nearby clusters. distant cluster would provide information on the characteristics of this radiation at a different epoch in the evolution of the imiverse and shed light on its origin. We (Petrosian, 2001) have argued that thermal bremsstrahlung, as proposed earlier, cannot be the mechanism for the production of the HXRs and that the most likely mechanism is Compton upscattering of the cosmic microwave radiation by relativistic electrons which are known to be present in the clusters and be responsible for the observed radio emission. Based on this picture we estimated that this cluster, in spite of its relatively large distance, will have HXR signal comparable to the other nearby ones. The planned observation of a relatively The proposed RXTE observations were carried out and the data have been analyzed. We detect a hard X-ray tail in the spectrum of this cluster with a flux very nearly equal to our predicted value. This has strengthen the case for the Compton scattering model. We intend the data obtained via this observation to be a part of a larger data set. We have identified other clusters of galaxies (in archival RXTE and other instrument data sets) with sufficiently high quality data where we can search for and measure (or at least put meaningful limits) on the strength of the hard component. With these studies we expect to clarify the mechanism for acceleration of particles in the intercluster medium and provide guidance for future observations of this intriguing phenomenon by instrument
Profitability and efficiency of Italian utilities: cluster analysis of financial statement ratios

International Nuclear Information System (INIS)

Linares, E.

2008-01-01

The last ten years have witnessed conspicuous changes in European and Italian regulation of public utility services and in the strategies of the major players in these fields. In response to these changes Italian utilities have made a variety of choices regarding size, presence in more or less capital-intensive stages of different value chains, and diversification. These choices have been implemented both through internal growth and by means of mergers and acquisitions. In this context it is interesting to try to establish whether there is a nexus between these choices and the performance of Italian utilities in terms of profitability and efficiency. Therefore statistical multivariate analysis techniques (cluster analysis and factor analysis) have been applied to several ratios obtained from the 2005 financial statement of 34 utilities. First, a hierarchical cluster analysis method has been applied to financial statement data in order to identify homogeneous groups based on several indicators of the incidence of costs (external costs, personnel costs, depreciation and amortization), profitability (return on sales, return on assets, return on equity) and efficiency (in the utilization of personnel, of total assets, of property, plant and equipment). Five clusters have been found. Then the clusters have been characterized in terms of the aforementioned indicators, the presence in different stages of the energy value chains (electricity and gas) and other descriptive variables (such as turnover, number of employees, assets, percentage of property, plant and equipment on total assets, sales revenues from electricity, gas, water supply and sanitation, waste collection and treatment and other services). In a second round cluster analysis has been preceded by factor analysis, in order to find a smaller set of variables. This procedure has revealed three not directly observable factors that can be interpreted as follows: i) efficiency in ordinary and financial management
The Cluster Science Archive and its relevance for multi-missions data analysis

Science.gov (United States)

Masson, A.; Escoubet, C. P.; Laakso, H. E.; Perry, C. H.

2014-12-01

The science data archive of the Cluster mission is a major contribution of the European Space Agency (ESA) to the International Living With a Star program. Known as the Cluster Active Archive (CAA), its availability since 2006 has resulted in a significant increase of the scientific return of this on-going mission. The Cluster science archive (CSA) has been developed in parallel to CAA over the last few years at the European Space Astronomy Center (ESAC) in Madrid, Spain. It is the long-term science archive of the Cluster mission, developed and managed along with all the other ESA science archives. Publicly opened in November 2013, CSA is available in parallel with CAA during a transition period until CAA public closing in early autumn 2014. Our goal here is to present what has been put in place to help geophysicists in their research. We will first talk about some aspects of the CSA user interface (data visualization including particle distribution; user data profiles) and how users can access data remotely (data streaming in Matlab, or via IDL or Python). The second goal is to present unique value added datasets that are now available on the CSA/CAA. These data have been produced by the scientific community, thanks to two EU FP7 projects: ECLAT and MAARBLE. For instance, the polarization and propagation parameters of ULF Pc waves measured by Cluster and Themis (since 2007) are available and cover more than a decade; along with magnetic spectra of Pc waves measured simultaneously by CHAMP and ground-based magnetometers. These data are clearly an outstanding data resource for low frequency waves researchers. Other datasets will be presented to show that CSA/CAA allow much more than downloading Cluster data from a graphical user interface. It's a single point entry that allows studies from micro-scale physics in the tail (e.g. catalogues of dipolarization fronts), to meso- and large-scale M-I coupling studies (e.g. Cluster magnetic footprints based on T96 and TS05
antiSMASH 4.0-improvements in chemistry prediction and gene cluster boundary identification

DEFF Research Database (Denmark)

Blin, Kai; Wolf, Thomas; Chevrette, Marc G.

2017-01-01

Many antibiotics, chemotherapeutics, crop protection agents and food preservatives originate from molecules produced by bacteria, fungi or plants. In recent years, genome mining methodologies have been widely adopted to identify and characterize the biosynthetic gene clusters encoding...... the production of such compounds. Since 2011, the 'antibiotics and secondary metabolite analysis shell-antiSMASH' has assisted researchers in efficiently performing this, both as a web server and a standalone tool. Here, we present the thoroughly updated antiSMASH version 4, which adds several novel features...
A formal concept analysis approach to consensus clustering of multi-experiment expression data

Science.gov (United States)

2014-01-01

Background Presently, with the increasing number and complexity of available gene expression datasets, the combination of data from multiple microarray studies addressing a similar biological question is gaining importance. The analysis and integration of multiple datasets are expected to yield more reliable and robust results since they are based on a larger number of samples and the effects of the individual study-specific biases are diminished. This is supported by recent studies suggesting that important biological signals are often preserved or enhanced by multiple experiments. An approach to combining data from different experiments is the aggregation of their clusterings into a consensus or representative clustering solution which increases the confidence in the common features of all the datasets and reveals the important differences among them. Results We propose a novel generic consensus clustering technique that applies Formal Concept Analysis (FCA) approach for the consolidation and analysis of clustering solutions derived from several microarray datasets. These datasets are initially divided into groups of related experiments with respect to a predefined criterion. Subsequently, a consensus clustering algorithm is applied to each group resulting in a clustering solution per group. These solutions are pooled together and further analysed by employing FCA which allows extracting valuable insights from the data and generating a gene partition over all the experiments. In order to validate the FCA-enhanced approach two consensus clustering algorithms are adapted to incorporate the FCA analysis. Their performance is evaluated on gene expression data from multi-experiment study examining the global cell-cycle control of fission yeast. The FCA results derived from both methods demonstrate that, although both algorithms optimize different clustering characteristics, FCA is able to overcome and diminish these differences and preserve some relevant biological
Interactive K-Means Clustering Method Based on User Behavior for Different Analysis Target in Medicine.

Science.gov (United States)

Lei, Yang; Yu, Dai; Bin, Zhang; Yang, Yang

2017-01-01

Clustering algorithm as a basis of data analysis is widely used in analysis systems. However, as for the high dimensions of the data, the clustering algorithm may overlook the business relation between these dimensions especially in the medical fields. As a result, usually the clustering result may not meet the business goals of the users. Then, in the clustering process, if it can combine the knowledge of the users, that is, the doctor's knowledge or the analysis intent, the clustering result can be more satisfied. In this paper, we propose an interactive K -means clustering method to improve the user's satisfactions towards the result. The core of this method is to get the user's feedback of the clustering result, to optimize the clustering result. Then, a particle swarm optimization algorithm is used in the method to optimize the parameters, especially the weight settings in the clustering algorithm to make it reflect the user's business preference as possible. After that, based on the parameter optimization and adjustment, the clustering result can be closer to the user's requirement. Finally, we take an example in the breast cancer, to testify our method. The experiments show the better performance of our algorithm.
Protecting reliability of mining at individual levels against phase conflict at the Belchatow mine and power plant

Energy Technology Data Exchange (ETDEWEB)

Sztandera, T [Akademia Gorniczo-Hutnicza, Cracow (Poland)

1988-01-01

Discusses problems of overburden removal and mining at the Belchatow brown coal surface mine in Poland. The analysis is based on the results of the statistical data on surface mining in Belchatow from 1985 to 1987. Information on number of benches, their dimensions (height, width), slope inclination and volume of overburden is evaluated. The analysis showed that time delay from overburden removal to coal mining is excessively low; reserves of coal ready for mining are too low, especially in winter. This phenomenon is illustrated by the increasing slope inclinations of the whole cut in winter as the mine yields coal reserves made ready for excavation. In summer the rate of overburden removal increases. Such seasonal fluctuations negatively affect safety conditions in winter and especially in spring when increased content of moisture in sedimentary rocks reduce their resistance to landslide. 5 refs.
Analysis of US underground thin seam mining potential. Volume 1. Text. Final technical report, December 1978. [In thin seams

Energy Technology Data Exchange (ETDEWEB)

Pimental, R. A; Barell, D.; Fine, R. J.; Douglas, W. J.

1979-06-01

An analysis of the potential for US underground thin seam (< 28'') coal mining is undertaken to provide basic information for use in making a decision on further thin seam mining equipment development. The characteristics of the present low seam mines and their mining methods are determined, in order to establish baseline data against which changes in mine characteristics can be monitored as a function of time. A detailed data base of thin seam coal resources is developed through a quantitative and qualitative analysis at the bed, county and state level. By establishing present and future coal demand and relating demand to production and resources, the market for thin seam coal has been identified. No thin seam coal demand of significance is forecast before the year 2000. Current uncertainty as to coal's future does not permit market forecasts beyond the year 2000 with a sufficient level of reliability.
The evolution of genome mining in microbes – a review

DEFF Research Database (Denmark)

Ziemert, Nadine; Alanjary, Mohammad; Weber, Tilmann

2016-01-01

Covering: 2006 to 2016. The computational mining of genomes has become an important part in the discovery of novel natural products as drug leads. Thousands of bacterial genome sequences are publically available these days containing an even larger number and diversity of secondary metabolite gene...... clusters that await linkage to their encoded natural products. With the development of high-throughput sequencing methods and the wealth of DNA data available, a variety of genome mining methods and tools have been developed to guide discovery and characterisation of these compounds. This article reviews...
Assessment of Heavy Metals in Mining Tailing around Boroo and Zuunkharaa Gold Mining Areas of Mongolia

OpenAIRE

Solongo, Enkhzaya; Ohe, Kaoru; Shiomori, Koichiro; Bolormaa, Oyuntsetseg; Ochirkhuyag, Bayanjargal; Watanabe, Makiko

2016-01-01

This study aimed to study the mobility of heavy metals using sequential extraction analysis and assess heavy metals in soil samples of mining tailing around the small-scale gold mining areas at Boroo and Zuunkharaa in Mongolia. The samples were collected from small scale gold mining area existed in Tuv and Selenge province, Mongolia. Physicochemical, chemical and some statistical analyses were made for the mining tailing samples. The pH of the mining tailing samples was determined as 6.10 – 7...
Operational analysis of the tailings bund wall drainage system at mirny ore mining and processing enterprise

Directory of Open Access Journals (Sweden)

Aniskin Nikolay Alekseevich

2016-12-01

Full Text Available Issues of environmental safety of tailings of ore mining and processing enterprises are considered; parameters of drainage of bund walls are of great significance for the environmental safety. Description of the bund wall of Mirny ore mining and processing enterprise and the tailings filling layouts are given. Results of field observation and model study of the tailings bund wall drainage system at Mirny ore mining and processing enterprise are presented. The drainage system rebuilding project analysis was performed. Proposals for its improvement were set forward.
Technological and mining analysis of mechanized systems used in roadways in Polish mines

Energy Technology Data Exchange (ETDEWEB)

Sikora, W; Giza, T; Siwiec, J [Politechnika Slaska, Gliwice (Poland). Instytut Mechanizacji Gornictwa

1987-01-01

Analyzes methods of mine drivage in Poland and materials handling systems. Of 1,620 km of roadways driven in 1982, 12% fell on roadways driven in coal and 88% on roadways driven in stone or stone and coal. Roadways driven in coal in most cases were situated at depths from 500 to 700 m. Roadway cross-section ranged from 12 to 18 m{sup 2}. Roadways in stone or stone and coal were driven by drilling and blasting. Loaders were used for stone handling. Roadways in coal were driven by heading machines. Advance rates of mine drivage by heading machines were 2 to 3 times higher than those by drilling and blasting with loaders for stone handling. Basic statistical data characterizing roadways and drivage methods are evaluated: roadway dimensions and depth advance rate depending on drivage methods and mining condition, types of heading machines and loaders.
A novel data-mining approach leveraging social media to monitor consumer opinion of sitagliptin.

Science.gov (United States)

Akay, Altug; Dragomir, Andrei; Erlandsson, Björn-Erik

2015-01-01

A novel data mining method was developed to gauge the experience of the drug Sitagliptin (trade name Januvia) by patients with diabetes mellitus type 2. To this goal, we devised a two-step analysis framework. Initial exploratory analysis using self-organizing maps was performed to determine structures based on user opinions among the forum posts. The results were a compilation of user's clusters and their correlated (positive or negative) opinion of the drug. Subsequent modeling using network analysis methods was used to determine influential users among the forum members. These findings can open new avenues of research into rapid data collection, feedback, and analysis that can enable improved outcomes and solutions for public health and important feedback for the manufacturer.

Adaptation of chemical methods of analysis to the matrix of pyrite-acidified mining lakes

International Nuclear Information System (INIS)

Herzsprung, P.; Friese, K.

2000-01-01

Owing to the unusual matrix of pyrite-acidified mining lakes, the analysis of chemical parameters may be difficult. A number of methodological improvements have been developed so far, and a comprehensive validation of methods is envisaged. The adaptation of the available methods to small-volume samples of sediment pore waters and the adaptation of sensitivity to the expected concentration ranges is an important element of the methods applied in analyses of biogeochemical processes in mining lakes [de
Improving estimation of kinetic parameters in dynamic force spectroscopy using cluster analysis

Science.gov (United States)

Yen, Chi-Fu; Sivasankar, Sanjeevi

2018-03-01

Dynamic Force Spectroscopy (DFS) is a widely used technique to characterize the dissociation kinetics and interaction energy landscape of receptor-ligand complexes with single-molecule resolution. In an Atomic Force Microscope (AFM)-based DFS experiment, receptor-ligand complexes, sandwiched between an AFM tip and substrate, are ruptured at different stress rates by varying the speed at which the AFM-tip and substrate are pulled away from each other. The rupture events are grouped according to their pulling speeds, and the mean force and loading rate of each group are calculated. These data are subsequently fit to established models, and energy landscape parameters such as the intrinsic off-rate (koff) and the width of the potential energy barrier (xβ) are extracted. However, due to large uncertainties in determining mean forces and loading rates of the groups, errors in the estimated koff and xβ can be substantial. Here, we demonstrate that the accuracy of fitted parameters in a DFS experiment can be dramatically improved by sorting rupture events into groups using cluster analysis instead of sorting them according to their pulling speeds. We test different clustering algorithms including Gaussian mixture, logistic regression, and K-means clustering, under conditions that closely mimic DFS experiments. Using Monte Carlo simulations, we benchmark the performance of these clustering algorithms over a wide range of koff and xβ, under different levels of thermal noise, and as a function of both the number of unbinding events and the number of pulling speeds. Our results demonstrate that cluster analysis, particularly K-means clustering, is very effective in improving the accuracy of parameter estimation, particularly when the number of unbinding events are limited and not well separated into distinct groups. Cluster analysis is easy to implement, and our performance benchmarks serve as a guide in choosing an appropriate method for DFS data analysis.
Spatio-Temporal Rule Mining

DEFF Research Database (Denmark)

Gidofalvi, Gyozo; Pedersen, Torben Bach

2005-01-01

Recent advances in communication and information technology, such as the increasing accuracy of GPS technology and the miniaturization of wireless communication devices pave the road for Location-Based Services (LBS). To achieve high quality for such services, spatio-temporal data mining techniques...... are needed. In this paper, we describe experiences with spatio-temporal rule mining in a Danish data mining company. First, a number of real world spatio-temporal data sets are described, leading to a taxonomy of spatio-temporal data. Second, the paper describes a general methodology that transforms...... the spatio-temporal rule mining task to the traditional market basket analysis task and applies it to the described data sets, enabling traditional association rule mining methods to discover spatio-temporal rules for LBS. Finally, unique issues in spatio-temporal rule mining are identified and discussed....
Clustering and classification of email contents

Directory of Open Access Journals (Sweden)

Izzat Alsmadi

2015-01-01

Full Text Available Information users depend heavily on emails’ system as one of the major sources of communication. Its importance and usage are continuously growing despite the evolution of mobile applications, social networks, etc. Emails are used on both the personal and professional levels. They can be considered as official documents in communication among users. Emails’ data mining and analysis can be conducted for several purposes such as: Spam detection and classification, subject classification, etc. In this paper, a large set of personal emails is used for the purpose of folder and subject classifications. Algorithms are developed to perform clustering and classification for this large text collection. Classification based on NGram is shown to be the best for such large text collection especially as text is Bi-language (i.e. with English and Arabic content.
Fatigue Feature Extraction Analysis based on a K-Means Clustering Approach

Directory of Open Access Journals (Sweden)

M.F.M. Yunoh

2015-06-01

Full Text Available This paper focuses on clustering analysis using a K-means approach for fatigue feature dataset extraction. The aim of this study is to group the dataset as closely as possible (homogeneity for the scattered dataset. Kurtosis, the wavelet-based energy coefficient and fatigue damage are calculated for all segments after the extraction process using wavelet transform. Kurtosis, the wavelet-based energy coefficient and fatigue damage are used as input data for the K-means clustering approach. K-means clustering calculates the average distance of each group from the centroid and gives the objective function values. Based on the results, maximum values of the objective function can be seen in the two centroid clusters, with a value of 11.58. The minimum objective function value is found at 8.06 for five centroid clusters. It can be seen that the objective function with the lowest value for the number of clusters is equal to five; which is therefore the best cluster for the dataset.
The dynamics of cyclone clustering in re-analysis and a high-resolution climate model

Science.gov (United States)

Priestley, Matthew; Pinto, Joaquim; Dacre, Helen; Shaffrey, Len

2017-04-01

Extratropical cyclones have a tendency to occur in groups (clusters) in the exit of the North Atlantic storm track during wintertime, potentially leading to widespread socioeconomic impacts. The Winter of 2013/14 was the stormiest on record for the UK and was characterised by the recurrent clustering of intense extratropical cyclones. This clustering was associated with a strong, straight and persistent North Atlantic 250 hPa jet with Rossby wave-breaking (RWB) on both flanks, pinning the jet in place. Here, we provide for the first time an analysis of all clustered events in 36 years of the ERA-Interim Re-analysis at three latitudes (45˚ N, 55˚ N, 65˚ N) encompassing various regions of Western Europe. The relationship between the occurrence of RWB and cyclone clustering is studied in detail. Clustering at 55˚ N is associated with an extended and anomalously strong jet flanked on both sides by RWB. However, clustering at 65(45)˚ N is associated with RWB to the south (north) of the jet, deflecting the jet northwards (southwards). A positive correlation was found between the intensity of the clustering and RWB occurrence to the north and south of the jet. However, there is considerable spread in these relationships. Finally, analysis has shown that the relationships identified in the re-analysis are also present in a high-resolution coupled global climate model (HiGEM). In particular, clustering is associated with the same dynamical conditions at each of our three latitudes in spite of the identified biases in frequency and intensity of RWB.
Spatial variability of sediment erosion processes using GIS analysis within watersheds in a historically mined region, Patagonia Mountains, Arizona

Science.gov (United States)

Brady, Laura M.; Gray, Floyd; Wissler, Craig A.; Guertin, D. Phillip

2001-01-01

In this study, a geographic information system (GIS) is used to integrate and accurately map field studies, information from remotely sensed data, watershed models, and the dispersion of potentially toxic mine waste and tailings. The purpose of this study is to identify erosion rates and net sediment delivery of soil and mine waste/tailings to the drainage channel within several watershed regions to determine source areas of sediment delivery as a method of quantifying geo-environmental analysis of transport mechanisms in abandoned mine lands in arid climate conditions. Users of this study are the researchers interested in exploration of approaches to depicting historical activity in an area which has no baseline data records for environmental analysis of heavily mined terrain.
The ClusTree : indexing micro-clusters for anytime stream mining

DEFF Research Database (Denmark)

Kranen, Philipp; Assent, Ira; Baldauf, Corinna

2011-01-01

-arrival times of the stream. Likewise, memory is limited, making it impossible to store all data. For clustering, we are faced with the challenge of maintaining a current result that can be presented to the user at any given time. In this work, we propose a parameter-free algorithm that automatically adapts...... introduce the ClusTree, a compact and self-adaptive index structure for maintaining stream summaries. Additionally we present solutions to handle very fast streams through aggregation mechanisms and propose novel descent strategies that improve the clustering result on slower streams as long as time permits...
Environmental management zoning for coal mining in mainland China based on ecological and resources conditions.

Science.gov (United States)

Geng, Haiqing; Chen, Fan; Wang, Zhiyuan; Liu, Jie; Xu, Weihua

2017-05-01

The purpose of this research is to establish an environmental management zoning for coal mining industry which is served as a basis for making environmental management policies. Based on the specific impacts of coal mining and regional characteristics of environment and resources, the ecological impact, water resources impact, and arable land impact are chose as the zoning indexes to construct the index system. The ecological sensitivity is graded into three levels of low, medium, and high according to analytical hierarchy processes and gray fixed weight clustering analysis, and the water resources sensitivity is divided into five levels of lower, low, medium, high, and higher according to the weighted sum of sub-indexes, while only the arable land sensitive zone was extracted on the basis of the ratio of arable land to the county or city. By combining the ecological sensitivity zoning and the water resources sensitive zoning and then overlapping the arable-sensitive areas, the mainland China is classified into six types of environmental management zones for coal mining except to the forbidden exploitation areas.
Cluster analysis of HZE particle tracks as applied to space radiobiology problems

International Nuclear Information System (INIS)

Batmunkh, M.; Bayarchimeg, L.; Lkhagva, O.; Belov, O.

2013-01-01

A cluster analysis is performed of ionizations in tracks produced by the most abundant nuclei in the charge and energy spectra of the galactic cosmic rays. The frequency distribution of clusters is estimated for cluster sizes comparable to the DNA molecule at different packaging levels. For this purpose, an improved K-mean-based algorithm is suggested. This technique allows processing particle tracks containing a large number of ionization events without setting the number of clusters as an input parameter. Using this method, the ionization distribution pattern is analyzed depending on the cluster size and particle's linear energy transfer
Mining biological information from 3D short time-series gene expression data: the OPTricluster algorithm.

Science.gov (United States)

Tchagang, Alain B; Phan, Sieu; Famili, Fazel; Shearer, Heather; Fobert, Pierre; Huang, Yi; Zou, Jitao; Huang, Daiqing; Cutler, Adrian; Liu, Ziying; Pan, Youlian

2012-04-04

Nowadays, it is possible to collect expression levels of a set of genes from a set of biological samples during a series of time points. Such data have three dimensions: gene-sample-time (GST). Thus they are called 3D microarray gene expression data. To take advantage of the 3D data collected, and to fully understand the biological knowledge hidden in the GST data, novel subspace clustering algorithms have to be developed to effectively address the biological problem in the corresponding space. We developed a subspace clustering algorithm called Order Preserving Triclustering (OPTricluster), for 3D short time-series data mining. OPTricluster is able to identify 3D clusters with coherent evolution from a given 3D dataset using a combinatorial approach on the sample dimension, and the order preserving (OP) concept on the time dimension. The fusion of the two methodologies allows one to study similarities and differences between samples in terms of their temporal expression profile. OPTricluster has been successfully applied to four case studies: immune response in mice infected by malaria (Plasmodium chabaudi), systemic acquired resistance in Arabidopsis thaliana, similarities and differences between inner and outer cotyledon in Brassica napus during seed development, and to Brassica napus whole seed development. These studies showed that OPTricluster is robust to noise and is able to detect the similarities and differences between biological samples. Our analysis showed that OPTricluster generally outperforms other well known clustering algorithms such as the TRICLUSTER, gTRICLUSTER and K-means; it is robust to noise and can effectively mine the biological knowledge hidden in the 3D short time-series gene expression data.
A High-Order CFS Algorithm for Clustering Big Data

Directory of Open Access Journals (Sweden)

Fanyu Bu

2016-01-01

Full Text Available With the development of Internet of Everything such as Internet of Things, Internet of People, and Industrial Internet, big data is being generated. Clustering is a widely used technique for big data analytics and mining. However, most of current algorithms are not effective to cluster heterogeneous data which is prevalent in big data. In this paper, we propose a high-order CFS algorithm (HOCFS to cluster heterogeneous data by combining the CFS clustering algorithm and the dropout deep learning model, whose functionality rests on three pillars: (i an adaptive dropout deep learning model to learn features from each type of data, (ii a feature tensor model to capture the correlations of heterogeneous data, and (iii a tensor distance-based high-order CFS algorithm to cluster heterogeneous data. Furthermore, we verify our proposed algorithm on different datasets, by comparison with other two clustering schemes, that is, HOPCM and CFS. Results confirm the effectiveness of the proposed algorithm in clustering heterogeneous data.
Cluster analysis of autoantibodies in 852 patients with systemic lupus erythematosus from a single center.

Science.gov (United States)

Artim-Esen, Bahar; Çene, Erhan; Şahinkaya, Yasemin; Ertan, Semra; Pehlivan, Özlem; Kamali, Sevil; Gül, Ahmet; Öcal, Lale; Aral, Orhan; Inanç, Murat

2014-07-01

Associations between autoantibodies and clinical features have been described in systemic lupus erythematosus (SLE). Herein, we aimed to define autoantibody clusters and their clinical correlations in a large cohort of patients with SLE. We analyzed 852 patients with SLE who attended our clinic. Seven autoantibodies were selected for cluster analysis: anti-DNA, anti-Sm, anti-RNP, anticardiolipin (aCL) immunoglobulin (Ig)G or IgM, lupus anticoagulant (LAC), anti-Ro, and anti-La. Two-step clustering and Kaplan-Meier survival analyses were used. Five clusters were identified. A cluster consisted of patients with only anti-dsDNA antibodies, a cluster of anti-Sm and anti-RNP, a cluster of aCL IgG/M and LAC, and a cluster of anti-Ro and anti-La antibodies. Analysis revealed 1 more cluster that consisted of patients who did not belong to any of the clusters formed by antibodies chosen for cluster analysis. Sm/RNP cluster had significantly higher incidence of pulmonary hypertension and Raynaud phenomenon. DsDNA cluster had the highest incidence of renal involvement. In the aCL/LAC cluster, there were significantly more patients with neuropsychiatric involvement, antiphospholipid syndrome, autoimmune hemolytic anemia, and thrombocytopenia. According to the Systemic Lupus International Collaborating Clinics damage index, the highest frequency of damage was in the aCL/LAC cluster. Comparison of 10 and 20 years survival showed reduced survival in the aCL/LAC cluster. This study supports the existence of autoantibody clusters with distinct clinical features in SLE and shows that forming clinical subsets according to autoantibody clusters may be useful in predicting the outcome of the disease. Autoantibody clusters in SLE may exhibit differences according to the clinical setting or population.
Application of cluster analysis to geochemical compositional data for identifying ore-related geochemical anomalies

Science.gov (United States)

Zhou, Shuguang; Zhou, Kefa; Wang, Jinlin; Yang, Genfang; Wang, Shanshan

2017-12-01

Cluster analysis is a well-known technique that is used to analyze various types of data. In this study, cluster analysis is applied to geochemical data that describe 1444 stream sediment samples collected in northwestern Xinjiang with a sample spacing of approximately 2 km. Three algorithms (the hierarchical, k-means, and fuzzy c-means algorithms) and six data transformation methods (the z-score standardization, ZST; the logarithmic transformation, LT; the additive log-ratio transformation, ALT; the centered log-ratio transformation, CLT; the isometric log-ratio transformation, ILT; and no transformation, NT) are compared in terms of their effects on the cluster analysis of the geochemical compositional data. The study shows that, on the one hand, the ZST does not affect the results of column- or variable-based (R-type) cluster analysis, whereas the other methods, including the LT, the ALT, and the CLT, have substantial effects on the results. On the other hand, the results of the row- or observation-based (Q-type) cluster analysis obtained from the geochemical data after applying NT and the ZST are relatively poor. However, we derive some improved results from the geochemical data after applying the CLT, the ILT, the LT, and the ALT. Moreover, the k-means and fuzzy c-means clustering algorithms are more reliable than the hierarchical algorithm when they are used to cluster the geochemical data. We apply cluster analysis to the geochemical data to explore for Au deposits within the study area, and we obtain a good correlation between the results retrieved by combining the CLT or the ILT with the k-means or fuzzy c-means algorithms and the potential zones of Au mineralization. Therefore, we suggest that the combination of the CLT or the ILT with the k-means or fuzzy c-means algorithms is an effective tool to identify potential zones of mineralization from geochemical data.
Influence of birth cohort on age of onset cluster analysis in bipolar I disorder

DEFF Research Database (Denmark)

Bauer, M; Glenn, T; Alda, M

2015-01-01

Purpose: Two common approaches to identify subgroups of patients with bipolar disorder are clustering methodology (mixture analysis) based on the age of onset, and a birth cohort analysis. This study investigates if a birth cohort effect will influence the results of clustering on the age of onset...... cohort. Model-based clustering (mixture analysis) was then performed on the age of onset data using the residuals. Clinical variables in subgroups were compared. Results: There was a strong birth cohort effect. Without adjusting for the birth cohort, three subgroups were found by clustering. After...... on the age of onset, and that there is a birth cohort effect. Including the birth cohort adjustment altered the number and characteristics of subgroups detected when clustering by age of onset. Further investigation is needed to determine if combining both approaches will identify subgroups that are more...
MMPI-2: Cluster Analysis of Personality Profiles in Perinatal Depression—Preliminary Evidence

Directory of Open Access Journals (Sweden)

Valentina Meuti

2014-01-01

Full Text Available Background. To assess personality characteristics of women who develop perinatal depression. Methods. The study started with a screening of a sample of 453 women in their third trimester of pregnancy, to which was administered a survey data form, the Edinburgh Postnatal Depression Scale (EPDS and the Minnesota Multiphasic Personality Inventory 2 (MMPI-2. A clinical group of subjects with perinatal depression (PND, 55 subjects was selected; clinical and validity scales of MMPI-2 were used as predictors in hierarchical cluster analysis carried out. Results. The analysis identified three clusters of personality profile: two “clinical” clusters (1 and 3 and an “apparently common” one (cluster 2. The first cluster (39.5% collects structures of personality with prevalent obsessive or dependent functioning tending to develop a “psychasthenic” depression; the third cluster (13.95% includes women with prevalent borderline functioning tending to develop “dysphoric” depression; the second cluster (46.5% shows a normal profile with a “defensive” attitude, probably due to the presence of defense mechanisms or to the fear of stigma. Conclusion. Characteristics of personality have a key role in clinical manifestations of perinatal depression; it is important to detect them to identify mothers at risk and to plan targeted therapeutic interventions.
MMPI-2: Cluster Analysis of Personality Profiles in Perinatal Depression—Preliminary Evidence

Science.gov (United States)

Grillo, Alessandra; Lauriola, Marco; Giacchetti, Nicoletta

2014-01-01

Background. To assess personality characteristics of women who develop perinatal depression. Methods. The study started with a screening of a sample of 453 women in their third trimester of pregnancy, to which was administered a survey data form, the Edinburgh Postnatal Depression Scale (EPDS) and the Minnesota Multiphasic Personality Inventory 2 (MMPI-2). A clinical group of subjects with perinatal depression (PND, 55 subjects) was selected; clinical and validity scales of MMPI-2 were used as predictors in hierarchical cluster analysis carried out. Results. The analysis identified three clusters of personality profile: two “clinical” clusters (1 and 3) and an “apparently common” one (cluster 2). The first cluster (39.5%) collects structures of personality with prevalent obsessive or dependent functioning tending to develop a “psychasthenic” depression; the third cluster (13.95%) includes women with prevalent borderline functioning tending to develop “dysphoric” depression; the second cluster (46.5%) shows a normal profile with a “defensive” attitude, probably due to the presence of defense mechanisms or to the fear of stigma. Conclusion. Characteristics of personality have a key role in clinical manifestations of perinatal depression; it is important to detect them to identify mothers at risk and to plan targeted therapeutic interventions. PMID:25574499
The application of the analytic hierarchy process (AHP) in uranium mine mining method of the optimal selection

International Nuclear Information System (INIS)

Tan Zhongyin; Kuang Zhengping; Qiu Huiyuan

2014-01-01

Analytic hierarchy process, AHP, is a combination of qualitative and quantitative, systematic and hierarchical analysis method. Basic decision theory of analytic hierarchy process is applied in this article, with a project example in north Guangdong region as the research object, the in-situ mining method optimization choose hierarchical analysis model is established and the analysis method, The results show that, the AHP model for mining method selecting model was reliable, optimization results were conformity with the actual use of the in-situ mining method, and it has better practicability. (authors)
MONITORING METAL POLLUTION LEVELS IN MINE WASTES AROUND A COAL MINE SITE USING GIS

Directory of Open Access Journals (Sweden)

D. Sanliyuksel Yucel

2017-11-01

Full Text Available In this case study, metal pollution levels in mine wastes at a coal mine site in Etili coal mine (Can coal basin, NW Turkey are evaluated using geographical information system (GIS tools. Etili coal mine was operated since the 1980s as an open pit. Acid mine drainage is the main environmental problem around the coal mine. The main environmental contamination source is mine wastes stored around the mine site. Mine wastes were dumped over an extensive area along the riverbeds, and are now abandoned. Mine waste samples were homogenously taken at 10 locations within the sampling area of 102.33 ha. The paste pH and electrical conductivity values of mine wastes ranged from 2.87 to 4.17 and 432 to 2430 μS/cm, respectively. Maximum Al, Fe, Mn, Pb, Zn and Ni concentrations of wastes were measured as 109300, 70600, 309.86, 115.2, 38 and 5.3 mg/kg, respectively. The Al, Fe and Pb concentrations of mine wastes are higher than world surface rock average values. The geochemical analysis results from the study area were presented in the form of maps. The GIS based environmental database will serve as a reference study for our future work.
Monitoring Metal Pollution Levels in Mine Wastes around a Coal Mine Site Using GIS

Science.gov (United States)

Sanliyuksel Yucel, D.; Yucel, M. A.; Ileri, B.

2017-11-01

In this case study, metal pollution levels in mine wastes at a coal mine site in Etili coal mine (Can coal basin, NW Turkey) are evaluated using geographical information system (GIS) tools. Etili coal mine was operated since the 1980s as an open pit. Acid mine drainage is the main environmental problem around the coal mine. The main environmental contamination source is mine wastes stored around the mine site. Mine wastes were dumped over an extensive area along the riverbeds, and are now abandoned. Mine waste samples were homogenously taken at 10 locations within the sampling area of 102.33 ha. The paste pH and electrical conductivity values of mine wastes ranged from 2.87 to 4.17 and 432 to 2430 μS/cm, respectively. Maximum Al, Fe, Mn, Pb, Zn and Ni concentrations of wastes were measured as 109300, 70600, 309.86, 115.2, 38 and 5.3 mg/kg, respectively. The Al, Fe and Pb concentrations of mine wastes are higher than world surface rock average values. The geochemical analysis results from the study area were presented in the form of maps. The GIS based environmental database will serve as a reference study for our future work.

Some links on this page may take you to non-federal websites. Their policies may differ from this site.