WorldWideScience

Sample records for data mining

  1. Data mining

    CERN Document Server

    Gorunescu, Florin

    2011-01-01

    The knowledge discovery process is as old as Homo sapiens. Until some time ago, this process was solely based on the 'natural personal' computer provided by Mother Nature. Fortunately, in recent decades the problem has begun to be solved based on the development of the Data mining technology, aided by the huge computational power of the 'artificial' computers. Digging intelligently in different large databases, data mining aims to extract implicit, previously unknown and potentially useful information from data, since 'knowledge is power'. The goal of this book is to provide, in a friendly way

  2. Social big data mining

    CERN Document Server

    Ishikawa, Hiroshi

    2015-01-01

    Social Media. Big Data and Social Data. Hypotheses in the Era of Big Data. Social Big Data Applications. Basic Concepts in Data Mining. Association Rule Mining. Clustering. Classification. Prediction. Web Structure Mining. Web Content Mining. Web Access Log Mining, Information Extraction and Deep Web Mining. Media Mining. Scalability and Outlier Detection.

  3. Data Mining for CRM

    Science.gov (United States)

    Thearling, Kurt

    Data Mining technology allows marketing organizations to better understand their customers and respond to their needs. This chapter describes how Data Mining can be combined with customer relationship management to help drive improved interactions with customers. An example showing how to use Data Mining to drive customer acquisition activities is presented.

  4. Data mining, mining data : energy consumption modelling

    Energy Technology Data Exchange (ETDEWEB)

    Dessureault, S. [Arizona Univ., Tucson, AZ (United States)

    2007-09-15

    Most modern mining operations are accumulating large amounts of data on production and business processes. Data, however, provides value only if it can be translated into information that appropriate users can utilize. This paper emphasized that a new technological focus should emerge, notably how to concentrate data into information; analyze information sufficiently to become knowledge; and, act on that knowledge. Researchers at the Mining Information Systems and Operations Management (MISOM) laboratory at the University of Arizona have created a method to transform data into action. The data-to-action approach was exercised in the development of an energy consumption model (ECM), in partnership with a major US-based copper mining company, 2 software companies, and the MISOM laboratory. The approach begins by integrating several key data sources using data warehousing techniques, and increasing the existing level of integration and data cleaning. An online analytical processing (OLAP) cube was also created to investigate the data and identify a subset of several million records. Data mining algorithms were applied using the information that was isolated by the OLAP cube. The data mining results showed that traditional cost drivers of energy consumption are poor predictors. A comparison was made between traditional methods of predicting energy consumption and the prediction formed using data mining. Traditionally, in the mines for which data were available, monthly averages of tons and distance are used to predict diesel fuel consumption. However, this article showed that new information technology can be used to incorporate many more variables into the budgeting process, resulting in more accurate predictions. The ECM helped mine planners improve the prediction of energy use through more data integration, measure development, and workflow analysis. 5 refs., 11 figs.

  5. Data mining in radiology

    International Nuclear Information System (INIS)

    Kharat, Amit T; Singh, Amarjit; Kulkarni, Vilas M; Shah, Digish

    2014-01-01

    Data mining facilitates the study of radiology data in various dimensions. It converts large patient image and text datasets into useful information that helps in improving patient care and provides informative reports. Data mining technology analyzes data within the Radiology Information System and Hospital Information System using specialized software which assesses relationships and agreement in available information. By using similar data analysis tools, radiologists can make informed decisions and predict the future outcome of a particular imaging finding. Data, information and knowledge are the components of data mining. Classes, Clusters, Associations, Sequential patterns, Classification, Prediction and Decision tree are the various types of data mining. Data mining has the potential to make delivery of health care affordable and ensure that the best imaging practices are followed. It is a tool for academic research. Data mining is considered to be ethically neutral, however concerns regarding privacy and legality exists which need to be addressed to ensure success of data mining

  6. Collaborative Data Mining

    Science.gov (United States)

    Moyle, Steve

    Collaborative Data Mining is a setting where the Data Mining effort is distributed to multiple collaborating agents - human or software. The objective of the collaborative Data Mining effort is to produce solutions to the tackled Data Mining problem which are considered better by some metric, with respect to those solutions that would have been achieved by individual, non-collaborating agents. The solutions require evaluation, comparison, and approaches for combination. Collaboration requires communication, and implies some form of community. The human form of collaboration is a social task. Organizing communities in an effective manner is non-trivial and often requires well defined roles and processes. Data Mining, too, benefits from a standard process. This chapter explores the standard Data Mining process CRISP-DM utilized in a collaborative setting.

  7. Data mining for service

    CERN Document Server

    2014-01-01

    Virtually all nontrivial and modern service related problems and systems involve data volumes and types that clearly fall into what is presently meant as "big data", that is, are huge, heterogeneous, complex, distributed, etc. Data mining is a series of processes which include collecting and accumulating data, modeling phenomena, and discovering new information, and it is one of the most important steps to scientific analysis of the processes of services.  Data mining application in services requires a thorough understanding of the characteristics of each service and knowledge of the compatibility of data mining technology within each particular service, rather than knowledge only in calculation speed and prediction accuracy. Varied examples of services provided in this book will help readers understand the relation between services and data mining technology. This book is intended to stimulate interest among researchers and practitioners in the relation between data mining technology and its application to ...

  8. Data preprocessing in data mining

    CERN Document Server

    García, Salvador; Herrera, Francisco

    2015-01-01

    Data Preprocessing for Data Mining addresses one of the most important issues within the well-known Knowledge Discovery from Data process. Data directly taken from the source will likely have inconsistencies, errors or most importantly, it is not ready to be considered for a data mining process. Furthermore, the increasing amount of data in recent science, industry and business applications, calls to the requirement of more complex tools to analyze it. Thanks to data preprocessing, it is possible to convert the impossible into possible, adapting the data to fulfill the input demands of each data mining algorithm. Data preprocessing includes the data reduction techniques, which aim at reducing the complexity of the data, detecting or removing irrelevant and noisy elements from the data. This book is intended to review the tasks that fill the gap between the data acquisition from the source and the data mining process. A comprehensive look from a practical point of view, including basic concepts and surveying t...

  9. Data Stream Mining

    Science.gov (United States)

    Gaber, Mohamed Medhat; Zaslavsky, Arkady; Krishnaswamy, Shonali

    Data mining is concerned with the process of computationally extracting hidden knowledge structures represented in models and patterns from large data repositories. It is an interdisciplinary field of study that has its roots in databases, statistics, machine learning, and data visualization. Data mining has emerged as a direct outcome of the data explosion that resulted from the success in database and data warehousing technologies over the past two decades (Fayyad, 1997,Fayyad, 1998,Kantardzic, 2003).

  10. Data mining in agriculture

    CERN Document Server

    Mucherino, Antonio; Pardalos, Panos M

    2009-01-01

    Data Mining in Agriculture represents a comprehensive effort to provide graduate students and researchers with an analytical text on data mining techniques applied to agriculture and environmental related fields. This book presents both theoretical and practical insights with a focus on presenting the context of each data mining technique rather intuitively with ample concrete examples represented graphically and with algorithms written in MATLAB®. Examples and exercises with solutions are provided at the end of each chapter to facilitate the comprehension of the material. For each data mining technique described in the book variants and improvements of the basic algorithm are also given. Also by P.J. Papajorgji and P.M. Pardalos: Advances in Modeling Agricultural Systems, 'Springer Optimization and its Applications' vol. 25, ©2009.

  11. Ensemble Data Mining Methods

    Data.gov (United States)

    National Aeronautics and Space Administration — Ensemble Data Mining Methods, also known as Committee Methods or Model Combiners, are machine learning methods that leverage the power of multiple models to achieve...

  12. Applied data mining

    CERN Document Server

    Xu, Guandong

    2013-01-01

    Data mining has witnessed substantial advances in recent decades. New research questions and practical challenges have arisen from emerging areas and applications within the various fields closely related to human daily life, e.g. social media and social networking. This book aims to bridge the gap between traditional data mining and the latest advances in newly emerging information services. It explores the extension of well-studied algorithms and approaches into these new research arenas.

  13. Data mining methods

    CERN Document Server

    Chattamvelli, Rajan

    2015-01-01

    DATA MINING METHODS, Second Edition discusses both theoretical foundation and practical applications of datamining in a web field including banking, e-commerce, medicine, engineering and management. This book starts byintroducing data and information, basic data type, data category and applications of data mining. The second chapterbriefly reviews data visualization technology and importance in data mining. Fundamentals of probability and statisticsare discussed in chapter 3, and novel algorithm for sample covariants are derived. The next two chapters give an indepthand useful discussion of data warehousing and OLAP. Decision trees are clearly explained and a new tabularmethod for decision tree building is discussed. The chapter on association rules discusses popular algorithms andcompares various algorithms in summary table form. An interesting application of genetic algorithm is introduced inthe next chapter. Foundations of neural networks are built from scratch and the back propagation algorithm is derived...

  14. Data mining for dummies

    CERN Document Server

    Brown, Meta S

    2014-01-01

    Delve into your data for the key to success Data mining is quickly becoming integral to creating value and business momentum. The ability to detect unseen patterns hidden in the numbers exhaustively generated by day-to-day operations allows savvy decision-makers to exploit every tool at their disposal in the pursuit of better business. By creating models and testing whether patterns hold up, it is possible to discover new intelligence that could change your business''s entire paradigm for a more successful outcome. Data Mining for Dummies shows you why it doesn''t take a data scientist to gain

  15. Biomedical Data Mining

    NARCIS (Netherlands)

    Peek, N.; Combi, C.; Tucker, A.

    2009-01-01

    Objective: To introduce the special topic of Methods of Information in Medicine on data mining in biomedicine, with selected papers from two workshops on Intelligent Data Analysis in bioMedicine (IDAMAP) held in Verona (2006) and Amsterdam (2007). Methods: Defining the field of biomedical data

  16. Security Measures in Data Mining

    OpenAIRE

    Anish Gupta; Vimal Bibhu; Rashid Hussain

    2012-01-01

    Data mining is a technique to dig the data from the large databases for analysis and executive decision making. Security aspect is one of the measure requirement for data mining applications. In this paper we present security requirement measures for the data mining. We summarize the requirements of security for data mining in tabular format. The summarization is performed by the requirements with different aspects of security measure of data mining. The performances and outcomes are determin...

  17. Data mining mobile devices

    CERN Document Server

    Mena, Jesus

    2013-01-01

    With today's consumers spending more time on their mobiles than on their PCs, new methods of empirical stochastic modeling have emerged that can provide marketers with detailed information about the products, content, and services their customers desire.Data Mining Mobile Devices defines the collection of machine-sensed environmental data pertaining to human social behavior. It explains how the integration of data mining and machine learning can enable the modeling of conversation context, proximity sensing, and geospatial location throughout large communities of mobile users

  18. Mining Views : database views for data mining

    NARCIS (Netherlands)

    Blockeel, H.; Calders, T.; Fromont, É.; Goethals, B.; Prado, A.

    2008-01-01

    We present a system towards the integration of data mining into relational databases. To this end, a relational database model is proposed, based on the so called virtual mining views. We show that several types of patterns and models over the data, such as itemsets, association rules and decision

  19. Mining Views : database views for data mining

    NARCIS (Netherlands)

    Blockeel, H.; Calders, T.; Fromont, É.; Goethals, B.; Prado, A.; Nijssen, S.; De Raedt, L.

    2007-01-01

    We propose a relational database model towards the integration of data mining into relational database systems, based on the so called virtual mining views. We show that several types of patterns and models over the data, such as itemsets, association rules, decision trees and clusterings, can be

  20. Large Data Set Mining

    NARCIS (Netherlands)

    Leemans, I.B.; Broomhall, Susan

    2017-01-01

    Digital emotion research has yet to make history. Until now large data set mining has not been a very active field of research in early modern emotion studies. This is indeed surprising since first, the early modern field has such rich, copyright-free, digitized data sets and second, emotion studies

  1. Data preprocessing for data mining

    OpenAIRE

    Ren, Yifei

    2013-01-01

    People have increasing amounts data in the current prosperous information age. In order to improve competitive power and work efficiency, discovering knowledge from data is becoming more and more important. Data mining, as an emerging interdisciplinary applications field, plays a significant role in various trades’ and industries' decision making. However, it is known that original data is always dirty and not suitable for further analysis which have become a major obstacle of finding knowled...

  2. Data mining in Cloud Computing

    Directory of Open Access Journals (Sweden)

    Ruxandra-Ştefania PETRE

    2012-10-01

    Full Text Available This paper describes how data mining is used in cloud computing. Data Mining is used for extracting potentially useful information from raw data. The integration of data mining techniques into normal day-to-day activities has become common place. Every day people are confronted with targeted advertising, and data mining techniques help businesses to become more efficient by reducing costs.Data mining techniques and applications are very much needed in the cloud computing paradigm. The implementation of data mining techniques through Cloud computing will allow the users to retrieve meaningful information from virtually integrated data warehouse that reduces the costs of infrastructure and storage.

  3. Data Mining and Analysis

    Science.gov (United States)

    Samms, Kevin O.

    2015-01-01

    The Data Mining project seeks to bring the capability of data visualization to NASA anomaly and problem reporting systems for the purpose of improving data trending, evaluations, and analyses. Currently NASA systems are tailored to meet the specific needs of its organizations. This tailoring has led to a variety of nomenclatures and levels of annotation for procedures, parts, and anomalies making difficult the realization of the common causes for anomalies. Making significant observations and realizing the connection between these causes without a common way to view large data sets is difficult to impossible. In the first phase of the Data Mining project a portal was created to present a common visualization of normalized sensitive data to customers with the appropriate security access. The tool of the visualization itself was also developed and fine-tuned. In the second phase of the project we took on the difficult task of searching and analyzing the target data set for common causes between anomalies. In the final part of the second phase we have learned more about how much of the analysis work will be the job of the Data Mining team, how to perform that work, and how that work may be used by different customers in different ways. In this paper I detail how our perspective has changed after gaining more insight into how the customers wish to interact with the output and how that has changed the product.

  4. Organizational Data Mining

    Science.gov (United States)

    Nemati, Hamid R.; Barko, Christopher D.

    Many organizations today possess substantial quantities of business information but have very little real business knowledge. A recent survey of 450 business executives reported that managerial intuition and instinct are more prevalent than hard facts in driving organizational decisions. To reverse this trend, businesses of all sizes would be well advised to adopt Organizational Data Mining (ODM). ODM is defined as leveraging Data Mining tools and technologies to enhance the decision-making process by transforming data into valuable and actionable knowledge to gain a competitive advantage. ODM has helped many organizations optimize internal resource allocations while better understanding and responding to the needs of their customers. The fundamental aspects of ODM can be categorized into Artificial Intelligence (AI), Information Technology (IT), and Organizational Theory (OT), with OT being the key distinction between ODM and Data Mining. In this chapter, we introduce ODM, explain its unique characteristics, and report on the current status of ODM research. Next we illustrate how several leading organizations have adopted ODM and are benefiting from it. Then we examine the evolution of ODM to the present day and conclude our chapter by contemplating ODM's challenging yet opportunistic future.

  5. Data mining and education.

    Science.gov (United States)

    Koedinger, Kenneth R; D'Mello, Sidney; McLaughlin, Elizabeth A; Pardos, Zachary A; Rosé, Carolyn P

    2015-01-01

    An emerging field of educational data mining (EDM) is building on and contributing to a wide variety of disciplines through analysis of data coming from various educational technologies. EDM researchers are addressing questions of cognition, metacognition, motivation, affect, language, social discourse, etc. using data from intelligent tutoring systems, massive open online courses, educational games and simulations, and discussion forums. The data include detailed action and timing logs of student interactions in user interfaces such as graded responses to questions or essays, steps in rich problem solving environments, games or simulations, discussion forum posts, or chat dialogs. They might also include external sensors such as eye tracking, facial expression, body movement, etc. We review how EDM has addressed the research questions that surround the psychology of learning with an emphasis on assessment, transfer of learning and model discovery, the role of affect, motivation and metacognition on learning, and analysis of language data and collaborative learning. For example, we discuss (1) how different statistical assessment methods were used in a data mining competition to improve prediction of student responses to intelligent tutor tasks, (2) how better cognitive models can be discovered from data and used to improve instruction, (3) how data-driven models of student affect can be used to focus discussion in a dialog-based tutoring system, and (4) how machine learning techniques applied to discussion data can be used to produce automated agents that support student learning as they collaborate in a chat room or a discussion board. © 2015 John Wiley & Sons, Ltd.

  6. Data mining for bioinformatics applications

    CERN Document Server

    Zengyou, He

    2015-01-01

    Data Mining for Bioinformatics Applications provides valuable information on the data mining methods have been widely used for solving real bioinformatics problems, including problem definition, data collection, data preprocessing, modeling, and validation. The text uses an example-based method to illustrate how to apply data mining techniques to solve real bioinformatics problems, containing 45 bioinformatics problems that have been investigated in recent research. For each example, the entire data mining process is described, ranging from data preprocessing to modeling and result validation. Provides valuable information on the data mining methods have been widely used for solving real bioinformatics problems Uses an example-based method to illustrate how to apply data mining techniques to solve real bioinformatics problems Contains 45 bioinformatics problems that have been investigated in recent research.

  7. Data mining goes multidimensional.

    Science.gov (United States)

    Hettler, M

    1997-03-01

    The success of a healthcare organization depends on its ability to acquire, store, analyze and compare data across many parts of the enterprise, by many individuals. While relational databases have been around since the 1970s, their two-dimensional structure has limited--or made impossible--the kind of cross-dimensional trend analysis so necessary to healthcare today. Enter online analytical processing (OLAP), in which servers store data in multiple dimensions, opening a world of opportunity for data-mining across the enterprise. In this issue of HEALTHCARE INFORMATICS, we feature our first report from the National Software Testing Laboratories (NSTL) about technologies that will change the way healthcare does business. A division of The McGraw-Hill Companies, NSTL is an independent software and hardware testing lab offering services that include compatibility testing, bug testing, comparison testing, documentation evaluation and usability.

  8. Data Mining SIAM Presentation

    Science.gov (United States)

    Srivastava, Ashok; McIntosh, Dawn; Castle, Pat; Pontikakis, Manos; Diev, Vesselin; Zane-Ulman, Brett; Turkov, Eugene; Akella, Ram; Xu, Zuobing; Kumaresan, Sakthi Preethi

    2006-01-01

    This viewgraph document describes the data mining system developed at NASA Ames. Many NASA programs have large numbers (and types) of problem reports.These free text reports are written by a number of different people, thus the emphasis and wording vary considerably With so much data to sift through, analysts (subject experts) need help identifying any possible safety issues or concerns and help them confirm that they haven't missed important problems. Unsupervised clustering is the initial step to accomplish this; We think we can go much farther, specifically, identify possible recurring anomalies. Recurring anomalies may be indicators of larger systemic problems. The requirement to identify these anomalies has led to the development of Recurring Anomaly Discovery System (ReADS).

  9. Data Mining Aplications in Livestock

    Directory of Open Access Journals (Sweden)

    Feyza ALEV ÇETİN

    2016-03-01

    Full Text Available Data mining provides discovering the required and applicable knowledge from very large amounts of information collected in one centre. Data mining has been used in the information industry and society. Although many methods of data mining has been used, these techniques has been remarkable in animal husbandry in recent years. For the solution of complex problems in animal husbandry many methods were discussed and developed. Brief information on data mining techniques such as k-means approach, k-nearest neighbor approach, multivariate adaptive regression function (MARS, naive Bayesian classifiers (NBC, artificial neural networks (ANN, support vector machines (SVM, decision trees are given in the study. Some data mining methods are presented and examples of the application of data mining in the field of animal husbandry in the world are provided with this study.

  10. Data mining applications in healthcare.

    Science.gov (United States)

    Koh, Hian Chye; Tan, Gerald

    2005-01-01

    Data mining has been used intensively and extensively by many organizations. In healthcare, data mining is becoming increasingly popular, if not increasingly essential. Data mining applications can greatly benefit all parties involved in the healthcare industry. For example, data mining can help healthcare insurers detect fraud and abuse, healthcare organizations make customer relationship management decisions, physicians identify effective treatments and best practices, and patients receive better and more affordable healthcare services. The huge amounts of data generated by healthcare transactions are too complex and voluminous to be processed and analyzed by traditional methods. Data mining provides the methodology and technology to transform these mounds of data into useful information for decision making. This article explores data mining applications in healthcare. In particular, it discusses data mining and its applications within healthcare in major areas such as the evaluation of treatment effectiveness, management of healthcare, customer relationship management, and the detection of fraud and abuse. It also gives an illustrative example of a healthcare data mining application involving the identification of risk factors associated with the onset of diabetes. Finally, the article highlights the limitations of data mining and discusses some future directions.

  11. Ensemble Data Mining Methods

    Science.gov (United States)

    Oza, Nikunj C.

    2004-01-01

    Ensemble Data Mining Methods, also known as Committee Methods or Model Combiners, are machine learning methods that leverage the power of multiple models to achieve better prediction accuracy than any of the individual models could on their own. The basic goal when designing an ensemble is the same as when establishing a committee of people: each member of the committee should be as competent as possible, but the members should be complementary to one another. If the members are not complementary, Le., if they always agree, then the committee is unnecessary---any one member is sufficient. If the members are complementary, then when one or a few members make an error, the probability is high that the remaining members can correct this error. Research in ensemble methods has largely revolved around designing ensembles consisting of competent yet complementary models.

  12. Process Mining Online Assessment Data

    Science.gov (United States)

    Pechenizkiy, Mykola; Trcka, Nikola; Vasilyeva, Ekaterina; van der Aalst, Wil; De Bra, Paul

    2009-01-01

    Traditional data mining techniques have been extensively applied to find interesting patterns, build descriptive and predictive models from large volumes of data accumulated through the use of different information systems. The results of data mining can be used for getting a better understanding of the underlying educational processes, for…

  13. Process mining online assessment data

    NARCIS (Netherlands)

    Pechenizkiy, M.; Trcka, N.; Vasilyeva, E.; Aalst, van der W.M.P.; De Bra, P.M.E.; Barnes, T.; Desmarais, M.; Romero, C.; Ventura, S.

    2009-01-01

    Traditional data mining techniques have been extensively applied to find interesting patterns, build descriptive and predictive models from large volumes of data accumulated through the use of different information systems. The results of data mining can be used for getting a better understanding of

  14. Implications of Emerging Data Mining

    Science.gov (United States)

    Kulathuramaiyer, Narayanan; Maurer, Hermann

    Data Mining describes a technology that discovers non-trivial hidden patterns in a large collection of data. Although this technology has a tremendous impact on our lives, the invaluable contributions of this invisible technology often go unnoticed. This paper discusses advances in data mining while focusing on the emerging data mining capability. Such data mining applications perform multidimensional mining on a wide variety of heterogeneous data sources, providing solutions to many unresolved problems. This paper also highlights the advantages and disadvantages arising from the ever-expanding scope of data mining. Data Mining augments human intelligence by equipping us with a wealth of knowledge and by empowering us to perform our daily tasks better. As the mining scope and capacity increases, users and organizations become more willing to compromise privacy. The huge data stores of the ‚master miners` allow them to gain deep insights into individual lifestyles and their social and behavioural patterns. Data integration and analysis capability of combining business and financial trends together with the ability to deterministically track market changes will drastically affect our lives.

  15. Real world data mining applications

    CERN Document Server

    Abou-Nasr, Mahmoud; Stahlbock, Robert; Weiss, Gary M

    2014-01-01

    Data mining applications range from commercial to social domains, with novel applications appearing swiftly; for example, within the context of social networks. The expanding application sphere and social reach of advanced data mining raise pertinent issues of privacy and security. Present-day data mining is a progressive multidisciplinary endeavor. This inter- and multidisciplinary approach is well reflected within the field of information systems. The information systems research addresses software and hardware requirements for supporting computationally and data-intensive applications. Furthermore, it encompasses analyzing system and data aspects, and all manual or automated activities. In that respect, research at the interface of information systems and data mining has significant potential to produce actionable knowledge vital for corporate decision-making. The aim of the proposed volume is to provide a balanced treatment of the latest advances and developments in data mining; in particular, exploring s...

  16. Finding Gold in Data Mining

    Science.gov (United States)

    Flaherty, Bill

    2013-01-01

    Data-mining systems provide a variety of opportunities for school district personnel to streamline operations and focus on student achievement. This article describes the value of data mining for school personnel, finance departments, teacher evaluations, and in the classroom. It suggests that much could be learned about district practices if one…

  17. Data Mining for Intrusion Detection

    Science.gov (United States)

    Singhal, Anoop; Jajodia, Sushil

    Data Mining Techniques have been successfully applied in many different fields including marketing, manufacturing, fraud detection and network management. Over the past years there is a lot of interest in security technologies such as intrusion detection, cryptography, authentication and firewalls. This chapter discusses the application of Data Mining techniques to computer security. Conclusions are drawn and directions for future research are suggested.

  18. Learning data mining with R

    CERN Document Server

    Makhabel, Bater

    2015-01-01

    This book is intended for the budding data scientist or quantitative analyst with only a basic exposure to R and statistics. This book assumes familiarity with only the very basics of R, such as the main data types, simple functions, and how to move data around. No prior experience with data mining packages is necessary; however, you should have a basic understanding of data mining concepts and processes.

  19. Privacy Preserving Distributed Data Mining

    Data.gov (United States)

    National Aeronautics and Space Administration — Distributed data mining from privacy-sensitive multi-party data is likely to play an important role in the next generation of integrated vehicle health monitoring...

  20. Mining High-Dimensional Data

    Science.gov (United States)

    Wang, Wei; Yang, Jiong

    With the rapid growth of computational biology and e-commerce applications, high-dimensional data becomes very common. Thus, mining high-dimensional data is an urgent problem of great practical importance. However, there are some unique challenges for mining data of high dimensions, including (1) the curse of dimensionality and more crucial (2) the meaningfulness of the similarity measure in the high dimension space. In this chapter, we present several state-of-art techniques for analyzing high-dimensional data, e.g., frequent pattern mining, clustering, and classification. We will discuss how these methods deal with the challenges of high dimensionality.

  1. The handbook of data mining

    CERN Document Server

    Ye, Nong

    2003-01-01

    This bk is the 1st comprehensive one to feature systematic coverage of the concepts, techniques, examples, issues, software tools and future advancements of data mining. The demand for DM apps are increasing in indus, gov, & academia.

  2. Data mining methods and applications

    CERN Document Server

    Lawrence, Kenneth D; Klimberg, Ronald K

    2007-01-01

    With today's information explosion, many organizations are now able to access a wealth of valuable data. Unfortunately, most of these organizations find they are ill-equipped to organize this information, let alone put it to work for them. Gain a Competitive Advantage Employ data mining in research and forecasting Build models with data management tools and methodology optimization Gain sophisticated breakdowns and complex analysis through multivariate, evolutionary, and neural net methodsLearn how to classify data and maintain qualityTransform Data into Business Acumen Data Mining Methods and

  3. Data Mining Tools in Science Education

    OpenAIRE

    Premysl Zaskodny

    2012-01-01

    The main principle of paper is Data Mining in Science Education (DMSE) as Problem Solving. The main goal of paper is consisting in Delimitation of Complex Data Mining Tool and Partial Data Mining Tool of DMSE. The procedure of paper is consisting of Data Preprocessing in Science Education, Data Processing in Science Education, Description of Curricular Process as Complex Data Mining Tool (CP-DMSE), Description of Analytical Synthetic Modeling as Partial Data Mining Tool (ASM-DMSE) and finally...

  4. Mastering SQL Server 2014 data mining

    CERN Document Server

    Bassan, Amarpreet Singh

    2014-01-01

    If you are a developer who is working on data mining for large companies and would like to enhance your knowledge of SQL Server Data Mining Suite, this book is for you. Whether you are brand new to data mining or are a seasoned expert, you will be able to master the skills needed to build a data mining solution.

  5. PROGRAMS WITH DATA MINING CAPABILITIES

    Directory of Open Access Journals (Sweden)

    Ciobanu Dumitru

    2012-03-01

    Full Text Available The fact that the Internet has become a commodity in the world has created a framework for anew economy. Traditional businesses migrate to this new environment that offers many features and options atrelatively low prices. However competitiveness is fierce and successful Internet business is tied to rigorous use of allavailable information. The information is often hidden in data and for their retrieval is necessary to use softwarecapable of applying data mining algorithms and techniques. In this paper we want to review some of the programswith data mining capabilities currently available in this area.We also propose some classifications of this softwareto assist those who wish to use such software.

  6. Signal system data mining

    Science.gov (United States)

    2000-09-01

    Intelligent transportation systems (ITS) include large numbers of traffic sensors that collect enormous quantities of data. The data provided by ITS is necessary for advanced forms of control, however basic forms of control, primarily time-of-day (TO...

  7. Collaborative Data Mining Tool for Education

    Science.gov (United States)

    Garcia, Enrique; Romero, Cristobal; Ventura, Sebastian; Gea, Miguel; de Castro, Carlos

    2009-01-01

    This paper describes a collaborative educational data mining tool based on association rule mining for the continuous improvement of e-learning courses allowing teachers with similar course's profile sharing and scoring the discovered information. This mining tool is oriented to be used by instructors non experts in data mining such that, its…

  8. Data mining concepts and techniques

    CERN Document Server

    Han, Jiawei

    2005-01-01

    Our ability to generate and collect data has been increasing rapidly. Not only are all of our business, scientific, and government transactions now computerized, but the widespread use of digital cameras, publication tools, and bar codes also generate data. On the collection side, scanned text and image platforms, satellite remote sensing systems, and the World Wide Web have flooded us with a tremendous amount of data. This explosive growth has generated an even more urgent need for new techniques and automated tools that can help us transform this data into useful information and knowledge.Like the first edition, voted the most popular data mining book by KD Nuggets readers, this book explores concepts and techniques for the discovery of patterns hidden in large data sets, focusing on issues relating to their feasibility, usefulness, effectiveness, and scalability. However, since the publication of the first edition, great progress has been made in the development of new data mining methods, systems, and app...

  9. Data mining theories, algorithms, and examples

    CERN Document Server

    Ye, Nong

    2013-01-01

    AN OVERVIEW OF DATA MINING METHODOLOGIESIntroduction to data mining methodologiesMETHODOLOGIES FOR MINING CLASSIFICATION AND PREDICTION PATTERNSRegression modelsBayes classifiersDecision treesMulti-layer feedforward artificial neural networksSupport vector machinesSupervised clusteringMETHODOLOGIES FOR MINING CLUSTERING AND ASSOCIATION PATTERNSHierarchical clusteringPartitional clusteringSelf-organized mapProbability distribution estimationAssociation rulesBayesian networksMETHODOLOGIES FOR MINING DATA REDUCTION PATTERNSPrincipal components analysisMulti-dimensional scalingLatent variable anal

  10. Data mining for social network data

    CERN Document Server

    Memon, Nasrullah; Hicks, David L; Chen, Hsinchun

    2010-01-01

    Driven by counter-terrorism efforts, marketing analysis and an explosion in online social networking in recent years, data mining has moved to the forefront of information science. This proposed Special Issue on ""Data Mining for Social Network Data"" will present a broad range of recent studies in social networking analysis. It will focus on emerging trends and needs in discovery and analysis of communities, solitary and social activities, and activities in open fora, and commercial sites as well. It will also look at network modeling, infrastructure construction, dynamic growth and evolution

  11. Mining Marketing Data

    Science.gov (United States)

    2002-01-01

    MarketMiner(R) Products, a line of automated marketing analysis tools manufactured by MarketMiner, Inc., can benefit organizations that perform significant amounts of direct marketing. MarketMiner received a Small Business Innovation Research (SBIR) contract from NASA's Johnson Space Center to develop the software as a data modeling tool for space mission applications. The technology was then built into the company current products to provide decision support for business and marketing applications. With the tool, users gain valuable information about customers and prospects from existing data in order to increase sales and profitability. MarketMiner(R) is a registered trademark of MarketMiner, Inc.

  12. Mining Connected Data

    Science.gov (United States)

    Michel, L.; Motch, C.; Pineau, F. X.

    2009-05-01

    As members of the Survey Science Consortium of the XMM-Newton mission the Strasbourg Observatory is in charge of the real-time cross-correlations of X-ray data with archival catalogs. We also are committed to provide a specific tools to handle these cross-correlations and propose identifications at other wavelengths. In order to do so, we developed a database generator (Saada) managing persitent links and supporting heterogeneous input datasets. This system allows to easily build an archive containing numerous and complex links between individual items [1]. It also offers a powerfull query engine able to select sources on the basis of the properties (existence, distance, colours) of the X-ray-archival associations. We present such a database in operation for the 2XMMi catalogue. This system is flexible enough to provide both a public data interface and a servicing interface which could be used in the framework of the Simbol-X ground segment.

  13. Mining for data

    Energy Technology Data Exchange (ETDEWEB)

    Ross, Elsie

    2011-12-15

    Launching a new oil and gas company involves certain initial challenges and risks. But a Calgary-based software company has a novel idea to help reduce that initial exploration risk. This paper presents the Visage information software tool, developed by Visage Information Solutions, which quickly accesses, analyzes and visually interprets data on more than 710,000 wells in the Western Canada sedimentary basin. Elkhorn Resources Inc. used this technology to screen exploration areas in developing their business plan and the software helped them analyze different play types and technical concepts in real time with the click of a button. This software helps in making quick and accurate decisions based on a thoroughly researched and well founded database. The technique focuses on a particular area to provide a better understanding of the economics involved for a potential play type with actual numbers based on publicly available production data.

  14. Contrast data mining concepts, algorithms, and applications

    CERN Document Server

    Dong, Guozhu

    2012-01-01

    A Fruitful Field for Researching Data Mining Methodology and for Solving Real-Life Problems Contrast Data Mining: Concepts, Algorithms, and Applications collects recent results from this specialized area of data mining that have previously been scattered in the literature, making them more accessible to researchers and developers in data mining and other fields. The book not only presents concepts and techniques for contrast data mining, but also explores the use of contrast mining to solve challenging problems in various scientific, medical, and business domains. Learn from Real Case Studies

  15. A survey of temporal data mining

    Indian Academy of Sciences (India)

    other subtle relationships in the data using a combination of techniques from ... stamped list of items bought by customers lends itself to data mining analysis that ...... Frequent episode mining can be used here as part of an alarm management.

  16. data mining in distributed database

    International Nuclear Information System (INIS)

    Ghunaim, A.A.A.

    2007-01-01

    as we march into the age of digital information, the collection and the storage of large quantities of data is increased, and the problem of data overload looms ominously ahead. it is estimated today that the volume of data stored by a company doubles every year but the amount of meaningful information is decreases rapidly. the ability to analyze and understand massive datasets lags far behind the ability to gather and store the data. the unbridled growth of data will inevitably lead to a situation in which it is increasingly difficult to access the desired information; it will always be like looking for a needle in a haystack, and where only the amount of hay will be growing all the time . so, a new generation of computational techniques and tools is required to analyze and understand the rapidly growing volumes of data . and, because the information technology (it) has become a strategic weapon in the modern life, it is needed to use a new decision support tools to be an international powerful competitor.data mining is one of these tools and its methods make it possible to extract decisive knowledge needed by an enterprise and it means that it concerned with inferring models from data , including statistical pattern recognition, applied statistics, machine learning , and neural networks. data mining is a tool for increasing productivity of people trying to build predictive models. data mining techniques have been applied successfully to several real world problem domains; but the application in the nuclear reactors field has only little attention . one of the main reasons, is the difficulty in obtaining the data sets

  17. Data mining and visualization techniques

    Science.gov (United States)

    Wong, Pak Chung [Richland, WA; Whitney, Paul [Richland, WA; Thomas, Jim [Richland, WA

    2004-03-23

    Disclosed are association rule identification and visualization methods, systems, and apparatus. An association rule in data mining is an implication of the form X.fwdarw.Y where X is a set of antecedent items and Y is the consequent item. A unique visualization technique that provides multiple antecedent, consequent, confidence, and support information is disclosed to facilitate better presentation of large quantities of complex association rules.

  18. Applied data mining for business and industry

    CERN Document Server

    Giudici, Paolo

    2009-01-01

    The increasing availability of data in our current, information overloaded society has led to the need for valid tools for its modelling and analysis. Data mining and applied statistical methods are the appropriate tools to extract knowledge from such data. This book provides an accessible introduction to data mining methods in a consistent and application oriented statistical framework, using case studies drawn from real industry projects and highlighting the use of data mining methods in a variety of business applications. Introduces data mining methods and applications.Covers classical and Bayesian multivariate statistical methodology as well as machine learning and computational data mining methods.Includes many recent developments such as association and sequence rules, graphical Markov models, lifetime value modelling, credit risk, operational risk and web mining.Features detailed case studies based on applied projects within industry.Incorporates discussion of data mining software, with case studies a...

  19. Visual cues for data mining

    Science.gov (United States)

    Rogowitz, Bernice E.; Rabenhorst, David A.; Gerth, John A.; Kalin, Edward B.

    1996-04-01

    This paper describes a set of visual techniques, based on principles of human perception and cognition, which can help users analyze and develop intuitions about tabular data. Collections of tabular data are widely available, including, for example, multivariate time series data, customer satisfaction data, stock market performance data, multivariate profiles of companies and individuals, and scientific measurements. In our approach, we show how visual cues can help users perform a number of data mining tasks, including identifying correlations and interaction effects, finding clusters and understanding the semantics of cluster membership, identifying anomalies and outliers, and discovering multivariate relationships among variables. These cues are derived from psychological studies on perceptual organization, visual search, perceptual scaling, and color perception. These visual techniques are presented as a complement to the statistical and algorithmic methods more commonly associated with these tasks, and provide an interactive interface for the human analyst.

  20. Data mining for ontology development.

    Energy Technology Data Exchange (ETDEWEB)

    Davidson, George S.; Strasburg, Jana (Pacific Northwest National Laboratory, Richland, WA); Stampf, David (Brookhaven National Laboratory, Upton, NY); Neymotin,Lev (Brookhaven National Laboratory, Upton, NY); Czajkowski, Carl (Brookhaven National Laboratory, Upton, NY); Shine, Eugene (Savannah River National Laboratory, Aiken, SC); Bollinger, James (Savannah River National Laboratory, Aiken, SC); Ghosh, Vinita (Brookhaven National Laboratory, Upton, NY); Sorokine, Alexandre (Oak Ridge National Laboratory, Oak Ridge, TN); Ferrell, Regina (Oak Ridge National Laboratory, Oak Ridge, TN); Ward, Richard (Oak Ridge National Laboratory, Oak Ridge, TN); Schoenwald, David Alan

    2010-06-01

    A multi-laboratory ontology construction effort during the summer and fall of 2009 prototyped an ontology for counterfeit semiconductor manufacturing. This effort included an ontology development team and an ontology validation methods team. Here the third team of the Ontology Project, the Data Analysis (DA) team reports on their approaches, the tools they used, and results for mining literature for terminology pertinent to counterfeit semiconductor manufacturing. A discussion of the value of ontology-based analysis is presented, with insights drawn from other ontology-based methods regularly used in the analysis of genomic experiments. Finally, suggestions for future work are offered.

  1. Data mining utilizando redes neuronales

    OpenAIRE

    Ale, Juan María; Bot, Romina Laura

    2004-01-01

    Las Redes Neuronales son ampliamente utilizadas para tareas relacionadas con reconocimiento de patrones y clasificación. Aunque son clasificadores muy precisos, no son comúnmente utilizadas para Data Mining porque producen modelos de aprendizaje inexplicables. El algoritmo TREPAN extrae hipótesis explicables de una Red Neuronal entrenada. Las hipótesis producidas por el algoritmo se representan con un árbol de decisión que aproxima a la red. Los árboles de decisión extraídos por TREPAN no pue...

  2. DATA MINING IN SPORTS BETTING

    Directory of Open Access Journals (Sweden)

    Cristian Georgescu

    2013-12-01

    Full Text Available n this paper, we have made a brief analysis on how to make decisions in betting on European football with the help of data mining techniques. Whether you refer to betting a few days in advance of the sporting event or live betting, both options have been taken into consideration. By using a clustering algorithm for analyzing both the database containing events from football matches and the odds given by bookmakers, we have obtained graphs indicating the probabilities associated with analyzed events. Given the purely informative aspect of the current paper, we have only analyzed the number of corners from a match.

  3. Data Mining Solutions for the Business Environment

    Directory of Open Access Journals (Sweden)

    Ruxandra-Stefania PETRE

    2014-02-01

    Full Text Available Over the past years, data mining became a matter of considerable importance due to the large amounts of data available in the applications belonging to various domains. Data mining, a dynamic and fast-expanding field, that applies advanced data analysis techniques, from statistics, machine learning, database systems or artificial intelligence, in order to discover relevant patterns, trends and relations contained within the data, information impossible to observe using other techniques. The paper focuses on presenting the applications of data mining in the business environment. It contains a general overview of data mining, providing a definition of the concept, enumerating six primary data mining techniques and mentioning the main fields for which data mining can be applied. The paper also presents the main business areas which can benefit from the use of data mining tools, along with their use cases: retail, banking and insurance. Also the main commercially available data mining tools and their key features are presented within the paper. Besides the analysis of data mining and the business areas that can successfully apply it, the paper presents the main features of a data mining solution that can be applied for the business environment and the architecture, with its main components, for the solution, that would help improve customer experiences and decision-making

  4. Statistically significant relational data mining :

    Energy Technology Data Exchange (ETDEWEB)

    Berry, Jonathan W.; Leung, Vitus Joseph; Phillips, Cynthia Ann; Pinar, Ali; Robinson, David Gerald; Berger-Wolf, Tanya; Bhowmick, Sanjukta; Casleton, Emily; Kaiser, Mark; Nordman, Daniel J.; Wilson, Alyson G.

    2014-02-01

    This report summarizes the work performed under the project (3z(BStatitically significant relational data mining.(3y (BThe goal of the project was to add more statistical rigor to the fairly ad hoc area of data mining on graphs. Our goal was to develop better algorithms and better ways to evaluate algorithm quality. We concetrated on algorithms for community detection, approximate pattern matching, and graph similarity measures. Approximate pattern matching involves finding an instance of a relatively small pattern, expressed with tolerance, in a large graph of data observed with uncertainty. This report gathers the abstracts and references for the eight refereed publications that have appeared as part of this work. We then archive three pieces of research that have not yet been published. The first is theoretical and experimental evidence that a popular statistical measure for comparison of community assignments favors over-resolved communities over approximations to a ground truth. The second are statistically motivated methods for measuring the quality of an approximate match of a small pattern in a large graph. The third is a new probabilistic random graph model. Statisticians favor these models for graph analysis. The new local structure graph model overcomes some of the issues with popular models such as exponential random graph models and latent variable models.

  5. Data Mining Mining Data: MSHA Enforcement Efforts, Underground Coal Mine Safety, and New Health Policy Implications

    OpenAIRE

    Thomas J. Kniesner; John D. Leeth

    2003-01-01

    Studies of industrial safety regulations, Occupational Safety and Health Administration (OSHA) in particular, often find little effect on worker safety. Critics of the regulatory approach argue that safety standards have little to do with industrial injuries and defenders of the regulatory approach cite infrequent inspections and low fines for violating safety standards. We use recently assembled data from the Mine Safety and Health Administration (MSHA) concerning underground coal mine produ...

  6. Data Mining Mining Data: MSHA Enforcement Efforts, Underground Coal Mine Safety, and New Health Implications

    OpenAIRE

    Kniesner, Thomas J.; Leeth, John D.

    2003-01-01

    Studies of industrial safety regulations, OSHA in particular, often find little effect on worker safety. Critics of the regulatory approach argue that safety standards have little to do with industrial injuries, and defenders of the regulatory approach cite infrequent inspections and low penalties for violating safety standards. We use recently assembled data from the Mine Safety and Health Administration (MSHA) concerning underground coal mine production, safety regulatory activities, and wo...

  7. Recurrent process mining with live event data

    NARCIS (Netherlands)

    Syamsiyah, A.; van Dongen, B.F.; van der Aalst, W.M.P.; Teniente, E.; Weidlich, M.

    2018-01-01

    In organizations, process mining activities are typically performed in a recurrent fashion, e.g. once a week, an event log is extracted from the information systems and a process mining tool is used to analyze the process’ characteristics. Typically, process mining tools import the data from a

  8. Data mining in pharma sector: benefits.

    Science.gov (United States)

    Ranjan, Jayanthi

    2009-01-01

    The amount of data getting generated in any sector at present is enormous. The information flow in the pharma industry is huge. Pharma firms are progressing into increased technology-enabled products and services. Data mining, which is knowledge discovery from large sets of data, helps pharma firms to discover patterns in improving the quality of drug discovery and delivery methods. The paper aims to present how data mining is useful in the pharma industry, how its techniques can yield good results in pharma sector, and to show how data mining can really enhance in making decisions using pharmaceutical data. This conceptual paper is written based on secondary study, research and observations from magazines, reports and notes. The author has listed the types of patterns that can be discovered using data mining in pharma data. The paper shows how data mining is useful in the pharma industry and how its techniques can yield good results in pharma sector. Although much work can be produced for discovering knowledge in pharma data using data mining, the paper is limited to conceptualizing the ideas and view points at this stage; future work may include applying data mining techniques to pharma data based on primary research using the available, famous significant data mining tools. Research papers and conceptual papers related to data mining in Pharma industry are rare; this is the motivation for the paper.

  9. The Hazards of Data Mining in Healthcare.

    Science.gov (United States)

    Househ, Mowafa; Aldosari, Bakheet

    2017-01-01

    From the mid-1990s, data mining methods have been used to explore and find patterns and relationships in healthcare data. During the 1990s and early 2000's, data mining was a topic of great interest to healthcare researchers, as data mining showed some promise in the use of its predictive techniques to help model the healthcare system and improve the delivery of healthcare services. However, it was soon discovered that mining healthcare data had many challenges relating to the veracity of healthcare data and limitations around predictive modelling leading to failures of data mining projects. As the Big Data movement has gained momentum over the past few years, there has been a reemergence of interest in the use of data mining techniques and methods to analyze healthcare generated Big Data. Much has been written on the positive impacts of data mining on healthcare practice relating to issues of best practice, fraud detection, chronic disease management, and general healthcare decision making. Little has been written about the limitations and challenges of data mining use in healthcare. In this review paper, we explore some of the limitations and challenges in the use of data mining techniques in healthcare. Our results show that the limitations of data mining in healthcare include reliability of medical data, data sharing between healthcare organizations, inappropriate modelling leading to inaccurate predictions. We conclude that there are many pitfalls in the use of data mining in healthcare and more work is needed to show evidence of its utility in facilitating healthcare decision-making for healthcare providers, managers, and policy makers and more evidence is needed on data mining's overall impact on healthcare services and patient care.

  10. Applications of Data Mining in Higher Education

    OpenAIRE

    Monika Goyal; Rajan Vohra

    2012-01-01

    Data analysis plays an important role for decision support irrespective of type of industry like any manufacturing unit and educations system. There are many domains in which data mining techniques plays an important role. This paper proposes the use of data mining techniques to improve the efficiency of higher education institution. If data mining techniques such as clustering, decision tree and association are applied to higher education processes, it would help to improve students performa...

  11. Data-Mining Research in Education

    OpenAIRE

    Cheng, Jiechao

    2017-01-01

    As an interdisciplinary discipline, data mining (DM) is popular in education area especially when examining students' learning performances. It focuses on analyzing educational related data to develop models for improving learners' learning experiences and enhancing institutional effectiveness. Therefore, DM does help education institutions provide high-quality education for its learners. Applying data mining in education also known as educational data mining (EDM), which enables to better un...

  12. DATA MINING TECHNIQUES FOR EDUCATIONAL DATA: A REVIEW

    OpenAIRE

    Pragati Sharma; Dr. Sanjiv Sharma

    2018-01-01

    Recently, data mining is gaining more popularity among researcher. Data mining provides various techniques and methods for analysing data produced by various applications of different domain. Similarly, Educational mining is providing a way for analyzing educational data set. Educational mining concerns with developing methods for discovering knowledge from data that come from educational field and it helps to extract the hidden patterns and to discover new knowledge from large educational da...

  13. A survey of temporal data mining

    Indian Academy of Sciences (India)

    Data mining is concerned with analysing large volumes of (often unstructured) data to automatically discover interesting regularities or relationships which in turn lead to better understanding of the underlying processes. The field of temporal data mining is concerned with such analysis in the case of ordered data streams ...

  14. EXTRACTING KNOWLEDGE FROM DATA - DATA MINING

    Directory of Open Access Journals (Sweden)

    DIANA ELENA CODREANU

    2011-04-01

    Full Text Available Managers of economic organizations have at their disposal a large volume of information and practically facing an avalanche of information, but they can not operate studying reports containing detailed data volumes without a correlation because of the good an organization may be decided in fractions of time. Thus, to take the best and effective decisions in real time, managers need to have the correct information is presented quickly, in a synthetic way, but relevant to allow for predictions and analysis.This paper wants to highlight the solutions to extract knowledge from data, namely data mining. With this technology not only has to verify some hypotheses, but aims at discovering new knowledge, so that economic organization to cope with fierce competition in the market.

  15. Data Mining for Anomaly Detection

    Science.gov (United States)

    Biswas, Gautam; Mack, Daniel; Mylaraswamy, Dinkar; Bharadwaj, Raj

    2013-01-01

    The Vehicle Integrated Prognostics Reasoner (VIPR) program describes methods for enhanced diagnostics as well as a prognostic extension to current state of art Aircraft Diagnostic and Maintenance System (ADMS). VIPR introduced a new anomaly detection function for discovering previously undetected and undocumented situations, where there are clear deviations from nominal behavior. Once a baseline (nominal model of operations) is established, the detection and analysis is split between on-aircraft outlier generation and off-aircraft expert analysis to characterize and classify events that may not have been anticipated by individual system providers. Offline expert analysis is supported by data curation and data mining algorithms that can be applied in the contexts of supervised learning methods and unsupervised learning. In this report, we discuss efficient methods to implement the Kolmogorov complexity measure using compression algorithms, and run a systematic empirical analysis to determine the best compression measure. Our experiments established that the combination of the DZIP compression algorithm and CiDM distance measure provides the best results for capturing relevant properties of time series data encountered in aircraft operations. This combination was used as the basis for developing an unsupervised learning algorithm to define "nominal" flight segments using historical flight segments.

  16. Process mining : data science in action

    NARCIS (Netherlands)

    Van der Aalst, W.M.P.

    2016-01-01

    This is the second edition of Wil van der Aalst’s seminal book on process mining, which now discusses the field also in the broader context of data science and big data approaches. It includes several additions and updates, e.g. on inductive mining techniques, the notion of alignments, a

  17. Data Mining and Homeland Security: An Overview

    National Research Council Canada - National Science Library

    Seifert, Jeffrey W

    2008-01-01

    .... Often used as a means for detecting fraud, assessing risk, and product retailing, data mining involves the use of data analysis tools to discover previously unknown, valid patterns and relationships in large data sets...

  18. Data Mining and Homeland Security: An Overview

    National Research Council Canada - National Science Library

    Seifert, Jeffrey W

    2007-01-01

    .... Often used as a means for detecting fraud, assessing risk, and product retailing, data mining involves the use of data analysis tools to discover previously unknown, valid patterns and relationships in large data sets...

  19. Data Mining and Homeland Security: An Overview

    National Research Council Canada - National Science Library

    Seifert, Jeffrey W

    2006-01-01

    .... Often used as a means for detecting fraud, assessing risk, and product retailing, data mining involves the use of data analysis tools to discover previously unknown, valid patterns and relationships in large data sets...

  20. Web Mining of Hotel Customer Survey Data

    Directory of Open Access Journals (Sweden)

    Richard S. Segall

    2008-12-01

    Full Text Available This paper provides an extensive literature review and list of references on the background of web mining as applied specifically to hotel customer survey data. This research applies the techniques of web mining to actual text of written comments for hotel customers using Megaputer PolyAnalyst®. Web mining functionalities utilized include those such as clustering, link analysis, key word and phrase extraction, taxonomy, and dimension matrices. This paper provides screen shots of the web mining applications using Megaputer PolyAnalyst®. Conclusions and future directions of the research are presented.

  1. Data mining and business analytics with R

    CERN Document Server

    Ledolter, Johannes

    2013-01-01

    Collecting, analyzing, and extracting valuable information from a large amount of data requires easily accessible, robust, computational and analytical tools. Data Mining and Business Analytics with R utilizes the open source software R for the analysis, exploration, and simplification of large high-dimensional data sets. As a result, readers are provided with the needed guidance to model and interpret complicated data and become adept at building powerful models for prediction and classification. Highlighting both underlying concepts and practical computational skills, Data Mining

  2. Solar Data Mining at Georgia State University

    Science.gov (United States)

    Angryk, R.; Martens, P. C.; Schuh, M.; Aydin, B.; Kempton, D.; Banda, J.; Ma, R.; Naduvil-Vadukootu, S.; Akkineni, V.; Küçük, A.; Filali Boubrahimi, S.; Hamdi, S. M.

    2016-12-01

    In this talk we give an overview of research projects related to solar data analysis that are conducted at Georgia State University. We will provide update on multiple advances made by our research team on the analysis of image parameters, spatio-temporal patterns mining, temporal data analysis and our experiences with big, heterogeneous solar data visualization, analysis, processing and storage. We will talk about up-to-date data mining methodologies, and their importance for big data-driven solar physics research.

  3. Data Mining Solutions for the Business Environment

    OpenAIRE

    Ruxandra-Stefania PETRE

    2013-01-01

    Over the past years, data mining became a matter of considerable importance due to the large amounts of data available in the applications belonging to various domains. Data mining, a dynamic and fast-expanding field, that applies advanced data analysis techniques, from statistics, machine learning, database systems or artificial intelligence, in order to discover relevant patterns, trends and relations contained within the data, information impossible to observe using other techniques. The p...

  4. Granular-relational data mining how to mine relational data in the paradigm of granular computing ?

    CERN Document Server

    Hońko, Piotr

    2017-01-01

    This book provides two general granular computing approaches to mining relational data, the first of which uses abstract descriptions of relational objects to build their granular representation, while the second extends existing granular data mining solutions to a relational case. Both approaches make it possible to perform and improve popular data mining tasks such as classification, clustering, and association discovery. How can different relational data mining tasks best be unified? How can the construction process of relational patterns be simplified? How can richer knowledge from relational data be discovered? All these questions can be answered in the same way: by mining relational data in the paradigm of granular computing! This book will allow readers with previous experience in the field of relational data mining to discover the many benefits of its granular perspective. In turn, those readers familiar with the paradigm of granular computing will find valuable insights on its application to mining r...

  5. Mining Product Data Models: A Case Study

    Directory of Open Access Journals (Sweden)

    Cristina-Claudia DOLEAN

    2014-01-01

    Full Text Available This paper presents two case studies used to prove the validity of some data-flow mining algorithms. We proposed the data-flow mining algorithms because most part of mining algorithms focuses on the control-flow perspective. First case study uses event logs generated by an ERP system (Navision after we set several trackers on the data elements needed in the process analyzed; while the second case study uses the event logs generated by YAWL system. We offered a general solution of data-flow model extraction from different data sources. In order to apply the data-flow mining algorithms the event logs must comply a certain format (using InputOutput extension. But to respect this format, a set of conversion tools is needed. We depicted the conversion tools used and how we got the data-flow models. Moreover, the data-flow model is compared to the control-flow model.

  6. A survey on Big Data Stream Mining

    African Journals Online (AJOL)

    pc

    2018-03-05

    Mar 5, 2018 ... huge amount of stream like telecommunication systems. So, there ... streams have many challenges for data mining algorithm design like using of ..... A. Bifet and R. Gavalda, "Learning from Time-Changing Data with. Adaptive ...

  7. Data Mining and Statistics for Decision Making

    CERN Document Server

    Tufféry, Stéphane

    2011-01-01

    Data mining is the process of automatically searching large volumes of data for models and patterns using computational techniques from statistics, machine learning and information theory; it is the ideal tool for such an extraction of knowledge. Data mining is usually associated with a business or an organization's need to identify trends and profiles, allowing, for example, retailers to discover patterns on which to base marketing objectives. This book looks at both classical and recent techniques of data mining, such as clustering, discriminant analysis, logistic regression, generalized lin

  8. IT Data Mining Tool Uses in Aerospace

    Science.gov (United States)

    Monroe, Gilena A.; Freeman, Kenneth; Jones, Kevin L.

    2012-01-01

    Data mining has a broad spectrum of uses throughout the realms of aerospace and information technology. Each of these areas has useful methods for processing, distributing, and storing its corresponding data. This paper focuses on ways to leverage the data mining tools and resources used in NASA's information technology area to meet the similar data mining needs of aviation and aerospace domains. This paper details the searching, alerting, reporting, and application functionalities of the Splunk system, used by NASA's Security Operations Center (SOC), and their potential shared solutions to address aircraft and spacecraft flight and ground systems data mining requirements. This paper also touches on capacity and security requirements when addressing sizeable amounts of data across a large data infrastructure.

  9. Exploring the Integration of Data Mining and Data Visualization

    Science.gov (United States)

    Zhang, Yi

    2011-01-01

    Due to the rapid advances in computing and sensing technologies, enormous amounts of data are being generated everyday in various applications. The integration of data mining and data visualization has been widely used to analyze these massive and complex data sets to discover hidden patterns. For both data mining and visualization to be…

  10. The Top Ten Algorithms in Data Mining

    CERN Document Server

    Wu, Xindong

    2009-01-01

    From classification and clustering to statistical learning, association analysis, and link mining, this book covers the most important topics in data mining research. It presents the ten most influential algorithms used in the data mining community today. Each chapter provides a detailed description of the algorithm, a discussion of available software implementation, advanced topics, and exercises. With a simple data set, examples illustrate how each algorithm works and highlight the overall performance of each algorithm in a real-world application. Featuring contributions from leading researc

  11. Educational data mining and learning analytics

    OpenAIRE

    Vera Hernández, Joan Carles

    2017-01-01

    Treball basat en Educational Data Mining & Learning Analitics d'anàlisi de la matriculació dels alumnes i el seu impacte sobre la decisió de tornar-se a matricular. Trabajo basado en Educational Data Mining & Learning Analytics análisis de la matriculación de los alumnos y su impacto sobre la decisión de volverse a matricular. Work based on Educational Data Mining & Learning Analytics analysis of student enrollment and its impact on the decision to re-enroll.

  12. Data Mining Tools for Malware Detection

    CERN Document Server

    Masud, Mehedy; Thuraisingham, Bhavani; Andreasson, Kim J

    2011-01-01

    Although the use of data mining for security and malware detection is quickly on the rise, most books on the subject provide high-level theoretical discussions to the near exclusion of the practical aspects. Breaking the mold, Data Mining Tools for Malware Detection provides a step-by-step breakdown of how to develop data mining tools for malware detection. Integrating theory with practical techniques and experimental results, it focuses on malware detection applications for email worms, malicious code, remote exploits, and botnets. The authors describe the systems they have designed and devel

  13. Open-source tools for data mining.

    Science.gov (United States)

    Zupan, Blaz; Demsar, Janez

    2008-03-01

    With a growing volume of biomedical databases and repositories, the need to develop a set of tools to address their analysis and support knowledge discovery is becoming acute. The data mining community has developed a substantial set of techniques for computational treatment of these data. In this article, we discuss the evolution of open-source toolboxes that data mining researchers and enthusiasts have developed over the span of a few decades and review several currently available open-source data mining suites. The approaches we review are diverse in data mining methods and user interfaces and also demonstrate that the field and its tools are ready to be fully exploited in biomedical research.

  14. Frequent Pattern Mining Algorithms for Data Clustering

    DEFF Research Database (Denmark)

    Zimek, Arthur; Assent, Ira; Vreeken, Jilles

    2014-01-01

    that frequent pattern mining was at the cradle of subspace clustering—yet, it quickly developed into an independent research field. In this chapter, we discuss how frequent pattern mining algorithms have been extended and generalized towards the discovery of local clusters in high-dimensional data......Discovering clusters in subspaces, or subspace clustering and related clustering paradigms, is a research field where we find many frequent pattern mining related influences. In fact, as the first algorithms for subspace clustering were based on frequent pattern mining algorithms, it is fair to say....... In particular, we discuss several example algorithms for subspace clustering or projected clustering as well as point out recent research questions and open topics in this area relevant to researchers in either clustering or pattern mining...

  15. DATA MINING THE GALAXY ZOO MERGERS

    Data.gov (United States)

    National Aeronautics and Space Administration — DATA MINING THE GALAXY ZOO MERGERS STEVEN BAEHR, ARUN VEDACHALAM, KIRK BORNE, AND DANIEL SPONSELLER Abstract. Collisions between pairs of galaxies usually end in the...

  16. Quantification of Operational Risk Using A Data Mining

    Science.gov (United States)

    Perera, J. Sebastian

    1999-01-01

    What is Data Mining? - Data Mining is the process of finding actionable information hidden in raw data. - Data Mining helps find hidden patterns, trends, and important relationships often buried in a sea of data - Typically, automated software tools based on advanced statistical analysis and data modeling technology can be utilized to automate the data mining process

  17. Challenges in computational statistics and data mining

    CERN Document Server

    Mielniczuk, Jan

    2016-01-01

    This volume contains nineteen research papers belonging to the areas of computational statistics, data mining, and their applications. Those papers, all written specifically for this volume, are their authors’ contributions to honour and celebrate Professor Jacek Koronacki on the occcasion of his 70th birthday. The book’s related and often interconnected topics, represent Jacek Koronacki’s research interests and their evolution. They also clearly indicate how close the areas of computational statistics and data mining are.

  18. Application of data mining techniques for nuclear data and instrumentation

    International Nuclear Information System (INIS)

    Toshniwal, Durga

    2013-01-01

    Data mining is defined as the discovery of previously unknown, valid, novel, potentially useful, and understandable patterns in large databases. It encompasses many different techniques and algorithms which differ in the kinds of data that can be analyzed and the form of knowledge representation used to convey the discovered knowledge. Patterns in the data can be represented in many different forms, including classification rules, association rules, clusters, etc. Data mining thus deals with the discovery of hidden trends and patterns from large quantities of data. The field of data mining is emerging as a new, fundamental research area with important applications to science, engineering, medicine, business, and education. It is an interdisciplinary research area and draws upon several roots, including database systems, machine learning, information systems, statistics and expert systems. Data mining, when performed on time series data, is known as time series data mining (TSDM). A time series is a sequence of real numbers, each number representing a value at a point of time. During the past few years, there has been an explosion of research in the area of time series data mining. This includes attempts to model time series data, to design languages to query such data, and to develop access structures to efficiently process queries on such data. Time series data arises naturally in many real-world applications. Efficient discovery of knowledge through time series data mining can be helpful in several domains such as: Stock market analysis, Weather forecasting etc. An important application area of data mining techniques is in nuclear power plant and related data. Nuclear power plant data can be represented in form of time sequences. Often it may be of prime importance to analyze such data to find trends and anomalies. The general goals of data mining include feature extraction, similarity search, clustering and classification, association rule mining and anomaly

  19. Robust processing of mining subsidence monitoring data

    Energy Technology Data Exchange (ETDEWEB)

    Mingzhong, Wang; Guogang, Huang [Pingdingshan Mining Bureau (China); Yunjia, Wang; Guogangli, [China Univ. of Mining and Technology, Xuzhou (China)

    1997-12-31

    Since China began to do research on mining subsidence in 1950s, more than one thousand lines have been observed. Yet, monitoring data sometimes contain quite a lot of outliers because of the limit of observation and geological mining conditions. In China, nowdays, the method of processing mining subsidence monitoring data is based on the principle of the least square method. It is possible to produce lower accuracy, less reliability, or even errors. For reason given above, the authors, according to Chinese actual situation, have done some research work on the robust processing of mining subsidence monitoring data in respect of how to get prediction parameters. The authors have derived related formulas, designed some computational programmes, done a great quantity of actual calculation and simulation, and achieved good results. (orig.)

  20. Robust processing of mining subsidence monitoring data

    Energy Technology Data Exchange (ETDEWEB)

    Wang Mingzhong; Huang Guogang [Pingdingshan Mining Bureau (China); Wang Yunjia; Guogangli [China Univ. of Mining and Technology, Xuzhou (China)

    1996-12-31

    Since China began to do research on mining subsidence in 1950s, more than one thousand lines have been observed. Yet, monitoring data sometimes contain quite a lot of outliers because of the limit of observation and geological mining conditions. In China, nowdays, the method of processing mining subsidence monitoring data is based on the principle of the least square method. It is possible to produce lower accuracy, less reliability, or even errors. For reason given above, the authors, according to Chinese actual situation, have done some research work on the robust processing of mining subsidence monitoring data in respect of how to get prediction parameters. The authors have derived related formulas, designed some computational programmes, done a great quantity of actual calculation and simulation, and achieved good results. (orig.)

  1. Data Mining and Machine Learning in Astronomy

    Science.gov (United States)

    Ball, Nicholas M.; Brunner, Robert J.

    We review the current state of data mining and machine learning in astronomy. Data Mining can have a somewhat mixed connotation from the point of view of a researcher in this field. If used correctly, it can be a powerful approach, holding the potential to fully exploit the exponentially increasing amount of available data, promising great scientific advance. However, if misused, it can be little more than the black box application of complex computing algorithms that may give little physical insight, and provide questionable results. Here, we give an overview of the entire data mining process, from data collection through to the interpretation of results. We cover common machine learning algorithms, such as artificial neural networks and support vector machines, applications from a broad range of astronomy, emphasizing those in which data mining techniques directly contributed to improving science, and important current and future directions, including probability density functions, parallel algorithms, Peta-Scale computing, and the time domain. We conclude that, so long as one carefully selects an appropriate algorithm and is guided by the astronomical problem at hand, data mining can be very much the powerful tool, and not the questionable black box.

  2. Warehousing Structured and Unstructured Data for Data Mining.

    Science.gov (United States)

    Miller, L. L.; Honavar, Vasant; Barta, Tom

    1997-01-01

    Describes an extensible object-oriented view system that supports the integration of both structured and unstructured data sources in either the multidatabase or data warehouse environment. Discusses related work and data mining issues. (AEF)

  3. Set-oriented data mining in relational databases

    NARCIS (Netherlands)

    Houtsma, M.A.W.; Swami, Arun

    1995-01-01

    Data mining is an important real-life application for businesses. It is critical to find efficient ways of mining large data sets. In order to benefit from the experience with relational databases, a set-oriented approach to mining data is needed. In such an approach, the data mining operations are

  4. Supporting Solar Physics Research via Data Mining

    Science.gov (United States)

    Angryk, Rafal; Banda, J.; Schuh, M.; Ganesan Pillai, K.; Tosun, H.; Martens, P.

    2012-05-01

    In this talk we will briefly introduce three pillars of data mining (i.e. frequent patterns discovery, classification, and clustering), and discuss some possible applications of known data mining techniques which can directly benefit solar physics research. In particular, we plan to demonstrate applicability of frequent patterns discovery methods for the verification of hypotheses about co-occurrence (in space and time) of filaments and sigmoids. We will also show how classification/machine learning algorithms can be utilized to verify human-created software modules to discover individual types of solar phenomena. Finally, we will discuss applicability of clustering techniques to image data processing.

  5. Accounting and Financial Data Analysis Data Mining Tools

    Directory of Open Access Journals (Sweden)

    Diana Elena Codreanu

    2011-05-01

    Full Text Available Computerized accounting systems in recent years have seen an increase in complexity due to thecompetitive economic environment but with the help of data analysis solutions such as OLAP and DataMining can be a multidimensional data analysis, can detect the fraud and can discover knowledge hidden indata, ensuring such information is useful for decision making within the organization. In the literature thereare many definitions for data mining but all boils down to same idea: the process takes place to extract newinformation from large data collections, information without the aid of data mining tools would be verydifficult to obtain. Information obtained by data mining process has the advantage that only respond to thequestion of what happens but at the same time argue and show why certain things are happening. In this paperwe wish to present advanced techniques for analysis and exploitation of data stored in a multidimensionaldatabase.

  6. Data Mining Techniques for Customer Relationship Management

    Science.gov (United States)

    Guo, Feng; Qin, Huilin

    2017-10-01

    Data mining have made customer relationship management (CRM) a new area where firms can gain a competitive advantage, and play a key role in the firms’ management decision. In this paper, we first analyze the value and application fields of data mining techniques for CRM, and further explore how data mining applied to Customer churn analysis. A new business culture is developing today. The conventional production centered and sales purposed market strategy is gradually shifting to customer centered and service purposed. Customers’ value orientation is increasingly affecting the firms’. And customer resource has become one of the most important strategic resources. Therefore, understanding customers’ needs and discriminating the most contributed customers has become the driving force of most modern business.

  7. Mining and Integration of Environmental Data

    Science.gov (United States)

    Tran, V.; Hluchy, L.; Habala, O.; Ciglan, M.

    2009-04-01

    The project ADMIRE (Advanced Data Mining and Integration Research for Europe) is a 7th FP EU ICT project aims to deliver a consistent and easy-to-use technology for extracting information and knowledge. The project is motivated by the difficulty of extracting meaningful information by data mining combinations of data from multiple heterogeneous and distributed resources. It will also provide an abstract view of data mining and integration, which will give users and developers the power to cope with complexity and heterogeneity of services, data and processes. The data sets describing phenomena from domains like business, society, and environment often contain spatial and temporal dimensions. Integration of spatio-temporal data from different sources is a challenging task due to those dimensions. Different spatio-temporal data sets contain data at different resolutions (e.g. size of the spatial grid) and frequencies. This heterogeneity is the principal challenge of geo-spatial and temporal data sets integration - the integrated data set should hold homogeneous data of the same resolution and frequency. Thus, to integrate heterogeneous spatio-temporal data from distinct source, transformation of one or more data sets is necessary. Following transformation operation are required: • transformation to common spatial and temporal representation - (e.g. transformation to common coordinate system), • spatial and/or temporal aggregation - data from detailed data source are aggregated to match the resolution of other resources involved in the integration process, • spatial and/or temporal record decomposition - records from source with lower resolution data are decomposed to match the granularity of the other data source. This operation decreases data quality (e.g. transformation of data from 50km grid to 10 km grid) - data from lower resolution data set in the integrated schema are imprecise, but it allows us to preserve higher resolution data. We can decompose the

  8. Educational Data Mining Acceptance among Undergraduate Students

    Science.gov (United States)

    Wook, Muslihah; Yusof, Zawiyah M.; Nazri, Mohd Zakree Ahmad

    2017-01-01

    The acceptance of Educational Data Mining (EDM) technology is on the rise due to, its ability to extract new knowledge from large amounts of students' data. This knowledge is important for educational stakeholders, such as policy makers, educators, and students themselves to enhance efficiency and achievements. However, previous studies on EDM…

  9. Data Mining Gets Traction in Education

    Science.gov (United States)

    Sparks, Sarah D.

    2011-01-01

    The new and rapidly growing field of educational data mining is using the chaff from data collected through normal school activities to explore learning in more detail than ever, and researchers say the day when educators can make use of Amazon.com-like feedback on student learning behaviors may be closer than most people think. Educational data…

  10. Highly Robust Methods in Data Mining

    Czech Academy of Sciences Publication Activity Database

    Kalina, Jan

    2013-01-01

    Roč. 8, č. 1 (2013), s. 9-24 ISSN 1452-4864 Institutional support: RVO:67985807 Keywords : data mining * robust statistics * high-dimensional data * cluster analysis * logistic regression * neural networks Subject RIV: BB - Applied Statistics, Operational Research

  11. Engaging Business Students with Data Mining

    Science.gov (United States)

    Brandon, Dan

    2016-01-01

    The Economist calls it "a golden vein", and many business experts now say it is the new science of winning. Business and technologists have many names for this new science, "business intelligence" (BI), " data analytics," and "data mining" are among the most common. The job market for people skilled in this…

  12. Traffic Flow Management: Data Mining Update

    Science.gov (United States)

    Grabbe, Shon R.

    2012-01-01

    This presentation provides an update on recent data mining efforts that have been designed to (1) identify like/similar days in the national airspace system, (2) cluster/aggregate national-level rerouting data and (3) apply machine learning techniques to predict when Ground Delay Programs are required at a weather-impacted airport

  13. Relational XES: Data management for process mining

    NARCIS (Netherlands)

    Dongen, van B.F.; Shabani, S.; Grabis, J.; Sandkuhl, K.

    2015-01-01

    Information systems log data during the execution of business processes in so called "event logs". Process mining aims to improve business processes by extracting knowledge from event logs. Currently, the de-facto standard for storing and managing event data, XES, is tailored towards sequential

  14. Relational XES : data management for process mining

    NARCIS (Netherlands)

    Dongen, van B.F.; Shabani, S.

    2015-01-01

    Information systems log data during the execution of business processes in so called "event logs". Process mining aims to improve business processes by extracting knowledge from event logs. Currently, the de-facto standard for storing and managing event data, XES, is tailored towards sequential

  15. Pocket data mining big data on small devices

    CERN Document Server

    Gaber, Mohamed Medhat; Gomes, Joao Bartolo

    2014-01-01

    Owing to continuous advances in the computational power of handheld devices like smartphones and tablet computers, it has become possible to perform Big Data operations including modern data mining processes onboard these small devices. A decade of research has proved the feasibility of what has been termed as Mobile Data Mining, with a focus on one mobile device running data mining processes. However, it is not before 2010 until the authors of this book initiated the Pocket Data Mining (PDM) project exploiting the seamless communication among handheld devices performing data analysis tasks that were infeasible until recently. PDM is the process of collaboratively extracting knowledge from distributed data streams in a mobile computing environment. This book provides the reader with an in-depth treatment on this emerging area of research. Details of techniques used and thorough experimental studies are given. More importantly and exclusive to this book, the authors provide detailed practical guide on the depl...

  16. Data mining in time series databases

    CERN Document Server

    Kandel, Abraham; Bunke, Horst

    2004-01-01

    Adding the time dimension to real-world databases produces Time SeriesDatabases (TSDB) and introduces new aspects and difficulties to datamining and knowledge discovery. This book covers the state-of-the-artmethodology for mining time series databases. The novel data miningmethods presented in the book include techniques for efficientsegmentation, indexing, and classification of noisy and dynamic timeseries. A graph-based method for anomaly detection in time series isdescribed and the book also studies the implications of a novel andpotentially useful representation of time series as strings. Theproblem of detecting changes in data mining models that are inducedfrom temporal databases is additionally discussed.

  17. Open data mining for Taiwan's dengue epidemic.

    Science.gov (United States)

    Wu, ChienHsing; Kao, Shu-Chen; Shih, Chia-Hung; Kan, Meng-Hsuan

    2018-07-01

    By using a quantitative approach, this study examines the applicability of data mining technique to discover knowledge from open data related to Taiwan's dengue epidemic. We compare results when Google trend data are included or excluded. Data sources are government open data, climate data, and Google trend data. Research findings from analysis of 70,914 cases are obtained. Location and time (month) in open data show the highest classification power followed by climate variables (temperature and humidity), whereas gender and age show the lowest values. Both prediction accuracy and simplicity decrease when Google trends are considered (respectively 0.94 and 0.37, compared to 0.96 and 0.46). The article demonstrates the value of open data mining in the context of public health care. Copyright © 2018 Elsevier B.V. All rights reserved.

  18. Data Mining Web Services for Science Data Repositories

    Science.gov (United States)

    Graves, S.; Ramachandran, R.; Keiser, K.; Maskey, M.; Lynnes, C.; Pham, L.

    2006-12-01

    The maturation of web services standards and technologies sets the stage for a distributed "Service-Oriented Architecture" (SOA) for NASA's next generation science data processing. This architecture will allow members of the scientific community to create and combine persistent distributed data processing services and make them available to other users over the Internet. NASA has initiated a project to create a suite of specialized data mining web services designed specifically for science data. The project leverages the Algorithm Development and Mining (ADaM) toolkit as its basis. The ADaM toolkit is a robust, mature and freely available science data mining toolkit that is being used by several research organizations and educational institutions worldwide. These mining services will give the scientific community a powerful and versatile data mining capability that can be used to create higher order products such as thematic maps from current and future NASA satellite data records with methods that are not currently available. The package of mining and related services are being developed using Web Services standards so that community-based measurement processing systems can access and interoperate with them. These standards-based services allow users different options for utilizing them, from direct remote invocation by a client application to deployment of a Business Process Execution Language (BPEL) solutions package where a complex data mining workflow is exposed to others as a single service. The ability to deploy and operate these services at a data archive allows the data mining algorithms to be run where the data are stored, a more efficient scenario than moving large amounts of data over the network. This will be demonstrated in a scenario in which a user uses a remote Web-Service-enabled clustering algorithm to create cloud masks from satellite imagery at the Goddard Earth Sciences Data and Information Services Center (GES DISC).

  19. Integrating Data Mining Techniques into Telemedicine Systems

    Directory of Open Access Journals (Sweden)

    Mihaela GHEORGHE

    2014-01-01

    Full Text Available The medical system is facing a wide range of challenges nowadays due to changes that are taking place in the global healthcare systems. These challenges are represented mostly by economic constraints (spiraling costs, financial issues, but also, by the increased emphasis on accountability and transparency, changes that were made in the education field, the fact that the biomedical research keeps growing in what concerns the complexities of the specific studies etc. Also the new partnerships that were made in medical care systems and the great advances in IT industry suggest that a predominant paradigm shift is occurring. This needs a focus on interaction, collaboration and increased sharing of information and knowledge, all of these may is in turn be leading healthcare organizations to embrace the techniques of data mining in order to create and sustain optimal healthcare outcomes. Data mining is a domain of great importance nowadays as it provides advanced data analysis techniques for extracting the knowledge from the huge volumes of data collected and stored by every system of a daily basis. In the healthcare organizations data mining can provide valuable information for patient's diagnosis and treatment planning, customer relationship management, organization resources management or fraud detection. In this article we focus on describing the importance of data mining techniques and systems for healthcare organizations with a focus on developing and implementing telemedicine solution in order to improve the healthcare services provided to the patients. We provide architecture for integrating data mining techniques into telemedicine systems and also offer an overview on understanding and improving the implemented solution by using Business Process Management methods.

  20. Tools for Educational Data Mining: A Review

    Science.gov (United States)

    Slater, Stefan; Joksimovic, Srecko; Kovanovic, Vitomir; Baker, Ryan S.; Gasevic, Dragan

    2017-01-01

    In recent years, a wide array of tools have emerged for the purposes of conducting educational data mining (EDM) and/or learning analytics (LA) research. In this article, we hope to highlight some of the most widely used, most accessible, and most powerful tools available for the researcher interested in conducting EDM/LA research. We will…

  1. Mining Diagnostic Assessment Data for Concept Similarity

    Science.gov (United States)

    Madhyastha, Tara; Hunt, Earl

    2009-01-01

    This paper introduces a method for mining multiple-choice assessment data for similarity of the concepts represented by the multiple choice responses. The resulting similarity matrix can be used to visualize the distance between concepts in a lower-dimensional space. This gives an instructor a visualization of the relative difficulty of concepts…

  2. Process mining data science in action

    CERN Document Server

    van der Aalst, Wil

    2016-01-01

    The first to cover this missing link between data mining and process modeling, this book provides real-world techniques for monitoring and analyzing processes in real time. It is a powerful new tool destined to play a key role in business process management.

  3. Comparative genomics using data mining tools

    Indian Academy of Sciences (India)

    We have analysed the genomes of representatives of three kingdoms of life, namely, archaea, eubacteria and eukaryota using data mining tools based on compositional analyses of the protein sequences. The representatives chosen in this analysis were Methanococcus jannaschii, Haemophilus influenzae and ...

  4. Supplementary data: Eucalyptus microsatellites mined in silico ...

    Indian Academy of Sciences (India)

    Supplementary data: Eucalyptus microsatellites mined in silico: survey and evaluation. R. Yasodha, R. Sumathi, P. Chezhian, S. Kavitha and M. Ghosh. J. Genet. 87, XX-XX. Tm. CT. 2222. NA. 60 125. 192. Table 1. List of EST-SSR primers developed for E. globulus. No. of. Tm Product. Acc. no. SSR repeats. Forward primer.

  5. Academic Performance: An Approach From Data Mining

    Directory of Open Access Journals (Sweden)

    David L. La Red Martinez

    2012-02-01

    Full Text Available The relatively low% of students promoted and regularized in Operating Systems Course of the LSI (Bachelor’s Degree in Information Systems of FaCENA (Faculty of Sciences and Natural Surveying - Facultad de Ciencias Exactas, Naturales y Agrimensura of UNNE (academic success, prompted this work, whose objective is to determine the variables that affect the academic performance, whereas the final status of the student according to the Res. 185/03 CD (scheme for evaluation and promotion: promoted, regular or free1. The variables considered are: status of the student, educational level of parents, secondary education, socio-economic level, and others. Data warehouse (Data Warehouses: DW and data mining (Data Mining: DM techniques were used to search pro.les of students and determine success or failure academic potential situations. Classifications through techniques of clustering according to different criteria have become. Some criteria were the following: mining of classification according to academic program, according to final status of the student, according to importance given to the study, mining of demographic clustering and Kohonen clustering according to final status of the student. Were conducted statistics of partition, detail of partitions, details of clusters, detail of fields and frequency of fields, overall quality of each process and quality detailed (precision, classification, reliability, arrays of confusion, diagrams of gain / elevation, trees, distribution of nodes, of importance of fields, correspondence tables of fields and statistics of cluster. Once certain profiles of students with low academic performance, it may address actions aimed at avoiding potential academic failures. This work aims to provide a brief description of aspects related to the data warehouse built and some processes of data mining developed on the same.

  6. Visual Data Mining of Robot Performance Data, Phase II

    Data.gov (United States)

    National Aeronautics and Space Administration — We propose to design and develop VDM/RP, a visual data mining system that will enable analysts to acquire, store, query, analyze, and visualize recent and historical...

  7. Transparent data mining for big and small data

    CERN Document Server

    Quercia, Daniele; Pasquale, Frank

    2017-01-01

    This book focuses on new and emerging data mining solutions that offer a greater level of transparency than existing solutions. Transparent data mining solutions with desirable properties (e.g. effective, fully automatic, scalable) are covered in the book. Experimental findings of transparent solutions are tailored to different domain experts, and experimental metrics for evaluating algorithmic transparency are presented. The book also discusses societal effects of black box vs. transparent approaches to data mining, as well as real-world use cases for these approaches. As algorithms increasingly support different aspects of modern life, a greater level of transparency is sorely needed, not least because discrimination and biases have to be avoided. With contributions from domain experts, this book provides an overview of an emerging area of data mining that has profound societal consequences, and provides the technical background to for readers to contribute to the field or to put existing approaches to prac...

  8. Data Mining of Network Logs

    Science.gov (United States)

    Collazo, Carlimar

    2011-01-01

    The statement of purpose is to analyze network monitoring logs to support the computer incident response team. Specifically, gain a clear understanding of the Uniform Resource Locator (URL) and its structure, and provide a way to breakdown a URL based on protocol, host name domain name, path, and other attributes. Finally, provide a method to perform data reduction by identifying the different types of advertisements shown on a webpage for incident data analysis. The procedures used for analysis and data reduction will be a computer program which would analyze the URL and identify and advertisement links from the actual content links.

  9. Data Mining Smart Energy Time Series

    Directory of Open Access Journals (Sweden)

    Janina POPEANGA

    2015-07-01

    Full Text Available With the advent of smart metering technology the amount of energy data will increase significantly and utilities industry will have to face another big challenge - to find relationships within time-series data and even more - to analyze such huge numbers of time series to find useful patterns and trends with fast or even real-time response. This study makes a small review of the literature in the field, trying to demonstrate how essential is the application of data mining techniques in the time series to make the best use of this large quantity of data, despite all the difficulties. Also, the most important Time Series Data Mining techniques are presented, highlighting their applicability in the energy domain.

  10. Data Mining Supercomputing with SAS JMP® Genomics

    Directory of Open Access Journals (Sweden)

    Richard S. Segall

    2011-02-01

    Full Text Available JMP® Genomics is statistical discovery software that can uncover meaningful patterns in high-throughput genomics and proteomics data. JMP® Genomics is designed for biologists, biostatisticians, statistical geneticists, and those engaged in analyzing the vast stores of data that are common in genomic research (SAS, 2009. Data mining was performed using JMP® Genomics on the two collections of microarray databases available from National Center for Biotechnology Information (NCBI for lung cancer and breast cancer. The Gene Expression Omnibus (GEO of NCBI serves as a public repository for a wide range of highthroughput experimental data, including the two collections of lung cancer and breast cancer that were used for this research. The results for applying data mining using software JMP® Genomics are shown in this paper with numerous screen shots.

  11. A Data Mining Approach for Cardiovascular Diagnosis

    Directory of Open Access Journals (Sweden)

    Pereira Joana

    2017-12-01

    Full Text Available The large amounts of data generated by healthcare transactions are too complex and voluminous to be processed and analysed by traditional methods. Data mining can improve decision-making by discovering patterns and trends in large amounts of complex data. In the healthcare industry specifically, data mining can be used to decrease costs by increasing efficiency, improve patient quality of life, and perhaps most importantly, save the lives of more patients. The main goal of this project is to apply data mining techniques in order to make possible the prediction of the degree of disability that patients will present when they leave hospitalization. The clinical data that will compose the data set was obtained from one single hospital and contains information about patients who were hospitalized in Cardio Vascular Disease’s (CVD unit in 2016 for having suffered a cardiovascular accident. To develop this project, it will be used the Waikato Environment for Knowledge Analysis (WEKA machine learning Workbench since this one allows users to quickly try out and compare different machine learning methods on new data sets

  12. On-board Data Mining

    Science.gov (United States)

    Tanner, Steve; Stein, Cara; Graves, Sara J.

    Networks of remote sensors are becoming more common as technology improves and costs decline. In the past, a remote sensor was usually a device that collected data to be retrieved at a later time by some other mechanism. This collected data were usually processed well after the fact at a computer greatly removed from the in situ sensing location. This has begun to change as sensor technology, on-board processing, and network communication capabilities have increased and their prices have dropped. There has been an explosion in the number of sensors and sensing devices, not just around the world, but literally throughout the solar system. These sensors are not only becoming vastly more sophisticated, accurate, and detailed in the data they gather but they are also becoming cheaper, lighter, and smaller. At the same time, engineers have developed improved methods to embed computing systems, memory, storage, and communication capabilities into the platforms that host these sensors. Now, it is not unusual to see large networks of sensors working in cooperation with one another. Nor does it seem strange to see the autonomous operation of sensorbased systems, from space-based satellites to smart vacuum cleaners that keep our homes clean and robotic toys that help to entertain and educate our children. But access to sensor data and computing power is only part of the story. For all the power of these systems, there are still substantial limits to what they can accomplish. These include the well-known limits to current Artificial Intelligence capabilities and our limited ability to program the abstract concepts, goals, and improvisation needed for fully autonomous systems. But it also includes much more basic engineering problems such as lack of adequate power, communications bandwidth, and memory, as well as problems with the geolocation and real-time georeferencing required to integrate data from multiple sensors to be used together.

  13. Data mining the EXFOR database

    International Nuclear Information System (INIS)

    Brown, David; Herman, Michal; Hirdt, John

    2014-01-01

    The EXFOR database contains the largest collection of experimental nuclear reaction data available as well as this data's bibliographic information and experimental details. We created an undirected graph from the EXFOR datasets with graph nodes representing single observables and graph links representing the connections of various types between these observables. This graph is an abstract representation of the connections in EXFOR, similar to graphs of social networks, authorship networks, etc. Analysing this abstract graph, we are able to address very specific questions such as: i) What observables are being used as reference measurements by the experimental community? ii) Are these observables given the attention needed by various standards organisations? iii) Are there classes of observables that are not connected to these reference measurements? In addressing these questions, we propose several (mostly cross-section) observables that should be evaluated and made into reaction reference standards. (authors)

  14. Big data mining: In-database Oracle data mining over hadoop

    Science.gov (United States)

    Kovacheva, Zlatinka; Naydenova, Ina; Kaloyanova, Kalinka; Markov, Krasimir

    2017-07-01

    Big data challenges different aspects of storing, processing and managing data, as well as analyzing and using data for business purposes. Applying Data Mining methods over Big Data is another challenge because of huge data volumes, variety of information, and the dynamic of the sources. Different applications are made in this area, but their successful usage depends on understanding many specific parameters. In this paper we present several opportunities for using Data Mining techniques provided by the analytical engine of RDBMS Oracle over data stored in Hadoop Distributed File System (HDFS). Some experimental results are given and they are discussed.

  15. A Survey of Educational Data-Mining Research

    Science.gov (United States)

    Huebner, Richard A.

    2013-01-01

    Educational data mining (EDM) is an emerging discipline that focuses on applying data mining tools and techniques to educationally related data. The discipline focuses on analyzing educational data to develop models for improving learning experiences and improving institutional effectiveness. A literature review on educational data mining topics…

  16. Spatiotemporal Data Mining: A Computational Perspective

    Directory of Open Access Journals (Sweden)

    Shashi Shekhar

    2015-10-01

    Full Text Available Explosive growth in geospatial and temporal data as well as the emergence of new technologies emphasize the need for automated discovery of spatiotemporal knowledge. Spatiotemporal data mining studies the process of discovering interesting and previously unknown, but potentially useful patterns from large spatiotemporal databases. It has broad application domains including ecology and environmental management, public safety, transportation, earth science, epidemiology, and climatology. The complexity of spatiotemporal data and intrinsic relationships limits the usefulness of conventional data science techniques for extracting spatiotemporal patterns. In this survey, we review recent computational techniques and tools in spatiotemporal data mining, focusing on several major pattern families: spatiotemporal outlier, spatiotemporal coupling and tele-coupling, spatiotemporal prediction, spatiotemporal partitioning and summarization, spatiotemporal hotspots, and change detection. Compared with other surveys in the literature, this paper emphasizes the statistical foundations of spatiotemporal data mining and provides comprehensive coverage of computational approaches for various pattern families. ISPRS Int. J. Geo-Inf. 2015, 4 2307 We also list popular software tools for spatiotemporal data analysis. The survey concludes with a look at future research needs.

  17. Temporal data mining for hospital management

    Science.gov (United States)

    Tsumoto, Shusaku; Hirano, Shoji

    2009-04-01

    It has passed about twenty years since clinical information are stored electronically as a hospital information system since 1980's. Stored data include from accounting information to laboratory data and even patient records are now started to be accumulated: in other words, a hospital cannot function without the information system, where almost all the pieces of medical information are stored as multimedia databases. In this paper, we applied temporal data mining and exploratory data analysis techniques to hospital management data. The results show several interesting results, which suggests that the reuse of stored data will give a powerful tool for hospial management.

  18. A Data Mining Approach to Intelligence Operations

    DEFF Research Database (Denmark)

    Memon, Nasrullah; Hicks, David; Harkiolakis, Nicholas

    2008-01-01

    agencies.   An emphasis in the paper is placed on Social Network Analysis and Investigative Data Mining, and the use of these technologies in the counterterrorism domain.  Tools and techniques from both areas are described, along with the important tasks for which they can be used to assist...... with the investigation and analysis of terrorist organizations.  The process of collecting data about these organizations is also considered along with the inherent difficulties that are involved....

  19. Data Mining Based on Cloud-Computing Technology

    Directory of Open Access Journals (Sweden)

    Ren Ying

    2016-01-01

    Full Text Available There are performance bottlenecks and scalability problems when traditional data-mining system is used in cloud computing. In this paper, we present a data-mining platform based on cloud computing. Compared with a traditional data mining system, this platform is highly scalable, has massive data processing capacities, is service-oriented, and has low hardware cost. This platform can support the design and applications of a wide range of distributed data-mining systems.

  20. Possibility of Integrated Data Mining of Clinical Data

    Directory of Open Access Journals (Sweden)

    Akinori Abe

    2007-03-01

    Full Text Available In this paper, we introduce integrated data mining. Because of recent rapid progress in medical science as well as clinical diagnosis and treatment, integrated and cooperative research among medical researchers, biology, engineering, cultural science, and sociology is required. Therefore, we propose a framework called Cyber Integrated Medical Infrastructure (CIMI. Within this framework, we can deal with various types of data and consequently need to integrate those data prior to analysis. In this study, for medical science, we analyze the features and relationships among various types of data and show the possibility of integrated data mining.

  1. Application of data mining in performance measures

    Science.gov (United States)

    Chan, Michael F. S.; Chung, Walter W.; Wong, Tai Sun

    2001-10-01

    This paper proposes a structured framework for exploiting data mining application for performance measures. The context is set in an airline company is illustrated for the use of such framework. The framework takes in consideration of how a knowledge worker interacts with performance information at the enterprise level to support them to make informed decision in managing the effectiveness of operations. A case study of applying data mining technology for performance data in an airline company is illustrated. The use of performance measures is specifically applied to assist in the aircraft delay management process. The increasingly dispersed and complex operations of airline operation put much strain on the part of knowledge worker in using search, acquiring and analyzing information to manage performance. One major problem faced with knowledge workers is the identification of root causes of performance deficiency. The large amount of factors involved in the analyze the root causes can be time consuming and the objective of applying data mining technology is to reduce the time and resources needed for such process. The increasing market competition for better performance management in various industries gives rises to need of the intelligent use of data. Because of this, the framework proposed here is very much generalizable to industries such as manufacturing. It could assist knowledge workers who are constantly looking for ways to improve operation effectiveness through new initiatives and the effort is required to be quickly done to gain competitive advantage in the marketplace.

  2. Utility Independent Privacy Preserving Data Mining - Horizontally Partitioned Data

    Directory of Open Access Journals (Sweden)

    E Poovammal

    2010-06-01

    Full Text Available Micro data is a valuable source of information for research. However, publishing data about individuals for research purposes, without revealing sensitive information, is an important problem. The main objective of privacy preserving data mining algorithms is to obtain accurate results/rules by analyzing the maximum possible amount of data without unintended information disclosure. Data sets for analysis may be in a centralized server or in a distributed environment. In a distributed environment, the data may be horizontally or vertically partitioned. We have developed a simple technique by which horizontally partitioned data can be used for any type of mining task without information loss. The partitioned sensitive data at 'm' different sites are transformed using a mapping table or graded grouping technique, depending on the data type. This transformed data set is given to a third party for analysis. This may not be a trusted party, but it is still allowed to perform mining operations on the data set and to release the results to all the 'm' parties. The results are interpreted among the 'm' parties involved in the data sharing. The experiments conducted on real data sets prove that our proposed simple transformation procedure preserves one hundred percent of the performance of any data mining algorithm as compared to the original data set while preserving privacy.

  3. Using Data Mining to Teach Applied Statistics and Correlation

    Science.gov (United States)

    Hartnett, Jessica L.

    2016-01-01

    This article describes two class activities that introduce the concept of data mining and very basic data mining analyses. Assessment data suggest that students learned some of the conceptual basics of data mining, understood some of the ethical concerns related to the practice, and were able to perform correlations via the Statistical Package for…

  4. 76 FR 14637 - State Medicaid Fraud Control Units; Data Mining

    Science.gov (United States)

    2011-03-17

    ...] State Medicaid Fraud Control Units; Data Mining AGENCY: Office of Inspector General (OIG), HHS. ACTION... and analyzing State Medicaid claims data, known as data mining. To support and modernize MFCU efforts... (FFP) in the costs of defined data mining activities under specified conditions. In addition, we...

  5. Data mining in e-commerce: A survey

    Indian Academy of Sciences (India)

    R. Narasimhan (Krishtel eMaging) 1461 1996 Oct 15 13:05:22

    it is only apposite to seek the services of data mining to make (business) sense out of these data sets. Data mining ..... for the simple reason that for practical purposes, it is sufficient to include snapshots of data taken at say, weekly ..... of the mining environment and the expenses the user is willing to incur). The authors have.

  6. On data mining in context : cases, fusion and evaluation

    NARCIS (Netherlands)

    Putten, Petrus Wilhelmus Henricus van der

    2010-01-01

    Data mining can be seen as a process, with modeling as the core step. However, other steps such as planning, data preparation, evaluation and deployment are of key importance for applications. This thesis studies data mining in the context of these other steps with the goal of improving data mining

  7. Data Mining and Data Fusion for Enhanced Decision Support

    Energy Technology Data Exchange (ETDEWEB)

    Khan, Shiraj [ORNL; Ganguly, Auroop R [ORNL; Gupta, Amar [University of Arizona

    2008-01-01

    The process of Data Mining converts information to knowledge by utilizing tools from the disciplines of computational statistics, database technologies, machine learning, signal processing, nonlinear dynamics, process modeling, simulation, and allied disciplines. Data Mining allows business problems to be analyzed from diverse perspectives, including dimensionality reduction, correlation and co-occurrence, clustering and classification, regression and forecasting, anomaly detection, and change analysis. The predictive insights generated from Data Mining can be further utilized through real-time analysis and decision sciences, as well as through human-driven analysis based on management by exceptions or by objectives, to generate actionable knowledge. The tools that enable the transformation of raw data to actionable predictive insights are collectively referred as Decision Support tools. This chapter presents a new formalization of the decision process, leading to a new Decision Superiority model, partially motivated by the Joint Directors of Laboratories (JDL) Data Fusion Model. In addition, it examines the growing importance of Data Fusion concepts.

  8. Uncertainty modeling for data mining a label semantics approach

    CERN Document Server

    Qin, Zengchang

    2014-01-01

    Outlining a new research direction in fuzzy set theory applied to data mining, this volume proposes a number of new data mining algorithms and includes dozens of figures and illustrations that help the reader grasp the complexities of the concepts.

  9. Visual data mining for quantized spatial data

    Science.gov (United States)

    Braverman, Amy; Kahn, Brian

    2004-01-01

    In previous papers we've shown how a well known data compression algorithm called Entropy-constrained Vector Quantization ( can be modified to reduce the size and complexity of very large, satellite data sets. In this paper, we descuss how to visualize and understand the content of such reduced data sets.

  10. Educational data mining applications and trends

    CERN Document Server

    2014-01-01

    This book is devoted to the Educational Data Mining arena. It highlights works that show relevant proposals, developments, and achievements that shape trends and inspire future research.  After a rigorous revision process sixteen manuscripts were accepted and organized into four parts as follows: ·     Profile: The first part embraces three chapters oriented to: 1) describe the nature of educational data mining (EDM); 2) describe how to pre-process raw data to facilitate data mining (DM); 3) explain how EDM supports government policies to enhance education. ·     Student modeling: The second part contains five chapters concerned with: 4) explore the factors having an impact on the students academic success; 5) detect student's personality and behaviors in an educational game; 6) predict students performance to adjust content and strategies; 7) identify students who will most benefit from tutor support; 8) hypothesize the student answer correctness based on eye metrics and mouse click. ·     As...

  11. Data Mining Thesis Topics in Finland

    OpenAIRE

    Bajo Rouvinen, Ari

    2017-01-01

    The Theseus open repository contains metadata about more than 100,000 thesis publications from the different universities of applied sciences in Finland. Different data mining techniques were applied to the Theseus dataset to build a web application to explore thesis topics and degree programmes using different libraries in Python and JavaScript. Thesis topics were extracted from manually annotated keywords by the authors and curated subjects by the librarians. During the project, the quality...

  12. Data mining teaching throughout cards game competition

    OpenAIRE

    Antoñanzas-Torres, Javier; Urraca, Ruben; Sodupe-Ortega, Enrique; Martínez-de-Pison, Francisco; Pernía-Espinoza, Alpha

    2015-01-01

    [EN] Data-mining techniques and statistical metrics learning can be complicated because of the complexity and overwhelming nature of this field. In this paper a class competition to improve learning of designing Decision Support Systems (DSS) by playing a classic cards game named "Copo" is proposed. The fact that this game is based on a probabilistic problem and that different solutions can be obtained represents a very typical kind of problem in the field of engineering and compu...

  13. Data Mining in Institutional Economics Tasks

    Science.gov (United States)

    Kirilyuk, Igor; Kuznetsova, Anna; Senko, Oleg

    2018-02-01

    The paper discusses problems associated with the use of data mining tools to study discrepancies between countries with different types of institutional matrices by variety of potential explanatory variables: climate, economic or infrastructure indicators. An approach is presented which is based on the search of statistically valid regularities describing the dependence of the institutional type on a single variable or a pair of variables. Examples of regularities are given.

  14. Time Dependent Data Mining in RAVEN

    Energy Technology Data Exchange (ETDEWEB)

    Cogliati, Joshua Joseph [Idaho National Lab. (INL), Idaho Falls, ID (United States); Chen, Jun [Idaho National Lab. (INL), Idaho Falls, ID (United States); Patel, Japan Ketan [Idaho National Lab. (INL), Idaho Falls, ID (United States); Mandelli, Diego [Idaho National Lab. (INL), Idaho Falls, ID (United States); Maljovec, Daniel Patrick [Idaho National Lab. (INL), Idaho Falls, ID (United States); Alfonsi, Andrea [Idaho National Lab. (INL), Idaho Falls, ID (United States); Talbot, Paul William [Idaho National Lab. (INL), Idaho Falls, ID (United States); Rabiti, Cristian [Idaho National Lab. (INL), Idaho Falls, ID (United States)

    2016-09-01

    RAVEN is a generic software framework to perform parametric and probabilistic analysis based on the response of complex system codes. The goal of this type of analyses is to understand the response of such systems in particular with respect their probabilistic behavior, to understand their predictability and drivers or lack of thereof. Data mining capabilities are the cornerstones to perform such deep learning of system responses. For this reason static data mining capabilities were added last fiscal year (FY 15). In real applications, when dealing with complex multi-scale, multi-physics systems it seems natural that, during transients, the relevance of the different scales, and physics, would evolve over time. For these reasons the data mining capabilities have been extended allowing their application over time. In this writing it is reported a description of the new RAVEN capabilities implemented with several simple analytical tests to explain their application and highlight the proper implementation. The report concludes with the application of those newly implemented capabilities to the analysis of a simulation performed with the Bison code.

  15. Data Mining and Knowledge Management in Higher Education -Potential Applications.

    Science.gov (United States)

    Luan, Jing

    This paper introduces a new decision support tool, data mining, in the context of knowledge management. The most striking features of data mining techniques are clustering and prediction. The clustering aspect of data mining offers comprehensive characteristics analysis of students, while the predicting function estimates the likelihood for a…

  16. Comparative analysis of data mining techniques for business data

    Science.gov (United States)

    Jamil, Jastini Mohd; Shaharanee, Izwan Nizal Mohd

    2014-12-01

    Data mining is the process of employing one or more computer learning techniques to automatically analyze and extract knowledge from data contained within a database. Companies are using this tool to further understand their customers, to design targeted sales and marketing campaigns, to predict what product customers will buy and the frequency of purchase, and to spot trends in customer preferences that can lead to new product development. In this paper, we conduct a systematic approach to explore several of data mining techniques in business application. The experimental result reveals that all data mining techniques accomplish their goals perfectly, but each of the technique has its own characteristics and specification that demonstrate their accuracy, proficiency and preference.

  17. Data mining and data integration in biology

    DEFF Research Database (Denmark)

    Ólason, Páll Ísólfur

    2008-01-01

    . They also necessitate new ways of data preparation as established methods for sequence sets are often useless when dealing with sets of sequence pairs. Therefore careful analysis on the sequence level as well as the integrated network level is needed to benchmark these data prior to use. The networks, which...... between molecules, the essence of systems biology. Internet technologies are very important in this respect as bioinformatics labs around the world generate staggering amounts of novel annotations, increasing the importance of on-line processing and distributed systems. One of the most important new data...... types in proteomics is protein-protein interactions. Interactions between the functional elements in the cell are a natural place to start when integrating protein annotations with the aim of gaining a systems view of the cell. Interaction data, however, are notoriously biased, erroneous and incomplete...

  18. Mining Significant Semantic Locations from GPS Data

    DEFF Research Database (Denmark)

    Cao, Xin; Cong, Gao; Jensen, Christian Søndergaard

    2010-01-01

    With the increasing deployment and use of GPS-enabled devices, massive amounts of GPS data are becoming available. We propose a general framework for the mining of semantically meaningful, significant locations, e.g., shopping malls and restaurants, from such data. We present techniques capable...... of extracting semantic locations from GPS data. We capture the relationships between locations and between locations and users with a graph. Significance is then assigned to locations using random walks over the graph that propagates significance among the locations. In doing so, mutual reinforcement between...

  19. Mining significant semantic locations from GPS data

    DEFF Research Database (Denmark)

    Cao, Xin; Cong, Gao; Jensen, Christian S.

    2010-01-01

    With the increasing deployment and use of GPS-enabled devices, massive amounts of GPS data are becoming available. We propose a general framework for the mining of semantically meaningful, significant locations, e.g., shopping malls and restaurants, from such data. We present techniques capable...... of extracting semantic locations from GPS data. We capture the relationships between locations and between locations and users with a graph. Significance is then assigned to locations using random walks over the graph that propagates significance among the locations. In doing so, mutual reinforcement between...

  20. Stratified sampling design based on data mining.

    Science.gov (United States)

    Kim, Yeonkook J; Oh, Yoonhwan; Park, Sunghoon; Cho, Sungzoon; Park, Hayoung

    2013-09-01

    To explore classification rules based on data mining methodologies which are to be used in defining strata in stratified sampling of healthcare providers with improved sampling efficiency. We performed k-means clustering to group providers with similar characteristics, then, constructed decision trees on cluster labels to generate stratification rules. We assessed the variance explained by the stratification proposed in this study and by conventional stratification to evaluate the performance of the sampling design. We constructed a study database from health insurance claims data and providers' profile data made available to this study by the Health Insurance Review and Assessment Service of South Korea, and population data from Statistics Korea. From our database, we used the data for single specialty clinics or hospitals in two specialties, general surgery and ophthalmology, for the year 2011 in this study. Data mining resulted in five strata in general surgery with two stratification variables, the number of inpatients per specialist and population density of provider location, and five strata in ophthalmology with two stratification variables, the number of inpatients per specialist and number of beds. The percentages of variance in annual changes in the productivity of specialists explained by the stratification in general surgery and ophthalmology were 22% and 8%, respectively, whereas conventional stratification by the type of provider location and number of beds explained 2% and 0.2% of variance, respectively. This study demonstrated that data mining methods can be used in designing efficient stratified sampling with variables readily available to the insurer and government; it offers an alternative to the existing stratification method that is widely used in healthcare provider surveys in South Korea.

  1. Parallel object-oriented data mining system

    Science.gov (United States)

    Kamath, Chandrika; Cantu-Paz, Erick

    2004-01-06

    A data mining system uncovers patterns, associations, anomalies and other statistically significant structures in data. Data files are read and displayed. Objects in the data files are identified. Relevant features for the objects are extracted. Patterns among the objects are recognized based upon the features. Data from the Faint Images of the Radio Sky at Twenty Centimeters (FIRST) sky survey was used to search for bent doubles. This test was conducted on data from the Very Large Array in New Mexico which seeks to locate a special type of quasar (radio-emitting stellar object) called bent doubles. The FIRST survey has generated more than 32,000 images of the sky to date. Each image is 7.1 megabytes, yielding more than 100 gigabytes of image data in the entire data set.

  2. Data mining, knowledge discovery and data-driven modelling

    NARCIS (Netherlands)

    Solomatine, D.P.; Velickov, S.; Bhattacharya, B.; Van der Wal, B.

    2003-01-01

    The project was aimed at exploring the possibilities of a new paradigm in modelling - data-driven modelling, often referred as "data mining". Several application areas were considered: sedimentation problems in the Port of Rotterdam, automatic soil classification on the basis of cone penetration

  3. Separation in Data Mining Based on Fractal Nature of Data

    Czech Academy of Sciences Publication Activity Database

    Jiřina, Marcel; Jiřina jr., M.

    2013-01-01

    Roč. 3, č. 1 (2013), s. 44-60 ISSN 2225-658X Institutional support: RVO:67985807 Keywords : nearest neighbor * fractal set * multifractal * IINC method * correlation dimension Subject RIV: JC - Computer Hardware ; Software http://sdiwc.net/digital-library/separation-in-data-mining-based-on-fractal-nature-of-data.html

  4. Analyzing Log Files using Data-Mining

    Directory of Open Access Journals (Sweden)

    Marius Mihut

    2008-01-01

    Full Text Available Information systems (i.e. servers, applications and communication devices create a large amount of monitoring data that are saved as log files. For analyzing them, a data-mining approach is helpful. This article presents the steps which are necessary for creating an ‘analyzing instrument’, based on an open source software called Waikato Environment for Knowledge Analysis (Weka [1]. For exemplification, a system log file created by a Windows-based operating system, is used as input file.

  5. Multimedia data mining and analytics disruptive innovation

    CERN Document Server

    Baughman, Aaron; Pan, Jia-Yu; Petrushin, Valery A

    2015-01-01

    This authoritative text/reference provides fresh insights into the cutting edge of multimedia data mining, reflecting how the research focus has shifted towards networked social communities, mobile devices and sensors. Presenting a detailed exploration into the progression of the field, the book describes how the history of multimedia data processing can be viewed as a sequence of disruptive innovations. Across the chapters, the discussion covers the practical frameworks, libraries, and open source software that enable the development of ground-breaking research into practical applications.

  6. The viability of business data mining in the sports environment ...

    African Journals Online (AJOL)

    Data mining can be viewed as the process of extracting previously unknown information from large databases and utilising this information to make crucial business decisions (Simoudis, 1996: 26). This paper considers the viability of using data mining tools and techniques in sports, particularly with regard to mining the ...

  7. Optimal sampling strategy for data mining

    International Nuclear Information System (INIS)

    Ghaffar, A.; Shahbaz, M.; Mahmood, W.

    2013-01-01

    Latest technology like Internet, corporate intranets, data warehouses, ERP's, satellites, digital sensors, embedded systems, mobiles networks all are generating such a massive amount of data that it is getting very difficult to analyze and understand all these data, even using data mining tools. Huge datasets are becoming a difficult challenge for classification algorithms. With increasing amounts of data, data mining algorithms are getting slower and analysis is getting less interactive. Sampling can be a solution. Using a fraction of computing resources, Sampling can often provide same level of accuracy. The process of sampling requires much care because there are many factors involved in the determination of correct sample size. The approach proposed in this paper tries to find a solution to this problem. Based on a statistical formula, after setting some parameters, it returns a sample size called s ufficient sample size , which is then selected through probability sampling. Results indicate the usefulness of this technique in coping with the problem of huge datasets. (author)

  8. Data mining: childhood injury control and beyond.

    Science.gov (United States)

    Tepas, Joseph J

    2009-08-01

    Data mining is defined as the automatic extraction of useful, often previously unknown information from large databases or data sets. It has become a major part of modern life and is extensively used in industry, banking, government, and health care delivery. The process requires a data collection system that integrates input from multiple sources containing critical elements that define outcomes of interest. Appropriately designed data mining processes identify and adjust for confounding variables. The statistical modeling used to manipulate accumulated data may involve any number of techniques. As predicted results are periodically analyzed against those observed, the model is consistently refined to optimize precision and accuracy. Whether applying integrated sources of clinical data to inferential probabilistic prediction of risk of ventilator-associated pneumonia or population surveillance for signs of bioterrorism, it is essential that modern health care providers have at least a rudimentary understanding of what the concept means, how it basically works, and what it means to current and future health care.

  9. Virtual Observatories, Data Mining, and Astroinformatics

    Science.gov (United States)

    Borne, Kirk

    The historical, current, and future trends in knowledge discovery from data in astronomy are presented here. The story begins with a brief history of data gathering and data organization. A description of the development ofnew information science technologies for astronomical discovery is then presented. Among these are e-Science and the virtual observatory, with its data discovery, access, display, and integration protocols; astroinformatics and data mining for exploratory data analysis, information extraction, and knowledge discovery from distributed data collections; new sky surveys' databases, including rich multivariate observational parameter sets for large numbers of objects; and the emerging discipline of data-oriented astronomical research, called astroinformatics. Astroinformatics is described as the fourth paradigm of astronomical research, following the three traditional research methodologies: observation, theory, and computation/modeling. Astroinformatics research areas include machine learning, data mining, visualization, statistics, semantic science, and scientific data management.Each of these areas is now an active research discipline, with significantscience-enabling applications in astronomy. Research challenges and sample research scenarios are presented in these areas, in addition to sample algorithms for data-oriented research. These information science technologies enable scientific knowledge discovery from the increasingly large and complex data collections in astronomy. The education and training of the modern astronomy student must consequently include skill development in these areas, whose practitioners have traditionally been limited to applied mathematicians, computer scientists, and statisticians. Modern astronomical researchers must cross these traditional discipline boundaries, thereby borrowing the best of breed methodologies from multiple disciplines. In the era of large sky surveys and numerous large telescopes, the potential

  10. Mining Building Metadata by Data Stream Comparison

    DEFF Research Database (Denmark)

    Holmegaard, Emil; Kjærgaard, Mikkel Baun

    2016-01-01

    to handle data streams with only slightly similar patterns. We have evaluated Metafier with points and data from one building located in Denmark. We have evaluated Metafier with 903 points, and the overall accuracy, with only 3 known examples, was 94.71%. Furthermore we found that using DTW for mining...... ways to annotate sensor and actuation points. This makes it difficult to create intuitive queries for retrieving data streams from points. Another problem is the amount of insufficient or missing metadata. We introduce Metafier, a tool for extracting metadata from comparing data streams. Metafier...... enables a semi-automatic labeling of metadata to building instrumentation. Metafier annotates points with metadata by comparing the data from a set of validated points with unvalidated points. Metafier has three different algorithms to compare points with based on their data. The three algorithms...

  11. Big data mining analysis method based on cloud computing

    Science.gov (United States)

    Cai, Qing Qiu; Cui, Hong Gang; Tang, Hao

    2017-08-01

    Information explosion era, large data super-large, discrete and non-(semi) structured features have gone far beyond the traditional data management can carry the scope of the way. With the arrival of the cloud computing era, cloud computing provides a new technical way to analyze the massive data mining, which can effectively solve the problem that the traditional data mining method cannot adapt to massive data mining. This paper introduces the meaning and characteristics of cloud computing, analyzes the advantages of using cloud computing technology to realize data mining, designs the mining algorithm of association rules based on MapReduce parallel processing architecture, and carries out the experimental verification. The algorithm of parallel association rule mining based on cloud computing platform can greatly improve the execution speed of data mining.

  12. Data Analysis and Data Mining: Current Issues in Biomedical Informatics

    Science.gov (United States)

    Bellazzi, Riccardo; Diomidous, Marianna; Sarkar, Indra Neil; Takabayashi, Katsuhiko; Ziegler, Andreas; McCray, Alexa T.

    2011-01-01

    Summary Background Medicine and biomedical sciences have become data-intensive fields, which, at the same time, enable the application of data-driven approaches and require sophisticated data analysis and data mining methods. Biomedical informatics provides a proper interdisciplinary context to integrate data and knowledge when processing available information, with the aim of giving effective decision-making support in clinics and translational research. Objectives To reflect on different perspectives related to the role of data analysis and data mining in biomedical informatics. Methods On the occasion of the 50th year of Methods of Information in Medicine a symposium was organized, that reflected on opportunities, challenges and priorities of organizing, representing and analysing data, information and knowledge in biomedicine and health care. The contributions of experts with a variety of backgrounds in the area of biomedical data analysis have been collected as one outcome of this symposium, in order to provide a broad, though coherent, overview of some of the most interesting aspects of the field. Results The paper presents sections on data accumulation and data-driven approaches in medical informatics, data and knowledge integration, statistical issues for the evaluation of data mining models, translational bioinformatics and bioinformatics aspects of genetic epidemiology. Conclusions Biomedical informatics represents a natural framework to properly and effectively apply data analysis and data mining methods in a decision-making context. In the future, it will be necessary to preserve the inclusive nature of the field and to foster an increasing sharing of data and methods between researchers. PMID:22146916

  13. 4D seismic data acquisition method during coal mining

    International Nuclear Information System (INIS)

    Du, Wen-Feng; Peng, Su-Ping

    2014-01-01

    In order to observe overburden media changes caused by mining processing, we take the fully-mechanized working face of the BLT coal mine in Shendong mine district as an example to develop a 4D seismic data acquisition methodology during coal mining. The 4D seismic data acquisition is implemented to collect 3D seismic data four times in different periods, such as before mining, during the mining process and after mining to observe the changes of the overburden layer during coal mining. The seismic data in the research area demonstrates that seismic waves are stronger in energy, higher in frequency and have better continuous reflectors before coal mining. However, all this is reversed after coal mining because the overburden layer has been mined, the seismic energy and frequency decrease, and reflections have more discontinuities. Comparing the records collected in the survey with those from newly mined areas and other records acquired in the same survey with the same geometry and with a long time for settling after mining, it clearly shows that the seismic reflections have stronger amplitudes and are more continuous because the media have recovered by overburden layer compaction after a long time of settling after mining. By 4D seismic acquisition, the original background investigation of the coal layers can be derived from the first records, then the layer structure changes can be monitored through the records of mining action and compaction action after mining. This method has laid the foundation for further research into the variation principles of the overburden layer under modern coal-mining conditions. (paper)

  14. Unsupervised Tensor Mining for Big Data Practitioners.

    Science.gov (United States)

    Papalexakis, Evangelos E; Faloutsos, Christos

    2016-09-01

    Multiaspect data are ubiquitous in modern Big Data applications. For instance, different aspects of a social network are the different types of communication between people, the time stamp of each interaction, and the location associated to each individual. How can we jointly model all those aspects and leverage the additional information that they introduce to our analysis? Tensors, which are multidimensional extensions of matrices, are a principled and mathematically sound way of modeling such multiaspect data. In this article, our goal is to popularize tensors and tensor decompositions to Big Data practitioners by demonstrating their effectiveness, outlining challenges that pertain to their application in Big Data scenarios, and presenting our recent work that tackles those challenges. We view this work as a step toward a fully automated, unsupervised tensor mining tool that can be easily and broadly adopted by practitioners in academia and industry.

  15. Data mining in bioinformatics using Weka.

    Science.gov (United States)

    Frank, Eibe; Hall, Mark; Trigg, Len; Holmes, Geoffrey; Witten, Ian H

    2004-10-12

    The Weka machine learning workbench provides a general-purpose environment for automatic classification, regression, clustering and feature selection-common data mining problems in bioinformatics research. It contains an extensive collection of machine learning algorithms and data pre-processing methods complemented by graphical user interfaces for data exploration and the experimental comparison of different machine learning techniques on the same problem. Weka can process data given in the form of a single relational table. Its main objectives are to (a) assist users in extracting useful information from data and (b) enable them to easily identify a suitable algorithm for generating an accurate predictive model from it. http://www.cs.waikato.ac.nz/ml/weka.

  16. MouseMine: a new data warehouse for MGI.

    Science.gov (United States)

    Motenko, H; Neuhauser, S B; O'Keefe, M; Richardson, J E

    2015-08-01

    MouseMine (www.mousemine.org) is a new data warehouse for accessing mouse data from Mouse Genome Informatics (MGI). Based on the InterMine software framework, MouseMine supports powerful query, reporting, and analysis capabilities, the ability to save and combine results from different queries, easy integration into larger workflows, and a comprehensive Web Services layer. Through MouseMine, users can access a significant portion of MGI data in new and useful ways. Importantly, MouseMine is also a member of a growing community of online data resources based on InterMine, including those established by other model organism databases. Adopting common interfaces and collaborating on data representation standards are critical to fostering cross-species data analysis. This paper presents a general introduction to MouseMine, presents examples of its use, and discusses the potential for further integration into the MGI interface.

  17. Research on Customer Value Based on Extension Data Mining

    Science.gov (United States)

    Chun-Yan, Yang; Wei-Hua, Li

    Extenics is a new discipline for dealing with contradiction problems with formulize model. Extension data mining (EDM) is a product combining Extenics with data mining. It explores to acquire the knowledge based on extension transformations, which is called extension knowledge (EK), taking advantage of extension methods and data mining technology. EK includes extensible classification knowledge, conductive knowledge and so on. Extension data mining technology (EDMT) is a new data mining technology that mining EK in databases or data warehouse. Customer value (CV) can weigh the essentiality of customer relationship for an enterprise according to an enterprise as a subject of tasting value and customers as objects of tasting value at the same time. CV varies continually. Mining the changing knowledge of CV in databases using EDMT, including quantitative change knowledge and qualitative change knowledge, can provide a foundation for that an enterprise decides the strategy of customer relationship management (CRM). It can also provide a new idea for studying CV.

  18. Report from Dagstuhl Seminar 12331 Mobility Data Mining and Privacy

    OpenAIRE

    Clifton, Christopher W.; Kuijpers, Bart; Morik, Katharina; Saygin, Yucel

    2012-01-01

    This report documents the program and the outcomes of Dagstuhl Seminar 12331 “Mobility Data Mining and Privacy”. Mobility data mining aims to extract knowledge from movement behaviour of people, but this data also poses novel privacy risks. This seminar gathered a multidisciplinary team for a conversation on how to balance the value in mining mobility data with privacy issues. The seminar focused on four key issues: Privacy in vehicular data, in cellular data, context- dependent privacy, and ...

  19. Proactive data mining with decision trees

    CERN Document Server

    Dahan, Haim; Rokach, Lior; Maimon, Oded

    2014-01-01

    This book explores a proactive and domain-driven method to classification tasks. This novel proactive approach to data mining not only induces a model for predicting or explaining a phenomenon, but also utilizes specific problem/domain knowledge to suggest specific actions to achieve optimal changes in the value of the target attribute. In particular, the authors suggest a specific implementation of the domain-driven proactive approach for classification trees. The book centers on the core idea of moving observations from one branch of the tree to another. It introduces a novel splitting crite

  20. Patent data mining method and apparatus

    Science.gov (United States)

    Boyack, Kevin W.; Grafe, V. Gerald; Johnson, David K.; Wylie, Brian N.

    2002-01-01

    A method of data mining represents related patents in a multidimensional space. Distance between patents in the multidimensional space corresponds to the extent of relationship between the patents. The relationship between pairings of patents can be expressed based on weighted combinations of several predicates. The user can select portions of the space to perceive. The user also can interact with and control the communication of the space, focusing attention on aspects of the space of most interest. The multidimensional spatial representation allows more ready comprehension of the structure of the relationships among the patents.

  1. Data Mining Methods for Recommender Systems

    Science.gov (United States)

    Amatriain, Xavier; Jaimes*, Alejandro; Oliver, Nuria; Pujol, Josep M.

    In this chapter, we give an overview of the main Data Mining techniques used in the context of Recommender Systems. We first describe common preprocessing methods such as sampling or dimensionality reduction. Next, we review the most important classification techniques, including Bayesian Networks and Support Vector Machines. We describe the k-means clustering algorithm and discuss several alternatives. We also present association rules and related algorithms for an efficient training process. In addition to introducing these techniques, we survey their uses in Recommender Systems and present cases where they have been successfully applied.

  2. Mining the Kepler Data using Machine Learning

    Science.gov (United States)

    Walkowicz, Lucianne; Howe, A. R.; Nayar, R.; Turner, E. L.; Scargle, J.; Meadows, V.; Zee, A.

    2014-01-01

    Kepler's high cadence and incredible precision has provided an unprecedented view into stars and their planetary companions, revealing both expected and novel phenomena and systems. Due to the large number of Kepler lightcurves, the discovery of novel phenomena in particular has often been serendipitous in the course of searching for known forms of variability (for example, the discovery of the doubly pulsating elliptical binary KOI-54, originally identified by the transiting planet search pipeline). In this talk, we discuss progress on mining the Kepler data through both supervised and unsupervised machine learning, intended to both systematically search the Kepler lightcurves for rare or anomalous variability, and to create a variability catalog for community use. Mining the dataset in this way also allows for a quantitative identification of anomalous variability, and so may also be used as a signal-agnostic form of optical SETI. As the Kepler data are exceptionally rich, they provide an interesting counterpoint to machine learning efforts typically performed on sparser and/or noisier survey data, and will inform similar characterization carried out on future survey datasets.

  3. Data Mining for Imbalanced Datasets: An Overview

    Science.gov (United States)

    Chawla, Nitesh V.

    A dataset is imbalanced if the classification categories are not approximately equally represented. Recent years brought increased interest in applying machine learning techniques to difficult "real-world" problems, many of which are characterized by imbalanced data. Additionally the distribution of the testing data may differ from that of the training data, and the true misclassification costs may be unknown at learning time. Predictive accuracy, a popular choice for evaluating performance of a classifier, might not be appropriate when the data is imbalanced and/or the costs of different errors vary markedly. In this Chapter, we discuss some of the sampling techniques used for balancing the datasets, and the performance measures more appropriate for mining imbalanced datasets.

  4. Mining Staff Assignment Rules from Event-Based Data

    NARCIS (Netherlands)

    Ly, Linh Thao; Rinderle, Stefanie; Dadam, Peter; Reichert, Manfred; Bussler, Christoph J.; Haller, Armin

    2006-01-01

    Process mining offers methods and techniques for capturing process behaviour from log data of past process executions. Although many promising approaches on mining the control flow have been published, no attempt has been made to mine the staff assignment situation of business processes. In this

  5. Data Mining and Privacy of Social Network Sites' Users: Implications of the Data Mining Problem.

    Science.gov (United States)

    Al-Saggaf, Yeslam; Islam, Md Zahidul

    2015-08-01

    This paper explores the potential of data mining as a technique that could be used by malicious data miners to threaten the privacy of social network sites (SNS) users. It applies a data mining algorithm to a real dataset to provide empirically-based evidence of the ease with which characteristics about the SNS users can be discovered and used in a way that could invade their privacy. One major contribution of this article is the use of the decision forest data mining algorithm (SysFor) to the context of SNS, which does not only build a decision tree but rather a forest allowing the exploration of more logic rules from a dataset. One logic rule that SysFor built in this study, for example, revealed that anyone having a profile picture showing just the face or a picture showing a family is less likely to be lonely. Another contribution of this article is the discussion of the implications of the data mining problem for governments, businesses, developers and the SNS users themselves.

  6. Asymmetric threat data mining and knowledge discovery

    Science.gov (United States)

    Gilmore, John F.; Pagels, Michael A.; Palk, Justin

    2001-03-01

    Asymmetric threats differ from the conventional force-on- force military encounters that the Defense Department has historically been trained to engage. Terrorism by its nature is now an operational activity that is neither easily detected or countered as its very existence depends on small covert attacks exploiting the element of surprise. But terrorism does have defined forms, motivations, tactics and organizational structure. Exploiting a terrorism taxonomy provides the opportunity to discover and assess knowledge of terrorist operations. This paper describes the Asymmetric Threat Terrorist Assessment, Countering, and Knowledge (ATTACK) system. ATTACK has been developed to (a) data mine open source intelligence (OSINT) information from web-based newspaper sources, video news web casts, and actual terrorist web sites, (b) evaluate this information against a terrorism taxonomy, (c) exploit country/region specific social, economic, political, and religious knowledge, and (d) discover and predict potential terrorist activities and association links. Details of the asymmetric threat structure and the ATTACK system architecture are presented with results of an actual terrorist data mining and knowledge discovery test case shown.

  7. DATA MINING AND APPLICATION OF IT TO CAPITAL MARKETS

    Directory of Open Access Journals (Sweden)

    Cenk AKKAYA

    2011-07-01

    Full Text Available Nowadays with the development of technology importance given to knowledge increases gradually. Data mining enables to form forecasts and models regarding future by making use of past data. Any method which helps to discover data can be used as a data mining method. Enterprises gain important competitive advantage by data mining methods. Data mining is used in different fields. In finance field it is a specially used in financial performance applications, guessing the enterprise bankruptcies and failures, determining transaction manipulation, determining financial risk management, determining customer profile and depth management. It can be costly, risky and time consuming for enterprises to gain knowledge. Thus today enterprises use data mining as an innovative competitive mean. The aim of the study is to determine the importance of data mining applications to capital markets.

  8. Data warehousing and data mining: A case study

    Directory of Open Access Journals (Sweden)

    Suknović Milija

    2005-01-01

    Full Text Available This paper shows design and implementation of data warehouse as well as the use of data mining algorithms for the purpose of knowledge discovery as the basic resource of adequate business decision making process. The project is realized for the needs of Student's Service Department of the Faculty of Organizational Sciences (FOS, University of Belgrade, Serbia and Montenegro. This system represents a good base for analysis and predictions in the following time period for the purpose of quality business decision-making by top management. Thus, the first part of the paper shows the steps in designing and development of data warehouse of the mentioned business system. The second part of the paper shows the implementation of data mining algorithms for the purpose of deducting rules, patterns and knowledge as a resource for support in the process of decision making.

  9. Marine data users clustering using data mining technique

    Directory of Open Access Journals (Sweden)

    Farnaz Ghiasi

    2015-09-01

    Full Text Available The objective of this research is marine data users clustering using data mining technique. To achieve this objective, marine organizations will enable to know their data and users requirements. In this research, CRISP-DM standard model was used to implement the data mining technique. The required data was extracted from 500 marine data users profile database of Iranian National Institute for Oceanography and Atmospheric Sciences (INIOAS from 1386 to 1393. The TwoStep algorithm was used for clustering. In this research, patterns was discovered between marine data users such as student, organization and scientist and their data request (Data source, Data type, Data set, Parameter and Geographic area using clustering for the first time. The most important clusters are: Student with International data source, Chemistry data type, “World Ocean Database” dataset, Persian Gulf geographic area and Organization with Nitrate parameter. Senior managers of the marine organizations will enable to make correct decisions concerning their existing data. They will direct to planning for better data collection in the future. Also data users will guide with respect to their requests. Finally, the valuable suggestions were offered to improve the performance of marine organizations.

  10. Detecting Internet Worms Using Data Mining Techniques

    Directory of Open Access Journals (Sweden)

    Muazzam Siddiqui

    2008-12-01

    Full Text Available Internet worms pose a serious threat to computer security. Traditional approaches using signatures to detect worms pose little danger to the zero day attacks. The focus of malware research is shifting from using signature patterns to identifying the malicious behavior displayed by the malwares. This paper presents a novel idea of extracting variable length instruction sequences that can identify worms from clean programs using data mining techniques. The analysis is facilitated by the program control flow information contained in the instruction sequences. Based upon general statistics gathered from these instruction sequences we formulated the problem as a binary classification problem and built tree based classifiers including decision tree, bagging and random forest. Our approach showed 95.6% detection rate on novel worms whose data was not used in the model building process.

  11. Visual mining of semi-structured data

    CERN Multimedia

    CERN. Geneva; Posada, Jorge; Quartulli, Marco

    2013-01-01

    Background Vicomtech is visiting CERN to expose their activities and explore possible lines of collaboration. As part of the programme they will be offering a presentation, staged in three parts: Presentation of Vicomtech – Seán Gaines Descriptions of technologies and specialities – Dr. Jorge Posada Details on projects related to the development of visually-based algorithms for intelligent storage, processing, visualization and interaction with Big Data, for massive sources of information. – Dr. Marco Quartulli. The full programme to the visit is here Abstract Mining semi-structured data is fundamental for archive monitoring, understanding and exploitation. Typical analysis systems are based on a three-tiered architecture, in which efficient databases feed highly parallelised application servers that in turn feed client user interfaces. Yet the sharing of analysis, content identification and semantic level summarization tasks among the two bot...

  12. Multipass mining sequence room closures: In situ data report

    International Nuclear Information System (INIS)

    Munson, D.E.; Jones, R.L.; Northrop-Salazar, C.L.; Woerner, S.J.

    1992-12-01

    During the construction of the Thermal/Structural In Situ Test Rooms at the Waste Isolation Pilot Plant (WIPP) facility, measurements of the salt displacements were obtained at very early times, essentially concurrent with the mining activity. This was accomplished by emplacing manually read closure gage stations directly at the mining face, actually between the face and the mining machine, immediately upon mining of the intended gage location. Typically, these mining sequence closure measurements were taken within one hour of mining of the location and within one meter of the mining face. Readings were taken at these gage stations as the multipass mining continued, with the gage station reestablished as each successive mining pass destroyed the earlier gage points. Data reduction yields the displacement history during the mining operation. These early mining sequence closure data, when combined with the later data of the permanently emplaced closure gages, gives the total time-dependent closure displacements of the test rooms. This complete closure history is an essential part of assuring that the in situ test databases will provide an adequate basis for validation of the predictive technology of salt creep behavior, as required by the WIPP technology development program for disposal of radioactive waste in bedded salt

  13. Mining Educational Data to Analyze the Student Motivation Behavior

    OpenAIRE

    Kunyanuth Kularbphettong; Cholticha Tongsiri

    2012-01-01

    The purpose of this research aims to discover the knowledge for analysis student motivation behavior on e-Learning based on Data Mining Techniques, in case of the Information Technology for Communication and Learning Course at Suan Sunandha Rajabhat University. The data mining techniques was applied in this research including association rules, classification techniques. The results showed that using data mining technique can indicate the important variables that influenc...

  14. The study on privacy preserving data mining for information security

    Science.gov (United States)

    Li, Xiaohui

    2012-04-01

    Privacy preserving data mining have a rapid development in a short year. But it still faces many challenges in the future. Firstly, the level of privacy has different definitions in different filed. Therefore, the measure of privacy preserving data mining technology protecting private information is not the same. So, it's an urgent issue to present a unified privacy definition and measure. Secondly, the most of research in privacy preserving data mining is presently confined to the theory study.

  15. WEKA-G: Parallel data mining on computational grids

    Directory of Open Access Journals (Sweden)

    PIMENTA, A.

    2009-12-01

    Full Text Available Data mining is a technology that can extract useful information from large amounts of data. However, mining a database often requires a high computational power. To resolve this problem, this paper presents a tool (Weka-G, which runs in parallel algorithms used in the mining process data. As the environment for doing so, we use a computational grid by adding several features within a WAN.

  16. Software tool for data mining and its applications

    Science.gov (United States)

    Yang, Jie; Ye, Chenzhou; Chen, Nianyi

    2002-03-01

    A software tool for data mining is introduced, which integrates pattern recognition (PCA, Fisher, clustering, hyperenvelop, regression), artificial intelligence (knowledge representation, decision trees), statistical learning (rough set, support vector machine), computational intelligence (neural network, genetic algorithm, fuzzy systems). It consists of nine function models: pattern recognition, decision trees, association rule, fuzzy rule, neural network, genetic algorithm, Hyper Envelop, support vector machine, visualization. The principle and knowledge representation of some function models of data mining are described. The software tool of data mining is realized by Visual C++ under Windows 2000. Nonmonotony in data mining is dealt with by concept hierarchy and layered mining. The software tool of data mining has satisfactorily applied in the prediction of regularities of the formation of ternary intermetallic compounds in alloy systems, and diagnosis of brain glioma.

  17. Data Mining Practical Machine Learning Tools and Techniques

    CERN Document Server

    Witten, Ian H; Hall, Mark A

    2011-01-01

    Data Mining: Practical Machine Learning Tools and Techniques offers a thorough grounding in machine learning concepts as well as practical advice on applying machine learning tools and techniques in real-world data mining situations. This highly anticipated third edition of the most acclaimed work on data mining and machine learning will teach you everything you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining. Thorough updates reflect the technical changes and modernizations that have taken place

  18. The multiple zeta value data mine

    International Nuclear Information System (INIS)

    Buemlein, J.; Broadhurst, D.J.

    2009-07-01

    We provide a data mine of proven results for multiple zeta values (MZVs) of the form ζ(s 1 ,s 2 ,..,s k ) = sum ∞ n 1 >n 2 >...>n k >0 {1/(n 1 s 1 ..n k s k )} with weight w = sum K i=1 s i and depth k and for Euler sums of the form sum ∞ n 1 >n 2 >...>n k >0 {(ε 1 n 1 ..ε 1 n k )/(n 1 s 1 ..n k s k )} with signs ε i = ± 1. Notably, we achieve explicit proven reductions of all MZVs with weights w≤22, and all Euler sums with weights w≤12, to bases whose dimensions, bigraded by weight and depth, have sizes in precise agreement with the Broadhurst. Kreimer and Broadhurst conjectures. Moreover, we lend further support to these conjectures by studying even greater weights (w≤30), using modular arithmetic. To obtain these results we derive a new type of relation for Euler sums, the Generalized Doubling Relations. We elucidate the ''pushdown'' mechanism, whereby the ornate enumeration of primitive MZVs, by weight and depth, is reconciled with the far simpler enumeration of primitive Euler sums. There is some evidence that this pushdown mechanism finds its origin in doubling relations. We hope that our data mine, obtained by exploiting the unique power of the computer algebra language FORM, will enable the study of many more such consequences of the double-shuffle algebra of MZVs, and their Euler cousins, which are already the subject of keen interest, to practitioners of quantum field theory, and to mathematicians alike. (orig.)

  19. Recent advances in environmental data mining

    Science.gov (United States)

    Leuenberger, Michael; Kanevski, Mikhail

    2016-04-01

    Due to the large amount and complexity of data available nowadays in geo- and environmental sciences, we face the need to develop and incorporate more robust and efficient methods for their analysis, modelling and visualization. An important part of these developments deals with an elaboration and application of a contemporary and coherent methodology following the process from data collection to the justification and communication of the results. Recent fundamental progress in machine learning (ML) can considerably contribute to the development of the emerging field - environmental data science. The present research highlights and investigates the different issues that can occur when dealing with environmental data mining using cutting-edge machine learning algorithms. In particular, the main attention is paid to the description of the self-consistent methodology and two efficient algorithms - Random Forest (RF, Breiman, 2001) and Extreme Learning Machines (ELM, Huang et al., 2006), which recently gained a great popularity. Despite the fact that they are based on two different concepts, i.e. decision trees vs artificial neural networks, they both propose promising results for complex, high dimensional and non-linear data modelling. In addition, the study discusses several important issues of data driven modelling, including feature selection and uncertainties. The approach considered is accompanied by simulated and real data case studies from renewable resources assessment and natural hazards tasks. In conclusion, the current challenges and future developments in statistical environmental data learning are discussed. References - Breiman, L., 2001. Random Forests. Machine Learning 45 (1), 5-32. - Huang, G.-B., Zhu, Q.-Y., Siew, C.-K., 2006. Extreme learning machine: theory and applications. Neurocomputing 70 (1-3), 489-501. - Kanevski, M., Pozdnoukhov, A., Timonin, V., 2009. Machine Learning for Spatial Environmental Data. EPFL Press; Lausanne, Switzerland, p.392

  20. Introduction to the special section on educational data mining

    NARCIS (Netherlands)

    Calders, T.G.K.; Pechenizkiy, M.

    2012-01-01

    Educational Data Mining (EDM) is an emerging multidisciplinary research area, in which methods and techniques for exploring data originating from various educational information systems have been developed. EDM is both a learning science, as well as a rich application area for data mining, due to

  1. Expressive power of an algebra for data mining

    NARCIS (Netherlands)

    Calders, T.; Lakshmanan, L.V.S.; Ng, R.T.; Paredaens, J.

    2006-01-01

    The relational data model has simple and clear foundations on which significant theoretical and systems research has flourished. By contrast, most research on data mining has focused on algorithmic issues. A major open question is: what's an appropriate foundation for data mining, which can

  2. Clustering for data mining a data recovery approach

    CERN Document Server

    Mirkin, Boris

    2005-01-01

    Often considered more as an art than a science, the field of clustering has been dominated by learning through examples and by techniques chosen almost through trial-and-error. Even the most popular clustering methods--K-Means for partitioning the data set and Ward's method for hierarchical clustering--have lacked the theoretical attention that would establish a firm relationship between the two methods and relevant interpretation aids.Rather than the traditional set of ad hoc techniques, Clustering for Data Mining: A Data Recovery Approach presents a theory that not only closes gaps in K-Mean

  3. Application of Data Mining in direct marketing

    Directory of Open Access Journals (Sweden)

    Dejana Pavlović

    2014-04-01

    Full Text Available The key to successful business operations lies in good communication with clients. There are a growing number of brokers in the financial market who collect excess funds from the clients and perform transfers to those who need the funds. However, many external and internal factors influence the decision on disposal of available funds. This paper identifies and researches into clients’ satisfaction in the banking system. By application of the disclosure of data legality we will try to point to the factors that influence the clients' decision to invest their long-term deposits in the parent bank. Upon classification and clustering, we will interpret and indentify the strengths and weaknesses of the target results. This analysis provides the guidelines through the use of the decision-making tree, application of data mining and the possibility to use a large set of data increases the value and accuracy of this technique. The problem with this technique is accuracy of the data submitted by the client.

  4. Development of National Health Data Warehouse for Data Mining

    Directory of Open Access Journals (Sweden)

    Shahidul Islam Khan

    2015-07-01

    Full Text Available Health informatics is currently one of the top focuses of computer science researchers. Availability of timely and accurate data is essential for medical decision making. Health care organizations face a common problem with the large amount of data they have in numerous systems. Researchers, health care providers and patients will not be able to utilize the knowledge stored in different repositories unless amalgamate the information from disparate sources is done. This problem can be solved by Data warehousing. Data warehousing techniques share a common set of tasks, include requirements analysis, data design, architectural design, implementation and deployment. Developing health data warehouse is complex and time consuming but is also essential to deliver quality health services. This paper depicts prospects and complexities of health data warehousing and mining and illustrate a data-warehousing model suitable for integrating data from different health care sources to discover effective knowledge.

  5. An Intelligent Archive Testbed Incorporating Data Mining

    Science.gov (United States)

    Ramapriyan, H.; Isaac, D.; Yang, W.; Bonnlander, B.; Danks, D.

    2009-01-01

    interoperability, and being able to convert data to information and usable knowledge in an efficient, convenient manner, aided significantly by automation (Ramapriyan et al. 2004; NASA 2005). We can look upon the distributed provider environment with capabilities to convert data to information and to knowledge as an Intelligent Archive in the Context of a Knowledge Building system (IA-KBS). Some of the key capabilities of an IA-KBS are: Virtual Product Generation, Significant Event Detection, Automated Data Quality Assessment, Large-Scale Data Mining, Dynamic Feedback Loop, and Data Discovery and Efficient Requesting (Ramapriyan et al. 2004).

  6. A Data Mining Classification Approach for Behavioral Malware Detection

    Directory of Open Access Journals (Sweden)

    Monire Norouzi

    2016-01-01

    Full Text Available Data mining techniques have numerous applications in malware detection. Classification method is one of the most popular data mining techniques. In this paper we present a data mining classification approach to detect malware behavior. We proposed different classification methods in order to detect malware based on the feature and behavior of each malware. A dynamic analysis method has been presented for identifying the malware features. A suggested program has been presented for converting a malware behavior executive history XML file to a suitable WEKA tool input. To illustrate the performance efficiency as well as training data and test, we apply the proposed approaches to a real case study data set using WEKA tool. The evaluation results demonstrated the availability of the proposed data mining approach. Also our proposed data mining approach is more efficient for detecting malware and behavioral classification of malware can be useful to detect malware in a behavioral antivirus.

  7. Mining TCGA data using Boolean implications.

    Directory of Open Access Journals (Sweden)

    Subarna Sinha

    Full Text Available Boolean implications (if-then rules provide a conceptually simple, uniform and highly scalable way to find associations between pairs of random variables. In this paper, we propose to use Boolean implications to find relationships between variables of different data types (mutation, copy number alteration, DNA methylation and gene expression from the glioblastoma (GBM and ovarian serous cystadenoma (OV data sets from The Cancer Genome Atlas (TCGA. We find hundreds of thousands of Boolean implications from these data sets. A direct comparison of the relationships found by Boolean implications and those found by commonly used methods for mining associations show that existing methods would miss relationships found by Boolean implications. Furthermore, many relationships exposed by Boolean implications reflect important aspects of cancer biology. Examples of our findings include cis relationships between copy number alteration, DNA methylation and expression of genes, a new hierarchy of mutations and recurrent copy number alterations, loss-of-heterozygosity of well-known tumor suppressors, and the hypermethylation phenotype associated with IDH1 mutations in GBM. The Boolean implication results used in the paper can be accessed at http://crookneck.stanford.edu/microarray/TCGANetworks/.

  8. Application and Exploration of Big Data Mining in Clinical Medicine

    Science.gov (United States)

    Zhang, Yue; Guo, Shu-Li; Han, Li-Na; Li, Tie-Ling

    2016-01-01

    Objective: To review theories and technologies of big data mining and their application in clinical medicine. Data Sources: Literatures published in English or Chinese regarding theories and technologies of big data mining and the concrete applications of data mining technology in clinical medicine were obtained from PubMed and Chinese Hospital Knowledge Database from 1975 to 2015. Study Selection: Original articles regarding big data mining theory/technology and big data mining's application in the medical field were selected. Results: This review characterized the basic theories and technologies of big data mining including fuzzy theory, rough set theory, cloud theory, Dempster–Shafer theory, artificial neural network, genetic algorithm, inductive learning theory, Bayesian network, decision tree, pattern recognition, high-performance computing, and statistical analysis. The application of big data mining in clinical medicine was analyzed in the fields of disease risk assessment, clinical decision support, prediction of disease development, guidance of rational use of drugs, medical management, and evidence-based medicine. Conclusion: Big data mining has the potential to play an important role in clinical medicine. PMID:26960378

  9. Kajian Data Mining Customer Relationship Management pada Lembaga Keuangan Mikro

    Directory of Open Access Journals (Sweden)

    Tikaridha Hardiani

    2016-01-01

    Full Text Available Companies are required to be ready to face the competition will be intense with other companies, including micro-finance institutions. Faced more intense competition, has led to many businesses in microfinance institutions find profitable strategy to distinguish from the others. Strategy that can be applied is implementing Customer Relationship Management (CRM and data mining. Data mining can be used to microfinance institutions that have a large enough data. Determine the potential customers with customer segmentation can help the decision-making marketing strategy that will be implemented . This paper discusses several data mining techniques that can be used for customer segmentation. Proposed method of data mining technique is fuzzy clustering with fuzzy C-Means algorithm and fuzzy RFM. Keywords : Customer relationship management; Data mining; Fuzzy clustering; Micro-finance institutions; Fuzzy C-Means; Fuzzy RFM

  10. Privacy-Preserving Data Mining of Medical Data Using Data Separation-Based Techniques

    Directory of Open Access Journals (Sweden)

    Gang Kou

    2007-08-01

    Full Text Available Data mining is concerned with the extraction of useful knowledge from various types of data. Medical data mining has been a popular data mining topic of late. Compared with other data mining areas, medical data mining has some unique characteristics. Because medical files are related to human subjects, privacy concerns are taken more seriously than other data mining tasks. This paper applied data separation-based techniques to preserve privacy in classification of medical data. We take two approaches to protect privacy: one approach is to vertically partition the medical data and mine these partitioned data at multiple sites; the other approach is to horizontally split data across multiple sites. In the vertical partition approach, each site uses a portion of the attributes to compute its results, and the distributed results are assembled at a central trusted party using a majority-vote ensemble method. In the horizontal partition approach, data are distributed among several sites. Each site computes its own data, and a central trusted party is responsible to integrate these results. We implement these two approaches using medical datasets from UCI KDD archive and report the experimental results.

  11. Seminal quality prediction using data mining methods.

    Science.gov (United States)

    Sahoo, Anoop J; Kumar, Yugal

    2014-01-01

    Now-a-days, some new classes of diseases have come into existences which are known as lifestyle diseases. The main reasons behind these diseases are changes in the lifestyle of people such as alcohol drinking, smoking, food habits etc. After going through the various lifestyle diseases, it has been found that the fertility rates (sperm quantity) in men has considerably been decreasing in last two decades. Lifestyle factors as well as environmental factors are mainly responsible for the change in the semen quality. The objective of this paper is to identify the lifestyle and environmental features that affects the seminal quality and also fertility rate in man using data mining methods. The five artificial intelligence techniques such as Multilayer perceptron (MLP), Decision Tree (DT), Navie Bayes (Kernel), Support vector machine+Particle swarm optimization (SVM+PSO) and Support vector machine (SVM) have been applied on fertility dataset to evaluate the seminal quality and also to predict the person is either normal or having altered fertility rate. While the eight feature selection techniques such as support vector machine (SVM), neural network (NN), evolutionary logistic regression (LR), support vector machine plus particle swarm optimization (SVM+PSO), principle component analysis (PCA), chi-square test, correlation and T-test methods have been used to identify more relevant features which affect the seminal quality. These techniques are applied on fertility dataset which contains 100 instances with nine attribute with two classes. The experimental result shows that SVM+PSO provides higher accuracy and area under curve (AUC) rate (94% & 0.932) among multi-layer perceptron (MLP) (92% & 0.728), Support Vector Machines (91% & 0.758), Navie Bayes (Kernel) (89% & 0.850) and Decision Tree (89% & 0.735) for some of the seminal parameters. This paper also focuses on the feature selection process i.e. how to select the features which are more important for prediction of

  12. Student Privacy and Educational Data Mining: Perspectives from Industry

    Science.gov (United States)

    Sabourin, Jennifer; Kosturko, Lucy; FitzGerald, Clare; McQuiggan, Scott

    2015-01-01

    While the field of educational data mining (EDM) has generated many innovations for improving educational software and student learning, the mining of student data has recently come under a great deal of scrutiny. Many stakeholder groups, including public officials, media outlets, and parents, have voiced concern over the privacy of student data…

  13. A Tools-Based Approach to Teaching Data Mining Methods

    Science.gov (United States)

    Jafar, Musa J.

    2010-01-01

    Data mining is an emerging field of study in Information Systems programs. Although the course content has been streamlined, the underlying technology is still in a state of flux. The purpose of this paper is to describe how we utilized Microsoft Excel's data mining add-ins as a front-end to Microsoft's Cloud Computing and SQL Server 2008 Business…

  14. Application and Exploration of Big Data Mining in Clinical Medicine.

    Science.gov (United States)

    Zhang, Yue; Guo, Shu-Li; Han, Li-Na; Li, Tie-Ling

    2016-03-20

    To review theories and technologies of big data mining and their application in clinical medicine. Literatures published in English or Chinese regarding theories and technologies of big data mining and the concrete applications of data mining technology in clinical medicine were obtained from PubMed and Chinese Hospital Knowledge Database from 1975 to 2015. Original articles regarding big data mining theory/technology and big data mining's application in the medical field were selected. This review characterized the basic theories and technologies of big data mining including fuzzy theory, rough set theory, cloud theory, Dempster-Shafer theory, artificial neural network, genetic algorithm, inductive learning theory, Bayesian network, decision tree, pattern recognition, high-performance computing, and statistical analysis. The application of big data mining in clinical medicine was analyzed in the fields of disease risk assessment, clinical decision support, prediction of disease development, guidance of rational use of drugs, medical management, and evidence-based medicine. Big data mining has the potential to play an important role in clinical medicine.

  15. Data Mining: A Hybrid Methodology for Complex and Dynamic Research

    Science.gov (United States)

    Lang, Susan; Baehr, Craig

    2012-01-01

    This article provides an overview of the ways in which data and text mining have potential as research methodologies in composition studies. It introduces data mining in the context of the field of composition studies and discusses ways in which this methodology can complement and extend our existing research practices by blending the best of what…

  16. Data Mine and Forget It?: A Cautionary Tale

    Science.gov (United States)

    Tada, Yuri; Kraft, Norbert Otto; Orasanu, Judith M.

    2011-01-01

    With the development of new technologies, data mining has become increasingly popular. However, caution should be exercised in choosing the variables to include in data mining. A series of regression trees was created to demonstrate the change in the selection by the program of significant predictors based on the nature of variables.

  17. Model architecture of intelligent data mining oriented urban transportation information

    Science.gov (United States)

    Yang, Bogang; Tao, Yingchun; Sui, Jianbo; Zhang, Feizhou

    2007-06-01

    Aiming at solving practical problems in urban traffic, the paper presents model architecture of intelligent data mining from hierarchical view. With artificial intelligent technologies used in the framework, the intelligent data mining technology improves, which is more suitable for the change of real-time road condition. It also provides efficient technology support for the urban transport information distribution, transmission and display.

  18. Informatics, Data Mining, Econometrics and Financial Economics: A Connection

    NARCIS (Netherlands)

    C-L. Chang (Chia-Lin); M.J. McAleer (Michael); W.-K. Wong (Wing-Keung)

    2015-01-01

    textabstractThis short communication reviews some of the literature in econometrics and financial economics that is related to informatics and data mining. We then discuss some of the research on econometrics and financial economics that could be extended to informatics and data mining beyond the

  19. Class association rules mining from students’ test data (Abstract)

    NARCIS (Netherlands)

    Romero, C.; Ventura, S.; Vasilyeva, E.; Pechenizkiy, M.; Baker, de R.S.J.; Merceron, A.; Pavlik Jr., P.I.

    2010-01-01

    In this paper we propose the use of a special type of association rules mining for discovering interesting relationships from the students’ test data collected in our case with Moodle learning management system (LMS). Particularly, we apply Class Association Rule (CAR) mining to different data

  20. Experienced ethical issues of personalized data-mined media services

    DEFF Research Database (Denmark)

    Sørensen, Jannick Kirk

    2008-01-01

    This tentative PhD project description concerns the ethnographic examination of users’ experience of privacy issues and usability related to personalized data mined (web-) services for media content.......This tentative PhD project description concerns the ethnographic examination of users’ experience of privacy issues and usability related to personalized data mined (web-) services for media content....

  1. Data mining algorithms for land cover change detection: a review

    Indian Academy of Sciences (India)

    Sangram Panigrahi

    2017-11-24

    Nov 24, 2017 ... values, poor quality measurement, high resolution and high dimensional data. The land cover .... These data sets also include quality assurance information, ...... 2012 A new data mining framework for forest fire mapping.

  2. The multiple zeta value data mine

    Energy Technology Data Exchange (ETDEWEB)

    Buemlein, J. [Deutsches Elektronen-Synchrotron (DESY), Zeuthen (Germany); Broadhurst, D.J. [Open Univ., Milton Keynes (United Kingdom). Physics and Astronomy Dept.; Vermaseren, J.A.M. [Deutsches Elektronen-Synchrotron (DESY), Zeuthen (Germany); NIKHEF, Amsterdam (Netherlands)

    2009-07-15

    We provide a data mine of proven results for multiple zeta values (MZVs) of the form {zeta}(s{sub 1},s{sub 2},..,s{sub k}) = sum {sup {infinity}}{sub n{sub 1}}{sub >n{sub 2}}{sub >...>n{sub k}}{sub >0} {l_brace}1/(n{sub 1}{sup s{sub 1}}..n{sub k}{sup s{sub k}}){r_brace} with weight w = sum {sup K}{sub i=1}s{sub i} and depth k and for Euler sums of the form sum {sup {infinity}}{sub n{sub 1}}{sub >n{sub 2}}{sub >...>n{sub k}}{sub >0} {l_brace}({epsilon}{sub 1}{sup n{sub 1}}..{epsilon}{sub 1}{sup n{sub k}})/(n{sub 1}{sup s{sub 1}}..n{sub k}{sup s{sub k}}){r_brace} with signs {epsilon}{sub i} = {+-} 1. Notably, we achieve explicit proven reductions of all MZVs with weights w{<=}22, and all Euler sums with weights w{<=}12, to bases whose dimensions, bigraded by weight and depth, have sizes in precise agreement with the Broadhurst. Kreimer and Broadhurst conjectures. Moreover, we lend further support to these conjectures by studying even greater weights (w{<=}30), using modular arithmetic. To obtain these results we derive a new type of relation for Euler sums, the Generalized Doubling Relations. We elucidate the ''pushdown'' mechanism, whereby the ornate enumeration of primitive MZVs, by weight and depth, is reconciled with the far simpler enumeration of primitive Euler sums. There is some evidence that this pushdown mechanism finds its origin in doubling relations. We hope that our data mine, obtained by exploiting the unique power of the computer algebra language FORM, will enable the study of many more such consequences of the double-shuffle algebra of MZVs, and their Euler cousins, which are already the subject of keen interest, to practitioners of quantum field theory, and to mathematicians alike. (orig.)

  3. DECISION SUPPORT SYSTEM TO SUPPORT DECISION PROCESSES WITH DATA MINING

    OpenAIRE

    Rupnik, Rok; Kukar, Matjaž

    2007-01-01

    Traditional techniques of data analysis do not enable the solution of all kind of problems and for that reason they have become insufficient. This caused a newinterdisciplinary field of data mining to arise, encompassing both classical statistical, and modern machine learning techniques to support the data analysis and knowledge discovery from data. Data mining methods are powerful in dealing with large quantities of data, but on the other hand they are difficult to master by business users t...

  4. Data and Statistics on New York's Mining Resources - NYS Dept. of

    Science.gov (United States)

    ): Search DEC D E C banner Home » Lands and Waters » Mining & Reclamation » Data and Statistics on New York's Mining Resources Skip to main navigation Data and Statistics on New York's Mining Resources Statistics on New York's Mining Resources: Mines in New York - Information on active mines in New York State

  5. Using Data Mining for Wine Quality Assessment

    Science.gov (United States)

    Cortez, Paulo; Teixeira, Juliana; Cerdeira, António; Almeida, Fernando; Matos, Telmo; Reis, José

    Certification and quality assessment are crucial issues within the wine industry. Currently, wine quality is mostly assessed by physicochemical (e.g alcohol levels) and sensory (e.g. human expert evaluation) tests. In this paper, we propose a data mining approach to predict wine preferences that is based on easily available analytical tests at the certification step. A large dataset is considered with white vinho verde samples from the Minho region of Portugal. Wine quality is modeled under a regression approach, which preserves the order of the grades. Explanatory knowledge is given in terms of a sensitivity analysis, which measures the response changes when a given input variable is varied through its domain. Three regression techniques were applied, under a computationally efficient procedure that performs simultaneous variable and model selection and that is guided by the sensitivity analysis. The support vector machine achieved promising results, outperforming the multiple regression and neural network methods. Such model is useful for understanding how physicochemical tests affect the sensory preferences. Moreover, it can support the wine expert evaluations and ultimately improve the production.

  6. Data Mining at NASA: From Theory to Applications

    Science.gov (United States)

    Srivastava, Ashok N.

    2009-01-01

    This slide presentation demonstrates the data mining/machine learning capabilities of NASA Ames and Intelligent Data Understanding (IDU) group. This will encompass the work done recently in the group by various group members. The IDU group develops novel algorithms to detect, classify, and predict events in large data streams for scientific and engineering systems. This presentation for Knowledge Discovery and Data Mining 2009 is to demonstrate the data mining/machine learning capabilities of NASA Ames and IDU group. This will encompass the work done re cently in the group by various group members.

  7. Data mining for the social sciences an introduction

    CERN Document Server

    Attewell, Paul

    2015-01-01

    We live in a world of big data: the amount of information collected on human behavior each day is staggering, and exponentially greater than at any time in the past. Additionally, powerful algorithms are capable of churning through seas of data to uncover patterns. Providing a simple and accessible introduction to data mining, Paul Attewell and David B. Monaghan discuss how data mining substantially differs from conventional statistical modeling familiar to most social scientists. The authors also empower social scientists to tap into these new resources and incorporate data mining

  8. Spatio-Temporal Data Mining for Location-Based Services

    DEFF Research Database (Denmark)

    Gidofalvi, Gyozo

    . The objectives of the presented thesis are three-fold. First, to extend popular data mining methods to the spatio-temporal domain. Second, to demonstrate the usefulness of the extended methods and the derived knowledge in promising LBS examples. Finally, to eliminate privacy concerns in connection with spatio......-temporal data mining by devising systems for privacy-preserving location data collection and mining.......Location-Based Services (LBS) are continuously gaining popularity. Innovative LBSes integrate knowledge about the users into the service. Such knowledge can be derived by analyzing the location data of users. Such data contain two unique dimensions, space and time, which need to be analyzed...

  9. Event metadata records as a testbed for scalable data mining

    International Nuclear Information System (INIS)

    Gemmeren, P van; Malon, D

    2010-01-01

    At a data rate of 200 hertz, event metadata records ('TAGs,' in ATLAS parlance) provide fertile grounds for development and evaluation of tools for scalable data mining. It is easy, of course, to apply HEP-specific selection or classification rules to event records and to label such an exercise 'data mining,' but our interest is different. Advanced statistical methods and tools such as classification, association rule mining, and cluster analysis are common outside the high energy physics community. These tools can prove useful, not for discovery physics, but for learning about our data, our detector, and our software. A fixed and relatively simple schema makes TAG export to other storage technologies such as HDF5 straightforward. This simplifies the task of exploiting very-large-scale parallel platforms such as Argonne National Laboratory's BlueGene/P, currently the largest supercomputer in the world for open science, in the development of scalable tools for data mining. Using a domain-neutral scientific data format may also enable us to take advantage of existing data mining components from other communities. There is, further, a substantial literature on the topic of one-pass algorithms and stream mining techniques, and such tools may be inserted naturally at various points in the event data processing and distribution chain. This paper describes early experience with event metadata records from ATLAS simulation and commissioning as a testbed for scalable data mining tool development and evaluation.

  10. Data mining applications in the context of casemix.

    Science.gov (United States)

    Koh, H C; Leong, S K

    2001-07-01

    In October 1999, the Singapore Government introduced casemix-based funding to public hospitals. The casemix approach to health care funding is expected to yield significant benefits, including equity and rationality in financing health care, the use of comparative casemix data for quality improvement activities, and the provision of information that enables hospitals to understand their cost behaviour and reinforces the drive for more cost-efficient services. However, there is some concern about the "quicker and sicker" syndrome (that is, the rapid discharge of patients with little regard for the quality of outcome). As it is likely that consequences of premature discharges will be reflected in the readmission data, an analysis of possible systematic patterns in readmission data can provide useful insight into the "quicker and sicker" syndrome. This paper explores potential data mining applications in the context of casemix by using readmission data as an illustration. In particular, it illustrates how data mining can be used to better understand readmission data and to detect systematic patterns, if any. From a technical perspective, data mining (which is capable of analysing complex non-linear and interaction relationships) supplements and complements traditional statistical methods in data analysis. From an applications perspective, data mining provides the technology and methodology to analyse mass volume of data to detect hidden patterns in data. Using readmission data as an illustrative data mining application, this paper explores potential data mining applications in the general casemix context.

  11. Interestingness of association rules in data mining: Issues relevant ...

    Indian Academy of Sciences (India)

    R. Narasimhan (Krishtel eMaging) 1461 1996 Oct 15 13:05:22

    mental changes in many spheres of our daily life. .... concentrate on association rule mining since it features as one of the main data mining tech- ..... years, a lot of work has been done in defining and quantifying 'interestingness. .... a critical effect on both, selection of interesting events and variation of interestingness thresh-.

  12. Recommending Learning Activities in Social Network Using Data Mining Algorithms

    Science.gov (United States)

    Mahnane, Lamia

    2017-01-01

    In this paper, we show how data mining algorithms (e.g. Apriori Algorithm (AP) and Collaborative Filtering (CF)) is useful in New Social Network (NSN-AP-CF). "NSN-AP-CF" processes the clusters based on different learning styles. Next, it analyzes the habits and the interests of the users through mining the frequent episodes by the…

  13. Advanced Data Mining of Leukemia Cells Micro-Arrays

    Directory of Open Access Journals (Sweden)

    Richard S. Segall

    2009-12-01

    Full Text Available This paper provides continuation and extensions of previous research by Segall and Pierce (2009a that discussed data mining for micro-array databases of Leukemia cells for primarily self-organized maps (SOM. As Segall and Pierce (2009a and Segall and Pierce (2009b the results of applying data mining are shown and discussed for the data categories of microarray databases of HL60, Jurkat, NB4 and U937 Leukemia cells that are also described in this article. First, a background section is provided on the work of others pertaining to the applications of data mining to micro-array databases of Leukemia cells and micro-array databases in general. As noted in predecessor article by Segall and Pierce (2009a, micro-array databases are one of the most popular functional genomics tools in use today. This research in this paper is intended to use advanced data mining technologies for better interpretations and knowledge discovery as generated by the patterns of gene expressions of HL60, Jurkat, NB4 and U937 Leukemia cells. The advanced data mining performed entailed using other data mining tools such as cubic clustering criterion, variable importance rankings, decision trees, and more detailed examinations of data mining statistics and study of other self-organized maps (SOM clustering regions of workspace as generated by SAS Enterprise Miner version 4. Conclusions and future directions of the research are also presented.

  14. BOOK REVIEW EDUCATIONAL DATA MINING: APPLICATIONS AND TRENDS

    Directory of Open Access Journals (Sweden)

    Aylin OZTURK

    2016-04-01

    Full Text Available Educational Data Mining (EDM is a developing field based on data mining techniques. EDM emerged as a combination of areas such as machine learning, statistics, computer science, education, cognitive science, and psychometry. EDM focuses on learner characteristics, behaviors, academic achievements, process of learning, educational functionalities, domain knowledge content, assessments, and applications. Educational data mining is defined by Baker (2010 as ‘‘an emerging discipline, concerned with developing methods for exploring the unique types of data that come from educational settings, and using those methods to better understand students, and the settings which they learn in’’. EDM is concerned with improving the learning process and environment.

  15. Development of an Enhanced Generic Data Mining Life Cycle (DMLC)

    OpenAIRE

    Hofmann, Markus; Tierney, Brendan

    2017-01-01

    Data mining projects are complex and have a high failure rate. In order to improve project management and success rates of such projects a life cycle is vital to the overall success of the project. This paper reports on a research project that was concerned with the life cycle development for large scale data mining projects. The paper provides a detailed view of the design and development of a generic data mining life cycle called DMLC. The life cycle aims to support all members of data mini...

  16. Review of Data Mining Techniques for Churn Prediction in Telecom

    OpenAIRE

    Vishal Mahajan; Richa Misra; Renuka Mahajan

    2015-01-01

    Telecommunication sector generates a huge amount of data due to increasing number of subscribers, rapidly renewable technologies; data based applications and other value added service. This data can be usefully mined for churn analysis and prediction. Significant research had been undertaken by researchers worldwide to understand the data mining practices that can be used for predicting customer churn. This paper provides a review of around 100 recent journal articles starting from year 2000 ...

  17. Usage of Data Mining at Financial Decision Making

    Directory of Open Access Journals (Sweden)

    Levent BORAN

    2014-06-01

    Full Text Available The knowledge age requires controlling every kind of information. Recognition of patterns in data may provide previously unknown and useful information that can provide competitive advantages. If related techniques are applied on financial statements, it is possible to acquire valuable information about companies’ financial situations. It is considered that data mining could be an alternative of common financial analysis techniques such as vertical analysis, horizontal analysis, trend analysis and ratio analysis. Against existing financial analysis methods, data mining provides some advantages, which are ability of manipulation of huge data and competence of obtaining previously unknown information. There exist two major constraints of data mining implementation that are lack of experts on both data mining and related domains and cost of computer software and hardware used.

  18. An Application of Multithreaded Data Mining in Educational Leadership Research

    OpenAIRE

    Fikis, David; Wang, Yinying; Bowers, Alex

    2015-01-01

    This study aims to apply high-performance computing to educational leadership research. Specifically, we applied an array of data acquisition and analytical techniques to the field of educational leadership research, including text data mining, probabiblistic topic modeling, and the use of software (CasperJS, GNU utilities, R, etc.) as well as hardware (the VELA batch computer and the multi-threaded data mining environment).  

  19. Visual data mining for developing competitive strategies in higher education

    OpenAIRE

    Ertek, Gürdal; Ertek, Gurdal

    2009-01-01

    Information visualization is the growing field of computer science that aims at visually mining data for knowledge discovery. In this paper, a data mining framework and a novel information visualization scheme is developed and applied to the domain of higher education. The presented framework consists of three main types of visual data analysis: Discovering general insights, carrying out competitive benchmarking, and planning for High School Relationship Management (HSRM). In this paper the f...

  20. Data processing in management of Dolni Rozinka uranium mines

    International Nuclear Information System (INIS)

    Benes, B.

    1987-01-01

    In 1985, a qualitative inovation was introduced of data processing by the commissioning of the EC 1026 computer with a terminal network and a remote data communication system. The design jobs which are being gradually implemented are mainly oriented to the creating of an automated information system for operative control of mining production, data preparation in mining plants, and to the personnel, wages, material consumptions, etc. areas. (J.B.)

  1. [Aspects for data mining implementation in gerontology and geriatrics].

    Science.gov (United States)

    Mikhal'skiĭ, A I

    2014-01-01

    Current challenges facing theory and practice in ageing sciences need new methods of experimental data investigation. This is a result as of experimental basis developments in biological research, so of information technology progress. These achievements make it possible to use well proven in different fields of science and engineering data mining methods for tasks in gerontology and geriatrics. Some examples of data mining methods implementation in gerontology are presented.

  2. A REVIEW ON PREDICTIVE ANALYTICS IN DATA MINING

    OpenAIRE

    Arumugam.S

    2016-01-01

    The data mining its main process is to collect, extract and store the valuable information and now-a-days it’s done by many enterprises actively. In advanced analytics, Predictive analytics is the one of the branch which is mainly used to make predictions about future events which are unknown. Predictive analytics which uses various techniques from machine learning, statistics, data mining, modeling, and artificial intelligence for analyzing the current data and to make predictions about futu...

  3. Predictive models in churn data mining: a review

    OpenAIRE

    García, David L.; Vellido Alcacena, Alfredo; Nebot Castells, M. Àngela

    2007-01-01

    The development of predictive models of customer abandonment plays a central role in any churn management strategy. These models can be developed using either qualitative approaches or can take a data-centred point of view. In the latter case, the use of Data Mining procedures and techniques can provide useful and actionable insights into the processes leading to abandonment. In this report, we provide a brief and structured review of some of the Data Mining approaches that have been put forw...

  4. Data Mining and Complex Problems: Case Study in Composite Materials

    Science.gov (United States)

    Rabelo, Luis; Marin, Mario

    2009-01-01

    Data mining is defined as the discovery of useful, possibly unexpected, patterns and relationships in data using statistical and non-statistical techniques in order to develop schemes for decision and policy making. Data mining can be used to discover the sources and causes of problems in complex systems. In addition, data mining can support simulation strategies by finding the different constants and parameters to be used in the development of simulation models. This paper introduces a framework for data mining and its application to complex problems. To further explain some of the concepts outlined in this paper, the potential application to the NASA Shuttle Reinforced Carbon-Carbon structures and genetic programming is used as an illustration.

  5. 1st International Conference on Computational Intelligence in Data Mining

    CERN Document Server

    Behera, Himansu; Mandal, Jyotsna; Mohapatra, Durga

    2015-01-01

    The contributed volume aims to explicate and address the difficulties and challenges for the seamless integration of two core disciplines of computer science, i.e., computational intelligence and data mining. Data Mining aims at the automatic discovery of underlying non-trivial knowledge from datasets by applying intelligent analysis techniques. The interest in this research area has experienced a considerable growth in the last years due to two key factors: (a) knowledge hidden in organizations’ databases can be exploited to improve strategic and managerial decision-making; (b) the large volume of data managed by organizations makes it impossible to carry out a manual analysis. The book addresses different methods and techniques of integration for enhancing the overall goal of data mining. The book helps to disseminate the knowledge about some innovative, active research directions in the field of data mining, machine and computational intelligence, along with some current issues and applications of relate...

  6. Towards data warehousing and mining of protein unfolding simulation data.

    Science.gov (United States)

    Berrar, Daniel; Stahl, Frederic; Silva, Candida; Rodrigues, J Rui; Brito, Rui M M; Dubitzky, Werner

    2005-10-01

    The prediction of protein structure and the precise understanding of protein folding and unfolding processes remains one of the greatest challenges in structural biology and bioinformatics. Computer simulations based on molecular dynamics (MD) are at the forefront of the effort to gain a deeper understanding of these complex processes. Currently, these MD simulations are usually on the order of tens of nanoseconds, generate a large amount of conformational data and are computationally expensive. More and more groups run such simulations and generate a myriad of data, which raises new challenges in managing and analyzing these data. Because the vast range of proteins researchers want to study and simulate, the computational effort needed to generate data, the large data volumes involved, and the different types of analyses scientists need to perform, it is desirable to provide a public repository allowing researchers to pool and share protein unfolding data. To adequately organize, manage, and analyze the data generated by unfolding simulation studies, we designed a data warehouse system that is embedded in a grid environment to facilitate the seamless sharing of available computer resources and thus enable many groups to share complex molecular dynamics simulations on a more regular basis. To gain insight into the conformational fluctuations and stability of the monomeric forms of the amyloidogenic protein transthyretin (TTR), molecular dynamics unfolding simulations of the monomer of human TTR have been conducted. Trajectory data and meta-data of the wild-type (WT) protein and the highly amyloidogenic variant L55P-TTR represent the test case for the data warehouse. Web and grid services, especially pre-defined data mining services that can run on or 'near' the data repository of the data warehouse, are likely to play a pivotal role in the analysis of molecular dynamics unfolding data.

  7. A Mining Algorithm for Extracting Decision Process Data Models

    Directory of Open Access Journals (Sweden)

    Cristina-Claudia DOLEAN

    2011-01-01

    Full Text Available The paper introduces an algorithm that mines logs of user interaction with simulation software. It outputs a model that explicitly shows the data perspective of the decision process, namely the Decision Data Model (DDM. In the first part of the paper we focus on how the DDM is extracted by our mining algorithm. We introduce it as pseudo-code and, then, provide explanations and examples of how it actually works. In the second part of the paper, we use a series of small case studies to prove the robustness of the mining algorithm and how it deals with the most common patterns we found in real logs.

  8. Usage reporting on recorded lectures using educational data mining

    NARCIS (Netherlands)

    Gorissen, Pierre; Van Bruggen, Jan; Jochems, Wim

    2012-01-01

    Gorissen, P., Van Bruggen, J., & Jochems, W. M. G. (2012). Usage reporting on recorded lectures using educational data mining. International Journal of Learning Technology, 7, 23-40. doi:10.1504/IJLT.2012.046864

  9. 2nd International Conference on Computational Intelligence in Data Mining

    CERN Document Server

    Mohapatra, Durga

    2016-01-01

    The book is a collection of high-quality peer-reviewed research papers presented in the Second International Conference on Computational Intelligence in Data Mining (ICCIDM 2015) held at Bhubaneswar, Odisha, India during 5 – 6 December 2015. The two-volume Proceedings address the difficulties and challenges for the seamless integration of two core disciplines of computer science, i.e., computational intelligence and data mining. The book addresses different methods and techniques of integration for enhancing the overall goal of data mining. The book helps to disseminate the knowledge about some innovative, active research directions in the field of data mining, machine and computational intelligence, along with some current issues and applications of related topics.

  10. Application of Data Mining for Card Fraud Detection

    Directory of Open Access Journals (Sweden)

    I.V. Andrianov

    2012-03-01

    Full Text Available This paper focuses on implementing Data Mining methods for card fraud detection. The approach to classification and prediction tasks for detection of unauthorized transactions is considered.

  11. Data mining of air traffic control operational errors

    Science.gov (United States)

    2006-01-01

    In this paper we present the results of : applying data mining techniques to identify patterns and : anomalies in air traffic control operational errors (OEs). : Reducing the OE rate is of high importance and remains a : challenge in the aviation saf...

  12. artery disease guidelines with extracted knowledge from data mining

    Directory of Open Access Journals (Sweden)

    Peyman Rezaei-Hachesu

    2017-06-01

    Conclusion: Guidelines confirm the achieved results from data mining (DM techniques and help to rank important risk factors based on national and local information. Evaluation of extracted rules determined new patterns for CAD patients.

  13. Towards educational data mining: Using data mining methods for automated chat analysis to understand and support inquiry learning processes

    OpenAIRE

    Anjewierden , Anjo; Kolloffel , Bas; Hulshof , Casper

    2007-01-01

    In this paper we investigate the application of data mining methods to provide learners with real-time adaptive feedback on the nature and patterns of their on-line communication while learning collaboratively.We derived two models for classifying chat messages using data mining techniques and tested these on an actual data set [16]. The reliability of the classification of chat messages is established by comparing the models performance to that of humans. Results indicate that the classifica...

  14. Advances in Machine Learning and Data Mining for Astronomy

    Science.gov (United States)

    Way, Michael J.; Scargle, Jeffrey D.; Ali, Kamal M.; Srivastava, Ashok N.

    2012-03-01

    Advances in Machine Learning and Data Mining for Astronomy documents numerous successful collaborations among computer scientists, statisticians, and astronomers who illustrate the application of state-of-the-art machine learning and data mining techniques in astronomy. Due to the massive amount and complexity of data in most scientific disciplines, the material discussed in this text transcends traditional boundaries between various areas in the sciences and computer science. The book's introductory part provides context to issues in the astronomical sciences that are also important to health, social, and physical sciences, particularly probabilistic and statistical aspects of classification and cluster analysis. The next part describes a number of astrophysics case studies that leverage a range of machine learning and data mining technologies. In the last part, developers of algorithms and practitioners of machine learning and data mining show how these tools and techniques are used in astronomical applications. With contributions from leading astronomers and computer scientists, this book is a practical guide to many of the most important developments in machine learning, data mining, and statistics. It explores how these advances can solve current and future problems in astronomy and looks at how they could lead to the creation of entirely new algorithms within the data mining community.

  15. Predicting Software Projects Cost Estimation Based on Mining Historical Data

    OpenAIRE

    Najadat, Hassan; Alsmadi, Izzat; Shboul, Yazan

    2012-01-01

    In this research, a hybrid cost estimation model is proposed to produce a realistic prediction model that takes into consideration software project, product, process, and environmental elements. A cost estimation dataset is built from a large number of open source projects. Those projects are divided into three domains: communication, finance, and game projects. Several data mining techniques are used to classify software projects in terms of their development complexity. Data mining techniqu...

  16. An Intelligent Agent based Architecture for Visual Data Mining

    OpenAIRE

    Hamdi Ellouzi; Hela Ltifi; Mounir Ben Ayed

    2016-01-01

    the aim of this paper is to present an intelligent architecture of Decision Support System (DSS) based on visual data mining. This architecture applies the multi-agent technology to facilitate the design and development of DSS in complex and dynamic environment. Multi-Agent Systems add a high level of abstraction. To validate the proposed architecture, it is implemented to develop a distributed visual data mining based DSS to predict nosocomial infectionsoccurrence in intensive care units. Th...

  17. Data mining in e-commerce: A survey

    Indian Academy of Sciences (India)

    Data mining has matured as a field of basic and applied research in computer science in general and e-commerce in particular. In this paper, we survey some of the recent approaches and architectures where data mining has been applied in the fields of e-commerce and e-business. Our intent is not to survey the plethora ...

  18. Profiling Oman education data using data mining approach

    Science.gov (United States)

    Alawi, Sultan Juma Sultan; Shaharanee, Izwan Nizal Mohd; Jamil, Jastini Mohd

    2017-10-01

    Nowadays, with a large amount of data generated by many application services in different learning fields has led to the new challenges in education field. Education portal is an important system that leads to a better development of education field. This research paper presents an innovative data mining techniques to understand and summarizes the information of Oman's education data generated from the Ministry of Education Oman "Educational Portal". This research embarks into performing student profiling of the Oman student database. This study utilized the k-means clustering technique to determine the students' profiles. An amount of 42484-student records from Sultanate of Oman has been extracted for this study. The findings of this study show the practicality of clustering technique to investigating student's profiles. Allowing for a better understanding of student's behavior and their academic performance. Oman Education Portal contain a large amounts of user activity and interaction data. Analyses of this large data can be meaningful for educator to improve the student performance level and recognize students who needed additional attention.

  19. G-Tunnel welded tuff mining experiment data summary

    International Nuclear Information System (INIS)

    Zimmerman, R.M.; Bellman, R.A. Jr.; Mann, K.L.; Zerga, D.P.; Fowler, M.

    1990-03-01

    Designers and analysts of radioactive waste repositories must be ably to predict the mechanical behavior of the host rock. Sandia National Laboratories elected to conduct a mine-by in welded tuff so that predictive-type information could be obtained regarding the response of the rock to a drill and blast excavation process, where smooth blasting techniques were used. Included in the study were evaluations of and recommendations for various measurement systems that might be used in future mine by efforts. This report summarizes all of the data obtained in the welded tuff mining experiment. 6 refs., 29 figs., 12 tabs

  20. Data mining for the online retail industry: A case study of RFM model-based customer segmentation using data mining

    OpenAIRE

    Chen, D

    2012-01-01

    Many small online retailers and new entrants to the online retail sector are keen to practice data mining and consumer-centric marketing in their businesses yet technically lack the necessary knowledge and expertise to do so. In this article a case study of using data mining techniques in customer-centric business intelligence for an online retailer is presented. The main purpose of this analysis is to help the business better understand its customers and therefore conduct customer-centric ma...

  1. Data mining in healthcare: decision making and precision

    Directory of Open Access Journals (Sweden)

    Ionuţ ŢĂRANU

    2016-05-01

    Full Text Available The trend of application of data mining in healthcare today is increased because the health sector is rich with information and data mining has become a necessity. Healthcare organizations generate and collect large volumes of information to a daily basis. Use of information technology enables automation of data mining and knowledge that help bring some interesting patterns which means eliminating manual tasks and easy data extraction directly from electronic records, electronic transfer system that will secure medical records, save lives and reduce the cost of medical services as well as enabling early detection of infectious diseases on the basis of advanced data collection. Data mining can enable healthcare organizations to anticipate trends in the patient's medical condition and behaviour proved by analysis of prospects different and by making connections between seemingly unrelated information. The raw data from healthcare organizations are voluminous and heterogeneous. It needs to be collected and stored in organized form and their integration allows the formation unite medical information system. Data mining in health offers unlimited possibilities for analyzing different data models less visible or hidden to common analysis techniques. These patterns can be used by healthcare practitioners to make forecasts, put diagnoses, and set treatments for patients in healthcare organizations.

  2. Data mining in soft computing framework: a survey.

    Science.gov (United States)

    Mitra, S; Pal, S K; Mitra, P

    2002-01-01

    The present article provides a survey of the available literature on data mining using soft computing. A categorization has been provided based on the different soft computing tools and their hybridizations used, the data mining function implemented, and the preference criterion selected by the model. The utility of the different soft computing methodologies is highlighted. Generally fuzzy sets are suitable for handling the issues related to understandability of patterns, incomplete/noisy data, mixed media information and human interaction, and can provide approximate solutions faster. Neural networks are nonparametric, robust, and exhibit good learning and generalization capabilities in data-rich environments. Genetic algorithms provide efficient search algorithms to select a model, from mixed media data, based on some preference criterion/objective function. Rough sets are suitable for handling different types of uncertainty in data. Some challenges to data mining and the application of soft computing methodologies are indicated. An extensive bibliography is also included.

  3. Analysis of Occupational Accidents in Underground and Surface Mining in Spain Using Data-Mining Techniques.

    Science.gov (United States)

    Sanmiquel, Lluís; Bascompta, Marc; Rossell, Josep M; Anticoi, Hernán Francisco; Guash, Eduard

    2018-03-07

    An analysis of occupational accidents in the mining sector was conducted using the data from the Spanish Ministry of Employment and Social Safety between 2005 and 2015, and data-mining techniques were applied. Data was processed with the software Weka. Two scenarios were chosen from the accidents database: surface and underground mining. The most important variables involved in occupational accidents and their association rules were determined. These rules are composed of several predictor variables that cause accidents, defining its characteristics and context. This study exposes the 20 most important association rules in the sector-either surface or underground mining-based on the statistical confidence levels of each rule as obtained by Weka. The outcomes display the most typical immediate causes, along with the percentage of accidents with a basis in each association rule. The most important immediate cause is body movement with physical effort or overexertion, and the type of accident is physical effort or overexertion. On the other hand, the second most important immediate cause and type of accident are different between the two scenarios. Data-mining techniques were chosen as a useful tool to find out the root cause of the accidents.

  4. Physics Mining of Multi-Source Data Sets

    Science.gov (United States)

    Helly, John; Karimabadi, Homa; Sipes, Tamara

    2012-01-01

    Powerful new parallel data mining algorithms can produce diagnostic and prognostic numerical models and analyses from observational data. These techniques yield higher-resolution measures than ever before of environmental parameters by fusing synoptic imagery and time-series measurements. These techniques are general and relevant to observational data, including raster, vector, and scalar, and can be applied in all Earth- and environmental science domains. Because they can be highly automated and are parallel, they scale to large spatial domains and are well suited to change and gap detection. This makes it possible to analyze spatial and temporal gaps in information, and facilitates within-mission replanning to optimize the allocation of observational resources. The basis of the innovation is the extension of a recently developed set of algorithms packaged into MineTool to multi-variate time-series data. MineTool is unique in that it automates the various steps of the data mining process, thus making it amenable to autonomous analysis of large data sets. Unlike techniques such as Artificial Neural Nets, which yield a blackbox solution, MineTool's outcome is always an analytical model in parametric form that expresses the output in terms of the input variables. This has the advantage that the derived equation can then be used to gain insight into the physical relevance and relative importance of the parameters and coefficients in the model. This is referred to as physics-mining of data. The capabilities of MineTool are extended to include both supervised and unsupervised algorithms, handle multi-type data sets, and parallelize it.

  5. SparkText: Biomedical Text Mining on Big Data Framework

    Science.gov (United States)

    He, Karen Y.; Wang, Kai

    2016-01-01

    Background Many new biomedical research articles are published every day, accumulating rich information, such as genetic variants, genes, diseases, and treatments. Rapid yet accurate text mining on large-scale scientific literature can discover novel knowledge to better understand human diseases and to improve the quality of disease diagnosis, prevention, and treatment. Results In this study, we designed and developed an efficient text mining framework called SparkText on a Big Data infrastructure, which is composed of Apache Spark data streaming and machine learning methods, combined with a Cassandra NoSQL database. To demonstrate its performance for classifying cancer types, we extracted information (e.g., breast, prostate, and lung cancers) from tens of thousands of articles downloaded from PubMed, and then employed Naïve Bayes, Support Vector Machine (SVM), and Logistic Regression to build prediction models to mine the articles. The accuracy of predicting a cancer type by SVM using the 29,437 full-text articles was 93.81%. While competing text-mining tools took more than 11 hours, SparkText mined the dataset in approximately 6 minutes. Conclusions This study demonstrates the potential for mining large-scale scientific articles on a Big Data infrastructure, with real-time update from new articles published daily. SparkText can be extended to other areas of biomedical research. PMID:27685652

  6. SparkText: Biomedical Text Mining on Big Data Framework.

    Directory of Open Access Journals (Sweden)

    Zhan Ye

    Full Text Available Many new biomedical research articles are published every day, accumulating rich information, such as genetic variants, genes, diseases, and treatments. Rapid yet accurate text mining on large-scale scientific literature can discover novel knowledge to better understand human diseases and to improve the quality of disease diagnosis, prevention, and treatment.In this study, we designed and developed an efficient text mining framework called SparkText on a Big Data infrastructure, which is composed of Apache Spark data streaming and machine learning methods, combined with a Cassandra NoSQL database. To demonstrate its performance for classifying cancer types, we extracted information (e.g., breast, prostate, and lung cancers from tens of thousands of articles downloaded from PubMed, and then employed Naïve Bayes, Support Vector Machine (SVM, and Logistic Regression to build prediction models to mine the articles. The accuracy of predicting a cancer type by SVM using the 29,437 full-text articles was 93.81%. While competing text-mining tools took more than 11 hours, SparkText mined the dataset in approximately 6 minutes.This study demonstrates the potential for mining large-scale scientific articles on a Big Data infrastructure, with real-time update from new articles published daily. SparkText can be extended to other areas of biomedical research.

  7. SparkText: Biomedical Text Mining on Big Data Framework.

    Science.gov (United States)

    Ye, Zhan; Tafti, Ahmad P; He, Karen Y; Wang, Kai; He, Max M

    Many new biomedical research articles are published every day, accumulating rich information, such as genetic variants, genes, diseases, and treatments. Rapid yet accurate text mining on large-scale scientific literature can discover novel knowledge to better understand human diseases and to improve the quality of disease diagnosis, prevention, and treatment. In this study, we designed and developed an efficient text mining framework called SparkText on a Big Data infrastructure, which is composed of Apache Spark data streaming and machine learning methods, combined with a Cassandra NoSQL database. To demonstrate its performance for classifying cancer types, we extracted information (e.g., breast, prostate, and lung cancers) from tens of thousands of articles downloaded from PubMed, and then employed Naïve Bayes, Support Vector Machine (SVM), and Logistic Regression to build prediction models to mine the articles. The accuracy of predicting a cancer type by SVM using the 29,437 full-text articles was 93.81%. While competing text-mining tools took more than 11 hours, SparkText mined the dataset in approximately 6 minutes. This study demonstrates the potential for mining large-scale scientific articles on a Big Data infrastructure, with real-time update from new articles published daily. SparkText can be extended to other areas of biomedical research.

  8. HSM: Heterogeneous Subspace Mining in High Dimensional Data

    DEFF Research Database (Denmark)

    Müller, Emmanuel; Assent, Ira; Seidl, Thomas

    2009-01-01

    Heterogeneous data, i.e. data with both categorical and continuous values, is common in many databases. However, most data mining algorithms assume either continuous or categorical attributes, but not both. In high dimensional data, phenomena due to the "curse of dimensionality" pose additional...... challenges. Usually, due to locally varying relevance of attributes, patterns do not show across the full set of attributes. In this paper we propose HSM, which defines a new pattern model for heterogeneous high dimensional data. It allows data mining in arbitrary subsets of the attributes that are relevant...... for the respective patterns. Based on this model we propose an efficient algorithm, which is aware of the heterogeneity of the attributes. We extend an indexing structure for continuous attributes such that HSM indexing adapts to different attribute types. In our experiments we show that HSM efficiently mines...

  9. APLIKASI DATA MINING UNTUK MENAMPILKAN INFORMASI TINGKAT KELULUSAN MAHASISWA

    Directory of Open Access Journals (Sweden)

    Yuli Asriningtias

    2014-01-01

    Full Text Available Perguruan tinggi dituntut memiliki keunggulan bersaing dengan memanfaatkan sumber dayanya, termasuk sumber daya manusia dalam hal ini adalah mahasiswa.Tidak semua mahasiswa dapat menyelesaikan study tepat waktu, disamping  IPK yang beragam. Lama waktu mahasiswa dalam menempuh studi dan IPK menjadi salah satu faktor tingkat keunggulan sebuah Perguruan Tinggi.  Nilai potensi tersebut dapat digali menggunakan teknik data mining.Data mining adalah kegiatan menemukan pola yang menarik dari data dalam jumlah besar, data dapat disimpan dalam database, data warehouse, atau penyimpanan informasi lainnya. Data warehouse merupakan penyimpanan data yang berorientasi objek, terintegrasi, mempunyai variant waktu, dan menyimpan data dalam bentuk nonvolatile sebagai pendukung manejemen dalam proses pengambilan keputusan. Penelitian ini dikembangkan dengan cara menscan data pada database secara langsung sehingga menghasilkan informasi yag dibutuhkan. Aplikasi data mining ini dibangun menggunakan bahasa pemrograman Borland Delphi 7 dan menggunakan database SQL Server 2000 sebagai media penyimpan data. Hasil dari penelitian bahwa dapat diketahui tingkat ketepatan waktu dan nilai kelulusan mahasiswa yang berelasi dengan atribut data masuk mahasiswa. Kata Kunci : Data mining, data warehouse, kelulusan mahasiswa.

  10. Compass: A hybrid method for clinical and biobank data mining

    DEFF Research Database (Denmark)

    Krysiak-Baltyn, Konrad; Petersen, Thomas Nordahl; Audouze, Karine Marie Laure

    2014-01-01

    We describe a new method for identification of confident associations within large clinical data sets. The method is a hybrid of two existing methods; Self-Organizing Maps and Association Mining. We utilize Self-Organizing Maps as the initial step to reduce the search space, and then apply...... Association Mining in order to find association rules. We demonstrate that this procedure has a number of advantages compared to traditional Association Mining; it allows for handling numerical variables without a priori binning and is able to generate variable groups which act as “hotspots” for statistically...... significant associations. We showcase the method on infertility-related data from Danish military conscripts. The clinical data we analyzed contained both categorical type questionnaire data and continuous variables generated from biological measurements, including missing values. From this data set, we...

  11. Teaching Financial Data Mining using Stocks and Futures Contracts

    Directory of Open Access Journals (Sweden)

    Gary Boetticher

    2005-06-01

    Full Text Available Financial data mining models is considered to be "the hardest way to make easy money." Data miners are certainly motivated by the prospect of discovering a financial "Holy Grail." However, designing and implementing a successful model poses many intellectual challenges. These include securing and cleaning data; acquiring a sufficient amount of financial domain knowledge; bounding the complexity of the problem; and properly validating results. Teaching financial data mining is especially difficult due to the student's limited financial domain knowledge and the relatively short period (one semester for building financial models. This paper describes an application of a financial data mining term project based on Stock and E-Mini futures contracts and discusses "lessons learned" from assigning similar term projects over six different semesters. Results of each case study results are presented and discussed.

  12. Clinical diabetes research using data mining: a Canadian perspective.

    Science.gov (United States)

    Shah, Baiju R; Lipscombe, Lorraine L

    2015-06-01

    With the advent of the digitization of large amounts of information and the computer power capable of analyzing this volume of information, data mining is increasingly being applied to medical research. Datasets created for administration of the healthcare system provide a wealth of information from different healthcare sectors, and Canadian provinces' single-payer universal healthcare systems mean that data are more comprehensive and complete in this country than in many other jurisdictions. The increasing ability to also link clinical information, such as electronic medical records, laboratory test results and disease registries, has broadened the types of data available for analysis. Data-mining methods have been used in many different areas of diabetes clinical research, including classic epidemiology, effectiveness research, population health and health services research. Although methodologic challenges and privacy concerns remain important barriers to using these techniques, data mining remains a powerful tool for clinical research. Copyright © 2015 Canadian Diabetes Association. Published by Elsevier Inc. All rights reserved.

  13. Redo log process mining in real life : data challenges & opportunities

    NARCIS (Netherlands)

    González López de Murillas, E.; Hoogendoorn, G.E.; Reijers, H.A.; Teniente, E.; Weidlich, M.

    2018-01-01

    Data extraction and preparation are the most time-consuming phases of any process mining project. Due to the variability on the sources of event data, it remains a highly manual process in most of the cases. Moreover, it is very difficult to obtain reliable event data in enterprise systems that are

  14. Data mining to detect clinical mastitis with automatic milking

    NARCIS (Netherlands)

    Kamphuis, C.; Mollenhorst, H.; Heesterbeek, J.A.P.; Hogeveen, H.

    2010-01-01

    Our objective was to use data mining to develop and validate a detection model for clinical mastitis (CM) using sensor data collected at nine Dutch dairy herds milking automatically. Sensor data was available for almost 3.5 million quarter milkings (QM) from 1,109 cows; 348 QM with CM were observed

  15. Process mining on databases: Unearthing historical data from redo logs

    NARCIS (Netherlands)

    González-López de Murillas, E.; van der Aalst, W.M.P.; Reijers, H.A.

    2015-01-01

    Process Mining techniques rely on the existence of event data. However, in many cases it is far from trivial to obtain such event data. Considerable efforts may need to be spent on making IT systems record historic data at all. But even if such records are available, it may not be possible to derive

  16. Prediction of thermodynamic properties of refrigerants using data mining

    International Nuclear Information System (INIS)

    Kuecueksille, Ecir Ugur; Selbas, Resat; Sencan, Arzu

    2011-01-01

    The analysis of vapor compression refrigeration systems requires the availability of simple and efficient mathematical formulations for the determination of thermodynamic properties of refrigerants. The aim of this study is to determine thermodynamic properties as enthalpy, entropy and specific volume of alternative refrigerants using data mining method. Alternative refrigerants used in the study are R134a, R404a, R407c and R410a. The results obtained from data mining have been compared to actual data from the literature. The study shows that the data mining methodology is successfully applicable to determine enthalpy, entropy and specific volume values for any temperature and pressure of refrigerants. Therefore, computation time reduces and simulation of vapor compression refrigeration systems is fairly facilitated.

  17. Randomized algorithms in automatic control and data mining

    CERN Document Server

    Granichin, Oleg; Toledano-Kitai, Dvora

    2015-01-01

    In the fields of data mining and control, the huge amount of unstructured data and the presence of uncertainty in system descriptions have always been critical issues. The book Randomized Algorithms in Automatic Control and Data Mining introduces the readers to the fundamentals of randomized algorithm applications in data mining (especially clustering) and in automatic control synthesis. The methods proposed in this book guarantee that the computational complexity of classical algorithms and the conservativeness of standard robust control techniques will be reduced. It is shown that when a problem requires "brute force" in selecting among options, algorithms based on random selection of alternatives offer good results with certain probability for a restricted time and significantly reduce the volume of operations.

  18. Locating previously unknown patterns in data-mining results: a dual data- and knowledge-mining method

    Directory of Open Access Journals (Sweden)

    Knaus William A

    2006-03-01

    Full Text Available Abstract Background Data mining can be utilized to automate analysis of substantial amounts of data produced in many organizations. However, data mining produces large numbers of rules and patterns, many of which are not useful. Existing methods for pruning uninteresting patterns have only begun to automate the knowledge acquisition step (which is required for subjective measures of interestingness, hence leaving a serious bottleneck. In this paper we propose a method for automatically acquiring knowledge to shorten the pattern list by locating the novel and interesting ones. Methods The dual-mining method is based on automatically comparing the strength of patterns mined from a database with the strength of equivalent patterns mined from a relevant knowledgebase. When these two estimates of pattern strength do not match, a high "surprise score" is assigned to the pattern, identifying the pattern as potentially interesting. The surprise score captures the degree of novelty or interestingness of the mined pattern. In addition, we show how to compute p values for each surprise score, thus filtering out noise and attaching statistical significance. Results We have implemented the dual-mining method using scripts written in Perl and R. We applied the method to a large patient database and a biomedical literature citation knowledgebase. The system estimated association scores for 50,000 patterns, composed of disease entities and lab results, by querying the database and the knowledgebase. It then computed the surprise scores by comparing the pairs of association scores. Finally, the system estimated statistical significance of the scores. Conclusion The dual-mining method eliminates more than 90% of patterns with strong associations, thus identifying them as uninteresting. We found that the pruning of patterns using the surprise score matched the biomedical evidence in the 100 cases that were examined by hand. The method automates the acquisition of

  19. Advances in machine learning and data mining for astronomy

    CERN Document Server

    Way, Michael J

    2012-01-01

    Advances in Machine Learning and Data Mining for Astronomy documents numerous successful collaborations among computer scientists, statisticians, and astronomers who illustrate the application of state-of-the-art machine learning and data mining techniques in astronomy. Due to the massive amount and complexity of data in most scientific disciplines, the material discussed in this text transcends traditional boundaries between various areas in the sciences and computer science. The book's introductory part provides context to issues in the astronomical sciences that are also important to health

  20. Advanced Data Mining of Leukemia Cells Micro-Arrays

    OpenAIRE

    Richard S. Segall; Ryan M. Pierce

    2009-01-01

    This paper provides continuation and extensions of previous research by Segall and Pierce (2009a) that discussed data mining for micro-array databases of Leukemia cells for primarily self-organized maps (SOM). As Segall and Pierce (2009a) and Segall and Pierce (2009b) the results of applying data mining are shown and discussed for the data categories of microarray databases of HL60, Jurkat, NB4 and U937 Leukemia cells that are also described in this article. First, a background section is pro...

  1. Data mining with SPSS modeler theory, exercises and solutions

    CERN Document Server

    Wendler, Tilo

    2016-01-01

    Introducing the IBM SPSS Modeler, this book guides readers through data mining processes and presents relevant statistical methods. There is a special focus on step-by-step tutorials and well-documented examples that help demystify complex mathematical algorithms and computer programs. The variety of exercises and solutions as well as an accompanying website with data sets and SPSS Modeler streams are particularly valuable. While intended for students, the simplicity of the Modeler makes the book useful for anyone wishing to learn about basic and more advanced data mining, and put this knowledge into practice.

  2. Data Mining Process Optimization in Computational Multi-agent Systems

    OpenAIRE

    Kazík, O.; Neruda, R. (Roman)

    2015-01-01

    In this paper, we present an agent-based solution of metalearning problem which focuses on optimization of data mining processes. We exploit the framework of computational multi-agent systems in which various meta-learning problems have been already studied, e.g. parameter-space search or simple method recommendation. In this paper, we examine the effect of data preprocessing for machine learning problems. We perform the set of experiments in the search-space of data mining processes which is...

  3. DATA MINING. CONCEPTS AND APPLICATIONS IN BANKING SECTOR

    Directory of Open Access Journals (Sweden)

    ADRIAN IONUT PASCU

    2018-02-01

    Full Text Available The concept of banking refers to the multitude of services and products that commercial banks offer to clients and include besides transactional accounts both passive and active products. Due to the increased competitiveness in banking, the relationship between the bank and the client has become an essential factor for the strategy in order to increase customer satisfaction. Currently the banking system is able to store impressive amounts of data that they collect daily, from customer data and transaction details to data on their transactional or risk profile. The process through which large amounts of data are analyzed, extracted, identified and the information obtained using mathematical and statistical models are interpreted is known as data mining. The discovery of knowledge from data involves identifying some models, some patterns with which certain events or possible risks are anticipated. This process helps banks to develop strategies in areas such as customer retention and loyalty, customer satisfaction, fraud detection and prevention, risk management, money laundering prevention. The aim of this paper is to present the concept of data mining and the concept of data discovery (KDD, but also the impact and important use of data mining techniques in the banking sector. This paper explores and reviews various data mining techniques that are applied in the banking sector but also provides insight into how these techniques are used in different areas to make decision-making easier and more efficient.

  4. Combining complex networks and data mining: Why and how

    Science.gov (United States)

    Zanin, M.; Papo, D.; Sousa, P. A.; Menasalvas, E.; Nicchi, A.; Kubik, E.; Boccaletti, S.

    2016-05-01

    The increasing power of computer technology does not dispense with the need to extract meaningful information out of data sets of ever growing size, and indeed typically exacerbates the complexity of this task. To tackle this general problem, two methods have emerged, at chronologically different times, that are now commonly used in the scientific community: data mining and complex network theory. Not only do complex network analysis and data mining share the same general goal, that of extracting information from complex systems to ultimately create a new compact quantifiable representation, but they also often address similar problems too. In the face of that, a surprisingly low number of researchers turn out to resort to both methodologies. One may then be tempted to conclude that these two fields are either largely redundant or totally antithetic. The starting point of this review is that this state of affairs should be put down to contingent rather than conceptual differences, and that these two fields can in fact advantageously be used in a synergistic manner. An overview of both fields is first provided, some fundamental concepts of which are illustrated. A variety of contexts in which complex network theory and data mining have been used in a synergistic manner are then presented. Contexts in which the appropriate integration of complex network metrics can lead to improved classification rates with respect to classical data mining algorithms and, conversely, contexts in which data mining can be used to tackle important issues in complex network theory applications are illustrated. Finally, ways to achieve a tighter integration between complex networks and data mining, and open lines of research are discussed.

  5. Review of Data Mining Techniques for Churn Prediction in Telecom

    Directory of Open Access Journals (Sweden)

    Vishal Mahajan

    2015-12-01

    service. This data can be usefully mined for churn analysis and prediction. Significant research had been undertaken by researchers worldwide to understand the data mining practices that can be used for predicting customer churn. This paper provides a review of around 100 recent journal articles starting from year 2000 to present the various data mining techniques used in multiple customer based churn models. It then summarizes the existing telecom literature by highlighting the sample size used, churn variables employed and the findings of different DM techniques. Finally, we list the most popular techniques for churn prediction in telecom as decision trees, regression analysis and clustering, thereby providing a roadmap to new researchers to build upon novel churn management models.

  6. The First International Conference on Soft Computing and Data Mining

    CERN Document Server

    Ghazali, Rozaida; Deris, Mustafa

    2014-01-01

    This book constitutes the refereed proceedings of the First International Conference on Soft Computing and Data Mining, SCDM 2014, held in Universiti Tun Hussein Onn Malaysia, in June 16th-18th, 2014. The 65 revised full papers presented in this book were carefully reviewed and selected from 145 submissions, and organized into two main topical sections; Data Mining and Soft Computing. The goal of this book is to provide both theoretical concepts and, especially, practical techniques on these exciting fields of soft computing and data mining, ready to be applied in real-world applications. The exchanges of views pertaining future research directions to be taken in this field and the resultant dissemination of the latest research findings makes this work of immense value to all those having an interest in the topics covered.    

  7. Classification of Internet banking customers using data mining algorithms

    Directory of Open Access Journals (Sweden)

    Reza Radfar

    2014-03-01

    Full Text Available Classifying customers using data mining algorithms, enables banks to keep old customers loyality while attracting new ones. Using decision tree as a data mining technique, we can optimize customer classification provided that the appropriate decision tree is selected. In this article we have presented an appropriate model to classify customers who use internet banking service. The model is developed based on CRISP-DM standard and we have used real data of Sina bank’s Internet bank. In compare to other decision trees, ours is based on both optimization and accuracy factors that recognizes new potential internet banking customers using a three level classification, which is low/medium and high. This is a practical, documentary-based research. Mining customer rules enables managers to make policies based on found out patterns in order to have a better perception of what customers really desire.

  8. Data mining-aided materials discovery and optimization

    Directory of Open Access Journals (Sweden)

    Wencong Lu

    2017-09-01

    Full Text Available Recent developments in data mining-aided materials discovery and optimization are reviewed in this paper, and an introduction to the materials data mining (MDM process is provided using case studies. Both qualitative and quantitative methods in machine learning can be adopted in the MDM process to accomplish different tasks in materials discovery, design, and optimization. State-of-the-art techniques in data mining-aided materials discovery and optimization are demonstrated by reviewing the controllable synthesis of dendritic Co3O4 superstructures, materials design of layered double hydroxide, battery materials discovery, and thermoelectric materials design. The results of the case studies indicate that MDM is a powerful approach for use in materials discovery and innovation, and will play an important role in the development of the Materials Genome Initiative and Materials Informatics.

  9. Towards Cooperative Predictive Data Mining in Competitive Environments

    Science.gov (United States)

    Lisý, Viliam; Jakob, Michal; Benda, Petr; Urban, Štěpán; Pěchouček, Michal

    We study the problem of predictive data mining in a competitive multi-agent setting, in which each agent is assumed to have some partial knowledge required for correctly classifying a set of unlabelled examples. The agents are self-interested and therefore need to reason about the trade-offs between increasing their classification accuracy by collaborating with other agents and disclosing their private classification knowledge to other agents through such collaboration. We analyze the problem and propose a set of components which can enable cooperation in this otherwise competitive task. These components include measures for quantifying private knowledge disclosure, data-mining models suitable for multi-agent predictive data mining, and a set of strategies by which agents can improve their classification accuracy through collaboration. The overall framework and its individual components are validated on a synthetic experimental domain.

  10. High Performance Data mining by Genetic Neural Network

    Directory of Open Access Journals (Sweden)

    Dadmehr Rahbari

    2013-10-01

    Full Text Available Data mining in computer science is the process of discovering interesting and useful patterns and relationships in large volumes of data. Most methods for mining problems is based on artificial intelligence algorithms. Neural network optimization based on three basic parameters topology, weights and the learning rate is a powerful method. We introduce optimal method for solving this problem. In this paper genetic algorithm with mutation and crossover operators change the network structure and optimized that. Dataset used for our work is stroke disease with twenty features that optimized number of that achieved by new hybrid algorithm. Result of this work is very well incomparison with other similar method. Low present of error show that our method is our new approach to efficient, high-performance data mining problems is introduced.

  11. Analysis of Occupational Accidents in Underground and Surface Mining in Spain Using Data-Mining Techniques

    Directory of Open Access Journals (Sweden)

    Lluís Sanmiquel

    2018-03-01

    Full Text Available An analysis of occupational accidents in the mining sector was conducted using the data from the Spanish Ministry of Employment and Social Safety between 2005 and 2015, and data-mining techniques were applied. Data was processed with the software Weka. Two scenarios were chosen from the accidents database: surface and underground mining. The most important variables involved in occupational accidents and their association rules were determined. These rules are composed of several predictor variables that cause accidents, defining its characteristics and context. This study exposes the 20 most important association rules in the sector—either surface or underground mining—based on the statistical confidence levels of each rule as obtained by Weka. The outcomes display the most typical immediate causes, along with the percentage of accidents with a basis in each association rule. The most important immediate cause is body movement with physical effort or overexertion, and the type of accident is physical effort or overexertion. On the other hand, the second most important immediate cause and type of accident are different between the two scenarios. Data-mining techniques were chosen as a useful tool to find out the root cause of the accidents.

  12. Analysis of Occupational Accidents in Underground and Surface Mining in Spain Using Data-Mining Techniques

    Science.gov (United States)

    Sanmiquel, Lluís; Bascompta, Marc; Rossell, Josep M.; Anticoi, Hernán Francisco; Guash, Eduard

    2018-01-01

    An analysis of occupational accidents in the mining sector was conducted using the data from the Spanish Ministry of Employment and Social Safety between 2005 and 2015, and data-mining techniques were applied. Data was processed with the software Weka. Two scenarios were chosen from the accidents database: surface and underground mining. The most important variables involved in occupational accidents and their association rules were determined. These rules are composed of several predictor variables that cause accidents, defining its characteristics and context. This study exposes the 20 most important association rules in the sector—either surface or underground mining—based on the statistical confidence levels of each rule as obtained by Weka. The outcomes display the most typical immediate causes, along with the percentage of accidents with a basis in each association rule. The most important immediate cause is body movement with physical effort or overexertion, and the type of accident is physical effort or overexertion. On the other hand, the second most important immediate cause and type of accident are different between the two scenarios. Data-mining techniques were chosen as a useful tool to find out the root cause of the accidents. PMID:29518921

  13. Research on forecast technology of mine gas emission based on fuzzy data mining (FDM)

    Energy Technology Data Exchange (ETDEWEB)

    Xu Chang-kai; Wang Yao-cai; Wang Jun-wei [CUMT, Xuzhou (China). School of Information and Electrical Engineering

    2004-07-01

    The safe production of coalmine can be further improved by forecasting the quantity of gas emission based on the real-time data and historical data which the gas monitoring system has saved. By making use of the advantages of data warehouse and data mining technology for processing large quantity of redundancy data, the method and its application of forecasting mine gas emission quantity based on FDM were studied. The constructing fuzzy resembling relation and clustering analysis were proposed, which the potential relationship inside the gas emission data may be found. The mode finds model and forecast model were presented, and the detailed approach to realize this forecast was also proposed, which have been applied to forecast the gas emission quantity efficiently.

  14. Mining multi-dimensional data for decision support

    Energy Technology Data Exchange (ETDEWEB)

    Donato, J.M.; Schryver, J.C.; Hinkel, G.C.; Schmoyer, R.L. Jr. [Oak Ridge National Lab., TN (United States); Grady, N.W.; Leuze, M.R. [Oak Ridge National Lab., TN (United States)]|[Joint Inst. for Computational Science, Knoxville, TN (United States)

    1998-06-01

    While it is widely recognized that data can be a valuable resource for any organization, extracting information contained within the data is often a difficult problem. Attempts to obtain information from data may be limited by legacy data storage formats, lack of expert knowledge about the data, difficulty in viewing the data, or the volume of data needing to be processed. The rapidly developing field of Data Mining or Knowledge Data Discovery is a blending of Artificial Intelligence, Statistics, and Human-Computer Interaction. Sophisticated data navigation tools to obtain the information needed for decision support do not yet exist. Each data mining task requires a custom solution that depends upon the character and quantity of the data. This paper presents a two-stage approach for handling the prediction of personal bankruptcy using credit card account data, combining decision tree and artificial neural network technologies. Topics to be discussed include the pre-processing of data, including data cleansing, the filtering of data for pertinent records, and the reduction of data for attributes contributing to the prediction of bankruptcy, and the two steps in the mining process itself.

  15. Data Mining Tools Make Flights Safer, More Efficient

    Science.gov (United States)

    2014-01-01

    A small data mining team at Ames Research Center developed a set of algorithms ideal for combing through flight data to find anomalies. Dallas-based Southwest Airlines Co. signed a Space Act Agreement with Ames in 2011 to access the tools, helping the company refine its safety practices, improve its safety reviews, and increase flight efficiencies.

  16. Briefly on the GUHA Method of Data Mining

    Czech Academy of Sciences Publication Activity Database

    Hájek, Petr

    -, č. 3 (2003), s. 112-114 ISSN 1509-4553 R&D Projects: GA MŠk OC 274.001 Grant - others:COST(XE) Action 274 TARSKI Institutional research plan: AV0Z1030915 Keywords : GUHA method * data mining * exploratory data analuysis Subject RIV: BA - General Mathematics http://www.nit.eu/czasopisma/JTIT/2003/3/112.pdf

  17. Modeling issues & choices in the data mining optimization ontology

    CSIR Research Space (South Africa)

    Keet, CM

    2013-05-01

    Full Text Available We describe the Data Mining Optimization Ontology (DMOP), which was developed to support informed decision-making at various choice points of the knowledge discovery (KD) process. It can be used as a reference by data miners, but its primary purpose...

  18. Dengue fatality prediction using data mining | Rahim | Journal of ...

    African Journals Online (AJOL)

    The aim of this research is to study the current implementation of dengue outbreak control in Malaysia and predict dengue fever cases using data mining techniques. Real data on dengue fever and weather are collected from the Ministry of Health in its Perak Tengah district office and Perak Meteorological office respectively ...

  19. Data Mining in Earth System Science (DMESS 2011)

    Science.gov (United States)

    Forrest M. Hoffman; J. Walter Larson; Richard Tran Mills; Bhorn-Gustaf Brooks; Auroop R. Ganguly; William Hargrove; et al

    2011-01-01

    From field-scale measurements to global climate simulations and remote sensing, the growing body of very large and long time series Earth science data are increasingly difficult to analyze, visualize, and interpret. Data mining, information theoretic, and machine learning techniques—such as cluster analysis, singular value decomposition, block entropy, Fourier and...

  20. Data mining approach to model the diagnostic service management.

    Science.gov (United States)

    Lee, Sun-Mi; Lee, Ae-Kyung; Park, Il-Su

    2006-01-01

    Korea has National Health Insurance Program operated by the government-owned National Health Insurance Corporation, and diagnostic services are provided every two year for the insured and their family members. Developing a customer relationship management (CRM) system using data mining technology would be useful to improve the performance of diagnostic service programs. Under these circumstances, this study developed a model for diagnostic service management taking into account the characteristics of subjects using a data mining approach. This study could be further used to develop an automated CRM system contributing to the increase in the rate of receiving diagnostic services.

  1. International Conference on Computational Intelligence in Data Mining

    CERN Document Server

    Mohapatra, Durga

    2017-01-01

    The book presents high quality papers presented at the International Conference on Computational Intelligence in Data Mining (ICCIDM 2016) organized by School of Computer Engineering, Kalinga Institute of Industrial Technology (KIIT), Bhubaneswar, Odisha, India during December 10 – 11, 2016. The book disseminates the knowledge about innovative, active research directions in the field of data mining, machine and computational intelligence, along with current issues and applications of related topics. The volume aims to explicate and address the difficulties and challenges that of seamless integration of the two core disciplines of computer science. .

  2. 4th International conference on Knowledge Discovery and Data Mining

    CERN Document Server

    Knowledge Discovery and Data Mining

    2012-01-01

    The volume includes a set of selected papers extended and revised from the 4th International conference on Knowledge Discovery and Data Mining, March 1-2, 2011, Macau, Chin.   This Volume is to provide a forum for researchers, educators, engineers, and government officials involved in the general areas of knowledge discovery and data mining and learning to disseminate their latest research results and exchange views on the future research directions of these fields. 108 high-quality papers are included in the volume.

  3. Data Mining for Education Decision Support: A Review

    Directory of Open Access Journals (Sweden)

    Suhirman Suhirman

    2014-12-01

    Full Text Available Management of higher education must continue to evaluate on an ongoing basis in order to improve the quality of institutions. This will be able to do the necessary evaluation of various data, information, and knowledge of both internal and external institutions. They plan to use more efficiently the collected data, develop tools so that to collect and direct management information, in order to support managerial decision making. The collected data could be utilized to evaluate quality, perform analyses and diagnoses, evaluate dependability to the standards and practices of curricula and syllabi, and suggest alternatives in decision processes. Data minings to support decision making are well suited methods to provide decision support in the education environments, by generating and presenting relevant information and knowledge towards quality improvement of education processes. In educational domain, this information is very useful since it can be used as a base for investigating and enhancing the current educational standards and managements. In this paper, a review on data mining for academic decision support in education field is presented. The details of this paper will review on recent data mining in educational field and outlines future researches in educational data mining.

  4. Explaining and predicting workplace accidents using data-mining techniques

    International Nuclear Information System (INIS)

    Rivas, T.; Paz, M.; Martin, J.E.; Matias, J.M.; Garcia, J.F.; Taboada, J.

    2011-01-01

    Current research into workplace risk is mainly conducted using conventional descriptive statistics, which, however, fail to properly identify cause-effect relationships and are unable to construct models that could predict accidents. The authors of the present study modelled incidents and accidents in two companies in the mining and construction sectors in order to identify the most important causes of accidents and develop predictive models. Data-mining techniques (decision rules, Bayesian networks, support vector machines and classification trees) were used to model accident and incident data compiled from the mining and construction sectors and obtained in interviews conducted soon after an incident/accident occurred. The results were compared with those for a classical statistical techniques (logistic regression), revealing the superiority of decision rules, classification trees and Bayesian networks in predicting and identifying the factors underlying accidents/incidents.

  5. Explaining and predicting workplace accidents using data-mining techniques

    Energy Technology Data Exchange (ETDEWEB)

    Rivas, T., E-mail: trivas@uvigo.e [Dpto. Ingenieria de los Recursos Naturales y Medio Ambiente, E.T.S.I. Minas, University of Vigo, Campus Lagoas, 36310 Vigo (Spain); Paz, M., E-mail: mpaz.minas@gmail.co [Dpto. Ingenieria de los Recursos Naturales y Medio Ambiente, E.T.S.I. Minas, University of Vigo, Campus Lagoas, 36310 Vigo (Spain); Martin, J.E., E-mail: jmartin@cippinternacional.co [CIPP International, S.L. Parque Tecnologico de Asturias, Parcela 43, Oficina 11, 33428 Llanera (Spain); Matias, J.M., E-mail: jmmatias@uvigo.e [Dpto. Estadistica e Investigacion Operativa, E.T.S.I. Minas, University of Vigo, Campus Lagoas, 36310 Vigo (Spain); Garcia, J.F., E-mail: jgarcia@cippinternacional.co [CIPP International, S.L. Parque Tecnologico de Asturias, Parcela 43, Oficina 11, 33428 Llanera (Spain); Taboada, J., E-mail: jtaboada@uvigo.e [Dpto. Ingenieria de los Recursos Naturales y Medio Ambiente, E.T.S.I. Minas, University of Vigo, Campus Lagoas, 36310 Vigo (Spain)

    2011-07-15

    Current research into workplace risk is mainly conducted using conventional descriptive statistics, which, however, fail to properly identify cause-effect relationships and are unable to construct models that could predict accidents. The authors of the present study modelled incidents and accidents in two companies in the mining and construction sectors in order to identify the most important causes of accidents and develop predictive models. Data-mining techniques (decision rules, Bayesian networks, support vector machines and classification trees) were used to model accident and incident data compiled from the mining and construction sectors and obtained in interviews conducted soon after an incident/accident occurred. The results were compared with those for a classical statistical techniques (logistic regression), revealing the superiority of decision rules, classification trees and Bayesian networks in predicting and identifying the factors underlying accidents/incidents.

  6. Mining Risk Factors in RFID Baggage Tracking Data

    DEFF Research Database (Denmark)

    Ahmed, Tanvir; Calders, Toon; Pedersen, Torben Bach

    2015-01-01

    and frustration to the passengers. To remedy these problems we propose a detailed methodology for mining risk factors from Radio Frequency Identification (RFID) baggage tracking data. The factors should identify potential issues in the baggage management. However, the baggage tracking data are low level...... and not directly accessible for finding such factors. Moreover, baggage tracking data are highly imbalanced, for example, our experimental data, which is a large real-world data set from the Scandinavian countries, contains only 0.8% mishandled bags. This imbalance presents difficulties to most data mining...... techniques. The paper presents detailed steps for pre-processing the unprocessed raw tracking data for higher-level analysis and handling the imbalance problem. We fragment the data set based on a number of relevant factors and find the best classifier for each of them. The paper reports on a comprehensive...

  7. Mining Social Media and DBpedia Data Using Gephi and R

    Directory of Open Access Journals (Sweden)

    Sadiq HUSSAIN

    2018-04-01

    Full Text Available The big data is playing a big role in the field of machine learning and data mining. To extract meaningful and interesting information from big data mining is a challenge. The size of the data at social media and Wikipedia are increasing exponentially. To visualize such huge data is another aspect of big data. The roles of graphs are becoming important in case of visualization and modelling of such data. Gephi and R are two important visualization and exploration tools in this field. Using graph, one may find and calculate modularity, eccentricity, Indegree, Outdegree, betweenness centrality etc. In this paper, we had used Dbpedia, facebook and twitter datasets. We had used Gephi and R to look inside the structure of such data and comparing different statistics based on the graph by exploring the graphs.

  8. Methodologies of Knowledge Discovery from Data and Data Mining Methods in Mechanical Engineering

    Directory of Open Access Journals (Sweden)

    Rogalewicz Michał

    2016-12-01

    Full Text Available The paper contains a review of methodologies of a process of knowledge discovery from data and methods of data exploration (Data Mining, which are the most frequently used in mechanical engineering. The methodologies contain various scenarios of data exploring, while DM methods are used in their scope. The paper shows premises for use of DM methods in industry, as well as their advantages and disadvantages. Development of methodologies of knowledge discovery from data is also presented, along with a classification of the most widespread Data Mining methods, divided by type of realized tasks. The paper is summarized by presentation of selected Data Mining applications in mechanical engineering.

  9. A survey on Big Data Stream Mining

    African Journals Online (AJOL)

    pc

    2018-03-05

    Mar 5, 2018 ... Big Data can be static on one machine or distributed ... decision making, and process automation. Big data .... Concept Drifting: concept drifting mean the classifier .... transactions generated by a prefix tree structure. EstDec ...

  10. Contrast data mining: concepts, algorithms, and applications

    National Research Council Canada - National Science Library

    Dong, Guozhu; Bailey, James

    2013-01-01

    .... Contrasting involves the comparison of one dataset against another. The datasets may represent data of different time periods, spatial locations, or classes, or they may represent data satisfying different conditions...

  11. Mining Electronic Health Records using Linked Data.

    Science.gov (United States)

    Odgers, David J; Dumontier, Michel

    2015-01-01

    Meaningful Use guidelines have pushed the United States Healthcare System to adopt electronic health record systems (EHRs) at an unprecedented rate. Hospitals and medical centers are providing access to clinical data via clinical data warehouses such as i2b2, or Stanford's STRIDE database. In order to realize the potential of using these data for translational research, clinical data warehouses must be interoperable with standardized health terminologies, biomedical ontologies, and growing networks of Linked Open Data such as Bio2RDF. Applying the principles of Linked Data, we transformed a de-identified version of the STRIDE into a semantic clinical data warehouse containing visits, labs, diagnoses, prescriptions, and annotated clinical notes. We demonstrate the utility of this system though basic cohort selection, phenotypic profiling, and identification of disease genes. This work is significant in that it demonstrates the feasibility of using semantic web technologies to directly exploit existing biomedical ontologies and Linked Open Data.

  12. Is Europe Falling Behind in Data Mining? Copyright’s Impact on Data Mining in Academic Research

    NARCIS (Netherlands)

    Handke, C.; Guibault, L.; Vallbé, J.J.; Schmidt, B.; Dobreva, M.

    2015-01-01

    With the diffusion of digital information technology, data mining (DM) is widely expected to increase the productivity of all kinds of research activities. Based on bibliometric data, we demonstrate that the share of DM-related research articles in all published academic papers has increased

  13. Mining Login Data for Actionable Student Insight

    Science.gov (United States)

    Agnihotri, Lalitha; Aghababyan, Ani; Mojarad, Shirin; Riedesel, Mark; Essa, Alfred

    2015-01-01

    Student login data is a key resource for gaining insight into their learning experience. However, the scale and the complexity of this data necessitate a thorough exploration to identify potential actionable insights, thus rendering it less valuable compared to student achievement data. To compensate for the underestimation of login data…

  14. Data Preparation for Web Mining – A survey

    OpenAIRE

    Amog Rajenderan

    2012-01-01

    An accepted trend is to categorize web mining intothree main areas: web content mining, webstructure mining and web usage mining. Webcontent mining involves extractingdetails/information from the contents of webpagesand performing things like knowledge synthesis.Web structure mining involves the usage of graphtheory to understand website structure/hierarchy.Web usage mining involves the mining of usefulinformation from things like server logs, tounderstand what the user does while on the inte...

  15. Radiological data acquisition, investigation and evaluation of mining relics

    International Nuclear Information System (INIS)

    1992-01-01

    Within the scope of a Federal Project, the environmental radioactivity and the radon concentration in buildings caused by mining relics in the new Federal Lands of Germany are investigated. In the first phase of the project, about 8000 relics of former mining were identified by analysing existing documents, categorised, and recorded in a special data bank. Thereby, 'areas of suspicion' of 1500 km 2 spaciously defined in the beginning could be reduced to 'areas of investigation' of 250 km 2 now to be examined in close coordination with the land and district authorities by a programme gradually adapted to the radiological significance of the relics. Experience with site-specific measuring programmes have already been gained through three pilot projects at typical sites of former mining activities. Recommendations of the German Radiation Protection Commission serve for the evaluation of the results. By the measuring programme for radon in buildings of mining and geological predestined regions more than 25000 buildings of 210 communities have been investigated. The results confirm the expected prevailing influence of the geologic underground on the radon concentration. Extreme values are observed where direct connections additionally exist to mining relics in the ground. (orig./HP) With 11 figs. in annex [de

  16. Data mining and knowledge discovery technologies

    National Research Council Canada - National Science Library

    Taniar, David

    2008-01-01

    "This book presents researchers and practitioners in fields such as knowledge management, information science, Web engineering, and medical informatics, with comprehensive, innovative research on data...

  17. Advances in learning analytics and educational data mining

    NARCIS (Netherlands)

    Vahdat, Mehrnoosh; Ghio, A; Oneto, L.; Anguita, D.; Funk, M.; Rauterberg, G.W.M.

    2015-01-01

    The growing interest in recent years towards Learning An- alytics (LA) and Educational Data Mining (EDM) has enabled novel ap- proaches and advancements in educational settings. The wide variety of research and practice in this context has enforced important possibilities and applications from

  18. A Demonstration of Regression False Positive Selection in Data Mining

    Science.gov (United States)

    Pinder, Jonathan P.

    2014-01-01

    Business analytics courses, such as marketing research, data mining, forecasting, and advanced financial modeling, have substantial predictive modeling components. The predictive modeling in these courses requires students to estimate and test many linear regressions. As a result, false positive variable selection ("type I errors") is…

  19. Data mining for the identification of metabolic syndrome status.

    Science.gov (United States)

    Worachartcheewan, Apilak; Schaduangrat, Nalini; Prachayasittikul, Virapong; Nantasenamat, Chanin

    2018-01-01

    Metabolic syndrome (MS) is a condition associated with metabolic abnormalities that are characterized by central obesity (e.g. waist circumference or body mass index), hypertension (e.g. systolic or diastolic blood pressure), hyperglycemia (e.g. fasting plasma glucose) and dyslipidemia (e.g. triglyceride and high-density lipoprotein cholesterol). It is also associated with the development of diabetes mellitus (DM) type 2 and cardiovascular disease (CVD). Therefore, the rapid identification of MS is required to prevent the occurrence of such diseases. Herein, we review the utilization of data mining approaches for MS identification. Furthermore, the concept of quantitative population-health relationship (QPHR) is also presented, which can be defined as the elucidation/understanding of the relationship that exists between health parameters and health status. The QPHR modeling uses data mining techniques such as artificial neural network (ANN), support vector machine (SVM), principal component analysis (PCA), decision tree (DT), random forest (RF) and association analysis (AA) for modeling and construction of predictive models for MS characterization. The DT method has been found to outperform other data mining techniques in the identification of MS status. Moreover, the AA technique has proved useful in the discovery of in-depth as well as frequently occurring health parameters that can be used for revealing the rules of MS development. This review presents the potential benefits on the applications of data mining as a rapid identification tool for classifying MS.

  20. An Application of Data Mining Algorithms for Shipbuilding Cost Estimation

    NARCIS (Netherlands)

    Kaluzny, B.L.; Barbici, S.; Berg, G.; Chiomento, R.; Derpanis,D.; Jonsson, U.; Shaw, R.H.A.D.; Smit, M.C.; Ramaroson, F.

    2011-01-01

    This article presents a novel application of known data mining algorithms to the problem of estimating the cost of ship development and construction. The work is a product of North Atlantic Treaty Organization Research and Technology Organization Systems Analysis and Studies 076 Task Group “NATO

  1. BAGEL2 : mining for bacteriocins in genomic data

    NARCIS (Netherlands)

    de Jong, Anne; van Heel, Auke J.; Kok, Jan; Kuipers, Oscar P.

    Mining bacterial genomes for bacteriocins is a challenging task due to the substantial structure and sequence diversity, and generally small sizes, of these antimicrobial peptides. Major progress in the research of antimicrobial peptides and the ever-increasing quantities of genomic data, varying

  2. Archetypal analysis for machine learning and data mining

    DEFF Research Database (Denmark)

    Mørup, Morten; Hansen, Lars Kai

    2012-01-01

    of the observed data. We further demonstrate that the aa model is relevant for feature extraction and dimensionality reduction for a large variety of machine learning problems taken from computer vision, neuroimaging, chemistry, text mining and collaborative filtering leading to highly interpretable...

  3. 78 FR 29055 - State Medicaid Fraud Control Units; Data Mining

    Science.gov (United States)

    2013-05-17

    ... pursue Medicaid provider fraud, we finalize proposals to permit Federal financial participation (FFP) in... general approach to data mining by MFCUs is to give each MFCU the autonomy to choose how to operate its...) to read as follows: Sec. 1007.19 Federal financial participation (FFP). * * * * * (e) * * * (2...

  4. 3D Visual Data Mining: goals and experiences

    DEFF Research Database (Denmark)

    Bøhlen, Michael Hanspeter; Bukauskas, Linas; Eriksen, Poul Svante

    2003-01-01

    , statistical analyses, perceptual and cognitive psychology, and scientific visualization. At the conceptual level we offer perceptual and cognitive insights to guide the information visualization process. We then choose cluster surfaces to exemplify the data mining process, to discuss the tasks involved...

  5. A Data Mining Approach to Modelling of Water Supply Assets

    DEFF Research Database (Denmark)

    Babovic, V.; Drecourt, J.; Keijzer, M.

    2002-01-01

    supply assets are mainly situated underground, and therefore not visible and under the influence of various highly unpredictable forces. This paper proposes the use of advanced data mining methods in order to determine the risks of pipe bursts. For example, analysis of the database of already occurred...

  6. Managing Multiuser Database Buffers Using Data Mining Techniques

    NARCIS (Netherlands)

    Feng, L.; Lu, H.J.

    2004-01-01

    In this paper, we propose a data-mining-based approach to public buffer management for a multiuser database system, where database buffers are organized into two areas – public and private. While the private buffer areas contain pages to be updated by particular users, the public

  7. Highlights of recent articles on data mining in genomics & proteomics

    Science.gov (United States)

    This editorial elaborates on investigations consisting of different “OMICS” technologies and their application to biological sciences. In addition, advantages and recent development of the proteomic, genomic and data mining technologies are discussed. This information will be useful to scientists ...

  8. Recommendation in Higher Education Using Data Mining Techniques

    Science.gov (United States)

    Vialardi, Cesar; Bravo, Javier; Shafti, Leila; Ortigosa, Alvaro

    2009-01-01

    One of the main problems faced by university students is to take the right decision in relation to their academic itinerary based on available information (for example courses, schedules, sections, classrooms and professors). In this context, this work proposes the use of a recommendation system based on data mining techniques to help students to…

  9. Feature extraction for classification in the data mining process

    NARCIS (Netherlands)

    Pechenizkiy, M.; Puuronen, S.; Tsymbal, A.

    2003-01-01

    Dimensionality reduction is a very important step in the data mining process. In this paper, we consider feature extraction for classification tasks as a technique to overcome problems occurring because of "the curse of dimensionality". Three different eigenvector-based feature extraction approaches

  10. Model Validation and Verification of Data Mining from the ...

    African Journals Online (AJOL)

    Michael Horsfall

    In this paper, we seek to present a hybrid method for Model Validation and Verification of Data Mining from the ... This model generally states the numerical value of knowledge .... procedures found in the field of software engineering should be ...

  11. Data mining for the identification of metabolic syndrome status

    Science.gov (United States)

    Worachartcheewan, Apilak; Schaduangrat, Nalini; Prachayasittikul, Virapong; Nantasenamat, Chanin

    2018-01-01

    Metabolic syndrome (MS) is a condition associated with metabolic abnormalities that are characterized by central obesity (e.g. waist circumference or body mass index), hypertension (e.g. systolic or diastolic blood pressure), hyperglycemia (e.g. fasting plasma glucose) and dyslipidemia (e.g. triglyceride and high-density lipoprotein cholesterol). It is also associated with the development of diabetes mellitus (DM) type 2 and cardiovascular disease (CVD). Therefore, the rapid identification of MS is required to prevent the occurrence of such diseases. Herein, we review the utilization of data mining approaches for MS identification. Furthermore, the concept of quantitative population-health relationship (QPHR) is also presented, which can be defined as the elucidation/understanding of the relationship that exists between health parameters and health status. The QPHR modeling uses data mining techniques such as artificial neural network (ANN), support vector machine (SVM), principal component analysis (PCA), decision tree (DT), random forest (RF) and association analysis (AA) for modeling and construction of predictive models for MS characterization. The DT method has been found to outperform other data mining techniques in the identification of MS status. Moreover, the AA technique has proved useful in the discovery of in-depth as well as frequently occurring health parameters that can be used for revealing the rules of MS development. This review presents the potential benefits on the applications of data mining as a rapid identification tool for classifying MS. PMID:29383020

  12. A framework for query optimization to support data mining

    NARCIS (Netherlands)

    S.R. Choenni (Sunil); A.P.J.M. Siebes (Arno)

    1996-01-01

    textabstractIn order to extract knowledge from databases, data mining algorithms heavily query the databases. Inefficient processing of these queries will inevitably have its impact on the performance of these algorithms, making them less valuable. In this paper, we describe an optimization

  13. Applying data mining techniques to improve diagnosis in neonatal jaundice

    Directory of Open Access Journals (Sweden)

    Ferreira Duarte

    2012-12-01

    Full Text Available Abstract Background Hyperbilirubinemia is emerging as an increasingly common problem in newborns due to a decreasing hospital length of stay after birth. Jaundice is the most common disease of the newborn and although being benign in most cases it can lead to severe neurological consequences if poorly evaluated. In different areas of medicine, data mining has contributed to improve the results obtained with other methodologies. Hence, the aim of this study was to improve the diagnosis of neonatal jaundice with the application of data mining techniques. Methods This study followed the different phases of the Cross Industry Standard Process for Data Mining model as its methodology. This observational study was performed at the Obstetrics Department of a central hospital (Centro Hospitalar Tâmega e Sousa – EPE, from February to March of 2011. A total of 227 healthy newborn infants with 35 or more weeks of gestation were enrolled in the study. Over 70 variables were collected and analyzed. Also, transcutaneous bilirubin levels were measured from birth to hospital discharge with maximum time intervals of 8 hours between measurements, using a noninvasive bilirubinometer. Different attribute subsets were used to train and test classification models using algorithms included in Weka data mining software, such as decision trees (J48 and neural networks (multilayer perceptron. The accuracy results were compared with the traditional methods for prediction of hyperbilirubinemia. Results The application of different classification algorithms to the collected data allowed predicting subsequent hyperbilirubinemia with high accuracy. In particular, at 24 hours of life of newborns, the accuracy for the prediction of hyperbilirubinemia was 89%. The best results were obtained using the following algorithms: naive Bayes, multilayer perceptron and simple logistic. Conclusions The findings of our study sustain that, new approaches, such as data mining, may support

  14. Data mining and the human genome

    Energy Technology Data Exchange (ETDEWEB)

    Abarbanel, Henry [The MITRE Corporation, McLean, VA (US). JASON Program Office; Callan, Curtis [The MITRE Corporation, McLean, VA (US). JASON Program Office; Dally, William [The MITRE Corporation, McLean, VA (US). JASON Program Office; Dyson, Freeman [The MITRE Corporation, McLean, VA (US). JASON Program Office; Hwa, Terence [The MITRE Corporation, McLean, VA (US). JASON Program Office; Koonin, Steven [The MITRE Corporation, McLean, VA (US). JASON Program Office; Levine, Herbert [The MITRE Corporation, McLean, VA (US). JASON Program Office; Rothaus, Oscar [The MITRE Corporation, McLean, VA (US). JASON Program Office; Schwitters, Roy [The MITRE Corporation, McLean, VA (US). JASON Program Office; Stubbs, Christopher [The MITRE Corporation, McLean, VA (US). JASON Program Office; Weinberger, Peter [The MITRE Corporation, McLean, VA (US). JASON Program Office

    2000-01-07

    As genomics research moves from an era of data acquisition to one of both acquisition and interpretation, new methods are required for organizing and prioritizing the data. These methods would allow an initial level of data analysis to be carried out before committing resources to a particular genetic locus. This JASON study sought to delineate the main problems that must be faced in bioinformatics and to identify information technologies that can help to overcome those problems. While the current influx of data greatly exceeds what biologists have experienced in the past, other scientific disciplines and the commercial sector have been handling much larger datasets for many years. Powerful datamining techniques have been developed in other fields that, with appropriate modification, could be applied to the biological sciences.

  15. Statistical and Machine-Learning Data Mining Techniques for Better Predictive Modeling and Analysis of Big Data

    CERN Document Server

    Ratner, Bruce

    2011-01-01

    The second edition of a bestseller, Statistical and Machine-Learning Data Mining: Techniques for Better Predictive Modeling and Analysis of Big Data is still the only book, to date, to distinguish between statistical data mining and machine-learning data mining. The first edition, titled Statistical Modeling and Analysis for Database Marketing: Effective Techniques for Mining Big Data, contained 17 chapters of innovative and practical statistical data mining techniques. In this second edition, renamed to reflect the increased coverage of machine-learning data mining techniques, the author has

  16. Mining on Big Data Using Hadoop MapReduce Model

    Science.gov (United States)

    Salman Ahmed, G.; Bhattacharya, Sweta

    2017-11-01

    Customary parallel calculations for mining nonstop item create opportunity to adjust stack of similar data among hubs. The paper aims to review this process by analyzing the critical execution downside of the common parallel recurrent item-set mining calculations. Given a larger than average dataset, data apportioning strategies inside the current arrangements endure high correspondence and mining overhead evoked by repetitive exchanges transmitted among registering hubs. We tend to address this downside by building up a learning apportioning approach referred as Hadoop abuse using the map-reduce programming model. All objectives of Hadoop are to zest up the execution of parallel recurrent item-set mining on Hadoop bunches. Fusing the comparability metric and furthermore the locality-sensitive hashing procedure, Hadoop puts to a great degree comparative exchanges into an information segment to lift neighborhood while not making AN exorbitant assortment of excess exchanges. We tend to execute Hadoop on a 34-hub Hadoop bunch, driven by a decent change of datasets made by IBM quest market-basket manufactured data generator. Trial uncovers the fact that Hadoop contributes towards lessening system and processing masses by the uprightness of dispensing with excess exchanges on Hadoop hubs. Hadoop impressively outperforms and enhances the other models considerably.

  17. Mining Personal Data Using Smartphones and Wearable Devices: A Survey

    Science.gov (United States)

    Rehman, Muhammad Habib ur; Liew, Chee Sun; Wah, Teh Ying; Shuja, Junaid; Daghighi, Babak

    2015-01-01

    The staggering growth in smartphone and wearable device use has led to a massive scale generation of personal (user-specific) data. To explore, analyze, and extract useful information and knowledge from the deluge of personal data, one has to leverage these devices as the data-mining platforms in ubiquitous, pervasive, and big data environments. This study presents the personal ecosystem where all computational resources, communication facilities, storage and knowledge management systems are available in user proximity. An extensive review on recent literature has been conducted and a detailed taxonomy is presented. The performance evaluation metrics and their empirical evidences are sorted out in this paper. Finally, we have highlighted some future research directions and potentially emerging application areas for personal data mining using smartphones and wearable devices. PMID:25688592

  18. Mining Personal Data Using Smartphones and Wearable Devices: A Survey

    Directory of Open Access Journals (Sweden)

    Muhammad Habib ur Rehman

    2015-02-01

    Full Text Available The staggering growth in smartphone and wearable device use has led to a massive scale generation of personal (user-specific data. To explore, analyze, and extract useful information and knowledge from the deluge of personal data, one has to leverage these devices as the data-mining platforms in ubiquitous, pervasive, and big data environments. This study presents the personal ecosystem where all computational resources, communication facilities, storage and knowledge management systems are available in user proximity. An extensive review on recent literature has been conducted and a detailed taxonomy is presented. The performance evaluation metrics and their empirical evidences are sorted out in this paper. Finally, we have highlighted some future research directions and potentially emerging application areas for personal data mining using smartphones and wearable devices.

  19. Utilization of data transmission in modern mining

    Energy Technology Data Exchange (ETDEWEB)

    Ostermann, W

    1989-01-01

    On the lowest hierarchical level much effort is being focussed on the automation of subsystems, particularly in the winning area. The missing links in this chain mainly comprise sensors capable of detecting the boundary layer between coal and the surrounding rock. A high potential for rationalization is offered by the computer-backed area control station, which can control and monitor underground operations from a surface vantage point. This has resulted in a significant improvement in the machine utilization rate. Information centres are currently being introduced throughout the industry to serve as the management centre for that particular colliery. The data thus compiled are also processed by the colliery's business computer unit; this would include information from the subsystem personnel, material, operations management, planning and cost controlling. Extensive data stocks are stored and processed in the company's central computer. All computer units, including personal computers networked together at management level, have access to the mass data stores of the other computers. 6 figs.

  20. Design of data warehouse in teaching state based on OLAP and data mining

    Science.gov (United States)

    Zhou, Lijuan; Wu, Minhua; Li, Shuang

    2009-04-01

    The data warehouse and the data mining technology is one of information technology research hot topics. At present the data warehouse and the data mining technology in aspects and so on commercial, financial industry as well as enterprise's production, market marketing obtained the widespread application, but is relatively less in educational fields' application. Over the years, the teaching and management have been accumulating large amounts of data in colleges and universities, while the data can not be effectively used, in the light of social needs of the university development and the current status of data management, the establishment of data warehouse in university state, the better use of existing data, and on the basis dealing with a higher level of disposal --data mining are particularly important. In this paper, starting from the decision-making needs design data warehouse structure of university teaching state, and then through the design structure and data extraction, loading, conversion create a data warehouse model, finally make use of association rule mining algorithm for data mining, to get effective results applied in practice. Based on the data analysis and mining, get a lot of valuable information, which can be used to guide teaching management, thereby improving the quality of teaching and promoting teaching devotion in universities and enhancing teaching infrastructure. At the same time it can provide detailed, multi-dimensional information for universities assessment and higher education research.

  1. Developing and Implementing the Data Mining Algorithms in RAVEN

    International Nuclear Information System (INIS)

    Sen, Ramazan Sonat; Maljovec, Daniel Patrick; Alfonsi, Andrea; Rabiti, Cristian

    2015-01-01

    The RAVEN code is becoming a comprehensive tool to perform probabilistic risk assessment, uncertainty quantification, and verification and validation. The RAVEN code is being developed to support many programs and to provide a set of methodologies and algorithms for advanced analysis. Scientific computer codes can generate enormous amounts of data. To post-process and analyze such data might, in some cases, take longer than the initial software runtime. Data mining algorithms/methods help in recognizing and understanding patterns in the data, and thus discover knowledge in databases. The methodologies used in the dynamic probabilistic risk assessment or in uncertainty and error quantification analysis couple system/physics codes with simulation controller codes, such as RAVEN. RAVEN introduces both deterministic and stochastic elements into the simulation while the system/physics code model the dynamics deterministically. A typical analysis is performed by sampling values of a set of parameter values. A major challenge in using dynamic probabilistic risk assessment or uncertainty and error quantification analysis for a complex system is to analyze the large number of scenarios generated. Data mining techniques are typically used to better organize and understand data, i.e. recognizing patterns in the data. This report focuses on development and implementation of Application Programming Interfaces (APIs) for different data mining algorithms, and the application of these algorithms to different databases.

  2. Developing and Implementing the Data Mining Algorithms in RAVEN

    Energy Technology Data Exchange (ETDEWEB)

    Sen, Ramazan Sonat [Idaho National Lab. (INL), Idaho Falls, ID (United States); Maljovec, Daniel Patrick [Idaho National Lab. (INL), Idaho Falls, ID (United States); Alfonsi, Andrea [Idaho National Lab. (INL), Idaho Falls, ID (United States); Rabiti, Cristian [Idaho National Lab. (INL), Idaho Falls, ID (United States)

    2015-09-01

    The RAVEN code is becoming a comprehensive tool to perform probabilistic risk assessment, uncertainty quantification, and verification and validation. The RAVEN code is being developed to support many programs and to provide a set of methodologies and algorithms for advanced analysis. Scientific computer codes can generate enormous amounts of data. To post-process and analyze such data might, in some cases, take longer than the initial software runtime. Data mining algorithms/methods help in recognizing and understanding patterns in the data, and thus discover knowledge in databases. The methodologies used in the dynamic probabilistic risk assessment or in uncertainty and error quantification analysis couple system/physics codes with simulation controller codes, such as RAVEN. RAVEN introduces both deterministic and stochastic elements into the simulation while the system/physics code model the dynamics deterministically. A typical analysis is performed by sampling values of a set of parameter values. A major challenge in using dynamic probabilistic risk assessment or uncertainty and error quantification analysis for a complex system is to analyze the large number of scenarios generated. Data mining techniques are typically used to better organize and understand data, i.e. recognizing patterns in the data. This report focuses on development and implementation of Application Programming Interfaces (APIs) for different data mining algorithms, and the application of these algorithms to different databases.

  3. Data mining application in industrial energy audit for lighting

    Energy Technology Data Exchange (ETDEWEB)

    Maricar, N.M.; Kim, G.C.; Jamal, N. [Kolej Univ., Melaka (Malaysia). Faculty of Electrical Engineering

    2005-07-01

    A data mining application for lighting energy audits at industrial sites was presented. Data collection was based on the parameters needed for the analysis part of the audit. Data collection included the activity for which the room was used; its dimension; light level readings in lux; the number of luminaries; the number of lamps per luminaries; lamp fixtures; and lamp wattage. The lumen method was used to calculate the recommended numbers of luminaries in the room. The number was then compared with the existing system's luminaries. The installed load efficacy ratio (ILER) was then used to determine proper retrofit action to maximize energy usage. The difference between the calculated lux and the standard lux was used to create data subsets. A data mining algorithm was used to determine that the ILER plays an important role in calculating the efficiency of lighting systems. It was also concluded that the method can be used to minimize the time needed to analyze large amounts of lighting data. The results of case studies were also used to show that the combined data mining algorithm provided accurate assessments using existing calculated data. 7 refs., 8 tabs., 5 figs.

  4. Data mining for geoinformatics methods and applications

    CERN Document Server

    Cervone, Guido; Waters, Nigel

    2013-01-01

    Presents a series of geoinformatic techniques and methodologies to solve real world problems of important societal value Provides geoinformatic algorithms necessary to extract knowledge from massive amounts of geographically distributed data Bridges fields together such as GIS, statistics, machine learning, remote sensing, natural hazards, earth and atmospheric sciences

  5. Process mining for electronic data interchange

    NARCIS (Netherlands)

    Engel, R.; Krathu, W.; Zapletal, M.; Pichler, C.; Aalst, van der W.M.P.; Werthner, H.; Huemer, C.; Setzer, T.

    2011-01-01

    Choreography modeling and service integration received a lot of attention in the last decade. However, most real-world implementations of inter-organizational systems are still realized by traditional Electronic Data Interchange (EDI) standards. In traditional EDI standards, the notion of process or

  6. Mining survey data for SWOT analysis

    OpenAIRE

    Phadermrod, Boonyarat

    2016-01-01

    Strengths, Weaknesses, Opportunities and Threats (SWOT) analysis is one of the most important tools for strategic planning. The traditional method of conducting SWOT analysis does not prioritize and is likely to hold subjective views that may result in an improper strategic action. Accordingly, this research exploits Importance-Performance Analysis (IPA), a technique for measuring customers’ satisfaction based on survey data, to systematically generate prioritized SWOT factors based on custom...

  7. Extracting software static defect models using data mining

    Directory of Open Access Journals (Sweden)

    Ahmed H. Yousef

    2015-03-01

    Full Text Available Large software projects are subject to quality risks of having defective modules that will cause failures during the software execution. Several software repositories contain source code of large projects that are composed of many modules. These software repositories include data for the software metrics of these modules and the defective state of each module. In this paper, a data mining approach is used to show the attributes that predict the defective state of software modules. Software solution architecture is proposed to convert the extracted knowledge into data mining models that can be integrated with the current software project metrics and bugs data in order to enhance the prediction. The results show better prediction capabilities when all the algorithms are combined using weighted votes. When only one individual algorithm is used, Naïve Bayes algorithm has the best results, then the Neural Network and the Decision Trees algorithms.

  8. Knowledge Discovery and Data Mining in Iran's Climatic Researches

    Science.gov (United States)

    Karimi, Mostafa

    2013-04-01

    Advances in measurement technology and data collection is the database gets larger. Large databases require powerful tools for analysis data. Iterative process of acquiring knowledge from information obtained from data processing is done in various forms in all scientific fields. However, when the data volume large, and many of the problems the Traditional methods cannot respond. in the recent years, use of databases in various scientific fields, especially atmospheric databases in climatology expanded. in addition, increases in the amount of data generated by the climate models is a challenge for analysis of it for extraction of hidden pattern and knowledge. The approach to this problem has been made in recent years uses the process of knowledge discovery and data mining techniques with the use of the concepts of machine learning, artificial intelligence and expert (professional) systems is overall performance. Data manning is analytically process for manning in massive volume data. The ultimate goal of data mining is access to information and finally knowledge. climatology is a part of science that uses variety and massive volume data. Goal of the climate data manning is Achieve to information from variety and massive atmospheric and non-atmospheric data. in fact, Knowledge Discovery performs these activities in a logical and predetermined and almost automatic process. The goal of this research is study of uses knowledge Discovery and data mining technique in Iranian climate research. For Achieve This goal, study content (descriptive) analysis and classify base method and issue. The result shown that in climatic research of Iran most clustering, k-means and wards applied and in terms of issues precipitation and atmospheric circulation patterns most introduced. Although several studies in geography and climate issues with statistical techniques such as clustering and pattern extraction is done, Due to the nature of statistics and data mining, but cannot say for

  9. 2nd International Conference on Soft Computing and Data Mining

    CERN Document Server

    Ghazali, Rozaida; Nawi, Nazri; Deris, Mustafa

    2017-01-01

    This book provides a comprehensive introduction and practical look at the concepts and techniques readers need to get the most out of their data in real-world, large-scale data mining projects. It also guides readers through the data-analytic thinking necessary for extracting useful knowledge and business value from the data. The book is based on the Soft Computing and Data Mining (SCDM-16) conference, which was held in Bandung, Indonesia on August 18th–20th 2016 to discuss the state of the art in soft computing techniques, and offer participants sufficient knowledge to tackle a wide range of complex systems. The scope of the conference is reflected in the book, which presents a balance of soft computing techniques and data mining approaches. The two constituents are introduced to the reader systematically and brought together using different combinations of applications and practices. It offers engineers, data analysts, practitioners, scientists and managers the insights into the concepts, tools and techni...

  10. Statistical and Visualization Data Mining Tools for Foundry Production

    Directory of Open Access Journals (Sweden)

    M. Perzyk

    2007-07-01

    Full Text Available In recent years a rapid development of a new, interdisciplinary knowledge area, called data mining, is observed. Its main task is extracting useful information from previously collected large amount of data. The main possibilities and potential applications of data mining in manufacturing industry are characterized. The main types of data mining techniques are briefly discussed, including statistical, artificial intelligence, data base and visualization tools. The statistical methods and visualization methods are presented in more detail, showing their general possibilities, advantages as well as characteristic examples of applications in foundry production. Results of the author’s research are presented, aimed at validation of selected statistical tools which can be easily and effectively used in manufacturing industry. A performance analysis of ANOVA and contingency tables based methods, dedicated for determination of the most significant process parameters as well as for detection of possible interactions among them, has been made. Several numerical tests have been performed using simulated data sets, with assumed hidden relationships as well some real data, related to the strength of ductile cast iron, collected in a foundry. It is concluded that the statistical methods offer relatively easy and fairly reliable tools for extraction of that type of knowledge about foundry manufacturing processes. However, further research is needed, aimed at explanation of some imperfections of the investigated tools as well assessment of their validity for more complex tasks.

  11. Building a Classification Model for Enrollment In Higher Educational Courses using Data Mining Techniques

    OpenAIRE

    Saini, Priyanka

    2014-01-01

    Data Mining is the process of extracting useful patterns from the huge amount of database and many data mining techniques are used for mining these patterns. Recently, one of the remarkable facts in higher educational institute is the rapid growth data and this educational data is expanding quickly without any advantage to the educational management. The main aim of the management is to refine the education standard; therefore by applying the various data mining techniques on this data one ca...

  12. Data Mining in Education : A Review on the Knowledge Discovery Perspective

    OpenAIRE

    Pratiyush Guleria; Manu Sood

    2014-01-01

    Knowledge Discovery in Databases is the process of finding knowledge in massive amount of data where data mining is the core of this process. Data minin g can be used to mine understandable meaningful patterns from large databases and these patterns ma y then be converted into knowledge.Data mining is t he process of extracting the information and patterns derived by the KDD process which helps in crucial decision-making.Data mining works with data warehou se and...

  13. Distributed video data fusion and mining

    Science.gov (United States)

    Chang, Edward Y.; Wang, Yuan-Fang; Rodoplu, Volkan

    2004-09-01

    This paper presents an event sensing paradigm for intelligent event-analysis in a wireless, ad hoc, multi-camera, video surveillance system. In particilar, we present statistical methods that we have developed to support three aspects of event sensing: 1) energy-efficient, resource-conserving, and robust sensor data fusion and analysis, 2) intelligent event modeling and recognition, and 3) rapid deployment, dynamic configuration, and continuous operation of the camera networks. We outline our preliminary results, and discuss future directions that research might take.

  14. Data Mining for ISHM of Liquid Rocket Propulsion Status Update

    Science.gov (United States)

    Srivastava, Ashok; Schwabacher, Mark; Oza, Nijunj; Martin, Rodney; Watson, Richard; Matthews, Bryan

    2006-01-01

    This document consists of presentation slides that review the current status of data mining to support the work with the Integrated Systems Health Management (ISHM) for the systems associated with Liquid Rocket Propulsion. The aim of this project is to have test stand data from Rocketdyne to design algorithms that will aid in the early detection of impending failures during operation. These methods will be extended and improved for future platforms (i.e., CEV/CLV).

  15. DATA MINING IN HIGHER EDUCATION : UNIVERSITY STUDENT DROPOUT CASE STUDY

    OpenAIRE

    Ghadeer S. Abu-Oda; Alaa M. El-Halees

    2015-01-01

    In this paper, we apply different data mining approaches for the purpose of examining and predicting students’ dropouts through their university programs. For the subject of the study we select a total of 1290 records of computer science students Graduated from ALAQSA University between 2005 and 2011. The collected data included student study history and transcript for courses taught in the first two years of computer science major in addition to student GPA , high school average ...

  16. Using Data Mining to Predict Possible Future Depression Cases

    OpenAIRE

    Daimi, Kevin; Banitaan, Shadi

    2014-01-01

    Depression is a disorder characterized by misery and gloominess felt over a period of time. Some symptoms of depression overlap with somatic illnesses implying considerable difficulty in diagnosing it. This paper contributes to its diagnosis through the application of data mining, namely classification, to predict patients who will most likely develop depression or are currently suffering from depression. Synthetic data is used for this study. To acquire the results, the popular suite of mach...

  17. Collaborative mining and transfer learning for relational data

    Science.gov (United States)

    Levchuk, Georgiy; Eslami, Mohammed

    2015-06-01

    Many of the real-world problems, - including human knowledge, communication, biological, and cyber network analysis, - deal with data entities for which the essential information is contained in the relations among those entities. Such data must be modeled and analyzed as graphs, with attributes on both objects and relations encode and differentiate their semantics. Traditional data mining algorithms were originally designed for analyzing discrete objects for which a set of features can be defined, and thus cannot be easily adapted to deal with graph data. This gave rise to the relational data mining field of research, of which graph pattern learning is a key sub-domain [11]. In this paper, we describe a model for learning graph patterns in collaborative distributed manner. Distributed pattern learning is challenging due to dependencies between the nodes and relations in the graph, and variability across graph instances. We present three algorithms that trade-off benefits of parallelization and data aggregation, compare their performance to centralized graph learning, and discuss individual benefits and weaknesses of each model. Presented algorithms are designed for linear speedup in distributed computing environments, and learn graph patterns that are both closer to ground truth and provide higher detection rates than centralized mining algorithm.

  18. Mining gene expression data of multiple sclerosis.

    Directory of Open Access Journals (Sweden)

    Pi Guo

    Full Text Available Microarray produces a large amount of gene expression data, containing various biological implications. The challenge is to detect a panel of discriminative genes associated with disease. This study proposed a robust classification model for gene selection using gene expression data, and performed an analysis to identify disease-related genes using multiple sclerosis as an example.Gene expression profiles based on the transcriptome of peripheral blood mononuclear cells from a total of 44 samples from 26 multiple sclerosis patients and 18 individuals with other neurological diseases (control were analyzed. Feature selection algorithms including Support Vector Machine based on Recursive Feature Elimination, Receiver Operating Characteristic Curve, and Boruta algorithms were jointly performed to select candidate genes associating with multiple sclerosis. Multiple classification models categorized samples into two different groups based on the identified genes. Models' performance was evaluated using cross-validation methods, and an optimal classifier for gene selection was determined.An overlapping feature set was identified consisting of 8 genes that were differentially expressed between the two phenotype groups. The genes were significantly associated with the pathways of apoptosis and cytokine-cytokine receptor interaction. TNFSF10 was significantly associated with multiple sclerosis. A Support Vector Machine model was established based on the featured genes and gave a practical accuracy of ∼86%. This binary classification model also outperformed the other models in terms of Sensitivity, Specificity and F1 score.The combined analytical framework integrating feature ranking algorithms and Support Vector Machine model could be used for selecting genes for other diseases.

  19. Data mining for gravitationally lensed quasars

    Science.gov (United States)

    Agnello, Adriano; Kelly, Brandon C.; Treu, Tommaso; Marshall, Philip J.

    2015-04-01

    Gravitationally lensed quasars are brighter than their unlensed counterparts and produce images with distinctive morphological signatures. Past searches and target-selection algorithms, in particular the Sloan Quasar Lens Search (SQLS), have relied on basic morphological criteria, which were applied to samples of bright, spectroscopically confirmed quasars. The SQLS techniques are not sufficient for searching into new surveys (e.g. DES, PS1, LSST), because spectroscopic information is not readily available and the large data volume requires higher purity in target/candidate selection. We carry out a systematic exploration of machine-learning techniques and demonstrate that a two-step strategy can be highly effective. In the first step, we use catalogue-level information (griz+WISE magnitudes, second moments) to pre-select targets, using artificial neural networks. The accepted targets are then inspected with pixel-by-pixel pattern recognition algorithms (gradient-boosted trees), to form a final set of candidates. The results from this procedure can be used to further refine the simpler SQLS algorithms, with a twofold (or threefold) gain in purity and the same (or 80 per cent) completeness at target-selection stage, or a purity of 70 per cent and a completeness of 60 per cent after the candidate-selection step. Simpler photometric searches in griz+WISE based on colour cuts would provide samples with 7 per cent purity or less. Our technique is extremely fast, as a list of candidates can be obtained from a Stage III experiment (e.g. DES catalogue/data base) in a few CPU hours. The techniques are easily extendable to Stage IV experiments like LSST with the addition of time domain information.

  20. An Overview on Data Mining of Nighttime Light Remote Sensing

    Directory of Open Access Journals (Sweden)

    LI Deren

    2015-06-01

    Full Text Available When observing the Earth from above at night, it is clear that the human settlement and major economic regions emit glorious light. At cloud-free nights, some remote sensing satellites can record visible radiance source, including city light, fishing boat light and fire, and these nighttime cloud-free images are remotely sensed nighttime light images. Different from daytime remote sensing, nighttime light remote sensing provides a unique perspective on human social activities, thus it has been widely used for spatial data mining of socioeconomic domains. Historically, researches on nighttime light remote sensing mostly focus on urban land cover and urban expansion mapping using DMSP/OLS imagery, but the nighttime light images are not the unique remote sensing source to do these works. Through decades of development of nighttime light product, the nighttime light remote sensing application has been extended to numerous interesting and scientific study domains such as econometrics, poverty estimation, light pollution, fishery and armed conflict. Among the application cases, it is surprising to see the Gross Domestic Production (GDP data can be corrected using the nighttime light data, and it is interesting to see mechanism of several diseases can be revealed by nighttime light images, while nighttime light are the unique remote sensing source to do the above works. As the nighttime light remote sensing has numerous applications, it is important to summarize the application of nighttime light remote sensing and its data mining fields. This paper introduced major satellite platform and sensors for observing nighttime light at first. Consequently, the paper summarized the progress of nighttime light remote sensing data mining in socioeconomic parameter estimation, urbanization monitoring, important event evaluation, environmental and healthy effects, fishery dynamic mapping, epidemiological research and natural gas flaring monitoring. Finally, future

  1. Using Advanced Data Mining And Integration In Environmental Prediction Scenarios

    Directory of Open Access Journals (Sweden)

    Habala Ondrej

    2012-01-01

    Full Text Available We present one of the meteorological and hydrological experiments performed in the FP7 project ADMIRE. It serves as an experimental platform for hydrologists, and we have used it also as a testing platform for a suite of advanced data integration and data mining (DMI tools, developed within ADMIRE. The idea of ADMIRE is to develop an advanced DMI platform accessible even to users who are not familiar with data mining techniques. To this end, we have designed a novel DMI architecture, supported by a set of software tools, managed by DMI process descriptions written in a specialized high-level DMI language called DISPEL, and controlled via several different user interfaces, each performing a different set of tasks and targeting different user group.

  2. DATA MINING IN EDUCATION: CURRENT STATE AND PERSPECTIVES OF DEVELOPMENT

    Directory of Open Access Journals (Sweden)

    Yurii O. Kovalchuk

    2016-01-01

    Full Text Available The main tasks (classification and regression, association rules, clustering and the basic principles of the Data Mining algorithms in the context of their use for a variety of research in the field of education which are the subject of a relatively new independent direction Educational Data Mining are considered. The findings about the most popular topics of research within this area as well as the perspectives of its development are presented. Presentation of the material is illustrated by simple examples. This article is intended for readers who are engaged in research in the field of education at various levels, especially those involved in the use of e-learning systems, but little familiar with this area of data analysis.

  3. Identifying Drug–Drug Interactions by Data Mining

    DEFF Research Database (Denmark)

    Hansen, Peter Wæde; Clemmensen, Line Katrine Harder; Sehested, Thomas S.G.

    2016-01-01

    Background—Knowledge about drug–drug interactions commonly arises from preclinical trials, from adverse drug reports, or based on knowledge of mechanisms of action. Our aim was to investigate whether drug–drug interactions were discoverable without prior hypotheses using data mining. We focused...... registries. Additionally, we discovered a few potentially novel interactions. This opens up for the use of data mining to discover unknown drug–drug interactions in cardiovascular medicine....... on warfarin–drug interactions as the prototype. Methods and Results—We analyzed altered prothrombin time (measured as international normalized ratio [INR]) after initiation of a novel prescription in previously INR-stable warfarin-treated patients with nonvalvular atrial fibrillation. Data sets were retrieved...

  4. Data mining and knowledge discovery for big data methodologies, challenge and opportunities

    CERN Document Server

    2014-01-01

    The field of data mining has made significant and far-reaching advances over the past three decades.  Because of its potential power for solving complex problems, data mining has been successfully applied to diverse areas such as business, engineering, social media, and biological science. Many of these applications search for patterns in complex structural information. In biomedicine for example, modeling complex biological systems requires linking knowledge across many levels of science, from genes to disease.  Further, the data characteristics of the problems have also grown from static to dynamic and spatiotemporal, complete to incomplete, and centralized to distributed, and grow in their scope and size (this is known as big data). The effective integration of big data for decision-making also requires privacy preservation. The contributions to this monograph summarize the advances of data mining in the respective fields. This volume consists of nine chapters that address subjects ranging from mining da...

  5. Kernel Methods for Mining Instance Data in Ontologies

    Science.gov (United States)

    Bloehdorn, Stephan; Sure, York

    The amount of ontologies and meta data available on the Web is constantly growing. The successful application of machine learning techniques for learning of ontologies from textual data, i.e. mining for the Semantic Web, contributes to this trend. However, no principal approaches exist so far for mining from the Semantic Web. We investigate how machine learning algorithms can be made amenable for directly taking advantage of the rich knowledge expressed in ontologies and associated instance data. Kernel methods have been successfully employed in various learning tasks and provide a clean framework for interfacing between non-vectorial data and machine learning algorithms. In this spirit, we express the problem of mining instances in ontologies as the problem of defining valid corresponding kernels. We present a principled framework for designing such kernels by means of decomposing the kernel computation into specialized kernels for selected characteristics of an ontology which can be flexibly assembled and tuned. Initial experiments on real world Semantic Web data enjoy promising results and show the usefulness of our approach.

  6. Using data mining techniques to characterize participation in observational studies.

    Science.gov (United States)

    Linden, Ariel; Yarnold, Paul R

    2016-12-01

    Data mining techniques are gaining in popularity among health researchers for an array of purposes, such as improving diagnostic accuracy, identifying high-risk patients and extracting concepts from unstructured data. In this paper, we describe how these techniques can be applied to another area in the health research domain: identifying characteristics of individuals who do and do not choose to participate in observational studies. In contrast to randomized studies where individuals have no control over their treatment assignment, participants in observational studies self-select into the treatment arm and therefore have the potential to differ in their characteristics from those who elect not to participate. These differences may explain part, or all, of the difference in the observed outcome, making it crucial to assess whether there is differential participation based on observed characteristics. As compared to traditional approaches to this assessment, data mining offers a more precise understanding of these differences. To describe and illustrate the application of data mining in this domain, we use data from a primary care-based medical home pilot programme and compare the performance of commonly used classification approaches - logistic regression, support vector machines, random forests and classification tree analysis (CTA) - in correctly classifying participants and non-participants. We find that CTA is substantially more accurate than the other models. Moreover, unlike the other models, CTA offers transparency in its computational approach, ease of interpretation via the decision rules produced and provides statistical results familiar to health researchers. Beyond their application to research, data mining techniques could help administrators to identify new candidates for participation who may most benefit from the intervention. © 2016 John Wiley & Sons, Ltd.

  7. Data Mining and Optimization Tools for Developing Engine Parameters Tools

    Science.gov (United States)

    Dhawan, Atam P.

    1998-01-01

    This project was awarded for understanding the problem and developing a plan for Data Mining tools for use in designing and implementing an Engine Condition Monitoring System. Tricia Erhardt and I studied the problem domain for developing an Engine Condition Monitoring system using the sparse and non-standardized datasets to be available through a consortium at NASA Lewis Research Center. We visited NASA three times to discuss additional issues related to dataset which was not made available to us. We discussed and developed a general framework of data mining and optimization tools to extract useful information from sparse and non-standard datasets. These discussions lead to the training of Tricia Erhardt to develop Genetic Algorithm based search programs which were written in C++ and used to demonstrate the capability of GA algorithm in searching an optimal solution in noisy, datasets. From the study and discussion with NASA LeRC personnel, we then prepared a proposal, which is being submitted to NASA for future work for the development of data mining algorithms for engine conditional monitoring. The proposed set of algorithm uses wavelet processing for creating multi-resolution pyramid of tile data for GA based multi-resolution optimal search.

  8. SURVEY ON CRIME ANALYSIS AND PREDICTION USING DATA MINING TECHNIQUES

    Directory of Open Access Journals (Sweden)

    H Benjamin Fredrick David

    2017-04-01

    Full Text Available Data Mining is the procedure which includes evaluating and examining large pre-existing databases in order to generate new information which may be essential to the organization. The extraction of new information is predicted using the existing datasets. Many approaches for analysis and prediction in data mining had been performed. But, many few efforts has made in the criminology field. Many few have taken efforts for comparing the information all these approaches produce. The police stations and other similar criminal justice agencies hold many large databases of information which can be used to predict or analyze the criminal movements and criminal activity involvement in the society. The criminals can also be predicted based on the crime data. The main aim of this work is to perform a survey on the supervised learning and unsupervised learning techniques that has been applied towards criminal identification. This paper presents the survey on the Crime analysis and crime prediction using several Data Mining techniques.

  9. Improving clinical decision support using data mining techniques

    Science.gov (United States)

    Burn-Thornton, Kath E.; Thorpe, Simon I.

    1999-02-01

    Physicians, in their ever-demanding jobs, are looking to decision support systems for aid in clinical diagnosis. However, clinical decision support systems need to be of sufficiently high accuracy that they help, rather than hinder, the physician in his/her diagnosis. Decision support systems with accuracies, of patient state determination, of greater than 80 percent, are generally perceived to be sufficiently accurate to fulfill the role of helping the physician. We have previously shown that data mining techniques have the potential to provide the underpinning technology for clinical decision support systems. In this paper, an extension of the work in reverence 2, we describe how changes in data mining methodologies, for the analysis of 12-lead ECG data, improve the accuracy by which data mining algorithms determine which patients are suffering from heart disease. We show that the accuracy of patient state prediction, for all the algorithms, which we investigated, can be increased by up to 6 percent, using the combination of appropriate test training ratios and 5-fold cross-validation. The use of cross-validation greater than 5-fold, appears to reduce the improvement in algorithm classification accuracy gained by the use of this validation method. The accuracy of 84 percent in patient state predictions, obtained using the algorithm OCI, suggests that this algorithm will be capable of providing the required accuracy for clinical decision support systems.

  10. A Survey on Accessing Data over Cloud Environment using Data mining Algorithms

    OpenAIRE

    B.Prasanalakshmi; A.Selvaraj

    2015-01-01

    In today's world to access the large set of data is more complex, because the data may be structured and unstructured like in the form of text, images, videos, etc., it cannot be controlled from the internet users this is known as Big data. Useful data can be accessed through extracting from big data with the help of data mining algorithms. Data mining is a technique for determine the patterns; classify the data, clustering from the large set of data. In this paper we will discuss how large s...

  11. USING ADVANCED DATA MINING AND INTEGRATION IN ENVIRONMENTAL PREDICTION SCENARIOS

    Directory of Open Access Journals (Sweden)

    Ondrej Habala

    2012-01-01

    Full Text Available We present one of the meteorological and hydrological experiments performed inthe FP7 project ADMIRE. It serves as an experimental platform for hydrologists,and we have used it also as a testing platform for a suite of advanced dataintegration and data mining (DMI tools, developed within ADMIRE. The ideaof ADMIRE is to develop an advanced DMI platform accessible even to userswho are not familiar with data mining techniques. To this end, we have designeda novel DMI architecture, supported by a set of software tools, managed by DMIprocess descriptions written in a specialized high-level DMI language calledDISPEL, and controlled via several different user interfaces, each performinga different set of tasks and targeting different user group.

  12. Educational data mining: a sample of review and study case

    Directory of Open Access Journals (Sweden)

    Alejandro Pena, Rafael Domínguez, Jose de Jesus Medel

    2009-12-01

    Full Text Available The aim of this work is to encourage the research in a novel merged field: Educational data mining (EDM. Thereby, twosubjects are outlined: The first one corresponds to a review of data mining (DM methods and EDM applications. Thesecond topic represents an EDM study case. As a result of the application of DM in Web-based Education Systems (WBES,stratified groups of students were found during a trial. Such groups reveal key attributes of volunteers that deserted orremained during a WBES experiment. This kind of discovered knowledge inspires the statement of correlational hypothesisto set relations between attributes and behavioral patterns of WBES users. We concluded that: When EDM findings aretaken into account for designing and managing WBES, the learning objectives are improved

  13. Data Mining and Machine Learning Methods for Dementia Research.

    Science.gov (United States)

    Li, Rui

    2018-01-01

    Patient data in clinical research often includes large amounts of structured information, such as neuroimaging data, neuropsychological test results, and demographic variables. Given the various sources of information, we can develop computerized methods that can be a great help to clinicians to discover hidden patterns in the data. The computerized methods often employ data mining and machine learning algorithms, lending themselves as the computer-aided diagnosis (CAD) tool that assists clinicians in making diagnostic decisions. In this chapter, we review state-of-the-art methods used in dementia research, and briefly introduce some recently proposed algorithms subsequently.

  14. Evolutionary Data Mining Approach to Creating Digital Logic

    Science.gov (United States)

    2010-01-01

    To deal with this problem a genetic program (GP) based data mining ( DM ) procedure has been invented (Smith 2005). A genetic program is an algorithm...that can operate on the variables. When a GP was used as a DM function in the past to automatically create fuzzy decision trees, the Report...rules represents an approach to the determining the effect of linguistic imprecision, i.e., the inability of experts to provide crisp rules. The

  15. Survey of Insurance Fraud Detection Using Data Mining Techniques

    OpenAIRE

    Sithic, H. Lookman; Balasubramanian, T.

    2013-01-01

    With an increase in financial accounting fraud in the current economic scenario experienced, financial accounting fraud detection has become an emerging topics of great importance for academics, research and industries. Financial fraud is a deliberate act that is contrary to law, rule or policy with intent to obtain unauthorized financial benefit and intentional misstatements or omission of amounts by deceiving users of financial statements, especially investors and creditors. Data mining tec...

  16. Using Copulas in Data Mining Based on the Observational Calculus

    Czech Academy of Sciences Publication Activity Database

    Holeňa, Martin; Bajer, L.; Ščavnický, M.

    2015-01-01

    Roč. 27, č. 10 (2015), s. 2851-2864 ISSN 1041-4347 R&D Projects: GA ČR GA13-17187S Grant - others:SLU(CZ) SGS/21/2014 Institutional support: RVO:67985807 Keywords : data mining * observational calculus * generalized quantifiers * joint probability distribution * copulas * hierarchical Archimedean copulas Subject RIV: IN - Informatics, Computer Science Impact factor: 2.476, year: 2015

  17. Data Mining Application in Customer Relationship Management for Hospital Inpatients

    OpenAIRE

    Lee, Eun Whan

    2012-01-01

    Objectives This study aims to discover patients loyal to a hospital and model their medical service usage patterns. Consequently, this study proposes a data mining application in customer relationship management (CRM) for hospital inpatients. Methods A recency, frequency, monetary (RFM) model has been applied toward 14,072 patients discharged from a university hospital. Cluster analysis was conducted to segment customers, and it modeled the patterns of the loyal customers' medical services us...

  18. Data mining application in customer relationship management for hospital inpatients.

    Science.gov (United States)

    Lee, Eun Whan

    2012-09-01

    This study aims to discover patients loyal to a hospital and model their medical service usage patterns. Consequently, this study proposes a data mining application in customer relationship management (CRM) for hospital inpatients. A recency, frequency, monetary (RFM) model has been applied toward 14,072 patients discharged from a university hospital. Cluster analysis was conducted to segment customers, and it modeled the patterns of the loyal customers' medical services usage via a decision tree. Patients were divided into two groups according to the variables of the RFM model and the group which had significantly high frequency of medical use and expenses was defined as loyal customers, a target market. As a result of the decision tree, the predictable factors of the loyal clients were; length of stay, certainty of selectable treatment, surgery, number of accompanying treatments, kind of patient room, and department from which they were discharged. Particularly, this research showed that when a patient within the internal medicine department who did not have surgery stayed for more than 13.5 days, their probability of being a classified as a loyal customer was 70.0%. To discover a hospital's loyal patients and model their medical usage patterns, the application of data-mining has been suggested. This paper suggests practical use of combining segmentation, targeting, positioning (STP) strategy and the RFM model with data-mining in CRM.

  19. Data Mining Application in Customer Relationship Management for Hospital Inpatients

    Science.gov (United States)

    2012-01-01

    Objectives This study aims to discover patients loyal to a hospital and model their medical service usage patterns. Consequently, this study proposes a data mining application in customer relationship management (CRM) for hospital inpatients. Methods A recency, frequency, monetary (RFM) model has been applied toward 14,072 patients discharged from a university hospital. Cluster analysis was conducted to segment customers, and it modeled the patterns of the loyal customers' medical services usage via a decision tree. Results Patients were divided into two groups according to the variables of the RFM model and the group which had significantly high frequency of medical use and expenses was defined as loyal customers, a target market. As a result of the decision tree, the predictable factors of the loyal clients were; length of stay, certainty of selectable treatment, surgery, number of accompanying treatments, kind of patient room, and department from which they were discharged. Particularly, this research showed that when a patient within the internal medicine department who did not have surgery stayed for more than 13.5 days, their probability of being a classified as a loyal customer was 70.0%. Conclusions To discover a hospital's loyal patients and model their medical usage patterns, the application of data-mining has been suggested. This paper suggests practical use of combining segmentation, targeting, positioning (STP) strategy and the RFM model with data-mining in CRM. PMID:23115740

  20. The Potentials of Educational Data Mining for Researching Metacognition, Motivation and Self-Regulated Learning

    Science.gov (United States)

    Winne, Philip H.; Baker, Ryan S. J. D.

    2013-01-01

    Our article introduces the "Journal of Educational Data Mining's" Special Issue on Educational Data Mining on Motivation, Metacognition, and Self-Regulated Learning. We outline general research challenges for data mining researchers who conduct investigations in these areas, the potential of EDM to advance research in this area, and…

  1. Data Mining Activities for Bone Discipline - Current Status

    Science.gov (United States)

    Sibonga, J. D.; Pietrzyk, R. A.; Johnston, S. L.; Arnaud, S. B.

    2008-01-01

    The disciplinary goals of the Human Research Program are broadly discussed. There is a critical need to identify gaps in the evidence that would substantiate a skeletal health risk during and after spaceflight missions. As a result, data mining activities will be engaged to gather reviews of medical data and flight analog data and to propose additional measures and specific analyses. Several studies are briefly reviewed which have topics that partially address these gaps in knowledge, including bone strength recovery with recovery of bone mass density, current renal stone formation knowledge, herniated discs, and a review of bed rest studies conducted at Ames Human Research Facility.

  2. Mathematical tools for data mining set theory, partial orders, combinatorics

    CERN Document Server

    Simovici, Dan A

    2014-01-01

    Data mining essentially relies on several mathematical disciplines, many of which are presented in this second edition of this book. Topics include partially ordered sets, combinatorics, general topology, metric spaces, linear spaces, graph theory. To motivate the reader a significant number of applications of these mathematical tools are included ranging from association rules, clustering algorithms, classification, data constraints, logical data analysis, etc. The book is intended as a reference for researchers and graduate students. The current edition is a significant expansion of the firs

  3. Astroinformatics, data mining and the future of astronomical research

    Energy Technology Data Exchange (ETDEWEB)

    Brescia, Massimo, E-mail: longo@na.infn.it [INAF, Astronomical Obs. of Capodimonte, Via Moiariello 16, I-80131 Napoli (Italy); Longo, Giuseppe [Department of Physics, University Federico II, Via Cintia 6, 80126 Napoli (Italy); Department of Astronomy, Caltech, Pasadena (United States)

    2013-08-21

    Astronomy, as many other scientific disciplines, is facing a true data deluge which is bound to change both the praxis and the methodology of every day research work. The emerging field of astroinformatics, while on the one end appears crucial to face the technological challenges, on the other is opening new exciting perspectives for new astronomical discoveries through the implementation of advanced data mining procedures. The complexity of astronomical data and the variety of scientific problems, however, call for innovative algorithms and methods as well as for an extreme usage of ICT technologies.

  4. Data mining practical machine learning tools and techniques

    CERN Document Server

    Witten, Ian H

    2005-01-01

    As with any burgeoning technology that enjoys commercial attention, the use of data mining is surrounded by a great deal of hype. Exaggerated reports tell of secrets that can be uncovered by setting algorithms loose on oceans of data. But there is no magic in machine learning, no hidden power, no alchemy. Instead there is an identifiable body of practical techniques that can extract useful information from raw data. This book describes these techniques and shows how they work. The book is a major revision of the first edition that appeared in 1999. While the basic core remains the same

  5. Astroinformatics, data mining and the future of astronomical research

    International Nuclear Information System (INIS)

    Brescia, Massimo; Longo, Giuseppe

    2013-01-01

    Astronomy, as many other scientific disciplines, is facing a true data deluge which is bound to change both the praxis and the methodology of every day research work. The emerging field of astroinformatics, while on the one end appears crucial to face the technological challenges, on the other is opening new exciting perspectives for new astronomical discoveries through the implementation of advanced data mining procedures. The complexity of astronomical data and the variety of scientific problems, however, call for innovative algorithms and methods as well as for an extreme usage of ICT technologies

  6. Visualizing data mining results with the Brede tools

    DEFF Research Database (Denmark)

    Nielsen, Finn Årup

    2009-01-01

    has expanded and now includes its own database with coordinates along with ontologies for brain regions and functions: The Brede Database. With Brede Toolbox and Database combined we setup automated workflows for extraction of data, mass meta-analytic data mining and visualizations. Most of the Web......A few neuroinformatics databases now exist that record results from neuroimaging studies in the form of brain coordinates in stereotaxic space. The Brede Toolbox was originally developed to extract, analyze and visualize data from one of them --- the BrainMap database. Since then the Brede Toolbox...

  7. Design database for quantitative trait loci (QTL) data warehouse, data mining, and meta-analysis.

    Science.gov (United States)

    Hu, Zhi-Liang; Reecy, James M; Wu, Xiao-Lin

    2012-01-01

    A database can be used to warehouse quantitative trait loci (QTL) data from multiple sources for comparison, genomic data mining, and meta-analysis. A robust database design involves sound data structure logistics, meaningful data transformations, normalization, and proper user interface designs. This chapter starts with a brief review of relational database basics and concentrates on issues associated with curation of QTL data into a relational database, with emphasis on the principles of data normalization and structure optimization. In addition, some simple examples of QTL data mining and meta-analysis are included. These examples are provided to help readers better understand the potential and importance of sound database design.

  8. Implementasi Data Warehouse dan Data Mining: Studi Kasus Analisis Peminatan Studi Siswa

    Directory of Open Access Journals (Sweden)

    Eka Miranda

    2011-06-01

    Full Text Available This paper discusses the implementation of data mining and their role in helping decision-making related to students’ specialization program selection. Currently, the university uses a database to store records of transactions which can not directly be used to assist analysis and decision making. Based on these issues then made the data warehouse design used to store large amounts of data and also has the potential to gain new data distribution perspectives and allows to answer the ad hoc question as well as to perform data analysis. The method used consists of: record analysis related to students’ academic achievement, designing data warehouse and data mining. The paper’s results are in a form of data warehouse and data mining design and its implementation with the classification techniques and association rules. From these results can be seen the students’ tendency and pattern background in choosing the specialization, to help them make decisions. 

  9. A Big Data Platform for Storing, Accessing, Mining and Learning Geospatial Data

    Science.gov (United States)

    Yang, C. P.; Bambacus, M.; Duffy, D.; Little, M. M.

    2017-12-01

    Big Data is becoming a norm in geoscience domains. A platform that is capable to effiently manage, access, analyze, mine, and learn the big data for new information and knowledge is desired. This paper introduces our latest effort on developing such a platform based on our past years' experiences on cloud and high performance computing, analyzing big data, comparing big data containers, and mining big geospatial data for new information. The platform includes four layers: a) the bottom layer includes a computing infrastructure with proper network, computer, and storage systems; b) the 2nd layer is a cloud computing layer based on virtualization to provide on demand computing services for upper layers; c) the 3rd layer is big data containers that are customized for dealing with different types of data and functionalities; d) the 4th layer is a big data presentation layer that supports the effient management, access, analyses, mining and learning of big geospatial data.

  10. Mining Social and Affective Data for Recommendation of Student Tutors

    Directory of Open Access Journals (Sweden)

    Elisa Boff

    2013-03-01

    Full Text Available This paper presents a learning environment where a mining algorithm is used to learn patterns of interaction with the user and to represent these patterns in a scheme called item descriptors. The learning environment keeps theoretical information about subjects, as well as tools and exercises where the student can put into practice the knowledge gained. One of the main purposes of the project is to stimulate collaborative learning through the interaction of students with different levels of knowledge. The students' actions, as well as their interactions, are monitored by the system and used to find patterns that can guide the search for students that may play the role of a tutor. Such patterns are found with a particular learning algorithm and represented in item descriptors. The paper presents the educational environment, the representation mechanism and learning algorithm used to mine social-affective data in order to create a recommendation model of tutors.

  11. Opinion data mining based on DNA method and ORA software

    Science.gov (United States)

    Tian, Ru-Ya; Wu, Lei; Liang, Xiao-He; Zhang, Xue-Fu

    2018-01-01

    Public opinion, especially the online public opinion is a critical issue when it comes to mining its characteristics. Because it can be formed directly and intensely in a short time, and may lead to the outbreak of online group events, and the formation of online public opinion crisis. This may become the pushing hand of a public crisis event, or even have negative social impacts, which brings great challenges to the government management. Data from the mass media which reveal implicit, previously unknown, and potentially valuable information, can effectively help us to understand the evolution law of public opinion, and provide a useful reference for rumor intervention. Based on the Dynamic Network Analysis method, this paper uses ORA software to mine characteristics of public opinion information, opinion topics, and public opinion agents through a series of indicators, and quantitatively analyzed the relationships between them. The results show that through the analysis of the 8 indexes associating with opinion data mining, we can have a basic understanding of the public opinion characteristics of an opinion event, such as who is important in the opinion spreading process, the information grasping condition, and the opinion topics release situation.

  12. An Integrative data mining approach to identifying Adverse ...

    Science.gov (United States)

    The Adverse Outcome Pathway (AOP) framework is a tool for making biological connections and summarizing key information across different levels of biological organization to connect biological perturbations at the molecular level to adverse outcomes for an individual or population. Computational approaches to explore and determine these connections can accelerate the assembly of AOPs. By leveraging the wealth of publicly available data covering chemical effects on biological systems, computationally-predicted AOPs (cpAOPs) were assembled via data mining of high-throughput screening (HTS) in vitro data, in vivo data and other disease phenotype information. Frequent Itemset Mining (FIM) was used to find associations between the gene targets of ToxCast HTS assays and disease data from Comparative Toxicogenomics Database (CTD) by using the chemicals as the common aggregators between datasets. The method was also used to map gene expression data to disease data from CTD. A cpAOP network was defined by considering genes and diseases as nodes and FIM associations as edges. This network contained 18,283 gene to disease associations for the ToxCast data and 110,253 for CTD gene expression. Two case studies show the value of the cpAOP network by extracting subnetworks focused either on fatty liver disease or the Aryl Hydrocarbon Receptor (AHR). The subnetwork surrounding fatty liver disease included many genes known to play a role in this disease. When querying the cpAOP

  13. Data Mining Methods to Generate Severe Wind Gust Models

    Directory of Open Access Journals (Sweden)

    Subana Shanmuganathan

    2014-01-01

    Full Text Available Gaining knowledge on weather patterns, trends and the influence of their extremes on various crop production yields and quality continues to be a quest by scientists, agriculturists, and managers. Precise and timely information aids decision-making, which is widely accepted as intrinsically necessary for increased production and improved quality. Studies in this research domain, especially those related to data mining and interpretation are being carried out by the authors and their colleagues. Some of this work that relates to data definition, description, analysis, and modelling is described in this paper. This includes studies that have evaluated extreme dry/wet weather events against reported yield at different scales in general. They indicate the effects of weather extremes such as prolonged high temperatures, heavy rainfall, and severe wind gusts. Occurrences of these events are among the main weather extremes that impact on many crops worldwide. Wind gusts are difficult to anticipate due to their rapid manifestation and yet can have catastrophic effects on crops and buildings. This paper examines the use of data mining methods to reveal patterns in the weather conditions, such as time of the day, month of the year, wind direction, speed, and severity using a data set from a single location. Case study data is used to provide examples of how the methods used can elicit meaningful information and depict it in a fashion usable for management decision making. Historical weather data acquired between 2008 and 2012 has been used for this study from telemetry devices installed in a vineyard in the north of New Zealand. The results show that using data mining techniques and the local weather conditions, such as relative pressure, temperature, wind direction and speed recorded at irregular intervals, can produce new knowledge relating to wind gust patterns for vineyard management decision making.

  14. Knowledge-Based Reinforcement Learning for Data Mining

    Science.gov (United States)

    Kudenko, Daniel; Grzes, Marek

    Data Mining is the process of extracting patterns from data. Two general avenues of research in the intersecting areas of agents and data mining can be distinguished. The first approach is concerned with mining an agent’s observation data in order to extract patterns, categorize environment states, and/or make predictions of future states. In this setting, data is normally available as a batch, and the agent’s actions and goals are often independent of the data mining task. The data collection is mainly considered as a side effect of the agent’s activities. Machine learning techniques applied in such situations fall into the class of supervised learning. In contrast, the second scenario occurs where an agent is actively performing the data mining, and is responsible for the data collection itself. For example, a mobile network agent is acquiring and processing data (where the acquisition may incur a certain cost), or a mobile sensor agent is moving in a (perhaps hostile) environment, collecting and processing sensor readings. In these settings, the tasks of the agent and the data mining are highly intertwined and interdependent (or even identical). Supervised learning is not a suitable technique for these cases. Reinforcement Learning (RL) enables an agent to learn from experience (in form of reward and punishment for explorative actions) and adapt to new situations, without a teacher. RL is an ideal learning technique for these data mining scenarios, because it fits the agent paradigm of continuous sensing and acting, and the RL agent is able to learn to make decisions on the sampling of the environment which provides the data. Nevertheless, RL still suffers from scalability problems, which have prevented its successful use in many complex real-world domains. The more complex the tasks, the longer it takes a reinforcement learning algorithm to converge to a good solution. For many real-world tasks, human expert knowledge is available. For example, human

  15. Clustering-based approaches to SAGE data mining

    Directory of Open Access Journals (Sweden)

    Wang Haiying

    2008-07-01

    Full Text Available Abstract Serial analysis of gene expression (SAGE is one of the most powerful tools for global gene expression profiling. It has led to several biological discoveries and biomedical applications, such as the prediction of new gene functions and the identification of biomarkers in human cancer research. Clustering techniques have become fundamental approaches in these applications. This paper reviews relevant clustering techniques specifically designed for this type of data. It places an emphasis on current limitations and opportunities in this area for supporting biologically-meaningful data mining and visualisation.

  16. A Data Preparation Methodology in Data Mining Applied to Mortality Population Databases.

    Science.gov (United States)

    Pérez, Joaquín; Iturbide, Emmanuel; Olivares, Víctor; Hidalgo, Miguel; Martínez, Alicia; Almanza, Nelva

    2015-11-01

    It is known that the data preparation phase is the most time consuming in the data mining process, using up to 50% or up to 70% of the total project time. Currently, data mining methodologies are of general purpose and one of their limitations is that they do not provide a guide about what particular task to develop in a specific domain. This paper shows a new data preparation methodology oriented to the epidemiological domain in which we have identified two sets of tasks: General Data Preparation and Specific Data Preparation. For both sets, the Cross-Industry Standard Process for Data Mining (CRISP-DM) is adopted as a guideline. The main contribution of our methodology is fourteen specialized tasks concerning such domain. To validate the proposed methodology, we developed a data mining system and the entire process was applied to real mortality databases. The results were encouraging because it was observed that the use of the methodology reduced some of the time consuming tasks and the data mining system showed findings of unknown and potentially useful patterns for the public health services in Mexico.

  17. TSCA Chemical Data Reporting Fact Sheet: Reporting Manufactured Chemical Substances from Metal Mining and Related Activities

    Science.gov (United States)

    This fact sheet provides guidance on the Chemical Data Reporting (CDR) rule requirements related to the reporting of mined metals, intermediates, and byproducts manufactured during metal mining and related activities.

  18. Machine Learning and Data Mining Methods in Diabetes Research.

    Science.gov (United States)

    Kavakiotis, Ioannis; Tsave, Olga; Salifoglou, Athanasios; Maglaveras, Nicos; Vlahavas, Ioannis; Chouvarda, Ioanna

    2017-01-01

    The remarkable advances in biotechnology and health sciences have led to a significant production of data, such as high throughput genetic data and clinical information, generated from large Electronic Health Records (EHRs). To this end, application of machine learning and data mining methods in biosciences is presently, more than ever before, vital and indispensable in efforts to transform intelligently all available information into valuable knowledge. Diabetes mellitus (DM) is defined as a group of metabolic disorders exerting significant pressure on human health worldwide. Extensive research in all aspects of diabetes (diagnosis, etiopathophysiology, therapy, etc.) has led to the generation of huge amounts of data. The aim of the present study is to conduct a systematic review of the applications of machine learning, data mining techniques and tools in the field of diabetes research with respect to a) Prediction and Diagnosis, b) Diabetic Complications, c) Genetic Background and Environment, and e) Health Care and Management with the first category appearing to be the most popular. A wide range of machine learning algorithms were employed. In general, 85% of those used were characterized by supervised learning approaches and 15% by unsupervised ones, and more specifically, association rules. Support vector machines (SVM) arise as the most successful and widely used algorithm. Concerning the type of data, clinical datasets were mainly used. The title applications in the selected articles project the usefulness of extracting valuable knowledge leading to new hypotheses targeting deeper understanding and further investigation in DM.

  19. Genomics Portals: integrative web-platform for mining genomics data.

    Science.gov (United States)

    Shinde, Kaustubh; Phatak, Mukta; Johannes, Freudenberg M; Chen, Jing; Li, Qian; Vineet, Joshi K; Hu, Zhen; Ghosh, Krishnendu; Meller, Jaroslaw; Medvedovic, Mario

    2010-01-13

    A large amount of experimental data generated by modern high-throughput technologies is available through various public repositories. Our knowledge about molecular interaction networks, functional biological pathways and transcriptional regulatory modules is rapidly expanding, and is being organized in lists of functionally related genes. Jointly, these two sources of information hold a tremendous potential for gaining new insights into functioning of living systems. Genomics Portals platform integrates access to an extensive knowledge base and a large database of human, mouse, and rat genomics data with basic analytical visualization tools. It provides the context for analyzing and interpreting new experimental data and the tool for effective mining of a large number of publicly available genomics datasets stored in the back-end databases. The uniqueness of this platform lies in the volume and the diversity of genomics data that can be accessed and analyzed (gene expression, ChIP-chip, ChIP-seq, epigenomics, computationally predicted binding sites, etc), and the integration with an extensive knowledge base that can be used in such analysis. The integrated access to primary genomics data, functional knowledge and analytical tools makes Genomics Portals platform a unique tool for interpreting results of new genomics experiments and for mining the vast amount of data stored in the Genomics Portals backend databases. Genomics Portals can be accessed and used freely at http://GenomicsPortals.org.

  20. Genomics Portals: integrative web-platform for mining genomics data

    Directory of Open Access Journals (Sweden)

    Ghosh Krishnendu

    2010-01-01

    Full Text Available Abstract Background A large amount of experimental data generated by modern high-throughput technologies is available through various public repositories. Our knowledge about molecular interaction networks, functional biological pathways and transcriptional regulatory modules is rapidly expanding, and is being organized in lists of functionally related genes. Jointly, these two sources of information hold a tremendous potential for gaining new insights into functioning of living systems. Results Genomics Portals platform integrates access to an extensive knowledge base and a large database of human, mouse, and rat genomics data with basic analytical visualization tools. It provides the context for analyzing and interpreting new experimental data and the tool for effective mining of a large number of publicly available genomics datasets stored in the back-end databases. The uniqueness of this platform lies in the volume and the diversity of genomics data that can be accessed and analyzed (gene expression, ChIP-chip, ChIP-seq, epigenomics, computationally predicted binding sites, etc, and the integration with an extensive knowledge base that can be used in such analysis. Conclusion The integrated access to primary genomics data, functional knowledge and analytical tools makes Genomics Portals platform a unique tool for interpreting results of new genomics experiments and for mining the vast amount of data stored in the Genomics Portals backend databases. Genomics Portals can be accessed and used freely at http://GenomicsPortals.org.

  1. A Statistical Toolbox For Mining And Modeling Spatial Data

    Directory of Open Access Journals (Sweden)

    D’Aubigny Gérard

    2016-12-01

    Full Text Available Most data mining projects in spatial economics start with an evaluation of a set of attribute variables on a sample of spatial entities, looking for the existence and strength of spatial autocorrelation, based on the Moran’s and the Geary’s coefficients, the adequacy of which is rarely challenged, despite the fact that when reporting on their properties, many users seem likely to make mistakes and to foster confusion. My paper begins by a critical appraisal of the classical definition and rational of these indices. I argue that while intuitively founded, they are plagued by an inconsistency in their conception. Then, I propose a principled small change leading to corrected spatial autocorrelation coefficients, which strongly simplifies their relationship, and opens the way to an augmented toolbox of statistical methods of dimension reduction and data visualization, also useful for modeling purposes. A second section presents a formal framework, adapted from recent work in statistical learning, which gives theoretical support to our definition of corrected spatial autocorrelation coefficients. More specifically, the multivariate data mining methods presented here, are easily implementable on the existing (free software, yield methods useful to exploit the proposed corrections in spatial data analysis practice, and, from a mathematical point of view, whose asymptotic behavior, already studied in a series of papers by Belkin & Niyogi, suggests that they own qualities of robustness and a limited sensitivity to the Modifiable Areal Unit Problem (MAUP, valuable in exploratory spatial data analysis.

  2. SPATIO-TEMPORAL PATTERN MINING ON TRAJECTORY DATA USING ARM

    Directory of Open Access Journals (Sweden)

    S. Khoshahval

    2017-09-01

    Full Text Available Preliminary mobile was considered to be a device to make human connections easier. But today the consumption of this device has been evolved to a platform for gaming, web surfing and GPS-enabled application capabilities. Embedding GPS in handheld devices, altered them to significant trajectory data gathering facilities. Raw GPS trajectory data is a series of points which contains hidden information. For revealing hidden information in traces, trajectory data analysis is needed. One of the most beneficial concealed information in trajectory data is user activity patterns. In each pattern, there are multiple stops and moves which identifies users visited places and tasks. This paper proposes an approach to discover user daily activity patterns from GPS trajectories using association rules. Finding user patterns needs extraction of user’s visited places from stops and moves of GPS trajectories. In order to locate stops and moves, we have implemented a place recognition algorithm. After extraction of visited points an advanced association rule mining algorithm, called Apriori was used to extract user activity patterns. This study outlined that there are useful patterns in each trajectory that can be emerged from raw GPS data using association rule mining techniques in order to find out about multiple users’ behaviour in a system and can be utilized in various location-based applications.

  3. Spatio-Temporal Pattern Mining on Trajectory Data Using Arm

    Science.gov (United States)

    Khoshahval, S.; Farnaghi, M.; Taleai, M.

    2017-09-01

    Preliminary mobile was considered to be a device to make human connections easier. But today the consumption of this device has been evolved to a platform for gaming, web surfing and GPS-enabled application capabilities. Embedding GPS in handheld devices, altered them to significant trajectory data gathering facilities. Raw GPS trajectory data is a series of points which contains hidden information. For revealing hidden information in traces, trajectory data analysis is needed. One of the most beneficial concealed information in trajectory data is user activity patterns. In each pattern, there are multiple stops and moves which identifies users visited places and tasks. This paper proposes an approach to discover user daily activity patterns from GPS trajectories using association rules. Finding user patterns needs extraction of user's visited places from stops and moves of GPS trajectories. In order to locate stops and moves, we have implemented a place recognition algorithm. After extraction of visited points an advanced association rule mining algorithm, called Apriori was used to extract user activity patterns. This study outlined that there are useful patterns in each trajectory that can be emerged from raw GPS data using association rule mining techniques in order to find out about multiple users' behaviour in a system and can be utilized in various location-based applications.

  4. DATA MINING APPLICATION IN CREDIT CARD FRAUD DETECTION SYSTEM

    Directory of Open Access Journals (Sweden)

    FRANCISCA NONYELUM OGWUELEKA

    2011-06-01

    Full Text Available Data mining is popularly used to combat frauds because of its effectiveness. It is a well-defined procedure that takes data as input and produces models or patterns as output. Neural network, a data mining technique was used in this study. The design of the neural network (NN architecture for the credit card detection system was based on unsupervised method, which was applied to the transactions data to generate four clusters of low, high, risky and high-risk clusters. The self-organizing map neural network (SOMNN technique was used for solving the problem of carrying out optimal classification of each transaction into its associated group, since a prior output is unknown. The receiver-operating curve (ROC for credit card fraud (CCF detection watch detected over 95% of fraud cases without causing false alarms unlike other statistical models and the two-stage clusters. This shows that the performance of CCF detection watch is in agreement with other detection software, but performs better.

  5. DATA MINING UNTUK KLASIFIKASI PELANGGAN DENGAN ANT COLONY OPTIMIZATION

    Directory of Open Access Journals (Sweden)

    Maulani Kapiudin

    2007-01-01

    Full Text Available In this research the system for potentially customer classification is designed by extracting rule based classification from raw data with certain criteria. The searching process uses customer database from a bank with data mining technic by using ant colony optimization. A test based on min_case_per_rule variety and phenomene updating were done on a certain period of time. The result are group of customer class which base on rules built by ant and by modifying the pheromone updating, the area of the case is getting bigger. Prototype of the software is coded with C++ 6 version. The customer database master is created by using Microsoft Access. This paper gives information about potential customer of bank that can be classified by prototype of the software. Abstract in Bahasa Indonesia : Pada penelitian untuk sistem klasifikasi potensial customer ini didesain dengan melakukan ekstrak rule berdasarkan klasifikasi dari data mentah dengan kriteria tertentu. Proses pencarian menggunakan database pelanggan dari suatu bank dengan teknik data mining dengan ant colony optimization. Dilakukan percobaan dengan min_case_per_rule variety dan phenomene updating pada periode waktu tertentu. Hasilnya adalah sekelompok class pelanggan yang didasarkan dari rules yang dibangun dengan ant dan dengan dimodifikasi dengan pheromone updating, area permasalahan menjadi lebih melebar. Prototype dari software ini menggunakan C++ versi 6. Database pelanggan dibangun dengan Microsoft Access. Paper ini memberikan informasi mengenai potensi pelanggan dari bank, sehingga dapat diklasifikasikan dengan prototype dari software. Kata kunci: ant colony optimization, classification, min_case_per_rule, term, pheromone updating

  6. Advances in research methods for information systems research data mining, data envelopment analysis, value focused thinking

    CERN Document Server

    Osei-Bryson, Kweku-Muata

    2013-01-01

    Advances in social science research methodologies and data analytic methods are changing the way research in information systems is conducted. New developments in statistical software technologies for data mining (DM) such as regression splines or decision tree induction can be used to assist researchers in systematic post-positivist theory testing and development. Established management science techniques like data envelopment analysis (DEA), and value focused thinking (VFT) can be used in combination with traditional statistical analysis and data mining techniques to more effectively explore

  7. Data Mining Learning Models and Algorithms on a Scada System Data Repository

    Directory of Open Access Journals (Sweden)

    Mircea Rîşteiu

    2010-06-01

    Full Text Available This paper presents three data mining techniques applied
    on a SCADA system data repository: Naijve Bayes, k-Nearest Neighbor and Decision Trees. A conclusion that k-Nearest Neighbor is a suitable method to classify the large amount of data considered is made finally according to the mining result and its reasonable explanation. The experiments are built on the training data set and evaluated using the new test set with machine learning tool WEKA.

  8. Mining

    Directory of Open Access Journals (Sweden)

    Khairullah Khan

    2014-09-01

    Full Text Available Opinion mining is an interesting area of research because of its applications in various fields. Collecting opinions of people about products and about social and political events and problems through the Web is becoming increasingly popular every day. The opinions of users are helpful for the public and for stakeholders when making certain decisions. Opinion mining is a way to retrieve information through search engines, Web blogs and social networks. Because of the huge number of reviews in the form of unstructured text, it is impossible to summarize the information manually. Accordingly, efficient computational methods are needed for mining and summarizing the reviews from corpuses and Web documents. This study presents a systematic literature survey regarding the computational techniques, models and algorithms for mining opinion components from unstructured reviews.

  9. Multiagent data warehousing and multiagent data mining for cerebrum/cerebellum modeling

    Science.gov (United States)

    Zhang, Wen-Ran

    2002-03-01

    An algorithm named Neighbor-Miner is outlined for multiagent data warehousing and multiagent data mining. The algorithm is defined in an evolving dynamic environment with autonomous or semiautonomous agents. Instead of mining frequent itemsets from customer transactions, the new algorithm discovers new agents and mining agent associations in first-order logic from agent attributes and actions. While the Apriori algorithm uses frequency as a priory threshold, the new algorithm uses agent similarity as priory knowledge. The concept of agent similarity leads to the notions of agent cuboid, orthogonal multiagent data warehousing (MADWH), and multiagent data mining (MADM). Based on agent similarities and action similarities, Neighbor-Miner is proposed and illustrated in a MADWH/MADM approach to cerebrum/cerebellum modeling. It is shown that (1) semiautonomous neurofuzzy agents can be identified for uniped locomotion and gymnastic training based on attribute relevance analysis; (2) new agents can be discovered and agent cuboids can be dynamically constructed in an orthogonal MADWH, which resembles an evolving cerebrum/cerebellum system; and (3) dynamic motion laws can be discovered as association rules in first order logic. Although examples in legged robot gymnastics are used to illustrate the basic ideas, the new approach is generally suitable for a broad category of data mining tasks where knowledge can be discovered collectively by a set of agents from a geographically or geometrically distributed but relevant environment, especially in scientific and engineering data environments.

  10. Applying Fuzzy Data Mining to Telecom Churn Management

    Science.gov (United States)

    Liao, Kuo-Hsiung; Chueh, Hao-En

    Customers tend to change telecommunications service providers in pursuit of more favorable telecommunication rates. Therefore, how to avoid customer churn is an extremely critical topic for the intensely competitive telecommunications industry. To assist telecommunications service providers in effectively reducing the rate of customer churn, this study used fuzzy data mining to determine effective marketing strategies by analyzing the responses of customers to various marketing activities. These techniques can help telecommunications service providers determine the most appropriate marketing opportunities and methods for different customer groups, to reduce effectively the rate of customer turnover.

  11. Building clusters for CRM strategies by mining airlines customer data

    OpenAIRE

    Miranda, Helena Sofia Guerreiro de

    2013-01-01

    Trabalho de Projeto apresentado como requisito parcial para obtenção do grau de Mestre em Estatística e Gestão de Informação As airlines strive to gain market share and sustain profitability in today’s economically challenging environment, they should develop new ways to optimize their frequent flyer programs while increase revenues. Aware of the challenges, airlines want to implement a customer relationship management (CRM) strategy based on customer analytics and data mining ...

  12. Ensemble Methods in Data Mining Improving Accuracy Through Combining Predictions

    CERN Document Server

    Seni, Giovanni

    2010-01-01

    This book is aimed at novice and advanced analytic researchers and practitioners -- especially in Engineering, Statistics, and Computer Science. Those with little exposure to ensembles will learn why and how to employ this breakthrough method, and advanced practitioners will gain insight into building even more powerful models. Throughout, snippets of code in R are provided to illustrate the algorithms described and to encourage the reader to try the techniques. The authors are industry experts in data mining and machine learning who are also adjunct professors and popular speakers. Although e

  13. Aplicaciones de data mining al estudio de la biodiversidad

    OpenAIRE

    Santa María, Cristóbal; Soria, Marcelo A.

    2011-01-01

    El trabajo propone la utilización conjunta de técnicas de data mining y simulación para evaluar la riqueza y diversidad de comunidades microbianas. Se parte de una muestra formada por distintas secuencias de ADN que se alinean para luego ser agrupadas según su similaridad en clusters. Cada uno de estos clusters es una especie y el propósito es estimar su número y distribución en la comunidad basándose en la información que da la muestra. La técnica de rarefacción, sustentada en el procedimien...

  14. Data Mining and Knowledge Discovery via Logic-Based Methods

    CERN Document Server

    Triantaphyllou, Evangelos

    2010-01-01

    There are many approaches to data mining and knowledge discovery (DM&KD), including neural networks, closest neighbor methods, and various statistical methods. This monograph, however, focuses on the development and use of a novel approach, based on mathematical logic, that the author and his research associates have worked on over the last 20 years. The methods presented in the book deal with key DM&KD issues in an intuitive manner and in a natural sequence. Compared to other DM&KD methods, those based on mathematical logic offer a direct and often intuitive approach for extracting easily int

  15. Data warehousing as a basis for web-based documentation of data mining and analysis.

    Science.gov (United States)

    Karlsson, J; Eklund, P; Hallgren, C G; Sjödin, J G

    1999-01-01

    In this paper we present a case study for data warehousing intended to support data mining and analysis. We also describe a prototype for data retrieval. Further we discuss some technical issues related to a particular choice of a patient record environment.

  16. Combining Data Warehouse and Data Mining Techniques for Web Log Analysis

    DEFF Research Database (Denmark)

    Pedersen, Torben Bach; Jespersen, Søren; Thorhauge, Jesper

    2008-01-01

    a number of approaches thatcombine data warehousing and data mining techniques in order to analyze Web logs.After introducing the well-known click and session data warehouse (DW) schemas,the chapter presents the subsession schema, which allows fast queries on sequences...

  17. Spatiotemporal Data Mining, Analysis, and Visualization of Human Activity Data

    Science.gov (United States)

    Li, Xun

    2012-01-01

    This dissertation addresses the research challenge of developing efficient new methods for discovering useful patterns and knowledge in large volumes of electronically collected spatiotemporal activity data. I propose to analyze three types of such spatiotemporal activity data in a methodological framework that integrates spatial analysis, data…

  18. Data Mining and Visualization of Large Human Behavior Data Sets

    DEFF Research Database (Denmark)

    Cuttone, Andrea

    and credit card transactions – have provided us new sources for studying our behavior. In particular smartphones have emerged as new tools for collecting data about human activity, thanks to their sensing capabilities and their ubiquity. This thesis investigates the question of what we can learn about human...... behavior from this rich and pervasive mobile sensing data. In the first part, we describe a large-scale data collection deployment collecting high-resolution data for over 800 students at the Technical University of Denmark using smartphones, including location, social proximity, calls and SMS. We provide...... an overview of the technical infrastructure, the experimental design, and the privacy measures. The second part investigates the usage of this mobile sensing data for understanding personal behavior. We describe two large-scale user studies on the deployment of self-tracking apps, in order to understand...

  19. Data Mining as a Service (DMaaS)

    Science.gov (United States)

    Tejedor, E.; Piparo, D.; Mascetti, L.; Moscicki, J.; Lamanna, M.; Mato, P.

    2016-10-01

    Data Mining as a Service (DMaaS) is a software and computing infrastructure that allows interactive mining of scientific data in the cloud. It allows users to run advanced data analyses by leveraging the widely adopted Jupyter notebook interface. Furthermore, the system makes it easier to share results and scientific code, access scientific software, produce tutorials and demonstrations as well as preserve the analyses of scientists. This paper describes how a first pilot of the DMaaS service is being deployed at CERN, starting from the notebook interface that has been fully integrated with the ROOT analysis framework, in order to provide all the tools for scientists to run their analyses. Additionally, we characterise the service backend, which combines a set of IT services such as user authentication, virtual computing infrastructure, mass storage, file synchronisation, development portals or batch systems. The added value acquired by the combination of the aforementioned categories of services is discussed, focusing on the opportunities offered by the CERNBox synchronisation service and its massive storage backend, EOS.

  20. Unrecorded Accidents Detection on Highways Based on Temporal Data Mining

    Directory of Open Access Journals (Sweden)

    Shi An

    2014-01-01

    Full Text Available Automatic traffic accident detection, especially not recorded by traffic police, is crucial to accident black spots identification and traffic safety. A new method of detecting traffic accidents is proposed based on temporal data mining, which can identify the unknown and unrecorded accidents by traffic police. Time series model was constructed using ternary numbers to reflect the state of traffic flow based on cell transmission model. In order to deal with the aftereffects of linear drift between time series and to reduce the computational cost, discrete Fourier transform was implemented to turn time series from time domain to frequency domain. The pattern of the time series when an accident happened could be recognized using the historical crash data. Then taking Euclidean distance as the similarity evaluation function, similarity data mining of the transformed time series was carried out. If the result was less than the given threshold, the two time series were similar and an accident happened probably. A numerical example was carried out and the results verified the effectiveness of the proposed method.

  1. Common Subcluster Mining in Microarray Data for Molecular Biomarker Discovery.

    Science.gov (United States)

    Sadhu, Arnab; Bhattacharyya, Balaram

    2017-10-11

    Molecular biomarkers can be potential facilitators for detection of cancer at early stage which is otherwise difficult through conventional biomarkers. Gene expression data from microarray experiments on both normal and diseased cell samples provide enormous scope to explore genetic relations of disease using computational techniques. Varied patterns of expressions of thousands of genes at different cell conditions along with inherent experimental error make the task of isolating disease related genes challenging. In this paper, we present a data mining method, common subcluster mining (CSM), to discover highly perturbed genes under diseased condition from differential expression patterns. The method builds heap through superposing near centroid clusters from gene expression data of normal samples and extracts its core part. It, thus, isolates genes exhibiting the most stable state across normal samples and constitute a reference set for each centroid. It performs the same operation on datasets from corresponding diseased samples and isolates the genes showing drastic changes in their expression patterns. The method thus finds the disease-sensitive genesets when applied to datasets of lung cancer, prostrate cancer, pancreatic cancer, breast cancer, leukemia and pulmonary arterial hypertension. In majority of the cases, few new genes are found over and above some previously reported ones. Genes with distinct deviations in diseased samples are prospective candidates for molecular biomarkers of the respective disease.

  2. Data Mining Methods Applied to Flight Operations Quality Assurance Data: A Comparison to Standard Statistical Methods

    Science.gov (United States)

    Stolzer, Alan J.; Halford, Carl

    2007-01-01

    In a previous study, multiple regression techniques were applied to Flight Operations Quality Assurance-derived data to develop parsimonious model(s) for fuel consumption on the Boeing 757 airplane. The present study examined several data mining algorithms, including neural networks, on the fuel consumption problem and compared them to the multiple regression results obtained earlier. Using regression methods, parsimonious models were obtained that explained approximately 85% of the variation in fuel flow. In general data mining methods were more effective in predicting fuel consumption. Classification and Regression Tree methods reported correlation coefficients of .91 to .92, and General Linear Models and Multilayer Perceptron neural networks reported correlation coefficients of about .99. These data mining models show great promise for use in further examining large FOQA databases for operational and safety improvements.

  3. Data mining in large sets of complex data

    CERN Document Server

    Cordeiro, Robson L F; Júnior, Caetano Traina

    2013-01-01

    The amount and the complexity of the data gathered by current enterprises are increasing at an exponential rate. Consequently, the analysis of Big Data is nowadays a central challenge in Computer Science, especially for complex data. For example, given a satellite image database containing tens of Terabytes, how can we find regions aiming at identifying native rainforests, deforestation or reforestation? Can it be made automatically? Based on the work discussed in this book, the answers to both questions are a sound "yes", and the results can be obtained in just minutes. In fact, results that

  4. Analisis Data Lulusan dengan Data Mining untuk Mendukung Strategi Promosi Universitas Lancang Kuning

    Directory of Open Access Journals (Sweden)

    Elvira Asril

    2015-11-01

    Full Text Available Setiap perusahaan maupun organisasi yang ingin tetap bertahan perlu untuk menentukan strategi promosi yang tepat. Penentuan strategi promosi yang tepat akan dapat mengurangi biaya promosi dan mencapai sasaran promosi yang tepat. Salah satu cara yang dapat dilakukan untuk penentuan strategi promosi adalah dengan menggunakan teknik data mining. Teknik data mining yang digunakan dalam hal ini adalah dengan menggunakan algoritma Clustering K-Means. Clustering merupakan pengelompokkan record, observasi, atau kasus ke dalam kelas-kelas objek yang mirip. K-Means adalah metode klaster data non-hirarkis yang mencoba untuk membagi data ke dalam satu atau lebih klaster. Penelitian dilakukan dengan mengamati beberapa variabel penelitian yang sering dipertimbangkan oleh perguruan tinggi dalam menentukan sasaran promosinya yaitu asal sekolah, daerah, dan jurusan. Hasil penelitian ini adalah berupa pola menarik hasil data mining yang merupakan informasi penting untuk mendukung strategi promosi yang tepat dalam mendapatkan calon mahasiswa baru.Kata kunci: Data Mining, Clustering, K-Means Each company or organization that wants to survive needs to determine appropriate promotional strategies. Determination of appropriate promotional strategies will be able to reduce costs and achieve the goals the promotion of proper promotion. One way that can be done to determine campaign strategy is to use data mining techniques. Data mining techniques used in this case is to use a K-Means clustering algorithm. Clustering is the grouping of records, observation, or in the case of the object classes that are similar. K-Means is a method of non-hierarchical clustering of data that is trying to divide the data into one or more clusters. The study was conducted by observing some of the variables that are often considered by the college in determining the target of promotion that the school of origin, region, and department. Results of this study are interesting pattern of

  5. A Case Study for Student Performance Analysis based on Educational Data Mining (EDM)

    OpenAIRE

    Daxa Kundariya; Prof. Vaseem Ghada

    2016-01-01

    Educational Data Mining (EDM) is a study methodology and an application of data mining techniques related to student’s data from academic database. Like other domain, educational domain also produce vast amount of studying data. To enhance the quality of education system student performance analysis plays an important role for decision support. This paper elaborates a study on various Educational data mining technique and how they could be used to educational system to analysis student perfor...

  6. From Visualisation to Data Mining with Large Data Sets

    CERN Document Server

    Adelmann, Andreas; Shalf, John M; Siegerist, Cristina

    2005-01-01

    In 3D particle simulations, the generated 6D phase space data are can be very large due to the need for accurate statistics, sufficient noise attenuation in the field solver and tracking of many turns in ring machines or accelerators. There is a need for distributed applications that allow users to peruse these extremely large remotely located datasets with the same ease as locally downloaded data. This paper will show concepts and a prototype tool to extract useful physical information out of 6D raw phase space data. ParViT allows the user to project 6D data into 3D space by selecting which dimensions will be represented spatially and which dimensions are represented as particle attributes, and the construction of complex transfer functions for representing the particle attributes. It also allows management of time-series data. An HDF5-based parallel-I/O library, with C++, C and Fortran bindings simplifies the interface with a variety of codes. A number of hooks in ParVit will allow it to connect with a para...

  7. Undergraduate Biocuration: Developing Tomorrow's Researchers While Mining Today's Data.

    Science.gov (United States)

    Mitchell, Cassie S; Cates, Ashlyn; Kim, Renaid B; Hollinger, Sabrina K

    2015-01-01

    Biocuration is a time-intensive process that involves extraction, transcription, and organization of biological or clinical data from disjointed data sets into a user-friendly database. Curated data is subsequently used primarily for text mining or informatics analysis (bioinformatics, neuroinformatics, health informatics, etc.) and secondarily as a researcher resource. Biocuration is traditionally considered a Ph.D. level task, but a massive shortage of curators to consolidate the ever-mounting biomedical "big data" opens the possibility of utilizing biocuration as a means to mine today's data while teaching students skill sets they can utilize in any career. By developing a biocuration assembly line of simplified and compartmentalized tasks, we have enabled biocuration to be effectively performed by a hierarchy of undergraduate students. We summarize the necessary physical resources, process for establishing a data path, biocuration workflow, and undergraduate hierarchy of curation, technical, information technology (IT), quality control and managerial positions. We detail the undergraduate application and training processes and give detailed job descriptions for each position on the assembly line. We present case studies of neuropathology curation performed entirely by undergraduates, namely the construction of experimental databases of Amyotrophic Lateral Sclerosis (ALS) transgenic mouse models and clinical data from ALS patient records. Our results reveal undergraduate biocuration is scalable for a group of 8-50+ with relatively minimal required resources. Moreover, with average accuracy rates greater than 98.8%, undergraduate biocurators are equivalently accurate to their professional counterparts. Initial training to be completely proficient at the entry-level takes about five weeks with a minimal student time commitment of four hours/week.

  8. Symbolic Data Analysis Conceptual Statistics and Data Mining

    CERN Document Server

    Billard, Lynne

    2012-01-01

    With the advent of computers, very large datasets have become routine. Standard statistical methods don't have the power or flexibility to analyse these efficiently, and extract the required knowledge. An alternative approach is to summarize a large dataset in such a way that the resulting summary dataset is of a manageable size and yet retains as much of the knowledge in the original dataset as possible. One consequence of this is that the data may no longer be formatted as single values, but be represented by lists, intervals, distributions, etc. The summarized data have their own internal s

  9. Mining data from hemodynamic simulations via Bayesian emulation

    Directory of Open Access Journals (Sweden)

    Nair Prasanth B

    2007-12-01

    Full Text Available Abstract Background: Arterial geometry variability is inevitable both within and across individuals. To ensure realistic prediction of cardiovascular flows, there is a need for efficient numerical methods that can systematically account for geometric uncertainty. Methods and results: A statistical framework based on Bayesian Gaussian process modeling was proposed for mining data generated from computer simulations. The proposed approach was applied to analyze the influence of geometric parameters on hemodynamics in the human carotid artery bifurcation. A parametric model in conjunction with a design of computer experiments strategy was used for generating a set of observational data that contains the maximum wall shear stress values for a range of probable arterial geometries. The dataset was mined via a Bayesian Gaussian process emulator to estimate: (a the influence of key parameters on the output via sensitivity analysis, (b uncertainty in output as a function of uncertainty in input, and (c which settings of the input parameters result in maximum and minimum values of the output. Finally, potential diagnostic indicators were proposed that can be used to aid the assessment of stroke risk for a given patient's geometry.

  10. Connecting traditional sciences with the OLAP and data mining paradigms

    Science.gov (United States)

    Guergachi, Aziz A.

    2003-03-01

    The paradigms of OLAP, multidimensional modeling and data mining have first emerged in the areas of market analysis and finance to address various needs of people working in these areas. Does this mean that they are useful and applicable in these areas only? Or, can they also be applicable in the other more traditional areas of science and engineering? What characterize the systems for which these paradigms are suitable? What are the goals of these paradigms? How do they relate to the traditional body of knowledge that has been developed throughout the centuries in the areas of mathematics, statistics, systems science and engineering? Where, how and to what extent can we leverage the conventional wisdom that has been accumulated in the aforementioned disciplines to develop a foundational basis for the above paradigms? The goal of this paper is to address these questions at the foundational level. We argue that the paradigms of OLAP, multidimensional modeling and data mining can also be applied successfully to complex engineering systems, such as membrane-based water/wastewater treatment plants, for example. We develop mathematically-based axiomatic definition of the concepts of 'dimension,' 'dimension level,' 'dimension hierarchy' and 'measure' using set theory and equivalence relations.

  11. A web server for mining Comparative Genomic Hybridization (CGH) data

    Science.gov (United States)

    Liu, Jun; Ranka, Sanjay; Kahveci, Tamer

    2007-11-01

    Advances in cytogenetics and molecular biology has established that chromosomal alterations are critical in the pathogenesis of human cancer. Recurrent chromosomal alterations provide cytological and molecular markers for the diagnosis and prognosis of disease. They also facilitate the identification of genes that are important in carcinogenesis, which in the future may help in the development of targeted therapy. A large amount of publicly available cancer genetic data is now available and it is growing. There is a need for public domain tools that allow users to analyze their data and visualize the results. This chapter describes a web based software tool that will allow researchers to analyze and visualize Comparative Genomic Hybridization (CGH) datasets. It employs novel data mining methodologies for clustering and classification of CGH datasets as well as algorithms for identifying important markers (small set of genomic intervals with aberrations) that are potentially cancer signatures. The developed software will help in understanding the relationships between genomic aberrations and cancer types.

  12. Mining Behavior Based Safety Data to Predict Safety Performance

    Energy Technology Data Exchange (ETDEWEB)

    Jeffrey C. Joe

    2010-06-01

    The Idaho National Laboratory (INL) operates a behavior based safety program called Safety Observations Achieve Results (SOAR). This peer-to-peer observation program encourages employees to perform in-field observations of each other's work practices and habits (i.e., behaviors). The underlying premise of conducting these observations is that more serious accidents are prevented from occurring because lower level “at risk” behaviors are identified and corrected before they can propagate into culturally accepted “unsafe” behaviors that result in injuries or fatalities. Although the approach increases employee involvement in safety, the premise of the program has not been subject to sufficient empirical evaluation. The INL now has a significant amount of SOAR data on these lower level “at risk” behaviors. This paper describes the use of data mining techniques to analyze these data to determine whether they can predict if and when a more serious accident will occur.

  13. Prediksi Pendapatan Sewa Dengan Data Mining Pada Perusahaan XYZ

    Directory of Open Access Journals (Sweden)

    May Liana

    2010-12-01

    Full Text Available XYZ Company has a program to predict leasing income that only predict in constant condition where every tenant assumed for leasing renewal. This research is done to build accurate income prediction system that accommodate in making strategic decision towards the company. Premier data collecting is through direct interview with the company management. The analysis is through data training from the previous years to build neural network model. The analysis result shows that this model has produced error total value that is smaller than the previous error total value in years before. Therefore, it could be concluded that data mining with neural network technique that produced more accurate leasing income that could help the company making decision based on the hidden information in the database.

  14. Application of Data Mining in Library-Based Personalized Learning

    Directory of Open Access Journals (Sweden)

    Lin Luo

    2017-12-01

    Full Text Available this paper expounds to mine up data with the DBSCAN algorithm in order to help teachers and students find which books they expect in the sea of library. In the first place, the model that DBSCAN algorithm applies in library data miner is proposed, followed by the DBSCAN algorithm improved on demands. In the end, an experiment is cited herein to validate this algorithm. The results show that the book price and the inventory level in the library produce a less impact on the resultant aggregation than the classification of books and the frequency of book borrowings. Library procurers should therefore purchase and subscribe data based on the results from cluster analysis thereby to improve hierarchies and structure distribution of library resources, forging on the library resources to be more scientific and reasonable, while it is also conducive to arousing readers' borrowing interest.

  15. Data Mining Relationships Among Urban Socioeconomic, Land Cover, and Remotely Sensed Ecological Data

    Science.gov (United States)

    Mennis, J.; Wessman, C.; Golubiewski, N.

    2003-12-01

    This research investigates the relationships among socioeconomic character, land cover, and ecological function in a rapidly urbanizing region, the Front Range of Colorado. We use novel spatial geographic information systems- (GIS-) based data integration and data mining techniques to integrate and analyze diverse spatial data sets. These data include elevation data, transportation data, land cover data derived from aerial photography, block group-level U.S. Census data, and vegetation greenness (NDVI) data derived from Landsat imagery. These data are used to derive a variety of U.S. block group-level variables indicating demographic, geographic, ecological, and land cover characteristics. We employ spatial association rule mining, decision tree induction, and spatial on-line analytical processing (OLAP), in addition to more conventional multivariate statistical techniques, to investigate relationships among these variables.

  16. A practitioners guide to resampling for data analysis, data mining, and modeling: A cookbook for starters

    NARCIS (Netherlands)

    van den Broek, Egon

    A practitioner’s guide to resampling for data analysis, data mining, and modeling provides a gentle and pragmatic introduction in the proposed topics. Its supporting Web site was offline and, hence, its potentially added value could not be verified. The book refrains from using advanced mathematics

  17. Web based parallel/distributed medical data mining using software agents

    Energy Technology Data Exchange (ETDEWEB)

    Kargupta, H.; Stafford, B.; Hamzaoglu, I.

    1997-12-31

    This paper describes an experimental parallel/distributed data mining system PADMA (PArallel Data Mining Agents) that uses software agents for local data accessing and analysis and a web based interface for interactive data visualization. It also presents the results of applying PADMA for detecting patterns in unstructured texts of postmortem reports and laboratory test data for Hepatitis C patients.

  18. Some remarks on parallel data mining using a persistent object manager

    International Nuclear Information System (INIS)

    Araujo, Neil; Grossman, Robert; Hanley, David

    1996-01-01

    Our underlying assumption is that high performance data management will be as important as high performance computing by the beginning of the next millennium. Given this, data mining will take on increasing importance. In this paper, we discuss our experience with parallel data mining on an IBM SP-2, focusing on four issues which we feel are emerging as critical for data mining applications in general. (author)

  19. Data Mining CMMSs: How to Convert Data into Knowledge.

    Science.gov (United States)

    Fennigkoh, Larry; Nanney, D Courtney

    2018-01-01

    Although the healthcare technology management (HTM) community has decades of accumulated medical device-related maintenance data, little knowledge has been gleaned from these data. Finding and extracting such knowledge requires the use of the well-established, but admittedly somewhat foreign to HTM, application of inferential statistics. This article sought to provide a basic background on inferential statistics and describe a case study of their application, limitations, and proper interpretation. The research question associated with this case study involved examining the effects of ventilator preventive maintenance (PM) labor hours, age, and manufacturer on needed unscheduled corrective maintenance (CM) labor hours. The study sample included more than 21,000 combined PM inspections and CM work orders on 2,045 ventilators from 26 manufacturers during a five-year period (2012-16). A multiple regression analysis revealed that device age, manufacturer, and accumulated PM inspection labor hours all influenced the amount of CM labor significantly (P < 0.001). In essence, CM labor hours increased with increasing PM labor. However, and despite the statistical significance of these predictors, the regression analysis also indicated that ventilator age, manufacturer, and PM labor hours only explained approximately 16% of all variability in CM labor, with the remainder (84%) caused by other factors that were not included in the study. As such, the regression model obtained here is not suitable for predicting ventilator CM labor hours.

  20. Telecare service activity analysis using Big Data and Data Mining

    Directory of Open Access Journals (Sweden)

    Alfredo Moreno Muñoz

    2017-01-01

    Full Text Available In the current moment that we are living now, the use of Big Data is taken a strength and a very important relevance. The biggest companies of social sector and service sector are using Big Data technologies that allow to store and treat all the information that they have of users and, in a second way, the incorporation of the knowledge of the treatment of this information in the life of the users, in the way of improve the services offered and go to the next step in the relationship of customer/company. In telecare, with the IP technology in Telecare Unit, the communication between the unit and control centre will be done using internet instead of telephony cable. The companies will start to use these technologies to store all the information that the unit will send to the control center. With all this information, the companies will be able to discover patterns of user’s behavior, detect some illnesses like, for example, alzheimer. The most important action that the companies will be able to have is to have more information related to the situation of all devices and sensors installed in user’s home when the emergency alarm is raised.

  1. A novel water quality data analysis framework based on time-series data mining.

    Science.gov (United States)

    Deng, Weihui; Wang, Guoyin

    2017-07-01

    The rapid development of time-series data mining provides an emerging method for water resource management research. In this paper, based on the time-series data mining methodology, we propose a novel and general analysis framework for water quality time-series data. It consists of two parts: implementation components and common tasks of time-series data mining in water quality data. In the first part, we propose to granulate the time series into several two-dimensional normal clouds and calculate the similarities in the granulated level. On the basis of the similarity matrix, the similarity search, anomaly detection, and pattern discovery tasks in the water quality time-series instance dataset can be easily implemented in the second part. We present a case study of this analysis framework on weekly Dissolve Oxygen time-series data collected from five monitoring stations on the upper reaches of Yangtze River, China. It discovered the relationship of water quality in the mainstream and tributary as well as the main changing patterns of DO. The experimental results show that the proposed analysis framework is a feasible and efficient method to mine the hidden and valuable knowledge from water quality historical time-series data. Copyright © 2017 Elsevier Ltd. All rights reserved.

  2. Collecting, storing, and mining research data in a brain bank.

    Science.gov (United States)

    Webster, Maree J; Kim, Sanghyeon

    2018-01-01

    The Stanley Medical Research Institute Brain Collection distributes samples from specified cohorts that contain demographically matched groups of subjects with mental illnesses such as schizophrenia, bipolar disorder, and major depression, as well as unaffected controls. The groups are matched by age, sex, race, postmortem interval, pH, side of brain, and mRNA quality. The samples are distributed coded so that all data must be returned in order to obtain the demographic information. The database contains more than 5000 individual data sets, as well as data from high-throughput microarray, sequencing, and proteomic studies. While most data were generated from the frontal cortex and hippocampus, the cerebellum has the most data sets that differ significantly between diagnostic groups and controls. The database contains interactive features and statistical tools that enable online data mining and real-time data analysis. The decrease in density of parvalbumin-positive neurons in schizophrenia, one of the most replicated findings in the field, is used to illustrate features of the brain bank. We describe how this finding can be replicated and validated in this database. We also show how the density of parvalbumin-positive neurons is correlated with markers of immune activation in the neuropathology data sets, how it is correlated with immune-related genes in a microarray data set, and how it is associated with a single-nucleotide polymorphism in the immune complement system. Copyright © 2018 Elsevier B.V. All rights reserved.

  3. Data Integration and Mining for Synthetic Biology Design.

    Science.gov (United States)

    Mısırlı, Göksel; Hallinan, Jennifer; Pocock, Matthew; Lord, Phillip; McLaughlin, James Alastair; Sauro, Herbert; Wipat, Anil

    2016-10-21

    One aim of synthetic biologists is to create novel and predictable biological systems from simpler modular parts. This approach is currently hampered by a lack of well-defined and characterized parts and devices. However, there is a wealth of existing biological information, which can be used to identify and characterize biological parts, and their design constraints in the literature and numerous biological databases. However, this information is spread among these databases in many different formats. New computational approaches are required to make this information available in an integrated format that is more amenable to data mining. A tried and tested approach to this problem is to map disparate data sources into a single data set, with common syntax and semantics, to produce a data warehouse or knowledge base. Ontologies have been used extensively in the life sciences, providing this common syntax and semantics as a model for a given biological domain, in a fashion that is amenable to computational analysis and reasoning. Here, we present an ontology for applications in synthetic biology design, SyBiOnt, which facilitates the modeling of information about biological parts and their relationships. SyBiOnt was used to create the SyBiOntKB knowledge base, incorporating and building upon existing life sciences ontologies and standards. The reasoning capabilities of ontologies were then applied to automate the mining of biological parts from this knowledge base. We propose that this approach will be useful to speed up synthetic biology design and ultimately help facilitate the automation of the biological engineering life cycle.

  4. Data mining concepts, methods and applications in management and engineering design

    CERN Document Server

    Yin, Yong; Tang, Jiafu; Zhu, JianMing

    2011-01-01

    Data Mining introduces in clear and simple ways how to use existing data mining methods to obtain effective solutions for a variety of management and engineering design problems. Data Mining is organised into two parts: the first provides a focused introduction to data mining and the second goes into greater depth on subjects such as customer analysis. It covers almost all managerial activities of a company, including: * supply chain design, * product development, * manufacturing system design, * product quality control, and * preservation of privacy. Incorporating recent developments of data

  5. Advanced Query and Data Mining Capabilities for MaROS

    Science.gov (United States)

    Wang, Paul; Wallick, Michael N.; Allard, Daniel A.; Gladden, Roy E.; Hy, Franklin H.

    2013-01-01

    The Mars Relay Operational Service (MaROS) comprises a number of tools to coordinate, plan, and visualize various aspects of the Mars Relay network. These levels include a Web-based user interface, a back-end "ReSTlet" built in Java, and databases that store the data as it is received from the network. As part of MaROS, the innovators have developed and implemented a feature set that operates on several levels of the software architecture. This new feature is an advanced querying capability through either the Web-based user interface, or through a back-end REST interface to access all of the data gathered from the network. This software is not meant to replace the REST interface, but to augment and expand the range of available data. The current REST interface provides specific data that is used by the MaROS Web application to display and visualize the information; however, the returned information from the REST interface has typically been pre-processed to return only a subset of the entire information within the repository, particularly only the information that is of interest to the GUI (graphical user interface). The new, advanced query and data mining capabilities allow users to retrieve the raw data and/or to perform their own data processing. The query language used to access the repository is a restricted subset of the structured query language (SQL) that can be built safely from the Web user interface, or entered as freeform SQL by a user. The results are returned in a CSV (Comma Separated Values) format for easy exporting to third party tools and applications that can be used for data mining or user-defined visualization and interpretation. This is the first time that a service is capable of providing access to all cross-project relay data from a single Web resource. Because MaROS contains the data for a variety of missions from the Mars network, which span both NASA and ESA, the software also establishes an access control list (ACL) on each data record

  6. Application of Data Mining Algorithm to Recipient of Motorcycle Installment

    Directory of Open Access Journals (Sweden)

    Harry Dhika

    2015-12-01

    Full Text Available The study was conducted in the subsidiaries that provide services of finance related to the purchase of a motorcycle on credit. At the time of applying, consumers enter their personal data. Based on the personal data, it will be known whether the consumer credit data is approved or rejected. From 224 consumer data obtained, it is known that the number of consumers whose applications are approved is 87% or about 217 consumers and consumers whose application is rejected is 16% or as much as 6 consumers. Acceptance of motorcycle financing on credit by using the method of applying the algorithm through CRIS-P DM is the industry standard in the processing of data mining. The algorithm used in the decision making is the algorithm C4.5. The results obtained previously, the level of accuracy is measured with the Confusion Matrix and Receiver Operating characteristic (ROC. Evaluation of the Confusion Matrix is intended to seek the value of accuracy, precision value, and the value of recall data. While the Receiver Operating Characteristic (ROC is used to find data tables and comparison Area Under Curve (AUC.

  7. Information Extraction for Clinical Data Mining: A Mammography Case Study.

    Science.gov (United States)

    Nassif, Houssam; Woods, Ryan; Burnside, Elizabeth; Ayvaci, Mehmet; Shavlik, Jude; Page, David

    2009-01-01

    Breast cancer is the leading cause of cancer mortality in women between the ages of 15 and 54. During mammography screening, radiologists use a strict lexicon (BI-RADS) to describe and report their findings. Mammography records are then stored in a well-defined database format (NMD). Lately, researchers have applied data mining and machine learning techniques to these databases. They successfully built breast cancer classifiers that can help in early detection of malignancy. However, the validity of these models depends on the quality of the underlying databases. Unfortunately, most databases suffer from inconsistencies, missing data, inter-observer variability and inappropriate term usage. In addition, many databases are not compliant with the NMD format and/or solely consist of text reports. BI-RADS feature extraction from free text and consistency checks between recorded predictive variables and text reports are crucial to addressing this problem. We describe a general scheme for concept information retrieval from free text given a lexicon, and present a BI-RADS features extraction algorithm for clinical data mining. It consists of a syntax analyzer, a concept finder and a negation detector. The syntax analyzer preprocesses the input into individual sentences. The concept finder uses a semantic grammar based on the BI-RADS lexicon and the experts' input. It parses sentences detecting BI-RADS concepts. Once a concept is located, a lexical scanner checks for negation. Our method can handle multiple latent concepts within the text, filtering out ultrasound concepts. On our dataset, our algorithm achieves 97.7% precision, 95.5% recall and an F 1 -score of 0.97. It outperforms manual feature extraction at the 5% statistical significance level.

  8. Data mining learning bootstrap through semantic thumbnail analysis

    Science.gov (United States)

    Battiato, Sebastiano; Farinella, Giovanni Maria; Giuffrida, Giovanni; Tribulato, Giuseppe

    2007-01-01

    The rapid increase of technological innovations in the mobile phone industry induces the research community to develop new and advanced systems to optimize services offered by mobile phones operators (telcos) to maximize their effectiveness and improve their business. Data mining algorithms can run over data produced by mobile phones usage (e.g. image, video, text and logs files) to discover user's preferences and predict the most likely (to be purchased) offer for each individual customer. One of the main challenges is the reduction of the learning time and cost of these automatic tasks. In this paper we discuss an experiment where a commercial offer is composed by a small picture augmented with a short text describing the offer itself. Each customer's purchase is properly logged with all relevant information. Upon arrival of new items we need to learn who the best customers (prospects) for each item are, that is, the ones most likely to be interested in purchasing that specific item. Such learning activity is time consuming and, in our specific case, is not applicable given the large number of new items arriving every day. Basically, given the current customer base we are not able to learn on all new items. Thus, we need somehow to select among those new items to identify the best candidates. We do so by using a joint analysis between visual features and text to estimate how good each new item could be, that is, whether or not is worth to learn on it. Preliminary results show the effectiveness of the proposed approach to improve classical data mining techniques.

  9. A Quantitative Analysis of Organizational Factors That Relate to Data Mining Success

    Science.gov (United States)

    Huebner, Richard A.

    2017-01-01

    The ubiquity of data in various forms has fueled the need for advanced data-mining techniques within organizations. The advent of data mining methods used to uncover hidden nuggets of information buried within large data sets has also fueled the need for determining how these unique projects can be successful. There are many challenges associated…

  10. Data mining and Pattern Recognizing Models for Identifying Inherited Diseases: Challenges and Implications

    OpenAIRE

    Lahiru Iddamalgoda; Partha Sarathi Das; Partha Sarathi Das; Achala Aponso; Vijayaraghava Seshadri Sundararajan; Prashanth Suravajhala; Prashanth Suravajhala; Prashanth Suravajhala; Jayaraman K Valadi

    2016-01-01

    Data mining and pattern recognition methods reveal interesting findings in genetic studies, especially on how genetic makeup is associated with inherited diseases. Although researchers have proposed various data mining models for biomedical approaches, there remains a challenge in accurately determining the responsible genetic factors for prioritizing the single nucleotide polymorphisms (SNP) associated with the disease. In this commentary, we review the state-of-art data mining and pattern r...

  11. Data Mining and Pattern Recognition Models for Identifying Inherited Diseases: Challenges and Implications

    OpenAIRE

    Iddamalgoda, Lahiru; Das, Partha S.; Aponso, Achala; Sundararajan, Vijayaraghava S.; Suravajhala, Prashanth; Valadi, Jayaraman K.

    2016-01-01

    Data mining and pattern recognition methods reveal interesting findings in genetic studies, especially on how the genetic makeup is associated with inherited diseases. Although researchers have proposed various data mining models for biomedical approaches, there remains a challenge in accurately prioritizing the single nucleotide polymorphisms (SNP) associated with the disease. In this commentary, we review the state-of-art data mining and pattern recognition models for identifying inherited ...

  12. Data Mining Foundations and Intelligent Paradigms Volume 2 Statistical, Bayesian, Time Series and other Theoretical Aspects

    CERN Document Server

    Jain, Lakhmi

    2012-01-01

    Data mining is one of the most rapidly growing research areas in computer science and statistics. In Volume 2 of this three volume series, we have brought together contributions from some of the most prestigious researchers in theoretical data mining. Each of the chapters is self contained. Statisticians and applied scientists/ engineers will find this volume valuable. Additionally, it provides a sourcebook for graduate students interested in the current direction of research in data mining.

  13. Data Warehouse, Data Mining Dan Konsep Cross-Selling Pada Analisis Penjualan Produk

    Directory of Open Access Journals (Sweden)

    Eka Miranda

    2010-12-01

    Full Text Available This paper is about designing and implementing data warehousing and data mining, along with their roles in supporting decision-making related to sales product analysis in cross-selling concept of PT XYZ. The database the company used is not supporting data analysis and decision-making. Therefore, it made a data warehousing design that could be used to keep data in a huge amount and could give report and answer from user’s questions in ad hoc. The method is used to design and implement data warehousing and data mining which consists of literature study, company problem analysis, and data warehousing design, and testing result. The writing results are a data warehousing design and data mining and also the implementation of cross-selling concept to analysis the sales, purchases, and customers’ cancellation data. The data could be showed and analyzed from some point of views that could help managers to analyse and acknowledge more information. 

  14. Mining manufacturing data for discovery of high productivity process characteristics.

    Science.gov (United States)

    Charaniya, Salim; Le, Huong; Rangwala, Huzefa; Mills, Keri; Johnson, Kevin; Karypis, George; Hu, Wei-Shou

    2010-06-01

    Modern manufacturing facilities for bioproducts are highly automated with advanced process monitoring and data archiving systems. The time dynamics of hundreds of process parameters and outcome variables over a large number of production runs are archived in the data warehouse. This vast amount of data is a vital resource to comprehend the complex characteristics of bioprocesses and enhance production robustness. Cell culture process data from 108 'trains' comprising production as well as inoculum bioreactors from Genentech's manufacturing facility were investigated. Each run constitutes over one-hundred on-line and off-line temporal parameters. A kernel-based approach combined with a maximum margin-based support vector regression algorithm was used to integrate all the process parameters and develop predictive models for a key cell culture performance parameter. The model was also used to identify and rank process parameters according to their relevance in predicting process outcome. Evaluation of cell culture stage-specific models indicates that production performance can be reliably predicted days prior to harvest. Strong associations between several temporal parameters at various manufacturing stages and final process outcome were uncovered. This model-based data mining represents an important step forward in establishing a process data-driven knowledge discovery in bioprocesses. Implementation of this methodology on the manufacturing floor can facilitate a real-time decision making process and thereby improve the robustness of large scale bioprocesses. 2010 Elsevier B.V. All rights reserved.

  15. Utilization of Selected Data Mining Methods for Communication Network Analysis

    Directory of Open Access Journals (Sweden)

    V. Ondryhal

    2011-06-01

    Full Text Available The aim of the project was to analyze the behavior of military communication networks based on work with real data collected continuously since 2005. With regard to the nature and amount of the data, data mining methods were selected for the purpose of analyses and experiments. The quality of real data is often insufficient for an immediate analysis. The article presents the data cleaning operations which have been carried out with the aim to improve the input data sample to obtain reliable models. Gradually, by means of properly chosen SW, network models were developed to verify generally valid patterns of network behavior as a bulk service. Furthermore, unlike the commercially available communication networks simulators, the models designed allowed us to capture nonstandard models of network behavior under an increased load, verify the correct sizing of the network to the increased load, and thus test its reliability. Finally, based on previous experience, the models enabled us to predict emergency situations with a reasonable accuracy.

  16. Grist: Grid-based Data Mining for Astronomy

    Science.gov (United States)

    Jacob, J. C.; Katz, D. S.; Miller, C. D.; Walia, H.; Williams, R. D.; Djorgovski, S. G.; Graham, M. J.; Mahabal, A. A.; Babu, G. J.; vanden Berk, D. E.; Nichol, R.

    2005-12-01

    The Grist project is developing a grid-technology based system as a research environment for astronomy with massive and complex datasets. This knowledge extraction system will consist of a library of distributed grid services controlled by a workflow system, compliant with standards emerging from the grid computing, web services, and virtual observatory communities. This new technology is being used to find high redshift quasars, study peculiar variable objects, search for transients in real time, and fit SDSS QSO spectra to measure black hole masses. Grist services are also a component of the ``hyperatlas'' project to serve high-resolution multi-wavelength imagery over the Internet. In support of these science and outreach objectives, the Grist framework will provide the enabling fabric to tie together distributed grid services in the areas of data access, federation, mining, subsetting, source extraction, image mosaicking, statistics, and visualization.

  17. Healthcare Scheduling by Data Mining: Literature Review and Future Directions

    Directory of Open Access Journals (Sweden)

    Maria M. Rinder

    2012-01-01

    Full Text Available This article presents a systematic literature review of the application of industrial engineering methods in healthcare scheduling, with a focus on the role of patient behavior in scheduling. Nine articles that used mathematical programming, data mining, genetic algorithms, and local searches for optimum schedules were obtained from an extensive search of literature. These methods are new approaches to solve the problems in healthcare scheduling. Some are adapted from areas such as manufacturing and transportation. Key findings from these studies include reduced time for scheduling, capability of solving more complex problems, and incorporation of more variables and constraints simultaneously than traditional scheduling methods. However, none of these methods modeled no-show and walk-ins patient behavior. Future research should include more variables related to patient and/or environment.

  18. Data mining techniques for thermophysical properties of refrigerants

    International Nuclear Information System (INIS)

    Kuecueksille, Ecir Ugur; Selbas, Resat; Sencan, Arzu

    2009-01-01

    This study presents ten modeling techniques within data mining process for the prediction of thermophysical properties of refrigerants (R134a, R404a, R407c and R410a). These are linear regression (LR), multi layer perception (MLP), pace regression (PR), simple linear regression (SLR), sequential minimal optimization (SMO), KStar, additive regression (AR), M5 model tree, decision table (DT), M5'Rules models. Relations depending on temperature and pressure were carried out for the determination of thermophysical properties as the specific heat capacity, viscosity, heat conduction coefficient, density of the refrigerants. Obtained model results for every refrigerant were compared and the best model was investigated. Results indicate that use of derived formulations from these techniques will facilitate design and optimize of heat exchangers which is component of especially vapor compression refrigeration system

  19. CrossRef text and data mining services

    Directory of Open Access Journals (Sweden)

    Rachael Lammey

    2015-02-01

    Full Text Available CrossRef is an association of scholarly publishers that develops shared infrastructure to support more effective scholarly communications. It is a registration agency for the digital object identifier (DOI, and has built additional services for CrossRef members around the DOI and the bibliographic metadata that publishers deposit in order to register DOIs for their publications. Among these services are CrossCheck, powered by iThenticate, which helps publishers screen for plagiarism in submitted manuscripts and FundRef, which gives publishers standard way to report funding sources for published scholarly research. To add to these services, Cross-Ref launched CrossRef text and data mining services in May 2014. This article will explain the thinking behind CrossRef launching this new service, what it offers to publishers and researchers alike, how publishers can participate in it, and the uptake of the service so far.

  20. Grist : grid-based data mining for astronomy

    Science.gov (United States)

    Jacob, Joseph C.; Katz, Daniel S.; Miller, Craig D.; Walia, Harshpreet; Williams, Roy; Djorgovski, S. George; Graham, Matthew J.; Mahabal, Ashish; Babu, Jogesh; Berk, Daniel E. Vanden; hide

    2004-01-01

    The Grist project is developing a grid-technology based system as a research environment for astronomy with massive and complex datasets. This knowledge extraction system will consist of a library of distributed grid services controlled by a workflow system, compliant with standards emerging from the grid computing, web services, and virtual observatory communities. This new technology is being used to find high redshift quasars, study peculiar variable objects, search for transients in real time, and fit SDSS QSO spectra to measure black hole masses. Grist services are also a component of the 'hyperatlas' project to serve high-resolution multi-wavelength imagery over the Internet. In support of these science and outreach objectives, the Grist framework will provide the enabling fabric to tie together distributed grid services in the areas of data access, federation, mining, subsetting, source extraction, image mosaicking, statistics, and visualization.

  1. Base Oils Biodegradability Prediction with Data Mining Techniques

    Directory of Open Access Journals (Sweden)

    Malika Trabelsi

    2010-02-01

    Full Text Available In this paper, we apply various data mining techniques including continuous numeric and discrete classification prediction models of base oils biodegradability, with emphasis on improving prediction accuracy. The results show that highly biodegradable oils can be better predicted through numeric models. In contrast, classification models did not uncover a similar dichotomy. With the exception of Memory Based Reasoning and Decision Trees, tested classification techniques achieved high classification prediction. However, the technique of Decision Trees helped uncover the most significant predictors. A simple classification rule derived based on this predictor resulted in good classification accuracy. The application of this rule enables efficient classification of base oils into either low or high biodegradability classes with high accuracy. For the latter, a higher precision biodegradability prediction can be obtained using continuous modeling techniques.

  2. A Proposed Data Fusion Architecture for Micro-Zone Analysis and Data Mining

    Energy Technology Data Exchange (ETDEWEB)

    Kevin McCarthy; Milos Manic

    2012-08-01

    Data Fusion requires the ability to combine or “fuse” date from multiple data sources. Time Series Analysis is a data mining technique used to predict future values from a data set based upon past values. Unlike other data mining techniques, however, Time Series places special emphasis on periodicity and how seasonal and other time-based factors tend to affect trends over time. One of the difficulties encountered in developing generic time series techniques is the wide variability of the data sets available for analysis. This presents challenges all the way from the data gathering stage to results presentation. This paper presents an architecture designed and used to facilitate the collection of disparate data sets well suited to Time Series analysis as well as other predictive data mining techniques. Results show this architecture provides a flexible, dynamic framework for the capture and storage of a myriad of dissimilar data sets and can serve as a foundation from which to build a complete data fusion architecture.

  3. Physics Mining of Multi-source Data Sets, Phase I

    Data.gov (United States)

    National Aeronautics and Space Administration — We propose to implement novel physics mining algorithms with analytical capabilities to derive diagnostic and prognostic numerical models from multi-source...

  4. A genetic algorithm approach to recognition and data mining

    Energy Technology Data Exchange (ETDEWEB)

    Punch, W.F.; Goodman, E.D.; Min, Pei [Michigan State Univ., East Lansing, MI (United States)] [and others

    1996-12-31

    We review here our use of genetic algorithm (GA) and genetic programming (GP) techniques to perform {open_quotes}data mining,{close_quotes} the discovery of particular/important data within large datasets, by finding optimal data classifications using known examples. Our first experiments concentrated on the use of a K-nearest neighbor algorithm in combination with a GA. The GA selected weights for each feature so as to optimize knn classification based on a linear combination of features. This combined GA-knn approach was successfully applied to both generated and real-world data. We later extended this work by substituting a GP for the GA. The GP-knn could not only optimize data classification via linear combinations of features but also determine functional relationships among the features. This allowed for improved performance and new information on important relationships among features. We review the effectiveness of the overall approach on examples from biology and compare the effectiveness of the GA and GP.

  5. PRIVACY PRESERVING DATA MINING USING MULTIPLE OBJECTIVE OPTIMIZATION

    Directory of Open Access Journals (Sweden)

    V. Shyamala Susan

    2016-10-01

    Full Text Available Privacy preservation is that the most targeted issue in information publication, because the sensitive data shouldn't be leaked. For this sake, several privacy preservation data mining algorithms are proposed. In this work, feature selection using evolutionary algorithm and data masking coupled with slicing is treated as a multiple objective optimisation to preserve privacy. To start with, Genetic Algorithm (GA is carried out over the datasets to perceive the sensitive attributes and prioritise the attributes for treatment as per their determined sensitive level. In the next phase, to distort the data, noise is added to the higher level sensitive value using Hybrid Data Transformation (HDT method. In the following phase slicing algorithm groups the correlated attributes organized and by this means reduces the dimensionality by retaining the Advanced Clustering Algorithm (ACA. With the aim of getting the optimal dimensions of buckets, tuple segregating is accomplished by Metaheuristic Firefly Algorithm (MFA. The investigational consequences imply that the anticipated technique can reserve confidentiality and therefore the information utility is additionally high. Slicing algorithm allows the protection of association and usefulness in which effects in decreasing the information dimensionality and information loss. Performance analysis is created over OCC 7 and OCC 15 and our optimization method proves its effectiveness over two totally different datasets by showing 92.98% and 96.92% respectively.

  6. A Framework for Investigating Influence of Organizational Decision Makers on Data Mining Process Achievement

    Directory of Open Access Journals (Sweden)

    Hanieh Hajisafari

    2012-02-01

    Full Text Available Currently, few studies deal with evaluation of data mining plans in context of solvng organizational problems. A successful data miner is searching to solve a fully defined business problem. To make the data mining (DM results actionable, the data miner must explain them to the business insider. The interaction process between the business insiders and data miners is actually a knowledge-sharing process. In this study through representing a framwork, influence of organizational decision makers on data mining process and results investigated. By investigating research literature, the critical success factors of data mining plans was identified and the role of organizational decision makers in each step of data mining was investigated.‌ Then, the conceptual framework of influence of organizational decision makers on data mining process achievement was designed. By getting expert opinions, the proposed framework was analyzed and evantually designed the final framework of influence of organizational decision makers on data mining process achievement. Analysis of experts opinions showed that by knowledge sharing of data ming results with decision makers, "learning", "action or internalization" and "enforcing/unlearning" will become as critical success factors. Also, results of examining importance of decision makers' feedback on data mining steps showed that getting feedback from decision makers could have most influence on "knowledge extraction and representing model" step and least on "data cleaning and preprocessing" step.

  7. On the classification techniques in data mining for microarray data classification

    Science.gov (United States)

    Aydadenta, Husna; Adiwijaya

    2018-03-01

    Cancer is one of the deadly diseases, according to data from WHO by 2015 there are 8.8 million more deaths caused by cancer, and this will increase every year if not resolved earlier. Microarray data has become one of the most popular cancer-identification studies in the field of health, since microarray data can be used to look at levels of gene expression in certain cell samples that serve to analyze thousands of genes simultaneously. By using data mining technique, we can classify the sample of microarray data thus it can be identified with cancer or not. In this paper we will discuss some research using some data mining techniques using microarray data, such as Support Vector Machine (SVM), Artificial Neural Network (ANN), Naive Bayes, k-Nearest Neighbor (kNN), and C4.5, and simulation of Random Forest algorithm with technique of reduction dimension using Relief. The result of this paper show performance measure (accuracy) from classification algorithm (SVM, ANN, Naive Bayes, kNN, C4.5, and Random Forets).The results in this paper show the accuracy of Random Forest algorithm higher than other classification algorithms (Support Vector Machine (SVM), Artificial Neural Network (ANN), Naive Bayes, k-Nearest Neighbor (kNN), and C4.5). It is hoped that this paper can provide some information about the speed, accuracy, performance and computational cost generated from each Data Mining Classification Technique based on microarray data.

  8. Data Mining in Course Management Systems: Moodle Case Study and Tutorial

    Science.gov (United States)

    Romero, Cristobal; Ventura, Sebastian; Garcia, Enrique

    2008-01-01

    Educational data mining is an emerging discipline, concerned with developing methods for exploring the unique types of data that come from the educational context. This work is a survey of the specific application of data mining in learning management systems and a case study tutorial with the Moodle system. Our objective is to introduce it both…

  9. Towards the generic framework for utility considerations in data mining research

    NARCIS (Netherlands)

    Puuronen, S.; Pechenizkiy, M.; Soares, C.; Ghani, R.

    2010-01-01

    Rigor data mining (DM) research has successfully developed advanced data mining techniques and algorithms, and many organizations have great expectations to take more benefit of their vast data warehouses in decision making. Even when there are some success stories the current status in practice is

  10. IBM SPSS modeler essentials effective techniques for building powerful data mining and predictive analytics solutions

    CERN Document Server

    McCormick, Keith; Wei, Bowen

    2017-01-01

    IBM SPSS Modeler allows quick, efficient predictive analytics and insight building from your data, and is a popularly used data mining tool. This book will guide you through the data mining process, and presents relevant statistical methods which are used to build predictive models and conduct other analytic tasks using IBM SPSS Modeler. From ...

  11. Fuzzy C-Means Clustering Model Data Mining For Recognizing Stock Data Sampling Pattern

    Directory of Open Access Journals (Sweden)

    Sylvia Jane Annatje Sumarauw

    2007-06-01

    Full Text Available Abstract Capital market has been beneficial to companies and investor. For investors, the capital market provides two economical advantages, namely deviden and capital gain, and a non-economical one that is a voting .} hare in Shareholders General Meeting. But, it can also penalize the share owners. In order to prevent them from the risk, the investors should predict the prospect of their companies. As a consequence of having an abstract commodity, the share quality will be determined by the validity of their company profile information. Any information of stock value fluctuation from Jakarta Stock Exchange can be a useful consideration and a good measurement for data analysis. In the context of preventing the shareholders from the risk, this research focuses on stock data sample category or stock data sample pattern by using Fuzzy c-Me, MS Clustering Model which providing any useful information jar the investors. lite research analyses stock data such as Individual Index, Volume and Amount on Property and Real Estate Emitter Group at Jakarta Stock Exchange from January 1 till December 31 of 204. 'he mining process follows Cross Industry Standard Process model for Data Mining (CRISP,. DM in the form of circle with these steps: Business Understanding, Data Understanding, Data Preparation, Modelling, Evaluation and Deployment. At this modelling process, the Fuzzy c-Means Clustering Model will be applied. Data Mining Fuzzy c-Means Clustering Model can analyze stock data in a big database with many complex variables especially for finding the data sample pattern, and then building Fuzzy Inference System for stimulating inputs to be outputs that based on Fuzzy Logic by recognising the pattern. Keywords: Data Mining, AUz..:y c-Means Clustering Model, Pattern Recognition

  12. Detecting Structural Damage of Nuclear Power Plant by Interactive Data Mining Approach

    International Nuclear Information System (INIS)

    Yufei Shu

    2006-01-01

    This paper presents a nonlinear structural damage identification technique, based on an interactive data mining approach, which integrates a human cognitive model in a data mining loop. A mining control agent emulating human analysts is developed, which directly interacts with the data miner, analyzing and verifying the output of the data miner and controlling the data mining process. Additionally, an artificial neural network method, which is adopted as a core component of the proposed interactive data mining method, is evolved by adding a novelty detecting and retraining function for handling complicated nuclear power plant quake-proof data. Plant quake-proof testing data has been applied to the system to show the validation of the proposed method. (author)

  13. Effective approach toward Intrusion Detection System using data mining techniques

    Directory of Open Access Journals (Sweden)

    G.V. Nadiammai

    2014-03-01

    Full Text Available With the tremendous growth of the usage of computers over network and development in application running on various platform captures the attention toward network security. This paradigm exploits security vulnerabilities on all computer systems that are technically difficult and expensive to solve. Hence intrusion is used as a key to compromise the integrity, availability and confidentiality of a computer resource. The Intrusion Detection System (IDS plays a vital role in detecting anomalies and attacks in the network. In this work, data mining concept is integrated with an IDS to identify the relevant, hidden data of interest for the user effectively and with less execution time. Four issues such as Classification of Data, High Level of Human Interaction, Lack of Labeled Data, and Effectiveness of Distributed Denial of Service Attack are being solved using the proposed algorithms like EDADT algorithm, Hybrid IDS model, Semi-Supervised Approach and Varying HOPERAA Algorithm respectively. Our proposed algorithm has been tested using KDD Cup dataset. All the proposed algorithm shows better accuracy and reduced false alarm rate when compared with existing algorithms.

  14. Generative Topic Modeling in Image Data Mining and Bioinformatics Studies

    Science.gov (United States)

    Chen, Xin

    2012-01-01

    Probabilistic topic models have been developed for applications in various domains such as text mining, information retrieval and computer vision and bioinformatics domain. In this thesis, we focus on developing novel probabilistic topic models for image mining and bioinformatics studies. Specifically, a probabilistic topic-connection (PTC) model…

  15. Mining algorithm for association rules in big data based on Hadoop

    Science.gov (United States)

    Fu, Chunhua; Wang, Xiaojing; Zhang, Lijun; Qiao, Liying

    2018-04-01

    In order to solve the problem that the traditional association rules mining algorithm has been unable to meet the mining needs of large amount of data in the aspect of efficiency and scalability, take FP-Growth as an example, the algorithm is realized in the parallelization based on Hadoop framework and Map Reduce model. On the basis, it is improved using the transaction reduce method for further enhancement of the algorithm's mining efficiency. The experiment, which consists of verification of parallel mining results, comparison on efficiency between serials and parallel, variable relationship between mining time and node number and between mining time and data amount, is carried out in the mining results and efficiency by Hadoop clustering. Experiments show that the paralleled FP-Growth algorithm implemented is able to accurately mine frequent item sets, with a better performance and scalability. It can be better to meet the requirements of big data mining and efficiently mine frequent item sets and association rules from large dataset.

  16. Large Scale Data Mining to Improve Usability of Data: An Intelligent Archive Testbed

    Science.gov (United States)

    Ramapriyan, Hampapuram; Isaac, David; Yang, Wenli; Morse, Steve

    2005-01-01

    Research in certain scientific disciplines - including Earth science, particle physics, and astrophysics - continually faces the challenge that the volume of data needed to perform valid scientific research can at times overwhelm even a sizable research community. The desire to improve utilization of this data gave rise to the Intelligent Archives project, which seeks to make data archives active participants in a knowledge building system capable of discovering events or patterns that represent new information or knowledge. Data mining can automatically discover patterns and events, but it is generally viewed as unsuited for large-scale use in disciplines like Earth science that routinely involve very high data volumes. Dozens of research projects have shown promising uses of data mining in Earth science, but all of these are based on experiments with data subsets of a few gigabytes or less, rather than the terabytes or petabytes typically encountered in operational systems. To bridge this gap, the Intelligent Archives project is establishing a testbed with the goal of demonstrating the use of data mining techniques in an operationally-relevant environment. This paper discusses the goals of the testbed and the design choices surrounding critical issues that arose during testbed implementation.

  17. Integrating text mining, data mining, and network analysis for identifying genetic breast cancer trends.

    Science.gov (United States)

    Jurca, Gabriela; Addam, Omar; Aksac, Alper; Gao, Shang; Özyer, Tansel; Demetrick, Douglas; Alhajj, Reda

    2016-04-26

    Breast cancer is a serious disease which affects many women and may lead to death. It has received considerable attention from the research community. Thus, biomedical researchers aim to find genetic biomarkers indicative of the disease. Novel biomarkers can be elucidated from the existing literature. However, the vast amount of scientific publications on breast cancer make this a daunting task. This paper presents a framework which investigates existing literature data for informative discoveries. It integrates text mining and social network analysis in order to identify new potential biomarkers for breast cancer. We utilized PubMed for the testing. We investigated gene-gene interactions, as well as novel interactions such as gene-year, gene-country, and abstract-country to find out how the discoveries varied over time and how overlapping/diverse are the discoveries and the interest of various research groups in different countries. Interesting trends have been identified and discussed, e.g., different genes are highlighted in relationship to different countries though the various genes were found to share functionality. Some text analysis based results have been validated against results from other tools that predict gene-gene relations and gene functions.

  18. Mining gene expression data by interpreting principal components

    Directory of Open Access Journals (Sweden)

    Mortazavi Ali

    2006-04-01

    Full Text Available Abstract Background There are many methods for analyzing microarray data that group together genes having similar patterns of expression over all conditions tested. However, in many instances the biologically important goal is to identify relatively small sets of genes that share coherent expression across only some conditions, rather than all or most conditions as required in traditional clustering; e.g. genes that are highly up-regulated and/or down-regulated similarly across only a subset of conditions. Equally important is the need to learn which conditions are the decisive ones in forming such gene sets of interest, and how they relate to diverse conditional covariates, such as disease diagnosis or prognosis. Results We present a method for automatically identifying such candidate sets of biologically relevant genes using a combination of principal components analysis and information theoretic metrics. To enable easy use of our methods, we have developed a data analysis package that facilitates visualization and subsequent data mining of the independent sources of significant variation present in gene microarray expression datasets (or in any other similarly structured high-dimensional dataset. We applied these tools to two public datasets, and highlight sets of genes most affected by specific subsets of conditions (e.g. tissues, treatments, samples, etc.. Statistically significant associations for highlighted gene sets were shown via global analysis for Gene Ontology term enrichment. Together with covariate associations, the tool provides a basis for building testable hypotheses about the biological or experimental causes of observed variation. Conclusion We provide an unsupervised data mining technique for diverse microarray expression datasets that is distinct from major methods now in routine use. In test uses, this method, based on publicly available gene annotations, appears to identify numerous sets of biologically relevant genes. It

  19. Data-Throughput Enhancement Using Data Mining-Informed Cognitive Radio

    Directory of Open Access Journals (Sweden)

    Khashayar Kotobi

    2015-03-01

    Full Text Available We propose the data mining-informed cognitive radio, which uses non-traditional data sources and data-mining techniques for decision making and improving the performance of a wireless network. To date, the application of information other than wireless channel data in cognitive radios has not been significantly studied. We use a novel dataset (Twitter traffic as an indicator of network load in a wireless channel. Using this dataset, we present and test a series of predictive algorithms that show an improvement in wireless channel utilization over traditional collision-detection algorithms. Our results demonstrate the viability of using these novel datasets to inform and create more efficient cognitive radio networks.

  20. Prediction of Thyroid Disease Using Data Mining Techniques

    Directory of Open Access Journals (Sweden)

    Irina Ioniţă

    2016-08-01

    Full Text Available Recently, thyroid diseases are more and more spread worldwide. In Romania, for example, one of eight women suffer from hypothyroidism, hyperthyroidism or thyroid cancer. Various research studies estimate that about 30% of Romanians are diagnosed with endemic goiter. The factors that affect the thyroid function are: stress, infection, trauma, toxins, low-calorie diet, certain medication etc. It is very important to prevent such diseases rather than cure them, because the majority of treatments consist in long term medication or in chirurgical intervention. The current study refers to the thyroid disease classification in two of the most common thyroid dysfunctions (hyperthyroidism and hypothyroidism among the population. The authors analyzed and compared four classification models: Naive Bayes, Decision Tree, Multilayer Perceptron and Radial Basis Function Network. The results indicate a significant accuracy for all the classification models mentioned above, the best classification rate being that of the Decision Tree model. The data set used to build and to validate the classifier was provided by the UCI machine learning repository and by a website with Romanian data. The framework for building and testing the classification models was KNIME Analytics Platform and Weka, two data mining software.

  1. Knowledge Discovery and Data Mining (KDDM) survey report.

    Energy Technology Data Exchange (ETDEWEB)

    Phillips, Laurence R.; Jordan, Danyelle N.; Bauer, Travis L.; Elmore, Mark T. (Oak Ridge National Laboratory, Oak Ridge, TN); Treadwell, Jim N. (Oak Ridge National Laboratory, Oak Ridge, TN); Homan, Rossitza A.; Chapman, Leon Darrel; Spires, Shannon V.

    2005-02-01

    The large number of government and industry activities supporting the Unit of Action (UA), with attendant documents, reports and briefings, can overwhelm decision-makers with an overabundance of information that hampers the ability to make quick decisions often resulting in a form of gridlock. In particular, the large and rapidly increasing amounts of data and data formats stored on UA Advanced Collaborative Environment (ACE) servers has led to the realization that it has become impractical and even impossible to perform manual analysis leading to timely decisions. UA Program Management (PM UA) has recognized the need to implement a Decision Support System (DSS) on UA ACE. The objective of this document is to research the commercial Knowledge Discovery and Data Mining (KDDM) market and publish the results in a survey. Furthermore, a ranking mechanism based on UA ACE-specific criteria has been developed and applied to a representative set of commercially available KDDM solutions. In addition, an overview of four R&D areas identified as critical to the implementation of DSS on ACE is provided. Finally, a comprehensive database containing detailed information on surveyed KDDM tools has been developed and is available upon customer request.

  2. Visualizing data mining results with the Brede tools

    Directory of Open Access Journals (Sweden)

    Finn A Nielsen

    2009-07-01

    Full Text Available A few neuroinformatics databases now exist that record results from neuroimaging studies in the form of brain coordinates in stereotaxic space. The Brede Toolbox was originally developed to extract, analyze and visualize data from one of them --- the BrainMap database. Since then the Brede Toolbox has expanded and now includes its own database with coordinates along with ontologies for brain regions and functions: The Brede Database. With Brede Toolbox and Database combined we setup automated workflows for extraction of data, mass meta-analytic data mining and visualizations. Most of the Web presence of the Brede Database is established by a single script executing a workflow involving these steps together with a final generation of Web pages with embedded visualizations and links to interactive three-dimensional models in the Virtual Reality Modeling Language. Apart from the Brede tools I briefly review alternate visualization tools and methods for Internet-based visualization and information visualization as well as portals for visualization tools.

  3. Text mining factor analysis (TFA) in green tea patent data

    Science.gov (United States)

    Rahmawati, Sela; Suprijadi, Jadi; Zulhanif

    2017-03-01

    Factor analysis has become one of the most widely used multivariate statistical procedures in applied research endeavors across a multitude of domains. There are two main types of analyses based on factor analysis: Exploratory Factor Analysis (EFA) and Confirmatory Factor Analysis (CFA). Both EFA and CFA aim to observed relationships among a group of indicators with a latent variable, but they differ fundamentally, a priori and restrictions made to the factor model. This method will be applied to patent data technology sector green tea to determine the development technology of green tea in the world. Patent analysis is useful in identifying the future technological trends in a specific field of technology. Database patent are obtained from agency European Patent Organization (EPO). In this paper, CFA model will be applied to the nominal data, which obtain from the presence absence matrix. While doing processing, analysis CFA for nominal data analysis was based on Tetrachoric matrix. Meanwhile, EFA model will be applied on a title from sector technology dominant. Title will be pre-processing first using text mining analysis.

  4. Development of turbine cycle performance analyzer using intelligent data mining

    Energy Technology Data Exchange (ETDEWEB)

    Heo, Gyun Young

    2004-02-15

    In recent year, the performance enhancement of turbine cycle in nuclear power plants is being highlighted because of worldwide deregulation environment. Especially the first target of operating plants became the reduction of operating cost to compete other power plants. It is known that overhaul interval is closely related to operating cost Author identified that the rapid and reliable performance tests, analysis, and diagnosis play an important role in the control of overhaul interval through field investigation. First the technical road map was proposed to clearly set up the objectives. The controversial issues were summarized into data gathering, analysis tool, and diagnosis method. Author proposed the integrated solution on the basis of intelligent data mining techniques. For the reliable data gathering, the state analyzer composed of statistical regression, wavelet analysis, and neural network was developed. The role of the state analyzer is to estimate unmeasured data and to increase the reliability of the collected data. For the advanced performance analysis, performance analysis toolbox was developed. The purpose of this tool makes analysis process easier and more accurate by providing three novel heat balance diagrams. This tool includes the state analyzer and turbine cycle simulation code. In diagnosis module, the probabilistic technique based on Bayesian network model and the deterministic technique based on algebraical model are provided together. It compromises the uncertainty in diagnosis process and the pin-point capability. All the modules were validated by simulated data as well as actual test data, and some modules are used as industrial applications. We have a lot of thing to be improved in turbine cycle in order to increase plant availability. This study was accomplished to remind the concern about the importance of turbine cycle and to propose the solutions on the basis of academic as well as industrial needs.

  5. Development of turbine cycle performance analyzer using intelligent data mining

    International Nuclear Information System (INIS)

    Heo, Gyun Young

    2004-02-01

    In recent year, the performance enhancement of turbine cycle in nuclear power plants is being highlighted because of worldwide deregulation environment. Especially the first target of operating plants became the reduction of operating cost to compete other power plants. It is known that overhaul interval is closely related to operating cost Author identified that the rapid and reliable performance tests, analysis, and diagnosis play an important role in the control of overhaul interval through field investigation. First the technical road map was proposed to clearly set up the objectives. The controversial issues were summarized into data gathering, analysis tool, and diagnosis method. Author proposed the integrated solution on the basis of intelligent data mining techniques. For the reliable data gathering, the state analyzer composed of statistical regression, wavelet analysis, and neural network was developed. The role of the state analyzer is to estimate unmeasured data and to increase the reliability of the collected data. For the advanced performance analysis, performance analysis toolbox was developed. The purpose of this tool makes analysis process easier and more accurate by providing three novel heat balance diagrams. This tool includes the state analyzer and turbine cycle simulation code. In diagnosis module, the probabilistic technique based on Bayesian network model and the deterministic technique based on algebraical model are provided together. It compromises the uncertainty in diagnosis process and the pin-point capability. All the modules were validated by simulated data as well as actual test data, and some modules are used as industrial applications. We have a lot of thing to be improved in turbine cycle in order to increase plant availability. This study was accomplished to remind the concern about the importance of turbine cycle and to propose the solutions on the basis of academic as well as industrial needs

  6. Microarray data and gene expression statistics for Saccharomyces cerevisiae exposed to simulated asbestos mine drainage

    Directory of Open Access Journals (Sweden)

    Heather E. Driscoll

    2017-08-01

    Full Text Available Here we describe microarray expression data (raw and normalized, experimental metadata, and gene-level data with expression statistics from Saccharomyces cerevisiae exposed to simulated asbestos mine drainage from the Vermont Asbestos Group (VAG Mine on Belvidere Mountain in northern Vermont, USA. For nearly 100 years (between the late 1890s and 1993, chrysotile asbestos fibers were extracted from serpentinized ultramafic rock at the VAG Mine for use in construction and manufacturing industries. Studies have shown that water courses and streambeds nearby have become contaminated with asbestos mine tailings runoff, including elevated levels of magnesium, nickel, chromium, and arsenic, elevated pH, and chrysotile asbestos-laden mine tailings, due to leaching and gradual erosion of massive piles of mine waste covering approximately 9 km2. We exposed yeast to simulated VAG Mine tailings leachate to help gain insight on how eukaryotic cells exposed to VAG Mine drainage may respond in the mine environment. Affymetrix GeneChip® Yeast Genome 2.0 Arrays were utilized to assess gene expression after 24-h exposure to simulated VAG Mine tailings runoff. The chemistry of mine-tailings leachate, mine-tailings leachate plus yeast extract peptone dextrose media, and control yeast extract peptone dextrose media is also reported. To our knowledge this is the first dataset to assess global gene expression patterns in a eukaryotic model system simulating asbestos mine tailings runoff exposure. Raw and normalized gene expression data are accessible through the National Center for Biotechnology Information Gene Expression Omnibus (NCBI GEO Database Series GSE89875 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE89875.

  7. Data mining with iPlant: a meeting report from the 2013 GARNet workshop, Data mining with iPlant.

    Science.gov (United States)

    Martin, Lisa; Cook, Charis; Matasci, Naim; Williams, Jason; Bastow, Ruth

    2015-01-01

    High-throughput sequencing technologies have rapidly moved from large international sequencing centres to individual laboratory benchtops. These changes have driven the 'data deluge' of modern biology. Submissions of nucleotide sequences to GenBank, for example, have doubled in size every year since 1982, and individual data sets now frequently reach terabytes in size. While 'big data' present exciting opportunities for scientific discovery, data analysis skills are not part of the typical wet bench biologist's experience. Knowing what to do with data, how to visualize and analyse them, make predictions, and test hypotheses are important barriers to success. Many researchers also lack adequate capacity to store and share these data, creating further bottlenecks to effective collaboration between groups and institutes. The US National Science Foundation-funded iPlant Collaborative was established in 2008 to form part of the data collection and analysis pipeline and help alleviate the bottlenecks associated with the big data challenge in plant science. Leveraging the power of high-performance computing facilities, iPlant provides free-to-use cyberinfrastructure to enable terabytes of data storage, improve analysis, and facilitate collaborations. To help train UK plant science researchers to use the iPlant platform and understand how it can be exploited to further research, GARNet organized a four-day Data mining with iPlant workshop at Warwick University in September 2013. This report provides an overview of the workshop, and highlights the power of the iPlant environment for lowering barriers to using complex bioinformatics resources, furthering discoveries in plant science research and providing a platform for education and outreach programmes. © The Author 2014. Published by Oxford University Press on behalf of the Society for Experimental Biology. All rights reserved. For permissions, please email: journals.permissions@oup.com.

  8. Signaling pathway networks mined from human pituitary adenoma proteomics data

    Directory of Open Access Journals (Sweden)

    Zhan Xianquan

    2010-04-01

    Full Text Available Abstract Background We obtained a series of pituitary adenoma proteomic expression data, including protein-mapping data (111 proteins, comparative proteomic data (56 differentially expressed proteins, and nitroproteomic data (17 nitroproteins. There is a pressing need to clarify the significant signaling pathway networks that derive from those proteins in order to clarify and to better understand the molecular basis of pituitary adenoma pathogenesis and to discover biomarkers. Here, we describe the significant signaling pathway networks that were mined from human pituitary adenoma proteomic data with the Ingenuity pathway analysis system. Methods The Ingenuity pathway analysis system was used to analyze signal pathway networks and canonical pathways from protein-mapping data, comparative proteomic data, adenoma nitroproteomic data, and control nitroproteomic data. A Fisher's exact test was used to test the statistical significance with a significance level of 0.05. Statistical significant results were rationalized within the pituitary adenoma biological system with literature-based bioinformatics analyses. Results For the protein-mapping data, the top pathway networks were related to cancer, cell death, and lipid metabolism; the top canonical toxicity pathways included acute-phase response, oxidative-stress response, oxidative stress, and cell-cycle G2/M transition regulation. For the comparative proteomic data, top pathway networks were related to cancer, endocrine system development and function, and lipid metabolism; the top canonical toxicity pathways included mitochondrial dysfunction, oxidative phosphorylation, oxidative-stress response, and ERK/MAPK signaling. The nitroproteomic data from a pituitary adenoma were related to cancer, cell death, lipid metabolism, and reproductive system disease, and the top canonical toxicity pathways mainly related to p38 MAPK signaling and cell-cycle G2/M transition regulation. Nitroproteins from a

  9. Mining Outlier Data in Mobile Internet-Based Large Real-Time Databases

    Directory of Open Access Journals (Sweden)

    Xin Liu

    2018-01-01

    Full Text Available Mining outlier data guarantees access security and data scheduling of parallel databases and maintains high-performance operation of real-time databases. Traditional mining methods generate abundant interference data with reduced accuracy, efficiency, and stability, causing severe deficiencies. This paper proposes a new mining outlier data method, which is used to analyze real-time data features, obtain magnitude spectra models of outlier data, establish a decisional-tree information chain transmission model for outlier data in mobile Internet, obtain the information flow of internal outlier data in the information chain of a large real-time database, and cluster data. Upon local characteristic time scale parameters of information flow, the phase position features of the outlier data before filtering are obtained; the decision-tree outlier-classification feature-filtering algorithm is adopted to acquire signals for analysis and instant amplitude and to achieve the phase-frequency characteristics of outlier data. Wavelet transform threshold denoising is combined with signal denoising to analyze data offset, to correct formed detection filter model, and to realize outlier data mining. The simulation suggests that the method detects the characteristic outlier data feature response distribution, reduces response time, iteration frequency, and mining error rate, improves mining adaptation and coverage, and shows good mining outcomes.

  10. A Data Mining Approach to Reveal Representative Collaboration Indicators in Open Collaboration Frameworks

    Science.gov (United States)

    Anaya, Antonio R.; Boticario, Jesus G.

    2009-01-01

    Data mining methods are successful in educational environments to discover new knowledge or learner skills or features. Unfortunately, they have not been used in depth with collaboration. We have developed a scalable data mining method, whose objective is to infer information on the collaboration during the collaboration process in a…

  11. Educational Data Mining Applications and Tasks: A Survey of the Last 10 Years

    Science.gov (United States)

    Bakhshinategh, Behdad; Zaiane, Osmar R.; ElAtia, Samira; Ipperciel, Donald

    2018-01-01

    Educational Data Mining (EDM) is the field of using data mining techniques in educational environments. There exist various methods and applications in EDM which can follow both applied research objectives such as improving and enhancing learning quality, as well as pure research objectives, which tend to improve our understanding of the learning…

  12. Workshop on Educational Data Mining @ ICALT07 (EDM@ICALT07)

    NARCIS (Netherlands)

    Beck, J.E.; Calders, T.; Pechenizkiy, M.; Viola, S.R.; Spector, J.M.; Sampson, D.G.; Okamoto, T.; Cerri, S.A.; Ueno, M.; Kashihara, A.

    2007-01-01

    The educational data mining workshop1 held in conjunction with the 7 IEEE International Conference on Advanced Learning Technologies (ICALT) in Niigata, Japan on July 18-20, 2007. EDM@ICALT07 continues the series of Workshops organized by the International Working Group on Educational Data Mining

  13. Process cubes : slicing, dicing, rolling up and drilling down event data for process mining

    NARCIS (Netherlands)

    Aalst, van der W.M.P.

    2013-01-01

    Recent breakthroughs in process mining research make it possible to discover, analyze, and improve business processes based on event data. The growth of event data provides many opportunities but also imposes new challenges. Process mining is typically done for an isolated well-defined process in

  14. Data mining methods for quality assurance in an environmental monitoring network

    NARCIS (Netherlands)

    Athanasiadis, Ioannis N.; Rizzoli, Andrea Emilio; Beard, Daniel W.

    2010-01-01

    The paper presents a system architecture that employs data mining techniques for ensuring quality assurance in an environmental monitoring network. We investigate how data mining techniques can be incorporated in the quality assurance decision making process. As prior expert decisions are

  15. An XML-Enabled Data Mining Query Language XML-DMQL

    NARCIS (Netherlands)

    Feng, L.; Dillon, T.

    2005-01-01

    Inspired by the good work of Han et al. (1996) and Elfeky et al. (2001) on the design of data mining query languages for relational and object-oriented databases, in this paper, we develop an expressive XML-enabled data mining query language by extension of XQuery. We first describe some

  16. A novel Neuro-fuzzy classification technique for data mining

    Directory of Open Access Journals (Sweden)

    Soumadip Ghosh

    2014-11-01

    Full Text Available In our study, we proposed a novel Neuro-fuzzy classification technique for data mining. The inputs to the Neuro-fuzzy classification system were fuzzified by applying generalized bell-shaped membership function. The proposed method utilized a fuzzification matrix in which the input patterns were associated with a degree of membership to different classes. Based on the value of degree of membership a pattern would be attributed to a specific category or class. We applied our method to ten benchmark data sets from the UCI machine learning repository for classification. Our objective was to analyze the proposed method and, therefore compare its performance with two powerful supervised classification algorithms Radial Basis Function Neural Network (RBFNN and Adaptive Neuro-fuzzy Inference System (ANFIS. We assessed the performance of these classification methods in terms of different performance measures such as accuracy, root-mean-square error, kappa statistic, true positive rate, false positive rate, precision, recall, and f-measure. In every aspect the proposed method proved to be superior to RBFNN and ANFIS algorithms.

  17. Apply data mining to analyze the rainfall of landslide

    Directory of Open Access Journals (Sweden)

    Lee Chou-Yuan

    2018-01-01

    Full Text Available Taiwan is listed as extremely dangerous country which suffers from many disasters. The disasters from the landslide result in the loss of agricultural productions, life and property and so on. Many researchers concern about the disasters of landslide, but there are few discussions for the threshold of rainfall for landslide. In this paper, data mining is applied to establish rules and the threshold of rainfall for landslide in Huafan University, Taiwan. These used variables include rainfall, insolation, insolation rate, averaged humidity, averaged temperature, wind speed, and the tilt of inclinometer. The inclinometer is an important instrument for measuring tilt, elevation or depression of an object with respect to gravity. There are 26 inclinometers in Talun mountain area of Huafan University. In this research, the used data were collected from January 2008 to July 2014. In the proposed approach, the regression analysis is used to predict rainfall first. Then, decision tree is used to obtain decision rules and set the threshold of rainfall for landslide. The output of this approach can provide more information for understanding the change of rainfall. The threshold of rainfall could also provide useful information to maintain the security for Huafan University.

  18. Simulation of California's Major Reservoirs Outflow Using Data Mining Technique

    Science.gov (United States)

    Yang, T.; Gao, X.; Sorooshian, S.

    2014-12-01

    The reservoir's outflow is controlled by reservoir operators, which is different from the upstream inflow. The outflow is more important than the reservoir's inflow for the downstream water users. In order to simulate the complicated reservoir operation and extract the outflow decision making patterns for California's 12 major reservoirs, we build a data-driven, computer-based ("artificial intelligent") reservoir decision making tool, using decision regression and classification tree approach. This is a well-developed statistical and graphical modeling methodology in the field of data mining. A shuffled cross validation approach is also employed to extract the outflow decision making patterns and rules based on the selected decision variables (inflow amount, precipitation, timing, water type year etc.). To show the accuracy of the model, a verification study is carried out comparing the model-generated outflow decisions ("artificial intelligent" decisions) with that made by reservoir operators (human decisions). The simulation results show that the machine-generated outflow decisions are very similar to the real reservoir operators' decisions. This conclusion is based on statistical evaluations using the Nash-Sutcliffe test. The proposed model is able to detect the most influential variables and their weights when the reservoir operators make an outflow decision. While the proposed approach was firstly applied and tested on California's 12 major reservoirs, the method is universally adaptable to other reservoir systems.

  19. Data mining for isotope discrimination in atom probe tomography

    Energy Technology Data Exchange (ETDEWEB)

    Broderick, Scott R. [Department of Materials Science and Engineering and Institute for Combinatorial Discovery, Iowa State University, Ames, IA 50011-2230 (United States); Bryden, Aaron [Ames National Laboratory, Ames, IA 50011-2230 (United States); Suram, Santosh K. [Department of Materials Science and Engineering and Institute for Combinatorial Discovery, Iowa State University, Ames, IA 50011-2230 (United States); Rajan, Krishna, E-mail: krajan@iastate.edu [Department of Materials Science and Engineering and Institute for Combinatorial Discovery, Iowa State University, Ames, IA 50011-2230 (United States)

    2013-09-15

    Ions with similar time-of-flights (TOF) can be discriminated by mapping their kinetic energy. While current generation position-sensitive detectors have been considered insufficient for capturing the isotope kinetic energy, we demonstrate in this paper that statistical learning methodologies can be used to capture the kinetic energy from all of the parameters currently measured by mathematically transforming the signal. This approach works because the kinetic energy is sufficiently described by the descriptors on the potential, the material, and the evaporation process within atom probe tomography (APT). We discriminate the isotopes for Mg and Al by capturing the kinetic energy, and then decompose the TOF spectrum into its isotope components and identify the isotope for each individual atom measured. This work demonstrates the value of advanced data mining methods to help enhance the information resolution of the atom probe. - Highlights: ► Atom probe tomography and statistical learning were combined for data enhancement. ► Multiple eigenvalue decompositions decomposed a spectrum with overlapping peaks. ► The isotope of each atom was determined by kinetic energy discrimination. ► Eigenspectra were identified and new chemical information was identified.

  20. Sistem Informasi Pemetaan Pendidikan Menggunakan Algoritma Data Mining

    Directory of Open Access Journals (Sweden)

    Olha Musa

    2016-04-01

    Full Text Available in this study to identify the increase in educational services based on the quality of non-formal education is an indicator, having tiered in terms of education, non-formal education (training to be one of the prerequisites in multiplying the potential for self-development. Data mining algorithms is a basic k-means clustering to put the object based on the average (Means nearest cluster. Aims to design mapping information system education with the k-means cluster. Application k-means cluster is part of a non-hierarchical method, the mapping system of education in 171 samples of data Isalam Students Association (HMI were tested in this study showed that the non-hierarchical method (k-means cluster has a good degree of accuracy because they specify the number of clusters in advance. Education information system mapping is used to cluster the data level, corresponding formal education and training has been followed. Education information system mapping is used to cluster the data level, corresponding formal education and training has been followed . The test results have in me some real, the spread of the data in each cluster are similar.  At the time of  the iteration process is not visible difference in the results of the mapping study using the k-means cluster. Results of a cluster centroid information models with variable 4 educated members include S1, S2, has entered basic training cluster  0, educated S1, S2, S3 has entered basic training cluster 1, S1 has educated basic training and training of incoming intermediate cluster 2, educated S1 has entered basic training cluster 3. formal education, education tiered seen in cluster 1 for non-formal education  (training tiered education seen in cluster 2. Based the test results k-means cluster.   Keywords: Information Systems; Educational Mapping; Cluster; K–means