WorldWideScience

Sample records for machine learning prediction

  1. Using Machine Learning to Predict Student Performance

    OpenAIRE

    Pojon, Murat

    2017-01-01

    This thesis examines the application of machine learning algorithms to predict whether a student will be successful or not. The specific focus of the thesis is the comparison of machine learning methods and feature engineering techniques in terms of how much they improve the prediction performance. Three different machine learning methods were used in this thesis. They are linear regression, decision trees, and naïve Bayes classification. Feature engineering, the process of modification ...

  2. Recent Advances in Predictive (Machine) Learning

    Energy Technology Data Exchange (ETDEWEB)

    Friedman, J

    2004-01-24

    Prediction involves estimating the unknown value of an attribute of a system under study given the values of other measured attributes. In prediction (machine) learning the prediction rule is derived from data consisting of previously solved cases. Most methods for predictive learning were originated many years ago at the dawn of the computer age. Recently two new techniques have emerged that have revitalized the field. These are support vector machines and boosted decision trees. This paper provides an introduction to these two new methods tracing their respective ancestral roots to standard kernel methods and ordinary decision trees.

  3. Machine Learning Methods to Predict Diabetes Complications.

    Science.gov (United States)

    Dagliati, Arianna; Marini, Simone; Sacchi, Lucia; Cogni, Giulia; Teliti, Marsida; Tibollo, Valentina; De Cata, Pasquale; Chiovato, Luca; Bellazzi, Riccardo

    2018-03-01

    One of the areas where Artificial Intelligence is having more impact is machine learning, which develops algorithms able to learn patterns and decision rules from data. Machine learning algorithms have been embedded into data mining pipelines, which can combine them with classical statistical strategies, to extract knowledge from data. Within the EU-funded MOSAIC project, a data mining pipeline has been used to derive a set of predictive models of type 2 diabetes mellitus (T2DM) complications based on electronic health record data of nearly one thousand patients. Such pipeline comprises clinical center profiling, predictive model targeting, predictive model construction and model validation. After having dealt with missing data by means of random forest (RF) and having applied suitable strategies to handle class imbalance, we have used Logistic Regression with stepwise feature selection to predict the onset of retinopathy, neuropathy, or nephropathy, at different time scenarios, at 3, 5, and 7 years from the first visit at the Hospital Center for Diabetes (not from the diagnosis). Considered variables are gender, age, time from diagnosis, body mass index (BMI), glycated hemoglobin (HbA1c), hypertension, and smoking habit. Final models, tailored in accordance with the complications, provided an accuracy up to 0.838. Different variables were selected for each complication and time scenario, leading to specialized models easy to translate to the clinical practice.

  4. Using Machine Learning to Predict MCNP Bias

    Energy Technology Data Exchange (ETDEWEB)

    Grechanuk, Pavel Aleksandrovi [Los Alamos National Lab. (LANL), Los Alamos, NM (United States)

    2018-01-09

    For many real-world applications in radiation transport where simulations are compared to experimental measurements, like in nuclear criticality safety, the bias (simulated - experimental keff) in the calculation is an extremely important quantity used for code validation. The objective of this project is to accurately predict the bias of MCNP6 [1] criticality calculations using machine learning (ML) algorithms, with the intention of creating a tool that can complement the current nuclear criticality safety methods. In the latest release of MCNP6, the Whisper tool is available for criticality safety analysts and includes a large catalogue of experimental benchmarks, sensitivity profiles, and nuclear data covariance matrices. This data, coming from 1100+ benchmark cases, is used in this study of ML algorithms for criticality safety bias predictions.

  5. Machine learning methods for metabolic pathway prediction

    Directory of Open Access Journals (Sweden)

    Karp Peter D

    2010-01-01

    Full Text Available Abstract Background A key challenge in systems biology is the reconstruction of an organism's metabolic network from its genome sequence. One strategy for addressing this problem is to predict which metabolic pathways, from a reference database of known pathways, are present in the organism, based on the annotated genome of the organism. Results To quantitatively validate methods for pathway prediction, we developed a large "gold standard" dataset of 5,610 pathway instances known to be present or absent in curated metabolic pathway databases for six organisms. We defined a collection of 123 pathway features, whose information content we evaluated with respect to the gold standard. Feature data were used as input to an extensive collection of machine learning (ML methods, including naïve Bayes, decision trees, and logistic regression, together with feature selection and ensemble methods. We compared the ML methods to the previous PathoLogic algorithm for pathway prediction using the gold standard dataset. We found that ML-based prediction methods can match the performance of the PathoLogic algorithm. PathoLogic achieved an accuracy of 91% and an F-measure of 0.786. The ML-based prediction methods achieved accuracy as high as 91.2% and F-measure as high as 0.787. The ML-based methods output a probability for each predicted pathway, whereas PathoLogic does not, which provides more information to the user and facilitates filtering of predicted pathways. Conclusions ML methods for pathway prediction perform as well as existing methods, and have qualitative advantages in terms of extensibility, tunability, and explainability. More advanced prediction methods and/or more sophisticated input features may improve the performance of ML methods. However, pathway prediction performance appears to be limited largely by the ability to correctly match enzymes to the reactions they catalyze based on genome annotations.

  6. Machine learning methods for metabolic pathway prediction

    Science.gov (United States)

    2010-01-01

    Background A key challenge in systems biology is the reconstruction of an organism's metabolic network from its genome sequence. One strategy for addressing this problem is to predict which metabolic pathways, from a reference database of known pathways, are present in the organism, based on the annotated genome of the organism. Results To quantitatively validate methods for pathway prediction, we developed a large "gold standard" dataset of 5,610 pathway instances known to be present or absent in curated metabolic pathway databases for six organisms. We defined a collection of 123 pathway features, whose information content we evaluated with respect to the gold standard. Feature data were used as input to an extensive collection of machine learning (ML) methods, including naïve Bayes, decision trees, and logistic regression, together with feature selection and ensemble methods. We compared the ML methods to the previous PathoLogic algorithm for pathway prediction using the gold standard dataset. We found that ML-based prediction methods can match the performance of the PathoLogic algorithm. PathoLogic achieved an accuracy of 91% and an F-measure of 0.786. The ML-based prediction methods achieved accuracy as high as 91.2% and F-measure as high as 0.787. The ML-based methods output a probability for each predicted pathway, whereas PathoLogic does not, which provides more information to the user and facilitates filtering of predicted pathways. Conclusions ML methods for pathway prediction perform as well as existing methods, and have qualitative advantages in terms of extensibility, tunability, and explainability. More advanced prediction methods and/or more sophisticated input features may improve the performance of ML methods. However, pathway prediction performance appears to be limited largely by the ability to correctly match enzymes to the reactions they catalyze based on genome annotations. PMID:20064214

  7. Predicting Increased Blood Pressure Using Machine Learning

    Science.gov (United States)

    Golino, Hudson Fernandes; Amaral, Liliany Souza de Brito; Duarte, Stenio Fernando Pimentel; Soares, Telma de Jesus; dos Reis, Luciana Araujo

    2014-01-01

    The present study investigates the prediction of increased blood pressure by body mass index (BMI), waist (WC) and hip circumference (HC), and waist hip ratio (WHR) using a machine learning technique named classification tree. Data were collected from 400 college students (56.3% women) from 16 to 63 years old. Fifteen trees were calculated in the training group for each sex, using different numbers and combinations of predictors. The result shows that for women BMI, WC, and WHR are the combination that produces the best prediction, since it has the lowest deviance (87.42), misclassification (.19), and the higher pseudo R 2 (.43). This model presented a sensitivity of 80.86% and specificity of 81.22% in the training set and, respectively, 45.65% and 65.15% in the test sample. For men BMI, WC, HC, and WHC showed the best prediction with the lowest deviance (57.25), misclassification (.16), and the higher pseudo R 2 (.46). This model had a sensitivity of 72% and specificity of 86.25% in the training set and, respectively, 58.38% and 69.70% in the test set. Finally, the result from the classification tree analysis was compared with traditional logistic regression, indicating that the former outperformed the latter in terms of predictive power. PMID:24669313

  8. Predicting increased blood pressure using machine learning.

    Science.gov (United States)

    Golino, Hudson Fernandes; Amaral, Liliany Souza de Brito; Duarte, Stenio Fernando Pimentel; Gomes, Cristiano Mauro Assis; Soares, Telma de Jesus; Dos Reis, Luciana Araujo; Santos, Joselito

    2014-01-01

    The present study investigates the prediction of increased blood pressure by body mass index (BMI), waist (WC) and hip circumference (HC), and waist hip ratio (WHR) using a machine learning technique named classification tree. Data were collected from 400 college students (56.3% women) from 16 to 63 years old. Fifteen trees were calculated in the training group for each sex, using different numbers and combinations of predictors. The result shows that for women BMI, WC, and WHR are the combination that produces the best prediction, since it has the lowest deviance (87.42), misclassification (.19), and the higher pseudo R (2) (.43). This model presented a sensitivity of 80.86% and specificity of 81.22% in the training set and, respectively, 45.65% and 65.15% in the test sample. For men BMI, WC, HC, and WHC showed the best prediction with the lowest deviance (57.25), misclassification (.16), and the higher pseudo R (2) (.46). This model had a sensitivity of 72% and specificity of 86.25% in the training set and, respectively, 58.38% and 69.70% in the test set. Finally, the result from the classification tree analysis was compared with traditional logistic regression, indicating that the former outperformed the latter in terms of predictive power.

  9. Predicting Increased Blood Pressure Using Machine Learning

    Directory of Open Access Journals (Sweden)

    Hudson Fernandes Golino

    2014-01-01

    Full Text Available The present study investigates the prediction of increased blood pressure by body mass index (BMI, waist (WC and hip circumference (HC, and waist hip ratio (WHR using a machine learning technique named classification tree. Data were collected from 400 college students (56.3% women from 16 to 63 years old. Fifteen trees were calculated in the training group for each sex, using different numbers and combinations of predictors. The result shows that for women BMI, WC, and WHR are the combination that produces the best prediction, since it has the lowest deviance (87.42, misclassification (.19, and the higher pseudo R2 (.43. This model presented a sensitivity of 80.86% and specificity of 81.22% in the training set and, respectively, 45.65% and 65.15% in the test sample. For men BMI, WC, HC, and WHC showed the best prediction with the lowest deviance (57.25, misclassification (.16, and the higher pseudo R2 (.46. This model had a sensitivity of 72% and specificity of 86.25% in the training set and, respectively, 58.38% and 69.70% in the test set. Finally, the result from the classification tree analysis was compared with traditional logistic regression, indicating that the former outperformed the latter in terms of predictive power.

  10. Machine learning applied to crime prediction

    OpenAIRE

    Vaquero Barnadas, Miquel

    2016-01-01

    Machine Learning is a cornerstone when it comes to artificial intelligence and big data analysis. It provides powerful algorithms that are capable of recognizing patterns, classifying data, and, basically, learn by themselves to perform a specific task. This field has incredibly grown in popularity these days, however, it still remains unknown for the majority of people, and even for most professionals. This project intends to provide an understandable explanation of what is it, what types ar...

  11. Exploration of Machine Learning Approaches to Predict Pavement Performance

    Science.gov (United States)

    2018-03-23

    Machine learning (ML) techniques were used to model and predict pavement condition index (PCI) for various pavement types using a variety of input variables. The primary objective of this research was to develop and assess PCI predictive models for t...

  12. Statistical and Machine Learning Models to Predict Programming Performance

    OpenAIRE

    Bergin, Susan

    2006-01-01

    This thesis details a longitudinal study on factors that influence introductory programming success and on the development of machine learning models to predict incoming student performance. Although numerous studies have developed models to predict programming success, the models struggled to achieve high accuracy in predicting the likely performance of incoming students. Our approach overcomes this by providing a machine learning technique, using a set of three significant...

  13. Prostate Cancer Probability Prediction By Machine Learning Technique.

    Science.gov (United States)

    Jović, Srđan; Miljković, Milica; Ivanović, Miljan; Šaranović, Milena; Arsić, Milena

    2017-11-26

    The main goal of the study was to explore possibility of prostate cancer prediction by machine learning techniques. In order to improve the survival probability of the prostate cancer patients it is essential to make suitable prediction models of the prostate cancer. If one make relevant prediction of the prostate cancer it is easy to create suitable treatment based on the prediction results. Machine learning techniques are the most common techniques for the creation of the predictive models. Therefore in this study several machine techniques were applied and compared. The obtained results were analyzed and discussed. It was concluded that the machine learning techniques could be used for the relevant prediction of prostate cancer.

  14. Using machine learning, neural networks and statistics to predict bankruptcy

    NARCIS (Netherlands)

    Pompe, P.P.M.; Feelders, A.J.; Feelders, A.J.

    1997-01-01

    Recent literature strongly suggests that machine learning approaches to classification outperform "classical" statistical methods. We make a comparison between the performance of linear discriminant analysis, classification trees, and neural networks in predicting corporate bankruptcy. Linear

  15. Machine learning applied to the prediction of citrus production

    OpenAIRE

    Díaz Rodríguez, Susana Irene; Mazza, Silvia M.; Fernández-Combarro Álvarez, Elías; Giménez, Laura I.; Gaiad, José E.

    2017-01-01

    An in-depth knowledge about variables affecting production is required in order to predict global production and take decisions in agriculture. Machine learning is a technique used in agricultural planning and precision agriculture. This work (i) studies the effectiveness of machine learning techniques for predicting orchards production; and (ii) variables affecting this production were also identified. Data from 964 orchards of lemon, mandarin, and orange in Corrientes, Argentina are analyse...

  16. Conformal prediction for reliable machine learning theory, adaptations and applications

    CERN Document Server

    Balasubramanian, Vineeth; Vovk, Vladimir

    2014-01-01

    The conformal predictions framework is a recent development in machine learning that can associate a reliable measure of confidence with a prediction in any real-world pattern recognition application, including risk-sensitive applications such as medical diagnosis, face recognition, and financial risk prediction. Conformal Predictions for Reliable Machine Learning: Theory, Adaptations and Applications captures the basic theory of the framework, demonstrates how to apply it to real-world problems, and presents several adaptations, including active learning, change detection, and anomaly detecti

  17. Predicting the dissolution kinetics of silicate glasses using machine learning

    Science.gov (United States)

    Anoop Krishnan, N. M.; Mangalathu, Sujith; Smedskjaer, Morten M.; Tandia, Adama; Burton, Henry; Bauchy, Mathieu

    2018-05-01

    Predicting the dissolution rates of silicate glasses in aqueous conditions is a complex task as the underlying mechanism(s) remain poorly understood and the dissolution kinetics can depend on a large number of intrinsic and extrinsic factors. Here, we assess the potential of data-driven models based on machine learning to predict the dissolution rates of various aluminosilicate glasses exposed to a wide range of solution pH values, from acidic to caustic conditions. Four classes of machine learning methods are investigated, namely, linear regression, support vector machine regression, random forest, and artificial neural network. We observe that, although linear methods all fail to describe the dissolution kinetics, the artificial neural network approach offers excellent predictions, thanks to its inherent ability to handle non-linear data. Overall, we suggest that a more extensive use of machine learning approaches could significantly accelerate the design of novel glasses with tailored properties.

  18. Machine learning in Python essential techniques for predictive analysis

    CERN Document Server

    Bowles, Michael

    2015-01-01

    Learn a simpler and more effective way to analyze data and predict outcomes with Python Machine Learning in Python shows you how to successfully analyze data using only two core machine learning algorithms, and how to apply them using Python. By focusing on two algorithm families that effectively predict outcomes, this book is able to provide full descriptions of the mechanisms at work, and the examples that illustrate the machinery with specific, hackable code. The algorithms are explained in simple terms with no complex math and applied using Python, with guidance on algorithm selection, d

  19. Applications of machine learning in cancer prediction and prognosis.

    Science.gov (United States)

    Cruz, Joseph A; Wishart, David S

    2007-02-11

    Machine learning is a branch of artificial intelligence that employs a variety of statistical, probabilistic and optimization techniques that allows computers to "learn" from past examples and to detect hard-to-discern patterns from large, noisy or complex data sets. This capability is particularly well-suited to medical applications, especially those that depend on complex proteomic and genomic measurements. As a result, machine learning is frequently used in cancer diagnosis and detection. More recently machine learning has been applied to cancer prognosis and prediction. This latter approach is particularly interesting as it is part of a growing trend towards personalized, predictive medicine. In assembling this review we conducted a broad survey of the different types of machine learning methods being used, the types of data being integrated and the performance of these methods in cancer prediction and prognosis. A number of trends are noted, including a growing dependence on protein biomarkers and microarray data, a strong bias towards applications in prostate and breast cancer, and a heavy reliance on "older" technologies such artificial neural networks (ANNs) instead of more recently developed or more easily interpretable machine learning methods. A number of published studies also appear to lack an appropriate level of validation or testing. Among the better designed and validated studies it is clear that machine learning methods can be used to substantially (15-25%) improve the accuracy of predicting cancer susceptibility, recurrence and mortality. At a more fundamental level, it is also evident that machine learning is also helping to improve our basic understanding of cancer development and progression.

  20. Applications of Machine Learning in Cancer Prediction and Prognosis

    Directory of Open Access Journals (Sweden)

    Joseph A. Cruz

    2006-01-01

    Full Text Available Machine learning is a branch of artificial intelligence that employs a variety of statistical, probabilistic and optimization techniques that allows computers to “learn” from past examples and to detect hard-to-discern patterns from large, noisy or complex data sets. This capability is particularly well-suited to medical applications, especially those that depend on complex proteomic and genomic measurements. As a result, machine learning is frequently used in cancer diagnosis and detection. More recently machine learning has been applied to cancer prognosis and prediction. This latter approach is particularly interesting as it is part of a growing trend towards personalized, predictive medicine. In assembling this review we conducted a broad survey of the different types of machine learning methods being used, the types of data being integrated and the performance of these methods in cancer prediction and prognosis. A number of trends are noted, including a growing dependence on protein biomarkers and microarray data, a strong bias towards applications in prostate and breast cancer, and a heavy reliance on “older” technologies such artificial neural networks (ANNs instead of more recently developed or more easily interpretable machine learning methods. A number of published studies also appear to lack an appropriate level of validation or testing. Among the better designed and validated studies it is clear that machine learning methods can be used to substantially (15-25% improve the accuracy of predicting cancer susceptibility, recurrence and mortality. At a more fundamental level, it is also evident that machine learning is also helping to improve our basic understanding of cancer development and progression.

  1. Randomized Prediction Games for Adversarial Machine Learning.

    Science.gov (United States)

    Rota Bulo, Samuel; Biggio, Battista; Pillai, Ignazio; Pelillo, Marcello; Roli, Fabio

    In spam and malware detection, attackers exploit randomization to obfuscate malicious data and increase their chances of evading detection at test time, e.g., malware code is typically obfuscated using random strings or byte sequences to hide known exploits. Interestingly, randomization has also been proposed to improve security of learning algorithms against evasion attacks, as it results in hiding information about the classifier to the attacker. Recent work has proposed game-theoretical formulations to learn secure classifiers, by simulating different evasion attacks and modifying the classification function accordingly. However, both the classification function and the simulated data manipulations have been modeled in a deterministic manner, without accounting for any form of randomization. In this paper, we overcome this limitation by proposing a randomized prediction game, namely, a noncooperative game-theoretic formulation in which the classifier and the attacker make randomized strategy selections according to some probability distribution defined over the respective strategy set. We show that our approach allows one to improve the tradeoff between attack detection and false alarms with respect to the state-of-the-art secure classifiers, even against attacks that are different from those hypothesized during design, on application examples including handwritten digit recognition, spam, and malware detection.In spam and malware detection, attackers exploit randomization to obfuscate malicious data and increase their chances of evading detection at test time, e.g., malware code is typically obfuscated using random strings or byte sequences to hide known exploits. Interestingly, randomization has also been proposed to improve security of learning algorithms against evasion attacks, as it results in hiding information about the classifier to the attacker. Recent work has proposed game-theoretical formulations to learn secure classifiers, by simulating different

  2. Predicting Solar Activity Using Machine-Learning Methods

    Science.gov (United States)

    Bobra, M.

    2017-12-01

    Of all the activity observed on the Sun, two of the most energetic events are flares and coronal mass ejections. However, we do not, as of yet, fully understand the physical mechanism that triggers solar eruptions. A machine-learning algorithm, which is favorable in cases where the amount of data is large, is one way to [1] empirically determine the signatures of this mechanism in solar image data and [2] use them to predict solar activity. In this talk, we discuss the application of various machine learning algorithms - specifically, a Support Vector Machine, a sparse linear regression (Lasso), and Convolutional Neural Network - to image data from the photosphere, chromosphere, transition region, and corona taken by instruments aboard the Solar Dynamics Observatory in order to predict solar activity on a variety of time scales. Such an approach may be useful since, at the present time, there are no physical models of flares available for real-time prediction. We discuss our results (Bobra and Couvidat, 2015; Bobra and Ilonidis, 2016; Jonas et al., 2017) as well as other attempts to predict flares using machine-learning (e.g. Ahmed et al., 2013; Nishizuka et al. 2017) and compare these results with the more traditional techniques used by the NOAA Space Weather Prediction Center (Crown, 2012). We also discuss some of the challenges in using machine-learning algorithms for space science applications.

  3. Predicting Market Impact Costs Using Nonparametric Machine Learning Models.

    Directory of Open Access Journals (Sweden)

    Saerom Park

    Full Text Available Market impact cost is the most significant portion of implicit transaction costs that can reduce the overall transaction cost, although it cannot be measured directly. In this paper, we employed the state-of-the-art nonparametric machine learning models: neural networks, Bayesian neural network, Gaussian process, and support vector regression, to predict market impact cost accurately and to provide the predictive model that is versatile in the number of variables. We collected a large amount of real single transaction data of US stock market from Bloomberg Terminal and generated three independent input variables. As a result, most nonparametric machine learning models outperformed a-state-of-the-art benchmark parametric model such as I-star model in four error measures. Although these models encounter certain difficulties in separating the permanent and temporary cost directly, nonparametric machine learning models can be good alternatives in reducing transaction costs by considerably improving in prediction performance.

  4. Predicting Market Impact Costs Using Nonparametric Machine Learning Models.

    Science.gov (United States)

    Park, Saerom; Lee, Jaewook; Son, Youngdoo

    2016-01-01

    Market impact cost is the most significant portion of implicit transaction costs that can reduce the overall transaction cost, although it cannot be measured directly. In this paper, we employed the state-of-the-art nonparametric machine learning models: neural networks, Bayesian neural network, Gaussian process, and support vector regression, to predict market impact cost accurately and to provide the predictive model that is versatile in the number of variables. We collected a large amount of real single transaction data of US stock market from Bloomberg Terminal and generated three independent input variables. As a result, most nonparametric machine learning models outperformed a-state-of-the-art benchmark parametric model such as I-star model in four error measures. Although these models encounter certain difficulties in separating the permanent and temporary cost directly, nonparametric machine learning models can be good alternatives in reducing transaction costs by considerably improving in prediction performance.

  5. Dropout Prediction in E-Learning Courses through the Combination of Machine Learning Techniques

    Science.gov (United States)

    Lykourentzou, Ioanna; Giannoukos, Ioannis; Nikolopoulos, Vassilis; Mpardis, George; Loumos, Vassili

    2009-01-01

    In this paper, a dropout prediction method for e-learning courses, based on three popular machine learning techniques and detailed student data, is proposed. The machine learning techniques used are feed-forward neural networks, support vector machines and probabilistic ensemble simplified fuzzy ARTMAP. Since a single technique may fail to…

  6. WORMHOLE: Novel Least Diverged Ortholog Prediction through Machine Learning.

    Directory of Open Access Journals (Sweden)

    George L Sutphin

    2016-11-01

    Full Text Available The rapid advancement of technology in genomics and targeted genetic manipulation has made comparative biology an increasingly prominent strategy to model human disease processes. Predicting orthology relationships between species is a vital component of comparative biology. Dozens of strategies for predicting orthologs have been developed using combinations of gene and protein sequence, phylogenetic history, and functional interaction with progressively increasing accuracy. A relatively new class of orthology prediction strategies combines aspects of multiple methods into meta-tools, resulting in improved prediction performance. Here we present WORMHOLE, a novel ortholog prediction meta-tool that applies machine learning to integrate 17 distinct ortholog prediction algorithms to identify novel least diverged orthologs (LDOs between 6 eukaryotic species-humans, mice, zebrafish, fruit flies, nematodes, and budding yeast. Machine learning allows WORMHOLE to intelligently incorporate predictions from a wide-spectrum of strategies in order to form aggregate predictions of LDOs with high confidence. In this study we demonstrate the performance of WORMHOLE across each combination of query and target species. We show that WORMHOLE is particularly adept at improving LDO prediction performance between distantly related species, expanding the pool of LDOs while maintaining low evolutionary distance and a high level of functional relatedness between genes in LDO pairs. We present extensive validation, including cross-validated prediction of PANTHER LDOs and evaluation of evolutionary divergence and functional similarity, and discuss future applications of machine learning in ortholog prediction. A WORMHOLE web tool has been developed and is available at http://wormhole.jax.org/.

  7. WORMHOLE: Novel Least Diverged Ortholog Prediction through Machine Learning

    Science.gov (United States)

    Sutphin, George L.; Mahoney, J. Matthew; Sheppard, Keith; Walton, David O.; Korstanje, Ron

    2016-01-01

    The rapid advancement of technology in genomics and targeted genetic manipulation has made comparative biology an increasingly prominent strategy to model human disease processes. Predicting orthology relationships between species is a vital component of comparative biology. Dozens of strategies for predicting orthologs have been developed using combinations of gene and protein sequence, phylogenetic history, and functional interaction with progressively increasing accuracy. A relatively new class of orthology prediction strategies combines aspects of multiple methods into meta-tools, resulting in improved prediction performance. Here we present WORMHOLE, a novel ortholog prediction meta-tool that applies machine learning to integrate 17 distinct ortholog prediction algorithms to identify novel least diverged orthologs (LDOs) between 6 eukaryotic species—humans, mice, zebrafish, fruit flies, nematodes, and budding yeast. Machine learning allows WORMHOLE to intelligently incorporate predictions from a wide-spectrum of strategies in order to form aggregate predictions of LDOs with high confidence. In this study we demonstrate the performance of WORMHOLE across each combination of query and target species. We show that WORMHOLE is particularly adept at improving LDO prediction performance between distantly related species, expanding the pool of LDOs while maintaining low evolutionary distance and a high level of functional relatedness between genes in LDO pairs. We present extensive validation, including cross-validated prediction of PANTHER LDOs and evaluation of evolutionary divergence and functional similarity, and discuss future applications of machine learning in ortholog prediction. A WORMHOLE web tool has been developed and is available at http://wormhole.jax.org/. PMID:27812085

  8. Machine Learning Principles Can Improve Hip Fracture Prediction

    DEFF Research Database (Denmark)

    Kruse, Christian; Eiken, Pia; Vestergaard, Peter

    2017-01-01

    Apply machine learning principles to predict hip fractures and estimate predictor importance in Dual-energy X-ray absorptiometry (DXA)-scanned men and women. Dual-energy X-ray absorptiometry data from two Danish regions between 1996 and 2006 were combined with national Danish patient data.......89 [0.82; 0.95], but with poor calibration in higher probabilities. A ten predictor subset (BMD, biochemical cholesterol and liver function tests, penicillin use and osteoarthritis diagnoses) achieved a test AUC of 0.86 [0.78; 0.94] using an “xgbTree” model. Machine learning can improve hip fracture...... prediction beyond logistic regression using ensemble models. Compiling data from international cohorts of longer follow-up and performing similar machine learning procedures has the potential to further improve discrimination and calibration....

  9. Machine Learning and Conflict Prediction: A Use Case

    Directory of Open Access Journals (Sweden)

    Chris Perry

    2013-10-01

    Full Text Available For at least the last two decades, the international community in general and the United Nations specifically have attempted to develop robust, accurate and effective conflict early warning system for conflict prevention. One potential and promising component of integrated early warning systems lies in the field of machine learning. This paper aims at giving conflict analysis a basic understanding of machine learning methodology as well as to test the feasibility and added value of such an approach. The paper finds that the selection of appropriate machine learning methodologies can offer substantial improvements in accuracy and performance. It also finds that even at this early stage in testing machine learning on conflict prediction, full models offer more predictive power than simply using a prior outbreak of violence as the leading indicator of current violence. This suggests that a refined data selection methodology combined with strategic use of machine learning algorithms could indeed offer a significant addition to the early warning toolkit. Finally, the paper suggests a number of steps moving forward to improve upon this initial test methodology.

  10. Predicting breast screening attendance using machine learning techniques.

    Science.gov (United States)

    Baskaran, Vikraman; Guergachi, Aziz; Bali, Rajeev K; Naguib, Raouf N G

    2011-03-01

    Machine learning-based prediction has been effectively applied for many healthcare applications. Predicting breast screening attendance using machine learning (prior to the actual mammogram) is a new field. This paper presents new predictor attributes for such an algorithm. It describes a new hybrid algorithm that relies on back-propagation and radial basis function-based neural networks for prediction. The algorithm has been developed in an open source-based environment. The algorithm was tested on a 13-year dataset (1995-2008). This paper compares the algorithm and validates its accuracy and efficiency with different platforms. Nearly 80% accuracy and 88% positive predictive value and sensitivity were recorded for the algorithm. The results were encouraging; 40-50% of negative predictive value and specificity warrant further work. Preliminary results were promising and provided ample amount of reasons for testing the algorithm on a larger scale.

  11. Predicting genome-wide redundancy using machine learning

    Directory of Open Access Journals (Sweden)

    Shasha Dennis E

    2010-11-01

    Full Text Available Abstract Background Gene duplication can lead to genetic redundancy, which masks the function of mutated genes in genetic analyses. Methods to increase sensitivity in identifying genetic redundancy can improve the efficiency of reverse genetics and lend insights into the evolutionary outcomes of gene duplication. Machine learning techniques are well suited to classifying gene family members into redundant and non-redundant gene pairs in model species where sufficient genetic and genomic data is available, such as Arabidopsis thaliana, the test case used here. Results Machine learning techniques that combine multiple attributes led to a dramatic improvement in predicting genetic redundancy over single trait classifiers alone, such as BLAST E-values or expression correlation. In withholding analysis, one of the methods used here, Support Vector Machines, was two-fold more precise than single attribute classifiers, reaching a level where the majority of redundant calls were correctly labeled. Using this higher confidence in identifying redundancy, machine learning predicts that about half of all genes in Arabidopsis showed the signature of predicted redundancy with at least one but typically less than three other family members. Interestingly, a large proportion of predicted redundant gene pairs were relatively old duplications (e.g., Ks > 1, suggesting that redundancy is stable over long evolutionary periods. Conclusions Machine learning predicts that most genes will have a functionally redundant paralog but will exhibit redundancy with relatively few genes within a family. The predictions and gene pair attributes for Arabidopsis provide a new resource for research in genetics and genome evolution. These techniques can now be applied to other organisms.

  12. Machine learning modelling for predicting soil liquefaction susceptibility

    Directory of Open Access Journals (Sweden)

    P. Samui

    2011-01-01

    Full Text Available This study describes two machine learning techniques applied to predict liquefaction susceptibility of soil based on the standard penetration test (SPT data from the 1999 Chi-Chi, Taiwan earthquake. The first machine learning technique which uses Artificial Neural Network (ANN based on multi-layer perceptions (MLP that are trained with Levenberg-Marquardt backpropagation algorithm. The second machine learning technique uses the Support Vector machine (SVM that is firmly based on the theory of statistical learning theory, uses classification technique. ANN and SVM have been developed to predict liquefaction susceptibility using corrected SPT [(N160] and cyclic stress ratio (CSR. Further, an attempt has been made to simplify the models, requiring only the two parameters [(N160 and peck ground acceleration (amax/g], for the prediction of liquefaction susceptibility. The developed ANN and SVM models have also been applied to different case histories available globally. The paper also highlights the capability of the SVM over the ANN models.

  13. Machine Learning

    CERN Multimedia

    CERN. Geneva

    2017-01-01

    Machine learning, which builds on ideas in computer science, statistics, and optimization, focuses on developing algorithms to identify patterns and regularities in data, and using these learned patterns to make predictions on new observations. Boosted by its industrial and commercial applications, the field of machine learning is quickly evolving and expanding. Recent advances have seen great success in the realms of computer vision, natural language processing, and broadly in data science. Many of these techniques have already been applied in particle physics, for instance for particle identification, detector monitoring, and the optimization of computer resources. Modern machine learning approaches, such as deep learning, are only just beginning to be applied to the analysis of High Energy Physics data to approach more and more complex problems. These classes will review the framework behind machine learning and discuss recent developments in the field.

  14. Prediction of Employee Turnover in Organizations using Machine Learning Algorithms

    OpenAIRE

    Rohit Punnoose; Pankaj Ajit

    2016-01-01

    Employee turnover has been identified as a key issue for organizations because of its adverse impact on work place productivity and long term growth strategies. To solve this problem, organizations use machine learning techniques to predict employee turnover. Accurate predictions enable organizations to take action for retention or succession planning of employees. However, the data for this modeling problem comes from HR Information Systems (HRIS); these are typically under-funded compared t...

  15. Machine learning derived risk prediction of anorexia nervosa.

    Science.gov (United States)

    Guo, Yiran; Wei, Zhi; Keating, Brendan J; Hakonarson, Hakon

    2016-01-20

    Anorexia nervosa (AN) is a complex psychiatric disease with a moderate to strong genetic contribution. In addition to conventional genome wide association (GWA) studies, researchers have been using machine learning methods in conjunction with genomic data to predict risk of diseases in which genetics play an important role. In this study, we collected whole genome genotyping data on 3940 AN cases and 9266 controls from the Genetic Consortium for Anorexia Nervosa (GCAN), the Wellcome Trust Case Control Consortium 3 (WTCCC3), Price Foundation Collaborative Group and the Children's Hospital of Philadelphia (CHOP), and applied machine learning methods for predicting AN disease risk. The prediction performance is measured by area under the receiver operating characteristic curve (AUC), indicating how well the model distinguishes cases from unaffected control subjects. Logistic regression model with the lasso penalty technique generated an AUC of 0.693, while Support Vector Machines and Gradient Boosted Trees reached AUC's of 0.691 and 0.623, respectively. Using different sample sizes, our results suggest that larger datasets are required to optimize the machine learning models and achieve higher AUC values. To our knowledge, this is the first attempt to assess AN risk based on genome wide genotype level data. Future integration of genomic, environmental and family-based information is likely to improve the AN risk evaluation process, eventually benefitting AN patients and families in the clinical setting.

  16. Machine learning algorithms for datasets popularity prediction

    CERN Document Server

    Kancys, Kipras

    2016-01-01

    This report represents continued study where ML algorithms were used to predict databases popularity. Three topics were covered. First of all, there was a discrepancy between old and new meta-data collection procedures, so a reason for that had to be found. Secondly, different parameters were analysed and dropped to make algorithms perform better. And third, it was decided to move modelling part on Spark.

  17. Machine learning models in breast cancer survival prediction.

    Science.gov (United States)

    Montazeri, Mitra; Montazeri, Mohadeseh; Montazeri, Mahdieh; Beigzadeh, Amin

    2016-01-01

    Breast cancer is one of the most common cancers with a high mortality rate among women. With the early diagnosis of breast cancer survival will increase from 56% to more than 86%. Therefore, an accurate and reliable system is necessary for the early diagnosis of this cancer. The proposed model is the combination of rules and different machine learning techniques. Machine learning models can help physicians to reduce the number of false decisions. They try to exploit patterns and relationships among a large number of cases and predict the outcome of a disease using historical cases stored in datasets. The objective of this study is to propose a rule-based classification method with machine learning techniques for the prediction of different types of Breast cancer survival. We use a dataset with eight attributes that include the records of 900 patients in which 876 patients (97.3%) and 24 (2.7%) patients were females and males respectively. Naive Bayes (NB), Trees Random Forest (TRF), 1-Nearest Neighbor (1NN), AdaBoost (AD), Support Vector Machine (SVM), RBF Network (RBFN), and Multilayer Perceptron (MLP) machine learning techniques with 10-cross fold technique were used with the proposed model for the prediction of breast cancer survival. The performance of machine learning techniques were evaluated with accuracy, precision, sensitivity, specificity, and area under ROC curve. Out of 900 patients, 803 patients and 97 patients were alive and dead, respectively. In this study, Trees Random Forest (TRF) technique showed better results in comparison to other techniques (NB, 1NN, AD, SVM and RBFN, MLP). The accuracy, sensitivity and the area under ROC curve of TRF are 96%, 96%, 93%, respectively. However, 1NN machine learning technique provided poor performance (accuracy 91%, sensitivity 91% and area under ROC curve 78%). This study demonstrates that Trees Random Forest model (TRF) which is a rule-based classification model was the best model with the highest level of

  18. Deep learning versus traditional machine learning methods for aggregated energy demand prediction

    NARCIS (Netherlands)

    Paterakis, N.G.; Mocanu, E.; Gibescu, M.; Stappers, B.; van Alst, W.

    2018-01-01

    In this paper the more advanced, in comparison with traditional machine learning approaches, deep learning methods are explored with the purpose of accurately predicting the aggregated energy consumption. Despite the fact that a wide range of machine learning methods have been applied to

  19. Machine learning methods in predicting the student academic motivation

    Directory of Open Access Journals (Sweden)

    Ivana Đurđević Babić

    2017-01-01

    Full Text Available Academic motivation is closely related to academic performance. For educators, it is equally important to detect early students with a lack of academic motivation as it is to detect those with a high level of academic motivation. In endeavouring to develop a classification model for predicting student academic motivation based on their behaviour in learning management system (LMS courses, this paper intends to establish links between the predicted student academic motivation and their behaviour in the LMS course. Students from all years at the Faculty of Education in Osijek participated in this research. Three machine learning classifiers (neural networks, decision trees, and support vector machines were used. To establish whether a significant difference in the performance of models exists, a t-test of the difference in proportions was used. Although, all classifiers were successful, the neural network model was shown to be the most successful in detecting the student academic motivation based on their behaviour in LMS course.

  20. Machine learning application in online lending risk prediction

    OpenAIRE

    Yu, Xiaojiao

    2017-01-01

    Online leading has disrupted the traditional consumer banking sector with more effective loan processing. Risk prediction and monitoring is critical for the success of the business model. Traditional credit score models fall short in applying big data technology in building risk model. In this manuscript, data with various format and size were collected from public website, third-parties and assembled with client's loan application information data. Ensemble machine learning models, random fo...

  1. Pileup Subtraction and Jet Energy Prediction Using Machine Learning

    OpenAIRE

    Kong, Vein S; Li, Jiakun; Zhang, Yujia

    2015-01-01

    In the Large Hardron Collider (LHC), multiple proton-proton collisions cause pileup in reconstructing energy information for a single primary collision (jet). This project aims to select the most important features and create a model to accurately estimate jet energy. Different machine learning methods were explored, including linear regression, support vector regression and decision tree. The best result is obtained by linear regression with predictive features and the performance is improve...

  2. Improving orbit prediction accuracy through supervised machine learning

    Science.gov (United States)

    Peng, Hao; Bai, Xiaoli

    2018-05-01

    Due to the lack of information such as the space environment condition and resident space objects' (RSOs') body characteristics, current orbit predictions that are solely grounded on physics-based models may fail to achieve required accuracy for collision avoidance and have led to satellite collisions already. This paper presents a methodology to predict RSOs' trajectories with higher accuracy than that of the current methods. Inspired by the machine learning (ML) theory through which the models are learned based on large amounts of observed data and the prediction is conducted without explicitly modeling space objects and space environment, the proposed ML approach integrates physics-based orbit prediction algorithms with a learning-based process that focuses on reducing the prediction errors. Using a simulation-based space catalog environment as the test bed, the paper demonstrates three types of generalization capability for the proposed ML approach: (1) the ML model can be used to improve the same RSO's orbit information that is not available during the learning process but shares the same time interval as the training data; (2) the ML model can be used to improve predictions of the same RSO at future epochs; and (3) the ML model based on a RSO can be applied to other RSOs that share some common features.

  3. Prediction of length-of-day using extreme learning machine

    Directory of Open Access Journals (Sweden)

    Yu Lei

    2015-03-01

    Full Text Available Traditional artificial neural networks (ANN such as back-propagation neural networks (BPNN provide good predictions of length-of-day (LOD. However, the determination of network topology is difficult and time consuming. Therefore, we propose a new type of neural network, extreme learning machine (ELM, to improve the efficiency of LOD predictions. Earth orientation parameters (EOP C04 time-series provides daily values from International Earth Rotation and Reference Systems Service (IERS, which serves as our database. First, the known predictable effects that can be described by functional models—such as the effects of solid earth, ocean tides, or seasonal atmospheric variations—are removed a priori from the C04 time-series. Only the residuals after the subtraction of a priori model from the observed LOD data (i.e., the irregular and quasi-periodic variations are employed for training and predictions. The predicted LOD is the sum of a prior extrapolation model and the ELM predictions of the residuals. Different input patterns are discussed and compared to optimize the network solution. The prediction results are analyzed and compared with those obtained by other machine learning-based prediction methods, including BPNN, generalization regression neural networks (GRNN, and adaptive network-based fuzzy inference systems (ANFIS. It is shown that while achieving similar prediction accuracy, the developed method uses much less training time than other methods. Furthermore, to conduct a direct comparison with the existing prediction techniques, the mean-absolute-error (MAE from the proposed method is compared with that from the EOP prediction comparison campaign (EOP PCC. The results indicate that the accuracy of the proposed method is comparable with that of the former techniques. The implementation of the proposed method is simple.

  4. Prediction of skin sensitization potency using machine learning approaches.

    Science.gov (United States)

    Zang, Qingda; Paris, Michael; Lehmann, David M; Bell, Shannon; Kleinstreuer, Nicole; Allen, David; Matheson, Joanna; Jacobs, Abigail; Casey, Warren; Strickland, Judy

    2017-07-01

    The replacement of animal use in testing for regulatory classification of skin sensitizers is a priority for US federal agencies that use data from such testing. Machine learning models that classify substances as sensitizers or non-sensitizers without using animal data have been developed and evaluated. Because some regulatory agencies require that sensitizers be further classified into potency categories, we developed statistical models to predict skin sensitization potency for murine local lymph node assay (LLNA) and human outcomes. Input variables for our models included six physicochemical properties and data from three non-animal test methods: direct peptide reactivity assay; human cell line activation test; and KeratinoSens™ assay. Models were built to predict three potency categories using four machine learning approaches and were validated using external test sets and leave-one-out cross-validation. A one-tiered strategy modeled all three categories of response together while a two-tiered strategy modeled sensitizer/non-sensitizer responses and then classified the sensitizers as strong or weak sensitizers. The two-tiered model using the support vector machine with all assay and physicochemical data inputs provided the best performance, yielding accuracy of 88% for prediction of LLNA outcomes (120 substances) and 81% for prediction of human test outcomes (87 substances). The best one-tiered model predicted LLNA outcomes with 78% accuracy and human outcomes with 75% accuracy. By comparison, the LLNA predicts human potency categories with 69% accuracy (60 of 87 substances correctly categorized). These results suggest that computational models using non-animal methods may provide valuable information for assessing skin sensitization potency. Copyright © 2017 John Wiley & Sons, Ltd. Copyright © 2017 John Wiley & Sons, Ltd.

  5. A machine learning approach to the accurate prediction of monitor units for a compact proton machine.

    Science.gov (United States)

    Sun, Baozhou; Lam, Dao; Yang, Deshan; Grantham, Kevin; Zhang, Tiezhi; Mutic, Sasa; Zhao, Tianyu

    2018-05-01

    Clinical treatment planning systems for proton therapy currently do not calculate monitor units (MUs) in passive scatter proton therapy due to the complexity of the beam delivery systems. Physical phantom measurements are commonly employed to determine the field-specific output factors (OFs) but are often subject to limited machine time, measurement uncertainties and intensive labor. In this study, a machine learning-based approach was developed to predict output (cGy/MU) and derive MUs, incorporating the dependencies on gantry angle and field size for a single-room proton therapy system. The goal of this study was to develop a secondary check tool for OF measurements and eventually eliminate patient-specific OF measurements. The OFs of 1754 fields previously measured in a water phantom with calibrated ionization chambers and electrometers for patient-specific fields with various range and modulation width combinations for 23 options were included in this study. The training data sets for machine learning models in three different methods (Random Forest, XGBoost and Cubist) included 1431 (~81%) OFs. Ten-fold cross-validation was used to prevent "overfitting" and to validate each model. The remaining 323 (~19%) OFs were used to test the trained models. The difference between the measured and predicted values from machine learning models was analyzed. Model prediction accuracy was also compared with that of the semi-empirical model developed by Kooy (Phys. Med. Biol. 50, 2005). Additionally, gantry angle dependence of OFs was measured for three groups of options categorized on the selection of the second scatters. Field size dependence of OFs was investigated for the measurements with and without patient-specific apertures. All three machine learning methods showed higher accuracy than the semi-empirical model which shows considerably large discrepancy of up to 7.7% for the treatment fields with full range and full modulation width. The Cubist-based solution

  6. Automatically explaining machine learning prediction results: a demonstration on type 2 diabetes risk prediction.

    Science.gov (United States)

    Luo, Gang

    2016-01-01

    Predictive modeling is a key component of solutions to many healthcare problems. Among all predictive modeling approaches, machine learning methods often achieve the highest prediction accuracy, but suffer from a long-standing open problem precluding their widespread use in healthcare. Most machine learning models give no explanation for their prediction results, whereas interpretability is essential for a predictive model to be adopted in typical healthcare settings. This paper presents the first complete method for automatically explaining results for any machine learning predictive model without degrading accuracy. We did a computer coding implementation of the method. Using the electronic medical record data set from the Practice Fusion diabetes classification competition containing patient records from all 50 states in the United States, we demonstrated the method on predicting type 2 diabetes diagnosis within the next year. For the champion machine learning model of the competition, our method explained prediction results for 87.4 % of patients who were correctly predicted by the model to have type 2 diabetes diagnosis within the next year. Our demonstration showed the feasibility of automatically explaining results for any machine learning predictive model without degrading accuracy.

  7. Machine Learning Techniques for Prediction of Early Childhood Obesity.

    Science.gov (United States)

    Dugan, T M; Mukhopadhyay, S; Carroll, A; Downs, S

    2015-01-01

    This paper aims to predict childhood obesity after age two, using only data collected prior to the second birthday by a clinical decision support system called CHICA. Analyses of six different machine learning methods: RandomTree, RandomForest, J48, ID3, Naïve Bayes, and Bayes trained on CHICA data show that an accurate, sensitive model can be created. Of the methods analyzed, the ID3 model trained on the CHICA dataset proved the best overall performance with accuracy of 85% and sensitivity of 89%. Additionally, the ID3 model had a positive predictive value of 84% and a negative predictive value of 88%. The structure of the tree also gives insight into the strongest predictors of future obesity in children. Many of the strongest predictors seen in the ID3 modeling of the CHICA dataset have been independently validated in the literature as correlated with obesity, thereby supporting the validity of the model. This study demonstrated that data from a production clinical decision support system can be used to build an accurate machine learning model to predict obesity in children after age two.

  8. Machine learning applied to the prediction of citrus production

    Energy Technology Data Exchange (ETDEWEB)

    Díaz, I.; Mazza, S.M.; Combarro, E.F.; Giménez, L.I.; Gaiad, J.E.

    2017-07-01

    An in-depth knowledge about variables affecting production is required in order to predict global production and take decisions in agriculture. Machine learning is a technique used in agricultural planning and precision agriculture. This work (i) studies the effectiveness of machine learning techniques for predicting orchards production; and (ii) variables affecting this production were also identified. Data from 964 orchards of lemon, mandarin, and orange in Corrientes, Argentina are analysed. Graphic and analytical descriptive statistics, correlation coefficients, principal component analysis and Biplot were performed. Production was predicted via M5-Prime, a model regression tree constructor which produces a classification based on piecewise linear functions. For all the species studied, the most informative variable was the trees' age; in mandarin and orange orchards, age was followed by between and within row distances; irrigation also affected mandarin production. Also, the performance of M5-Prime in the prediction of production is adequate, as shown when measured with correlation coefficients (~0.8) and relative mean absolute error (~0.1). These results show that M5-Prime is an appropriate method to classify citrus orchards according to production and, in addition, it allows for identifying the most informative variables affecting production by tree.

  9. Machine learning applied to the prediction of citrus production

    International Nuclear Information System (INIS)

    Díaz, I.; Mazza, S.M.; Combarro, E.F.; Giménez, L.I.; Gaiad, J.E.

    2017-01-01

    An in-depth knowledge about variables affecting production is required in order to predict global production and take decisions in agriculture. Machine learning is a technique used in agricultural planning and precision agriculture. This work (i) studies the effectiveness of machine learning techniques for predicting orchards production; and (ii) variables affecting this production were also identified. Data from 964 orchards of lemon, mandarin, and orange in Corrientes, Argentina are analysed. Graphic and analytical descriptive statistics, correlation coefficients, principal component analysis and Biplot were performed. Production was predicted via M5-Prime, a model regression tree constructor which produces a classification based on piecewise linear functions. For all the species studied, the most informative variable was the trees' age; in mandarin and orange orchards, age was followed by between and within row distances; irrigation also affected mandarin production. Also, the performance of M5-Prime in the prediction of production is adequate, as shown when measured with correlation coefficients (~0.8) and relative mean absolute error (~0.1). These results show that M5-Prime is an appropriate method to classify citrus orchards according to production and, in addition, it allows for identifying the most informative variables affecting production by tree.

  10. Machine learning applied to the prediction of citrus production

    Directory of Open Access Journals (Sweden)

    Irene Díaz

    2017-07-01

    Full Text Available An in-depth knowledge about variables affecting production is required in order to predict global production and take decisions in agriculture. Machine learning is a technique used in agricultural planning and precision agriculture. This work (i studies the effectiveness of machine learning techniques for predicting orchards production; and (ii variables affecting this production were also identified. Data from 964 orchards of lemon, mandarin, and orange in Corrientes, Argentina are analysed. Graphic and analytical descriptive statistics, correlation coefficients, principal component analysis and Biplot were performed. Production was predicted via M5-Prime, a model regression tree constructor which produces a classification based on piecewise linear functions. For all the species studied, the most informative variable was the trees’ age; in mandarin and orange orchards, age was followed by between and within row distances; irrigation also affected mandarin production. Also, the performance of M5-Prime in the prediction of production is adequate, as shown when measured with correlation coefficients (~0.8 and relative mean absolute error (~0.1. These results show that M5-Prime is an appropriate method to classify citrus orchards according to production and, in addition, it allows for identifying the most informative variables affecting production by tree.

  11. Prediction of stroke thrombolysis outcome using CT brain machine learning

    Directory of Open Access Journals (Sweden)

    Paul Bentley

    2014-01-01

    Full Text Available A critical decision-step in the emergency treatment of ischemic stroke is whether or not to administer thrombolysis — a treatment that can result in good recovery, or deterioration due to symptomatic intracranial haemorrhage (SICH. Certain imaging features based upon early computerized tomography (CT, in combination with clinical variables, have been found to predict SICH, albeit with modest accuracy. In this proof-of-concept study, we determine whether machine learning of CT images can predict which patients receiving tPA will develop SICH as opposed to showing clinical improvement with no haemorrhage. Clinical records and CT brains of 116 acute ischemic stroke patients treated with intravenous thrombolysis were collected retrospectively (including 16 who developed SICH. The sample was split into training (n = 106 and test sets (n = 10, repeatedly for 1760 different combinations. CT brain images acted as inputs into a support vector machine (SVM, along with clinical severity. Performance of the SVM was compared with established prognostication tools (SEDAN and HAT scores; original, or after adaptation to our cohort. Predictive performance, assessed as area under receiver-operating-characteristic curve (AUC, of the SVM (0.744 compared favourably with that of prognostic scores (original and adapted versions: 0.626–0.720; p < 0.01. The SVM also identified 9 out of 16 SICHs, as opposed to 1–5 using prognostic scores, assuming a 10% SICH frequency (p < 0.001. In summary, machine learning methods applied to acute stroke CT images offer automation, and potentially improved performance, for prediction of SICH following thrombolysis. Larger-scale cohorts, and incorporation of advanced imaging, should be tested with such methods.

  12. Predicting DPP-IV inhibitors with machine learning approaches

    Science.gov (United States)

    Cai, Jie; Li, Chanjuan; Liu, Zhihong; Du, Jiewen; Ye, Jiming; Gu, Qiong; Xu, Jun

    2017-04-01

    Dipeptidyl peptidase IV (DPP-IV) is a promising Type 2 diabetes mellitus (T2DM) drug target. DPP-IV inhibitors prolong the action of glucagon-like peptide-1 (GLP-1) and gastric inhibitory peptide (GIP), improve glucose homeostasis without weight gain, edema, and hypoglycemia. However, the marketed DPP-IV inhibitors have adverse effects such as nasopharyngitis, headache, nausea, hypersensitivity, skin reactions and pancreatitis. Therefore, it is still expected for novel DPP-IV inhibitors with minimal adverse effects. The scaffolds of existing DPP-IV inhibitors are structurally diversified. This makes it difficult to build virtual screening models based upon the known DPP-IV inhibitor libraries using conventional QSAR approaches. In this paper, we report a new strategy to predict DPP-IV inhibitors with machine learning approaches involving naïve Bayesian (NB) and recursive partitioning (RP) methods. We built 247 machine learning models based on 1307 known DPP-IV inhibitors with optimized molecular properties and topological fingerprints as descriptors. The overall predictive accuracies of the optimized models were greater than 80%. An external test set, composed of 65 recently reported compounds, was employed to validate the optimized models. The results demonstrated that both NB and RP models have a good predictive ability based on different combinations of descriptors. Twenty "good" and twenty "bad" structural fragments for DPP-IV inhibitors can also be derived from these models for inspiring the new DPP-IV inhibitor scaffold design.

  13. Machine learning landscapes and predictions for patient outcomes

    Science.gov (United States)

    Das, Ritankar; Wales, David J.

    2017-07-01

    The theory and computational tools developed to interpret and explore energy landscapes in molecular science are applied to the landscapes defined by local minima for neural networks. These machine learning landscapes correspond to fits of training data, where the inputs are vital signs and laboratory measurements for a database of patients, and the objective is to predict a clinical outcome. In this contribution, we test the predictions obtained by fitting to single measurements, and then to combinations of between 2 and 10 different patient medical data items. The effect of including measurements over different time intervals from the 48 h period in question is analysed, and the most recent values are found to be the most important. We also compare results obtained for neural networks as a function of the number of hidden nodes, and for different values of a regularization parameter. The predictions are compared with an alternative convex fitting function, and a strong correlation is observed. The dependence of these results on the patients randomly selected for training and testing decreases systematically with the size of the database available. The machine learning landscapes defined by neural network fits in this investigation have single-funnel character, which probably explains why it is relatively straightforward to obtain the global minimum solution, or a fit that behaves similarly to this optimal parameterization.

  14. Daily sea level prediction at Chiayi coast, Taiwan using extreme learning machine and relevance vector machine

    Science.gov (United States)

    Imani, Moslem; Kao, Huan-Chin; Lan, Wen-Hau; Kuo, Chung-Yen

    2018-02-01

    The analysis and the prediction of sea level fluctuations are core requirements of marine meteorology and operational oceanography. Estimates of sea level with hours-to-days warning times are especially important for low-lying regions and coastal zone management. The primary purpose of this study is to examine the applicability and capability of extreme learning machine (ELM) and relevance vector machine (RVM) models for predicting sea level variations and compare their performances with powerful machine learning methods, namely, support vector machine (SVM) and radial basis function (RBF) models. The input dataset from the period of January 2004 to May 2011 used in the study was obtained from the Dongshi tide gauge station in Chiayi, Taiwan. Results showed that the ELM and RVM models outperformed the other methods. The performance of the RVM approach was superior in predicting the daily sea level time series given the minimum root mean square error of 34.73 mm and the maximum determination coefficient of 0.93 (R2) during the testing periods. Furthermore, the obtained results were in close agreement with the original tide-gauge data, which indicates that RVM approach is a promising alternative method for time series prediction and could be successfully used for daily sea level forecasts.

  15. Application of Machine Learning for Dragline Failure Prediction

    Directory of Open Access Journals (Sweden)

    Taghizadeh Amir

    2017-01-01

    Full Text Available Overburden stripping in open cast coal mines is extensively carried out by walking draglines. Draglines’ unavailability and unexpected failures result in delayed productions and increased maintenance and operating costs. Therefore, achieving high availability of draglines plays a crucial role for increasing economic feasibility of mining projects. Applications of methodologies which can forecast the failure type of dragline based on the available failure data not only help to reduce the maintenance and operating costs but also increase the availability and the production rate. In this study, Machine Learning approaches have been applied for data which has been gathered from an operating coal mine in Turkey. The study methodology consists of three algorithms as: i implementation of K-Nearest Neighbors, ii implementation of Multi-Layer Perceptron, and iii implementation of Radial Basis Function. The algorithms have been utilized for predicting the draglines’ failure types. In this sense, the input data, which are mean time-to-failure, and the output data, failure types, have been fed to the algorithms. The regression analysis of methodologies have been compared and showed the K- Nearest Neighbors has a higher rate of regression which is around 70 percent. Thus, the K-Nearest Neighbor algorithm can be applied in order to preventive components replacement which causes to minimized preventive and corrective cost parameters. The accurate prediction of failure type, indeed, causes to optimized number of inspections. The novelty of this study is application of machine learning approaches in draglines’ reliability subject for first time.

  16. Predicting Smoking Status Using Machine Learning Algorithms and Statistical Analysis

    Directory of Open Access Journals (Sweden)

    Charles Frank

    2018-03-01

    Full Text Available Smoking has been proven to negatively affect health in a multitude of ways. As of 2009, smoking has been considered the leading cause of preventable morbidity and mortality in the United States, continuing to plague the country’s overall health. This study aims to investigate the viability and effectiveness of some machine learning algorithms for predicting the smoking status of patients based on their blood tests and vital readings results. The analysis of this study is divided into two parts: In part 1, we use One-way ANOVA analysis with SAS tool to show the statistically significant difference in blood test readings between smokers and non-smokers. The results show that the difference in INR, which measures the effectiveness of anticoagulants, was significant in favor of non-smokers which further confirms the health risks associated with smoking. In part 2, we use five machine learning algorithms: Naïve Bayes, MLP, Logistic regression classifier, J48 and Decision Table to predict the smoking status of patients. To compare the effectiveness of these algorithms we use: Precision, Recall, F-measure and Accuracy measures. The results show that the Logistic algorithm outperformed the four other algorithms with Precision, Recall, F-Measure, and Accuracy of 83%, 83.4%, 83.2%, 83.44%, respectively.

  17. Prediction of Baseflow Index of Catchments using Machine Learning Algorithms

    Science.gov (United States)

    Yadav, B.; Hatfield, K.

    2017-12-01

    We present the results of eight machine learning techniques for predicting the baseflow index (BFI) of ungauged basins using a surrogate of catchment scale climate and physiographic data. The tested algorithms include ordinary least squares, ridge regression, least absolute shrinkage and selection operator (lasso), elasticnet, support vector machine, gradient boosted regression trees, random forests, and extremely randomized trees. Our work seeks to identify the dominant controls of BFI that can be readily obtained from ancillary geospatial databases and remote sensing measurements, such that the developed techniques can be extended to ungauged catchments. More than 800 gauged catchments spanning the continental United States were selected to develop the general methodology. The BFI calculation was based on the baseflow separated from daily streamflow hydrograph using HYSEP filter. The surrogate catchment attributes were compiled from multiple sources including digital elevation model, soil, landuse, climate data, other publicly available ancillary and geospatial data. 80% catchments were used to train the ML algorithms, and the remaining 20% of the catchments were used as an independent test set to measure the generalization performance of fitted models. A k-fold cross-validation using exhaustive grid search was used to fit the hyperparameters of each model. Initial model development was based on 19 independent variables, but after variable selection and feature ranking, we generated revised sparse models of BFI prediction that are based on only six catchment attributes. These key predictive variables selected after the careful evaluation of bias-variance tradeoff include average catchment elevation, slope, fraction of sand, permeability, temperature, and precipitation. The most promising algorithms exceeding an accuracy score (r-square) of 0.7 on test data include support vector machine, gradient boosted regression trees, random forests, and extremely randomized

  18. Prediction of beta-turns with learning machines.

    Science.gov (United States)

    Cai, Yu-Dong; Liu, Xiao-Jun; Li, Yi-Xue; Xu, Xue-biao; Chou, Kuo-Chen

    2003-05-01

    The support vector machine approach was introduced to predict the beta-turns in proteins. The overall self-consistency rate by the re-substitution test for the training or learning dataset reached 100%. Both the training dataset and independent testing dataset were taken from Chou [J. Pept. Res. 49 (1997) 120]. The success prediction rates by the jackknife test for the beta-turn subset of 455 tetrapeptides and non-beta-turn subset of 3807 tetrapeptides in the training dataset were 58.1 and 98.4%, respectively. The success rates with the independent dataset test for the beta-turn subset of 110 tetrapeptides and non-beta-turn subset of 30,231 tetrapeptides were 69.1 and 97.3%, respectively. The results obtained from this study support the conclusion that the residue-coupled effect along a tetrapeptide is important for the formation of a beta-turn.

  19. Machine-Learning-Based No Show Prediction in Outpatient Visits

    Directory of Open Access Journals (Sweden)

    Carlos Elvira

    2018-03-01

    Full Text Available A recurring problem in healthcare is the high percentage of patients who miss their appointment, be it a consultation or a hospital test. The present study seeks patient’s behavioural patterns that allow predicting the probability of no- shows. We explore the convenience of using Big Data Machine Learning models to accomplish this task. To begin with, a predictive model based only on variables associated with the target appointment is built. Then the model is improved by considering the patient’s history of appointments. In both cases, the Gradient Boosting algorithm was the predictor of choice. Our numerical results are considered promising given the small amount of information available. However, there seems to be plenty of room to improve the model if we manage to collect additional data for both patients and appointments.

  20. Extremely Randomized Machine Learning Methods for Compound Activity Prediction

    Directory of Open Access Journals (Sweden)

    Wojciech M. Czarnecki

    2015-11-01

    Full Text Available Speed, a relatively low requirement for computational resources and high effectiveness of the evaluation of the bioactivity of compounds have caused a rapid growth of interest in the application of machine learning methods to virtual screening tasks. However, due to the growth of the amount of data also in cheminformatics and related fields, the aim of research has shifted not only towards the development of algorithms of high predictive power but also towards the simplification of previously existing methods to obtain results more quickly. In the study, we tested two approaches belonging to the group of so-called ‘extremely randomized methods’—Extreme Entropy Machine and Extremely Randomized Trees—for their ability to properly identify compounds that have activity towards particular protein targets. These methods were compared with their ‘non-extreme’ competitors, i.e., Support Vector Machine and Random Forest. The extreme approaches were not only found out to improve the efficiency of the classification of bioactive compounds, but they were also proved to be less computationally complex, requiring fewer steps to perform an optimization procedure.

  1. Investigating Mesoscale Convective Systems and their Predictability Using Machine Learning

    Science.gov (United States)

    Daher, H.; Duffy, D.; Bowen, M. K.

    2016-12-01

    A mesoscale convective system (MCS) is a thunderstorm region that lasts several hours long and forms near weather fronts and can often develop into tornadoes. Here we seek to answer the question of whether these tornadoes are "predictable" by looking for a defining characteristic(s) separating MCSs that evolve into tornadoes versus those that do not. Using NASA's Modern Era Retrospective-analysis for Research and Applications 2 reanalysis data (M2R12K), we apply several state of the art machine learning techniques to investigate this question. The spatial region examined in this experiment is Tornado Alley in the United States over the peak tornado months. A database containing select variables from M2R12K is created using PostgreSQL. This database is then analyzed using machine learning methods such as Symbolic Aggregate approXimation (SAX) and DBSCAN (an unsupervised density-based data clustering algorithm). The incentive behind using these methods is to mathematically define a MCS so that association rule mining techniques can be used to uncover some sort of signal or teleconnection that will help us forecast which MCSs will result in tornadoes and therefore give society more time to prepare and in turn reduce casualties and destruction.

  2. Support vector machine incremental learning triggered by wrongly predicted samples

    Science.gov (United States)

    Tang, Ting-long; Guan, Qiu; Wu, Yi-rong

    2018-05-01

    According to the classic Karush-Kuhn-Tucker (KKT) theorem, at every step of incremental support vector machine (SVM) learning, the newly adding sample which violates the KKT conditions will be a new support vector (SV) and migrate the old samples between SV set and non-support vector (NSV) set, and at the same time the learning model should be updated based on the SVs. However, it is not exactly clear at this moment that which of the old samples would change between SVs and NSVs. Additionally, the learning model will be unnecessarily updated, which will not greatly increase its accuracy but decrease the training speed. Therefore, how to choose the new SVs from old sets during the incremental stages and when to process incremental steps will greatly influence the accuracy and efficiency of incremental SVM learning. In this work, a new algorithm is proposed to select candidate SVs and use the wrongly predicted sample to trigger the incremental processing simultaneously. Experimental results show that the proposed algorithm can achieve good performance with high efficiency, high speed and good accuracy.

  3. Machine Learning and Neurosurgical Outcome Prediction: A Systematic Review.

    Science.gov (United States)

    Senders, Joeky T; Staples, Patrick C; Karhade, Aditya V; Zaki, Mark M; Gormley, William B; Broekman, Marike L D; Smith, Timothy R; Arnaout, Omar

    2018-01-01

    Accurate measurement of surgical outcomes is highly desirable to optimize surgical decision-making. An important element of surgical decision making is identification of the patient cohort that will benefit from surgery before the intervention. Machine learning (ML) enables computers to learn from previous data to make accurate predictions on new data. In this systematic review, we evaluate the potential of ML for neurosurgical outcome prediction. A systematic search in the PubMed and Embase databases was performed to identify all potential relevant studies up to January 1, 2017. Thirty studies were identified that evaluated ML algorithms used as prediction models for survival, recurrence, symptom improvement, and adverse events in patients undergoing surgery for epilepsy, brain tumor, spinal lesions, neurovascular disease, movement disorders, traumatic brain injury, and hydrocephalus. Depending on the specific prediction task evaluated and the type of input features included, ML models predicted outcomes after neurosurgery with a median accuracy and area under the receiver operating curve of 94.5% and 0.83, respectively. Compared with logistic regression, ML models performed significantly better and showed a median absolute improvement in accuracy and area under the receiver operating curve of 15% and 0.06, respectively. Some studies also demonstrated a better performance in ML models compared with established prognostic indices and clinical experts. In the research setting, ML has been studied extensively, demonstrating an excellent performance in outcome prediction for a wide range of neurosurgical conditions. However, future studies should investigate how ML can be implemented as a practical tool supporting neurosurgical care. Copyright © 2017 Elsevier Inc. All rights reserved.

  4. Application of Machine Learning Algorithms for the Query Performance Prediction

    Directory of Open Access Journals (Sweden)

    MILICEVIC, M.

    2015-08-01

    Full Text Available This paper analyzes the relationship between the system load/throughput and the query response time in a real Online transaction processing (OLTP system environment. Although OLTP systems are characterized by short transactions, which normally entail high availability and consistent short response times, the need for operational reporting may jeopardize these objectives. We suggest a new approach to performance prediction for concurrent database workloads, based on the system state vector which consists of 36 attributes. There is no bias to the importance of certain attributes, but the machine learning methods are used to determine which attributes better describe the behavior of the particular database server and how to model that system. During the learning phase, the system's profile is created using multiple reference queries, which are selected to represent frequent business processes. The possibility of the accurate response time prediction may be a foundation for automated decision-making for database (DB query scheduling. Possible applications of the proposed method include adaptive resource allocation, quality of service (QoS management or real-time dynamic query scheduling (e.g. estimation of the optimal moment for a complex query execution.

  5. Machine learning applications in cancer prognosis and prediction.

    Science.gov (United States)

    Kourou, Konstantina; Exarchos, Themis P; Exarchos, Konstantinos P; Karamouzis, Michalis V; Fotiadis, Dimitrios I

    2015-01-01

    Cancer has been characterized as a heterogeneous disease consisting of many different subtypes. The early diagnosis and prognosis of a cancer type have become a necessity in cancer research, as it can facilitate the subsequent clinical management of patients. The importance of classifying cancer patients into high or low risk groups has led many research teams, from the biomedical and the bioinformatics field, to study the application of machine learning (ML) methods. Therefore, these techniques have been utilized as an aim to model the progression and treatment of cancerous conditions. In addition, the ability of ML tools to detect key features from complex datasets reveals their importance. A variety of these techniques, including Artificial Neural Networks (ANNs), Bayesian Networks (BNs), Support Vector Machines (SVMs) and Decision Trees (DTs) have been widely applied in cancer research for the development of predictive models, resulting in effective and accurate decision making. Even though it is evident that the use of ML methods can improve our understanding of cancer progression, an appropriate level of validation is needed in order for these methods to be considered in the everyday clinical practice. In this work, we present a review of recent ML approaches employed in the modeling of cancer progression. The predictive models discussed here are based on various supervised ML techniques as well as on different input features and data samples. Given the growing trend on the application of ML methods in cancer research, we present here the most recent publications that employ these techniques as an aim to model cancer risk or patient outcomes.

  6. Prediction of preterm deliveries from EHG signals using machine learning.

    Directory of Open Access Journals (Sweden)

    Paul Fergus

    Full Text Available There has been some improvement in the treatment of preterm infants, which has helped to increase their chance of survival. However, the rate of premature births is still globally increasing. As a result, this group of infants are most at risk of developing severe medical conditions that can affect the respiratory, gastrointestinal, immune, central nervous, auditory and visual systems. In extreme cases, this can also lead to long-term conditions, such as cerebral palsy, mental retardation, learning difficulties, including poor health and growth. In the US alone, the societal and economic cost of preterm births, in 2005, was estimated to be $26.2 billion, per annum. In the UK, this value was close to £2.95 billion, in 2009. Many believe that a better understanding of why preterm births occur, and a strategic focus on prevention, will help to improve the health of children and reduce healthcare costs. At present, most methods of preterm birth prediction are subjective. However, a strong body of evidence suggests the analysis of uterine electrical signals (Electrohysterography, could provide a viable way of diagnosing true labour and predict preterm deliveries. Most Electrohysterography studies focus on true labour detection during the final seven days, before labour. The challenge is to utilise Electrohysterography techniques to predict preterm delivery earlier in the pregnancy. This paper explores this idea further and presents a supervised machine learning approach that classifies term and preterm records, using an open source dataset containing 300 records (38 preterm and 262 term. The synthetic minority oversampling technique is used to oversample the minority preterm class, and cross validation techniques, are used to evaluate the dataset against other similar studies. Our approach shows an improvement on existing studies with 96% sensitivity, 90% specificity, and a 95% area under the curve value with 8% global error using the polynomial

  7. Spatial extreme learning machines: An application on prediction of disease counts.

    Science.gov (United States)

    Prates, Marcos O

    2018-01-01

    Extreme learning machines have gained a lot of attention by the machine learning community because of its interesting properties and computational advantages. With the increase in collection of information nowadays, many sources of data have missing information making statistical analysis harder or unfeasible. In this paper, we present a new model, coined spatial extreme learning machine, that combine spatial modeling with extreme learning machines keeping the nice properties of both methodologies and making it very flexible and robust. As explained throughout the text, the spatial extreme learning machines have many advantages in comparison with the traditional extreme learning machines. By a simulation study and a real data analysis we present how the spatial extreme learning machine can be used to improve imputation of missing data and uncertainty prediction estimation.

  8. INSIGHTS FROM MACHINE-LEARNED DIET SUCCESS PREDICTION.

    Science.gov (United States)

    Weber, Ingmar; Achananuparp, Palakorn

    2016-01-01

    To support people trying to lose weight and stay healthy, more and more fitness apps have sprung up including the ability to track both calories intake and expenditure. Users of such apps are part of a wider "quantified self" movement and many opt-in to publicly share their logged data. In this paper, we use public food diaries of more than 4,000 long-term active MyFitnessPal users to study the characteristics of a (un-)successful diet. Concretely, we train a machine learning model to predict repeatedly being over or under self-set daily calories goals and then look at which features contribute to the model's prediction. Our findings include both expected results, such as the token "mcdonalds" or the category "dessert" being indicative for being over the calories goal, but also less obvious ones such as the difference between pork and poultry concerning dieting success, or the use of the "quick added calories" functionality being indicative of over-shooting calorie-wise. This study also hints at the feasibility of using such data for more in-depth data mining, e.g., looking at the interaction between consumed foods such as mixing protein- and carbohydrate-rich foods. To the best of our knowledge, this is the first systematic study of public food diaries.

  9. Novel Breast Imaging and Machine Learning: Predicting Breast Lesion Malignancy at Cone-Beam CT Using Machine Learning Techniques.

    Science.gov (United States)

    Uhlig, Johannes; Uhlig, Annemarie; Kunze, Meike; Beissbarth, Tim; Fischer, Uwe; Lotz, Joachim; Wienbeck, Susanne

    2018-05-24

    The purpose of this study is to evaluate the diagnostic performance of machine learning techniques for malignancy prediction at breast cone-beam CT (CBCT) and to compare them to human readers. Five machine learning techniques, including random forests, back propagation neural networks (BPN), extreme learning machines, support vector machines, and K-nearest neighbors, were used to train diagnostic models on a clinical breast CBCT dataset with internal validation by repeated 10-fold cross-validation. Two independent blinded human readers with profound experience in breast imaging and breast CBCT analyzed the same CBCT dataset. Diagnostic performance was compared using AUC, sensitivity, and specificity. The clinical dataset comprised 35 patients (American College of Radiology density type C and D breasts) with 81 suspicious breast lesions examined with contrast-enhanced breast CBCT. Forty-five lesions were histopathologically proven to be malignant. Among the machine learning techniques, BPNs provided the best diagnostic performance, with AUC of 0.91, sensitivity of 0.85, and specificity of 0.82. The diagnostic performance of the human readers was AUC of 0.84, sensitivity of 0.89, and specificity of 0.72 for reader 1 and AUC of 0.72, sensitivity of 0.71, and specificity of 0.67 for reader 2. AUC was significantly higher for BPN when compared with both reader 1 (p = 0.01) and reader 2 (p Machine learning techniques provide a high and robust diagnostic performance in the prediction of malignancy in breast lesions identified at CBCT. BPNs showed the best diagnostic performance, surpassing human readers in terms of AUC and specificity.

  10. Developing robust arsenic awareness prediction models using machine learning algorithms.

    Science.gov (United States)

    Singh, Sushant K; Taylor, Robert W; Rahman, Mohammad Mahmudur; Pradhan, Biswajeet

    2018-04-01

    Arsenic awareness plays a vital role in ensuring the sustainability of arsenic mitigation technologies. Thus far, however, few studies have dealt with the sustainability of such technologies and its associated socioeconomic dimensions. As a result, arsenic awareness prediction has not yet been fully conceptualized. Accordingly, this study evaluated arsenic awareness among arsenic-affected communities in rural India, using a structured questionnaire to record socioeconomic, demographic, and other sociobehavioral factors with an eye to assessing their association with and influence on arsenic awareness. First a logistic regression model was applied and its results compared with those produced by six state-of-the-art machine-learning algorithms (Support Vector Machine [SVM], Kernel-SVM, Decision Tree [DT], k-Nearest Neighbor [k-NN], Naïve Bayes [NB], and Random Forests [RF]) as measured by their accuracy at predicting arsenic awareness. Most (63%) of the surveyed population was found to be arsenic-aware. Significant arsenic awareness predictors were divided into three types: (1) socioeconomic factors: caste, education level, and occupation; (2) water and sanitation behavior factors: number of family members involved in water collection, distance traveled and time spent for water collection, places for defecation, and materials used for handwashing after defecation; and (3) social capital and trust factors: presence of anganwadi and people's trust in other community members, NGOs, and private agencies. Moreover, individuals' having higher social network positively contributed to arsenic awareness in the communities. Results indicated that both the SVM and the RF algorithms outperformed at overall prediction of arsenic awareness-a nonlinear classification problem. Lower-caste, less educated, and unemployed members of the population were found to be the most vulnerable, requiring immediate arsenic mitigation. To this end, local social institutions and NGOs could play a

  11. Sensor Data Air Pollution Prediction by Machine Learning Methods

    Czech Academy of Sciences Publication Activity Database

    Vidnerová, Petra; Neruda, Roman

    submitted 25. 1. (2018) ISSN 1530-437X R&D Projects: GA ČR GA15-18108S Grant - others:GA MŠk(CZ) LM2015042 Institutional support: RVO:67985807 Keywords : machine learning * sensors * air pollution * deep neural networks * regularization networks Subject RIV: IN - Informatics, Computer Science Impact factor: 2.512, year: 2016

  12. Predicting the concentration of residual methanol in industrial formalin using machine learning

    OpenAIRE

    Heidkamp, William

    2016-01-01

    In this thesis, a machine learning approach was used to develop a predictive model for residual methanol concentration in industrial formalin produced at the Akzo Nobel factory in Kristinehamn, Sweden. The MATLABTM computational environment supplemented with the Statistics and Machine LearningTM toolbox from the MathWorks were used to test various machine learning algorithms on the formalin production data from Akzo Nobel. As a result, the Gaussian Process Regression algorithm was found to pr...

  13. Machine learning in updating predictive models of planning and scheduling transportation projects

    Science.gov (United States)

    1997-01-01

    A method combining machine learning and regression analysis to automatically and intelligently update predictive models used in the Kansas Department of Transportations (KDOTs) internal management system is presented. The predictive models used...

  14. Machine Learning.

    Science.gov (United States)

    Kirrane, Diane E.

    1990-01-01

    As scientists seek to develop machines that can "learn," that is, solve problems by imitating the human brain, a gold mine of information on the processes of human learning is being discovered, expert systems are being improved, and human-machine interactions are being enhanced. (SK)

  15. Predicting Refractive Surgery Outcome: Machine Learning Approach With Big Data.

    Science.gov (United States)

    Achiron, Asaf; Gur, Zvi; Aviv, Uri; Hilely, Assaf; Mimouni, Michael; Karmona, Lily; Rokach, Lior; Kaiserman, Igor

    2017-09-01

    To develop a decision forest for prediction of laser refractive surgery outcome. Data from consecutive cases of patients who underwent LASIK or photorefractive surgeries during a 12-year period in a single center were assembled into a single dataset. Training of machine-learning classifiers and testing were performed with a statistical classifier algorithm. The decision forest was created by feature vectors extracted from 17,592 cases and 38 clinical parameters for each patient. A 10-fold cross-validation procedure was applied to estimate the predictive value of the decision forest when applied to new patients. Analysis included patients younger than 40 years who were not treated for monovision. Efficacy of 0.7 or greater and 0.8 or greater was achieved in 16,198 (92.0%) and 14,945 (84.9%) eyes, respectively. Efficacy of less than 0.4 and less than 0.5 was achieved in 322 (1.8%) and 506 (2.9%) eyes, respectively. Patients in the low efficacy group (differences compared with the high efficacy group (≥ 0.8), yet were clinically similar (mean differences between groups of 0.7 years, of 0.43 mm in pupil size, of 0.11 D in cylinder, of 0.22 logMAR in preoperative CDVA, of 0.11 mm in optical zone size, of 1.03 D in actual sphere treatment, and of 0.64 D in actual cylinder treatment). The preoperative subjective CDVA had the highest gain (most important to the model). Correlations analysis revealed significantly decreased efficacy with increased age (r = -0.67, P big data from refractive surgeries may be of interest. [J Refract Surg. 2017;33(9):592-597.]. Copyright 2017, SLACK Incorporated.

  16. Predicting High Frequency Exchange Rates using Machine Learning

    OpenAIRE

    Palikuca, Aleksandar; Seidl,, Timo

    2016-01-01

    This thesis applies a committee of Artificial Neural Networks and Support Vector Machines on high-dimensional, high-frequency EUR/USD exchange rate data in an effort to predict directional market movements on up to a 60 second prediction horizon. The study shows that combining multiple classifiers into a committee produces improved precision relative to the best individual committee members and outperforms previously reported results. A trading simulation implementing the committee classifier...

  17. Different protein-protein interface patterns predicted by different machine learning methods.

    Science.gov (United States)

    Wang, Wei; Yang, Yongxiao; Yin, Jianxin; Gong, Xinqi

    2017-11-22

    Different types of protein-protein interactions make different protein-protein interface patterns. Different machine learning methods are suitable to deal with different types of data. Then, is it the same situation that different interface patterns are preferred for prediction by different machine learning methods? Here, four different machine learning methods were employed to predict protein-protein interface residue pairs on different interface patterns. The performances of the methods for different types of proteins are different, which suggest that different machine learning methods tend to predict different protein-protein interface patterns. We made use of ANOVA and variable selection to prove our result. Our proposed methods taking advantages of different single methods also got a good prediction result compared to single methods. In addition to the prediction of protein-protein interactions, this idea can be extended to other research areas such as protein structure prediction and design.

  18. A Machine Learning Perspective on Predictive Coding with PAQ

    OpenAIRE

    Knoll, Byron; de Freitas, Nando

    2011-01-01

    PAQ8 is an open source lossless data compression algorithm that currently achieves the best compression rates on many benchmarks. This report presents a detailed description of PAQ8 from a statistical machine learning perspective. It shows that it is possible to understand some of the modules of PAQ8 and use this understanding to improve the method. However, intuitive statistical explanations of the behavior of other modules remain elusive. We hope the description in this report will be a sta...

  19. Mortality risk prediction in burn injury: Comparison of logistic regression with machine learning approaches.

    Science.gov (United States)

    Stylianou, Neophytos; Akbarov, Artur; Kontopantelis, Evangelos; Buchan, Iain; Dunn, Ken W

    2015-08-01

    Predicting mortality from burn injury has traditionally employed logistic regression models. Alternative machine learning methods have been introduced in some areas of clinical prediction as the necessary software and computational facilities have become accessible. Here we compare logistic regression and machine learning predictions of mortality from burn. An established logistic mortality model was compared to machine learning methods (artificial neural network, support vector machine, random forests and naïve Bayes) using a population-based (England & Wales) case-cohort registry. Predictive evaluation used: area under the receiver operating characteristic curve; sensitivity; specificity; positive predictive value and Youden's index. All methods had comparable discriminatory abilities, similar sensitivities, specificities and positive predictive values. Although some machine learning methods performed marginally better than logistic regression the differences were seldom statistically significant and clinically insubstantial. Random forests were marginally better for high positive predictive value and reasonable sensitivity. Neural networks yielded slightly better prediction overall. Logistic regression gives an optimal mix of performance and interpretability. The established logistic regression model of burn mortality performs well against more complex alternatives. Clinical prediction with a small set of strong, stable, independent predictors is unlikely to gain much from machine learning outside specialist research contexts. Copyright © 2015 Elsevier Ltd and ISBI. All rights reserved.

  20. A comparison of machine learning techniques for predicting downstream acid mine drainage

    CSIR Research Space (South Africa)

    van Zyl, TL

    2014-07-01

    Full Text Available windowing approach over historical values to generate a prediction for the current value. We evaluate a number of Machine Learning techniques as regressors including Support Vector Regression, Random Forests, Stochastic Gradient Decent Regression, Linear...

  1. Applying machine learning to predict patient-specific current CD4 ...

    African Journals Online (AJOL)

    Apple apple

    This work shows the application of machine learning to predict current CD4 cell count of an HIV- .... Pre-processing ... remaining data elements of the PR and RT datasets. ... technique based on the structure of the human brain's neuron.

  2. Building Customer Churn Prediction Models in Fitness Industry with Machine Learning Methods

    OpenAIRE

    Shan, Min

    2017-01-01

    With the rapid growth of digital systems, churn management has become a major focus within customer relationship management in many industries. Ample research has been conducted for churn prediction in different industries with various machine learning methods. This thesis aims to combine feature selection and supervised machine learning methods for defining models of churn prediction and apply them on fitness industry. Forward selection is chosen as feature selection methods. Support Vector ...

  3. Application of Machine Learning Approaches for Protein-protein Interactions Prediction.

    Science.gov (United States)

    Zhang, Mengying; Su, Qiang; Lu, Yi; Zhao, Manman; Niu, Bing

    2017-01-01

    Proteomics endeavors to study the structures, functions and interactions of proteins. Information of the protein-protein interactions (PPIs) helps to improve our knowledge of the functions and the 3D structures of proteins. Thus determining the PPIs is essential for the study of the proteomics. In this review, in order to study the application of machine learning in predicting PPI, some machine learning approaches such as support vector machine (SVM), artificial neural networks (ANNs) and random forest (RF) were selected, and the examples of its applications in PPIs were listed. SVM and RF are two commonly used methods. Nowadays, more researchers predict PPIs by combining more than two methods. This review presents the application of machine learning approaches in predicting PPI. Many examples of success in identification and prediction in the area of PPI prediction have been discussed, and the PPIs research is still in progress. Copyright© Bentham Science Publishers; For any queries, please email at epub@benthamscience.org.

  4. Machine Learning

    Energy Technology Data Exchange (ETDEWEB)

    Chikkagoudar, Satish; Chatterjee, Samrat; Thomas, Dennis G.; Carroll, Thomas E.; Muller, George

    2017-04-21

    The absence of a robust and unified theory of cyber dynamics presents challenges and opportunities for using machine learning based data-driven approaches to further the understanding of the behavior of such complex systems. Analysts can also use machine learning approaches to gain operational insights. In order to be operationally beneficial, cybersecurity machine learning based models need to have the ability to: (1) represent a real-world system, (2) infer system properties, and (3) learn and adapt based on expert knowledge and observations. Probabilistic models and Probabilistic graphical models provide these necessary properties and are further explored in this chapter. Bayesian Networks and Hidden Markov Models are introduced as an example of a widely used data driven classification/modeling strategy.

  5. Phishtest: Measuring the Impact of Email Headers on the Predictive Accuracy of Machine Learning Techniques

    Science.gov (United States)

    Tout, Hicham

    2013-01-01

    The majority of documented phishing attacks have been carried by email, yet few studies have measured the impact of email headers on the predictive accuracy of machine learning techniques in detecting email phishing attacks. Research has shown that the inclusion of a limited subset of email headers as features in training machine learning…

  6. CAT-PUMA: CME Arrival Time Prediction Using Machine learning Algorithms

    Science.gov (United States)

    Liu, Jiajia; Ye, Yudong; Shen, Chenglong; Wang, Yuming; Erdélyi, Robert

    2018-04-01

    CAT-PUMA (CME Arrival Time Prediction Using Machine learning Algorithms) quickly and accurately predicts the arrival of Coronal Mass Ejections (CMEs) of CME arrival time. The software was trained via detailed analysis of CME features and solar wind parameters using 182 previously observed geo-effective partial-/full-halo CMEs and uses algorithms of the Support Vector Machine (SVM) to make its predictions, which can be made within minutes of providing the necessary input parameters of a CME.

  7. Introduction to machine learning

    OpenAIRE

    Baştanlar, Yalın; Özuysal, Mustafa

    2014-01-01

    The machine learning field, which can be briefly defined as enabling computers make successful predictions using past experiences, has exhibited an impressive development recently with the help of the rapid increase in the storage capacity and processing power of computers. Together with many other disciplines, machine learning methods have been widely employed in bioinformatics. The difficulties and cost of biological analyses have led to the development of sophisticated machine learning app...

  8. An ensemble machine learning approach to predict survival in breast cancer.

    Science.gov (United States)

    Djebbari, Amira; Liu, Ziying; Phan, Sieu; Famili, Fazel

    2008-01-01

    Current breast cancer predictive signatures are not unique. Can we use this fact to our advantage to improve prediction? From the machine learning perspective, it is well known that combining multiple classifiers can improve classification performance. We propose an ensemble machine learning approach which consists of choosing feature subsets and learning predictive models from them. We then combine models based on certain model fusion criteria and we also introduce a tuning parameter to control sensitivity. Our method significantly improves classification performance with a particular emphasis on sensitivity which is critical to avoid misclassifying poor prognosis patients as good prognosis.

  9. Structure prediction of boron-doped graphene by machine learning

    Science.gov (United States)

    M. Dieb, Thaer; Hou, Zhufeng; Tsuda, Koji

    2018-06-01

    Heteroatom doping has endowed graphene with manifold aspects of material properties and boosted its applications. The atomic structure determination of doped graphene is vital to understand its material properties. Motivated by the recently synthesized boron-doped graphene with relatively high concentration, here we employ machine learning methods to search the most stable structures of doped boron atoms in graphene, in conjunction with the atomistic simulations. From the determined stable structures, we find that in the free-standing pristine graphene, the doped boron atoms energetically prefer to substitute for the carbon atoms at different sublattice sites and that the para configuration of boron-boron pair is dominant in the cases of high boron concentrations. The boron doping can increase the work function of graphene by 0.7 eV for a boron content higher than 3.1%.

  10. Applying machine learning to predict patient-specific current CD 4 ...

    African Journals Online (AJOL)

    This work shows the application of machine learning to predict current CD4 cell count of an HIV-positive patient using genome sequences, viral load and time. A regression model predicting actual CD4 cell counts and a classification model predicting if a patient's CD4 cell count is less than 200 was built using a support ...

  11. Prediction of Student Dropout in E-Learning Program Through the Use of Machine Learning Method

    Directory of Open Access Journals (Sweden)

    Mingjie Tan

    2015-02-01

    Full Text Available The high rate of dropout is a serious problem in E-learning program. Thus it has received extensive concern from the education administrators and researchers. Predicting the potential dropout students is a workable solution to prevent dropout. Based on the analysis of related literature, this study selected student’s personal characteristic and academic performance as input attributions. Prediction models were developed using Artificial Neural Network (ANN, Decision Tree (DT and Bayesian Networks (BNs. A large sample of 62375 students was utilized in the procedures of model training and testing. The results of each model were presented in confusion matrix, and analyzed by calculating the rates of accuracy, precision, recall, and F-measure. The results suggested all of the three machine learning methods were effective in student dropout prediction, and DT presented a better performance. Finally, some suggestions were made for considerable future research.

  12. The use of machine learning and nonlinear statistical tools for ADME prediction.

    Science.gov (United States)

    Sakiyama, Yojiro

    2009-02-01

    Absorption, distribution, metabolism and excretion (ADME)-related failure of drug candidates is a major issue for the pharmaceutical industry today. Prediction of ADME by in silico tools has now become an inevitable paradigm to reduce cost and enhance efficiency in pharmaceutical research. Recently, machine learning as well as nonlinear statistical tools has been widely applied to predict routine ADME end points. To achieve accurate and reliable predictions, it would be a prerequisite to understand the concepts, mechanisms and limitations of these tools. Here, we have devised a small synthetic nonlinear data set to help understand the mechanism of machine learning by 2D-visualisation. We applied six new machine learning methods to four different data sets. The methods include Naive Bayes classifier, classification and regression tree, random forest, Gaussian process, support vector machine and k nearest neighbour. The results demonstrated that ensemble learning and kernel machine displayed greater accuracy of prediction than classical methods irrespective of the data set size. The importance of interaction with the engineering field is also addressed. The results described here provide insights into the mechanism of machine learning, which will enable appropriate usage in the future.

  13. Exploration of machine learning techniques in predicting multiple sclerosis disease course

    OpenAIRE

    Zhao, Yijun; Healy, Brian C.; Rotstein, Dalia; Guttmann, Charles R. G.; Bakshi, Rohit; Weiner, Howard L.; Brodley, Carla E.; Chitnis, Tanuja

    2017-01-01

    Objective To explore the value of machine learning methods for predicting multiple sclerosis disease course. Methods 1693 CLIMB study patients were classified as increased EDSS?1.5 (worsening) or not (non-worsening) at up to five years after baseline visit. Support vector machines (SVM) were used to build the classifier, and compared to logistic regression (LR) using demographic, clinical and MRI data obtained at years one and two to predict EDSS at five years follow-up. Results Baseline data...

  14. Machine Learning Methods for Prediction of CDK-Inhibitors

    Science.gov (United States)

    Ramana, Jayashree; Gupta, Dinesh

    2010-01-01

    Progression through the cell cycle involves the coordinated activities of a suite of cyclin/cyclin-dependent kinase (CDK) complexes. The activities of the complexes are regulated by CDK inhibitors (CDKIs). Apart from its role as cell cycle regulators, CDKIs are involved in apoptosis, transcriptional regulation, cell fate determination, cell migration and cytoskeletal dynamics. As the complexes perform crucial and diverse functions, these are important drug targets for tumour and stem cell therapeutic interventions. However, CDKIs are represented by proteins with considerable sequence heterogeneity and may fail to be identified by simple similarity search methods. In this work we have evaluated and developed machine learning methods for identification of CDKIs. We used different compositional features and evolutionary information in the form of PSSMs, from CDKIs and non-CDKIs for generating SVM and ANN classifiers. In the first stage, both the ANN and SVM models were evaluated using Leave-One-Out Cross-Validation and in the second stage these were tested on independent data sets. The PSSM-based SVM model emerged as the best classifier in both the stages and is publicly available through a user-friendly web interface at http://bioinfo.icgeb.res.in/cdkipred. PMID:20967128

  15. Introduction to machine learning.

    Science.gov (United States)

    Baştanlar, Yalin; Ozuysal, Mustafa

    2014-01-01

    The machine learning field, which can be briefly defined as enabling computers make successful predictions using past experiences, has exhibited an impressive development recently with the help of the rapid increase in the storage capacity and processing power of computers. Together with many other disciplines, machine learning methods have been widely employed in bioinformatics. The difficulties and cost of biological analyses have led to the development of sophisticated machine learning approaches for this application area. In this chapter, we first review the fundamental concepts of machine learning such as feature assessment, unsupervised versus supervised learning and types of classification. Then, we point out the main issues of designing machine learning experiments and their performance evaluation. Finally, we introduce some supervised learning methods.

  16. A Symbiotic Framework for coupling Machine Learning and Geosciences in Prediction and Predictability

    Science.gov (United States)

    Ravela, S.

    2017-12-01

    In this presentation we review the two directions of a symbiotic relationship between machine learning and the geosciences in relation to prediction and predictability. In the first direction, we develop ensemble, information theoretic and manifold learning framework to adaptively improve state and parameter estimates in nonlinear high-dimensional non-Gaussian problems, showing in particular that tractable variational approaches can be produced. We demonstrate these applications in the context of autonomous mapping of environmental coherent structures and other idealized problems. In the reverse direction, we show that data assimilation, particularly probabilistic approaches for filtering and smoothing offer a novel and useful way to train neural networks, and serve as a better basis than gradient based approaches when we must quantify uncertainty in association with nonlinear, chaotic processes. In many inference problems in geosciences we seek to build reduced models to characterize local sensitivies, adjoints or other mechanisms that propagate innovations and errors. Here, the particular use of neural approaches for such propagation trained using ensemble data assimilation provides a novel framework. Through these two examples of inference problems in the earth sciences, we show that not only is learning useful to broaden existing methodology, but in reverse, geophysical methodology can be used to influence paradigms in learning.

  17. The validation and assessment of machine learning: a game of prediction from high-dimensional data

    DEFF Research Database (Denmark)

    Pers, Tune Hannes; Albrechtsen, A; Holst, C

    2009-01-01

    In applied statistics, tools from machine learning are popular for analyzing complex and high-dimensional data. However, few theoretical results are available that could guide to the appropriate machine learning tool in a new application. Initial development of an overall strategy thus often...... the ideas, the game is applied to data from the Nugenob Study where the aim is to predict the fat oxidation capacity based on conventional factors and high-dimensional metabolomics data. Three players have chosen to use support vector machines, LASSO, and random forests, respectively....

  18. Machine learning and predictive data analytics enabling metrology and process control in IC fabrication

    Science.gov (United States)

    Rana, Narender; Zhang, Yunlin; Wall, Donald; Dirahoui, Bachir; Bailey, Todd C.

    2015-03-01

    Integrate circuit (IC) technology is going through multiple changes in terms of patterning techniques (multiple patterning, EUV and DSA), device architectures (FinFET, nanowire, graphene) and patterning scale (few nanometers). These changes require tight controls on processes and measurements to achieve the required device performance, and challenge the metrology and process control in terms of capability and quality. Multivariate data with complex nonlinear trends and correlations generally cannot be described well by mathematical or parametric models but can be relatively easily learned by computing machines and used to predict or extrapolate. This paper introduces the predictive metrology approach which has been applied to three different applications. Machine learning and predictive analytics have been leveraged to accurately predict dimensions of EUV resist patterns down to 18 nm half pitch leveraging resist shrinkage patterns. These patterns could not be directly and accurately measured due to metrology tool limitations. Machine learning has also been applied to predict the electrical performance early in the process pipeline for deep trench capacitance and metal line resistance. As the wafer goes through various processes its associated cost multiplies. It may take days to weeks to get the electrical performance readout. Predicting the electrical performance early on can be very valuable in enabling timely actionable decision such as rework, scrap, feedforward, feedback predicted information or information derived from prediction to improve or monitor processes. This paper provides a general overview of machine learning and advanced analytics application in the advanced semiconductor development and manufacturing.

  19. Prediction of mortality after radical cystectomy for bladder cancer by machine learning techniques.

    Science.gov (United States)

    Wang, Guanjin; Lam, Kin-Man; Deng, Zhaohong; Choi, Kup-Sze

    2015-08-01

    Bladder cancer is a common cancer in genitourinary malignancy. For muscle invasive bladder cancer, surgical removal of the bladder, i.e. radical cystectomy, is in general the definitive treatment which, unfortunately, carries significant morbidities and mortalities. Accurate prediction of the mortality of radical cystectomy is therefore needed. Statistical methods have conventionally been used for this purpose, despite the complex interactions of high-dimensional medical data. Machine learning has emerged as a promising technique for handling high-dimensional data, with increasing application in clinical decision support, e.g. cancer prediction and prognosis. Its ability to reveal the hidden nonlinear interactions and interpretable rules between dependent and independent variables is favorable for constructing models of effective generalization performance. In this paper, seven machine learning methods are utilized to predict the 5-year mortality of radical cystectomy, including back-propagation neural network (BPN), radial basis function (RBFN), extreme learning machine (ELM), regularized ELM (RELM), support vector machine (SVM), naive Bayes (NB) classifier and k-nearest neighbour (KNN), on a clinicopathological dataset of 117 patients of the urology unit of a hospital in Hong Kong. The experimental results indicate that RELM achieved the highest average prediction accuracy of 0.8 at a fast learning speed. The research findings demonstrate the potential of applying machine learning techniques to support clinical decision making. Copyright © 2015 Elsevier Ltd. All rights reserved.

  20. Predicting Coronal Mass Ejections Using Machine Learning Methods

    Science.gov (United States)

    Bobra, M. G.; Ilonidis, S.

    2016-04-01

    Of all the activity observed on the Sun, two of the most energetic events are flares and coronal mass ejections (CMEs). Usually, solar active regions that produce large flares will also produce a CME, but this is not always true. Despite advances in numerical modeling, it is still unclear which circumstances will produce a CME. Therefore, it is worthwhile to empirically determine which features distinguish flares associated with CMEs from flares that are not. At this time, no extensive study has used physically meaningful features of active regions to distinguish between these two populations. As such, we attempt to do so by using features derived from (1) photospheric vector magnetic field data taken by the Solar Dynamics Observatory’s Helioseismic and Magnetic Imager instrument and (2) X-ray flux data from the Geostationary Operational Environmental Satellite’s X-ray Flux instrument. We build a catalog of active regions that either produced both a flare and a CME (the positive class) or simply a flare (the negative class). We then use machine-learning algorithms to (1) determine which features distinguish these two populations, and (2) forecast whether an active region that produces an M- or X-class flare will also produce a CME. We compute the True Skill Statistic, a forecast verification metric, and find that it is a relatively high value of ∼0.8 ± 0.2. We conclude that a combination of six parameters, which are all intensive in nature, will capture most of the relevant information contained in the photospheric magnetic field.

  1. PREDICTING CORONAL MASS EJECTIONS USING MACHINE LEARNING METHODS

    Energy Technology Data Exchange (ETDEWEB)

    Bobra, M. G.; Ilonidis, S. [W.W. Hansen Experimental Physics Laboratory, Stanford University, Stanford, CA 94305 (United States)

    2016-04-20

    Of all the activity observed on the Sun, two of the most energetic events are flares and coronal mass ejections (CMEs). Usually, solar active regions that produce large flares will also produce a CME, but this is not always true. Despite advances in numerical modeling, it is still unclear which circumstances will produce a CME. Therefore, it is worthwhile to empirically determine which features distinguish flares associated with CMEs from flares that are not. At this time, no extensive study has used physically meaningful features of active regions to distinguish between these two populations. As such, we attempt to do so by using features derived from (1) photospheric vector magnetic field data taken by the Solar Dynamics Observatory ’s Helioseismic and Magnetic Imager instrument and (2) X-ray flux data from the Geostationary Operational Environmental Satellite’s X-ray Flux instrument. We build a catalog of active regions that either produced both a flare and a CME (the positive class) or simply a flare (the negative class). We then use machine-learning algorithms to (1) determine which features distinguish these two populations, and (2) forecast whether an active region that produces an M- or X-class flare will also produce a CME. We compute the True Skill Statistic, a forecast verification metric, and find that it is a relatively high value of ∼0.8 ± 0.2. We conclude that a combination of six parameters, which are all intensive in nature, will capture most of the relevant information contained in the photospheric magnetic field.

  2. Comparing statistical and machine learning classifiers: alternatives for predictive modeling in human factors research.

    Science.gov (United States)

    Carnahan, Brian; Meyer, Gérard; Kuntz, Lois-Ann

    2003-01-01

    Multivariate classification models play an increasingly important role in human factors research. In the past, these models have been based primarily on discriminant analysis and logistic regression. Models developed from machine learning research offer the human factors professional a viable alternative to these traditional statistical classification methods. To illustrate this point, two machine learning approaches--genetic programming and decision tree induction--were used to construct classification models designed to predict whether or not a student truck driver would pass his or her commercial driver license (CDL) examination. The models were developed and validated using the curriculum scores and CDL exam performances of 37 student truck drivers who had completed a 320-hr driver training course. Results indicated that the machine learning classification models were superior to discriminant analysis and logistic regression in terms of predictive accuracy. Actual or potential applications of this research include the creation of models that more accurately predict human performance outcomes.

  3. The machine learning in the prediction of elections

    Directory of Open Access Journals (Sweden)

    M.S.C. José A. León-Borges

    2015-05-01

    Full Text Available This research article, presents an analysis and a comparison of three different algorithms: A.- Grouping method K-means, B.-Expectation a convergence criteria, EM and C.- Methodology for classification LAMDA, using two software of classification Weka and SALSA, as an aid for the prediction of future elections in the state of Quintana Roo. When working with electoral data, these are classified in a qualitative and quantitative way, by such virtue at the end of this article you will have the elements necessary to decide, which software, has better performance for such learning of classification. The main reason for the development of this work, is to demonstrate the efficiency of algorithms, with different data types. At the end, it may be decided, the algorithm with the better performance in data management.

  4. Machine Learning and Deep Learning Models to Predict Runoff Water Quantity and Quality

    Science.gov (United States)

    Bradford, S. A.; Liang, J.; Li, W.; Murata, T.; Simunek, J.

    2017-12-01

    Contaminants can be rapidly transported at the soil surface by runoff to surface water bodies. Physically-based models, which are based on the mathematical description of main hydrological processes, are key tools for predicting surface water impairment. Along with physically-based models, data-driven models are becoming increasingly popular for describing the behavior of hydrological and water resources systems since these models can be used to complement or even replace physically based-models. In this presentation we propose a new data-driven model as an alternative to a physically-based overland flow and transport model. First, we have developed a physically-based numerical model to simulate overland flow and contaminant transport (the HYDRUS-1D overland flow module). A large number of numerical simulations were carried out to develop a database containing information about the impact of various input parameters (weather patterns, surface topography, vegetation, soil conditions, contaminants, and best management practices) on runoff water quantity and quality outputs. This database was used to train data-driven models. Three different methods (Neural Networks, Support Vector Machines, and Recurrence Neural Networks) were explored to prepare input- output functional relations. Results demonstrate the ability and limitations of machine learning and deep learning models to predict runoff water quantity and quality.

  5. A machine learning approach for predicting the relationship between energy resources and economic development

    Science.gov (United States)

    Cogoljević, Dušan; Alizamir, Meysam; Piljan, Ivan; Piljan, Tatjana; Prljić, Katarina; Zimonjić, Stefan

    2018-04-01

    The linkage between energy resources and economic development is a topic of great interest. Research in this area is also motivated by contemporary concerns about global climate change, carbon emissions fluctuating crude oil prices, and the security of energy supply. The purpose of this research is to develop and apply the machine learning approach to predict gross domestic product (GDP) based on the mix of energy resources. Our results indicate that GDP predictive accuracy can be improved slightly by applying a machine learning approach.

  6. Applying Machine Learning and High Performance Computing to Water Quality Assessment and Prediction

    OpenAIRE

    Ruijian Zhang; Deren Li

    2017-01-01

    Water quality assessment and prediction is a more and more important issue. Traditional ways either take lots of time or they can only do assessments. In this research, by applying machine learning algorithm to a long period time of water attributes’ data; we can generate a decision tree so that it can predict the future day’s water quality in an easy and efficient way. The idea is to combine the traditional ways and the computer algorithms together. Using machine learning algorithms, the ass...

  7. Moving beyond regression techniques in cardiovascular risk prediction: applying machine learning to address analytic challenges.

    Science.gov (United States)

    Goldstein, Benjamin A; Navar, Ann Marie; Carter, Rickey E

    2017-06-14

    Risk prediction plays an important role in clinical cardiology research. Traditionally, most risk models have been based on regression models. While useful and robust, these statistical methods are limited to using a small number of predictors which operate in the same way on everyone, and uniformly throughout their range. The purpose of this review is to illustrate the use of machine-learning methods for development of risk prediction models. Typically presented as black box approaches, most machine-learning methods are aimed at solving particular challenges that arise in data analysis that are not well addressed by typical regression approaches. To illustrate these challenges, as well as how different methods can address them, we consider trying to predicting mortality after diagnosis of acute myocardial infarction. We use data derived from our institution's electronic health record and abstract data on 13 regularly measured laboratory markers. We walk through different challenges that arise in modelling these data and then introduce different machine-learning approaches. Finally, we discuss general issues in the application of machine-learning methods including tuning parameters, loss functions, variable importance, and missing data. Overall, this review serves as an introduction for those working on risk modelling to approach the diffuse field of machine learning. © The Author 2016. Published by Oxford University Press on behalf of the European Society of Cardiology.

  8. Machine learning-based methods for prediction of linear B-cell epitopes.

    Science.gov (United States)

    Wang, Hsin-Wei; Pai, Tun-Wen

    2014-01-01

    B-cell epitope prediction facilitates immunologists in designing peptide-based vaccine, diagnostic test, disease prevention, treatment, and antibody production. In comparison with T-cell epitope prediction, the performance of variable length B-cell epitope prediction is still yet to be satisfied. Fortunately, due to increasingly available verified epitope databases, bioinformaticians could adopt machine learning-based algorithms on all curated data to design an improved prediction tool for biomedical researchers. Here, we have reviewed related epitope prediction papers, especially those for linear B-cell epitope prediction. It should be noticed that a combination of selected propensity scales and statistics of epitope residues with machine learning-based tools formulated a general way for constructing linear B-cell epitope prediction systems. It is also observed from most of the comparison results that the kernel method of support vector machine (SVM) classifier outperformed other machine learning-based approaches. Hence, in this chapter, except reviewing recently published papers, we have introduced the fundamentals of B-cell epitope and SVM techniques. In addition, an example of linear B-cell prediction system based on physicochemical features and amino acid combinations is illustrated in details.

  9. Prediction of Human Drug Targets and Their Interactions Using Machine Learning Methods: Current and Future Perspectives.

    Science.gov (United States)

    Nath, Abhigyan; Kumari, Priyanka; Chaube, Radha

    2018-01-01

    Identification of drug targets and drug target interactions are important steps in the drug-discovery pipeline. Successful computational prediction methods can reduce the cost and time demanded by the experimental methods. Knowledge of putative drug targets and their interactions can be very useful for drug repurposing. Supervised machine learning methods have been very useful in drug target prediction and in prediction of drug target interactions. Here, we describe the details for developing prediction models using supervised learning techniques for human drug target prediction and their interactions.

  10. Machine Learning for Hackers

    CERN Document Server

    Conway, Drew

    2012-01-01

    If you're an experienced programmer interested in crunching data, this book will get you started with machine learning-a toolkit of algorithms that enables computers to train themselves to automate useful tasks. Authors Drew Conway and John Myles White help you understand machine learning and statistics tools through a series of hands-on case studies, instead of a traditional math-heavy presentation. Each chapter focuses on a specific problem in machine learning, such as classification, prediction, optimization, and recommendation. Using the R programming language, you'll learn how to analyz

  11. Clinical chemistry in higher dimensions: Machine-learning and enhanced prediction from routine clinical chemistry data.

    Science.gov (United States)

    Richardson, Alice; Signor, Ben M; Lidbury, Brett A; Badrick, Tony

    2016-11-01

    Big Data is having an impact on many areas of research, not the least of which is biomedical science. In this review paper, big data and machine learning are defined in terms accessible to the clinical chemistry community. Seven myths associated with machine learning and big data are then presented, with the aim of managing expectation of machine learning amongst clinical chemists. The myths are illustrated with four examples investigating the relationship between biomarkers in liver function tests, enhanced laboratory prediction of hepatitis virus infection, the relationship between bilirubin and white cell count, and the relationship between red cell distribution width and laboratory prediction of anaemia. Copyright © 2016 The Canadian Society of Clinical Chemists. Published by Elsevier Inc. All rights reserved.

  12. Guidelines for Developing and Reporting Machine Learning Predictive Models in Biomedical Research: A Multidisciplinary View.

    Science.gov (United States)

    Luo, Wei; Phung, Dinh; Tran, Truyen; Gupta, Sunil; Rana, Santu; Karmakar, Chandan; Shilton, Alistair; Yearwood, John; Dimitrova, Nevenka; Ho, Tu Bao; Venkatesh, Svetha; Berk, Michael

    2016-12-16

    As more and more researchers are turning to big data for new opportunities of biomedical discoveries, machine learning models, as the backbone of big data analysis, are mentioned more often in biomedical journals. However, owing to the inherent complexity of machine learning methods, they are prone to misuse. Because of the flexibility in specifying machine learning models, the results are often insufficiently reported in research articles, hindering reliable assessment of model validity and consistent interpretation of model outputs. To attain a set of guidelines on the use of machine learning predictive models within clinical settings to make sure the models are correctly applied and sufficiently reported so that true discoveries can be distinguished from random coincidence. A multidisciplinary panel of machine learning experts, clinicians, and traditional statisticians were interviewed, using an iterative process in accordance with the Delphi method. The process produced a set of guidelines that consists of (1) a list of reporting items to be included in a research article and (2) a set of practical sequential steps for developing predictive models. A set of guidelines was generated to enable correct application of machine learning models and consistent reporting of model specifications and results in biomedical research. We believe that such guidelines will accelerate the adoption of big data analysis, particularly with machine learning methods, in the biomedical research community. ©Wei Luo, Dinh Phung, Truyen Tran, Sunil Gupta, Santu Rana, Chandan Karmakar, Alistair Shilton, John Yearwood, Nevenka Dimitrova, Tu Bao Ho, Svetha Venkatesh, Michael Berk. Originally published in the Journal of Medical Internet Research (http://www.jmir.org), 16.12.2016.

  13. Machine learning for outcome prediction of acute ischemic stroke post intra-arterial therapy.

    Directory of Open Access Journals (Sweden)

    Hamed Asadi

    Full Text Available INTRODUCTION: Stroke is a major cause of death and disability. Accurately predicting stroke outcome from a set of predictive variables may identify high-risk patients and guide treatment approaches, leading to decreased morbidity. Logistic regression models allow for the identification and validation of predictive variables. However, advanced machine learning algorithms offer an alternative, in particular, for large-scale multi-institutional data, with the advantage of easily incorporating newly available data to improve prediction performance. Our aim was to design and compare different machine learning methods, capable of predicting the outcome of endovascular intervention in acute anterior circulation ischaemic stroke. METHOD: We conducted a retrospective study of a prospectively collected database of acute ischaemic stroke treated by endovascular intervention. Using SPSS®, MATLAB®, and Rapidminer®, classical statistics as well as artificial neural network and support vector algorithms were applied to design a supervised machine capable of classifying these predictors into potential good and poor outcomes. These algorithms were trained, validated and tested using randomly divided data. RESULTS: We included 107 consecutive acute anterior circulation ischaemic stroke patients treated by endovascular technique. Sixty-six were male and the mean age of 65.3. All the available demographic, procedural and clinical factors were included into the models. The final confusion matrix of the neural network, demonstrated an overall congruency of ∼ 80% between the target and output classes, with favourable receiving operative characteristics. However, after optimisation, the support vector machine had a relatively better performance, with a root mean squared error of 2.064 (SD: ± 0.408. DISCUSSION: We showed promising accuracy of outcome prediction, using supervised machine learning algorithms, with potential for incorporation of larger multicenter

  14. Machine learning approach for the outcome prediction of temporal lobe epilepsy surgery.

    Directory of Open Access Journals (Sweden)

    Rubén Armañanzas

    Full Text Available Epilepsy surgery is effective in reducing both the number and frequency of seizures, particularly in temporal lobe epilepsy (TLE. Nevertheless, a significant proportion of these patients continue suffering seizures after surgery. Here we used a machine learning approach to predict the outcome of epilepsy surgery based on supervised classification data mining taking into account not only the common clinical variables, but also pathological and neuropsychological evaluations. We have generated models capable of predicting whether a patient with TLE secondary to hippocampal sclerosis will fully recover from epilepsy or not. The machine learning analysis revealed that outcome could be predicted with an estimated accuracy of almost 90% using some clinical and neuropsychological features. Importantly, not all the features were needed to perform the prediction; some of them proved to be irrelevant to the prognosis. Personality style was found to be one of the key features to predict the outcome. Although we examined relatively few cases, findings were verified across all data, showing that the machine learning approach described in the present study may be a powerful method. Since neuropsychological assessment of epileptic patients is a standard protocol in the pre-surgical evaluation, we propose to include these specific psychological tests and machine learning tools to improve the selection of candidates for epilepsy surgery.

  15. Prediction of drug synergy in cancer using ensemble-based machine learning techniques

    Science.gov (United States)

    Singh, Harpreet; Rana, Prashant Singh; Singh, Urvinder

    2018-04-01

    Drug synergy prediction plays a significant role in the medical field for inhibiting specific cancer agents. It can be developed as a pre-processing tool for therapeutic successes. Examination of different drug-drug interaction can be done by drug synergy score. It needs efficient regression-based machine learning approaches to minimize the prediction errors. Numerous machine learning techniques such as neural networks, support vector machines, random forests, LASSO, Elastic Nets, etc., have been used in the past to realize requirement as mentioned above. However, these techniques individually do not provide significant accuracy in drug synergy score. Therefore, the primary objective of this paper is to design a neuro-fuzzy-based ensembling approach. To achieve this, nine well-known machine learning techniques have been implemented by considering the drug synergy data. Based on the accuracy of each model, four techniques with high accuracy are selected to develop ensemble-based machine learning model. These models are Random forest, Fuzzy Rules Using Genetic Cooperative-Competitive Learning method (GFS.GCCL), Adaptive-Network-Based Fuzzy Inference System (ANFIS) and Dynamic Evolving Neural-Fuzzy Inference System method (DENFIS). Ensembling is achieved by evaluating the biased weighted aggregation (i.e. adding more weights to the model with a higher prediction score) of predicted data by selected models. The proposed and existing machine learning techniques have been evaluated on drug synergy score data. The comparative analysis reveals that the proposed method outperforms others in terms of accuracy, root mean square error and coefficient of correlation.

  16. Machine Learning Algorithms Outperform Conventional Regression Models in Predicting Development of Hepatocellular Carcinoma

    Science.gov (United States)

    Singal, Amit G.; Mukherjee, Ashin; Elmunzer, B. Joseph; Higgins, Peter DR; Lok, Anna S.; Zhu, Ji; Marrero, Jorge A; Waljee, Akbar K

    2015-01-01

    Background Predictive models for hepatocellular carcinoma (HCC) have been limited by modest accuracy and lack of validation. Machine learning algorithms offer a novel methodology, which may improve HCC risk prognostication among patients with cirrhosis. Our study's aim was to develop and compare predictive models for HCC development among cirrhotic patients, using conventional regression analysis and machine learning algorithms. Methods We enrolled 442 patients with Child A or B cirrhosis at the University of Michigan between January 2004 and September 2006 (UM cohort) and prospectively followed them until HCC development, liver transplantation, death, or study termination. Regression analysis and machine learning algorithms were used to construct predictive models for HCC development, which were tested on an independent validation cohort from the Hepatitis C Antiviral Long-term Treatment against Cirrhosis (HALT-C) Trial. Both models were also compared to the previously published HALT-C model. Discrimination was assessed using receiver operating characteristic curve analysis and diagnostic accuracy was assessed with net reclassification improvement and integrated discrimination improvement statistics. Results After a median follow-up of 3.5 years, 41 patients developed HCC. The UM regression model had a c-statistic of 0.61 (95%CI 0.56-0.67), whereas the machine learning algorithm had a c-statistic of 0.64 (95%CI 0.60–0.69) in the validation cohort. The machine learning algorithm had significantly better diagnostic accuracy as assessed by net reclassification improvement (pmachine learning algorithm (p=0.047). Conclusion Machine learning algorithms improve the accuracy of risk stratifying patients with cirrhosis and can be used to accurately identify patients at high-risk for developing HCC. PMID:24169273

  17. Machine learning competition in immunology – Prediction of HLA class I binding peptides

    DEFF Research Database (Denmark)

    Zhang, Guang Lan; Ansari, Hifzur Rahman; Bradley, Phil

    2011-01-01

    of peptide binding, therefore, determines the accuracy of the overall method. Computational predictions of peptide binding to HLA, both class I and class II, use a variety of algorithms ranging from binding motifs to advanced machine learning techniques ( [Brusic et al., 2004] and [Lafuente and Reche, 2009...

  18. Machine learning approaches for the prediction of signal peptides and otherprotein sorting signals

    DEFF Research Database (Denmark)

    Nielsen, Henrik; Brunak, Søren; von Heijne, Gunnar

    1999-01-01

    Prediction of protein sorting signals from the sequence of amino acids has great importance in the field of proteomics today. Recently,the growth of protein databases, combined with machine learning approaches, such as neural networks and hidden Markov models, havemade it possible to achieve...

  19. Pressure Prediction of Coal Slurry Transportation Pipeline Based on Particle Swarm Optimization Kernel Function Extreme Learning Machine

    Directory of Open Access Journals (Sweden)

    Xue-cun Yang

    2015-01-01

    Full Text Available For coal slurry pipeline blockage prediction problem, through the analysis of actual scene, it is determined that the pressure prediction from each measuring point is the premise of pipeline blockage prediction. Kernel function of support vector machine is introduced into extreme learning machine, the parameters are optimized by particle swarm algorithm, and blockage prediction method based on particle swarm optimization kernel function extreme learning machine (PSOKELM is put forward. The actual test data from HuangLing coal gangue power plant are used for simulation experiments and compared with support vector machine prediction model optimized by particle swarm algorithm (PSOSVM and kernel function extreme learning machine prediction model (KELM. The results prove that mean square error (MSE for the prediction model based on PSOKELM is 0.0038 and the correlation coefficient is 0.9955, which is superior to prediction model based on PSOSVM in speed and accuracy and superior to KELM prediction model in accuracy.

  20. Can machine-learning improve cardiovascular risk prediction using routine clinical data?

    Science.gov (United States)

    Kai, Joe; Garibaldi, Jonathan M.; Qureshi, Nadeem

    2017-01-01

    Background Current approaches to predict cardiovascular risk fail to identify many people who would benefit from preventive treatment, while others receive unnecessary intervention. Machine-learning offers opportunity to improve accuracy by exploiting complex interactions between risk factors. We assessed whether machine-learning can improve cardiovascular risk prediction. Methods Prospective cohort study using routine clinical data of 378,256 patients from UK family practices, free from cardiovascular disease at outset. Four machine-learning algorithms (random forest, logistic regression, gradient boosting machines, neural networks) were compared to an established algorithm (American College of Cardiology guidelines) to predict first cardiovascular event over 10-years. Predictive accuracy was assessed by area under the ‘receiver operating curve’ (AUC); and sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV) to predict 7.5% cardiovascular risk (threshold for initiating statins). Findings 24,970 incident cardiovascular events (6.6%) occurred. Compared to the established risk prediction algorithm (AUC 0.728, 95% CI 0.723–0.735), machine-learning algorithms improved prediction: random forest +1.7% (AUC 0.745, 95% CI 0.739–0.750), logistic regression +3.2% (AUC 0.760, 95% CI 0.755–0.766), gradient boosting +3.3% (AUC 0.761, 95% CI 0.755–0.766), neural networks +3.6% (AUC 0.764, 95% CI 0.759–0.769). The highest achieving (neural networks) algorithm predicted 4,998/7,404 cases (sensitivity 67.5%, PPV 18.4%) and 53,458/75,585 non-cases (specificity 70.7%, NPV 95.7%), correctly predicting 355 (+7.6%) more patients who developed cardiovascular disease compared to the established algorithm. Conclusions Machine-learning significantly improves accuracy of cardiovascular risk prediction, increasing the number of patients identified who could benefit from preventive treatment, while avoiding unnecessary treatment of others

  1. Life Prediction for FRP composites with Data Fusion & Machine Learning

    Data.gov (United States)

    National Aeronautics and Space Administration — High-fidelity, probabilistic predictions of damage evolution in fiber-reinforced polymer (FRP) composite structures could accelerate development and certification of...

  2. Quantum Machine Learning

    OpenAIRE

    Romero García, Cristian

    2017-01-01

    [EN] In a world in which accessible information grows exponentially, the selection of the appropriate information turns out to be an extremely relevant problem. In this context, the idea of Machine Learning (ML), a subfield of Artificial Intelligence, emerged to face problems in data mining, pattern recognition, automatic prediction, among others. Quantum Machine Learning is an interdisciplinary research area combining quantum mechanics with methods of ML, in which quantum properties allow fo...

  3. Efficient Prediction of Low-Visibility Events at Airports Using Machine-Learning Regression

    Science.gov (United States)

    Cornejo-Bueno, L.; Casanova-Mateo, C.; Sanz-Justo, J.; Cerro-Prada, E.; Salcedo-Sanz, S.

    2017-11-01

    We address the prediction of low-visibility events at airports using machine-learning regression. The proposed model successfully forecasts low-visibility events in terms of the runway visual range at the airport, with the use of support-vector regression, neural networks (multi-layer perceptrons and extreme-learning machines) and Gaussian-process algorithms. We assess the performance of these algorithms based on real data collected at the Valladolid airport, Spain. We also propose a study of the atmospheric variables measured at a nearby tower related to low-visibility atmospheric conditions, since they are considered as the inputs of the different regressors. A pre-processing procedure of these input variables with wavelet transforms is also described. The results show that the proposed machine-learning algorithms are able to predict low-visibility events well. The Gaussian process is the best algorithm among those analyzed, obtaining over 98% of the correct classification rate in low-visibility events when the runway visual range is {>}1000 m, and about 80% under this threshold. The performance of all the machine-learning algorithms tested is clearly affected in extreme low-visibility conditions ({algorithm performance in daytime and nighttime conditions, and for different prediction time horizons.

  4. A Machine-Learning Approach to Predict Main Energy Consumption under Realistic Operational Conditions

    DEFF Research Database (Denmark)

    Petersen, Joan P; Winther, Ole; Jacobsen, Daniel J

    2012-01-01

    The paper presents a novel and publicly available set of high-quality sensory data collected from a ferry over a period of two months and overviews exixting machine-learning methods for the prediction of main propulsion efficiency. Neural networks are applied on both real-time and predictive...... settings. Performance results for the real-time models are shown. The presented models were successfully developed in a trim optimisation application onboard a product tanker....

  5. Multi-fidelity machine learning models for accurate bandgap predictions of solids

    International Nuclear Information System (INIS)

    Pilania, Ghanshyam; Gubernatis, James E.; Lookman, Turab

    2016-01-01

    Here, we present a multi-fidelity co-kriging statistical learning framework that combines variable-fidelity quantum mechanical calculations of bandgaps to generate a machine-learned model that enables low-cost accurate predictions of the bandgaps at the highest fidelity level. Additionally, the adopted Gaussian process regression formulation allows us to predict the underlying uncertainties as a measure of our confidence in the predictions. In using a set of 600 elpasolite compounds as an example dataset and using semi-local and hybrid exchange correlation functionals within density functional theory as two levels of fidelities, we demonstrate the excellent learning performance of the method against actual high fidelity quantum mechanical calculations of the bandgaps. The presented statistical learning method is not restricted to bandgaps or electronic structure methods and extends the utility of high throughput property predictions in a significant way.

  6. A Review of Current Machine Learning Methods Used for Cancer Recurrence Modeling and Prediction

    Energy Technology Data Exchange (ETDEWEB)

    Hemphill, Geralyn M. [Los Alamos National Lab. (LANL), Los Alamos, NM (United States)

    2016-09-27

    Cancer has been characterized as a heterogeneous disease consisting of many different subtypes. The early diagnosis and prognosis of a cancer type has become a necessity in cancer research. A major challenge in cancer management is the classification of patients into appropriate risk groups for better treatment and follow-up. Such risk assessment is critically important in order to optimize the patient’s health and the use of medical resources, as well as to avoid cancer recurrence. This paper focuses on the application of machine learning methods for predicting the likelihood of a recurrence of cancer. It is not meant to be an extensive review of the literature on the subject of machine learning techniques for cancer recurrence modeling. Other recent papers have performed such a review, and I will rely heavily on the results and outcomes from these papers. The electronic databases that were used for this review include PubMed, Google, and Google Scholar. Query terms used include “cancer recurrence modeling”, “cancer recurrence and machine learning”, “cancer recurrence modeling and machine learning”, and “machine learning for cancer recurrence and prediction”. The most recent and most applicable papers to the topic of this review have been included in the references. It also includes a list of modeling and classification methods to predict cancer recurrence.

  7. Predictive ability of machine learning methods for massive crop yield prediction

    Directory of Open Access Journals (Sweden)

    Alberto Gonzalez-Sanchez

    2014-04-01

    Full Text Available An important issue for agricultural planning purposes is the accurate yield estimation for the numerous crops involved in the planning. Machine learning (ML is an essential approach for achieving practical and effective solutions for this problem. Many comparisons of ML methods for yield prediction have been made, seeking for the most accurate technique. Generally, the number of evaluated crops and techniques is too low and does not provide enough information for agricultural planning purposes. This paper compares the predictive accuracy of ML and linear regression techniques for crop yield prediction in ten crop datasets. Multiple linear regression, M5-Prime regression trees, perceptron multilayer neural networks, support vector regression and k-nearest neighbor methods were ranked. Four accuracy metrics were used to validate the models: the root mean square error (RMS, root relative square error (RRSE, normalized mean absolute error (MAE, and correlation factor (R. Real data of an irrigation zone of Mexico were used for building the models. Models were tested with samples of two consecutive years. The results show that M5-Prime and k-nearest neighbor techniques obtain the lowest average RMSE errors (5.14 and 4.91, the lowest RRSE errors (79.46% and 79.78%, the lowest average MAE errors (18.12% and 19.42%, and the highest average correlation factors (0.41 and 0.42. Since M5-Prime achieves the largest number of crop yield models with the lowest errors, it is a very suitable tool for massive crop yield prediction in agricultural planning.

  8. Wind Power Ramp Events Prediction with Hybrid Machine Learning Regression Techniques and Reanalysis Data

    Directory of Open Access Journals (Sweden)

    Laura Cornejo-Bueno

    2017-11-01

    Full Text Available Wind Power Ramp Events (WPREs are large fluctuations of wind power in a short time interval, which lead to strong, undesirable variations in the electric power produced by a wind farm. Its accurate prediction is important in the effort of efficiently integrating wind energy in the electric system, without affecting considerably its stability, robustness and resilience. In this paper, we tackle the problem of predicting WPREs by applying Machine Learning (ML regression techniques. Our approach consists of using variables from atmospheric reanalysis data as predictive inputs for the learning machine, which opens the possibility of hybridizing numerical-physical weather models with ML techniques for WPREs prediction in real systems. Specifically, we have explored the feasibility of a number of state-of-the-art ML regression techniques, such as support vector regression, artificial neural networks (multi-layer perceptrons and extreme learning machines and Gaussian processes to solve the problem. Furthermore, the ERA-Interim reanalysis from the European Center for Medium-Range Weather Forecasts is the one used in this paper because of its accuracy and high resolution (in both spatial and temporal domains. Aiming at validating the feasibility of our predicting approach, we have carried out an extensive experimental work using real data from three wind farms in Spain, discussing the performance of the different ML regression tested in this wind power ramp event prediction problem.

  9. Taxi-Out Time Prediction for Departures at Charlotte Airport Using Machine Learning Techniques

    Science.gov (United States)

    Lee, Hanbong; Malik, Waqar; Jung, Yoon C.

    2016-01-01

    Predicting the taxi-out times of departures accurately is important for improving airport efficiency and takeoff time predictability. In this paper, we attempt to apply machine learning techniques to actual traffic data at Charlotte Douglas International Airport for taxi-out time prediction. To find the key factors affecting aircraft taxi times, surface surveillance data is first analyzed. From this data analysis, several variables, including terminal concourse, spot, runway, departure fix and weight class, are selected for taxi time prediction. Then, various machine learning methods such as linear regression, support vector machines, k-nearest neighbors, random forest, and neural networks model are applied to actual flight data. Different traffic flow and weather conditions at Charlotte airport are also taken into account for more accurate prediction. The taxi-out time prediction results show that linear regression and random forest techniques can provide the most accurate prediction in terms of root-mean-square errors. We also discuss the operational complexity and uncertainties that make it difficult to predict the taxi times accurately.

  10. Using machine learning to predict wind turbine power output

    International Nuclear Information System (INIS)

    Clifton, A; Kilcher, L; Lundquist, J K; Fleming, P

    2013-01-01

    Wind turbine power output is known to be a strong function of wind speed, but is also affected by turbulence and shear. In this work, new aerostructural simulations of a generic 1.5 MW turbine are used to rank atmospheric influences on power output. Most significant is the hub height wind speed, followed by hub height turbulence intensity and then wind speed shear across the rotor disk. These simulation data are used to train regression trees that predict the turbine response for any combination of wind speed, turbulence intensity, and wind shear that might be expected at a turbine site. For a randomly selected atmospheric condition, the accuracy of the regression tree power predictions is three times higher than that from the traditional power curve methodology. The regression tree method can also be applied to turbine test data and used to predict turbine performance at a new site. No new data are required in comparison to the data that are usually collected for a wind resource assessment. Implementing the method requires turbine manufacturers to create a turbine regression tree model from test site data. Such an approach could significantly reduce bias in power predictions that arise because of the different turbulence and shear at the new site, compared to the test site. (letter)

  11. Rapid Prediction of Bacterial Heterotrophic Fluxomics Using Machine Learning and Constraint Programming.

    Directory of Open Access Journals (Sweden)

    Stephen Gang Wu

    2016-04-01

    Full Text Available 13C metabolic flux analysis (13C-MFA has been widely used to measure in vivo enzyme reaction rates (i.e., metabolic flux in microorganisms. Mining the relationship between environmental and genetic factors and metabolic fluxes hidden in existing fluxomic data will lead to predictive models that can significantly accelerate flux quantification. In this paper, we present a web-based platform MFlux (http://mflux.org that predicts the bacterial central metabolism via machine learning, leveraging data from approximately 100 13C-MFA papers on heterotrophic bacterial metabolisms. Three machine learning methods, namely Support Vector Machine (SVM, k-Nearest Neighbors (k-NN, and Decision Tree, were employed to study the sophisticated relationship between influential factors and metabolic fluxes. We performed a grid search of the best parameter set for each algorithm and verified their performance through 10-fold cross validations. SVM yields the highest accuracy among all three algorithms. Further, we employed quadratic programming to adjust flux profiles to satisfy stoichiometric constraints. Multiple case studies have shown that MFlux can reasonably predict fluxomes as a function of bacterial species, substrate types, growth rate, oxygen conditions, and cultivation methods. Due to the interest of studying model organism under particular carbon sources, bias of fluxome in the dataset may limit the applicability of machine learning models. This problem can be resolved after more papers on 13C-MFA are published for non-model species.

  12. Prediction of GPCR-Ligand Binding Using Machine Learning Algorithms

    Directory of Open Access Journals (Sweden)

    Sangmin Seo

    2018-01-01

    Full Text Available We propose a novel method that predicts binding of G-protein coupled receptors (GPCRs and ligands. The proposed method uses hub and cycle structures of ligands and amino acid motif sequences of GPCRs, rather than the 3D structure of a receptor or similarity of receptors or ligands. The experimental results show that these new features can be effective in predicting GPCR-ligand binding (average area under the curve [AUC] of 0.944, because they are thought to include hidden properties of good ligand-receptor binding. Using the proposed method, we were able to identify novel ligand-GPCR bindings, some of which are supported by several studies.

  13. Machine learning for predicting the response of breast cancer to neoadjuvant chemotherapy.

    Science.gov (United States)

    Mani, Subramani; Chen, Yukun; Li, Xia; Arlinghaus, Lori; Chakravarthy, A Bapsi; Abramson, Vandana; Bhave, Sandeep R; Levy, Mia A; Xu, Hua; Yankeelov, Thomas E

    2013-01-01

    To employ machine learning methods to predict the eventual therapeutic response of breast cancer patients after a single cycle of neoadjuvant chemotherapy (NAC). Quantitative dynamic contrast-enhanced MRI and diffusion-weighted MRI data were acquired on 28 patients before and after one cycle of NAC. A total of 118 semiquantitative and quantitative parameters were derived from these data and combined with 11 clinical variables. We used Bayesian logistic regression in combination with feature selection using a machine learning framework for predictive model building. The best predictive models using feature selection obtained an area under the curve of 0.86 and an accuracy of 0.86, with a sensitivity of 0.88 and a specificity of 0.82. With the numerous options for NAC available, development of a method to predict response early in the course of therapy is needed. Unfortunately, by the time most patients are found not to be responding, their disease may no longer be surgically resectable, and this situation could be avoided by the development of techniques to assess response earlier in the treatment regimen. The method outlined here is one possible solution to this important clinical problem. Predictive modeling approaches based on machine learning using readily available clinical and quantitative MRI data show promise in distinguishing breast cancer responders from non-responders after the first cycle of NAC.

  14. SU-F-P-20: Predicting Waiting Times in Radiation Oncology Using Machine Learning

    International Nuclear Information System (INIS)

    Joseph, A; Herrera, D; Hijal, T; Kildea, J; Hendren, L; Leung, A; Wainberg, J; Sawaf, M; Gorshkov, M; Maglieri, R; Keshavarz, M

    2016-01-01

    Purpose: Waiting times remain one of the most vexing patient satisfaction challenges facing healthcare. Waiting time uncertainty can cause patients, who are already sick or in pain, to worry about when they will receive the care they need. These waiting periods are often difficult for staff to predict and only rough estimates are typically provided based on personal experience. This level of uncertainty leaves most patients unable to plan their calendar, making the waiting experience uncomfortable, even painful. In the present era of electronic health records (EHRs), waiting times need not be so uncertain. Extensive EHRs provide unprecedented amounts of data that can statistically cluster towards representative values when appropriate patient cohorts are selected. Predictive modelling, such as machine learning, is a powerful approach that benefits from large, potentially complex, datasets. The essence of machine learning is to predict future outcomes by learning from previous experience. The application of a machine learning algorithm to waiting time data has the potential to produce personalized waiting time predictions such that the uncertainty may be removed from the patient’s waiting experience. Methods: In radiation oncology, patients typically experience several types of waiting (eg waiting at home for treatment planning, waiting in the waiting room for oncologist appointments and daily waiting in the waiting room for radiotherapy treatments). A daily treatment wait time model is discussed in this report. To develop a prediction model using our large dataset (with more than 100k sample points) a variety of machine learning algorithms from the Python package sklearn were tested. Results: We found that the Random Forest Regressor model provides the best predictions for daily radiotherapy treatment waiting times. Using this model, we achieved a median residual (actual value minus predicted value) of 0.25 minutes and a standard deviation residual of 6.5 minutes

  15. SU-F-P-20: Predicting Waiting Times in Radiation Oncology Using Machine Learning

    Energy Technology Data Exchange (ETDEWEB)

    Joseph, A; Herrera, D; Hijal, T; Kildea, J [McGill University Health Centre, Montreal, Quebec (Canada); Hendren, L; Leung, A; Wainberg, J; Sawaf, M; Gorshkov, M; Maglieri, R; Keshavarz, M [McGill University, Montreal, Quebec (Canada)

    2016-06-15

    Purpose: Waiting times remain one of the most vexing patient satisfaction challenges facing healthcare. Waiting time uncertainty can cause patients, who are already sick or in pain, to worry about when they will receive the care they need. These waiting periods are often difficult for staff to predict and only rough estimates are typically provided based on personal experience. This level of uncertainty leaves most patients unable to plan their calendar, making the waiting experience uncomfortable, even painful. In the present era of electronic health records (EHRs), waiting times need not be so uncertain. Extensive EHRs provide unprecedented amounts of data that can statistically cluster towards representative values when appropriate patient cohorts are selected. Predictive modelling, such as machine learning, is a powerful approach that benefits from large, potentially complex, datasets. The essence of machine learning is to predict future outcomes by learning from previous experience. The application of a machine learning algorithm to waiting time data has the potential to produce personalized waiting time predictions such that the uncertainty may be removed from the patient’s waiting experience. Methods: In radiation oncology, patients typically experience several types of waiting (eg waiting at home for treatment planning, waiting in the waiting room for oncologist appointments and daily waiting in the waiting room for radiotherapy treatments). A daily treatment wait time model is discussed in this report. To develop a prediction model using our large dataset (with more than 100k sample points) a variety of machine learning algorithms from the Python package sklearn were tested. Results: We found that the Random Forest Regressor model provides the best predictions for daily radiotherapy treatment waiting times. Using this model, we achieved a median residual (actual value minus predicted value) of 0.25 minutes and a standard deviation residual of 6.5 minutes

  16. Breast Cancer Patients' Depression Prediction by Machine Learning Approach.

    Science.gov (United States)

    Cvetković, Jovana

    2017-09-14

    One of the most common cancer in females is breasts cancer. This cancer can has high impact on the women including health and social dimensions. One of the most common social dimension is depression caused by breast cancer. Depression can impairs life quality. Depression is one of the symptom among the breast cancer patients. One of the solution is to eliminate the depression in breast cancer patients is by treatments but these treatments can has different unpredictable impacts on the patients. Therefore it is suitable to develop algorithm in order to predict the depression range.

  17. Exploring prediction uncertainty of spatial data in geostatistical and machine learning Approaches

    Science.gov (United States)

    Klump, J. F.; Fouedjio, F.

    2017-12-01

    Geostatistical methods such as kriging with external drift as well as machine learning techniques such as quantile regression forest have been intensively used for modelling spatial data. In addition to providing predictions for target variables, both approaches are able to deliver a quantification of the uncertainty associated with the prediction at a target location. Geostatistical approaches are, by essence, adequate for providing such prediction uncertainties and their behaviour is well understood. However, they often require significant data pre-processing and rely on assumptions that are rarely met in practice. Machine learning algorithms such as random forest regression, on the other hand, require less data pre-processing and are non-parametric. This makes the application of machine learning algorithms to geostatistical problems an attractive proposition. The objective of this study is to compare kriging with external drift and quantile regression forest with respect to their ability to deliver reliable prediction uncertainties of spatial data. In our comparison we use both simulated and real world datasets. Apart from classical performance indicators, comparisons make use of accuracy plots, probability interval width plots, and the visual examinations of the uncertainty maps provided by the two approaches. By comparing random forest regression to kriging we found that both methods produced comparable maps of estimated values for our variables of interest. However, the measure of uncertainty provided by random forest seems to be quite different to the measure of uncertainty provided by kriging. In particular, the lack of spatial context can give misleading results in areas without ground truth data. These preliminary results raise questions about assessing the risks associated with decisions based on the predictions from geostatistical and machine learning algorithms in a spatial context, e.g. mineral exploration.

  18. A machine learns to predict the stability of circumbinary planets

    Science.gov (United States)

    Lam, Christopher; Kipping, David

    2018-01-01

    Long-period circumbinary planets appear to be as common as those orbiting single stars and have been found to frequently have orbital radii just beyond the critical distance for dynamical stability. Assessing the stability is typically done either through N-body simulations or using the classic Holman-Wiegert stability criterion: a second-order polynomial calibrated to broadly match numerical simulations. However, the polynomial is unable to capture islands of instability introduced by mean motion resonances, causing the accuracy of the criterion to approach that of a random coin-toss when close to the boundary. We show how a deep neural network (DNN) trained on N-body simulations generated with REBOUND is able to significantly improve stability predictions for circumbinary planets on initially coplanar, circular orbits. Specifically, we find that the accuracy of our DNN never drops below 86%, even when tightly surrounding the boundary of instability, and is fast enough to be practical for on-the-fly calls during likelihood evaluations typical of modern Bayesian inference. Our binary classifier DNN is made publicly available at https://github.com/CoolWorlds/orbital-stability.

  19. A machine learns to predict the stability of circumbinary planets

    Science.gov (United States)

    Lam, Christopher; Kipping, David

    2018-06-01

    Long-period circumbinary planets appear to be as common as those orbiting single stars and have been found to frequently have orbital radii just beyond the critical distance for dynamical stability. Assessing the stability is typically done either through N-body simulations or using the classic stability criterion first considered by Dvorak and later developed by Holman and Wiegert: a second-order polynomial calibrated to broadly match numerical simulations. However, the polynomial is unable to capture islands of instability introduced by mean motion resonances, causing the accuracy of the criterion to approach that of a random coin-toss when close to the boundary. We show how a deep neural network (DNN) trained on N-body simulations generated with REBOUND is able to significantly improve stability predictions for circumbinary planets on initially coplanar, circular orbits. Specifically, we find that the accuracy of our DNN never drops below 86 per cent, even when tightly surrounding the boundary of instability, and is fast enough to be practical for on-the-fly calls during likelihood evaluations typical of modern Bayesian inference. Our binary classifier DNN is made publicly available at https://github.com/CoolWorlds/orbital-stability.

  20. Applying Machine Learning and High Performance Computing to Water Quality Assessment and Prediction

    Directory of Open Access Journals (Sweden)

    Ruijian Zhang

    2017-12-01

    Full Text Available Water quality assessment and prediction is a more and more important issue. Traditional ways either take lots of time or they can only do assessments. In this research, by applying machine learning algorithm to a long period time of water attributes’ data; we can generate a decision tree so that it can predict the future day’s water quality in an easy and efficient way. The idea is to combine the traditional ways and the computer algorithms together. Using machine learning algorithms, the assessment of water quality will be far more efficient, and by generating the decision tree, the prediction will be quite accurate. The drawback of the machine learning modeling is that the execution takes quite long time, especially when we employ a better accuracy but more time-consuming algorithm in clustering. Therefore, we applied the high performance computing (HPC System to deal with this problem. Up to now, the pilot experiments have achieved very promising preliminary results. The visualized water quality assessment and prediction obtained from this project would be published in an interactive website so that the public and the environmental managers could use the information for their decision making.

  1. Failure prediction using machine learning and time series in optical network.

    Science.gov (United States)

    Wang, Zhilong; Zhang, Min; Wang, Danshi; Song, Chuang; Liu, Min; Li, Jin; Lou, Liqi; Liu, Zhuo

    2017-08-07

    In this paper, we propose a performance monitoring and failure prediction method in optical networks based on machine learning. The primary algorithms of this method are the support vector machine (SVM) and double exponential smoothing (DES). With a focus on risk-aware models in optical networks, the proposed protection plan primarily investigates how to predict the risk of an equipment failure. To the best of our knowledge, this important problem has not yet been fully considered. Experimental results showed that the average prediction accuracy of our method was 95% when predicting the optical equipment failure state. This finding means that our method can forecast an equipment failure risk with high accuracy. Therefore, our proposed DES-SVM method can effectively improve traditional risk-aware models to protect services from possible failures and enhance the optical network stability.

  2. In Silico Prediction of Chemical Toxicity for Drug Design Using Machine Learning Methods and Structural Alerts

    Science.gov (United States)

    Yang, Hongbin; Sun, Lixia; Li, Weihua; Liu, Guixia; Tang, Yun

    2018-02-01

    For a drug, safety is always the most important issue, including a variety of toxicities and adverse drug effects, which should be evaluated in preclinical and clinical trial phases. This review article at first simply introduced the computational methods used in prediction of chemical toxicity for drug design, including machine learning methods and structural alerts. Machine learning methods have been widely applied in qualitative classification and quantitative regression studies, while structural alerts can be regarded as a complementary tool for lead optimization. The emphasis of this article was put on the recent progress of predictive models built for various toxicities. Available databases and web servers were also provided. Though the methods and models are very helpful for drug design, there are still some challenges and limitations to be improved for drug safety assessment in the future.

  3. In Silico Prediction of Chemical Toxicity for Drug Design Using Machine Learning Methods and Structural Alerts

    Directory of Open Access Journals (Sweden)

    Hongbin Yang

    2018-02-01

    Full Text Available During drug development, safety is always the most important issue, including a variety of toxicities and adverse drug effects, which should be evaluated in preclinical and clinical trial phases. This review article at first simply introduced the computational methods used in prediction of chemical toxicity for drug design, including machine learning methods and structural alerts. Machine learning methods have been widely applied in qualitative classification and quantitative regression studies, while structural alerts can be regarded as a complementary tool for lead optimization. The emphasis of this article was put on the recent progress of predictive models built for various toxicities. Available databases and web servers were also provided. Though the methods and models are very helpful for drug design, there are still some challenges and limitations to be improved for drug safety assessment in the future.

  4. In Silico Prediction of Chemical Toxicity for Drug Design Using Machine Learning Methods and Structural Alerts.

    Science.gov (United States)

    Yang, Hongbin; Sun, Lixia; Li, Weihua; Liu, Guixia; Tang, Yun

    2018-01-01

    During drug development, safety is always the most important issue, including a variety of toxicities and adverse drug effects, which should be evaluated in preclinical and clinical trial phases. This review article at first simply introduced the computational methods used in prediction of chemical toxicity for drug design, including machine learning methods and structural alerts. Machine learning methods have been widely applied in qualitative classification and quantitative regression studies, while structural alerts can be regarded as a complementary tool for lead optimization. The emphasis of this article was put on the recent progress of predictive models built for various toxicities. Available databases and web servers were also provided. Though the methods and models are very helpful for drug design, there are still some challenges and limitations to be improved for drug safety assessment in the future.

  5. MU-LOC: A Machine-Learning Method for Predicting Mitochondrially Localized Proteins in Plants

    DEFF Research Database (Denmark)

    Zhang, Ning; Rao, R Shyama Prasad; Salvato, Fernanda

    2018-01-01

    -sequence or a multitude of internal signals. Compared with experimental approaches, computational predictions provide an efficient way to infer subcellular localization of a protein. However, it is still challenging to predict plant mitochondrially localized proteins accurately due to various limitations. Consequently......, the performance of current tools can be improved with new data and new machine-learning methods. We present MU-LOC, a novel computational approach for large-scale prediction of plant mitochondrial proteins. We collected a comprehensive dataset of plant subcellular localization, extracted features including amino...

  6. PredPsych: A toolbox for predictive machine learning based approach in experimental psychology research

    OpenAIRE

    Cavallo, Andrea; Becchio, Cristina; Koul, Atesh

    2016-01-01

    Recent years have seen an increased interest in machine learning based predictive methods for analysing quantitative behavioural data in experimental psychology. While these methods can achieve relatively greater sensitivity compared to conventional univariate techniques, they still lack an established and accessible software framework. The goal of this work was to build an open-source toolbox – “PredPsych” – that could make these methods readily available to all psychologists. PredPsych is a...

  7. Using Unsupervised Machine Learning for Outlier Detection in Data to Improve Wind Power Production Prediction

    OpenAIRE

    Åkerberg, Ludvig

    2017-01-01

    The expansion of wind power for electrical energy production has increased in recent years and shows no signs of slowing down. This unpredictable source of energy has contributed to destabilization of the electrical grid causing the energy market prices to vary significantly on a daily basis. For energy producers and consumers to make good investments, methods have been developed to make predictions of wind power production. These methods are often based on machine learning were historical we...

  8. ROBUSTNESS AND PREDICTION ACCURACY OF MACHINE LEARNING FOR OBJECTIVE VISUAL QUALITY ASSESSMENT

    OpenAIRE

    Hines, Andrew; Kendrick, Paul; Barri, Adriaan; Narwaria, Manish; Redi, Judith A.

    2014-01-01

    Machine Learning (ML) is a powerful tool to support the development of objective visual quality assessment metrics, serving as a substitute model for the perceptual mechanisms acting in visual quality appreciation. Nevertheless, the reliability of ML-based techniques within objective quality assessment metrics is often questioned. In this study, the robustness of ML in supporting objective quality assessment is investigated, specifically when the feature set adopted for prediction is suboptim...

  9. Solar Flare Prediction Model with Three Machine-learning Algorithms using Ultraviolet Brightening and Vector Magnetograms

    Science.gov (United States)

    Nishizuka, N.; Sugiura, K.; Kubo, Y.; Den, M.; Watari, S.; Ishii, M.

    2017-02-01

    We developed a flare prediction model using machine learning, which is optimized to predict the maximum class of flares occurring in the following 24 hr. Machine learning is used to devise algorithms that can learn from and make decisions on a huge amount of data. We used solar observation data during the period 2010-2015, such as vector magnetograms, ultraviolet (UV) emission, and soft X-ray emission taken by the Solar Dynamics Observatory and the Geostationary Operational Environmental Satellite. We detected active regions (ARs) from the full-disk magnetogram, from which ˜60 features were extracted with their time differentials, including magnetic neutral lines, the current helicity, the UV brightening, and the flare history. After standardizing the feature database, we fully shuffled and randomly separated it into two for training and testing. To investigate which algorithm is best for flare prediction, we compared three machine-learning algorithms: the support vector machine, k-nearest neighbors (k-NN), and extremely randomized trees. The prediction score, the true skill statistic, was higher than 0.9 with a fully shuffled data set, which is higher than that for human forecasts. It was found that k-NN has the highest performance among the three algorithms. The ranking of the feature importance showed that previous flare activity is most effective, followed by the length of magnetic neutral lines, the unsigned magnetic flux, the area of UV brightening, and the time differentials of features over 24 hr, all of which are strongly correlated with the flux emergence dynamics in an AR.

  10. Using social media and machine learning to predict financial performance of a company

    OpenAIRE

    Forouzani, Sepehr

    2016-01-01

    Social media have recently become one of the most popular communicating form of media for numerous number of people. the text and posts shared on social media is widely used by researcher to analyze, study and relate them to various fields. In this master thesis, sentiment analysis has been performed on posts containing information about two companies that are shared on Twitter, and machine learning algorithms has been used to predict the financial performance of these companies.

  11. Solar Flare Prediction Model with Three Machine-learning Algorithms using Ultraviolet Brightening and Vector Magnetograms

    Energy Technology Data Exchange (ETDEWEB)

    Nishizuka, N.; Kubo, Y.; Den, M.; Watari, S.; Ishii, M. [Applied Electromagnetic Research Institute, National Institute of Information and Communications Technology, 4-2-1, Nukui-Kitamachi, Koganei, Tokyo 184-8795 (Japan); Sugiura, K., E-mail: nishizuka.naoto@nict.go.jp [Advanced Speech Translation Research and Development Promotion Center, National Institute of Information and Communications Technology (Japan)

    2017-02-01

    We developed a flare prediction model using machine learning, which is optimized to predict the maximum class of flares occurring in the following 24 hr. Machine learning is used to devise algorithms that can learn from and make decisions on a huge amount of data. We used solar observation data during the period 2010–2015, such as vector magnetograms, ultraviolet (UV) emission, and soft X-ray emission taken by the Solar Dynamics Observatory and the Geostationary Operational Environmental Satellite . We detected active regions (ARs) from the full-disk magnetogram, from which ∼60 features were extracted with their time differentials, including magnetic neutral lines, the current helicity, the UV brightening, and the flare history. After standardizing the feature database, we fully shuffled and randomly separated it into two for training and testing. To investigate which algorithm is best for flare prediction, we compared three machine-learning algorithms: the support vector machine, k-nearest neighbors (k-NN), and extremely randomized trees. The prediction score, the true skill statistic, was higher than 0.9 with a fully shuffled data set, which is higher than that for human forecasts. It was found that k-NN has the highest performance among the three algorithms. The ranking of the feature importance showed that previous flare activity is most effective, followed by the length of magnetic neutral lines, the unsigned magnetic flux, the area of UV brightening, and the time differentials of features over 24 hr, all of which are strongly correlated with the flux emergence dynamics in an AR.

  12. Solar Flare Prediction Model with Three Machine-learning Algorithms using Ultraviolet Brightening and Vector Magnetograms

    International Nuclear Information System (INIS)

    Nishizuka, N.; Kubo, Y.; Den, M.; Watari, S.; Ishii, M.; Sugiura, K.

    2017-01-01

    We developed a flare prediction model using machine learning, which is optimized to predict the maximum class of flares occurring in the following 24 hr. Machine learning is used to devise algorithms that can learn from and make decisions on a huge amount of data. We used solar observation data during the period 2010–2015, such as vector magnetograms, ultraviolet (UV) emission, and soft X-ray emission taken by the Solar Dynamics Observatory and the Geostationary Operational Environmental Satellite . We detected active regions (ARs) from the full-disk magnetogram, from which ∼60 features were extracted with their time differentials, including magnetic neutral lines, the current helicity, the UV brightening, and the flare history. After standardizing the feature database, we fully shuffled and randomly separated it into two for training and testing. To investigate which algorithm is best for flare prediction, we compared three machine-learning algorithms: the support vector machine, k-nearest neighbors (k-NN), and extremely randomized trees. The prediction score, the true skill statistic, was higher than 0.9 with a fully shuffled data set, which is higher than that for human forecasts. It was found that k-NN has the highest performance among the three algorithms. The ranking of the feature importance showed that previous flare activity is most effective, followed by the length of magnetic neutral lines, the unsigned magnetic flux, the area of UV brightening, and the time differentials of features over 24 hr, all of which are strongly correlated with the flux emergence dynamics in an AR.

  13. Application of Machine-Learning Models to Predict Tacrolimus Stable Dose in Renal Transplant Recipients

    Science.gov (United States)

    Tang, Jie; Liu, Rong; Zhang, Yue-Li; Liu, Mou-Ze; Hu, Yong-Fang; Shao, Ming-Jie; Zhu, Li-Jun; Xin, Hua-Wen; Feng, Gui-Wen; Shang, Wen-Jun; Meng, Xiang-Guang; Zhang, Li-Rong; Ming, Ying-Zi; Zhang, Wei

    2017-02-01

    Tacrolimus has a narrow therapeutic window and considerable variability in clinical use. Our goal was to compare the performance of multiple linear regression (MLR) and eight machine learning techniques in pharmacogenetic algorithm-based prediction of tacrolimus stable dose (TSD) in a large Chinese cohort. A total of 1,045 renal transplant patients were recruited, 80% of which were randomly selected as the “derivation cohort” to develop dose-prediction algorithm, while the remaining 20% constituted the “validation cohort” to test the final selected algorithm. MLR, artificial neural network (ANN), regression tree (RT), multivariate adaptive regression splines (MARS), boosted regression tree (BRT), support vector regression (SVR), random forest regression (RFR), lasso regression (LAR) and Bayesian additive regression trees (BART) were applied and their performances were compared in this work. Among all the machine learning models, RT performed best in both derivation [0.71 (0.67-0.76)] and validation cohorts [0.73 (0.63-0.82)]. In addition, the ideal rate of RT was 4% higher than that of MLR. To our knowledge, this is the first study to use machine learning models to predict TSD, which will further facilitate personalized medicine in tacrolimus administration in the future.

  14. Machine learning and statistical methods for the prediction of maximal oxygen uptake: recent advances

    Directory of Open Access Journals (Sweden)

    Abut F

    2015-08-01

    Full Text Available Fatih Abut, Mehmet Fatih AkayDepartment of Computer Engineering, Çukurova University, Adana, TurkeyAbstract: Maximal oxygen uptake (VO2max indicates how many milliliters of oxygen the body can consume in a state of intense exercise per minute. VO2max plays an important role in both sport and medical sciences for different purposes, such as indicating the endurance capacity of athletes or serving as a metric in estimating the disease risk of a person. In general, the direct measurement of VO2max provides the most accurate assessment of aerobic power. However, despite a high level of accuracy, practical limitations associated with the direct measurement of VO2max, such as the requirement of expensive and sophisticated laboratory equipment or trained staff, have led to the development of various regression models for predicting VO2max. Consequently, a lot of studies have been conducted in the last years to predict VO2max of various target audiences, ranging from soccer athletes, nonexpert swimmers, cross-country skiers to healthy-fit adults, teenagers, and children. Numerous prediction models have been developed using different sets of predictor variables and a variety of machine learning and statistical methods, including support vector machine, multilayer perceptron, general regression neural network, and multiple linear regression. The purpose of this study is to give a detailed overview about the data-driven modeling studies for the prediction of VO2max conducted in recent years and to compare the performance of various VO2max prediction models reported in related literature in terms of two well-known metrics, namely, multiple correlation coefficient (R and standard error of estimate. The survey results reveal that with respect to regression methods used to develop prediction models, support vector machine, in general, shows better performance than other methods, whereas multiple linear regression exhibits the worst performance

  15. Controlling misses and false alarms in a machine learning framework for predicting uniformity of printed pages

    Science.gov (United States)

    Nguyen, Minh Q.; Allebach, Jan P.

    2015-01-01

    In our previous work1 , we presented a block-based technique to analyze printed page uniformity both visually and metrically. The features learned from the models were then employed in a Support Vector Machine (SVM) framework to classify the pages into one of the two categories of acceptable and unacceptable quality. In this paper, we introduce a set of tools for machine learning in the assessment of printed page uniformity. This work is primarily targeted to the printing industry, specifically the ubiquitous laser, electrophotographic printer. We use features that are well-correlated with the rankings of expert observers to develop a novel machine learning framework that allows one to achieve the minimum "false alarm" rate, subject to a chosen "miss" rate. Surprisingly, most of the research that has been conducted on machine learning does not consider this framework. During the process of developing a new product, test engineers will print hundreds of test pages, which can be scanned and then analyzed by an autonomous algorithm. Among these pages, most may be of acceptable quality. The objective is to find the ones that are not. These will provide critically important information to systems designers, regarding issues that need to be addressed in improving the printer design. A "miss" is defined to be a page that is not of acceptable quality to an expert observer that the prediction algorithm declares to be a "pass". Misses are a serious problem, since they represent problems that will not be seen by the systems designers. On the other hand, "false alarms" correspond to pages that an expert observer would declare to be of acceptable quality, but which are flagged by the prediction algorithm as "fails". In a typical printer testing and development scenario, such pages would be examined by an expert, and found to be of acceptable quality after all. "False alarm" pages result in extra pages to be examined by expert observers, which increases labor cost. But "false

  16. Use of a Machine-learning Method for Predicting Highly Cited Articles Within General Radiology Journals.

    Science.gov (United States)

    Rosenkrantz, Andrew B; Doshi, Ankur M; Ginocchio, Luke A; Aphinyanaphongs, Yindalon

    2016-12-01

    This study aimed to assess the performance of a text classification machine-learning model in predicting highly cited articles within the recent radiological literature and to identify the model's most influential article features. We downloaded from PubMed the title, abstract, and medical subject heading terms for 10,065 articles published in 25 general radiology journals in 2012 and 2013. Three machine-learning models were applied to predict the top 10% of included articles in terms of the number of citations to the article in 2014 (reflecting the 2-year time window in conventional impact factor calculations). The model having the highest area under the curve was selected to derive a list of article features (words) predicting high citation volume, which was iteratively reduced to identify the smallest possible core feature list maintaining predictive power. Overall themes were qualitatively assigned to the core features. The regularized logistic regression (Bayesian binary regression) model had highest performance, achieving an area under the curve of 0.814 in predicting articles in the top 10% of citation volume. We reduced the initial 14,083 features to 210 features that maintain predictivity. These features corresponded with topics relating to various imaging techniques (eg, diffusion-weighted magnetic resonance imaging, hyperpolarized magnetic resonance imaging, dual-energy computed tomography, computed tomography reconstruction algorithms, tomosynthesis, elastography, and computer-aided diagnosis), particular pathologies (prostate cancer; thyroid nodules; hepatic adenoma, hepatocellular carcinoma, non-alcoholic fatty liver disease), and other topics (radiation dose, electroporation, education, general oncology, gadolinium, statistics). Machine learning can be successfully applied to create specific feature-based models for predicting articles likely to achieve high influence within the radiological literature. Copyright © 2016 The Association of University

  17. Two Machine Learning Approaches for Short-Term Wind Speed Time-Series Prediction.

    Science.gov (United States)

    Ak, Ronay; Fink, Olga; Zio, Enrico

    2016-08-01

    The increasing liberalization of European electricity markets, the growing proportion of intermittent renewable energy being fed into the energy grids, and also new challenges in the patterns of energy consumption (such as electric mobility) require flexible and intelligent power grids capable of providing efficient, reliable, economical, and sustainable energy production and distribution. From the supplier side, particularly, the integration of renewable energy sources (e.g., wind and solar) into the grid imposes an engineering and economic challenge because of the limited ability to control and dispatch these energy sources due to their intermittent characteristics. Time-series prediction of wind speed for wind power production is a particularly important and challenging task, wherein prediction intervals (PIs) are preferable results of the prediction, rather than point estimates, because they provide information on the confidence in the prediction. In this paper, two different machine learning approaches to assess PIs of time-series predictions are considered and compared: 1) multilayer perceptron neural networks trained with a multiobjective genetic algorithm and 2) extreme learning machines combined with the nearest neighbors approach. The proposed approaches are applied for short-term wind speed prediction from a real data set of hourly wind speed measurements for the region of Regina in Saskatchewan, Canada. Both approaches demonstrate good prediction precision and provide complementary advantages with respect to different evaluation criteria.

  18. An Approach for Predicting Essential Genes Using Multiple Homology Mapping and Machine Learning Algorithms.

    Science.gov (United States)

    Hua, Hong-Li; Zhang, Fa-Zhan; Labena, Abraham Alemayehu; Dong, Chuan; Jin, Yan-Ting; Guo, Feng-Biao

    Investigation of essential genes is significant to comprehend the minimal gene sets of cell and discover potential drug targets. In this study, a novel approach based on multiple homology mapping and machine learning method was introduced to predict essential genes. We focused on 25 bacteria which have characterized essential genes. The predictions yielded the highest area under receiver operating characteristic (ROC) curve (AUC) of 0.9716 through tenfold cross-validation test. Proper features were utilized to construct models to make predictions in distantly related bacteria. The accuracy of predictions was evaluated via the consistency of predictions and known essential genes of target species. The highest AUC of 0.9552 and average AUC of 0.8314 were achieved when making predictions across organisms. An independent dataset from Synechococcus elongatus , which was released recently, was obtained for further assessment of the performance of our model. The AUC score of predictions is 0.7855, which is higher than other methods. This research presents that features obtained by homology mapping uniquely can achieve quite great or even better results than those integrated features. Meanwhile, the work indicates that machine learning-based method can assign more efficient weight coefficients than using empirical formula based on biological knowledge.

  19. Multi-level machine learning prediction of protein–protein interactions in Saccharomyces cerevisiae

    Directory of Open Access Journals (Sweden)

    Julian Zubek

    2015-07-01

    Full Text Available Accurate identification of protein–protein interactions (PPI is the key step in understanding proteins’ biological functions, which are typically context-dependent. Many existing PPI predictors rely on aggregated features from protein sequences, however only a few methods exploit local information about specific residue contacts. In this work we present a two-stage machine learning approach for prediction of protein–protein interactions. We start with the carefully filtered data on protein complexes available for Saccharomyces cerevisiae in the Protein Data Bank (PDB database. First, we build linear descriptions of interacting and non-interacting sequence segment pairs based on their inter-residue distances. Secondly, we train machine learning classifiers to predict binary segment interactions for any two short sequence fragments. The final prediction of the protein–protein interaction is done using the 2D matrix representation of all-against-all possible interacting sequence segments of both analysed proteins. The level-I predictor achieves 0.88 AUC for micro-scale, i.e., residue-level prediction. The level-II predictor improves the results further by a more complex learning paradigm. We perform 30-fold macro-scale, i.e., protein-level cross-validation experiment. The level-II predictor using PSIPRED-predicted secondary structure reaches 0.70 precision, 0.68 recall, and 0.70 AUC, whereas other popular methods provide results below 0.6 threshold (recall, precision, AUC. Our results demonstrate that multi-scale sequence features aggregation procedure is able to improve the machine learning results by more than 10% as compared to other sequence representations. Prepared datasets and source code for our experimental pipeline are freely available for download from: http://zubekj.github.io/mlppi/ (open source Python implementation, OS independent.

  20. SU-E-J-191: Motion Prediction Using Extreme Learning Machine in Image Guided Radiotherapy

    International Nuclear Information System (INIS)

    Jia, J; Cao, R; Pei, X; Wang, H; Hu, L

    2015-01-01

    Purpose: Real-time motion tracking is a critical issue in image guided radiotherapy due to the time latency caused by image processing and system response. It is of great necessity to fast and accurately predict the future position of the respiratory motion and the tumor location. Methods: The prediction of respiratory position was done based on the positioning and tracking module in ARTS-IGRT system which was developed by FDS Team (www.fds.org.cn). An approach involving with the extreme learning machine (ELM) was adopted to predict the future respiratory position as well as the tumor’s location by training the past trajectories. For the training process, a feed-forward neural network with one single hidden layer was used for the learning. First, the number of hidden nodes was figured out for the single layered feed forward network (SLFN). Then the input weights and hidden layer biases of the SLFN were randomly assigned to calculate the hidden neuron output matrix. Finally, the predicted movement were obtained by applying the output weights and compared with the actual movement. Breathing movement acquired from the external infrared markers was used to test the prediction accuracy. And the implanted marker movement for the prostate cancer was used to test the implementation of the tumor motion prediction. Results: The accuracy of the predicted motion and the actual motion was tested. Five volunteers with different breathing patterns were tested. The average prediction time was 0.281s. And the standard deviation of prediction accuracy was 0.002 for the respiratory motion and 0.001 for the tumor motion. Conclusion: The extreme learning machine method can provide an accurate and fast prediction of the respiratory motion and the tumor location and therefore can meet the requirements of real-time tumor-tracking in image guided radiotherapy

  1. SU-E-J-191: Motion Prediction Using Extreme Learning Machine in Image Guided Radiotherapy

    Energy Technology Data Exchange (ETDEWEB)

    Jia, J; Cao, R; Pei, X; Wang, H; Hu, L [Key Laboratory of Neutronics and Radiation Safety, Institute of Nuclear Energy Safety Technology, Chinese Academy of Sciences, Hefei, Anhui, 230031 (China); Engineering Technology Research Center of Accurate Radiotherapy of Anhui Province, Hefei 230031 (China); Collaborative Innovation Center of Radiation Medicine of Jiangsu Higher Education Institutions, SuZhou (China)

    2015-06-15

    Purpose: Real-time motion tracking is a critical issue in image guided radiotherapy due to the time latency caused by image processing and system response. It is of great necessity to fast and accurately predict the future position of the respiratory motion and the tumor location. Methods: The prediction of respiratory position was done based on the positioning and tracking module in ARTS-IGRT system which was developed by FDS Team (www.fds.org.cn). An approach involving with the extreme learning machine (ELM) was adopted to predict the future respiratory position as well as the tumor’s location by training the past trajectories. For the training process, a feed-forward neural network with one single hidden layer was used for the learning. First, the number of hidden nodes was figured out for the single layered feed forward network (SLFN). Then the input weights and hidden layer biases of the SLFN were randomly assigned to calculate the hidden neuron output matrix. Finally, the predicted movement were obtained by applying the output weights and compared with the actual movement. Breathing movement acquired from the external infrared markers was used to test the prediction accuracy. And the implanted marker movement for the prostate cancer was used to test the implementation of the tumor motion prediction. Results: The accuracy of the predicted motion and the actual motion was tested. Five volunteers with different breathing patterns were tested. The average prediction time was 0.281s. And the standard deviation of prediction accuracy was 0.002 for the respiratory motion and 0.001 for the tumor motion. Conclusion: The extreme learning machine method can provide an accurate and fast prediction of the respiratory motion and the tumor location and therefore can meet the requirements of real-time tumor-tracking in image guided radiotherapy.

  2. PredPsych: A toolbox for predictive machine learning-based approach in experimental psychology research.

    Science.gov (United States)

    Koul, Atesh; Becchio, Cristina; Cavallo, Andrea

    2017-12-12

    Recent years have seen an increased interest in machine learning-based predictive methods for analyzing quantitative behavioral data in experimental psychology. While these methods can achieve relatively greater sensitivity compared to conventional univariate techniques, they still lack an established and accessible implementation. The aim of current work was to build an open-source R toolbox - "PredPsych" - that could make these methods readily available to all psychologists. PredPsych is a user-friendly, R toolbox based on machine-learning predictive algorithms. In this paper, we present the framework of PredPsych via the analysis of a recently published multiple-subject motion capture dataset. In addition, we discuss examples of possible research questions that can be addressed with the machine-learning algorithms implemented in PredPsych and cannot be easily addressed with univariate statistical analysis. We anticipate that PredPsych will be of use to researchers with limited programming experience not only in the field of psychology, but also in that of clinical neuroscience, enabling computational assessment of putative bio-behavioral markers for both prognosis and diagnosis.

  3. Semi-supervised prediction of gene regulatory networks using machine learning algorithms.

    Science.gov (United States)

    Patel, Nihir; Wang, Jason T L

    2015-10-01

    Use of computational methods to predict gene regulatory networks (GRNs) from gene expression data is a challenging task. Many studies have been conducted using unsupervised methods to fulfill the task; however, such methods usually yield low prediction accuracies due to the lack of training data. In this article, we propose semi-supervised methods for GRN prediction by utilizing two machine learning algorithms, namely, support vector machines (SVM) and random forests (RF). The semi-supervised methods make use of unlabelled data for training. We investigated inductive and transductive learning approaches, both of which adopt an iterative procedure to obtain reliable negative training data from the unlabelled data. We then applied our semi-supervised methods to gene expression data of Escherichia coli and Saccharomyces cerevisiae, and evaluated the performance of our methods using the expression data. Our analysis indicated that the transductive learning approach outperformed the inductive learning approach for both organisms. However, there was no conclusive difference identified in the performance of SVM and RF. Experimental results also showed that the proposed semi-supervised methods performed better than existing supervised methods for both organisms.

  4. Machine learning methods enable predictive modeling of antibody feature:function relationships in RV144 vaccinees.

    Science.gov (United States)

    Choi, Ickwon; Chung, Amy W; Suscovich, Todd J; Rerks-Ngarm, Supachai; Pitisuttithum, Punnee; Nitayaphan, Sorachai; Kaewkungwal, Jaranit; O'Connell, Robert J; Francis, Donald; Robb, Merlin L; Michael, Nelson L; Kim, Jerome H; Alter, Galit; Ackerman, Margaret E; Bailey-Kellogg, Chris

    2015-04-01

    The adaptive immune response to vaccination or infection can lead to the production of specific antibodies to neutralize the pathogen or recruit innate immune effector cells for help. The non-neutralizing role of antibodies in stimulating effector cell responses may have been a key mechanism of the protection observed in the RV144 HIV vaccine trial. In an extensive investigation of a rich set of data collected from RV144 vaccine recipients, we here employ machine learning methods to identify and model associations between antibody features (IgG subclass and antigen specificity) and effector function activities (antibody dependent cellular phagocytosis, cellular cytotoxicity, and cytokine release). We demonstrate via cross-validation that classification and regression approaches can effectively use the antibody features to robustly predict qualitative and quantitative functional outcomes. This integration of antibody feature and function data within a machine learning framework provides a new, objective approach to discovering and assessing multivariate immune correlates.

  5. Predicting the stability of ternary intermetallics with density functional theory and machine learning

    Science.gov (United States)

    Schmidt, Jonathan; Chen, Liming; Botti, Silvana; Marques, Miguel A. L.

    2018-06-01

    We use a combination of machine learning techniques and high-throughput density-functional theory calculations to explore ternary compounds with the AB2C2 composition. We chose the two most common intermetallic prototypes for this composition, namely, the tI10-CeAl2Ga2 and the tP10-FeMo2B2 structures. Our results suggest that there may be ˜10 times more stable compounds in these phases than previously known. These are mostly metallic and non-magnetic. While the use of machine learning reduces the overall calculation cost by around 75%, some limitations of its predictive power still exist, in particular, for compounds involving the second-row of the periodic table or magnetic elements.

  6. Machine learning methods enable predictive modeling of antibody feature:function relationships in RV144 vaccinees.

    Directory of Open Access Journals (Sweden)

    Ickwon Choi

    2015-04-01

    Full Text Available The adaptive immune response to vaccination or infection can lead to the production of specific antibodies to neutralize the pathogen or recruit innate immune effector cells for help. The non-neutralizing role of antibodies in stimulating effector cell responses may have been a key mechanism of the protection observed in the RV144 HIV vaccine trial. In an extensive investigation of a rich set of data collected from RV144 vaccine recipients, we here employ machine learning methods to identify and model associations between antibody features (IgG subclass and antigen specificity and effector function activities (antibody dependent cellular phagocytosis, cellular cytotoxicity, and cytokine release. We demonstrate via cross-validation that classification and regression approaches can effectively use the antibody features to robustly predict qualitative and quantitative functional outcomes. This integration of antibody feature and function data within a machine learning framework provides a new, objective approach to discovering and assessing multivariate immune correlates.

  7. Osteoporosis risk prediction for bone mineral density assessment of postmenopausal women using machine learning.

    Science.gov (United States)

    Yoo, Tae Keun; Kim, Sung Kean; Kim, Deok Won; Choi, Joon Yul; Lee, Wan Hyung; Oh, Ein; Park, Eun-Cheol

    2013-11-01

    A number of clinical decision tools for osteoporosis risk assessment have been developed to select postmenopausal women for the measurement of bone mineral density. We developed and validated machine learning models with the aim of more accurately identifying the risk of osteoporosis in postmenopausal women compared to the ability of conventional clinical decision tools. We collected medical records from Korean postmenopausal women based on the Korea National Health and Nutrition Examination Surveys. The training data set was used to construct models based on popular machine learning algorithms such as support vector machines (SVM), random forests, artificial neural networks (ANN), and logistic regression (LR) based on simple surveys. The machine learning models were compared to four conventional clinical decision tools: osteoporosis self-assessment tool (OST), osteoporosis risk assessment instrument (ORAI), simple calculated osteoporosis risk estimation (SCORE), and osteoporosis index of risk (OSIRIS). SVM had significantly better area under the curve (AUC) of the receiver operating characteristic than ANN, LR, OST, ORAI, SCORE, and OSIRIS for the training set. SVM predicted osteoporosis risk with an AUC of 0.827, accuracy of 76.7%, sensitivity of 77.8%, and specificity of 76.0% at total hip, femoral neck, or lumbar spine for the testing set. The significant factors selected by SVM were age, height, weight, body mass index, duration of menopause, duration of breast feeding, estrogen therapy, hyperlipidemia, hypertension, osteoarthritis, and diabetes mellitus. Considering various predictors associated with low bone density, the machine learning methods may be effective tools for identifying postmenopausal women at high risk for osteoporosis.

  8. Prediction of Air Pollutants Concentration Based on an Extreme Learning Machine: The Case of Hong Kong

    Directory of Open Access Journals (Sweden)

    Jiangshe Zhang

    2017-01-01

    Full Text Available With the development of the economy and society all over the world, most metropolitan cities are experiencing elevated concentrations of ground-level air pollutants. It is urgent to predict and evaluate the concentration of air pollutants for some local environmental or health agencies. Feed-forward artificial neural networks have been widely used in the prediction of air pollutants concentration. However, there are some drawbacks, such as the low convergence rate and the local minimum. The extreme learning machine for single hidden layer feed-forward neural networks tends to provide good generalization performance at an extremely fast learning speed. The major sources of air pollutants in Hong Kong are mobile, stationary, and from trans-boundary sources. We propose predicting the concentration of air pollutants by the use of trained extreme learning machines based on the data obtained from eight air quality parameters in two monitoring stations, including Sham Shui Po and Tap Mun in Hong Kong for six years. The experimental results show that our proposed algorithm performs better on the Hong Kong data both quantitatively and qualitatively. Particularly, our algorithm shows better predictive ability, with R 2 increased and root mean square error values decreased respectively.

  9. Prediction of Air Pollutants Concentration Based on an Extreme Learning Machine: The Case of Hong Kong.

    Science.gov (United States)

    Zhang, Jiangshe; Ding, Weifu

    2017-01-24

    With the development of the economy and society all over the world, most metropolitan cities are experiencing elevated concentrations of ground-level air pollutants. It is urgent to predict and evaluate the concentration of air pollutants for some local environmental or health agencies. Feed-forward artificial neural networks have been widely used in the prediction of air pollutants concentration. However, there are some drawbacks, such as the low convergence rate and the local minimum. The extreme learning machine for single hidden layer feed-forward neural networks tends to provide good generalization performance at an extremely fast learning speed. The major sources of air pollutants in Hong Kong are mobile, stationary, and from trans-boundary sources. We propose predicting the concentration of air pollutants by the use of trained extreme learning machines based on the data obtained from eight air quality parameters in two monitoring stations, including Sham Shui Po and Tap Mun in Hong Kong for six years. The experimental results show that our proposed algorithm performs better on the Hong Kong data both quantitatively and qualitatively. Particularly, our algorithm shows better predictive ability, with R 2 increased and root mean square error values decreased respectively.

  10. Identifying predictive features in drug response using machine learning: opportunities and challenges.

    Science.gov (United States)

    Vidyasagar, Mathukumalli

    2015-01-01

    This article reviews several techniques from machine learning that can be used to study the problem of identifying a small number of features, from among tens of thousands of measured features, that can accurately predict a drug response. Prediction problems are divided into two categories: sparse classification and sparse regression. In classification, the clinical parameter to be predicted is binary, whereas in regression, the parameter is a real number. Well-known methods for both classes of problems are briefly discussed. These include the SVM (support vector machine) for classification and various algorithms such as ridge regression, LASSO (least absolute shrinkage and selection operator), and EN (elastic net) for regression. In addition, several well-established methods that do not directly fall into machine learning theory are also reviewed, including neural networks, PAM (pattern analysis for microarrays), SAM (significance analysis for microarrays), GSEA (gene set enrichment analysis), and k-means clustering. Several references indicative of the application of these methods to cancer biology are discussed.

  11. Exploration of machine learning techniques in predicting multiple sclerosis disease course.

    Directory of Open Access Journals (Sweden)

    Yijun Zhao

    Full Text Available To explore the value of machine learning methods for predicting multiple sclerosis disease course.1693 CLIMB study patients were classified as increased EDSS≥1.5 (worsening or not (non-worsening at up to five years after baseline visit. Support vector machines (SVM were used to build the classifier, and compared to logistic regression (LR using demographic, clinical and MRI data obtained at years one and two to predict EDSS at five years follow-up.Baseline data alone provided little predictive value. Clinical observation for one year improved overall SVM sensitivity to 62% and specificity to 65% in predicting worsening cases. The addition of one year MRI data improved sensitivity to 71% and specificity to 68%. Use of non-uniform misclassification costs in the SVM model, weighting towards increased sensitivity, improved predictions (up to 86%. Sensitivity, specificity, and overall accuracy improved minimally with additional follow-up data. Predictions improved within specific groups defined by baseline EDSS. LR performed more poorly than SVM in most cases. Race, family history of MS, and brain parenchymal fraction, ranked highly as predictors of the non-worsening group. Brain T2 lesion volume ranked highly as predictive of the worsening group.SVM incorporating short-term clinical and brain MRI data, class imbalance corrective measures, and classification costs may be a promising means to predict MS disease course, and for selection of patients suitable for more aggressive treatment regimens.

  12. Machine-learning scoring functions to improve structure-based binding affinity prediction and virtual screening.

    Science.gov (United States)

    Ain, Qurrat Ul; Aleksandrova, Antoniya; Roessler, Florian D; Ballester, Pedro J

    2015-01-01

    Docking tools to predict whether and how a small molecule binds to a target can be applied if a structural model of such target is available. The reliability of docking depends, however, on the accuracy of the adopted scoring function (SF). Despite intense research over the years, improving the accuracy of SFs for structure-based binding affinity prediction or virtual screening has proven to be a challenging task for any class of method. New SFs based on modern machine-learning regression models, which do not impose a predetermined functional form and thus are able to exploit effectively much larger amounts of experimental data, have recently been introduced. These machine-learning SFs have been shown to outperform a wide range of classical SFs at both binding affinity prediction and virtual screening. The emerging picture from these studies is that the classical approach of using linear regression with a small number of expert-selected structural features can be strongly improved by a machine-learning approach based on nonlinear regression allied with comprehensive data-driven feature selection. Furthermore, the performance of classical SFs does not grow with larger training datasets and hence this performance gap is expected to widen as more training data becomes available in the future. Other topics covered in this review include predicting the reliability of a SF on a particular target class, generating synthetic data to improve predictive performance and modeling guidelines for SF development. WIREs Comput Mol Sci 2015, 5:405-424. doi: 10.1002/wcms.1225 For further resources related to this article, please visit the WIREs website.

  13. Prediction of diffuse solar irradiance using machine learning and multivariable regression

    International Nuclear Information System (INIS)

    Lou, Siwei; Li, Danny H.W.; Lam, Joseph C.; Chan, Wilco W.H.

    2016-01-01

    Highlights: • 54.9% of the annual global irradiance is composed by its diffuse part in HK. • Hourly diffuse irradiance was predicted by accessible variables. • The importance of variable in prediction was assessed by machine learning. • Simple prediction equations were developed with the knowledge of variable importance. - Abstract: The paper studies the horizontal global, direct-beam and sky-diffuse solar irradiance data measured in Hong Kong from 2008 to 2013. A machine learning algorithm was employed to predict the horizontal sky-diffuse irradiance and conduct sensitivity analysis for the meteorological variables. Apart from the clearness index (horizontal global/extra atmospheric solar irradiance), we found that predictors including solar altitude, air temperature, cloud cover and visibility are also important in predicting the diffuse component. The mean absolute error (MAE) of the logistic regression using the aforementioned predictors was less than 21.5 W/m"2 and 30 W/m"2 for Hong Kong and Denver, USA, respectively. With the systematic recording of the five variables for more than 35 years, the proposed model would be appropriate to estimate of long-term diffuse solar radiation, study climate change and develope typical meteorological year in Hong Kong and places with similar climates.

  14. Open source machine-learning algorithms for the prediction of optimal cancer drug therapies.

    Science.gov (United States)

    Huang, Cai; Mezencev, Roman; McDonald, John F; Vannberg, Fredrik

    2017-01-01

    Precision medicine is a rapidly growing area of modern medical science and open source machine-learning codes promise to be a critical component for the successful development of standardized and automated analysis of patient data. One important goal of precision cancer medicine is the accurate prediction of optimal drug therapies from the genomic profiles of individual patient tumors. We introduce here an open source software platform that employs a highly versatile support vector machine (SVM) algorithm combined with a standard recursive feature elimination (RFE) approach to predict personalized drug responses from gene expression profiles. Drug specific models were built using gene expression and drug response data from the National Cancer Institute panel of 60 human cancer cell lines (NCI-60). The models are highly accurate in predicting the drug responsiveness of a variety of cancer cell lines including those comprising the recent NCI-DREAM Challenge. We demonstrate that predictive accuracy is optimized when the learning dataset utilizes all probe-set expression values from a diversity of cancer cell types without pre-filtering for genes generally considered to be "drivers" of cancer onset/progression. Application of our models to publically available ovarian cancer (OC) patient gene expression datasets generated predictions consistent with observed responses previously reported in the literature. By making our algorithm "open source", we hope to facilitate its testing in a variety of cancer types and contexts leading to community-driven improvements and refinements in subsequent applications.

  15. Open source machine-learning algorithms for the prediction of optimal cancer drug therapies.

    Directory of Open Access Journals (Sweden)

    Cai Huang

    Full Text Available Precision medicine is a rapidly growing area of modern medical science and open source machine-learning codes promise to be a critical component for the successful development of standardized and automated analysis of patient data. One important goal of precision cancer medicine is the accurate prediction of optimal drug therapies from the genomic profiles of individual patient tumors. We introduce here an open source software platform that employs a highly versatile support vector machine (SVM algorithm combined with a standard recursive feature elimination (RFE approach to predict personalized drug responses from gene expression profiles. Drug specific models were built using gene expression and drug response data from the National Cancer Institute panel of 60 human cancer cell lines (NCI-60. The models are highly accurate in predicting the drug responsiveness of a variety of cancer cell lines including those comprising the recent NCI-DREAM Challenge. We demonstrate that predictive accuracy is optimized when the learning dataset utilizes all probe-set expression values from a diversity of cancer cell types without pre-filtering for genes generally considered to be "drivers" of cancer onset/progression. Application of our models to publically available ovarian cancer (OC patient gene expression datasets generated predictions consistent with observed responses previously reported in the literature. By making our algorithm "open source", we hope to facilitate its testing in a variety of cancer types and contexts leading to community-driven improvements and refinements in subsequent applications.

  16. Comparison of four statistical and machine learning methods for crash severity prediction.

    Science.gov (United States)

    Iranitalab, Amirfarrokh; Khattak, Aemal

    2017-11-01

    Crash severity prediction models enable different agencies to predict the severity of a reported crash with unknown severity or the severity of crashes that may be expected to occur sometime in the future. This paper had three main objectives: comparison of the performance of four statistical and machine learning methods including Multinomial Logit (MNL), Nearest Neighbor Classification (NNC), Support Vector Machines (SVM) and Random Forests (RF), in predicting traffic crash severity; developing a crash costs-based approach for comparison of crash severity prediction methods; and investigating the effects of data clustering methods comprising K-means Clustering (KC) and Latent Class Clustering (LCC), on the performance of crash severity prediction models. The 2012-2015 reported crash data from Nebraska, United States was obtained and two-vehicle crashes were extracted as the analysis data. The dataset was split into training/estimation (2012-2014) and validation (2015) subsets. The four prediction methods were trained/estimated using the training/estimation dataset and the correct prediction rates for each crash severity level, overall correct prediction rate and a proposed crash costs-based accuracy measure were obtained for the validation dataset. The correct prediction rates and the proposed approach showed NNC had the best prediction performance in overall and in more severe crashes. RF and SVM had the next two sufficient performances and MNL was the weakest method. Data clustering did not affect the prediction results of SVM, but KC improved the prediction performance of MNL, NNC and RF, while LCC caused improvement in MNL and RF but weakened the performance of NNC. Overall correct prediction rate had almost the exact opposite results compared to the proposed approach, showing that neglecting the crash costs can lead to misjudgment in choosing the right prediction method. Copyright © 2017 Elsevier Ltd. All rights reserved.

  17. Evaluation of machine learning algorithms for prediction of regions of high Reynolds averaged Navier Stokes uncertainty

    Science.gov (United States)

    Ling, J.; Templeton, J.

    2015-08-01

    Reynolds Averaged Navier Stokes (RANS) models are widely used in industry to predict fluid flows, despite their acknowledged deficiencies. Not only do RANS models often produce inaccurate flow predictions, but there are very limited diagnostics available to assess RANS accuracy for a given flow configuration. If experimental or higher fidelity simulation results are not available for RANS validation, there is no reliable method to evaluate RANS accuracy. This paper explores the potential of utilizing machine learning algorithms to identify regions of high RANS uncertainty. Three different machine learning algorithms were evaluated: support vector machines, Adaboost decision trees, and random forests. The algorithms were trained on a database of canonical flow configurations for which validated direct numerical simulation or large eddy simulation results were available, and were used to classify RANS results on a point-by-point basis as having either high or low uncertainty, based on the breakdown of specific RANS modeling assumptions. Classifiers were developed for three different basic RANS eddy viscosity model assumptions: the isotropy of the eddy viscosity, the linearity of the Boussinesq hypothesis, and the non-negativity of the eddy viscosity. It is shown that these classifiers are able to generalize to flows substantially different from those on which they were trained. Feature selection techniques, model evaluation, and extrapolation detection are discussed in the context of turbulence modeling applications.

  18. The validation and assessment of machine learning: a game of prediction from high-dimensional data.

    Directory of Open Access Journals (Sweden)

    Tune H Pers

    Full Text Available In applied statistics, tools from machine learning are popular for analyzing complex and high-dimensional data. However, few theoretical results are available that could guide to the appropriate machine learning tool in a new application. Initial development of an overall strategy thus often implies that multiple methods are tested and compared on the same set of data. This is particularly difficult in situations that are prone to over-fitting where the number of subjects is low compared to the number of potential predictors. The article presents a game which provides some grounds for conducting a fair model comparison. Each player selects a modeling strategy for predicting individual response from potential predictors. A strictly proper scoring rule, bootstrap cross-validation, and a set of rules are used to make the results obtained with different strategies comparable. To illustrate the ideas, the game is applied to data from the Nugenob Study where the aim is to predict the fat oxidation capacity based on conventional factors and high-dimensional metabolomics data. Three players have chosen to use support vector machines, LASSO, and random forests, respectively.

  19. Predictive Power of Machine Learning for Optimizing Solar Water Heater Performance: The Potential Application of High-Throughput Screening

    Directory of Open Access Journals (Sweden)

    Hao Li

    2017-01-01

    Full Text Available Predicting the performance of solar water heater (SWH is challenging due to the complexity of the system. Fortunately, knowledge-based machine learning can provide a fast and precise prediction method for SWH performance. With the predictive power of machine learning models, we can further solve a more challenging question: how to cost-effectively design a high-performance SWH? Here, we summarize our recent studies and propose a general framework of SWH design using a machine learning-based high-throughput screening (HTS method. Design of water-in-glass evacuated tube solar water heater (WGET-SWH is selected as a case study to show the potential application of machine learning-based HTS to the design and optimization of solar energy systems.

  20. Machine Learning Approaches for Predicting Radiation Therapy Outcomes: A Clinician's Perspective.

    Science.gov (United States)

    Kang, John; Schwartz, Russell; Flickinger, John; Beriwal, Sushil

    2015-12-01

    Radiation oncology has always been deeply rooted in modeling, from the early days of isoeffect curves to the contemporary Quantitative Analysis of Normal Tissue Effects in the Clinic (QUANTEC) initiative. In recent years, medical modeling for both prognostic and therapeutic purposes has exploded thanks to increasing availability of electronic data and genomics. One promising direction that medical modeling is moving toward is adopting the same machine learning methods used by companies such as Google and Facebook to combat disease. Broadly defined, machine learning is a branch of computer science that deals with making predictions from complex data through statistical models. These methods serve to uncover patterns in data and are actively used in areas such as speech recognition, handwriting recognition, face recognition, "spam" filtering (junk email), and targeted advertising. Although multiple radiation oncology research groups have shown the value of applied machine learning (ML), clinical adoption has been slow due to the high barrier to understanding these complex models by clinicians. Here, we present a review of the use of ML to predict radiation therapy outcomes from the clinician's point of view with the hope that it lowers the "barrier to entry" for those without formal training in ML. We begin by describing 7 principles that one should consider when evaluating (or creating) an ML model in radiation oncology. We next introduce 3 popular ML methods--logistic regression (LR), support vector machine (SVM), and artificial neural network (ANN)--and critique 3 seminal papers in the context of these principles. Although current studies are in exploratory stages, the overall methodology has progressively matured, and the field is ready for larger-scale further investigation. Copyright © 2015 Elsevier Inc. All rights reserved.

  1. Machine Learning Approaches for Predicting Radiation Therapy Outcomes: A Clinician's Perspective

    International Nuclear Information System (INIS)

    Kang, John; Schwartz, Russell; Flickinger, John; Beriwal, Sushil

    2015-01-01

    Radiation oncology has always been deeply rooted in modeling, from the early days of isoeffect curves to the contemporary Quantitative Analysis of Normal Tissue Effects in the Clinic (QUANTEC) initiative. In recent years, medical modeling for both prognostic and therapeutic purposes has exploded thanks to increasing availability of electronic data and genomics. One promising direction that medical modeling is moving toward is adopting the same machine learning methods used by companies such as Google and Facebook to combat disease. Broadly defined, machine learning is a branch of computer science that deals with making predictions from complex data through statistical models. These methods serve to uncover patterns in data and are actively used in areas such as speech recognition, handwriting recognition, face recognition, “spam” filtering (junk email), and targeted advertising. Although multiple radiation oncology research groups have shown the value of applied machine learning (ML), clinical adoption has been slow due to the high barrier to understanding these complex models by clinicians. Here, we present a review of the use of ML to predict radiation therapy outcomes from the clinician's point of view with the hope that it lowers the “barrier to entry” for those without formal training in ML. We begin by describing 7 principles that one should consider when evaluating (or creating) an ML model in radiation oncology. We next introduce 3 popular ML methods—logistic regression (LR), support vector machine (SVM), and artificial neural network (ANN)—and critique 3 seminal papers in the context of these principles. Although current studies are in exploratory stages, the overall methodology has progressively matured, and the field is ready for larger-scale further investigation.

  2. Machine Learning Approaches for Predicting Radiation Therapy Outcomes: A Clinician's Perspective

    Energy Technology Data Exchange (ETDEWEB)

    Kang, John [Medical Scientist Training Program, University of Pittsburgh-Carnegie Mellon University, Pittsburgh, Pennsylvania (United States); Schwartz, Russell [Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, Pennsylvania (United States); Flickinger, John [Departments of Radiation Oncology and Neurological Surgery, University of Pittsburgh Medical Center, Pittsburgh, Pennsylvania (United States); Beriwal, Sushil, E-mail: beriwals@upmc.edu [Department of Radiation Oncology, University of Pittsburgh Medical Center, Pittsburgh, Pennsylvania (United States)

    2015-12-01

    Radiation oncology has always been deeply rooted in modeling, from the early days of isoeffect curves to the contemporary Quantitative Analysis of Normal Tissue Effects in the Clinic (QUANTEC) initiative. In recent years, medical modeling for both prognostic and therapeutic purposes has exploded thanks to increasing availability of electronic data and genomics. One promising direction that medical modeling is moving toward is adopting the same machine learning methods used by companies such as Google and Facebook to combat disease. Broadly defined, machine learning is a branch of computer science that deals with making predictions from complex data through statistical models. These methods serve to uncover patterns in data and are actively used in areas such as speech recognition, handwriting recognition, face recognition, “spam” filtering (junk email), and targeted advertising. Although multiple radiation oncology research groups have shown the value of applied machine learning (ML), clinical adoption has been slow due to the high barrier to understanding these complex models by clinicians. Here, we present a review of the use of ML to predict radiation therapy outcomes from the clinician's point of view with the hope that it lowers the “barrier to entry” for those without formal training in ML. We begin by describing 7 principles that one should consider when evaluating (or creating) an ML model in radiation oncology. We next introduce 3 popular ML methods—logistic regression (LR), support vector machine (SVM), and artificial neural network (ANN)—and critique 3 seminal papers in the context of these principles. Although current studies are in exploratory stages, the overall methodology has progressively matured, and the field is ready for larger-scale further investigation.

  3. Plant MicroRNA Prediction by Supervised Machine Learning Using C5.0 Decision Trees

    Directory of Open Access Journals (Sweden)

    Philip H. Williams

    2012-01-01

    Full Text Available MicroRNAs (miRNAs are nonprotein coding RNAs between 20 and 22 nucleotides long that attenuate protein production. Different types of sequence data are being investigated for novel miRNAs, including genomic and transcriptomic sequences. A variety of machine learning methods have successfully predicted miRNA precursors, mature miRNAs, and other nonprotein coding sequences. MirTools, mirDeep2, and miRanalyzer require “read count” to be included with the input sequences, which restricts their use to deep-sequencing data. Our aim was to train a predictor using a cross-section of different species to accurately predict miRNAs outside the training set. We wanted a system that did not require read-count for prediction and could therefore be applied to short sequences extracted from genomic, EST, or RNA-seq sources. A miRNA-predictive decision-tree model has been developed by supervised machine learning. It only requires that the corresponding genome or transcriptome is available within a sequence window that includes the precursor candidate so that the required sequence features can be collected. Some of the most critical features for training the predictor are the miRNA:miRNA∗ duplex energy and the number of mismatches in the duplex. We present a cross-species plant miRNA predictor with 84.08% sensitivity and 98.53% specificity based on rigorous testing by leave-one-out validation.

  4. Plant MicroRNA Prediction by Supervised Machine Learning Using C5.0 Decision Trees.

    Science.gov (United States)

    Williams, Philip H; Eyles, Rod; Weiller, Georg

    2012-01-01

    MicroRNAs (miRNAs) are nonprotein coding RNAs between 20 and 22 nucleotides long that attenuate protein production. Different types of sequence data are being investigated for novel miRNAs, including genomic and transcriptomic sequences. A variety of machine learning methods have successfully predicted miRNA precursors, mature miRNAs, and other nonprotein coding sequences. MirTools, mirDeep2, and miRanalyzer require "read count" to be included with the input sequences, which restricts their use to deep-sequencing data. Our aim was to train a predictor using a cross-section of different species to accurately predict miRNAs outside the training set. We wanted a system that did not require read-count for prediction and could therefore be applied to short sequences extracted from genomic, EST, or RNA-seq sources. A miRNA-predictive decision-tree model has been developed by supervised machine learning. It only requires that the corresponding genome or transcriptome is available within a sequence window that includes the precursor candidate so that the required sequence features can be collected. Some of the most critical features for training the predictor are the miRNA:miRNA(∗) duplex energy and the number of mismatches in the duplex. We present a cross-species plant miRNA predictor with 84.08% sensitivity and 98.53% specificity based on rigorous testing by leave-one-out validation.

  5. Heart Failure: Diagnosis, Severity Estimation and Prediction of Adverse Events Through Machine Learning Techniques

    Directory of Open Access Journals (Sweden)

    Evanthia E. Tripoliti

    Full Text Available Heart failure is a serious condition with high prevalence (about 2% in the adult population in developed countries, and more than 8% in patients older than 75 years. About 3–5% of hospital admissions are linked with heart failure incidents. Heart failure is the first cause of admission by healthcare professionals in their clinical practice. The costs are very high, reaching up to 2% of the total health costs in the developed countries. Building an effective disease management strategy requires analysis of large amount of data, early detection of the disease, assessment of the severity and early prediction of adverse events. This will inhibit the progression of the disease, will improve the quality of life of the patients and will reduce the associated medical costs. Toward this direction machine learning techniques have been employed. The aim of this paper is to present the state-of-the-art of the machine learning methodologies applied for the assessment of heart failure. More specifically, models predicting the presence, estimating the subtype, assessing the severity of heart failure and predicting the presence of adverse events, such as destabilizations, re-hospitalizations, and mortality are presented. According to the authors' knowledge, it is the first time that such a comprehensive review, focusing on all aspects of the management of heart failure, is presented. Keywords: Heart failure, Diagnosis, Prediction, Severity estimation, Classification, Data mining

  6. Prediction of outcome in internet-delivered cognitive behaviour therapy for paediatric obsessive-compulsive disorder: A machine learning approach.

    Science.gov (United States)

    Lenhard, Fabian; Sauer, Sebastian; Andersson, Erik; Månsson, Kristoffer Nt; Mataix-Cols, David; Rück, Christian; Serlachius, Eva

    2018-03-01

    There are no consistent predictors of treatment outcome in paediatric obsessive-compulsive disorder (OCD). One reason for this might be the use of suboptimal statistical methodology. Machine learning is an approach to efficiently analyse complex data. Machine learning has been widely used within other fields, but has rarely been tested in the prediction of paediatric mental health treatment outcomes. To test four different machine learning methods in the prediction of treatment response in a sample of paediatric OCD patients who had received Internet-delivered cognitive behaviour therapy (ICBT). Participants were 61 adolescents (12-17 years) who enrolled in a randomized controlled trial and received ICBT. All clinical baseline variables were used to predict strictly defined treatment response status three months after ICBT. Four machine learning algorithms were implemented. For comparison, we also employed a traditional logistic regression approach. Multivariate logistic regression could not detect any significant predictors. In contrast, all four machine learning algorithms performed well in the prediction of treatment response, with 75 to 83% accuracy. The results suggest that machine learning algorithms can successfully be applied to predict paediatric OCD treatment outcome. Validation studies and studies in other disorders are warranted. Copyright © 2017 John Wiley & Sons, Ltd.

  7. Predicting the Geothermal Heat Flux in Greenland: A Machine Learning Approach

    Science.gov (United States)

    Rezvanbehbahani, Soroush; Stearns, Leigh A.; Kadivar, Amir; Walker, J. Doug; van der Veen, C. J.

    2017-12-01

    Geothermal heat flux (GHF) is a crucial boundary condition for making accurate predictions of ice sheet mass loss, yet it is poorly known in Greenland due to inaccessibility of the bedrock. Here we use a machine learning algorithm on a large collection of relevant geologic features and global GHF measurements and produce a GHF map of Greenland that we argue is within ˜15% accuracy. The main features of our predicted GHF map include a large region with high GHF in central-north Greenland surrounding the NorthGRIP ice core site, and hot spots in the Jakobshavn Isbræ catchment, upstream of Petermann Gletscher, and near the terminus of Nioghalvfjerdsfjorden glacier. Our model also captures the trajectory of Greenland movement over the Icelandic plume by predicting a stripe of elevated GHF in central-east Greenland. Finally, we show that our model can produce substantially more accurate predictions if additional measurements of GHF in Greenland are provided.

  8. MU-LOC: A Machine-Learning Method for Predicting Mitochondrially Localized Proteins in Plants

    Directory of Open Access Journals (Sweden)

    Ning Zhang

    2018-05-01

    Full Text Available Targeting and translocation of proteins to the appropriate subcellular compartments are crucial for cell organization and function. Newly synthesized proteins are transported to mitochondria with the assistance of complex targeting sequences containing either an N-terminal pre-sequence or a multitude of internal signals. Compared with experimental approaches, computational predictions provide an efficient way to infer subcellular localization of a protein. However, it is still challenging to predict plant mitochondrially localized proteins accurately due to various limitations. Consequently, the performance of current tools can be improved with new data and new machine-learning methods. We present MU-LOC, a novel computational approach for large-scale prediction of plant mitochondrial proteins. We collected a comprehensive dataset of plant subcellular localization, extracted features including amino acid composition, protein position weight matrix, and gene co-expression information, and trained predictors using deep neural network and support vector machine. Benchmarked on two independent datasets, MU-LOC achieved substantial improvements over six state-of-the-art tools for plant mitochondrial targeting prediction. In addition, MU-LOC has the advantage of predicting plant mitochondrial proteins either possessing or lacking N-terminal pre-sequences. We applied MU-LOC to predict candidate mitochondrial proteins for the whole proteome of Arabidopsis and potato. MU-LOC is publicly available at http://mu-loc.org.

  9. Application of Machine Learning to Predict Dietary Lapses During Weight Loss.

    Science.gov (United States)

    Goldstein, Stephanie P; Zhang, Fengqing; Thomas, John G; Butryn, Meghan L; Herbert, James D; Forman, Evan M

    2018-05-01

    Individuals who adhere to dietary guidelines provided during weight loss interventions tend to be more successful with weight control. Any deviation from dietary guidelines can be referred to as a "lapse." There is a growing body of research showing that lapses are predictable using a variety of physiological, environmental, and psychological indicators. With recent technological advancements, it may be possible to assess these triggers and predict dietary lapses in real time. The current study sought to use machine learning techniques to predict lapses and evaluate the utility of combining both group- and individual-level data to enhance lapse prediction. The current study trained and tested a machine learning algorithm capable of predicting dietary lapses from a behavioral weight loss program among adults with overweight/obesity (n = 12). Participants were asked to follow a weight control diet for 6 weeks and complete ecological momentary assessment (EMA; repeated brief surveys delivered via smartphone) regarding dietary lapses and relevant triggers. WEKA decision trees were used to predict lapses with an accuracy of 0.72 for the group of participants. However, generalization of the group algorithm to each individual was poor, and as such, group- and individual-level data were combined to improve prediction. The findings suggest that 4 weeks of individual data collection is recommended to attain optimal model performance. The predictive algorithm could be utilized to provide in-the-moment interventions to prevent dietary lapses and therefore enhance weight losses. Furthermore, methods in the current study could be translated to other types of health behavior lapses.

  10. Machine learning for predicting soil classes in three semi-arid landscapes

    Science.gov (United States)

    Brungard, Colby W.; Boettinger, Janis L.; Duniway, Michael C.; Wills, Skye A.; Edwards, Thomas C.

    2015-01-01

    Mapping the spatial distribution of soil taxonomic classes is important for informing soil use and management decisions. Digital soil mapping (DSM) can quantitatively predict the spatial distribution of soil taxonomic classes. Key components of DSM are the method and the set of environmental covariates used to predict soil classes. Machine learning is a general term for a broad set of statistical modeling techniques. Many different machine learning models have been applied in the literature and there are different approaches for selecting covariates for DSM. However, there is little guidance as to which, if any, machine learning model and covariate set might be optimal for predicting soil classes across different landscapes. Our objective was to compare multiple machine learning models and covariate sets for predicting soil taxonomic classes at three geographically distinct areas in the semi-arid western United States of America (southern New Mexico, southwestern Utah, and northeastern Wyoming). All three areas were the focus of digital soil mapping studies. Sampling sites at each study area were selected using conditioned Latin hypercube sampling (cLHS). We compared models that had been used in other DSM studies, including clustering algorithms, discriminant analysis, multinomial logistic regression, neural networks, tree based methods, and support vector machine classifiers. Tested machine learning models were divided into three groups based on model complexity: simple, moderate, and complex. We also compared environmental covariates derived from digital elevation models and Landsat imagery that were divided into three different sets: 1) covariates selected a priori by soil scientists familiar with each area and used as input into cLHS, 2) the covariates in set 1 plus 113 additional covariates, and 3) covariates selected using recursive feature elimination. Overall, complex models were consistently more accurate than simple or moderately complex models. Random

  11. Predicting Pre-planting Risk of Stagonospora nodorum blotch in Winter Wheat Using Machine Learning Models

    Directory of Open Access Journals (Sweden)

    Lucky eMehra

    2016-03-01

    Full Text Available Pre-planting factors have been associated with the late-season severity of Stagonospora nodorum blotch (SNB, caused by the fungal pathogen Parastagonospora nodorum, in winter wheat (Triticum aestivum. The relative importance of these factors in the risk of SNB has not been determined and this knowledge can facilitate disease management decisions prior to planting of the wheat crop. In this study, we examined the performance of multiple regression (MR and three machine learning algorithms namely artificial neural networks, categorical and regression trees, and random forests (RF in predicting the pre-planting risk of SNB in wheat. Pre-planting factors tested as potential predictor variables were cultivar resistance, latitude, longitude, previous crop, seeding rate, seed treatment, tillage type, and wheat residue. Disease severity assessed at the end of the growing season was used as the response variable. The models were developed using 431 disease cases (unique combinations of predictors collected from 2012 to 2014 and these cases were randomly divided into training, validation, and test datasets. Models were evaluated based on the regression of observed against predicted severity values of SNB, sensitivity-specificity ROC analysis, and the Kappa statistic. A strong relationship was observed between late-season severity of SNB and specific pre-planting factors in which latitude, longitude, wheat residue, and cultivar resistance were the most important predictors. The MR model explained 33% of variability in the data, while machine learning models explained 47 to 79% of the total variability. Similarly, the MR model correctly classified 74% of the disease cases, while machine learning models correctly classified 81 to 83% of these cases. Results show that the RF algorithm, which explained 79% of the variability within the data, was the most accurate in predicting the risk of SNB, with an accuracy rate of 93%. The RF algorithm could allow early

  12. Predicting Pre-planting Risk of Stagonospora nodorum blotch in Winter Wheat Using Machine Learning Models.

    Science.gov (United States)

    Mehra, Lucky K; Cowger, Christina; Gross, Kevin; Ojiambo, Peter S

    2016-01-01

    Pre-planting factors have been associated with the late-season severity of Stagonospora nodorum blotch (SNB), caused by the fungal pathogen Parastagonospora nodorum, in winter wheat (Triticum aestivum). The relative importance of these factors in the risk of SNB has not been determined and this knowledge can facilitate disease management decisions prior to planting of the wheat crop. In this study, we examined the performance of multiple regression (MR) and three machine learning algorithms namely artificial neural networks, categorical and regression trees, and random forests (RF), in predicting the pre-planting risk of SNB in wheat. Pre-planting factors tested as potential predictor variables were cultivar resistance, latitude, longitude, previous crop, seeding rate, seed treatment, tillage type, and wheat residue. Disease severity assessed at the end of the growing season was used as the response variable. The models were developed using 431 disease cases (unique combinations of predictors) collected from 2012 to 2014 and these cases were randomly divided into training, validation, and test datasets. Models were evaluated based on the regression of observed against predicted severity values of SNB, sensitivity-specificity ROC analysis, and the Kappa statistic. A strong relationship was observed between late-season severity of SNB and specific pre-planting factors in which latitude, longitude, wheat residue, and cultivar resistance were the most important predictors. The MR model explained 33% of variability in the data, while machine learning models explained 47 to 79% of the total variability. Similarly, the MR model correctly classified 74% of the disease cases, while machine learning models correctly classified 81 to 83% of these cases. Results show that the RF algorithm, which explained 79% of the variability within the data, was the most accurate in predicting the risk of SNB, with an accuracy rate of 93%. The RF algorithm could allow early assessment of

  13. Statistical and Machine-Learning Data Mining Techniques for Better Predictive Modeling and Analysis of Big Data

    CERN Document Server

    Ratner, Bruce

    2011-01-01

    The second edition of a bestseller, Statistical and Machine-Learning Data Mining: Techniques for Better Predictive Modeling and Analysis of Big Data is still the only book, to date, to distinguish between statistical data mining and machine-learning data mining. The first edition, titled Statistical Modeling and Analysis for Database Marketing: Effective Techniques for Mining Big Data, contained 17 chapters of innovative and practical statistical data mining techniques. In this second edition, renamed to reflect the increased coverage of machine-learning data mining techniques, the author has

  14. Development of Predictive QSAR Models of 4-Thiazolidinones Antitrypanosomal Activity using Modern Machine Learning Algorithms.

    Science.gov (United States)

    Kryshchyshyn, Anna; Devinyak, Oleg; Kaminskyy, Danylo; Grellier, Philippe; Lesyk, Roman

    2017-11-14

    This paper presents novel QSAR models for the prediction of antitrypanosomal activity among thiazolidines and related heterocycles. The performance of four machine learning algorithms: Random Forest regression, Stochastic gradient boosting, Multivariate adaptive regression splines and Gaussian processes regression have been studied in order to reach better levels of predictivity. The results for Random Forest and Gaussian processes regression are comparable and outperform other studied methods. The preliminary descriptor selection with Boruta method improved the outcome of machine learning methods. The two novel QSAR-models developed with Random Forest and Gaussian processes regression algorithms have good predictive ability, which was proved by the external evaluation of the test set with corresponding Q 2 ext =0.812 and Q 2 ext =0.830. The obtained models can be used further for in silico screening of virtual libraries in the same chemical domain in order to find new antitrypanosomal agents. Thorough analysis of descriptors influence in the QSAR models and interpretation of their chemical meaning allows to highlight a number of structure-activity relationships. The presence of phenyl rings with electron-withdrawing atoms or groups in para-position, increased number of aromatic rings, high branching but short chains, high HOMO energy, and the introduction of 1-substituted 2-indolyl fragment into the molecular structure have been recognized as trypanocidal activity prerequisites. © 2017 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim.

  15. Pol II promoter prediction using characteristic 4-mer motifs: a machine learning approach

    Directory of Open Access Journals (Sweden)

    Shoyaib Mohammad

    2008-10-01

    Full Text Available Abstract Background Eukaryotic promoter prediction using computational analysis techniques is one of the most difficult jobs in computational genomics that is essential for constructing and understanding genetic regulatory networks. The increased availability of sequence data for various eukaryotic organisms in recent years has necessitated for better tools and techniques for the prediction and analysis of promoters in eukaryotic sequences. Many promoter prediction methods and tools have been developed to date but they have yet to provide acceptable predictive performance. One obvious criteria to improve on current methods is to devise a better system for selecting appropriate features of promoters that distinguish them from non-promoters. Secondly improved performance can be achieved by enhancing the predictive ability of the machine learning algorithms used. Results In this paper, a novel approach is presented in which 128 4-mer motifs in conjunction with a non-linear machine-learning algorithm utilising a Support Vector Machine (SVM are used to distinguish between promoter and non-promoter DNA sequences. By applying this approach to plant, Drosophila, human, mouse and rat sequences, the classification model has showed 7-fold cross-validation percentage accuracies of 83.81%, 94.82%, 91.25%, 90.77% and 82.35% respectively. The high sensitivity and specificity value of 0.86 and 0.90 for plant; 0.96 and 0.92 for Drosophila; 0.88 and 0.92 for human; 0.78 and 0.84 for mouse and 0.82 and 0.80 for rat demonstrate that this technique is less prone to false positive results and exhibits better performance than many other tools. Moreover, this model successfully identifies location of promoter using TATA weight matrix. Conclusion The high sensitivity and specificity indicate that 4-mer frequencies in conjunction with supervised machine-learning methods can be beneficial in the identification of RNA pol II promoters comparative to other methods. This

  16. Structure Based Thermostability Prediction Models for Protein Single Point Mutations with Machine Learning Tools.

    Directory of Open Access Journals (Sweden)

    Lei Jia

    Full Text Available Thermostability issue of protein point mutations is a common occurrence in protein engineering. An application which predicts the thermostability of mutants can be helpful for guiding decision making process in protein design via mutagenesis. An in silico point mutation scanning method is frequently used to find "hot spots" in proteins for focused mutagenesis. ProTherm (http://gibk26.bio.kyutech.ac.jp/jouhou/Protherm/protherm.html is a public database that consists of thousands of protein mutants' experimentally measured thermostability. Two data sets based on two differently measured thermostability properties of protein single point mutations, namely the unfolding free energy change (ddG and melting temperature change (dTm were obtained from this database. Folding free energy change calculation from Rosetta, structural information of the point mutations as well as amino acid physical properties were obtained for building thermostability prediction models with informatics modeling tools. Five supervised machine learning methods (support vector machine, random forests, artificial neural network, naïve Bayes classifier, K nearest neighbor and partial least squares regression are used for building the prediction models. Binary and ternary classifications as well as regression models were built and evaluated. Data set redundancy and balancing, the reverse mutations technique, feature selection, and comparison to other published methods were discussed. Rosetta calculated folding free energy change ranked as the most influential features in all prediction models. Other descriptors also made significant contributions to increasing the accuracy of the prediction models.

  17. The Prediction of Yarn Elongation of Kenyan Ring-Spun Yarn using Extreme Learning Machines (ELM

    Directory of Open Access Journals (Sweden)

    Josphat Igadwa Mwasiagi

    2017-03-01

    Full Text Available The optimization of the manufacture of cotton yarns involves several processes, while the prediction of yarn quality parameters forms an important area of investigation. This research work concentrated on the prediction of cotton yarn elongation. Cotton lint and yarn samples were collected in textile factories in Kenya.The collected samples were tested under standard testing conditions. Cotton lint parameters, machine parameters and yarn elongation were used to design yarn elongation prediction models. The elongation prediction models used three network training algorithms, including backpropagation (BP, an extreme learning machine (ELM, and a hybrid of differential evolution (DE and an ELM referred to as DE-ELM. The prediction models recorded a mean squared error (mse value of 0.001 using 11, 43 and 2 neurons in the hidden layer for the BP, ELM and DE-ELM models respectively. The ELM models exhibited faster training speeds than the BP algorithms, but required more neurons in the hidden layer than other models. The DEELM hybrid algorithm was faster than the BP algorithm, but slower than the ELM algorithm.

  18. Machine learning and statistical methods for the prediction of maximal oxygen uptake: recent advances.

    Science.gov (United States)

    Abut, Fatih; Akay, Mehmet Fatih

    2015-01-01

    Maximal oxygen uptake (VO2max) indicates how many milliliters of oxygen the body can consume in a state of intense exercise per minute. VO2max plays an important role in both sport and medical sciences for different purposes, such as indicating the endurance capacity of athletes or serving as a metric in estimating the disease risk of a person. In general, the direct measurement of VO2max provides the most accurate assessment of aerobic power. However, despite a high level of accuracy, practical limitations associated with the direct measurement of VO2max, such as the requirement of expensive and sophisticated laboratory equipment or trained staff, have led to the development of various regression models for predicting VO2max. Consequently, a lot of studies have been conducted in the last years to predict VO2max of various target audiences, ranging from soccer athletes, nonexpert swimmers, cross-country skiers to healthy-fit adults, teenagers, and children. Numerous prediction models have been developed using different sets of predictor variables and a variety of machine learning and statistical methods, including support vector machine, multilayer perceptron, general regression neural network, and multiple linear regression. The purpose of this study is to give a detailed overview about the data-driven modeling studies for the prediction of VO2max conducted in recent years and to compare the performance of various VO2max prediction models reported in related literature in terms of two well-known metrics, namely, multiple correlation coefficient (R) and standard error of estimate. The survey results reveal that with respect to regression methods used to develop prediction models, support vector machine, in general, shows better performance than other methods, whereas multiple linear regression exhibits the worst performance.

  19. Reversible Watermarking Using Prediction-Error Expansion and Extreme Learning Machine

    Directory of Open Access Journals (Sweden)

    Guangyong Gao

    2015-01-01

    Full Text Available Currently, the research for reversible watermarking focuses on the decreasing of image distortion. Aiming at this issue, this paper presents an improvement method to lower the embedding distortion based on the prediction-error expansion (PE technique. Firstly, the extreme learning machine (ELM with good generalization ability is utilized to enhance the prediction accuracy for image pixel value during the watermarking embedding, and the lower prediction error results in the reduction of image distortion. Moreover, an optimization operation for strengthening the performance of ELM is taken to further lessen the embedding distortion. With two popular predictors, that is, median edge detector (MED predictor and gradient-adjusted predictor (GAP, the experimental results for the classical images and Kodak image set indicate that the proposed scheme achieves improvement for the lowering of image distortion compared with the classical PE scheme proposed by Thodi et al. and outperforms the improvement method presented by Coltuc and other existing approaches.

  20. Genome-wide prediction of discrete traits using bayesian regressions and machine learning

    Directory of Open Access Journals (Sweden)

    Forni Selma

    2011-02-01

    Full Text Available Abstract Background Genomic selection has gained much attention and the main goal is to increase the predictive accuracy and the genetic gain in livestock using dense marker information. Most methods dealing with the large p (number of covariates small n (number of observations problem have dealt only with continuous traits, but there are many important traits in livestock that are recorded in a discrete fashion (e.g. pregnancy outcome, disease resistance. It is necessary to evaluate alternatives to analyze discrete traits in a genome-wide prediction context. Methods This study shows two threshold versions of Bayesian regressions (Bayes A and Bayesian LASSO and two machine learning algorithms (boosting and random forest to analyze discrete traits in a genome-wide prediction context. These methods were evaluated using simulated and field data to predict yet-to-be observed records. Performances were compared based on the models' predictive ability. Results The simulation showed that machine learning had some advantages over Bayesian regressions when a small number of QTL regulated the trait under pure additivity. However, differences were small and disappeared with a large number of QTL. Bayesian threshold LASSO and boosting achieved the highest accuracies, whereas Random Forest presented the highest classification performance. Random Forest was the most consistent method in detecting resistant and susceptible animals, phi correlation was up to 81% greater than Bayesian regressions. Random Forest outperformed other methods in correctly classifying resistant and susceptible animals in the two pure swine lines evaluated. Boosting and Bayes A were more accurate with crossbred data. Conclusions The results of this study suggest that the best method for genome-wide prediction may depend on the genetic basis of the population analyzed. All methods were less accurate at correctly classifying intermediate animals than extreme animals. Among the different

  1. Machine learning methods to predict child posttraumatic stress: a proof of concept study.

    Science.gov (United States)

    Saxe, Glenn N; Ma, Sisi; Ren, Jiwen; Aliferis, Constantin

    2017-07-10

    The care of traumatized children would benefit significantly from accurate predictive models for Posttraumatic Stress Disorder (PTSD), using information available around the time of trauma. Machine Learning (ML) computational methods have yielded strong results in recent applications across many diseases and data types, yet they have not been previously applied to childhood PTSD. Since these methods have not been applied to this complex and debilitating disorder, there is a great deal that remains to be learned about their application. The first step is to prove the concept: Can ML methods - as applied in other fields - produce predictive classification models for childhood PTSD? Additionally, we seek to determine if specific variables can be identified - from the aforementioned predictive classification models - with putative causal relations to PTSD. ML predictive classification methods - with causal discovery feature selection - were applied to a data set of 163 children hospitalized with an injury and PTSD was determined three months after hospital discharge. At the time of hospitalization, 105 risk factor variables were collected spanning a range of biopsychosocial domains. Seven percent of subjects had a high level of PTSD symptoms. A predictive classification model was discovered with significant predictive accuracy. A predictive model constructed based on subsets of potentially causally relevant features achieves similar predictivity compared to the best predictive model constructed with all variables. Causal Discovery feature selection methods identified 58 variables of which 10 were identified as most stable. In this first proof-of-concept application of ML methods to predict childhood Posttraumatic Stress we were able to determine both predictive classification models for childhood PTSD and identify several causal variables. This set of techniques has great potential for enhancing the methodological toolkit in the field and future studies should seek to

  2. Gene prediction in metagenomic fragments: A large scale machine learning approach

    Directory of Open Access Journals (Sweden)

    Morgenstern Burkhard

    2008-04-01

    Full Text Available Abstract Background Metagenomics is an approach to the characterization of microbial genomes via the direct isolation of genomic sequences from the environment without prior cultivation. The amount of metagenomic sequence data is growing fast while computational methods for metagenome analysis are still in their infancy. In contrast to genomic sequences of single species, which can usually be assembled and analyzed by many available methods, a large proportion of metagenome data remains as unassembled anonymous sequencing reads. One of the aims of all metagenomic sequencing projects is the identification of novel genes. Short length, for example, Sanger sequencing yields on average 700 bp fragments, and unknown phylogenetic origin of most fragments require approaches to gene prediction that are different from the currently available methods for genomes of single species. In particular, the large size of metagenomic samples requires fast and accurate methods with small numbers of false positive predictions. Results We introduce a novel gene prediction algorithm for metagenomic fragments based on a two-stage machine learning approach. In the first stage, we use linear discriminants for monocodon usage, dicodon usage and translation initiation sites to extract features from DNA sequences. In the second stage, an artificial neural network combines these features with open reading frame length and fragment GC-content to compute the probability that this open reading frame encodes a protein. This probability is used for the classification and scoring of gene candidates. With large scale training, our method provides fast single fragment predictions with good sensitivity and specificity on artificially fragmented genomic DNA. Additionally, this method is able to predict translation initiation sites accurately and distinguishes complete from incomplete genes with high reliability. Conclusion Large scale machine learning methods are well-suited for gene

  3. A data-driven predictive approach for drug delivery using machine learning techniques.

    Directory of Open Access Journals (Sweden)

    Yuanyuan Li

    Full Text Available In drug delivery, there is often a trade-off between effective killing of the pathogen, and harmful side effects associated with the treatment. Due to the difficulty in testing every dosing scenario experimentally, a computational approach will be helpful to assist with the prediction of effective drug delivery methods. In this paper, we have developed a data-driven predictive system, using machine learning techniques, to determine, in silico, the effectiveness of drug dosing. The system framework is scalable, autonomous, robust, and has the ability to predict the effectiveness of the current drug treatment and the subsequent drug-pathogen dynamics. The system consists of a dynamic model incorporating both the drug concentration and pathogen population into distinct states. These states are then analyzed using a temporal model to describe the drug-cell interactions over time. The dynamic drug-cell interactions are learned in an adaptive fashion and used to make sequential predictions on the effectiveness of the dosing strategy. Incorporated into the system is the ability to adjust the sensitivity and specificity of the learned models based on a threshold level determined by the operator for the specific application. As a proof-of-concept, the system was validated experimentally using the pathogen Giardia lamblia and the drug metronidazole in vitro.

  4. Prediction of Student Dropout in E-Learning Program Through the Use of Machine Learning Method

    OpenAIRE

    Mingjie Tan; Peiji Shao

    2015-01-01

    The high rate of dropout is a serious problem in E-learning program. Thus it has received extensive concern from the education administrators and researchers. Predicting the potential dropout students is a workable solution to prevent dropout. Based on the analysis of related literature, this study selected student’s personal characteristic and academic performance as input attributions. Prediction models were developed using Artificial Neural Network (ANN), Decision Tree (DT) and Bayesian Ne...

  5. Variable complexity online sequential extreme learning machine, with applications to streamflow prediction

    Science.gov (United States)

    Lima, Aranildo R.; Hsieh, William W.; Cannon, Alex J.

    2017-12-01

    In situations where new data arrive continually, online learning algorithms are computationally much less costly than batch learning ones in maintaining the model up-to-date. The extreme learning machine (ELM), a single hidden layer artificial neural network with random weights in the hidden layer, is solved by linear least squares, and has an online learning version, the online sequential ELM (OSELM). As more data become available during online learning, information on the longer time scale becomes available, so ideally the model complexity should be allowed to change, but the number of hidden nodes (HN) remains fixed in OSELM. A variable complexity VC-OSELM algorithm is proposed to dynamically add or remove HN in the OSELM, allowing the model complexity to vary automatically as online learning proceeds. The performance of VC-OSELM was compared with OSELM in daily streamflow predictions at two hydrological stations in British Columbia, Canada, with VC-OSELM significantly outperforming OSELM in mean absolute error, root mean squared error and Nash-Sutcliffe efficiency at both stations.

  6. Promises of Machine Learning Approaches in Prediction of Absorption of Compounds.

    Science.gov (United States)

    Kumar, Rajnish; Sharma, Anju; Siddiqui, Mohammed Haris; Tiwari, Rajesh Kumar

    2018-01-01

    The Machine Learning (ML) is one of the fastest developing techniques in the prediction and evaluation of important pharmacokinetic properties such as absorption, distribution, metabolism and excretion. The availability of a large number of robust validation techniques for prediction models devoted to pharmacokinetics has significantly enhanced the trust and authenticity in ML approaches. There is a series of prediction models generated and used for rapid screening of compounds on the basis of absorption in last one decade. Prediction of absorption of compounds using ML models has great potential across the pharmaceutical industry as a non-animal alternative to predict absorption. However, these prediction models still have to go far ahead to develop the confidence similar to conventional experimental methods for estimation of drug absorption. Some of the general concerns are selection of appropriate ML methods and validation techniques in addition to selecting relevant descriptors and authentic data sets for the generation of prediction models. The current review explores published models of ML for the prediction of absorption using physicochemical properties as descriptors and their important conclusions. In addition, some critical challenges in acceptance of ML models for absorption are also discussed. Copyright© Bentham Science Publishers; For any queries, please email at epub@benthamscience.org.

  7. A New Tool for CME Arrival Time Prediction using Machine Learning Algorithms: CAT-PUMA

    Science.gov (United States)

    Liu, Jiajia; Ye, Yudong; Shen, Chenglong; Wang, Yuming; Erdélyi, Robert

    2018-03-01

    Coronal mass ejections (CMEs) are arguably the most violent eruptions in the solar system. CMEs can cause severe disturbances in interplanetary space and can even affect human activities in many aspects, causing damage to infrastructure and loss of revenue. Fast and accurate prediction of CME arrival time is vital to minimize the disruption that CMEs may cause when interacting with geospace. In this paper, we propose a new approach for partial-/full halo CME Arrival Time Prediction Using Machine learning Algorithms (CAT-PUMA). Via detailed analysis of the CME features and solar-wind parameters, we build a prediction engine taking advantage of 182 previously observed geo-effective partial-/full halo CMEs and using algorithms of the Support Vector Machine. We demonstrate that CAT-PUMA is accurate and fast. In particular, predictions made after applying CAT-PUMA to a test set unknown to the engine show a mean absolute prediction error of ∼5.9 hr within the CME arrival time, with 54% of the predictions having absolute errors less than 5.9 hr. Comparisons with other models reveal that CAT-PUMA has a more accurate prediction for 77% of the events investigated that can be carried out very quickly, i.e., within minutes of providing the necessary input parameters of a CME. A practical guide containing the CAT-PUMA engine and the source code of two examples are available in the Appendix, allowing the community to perform their own applications for prediction using CAT-PUMA.

  8. In Silico Prediction of Chemicals Binding to Aromatase with Machine Learning Methods.

    Science.gov (United States)

    Du, Hanwen; Cai, Yingchun; Yang, Hongbin; Zhang, Hongxiao; Xue, Yuhan; Liu, Guixia; Tang, Yun; Li, Weihua

    2017-05-15

    Environmental chemicals may affect endocrine systems through multiple mechanisms, one of which is via effects on aromatase (also known as CYP19A1), an enzyme critical for maintaining the normal balance of estrogens and androgens in the body. Therefore, rapid and efficient identification of aromatase-related endocrine disrupting chemicals (EDCs) is important for toxicology and environment risk assessment. In this study, on the basis of the Tox21 10K compound library, in silico classification models for predicting aromatase binders/nonbinders were constructed by machine learning methods. To improve the prediction ability of the models, a combined classifier (CC) strategy that combines different independent machine learning methods was adopted. Performances of the models were measured by test and external validation sets containing 1336 and 216 chemicals, respectively. The best model was obtained with the MACCS (Molecular Access System) fingerprint and CC method, which exhibited an accuracy of 0.84 for the test set and 0.91 for the external validation set. Additionally, several representative substructures for characterizing aromatase binders, such as ketone, lactone, and nitrogen-containing derivatives, were identified using information gain and substructure frequency analysis. Our study provided a systematic assessment of chemicals binding to aromatase. The built models can be helpful to rapidly identify potential EDCs targeting aromatase.

  9. Human Machine Learning Symbiosis

    Science.gov (United States)

    Walsh, Kenneth R.; Hoque, Md Tamjidul; Williams, Kim H.

    2017-01-01

    Human Machine Learning Symbiosis is a cooperative system where both the human learner and the machine learner learn from each other to create an effective and efficient learning environment adapted to the needs of the human learner. Such a system can be used in online learning modules so that the modules adapt to each learner's learning state both…

  10. Prediction of lung cancer patient survival via supervised machine learning classification techniques.

    Science.gov (United States)

    Lynch, Chip M; Abdollahi, Behnaz; Fuqua, Joshua D; de Carlo, Alexandra R; Bartholomai, James A; Balgemann, Rayeanne N; van Berkel, Victor H; Frieboes, Hermann B

    2017-12-01

    Outcomes for cancer patients have been previously estimated by applying various machine learning techniques to large datasets such as the Surveillance, Epidemiology, and End Results (SEER) program database. In particular for lung cancer, it is not well understood which types of techniques would yield more predictive information, and which data attributes should be used in order to determine this information. In this study, a number of supervised learning techniques is applied to the SEER database to classify lung cancer patients in terms of survival, including linear regression, Decision Trees, Gradient Boosting Machines (GBM), Support Vector Machines (SVM), and a custom ensemble. Key data attributes in applying these methods include tumor grade, tumor size, gender, age, stage, and number of primaries, with the goal to enable comparison of predictive power between the various methods The prediction is treated like a continuous target, rather than a classification into categories, as a first step towards improving survival prediction. The results show that the predicted values agree with actual values for low to moderate survival times, which constitute the majority of the data. The best performing technique was the custom ensemble with a Root Mean Square Error (RMSE) value of 15.05. The most influential model within the custom ensemble was GBM, while Decision Trees may be inapplicable as it had too few discrete outputs. The results further show that among the five individual models generated, the most accurate was GBM with an RMSE value of 15.32. Although SVM underperformed with an RMSE value of 15.82, statistical analysis singles the SVM as the only model that generated a distinctive output. The results of the models are consistent with a classical Cox proportional hazards model used as a reference technique. We conclude that application of these supervised learning techniques to lung cancer data in the SEER database may be of use to estimate patient survival time

  11. Supervised machine learning techniques to predict binding affinity. A study for cyclin-dependent kinase 2.

    Science.gov (United States)

    de Ávila, Maurício Boff; Xavier, Mariana Morrone; Pintro, Val Oliveira; de Azevedo, Walter Filgueira

    2017-12-09

    Here we report the development of a machine-learning model to predict binding affinity based on the crystallographic structures of protein-ligand complexes. We used an ensemble of crystallographic structures (resolution better than 1.5 Å resolution) for which half-maximal inhibitory concentration (IC 50 ) data is available. Polynomial scoring functions were built using as explanatory variables the energy terms present in the MolDock and PLANTS scoring functions. Prediction performance was tested and the supervised machine learning models showed improvement in the prediction power, when compared with PLANTS and MolDock scoring functions. In addition, the machine-learning model was applied to predict binding affinity of CDK2, which showed a better performance when compared with AutoDock4, AutoDock Vina, MolDock, and PLANTS scores. Copyright © 2017 Elsevier Inc. All rights reserved.

  12. A Machine Learning Method for the Prediction of Receptor Activation in the Simulation of Synapses

    Science.gov (United States)

    Montes, Jesus; Gomez, Elena; Merchán-Pérez, Angel; DeFelipe, Javier; Peña, Jose-Maria

    2013-01-01

    Chemical synaptic transmission involves the release of a neurotransmitter that diffuses in the extracellular space and interacts with specific receptors located on the postsynaptic membrane. Computer simulation approaches provide fundamental tools for exploring various aspects of the synaptic transmission under different conditions. In particular, Monte Carlo methods can track the stochastic movements of neurotransmitter molecules and their interactions with other discrete molecules, the receptors. However, these methods are computationally expensive, even when used with simplified models, preventing their use in large-scale and multi-scale simulations of complex neuronal systems that may involve large numbers of synaptic connections. We have developed a machine-learning based method that can accurately predict relevant aspects of the behavior of synapses, such as the percentage of open synaptic receptors as a function of time since the release of the neurotransmitter, with considerably lower computational cost compared with the conventional Monte Carlo alternative. The method is designed to learn patterns and general principles from a corpus of previously generated Monte Carlo simulations of synapses covering a wide range of structural and functional characteristics. These patterns are later used as a predictive model of the behavior of synapses under different conditions without the need for additional computationally expensive Monte Carlo simulations. This is performed in five stages: data sampling, fold creation, machine learning, validation and curve fitting. The resulting procedure is accurate, automatic, and it is general enough to predict synapse behavior under experimental conditions that are different to the ones it has been trained on. Since our method efficiently reproduces the results that can be obtained with Monte Carlo simulations at a considerably lower computational cost, it is suitable for the simulation of high numbers of synapses and it is

  13. A machine learning method for the prediction of receptor activation in the simulation of synapses.

    Directory of Open Access Journals (Sweden)

    Jesus Montes

    Full Text Available Chemical synaptic transmission involves the release of a neurotransmitter that diffuses in the extracellular space and interacts with specific receptors located on the postsynaptic membrane. Computer simulation approaches provide fundamental tools for exploring various aspects of the synaptic transmission under different conditions. In particular, Monte Carlo methods can track the stochastic movements of neurotransmitter molecules and their interactions with other discrete molecules, the receptors. However, these methods are computationally expensive, even when used with simplified models, preventing their use in large-scale and multi-scale simulations of complex neuronal systems that may involve large numbers of synaptic connections. We have developed a machine-learning based method that can accurately predict relevant aspects of the behavior of synapses, such as the percentage of open synaptic receptors as a function of time since the release of the neurotransmitter, with considerably lower computational cost compared with the conventional Monte Carlo alternative. The method is designed to learn patterns and general principles from a corpus of previously generated Monte Carlo simulations of synapses covering a wide range of structural and functional characteristics. These patterns are later used as a predictive model of the behavior of synapses under different conditions without the need for additional computationally expensive Monte Carlo simulations. This is performed in five stages: data sampling, fold creation, machine learning, validation and curve fitting. The resulting procedure is accurate, automatic, and it is general enough to predict synapse behavior under experimental conditions that are different to the ones it has been trained on. Since our method efficiently reproduces the results that can be obtained with Monte Carlo simulations at a considerably lower computational cost, it is suitable for the simulation of high numbers of

  14. Machine Learning Approach for Prediction and Understanding of Glass-Forming Ability.

    Science.gov (United States)

    Sun, Y T; Bai, H Y; Li, M Z; Wang, W H

    2017-07-20

    The prediction of the glass-forming ability (GFA) by varying the composition of alloys is a challenging problem in glass physics, as well as a problem for industry, with enormous financial ramifications. Although different empirical guides for the prediction of GFA were established over decades, a comprehensive model or approach that is able to deal with as many variables as possible simultaneously for efficiently predicting good glass formers is still highly desirable. Here, by applying the support vector classification method, we develop models for predicting the GFA of binary metallic alloys from random compositions. The effect of different input descriptors on GFA were evaluated, and the best prediction model was selected, which shows that the information related to liquidus temperatures plays a key role in the GFA of alloys. On the basis of this model, good glass formers can be predicted with high efficiency. The prediction efficiency can be further enhanced by improving larger database and refined input descriptor selection. Our findings suggest that machine learning is very powerful and efficient and has great potential for discovering new metallic glasses with good GFA.

  15. Computational prediction of multidisciplinary team decision-making for adjuvant breast cancer drug therapies: a machine learning approach.

    Science.gov (United States)

    Lin, Frank P Y; Pokorny, Adrian; Teng, Christina; Dear, Rachel; Epstein, Richard J

    2016-12-01

    Multidisciplinary team (MDT) meetings are used to optimise expert decision-making about treatment options, but such expertise is not digitally transferable between centres. To help standardise medical decision-making, we developed a machine learning model designed to predict MDT decisions about adjuvant breast cancer treatments. We analysed MDT decisions regarding adjuvant systemic therapy for 1065 breast cancer cases over eight years. Machine learning classifiers with and without bootstrap aggregation were correlated with MDT decisions (recommended, not recommended, or discussable) regarding adjuvant cytotoxic, endocrine and biologic/targeted therapies, then tested for predictability using stratified ten-fold cross-validations. The predictions so derived were duly compared with those based on published (ESMO and NCCN) cancer guidelines. Machine learning more accurately predicted adjuvant chemotherapy MDT decisions than did simple application of guidelines. No differences were found between MDT- vs. ESMO/NCCN- based decisions to prescribe either adjuvant endocrine (97%, p = 0.44/0.74) or biologic/targeted therapies (98%, p = 0.82/0.59). In contrast, significant discrepancies were evident between MDT- and guideline-based decisions to prescribe chemotherapy (87%, p machine learning models. A machine learning approach based on clinicopathologic characteristics can predict MDT decisions about adjuvant breast cancer drug therapies. The discrepancy between MDT- and guideline-based decisions regarding adjuvant chemotherapy implies that certain non-clincopathologic criteria, such as patient preference and resource availability, are factored into clinical decision-making by local experts but not captured by guidelines.

  16. Comparison of Machine Learning Techniques for the Prediction of Compressive Strength of Concrete

    Directory of Open Access Journals (Sweden)

    Palika Chopra

    2018-01-01

    Full Text Available A comparative analysis for the prediction of compressive strength of concrete at the ages of 28, 56, and 91 days has been carried out using machine learning techniques via “R” software environment. R is digging out a strong foothold in the statistical realm and is becoming an indispensable tool for researchers. The dataset has been generated under controlled laboratory conditions. Using R miner, the most widely used data mining techniques decision tree (DT model, random forest (RF model, and neural network (NN model have been used and compared with the help of coefficient of determination (R2 and root-mean-square error (RMSE, and it is inferred that the NN model predicts with high accuracy for compressive strength of concrete.

  17. Optimized Extreme Learning Machine for Power System Transient Stability Prediction Using Synchrophasors

    Directory of Open Access Journals (Sweden)

    Yanjun Zhang

    2015-01-01

    Full Text Available A new optimized extreme learning machine- (ELM- based method for power system transient stability prediction (TSP using synchrophasors is presented in this paper. First, the input features symbolizing the transient stability of power systems are extracted from synchronized measurements. Then, an ELM classifier is employed to build the TSP model. And finally, the optimal parameters of the model are optimized by using the improved particle swarm optimization (IPSO algorithm. The novelty of the proposal is in the fact that it improves the prediction performance of the ELM-based TSP model by using IPSO to optimize the parameters of the model with synchrophasors. And finally, based on the test results on both IEEE 39-bus system and a large-scale real power system, the correctness and validity of the presented approach are verified.

  18. Prediction of laser cutting heat affected zone by extreme learning machine

    Science.gov (United States)

    Anicic, Obrad; Jović, Srđan; Skrijelj, Hivzo; Nedić, Bogdan

    2017-01-01

    Heat affected zone (HAZ) of the laser cutting process may be developed based on combination of different factors. In this investigation the HAZ forecasting, based on the different laser cutting parameters, was analyzed. The main goal was to predict the HAZ according to three inputs. The purpose of this research was to develop and apply the Extreme Learning Machine (ELM) to predict the HAZ. The ELM results were compared with genetic programming (GP) and artificial neural network (ANN). The reliability of the computational models were accessed based on simulation results and by using several statistical indicators. Based upon simulation results, it was demonstrated that ELM can be utilized effectively in applications of HAZ forecasting.

  19. A comparison of machine learning techniques for survival prediction in breast cancer.

    Science.gov (United States)

    Vanneschi, Leonardo; Farinaccio, Antonella; Mauri, Giancarlo; Antoniotti, Mauro; Provero, Paolo; Giacobini, Mario

    2011-05-11

    The ability to accurately classify cancer patients into risk classes, i.e. to predict the outcome of the pathology on an individual basis, is a key ingredient in making therapeutic decisions. In recent years gene expression data have been successfully used to complement the clinical and histological criteria traditionally used in such prediction. Many "gene expression signatures" have been developed, i.e. sets of genes whose expression values in a tumor can be used to predict the outcome of the pathology. Here we investigate the use of several machine learning techniques to classify breast cancer patients using one of such signatures, the well established 70-gene signature. We show that Genetic Programming performs significantly better than Support Vector Machines, Multilayered Perceptrons and Random Forests in classifying patients from the NKI breast cancer dataset, and comparably to the scoring-based method originally proposed by the authors of the 70-gene signature. Furthermore, Genetic Programming is able to perform an automatic feature selection. Since the performance of Genetic Programming is likely to be improvable compared to the out-of-the-box approach used here, and given the biological insight potentially provided by the Genetic Programming solutions, we conclude that Genetic Programming methods are worth further investigation as a tool for cancer patient classification based on gene expression data.

  20. A comparison of machine learning techniques for survival prediction in breast cancer

    Directory of Open Access Journals (Sweden)

    Vanneschi Leonardo

    2011-05-01

    Full Text Available Abstract Background The ability to accurately classify cancer patients into risk classes, i.e. to predict the outcome of the pathology on an individual basis, is a key ingredient in making therapeutic decisions. In recent years gene expression data have been successfully used to complement the clinical and histological criteria traditionally used in such prediction. Many "gene expression signatures" have been developed, i.e. sets of genes whose expression values in a tumor can be used to predict the outcome of the pathology. Here we investigate the use of several machine learning techniques to classify breast cancer patients using one of such signatures, the well established 70-gene signature. Results We show that Genetic Programming performs significantly better than Support Vector Machines, Multilayered Perceptrons and Random Forests in classifying patients from the NKI breast cancer dataset, and comparably to the scoring-based method originally proposed by the authors of the 70-gene signature. Furthermore, Genetic Programming is able to perform an automatic feature selection. Conclusions Since the performance of Genetic Programming is likely to be improvable compared to the out-of-the-box approach used here, and given the biological insight potentially provided by the Genetic Programming solutions, we conclude that Genetic Programming methods are worth further investigation as a tool for cancer patient classification based on gene expression data.

  1. Prediction and early detection of delirium in the intensive care unit by using heart rate variability and machine learning.

    Science.gov (United States)

    Oh, Jooyoung; Cho, Dongrae; Park, Jaesub; Na, Se Hee; Kim, Jongin; Heo, Jaeseok; Shin, Cheung Soo; Kim, Jae-Jin; Park, Jin Young; Lee, Boreom

    2018-03-27

    Delirium is an important syndrome found in patients in the intensive care unit (ICU), however, it is usually under-recognized during treatment. This study was performed to investigate whether delirious patients can be successfully distinguished from non-delirious patients by using heart rate variability (HRV) and machine learning. Electrocardiography data of 140 patients was acquired during daily ICU care, and HRV data were analyzed. Delirium, including its type, severity, and etiologies, was evaluated daily by trained psychiatrists. HRV data and various machine learning algorithms including linear support vector machine (SVM), SVM with radial basis function (RBF) kernels, linear extreme learning machine (ELM), ELM with RBF kernels, linear discriminant analysis, and quadratic discriminant analysis were utilized to distinguish delirium patients from non-delirium patients. HRV data of 4797 ECGs were included, and 39 patients had delirium at least once during their ICU stay. The maximum classification accuracy was acquired using SVM with RBF kernels. Our prediction method based on HRV with machine learning was comparable to previous delirium prediction models using massive amounts of clinical information. Our results show that autonomic alterations could be a significant feature of patients with delirium in the ICU, suggesting the potential for the automatic prediction and early detection of delirium based on HRV with machine learning.

  2. Quantum machine learning.

    Science.gov (United States)

    Biamonte, Jacob; Wittek, Peter; Pancotti, Nicola; Rebentrost, Patrick; Wiebe, Nathan; Lloyd, Seth

    2017-09-13

    Fuelled by increasing computer power and algorithmic advances, machine learning techniques have become powerful tools for finding patterns in data. Quantum systems produce atypical patterns that classical systems are thought not to produce efficiently, so it is reasonable to postulate that quantum computers may outperform classical computers on machine learning tasks. The field of quantum machine learning explores how to devise and implement quantum software that could enable machine learning that is faster than that of classical computers. Recent work has produced quantum algorithms that could act as the building blocks of machine learning programs, but the hardware and software challenges are still considerable.

  3. ReactionPredictor: prediction of complex chemical reactions at the mechanistic level using machine learning.

    Science.gov (United States)

    Kayala, Matthew A; Baldi, Pierre

    2012-10-22

    Proposing reasonable mechanisms and predicting the course of chemical reactions is important to the practice of organic chemistry. Approaches to reaction prediction have historically used obfuscating representations and manually encoded patterns or rules. Here we present ReactionPredictor, a machine learning approach to reaction prediction that models elementary, mechanistic reactions as interactions between approximate molecular orbitals (MOs). A training data set of productive reactions known to occur at reasonable rates and yields and verified by inclusion in the literature or textbooks is derived from an existing rule-based system and expanded upon with manual curation from graduate level textbooks. Using this training data set of complex polar, hypervalent, radical, and pericyclic reactions, a two-stage machine learning prediction framework is trained and validated. In the first stage, filtering models trained at the level of individual MOs are used to reduce the space of possible reactions to consider. In the second stage, ranking models over the filtered space of possible reactions are used to order the reactions such that the productive reactions are the top ranked. The resulting model, ReactionPredictor, perfectly ranks polar reactions 78.1% of the time and recovers all productive reactions 95.7% of the time when allowing for small numbers of errors. Pericyclic and radical reactions are perfectly ranked 85.8% and 77.0% of the time, respectively, rising to >93% recovery for both reaction types with a small number of allowed errors. Decisions about which of the polar, pericyclic, or radical reaction type ranking models to use can be made with >99% accuracy. Finally, for multistep reaction pathways, we implement the first mechanistic pathway predictor using constrained tree-search to discover a set of reasonable mechanistic steps from given reactants to given products. Webserver implementations of both the single step and pathway versions of Reaction

  4. The Hybrid of Classification Tree and Extreme Learning Machine for Permeability Prediction in Oil Reservoir

    KAUST Repository

    Prasetyo Utomo, Chandra

    2011-06-01

    Permeability is an important parameter connected with oil reservoir. Predicting the permeability could save millions of dollars. Unfortunately, petroleum engineers have faced numerous challenges arriving at cost-efficient predictions. Much work has been carried out to solve this problem. The main challenge is to handle the high range of permeability in each reservoir. For about a hundred year, mathematicians and engineers have tried to deliver best prediction models. However, none of them have produced satisfying results. In the last two decades, artificial intelligence models have been used. The current best prediction model in permeability prediction is extreme learning machine (ELM). It produces fairly good results but a clear explanation of the model is hard to come by because it is so complex. The aim of this research is to propose a way out of this complexity through the design of a hybrid intelligent model. In this proposal, the system combines classification and regression models to predict the permeability value. These are based on the well logs data. In order to handle the high range of the permeability value, a classification tree is utilized. A benefit of this innovation is that the tree represents knowledge in a clear and succinct fashion and thereby avoids the complexity of all previous models. Finally, it is important to note that the ELM is used as a final predictor. Results demonstrate that this proposed hybrid model performs better when compared with support vector machines (SVM) and ELM in term of correlation coefficient. Moreover, the classification tree model potentially leads to better communication among petroleum engineers concerning this important process and has wider implications for oil reservoir management efficiency.

  5. Employability and Related Context Prediction Framework for University Graduands: A Machine Learning Approach

    Directory of Open Access Journals (Sweden)

    Manushi P. Wijayapala

    2016-12-01

    Full Text Available In Sri Lanka (SL, graduands’ employability remains a national issue due to the increasing number of graduates produced by higher education institutions each year. Thus, predicting the employability of university graduands can mitigate this issue since graduands can identify what qualifications or skills they need to strengthen up in order to find a job of their desired field with a good salary, before they complete the degree. The main objective of the study is to discover the plausibility of applying machine learning approach efficiently and effectively towards predicting the employability and related context of university graduands in Sri Lanka by proposing an architectural framework which consists of four modules; employment status prediction, job salary prediction, job field prediction and job relevance prediction of graduands while also comparing performance of classification algorithms under each prediction module. Series of machine learning algorithms such as C4.5, Naïve Bayes and AODE have been experimented on the Graduand Employment Census - 2014 data. A pre-processing step is proposed to overcome challenges embedded in graduand employability data and a feature selection process is proposed in order to reduce computational complexity. Additionally, parameter tuning is also done to get the most optimized parameters. More importantly, this study utilizes several types of Sampling (Oversampling, Undersampling and Ensemble (Bagging, Boosting, RF techniques as well as a newly proposed hybrid approach to overcome the limitations caused by the class imbalance phenomena. For the validation purposes, a wide range of evaluation measures was used to analyze the effectiveness of applying classification algorithms and class imbalance mitigation techniques on the dataset. The experimented results indicated that RandomForest has recorded the highest classification performance for 3 modules, achieving the selected best predictive models under hybrid

  6. TargetSpy: a supervised machine learning approach for microRNA target prediction.

    Science.gov (United States)

    Sturm, Martin; Hackenberg, Michael; Langenberger, David; Frishman, Dmitrij

    2010-05-28

    Virtually all currently available microRNA target site prediction algorithms require the presence of a (conserved) seed match to the 5' end of the microRNA. Recently however, it has been shown that this requirement might be too stringent, leading to a substantial number of missed target sites. We developed TargetSpy, a novel computational approach for predicting target sites regardless of the presence of a seed match. It is based on machine learning and automatic feature selection using a wide spectrum of compositional, structural, and base pairing features covering current biological knowledge. Our model does not rely on evolutionary conservation, which allows the detection of species-specific interactions and makes TargetSpy suitable for analyzing unconserved genomic sequences.In order to allow for an unbiased comparison of TargetSpy to other methods, we classified all algorithms into three groups: I) no seed match requirement, II) seed match requirement, and III) conserved seed match requirement. TargetSpy predictions for classes II and III are generated by appropriate postfiltering. On a human dataset revealing fold-change in protein production for five selected microRNAs our method shows superior performance in all classes. In Drosophila melanogaster not only our class II and III predictions are on par with other algorithms, but notably the class I (no-seed) predictions are just marginally less accurate. We estimate that TargetSpy predicts between 26 and 112 functional target sites without a seed match per microRNA that are missed by all other currently available algorithms. Only a few algorithms can predict target sites without demanding a seed match and TargetSpy demonstrates a substantial improvement in prediction accuracy in that class. Furthermore, when conservation and the presence of a seed match are required, the performance is comparable with state-of-the-art algorithms. TargetSpy was trained on mouse and performs well in human and drosophila

  7. TargetSpy: a supervised machine learning approach for microRNA target prediction

    Directory of Open Access Journals (Sweden)

    Langenberger David

    2010-05-01

    Full Text Available Abstract Background Virtually all currently available microRNA target site prediction algorithms require the presence of a (conserved seed match to the 5' end of the microRNA. Recently however, it has been shown that this requirement might be too stringent, leading to a substantial number of missed target sites. Results We developed TargetSpy, a novel computational approach for predicting target sites regardless of the presence of a seed match. It is based on machine learning and automatic feature selection using a wide spectrum of compositional, structural, and base pairing features covering current biological knowledge. Our model does not rely on evolutionary conservation, which allows the detection of species-specific interactions and makes TargetSpy suitable for analyzing unconserved genomic sequences. In order to allow for an unbiased comparison of TargetSpy to other methods, we classified all algorithms into three groups: I no seed match requirement, II seed match requirement, and III conserved seed match requirement. TargetSpy predictions for classes II and III are generated by appropriate postfiltering. On a human dataset revealing fold-change in protein production for five selected microRNAs our method shows superior performance in all classes. In Drosophila melanogaster not only our class II and III predictions are on par with other algorithms, but notably the class I (no-seed predictions are just marginally less accurate. We estimate that TargetSpy predicts between 26 and 112 functional target sites without a seed match per microRNA that are missed by all other currently available algorithms. Conclusion Only a few algorithms can predict target sites without demanding a seed match and TargetSpy demonstrates a substantial improvement in prediction accuracy in that class. Furthermore, when conservation and the presence of a seed match are required, the performance is comparable with state-of-the-art algorithms. TargetSpy was trained on

  8. Taxi Time Prediction at Charlotte Airport Using Fast-Time Simulation and Machine Learning Techniques

    Science.gov (United States)

    Lee, Hanbong

    2016-01-01

    Accurate taxi time prediction is required for enabling efficient runway scheduling that can increase runway throughput and reduce taxi times and fuel consumptions on the airport surface. Currently NASA and American Airlines are jointly developing a decision-support tool called Spot and Runway Departure Advisor (SARDA) that assists airport ramp controllers to make gate pushback decisions and improve the overall efficiency of airport surface traffic. In this presentation, we propose to use Linear Optimized Sequencing (LINOS), a discrete-event fast-time simulation tool, to predict taxi times and provide the estimates to the runway scheduler in real-time airport operations. To assess its prediction accuracy, we also introduce a data-driven analytical method using machine learning techniques. These two taxi time prediction methods are evaluated with actual taxi time data obtained from the SARDA human-in-the-loop (HITL) simulation for Charlotte Douglas International Airport (CLT) using various performance measurement metrics. Based on the taxi time prediction results, we also discuss how the prediction accuracy can be affected by the operational complexity at this airport and how we can improve the fast time simulation model before implementing it with an airport scheduling algorithm in a real-time environment.

  9. Machine learning-based prediction of adverse drug effects: An example of seizure-inducing compounds

    Directory of Open Access Journals (Sweden)

    Mengxuan Gao

    2017-02-01

    Full Text Available Various biological factors have been implicated in convulsive seizures, involving side effects of drugs. For the preclinical safety assessment of drug development, it is difficult to predict seizure-inducing side effects. Here, we introduced a machine learning-based in vitro system designed to detect seizure-inducing side effects. We recorded local field potentials from the CA1 alveus in acute mouse neocortico-hippocampal slices, while 14 drugs were bath-perfused at 5 different concentrations each. For each experimental condition, we collected seizure-like neuronal activity and merged their waveforms as one graphic image, which was further converted into a feature vector using Caffe, an open framework for deep learning. In the space of the first two principal components, the support vector machine completely separated the vectors (i.e., doses of individual drugs that induced seizure-like events and identified diphenhydramine, enoxacin, strychnine and theophylline as “seizure-inducing” drugs, which indeed were reported to induce seizures in clinical situations. Thus, this artificial intelligence-based classification may provide a new platform to detect the seizure-inducing side effects of preclinical drugs.

  10. Prediction of druggable proteins using machine learning and systems biology: a mini-review

    Directory of Open Access Journals (Sweden)

    Gaurav eKandoi

    2015-12-01

    Full Text Available The emergence of -omics technologies has allowed the collection of vast amounts of data on biological systems. Although the pace of such collection has been exponential, the impact of these data remains small on many critical biomedical applications such as drug development. Limited resources, high costs and low hit-to-lead ratio have led researchers to search for more cost effective methodologies. A possible alternative is to incorporate computational methods of potential drug target prediction early during drug discovery workflow. Computational methods based on systems approaches have the advantage of taking into account the global properties of a molecule not limited to its sequence, structure or function. Machine learning techniques are powerful tools that can extract relevant information from massive and noisy data sets. In recent years the scientific community has explored the combined power of these fields to propose increasingly accurate and low cost methods to propose interesting drug targets. In this mini-review, we describe promising approaches based on the simultaneous use of systems biology and machine learning to access gene and protein druggability. Moreover, we discuss the state-of-the-art of this emerging and interdisciplinary field, discussing data sources, algorithms and the performance of the different methodologies. Finally, we indicate interesting avenues of research and some remaining open challenges.

  11. Non-linear dynamical signal characterization for prediction of defibrillation success through machine learning

    Directory of Open Access Journals (Sweden)

    Shandilya Sharad

    2012-10-01

    Full Text Available Abstract Background Ventricular Fibrillation (VF is a common presenting dysrhythmia in the setting of cardiac arrest whose main treatment is defibrillation through direct current countershock to achieve return of spontaneous circulation. However, often defibrillation is unsuccessful and may even lead to the transition of VF to more nefarious rhythms such as asystole or pulseless electrical activity. Multiple methods have been proposed for predicting defibrillation success based on examination of the VF waveform. To date, however, no analytical technique has been widely accepted. We developed a unique approach of computational VF waveform analysis, with and without addition of the signal of end-tidal carbon dioxide (PetCO2, using advanced machine learning algorithms. We compare these results with those obtained using the Amplitude Spectral Area (AMSA technique. Methods A total of 90 pre-countershock ECG signals were analyzed form an accessible preshosptial cardiac arrest database. A unified predictive model, based on signal processing and machine learning, was developed with time-series and dual-tree complex wavelet transform features. Upon selection of correlated variables, a parametrically optimized support vector machine (SVM model was trained for predicting outcomes on the test sets. Training and testing was performed with nested 10-fold cross validation and 6–10 features for each test fold. Results The integrative model performs real-time, short-term (7.8 second analysis of the Electrocardiogram (ECG. For a total of 90 signals, 34 successful and 56 unsuccessful defibrillations were classified with an average Accuracy and Receiver Operator Characteristic (ROC Area Under the Curve (AUC of 82.2% and 85%, respectively. Incorporation of the end-tidal carbon dioxide signal boosted Accuracy and ROC AUC to 83.3% and 93.8%, respectively, for a smaller dataset containing 48 signals. VF analysis using AMSA resulted in accuracy and ROC AUC of 64

  12. A Hybrid Supervised/Unsupervised Machine Learning Approach to Solar Flare Prediction

    Science.gov (United States)

    Benvenuto, Federico; Piana, Michele; Campi, Cristina; Massone, Anna Maria

    2018-01-01

    This paper introduces a novel method for flare forecasting, combining prediction accuracy with the ability to identify the most relevant predictive variables. This result is obtained by means of a two-step approach: first, a supervised regularization method for regression, namely, LASSO is applied, where a sparsity-enhancing penalty term allows the identification of the significance with which each data feature contributes to the prediction; then, an unsupervised fuzzy clustering technique for classification, namely, Fuzzy C-Means, is applied, where the regression outcome is partitioned through the minimization of a cost function and without focusing on the optimization of a specific skill score. This approach is therefore hybrid, since it combines supervised and unsupervised learning; realizes classification in an automatic, skill-score-independent way; and provides effective prediction performances even in the case of imbalanced data sets. Its prediction power is verified against NOAA Space Weather Prediction Center data, using as a test set, data in the range between 1996 August and 2010 December and as training set, data in the range between 1988 December and 1996 June. To validate the method, we computed several skill scores typically utilized in flare prediction and compared the values provided by the hybrid approach with the ones provided by several standard (non-hybrid) machine learning methods. The results showed that the hybrid approach performs classification better than all other supervised methods and with an effectiveness comparable to the one of clustering methods; but, in addition, it provides a reliable ranking of the weights with which the data properties contribute to the forecast.

  13. A machine learning approach to the accurate prediction of multi-leaf collimator positional errors

    Science.gov (United States)

    Carlson, Joel N. K.; Park, Jong Min; Park, So-Yeon; In Park, Jong; Choi, Yunseok; Ye, Sung-Joon

    2016-03-01

    Discrepancies between planned and delivered movements of multi-leaf collimators (MLCs) are an important source of errors in dose distributions during radiotherapy. In this work we used machine learning techniques to train models to predict these discrepancies, assessed the accuracy of the model predictions, and examined the impact these errors have on quality assurance (QA) procedures and dosimetry. Predictive leaf motion parameters for the models were calculated from the plan files, such as leaf position and velocity, whether the leaf was moving towards or away from the isocenter of the MLC, and many others. Differences in positions between synchronized DICOM-RT planning files and DynaLog files reported during QA delivery were used as a target response for training of the models. The final model is capable of predicting MLC positions during delivery to a high degree of accuracy. For moving MLC leaves, predicted positions were shown to be significantly closer to delivered positions than were planned positions. By incorporating predicted positions into dose calculations in the TPS, increases were shown in gamma passing rates against measured dose distributions recorded during QA delivery. For instance, head and neck plans with 1%/2 mm gamma criteria had an average increase in passing rate of 4.17% (SD  =  1.54%). This indicates that the inclusion of predictions during dose calculation leads to a more realistic representation of plan delivery. To assess impact on the patient, dose volumetric histograms (DVH) using delivered positions were calculated for comparison with planned and predicted DVHs. In all cases, predicted dose volumetric parameters were in closer agreement to the delivered parameters than were the planned parameters, particularly for organs at risk on the periphery of the treatment area. By incorporating the predicted positions into the TPS, the treatment planner is given a more realistic view of the dose distribution as it will truly be

  14. Short communication: Prediction of retention pay-off using a machine learning algorithm.

    Science.gov (United States)

    Shahinfar, Saleh; Kalantari, Afshin S; Cabrera, Victor; Weigel, Kent

    2014-05-01

    Replacement decisions have a major effect on dairy farm profitability. Dynamic programming (DP) has been widely studied to find the optimal replacement policies in dairy cattle. However, DP models are computationally intensive and might not be practical for daily decision making. Hence, the ability of applying machine learning on a prerun DP model to provide fast and accurate predictions of nonlinear and intercorrelated variables makes it an ideal methodology. Milk class (1 to 5), lactation number (1 to 9), month in milk (1 to 20), and month of pregnancy (0 to 9) were used to describe all cows in a herd in a DP model. Twenty-seven scenarios based on all combinations of 3 levels (base, 20% above, and 20% below) of milk production, milk price, and replacement cost were solved with the DP model, resulting in a data set of 122,716 records, each with a calculated retention pay-off (RPO). Then, a machine learning model tree algorithm was used to mimic the evaluated RPO with DP. The correlation coefficient factor was used to observe the concordance of RPO evaluated by DP and RPO predicted by the model tree. The obtained correlation coefficient was 0.991, with a corresponding value of 0.11 for relative absolute error. At least 100 instances were required per model constraint, resulting in 204 total equations (models). When these models were used for binary classification of positive and negative RPO, error rates were 1% false negatives and 9% false positives. Applying this trained model from simulated data for prediction of RPO for 102 actual replacement records from the University of Wisconsin-Madison dairy herd resulted in a 0.994 correlation with 0.10 relative absolute error rate. Overall results showed that model tree has a potential to be used in conjunction with DP to assist farmers in their replacement decisions. Copyright © 2014 American Dairy Science Association. Published by Elsevier Inc. All rights reserved.

  15. Machine learning techniques in disease forecasting: a case study on rice blast prediction

    Directory of Open Access Journals (Sweden)

    Kapoor Amar S

    2006-11-01

    Full Text Available Abstract Background Diverse modeling approaches viz. neural networks and multiple regression have been followed to date for disease prediction in plant populations. However, due to their inability to predict value of unknown data points and longer training times, there is need for exploiting new prediction softwares for better understanding of plant-pathogen-environment relationships. Further, there is no online tool available which can help the plant researchers or farmers in timely application of control measures. This paper introduces a new prediction approach based on support vector machines for developing weather-based prediction models of plant diseases. Results Six significant weather variables were selected as predictor variables. Two series of models (cross-location and cross-year were developed and validated using a five-fold cross validation procedure. For cross-year models, the conventional multiple regression (REG approach achieved an average correlation coefficient (r of 0.50, which increased to 0.60 and percent mean absolute error (%MAE decreased from 65.42 to 52.24 when back-propagation neural network (BPNN was used. With generalized regression neural network (GRNN, the r increased to 0.70 and %MAE also improved to 46.30, which further increased to r = 0.77 and %MAE = 36.66 when support vector machine (SVM based method was used. Similarly, cross-location validation achieved r = 0.48, 0.56 and 0.66 using REG, BPNN and GRNN respectively, with their corresponding %MAE as 77.54, 66.11 and 58.26. The SVM-based method outperformed all the three approaches by further increasing r to 0.74 with improvement in %MAE to 44.12. Overall, this SVM-based prediction approach will open new vistas in the area of forecasting plant diseases of various crops. Conclusion Our case study demonstrated that SVM is better than existing machine learning techniques and conventional REG approaches in forecasting plant diseases. In this direction, we have also

  16. Predicting Freeway Work Zone Delays and Costs with a Hybrid Machine-Learning Model

    Directory of Open Access Journals (Sweden)

    Bo Du

    2017-01-01

    Full Text Available A hybrid machine-learning model, integrating an artificial neural network (ANN and a support vector machine (SVM model, is developed to predict spatiotemporal delays, subject to road geometry, number of lane closures, and work zone duration in different periods of a day and in the days of a week. The model is very user friendly, allowing the least inputs from the users. With that the delays caused by a work zone on any location of a New Jersey freeway can be predicted. To this end, tremendous amounts of data from different sources were collected to establish the relationship between the model inputs and outputs. A comparative analysis was conducted, and results indicate that the proposed model outperforms others in terms of the least root mean square error (RMSE. The proposed hybrid model can be used to calculate contractor penalty in terms of cost overruns as well as incentive reward schedule in case of early work competition. Additionally, it can assist work zone planners in determining the best start and end times of a work zone for developing and evaluating traffic mitigation and management plans.

  17. Using machine learning and quantum chemistry descriptors to predict the toxicity of ionic liquids.

    Science.gov (United States)

    Cao, Lingdi; Zhu, Peng; Zhao, Yongsheng; Zhao, Jihong

    2018-06-15

    Large-scale application of ionic liquids (ILs) hinges on the advancement of designable and eco-friendly nature. Research of the potential toxicity of ILs towards different organisms and trophic levels is insufficient. Quantitative structure-activity relationships (QSAR) model is applied to evaluate the toxicity of ILs towards the leukemia rat cell line (ICP-81). The structures of 57 cations and 21 anions were optimized by quantum chemistry. The electrostatic potential surface area (S EP ) and charge distribution area (S σ-profile ) descriptors are calculated and used to predict the toxicity of ILs. The performance and predictive aptitude of extreme learning machine (ELM) model are analyzed and compared with those of multiple linear regression (MLR) and support vector machine (SVM) models. The highest R 2 and the lowest AARD% and RMSE of the training set, test set and total set for the ELM are observed, which validates the superior performance of the ELM than that of obtained by the MLR and SVM. The applicability domain of the model is assessed by the Williams plot. Copyright © 2018 Elsevier B.V. All rights reserved.

  18. Comparison of Advanced Machine Learning Tools for Disruption Prediction and Disruption Studies

    Czech Academy of Sciences Publication Activity Database

    Odstrčil, Michal; Murari, A.; Mlynář, Jan

    2013-01-01

    Roč. 41, č. 7 (2013), s. 1751-1759 ISSN 0093-3813 R&D Projects: GA ČR GAP205/10/2055 Institutional support: RVO:61389021 Keywords : Learning Machines * Support Vector Machines * Neural Network * ASDEX Upgrade * JET * Disruption mitigation * Tokamaks * ITER Subject RIV: BL - Plasma and Gas Discharge Physics Impact factor: 0.950, year: 2013

  19. e-Bitter: Bitterant Prediction by the Consensus Voting From the Machine-Learning Methods.

    Science.gov (United States)

    Zheng, Suqing; Jiang, Mengying; Zhao, Chengwei; Zhu, Rui; Hu, Zhicheng; Xu, Yong; Lin, Fu

    2018-01-01

    In-silico bitterant prediction received the considerable attention due to the expensive and laborious experimental-screening of the bitterant. In this work, we collect the fully experimental dataset containing 707 bitterants and 592 non-bitterants, which is distinct from the fully or partially hypothetical non-bitterant dataset used in the previous works. Based on this experimental dataset, we harness the consensus votes from the multiple machine-learning methods (e.g., deep learning etc.) combined with the molecular fingerprint to build the bitter/bitterless classification models with five-fold cross-validation, which are further inspected by the Y-randomization test and applicability domain analysis. One of the best consensus models affords the accuracy, precision, specificity, sensitivity, F1-score, and Matthews correlation coefficient (MCC) of 0.929, 0.918, 0.898, 0.954, 0.936, and 0.856 respectively on our test set. For the automatic prediction of bitterant, a graphic program "e-Bitter" is developed for the convenience of users via the simple mouse click. To our best knowledge, it is for the first time to adopt the consensus model for the bitterant prediction and develop the first free stand-alone software for the experimental food scientist.

  20. e-Bitter: Bitterant Prediction by the Consensus Voting From the Machine-Learning Methods

    Directory of Open Access Journals (Sweden)

    Suqing Zheng

    2018-03-01

    Full Text Available In-silico bitterant prediction received the considerable attention due to the expensive and laborious experimental-screening of the bitterant. In this work, we collect the fully experimental dataset containing 707 bitterants and 592 non-bitterants, which is distinct from the fully or partially hypothetical non-bitterant dataset used in the previous works. Based on this experimental dataset, we harness the consensus votes from the multiple machine-learning methods (e.g., deep learning etc. combined with the molecular fingerprint to build the bitter/bitterless classification models with five-fold cross-validation, which are further inspected by the Y-randomization test and applicability domain analysis. One of the best consensus models affords the accuracy, precision, specificity, sensitivity, F1-score, and Matthews correlation coefficient (MCC of 0.929, 0.918, 0.898, 0.954, 0.936, and 0.856 respectively on our test set. For the automatic prediction of bitterant, a graphic program “e-Bitter” is developed for the convenience of users via the simple mouse click. To our best knowledge, it is for the first time to adopt the consensus model for the bitterant prediction and develop the first free stand-alone software for the experimental food scientist.

  1. e-Bitter: Bitterant Prediction by the Consensus Voting From the Machine-learning Methods

    Science.gov (United States)

    Zheng, Suqing; Jiang, Mengying; Zhao, Chengwei; Zhu, Rui; Hu, Zhicheng; Xu, Yong; Lin, Fu

    2018-03-01

    In-silico bitterant prediction received the considerable attention due to the expensive and laborious experimental-screening of the bitterant. In this work, we collect the fully experimental dataset containing 707 bitterants and 592 non-bitterants, which is distinct from the fully or partially hypothetical non-bitterant dataset used in the previous works. Based on this experimental dataset, we harness the consensus votes from the multiple machine-learning methods (e.g., deep learning etc.) combined with the molecular fingerprint to build the bitter/bitterless classification models with five-fold cross-validation, which are further inspected by the Y-randomization test and applicability domain analysis. One of the best consensus models affords the accuracy, precision, specificity, sensitivity, F1-score, and Matthews correlation coefficient (MCC) of 0.929, 0.918, 0.898, 0.954, 0.936, and 0.856 respectively on our test set. For the automatic prediction of bitterant, a graphic program “e-Bitter” is developed for the convenience of users via the simple mouse click. To our best knowledge, it is for the first time to adopt the consensus model for the bitterant prediction and develop the first free stand-alone software for the experimental food scientist.

  2. Improving virtual screening predictive accuracy of Human kallikrein 5 inhibitors using machine learning models.

    Science.gov (United States)

    Fang, Xingang; Bagui, Sikha; Bagui, Subhash

    2017-08-01

    The readily available high throughput screening (HTS) data from the PubChem database provides an opportunity for mining of small molecules in a variety of biological systems using machine learning techniques. From the thousands of available molecular descriptors developed to encode useful chemical information representing the characteristics of molecules, descriptor selection is an essential step in building an optimal quantitative structural-activity relationship (QSAR) model. For the development of a systematic descriptor selection strategy, we need the understanding of the relationship between: (i) the descriptor selection; (ii) the choice of the machine learning model; and (iii) the characteristics of the target bio-molecule. In this work, we employed the Signature descriptor to generate a dataset on the Human kallikrein 5 (hK 5) inhibition confirmatory assay data and compared multiple classification models including logistic regression, support vector machine, random forest and k-nearest neighbor. Under optimal conditions, the logistic regression model provided extremely high overall accuracy (98%) and precision (90%), with good sensitivity (65%) in the cross validation test. In testing the primary HTS screening data with more than 200K molecular structures, the logistic regression model exhibited the capability of eliminating more than 99.9% of the inactive structures. As part of our exploration of the descriptor-model-target relationship, the excellent predictive performance of the combination of the Signature descriptor and the logistic regression model on the assay data of the Human kallikrein 5 (hK 5) target suggested a feasible descriptor/model selection strategy on similar targets. Copyright © 2017 Elsevier Ltd. All rights reserved.

  3. How long will my mouse live? Machine learning approaches for prediction of mouse life span.

    Science.gov (United States)

    Swindell, William R; Harper, James M; Miller, Richard A

    2008-09-01

    Prediction of individual life span based on characteristics evaluated at middle-age represents a challenging objective for aging research. In this study, we used machine learning algorithms to construct models that predict life span in a stock of genetically heterogeneous mice. Life-span prediction accuracy of 22 algorithms was evaluated using a cross-validation approach, in which models were trained and tested with distinct subsets of data. Using a combination of body weight and T-cell subset measures evaluated before 2 years of age, we show that the life-span quartile to which an individual mouse belongs can be predicted with an accuracy of 35.3% (+/-0.10%). This result provides a new benchmark for the development of life-span-predictive models, but improvement can be expected through identification of new predictor variables and development of computational approaches. Future work in this direction can provide tools for aging research and will shed light on associations between phenotypic traits and longevity.

  4. Prediction of HDR quality by combining perceptually transformed display measurements with machine learning

    Science.gov (United States)

    Choudhury, Anustup; Farrell, Suzanne; Atkins, Robin; Daly, Scott

    2017-09-01

    We present an approach to predict overall HDR display quality as a function of key HDR display parameters. We first performed subjective experiments on a high quality HDR display that explored five key HDR display parameters: maximum luminance, minimum luminance, color gamut, bit-depth and local contrast. Subjects rated overall quality for different combinations of these display parameters. We explored two models | a physical model solely based on physically measured display characteristics and a perceptual model that transforms physical parameters using human vision system models. For the perceptual model, we use a family of metrics based on a recently published color volume model (ICT-CP), which consists of the PQ luminance non-linearity (ST2084) and LMS-based opponent color, as well as an estimate of the display point spread function. To predict overall visual quality, we apply linear regression and machine learning techniques such as Multilayer Perceptron, RBF and SVM networks. We use RMSE and Pearson/Spearman correlation coefficients to quantify performance. We found that the perceptual model is better at predicting subjective quality than the physical model and that SVM is better at prediction than linear regression. The significance and contribution of each display parameter was investigated. In addition, we found that combined parameters such as contrast do not improve prediction. Traditional perceptual models were also evaluated and we found that models based on the PQ non-linearity performed better.

  5. Prediction of the Chloride Resistance of Concrete Modified with High Calcium Fly Ash Using Machine Learning.

    Science.gov (United States)

    Marks, Michał; Glinicki, Michał A; Gibas, Karolina

    2015-12-11

    The aim of the study was to generate rules for the prediction of the chloride resistance of concrete modified with high calcium fly ash using machine learning methods. The rapid chloride permeability test, according to the Nordtest Method Build 492, was used for determining the chloride ions' penetration in concrete containing high calcium fly ash (HCFA) for partial replacement of Portland cement. The results of the performed tests were used as the training set to generate rules describing the relation between material composition and the chloride resistance. Multiple methods for rule generation were applied and compared. The rules generated by algorithm J48 from the Weka workbench provided the means for adequate classification of plain concretes and concretes modified with high calcium fly ash as materials of good, acceptable or unacceptable resistance to chloride penetration.

  6. Using Five Machine Learning for Breast Cancer Biopsy Predictions Based on Mammographic Diagnosis

    OpenAIRE

    Oyewola, David; Hakimi, Danladi; Adeboye, Kayode; Shehu, Musa Danjuma

    2017-01-01

    Breast cancer is one of thecauses of female death in the world. Mammography  is commonly  used for  distinguishing  malignant tumors  from benign  ones. In this research,  a mammographic  diagnostic method  is  presented for breast  cancer  biopsy outcome  predictions  using  fivemachine learning which includes: Logistic Regression(LR), Linear DiscriminantAnalysis(LDA), Quadratic Discriminant Analysis(QDA), Random Forest(RF) andSupport  Vector Machine(SVM)  classification.  The testing result...

  7. 3D Cloud Field Prediction using A-Train Data and Machine Learning Techniques

    Science.gov (United States)

    Johnson, C. L.

    2017-12-01

    Validation of cloud process parameterizations used in global climate models (GCMs) would greatly benefit from observed 3D cloud fields at the size comparable to that of a GCM grid cell. For the highest resolution simulations, surface grid cells are on the order of 100 km by 100 km. CloudSat/CALIPSO data provides 1 km width of detailed vertical cloud fraction profile (CFP) and liquid and ice water content (LWC/IWC). This work utilizes four machine learning algorithms to create nonlinear regressions of CFP, LWC, and IWC data using radiances, surface type and location of measurement as predictors and applies the regression equations to off-track locations generating 3D cloud fields for 100 km by 100 km domains. The CERES-CloudSat-CALIPSO-MODIS (C3M) merged data set for February 2007 is used. Support Vector Machines, Artificial Neural Networks, Gaussian Processes and Decision Trees are trained on 1000 km of continuous C3M data. Accuracy is computed using existing vertical profiles that are excluded from the training data and occur within 100 km of the training data. Accuracy of the four algorithms is compared. Average accuracy for one day of predicted data is 86% for the most successful algorithm. The methodology for training the algorithms, determining valid prediction regions and applying the equations off-track is discussed. Predicted 3D cloud fields are provided as inputs to the Ed4 NASA LaRC Fu-Liou radiative transfer code and resulting TOA radiances compared to observed CERES/MODIS radiances. Differences in computed radiances using predicted profiles and observed radiances are compared.

  8. Machine learning with R

    CERN Document Server

    Lantz, Brett

    2013-01-01

    Written as a tutorial to explore and understand the power of R for machine learning. This practical guide that covers all of the need to know topics in a very systematic way. For each machine learning approach, each step in the process is detailed, from preparing the data for analysis to evaluating the results. These steps will build the knowledge you need to apply them to your own data science tasks.Intended for those who want to learn how to use R's machine learning capabilities and gain insight from your data. Perhaps you already know a bit about machine learning, but have never used R; or

  9. Prediction of revascularization after myocardial perfusion SPECT by machine learning in a large population.

    Science.gov (United States)

    Arsanjani, Reza; Dey, Damini; Khachatryan, Tigran; Shalev, Aryeh; Hayes, Sean W; Fish, Mathews; Nakanishi, Rine; Germano, Guido; Berman, Daniel S; Slomka, Piotr

    2015-10-01

    We aimed to investigate if early revascularization in patients with suspected coronary artery disease can be effectively predicted by integrating clinical data and quantitative image features derived from perfusion SPECT (MPS) by machine learning (ML) approach. 713 rest (201)Thallium/stress (99m)Technetium MPS studies with correlating invasive angiography with 372 revascularization events (275 PCI/97 CABG) within 90 days after MPS (91% within 30 days) were considered. Transient ischemic dilation, stress combined supine/prone total perfusion deficit (TPD), supine rest and stress TPD, exercise ejection fraction, and end-systolic volume, along with clinical parameters including patient gender, history of hypertension and diabetes mellitus, ST-depression on baseline ECG, ECG and clinical response during stress, and post-ECG probability by boosted ensemble ML algorithm (LogitBoost) to predict revascularization events. These features were selected using an automated feature selection algorithm from all available clinical and quantitative data (33 parameters). Tenfold cross-validation was utilized to train and test the prediction model. The prediction of revascularization by ML algorithm was compared to standalone measures of perfusion and visual analysis by two experienced readers utilizing all imaging, quantitative, and clinical data. The sensitivity of machine learning (ML) (73.6% ± 4.3%) for prediction of revascularization was similar to one reader (73.9% ± 4.6%) and standalone measures of perfusion (75.5% ± 4.5%). The specificity of ML (74.7% ± 4.2%) was also better than both expert readers (67.2% ± 4.9% and 66.0% ± 5.0%, P < .05), but was similar to ischemic TPD (68.3% ± 4.9%, P < .05). The receiver operator characteristics areas under curve for ML (0.81 ± 0.02) was similar to reader 1 (0.81 ± 0.02) but superior to reader 2 (0.72 ± 0.02, P < .01) and standalone measure of perfusion (0.77 ± 0.02, P < .01). ML approach is comparable or better than

  10. Prediction of insemination outcomes in Holstein dairy cattle using alternative machine learning algorithms.

    Science.gov (United States)

    Shahinfar, Saleh; Page, David; Guenther, Jerry; Cabrera, Victor; Fricke, Paul; Weigel, Kent

    2014-02-01

    When making the decision about whether or not to breed a given cow, knowledge about the expected outcome would have an economic impact on profitability of the breeding program and net income of the farm. The outcome of each breeding can be affected by many management and physiological features that vary between farms and interact with each other. Hence, the ability of machine learning algorithms to accommodate complex relationships in the data and missing values for explanatory variables makes these algorithms well suited for investigation of reproduction performance in dairy cattle. The objective of this study was to develop a user-friendly and intuitive on-farm tool to help farmers make reproduction management decisions. Several different machine learning algorithms were applied to predict the insemination outcomes of individual cows based on phenotypic and genotypic data. Data from 26 dairy farms in the Alta Genetics (Watertown, WI) Advantage Progeny Testing Program were used, representing a 10-yr period from 2000 to 2010. Health, reproduction, and production data were extracted from on-farm dairy management software, and estimated breeding values were downloaded from the US Department of Agriculture Agricultural Research Service Animal Improvement Programs Laboratory (Beltsville, MD) database. The edited data set consisted of 129,245 breeding records from primiparous Holstein cows and 195,128 breeding records from multiparous Holstein cows. Each data point in the final data set included 23 and 25 explanatory variables and 1 binary outcome for of 0.756 ± 0.005 and 0.736 ± 0.005 for primiparous and multiparous cows, respectively. The naïve Bayes algorithm, Bayesian network, and decision tree algorithms showed somewhat poorer classification performance. An information-based variable selection procedure identified herd average conception rate, incidence of ketosis, number of previous (failed) inseminations, days in milk at breeding, and mastitis as the most

  11. A Novel Online Sequential Extreme Learning Machine for Gas Utilization Ratio Prediction in Blast Furnaces

    Directory of Open Access Journals (Sweden)

    Yanjiao Li

    2017-08-01

    Full Text Available Gas utilization ratio (GUR is an important indicator used to measure the operating status and energy consumption of blast furnaces (BFs. In this paper, we present a soft-sensor approach, i.e., a novel online sequential extreme learning machine (OS-ELM named DU-OS-ELM, to establish a data-driven model for GUR prediction. In DU-OS-ELM, firstly, the old collected data are discarded gradually and the newly acquired data are given more attention through a novel dynamic forgetting factor (DFF, depending on the estimation errors to enhance the dynamic tracking ability. Furthermore, we develop an updated selection strategy (USS to judge whether the model needs to be updated with the newly coming data, so that the proposed approach is more in line with the actual production situation. Then, the convergence analysis of the proposed DU-OS-ELM is presented to ensure the estimation of output weight converge to the true value with the new data arriving. Meanwhile, the proposed DU-OS-ELM is applied to build a soft-sensor model to predict GUR. Experimental results demonstrate that the proposed DU-OS-ELM obtains better generalization performance and higher prediction accuracy compared with a number of existing related approaches using the real production data from a BF and the created GUR prediction model can provide an effective guidance for further optimization operation.

  12. A Novel Online Sequential Extreme Learning Machine for Gas Utilization Ratio Prediction in Blast Furnaces.

    Science.gov (United States)

    Li, Yanjiao; Zhang, Sen; Yin, Yixin; Xiao, Wendong; Zhang, Jie

    2017-08-10

    Gas utilization ratio (GUR) is an important indicator used to measure the operating status and energy consumption of blast furnaces (BFs). In this paper, we present a soft-sensor approach, i.e., a novel online sequential extreme learning machine (OS-ELM) named DU-OS-ELM, to establish a data-driven model for GUR prediction. In DU-OS-ELM, firstly, the old collected data are discarded gradually and the newly acquired data are given more attention through a novel dynamic forgetting factor (DFF), depending on the estimation errors to enhance the dynamic tracking ability. Furthermore, we develop an updated selection strategy (USS) to judge whether the model needs to be updated with the newly coming data, so that the proposed approach is more in line with the actual production situation. Then, the convergence analysis of the proposed DU-OS-ELM is presented to ensure the estimation of output weight converge to the true value with the new data arriving. Meanwhile, the proposed DU-OS-ELM is applied to build a soft-sensor model to predict GUR. Experimental results demonstrate that the proposed DU-OS-ELM obtains better generalization performance and higher prediction accuracy compared with a number of existing related approaches using the real production data from a BF and the created GUR prediction model can provide an effective guidance for further optimization operation.

  13. Is Romantic Desire Predictable? Machine Learning Applied to Initial Romantic Attraction.

    Science.gov (United States)

    Joel, Samantha; Eastwick, Paul W; Finkel, Eli J

    2017-10-01

    Matchmaking companies and theoretical perspectives on close relationships suggest that initial attraction is, to some extent, a product of two people's self-reported traits and preferences. We used machine learning to test how well such measures predict people's overall tendencies to romantically desire other people (actor variance) and to be desired by other people (partner variance), as well as people's desire for specific partners above and beyond actor and partner variance (relationship variance). In two speed-dating studies, romantically unattached individuals completed more than 100 self-report measures about traits and preferences that past researchers have identified as being relevant to mate selection. Each participant met each opposite-sex participant attending a speed-dating event for a 4-min speed date. Random forests models predicted 4% to 18% of actor variance and 7% to 27% of partner variance; crucially, however, they were unable to predict relationship variance using any combination of traits and preferences reported before the dates. These results suggest that compatibility elements of human mating are challenging to predict before two people meet.

  14. Machine Learning meets Mathematical Optimization to predict the optimal production of offshore wind parks

    DEFF Research Database (Denmark)

    Fischetti, Martina; Fraccaro, Marco

    2018-01-01

    In this paper we propose a combination of Mathematical Optimization and Machine Learning to estimate the value of optimized solutions. In particular, we investigate if a machine, trained on a large number of optimized solutions, could accurately estimate the value of the optimized solution for new...... in production between optimized/non optimized solutions, it is not trivial to understand the potential value of a new site without running a complete optimization. This could be too time consuming if a lot of sites need to be evaluated, therefore we propose to use Machine Learning to quickly estimate...... the potential of new sites (i.e., to estimate the optimized production of a site without explicitly running the optimization). To do so, we trained and tested different Machine Learning models on a dataset of 3000+ optimized layouts found by the optimizer. Thanks to the close collaboration with a leading...

  15. Microsoft Azure machine learning

    CERN Document Server

    Mund, Sumit

    2015-01-01

    The book is intended for those who want to learn how to use Azure Machine Learning. Perhaps you already know a bit about Machine Learning, but have never used ML Studio in Azure; or perhaps you are an absolute newbie. In either case, this book will get you up-and-running quickly.

  16. Machine-learning-based calving prediction from activity, lying, and ruminating behaviors in dairy cattle.

    Science.gov (United States)

    Borchers, M R; Chang, Y M; Proudfoot, K L; Wadsworth, B A; Stone, A E; Bewley, J M

    2017-07-01

    The objective of this study was to use automated activity, lying, and rumination monitors to characterize prepartum behavior and predict calving in dairy cattle. Data were collected from 20 primiparous and 33 multiparous Holstein dairy cattle from September 2011 to May 2013 at the University of Kentucky Coldstream Dairy. The HR Tag (SCR Engineers Ltd., Netanya, Israel) automatically collected neck activity and rumination data in 2-h increments. The IceQube (IceRobotics Ltd., South Queensferry, United Kingdom) automatically collected number of steps, lying time, standing time, number of transitions from standing to lying (lying bouts), and total motion, summed in 15-min increments. IceQube data were summed in 2-h increments to match HR Tag data. All behavioral data were collected for 14 d before the predicted calving date. Retrospective data analysis was performed using mixed linear models to examine behavioral changes by day in the 14 d before calving. Bihourly behavioral differences from baseline values over the 14 d before calving were also evaluated using mixed linear models. Changes in daily rumination time, total motion, lying time, and lying bouts occurred in the 14 d before calving. In the bihourly analysis, extreme values for all behaviors occurred in the final 24 h, indicating that the monitored behaviors may be useful in calving prediction. To determine whether technologies were useful at predicting calving, random forest, linear discriminant analysis, and neural network machine-learning techniques were constructed and implemented using R version 3.1.0 (R Foundation for Statistical Computing, Vienna, Austria). These methods were used on variables from each technology and all combined variables from both technologies. A neural network analysis that combined variables from both technologies at the daily level yielded 100.0% sensitivity and 86.8% specificity. A neural network analysis that combined variables from both technologies in bihourly increments was

  17. Prediction of selective estrogen receptor beta agonist using open data and machine learning approach

    Directory of Open Access Journals (Sweden)

    Niu AQ

    2016-07-01

    for the classification of selective ER-β agonists. Chemistry Development Kit extended fingerprints and MACCS fingerprint performed better in structural representation between active and inactive agonists. Conclusion: These results demonstrate that combining the fingerprint and ML approaches leads to robust ER-β agonist prediction models, which are potentially applicable to the identification of selective ER-β agonists. Keywords: estrogen receptor subtype β, selective estrogen receptor modulators, quantitative structure-activity relationship models, machine learning approach

  18. Predicting the Occurrence of Haze Events in Southeast Asia using Machine Learning Algorithms

    Science.gov (United States)

    Lee, H. H.; Chulakadabba, A.; Tonks, A.; Yang, Z.; Wang, C.

    2017-12-01

    Severe local- and regional-scale air pollution episodes typically originate from 1) high emissions of air pollutants, 2) poor dispersion conditions, and 3) trans-boundary pollutant transport. Biomass burning activities have become more frequent in Southeast Asia, especially in Sumatra, Borneo, and the mainland Southeast. Trans-boundary transport of biomass burning aerosols often lead to air quality problems in the region. Furthermore, particulate pollutants from human activities besides biomass burning also play an important role in the air quality of Southeast Asia. Singapore, for example, has a dynamic industrial sector including chemical, electric and metallurgic industries, and is the region's major petroleum-refining center. In addition, natural gas and oil power plants, waste incinerators, active port traffic, and a major regional airport further complicate Singapore's air quality issues. In this study, we compare five Machine Learning algorithms: k-Nearest Neighbors, Linear Support Vector Machine, Decision Tree, Random Forest and Artificial Neural Network, to identify haze patterns and determine variable importance. The algorithms were trained using local atmospheric data (i.e. months, atmospheric conditions, wind direction and relative humidity) from three observation stations in Singapore (Changi, Seletar and Paya Labar). We find that the algorithms reveal the associations in data within and between the stations, and provide in-depth interpretation of the haze sources. The algorithms also allow us to predict the probability of haze episodes in Singapore and to determine the correlation between this probability and atmospheric conditions.

  19. GWAS-based machine learning approach to predict duloxetine response in major depressive disorder.

    Science.gov (United States)

    Maciukiewicz, Malgorzata; Marshe, Victoria S; Hauschild, Anne-Christin; Foster, Jane A; Rotzinger, Susan; Kennedy, James L; Kennedy, Sidney H; Müller, Daniel J; Geraci, Joseph

    2018-04-01

    Major depressive disorder (MDD) is one of the most prevalent psychiatric disorders and is commonly treated with antidepressant drugs. However, large variability is observed in terms of response to antidepressants. Machine learning (ML) models may be useful to predict treatment outcomes. A sample of 186 MDD patients received treatment with duloxetine for up to 8 weeks were categorized as "responders" based on a MADRS change >50% from baseline; or "remitters" based on a MADRS score ≤10 at end point. The initial dataset (N = 186) was randomly divided into training and test sets in a nested 5-fold cross-validation, where 80% was used as a training set and 20% made up five independent test sets. We performed genome-wide logistic regression to identify potentially significant variants related to duloxetine response/remission and extracted the most promising predictors using LASSO regression. Subsequently, classification-regression trees (CRT) and support vector machines (SVM) were applied to construct models, using ten-fold cross-validation. With regards to response, none of the pairs performed significantly better than chance (accuracy p > .1). For remission, SVM achieved moderate performance with an accuracy = 0.52, a sensitivity = 0.58, and a specificity = 0.46, and 0.51 for all coefficients for CRT. The best performing SVM fold was characterized by an accuracy = 0.66 (p = .071), sensitivity = 0.70 and a sensitivity = 0.61. In this study, the potential of using GWAS data to predict duloxetine outcomes was examined using ML models. The models were characterized by a promising sensitivity, but specificity remained moderate at best. The inclusion of additional non-genetic variables to create integrated models may improve prediction. Copyright © 2017. Published by Elsevier Ltd.

  20. Machine Learning for Medical Imaging.

    Science.gov (United States)

    Erickson, Bradley J; Korfiatis, Panagiotis; Akkus, Zeynettin; Kline, Timothy L

    2017-01-01

    Machine learning is a technique for recognizing patterns that can be applied to medical images. Although it is a powerful tool that can help in rendering medical diagnoses, it can be misapplied. Machine learning typically begins with the machine learning algorithm system computing the image features that are believed to be of importance in making the prediction or diagnosis of interest. The machine learning algorithm system then identifies the best combination of these image features for classifying the image or computing some metric for the given image region. There are several methods that can be used, each with different strengths and weaknesses. There are open-source versions of most of these machine learning methods that make them easy to try and apply to images. Several metrics for measuring the performance of an algorithm exist; however, one must be aware of the possible associated pitfalls that can result in misleading metrics. More recently, deep learning has started to be used; this method has the benefit that it does not require image feature identification and calculation as a first step; rather, features are identified as part of the learning process. Machine learning has been used in medical imaging and will have a greater influence in the future. Those working in medical imaging must be aware of how machine learning works. © RSNA, 2017.

  1. Pattern recognition & machine learning

    CERN Document Server

    Anzai, Y

    1992-01-01

    This is the first text to provide a unified and self-contained introduction to visual pattern recognition and machine learning. It is useful as a general introduction to artifical intelligence and knowledge engineering, and no previous knowledge of pattern recognition or machine learning is necessary. Basic for various pattern recognition and machine learning methods. Translated from Japanese, the book also features chapter exercises, keywords, and summaries.

  2. Machine-Learning-Based Future Received Signal Strength Prediction Using Depth Images for mmWave Communications

    OpenAIRE

    Okamoto, Hironao; Nishio, Takayuki; Nakashima, Kota; Koda, Yusuke; Yamamoto, Koji; Morikura, Masahiro; Asai, Yusuke; Miyatake, Ryo

    2018-01-01

    This paper discusses a machine-learning (ML)-based future received signal strength (RSS) prediction scheme using depth camera images for millimeter-wave (mmWave) networks. The scheme provides the future RSS prediction of any mmWave links within the camera's view, including links where nodes are not transmitting frames. This enables network controllers to conduct network operations before line-of-sight path blockages degrade the RSS. Using the ML techniques, the prediction scheme automatically...

  3. Prediction of selective estrogen receptor beta agonist using open data and machine learning approach.

    Science.gov (United States)

    Niu, Ai-Qin; Xie, Liang-Jun; Wang, Hui; Zhu, Bing; Wang, Sheng-Qi

    2016-01-01

    Estrogen receptors (ERs) are nuclear transcription factors that are involved in the regulation of many complex physiological processes in humans. ERs have been validated as important drug targets for the treatment of various diseases, including breast cancer, ovarian cancer, osteoporosis, and cardiovascular disease. ERs have two subtypes, ER-α and ER-β. Emerging data suggest that the development of subtype-selective ligands that specifically target ER-β could be a more optimal approach to elicit beneficial estrogen-like activities and reduce side effects. Herein, we focused on ER-β and developed its in silico quantitative structure-activity relationship models using machine learning (ML) methods. The chemical structures and ER-β bioactivity data were extracted from public chemogenomics databases. Four types of popular fingerprint generation methods including MACCS fingerprint, PubChem fingerprint, 2D atom pairs, and Chemistry Development Kit extended fingerprint were used as descriptors. Four ML methods including Naïve Bayesian classifier, k-nearest neighbor, random forest, and support vector machine were used to train the models. The range of classification accuracies was 77.10% to 88.34%, and the range of area under the ROC (receiver operating characteristic) curve values was 0.8151 to 0.9475, evaluated by the 5-fold cross-validation. Comparison analysis suggests that both the random forest and the support vector machine are superior for the classification of selective ER-β agonists. Chemistry Development Kit extended fingerprints and MACCS fingerprint performed better in structural representation between active and inactive agonists. These results demonstrate that combining the fingerprint and ML approaches leads to robust ER-β agonist prediction models, which are potentially applicable to the identification of selective ER-β agonists.

  4. Prediction of recombinant protein overexpression in Escherichia coli using a machine learning based model (RPOLP).

    Science.gov (United States)

    Habibi, Narjeskhatoon; Norouzi, Alireza; Mohd Hashim, Siti Z; Shamsir, Mohd Shahir; Samian, Razip

    2015-11-01

    Recombinant protein overexpression, an important biotechnological process, is ruled by complex biological rules which are mostly unknown, is in need of an intelligent algorithm so as to avoid resource-intensive lab-based trial and error experiments in order to determine the expression level of the recombinant protein. The purpose of this study is to propose a predictive model to estimate the level of recombinant protein overexpression for the first time in the literature using a machine learning approach based on the sequence, expression vector, and expression host. The expression host was confined to Escherichia coli which is the most popular bacterial host to overexpress recombinant proteins. To provide a handle to the problem, the overexpression level was categorized as low, medium and high. A set of features which were likely to affect the overexpression level was generated based on the known facts (e.g. gene length) and knowledge gathered from related literature. Then, a representative sub-set of features generated in the previous objective was determined using feature selection techniques. Finally a predictive model was developed using random forest classifier which was able to adequately classify the multi-class imbalanced small dataset constructed. The result showed that the predictive model provided a promising accuracy of 80% on average, in estimating the overexpression level of a recombinant protein. Copyright © 2015 Elsevier Ltd. All rights reserved.

  5. Voxel-wise prostate cell density prediction using multiparametric magnetic resonance imaging and machine learning.

    Science.gov (United States)

    Sun, Yu; Reynolds, Hayley M; Wraith, Darren; Williams, Scott; Finnegan, Mary E; Mitchell, Catherine; Murphy, Declan; Haworth, Annette

    2018-04-26

    There are currently no methods to estimate cell density in the prostate. This study aimed to develop predictive models to estimate prostate cell density from multiparametric magnetic resonance imaging (mpMRI) data at a voxel level using machine learning techniques. In vivo mpMRI data were collected from 30 patients before radical prostatectomy. Sequences included T2-weighted imaging, diffusion-weighted imaging and dynamic contrast-enhanced imaging. Ground truth cell density maps were computed from histology and co-registered with mpMRI. Feature extraction and selection were performed on mpMRI data. Final models were fitted using three regression algorithms including multivariate adaptive regression spline (MARS), polynomial regression (PR) and generalised additive model (GAM). Model parameters were optimised using leave-one-out cross-validation on the training data and model performance was evaluated on test data using root mean square error (RMSE) measurements. Predictive models to estimate voxel-wise prostate cell density were successfully trained and tested using the three algorithms. The best model (GAM) achieved a RMSE of 1.06 (± 0.06) × 10 3 cells/mm 2 and a relative deviation of 13.3 ± 0.8%. Prostate cell density can be quantitatively estimated non-invasively from mpMRI data using high-quality co-registered data at a voxel level. These cell density predictions could be used for tissue classification, treatment response evaluation and personalised radiotherapy.

  6. Machine learning predictions of molecular properties: Accurate many-body potentials and nonlocality in chemical space

    International Nuclear Information System (INIS)

    Hansen, Katja; Biegler, Franziska; Ramakrishnan, Raghunathan; Pronobis, Wiktor; Lilienfeld, O. Anatole von; Müller, Klaus-Robert; Tkatchenko, Alexandre

    2015-01-01

    Simultaneously accurate and efficient prediction of molecular properties throughout chemical compound space is a critical ingredient toward rational compound design in chemical and pharmaceutical industries. Aiming toward this goal, we develop and apply a systematic hierarchy of efficient empirical methods to estimate atomization and total energies of molecules. These methods range from a simple sum over atoms, to addition of bond energies, to pairwise interatomic force fields, reaching to the more sophisticated machine learning approaches that are capable of describing collective interactions between many atoms or bonds. In the case of equilibrium molecular geometries, even simple pairwise force fields demonstrate prediction accuracy comparable to benchmark energies calculated using density functional theory with hybrid exchange-correlation functionals; however, accounting for the collective many-body interactions proves to be essential for approaching the 'holy grail' of chemical accuracy of 1 kcal/mol for both equilibrium and out-of-equilibrium geometries. This remarkable accuracy is achieved by a vectorized representation of molecules (so-called Bag of Bonds model) that exhibits strong nonlocality in chemical space. The same representation allows us to predict accurate electronic properties of molecules, such as their polarizability and molecular frontier orbital energies

  7. Machine-Learning Research

    OpenAIRE

    Dietterich, Thomas G.

    1997-01-01

    Machine-learning research has been making great progress in many directions. This article summarizes four of these directions and discusses some current open problems. The four directions are (1) the improvement of classification accuracy by learning ensembles of classifiers, (2) methods for scaling up supervised learning algorithms, (3) reinforcement learning, and (4) the learning of complex stochastic models.

  8. Predicting activities of daily living for cancer patients using an ontology-guided machine learning methodology.

    Science.gov (United States)

    Min, Hua; Mobahi, Hedyeh; Irvin, Katherine; Avramovic, Sanja; Wojtusiak, Janusz

    2017-09-16

    Bio-ontologies are becoming increasingly important in knowledge representation and in the machine learning (ML) fields. This paper presents a ML approach that incorporates bio-ontologies and its application to the SEER-MHOS dataset to discover patterns of patient characteristics that impact the ability to perform activities of daily living (ADLs). Bio-ontologies are used to provide computable knowledge for ML methods to "understand" biomedical data. This retrospective study included 723 cancer patients from the SEER-MHOS dataset. Two ML methods were applied to create predictive models for ADL disabilities for the first year after a patient's cancer diagnosis. The first method is a standard rule learning algorithm; the second is that same algorithm additionally equipped with methods for reasoning with ontologies. The models showed that a patient's race, ethnicity, smoking preference, treatment plan and tumor characteristics including histology, staging, cancer site, and morphology were predictors for ADL performance levels one year after cancer diagnosis. The ontology-guided ML method was more accurate at predicting ADL performance levels (P ontologies. This study demonstrated that bio-ontologies can be harnessed to provide medical knowledge for ML algorithms. The presented method demonstrates that encoding specific types of hierarchical relationships to guide rule learning is possible, and can be extended to other types of semantic relationships present in biomedical ontologies. The ontology-guided ML method achieved better performance than the method without ontologies. The presented method can also be used to promote the effectiveness and efficiency of ML in healthcare, in which use of background knowledge and consistency with existing clinical expertise is critical.

  9. Machine Learning Algorithms For Predicting the Instability Timescales of Compact Planetary Systems

    Science.gov (United States)

    Tamayo, Daniel; Ali-Dib, Mohamad; Cloutier, Ryan; Huang, Chelsea; Van Laerhoven, Christa L.; Leblanc, Rejean; Menou, Kristen; Murray, Norman; Obertas, Alysa; Paradise, Adiv; Petrovich, Cristobal; Rachkov, Aleksandar; Rein, Hanno; Silburt, Ari; Tacik, Nick; Valencia, Diana

    2016-10-01

    The Kepler mission has uncovered hundreds of compact multi-planet systems. The dynamical pathways to instability in these compact systems and their associated timescales are not well understood theoretically. However, long-term stability is often used as a constraint to narrow down the space of orbital solutions from the transit data. This requires a large suite of N-body integrations that can each take several weeks to complete. This computational bottleneck is therefore an important limitation in our ability to characterize compact multi-planet systems.From suites of numerical simulations, previous studies have fit simple scaling relations between the instability timescale and various system parameters. However, the numerically simulated systems can deviate strongly from these empirical fits.We present a new approach to the problem using machine learning algorithms that have enjoyed success across a broad range of high-dimensional industry applications. In particular, we have generated large training sets of direct N-body integrations of synthetic compact planetary systems to train several regression models (support vector machine, gradient boost) that predict the instability timescale. We find that ensembling these models predicts the instability timescale of planetary systems better than previous approaches using the simple scaling relations mentioned above.Finally, we will discuss how these models provide a powerful tool for not only understanding the current Kepler multi-planet sample, but also for characterizing and shaping the radial-velocity follow-up strategies of multi-planet systems from the upcoming Transiting Exoplanet Survey Satellite (TESS) mission, given its shorter observation baselines.

  10. Evaluation of different machine learning models for predicting and mapping the susceptibility of gully erosion

    Science.gov (United States)

    Rahmati, Omid; Tahmasebipour, Nasser; Haghizadeh, Ali; Pourghasemi, Hamid Reza; Feizizadeh, Bakhtiar

    2017-12-01

    Gully erosion constitutes a serious problem for land degradation in a wide range of environments. The main objective of this research was to compare the performance of seven state-of-the-art machine learning models (SVM with four kernel types, BP-ANN, RF, and BRT) to model the occurrence of gully erosion in the Kashkan-Poldokhtar Watershed, Iran. In the first step, a gully inventory map consisting of 65 gully polygons was prepared through field surveys. Three different sample data sets (S1, S2, and S3), including both positive and negative cells (70% for training and 30% for validation), were randomly prepared to evaluate the robustness of the models. To model the gully erosion susceptibility, 12 geo-environmental factors were selected as predictors. Finally, the goodness-of-fit and prediction skill of the models were evaluated by different criteria, including efficiency percent, kappa coefficient, and the area under the ROC curves (AUC). In terms of accuracy, the RF, RBF-SVM, BRT, and P-SVM models performed excellently both in the degree of fitting and in predictive performance (AUC values well above 0.9), which resulted in accurate predictions. Therefore, these models can be used in other gully erosion studies, as they are capable of rapidly producing accurate and robust gully erosion susceptibility maps (GESMs) for decision-making and soil and water management practices. Furthermore, it was found that performance of RF and RBF-SVM for modelling gully erosion occurrence is quite stable when the learning and validation samples are changed.

  11. Sci-Fri AM: Quality, Safety, and Professional Issues 04: Predicting waiting times in Radiation Oncology using machine learning

    International Nuclear Information System (INIS)

    Joseph, Ackeem; Herrera, David; Hijal, Tarek; Hendren, Laurie; Leung, Alvin; Wainberg, Justin; Sawaf, Marya; Maxim, Gorshkov; Maglieri, Robert; Keshavarz, Mehryar; Kildea, John

    2016-01-01

    We describe a method for predicting waiting times in radiation oncology. Machine learning is a powerful predictive modelling tool that benefits from large, potentially complex, datasets. The essence of machine learning is to predict future outcomes by learning from previous experience. The patient waiting experience remains one of the most vexing challenges facing healthcare. Waiting time uncertainty can cause patients, who are already sick and in pain, to worry about when they will receive the care they need. In radiation oncology, patients typically experience three types of waiting: Waiting at home for their treatment plan to be prepared Waiting in the waiting room for daily radiotherapy Waiting in the waiting room to see a physician in consultation or follow-up These waiting periods are difficult for staff to predict and only rough estimates are typically provided, based on personal experience. In the present era of electronic health records, waiting times need not be so uncertain. At our centre, we have incorporated the electronic treatment records of all previously-treated patients into our machine learning model. We found that the Random Forest Regression model provides the best predictions for daily radiotherapy treatment waiting times (type 2). Using this model, we achieved a median residual (actual minus predicted value) of 0.25 minutes and a standard deviation residual of 6.5 minutes. The main features that generated the best fit model (from most to least significant) are: Allocated time, median past duration, fraction number and the number of treatment fields.

  12. Sci-Fri AM: Quality, Safety, and Professional Issues 04: Predicting waiting times in Radiation Oncology using machine learning

    Energy Technology Data Exchange (ETDEWEB)

    Joseph, Ackeem; Herrera, David; Hijal, Tarek; Hendren, Laurie; Leung, Alvin; Wainberg, Justin; Sawaf, Marya; Maxim, Gorshkov; Maglieri, Robert; Keshavarz, Mehryar; Kildea, John [McGill University Health Centre (Canada)

    2016-08-15

    We describe a method for predicting waiting times in radiation oncology. Machine learning is a powerful predictive modelling tool that benefits from large, potentially complex, datasets. The essence of machine learning is to predict future outcomes by learning from previous experience. The patient waiting experience remains one of the most vexing challenges facing healthcare. Waiting time uncertainty can cause patients, who are already sick and in pain, to worry about when they will receive the care they need. In radiation oncology, patients typically experience three types of waiting: Waiting at home for their treatment plan to be prepared Waiting in the waiting room for daily radiotherapy Waiting in the waiting room to see a physician in consultation or follow-up These waiting periods are difficult for staff to predict and only rough estimates are typically provided, based on personal experience. In the present era of electronic health records, waiting times need not be so uncertain. At our centre, we have incorporated the electronic treatment records of all previously-treated patients into our machine learning model. We found that the Random Forest Regression model provides the best predictions for daily radiotherapy treatment waiting times (type 2). Using this model, we achieved a median residual (actual minus predicted value) of 0.25 minutes and a standard deviation residual of 6.5 minutes. The main features that generated the best fit model (from most to least significant) are: Allocated time, median past duration, fraction number and the number of treatment fields.

  13. Machine Learning Methods to Predict Density Functional Theory B3LYP Energies of HOMO and LUMO Orbitals.

    Science.gov (United States)

    Pereira, Florbela; Xiao, Kaixia; Latino, Diogo A R S; Wu, Chengcheng; Zhang, Qingyou; Aires-de-Sousa, Joao

    2017-01-23

    Machine learning algorithms were explored for the fast estimation of HOMO and LUMO orbital energies calculated by DFT B3LYP, on the basis of molecular descriptors exclusively based on connectivity. The whole project involved the retrieval and generation of molecular structures, quantum chemical calculations for a database with >111 000 structures, development of new molecular descriptors, and training/validation of machine learning models. Several machine learning algorithms were screened, and an applicability domain was defined based on Euclidean distances to the training set. Random forest models predicted an external test set of 9989 compounds achieving mean absolute error (MAE) up to 0.15 and 0.16 eV for the HOMO and LUMO orbitals, respectively. The impact of the quantum chemical calculation protocol was assessed with a subset of compounds. Inclusion of the orbital energy calculated by PM7 as an additional descriptor significantly improved the quality of estimations (reducing the MAE in >30%).

  14. The Large Scale Machine Learning in an Artificial Society: Prediction of the Ebola Outbreak in Beijing

    Directory of Open Access Journals (Sweden)

    Peng Zhang

    2015-01-01

    Full Text Available Ebola virus disease (EVD distinguishes its feature as high infectivity and mortality. Thus, it is urgent for governments to draw up emergency plans against Ebola. However, it is hard to predict the possible epidemic situations in practice. Luckily, in recent years, computational experiments based on artificial society appeared, providing a new approach to study the propagation of EVD and analyze the corresponding interventions. Therefore, the rationality of artificial society is the key to the accuracy and reliability of experiment results. Individuals’ behaviors along with travel mode directly affect the propagation among individuals. Firstly, artificial Beijing is reconstructed based on geodemographics and machine learning is involved to optimize individuals’ behaviors. Meanwhile, Ebola course model and propagation model are built, according to the parameters in West Africa. Subsequently, propagation mechanism of EVD is analyzed, epidemic scenario is predicted, and corresponding interventions are presented. Finally, by simulating the emergency responses of Chinese government, the conclusion is finally drawn that Ebola is impossible to outbreak in large scale in the city of Beijing.

  15. Machine learning with R

    CERN Document Server

    Lantz, Brett

    2015-01-01

    Perhaps you already know a bit about machine learning but have never used R, or perhaps you know a little R but are new to machine learning. In either case, this book will get you up and running quickly. It would be helpful to have a bit of familiarity with basic programming concepts, but no prior experience is required.

  16. Cross-trial prediction of treatment outcome in depression: a machine learning approach.

    Science.gov (United States)

    Chekroud, Adam Mourad; Zotti, Ryan Joseph; Shehzad, Zarrar; Gueorguieva, Ralitza; Johnson, Marcia K; Trivedi, Madhukar H; Cannon, Tyrone D; Krystal, John Harrison; Corlett, Philip Robert

    2016-03-01

    Antidepressant treatment efficacy is low, but might be improved by matching patients to interventions. At present, clinicians have no empirically validated mechanisms to assess whether a patient with depression will respond to a specific antidepressant. We aimed to develop an algorithm to assess whether patients will achieve symptomatic remission from a 12-week course of citalopram. We used patient-reported data from patients with depression (n=4041, with 1949 completers) from level 1 of the Sequenced Treatment Alternatives to Relieve Depression (STAR*D; ClinicalTrials.gov, number NCT00021528) to identify variables that were most predictive of treatment outcome, and used these variables to train a machine-learning model to predict clinical remission. We externally validated the model in the escitalopram treatment group (n=151) of an independent clinical trial (Combining Medications to Enhance Depression Outcomes [COMED]; ClinicalTrials.gov, number NCT00590863). We identified 25 variables that were most predictive of treatment outcome from 164 patient-reportable variables, and used these to train the model. The model was internally cross-validated, and predicted outcomes in the STAR*D cohort with accuracy significantly above chance (64·6% [SD 3·2]; p<0·0001). The model was externally validated in the escitalopram treatment group (N=151) of COMED (accuracy 59·6%, p=0.043). The model also performed significantly above chance in a combined escitalopram-buproprion treatment group in COMED (n=134; accuracy 59·7%, p=0·023), but not in a combined venlafaxine-mirtazapine group (n=140; accuracy 51·4%, p=0·53), suggesting specificity of the model to underlying mechanisms. Building statistical models by mining existing clinical trial data can enable prospective identification of patients who are likely to respond to a specific antidepressant. Yale University. Copyright © 2016 Elsevier Ltd. All rights reserved.

  17. A novel machine learning model to predict abnormal Runway Occupancy Times and observe related precursors

    NARCIS (Netherlands)

    Herrema, Herrema Floris; Treve, V; Desart, B; Curran, R.; Visser, H.G.

    2017-01-01

    Accidents on the runway triggered the development and implementation of mitigation strategies. Therefore, the airline industry is moving toward proactive risk management, which aims to identify and predict risk percursors and to mitigate risks before accidents occur. For certain predictions Machine

  18. Prediction of Aerosol Optical Depth in West Asia: Machine Learning Methods versus Numerical Models

    Science.gov (United States)

    Omid Nabavi, Seyed; Haimberger, Leopold; Abbasi, Reyhaneh; Samimi, Cyrus

    2017-04-01

    Dust-prone areas of West Asia are releasing increasingly large amounts of dust particles during warm months. Because of the lack of ground-based observations in the region, this phenomenon is mainly monitored through remotely sensed aerosol products. The recent development of mesoscale Numerical Models (NMs) has offered an unprecedented opportunity to predict dust emission, and, subsequently Aerosol Optical Depth (AOD), at finer spatial and temporal resolutions. Nevertheless, the significant uncertainties in input data and simulations of dust activation and transport limit the performance of numerical models in dust prediction. The presented study aims to evaluate if machine-learning algorithms (MLAs), which require much less computational expense, can yield the same or even better performance than NMs. Deep blue (DB) AOD, which is observed by satellites but also predicted by MLAs and NMs, is used for validation. We concentrate our evaluations on the over dry Iraq plains, known as the main origin of recently intensified dust storms in West Asia. Here we examine the performance of four MLAs including Linear regression Model (LM), Support Vector Machine (SVM), Artificial Neural Network (ANN), Multivariate Adaptive Regression Splines (MARS). The Weather Research and Forecasting model coupled to Chemistry (WRF-Chem) and the Dust REgional Atmosphere Model (DREAM) are included as NMs. The MACC aerosol re-analysis of European Centre for Medium-range Weather Forecast (ECMWF) is also included, although it has assimilated satellite-based AOD data. Using the Recursive Feature Elimination (RFE) method, nine environmental features including soil moisture and temperature, NDVI, dust source function, albedo, dust uplift potential, vertical velocity, precipitation and 9-month SPEI drought index are selected for dust (AOD) modeling by MLAs. During the feature selection process, we noticed that NDVI and SPEI are of the highest importance in MLAs predictions. The data set was divided

  19. A Machine Learned Classifier That Uses Gene Expression Data to Accurately Predict Estrogen Receptor Status

    Science.gov (United States)

    Bastani, Meysam; Vos, Larissa; Asgarian, Nasimeh; Deschenes, Jean; Graham, Kathryn; Mackey, John; Greiner, Russell

    2013-01-01

    Background Selecting the appropriate treatment for breast cancer requires accurately determining the estrogen receptor (ER) status of the tumor. However, the standard for determining this status, immunohistochemical analysis of formalin-fixed paraffin embedded samples, suffers from numerous technical and reproducibility issues. Assessment of ER-status based on RNA expression can provide more objective, quantitative and reproducible test results. Methods To learn a parsimonious RNA-based classifier of hormone receptor status, we applied a machine learning tool to a training dataset of gene expression microarray data obtained from 176 frozen breast tumors, whose ER-status was determined by applying ASCO-CAP guidelines to standardized immunohistochemical testing of formalin fixed tumor. Results This produced a three-gene classifier that can predict the ER-status of a novel tumor, with a cross-validation accuracy of 93.17±2.44%. When applied to an independent validation set and to four other public databases, some on different platforms, this classifier obtained over 90% accuracy in each. In addition, we found that this prediction rule separated the patients' recurrence-free survival curves with a hazard ratio lower than the one based on the IHC analysis of ER-status. Conclusions Our efficient and parsimonious classifier lends itself to high throughput, highly accurate and low-cost RNA-based assessments of ER-status, suitable for routine high-throughput clinical use. This analytic method provides a proof-of-principle that may be applicable to developing effective RNA-based tests for other biomarkers and conditions. PMID:24312637

  20. A machine learned classifier that uses gene expression data to accurately predict estrogen receptor status.

    Directory of Open Access Journals (Sweden)

    Meysam Bastani

    Full Text Available BACKGROUND: Selecting the appropriate treatment for breast cancer requires accurately determining the estrogen receptor (ER status of the tumor. However, the standard for determining this status, immunohistochemical analysis of formalin-fixed paraffin embedded samples, suffers from numerous technical and reproducibility issues. Assessment of ER-status based on RNA expression can provide more objective, quantitative and reproducible test results. METHODS: To learn a parsimonious RNA-based classifier of hormone receptor status, we applied a machine learning tool to a training dataset of gene expression microarray data obtained from 176 frozen breast tumors, whose ER-status was determined by applying ASCO-CAP guidelines to standardized immunohistochemical testing of formalin fixed tumor. RESULTS: This produced a three-gene classifier that can predict the ER-status of a novel tumor, with a cross-validation accuracy of 93.17±2.44%. When applied to an independent validation set and to four other public databases, some on different platforms, this classifier obtained over 90% accuracy in each. In addition, we found that this prediction rule separated the patients' recurrence-free survival curves with a hazard ratio lower than the one based on the IHC analysis of ER-status. CONCLUSIONS: Our efficient and parsimonious classifier lends itself to high throughput, highly accurate and low-cost RNA-based assessments of ER-status, suitable for routine high-throughput clinical use. This analytic method provides a proof-of-principle that may be applicable to developing effective RNA-based tests for other biomarkers and conditions.

  1. Classification and Diagnostic Output Prediction of Cancer Using Gene Expression Profiling and Supervised Machine Learning Algorithms

    DEFF Research Database (Denmark)

    Yoo, C.; Gernaey, Krist

    2008-01-01

    importance in the projection (VIP) information of the DPLS method. The power of the gene selection method and the proposed supervised hierarchical clustering method is illustrated on a three microarray data sets of leukemia, breast, and colon cancer. Supervised machine learning algorithms thus enable...

  2. Improved method for SNR prediction in machine-learning-based test

    NARCIS (Netherlands)

    Sheng, Xiaoqin; Kerkhoff, Hans G.

    2010-01-01

    This paper applies an improved method for testing the signal-to-noise ratio (SNR) of Analogue-to-Digital Converters (ADC). In previous work, a noisy and nonlinear pulse signal is exploited as the input stimulus to obtain the signature results of ADC. By applying a machine-learning-based approach,

  3. A Mobile Health Application to Predict Postpartum Depression Based on Machine Learning.

    Science.gov (United States)

    Jiménez-Serrano, Santiago; Tortajada, Salvador; García-Gómez, Juan Miguel

    2015-07-01

    Postpartum depression (PPD) is a disorder that often goes undiagnosed. The development of a screening program requires considerable and careful effort, where evidence-based decisions have to be taken in order to obtain an effective test with a high level of sensitivity and an acceptable specificity that is quick to perform, easy to interpret, culturally sensitive, and cost-effective. The purpose of this article is twofold: first, to develop classification models for detecting the risk of PPD during the first week after childbirth, thus enabling early intervention; and second, to develop a mobile health (m-health) application (app) for the Android(®) (Google, Mountain View, CA) platform based on the model with best performance for both mothers who have just given birth and clinicians who want to monitor their patient's test. A set of predictive models for estimating the risk of PPD was trained using machine learning techniques and data about postpartum women collected from seven Spanish hospitals. An internal evaluation was carried out using a hold-out strategy. An easy flowchart and architecture for designing the graphical user interface of the m-health app was followed. Naive Bayes showed the best balance between sensitivity and specificity as a predictive model for PPD during the first week after delivery. It was integrated into the clinical decision support system for Android mobile apps. This approach can enable the early prediction and detection of PPD because it fulfills the conditions of an effective screening test with a high level of sensitivity and specificity that is quick to perform, easy to interpret, culturally sensitive, and cost-effective.

  4. Prediction and characterization of human ageing-related proteins by using machine learning.

    Science.gov (United States)

    Kerepesi, Csaba; Daróczy, Bálint; Sturm, Ádám; Vellai, Tibor; Benczúr, András

    2018-03-06

    Ageing has a huge impact on human health and economy, but its molecular basis - regulation and mechanism - is still poorly understood. By today, more than three hundred genes (almost all of them function as protein-coding genes) have been related to human ageing. Although individual ageing-related genes or some small subsets of these genes have been intensively studied, their analysis as a whole has been highly limited. To fill this gap, for each human protein we extracted 21000 protein features from various databases, and using these data as an input to state-of-the-art machine learning methods, we classified human proteins as ageing-related or non-ageing-related. We found a simple classification model based on only 36 protein features, such as the "number of ageing-related interaction partners", "response to oxidative stress", "damaged DNA binding", "rhythmic process" and "extracellular region". Predicted values of the model quantify the relevance of a given protein in the regulation or mechanisms of the human ageing process. Furthermore, we identified new candidate proteins having strong computational evidence of their important role in ageing. Some of them, like Cytochrome b-245 light chain (CY24A) and Endoribonuclease ZC3H12A (ZC12A) have no previous ageing-associated annotations.

  5. Predictions of new AB O3 perovskite compounds by combining machine learning and density functional theory

    Science.gov (United States)

    Balachandran, Prasanna V.; Emery, Antoine A.; Gubernatis, James E.; Lookman, Turab; Wolverton, Chris; Zunger, Alex

    2018-04-01

    We apply machine learning (ML) methods to a database of 390 experimentally reported A B O3 compounds to construct two statistical models that predict possible new perovskite materials and possible new cubic perovskites. The first ML model classified the 390 compounds into 254 perovskites and 136 that are not perovskites with a 90% average cross-validation (CV) accuracy; the second ML model further classified the perovskites into 22 known cubic perovskites and 232 known noncubic perovskites with a 94% average CV accuracy. We find that the most effective chemical descriptors affecting our classification include largely geometric constructs such as the A and B Shannon ionic radii, the tolerance and octahedral factors, the A -O and B -O bond length, and the A and B Villars' Mendeleev numbers. We then construct an additional list of 625 A B O3 compounds assembled from charge conserving combinations of A and B atoms absent from our list of known compounds. Then, using the two ML models constructed on the known compounds, we predict that 235 of the 625 exist in a perovskite structure with a confidence greater than 50% and among them that 20 exist in the cubic structure (albeit, the latter with only ˜50 % confidence). We find that the new perovskites are most likely to occur when the A and B atoms are a lanthanide or actinide, when the A atom is an alkali, alkali earth, or late transition metal atom, or when the B atom is a p -block atom. We also compare the ML findings with the density functional theory calculations and convex hull analyses in the Open Quantum Materials Database (OQMD), which predicts the T =0 K ground-state stability of all the A B O3 compounds. We find that OQMD predicts 186 of 254 of the perovskites in the experimental database to be thermodynamically stable within 100 meV/atom of the convex hull and predicts 87 of the 235 ML-predicted perovskite compounds to be thermodynamically stable within 100 meV/atom of the convex hull, including 6 of these to

  6. Machine Learning and Radiology

    Science.gov (United States)

    Wang, Shijun; Summers, Ronald M.

    2012-01-01

    In this paper, we give a short introduction to machine learning and survey its applications in radiology. We focused on six categories of applications in radiology: medical image segmentation, registration, computer aided detection and diagnosis, brain function or activity analysis and neurological disease diagnosis from fMR images, content-based image retrieval systems for CT or MRI images, and text analysis of radiology reports using natural language processing (NLP) and natural language understanding (NLU). This survey shows that machine learning plays a key role in many radiology applications. Machine learning identifies complex patterns automatically and helps radiologists make intelligent decisions on radiology data such as conventional radiographs, CT, MRI, and PET images and radiology reports. In many applications, the performance of machine learning-based automatic detection and diagnosis systems has shown to be comparable to that of a well-trained and experienced radiologist. Technology development in machine learning and radiology will benefit from each other in the long run. Key contributions and common characteristics of machine learning techniques in radiology are discussed. We also discuss the problem of translating machine learning applications to the radiology clinical setting, including advantages and potential barriers. PMID:22465077

  7. Machine learning and radiology.

    Science.gov (United States)

    Wang, Shijun; Summers, Ronald M

    2012-07-01

    In this paper, we give a short introduction to machine learning and survey its applications in radiology. We focused on six categories of applications in radiology: medical image segmentation, registration, computer aided detection and diagnosis, brain function or activity analysis and neurological disease diagnosis from fMR images, content-based image retrieval systems for CT or MRI images, and text analysis of radiology reports using natural language processing (NLP) and natural language understanding (NLU). This survey shows that machine learning plays a key role in many radiology applications. Machine learning identifies complex patterns automatically and helps radiologists make intelligent decisions on radiology data such as conventional radiographs, CT, MRI, and PET images and radiology reports. In many applications, the performance of machine learning-based automatic detection and diagnosis systems has shown to be comparable to that of a well-trained and experienced radiologist. Technology development in machine learning and radiology will benefit from each other in the long run. Key contributions and common characteristics of machine learning techniques in radiology are discussed. We also discuss the problem of translating machine learning applications to the radiology clinical setting, including advantages and potential barriers. Copyright © 2012. Published by Elsevier B.V.

  8. Creativity in Machine Learning

    OpenAIRE

    Thoma, Martin

    2016-01-01

    Recent machine learning techniques can be modified to produce creative results. Those results did not exist before; it is not a trivial combination of the data which was fed into the machine learning system. The obtained results come in multiple forms: As images, as text and as audio. This paper gives a high level overview of how they are created and gives some examples. It is meant to be a summary of the current work and give people who are new to machine learning some starting points.

  9. Local-learning-based neuron selection for grasping gesture prediction in motor brain machine interfaces

    Science.gov (United States)

    Xu, Kai; Wang, Yiwen; Wang, Yueming; Wang, Fang; Hao, Yaoyao; Zhang, Shaomin; Zhang, Qiaosheng; Chen, Weidong; Zheng, Xiaoxiang

    2013-04-01

    Objective. The high-dimensional neural recordings bring computational challenges to movement decoding in motor brain machine interfaces (mBMI), especially for portable applications. However, not all recorded neural activities relate to the execution of a certain movement task. This paper proposes to use a local-learning-based method to perform neuron selection for the gesture prediction in a reaching and grasping task. Approach. Nonlinear neural activities are decomposed into a set of linear ones in a weighted feature space. A margin is defined to measure the distance between inter-class and intra-class neural patterns. The weights, reflecting the importance of neurons, are obtained by minimizing a margin-based exponential error function. To find the most dominant neurons in the task, 1-norm regularization is introduced to the objective function for sparse weights, where near-zero weights indicate irrelevant neurons. Main results. The signals of only 10 neurons out of 70 selected by the proposed method could achieve over 95% of the full recording's decoding accuracy of gesture predictions, no matter which different decoding methods are used (support vector machine and K-nearest neighbor). The temporal activities of the selected neurons show visually distinguishable patterns associated with various hand states. Compared with other algorithms, the proposed method can better eliminate the irrelevant neurons with near-zero weights and provides the important neuron subset with the best decoding performance in statistics. The weights of important neurons converge usually within 10-20 iterations. In addition, we study the temporal and spatial variation of neuron importance along a period of one and a half months in the same task. A high decoding performance can be maintained by updating the neuron subset. Significance. The proposed algorithm effectively ascertains the neuronal importance without assuming any coding model and provides a high performance with different

  10. Synergistic target combination prediction from curated signaling networks: Machine learning meets systems biology and pharmacology.

    Science.gov (United States)

    Chua, Huey Eng; Bhowmick, Sourav S; Tucker-Kellogg, Lisa

    2017-10-01

    Given a signaling network, the target combination prediction problem aims to predict efficacious and safe target combinations for combination therapy. State-of-the-art in silico methods use Monte Carlo simulated annealing (mcsa) to modify a candidate solution stochastically, and use the Metropolis criterion to accept or reject the proposed modifications. However, such stochastic modifications ignore the impact of the choice of targets and their activities on the combination's therapeutic effect and off-target effects, which directly affect the solution quality. In this paper, we present mascot, a method that addresses this limitation by leveraging two additional heuristic criteria to minimize off-target effects and achieve synergy for candidate modification. Specifically, off-target effects measure the unintended response of a signaling network to the target combination and is often associated with toxicity. Synergy occurs when a pair of targets exerts effects that are greater than the sum of their individual effects, and is generally a beneficial strategy for maximizing effect while minimizing toxicity. mascot leverages on a machine learning-based target prioritization method which prioritizes potential targets in a given disease-associated network to select more effective targets (better therapeutic effect and/or lower off-target effects); and on Loewe additivity theory from pharmacology which assesses the non-additive effects in a combination drug treatment to select synergistic target activities. Our experimental study on two disease-related signaling networks demonstrates the superiority of mascot in comparison to existing approaches. Copyright © 2017 Elsevier Inc. All rights reserved.

  11. Analysis and design of machine learning techniques evolutionary solutions for regression, prediction, and control problems

    CERN Document Server

    Stalph, Patrick

    2014-01-01

    Manipulating or grasping objects seems like a trivial task for humans, as these are motor skills of everyday life. Nevertheless, motor skills are not easy to learn for humans and this is also an active research topic in robotics. However, most solutions are optimized for industrial applications and, thus, few are plausible explanations for human learning. The fundamental challenge, that motivates Patrick Stalph, originates from the cognitive science: How do humans learn their motor skills? The author makes a connection between robotics and cognitive sciences by analyzing motor skill learning using implementations that could be found in the human brain – at least to some extent. Therefore three suitable machine learning algorithms are selected – algorithms that are plausible from a cognitive viewpoint and feasible for the roboticist. The power and scalability of those algorithms is evaluated in theoretical simulations and more realistic scenarios with the iCub humanoid robot. Convincing results confirm the...

  12. Trip time prediction in mass transit companies. A machine learning approach

    OpenAIRE

    João M. Moreira; Alípio Jorge; Jorge Freire de Sousa; Carlos Soares

    2005-01-01

    In this paper we discuss how trip time prediction can be useful foroperational optimization in mass transit companies and which machine learningtechniques can be used to improve results. Firstly, we analyze which departmentsneed trip time prediction and when. Secondly, we review related work and thirdlywe present the analysis of trip time over a particular path. We proceed by presentingexperimental results conducted on real data with the forecasting techniques wefound most adequate, and concl...

  13. Machine learning techniques for optical communication system optimization

    DEFF Research Database (Denmark)

    Zibar, Darko; Wass, Jesper; Thrane, Jakob

    In this paper, machine learning techniques relevant to optical communication are presented and discussed. The focus is on applying machine learning tools to optical performance monitoring and performance prediction.......In this paper, machine learning techniques relevant to optical communication are presented and discussed. The focus is on applying machine learning tools to optical performance monitoring and performance prediction....

  14. Prediction of interactions between viral and host proteins using supervised machine learning methods.

    Directory of Open Access Journals (Sweden)

    Ranjan Kumar Barman

    Full Text Available BACKGROUND: Viral-host protein-protein interaction plays a vital role in pathogenesis, since it defines viral infection of the host and regulation of the host proteins. Identification of key viral-host protein-protein interactions (PPIs has great implication for therapeutics. METHODS: In this study, a systematic attempt has been made to predict viral-host PPIs by integrating different features, including domain-domain association, network topology and sequence information using viral-host PPIs from VirusMINT. The three well-known supervised machine learning methods, such as SVM, Naïve Bayes and Random Forest, which are commonly used in the prediction of PPIs, were employed to evaluate the performance measure based on five-fold cross validation techniques. RESULTS: Out of 44 descriptors, best features were found to be domain-domain association and methionine, serine and valine amino acid composition of viral proteins. In this study, SVM-based method achieved better sensitivity of 67% over Naïve Bayes (37.49% and Random Forest (55.66%. However the specificity of Naïve Bayes was the highest (99.52% as compared with SVM (74% and Random Forest (89.08%. Overall, the SVM and Random Forest achieved accuracy of 71% and 72.41%, respectively. The proposed SVM-based method was evaluated on blind dataset and attained a sensitivity of 64%, specificity of 83%, and accuracy of 74%. In addition, unknown potential targets of hepatitis B virus-human and hepatitis E virus-human PPIs have been predicted through proposed SVM model and validated by gene ontology enrichment analysis. Our proposed model shows that, hepatitis B virus "C protein" binds to membrane docking protein, while "X protein" and "P protein" interacts with cell-killing and metabolic process proteins, respectively. CONCLUSION: The proposed method can predict large scale interspecies viral-human PPIs. The nature and function of unknown viral proteins (HBV and HEV, interacting partners of host

  15. Combining structural modeling with ensemble machine learning to accurately predict protein fold stability and binding affinity effects upon mutation.

    Directory of Open Access Journals (Sweden)

    Niklas Berliner

    Full Text Available Advances in sequencing have led to a rapid accumulation of mutations, some of which are associated with diseases. However, to draw mechanistic conclusions, a biochemical understanding of these mutations is necessary. For coding mutations, accurate prediction of significant changes in either the stability of proteins or their affinity to their binding partners is required. Traditional methods have used semi-empirical force fields, while newer methods employ machine learning of sequence and structural features. Here, we show how combining both of these approaches leads to a marked boost in accuracy. We introduce ELASPIC, a novel ensemble machine learning approach that is able to predict stability effects upon mutation in both, domain cores and domain-domain interfaces. We combine semi-empirical energy terms, sequence conservation, and a wide variety of molecular details with a Stochastic Gradient Boosting of Decision Trees (SGB-DT algorithm. The accuracy of our predictions surpasses existing methods by a considerable margin, achieving correlation coefficients of 0.77 for stability, and 0.75 for affinity predictions. Notably, we integrated homology modeling to enable proteome-wide prediction and show that accurate prediction on modeled structures is possible. Lastly, ELASPIC showed significant differences between various types of disease-associated mutations, as well as between disease and common neutral mutations. Unlike pure sequence-based prediction methods that try to predict phenotypic effects of mutations, our predictions unravel the molecular details governing the protein instability, and help us better understand the molecular causes of diseases.

  16. Prediction Errors of Molecular Machine Learning Models Lower than Hybrid DFT Error.

    Science.gov (United States)

    Faber, Felix A; Hutchison, Luke; Huang, Bing; Gilmer, Justin; Schoenholz, Samuel S; Dahl, George E; Vinyals, Oriol; Kearnes, Steven; Riley, Patrick F; von Lilienfeld, O Anatole

    2017-11-14

    We investigate the impact of choosing regressors and molecular representations for the construction of fast machine learning (ML) models of 13 electronic ground-state properties of organic molecules. The performance of each regressor/representation/property combination is assessed using learning curves which report out-of-sample errors as a function of training set size with up to ∼118k distinct molecules. Molecular structures and properties at the hybrid density functional theory (DFT) level of theory come from the QM9 database [ Ramakrishnan et al. Sci. Data 2014 , 1 , 140022 ] and include enthalpies and free energies of atomization, HOMO/LUMO energies and gap, dipole moment, polarizability, zero point vibrational energy, heat capacity, and the highest fundamental vibrational frequency. Various molecular representations have been studied (Coulomb matrix, bag of bonds, BAML and ECFP4, molecular graphs (MG)), as well as newly developed distribution based variants including histograms of distances (HD), angles (HDA/MARAD), and dihedrals (HDAD). Regressors include linear models (Bayesian ridge regression (BR) and linear regression with elastic net regularization (EN)), random forest (RF), kernel ridge regression (KRR), and two types of neural networks, graph convolutions (GC) and gated graph networks (GG). Out-of sample errors are strongly dependent on the choice of representation and regressor and molecular property. Electronic properties are typically best accounted for by MG and GC, while energetic properties are better described by HDAD and KRR. The specific combinations with the lowest out-of-sample errors in the ∼118k training set size limit are (free) energies and enthalpies of atomization (HDAD/KRR), HOMO/LUMO eigenvalue and gap (MG/GC), dipole moment (MG/GC), static polarizability (MG/GG), zero point vibrational energy (HDAD/KRR), heat capacity at room temperature (HDAD/KRR), and highest fundamental vibrational frequency (BAML/RF). We present numerical

  17. Can machine learning on learner analytics produce a predictive model on student performance?

    OpenAIRE

    Busch, John; Hanna, Philip; O'Neill, Ian; McGowan, Aidan; Collins, Matthew

    2017-01-01

    The aim of this research is to analysis past student learner analytics using machine learning algorithms that had undertaken a web development and programming module. By specifically using the access and error web server logs from each student web server it provides a deeper learner analytic data. The web server logs every web file access and error access from a browser so in turn each data file can directly relate to a student's engagement level and assessment strategy. Each log holds severa...

  18. Fault Tolerance Automotive Air-Ratio Control Using Extreme Learning Machine Model Predictive Controller

    OpenAIRE

    Pak Kin Wong; Hang Cheong Wong; Chi Man Vong; Tong Meng Iong; Ka In Wong; Xianghui Gao

    2015-01-01

    Effective air-ratio control is desirable to maintain the best engine performance. However, traditional air-ratio control assumes the lambda sensor located at the tail pipe works properly and relies strongly on the air-ratio feedback signal measured by the lambda sensor. When the sensor is warming up during cold start or under failure, the traditional air-ratio control no longer works. To address this issue, this paper utilizes an advanced modelling technique, kernel extreme learning machine (...

  19. Robustness and prediction accuracy of machine learning for objective visual quality assessment

    OpenAIRE

    HINES, ANDREW

    2014-01-01

    PUBLISHED Lisbon, Portugal Machine Learning (ML) is a powerful tool to support the development of objective visual quality assessment metrics, serving as a substitute model for the perceptual mechanisms acting in visual quality appreciation. Nevertheless, the reli- ability of ML-based techniques within objective quality as- sessment metrics is often questioned. In this study, the ro- bustness of ML in supporting objective quality assessment is investigated, specific...

  20. Prediction of breast cancer risk using a machine learning approach embedded with a locality preserving projection algorithm

    Science.gov (United States)

    Heidari, Morteza; Zargari Khuzani, Abolfazl; Hollingsworth, Alan B.; Danala, Gopichandh; Mirniaharikandehei, Seyedehnafiseh; Qiu, Yuchen; Liu, Hong; Zheng, Bin

    2018-02-01

    In order to automatically identify a set of effective mammographic image features and build an optimal breast cancer risk stratification model, this study aims to investigate advantages of applying a machine learning approach embedded with a locally preserving projection (LPP) based feature combination and regeneration algorithm to predict short-term breast cancer risk. A dataset involving negative mammograms acquired from 500 women was assembled. This dataset was divided into two age-matched classes of 250 high risk cases in which cancer was detected in the next subsequent mammography screening and 250 low risk cases, which remained negative. First, a computer-aided image processing scheme was applied to segment fibro-glandular tissue depicted on mammograms and initially compute 44 features related to the bilateral asymmetry of mammographic tissue density distribution between left and right breasts. Next, a multi-feature fusion based machine learning classifier was built to predict the risk of cancer detection in the next mammography screening. A leave-one-case-out (LOCO) cross-validation method was applied to train and test the machine learning classifier embedded with a LLP algorithm, which generated a new operational vector with 4 features using a maximal variance approach in each LOCO process. Results showed a 9.7% increase in risk prediction accuracy when using this LPP-embedded machine learning approach. An increased trend of adjusted odds ratios was also detected in which odds ratios increased from 1.0 to 11.2. This study demonstrated that applying the LPP algorithm effectively reduced feature dimensionality, and yielded higher and potentially more robust performance in predicting short-term breast cancer risk.

  1. Machine learning and statistical techniques : an application to the prediction of insolvency in Spanish non-life insurance companies

    OpenAIRE

    Díaz, Zuleyka; Segovia, María Jesús; Fernández, José

    2005-01-01

    Prediction of insurance companies insolvency has arisen as an important problem in the field of financial research. Most methods applied in the past to tackle this issue are traditional statistical techniques which use financial ratios as explicative variables. However, these variables often do not satisfy statistical assumptions, which complicates the application of the mentioned methods. In this paper, a comparative study of the performance of two non-parametric machine learning techniques ...

  2. Applying a machine learning model using a locally preserving projection based feature regeneration algorithm to predict breast cancer risk

    Science.gov (United States)

    Heidari, Morteza; Zargari Khuzani, Abolfazl; Danala, Gopichandh; Mirniaharikandehei, Seyedehnafiseh; Qian, Wei; Zheng, Bin

    2018-03-01

    Both conventional and deep machine learning has been used to develop decision-support tools applied in medical imaging informatics. In order to take advantages of both conventional and deep learning approach, this study aims to investigate feasibility of applying a locally preserving projection (LPP) based feature regeneration algorithm to build a new machine learning classifier model to predict short-term breast cancer risk. First, a computer-aided image processing scheme was used to segment and quantify breast fibro-glandular tissue volume. Next, initially computed 44 image features related to the bilateral mammographic tissue density asymmetry were extracted. Then, an LLP-based feature combination method was applied to regenerate a new operational feature vector using a maximal variance approach. Last, a k-nearest neighborhood (KNN) algorithm based machine learning classifier using the LPP-generated new feature vectors was developed to predict breast cancer risk. A testing dataset involving negative mammograms acquired from 500 women was used. Among them, 250 were positive and 250 remained negative in the next subsequent mammography screening. Applying to this dataset, LLP-generated feature vector reduced the number of features from 44 to 4. Using a leave-onecase-out validation method, area under ROC curve produced by the KNN classifier significantly increased from 0.62 to 0.68 (p breast cancer detected in the next subsequent mammography screening.

  3. A Machine Learning Ensemble Classifier for Early Prediction of Diabetic Retinopathy.

    Science.gov (United States)

    S K, Somasundaram; P, Alli

    2017-11-09

    The main complication of diabetes is Diabetic retinopathy (DR), retinal vascular disease and it leads to the blindness. Regular screening for early DR disease detection is considered as an intensive labor and resource oriented task. Therefore, automatic detection of DR diseases is performed only by using the computational technique is the great solution. An automatic method is more reliable to determine the presence of an abnormality in Fundus images (FI) but, the classification process is poorly performed. Recently, few research works have been designed for analyzing texture discrimination capacity in FI to distinguish the healthy images. However, the feature extraction (FE) process was not performed well, due to the high dimensionality. Therefore, to identify retinal features for DR disease diagnosis and early detection using Machine Learning and Ensemble Classification method, called, Machine Learning Bagging Ensemble Classifier (ML-BEC) is designed. The ML-BEC method comprises of two stages. The first stage in ML-BEC method comprises extraction of the candidate objects from Retinal Images (RI). The candidate objects or the features for DR disease diagnosis include blood vessels, optic nerve, neural tissue, neuroretinal rim, optic disc size, thickness and variance. These features are initially extracted by applying Machine Learning technique called, t-distributed Stochastic Neighbor Embedding (t-SNE). Besides, t-SNE generates a probability distribution across high-dimensional images where the images are separated into similar and dissimilar pairs. Then, t-SNE describes a similar probability distribution across the points in the low-dimensional map. This lessens the Kullback-Leibler divergence among two distributions regarding the locations of the points on the map. The second stage comprises of application of ensemble classifiers to the extracted features for providing accurate analysis of digital FI using machine learning. In this stage, an automatic detection

  4. A Hybrid Short-Term Traffic Flow Prediction Model Based on Singular Spectrum Analysis and Kernel Extreme Learning Machine.

    Directory of Open Access Journals (Sweden)

    Qiang Shang

    Full Text Available Short-term traffic flow prediction is one of the most important issues in the field of intelligent transport system (ITS. Because of the uncertainty and nonlinearity, short-term traffic flow prediction is a challenging task. In order to improve the accuracy of short-time traffic flow prediction, a hybrid model (SSA-KELM is proposed based on singular spectrum analysis (SSA and kernel extreme learning machine (KELM. SSA is used to filter out the noise of traffic flow time series. Then, the filtered traffic flow data is used to train KELM model, the optimal input form of the proposed model is determined by phase space reconstruction, and parameters of the model are optimized by gravitational search algorithm (GSA. Finally, case validation is carried out using the measured data of an expressway in Xiamen, China. And the SSA-KELM model is compared with several well-known prediction models, including support vector machine, extreme learning machine, and single KLEM model. The experimental results demonstrate that performance of the proposed model is superior to that of the comparison models. Apart from accuracy improvement, the proposed model is more robust.

  5. Feature selection in wind speed prediction systems based on a hybrid coral reefs optimization – Extreme learning machine approach

    International Nuclear Information System (INIS)

    Salcedo-Sanz, S.; Pastor-Sánchez, A.; Prieto, L.; Blanco-Aguilera, A.; García-Herrera, R.

    2014-01-01

    Highlights: • A novel approach for short-term wind speed prediction is presented. • The system is formed by a coral reefs optimization algorithm and an extreme learning machine. • Feature selection is carried out with the CRO to improve the ELM performance. • The method is tested in real wind farm data in USA, for the period 2007–2008. - Abstract: This paper presents a novel approach for short-term wind speed prediction based on a Coral Reefs Optimization algorithm (CRO) and an Extreme Learning Machine (ELM), using meteorological predictive variables from a physical model (the Weather Research and Forecast model, WRF). The approach is based on a Feature Selection Problem (FSP) carried out with the CRO, that must obtain a reduced number of predictive variables out of the total available from the WRF. This set of features will be the input of an ELM, that finally provides the wind speed prediction. The CRO is a novel bio-inspired approach, based on the simulation of reef formation and coral reproduction, able to obtain excellent results in optimization problems. On the other hand, the ELM is a new paradigm in neural networks’ training, that provides a robust and extremely fast training of the network. Together, these algorithms are able to successfully solve this problem of feature selection in short-term wind speed prediction. Experiments in a real wind farm in the USA show the excellent performance of the CRO–ELM approach in this FSP wind speed prediction problem

  6. Comparison of machine-learning algorithms to build a predictive model for detecting undiagnosed diabetes - ELSA-Brasil: accuracy study.

    Science.gov (United States)

    Olivera, André Rodrigues; Roesler, Valter; Iochpe, Cirano; Schmidt, Maria Inês; Vigo, Álvaro; Barreto, Sandhi Maria; Duncan, Bruce Bartholow

    2017-01-01

    Type 2 diabetes is a chronic disease associated with a wide range of serious health complications that have a major impact on overall health. The aims here were to develop and validate predictive models for detecting undiagnosed diabetes using data from the Longitudinal Study of Adult Health (ELSA-Brasil) and to compare the performance of different machine-learning algorithms in this task. Comparison of machine-learning algorithms to develop predictive models using data from ELSA-Brasil. After selecting a subset of 27 candidate variables from the literature, models were built and validated in four sequential steps: (i) parameter tuning with tenfold cross-validation, repeated three times; (ii) automatic variable selection using forward selection, a wrapper strategy with four different machine-learning algorithms and tenfold cross-validation (repeated three times), to evaluate each subset of variables; (iii) error estimation of model parameters with tenfold cross-validation, repeated ten times; and (iv) generalization testing on an independent dataset. The models were created with the following machine-learning algorithms: logistic regression, artificial neural network, naïve Bayes, K-nearest neighbor and random forest. The best models were created using artificial neural networks and logistic regression. -These achieved mean areas under the curve of, respectively, 75.24% and 74.98% in the error estimation step and 74.17% and 74.41% in the generalization testing step. Most of the predictive models produced similar results, and demonstrated the feasibility of identifying individuals with highest probability of having undiagnosed diabetes, through easily-obtained clinical data.

  7. Visible Machine Learning for Biomedicine.

    Science.gov (United States)

    Yu, Michael K; Ma, Jianzhu; Fisher, Jasmin; Kreisberg, Jason F; Raphael, Benjamin J; Ideker, Trey

    2018-06-14

    A major ambition of artificial intelligence lies in translating patient data to successful therapies. Machine learning models face particular challenges in biomedicine, however, including handling of extreme data heterogeneity and lack of mechanistic insight into predictions. Here, we argue for "visible" approaches that guide model structure with experimental biology. Copyright © 2018. Published by Elsevier Inc.

  8. Machine learning systems

    Energy Technology Data Exchange (ETDEWEB)

    Forsyth, R

    1984-05-01

    With the dramatic rise of expert systems has come a renewed interest in the fuel that drives them-knowledge. For it is specialist knowledge which gives expert systems their power. But extracting knowledge from human experts in symbolic form has proved arduous and labour-intensive. So the idea of machine learning is enjoying a renaissance. Machine learning is any automatic improvement in the performance of a computer system over time, as a result of experience. Thus a learning algorithm seeks to do one or more of the following: cover a wider range of problems, deliver more accurate solutions, obtain answers more cheaply, and simplify codified knowledge. 6 references.

  9. Gaussian processes for machine learning.

    Science.gov (United States)

    Seeger, Matthias

    2004-04-01

    Gaussian processes (GPs) are natural generalisations of multivariate Gaussian random variables to infinite (countably or continuous) index sets. GPs have been applied in a large number of fields to a diverse range of ends, and very many deep theoretical analyses of various properties are available. This paper gives an introduction to Gaussian processes on a fairly elementary level with special emphasis on characteristics relevant in machine learning. It draws explicit connections to branches such as spline smoothing models and support vector machines in which similar ideas have been investigated. Gaussian process models are routinely used to solve hard machine learning problems. They are attractive because of their flexible non-parametric nature and computational simplicity. Treated within a Bayesian framework, very powerful statistical methods can be implemented which offer valid estimates of uncertainties in our predictions and generic model selection procedures cast as nonlinear optimization problems. Their main drawback of heavy computational scaling has recently been alleviated by the introduction of generic sparse approximations.13,78,31 The mathematical literature on GPs is large and often uses deep concepts which are not required to fully understand most machine learning applications. In this tutorial paper, we aim to present characteristics of GPs relevant to machine learning and to show up precise connections to other "kernel machines" popular in the community. Our focus is on a simple presentation, but references to more detailed sources are provided.

  10. A review of machine learning methods to predict the solubility of overexpressed recombinant proteins in Escherichia coli.

    Science.gov (United States)

    Habibi, Narjeskhatoon; Mohd Hashim, Siti Z; Norouzi, Alireza; Samian, Mohammed Razip

    2014-05-08

    Over the last 20 years in biotechnology, the production of recombinant proteins has been a crucial bioprocess in both biopharmaceutical and research arena in terms of human health, scientific impact and economic volume. Although logical strategies of genetic engineering have been established, protein overexpression is still an art. In particular, heterologous expression is often hindered by low level of production and frequent fail due to opaque reasons. The problem is accentuated because there is no generic solution available to enhance heterologous overexpression. For a given protein, the extent of its solubility can indicate the quality of its function. Over 30% of synthesized proteins are not soluble. In certain experimental circumstances, including temperature, expression host, etc., protein solubility is a feature eventually defined by its sequence. Until now, numerous methods based on machine learning are proposed to predict the solubility of protein merely from its amino acid sequence. In spite of the 20 years of research on the matter, no comprehensive review is available on the published methods. This paper presents an extensive review of the existing models to predict protein solubility in Escherichia coli recombinant protein overexpression system. The models are investigated and compared regarding the datasets used, features, feature selection methods, machine learning techniques and accuracy of prediction. A discussion on the models is provided at the end. This study aims to investigate extensively the machine learning based methods to predict recombinant protein solubility, so as to offer a general as well as a detailed understanding for researches in the field. Some of the models present acceptable prediction performances and convenient user interfaces. These models can be considered as valuable tools to predict recombinant protein overexpression results before performing real laboratory experiments, thus saving labour, time and cost.

  11. Contaminant dispersion prediction and source estimation with integrated Gaussian-machine learning network model for point source emission in atmosphere

    Energy Technology Data Exchange (ETDEWEB)

    Ma, Denglong [Fuli School of Food Equipment Engineering and Science, Xi’an Jiaotong University, No.28 Xianning West Road, Xi’an 710049 (China); Zhang, Zaoxiao, E-mail: zhangzx@mail.xjtu.edu.cn [State Key Laboratory of Multiphase Flow in Power Engineering, Xi’an Jiaotong University, No.28 Xianning West Road, Xi’an 710049 (China); School of Chemical Engineering and Technology, Xi’an Jiaotong University, No.28 Xianning West Road, Xi’an 710049 (China)

    2016-07-05

    Highlights: • The intelligent network models were built to predict contaminant gas concentrations. • The improved network models coupled with Gaussian dispersion model were presented. • New model has high efficiency and accuracy for concentration prediction. • New model were applied to indentify the leakage source with satisfied results. - Abstract: Gas dispersion model is important for predicting the gas concentrations when contaminant gas leakage occurs. Intelligent network models such as radial basis function (RBF), back propagation (BP) neural network and support vector machine (SVM) model can be used for gas dispersion prediction. However, the prediction results from these network models with too many inputs based on original monitoring parameters are not in good agreement with the experimental data. Then, a new series of machine learning algorithms (MLA) models combined classic Gaussian model with MLA algorithm has been presented. The prediction results from new models are improved greatly. Among these models, Gaussian-SVM model performs best and its computation time is close to that of classic Gaussian dispersion model. Finally, Gaussian-MLA models were applied to identifying the emission source parameters with the particle swarm optimization (PSO) method. The estimation performance of PSO with Gaussian-MLA is better than that with Gaussian, Lagrangian stochastic (LS) dispersion model and network models based on original monitoring parameters. Hence, the new prediction model based on Gaussian-MLA is potentially a good method to predict contaminant gas dispersion as well as a good forward model in emission source parameters identification problem.

  12. Contaminant dispersion prediction and source estimation with integrated Gaussian-machine learning network model for point source emission in atmosphere

    International Nuclear Information System (INIS)

    Ma, Denglong; Zhang, Zaoxiao

    2016-01-01

    Highlights: • The intelligent network models were built to predict contaminant gas concentrations. • The improved network models coupled with Gaussian dispersion model were presented. • New model has high efficiency and accuracy for concentration prediction. • New model were applied to indentify the leakage source with satisfied results. - Abstract: Gas dispersion model is important for predicting the gas concentrations when contaminant gas leakage occurs. Intelligent network models such as radial basis function (RBF), back propagation (BP) neural network and support vector machine (SVM) model can be used for gas dispersion prediction. However, the prediction results from these network models with too many inputs based on original monitoring parameters are not in good agreement with the experimental data. Then, a new series of machine learning algorithms (MLA) models combined classic Gaussian model with MLA algorithm has been presented. The prediction results from new models are improved greatly. Among these models, Gaussian-SVM model performs best and its computation time is close to that of classic Gaussian dispersion model. Finally, Gaussian-MLA models were applied to identifying the emission source parameters with the particle swarm optimization (PSO) method. The estimation performance of PSO with Gaussian-MLA is better than that with Gaussian, Lagrangian stochastic (LS) dispersion model and network models based on original monitoring parameters. Hence, the new prediction model based on Gaussian-MLA is potentially a good method to predict contaminant gas dispersion as well as a good forward model in emission source parameters identification problem.

  13. Predicting diabetes mellitus using SMOTE and ensemble machine learning approach: The Henry Ford ExercIse Testing (FIT) project.

    Science.gov (United States)

    Alghamdi, Manal; Al-Mallah, Mouaz; Keteyian, Steven; Brawner, Clinton; Ehrman, Jonathan; Sakr, Sherif

    2017-01-01

    Machine learning is becoming a popular and important approach in the field of medical research. In this study, we investigate the relative performance of various machine learning methods such as Decision Tree, Naïve Bayes, Logistic Regression, Logistic Model Tree and Random Forests for predicting incident diabetes using medical records of cardiorespiratory fitness. In addition, we apply different techniques to uncover potential predictors of diabetes. This FIT project study used data of 32,555 patients who are free of any known coronary artery disease or heart failure who underwent clinician-referred exercise treadmill stress testing at Henry Ford Health Systems between 1991 and 2009 and had a complete 5-year follow-up. At the completion of the fifth year, 5,099 of those patients have developed diabetes. The dataset contained 62 attributes classified into four categories: demographic characteristics, disease history, medication use history, and stress test vital signs. We developed an Ensembling-based predictive model using 13 attributes that were selected based on their clinical importance, Multiple Linear Regression, and Information Gain Ranking methods. The negative effect of the imbalance class of the constructed model was handled by Synthetic Minority Oversampling Technique (SMOTE). The overall performance of the predictive model classifier was improved by the Ensemble machine learning approach using the Vote method with three Decision Trees (Naïve Bayes Tree, Random Forest, and Logistic Model Tree) and achieved high accuracy of prediction (AUC = 0.92). The study shows the potential of ensembling and SMOTE approaches for predicting incident diabetes using cardiorespiratory fitness data.

  14. Predicting diabetes mellitus using SMOTE and ensemble machine learning approach: The Henry Ford ExercIse Testing (FIT project.

    Directory of Open Access Journals (Sweden)

    Manal Alghamdi

    Full Text Available Machine learning is becoming a popular and important approach in the field of medical research. In this study, we investigate the relative performance of various machine learning methods such as Decision Tree, Naïve Bayes, Logistic Regression, Logistic Model Tree and Random Forests for predicting incident diabetes using medical records of cardiorespiratory fitness. In addition, we apply different techniques to uncover potential predictors of diabetes. This FIT project study used data of 32,555 patients who are free of any known coronary artery disease or heart failure who underwent clinician-referred exercise treadmill stress testing at Henry Ford Health Systems between 1991 and 2009 and had a complete 5-year follow-up. At the completion of the fifth year, 5,099 of those patients have developed diabetes. The dataset contained 62 attributes classified into four categories: demographic characteristics, disease history, medication use history, and stress test vital signs. We developed an Ensembling-based predictive model using 13 attributes that were selected based on their clinical importance, Multiple Linear Regression, and Information Gain Ranking methods. The negative effect of the imbalance class of the constructed model was handled by Synthetic Minority Oversampling Technique (SMOTE. The overall performance of the predictive model classifier was improved by the Ensemble machine learning approach using the Vote method with three Decision Trees (Naïve Bayes Tree, Random Forest, and Logistic Model Tree and achieved high accuracy of prediction (AUC = 0.92. The study shows the potential of ensembling and SMOTE approaches for predicting incident diabetes using cardiorespiratory fitness data.

  15. HUMAN DECISIONS AND MACHINE PREDICTIONS.

    Science.gov (United States)

    Kleinberg, Jon; Lakkaraju, Himabindu; Leskovec, Jure; Ludwig, Jens; Mullainathan, Sendhil

    2018-02-01

    Can machine learning improve human decision making? Bail decisions provide a good test case. Millions of times each year, judges make jail-or-release decisions that hinge on a prediction of what a defendant would do if released. The concreteness of the prediction task combined with the volume of data available makes this a promising machine-learning application. Yet comparing the algorithm to judges proves complicated. First, the available data are generated by prior judge decisions. We only observe crime outcomes for released defendants, not for those judges detained. This makes it hard to evaluate counterfactual decision rules based on algorithmic predictions. Second, judges may have a broader set of preferences than the variable the algorithm predicts; for instance, judges may care specifically about violent crimes or about racial inequities. We deal with these problems using different econometric strategies, such as quasi-random assignment of cases to judges. Even accounting for these concerns, our results suggest potentially large welfare gains: one policy simulation shows crime reductions up to 24.7% with no change in jailing rates, or jailing rate reductions up to 41.9% with no increase in crime rates. Moreover, all categories of crime, including violent crimes, show reductions; and these gains can be achieved while simultaneously reducing racial disparities. These results suggest that while machine learning can be valuable, realizing this value requires integrating these tools into an economic framework: being clear about the link between predictions and decisions; specifying the scope of payoff functions; and constructing unbiased decision counterfactuals. JEL Codes: C10 (Econometric and statistical methods and methodology), C55 (Large datasets: Modeling and analysis), K40 (Legal procedure, the legal system, and illegal behavior).

  16. Machine Learning in Medicine.

    Science.gov (United States)

    Deo, Rahul C

    2015-11-17

    Spurred by advances in processing power, memory, storage, and an unprecedented wealth of data, computers are being asked to tackle increasingly complex learning tasks, often with astonishing success. Computers have now mastered a popular variant of poker, learned the laws of physics from experimental data, and become experts in video games - tasks that would have been deemed impossible not too long ago. In parallel, the number of companies centered on applying complex data analysis to varying industries has exploded, and it is thus unsurprising that some analytic companies are turning attention to problems in health care. The purpose of this review is to explore what problems in medicine might benefit from such learning approaches and use examples from the literature to introduce basic concepts in machine learning. It is important to note that seemingly large enough medical data sets and adequate learning algorithms have been available for many decades, and yet, although there are thousands of papers applying machine learning algorithms to medical data, very few have contributed meaningfully to clinical care. This lack of impact stands in stark contrast to the enormous relevance of machine learning to many other industries. Thus, part of my effort will be to identify what obstacles there may be to changing the practice of medicine through statistical learning approaches, and discuss how these might be overcome. © 2015 American Heart Association, Inc.

  17. Machine Learning in Medicine

    Science.gov (United States)

    Deo, Rahul C.

    2015-01-01

    Spurred by advances in processing power, memory, storage, and an unprecedented wealth of data, computers are being asked to tackle increasingly complex learning tasks, often with astonishing success. Computers have now mastered a popular variant of poker, learned the laws of physics from experimental data, and become experts in video games – tasks which would have been deemed impossible not too long ago. In parallel, the number of companies centered on applying complex data analysis to varying industries has exploded, and it is thus unsurprising that some analytic companies are turning attention to problems in healthcare. The purpose of this review is to explore what problems in medicine might benefit from such learning approaches and use examples from the literature to introduce basic concepts in machine learning. It is important to note that seemingly large enough medical data sets and adequate learning algorithms have been available for many decades – and yet, although there are thousands of papers applying machine learning algorithms to medical data, very few have contributed meaningfully to clinical care. This lack of impact stands in stark contrast to the enormous relevance of machine learning to many other industries. Thus part of my effort will be to identify what obstacles there may be to changing the practice of medicine through statistical learning approaches, and discuss how these might be overcome. PMID:26572668

  18. Predicting ionizing radiation exposure using biochemically-inspired genomic machine learning [version 1; referees: 3 approved

    Directory of Open Access Journals (Sweden)

    Jonathan Z.L. Zhao

    2018-02-01

    Full Text Available Background: Gene signatures derived from transcriptomic data using machine learning methods have shown promise for biodosimetry testing. These signatures may not be sufficiently robust for large scale testing, as their performance has not been adequately validated on external, independent datasets. The present study develops human and murine signatures with biochemically-inspired machine learning that are strictly validated using k-fold and traditional approaches. Methods: Gene Expression Omnibus (GEO datasets of exposed human and murine lymphocytes were preprocessed via nearest neighbor imputation and expression of genes implicated in the literature to be responsive to radiation exposure (n=998 were then ranked by Minimum Redundancy Maximum Relevance (mRMR. Optimal signatures were derived by backward, complete, and forward sequential feature selection using Support Vector Machines (SVM, and validated using k-fold or traditional validation on independent datasets. Results: The best human signatures we derived exhibit k-fold validation accuracies of up to 98% (DDB2,  PRKDC, TPP2, PTPRE, and GADD45A when validated over 209 samples and traditional validation accuracies of up to 92% (DDB2,  CD8A,  TALDO1,  PCNA,  EIF4G2,  LCN2,  CDKN1A,  PRKCH,  ENO1,  and PPM1D when validated over 85 samples. Some human signatures are specific enough to differentiate between chemotherapy and radiotherapy. Certain multi-class murine signatures have sufficient granularity in dose estimation to inform eligibility for cytokine therapy (assuming these signatures could be translated to humans. We compiled a list of the most frequently appearing genes in the top 20 human and mouse signatures. More frequently appearing genes among an ensemble of signatures may indicate greater impact of these genes on the performance of individual signatures. Several genes in the signatures we derived are present in previously proposed signatures. Conclusions: Gene signatures

  19. Laser Direct Metal Deposition of 2024 Al Alloy: Trace Geometry Prediction via Machine Learning.

    Science.gov (United States)

    Caiazzo, Fabrizia; Caggiano, Alessandra

    2018-03-19

    Laser direct metal deposition is an advanced additive manufacturing technology suitably applicable in maintenance, repair, and overhaul of high-cost products, allowing for minimal distortion of the workpiece, reduced heat affected zones, and superior surface quality. Special interest is growing for the repair and coating of 2024 aluminum alloy parts, extensively utilized for a wide range of applications in the automotive, military, and aerospace sectors due to its excellent plasticity, corrosion resistance, electric conductivity, and strength-to-weight ratio. A critical issue in the laser direct metal deposition process is related to the geometrical parameters of the cross-section of the deposited metal trace that should be controlled to meet the part specifications. In this research, a machine learning approach based on artificial neural networks is developed to find the correlation between the laser metal deposition process parameters and the output geometrical parameters of the deposited metal trace produced by laser direct metal deposition on 5-mm-thick 2024 aluminum alloy plates. The results show that the neural network-based machine learning paradigm is able to accurately estimate the appropriate process parameters required to obtain a specified geometry for the deposited metal trace.

  20. Laser Direct Metal Deposition of 2024 Al Alloy: Trace Geometry Prediction via Machine Learning

    Directory of Open Access Journals (Sweden)

    Fabrizia Caiazzo

    2018-03-01

    Full Text Available Laser direct metal deposition is an advanced additive manufacturing technology suitably applicable in maintenance, repair, and overhaul of high-cost products, allowing for minimal distortion of the workpiece, reduced heat affected zones, and superior surface quality. Special interest is growing for the repair and coating of 2024 aluminum alloy parts, extensively utilized for a wide range of applications in the automotive, military, and aerospace sectors due to its excellent plasticity, corrosion resistance, electric conductivity, and strength-to-weight ratio. A critical issue in the laser direct metal deposition process is related to the geometrical parameters of the cross-section of the deposited metal trace that should be controlled to meet the part specifications. In this research, a machine learning approach based on artificial neural networks is developed to find the correlation between the laser metal deposition process parameters and the output geometrical parameters of the deposited metal trace produced by laser direct metal deposition on 5-mm-thick 2024 aluminum alloy plates. The results show that the neural network-based machine learning paradigm is able to accurately estimate the appropriate process parameters required to obtain a specified geometry for the deposited metal trace.

  1. Using Blood Indexes to Predict Overweight Statuses: An Extreme Learning Machine-Based Approach.

    Directory of Open Access Journals (Sweden)

    Huiling Chen

    Full Text Available The number of the overweight people continues to rise across the world. Studies have shown that being overweight can increase health risks, such as high blood pressure, diabetes mellitus, coronary heart disease, and certain forms of cancer. Therefore, identifying the overweight status in people is critical to prevent and decrease health risks. This study explores a new technique that uses blood and biochemical measurements to recognize the overweight condition. A new machine learning technique, an extreme learning machine, was developed to accurately detect the overweight status from a pool of 225 overweight and 251 healthy subjects. The group included 179 males and 297 females. The detection method was rigorously evaluated against the real-life dataset for accuracy, sensitivity, specificity, and AUC (area under the receiver operating characteristic (ROC curve criterion. Additionally, the feature selection was investigated to identify correlating factors for the overweight status. The results demonstrate that there are significant differences in blood and biochemical indexes between healthy and overweight people (p-value < 0.01. According to the feature selection, the most important correlated indexes are creatinine, hemoglobin, hematokrit, uric Acid, red blood cells, high density lipoprotein, alanine transaminase, triglyceride, and γ-glutamyl transpeptidase. These are consistent with the results of Spearman test analysis. The proposed method holds promise as a new, accurate method for identifying the overweight status in subjects.

  2. Using machine learning to predict snow water equivalent in the Sierra Nevada USA and Afghanistan

    Science.gov (United States)

    Bair, N.; Rittger, K.; Dozier, J.

    2017-12-01

    In many mountain regions, snowmelt provides most of the runoff. Ranges such as the Sierra Nevada USA benefit from hundreds of manual and automated snow measurement stations as well as basin-wide snow water equavalent (SWE) estimates from new platforms like the Airborne Snow Observatory. Thus, we have been able to use the Sierra Nevada as a testbed to validate an approach called SWE reconstruction, where the snowpack is built-up in reverse using downscaled energy balance forcings. Our past work has shown that SWE reconstruction produces some of the most accurate basin-wide SWE estimates, comparable in accuracy to a snow pillow/course interpolation, but requires no in situ measurements, which is its main advantage. The disadvantages are that reconstruction cannot be used for a forecast and is only valid during the ablation period. To address these shortcomings, we have used machine learning trained on reconstructed SWE in the Sierra Nevada and Afghanistan, where there are no accessible snowpack measurements. Predictors are physiographic and remotely-sensed variables, including brightness temperatures from a new enhanced resolution passive microwave dataset. Two machine learning techniques—bagged regression trees and feed-forward neural networks—were used. Results show little bias on average and training period, i.e. the ablation period.

  3. Clojure for machine learning

    CERN Document Server

    Wali, Akhil

    2014-01-01

    A book that brings out the strengths of Clojure programming that have to facilitate machine learning. Each topic is described in substantial detail, and examples and libraries in Clojure are also demonstrated.This book is intended for Clojure developers who want to explore the area of machine learning. Basic understanding of the Clojure programming language is required, but thorough acquaintance with the standard Clojure library or any libraries are not required. Familiarity with theoretical concepts and notation of mathematics and statistics would be an added advantage.

  4. Predicting Post-Translational Modifications from Local Sequence Fragments Using Machine Learning Algorithms: Overview and Best Practices.

    Science.gov (United States)

    Tatjewski, Marcin; Kierczak, Marcin; Plewczynski, Dariusz

    2017-01-01

    Here, we present two perspectives on the task of predicting post translational modifications (PTMs) from local sequence fragments using machine learning algorithms. The first is the description of the fundamental steps required to construct a PTM predictor from the very beginning. These steps include data gathering, feature extraction, or machine-learning classifier selection. The second part of our work contains the detailed discussion of more advanced problems which are encountered in PTM prediction task. Probably the most challenging issues which we have covered here are: (1) how to address the training data class imbalance problem (we also present statistics describing the problem); (2) how to properly set up cross-validation folds with an approach which takes into account the homology of protein data records, to address this problem we present our folds-over-clusters algorithm; and (3) how to efficiently reach for new sources of learning features. Presented techniques and notes resulted from intense studies in the field, performed by our and other groups, and can be useful both for researchers beginning in the field of PTM prediction and for those who want to extend the repertoire of their research techniques.

  5. SOLAR FLARE PREDICTION USING SDO/HMI VECTOR MAGNETIC FIELD DATA WITH A MACHINE-LEARNING ALGORITHM

    International Nuclear Information System (INIS)

    Bobra, M. G.; Couvidat, S.

    2015-01-01

    We attempt to forecast M- and X-class solar flares using a machine-learning algorithm, called support vector machine (SVM), and four years of data from the Solar Dynamics Observatory's Helioseismic and Magnetic Imager, the first instrument to continuously map the full-disk photospheric vector magnetic field from space. Most flare forecasting efforts described in the literature use either line-of-sight magnetograms or a relatively small number of ground-based vector magnetograms. This is the first time a large data set of vector magnetograms has been used to forecast solar flares. We build a catalog of flaring and non-flaring active regions sampled from a database of 2071 active regions, comprised of 1.5 million active region patches of vector magnetic field data, and characterize each active region by 25 parameters. We then train and test the machine-learning algorithm and we estimate its performances using forecast verification metrics with an emphasis on the true skill statistic (TSS). We obtain relatively high TSS scores and overall predictive abilities. We surmise that this is partly due to fine-tuning the SVM for this purpose and also to an advantageous set of features that can only be calculated from vector magnetic field data. We also apply a feature selection algorithm to determine which of our 25 features are useful for discriminating between flaring and non-flaring active regions and conclude that only a handful are needed for good predictive abilities

  6. Mastering machine learning with scikit-learn

    CERN Document Server

    Hackeling, Gavin

    2014-01-01

    If you are a software developer who wants to learn how machine learning models work and how to apply them effectively, this book is for you. Familiarity with machine learning fundamentals and Python will be helpful, but is not essential.

  7. High-Risk Breast Lesions: A Machine Learning Model to Predict Pathologic Upgrade and Reduce Unnecessary Surgical Excision.

    Science.gov (United States)

    Bahl, Manisha; Barzilay, Regina; Yedidia, Adam B; Locascio, Nicholas J; Yu, Lili; Lehman, Constance D

    2018-03-01

    Purpose To develop a machine learning model that allows high-risk breast lesions (HRLs) diagnosed with image-guided needle biopsy that require surgical excision to be distinguished from HRLs that are at low risk for upgrade to cancer at surgery and thus could be surveilled. Materials and Methods Consecutive patients with biopsy-proven HRLs who underwent surgery or at least 2 years of imaging follow-up from June 2006 to April 2015 were identified. A random forest machine learning model was developed to identify HRLs at low risk for upgrade to cancer. Traditional features such as age and HRL histologic results were used in the model, as were text features from the biopsy pathologic report. Results One thousand six HRLs were identified, with a cancer upgrade rate of 11.4% (115 of 1006). A machine learning random forest model was developed with 671 HRLs and tested with an independent set of 335 HRLs. Among the most important traditional features were age and HRL histologic results (eg, atypical ductal hyperplasia). An important text feature from the pathologic reports was "severely atypical." Instead of surgical excision of all HRLs, if those categorized with the model to be at low risk for upgrade were surveilled and the remainder were excised, then 97.4% (37 of 38) of malignancies would have been diagnosed at surgery, and 30.6% (91 of 297) of surgeries of benign lesions could have been avoided. Conclusion This study provides proof of concept that a machine learning model can be applied to predict the risk of upgrade of HRLs to cancer. Use of this model could decrease unnecessary surgery by nearly one-third and could help guide clinical decision making with regard to surveillance versus surgical excision of HRLs. © RSNA, 2017.

  8. A Comparison of a Machine Learning Model with EuroSCORE II in Predicting Mortality after Elective Cardiac Surgery: A Decision Curve Analysis.

    Science.gov (United States)

    Allyn, Jérôme; Allou, Nicolas; Augustin, Pascal; Philip, Ivan; Martinet, Olivier; Belghiti, Myriem; Provenchere, Sophie; Montravers, Philippe; Ferdynus, Cyril

    2017-01-01

    The benefits of cardiac surgery are sometimes difficult to predict and the decision to operate on a given individual is complex. Machine Learning and Decision Curve Analysis (DCA) are recent methods developed to create and evaluate prediction models. We conducted a retrospective cohort study using a prospective collected database from December 2005 to December 2012, from a cardiac surgical center at University Hospital. The different models of prediction of mortality in-hospital after elective cardiac surgery, including EuroSCORE II, a logistic regression model and a machine learning model, were compared by ROC and DCA. Of the 6,520 patients having elective cardiac surgery with cardiopulmonary bypass, 6.3% died. Mean age was 63.4 years old (standard deviation 14.4), and mean EuroSCORE II was 3.7 (4.8) %. The area under ROC curve (IC95%) for the machine learning model (0.795 (0.755-0.834)) was significantly higher than EuroSCORE II or the logistic regression model (respectively, 0.737 (0.691-0.783) and 0.742 (0.698-0.785), p machine learning model, in this monocentric study, has a greater benefit whatever the probability threshold. According to ROC and DCA, machine learning model is more accurate in predicting mortality after elective cardiac surgery than EuroSCORE II. These results confirm the use of machine learning methods in the field of medical prediction.

  9. Machine Learning for Security

    CERN Multimedia

    CERN. Geneva

    2015-01-01

    Applied statistics, aka ‘Machine Learning’, offers a wealth of techniques for answering security questions. It’s a much hyped topic in the big data world, with many companies now providing machine learning as a service. This talk will demystify these techniques, explain the math, and demonstrate their application to security problems. The presentation will include how-to’s on classifying malware, looking into encrypted tunnels, and finding botnets in DNS data. About the speaker Josiah is a security researcher with HP TippingPoint DVLabs Research Group. He has over 15 years of professional software development experience. Josiah used to do AI, with work focused on graph theory, search, and deductive inference on large knowledge bases. As rules only get you so far, he moved from AI to using machine learning techniques identifying failure modes in email traffic. There followed digressions into clustered data storage and later integrated control systems. Current ...

  10. Massively collaborative machine learning

    NARCIS (Netherlands)

    Rijn, van J.N.

    2016-01-01

    Many scientists are focussed on building models. We nearly process all information we perceive to a model. There are many techniques that enable computers to build models as well. The field of research that develops such techniques is called Machine Learning. Many research is devoted to develop

  11. Effluent composition prediction of a two-stage anaerobic digestion process: machine learning and stoichiometry techniques.

    Science.gov (United States)

    Alejo, Luz; Atkinson, John; Guzmán-Fierro, Víctor; Roeckel, Marlene

    2018-05-16

    Computational self-adapting methods (Support Vector Machines, SVM) are compared with an analytical method in effluent composition prediction of a two-stage anaerobic digestion (AD) process. Experimental data for the AD of poultry manure were used. The analytical method considers the protein as the only source of ammonia production in AD after degradation. Total ammonia nitrogen (TAN), total solids (TS), chemical oxygen demand (COD), and total volatile solids (TVS) were measured in the influent and effluent of the process. The TAN concentration in the effluent was predicted, this being the most inhibiting and polluting compound in AD. Despite the limited data available, the SVM-based model outperformed the analytical method for the TAN prediction, achieving a relative average error of 15.2% against 43% for the analytical method. Moreover, SVM showed higher prediction accuracy in comparison with Artificial Neural Networks. This result reveals the future promise of SVM for prediction in non-linear and dynamic AD processes. Graphical abstract ᅟ.

  12. Assessing the suitability of extreme learning machines (ELM for groundwater level prediction

    Directory of Open Access Journals (Sweden)

    Yadav Basant

    2017-03-01

    Full Text Available Fluctuation of groundwater levels around the world is an important theme in hydrological research. Rising water demand, faulty irrigation practices, mismanagement of soil and uncontrolled exploitation of aquifers are some of the reasons why groundwater levels are fluctuating. In order to effectively manage groundwater resources, it is important to have accurate readings and forecasts of groundwater levels. Due to the uncertain and complex nature of groundwater systems, the development of soft computing techniques (data-driven models in the field of hydrology has significant potential. This study employs two soft computing techniques, namely, extreme learning machine (ELM and support vector machine (SVM to forecast groundwater levels at two observation wells located in Canada. A monthly data set of eight years from 2006 to 2014 consisting of both hydrological and meteorological parameters (rainfall, temperature, evapotranspiration and groundwater level was used for the comparative study of the models. These variables were used in various combinations for univariate and multivariate analysis of the models. The study demonstrates that the proposed ELM model has better forecasting ability compared to the SVM model for monthly groundwater level forecasting.

  13. Issues on machine learning for prediction of classes among molecular sequences of plants and animals

    Science.gov (United States)

    Stehlik, Milan; Pant, Bhasker; Pant, Kumud; Pardasani, K. R.

    2012-09-01

    Nowadays major laboratories of the world are turning towards in-silico experimentation due to their ease, reproducibility and accuracy. The ethical issues concerning wet lab experimentations are also minimal in in-silico experimentations. But before we turn fully towards dry lab simulations it is necessary to understand the discrepancies and bottle necks involved with dry lab experimentations. It is necessary before reporting any result using dry lab simulations to perform in-depth statistical analysis of the data. Keeping same in mind here we are presenting a collaborative effort to correlate findings and results of various machine learning algorithms and checking underlying regressions and mutual dependencies so as to develop an optimal classifier and predictors.

  14. Predicting Essential Genes and Proteins Based on Machine Learning and Network Topological Features: A Comprehensive Review

    Science.gov (United States)

    Zhang, Xue; Acencio, Marcio Luis; Lemke, Ney

    2016-01-01

    Essential proteins/genes are indispensable to the survival or reproduction of an organism, and the deletion of such essential proteins will result in lethality or infertility. The identification of essential genes is very important not only for understanding the minimal requirements for survival of an organism, but also for finding human disease genes and new drug targets. Experimental methods for identifying essential genes are costly, time-consuming, and laborious. With the accumulation of sequenced genomes data and high-throughput experimental data, many computational methods for identifying essential proteins are proposed, which are useful complements to experimental methods. In this review, we show the state-of-the-art methods for identifying essential genes and proteins based on machine learning and network topological features, point out the progress and limitations of current methods, and discuss the challenges and directions for further research. PMID:27014079

  15. New Applications of Learning Machines

    DEFF Research Database (Denmark)

    Larsen, Jan

    * Machine learning framework for sound search * Genre classification * Music separation * MIMO channel estimation and symbol detection......* Machine learning framework for sound search * Genre classification * Music separation * MIMO channel estimation and symbol detection...

  16. Accurate prediction of stability changes in protein mutants by combining machine learning with structure based computational mutagenesis.

    Science.gov (United States)

    Masso, Majid; Vaisman, Iosif I

    2008-09-15

    Accurate predictive models for the impact of single amino acid substitutions on protein stability provide insight into protein structure and function. Such models are also valuable for the design and engineering of new proteins. Previously described methods have utilized properties of protein sequence or structure to predict the free energy change of mutants due to thermal (DeltaDeltaG) and denaturant (DeltaDeltaG(H2O)) denaturations, as well as mutant thermal stability (DeltaT(m)), through the application of either computational energy-based approaches or machine learning techniques. However, accuracy associated with applying these methods separately is frequently far from optimal. We detail a computational mutagenesis technique based on a four-body, knowledge-based, statistical contact potential. For any mutation due to a single amino acid replacement in a protein, the method provides an empirical normalized measure of the ensuing environmental perturbation occurring at every residue position. A feature vector is generated for the mutant by considering perturbations at the mutated position and it's ordered six nearest neighbors in the 3-dimensional (3D) protein structure. These predictors of stability change are evaluated by applying machine learning tools to large training sets of mutants derived from diverse proteins that have been experimentally studied and described. Predictive models based on our combined approach are either comparable to, or in many cases significantly outperform, previously published results. A web server with supporting documentation is available at http://proteins.gmu.edu/automute.

  17. Predicting ground contact events for a continuum of gait types: An application of targeted machine learning using principal component analysis.

    Science.gov (United States)

    Osis, Sean T; Hettinga, Blayne A; Ferber, Reed

    2016-05-01

    An ongoing challenge in the application of gait analysis to clinical settings is the standardized detection of temporal events, with unobtrusive and cost-effective equipment, for a wide range of gait types. The purpose of the current study was to investigate a targeted machine learning approach for the prediction of timing for foot strike (or initial contact) and toe-off, using only kinematics for walking, forefoot running, and heel-toe running. Data were categorized by gait type and split into a training set (∼30%) and a validation set (∼70%). A principal component analysis was performed, and separate linear models were trained and validated for foot strike and toe-off, using ground reaction force data as a gold-standard for event timing. Results indicate the model predicted both foot strike and toe-off timing to within 20ms of the gold-standard for more than 95% of cases in walking and running gaits. The machine learning approach continues to provide robust timing predictions for clinical use, and may offer a flexible methodology to handle new events and gait types. Copyright © 2016 Elsevier B.V. All rights reserved.

  18. Development of a Late-Life Dementia Prediction Index with Supervised Machine Learning in the Population-Based CAIDE Study.

    Science.gov (United States)

    Pekkala, Timo; Hall, Anette; Lötjönen, Jyrki; Mattila, Jussi; Soininen, Hilkka; Ngandu, Tiia; Laatikainen, Tiina; Kivipelto, Miia; Solomon, Alina

    2017-01-01

    This study aimed to develop a late-life dementia prediction model using a novel validated supervised machine learning method, the Disease State Index (DSI), in the Finnish population-based CAIDE study. The CAIDE study was based on previous population-based midlife surveys. CAIDE participants were re-examined twice in late-life, and the first late-life re-examination was used as baseline for the present study. The main study population included 709 cognitively normal subjects at first re-examination who returned to the second re-examination up to 10 years later (incident dementia n = 39). An extended population (n = 1009, incident dementia 151) included non-participants/non-survivors (national registers data). DSI was used to develop a dementia index based on first re-examination assessments. Performance in predicting dementia was assessed as area under the ROC curve (AUC). AUCs for DSI were 0.79 and 0.75 for main and extended populations. Included predictors were cognition, vascular factors, age, subjective memory complaints, and APOE genotype. The supervised machine learning method performed well in identifying comprehensive profiles for predicting dementia development up to 10 years later. DSI could thus be useful for identifying individuals who are most at risk and may benefit from dementia prevention interventions.

  19. Machine learning to predict the occurrence of bisphosphonate-related osteonecrosis of the jaw associated with dental extraction: A preliminary report.

    Science.gov (United States)

    Kim, Dong Wook; Kim, Hwiyoung; Nam, Woong; Kim, Hyung Jun; Cha, In-Ho

    2018-04-23

    The aim of this study was to build and validate five types of machine learning models that can predict the occurrence of BRONJ associated with dental extraction in patients taking bisphosphonates for the management of osteoporosis. A retrospective review of the medical records was conducted to obtain cases and controls for the study. Total 125 patients consisting of 41 cases and 84 controls were selected for the study. Five machine learning prediction algorithms including multivariable logistic regression model, decision tree, support vector machine, artificial neural network, and random forest were implemented. The outputs of these models were compared with each other and also with conventional methods, such as serum CTX level. Area under the receiver operating characteristic (ROC) curve (AUC) was used to compare the results. The performance of machine learning models was significantly superior to conventional statistical methods and single predictors. The random forest model yielded the best performance (AUC = 0.973), followed by artificial neural network (AUC = 0.915), support vector machine (AUC = 0.882), logistic regression (AUC = 0.844), decision tree (AUC = 0.821), drug holiday alone (AUC = 0.810), and CTX level alone (AUC = 0.630). Machine learning methods showed superior performance in predicting BRONJ associated with dental extraction compared to conventional statistical methods using drug holiday and serum CTX level. Machine learning can thus be applied in a wide range of clinical studies. Copyright © 2017. Published by Elsevier Inc.

  20. Machine learning a probabilistic perspective

    CERN Document Server

    Murphy, Kevin P

    2012-01-01

    Today's Web-enabled deluge of electronic data calls for automated methods of data analysis. Machine learning provides these, developing methods that can automatically detect patterns in data and then use the uncovered patterns to predict future data. This textbook offers a comprehensive and self-contained introduction to the field of machine learning, based on a unified, probabilistic approach. The coverage combines breadth and depth, offering necessary background material on such topics as probability, optimization, and linear algebra as well as discussion of recent developments in the field, including conditional random fields, L1 regularization, and deep learning. The book is written in an informal, accessible style, complete with pseudo-code for the most important algorithms. All topics are copiously illustrated with color images and worked examples drawn from such application domains as biology, text processing, computer vision, and robotics. Rather than providing a cookbook of different heuristic method...

  1. Prediction of Backbreak in Open-Pit Blasting Operations Using the Machine Learning Method

    Science.gov (United States)

    Khandelwal, Manoj; Monjezi, M.

    2013-03-01

    Backbreak is an undesirable phenomenon in blasting operations. It can cause instability of mine walls, falling down of machinery, improper fragmentation, reduced efficiency of drilling, etc. The existence of various effective parameters and their unknown relationships are the main reasons for inaccuracy of the empirical models. Presently, the application of new approaches such as artificial intelligence is highly recommended. In this paper, an attempt has been made to predict backbreak in blasting operations of Soungun iron mine, Iran, incorporating rock properties and blast design parameters using the support vector machine (SVM) method. To investigate the suitability of this approach, the predictions by SVM have been compared with multivariate regression analysis (MVRA). The coefficient of determination (CoD) and the mean absolute error (MAE) were taken as performance measures. It was found that the CoD between measured and predicted backbreak was 0.987 and 0.89 by SVM and MVRA, respectively, whereas the MAE was 0.29 and 1.07 by SVM and MVRA, respectively.

  2. Prediction of promiscuous p-glycoprotein inhibition using a novel machine learning scheme.

    Directory of Open Access Journals (Sweden)

    Max K Leong

    Full Text Available BACKGROUND: P-glycoprotein (P-gp is an ATP-dependent membrane transporter that plays a pivotal role in eliminating xenobiotics by active extrusion of xenobiotics from the cell. Multidrug resistance (MDR is highly associated with the over-expression of P-gp by cells, resulting in increased efflux of chemotherapeutical agents and reduction of intracellular drug accumulation. It is of clinical importance to develop a P-gp inhibition predictive model in the process of drug discovery and development. METHODOLOGY/PRINCIPAL FINDINGS: An in silico model was derived to predict the inhibition of P-gp using the newly invented pharmacophore ensemble/support vector machine (PhE/SVM scheme based on the data compiled from the literature. The predictions by the PhE/SVM model were found to be in good agreement with the observed values for those structurally diverse molecules in the training set (n = 31, r(2 = 0.89, q(2 = 0.86, RMSE = 0.40, s = 0.28, the test set (n = 88, r(2 = 0.87, RMSE = 0.39, s = 0.25 and the outlier set (n = 11, r(2 = 0.96, RMSE = 0.10, s = 0.05. The generated PhE/SVM model also showed high accuracy when subjected to those validation criteria generally adopted to gauge the predictivity of a theoretical model. CONCLUSIONS/SIGNIFICANCE: This accurate, fast and robust PhE/SVM model that can take into account the promiscuous nature of P-gp can be applied to predict the P-gp inhibition of structurally diverse compounds that otherwise cannot be done by any other methods in a high-throughput fashion to facilitate drug discovery and development by designing drug candidates with better metabolism profile.

  3. Can machine learning explain human learning?

    NARCIS (Netherlands)

    Vahdat, M.; Oneto, L.; Anguita, D.; Funk, M.; Rauterberg, G.W.M.

    2016-01-01

    Learning Analytics (LA) has a major interest in exploring and understanding the learning process of humans and, for this purpose, benefits from both Cognitive Science, which studies how humans learn, and Machine Learning, which studies how algorithms learn from data. Usually, Machine Learning is

  4. A Practical Framework Toward Prediction of Breaking Force and Disintegration of Tablet Formulations Using Machine Learning Tools.

    Science.gov (United States)

    Akseli, Ilgaz; Xie, Jingjin; Schultz, Leon; Ladyzhynsky, Nadia; Bramante, Tommasina; He, Xiaorong; Deanne, Rich; Horspool, Keith R; Schwabe, Robert

    2017-01-01

    Enabling the paradigm of quality by design requires the ability to quantitatively correlate material properties and process variables to measureable product performance attributes. Conventional, quality-by-test methods for determining tablet breaking force and disintegration time usually involve destructive tests, which consume significant amount of time and labor and provide limited information. Recent advances in material characterization, statistical analysis, and machine learning have provided multiple tools that have the potential to develop nondestructive, fast, and accurate approaches in drug product development. In this work, a methodology to predict the breaking force and disintegration time of tablet formulations using nondestructive ultrasonics and machine learning tools was developed. The input variables to the model include intrinsic properties of formulation and extrinsic process variables influencing the tablet during manufacturing. The model has been applied to predict breaking force and disintegration time using small quantities of active pharmaceutical ingredient and prototype formulation designs. The novel approach presented is a step forward toward rational design of a robust drug product based on insight into the performance of common materials during formulation and process development. It may also help expedite drug product development timeline and reduce active pharmaceutical ingredient usage while improving efficiency of the overall process. Copyright © 2016 American Pharmacists Association®. Published by Elsevier Inc. All rights reserved.

  5. Use of Machine-Learning Approaches to Predict Clinical Deterioration in Critically Ill Patients: A Systematic Review

    Directory of Open Access Journals (Sweden)

    Tadashi Kamio

    2017-06-01

    Full Text Available Introduction: Early identification of patients with unexpected clinical deterioration is a matter of serious concern. Previous studies have shown that early intervention on a patient whose health is deteriorating improves the patient outcome, and machine-learning-based approaches to predict clinical deterioration may contribute to precision improvement. To date, however, no systematic review in this area is available. Methods: We completed a search on PubMed on January 22, 2017 as well as a review of the articles identified by study authors involved in this area of research following the preferred reporting items for systematic reviews and meta-analyses (PRISMA guidelines for systematic reviews. Results: Twelve articles were selected for the current study from 273 articles initially obtained from the PubMed searches. Eleven of the 12 studies were retrospective studies, and no randomized controlled trials were performed. Although the artificial neural network techniques were the most frequently used and provided high precision and accuracy, we failed to identify articles that showed improvement in the patient outcome. Limitations were reported related to generalizability, complexity of models, and technical knowledge. Conclusions: This review shows that machine-learning approaches can improve prediction of clinical deterioration compared with traditional methods. However, these techniques will require further external validation before widespread clinical acceptance can be achieved.

  6. Prediction of hot spot residues at protein-protein interfaces by combining machine learning and energy-based methods

    Directory of Open Access Journals (Sweden)

    Pontil Massimiliano

    2009-10-01

    Full Text Available Abstract Background Alanine scanning mutagenesis is a powerful experimental methodology for investigating the structural and energetic characteristics of protein complexes. Individual amino-acids are systematically mutated to alanine and changes in free energy of binding (ΔΔG measured. Several experiments have shown that protein-protein interactions are critically dependent on just a few residues ("hot spots" at the interface. Hot spots make a dominant contribution to the free energy of binding and if mutated they can disrupt the interaction. As mutagenesis studies require significant experimental efforts, there is a need for accurate and reliable computational methods. Such methods would also add to our understanding of the determinants of affinity and specificity in protein-protein recognition. Results We present a novel computational strategy to identify hot spot residues, given the structure of a complex. We consider the basic energetic terms that contribute to hot spot interactions, i.e. van der Waals potentials, solvation energy, hydrogen bonds and Coulomb electrostatics. We treat them as input features and use machine learning algorithms such as Support Vector Machines and Gaussian Processes to optimally combine and integrate them, based on a set of training examples of alanine mutations. We show that our approach is effective in predicting hot spots and it compares favourably to other available methods. In particular we find the best performances using Transductive Support Vector Machines, a semi-supervised learning scheme. When hot spots are defined as those residues for which ΔΔG ≥ 2 kcal/mol, our method achieves a precision and a recall respectively of 56% and 65%. Conclusion We have developed an hybrid scheme in which energy terms are used as input features of machine learning models. This strategy combines the strengths of machine learning and energy-based methods. Although so far these two types of approaches have mainly been

  7. A machine learning approach for predicting CRISPR-Cas9 cleavage efficiencies and patterns underlying its mechanism of action.

    Science.gov (United States)

    Abadi, Shiran; Yan, Winston X; Amar, David; Mayrose, Itay

    2017-10-01

    The adaptation of the CRISPR-Cas9 system as a genome editing technique has generated much excitement in recent years owing to its ability to manipulate targeted genes and genomic regions that are complementary to a programmed single guide RNA (sgRNA). However, the efficacy of a specific sgRNA is not uniquely defined by exact sequence homology to the target site, thus unintended off-targets might additionally be cleaved. Current methods for sgRNA design are mainly concerned with predicting off-targets for a given sgRNA using basic sequence features and employ elementary rules for ranking possible sgRNAs. Here, we introduce CRISTA (CRISPR Target Assessment), a novel algorithm within the machine learning framework that determines the propensity of a genomic site to be cleaved by a given sgRNA. We show that the predictions made with CRISTA are more accurate than other available methodologies. We further demonstrate that the occurrence of bulges is not a rare phenomenon and should be accounted for in the prediction process. Beyond predicting cleavage efficiencies, the learning process provides inferences regarding patterns that underlie the mechanism of action of the CRISPR-Cas9 system. We discover that attributes that describe the spatial structure and rigidity of the entire genomic site as well as those surrounding the PAM region are a major component of the prediction capabilities.

  8. A machine learning approach for predicting CRISPR-Cas9 cleavage efficiencies and patterns underlying its mechanism of action.

    Directory of Open Access Journals (Sweden)

    Shiran Abadi

    2017-10-01

    Full Text Available The adaptation of the CRISPR-Cas9 system as a genome editing technique has generated much excitement in recent years owing to its ability to manipulate targeted genes and genomic regions that are complementary to a programmed single guide RNA (sgRNA. However, the efficacy of a specific sgRNA is not uniquely defined by exact sequence homology to the target site, thus unintended off-targets might additionally be cleaved. Current methods for sgRNA design are mainly concerned with predicting off-targets for a given sgRNA using basic sequence features and employ elementary rules for ranking possible sgRNAs. Here, we introduce CRISTA (CRISPR Target Assessment, a novel algorithm within the machine learning framework that determines the propensity of a genomic site to be cleaved by a given sgRNA. We show that the predictions made with CRISTA are more accurate than other available methodologies. We further demonstrate that the occurrence of bulges is not a rare phenomenon and should be accounted for in the prediction process. Beyond predicting cleavage efficiencies, the learning process provides inferences regarding patterns that underlie the mechanism of action of the CRISPR-Cas9 system. We discover that attributes that describe the spatial structure and rigidity of the entire genomic site as well as those surrounding the PAM region are a major component of the prediction capabilities.

  9. Prediction of persistent post-surgery pain by preoperative cold pain sensitivity: biomarker development with machine-learning-derived analysis.

    Science.gov (United States)

    Lötsch, J; Ultsch, A; Kalso, E

    2017-10-01

    To prevent persistent post-surgery pain, early identification of patients at high risk is a clinical need. Supervised machine-learning techniques were used to test how accurately the patients' performance in a preoperatively performed tonic cold pain test could predict persistent post-surgery pain. We analysed 763 patients from a cohort of 900 women who were treated for breast cancer, of whom 61 patients had developed signs of persistent pain during three yr of follow-up. Preoperatively, all patients underwent a cold pain test (immersion of the hand into a water bath at 2-4 °C). The patients rated the pain intensity using a numerical ratings scale (NRS) from 0 to 10. Supervised machine-learning techniques were used to construct a classifier that could predict patients at risk of persistent pain. Whether or not a patient rated the pain intensity at NRS=10 within less than 45 s during the cold water immersion test provided a negative predictive value of 94.4% to assign a patient to the "persistent pain" group. If NRS=10 was never reached during the cold test, the predictive value for not developing persistent pain was almost 97%. However, a low negative predictive value of 10% implied a high false positive rate. Results provide a robust exclusion of persistent pain in women with an accuracy of 94.4%. Moreover, results provide further support for the hypothesis that the endogenous pain inhibitory system may play an important role in the process of pain becoming persistent. © The Author 2017. Published by Oxford University Press on behalf of the British Journal of Anaesthesia.

  10. A machine learning approach using EEG data to predict response to SSRI treatment for major depressive disorder.

    Science.gov (United States)

    Khodayari-Rostamabad, Ahmad; Reilly, James P; Hasey, Gary M; de Bruin, Hubert; Maccrimmon, Duncan J

    2013-10-01

    The problem of identifying, in advance, the most effective treatment agent for various psychiatric conditions remains an elusive goal. To address this challenge, we investigate the performance of the proposed machine learning (ML) methodology (based on the pre-treatment electroencephalogram (EEG)) for prediction of response to treatment with a selective serotonin reuptake inhibitor (SSRI) medication in subjects suffering from major depressive disorder (MDD). A relatively small number of most discriminating features are selected from a large group of candidate features extracted from the subject's pre-treatment EEG, using a machine learning procedure for feature selection. The selected features are fed into a classifier, which was realized as a mixture of factor analysis (MFA) model, whose output is the predicted response in the form of a likelihood value. This likelihood indicates the extent to which the subject belongs to the responder vs. non-responder classes. The overall method was evaluated using a "leave-n-out" randomized permutation cross-validation procedure. A list of discriminating EEG biomarkers (features) was found. The specificity of the proposed method is 80.9% while sensitivity is 94.9%, for an overall prediction accuracy of 87.9%. There is a 98.76% confidence that the estimated prediction rate is within the interval [75%, 100%]. These results indicate that the proposed ML method holds considerable promise in predicting the efficacy of SSRI antidepressant therapy for MDD, based on a simple and cost-effective pre-treatment EEG. The proposed approach offers the potential to improve the treatment of major depression and to reduce health care costs. Copyright © 2013 International Federation of Clinical Neurophysiology. Published by Elsevier Ireland Ltd. All rights reserved.

  11. Refining Prediction in Treatment-Resistant Depression: Results of Machine Learning Analyses in the TRD III Sample.

    Science.gov (United States)

    Kautzky, Alexander; Dold, Markus; Bartova, Lucie; Spies, Marie; Vanicek, Thomas; Souery, Daniel; Montgomery, Stuart; Mendlewicz, Julien; Zohar, Joseph; Fabbri, Chiara; Serretti, Alessandro; Lanzenberger, Rupert; Kasper, Siegfried

    The study objective was to generate a prediction model for treatment-resistant depression (TRD) using machine learning featuring a large set of 47 clinical and sociodemographic predictors of treatment outcome. 552 Patients diagnosed with major depressive disorder (MDD) according to DSM-IV criteria were enrolled between 2011 and 2016. TRD was defined as failure to reach response to antidepressant treatment, characterized by a Montgomery-Asberg Depression Rating Scale (MADRS) score below 22 after at least 2 antidepressant trials of adequate length and dosage were administered. RandomForest (RF) was used for predicting treatment outcome phenotypes in a 10-fold cross-validation. The full model with 47 predictors yielded an accuracy of 75.0%. When the number of predictors was reduced to 15, accuracies between 67.6% and 71.0% were attained for different test sets. The most informative predictors of treatment outcome were baseline MADRS score for the current episode; impairment of family, social, and work life; the timespan between first and last depressive episode; severity; suicidal risk; age; body mass index; and the number of lifetime depressive episodes as well as lifetime duration of hospitalization. With the application of the machine learning algorithm RF, an efficient prediction model with an accuracy of 75.0% for forecasting treatment outcome could be generated, thus surpassing the predictive capabilities of clinical evaluation. We also supply a simplified algorithm of 15 easily collected clinical and sociodemographic predictors that can be obtained within approximately 10 minutes, which reached an accuracy of 70.6%. Thus, we are confident that our model will be validated within other samples to advance an accurate prediction model fit for clinical usage in TRD. © Copyright 2017 Physicians Postgraduate Press, Inc.

  12. Assessing the Influence of Spatio-Temporal Context for Next Place Prediction using Different Machine Learning Approaches

    Directory of Open Access Journals (Sweden)

    Jorim Urner

    2018-04-01

    Full Text Available For next place prediction, machine learning methods which incorporate contextual data are frequently used. However, previous studies often do not allow deriving generalizable methodological recommendations, since they use different datasets, methods for discretizing space, scales of prediction, prediction algorithms, and context data, and therefore lack comparability. Additionally, the cold start problem for new users is an issue. In this study, we predict next places based on one trajectory dataset but with systematically varying prediction algorithms, methods for space discretization, scales of prediction (based on a novel hierarchical approach, and incorporated context data. This allows to evaluate the relative influence of these factors on the overall prediction accuracy. Moreover, in order to tackle the cold start problem prevalent in recommender and prediction systems, we test the effect of training the predictor on all users instead of each individual one. We find that the prediction accuracy shows a varying dependency on the method of space discretization and the incorporated contextual factors at different spatial scales. Moreover, our user-independent approach reaches a prediction accuracy of around 75%, and is therefore an alternative to existing user-specific models. This research provides valuable insights into the individual and combinatory effects of model parameters and algorithms on the next place prediction accuracy. The results presented in this paper can be used to determine the influence of various contextual factors and to help researchers building more accurate prediction models. It is also a starting point for future work creating a comprehensive framework to guide the building of prediction models.

  13. Student Modeling and Machine Learning

    OpenAIRE

    Sison , Raymund; Shimura , Masamichi

    1998-01-01

    After identifying essential student modeling issues and machine learning approaches, this paper examines how machine learning techniques have been used to automate the construction of student models as well as the background knowledge necessary for student modeling. In the process, the paper sheds light on the difficulty, suitability and potential of using machine learning for student modeling processes, and, to a lesser extent, the potential of using student modeling techniques in machine le...

  14. Quantum Machine Learning

    Science.gov (United States)

    Biswas, Rupak

    2018-01-01

    Quantum computing promises an unprecedented ability to solve intractable problems by harnessing quantum mechanical effects such as tunneling, superposition, and entanglement. The Quantum Artificial Intelligence Laboratory (QuAIL) at NASA Ames Research Center is the space agency's primary facility for conducting research and development in quantum information sciences. QuAIL conducts fundamental research in quantum physics but also explores how best to exploit and apply this disruptive technology to enable NASA missions in aeronautics, Earth and space sciences, and space exploration. At the same time, machine learning has become a major focus in computer science and captured the imagination of the public as a panacea to myriad big data problems. In this talk, we will discuss how classical machine learning can take advantage of quantum computing to significantly improve its effectiveness. Although we illustrate this concept on a quantum annealer, other quantum platforms could be used as well. If explored fully and implemented efficiently, quantum machine learning could greatly accelerate a wide range of tasks leading to new technologies and discoveries that will significantly change the way we solve real-world problems.

  15. Water quality of Danube Delta systems: ecological status and prediction using machine-learning algorithms.

    Science.gov (United States)

    Stoica, C; Camejo, J; Banciu, A; Nita-Lazar, M; Paun, I; Cristofor, S; Pacheco, O R; Guevara, M

    2016-01-01

    Environmental issues have a worldwide impact on water bodies, including the Danube Delta, the largest European wetland. The Water Framework Directive (2000/60/EC) implementation operates toward solving environmental issues from European and national level. As a consequence, the water quality and the biocenosis structure was altered, especially the composition of the macro invertebrate community which is closely related to habitat and substrate heterogeneity. This study aims to assess the ecological status of Southern Branch of the Danube Delta, Saint Gheorghe, using benthic fauna and a computational method as an alternative for monitoring the water quality in real time. The analysis of spatial and temporal variability of unicriterial and multicriterial indices were used to assess the current status of aquatic systems. In addition, chemical status was characterized. Coliform bacteria and several chemical parameters were used to feed machine-learning (ML) algorithms to simulate a real-time classification method. Overall, the assessment of the water bodies indicated a moderate ecological status based on the biological quality elements or a good ecological status based on chemical and ML algorithms criteria.

  16. Quality prediction modeling for sintered ores based on mechanism models of sintering and extreme learning machine based error compensation

    Science.gov (United States)

    Tiebin, Wu; Yunlian, Liu; Xinjun, Li; Yi, Yu; Bin, Zhang

    2018-06-01

    Aiming at the difficulty in quality prediction of sintered ores, a hybrid prediction model is established based on mechanism models of sintering and time-weighted error compensation on the basis of the extreme learning machine (ELM). At first, mechanism models of drum index, total iron, and alkalinity are constructed according to the chemical reaction mechanism and conservation of matter in the sintering process. As the process is simplified in the mechanism models, these models are not able to describe high nonlinearity. Therefore, errors are inevitable. For this reason, the time-weighted ELM based error compensation model is established. Simulation results verify that the hybrid model has a high accuracy and can meet the requirement for industrial applications.

  17. The Development of a Machine Learning Inpatient Acute Kidney Injury Prediction Model.

    Science.gov (United States)

    Koyner, Jay L; Carey, Kyle A; Edelson, Dana P; Churpek, Matthew M

    2018-03-28

    To develop an acute kidney injury risk prediction model using electronic health record data for longitudinal use in hospitalized patients. Observational cohort study. Tertiary, urban, academic medical center from November 2008 to January 2016. All adult inpatients without pre-existing renal failure at admission, defined as first serum creatinine greater than or equal to 3.0 mg/dL, International Classification of Diseases, 9th Edition, code for chronic kidney disease stage 4 or higher or having received renal replacement therapy within 48 hours of first serum creatinine measurement. None. Demographics, vital signs, diagnostics, and interventions were used in a Gradient Boosting Machine algorithm to predict serum creatinine-based Kidney Disease Improving Global Outcomes stage 2 acute kidney injury, with 60% of the data used for derivation and 40% for validation. Area under the receiver operator characteristic curve (AUC) was calculated in the validation cohort, and subgroup analyses were conducted across admission serum creatinine, acute kidney injury severity, and hospital location. Among the 121,158 included patients, 17,482 (14.4%) developed any Kidney Disease Improving Global Outcomes acute kidney injury, with 4,251 (3.5%) developing stage 2. The AUC (95% CI) was 0.90 (0.90-0.90) for predicting stage 2 acute kidney injury within 24 hours and 0.87 (0.87-0.87) within 48 hours. The AUC was 0.96 (0.96-0.96) for receipt of renal replacement therapy (n = 821) in the next 48 hours. Accuracy was similar across hospital settings (ICU, wards, and emergency department) and admitting serum creatinine groupings. At a probability threshold of greater than or equal to 0.022, the algorithm had a sensitivity of 84% and a specificity of 85% for stage 2 acute kidney injury and predicted the development of stage 2 a median of 41 hours (interquartile range, 12-141 hr) prior to the development of stage 2 acute kidney injury. Readily available electronic health record data can be used

  18. Probabilistic machine learning and artificial intelligence.

    Science.gov (United States)

    Ghahramani, Zoubin

    2015-05-28

    How can a machine learn from experience? Probabilistic modelling provides a framework for understanding what learning is, and has therefore emerged as one of the principal theoretical and practical approaches for designing machines that learn from data acquired through experience. The probabilistic framework, which describes how to represent and manipulate uncertainty about models and predictions, has a central role in scientific data analysis, machine learning, robotics, cognitive science and artificial intelligence. This Review provides an introduction to this framework, and discusses some of the state-of-the-art advances in the field, namely, probabilistic programming, Bayesian optimization, data compression and automatic model discovery.

  19. Probabilistic machine learning and artificial intelligence

    Science.gov (United States)

    Ghahramani, Zoubin

    2015-05-01

    How can a machine learn from experience? Probabilistic modelling provides a framework for understanding what learning is, and has therefore emerged as one of the principal theoretical and practical approaches for designing machines that learn from data acquired through experience. The probabilistic framework, which describes how to represent and manipulate uncertainty about models and predictions, has a central role in scientific data analysis, machine learning, robotics, cognitive science and artificial intelligence. This Review provides an introduction to this framework, and discusses some of the state-of-the-art advances in the field, namely, probabilistic programming, Bayesian optimization, data compression and automatic model discovery.

  20. Machine Learning Algorithms for prediction of regions of high Reynolds Averaged Navier Stokes Uncertainty

    Science.gov (United States)

    Mishra, Aashwin; Iaccarino, Gianluca

    2017-11-01

    In spite of their deficiencies, RANS models represent the workhorse for industrial investigations into turbulent flows. In this context, it is essential to provide diagnostic measures to assess the quality of RANS predictions. To this end, the primary step is to identify feature importances amongst massive sets of potentially descriptive and discriminative flow features. This aids the physical interpretability of the resultant discrepancy model and its extensibility to similar problems. Recent investigations have utilized approaches such as Random Forests, Support Vector Machines and the Least Absolute Shrinkage and Selection Operator for feature selection. With examples, we exhibit how such methods may not be suitable for turbulent flow datasets. The underlying rationale, such as the correlation bias and the required conditions for the success of penalized algorithms, are discussed with illustrative examples. Finally, we provide alternate approaches using convex combinations of regularized regression approaches and randomized sub-sampling in combination with feature selection algorithms, to infer model structure from data. This research was supported by the Defense Advanced Research Projects Agency under the Enabling Quantification of Uncertainty in Physical Systems (EQUiPS) project (technical monitor: Dr Fariba Fahroo).

  1. Machine learning with R cookbook

    CERN Document Server

    Chiu, Yu-Wei

    2015-01-01

    If you want to learn how to use R for machine learning and gain insights from your data, then this book is ideal for you. Regardless of your level of experience, this book covers the basics of applying R to machine learning through to advanced techniques. While it is helpful if you are familiar with basic programming or machine learning concepts, you do not require prior experience to benefit from this book.

  2. Multiple machine learning based descriptive and predictive workflow for the identification of potential PTP1B inhibitors.

    Science.gov (United States)

    Chandra, Sharat; Pandey, Jyotsana; Tamrakar, Akhilesh Kumar; Siddiqi, Mohammad Imran

    2017-01-01

    In insulin and leptin signaling pathway, Protein-Tyrosine Phosphatase 1B (PTP1B) plays a crucial controlling role as a negative regulator, which makes it an attractive therapeutic target for both Type-2 Diabetes (T2D) and obesity. In this work, we have generated classification models by using the inhibition data set of known PTP1B inhibitors to identify new inhibitors of PTP1B utilizing multiple machine learning techniques like naïve Bayesian, random forest, support vector machine and k-nearest neighbors, along with structural fingerprints and selected molecular descriptors. Several models from each algorithm have been constructed and optimized, with the different combination of molecular descriptors and structural fingerprints. For the training and test sets, most of the predictive models showed more than 90% of overall prediction accuracies. The best model was obtained with support vector machine approach and has Matthews Correlation Coefficient of 0.82 for the external test set, which was further employed for the virtual screening of Maybridge small compound database. Five compounds were subsequently selected for experimental assay. Out of these two compounds were found to inhibit PTP1B with significant inhibitory activity in in-vitro inhibition assay. The structural fragments which are important for PTP1B inhibition were identified by naïve Bayesian method and can be further exploited to design new molecules around the identified scaffolds. The descriptive and predictive modeling strategy applied in this study is capable of identifying PTP1B inhibitors from the large compound libraries. Copyright © 2016 Elsevier Inc. All rights reserved.

  3. An Improved Bacterial-Foraging Optimization-Based Machine Learning Framework for Predicting the Severity of Somatization Disorder

    Directory of Open Access Journals (Sweden)

    Xinen Lv

    2018-02-01

    Full Text Available It is of great clinical significance to establish an accurate intelligent model to diagnose the somatization disorder of community correctional personnel. In this study, a novel machine learning framework is proposed to predict the severity of somatization disorder in community correction personnel. The core of this framework is to adopt the improved bacterial foraging optimization (IBFO to optimize two key parameters (penalty coefficient and the kernel width of a kernel extreme learning machine (KELM and build an IBFO-based KELM (IBFO-KELM for the diagnosis of somatization disorder patients. The main innovation point of the IBFO-KELM model is the introduction of opposition-based learning strategies in traditional bacteria foraging optimization, which increases the diversity of bacterial species, keeps a uniform distribution of individuals of initial population, and improves the convergence rate of the BFO optimization process as well as the probability of escaping from the local optimal solution. In order to verify the effectiveness of the method proposed in this study, a 10-fold cross-validation method based on data from a symptom self-assessment scale (SCL-90 is used to make comparison among IBFO-KELM, BFO-KELM (model based on the original bacterial foraging optimization model, GA-KELM (model based on genetic algorithm, PSO-KELM (model based on particle swarm optimization algorithm and Grid-KELM (model based on grid search method. The experimental results show that the proposed IBFO-KELM prediction model has better performance than other methods in terms of classification accuracy, Matthews correlation coefficient (MCC, sensitivity and specificity. It can distinguish very well between severe somatization disorder and mild somatization and assist the psychological doctor with clinical diagnosis.

  4. Machine learning approaches for integrating clinical and imaging features in late-life depression classification and response prediction.

    Science.gov (United States)

    Patel, Meenal J; Andreescu, Carmen; Price, Julie C; Edelman, Kathryn L; Reynolds, Charles F; Aizenstein, Howard J

    2015-10-01

    Currently, depression diagnosis relies primarily on behavioral symptoms and signs, and treatment is guided by trial and error instead of evaluating associated underlying brain characteristics. Unlike past studies, we attempted to estimate accurate prediction models for late-life depression diagnosis and treatment response using multiple machine learning methods with inputs of multi-modal imaging and non-imaging whole brain and network-based features. Late-life depression patients (medicated post-recruitment) (n = 33) and older non-depressed individuals (n = 35) were recruited. Their demographics and cognitive ability scores were recorded, and brain characteristics were acquired using multi-modal magnetic resonance imaging pretreatment. Linear and nonlinear learning methods were tested for estimating accurate prediction models. A learning method called alternating decision trees estimated the most accurate prediction models for late-life depression diagnosis (87.27% accuracy) and treatment response (89.47% accuracy). The diagnosis model included measures of age, Mini-mental state examination score, and structural imaging (e.g. whole brain atrophy and global white mater hyperintensity burden). The treatment response model included measures of structural and functional connectivity. Combinations of multi-modal imaging and/or non-imaging measures may help better predict late-life depression diagnosis and treatment response. As a preliminary observation, we speculate that the results may also suggest that different underlying brain characteristics defined by multi-modal imaging measures-rather than region-based differences-are associated with depression versus depression recovery because to our knowledge this is the first depression study to accurately predict both using the same approach. These findings may help better understand late-life depression and identify preliminary steps toward personalized late-life depression treatment. Copyright © 2015 John Wiley

  5. Predictive Maintenance of Power Substation Equipment by Infrared Thermography Using a Machine-Learning Approach

    Directory of Open Access Journals (Sweden)

    Irfan Ullah

    2017-12-01

    Full Text Available A variety of reasons, specifically contact issues, irregular loads, cracks in insulation, defective relays, terminal junctions and other similar issues, increase the internal temperature of electrical instruments. This results in unexpected disturbances and potential damage to power equipment. Therefore, the initial prevention measures of thermal anomalies in electrical tools are essential to prevent power-equipment failure. In this article, we address this initial prevention mechanism for power substations using a computer-vision approach by taking advantage of infrared thermal images. The thermal images are taken through infrared cameras without disturbing the working operations of power substations. Thus, this article augments the non-destructive approach to defect analysis in electrical power equipment using computer vision and machine learning. We use a total of 150 thermal pictures of different electrical equipment in 10 different substations in operating conditions, using 300 different hotspots. Our approach uses multi-layered perceptron (MLP to classify the thermal conditions of components of power substations into “defect” and “non-defect” classes. A total of eleven features, which are first-order and second-order statistical features, are calculated from the thermal sample images. The performance of MLP shows initial accuracy of 79.78%. We further augment the MLP with graph cut to increase accuracy to 84%. We argue that with the successful development and deployment of this new system, the Technology Department of Chongqing can arrange the recommended actions and thus save cost in repair and outages. This can play an important role in the quick and reliable inspection to potentially prevent power substation equipment from failure, which will save the whole system from breakdown. The increased 84% accuracy with the integration of the graph cut shows the efficacy of the proposed defect analysis approach.

  6. Spatiotemporal prediction of continuous daily PM2.5 concentrations across China using a spatially explicit machine learning algorithm

    Science.gov (United States)

    Zhan, Yu; Luo, Yuzhou; Deng, Xunfei; Chen, Huajin; Grieneisen, Michael L.; Shen, Xueyou; Zhu, Lizhong; Zhang, Minghua

    2017-04-01

    A high degree of uncertainty associated with the emission inventory for China tends to degrade the performance of chemical transport models in predicting PM2.5 concentrations especially on a daily basis. In this study a novel machine learning algorithm, Geographically-Weighted Gradient Boosting Machine (GW-GBM), was developed by improving GBM through building spatial smoothing kernels to weigh the loss function. This modification addressed the spatial nonstationarity of the relationships between PM2.5 concentrations and predictor variables such as aerosol optical depth (AOD) and meteorological conditions. GW-GBM also overcame the estimation bias of PM2.5 concentrations due to missing AOD retrievals, and thus potentially improved subsequent exposure analyses. GW-GBM showed good performance in predicting daily PM2.5 concentrations (R2 = 0.76, RMSE = 23.0 μg/m3) even with partially missing AOD data, which was better than the original GBM model (R2 = 0.71, RMSE = 25.3 μg/m3). On the basis of the continuous spatiotemporal prediction of PM2.5 concentrations, it was predicted that 95% of the population lived in areas where the estimated annual mean PM2.5 concentration was higher than 35 μg/m3, and 45% of the population was exposed to PM2.5 >75 μg/m3 for over 100 days in 2014. GW-GBM accurately predicted continuous daily PM2.5 concentrations in China for assessing acute human health effects.

  7. In silico prediction of Tetrahymena pyriformis toxicity for diverse industrial chemicals with substructure pattern recognition and machine learning methods.

    Science.gov (United States)

    Cheng, Feixiong; Shen, Jie; Yu, Yue; Li, Weihua; Liu, Guixia; Lee, Philip W; Tang, Yun

    2011-03-01

    There is an increasing need for the rapid safety assessment of chemicals by both industries and regulatory agencies throughout the world. In silico techniques are practical alternatives in the environmental hazard assessment. It is especially true to address the persistence, bioaccumulative and toxicity potentials of organic chemicals. Tetrahymena pyriformis toxicity is often used as a toxic endpoint. In this study, 1571 diverse unique chemicals were collected from the literature and composed of the largest diverse data set for T. pyriformis toxicity. Classification predictive models of T. pyriformis toxicity were developed by substructure pattern recognition and different machine learning methods, including support vector machine (SVM), C4.5 decision tree, k-nearest neighbors and random forest. The results of a 5-fold cross-validation showed that the SVM method performed better than other algorithms. The overall predictive accuracies of the SVM classification model with radial basis functions kernel was 92.2% for the 5-fold cross-validation and 92.6% for the external validation set, respectively. Furthermore, several representative substructure patterns for characterizing T. pyriformis toxicity were also identified via the information gain analysis methods. Copyright © 2010 Elsevier Ltd. All rights reserved.

  8. Spatial prediction of landslides using a hybrid machine learning approach based on Random Subspace and Classification and Regression Trees

    Science.gov (United States)

    Pham, Binh Thai; Prakash, Indra; Tien Bui, Dieu

    2018-02-01

    A hybrid machine learning approach of Random Subspace (RSS) and Classification And Regression Trees (CART) is proposed to develop a model named RSSCART for spatial prediction of landslides. This model is a combination of the RSS method which is known as an efficient ensemble technique and the CART which is a state of the art classifier. The Luc Yen district of Yen Bai province, a prominent landslide prone area of Viet Nam, was selected for the model development. Performance of the RSSCART model was evaluated through the Receiver Operating Characteristic (ROC) curve, statistical analysis methods, and the Chi Square test. Results were compared with other benchmark landslide models namely Support Vector Machines (SVM), single CART, Naïve Bayes Trees (NBT), and Logistic Regression (LR). In the development of model, ten important landslide affecting factors related with geomorphology, geology and geo-environment were considered namely slope angles, elevation, slope aspect, curvature, lithology, distance to faults, distance to rivers, distance to roads, and rainfall. Performance of the RSSCART model (AUC = 0.841) is the best compared with other popular landslide models namely SVM (0.835), single CART (0.822), NBT (0.821), and LR (0.723). These results indicate that performance of the RSSCART is a promising method for spatial landslide prediction.

  9. Data-Driven Machine-Learning Model in District Heating System for Heat Load Prediction: A Comparison Study

    Directory of Open Access Journals (Sweden)

    Fisnik Dalipi

    2016-01-01

    Full Text Available We present our data-driven supervised machine-learning (ML model to predict heat load for buildings in a district heating system (DHS. Even though ML has been used as an approach to heat load prediction in literature, it is hard to select an approach that will qualify as a solution for our case as existing solutions are quite problem specific. For that reason, we compared and evaluated three ML algorithms within a framework on operational data from a DH system in order to generate the required prediction model. The algorithms examined are Support Vector Regression (SVR, Partial Least Square (PLS, and random forest (RF. We use the data collected from buildings at several locations for a period of 29 weeks. Concerning the accuracy of predicting the heat load, we evaluate the performance of the proposed algorithms using mean absolute error (MAE, mean absolute percentage error (MAPE, and correlation coefficient. In order to determine which algorithm had the best accuracy, we conducted performance comparison among these ML algorithms. The comparison of the algorithms indicates that, for DH heat load prediction, SVR method presented in this paper is the most efficient one out of the three also compared to other methods found in the literature.

  10. Soft computing in machine learning

    CERN Document Server

    Park, Jooyoung; Inoue, Atsushi

    2014-01-01

    As users or consumers are now demanding smarter devices, intelligent systems are revolutionizing by utilizing machine learning. Machine learning as part of intelligent systems is already one of the most critical components in everyday tools ranging from search engines and credit card fraud detection to stock market analysis. You can train machines to perform some things, so that they can automatically detect, diagnose, and solve a variety of problems. The intelligent systems have made rapid progress in developing the state of the art in machine learning based on smart and deep perception. Using machine learning, the intelligent systems make widely applications in automated speech recognition, natural language processing, medical diagnosis, bioinformatics, and robot locomotion. This book aims at introducing how to treat a substantial amount of data, to teach machines and to improve decision making models. And this book specializes in the developments of advanced intelligent systems through machine learning. It...

  11. Utilizing Machine Learning and Automated Performance Metrics to Evaluate Robot-Assisted Radical Prostatectomy Performance and Predict Outcomes.

    Science.gov (United States)

    Hung, Andrew J; Chen, Jian; Che, Zhengping; Nilanon, Tanachat; Jarc, Anthony; Titus, Micha; Oh, Paul J; Gill, Inderbir S; Liu, Yan

    2018-05-01

    Surgical performance is critical for clinical outcomes. We present a novel machine learning (ML) method of processing automated performance metrics (APMs) to evaluate surgical performance and predict clinical outcomes after robot-assisted radical prostatectomy (RARP). We trained three ML algorithms utilizing APMs directly from robot system data (training material) and hospital length of stay (LOS; training label) (≤2 days and >2 days) from 78 RARP cases, and selected the algorithm with the best performance. The selected algorithm categorized the cases as "Predicted as expected LOS (pExp-LOS)" and "Predicted as extended LOS (pExt-LOS)." We compared postoperative outcomes of the two groups (Kruskal-Wallis/Fisher's exact tests). The algorithm then predicted individual clinical outcomes, which we compared with actual outcomes (Spearman's correlation/Fisher's exact tests). Finally, we identified five most relevant APMs adopted by the algorithm during predicting. The "Random Forest-50" (RF-50) algorithm had the best performance, reaching 87.2% accuracy in predicting LOS (73 cases as "pExp-LOS" and 5 cases as "pExt-LOS"). The "pExp-LOS" cases outperformed the "pExt-LOS" cases in surgery time (3.7 hours vs 4.6 hours, p = 0.007), LOS (2 days vs 4 days, p = 0.02), and Foley duration (9 days vs 14 days, p = 0.02). Patient outcomes predicted by the algorithm had significant association with the "ground truth" in surgery time (p algorithm in predicting, were largely related to camera manipulation. To our knowledge, ours is the first study to show that APMs and ML algorithms may help assess surgical RARP performance and predict clinical outcomes. With further accrual of clinical data (oncologic and functional data), this process will become increasingly relevant and valuable in surgical assessment and training.

  12. A hybrid machine learning model to predict and visualize nitrate concentration throughout the Central Valley aquifer, California, USA

    Science.gov (United States)

    Ransom, Katherine M.; Nolan, Bernard T.; Traum, Jonathan A.; Faunt, Claudia; Bell, Andrew M.; Gronberg, Jo Ann M.; Wheeler, David C.; Zamora, Celia; Jurgens, Bryant; Schwarz, Gregory E.; Belitz, Kenneth; Eberts, Sandra; Kourakos, George; Harter, Thomas

    2017-01-01

    Intense demand for water in the Central Valley of California and related increases in groundwater nitrate concentration threaten the sustainability of the groundwater resource. To assess contamination risk in the region, we developed a hybrid, non-linear, machine learning model within a statistical learning framework to predict nitrate contamination of groundwater to depths of approximately 500 m below ground surface. A database of 145 predictor variables representing well characteristics, historical and current field and landscape-scale nitrogen mass balances, historical and current land use, oxidation/reduction conditions, groundwater flow, climate, soil characteristics, depth to groundwater, and groundwater age were assigned to over 6000 private supply and public supply wells measured previously for nitrate and located throughout the study area. The boosted regression tree (BRT) method was used to screen and rank variables to predict nitrate concentration at the depths of domestic and public well supplies. The novel approach included as predictor variables outputs from existing physically based models of the Central Valley. The top five most important predictor variables included two oxidation/reduction variables (probability of manganese concentration to exceed 50 ppb and probability of dissolved oxygen concentration to be below 0.5 ppm), field-scale adjusted unsaturated zone nitrogen input for the 1975 time period, average difference between precipitation and evapotranspiration during the years 1971–2000, and 1992 total landscape nitrogen input. Twenty-five variables were selected for the final model for log-transformed nitrate. In general, increasing probability of anoxic conditions and increasing precipitation relative to potential evapotranspiration had a corresponding decrease in nitrate concentration predictions. Conversely, increasing 1975 unsaturated zone nitrogen leaching flux and 1992 total landscape nitrogen input had an increasing relative

  13. Predicting the hepatocarcinogenic potential of alkenylbenzene flavoring agents using toxicogenomics and machine learning

    International Nuclear Information System (INIS)

    Auerbach, Scott S.; Shah, Ruchir R.; Mav, Deepak; Smith, Cynthia S.; Walker, Nigel J.; Vallant, Molly K.; Boorman, Gary A.; Irwin, Richard D.

    2010-01-01

    Identification of carcinogenic activity is the primary goal of the 2-year bioassay. The expense of these studies limits the number of chemicals that can be studied and therefore chemicals need to be prioritized based on a variety of parameters. We have developed an ensemble of support vector machine classification models based on male F344 rat liver gene expression following 2, 14 or 90 days of exposure to a collection of hepatocarcinogens (aflatoxin B1, 1-amino-2,4-dibromoanthraquinone, N-nitrosodimethylamine, methyleugenol) and non-hepatocarcinogens (acetaminophen, ascorbic acid, tryptophan). Seven models were generated based on individual exposure durations (2, 14 or 90 days) or a combination of exposures (2 + 14, 2 + 90, 14 + 90 and 2 + 14 + 90 days). All sets of data, with the exception of one yielded models with 0% cross-validation error. Independent validation of the models was performed using expression data from the liver of rats exposed at 2 dose levels to a collection of alkenylbenzene flavoring agents. Depending on the model used and the exposure duration of the test data, independent validation error rates ranged from 47% to 10%. The variable with the most notable effect on independent validation accuracy was exposure duration of the alkenylbenzene test data. All models generally exhibited improved performance as the exposure duration of the alkenylbenzene data increased. The models differentiated between hepatocarcinogenic (estragole and safrole) and non-hepatocarcinogenic (anethole, eugenol and isoeugenol) alkenylbenzenes previously studied in a carcinogenicity bioassay. In the case of safrole the models correctly differentiated between carcinogenic and non-carcinogenic dose levels. The models predict that two alkenylbenzenes not previously assessed in a carcinogenicity bioassay, myristicin and isosafrole, would be weakly hepatocarcinogenic if studied at a dose level of 2 mmol/kg bw/day for 2 years in male F344 rats; therefore suggesting that these

  14. Machine Learning and Applied Linguistics

    OpenAIRE

    Vajjala, Sowmya

    2018-01-01

    This entry introduces the topic of machine learning and provides an overview of its relevance for applied linguistics and language learning. The discussion will focus on giving an introduction to the methods and applications of machine learning in applied linguistics, and will provide references for further study.

  15. Computational methods using weighed-extreme learning machine to predict protein self-interactions with protein evolutionary information.

    Science.gov (United States)

    An, Ji-Yong; Zhang, Lei; Zhou, Yong; Zhao, Yu-Jun; Wang, Da-Fu

    2017-08-18

    Self-interactions Proteins (SIPs) is important for their biological activity owing to the inherent interaction amongst their secondary structures or domains. However, due to the limitations of experimental Self-interactions detection, one major challenge in the study of prediction SIPs is how to exploit computational approaches for SIPs detection based on evolutionary information contained protein sequence. In the work, we presented a novel computational approach named WELM-LAG, which combined the Weighed-Extreme Learning Machine (WELM) classifier with Local Average Group (LAG) to predict SIPs based on protein sequence. The major improvement of our method lies in presenting an effective feature extraction method used to represent candidate Self-interactions proteins by exploring the evolutionary information embedded in PSI-BLAST-constructed position specific scoring matrix (PSSM); and then employing a reliable and robust WELM classifier to carry out classification. In addition, the Principal Component Analysis (PCA) approach is used to reduce the impact of noise. The WELM-LAG method gave very high average accuracies of 92.94 and 96.74% on yeast and human datasets, respectively. Meanwhile, we compared it with the state-of-the-art support vector machine (SVM) classifier and other existing methods on human and yeast datasets, respectively. Comparative results indicated that our approach is very promising and may provide a cost-effective alternative for predicting SIPs. In addition, we developed a freely available web server called WELM-LAG-SIPs to predict SIPs. The web server is available at http://219.219.62.123:8888/WELMLAG/ .

  16. Machine learning in healthcare informatics

    CERN Document Server

    Acharya, U; Dua, Prerna

    2014-01-01

    The book is a unique effort to represent a variety of techniques designed to represent, enhance, and empower multi-disciplinary and multi-institutional machine learning research in healthcare informatics. The book provides a unique compendium of current and emerging machine learning paradigms for healthcare informatics and reflects the diversity, complexity and the depth and breath of this multi-disciplinary area. The integrated, panoramic view of data and machine learning techniques can provide an opportunity for novel clinical insights and discoveries.

  17. Mirnacle: machine learning with SMOTE and random forest for improving selectivity in pre-miRNA ab initio prediction.

    Science.gov (United States)

    Marques, Yuri Bento; de Paiva Oliveira, Alcione; Ribeiro Vasconcelos, Ana Tereza; Cerqueira, Fabio Ribeiro

    2016-12-15

    MicroRNAs (miRNAs) are key gene expression regulators in plants and animals. Therefore, miRNAs are involved in several biological processes, making the study of these molecules one of the most relevant topics of molecular biology nowadays. However, characterizing miRNAs in vivo is still a complex task. As a consequence, in silico methods have been developed to predict miRNA loci. A common ab initio strategy to find miRNAs in genomic data is to search for sequences that can fold into the typical hairpin structure of miRNA precursors (pre-miRNAs). The current ab initio approaches, however, have selectivity issues, i.e., a high number of false positives is reported, which can lead to laborious and costly attempts to provide biological validation. This study presents an extension of the ab initio method miRNAFold, with the aim of improving selectivity through machine learning techniques, namely, random forest combined with the SMOTE procedure that copes with imbalance datasets. By comparing our method, termed Mirnacle, with other important approaches in the literature, we demonstrate that Mirnacle substantially improves selectivity without compromising sensitivity. For the three datasets used in our experiments, our method achieved at least 97% of sensitivity and could deliver a two-fold, 20-fold, and 6-fold increase in selectivity, respectively, compared with the best results of current computational tools. The extension of miRNAFold by the introduction of machine learning techniques, significantly increases selectivity in pre-miRNA ab initio prediction, which optimally contributes to advanced studies on miRNAs, as the need of biological validations is diminished. Hopefully, new research, such as studies of severe diseases caused by miRNA malfunction, will benefit from the proposed computational tool.

  18. Developing a Global Database of Historic Flood Events to Support Machine Learning Flood Prediction in Google Earth Engine

    Science.gov (United States)

    Tellman, B.; Sullivan, J.; Kettner, A.; Brakenridge, G. R.; Slayback, D. A.; Kuhn, C.; Doyle, C.

    2016-12-01

    There is an increasing need to understand flood vulnerability as the societal and economic effects of flooding increases. Risk models from insurance companies and flood models from hydrologists must be calibrated based on flood observations in order to make future predictions that can improve planning and help societies reduce future disasters. Specifically, to improve these models both traditional methods of flood prediction from physically based models as well as data-driven techniques, such as machine learning, require spatial flood observation to validate model outputs and quantify uncertainty. A key dataset that is missing for flood model validation is a global historical geo-database of flood event extents. Currently, the most advanced database of historical flood extent is hosted and maintained at the Dartmouth Flood Observatory (DFO) that has catalogued 4320 floods (1985-2015) but has only mapped 5% of these floods. We are addressing this data gap by mapping the inventory of floods in the DFO database to create a first-of- its-kind, comprehensive, global and historical geospatial database of flood events. To do so, we combine water detection algorithms on MODIS and Landsat 5,7 and 8 imagery in Google Earth Engine to map discrete flood events. The created database will be available in the Earth Engine Catalogue for download by country, region, or time period. This dataset can be leveraged for new data-driven hydrologic modeling using machine learning algorithms in Earth Engine's highly parallelized computing environment, and we will show examples for New York and Senegal.

  19. Is demography destiny? Application of machine learning techniques to accurately predict population health outcomes from a minimal demographic dataset.

    Directory of Open Access Journals (Sweden)

    Wei Luo

    Full Text Available For years, we have relied on population surveys to keep track of regional public health statistics, including the prevalence of non-communicable diseases. Because of the cost and limitations of such surveys, we often do not have the up-to-date data on health outcomes of a region. In this paper, we examined the feasibility of inferring regional health outcomes from socio-demographic data that are widely available and timely updated through national censuses and community surveys. Using data for 50 American states (excluding Washington DC from 2007 to 2012, we constructed a machine-learning model to predict the prevalence of six non-communicable disease (NCD outcomes (four NCDs and two major clinical risk factors, based on population socio-demographic characteristics from the American Community Survey. We found that regional prevalence estimates for non-communicable diseases can be reasonably predicted. The predictions were highly correlated with the observed data, in both the states included in the derivation model (median correlation 0.88 and those excluded from the development for use as a completely separated validation sample (median correlation 0.85, demonstrating that the model had sufficient external validity to make good predictions, based on demographics alone, for areas not included in the model development. This highlights both the utility of this sophisticated approach to model development, and the vital importance of simple socio-demographic characteristics as both indicators and determinants of chronic disease.

  20. Is demography destiny? Application of machine learning techniques to accurately predict population health outcomes from a minimal demographic dataset.

    Science.gov (United States)

    Luo, Wei; Nguyen, Thin; Nichols, Melanie; Tran, Truyen; Rana, Santu; Gupta, Sunil; Phung, Dinh; Venkatesh, Svetha; Allender, Steve

    2015-01-01

    For years, we have relied on population surveys to keep track of regional public health statistics, including the prevalence of non-communicable diseases. Because of the cost and limitations of such surveys, we often do not have the up-to-date data on health outcomes of a region. In this paper, we examined the feasibility of inferring regional health outcomes from socio-demographic data that are widely available and timely updated through national censuses and community surveys. Using data for 50 American states (excluding Washington DC) from 2007 to 2012, we constructed a machine-learning model to predict the prevalence of six non-communicable disease (NCD) outcomes (four NCDs and two major clinical risk factors), based on population socio-demographic characteristics from the American Community Survey. We found that regional prevalence estimates for non-communicable diseases can be reasonably predicted. The predictions were highly correlated with the observed data, in both the states included in the derivation model (median correlation 0.88) and those excluded from the development for use as a completely separated validation sample (median correlation 0.85), demonstrating that the model had sufficient external validity to make good predictions, based on demographics alone, for areas not included in the model development. This highlights both the utility of this sophisticated approach to model development, and the vital importance of simple socio-demographic characteristics as both indicators and determinants of chronic disease.

  1. An Internet of Things System for Underground Mine Air Quality Pollutant Prediction Based on Azure Machine Learning.

    Science.gov (United States)

    Jo, ByungWan; Khan, Rana Muhammad Asad

    2018-03-21

    The implementation of wireless sensor networks (WSNs) for monitoring the complex, dynamic, and harsh environment of underground coal mines (UCMs) is sought around the world to enhance safety. However, previously developed smart systems are limited to monitoring or, in a few cases, can report events. Therefore, this study introduces a reliable, efficient, and cost-effective internet of things (IoT) system for air quality monitoring with newly added features of assessment and pollutant prediction. This system is comprised of sensor modules, communication protocols, and a base station, running Azure Machine Learning (AML) Studio over it. Arduino-based sensor modules with eight different parameters were installed at separate locations of an operational UCM. Based on the sensed data, the proposed system assesses mine air quality in terms of the mine environment index (MEI). Principal component analysis (PCA) identified CH₄, CO, SO₂, and H₂S as the most influencing gases significantly affecting mine air quality. The results of PCA were fed into the ANN model in AML studio, which enabled the prediction of MEI. An optimum number of neurons were determined for both actual input and PCA-based input parameters. The results showed a better performance of the PCA-based ANN for MEI prediction, with R ² and RMSE values of 0.6654 and 0.2104, respectively. Therefore, the proposed Arduino and AML-based system enhances mine environmental safety by quickly assessing and predicting mine air quality.

  2. An Internet of Things System for Underground Mine Air Quality Pollutant Prediction Based on Azure Machine Learning

    Directory of Open Access Journals (Sweden)

    ByungWan Jo

    2018-03-01

    Full Text Available The implementation of wireless sensor networks (WSNs for monitoring the complex, dynamic, and harsh environment of underground coal mines (UCMs is sought around the world to enhance safety. However, previously developed smart systems are limited to monitoring or, in a few cases, can report events. Therefore, this study introduces a reliable, efficient, and cost-effective internet of things (IoT system for air quality monitoring with newly added features of assessment and pollutant prediction. This system is comprised of sensor modules, communication protocols, and a base station, running Azure Machine Learning (AML Studio over it. Arduino-based sensor modules with eight different parameters were installed at separate locations of an operational UCM. Based on the sensed data, the proposed system assesses mine air quality in terms of the mine environment index (MEI. Principal component analysis (PCA identified CH4, CO, SO2, and H2S as the most influencing gases significantly affecting mine air quality. The results of PCA were fed into the ANN model in AML studio, which enabled the prediction of MEI. An optimum number of neurons were determined for both actual input and PCA-based input parameters. The results showed a better performance of the PCA-based ANN for MEI prediction, with R2 and RMSE values of 0.6654 and 0.2104, respectively. Therefore, the proposed Arduino and AML-based system enhances mine environmental safety by quickly assessing and predicting mine air quality.

  3. Machine learning and medical imaging

    CERN Document Server

    Shen, Dinggang; Sabuncu, Mert

    2016-01-01

    Machine Learning and Medical Imaging presents state-of- the-art machine learning methods in medical image analysis. It first summarizes cutting-edge machine learning algorithms in medical imaging, including not only classical probabilistic modeling and learning methods, but also recent breakthroughs in deep learning, sparse representation/coding, and big data hashing. In the second part leading research groups around the world present a wide spectrum of machine learning methods with application to different medical imaging modalities, clinical domains, and organs. The biomedical imaging modalities include ultrasound, magnetic resonance imaging (MRI), computed tomography (CT), histology, and microscopy images. The targeted organs span the lung, liver, brain, and prostate, while there is also a treatment of examining genetic associations. Machine Learning and Medical Imaging is an ideal reference for medical imaging researchers, industry scientists and engineers, advanced undergraduate and graduate students, a...

  4. Machine Learning of Fault Friction

    Science.gov (United States)

    Johnson, P. A.; Rouet-Leduc, B.; Hulbert, C.; Marone, C.; Guyer, R. A.

    2017-12-01

    We are applying machine learning (ML) techniques to continuous acoustic emission (AE) data from laboratory earthquake experiments. Our goal is to apply explicit ML methods to this acoustic datathe AE in order to infer frictional properties of a laboratory fault. The experiment is a double direct shear apparatus comprised of fault blocks surrounding fault gouge comprised of glass beads or quartz powder. Fault characteristics are recorded, including shear stress, applied load (bulk friction = shear stress/normal load) and shear velocity. The raw acoustic signal is continuously recorded. We rely on explicit decision tree approaches (Random Forest and Gradient Boosted Trees) that allow us to identify important features linked to the fault friction. A training procedure that employs both the AE and the recorded shear stress from the experiment is first conducted. Then, testing takes place on data the algorithm has never seen before, using only the continuous AE signal. We find that these methods provide rich information regarding frictional processes during slip (Rouet-Leduc et al., 2017a; Hulbert et al., 2017). In addition, similar machine learning approaches predict failure times, as well as slip magnitudes in some cases. We find that these methods work for both stick slip and slow slip experiments, for periodic slip and for aperiodic slip. We also derive a fundamental relationship between the AE and the friction describing the frictional behavior of any earthquake slip cycle in a given experiment (Rouet-Leduc et al., 2017b). Our goal is to ultimately scale these approaches to Earth geophysical data to probe fault friction. References Rouet-Leduc, B., C. Hulbert, N. Lubbers, K. Barros, C. Humphreys and P. A. Johnson, Machine learning predicts laboratory earthquakes, in review (2017). https://arxiv.org/abs/1702.05774Rouet-LeDuc, B. et al., Friction Laws Derived From the Acoustic Emissions of a Laboratory Fault by Machine Learning (2017), AGU Fall Meeting Session S025

  5. Addressing uncertainty in atomistic machine learning

    DEFF Research Database (Denmark)

    Peterson, Andrew A.; Christensen, Rune; Khorshidi, Alireza

    2017-01-01

    Machine-learning regression has been demonstrated to precisely emulate the potential energy and forces that are output from more expensive electronic-structure calculations. However, to predict new regions of the potential energy surface, an assessment must be made of the credibility of the predi......Machine-learning regression has been demonstrated to precisely emulate the potential energy and forces that are output from more expensive electronic-structure calculations. However, to predict new regions of the potential energy surface, an assessment must be made of the credibility...... of the predictions. In this perspective, we address the types of errors that might arise in atomistic machine learning, the unique aspects of atomistic simulations that make machine-learning challenging, and highlight how uncertainty analysis can be used to assess the validity of machine-learning predictions. We...... suggest this will allow researchers to more fully use machine learning for the routine acceleration of large, high-accuracy, or extended-time simulations. In our demonstrations, we use a bootstrap ensemble of neural network-based calculators, and show that the width of the ensemble can provide an estimate...

  6. Predicting Ascospore Release of Monilinia vaccinii-corymbosi of Blueberry with Machine Learning.

    Science.gov (United States)

    Harteveld, Dalphy O C; Grant, Michael R; Pscheidt, Jay W; Peever, Tobin L

    2017-11-01

    Mummy berry, caused by Monilinia vaccinii-corymbosi, causes economic losses of highbush blueberry in the U.S. Pacific Northwest (PNW). Apothecia develop from mummified berries overwintering on soil surfaces and produce ascospores that infect tissue emerging from floral and vegetative buds. Disease control currently relies on fungicides applied on a calendar basis rather than inoculum availability. To establish a prediction model for ascospore release, apothecial development was tracked in three fields, one in western Oregon and two in northwestern Washington in 2015 and 2016. Air and soil temperature, precipitation, soil moisture, leaf wetness, relative humidity and solar radiation were monitored using in-field weather stations and Washington State University's AgWeatherNet stations. Four modeling approaches were compared: logistic regression, multivariate adaptive regression splines, artificial neural networks, and random forest. A supervised learning approach was used to train the models on two data sets: training (70%) and testing (30%). The importance of environmental factors was calculated for each model separately. Soil temperature, soil moisture, and solar radiation were identified as the most important factors influencing ascospore release. Random forest models, with 78% accuracy, showed the best performance compared with the other models. Results of this research helps PNW blueberry growers to optimize fungicide use and reduce production costs.

  7. Machine learning topological states

    Science.gov (United States)

    Deng, Dong-Ling; Li, Xiaopeng; Das Sarma, S.

    2017-11-01

    Artificial neural networks and machine learning have now reached a new era after several decades of improvement where applications are to explode in many fields of science, industry, and technology. Here, we use artificial neural networks to study an intriguing phenomenon in quantum physics—the topological phases of matter. We find that certain topological states, either symmetry-protected or with intrinsic topological order, can be represented with classical artificial neural networks. This is demonstrated by using three concrete spin systems, the one-dimensional (1D) symmetry-protected topological cluster state and the 2D and 3D toric code states with intrinsic topological orders. For all three cases, we show rigorously that the topological ground states can be represented by short-range neural networks in an exact and efficient fashion—the required number of hidden neurons is as small as the number of physical spins and the number of parameters scales only linearly with the system size. For the 2D toric-code model, we find that the proposed short-range neural networks can describe the excited states with Abelian anyons and their nontrivial mutual statistics as well. In addition, by using reinforcement learning we show that neural networks are capable of finding the topological ground states of nonintegrable Hamiltonians with strong interactions and studying their topological phase transitions. Our results demonstrate explicitly the exceptional power of neural networks in describing topological quantum states, and at the same time provide valuable guidance to machine learning of topological phases in generic lattice models.

  8. Adaptive Machine Aids to Learning.

    Science.gov (United States)

    Starkweather, John A.

    With emphasis on man-machine relationships and on machine evolution, computer-assisted instruction (CAI) is examined in this paper. The discussion includes the background of machine assistance to learning, the current status of CAI, directions of development, the development of criteria for successful instruction, meeting the needs of users,…

  9. Model-Agnostic Interpretability of Machine Learning

    OpenAIRE

    Ribeiro, Marco Tulio; Singh, Sameer; Guestrin, Carlos

    2016-01-01

    Understanding why machine learning models behave the way they do empowers both system designers and end-users in many ways: in model selection, feature engineering, in order to trust and act upon the predictions, and in more intuitive user interfaces. Thus, interpretability has become a vital concern in machine learning, and work in the area of interpretable models has found renewed interest. In some applications, such models are as accurate as non-interpretable ones, and thus are preferred f...

  10. Implementing Machine Learning in the PCWG Tool

    Energy Technology Data Exchange (ETDEWEB)

    Clifton, Andrew; Ding, Yu; Stuart, Peter

    2016-12-13

    The Power Curve Working Group (www.pcwg.org) is an ad-hoc industry-led group to investigate the performance of wind turbines in real-world conditions. As part of ongoing experience-sharing exercises, machine learning has been proposed as a possible way to predict turbine performance. This presentation provides some background information about machine learning and how it might be implemented in the PCWG exercises.

  11. ToxiM: A Toxicity Prediction Tool for Small Molecules Developed Using Machine Learning and Chemoinformatics Approaches

    Directory of Open Access Journals (Sweden)

    Ashok K. Sharma

    2017-11-01

    Full Text Available The experimental methods for the prediction of molecular toxicity are tedious and time-consuming tasks. Thus, the computational approaches could be used to develop alternative methods for toxicity prediction. We have developed a tool for the prediction of molecular toxicity along with the aqueous solubility and permeability of any molecule/metabolite. Using a comprehensive and curated set of toxin molecules as a training set, the different chemical and structural based features such as descriptors and fingerprints were exploited for feature selection, optimization and development of machine learning based classification and regression models. The compositional differences in the distribution of atoms were apparent between toxins and non-toxins, and hence, the molecular features were used for the classification and regression. On 10-fold cross-validation, the descriptor-based, fingerprint-based and hybrid-based classification models showed similar accuracy (93% and Matthews's correlation coefficient (0.84. The performances of all the three models were comparable (Matthews's correlation coefficient = 0.84–0.87 on the blind dataset. In addition, the regression-based models using descriptors as input features were also compared and evaluated on the blind dataset. Random forest based regression model for the prediction of solubility performed better (R2 = 0.84 than the multi-linear regression (MLR and partial least square regression (PLSR models, whereas, the partial least squares based regression model for the prediction of permeability (caco-2 performed better (R2 = 0.68 in comparison to the random forest and MLR based regression models. The performance of final classification and regression models was evaluated using the two validation datasets including the known toxins and commonly used constituents of health products, which attests to its accuracy. The ToxiM web server would be a highly useful and reliable tool for the prediction of toxicity

  12. System modeling based on machine learning for anomaly detection and predictive maintenance in industrial plants

    OpenAIRE

    Kroll, Björn; Schaffranek, David; Schriegel, Sebastian; Niggemann, Oliver

    2014-01-01

    Electricity, water or air are some Industrial energy carriers which are struggling under the prices of primary energy carriers. The European Union for example used more 20.000.000 GWh electricity in 2011 based on the IEA Report [1]. Cyber Physical Production Systems (CPPS) are able to reduce this amount, but they also help to increase the efficiency of machines above expectations which results in a more cost efficient production. Especially in the field of improving industrial plants, one of ...

  13. Machine Learning for Robotic Vision

    OpenAIRE

    Drummond, Tom

    2018-01-01

    Machine learning is a crucial enabling technology for robotics, in particular for unlocking the capabilities afforded by visual sensing. This talk will present research within Prof Drummond’s lab that explores how machine learning can be developed and used within the context of Robotic Vision.

  14. Machine Learning Algorithms Utilizing Quantitative CT Features May Predict Eventual Onset of Bronchiolitis Obliterans Syndrome After Lung Transplantation.

    Science.gov (United States)

    Barbosa, Eduardo J Mortani; Lanclus, Maarten; Vos, Wim; Van Holsbeke, Cedric; De Backer, William; De Backer, Jan; Lee, James

    2018-02-19

    Long-term survival after lung transplantation (LTx) is limited by bronchiolitis obliterans syndrome (BOS), defined as a sustained decline in forced expiratory volume in the first second (FEV 1 ) not explained by other causes. We assessed whether machine learning (ML) utilizing quantitative computed tomography (qCT) metrics can predict eventual development of BOS. Paired inspiratory-expiratory CT scans of 71 patients who underwent LTx were analyzed retrospectively (BOS [n = 41] versus non-BOS [n = 30]), using at least two different time points. The BOS cohort experienced a reduction in FEV 1 of >10% compared to baseline FEV 1 post LTx. Multifactor analysis correlated declining FEV 1 with qCT features linked to acute inflammation or BOS onset. Student t test and ML were applied on baseline qCT features to identify lung transplant patients at baseline that eventually developed BOS. The FEV 1 decline in the BOS cohort correlated with an increase in the lung volume (P = .027) and in the central airway volume at functional residual capacity (P = .018), not observed in non-BOS patients, whereas the non-BOS cohort experienced a decrease in the central airway volume at total lung capacity with declining FEV 1 (P = .039). Twenty-three baseline qCT parameters could significantly distinguish between non-BOS patients and eventual BOS developers (P machine), we could identify BOS developers at baseline with an accuracy of 85%, using only three qCT parameters. ML utilizing qCT could discern distinct mechanisms driving FEV 1 decline in BOS and non-BOS LTx patients and predict eventual onset of BOS. This approach may become useful to optimize management of LTx patients. Copyright © 2018 The Association of University Radiologists. Published by Elsevier Inc. All rights reserved.

  15. Prediction of Running Injuries from Training Load: a Machine Learning Approach.

    NARCIS (Netherlands)

    Dijkhuis, Talko; Otter, Ruby; Velthuijsen, H.; Lemmink, Koen A.P.M.

    2017-01-01

    The prediction of the running injuries based on selfreported training data on load is difficult. At present, coaches and researchers have no validated system to predict if a runner has an increased risk of injuries. We aim to develop an algorithm to predict the increase of the risk of a runner to

  16. Acoustic Log Prediction on the Basis of Kernel Extreme Learning Machine for Wells in GJH Survey, Erdos Basin

    Directory of Open Access Journals (Sweden)

    Jianhua Cao

    2017-01-01

    Full Text Available In petroleum exploration, the acoustic log (DT is popularly used as an estimator to calculate formation porosity, to carry out petrophysical studies, or to participate in geological analysis and research (e.g., to map abnormal pore-fluid pressure. But sometime it does not exist in those old wells drilled 20 years ago, either because of data loss or because of just being not recorded at that time. Thus synthesizing the DT log becomes the necessary task for the researchers. In this paper we propose using kernel extreme learning machine (KELM to predict missing sonic (DT logs when only common logs (e.g., natural gamma ray: GR, deep resistivity: REID, and bulk density: DEN are available. The common logs are set as predictors and the DT log is the target. By using KELM, a prediction model is firstly created based on the experimental data and then confirmed and validated by blind-testing the results in wells containing both the predictors and the target (DT values used in the supervised training. Finally the optimal model is set up as a predictor. A case study for wells in GJH survey from the Erdos Basin, about velocity inversion using the KELM-estimated DT values, is presented. The results are promising and encouraging.

  17. PRGPred: A platform for prediction of domains of resistance gene analogue (RGA in Arecaceae developed using machine learning algorithms

    Directory of Open Access Journals (Sweden)

    MATHODIYIL S. MANJULA

    2015-12-01

    Full Text Available Plant disease resistance genes (R-genes are responsible for initiation of defense mechanism against various phytopathogens. The majority of plant R-genes are members of very large multi-gene families, which encode structurally related proteins containing nucleotide binding site domains (NBS and C-terminal leucine rich repeats (LRR. Other classes possess' an extracellular LRR domain, a transmembrane domain and sometimes, an intracellular serine/threonine kinase domain. R-proteins work in pathogen perception and/or the activation of conserved defense signaling networks. In the present study, sequences representing resistance gene analogues (RGAs of coconut, arecanut, oil palm and date palm were collected from NCBI, sorted based on domains and assembled into a database. The sequences were analyzed in PRINTS database to find out the conserved domains and their motifs present in the RGAs. Based on these domains, we have also developed a tool to predict the domains of palm R-genes using various machine learning algorithms. The model files were selected based on the performance of the best classifier in training and testing. All these information is stored and made available in the online ‘PRGpred' database and prediction tool.

  18. What subject matter questions motivate the use of machine learning approaches compared to statistical models for probability prediction?

    Science.gov (United States)

    Binder, Harald

    2014-07-01

    This is a discussion of the following papers: "Probability estimation with machine learning methods for dichotomous and multicategory outcome: Theory" by Jochen Kruppa, Yufeng Liu, Gérard Biau, Michael Kohler, Inke R. König, James D. Malley, and Andreas Ziegler; and "Probability estimation with machine learning methods for dichotomous and multicategory outcome: Applications" by Jochen Kruppa, Yufeng Liu, Hans-Christian Diener, Theresa Holste, Christian Weimar, Inke R. König, and Andreas Ziegler. © 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.

  19. Primary Sclerosing Cholangitis Risk Estimate Tool (PREsTo) Predicts Outcomes in PSC: A Derivation & Validation Study Using Machine Learning.

    Science.gov (United States)

    Eaton, John E; Vesterhus, Mette; McCauley, Bryan M; Atkinson, Elizabeth J; Schlicht, Erik M; Juran, Brian D; Gossard, Andrea A; LaRusso, Nicholas F; Gores, Gregory J; Karlsen, Tom H; Lazaridis, Konstantinos N

    2018-05-09

    Improved methods are needed to risk stratify and predict outcomes in patients with primary sclerosing cholangitis (PSC). Therefore, we sought to derive and validate a new prediction model and compare its performance to existing surrogate markers. The model was derived using 509 subjects from a multicenter North American cohort and validated in an international multicenter cohort (n=278). Gradient boosting, a machine based learning technique, was used to create the model. The endpoint was hepatic decompensation (ascites, variceal hemorrhage or encephalopathy). Subjects with advanced PSC or cholangiocarcinoma at baseline were excluded. The PSC risk estimate tool (PREsTo) consists of 9 variables: bilirubin, albumin, serum alkaline phosphatase (SAP) times the upper limit of normal (ULN), platelets, AST, hemoglobin, sodium, patient age and the number of years since PSC was diagnosed. Validation in an independent cohort confirms PREsTo accurately predicts decompensation (C statistic 0.90, 95% confidence interval (CI) 0.84-0.95) and performed well compared to MELD score (C statistic 0.72, 95% CI 0.57-0.84), Mayo PSC risk score (C statistic 0.85, 95% CI 0.77-0.92) and SAP statistic 0.65, 95% CI 0.55-0.73). PREsTo continued to be accurate among individuals with a bilirubin statistic 0.90, 95% CI 0.82-0.96) and when the score was re-applied at a later course in the disease (C statistic 0.82, 95% CI 0.64-0.95). PREsTo accurately predicts hepatic decompensation in PSC and exceeds the performance among other widely available, noninvasive prognostic scoring systems. This article is protected by copyright. All rights reserved. © 2018 by the American Association for the Study of Liver Diseases.

  20. Using Machine Learning to Predict Swine Movements within a Regional Program to Improve Control of Infectious Diseases in the US.

    Science.gov (United States)

    Valdes-Donoso, Pablo; VanderWaal, Kimberly; Jarvis, Lovell S; Wayne, Spencer R; Perez, Andres M

    2017-01-01

    Between-farm animal movement is one of the most important factors influencing the spread of infectious diseases in food animals, including in the US swine industry. Understanding the structural network of contacts in a food animal industry is prerequisite to planning for efficient production strategies and for effective disease control measures. Unfortunately, data regarding between-farm animal movements in the US are not systematically collected and thus, such information is often unavailable. In this paper, we develop a procedure to replicate the structure of a network, making use of partial data available, and subsequently use the model developed to predict animal movements among sites in 34 Minnesota counties. First, we summarized two networks of swine producing facilities in Minnesota, then we used a machine learning technique referred to as random forest, an ensemble of independent classification trees, to estimate the probability of pig movements between farms and/or markets sites located in two counties in Minnesota. The model was calibrated and tested by comparing predicted data and observed data in those two counties for which data were available. Finally, the model was used to predict animal movements in sites located across 34 Minnesota counties. Variables that were important in predicting pig movements included between-site distance, ownership, and production type of the sending and receiving farms and/or markets. Using a weighted-kernel approach to describe spatial variation in the centrality measures of the predicted network, we showed that the south-central region of the study area exhibited high aggregation of predicted pig movements. Our results show an overlap with the distribution of outbreaks of porcine reproductive and respiratory syndrome, which is believed to be transmitted, at least in part, though animal movements. While the correspondence of movements and disease is not a causal test, it suggests that the predicted network may approximate

  1. Online transfer learning with extreme learning machine

    Science.gov (United States)

    Yin, Haibo; Yang, Yun-an

    2017-05-01

    In this paper, we propose a new transfer learning algorithm for online training. The proposed algorithm, which is called Online Transfer Extreme Learning Machine (OTELM), is based on Online Sequential Extreme Learning Machine (OSELM) while it introduces Semi-Supervised Extreme Learning Machine (SSELM) to transfer knowledge from the source to the target domain. With the manifold regularization, SSELM picks out instances from the source domain that are less relevant to those in the target domain to initialize the online training, so as to improve the classification performance. Experimental results demonstrate that the proposed OTELM can effectively use instances in the source domain to enhance the learning performance.

  2. Machine Learning Techniques in Clinical Vision Sciences.

    Science.gov (United States)

    Caixinha, Miguel; Nunes, Sandrina

    2017-01-01

    This review presents and discusses the contribution of machine learning techniques for diagnosis and disease monitoring in the context of clinical vision science. Many ocular diseases leading to blindness can be halted or delayed when detected and treated at its earliest stages. With the recent developments in diagnostic devices, imaging and genomics, new sources of data for early disease detection and patients' management are now available. Machine learning techniques emerged in the biomedical sciences as clinical decision-support techniques to improve sensitivity and specificity of disease detection and monitoring, increasing objectively the clinical decision-making process. This manuscript presents a review in multimodal ocular disease diagnosis and monitoring based on machine learning approaches. In the first section, the technical issues related to the different machine learning approaches will be present. Machine learning techniques are used to automatically recognize complex patterns in a given dataset. These techniques allows creating homogeneous groups (unsupervised learning), or creating a classifier predicting group membership of new cases (supervised learning), when a group label is available for each case. To ensure a good performance of the machine learning techniques in a given dataset, all possible sources of bias should be removed or minimized. For that, the representativeness of the input dataset for the true population should be confirmed, the noise should be removed, the missing data should be treated and the data dimensionally (i.e., the number of parameters/features and the number of cases in the dataset) should be adjusted. The application of machine learning techniques in ocular disease diagnosis and monitoring will be presented and discussed in the second section of this manuscript. To show the clinical benefits of machine learning in clinical vision sciences, several examples will be presented in glaucoma, age-related macular degeneration

  3. A New Approach to Predict user Mobility Using Semantic Analysis and Machine Learning.

    Science.gov (United States)

    Fernandes, Roshan; D'Souza G L, Rio

    2017-10-19

    Mobility prediction is a technique in which the future location of a user is identified in a given network. Mobility prediction provides solutions to many day-to-day life problems. It helps in seamless handovers in wireless networks to provide better location based services and to recalculate paths in Mobile Ad hoc Networks (MANET). In the present study, a framework is presented which predicts user mobility in presence and absence of mobility history. Naïve Bayesian classification algorithm and Markov Model are used to predict user future location when user mobility history is available. An attempt is made to predict user future location by using Short Message Service (SMS) and instantaneous Geological coordinates in the absence of mobility patterns. The proposed technique compares the performance metrics with commonly used Markov Chain model. From the experimental results it is evident that the techniques used in this work gives better results when considering both spatial and temporal information. The proposed method predicts user's future location in the absence of mobility history quite fairly. The proposed work is applied to predict the mobility of medical rescue vehicles and social security systems.

  4. Prediction of Air Pollutants Concentration Based on an Extreme Learning Machine: The Case of Hong Kong

    OpenAIRE

    Zhang, Jiangshe; Ding, Weifu

    2017-01-01

    With the development of the economy and society all over the world, most metropolitan cities are experiencing elevated concentrations of ground-level air pollutants. It is urgent to predict and evaluate the concentration of air pollutants for some local environmental or health agencies. Feed-forward artificial neural networks have been widely used in the prediction of air pollutants concentration. However, there are some drawbacks, such as the low convergence rate and the local minimum. The e...

  5. Novel computational methods to predict drug–target interactions using graph mining and machine learning approaches

    KAUST Repository

    Olayan, Rawan S.

    2017-12-01

    Computational drug repurposing aims at finding new medical uses for existing drugs. The identification of novel drug-target interactions (DTIs) can be a useful part of such a task. Computational determination of DTIs is a convenient strategy for systematic screening of a large number of drugs in the attempt to identify new DTIs at low cost and with reasonable accuracy. This necessitates development of accurate computational methods that can help focus on the follow-up experimental validation on a smaller number of highly likely targets for a drug. Although many methods have been proposed for computational DTI prediction, they suffer the high false positive prediction rate or they do not predict the effect that drugs exert on targets in DTIs. In this report, first, we present a comprehensive review of the recent progress in the field of DTI prediction from data-centric and algorithm-centric perspectives. The aim is to provide a comprehensive review of computational methods for identifying DTIs, which could help in constructing more reliable methods. Then, we present DDR, an efficient method to predict the existence of DTIs. DDR achieves significantly more accurate results compared to the other state-of-theart methods. As supported by independent evidences, we verified as correct 22 out of the top 25 DDR DTIs predictions. This validation proves the practical utility of DDR, suggesting that DDR can be used as an efficient method to identify 5 correct DTIs. Finally, we present DDR-FE method that predicts the effect types of a drug on its target. On different representative datasets, under various test setups, and using different performance measures, we show that DDR-FE achieves extremely good performance. Using blind test data, we verified as correct 2,300 out of 3,076 DTIs effects predicted by DDR-FE. This suggests that DDR-FE can be used as an efficient method to identify correct effects of a drug on its target.

  6. Machine Learning Phases of Strongly Correlated Fermions

    Directory of Open Access Journals (Sweden)

    Kelvin Ch’ng

    2017-08-01

    Full Text Available Machine learning offers an unprecedented perspective for the problem of classifying phases in condensed matter physics. We employ neural-network machine learning techniques to distinguish finite-temperature phases of the strongly correlated fermions on cubic lattices. We show that a three-dimensional convolutional network trained on auxiliary field configurations produced by quantum Monte Carlo simulations of the Hubbard model can correctly predict the magnetic phase diagram of the model at the average density of one (half filling. We then use the network, trained at half filling, to explore the trend in the transition temperature as the system is doped away from half filling. This transfer learning approach predicts that the instability to the magnetic phase extends to at least 5% doping in this region. Our results pave the way for other machine learning applications in correlated quantum many-body systems.

  7. Comparison of machine learning techniques to predict all-cause mortality using fitness data: the Henry ford exercIse testing (FIT) project.

    Science.gov (United States)

    Sakr, Sherif; Elshawi, Radwa; Ahmed, Amjad M; Qureshi, Waqas T; Brawner, Clinton A; Keteyian, Steven J; Blaha, Michael J; Al-Mallah, Mouaz H

    2017-12-19

    Prior studies have demonstrated that cardiorespiratory fitness (CRF) is a strong marker of cardiovascular health. Machine learning (ML) can enhance the prediction of outcomes through classification techniques that classify the data into predetermined categories. The aim of this study is to present an evaluation and comparison of how machine learning techniques can be applied on medical records of cardiorespiratory fitness and how the various techniques differ in terms of capabilities of predicting medical outcomes (e.g. mortality). We use data of 34,212 patients free of known coronary artery disease or heart failure who underwent clinician-referred exercise treadmill stress testing at Henry Ford Health Systems Between 1991 and 2009 and had a complete 10-year follow-up. Seven machine learning classification techniques were evaluated: Decision Tree (DT), Support Vector Machine (SVM), Artificial Neural Networks (ANN), Naïve Bayesian Classifier (BC), Bayesian Network (BN), K-Nearest Neighbor (KNN) and Random Forest (RF). In order to handle the imbalanced dataset used, the Synthetic Minority Over-Sampling Technique (SMOTE) is used. Two set of experiments have been conducted with and without the SMOTE sampling technique. On average over different evaluation metrics, SVM Classifier has shown the lowest performance while other models like BN, BC and DT performed better. The RF classifier has shown the best performance (AUC = 0.97) among all models trained using the SMOTE sampling. The results show that various ML techniques can significantly vary in terms of its performance for the different evaluation metrics. It is also not necessarily that the more complex the ML model, the more prediction accuracy can be achieved. The prediction performance of all models trained with SMOTE is much better than the performance of models trained without SMOTE. The study shows the potential of machine learning methods for predicting all-cause mortality using cardiorespiratory fitness

  8. An Interpretable Machine Learning Model for Accurate Prediction of Sepsis in the ICU.

    Science.gov (United States)

    Nemati, Shamim; Holder, Andre; Razmi, Fereshteh; Stanley, Matthew D; Clifford, Gari D; Buchman, Timothy G

    2018-04-01

    Sepsis is among the leading causes of morbidity, mortality, and cost overruns in critically ill patients. Early intervention with antibiotics improves survival in septic patients. However, no clinically validated system exists for real-time prediction of sepsis onset. We aimed to develop and validate an Artificial Intelligence Sepsis Expert algorithm for early prediction of sepsis. Observational cohort study. Academic medical center from January 2013 to December 2015. Over 31,000 admissions to the ICUs at two Emory University hospitals (development cohort), in addition to over 52,000 ICU patients from the publicly available Medical Information Mart for Intensive Care-III ICU database (validation cohort). Patients who met the Third International Consensus Definitions for Sepsis (Sepsis-3) prior to or within 4 hours of their ICU admission were excluded, resulting in roughly 27,000 and 42,000 patients within our development and validation cohorts, respectively. None. High-resolution vital signs time series and electronic medical record data were extracted. A set of 65 features (variables) were calculated on hourly basis and passed to the Artificial Intelligence Sepsis Expert algorithm to predict onset of sepsis in the proceeding T hours (where T = 12, 8, 6, or 4). Artificial Intelligence Sepsis Expert was used to predict onset of sepsis in the proceeding T hours and to produce a list of the most significant contributing factors. For the 12-, 8-, 6-, and 4-hour ahead prediction of sepsis, Artificial Intelligence Sepsis Expert achieved area under the receiver operating characteristic in the range of 0.83-0.85. Performance of the Artificial Intelligence Sepsis Expert on the development and validation cohorts was indistinguishable. Using data available in the ICU in real-time, Artificial Intelligence Sepsis Expert can accurately predict the onset of sepsis in an ICU patient 4-12 hours prior to clinical recognition. A prospective study is necessary to determine the

  9. Automatic single- and multi-label enzymatic function prediction by machine learning

    Directory of Open Access Journals (Sweden)

    Shervine Amidi

    2017-03-01

    Full Text Available The number of protein structures in the PDB database has been increasing more than 15-fold since 1999. The creation of computational models predicting enzymatic function is of major importance since such models provide the means to better understand the behavior of newly discovered enzymes when catalyzing chemical reactions. Until now, single-label classification has been widely performed for predicting enzymatic function limiting the application to enzymes performing unique reactions and introducing errors when multi-functional enzymes are examined. Indeed, some enzymes may be performing different reactions and can hence be directly associated with multiple enzymatic functions. In the present work, we propose a multi-label enzymatic function classification scheme that combines structural and amino acid sequence information. We investigate two fusion approaches (in the feature level and decision level and assess the methodology for general enzymatic function prediction indicated by the first digit of the enzyme commission (EC code (six main classes on 40,034 enzymes from the PDB database. The proposed single-label and multi-label models predict correctly the actual functional activities in 97.8% and 95.5% (based on Hamming-loss of the cases, respectively. Also the multi-label model predicts all possible enzymatic reactions in 85.4% of the multi-labeled enzymes when the number of reactions is unknown. Code and datasets are available at https://figshare.com/s/a63e0bafa9b71fc7cbd7.

  10. Protein-RNA interface residue prediction using machine learning: an assessment of the state of the art.

    Science.gov (United States)

    Walia, Rasna R; Caragea, Cornelia; Lewis, Benjamin A; Towfic, Fadi; Terribilini, Michael; El-Manzalawy, Yasser; Dobbs, Drena; Honavar, Vasant

    2012-05-10

    RNA molecules play diverse functional and structural roles in cells. They function as messengers for transferring genetic information from DNA to proteins, as the primary genetic material in many viruses, as catalysts (ribozymes) important for protein synthesis and RNA processing, and as essential and ubiquitous regulators of gene expression in living organisms. Many of these functions depend on precisely orchestrated interactions between RNA molecules and specific proteins in cells. Understanding the molecular mechanisms by which proteins recognize and bind RNA is essential for comprehending the functional implications of these interactions, but the recognition 'code' that mediates interactions between proteins and RNA is not yet understood. Success in deciphering this code would dramatically impact the development of new therapeutic strategies for intervening in devastating diseases such as AIDS and cancer. Because of the high cost of experimental determination of protein-RNA interfaces, there is an increasing reliance on statistical machine learning methods for training predictors of RNA-binding residues in proteins. However, because of differences in the choice of datasets, performance measures, and data representations used, it has been difficult to obtain an accurate assessment of the current state of the art in protein-RNA interface prediction. We provide a review of published approaches for predicting RNA-binding residues in proteins and a systematic comparison and critical assessment of protein-RNA interface residue predictors trained using these approaches on three carefully curated non-redundant datasets. We directly compare two widely used machine learning algorithms (Naïve Bayes (NB) and Support Vector Machine (SVM)) using three different data representations in which features are encoded using either sequence- or structure-based windows. Our results show that (i) Sequence-based classifiers that use a position-specific scoring matrix (PSSM

  11. Prediction of seebeck coefficient for compounds without restriction to fixed stoichiometry: A machine learning approach.

    Science.gov (United States)

    Furmanchuk, Al'ona; Saal, James E; Doak, Jeff W; Olson, Gregory B; Choudhary, Alok; Agrawal, Ankit

    2018-02-05

    The regression model-based tool is developed for predicting the Seebeck coefficient of crystalline materials in the temperature range from 300 K to 1000 K. The tool accounts for the single crystal versus polycrystalline nature of the compound, the production method, and properties of the constituent elements in the chemical formula. We introduce new descriptive features of crystalline materials relevant for the prediction the Seebeck coefficient. To address off-stoichiometry in materials, the predictive tool is trained on a mix of stoichiometric and nonstoichiometric materials. The tool is implemented into a web application (http://info.eecs.northwestern.edu/SeebeckCoefficientPredictor) to assist field scientists in the discovery of novel thermoelectric materials. © 2017 Wiley Periodicals, Inc. © 2017 Wiley Periodicals, Inc.

  12. Prediction of Driver's Intention of Lane Change by Augmenting Sensor Information Using Machine Learning Techniques.

    Science.gov (United States)

    Kim, Il-Hwan; Bong, Jae-Hwan; Park, Jooyoung; Park, Shinsuk

    2017-06-10

    Driver assistance systems have become a major safety feature of modern passenger vehicles. The advanced driver assistance system (ADAS) is one of the active safety systems to improve the vehicle control performance and, thus, the safety of the driver and the passengers. To use the ADAS for lane change control, rapid and correct detection of the driver's intention is essential. This study proposes a novel preprocessing algorithm for the ADAS to improve the accuracy in classifying the driver's intention for lane change by augmenting basic measurements from conventional on-board sensors. The information on the vehicle states and the road surface condition is augmented by using an artificial neural network (ANN) models, and the augmented information is fed to a support vector machine (SVM) to detect the driver's intention with high accuracy. The feasibility of the developed algorithm was tested through driving simulator experiments. The results show that the classification accuracy for the driver's intention can be improved by providing an SVM model with sufficient driving information augmented by using ANN models of vehicle dynamics.

  13. Machine learning in virtual screening.

    Science.gov (United States)

    Melville, James L; Burke, Edmund K; Hirst, Jonathan D

    2009-05-01

    In this review, we highlight recent applications of machine learning to virtual screening, focusing on the use of supervised techniques to train statistical learning algorithms to prioritize databases of molecules as active against a particular protein target. Both ligand-based similarity searching and structure-based docking have benefited from machine learning algorithms, including naïve Bayesian classifiers, support vector machines, neural networks, and decision trees, as well as more traditional regression techniques. Effective application of these methodologies requires an appreciation of data preparation, validation, optimization, and search methodologies, and we also survey developments in these areas.

  14. Machine Learning an algorithmic perspective

    CERN Document Server

    Marsland, Stephen

    2009-01-01

    Traditional books on machine learning can be divided into two groups - those aimed at advanced undergraduates or early postgraduates with reasonable mathematical knowledge and those that are primers on how to code algorithms. The field is ready for a text that not only demonstrates how to use the algorithms that make up machine learning methods, but also provides the background needed to understand how and why these algorithms work. Machine Learning: An Algorithmic Perspective is that text.Theory Backed up by Practical ExamplesThe book covers neural networks, graphical models, reinforcement le

  15. Using Machine Learning Models to Predict Functional Use (Interagency Alternatives Assessment Workshop)

    Science.gov (United States)

    This presentation will outline how data was collected on how chemicals are used in products, models were built using this data to then predict how chemicals can be used in products, and, finally, how combining this information with Tox21 in vitro assays can be use to rapidly scre...

  16. Machine learning and hurdle models for improving regional predictions of stream water acid neutralizing capacity

    Science.gov (United States)

    Nicholas A. Povak; Paul F. Hessburg; Keith M. Reynolds; Timothy J. Sullivan; Todd C. McDonnell; R. Brion Salter

    2013-01-01

    In many industrialized regions of the world, atmospherically deposited sulfur derived from industrial, nonpoint air pollution sources reduces stream water quality and results in acidic conditions that threaten aquatic resources. Accurate maps of predicted stream water acidity are an essential aid to managers who must identify acid-sensitive streams, potentially...

  17. Prediction of Individual Response to Electroconvulsive Therapy via Machine Learning on Structural Magnetic Resonance Imaging Data.

    Science.gov (United States)

    Redlich, Ronny; Opel, Nils; Grotegerd, Dominik; Dohm, Katharina; Zaremba, Dario; Bürger, Christian; Münker, Sandra; Mühlmann, Lisa; Wahl, Patricia; Heindel, Walter; Arolt, Volker; Alferink, Judith; Zwanzger, Peter; Zavorotnyy, Maxim; Kugel, Harald; Dannlowski, Udo

    2016-06-01

    Electroconvulsive therapy (ECT) is one of the most effective treatments for severe depression. However, biomarkers that accurately predict a response to ECT remain unidentified. To investigate whether certain factors identified by structural magnetic resonance imaging (MRI) techniques are able to predict ECT response. In this nonrandomized prospective study, gray matter structure was assessed twice at approximately 6 weeks apart using 3-T MRI and voxel-based morphometry. Patients were recruited through the inpatient service of the Department of Psychiatry, University of Muenster, from March 11, 2010, to March 27, 2015. Two patient groups with acute major depressive disorder were included. One group received an ECT series in addition to antidepressants (n = 24); a comparison sample was treated solely with antidepressants (n = 23). Both groups were compared with a sample of healthy control participants (n = 21). Binary pattern classification was used to predict ECT response by structural MRI that was performed before treatment. In addition, univariate analysis was conducted to predict reduction of the Hamilton Depression Rating Scale score by pretreatment gray matter volumes and to investigate ECT-related structural changes. One participant in the ECT sample was excluded from the analysis, leaving 67 participants (27 men and 40 women; mean [SD] age, 43.7 [10.6] years). The binary pattern classification yielded a successful prediction of ECT response, with accuracy rates of 78.3% (18 of 23 patients in the ECT sample) and sensitivity rates of 100% (13 of 13 who responded to ECT). Furthermore, a support vector regression yielded a significant prediction of relative reduction in the Hamilton Depression Rating Scale score. The principal findings of the univariate model indicated a positive association between pretreatment subgenual cingulate volume and individual ECT response (Montreal Neurological Institute [MNI] coordinates x = 8, y = 21, z = -18

  18. Predicting the academic success of architecture students by pre-enrolment requirement: using machine-learning techniques

    Directory of Open Access Journals (Sweden)

    Ralph Olusola Aluko

    2016-12-01

    Full Text Available In recent years, there has been an increase in the number of applicants seeking admission into architecture programmes. As expected, prior academic performance (also referred to as pre-enrolment requirement is a major factor considered during the process of selecting applicants. In the present study, machine learning models were used to predict academic success of architecture students based on information provided in prior academic performance. Two modeling techniques, namely K-nearest neighbour (k-NN and linear discriminant analysis were applied in the study. It was found that K-nearest neighbour (k-NN outperforms the linear discriminant analysis model in terms of accuracy. In addition, grades obtained in mathematics (at ordinary level examinations had a significant impact on the academic success of undergraduate architecture students. This paper makes a modest contribution to the ongoing discussion on the relationship between prior academic performance and academic success of undergraduate students by evaluating this proposition. One of the issues that emerges from these findings is that prior academic performance can be used as a predictor of academic success in undergraduate architecture programmes. Overall, the developed k-NN model can serve as a valuable tool during the process of selecting new intakes into undergraduate architecture programmes in Nigeria.

  19. Prediction of Cancer Proteins by Integrating Protein Interaction, Domain Frequency, and Domain Interaction Data Using Machine Learning Algorithms

    Directory of Open Access Journals (Sweden)

    Chien-Hung Huang

    2015-01-01

    Full Text Available Many proteins are known to be associated with cancer diseases. It is quite often that their precise functional role in disease pathogenesis remains unclear. A strategy to gain a better understanding of the function of these proteins is to make use of a combination of different aspects of proteomics data types. In this study, we extended Aragues’s method by employing the protein-protein interaction (PPI data, domain-domain interaction (DDI data, weighted domain frequency score (DFS, and cancer linker degree (CLD data to predict cancer proteins. Performances were benchmarked based on three kinds of experiments as follows: (I using individual algorithm, (II combining algorithms, and (III combining the same classification types of algorithms. When compared with Aragues’s method, our proposed methods, that is, machine learning algorithm and voting with the majority, are significantly superior in all seven performance measures. We demonstrated the accuracy of the proposed method on two independent datasets. The best algorithm can achieve a hit ratio of 89.4% and 72.8% for lung cancer dataset and lung cancer microarray study, respectively. It is anticipated that the current research could help understand disease mechanisms and diagnosis.

  20. Applications of Machine learning in Prediction of Breast Cancer Incidence and Mortality

    International Nuclear Information System (INIS)

    Helal, N.; Sarwat, E.

    2012-01-01

    Breast cancer is one of the leading causes of cancer deaths for the female population in both developed and developing countries. In this work we have used the baseline descriptive data about the incidence (new cancer cases) of in situ breast cancer among Wisconsin females. The documented data were from the most recent 12-years period for which data are available. Wiscons in cancer incidence and mortality (deaths due to cancer) that occurred were also considered in this work. Artificial Neural network (ANN) have been successfully applied to problems in the prediction of the number of new cancer cases and mortality. Using artificial intelligence (AI) in this study, the numbers of new cancer cases and mortality that may occur are predicted.

  1. Predicting Health Care Utilization After Behavioral Health Referral Using Natural Language Processing and Machine Learning

    OpenAIRE

    Roysden, Nathaniel; Wright, Adam

    2015-01-01

    Mental health problems are an independent predictor of increased healthcare utilization. We created random forest classifiers for predicting two outcomes following a patient’s first behavioral health encounter: decreased utilization by any amount (AUROC 0.74) and ultra-high absolute utilization (AUROC 0.88). These models may be used for clinical decision support by referring providers, to automatically detect patients who may benefit from referral, for cost management, or for risk/protection ...

  2. Predicting Health Care Utilization After Behavioral Health Referral Using Natural Language Processing and Machine Learning.

    Science.gov (United States)

    Roysden, Nathaniel; Wright, Adam

    2015-01-01

    Mental health problems are an independent predictor of increased healthcare utilization. We created random forest classifiers for predicting two outcomes following a patient's first behavioral health encounter: decreased utilization by any amount (AUROC 0.74) and ultra-high absolute utilization (AUROC 0.88). These models may be used for clinical decision support by referring providers, to automatically detect patients who may benefit from referral, for cost management, or for risk/protection factor analysis.

  3. Improving the Spatial Prediction of Soil Organic Carbon Stocks in a Complex Tropical Mountain Landscape by Methodological Specifications in Machine Learning Approaches.

    Directory of Open Access Journals (Sweden)

    Mareike Ließ

    Full Text Available Tropical forests are significant carbon sinks and their soils' carbon storage potential is immense. However, little is known about the soil organic carbon (SOC stocks of tropical mountain areas whose complex soil-landscape and difficult accessibility pose a challenge to spatial analysis. The choice of methodology for spatial prediction is of high importance to improve the expected poor model results in case of low predictor-response correlations. Four aspects were considered to improve model performance in predicting SOC stocks of the organic layer of a tropical mountain forest landscape: Different spatial predictor settings, predictor selection strategies, various machine learning algorithms and model tuning. Five machine learning algorithms: random forests, artificial neural networks, multivariate adaptive regression splines, boosted regression trees and support vector machines were trained and tuned to predict SOC stocks from predictors derived from a digital elevation model and satellite image. Topographical predictors were calculated with a GIS search radius of 45 to 615 m. Finally, three predictor selection strategies were applied to the total set of 236 predictors. All machine learning algorithms-including the model tuning and predictor selection-were compared via five repetitions of a tenfold cross-validation. The boosted regression tree algorithm resulted in the overall best model. SOC stocks ranged between 0.2 to 17.7 kg m-2, displaying a huge variability with diffuse insolation and curvatures of different scale guiding the spatial pattern. Predictor selection and model tuning improved the models' predictive performance in all five machine learning algorithms. The rather low number of selected predictors favours forward compared to backward selection procedures. Choosing predictors due to their indiviual performance was vanquished by the two procedures which accounted for predictor interaction.

  4. Improving the Spatial Prediction of Soil Organic Carbon Stocks in a Complex Tropical Mountain Landscape by Methodological Specifications in Machine Learning Approaches.

    Science.gov (United States)

    Ließ, Mareike; Schmidt, Johannes; Glaser, Bruno

    2016-01-01

    Tropical forests are significant carbon sinks and their soils' carbon storage potential is immense. However, little is known about the soil organic carbon (SOC) stocks of tropical mountain areas whose complex soil-landscape and difficult accessibility pose a challenge to spatial analysis. The choice of methodology for spatial prediction is of high importance to improve the expected poor model results in case of low predictor-response correlations. Four aspects were considered to improve model performance in predicting SOC stocks of the organic layer of a tropical mountain forest landscape: Different spatial predictor settings, predictor selection strategies, various machine learning algorithms and model tuning. Five machine learning algorithms: random forests, artificial neural networks, multivariate adaptive regression splines, boosted regression trees and support vector machines were trained and tuned to predict SOC stocks from predictors derived from a digital elevation model and satellite image. Topographical predictors were calculated with a GIS search radius of 45 to 615 m. Finally, three predictor selection strategies were applied to the total set of 236 predictors. All machine learning algorithms-including the model tuning and predictor selection-were compared via five repetitions of a tenfold cross-validation. The boosted regression tree algorithm resulted in the overall best model. SOC stocks ranged between 0.2 to 17.7 kg m-2, displaying a huge variability with diffuse insolation and curvatures of different scale guiding the spatial pattern. Predictor selection and model tuning improved the models' predictive performance in all five machine learning algorithms. The rather low number of selected predictors favours forward compared to backward selection procedures. Choosing predictors due to their indiviual performance was vanquished by the two procedures which accounted for predictor interaction.

  5. The Impact of Protein Structure and Sequence Similarity on the Accuracy of Machine-Learning Scoring Functions for Binding Affinity Prediction.

    Science.gov (United States)

    Li, Hongjian; Peng, Jiangjun; Leung, Yee; Leung, Kwong-Sak; Wong, Man-Hon; Lu, Gang; Ballester, Pedro J

    2018-03-14

    It has recently been claimed that the outstanding performance of machine-learning scoring functions (SFs) is exclusively due to the presence of training complexes with highly similar proteins to those in the test set. Here, we revisit this question using 24 similarity-based training sets, a widely used test set, and four SFs. Three of these SFs employ machine learning instead of the classical linear regression approach of the fourth SF (X-Score which has the best test set performance out of 16 classical SFs). We have found that random forest (RF)-based RF-Score-v3 outperforms X-Score even when 68% of the most similar proteins are removed from the training set. In addition, unlike X-Score, RF-Score-v3 is able to keep learning with an increasing training set size, becoming substantially more predictive than X-Score when the full 1105 complexes are used for training. These results show that machine-learning SFs owe a substantial part of their performance to training on complexes with dissimilar proteins to those in the test set, against what has been previously concluded using the same data. Given that a growing amount of structural and interaction data will be available from academic and industrial sources, this performance gap between machine-learning SFs and classical SFs is expected to enlarge in the future.

  6. Model-based machine learning.

    Science.gov (United States)

    Bishop, Christopher M

    2013-02-13

    Several decades of research in the field of machine learning have resulted in a multitude of different algorithms for solving a broad range of problems. To tackle a new application, a researcher typically tries to map their problem onto one of these existing methods, often influenced by their familiarity with specific algorithms and by the availability of corresponding software implementations. In this study, we describe an alternative methodology for applying machine learning, in which a bespoke solution is formulated for each new application. The solution is expressed through a compact modelling language, and the corresponding custom machine learning code is then generated automatically. This model-based approach offers several major advantages, including the opportunity to create highly tailored models for specific scenarios, as well as rapid prototyping and comparison of a range of alternative models. Furthermore, newcomers to the field of machine learning do not have to learn about the huge range of traditional methods, but instead can focus their attention on understanding a single modelling environment. In this study, we show how probabilistic graphical models, coupled with efficient inference algorithms, provide a very flexible foundation for model-based machine learning, and we outline a large-scale commercial application of this framework involving tens of millions of users. We also describe the concept of probabilistic programming as a powerful software environment for model-based machine learning, and we discuss a specific probabilistic programming language called Infer.NET, which has been widely used in practical applications.

  7. Emerging Paradigms in Machine Learning

    CERN Document Server

    Jain, Lakhmi; Howlett, Robert

    2013-01-01

    This  book presents fundamental topics and algorithms that form the core of machine learning (ML) research, as well as emerging paradigms in intelligent system design. The  multidisciplinary nature of machine learning makes it a very fascinating and popular area for research.  The book is aiming at students, practitioners and researchers and captures the diversity and richness of the field of machine learning and intelligent systems.  Several chapters are devoted to computational learning models such as granular computing, rough sets and fuzzy sets An account of applications of well-known learning methods in biometrics, computational stylistics, multi-agent systems, spam classification including an extremely well-written survey on Bayesian networks shed light on the strengths and weaknesses of the methods. Practical studies yielding insight into challenging problems such as learning from incomplete and imbalanced data, pattern recognition of stochastic episodic events and on-line mining of non-stationary ...

  8. Application of machine learning in prediction of hydrotrope-enhanced solubilisation of indomethacin.

    Science.gov (United States)

    Damiati, Safa A; Martini, Luigi G; Smith, Norman W; Lawrence, M Jayne; Barlow, David J

    2017-09-15

    Systematic in-vitro studies have been conducted to determine the ability of a range of 10 potential hydrotropes to improve the apparent aqueous solubility of the poorly water soluble drug, indomethacin. Solubilisation of the drug in the presence of the hydrotropes was determined experimentally using high-performance liquid chromatography (HPLC) with ultraviolet (UV) detection. These experimental data, together with various known and computed physicochemical properties of the hydrotropes were thereafter used in silico to train an artificial neural network (ANN) to allow for predictions of indomethacin solubilisation. The trained ANN was found to give highly accurate predictions of indomethacin solubilisation in the presence of hydrotropes and was thus shown to provide a valuable means by which hydrotrope efficacy could be screened computationally. Interrogation of the network connection weights afforded a quantitative assessment of the relative importance of the various hydrotrope physicochemical properties in determining the extent of the enhancement in indomethacin solubilisation. It is concluded that in-silico screening of drug/hydrotrope systems using artificial neural networks offers significant potential to reduce the need for extensive laboratory testing of these systems, and could thus provide an economy in terms of reduced costs and time in drug formulation development. Crown Copyright © 2017. Published by Elsevier B.V. All rights reserved.

  9. Hit Dexter: A Machine-Learning Model for the Prediction of Frequent Hitters.

    Science.gov (United States)

    Stork, Conrad; Wagner, Johannes; Friedrich, Nils-Ole; de Bruyn Kops, Christina; Šícho, Martin; Kirchmair, Johannes

    2018-03-20

    False-positive assay readouts caused by badly behaving compounds-frequent hitters, pan-assay interference compounds (PAINS), aggregators, and others-continue to pose a major challenge to experimental screening. There are only a few in silico methods that allow the prediction of such problematic compounds. We report the development of Hit Dexter, two extremely randomized trees classifiers for the prediction of compounds likely to trigger positive assay readouts either by true promiscuity or by assay interference. The models were trained on a well-prepared dataset extracted from the PubChem Bioassay database, consisting of approximately 311 000 compounds tested for activity on at least 50 proteins. Hit Dexter reached MCC and AUC values of up to 0.67 and 0.96 on an independent test set, respectively. The models are expected to be of high value, in particular to medicinal chemists and biochemists who can use Hit Dexter to identify compounds for which extra caution should be exercised with positive assay readouts. Hit Dexter is available as a free web service at http://hitdexter.zbh. uni-hamburg.de. © 2018 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim.

  10. Machine learning for healthcare technologies

    CERN Document Server

    Clifton, David A

    2016-01-01

    This book brings together chapters on the state-of-the-art in machine learning (ML) as it applies to the development of patient-centred technologies, with a special emphasis on 'big data' and mobile data.

  11. Machine Learning via Mathematical Programming

    National Research Council Canada - National Science Library

    Mamgasarian, Olivi

    1999-01-01

    Mathematical programming approaches were applied to a variety of problems in machine learning in order to gain deeper understanding of the problems and to come up with new and more efficient computational algorithms...

  12. Machine Learning examples on Invenio

    CERN Document Server

    CERN. Geneva

    2017-01-01

    This talk will present the different Machine Learning tools that the INSPIRE is developing and integrating in order to automatize as much as possible content selection and curation in a subject based repository.

  13. Scikit-learn: Machine Learning in Python

    OpenAIRE

    Pedregosa, Fabian; Varoquaux, Gaël; Gramfort, Alexandre; Michel, Vincent; Thirion, Bertrand; Grisel, Olivier; Blondel, Mathieu; Prettenhofer, Peter; Weiss, Ron; Dubourg, Vincent; Vanderplas, Jake; Passos, Alexandre; Cournapeau, David; Brucher, Matthieu; Perrot, Matthieu

    2011-01-01

    International audience; Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing machine learning to non-specialists using a general-purpose high-level language. Emphasis is put on ease of use, performance, documentation, and API consistency. It has minimal dependencies and is distributed under the simplified BSD license, encouraging its use in both academic ...

  14. Scikit-learn: Machine Learning in Python

    OpenAIRE

    Pedregosa, Fabian; Varoquaux, Gaël; Gramfort, Alexandre; Michel, Vincent; Thirion, Bertrand; Grisel, Olivier; Blondel, Mathieu; Louppe, Gilles; Prettenhofer, Peter; Weiss, Ron; Dubourg, Vincent; Vanderplas, Jake; Passos, Alexandre; Cournapeau, David; Brucher, Matthieu

    2012-01-01

    Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing machine learning to non-specialists using a general-purpose high-level language. Emphasis is put on ease of use, performance, documentation, and API consistency. It has minimal dependencies and is distributed under the simplified BSD license, encouraging its use in both academic and commercial settings....

  15. Predictive models for anti-tubercular molecules using machine learning on high-throughput biological screening datasets.

    Science.gov (United States)

    Periwal, Vinita; Rajappan, Jinuraj K; Jaleel, Abdul Uc; Scaria, Vinod

    2011-11-18

    Tuberculosis is a contagious disease caused by Mycobacterium tuberculosis (Mtb), affecting more than two billion people around the globe and is one of the major causes of morbidity and mortality in the developing world. Recent reports suggest that Mtb has been developing resistance to the widely used anti-tubercular drugs resulting in the emergence and spread of multi drug-resistant (MDR) and extensively drug-resistant (XDR) strains throughout the world. In view of this global epidemic, there is an urgent need to facilitate fast and efficient lead identification methodologies. Target based screening of large compound libraries has been widely used as a fast and efficient approach for lead identification, but is restricted by the knowledge about the target structure. Whole organism screens on the other hand are target-agnostic and have been now widely employed as an alternative for lead identification but they are limited by the time and cost involved in running the screens for large compound libraries. This could be possibly be circumvented by using computational approaches to prioritize molecules for screening programmes. We utilized physicochemical properties of compounds to train four supervised classifiers (Naïve Bayes, Random Forest, J48 and SMO) on three publicly available bioassay screens of Mtb inhibitors and validated the robustness of the predictive models using various statistical measures. This study is a comprehensive analysis of high-throughput bioassay data for anti-tubercular activity and the application of machine learning approaches to create target-agnostic predictive models for anti-tubercular agents.

  16. Predictive models for anti-tubercular molecules using machine learning on high-throughput biological screening datasets

    Directory of Open Access Journals (Sweden)

    Periwal Vinita

    2011-11-01

    Full Text Available Abstract Background Tuberculosis is a contagious disease caused by Mycobacterium tuberculosis (Mtb, affecting more than two billion people around the globe and is one of the major causes of morbidity and mortality in the developing world. Recent reports suggest that Mtb has been developing resistance to the widely used anti-tubercular drugs resulting in the emergence and spread of multi drug-resistant (MDR and extensively drug-resistant (XDR strains throughout the world. In view of this global epidemic, there is an urgent need to facilitate fast and efficient lead identification methodologies. Target based screening of large compound libraries has been widely used as a fast and efficient approach for lead identification, but is restricted by the knowledge about the target structure. Whole organism screens on the other hand are target-agnostic and have been now widely employed as an alternative for lead identification but they are limited by the time and cost involved in running the screens for large compound libraries. This could be possibly be circumvented by using computational approaches to prioritize molecules for screening programmes. Results We utilized physicochemical properties of compounds to train four supervised classifiers (Naïve Bayes, Random Forest, J48 and SMO on three publicly available bioassay screens of Mtb inhibitors and validated the robustness of the predictive models using various statistical measures. Conclusions This study is a comprehensive analysis of high-throughput bioassay data for anti-tubercular activity and the application of machine learning approaches to create target-agnostic predictive models for anti-tubercular agents.

  17. Machine Learning of Musical Gestures

    OpenAIRE

    Caramiaux, Baptiste; Tanaka, Atau

    2013-01-01

    We present an overview of machine learning (ML) techniques and theirapplication in interactive music and new digital instruments design. We firstgive to the non-specialist reader an introduction to two ML tasks,classification and regression, that are particularly relevant for gesturalinteraction. We then present a review of the literature in current NIMEresearch that uses ML in musical gesture analysis and gestural sound control.We describe the ways in which machine learning is useful for cre...

  18. Machine learning methods for planning

    CERN Document Server

    Minton, Steven

    1993-01-01

    Machine Learning Methods for Planning provides information pertinent to learning methods for planning and scheduling. This book covers a wide variety of learning methods and learning architectures, including analogical, case-based, decision-tree, explanation-based, and reinforcement learning.Organized into 15 chapters, this book begins with an overview of planning and scheduling and describes some representative learning systems that have been developed for these tasks. This text then describes a learning apprentice for calendar management. Other chapters consider the problem of temporal credi

  19. BENCHMARKING MACHINE LEARNING TECHNIQUES FOR SOFTWARE DEFECT DETECTION

    OpenAIRE

    Saiqa Aleem; Luiz Fernando Capretz; Faheem Ahmed

    2015-01-01

    Machine Learning approaches are good in solving problems that have less information. In most cases, the software domain problems characterize as a process of learning that depend on the various circumstances and changes accordingly. A predictive model is constructed by using machine learning approaches and classified them into defective and non-defective modules. Machine learning techniques help developers to retrieve useful information after the classification and enable them to analyse data...

  20. Machine Learning Technologies Translates Vigilant Surveillance Satellite Big Data into Predictive Alerts for Environmental Stressors

    Science.gov (United States)

    Johnson, S. P.; Rohrer, M. E.

    2017-12-01

    , intervention, and prediction of emerging environmental threats.

  1. Learning with Support Vector Machines

    CERN Document Server

    Campbell, Colin

    2010-01-01

    Support Vectors Machines have become a well established tool within machine learning. They work well in practice and have now been used across a wide range of applications from recognizing hand-written digits, to face identification, text categorisation, bioinformatics, and database marketing. In this book we give an introductory overview of this subject. We start with a simple Support Vector Machine for performing binary classification before considering multi-class classification and learning in the presence of noise. We show that this framework can be extended to many other scenarios such a

  2. Machine Learning in Medical Imaging.

    Science.gov (United States)

    Giger, Maryellen L

    2018-03-01

    Advances in both imaging and computers have synergistically led to a rapid rise in the potential use of artificial intelligence in various radiological imaging tasks, such as risk assessment, detection, diagnosis, prognosis, and therapy response, as well as in multi-omics disease discovery. A brief overview of the field is given here, allowing the reader to recognize the terminology, the various subfields, and components of machine learning, as well as the clinical potential. Radiomics, an expansion of computer-aided diagnosis, has been defined as the conversion of images to minable data. The ultimate benefit of quantitative radiomics is to (1) yield predictive image-based phenotypes of disease for precision medicine or (2) yield quantitative image-based phenotypes for data mining with other -omics for discovery (ie, imaging genomics). For deep learning in radiology to succeed, note that well-annotated large data sets are needed since deep networks are complex, computer software and hardware are evolving constantly, and subtle differences in disease states are more difficult to perceive than differences in everyday objects. In the future, machine learning in radiology is expected to have a substantial clinical impact with imaging examinations being routinely obtained in clinical practice, providing an opportunity to improve decision support in medical image interpretation. The term of note is decision support, indicating that computers will augment human decision making, making it more effective and efficient. The clinical impact of having computers in the routine clinical practice may allow radiologists to further integrate their knowledge with their clinical colleagues in other medical specialties and allow for precision medicine. Copyright © 2018. Published by Elsevier Inc.

  3. Prediction of Machine Tool Condition Using Support Vector Machine

    International Nuclear Information System (INIS)

    Wang Peigong; Meng Qingfeng; Zhao Jian; Li Junjie; Wang Xiufeng

    2011-01-01

    Condition monitoring and predicting of CNC machine tools are investigated in this paper. Considering the CNC machine tools are often small numbers of samples, a condition predicting method for CNC machine tools based on support vector machines (SVMs) is proposed, then one-step and multi-step condition prediction models are constructed. The support vector machines prediction models are used to predict the trends of working condition of a certain type of CNC worm wheel and gear grinding machine by applying sequence data of vibration signal, which is collected during machine processing. And the relationship between different eigenvalue in CNC vibration signal and machining quality is discussed. The test result shows that the trend of vibration signal Peak-to-peak value in surface normal direction is most relevant to the trend of surface roughness value. In trends prediction of working condition, support vector machine has higher prediction accuracy both in the short term ('One-step') and long term (multi-step) prediction compared to autoregressive (AR) model and the RBF neural network. Experimental results show that it is feasible to apply support vector machine to CNC machine tool condition prediction.

  4. Using Machine Learning to Advance Personality Assessment and Theory.

    Science.gov (United States)

    Bleidorn, Wiebke; Hopwood, Christopher James

    2018-05-01

    Machine learning has led to important advances in society. One of the most exciting applications of machine learning in psychological science has been the development of assessment tools that can powerfully predict human behavior and personality traits. Thus far, machine learning approaches to personality assessment have focused on the associations between social media and other digital records with established personality measures. The goal of this article is to expand the potential of machine learning approaches to personality assessment by embedding it in a more comprehensive construct validation framework. We review recent applications of machine learning to personality assessment, place machine learning research in the broader context of fundamental principles of construct validation, and provide recommendations for how to use machine learning to advance our understanding of personality.

  5. An Evolutionary Machine Learning Framework for Big Data Sequence Mining

    Science.gov (United States)

    Kamath, Uday Krishna

    2014-01-01

    Sequence classification is an important problem in many real-world applications. Unlike other machine learning data, there are no "explicit" features or signals in sequence data that can help traditional machine learning algorithms learn and predict from the data. Sequence data exhibits inter-relationships in the elements that are…

  6. Performance of Machine Learning Algorithms for Qualitative and Quantitative Prediction Drug Blockade of hERG1 channel.

    Science.gov (United States)

    Wacker, Soren; Noskov, Sergei Yu

    2018-05-01

    Drug-induced abnormal heart rhythm known as Torsades de Pointes (TdP) is a potential lethal ventricular tachycardia found in many patients. Even newly released anti-arrhythmic drugs, like ivabradine with HCN channel as a primary target, block the hERG potassium current in overlapping concentration interval. Promiscuous drug block to hERG channel may potentially lead to perturbation of the action potential duration (APD) and TdP, especially when with combined with polypharmacy and/or electrolyte disturbances. The example of novel anti-arrhythmic ivabradine illustrates clinically important and ongoing deficit in drug design and warrants for better screening methods. There is an urgent need to develop new approaches for rapid and accurate assessment of how drugs with complex interactions and multiple subcellular targets can predispose or protect from drug-induced TdP. One of the unexpected outcomes of compulsory hERG screening implemented in USA and European Union resulted in large datasets of IC 50 values for various molecules entering the market. The abundant data allows now to construct predictive machine-learning (ML) models. Novel ML algorithms and techniques promise better accuracy in determining IC 50 values of hERG blockade that is comparable or surpassing that of the earlier QSAR or molecular modeling technique. To test the performance of modern ML techniques, we have developed a computational platform integrating various workflows for quantitative structure activity relationship (QSAR) models using data from the ChEMBL database. To establish predictive powers of ML-based algorithms we computed IC 50 values for large dataset of molecules and compared it to automated patch clamp system for a large dataset of hERG blocking and non-blocking drugs, an industry gold standard in studies of cardiotoxicity. The optimal protocol with high sensitivity and predictive power is based on the novel eXtreme gradient boosting (XGBoost) algorithm. The ML-platform with XGBoost

  7. Comparing deep neural network and other machine learning algorithms for stroke prediction in a large-scale population-based electronic medical claims database.

    Science.gov (United States)

    Chen-Ying Hung; Wei-Chen Chen; Po-Tsun Lai; Ching-Heng Lin; Chi-Chun Lee

    2017-07-01

    Electronic medical claims (EMCs) can be used to accurately predict the occurrence of a variety of diseases, which can contribute to precise medical interventions. While there is a growing interest in the application of machine learning (ML) techniques to address clinical problems, the use of deep-learning in healthcare have just gained attention recently. Deep learning, such as deep neural network (DNN), has achieved impressive results in the areas of speech recognition, computer vision, and natural language processing in recent years. However, deep learning is often difficult to comprehend due to the complexities in its framework. Furthermore, this method has not yet been demonstrated to achieve a better performance comparing to other conventional ML algorithms in disease prediction tasks using EMCs. In this study, we utilize a large population-based EMC database of around 800,000 patients to compare DNN with three other ML approaches for predicting 5-year stroke occurrence. The result shows that DNN and gradient boosting decision tree (GBDT) can result in similarly high prediction accuracies that are better compared to logistic regression (LR) and support vector machine (SVM) approaches. Meanwhile, DNN achieves optimal results by using lesser amounts of patient data when comparing to GBDT method.

  8. Machine learning for evolution strategies

    CERN Document Server

    Kramer, Oliver

    2016-01-01

    This book introduces numerous algorithmic hybridizations between both worlds that show how machine learning can improve and support evolution strategies. The set of methods comprises covariance matrix estimation, meta-modeling of fitness and constraint functions, dimensionality reduction for search and visualization of high-dimensional optimization processes, and clustering-based niching. After giving an introduction to evolution strategies and machine learning, the book builds the bridge between both worlds with an algorithmic and experimental perspective. Experiments mostly employ a (1+1)-ES and are implemented in Python using the machine learning library scikit-learn. The examples are conducted on typical benchmark problems illustrating algorithmic concepts and their experimental behavior. The book closes with a discussion of related lines of research.

  9. Assessment of genetic and nongenetic interactions for the prediction of depressive symptomatology: an analysis of the Wisconsin Longitudinal Study using machine learning algorithms.

    Science.gov (United States)

    Roetker, Nicholas S; Page, C David; Yonker, James A; Chang, Vicky; Roan, Carol L; Herd, Pamela; Hauser, Taissa S; Hauser, Robert M; Atwood, Craig S

    2013-10-01

    We examined depression within a multidimensional framework consisting of genetic, environmental, and sociobehavioral factors and, using machine learning algorithms, explored interactions among these factors that might better explain the etiology of depressive symptoms. We measured current depressive symptoms using the Center for Epidemiologic Studies Depression Scale (n = 6378 participants in the Wisconsin Longitudinal Study). Genetic factors were 78 single nucleotide polymorphisms (SNPs); environmental factors-13 stressful life events (SLEs), plus a composite proportion of SLEs index; and sociobehavioral factors-18 personality, intelligence, and other health or behavioral measures. We performed traditional SNP associations via logistic regression likelihood ratio testing and explored interactions with support vector machines and Bayesian networks. After correction for multiple testing, we found no significant single genotypic associations with depressive symptoms. Machine learning algorithms showed no evidence of interactions. Naïve Bayes produced the best models in both subsets and included only environmental and sociobehavioral factors. We found no single or interactive associations with genetic factors and depressive symptoms. Various environmental and sociobehavioral factors were more predictive of depressive symptoms, yet their impacts were independent of one another. A genome-wide analysis of genetic alterations using machine learning methodologies will provide a framework for identifying genetic-environmental-sociobehavioral interactions in depressive symptoms.

  10. Game-powered machine learning.

    Science.gov (United States)

    Barrington, Luke; Turnbull, Douglas; Lanckriet, Gert

    2012-04-24

    Searching for relevant content in a massive amount of multimedia information is facilitated by accurately annotating each image, video, or song with a large number of relevant semantic keywords, or tags. We introduce game-powered machine learning, an integrated approach to annotating multimedia content that combines the effectiveness of human computation, through online games, with the scalability of machine learning. We investigate this framework for labeling music. First, a socially-oriented music annotation game called Herd It collects reliable music annotations based on the "wisdom of the crowds." Second, these annotated examples are used to train a supervised machine learning system. Third, the machine learning system actively directs the annotation games to collect new data that will most benefit future model iterations. Once trained, the system can automatically annotate a corpus of music much larger than what could be labeled using human computation alone. Automatically annotated songs can be retrieved based on their semantic relevance to text-based queries (e.g., "funky jazz with saxophone," "spooky electronica," etc.). Based on the results presented in this paper, we find that actively coupling annotation games with machine learning provides a reliable and scalable approach to making searchable massive amounts of multimedia data.

  11. Machine learning in the string landscape

    Science.gov (United States)

    Carifio, Jonathan; Halverson, James; Krioukov, Dmitri; Nelson, Brent D.

    2017-09-01

    We utilize machine learning to study the string landscape. Deep data dives and conjecture generation are proposed as useful frameworks for utilizing machine learning in the landscape, and examples of each are presented. A decision tree accurately predicts the number of weak Fano toric threefolds arising from reflexive polytopes, each of which determines a smooth F-theory compactification, and linear regression generates a previously proven conjecture for the gauge group rank in an ensemble of 4/3× 2.96× {10}^{755} F-theory compactifications. Logistic regression generates a new conjecture for when E 6 arises in the large ensemble of F-theory compactifications, which is then rigorously proven. This result may be relevant for the appearance of visible sectors in the ensemble. Through conjecture generation, machine learning is useful not only for numerics, but also for rigorous results.

  12. Machine learning: novel bioinformatics approaches for combating antimicrobial resistance.

    Science.gov (United States)

    Macesic, Nenad; Polubriaginof, Fernanda; Tatonetti, Nicholas P

    2017-12-01

    Antimicrobial resistance (AMR) is a threat to global health and new approaches to combating AMR are needed. Use of machine learning in addressing AMR is in its infancy but has made promising steps. We reviewed the current literature on the use of machine learning for studying bacterial AMR. The advent of large-scale data sets provided by next-generation sequencing and electronic health records make applying machine learning to the study and treatment of AMR possible. To date, it has been used for antimicrobial susceptibility genotype/phenotype prediction, development of AMR clinical decision rules, novel antimicrobial agent discovery and antimicrobial therapy optimization. Application of machine learning to studying AMR is feasible but remains limited. Implementation of machine learning in clinical settings faces barriers to uptake with concerns regarding model interpretability and data quality.Future applications of machine learning to AMR are likely to be laboratory-based, such as antimicrobial susceptibility phenotype prediction.

  13. Evaluation of Machine Learning and Rules-Based Approaches for Predicting Antimicrobial Resistance Profiles in Gram-negative Bacilli from Whole Genome Sequence Data.

    Science.gov (United States)

    Pesesky, Mitchell W; Hussain, Tahir; Wallace, Meghan; Patel, Sanket; Andleeb, Saadia; Burnham, Carey-Ann D; Dantas, Gautam

    2016-01-01

    The time-to-result for culture-based microorganism recovery and phenotypic antimicrobial susceptibility testing necessitates initial use of empiric (frequently broad-spectrum) antimicrobial therapy. If the empiric therapy is not optimal, this can lead to adverse patient outcomes and contribute to increasing antibiotic resistance in pathogens. New, more rapid technologies are emerging to meet this need. Many of these are based on identifying resistance genes, rather than directly assaying resistance phenotypes, and thus require interpretation to translate the genotype into treatment recommendations. These interpretations, like other parts of clinical diagnostic workflows, are likely to be increasingly automated in the future. We set out to evaluate the two major approaches that could be amenable to automation pipelines: rules-based methods and machine learning methods. The rules-based algorithm makes predictions based upon current, curated knowledge of Enterobacteriaceae resistance genes. The machine-learning algorithm predicts resistance and susceptibility based on a model built from a training set of variably resistant isolates. As our test set, we used whole genome sequence data from 78 clinical Enterobacteriaceae isolates, previously identified to represent a variety of phenotypes, from fully-susceptible to pan-resistant strains for the antibiotics tested. We tested three antibiotic resistance determinant databases for their utility in identifying the complete resistome for each isolate. The predictions of the rules-based and machine learning algorithms for these isolates were compared to results of phenotype-based diagnostics. The rules based and machine-learning predictions achieved agreement with standard-of-care phenotypic diagnostics of 89.0 and 90.3%, respectively, across twelve antibiotic agents from six major antibiotic classes. Several sources of disagreement between the algorithms were identified. Novel variants of known resistance factors and

  14. Evaluation of Machine Learning and Rules-Based Approaches for Predicting Antimicrobial Resistance Profiles in Gram-negative Bacilli from Whole Genome Sequence Data

    Directory of Open Access Journals (Sweden)

    Mitchell Pesesky

    2016-11-01

    Full Text Available The time-to-result for culture-based microorganism recovery and phenotypic antimicrobial susceptibility testing necessitate initial use of empiric (frequently broad-spectrum antimicrobial therapy. If the empiric therapy is not optimal, this can lead to adverse patient outcomes and contribute to increasing antibiotic resistance in pathogens. New, more rapid technologies are emerging to meet this need. Many of these are based on identifying resistance genes, rather than directly assaying resistance phenotypes, and thus require interpretation to translate the genotype into treatment recommendations. These interpretations, like other parts of clinical diagnostic workflows, are likely to be increasingly automated in the future. We set out to evaluate the two major approaches that could be amenable to automation pipelines: rules-based methods and machine learning methods. The rules-based algorithm makes predictions based upon current, curated knowledge of Enterobacteriaceae resistance genes. The machine-learning algorithm predicts resistance and susceptibility based on a model built from a training set of variably resistant isolates. As our test set, we used whole genome sequence data from 78 clinical Enterobacteriaceae isolates, previously identified to represent a variety of phenotypes, from fully-susceptible to pan-resistant strains for the antibiotics tested. We tested three antibiotic resistance determinant databases for their utility in identifying the complete resistome for each isolate. The predictions of the rules-based and machine learning algorithms for these isolates were compared to results of phenotype-based diagnostics. The rules based and machine-learning predictions achieved agreement with standard-of-care phenotypic diagnostics of 89.0% and 90.3%, respectively, across twelve antibiotic agents from six major antibiotic classes. Several sources of disagreement between the algorithms were identified. Novel variants of known resistance

  15. Deep learning: Using machine learning to study biological vision

    OpenAIRE

    Majaj, Najib; Pelli, Denis

    2017-01-01

    Today most vision-science presentations mention machine learning. Many neuroscientists use machine learning to decode neural responses. Many perception scientists try to understand recognition by living organisms. To them, machine learning offers a reference of attainable performance based on learned stimuli. This brief overview of the use of machine learning in biological vision touches on its strengths, weaknesses, milestones, controversies, and current directions.

  16. Higgs Machine Learning Challenge 2014

    CERN Document Server

    Olivier, A-P; Bourdarios, C ; LAL / Orsay; Goldfarb, S ; University of Michigan

    2014-01-01

    High Energy Physics (HEP) has been using Machine Learning (ML) techniques such as boosted decision trees (paper) and neural nets since the 90s. These techniques are now routinely used for difficult tasks such as the Higgs boson search. Nevertheless, formal connections between the two research fields are rather scarce, with some exceptions such as the AppStat group at LAL, founded in 2006. In collaboration with INRIA, AppStat promotes interdisciplinary research on machine learning, computational statistics, and high-energy particle and astroparticle physics. We are now exploring new ways to improve the cross-fertilization of the two fields by setting up a data challenge, following the footsteps of, among others, the astrophysics community (dark matter and galaxy zoo challenges) and neurobiology (connectomics and decoding the human brain). The organization committee consists of ATLAS physicists and machine learning researchers. The Challenge will run from Monday 12th to September 2014.

  17. Neuromorphic Deep Learning Machines

    OpenAIRE

    Neftci, E; Augustine, C; Paul, S; Detorakis, G

    2017-01-01

    An ongoing challenge in neuromorphic computing is to devise general and computationally efficient models of inference and learning which are compatible with the spatial and temporal constraints of the brain. One increasingly popular and successful approach is to take inspiration from inference and learning algorithms used in deep neural networks. However, the workhorse of deep learning, the gradient descent Back Propagation (BP) rule, often relies on the immediate availability of network-wide...

  18. Application of Machine Learning Techniques in Aquaculture

    OpenAIRE

    Rahman, Akhlaqur; Tasnim, Sumaira

    2014-01-01

    In this paper we present applications of different machine learning algorithms in aquaculture. Machine learning algorithms learn models from historical data. In aquaculture historical data are obtained from farm practices, yields, and environmental data sources. Associations between these different variables can be obtained by applying machine learning algorithms to historical data. In this paper we present applications of different machine learning algorithms in aquaculture applications.

  19. Modelling tick abundance using machine learning techniques and satellite imagery

    DEFF Research Database (Denmark)

    Kjær, Lene Jung; Korslund, L.; Kjelland, V.

    satellite images to run Boosted Regression Tree machine learning algorithms to predict overall distribution (presence/absence of ticks) and relative tick abundance of nymphs and larvae in southern Scandinavia. For nymphs, the predicted abundance had a positive correlation with observed abundance...... the predicted distribution of larvae was mostly even throughout Denmark, it was primarily around the coastlines in Norway and Sweden. Abundance was fairly low overall except in some fragmented patches corresponding to forested habitats in the region. Machine learning techniques allow us to predict for larger...... the collected ticks for pathogens and using the same machine learning techniques to develop prevalence maps of the ScandTick region....

  20. SVM-Prot 2016: A Web-Server for Machine Learning Prediction of Protein Functional Families from Sequence Irrespective of Similarity.

    Science.gov (United States)

    Li, Ying Hong; Xu, Jing Yu; Tao, Lin; Li, Xiao Feng; Li, Shuang; Zeng, Xian; Chen, Shang Ying; Zhang, Peng; Qin, Chu; Zhang, Cheng; Chen, Zhe; Zhu, Feng; Chen, Yu Zong

    2016-01-01

    Knowledge of protein function is important for biological, medical and therapeutic studies, but many proteins are still unknown in function. There is a need for more improved functional prediction methods. Our SVM-Prot web-server employed a machine learning method for predicting protein functional families from protein sequences irrespective of similarity, which complemented those similarity-based and other methods in predicting diverse classes of proteins including the distantly-related proteins and homologous proteins of different functions. Since its publication in 2003, we made major improvements to SVM-Prot with (1) expanded coverage from 54 to 192 functional families, (2) more diverse protein descriptors protein representation, (3) improved predictive performances due to the use of more enriched training datasets and more variety of protein descriptors, (4) newly integrated BLAST analysis option for assessing proteins in the SVM-Prot predicted functional families that were similar in sequence to a query protein, and (5) newly added batch submission option for supporting the classification of multiple proteins. Moreover, 2 more machine learning approaches, K nearest neighbor and probabilistic neural networks, were added for facilitating collective assessment of protein functions by multiple methods. SVM-Prot can be accessed at http://bidd2.nus.edu.sg/cgi-bin/svmprot/svmprot.cgi.

  1. Machine learning for prediction of all-cause mortality in patients with suspected coronary artery disease: a 5-year multicentre prospective registry analysis.

    Science.gov (United States)

    Motwani, Manish; Dey, Damini; Berman, Daniel S; Germano, Guido; Achenbach, Stephan; Al-Mallah, Mouaz H; Andreini, Daniele; Budoff, Matthew J; Cademartiri, Filippo; Callister, Tracy Q; Chang, Hyuk-Jae; Chinnaiyan, Kavitha; Chow, Benjamin J W; Cury, Ricardo C; Delago, Augustin; Gomez, Millie; Gransar, Heidi; Hadamitzky, Martin; Hausleiter, Joerg; Hindoyan, Niree; Feuchtner, Gudrun; Kaufmann, Philipp A; Kim, Yong-Jin; Leipsic, Jonathon; Lin, Fay Y; Maffei, Erica; Marques, Hugo; Pontone, Gianluca; Raff, Gilbert; Rubinshtein, Ronen; Shaw, Leslee J; Stehli, Julia; Villines, Todd C; Dunning, Allison; Min, James K; Slomka, Piotr J

    2017-02-14

    Traditional prognostic risk assessment in patients undergoing non-invasive imaging is based upon a limited selection of clinical and imaging findings. Machine learning (ML) can consider a greater number and complexity of variables. Therefore, we investigated the feasibility and accuracy of ML to predict 5-year all-cause mortality (ACM) in patients undergoing coronary computed tomographic angiography (CCTA), and compared the performance to existing clinical or CCTA metrics. The analysis included 10 030 patients with suspected coronary artery disease and 5-year follow-up from the COronary CT Angiography EvaluatioN For Clinical Outcomes: An InteRnational Multicenter registry. All patients underwent CCTA as their standard of care. Twenty-five clinical and 44 CCTA parameters were evaluated, including segment stenosis score (SSS), segment involvement score (SIS), modified Duke index (DI), number of segments with non-calcified, mixed or calcified plaques, age, sex, gender, standard cardiovascular risk factors, and Framingham risk score (FRS). Machine learning involved automated feature selection by information gain ranking, model building with a boosted ensemble algorithm, and 10-fold stratified cross-validation. Seven hundred and forty-five patients died during 5-year follow-up. Machine learning exhibited a higher area-under-curve compared with the FRS or CCTA severity scores alone (SSS, SIS, DI) for predicting all-cause mortality (ML: 0.79 vs. FRS: 0.61, SSS: 0.64, SIS: 0.64, DI: 0.62; PMachine learning combining clinical and CCTA data was found to predict 5-year ACM significantly better than existing clinical or CCTA metrics alone. Published on behalf of the European Society of Cardiology. All rights reserved. © The Author 2016. For permissions please email: journals.permissions@oup.com.

  2. Machine Learning applications in CMS

    CERN Multimedia

    CERN. Geneva

    2017-01-01

    Machine Learning is used in many aspects of CMS data taking, monitoring, processing and analysis. We review a few of these use cases and the most recent developments, with an outlook to future applications in the LHC Run III and for the High-Luminosity phase.

  3. Attention: A Machine Learning Perspective

    DEFF Research Database (Denmark)

    Hansen, Lars Kai

    2012-01-01

    We review a statistical machine learning model of top-down task driven attention based on the notion of ‘gist’. In this framework we consider the task to be represented as a classification problem with two sets of features — a gist of coarse grained global features and a larger set of low...

  4. Book review: A first course in Machine Learning

    DEFF Research Database (Denmark)

    Ortiz-Arroyo, Daniel

    2016-01-01

    "The new edition of A First Course in Machine Learning by Rogers and Girolami is an excellent introduction to the use of statistical methods in machine learning. The book introduces concepts such as mathematical modeling, inference, and prediction, providing ‘just in time’ the essential background...... to change models and parameter values to make [it] easier to understand and apply these models in real applications. The authors [also] introduce more advanced, state-of-the-art machine learning methods, such as Gaussian process models and advanced mixture models, which are used across machine learning....... This makes the book interesting not only to students with little or no background in machine learning but also to more advanced graduate students interested in statistical approaches to machine learning." —Daniel Ortiz-Arroyo, Associate Professor, Aalborg University Esbjerg, Denmark...

  5. [A machine learning model using gut microbiome data for predicting changes of trimethylamine-N-oxide in healthy volunteers after choline consumption].

    Science.gov (United States)

    Lu, Jun-Qi; Wang, Shan; Yin, Jia; Wu, Shan; He, Yan; Zheng, Hui-Min; Sheng, Hua-Fang; Zhou, Hong-Wei

    2017-03-20

    To establish a machine learning model based on gut microbiota for predicting the level of trimethylamine N-oxide (TMAO) metabolism in vivo after choline intake to provide guidance of individualized precision diet and evidence for screening population at high risks of cardiovascular disease. We quantified plasma levels of TMAO in 18 healthy volunteers before and 8 h after a choline challenge (ingestion of two boiled eggs). The volunteers were divided into two groups with increased or decreased TMAO level following choline challenge. Fresh fecal samples were collected before taking fasting blood samples for amplifying 16S rRNA V4 tags, and the PCR products were sequenced using the platform of Illumina HiSeq 2000. The differences in gut microbiata between subjects with increased and decreased plasma TMAO were analyzed using QIIME. Based on the gut microbiota data and TMAO levels in the two groups, the prediction model was established using the machine learning random forest algorithm, and the validity of the model was tested using a verified dataset. An obvious difference was found in beta diversity of the gut microbota between the subjects with increased and decreased plasma TMAO level following choline challenge. The area under the curve (AUC) of the model was 86.39% (95% CI: 72.7%-100%). Using the verified dataset, the model showed a much higher probability for correctly predicting TMAO variation following choline challenge. The model is feasible and reliable for predicting the level of TMAO metabolism in vivo based on gut microbiota.

  6. Learning scikit-learn machine learning in Python

    CERN Document Server

    Garreta, Raúl

    2013-01-01

    The book adopts a tutorial-based approach to introduce the user to Scikit-learn.If you are a programmer who wants to explore machine learning and data-based methods to build intelligent applications and enhance your programming skills, this the book for you. No previous experience with machine-learning algorithms is required.

  7. Infinite ensemble of support vector machines for prediction of ...

    African Journals Online (AJOL)

    Many researchers have demonstrated the use of artificial neural networks (ANNs) to predict musculoskeletal disorders risk associated with occupational exposures. In order to improve the accuracy of LBDs risk classification, this paper proposes to use the support vector machines (SVMs), a machine learning algorithm used ...

  8. Genome-Wide Locations of Potential Epimutations Associated with Environmentally Induced Epigenetic Transgenerational Inheritance of Disease Using a Sequential Machine Learning Prediction Approach.

    Science.gov (United States)

    Haque, M Muksitul; Holder, Lawrence B; Skinner, Michael K

    2015-01-01

    Environmentally induced epigenetic transgenerational inheritance of disease and phenotypic variation involves germline transmitted epimutations. The primary epimutations identified involve altered differential DNA methylation regions (DMRs). Different environmental toxicants have been shown to promote exposure (i.e., toxicant) specific signatures of germline epimutations. Analysis of genomic features associated with these epimutations identified low-density CpG regions (machine learning computational approach to predict all potential epimutations in the genome. A number of previously identified sperm epimutations were used as training sets. A novel machine learning approach using a sequential combination of Active Learning and Imbalance Class Learner analysis was developed. The transgenerational sperm epimutation analysis identified approximately 50K individual sites with a 1 kb mean size and 3,233 regions that had a minimum of three adjacent sites with a mean size of 3.5 kb. A select number of the most relevant genomic features were identified with the low density CpG deserts being a critical genomic feature of the features selected. A similar independent analysis with transgenerational somatic cell epimutation training sets identified a smaller number of 1,503 regions of genome-wide predicted sites and differences in genomic feature contributions. The predicted genome-wide germline (sperm) epimutations were found to be distinct from the predicted somatic cell epimutations. Validation of the genome-wide germline predicted sites used two recently identified transgenerational sperm epimutation signature sets from the pesticides dichlorodiphenyltrichloroethane (DDT) and methoxychlor (MXC) exposure lineage F3 generation. Analysis of this positive validation data set showed a 100% prediction accuracy for all the DDT-MXC sperm epimutations. Observations further elucidate the genomic features associated with transgenerational germline epimutations and identify a genome

  9. Machine learning an artificial intelligence approach

    CERN Document Server

    Banerjee, R; Bradshaw, Gary; Carbonell, Jaime Guillermo; Mitchell, Tom Michael; Michalski, Ryszard Spencer

    1983-01-01

    Machine Learning: An Artificial Intelligence Approach contains tutorial overviews and research papers representative of trends in the area of machine learning