Muhammad Adi Khairul Anshary
Full Text Available Target market classification is aimed at focusing marketing activities on the right targets. Classification of target markets can be done through data mining and by utilizing data from social media, e.g. Twitter. The end result of data mining are learning models that can classify new data. Ensemble methods can improve the accuracy of the models and therefore provide better results. In this study, classification of target markets was conducted on a dataset of 3000 tweets in order to extract features. Classification models were constructed to manipulate the training data using two ensemble methods (bagging and boosting. To investigate the effectiveness of the ensemble methods, this study used the CART (classification and regression tree algorithm for comparison. Three categories of consumer goods (computers, mobile phones and cameras and three categories of sentiments (positive, negative and neutral were classified towards three target-market categories. Machine learning was performed using Weka 3.6.9. The results of the test data showed that the bagging method improved the accuracy of CART with 1.9% (to 85.20%. On the other hand, for sentiment classification, the ensemble methods were not successful in increasing the accuracy of CART. The results of this study may be taken into consideration by companies who approach their customers through social media, especially Twitter.
Full Text Available Recently, more and more machine learning techniques have been applied to microarray data analysis. The aim of this study is to propose a genetic programming (GP based new ensemble system (named GPES, which can be used to effectively classify different types of cancers. Decision trees are deployed as base classifiers in this ensemble framework with three operators: Min, Max, and Average. Each individual of the GP is an ensemble system, and they become more and more accurate in the evolutionary process. The feature selection technique and balanced subsampling technique are applied to increase the diversity in each ensemble system. The final ensemble committee is selected by a forward search algorithm, which is shown to be capable of fitting data automatically. The performance of GPES is evaluated using five binary class and six multiclass microarray datasets, and results show that the algorithm can achieve better results in most cases compared with some other ensemble systems. By using elaborate base classifiers or applying other sampling techniques, the performance of GPES may be further improved.
Abuassba, Adnan O M; Zhang, Dezheng; Luo, Xiong; Shaheryar, Ahmad; Ali, Hazrat
Extreme Learning Machine (ELM) is a fast-learning algorithm for a single-hidden layer feedforward neural network (SLFN). It often has good generalization performance. However, there are chances that it might overfit the training data due to having more hidden nodes than needed. To address the generalization performance, we use a heterogeneous ensemble approach. We propose an Advanced ELM Ensemble (AELME) for classification, which includes Regularized-ELM, L 2 -norm-optimized ELM (ELML2), and Kernel-ELM. The ensemble is constructed by training a randomly chosen ELM classifier on a subset of training data selected through random resampling. The proposed AELM-Ensemble is evolved by employing an objective function of increasing diversity and accuracy among the final ensemble. Finally, the class label of unseen data is predicted using majority vote approach. Splitting the training data into subsets and incorporation of heterogeneous ELM classifiers result in higher prediction accuracy, better generalization, and a lower number of base classifiers, as compared to other models (Adaboost, Bagging, Dynamic ELM ensemble, data splitting ELM ensemble, and ELM ensemble). The validity of AELME is confirmed through classification on several real-world benchmark datasets.
Adnan O. M. Abuassba
Full Text Available Extreme Learning Machine (ELM is a fast-learning algorithm for a single-hidden layer feedforward neural network (SLFN. It often has good generalization performance. However, there are chances that it might overfit the training data due to having more hidden nodes than needed. To address the generalization performance, we use a heterogeneous ensemble approach. We propose an Advanced ELM Ensemble (AELME for classification, which includes Regularized-ELM, L2-norm-optimized ELM (ELML2, and Kernel-ELM. The ensemble is constructed by training a randomly chosen ELM classifier on a subset of training data selected through random resampling. The proposed AELM-Ensemble is evolved by employing an objective function of increasing diversity and accuracy among the final ensemble. Finally, the class label of unseen data is predicted using majority vote approach. Splitting the training data into subsets and incorporation of heterogeneous ELM classifiers result in higher prediction accuracy, better generalization, and a lower number of base classifiers, as compared to other models (Adaboost, Bagging, Dynamic ELM ensemble, data splitting ELM ensemble, and ELM ensemble. The validity of AELME is confirmed through classification on several real-world benchmark datasets.
Chen, Puhua; Jiao, Licheng; Liu, Fang; Zhao, Jiaqi; Zhao, Zhiqiang
Hyperspectral data are the spectral response of landcovers from different spectral bands and different band sets can be treated as different views of landcovers, which may contain different structure information. Therefore, multiview graphs ensemble-based graph embedding is proposed to promote the performance of graph embedding for hyperspectral image classification. By integrating multiview graphs, more affluent and more accurate structure information can be utilized in graph embedding to achieve better results than traditional graph embedding methods. In addition, the multiview graphs ensemble-based graph embedding can be treated as a framework to be extended to different graph-based methods. Experimental results demonstrate that the proposed method can improve the performance of traditional graph embedding methods significantly.
Pourashraf, Payam; Tomuro, Noriko; Apostolova, Emilia
This paper presents an image classification model developed to classify images embedded in commercial real estate flyers. It is a component in a larger, multimodal system which uses texts as well as images in the flyers to automatically classify them by the property types. The role of the image classifier in the system is to provide the genres of the embedded images (map, schematic drawing, aerial photo, etc.), which to be combined with the texts in the flyer to do the overall classification. In this work, we used an ensemble learning approach and developed a model where the outputs of an ensemble of support vector machines (SVMs) are combined by a k-nearest neighbor (KNN) classifier. In this model, the classifiers in the ensemble are strong classifiers, each of which is trained to predict a given/assigned genre. Not only is our model intuitive by taking advantage of the mutual distinctness of the image genres, it is also scalable. We tested the model using over 3000 images extracted from online real estate flyers. The result showed that our model outperformed the baseline classifiers by a large margin.
Akbar, Shahid; Hayat, Maqsood; Iqbal, Muhammad; Jan, Mian Ahmad
Cancer is a fatal disease, responsible for one-quarter of all deaths in developed countries. Traditional anticancer therapies such as, chemotherapy and radiation, are highly expensive, susceptible to errors and ineffective techniques. These conventional techniques induce severe side-effects on human cells. Due to perilous impact of cancer, the development of an accurate and highly efficient intelligent computational model is desirable for identification of anticancer peptides. In this paper, evolutionary intelligent genetic algorithm-based ensemble model, 'iACP-GAEnsC', is proposed for the identification of anticancer peptides. In this model, the protein sequences are formulated, using three different discrete feature representation methods, i.e., amphiphilic Pseudo amino acid composition, g-Gap dipeptide composition, and Reduce amino acid alphabet composition. The performance of the extracted feature spaces are investigated separately and then merged to exhibit the significance of hybridization. In addition, the predicted results of individual classifiers are combined together, using optimized genetic algorithm and simple majority technique in order to enhance the true classification rate. It is observed that genetic algorithm-based ensemble classification outperforms than individual classifiers as well as simple majority voting base ensemble. The performance of genetic algorithm-based ensemble classification is highly reported on hybrid feature space, with an accuracy of 96.45%. In comparison to the existing techniques, 'iACP-GAEnsC' model has achieved remarkable improvement in terms of various performance metrics. Based on the simulation results, it is observed that 'iACP-GAEnsC' model might be a leading tool in the field of drug design and proteomics for researchers. Copyright © 2017 Elsevier B.V. All rights reserved.
Full Text Available The gene microarray analysis and classification have demonstrated an effective way for the effective diagnosis of diseases and cancers. However, it has been also revealed that the basic classification techniques have intrinsic drawbacks in achieving accurate gene classification and cancer diagnosis. On the other hand, classifier ensembles have received increasing attention in various applications. Here, we address the gene classification issue using RotBoost ensemble methodology. This method is a combination of Rotation Forest and AdaBoost techniques which in turn preserve both desirable features of an ensemble architecture, that is, accuracy and diversity. To select a concise subset of informative genes, 5 different feature selection algorithms are considered. To assess the efficiency of the RotBoost, other nonensemble/ensemble techniques including Decision Trees, Support Vector Machines, Rotation Forest, AdaBoost, and Bagging are also deployed. Experimental results have revealed that the combination of the fast correlation-based feature selection method with ICA-based RotBoost ensemble is highly effective for gene classification. In fact, the proposed method can create ensemble classifiers which outperform not only the classifiers produced by the conventional machine learning but also the classifiers generated by two widely used conventional ensemble learning methods, that is, Bagging and AdaBoost.
Fu, Bin; Wang, Zhihai; Pan, Rong
Ensemble pruning is an important issue in the field of ensemble learning. Diversity is a key criterion to determine how the pruning process has been done and measure what result has been derived. However, there is few formal definitions of diversity yet. Hence, three important factors that should...
Bensema, Kevin; Gosink, Luke; Obermaier, Harald; Joy, Kenneth I.
Advances in computational power now enable domain scientists to address conceptual and parametric uncertainty by running simulations multiple times in order to sufficiently sample the uncertain input space. While this approach helps address conceptual and parametric uncertainties, the ensemble datasets produced by this technique present a special challenge to visualization researchers as the ensemble dataset records a distribution of possible values for each location in the domain. Contemporary visualization approaches that rely solely on summary statistics (e.g., mean and variance) cannot convey the detailed information encoded in ensemble distributions that are paramount to ensemble analysis; summary statistics provide no information about modality classification and modality persistence. To address this problem, we propose a novel technique that classifies high-variance locations based on the modality of the distribution of ensemble predictions. Additionally, we develop a set of confidence metrics to inform the end-user of the quality of fit between the distribution at a given location and its assigned class. We apply a similar method to time-varying ensembles to illustrate the relationship between peak variance and bimodal or multimodal behavior. These classification schemes enable a deeper understanding of the behavior of the ensemble members by distinguishing between distributions that can be described by a single tendency and distributions which reflect divergent trends in the ensemble.
Jin, Lin-Peng; Dong, Jun
Ensemble learning has been proved to improve the generalization ability effectively in both theory and practice. In this paper, we briefly outline the current status of research on it first. Then, a new deep neural network-based ensemble method that integrates filtering views, local views, distorted views, explicit training, implicit training, subview prediction, and Simple Average is proposed for biomedical time series classification. Finally, we validate its effectiveness on the Chinese Cardiovascular Disease Database containing a large number of electrocardiogram recordings. The experimental results show that the proposed method has certain advantages compared to some well-known ensemble methods, such as Bagging and AdaBoost .
Full Text Available Ensemble learning has been proved to improve the generalization ability effectively in both theory and practice. In this paper, we briefly outline the current status of research on it first. Then, a new deep neural network-based ensemble method that integrates filtering views, local views, distorted views, explicit training, implicit training, subview prediction, and Simple Average is proposed for biomedical time series classification. Finally, we validate its effectiveness on the Chinese Cardiovascular Disease Database containing a large number of electrocardiogram recordings. The experimental results show that the proposed method has certain advantages compared to some well-known ensemble methods, such as Bagging and AdaBoost.
Ebadi, Ashkan; Dalboni da Rocha, Josué L.; Nagaraju, Dushyanth B.; Tovar-Moll, Fernanda; Bramati, Ivanei; Coutinho, Gabriel; Sitaram, Ranganatha; Rashidi, Parisa
The human brain is a complex network of interacting regions. The gray matter regions of brain are interconnected by white matter tracts, together forming one integrative complex network. In this article, we report our investigation about the potential of applying brain connectivity patterns as an aid in diagnosing Alzheimer's disease and Mild Cognitive Impairment (MCI). We performed pattern analysis of graph theoretical measures derived from Diffusion Tensor Imaging (DTI) data representing structural brain networks of 45 subjects, consisting of 15 patients of Alzheimer's disease (AD), 15 patients of MCI, and 15 healthy subjects (CT). We considered pair-wise class combinations of subjects, defining three separate classification tasks, i.e., AD-CT, AD-MCI, and CT-MCI, and used an ensemble classification module to perform the classification tasks. Our ensemble framework with feature selection shows a promising performance with classification accuracy of 83.3% for AD vs. MCI, 80% for AD vs. CT, and 70% for MCI vs. CT. Moreover, our findings suggest that AD can be related to graph measures abnormalities at Brodmann areas in the sensorimotor cortex and piriform cortex. In this way, node redundancy coefficient and load centrality in the primary motor cortex were recognized as good indicators of AD in contrast to MCI. In general, load centrality, betweenness centrality, and closeness centrality were found to be the most relevant network measures, as they were the top identified features at different nodes. The present study can be regarded as a “proof of concept” about a procedure for the classification of MRI markers between AD dementia, MCI, and normal old individuals, due to the small and not well-defined groups of AD and MCI patients. Future studies with larger samples of subjects and more sophisticated patient exclusion criteria are necessary toward the development of a more precise technique for clinical diagnosis. PMID:28293162
Full Text Available Supervised learning is the process of data mining for deducing rules from training datasets. A broad array of supervised learning algorithms exists, every one of them with its own advantages and drawbacks. There are some basic issues that affect the accuracy of classifier while solving a supervised learning problem, like bias-variance tradeoff, dimensionality of input space, and noise in the input data space. All these problems affect the accuracy of classifier and are the reason that there is no global optimal method for classification. There is not any generalized improvement method that can increase the accuracy of any classifier while addressing all the problems stated above. This paper proposes a global optimization ensemble model for classification methods (GMC that can improve the overall accuracy for supervised learning problems. The experimental results on various public datasets showed that the proposed model improved the accuracy of the classification models from 1% to 30% depending upon the algorithm complexity.
Nabhan Homsi, Masun; Warrick, Philip
Heart sound classification and analysis play an important role in the early diagnosis and prevention of cardiovascular disease. To this end, this paper introduces a novel method for automatic classification of normal and abnormal heart sound recordings. Signals are first preprocessed to extract a total of 131 features in the time, frequency, wavelet and statistical domains from the entire signal and from the timings of the states. Outlier signals are then detected and separated from those with a standard range using an interquartile range algorithm. After that, feature extreme values are given special consideration, and finally features are reduced to the most significant ones using a feature reduction technique. In the classification stage, the selected features either for standard or outlier signals are fed separately into an ensemble of 20 two-step classifiers for the classification task. The first step of the classifier is represented by a nested set of ensemble algorithms which was cross-validated on the training dataset provided by PhysioNet Challenge 2016, while the second one uses a voting rule of the class label. The results show that this method is able to recognize heart sound recordings efficiently, achieving an overall score of 96.30% for standard signals and 90.18% for outlier signals on a cross-validated experiment using the available training data. The approach of our proposed method helped reduce overfitting and improved classification performance, achieving an overall score on the hidden test set of 80.1% (79.6% sensitivity and 80.6% specificity).
Cacha, L A; Parida, S; Dehuri, S; Cho, S-B; Poznanski, R R
The huge number of voxels in fMRI over time poses a major challenge to for effective analysis. Fast, accurate, and reliable classifiers are required for estimating the decoding accuracy of brain activities. Although machine-learning classifiers seem promising, individual classifiers have their own limitations. To address this limitation, the present paper proposes a method based on the ensemble of neural networks to analyze fMRI data for cognitive state classification for application across multiple subjects. Similarly, the fuzzy integral (FI) approach has been employed as an efficient tool for combining different classifiers. The FI approach led to the development of a classifiers ensemble technique that performs better than any of the single classifier by reducing the misclassification, the bias, and the variance. The proposed method successfully classified the different cognitive states for multiple subjects with high accuracy of classification. Comparison of the performance improvement, while applying ensemble neural networks method, vs. that of the individual neural network strongly points toward the usefulness of the proposed method.
Simoes, Rita; Slump, Cornelis H.; Cappellen van Walsum, Anne-Marie van
Classification methods have been proposed to detect Alzheimer's disease (AD) using magnetic resonance images. Most rely on features such as the shape/volume of brain structures that need to be defined a priori. In this work, we propose a method that does not require either the segmentation of specific brain regions or the nonlinear alignment to a template. Besides classification, we also analyze which brain regions are discriminative between a group of normal controls and a group of AD patients. We perform 3D texture analysis using Local Binary Patterns computed at local image patches in the whole brain, combined in a classifier ensemble. We evaluate our method in a publicly available database including very mild-to-mild AD subjects and healthy elderly controls. For the subject cohort including only mild AD subjects, the best results are obtained using a combination of large (30 x 30 x 30 and 40 x 40 x 40 voxels) patches. A spatial analysis on the best performing patches shows that these are located in the medial-temporal lobe and in the periventricular regions. When very mild AD subjects are included in the dataset, the small (10 x 10 x 10 voxels) patches perform best, with the most discriminative ones being located near the left hippocampus. We show that our method is able not only to perform accurate classification, but also to localize discriminative brain regions, which are in accordance with the medical literature. This is achieved without the need to segment-specific brain structures and without performing nonlinear registration to a template, indicating that the method may be suitable for a clinical implementation that can help to diagnose AD at an earlier stage.
Full Text Available Recently, ensemble learning methods have been widely used to improve classification performance in machine learning. In this paper, we present a novel ensemble learning method: argumentation based multi-agent joint learning (AMAJL, which integrates ideas from multi-agent argumentation, ensemble learning, and association rule mining. In AMAJL, argumentation technology is introduced as an ensemble strategy to integrate multiple base classifiers and generate a high performance ensemble classifier. We design an argumentation framework named Arena as a communication platform for knowledge integration. Through argumentation based joint learning, high quality individual knowledge can be extracted, and thus a refined global knowledge base can be generated and used independently for classification. We perform numerous experiments on multiple public datasets using AMAJL and other benchmark methods. The results demonstrate that our method can effectively extract high quality knowledge for ensemble classifier and improve the performance of classification.
Kim, Edward J.; Brunner, Robert J.; Carrasco Kind, Matias
There exist a variety of star-galaxy classification techniques, each with their own strengths and weaknesses. In this paper, we present a novel meta-classification framework that combines and fully exploits different techniques to produce a more robust star-galaxy classification. To demonstrate this hybrid, ensemble approach, we combine a purely morphological classifier, a supervised machine learning method based on random forest, an unsupervised machine learning method based on self-organizing maps, and a hierarchical Bayesian template-fitting method. Using data from the CFHTLenS survey (Canada-France-Hawaii Telescope Lensing Survey), we consider different scenarios: when a high-quality training set is available with spectroscopic labels from DEEP2 (Deep Extragalactic Evolutionary Probe Phase 2 ), SDSS (Sloan Digital Sky Survey), VIPERS (VIMOS Public Extragalactic Redshift Survey), and VVDS (VIMOS VLT Deep Survey), and when the demographics of sources in a low-quality training set do not match the demographics of objects in the test data set. We demonstrate that our Bayesian combination technique improves the overall performance over any individual classification method in these scenarios. Thus, strategies that combine the predictions of different classifiers may prove to be optimal in currently ongoing and forthcoming photometric surveys, such as the Dark Energy Survey and the Large Synoptic Survey Telescope.
Olesen, Alexander Neergaard; Christensen, Julie A E; Sorensen, Helge B D; Jennum, Poul J
Reducing the number of recording modalities for sleep staging research can benefit both researchers and patients, under the condition that they provide as accurate results as conventional systems. This paper investigates the possibility of exploiting the multisource nature of the electrooculography (EOG) signals by presenting a method for automatic sleep staging using the complete ensemble empirical mode decomposition with adaptive noise algorithm, and a random forest classifier. It achieves a high overall accuracy of 82% and a Cohen's kappa of 0.74 indicating substantial agreement between automatic and manual scoring.
Full Text Available Audio segmentation is a basis for multimedia content analysis which is the most important and widely used application nowadays. An optimized audio classification and segmentation algorithm is presented in this paper that segments a superimposed audio stream on the basis of its content into four main audio types: pure-speech, music, environment sound, and silence. An algorithm is proposed that preserves important audio content and reduces the misclassification rate without using large amount of training data, which handles noise and is suitable for use for real-time applications. Noise in an audio stream is segmented out as environment sound. A hybrid classification approach is used, bagged support vector machines (SVMs with artificial neural networks (ANNs. Audio stream is classified, firstly, into speech and nonspeech segment by using bagged support vector machines; nonspeech segment is further classified into music and environment sound by using artificial neural networks and lastly, speech segment is classified into silence and pure-speech segments on the basis of rule-based classifier. Minimum data is used for training classifier; ensemble methods are used for minimizing misclassification rate and approximately 98% accurate segments are obtained. A fast and efficient algorithm is designed that can be used with real-time multimedia applications.
Full Text Available Kernel-based methods and ensemble learning are two important paradigms for the classification of hyperspectral remote sensing images. However, they were developed in parallel with different principles. In this paper, we aim to combine the advantages of kernel and ensemble methods by proposing a kernel supervised ensemble classification method. In particular, the proposed method, namely RoF-KOPLS, combines the merits of ensemble feature learning (i.e., Rotation Forest (RoF and kernel supervised learning (i.e., Kernel Orthonormalized Partial Least Square (KOPLS. In particular, the feature space is randomly split into K disjoint subspace and KOPLS is applied to each subspace to produce the new features set for the training of decision tree classifier. The final classification result is assigned to the corresponding class by the majority voting rule. Experimental results on two hyperspectral airborne images demonstrated that RoF-KOPLS with radial basis function (RBF kernel yields the best classification accuracies due to the ability of improving the accuracies of base classifiers and the diversity within the ensemble, especially for the very limited training set. Furthermore, our proposed method is insensitive to the number of subsets.
Full Text Available It is important to identify and prevent disease risk as early as possible through regular physical examinations. We formulate the disease risk prediction into a multilabel classification problem. A novel Ensemble Label Power-set Pruned datasets Joint Decomposition (ELPPJD method is proposed in this work. First, we transform the multilabel classification into a multiclass classification. Then, we propose the pruned datasets and joint decomposition methods to deal with the imbalance learning problem. Two strategies size balanced (SB and label similarity (LS are designed to decompose the training dataset. In the experiments, the dataset is from the real physical examination records. We contrast the performance of the ELPPJD method with two different decomposition strategies. Moreover, the comparison between ELPPJD and the classic multilabel classification methods RAkEL and HOMER is carried out. The experimental results show that the ELPPJD method with label similarity strategy has outstanding performance.
Idil Isikli Esener
Full Text Available A new and effective feature ensemble with a multistage classification is proposed to be implemented in a computer-aided diagnosis (CAD system for breast cancer diagnosis. A publicly available mammogram image dataset collected during the Image Retrieval in Medical Applications (IRMA project is utilized to verify the suggested feature ensemble and multistage classification. In achieving the CAD system, feature extraction is performed on the mammogram region of interest (ROI images which are preprocessed by applying a histogram equalization followed by a nonlocal means filtering. The proposed feature ensemble is formed by concatenating the local configuration pattern-based, statistical, and frequency domain features. The classification process of these features is implemented in three cases: a one-stage study, a two-stage study, and a three-stage study. Eight well-known classifiers are used in all cases of this multistage classification scheme. Additionally, the results of the classifiers that provide the top three performances are combined via a majority voting technique to improve the recognition accuracy on both two- and three-stage studies. A maximum of 85.47%, 88.79%, and 93.52% classification accuracies are attained by the one-, two-, and three-stage studies, respectively. The proposed multistage classification scheme is more effective than the single-stage classification for breast cancer diagnosis.
Budiman Putra Asmaur Rohman
Full Text Available Target detection is a mandatory task of radar system so that the radar system performance is mainly determined by its detection rate. Constant False Alarm Rate (CFAR is a detection algorithm commonly used in radar systems. This method is divided into several approaches which have different performance in the different environments. Therefore, this paper proposes an ensemble neural network based classifier with a variation of hidden neuron number for classifying the radar environments. The result of this research will support the improvement of the performance of the target detection on the radar systems by developing such an adaptive CFAR. Multi-layer perceptron network (MLPN with a single hidden layer is employed as the structure of base classifiers. The first step of this research is the evaluation of the hidden neuron number giving the highest accuracy of classification and the simplicity of computation. According to the result of this step, the three best structures are selected to build an ensemble classifier. On the ensemble structure, all of those three MLPN outputs then be collected and voted for getting the majority result in order to decide the final classification. The three possible radar environments investigated are homogeneous, multiple-targets and clutter boundary. According to the simulation results, the ensemble MLPN provides a higher detection rate than the conventional single MLPNs. Moreover, in the multiple-target and clutter boundary environments, the proposed method is able to show its highest performance.
Full Text Available Abstract Background Given a new biological sequence, detecting membership in a known family is a basic step in many bioinformatics analyses, with applications to protein structure and function prediction and metagenomic taxon identification and abundance profiling, among others. Yet family identification of sequences that are distantly related to sequences in public databases or that are fragmentary remains one of the more difficult analytical problems in bioinformatics. Results We present a new technique for family identification called HIPPI (Hierarchical Profile Hidden Markov Models for Protein family Identification. HIPPI uses a novel technique to represent a multiple sequence alignment for a given protein family or superfamily by an ensemble of profile hidden Markov models computed using HMMER. An evaluation of HIPPI on the Pfam database shows that HIPPI has better overall precision and recall than blastp, HMMER, and pipelines based on HHsearch, and maintains good accuracy even for fragmentary query sequences and for protein families with low average pairwise sequence identity, both conditions where other methods degrade in accuracy. Conclusion HIPPI provides accurate protein family identification and is robust to difficult model conditions. Our results, combined with observations from previous studies, show that ensembles of profile Hidden Markov models can better represent multiple sequence alignments than a single profile Hidden Markov model, and thus can improve downstream analyses for various bioinformatic tasks. Further research is needed to determine the best practices for building the ensemble of profile Hidden Markov models. HIPPI is available on GitHub at https://github.com/smirarab/sepp .
Full Text Available In supervised learning-based classification, ensembles have been successfully employed to different application domains. In the literature, many researchers have proposed different ensembles by considering different combination methods, training datasets, base classifiers, and many other factors. Artificial-intelligence-(AI- based techniques play prominent role in development of ensemble for intrusion detection (ID and have many benefits over other techniques. However, there is no comprehensive review of ensembles in general and AI-based ensembles for ID to examine and understand their current research status to solve the ID problem. Here, an updated review of ensembles and their taxonomies has been presented in general. The paper also presents the updated review of various AI-based ensembles for ID (in particular during last decade. The related studies of AI-based ensembles are compared by set of evaluation metrics driven from (1 architecture & approach followed; (2 different methods utilized in different phases of ensemble learning; (3 other measures used to evaluate classification performance of the ensembles. The paper also provides the future directions of the research in this area. The paper will help the better understanding of different directions in which research of ensembles has been done in general and specifically: field of intrusion detection systems (IDSs.
Schclar, Alon; Rokach, Lior; Amit, Amir
We present a novel approach for the construction of ensemble classifiers based on dimensionality reduction. Dimensionality reduction methods represent datasets using a small number of attributes while preserving the information conveyed by the original dataset. The ensemble members are trained based on dimension-reduced versions of the training set. These versions are obtained by applying dimensionality reduction to the original training set using different values of the input parameters. Thi...
Full Text Available Abstract Background Uncovering subtypes of disease from microarray samples has important clinical implications such as survival time and sensitivity of individual patients to specific therapies. Unsupervised clustering methods have been used to classify this type of data. However, most existing methods focus on clusters with compact shapes and do not reflect the geometric complexity of the high dimensional microarray clusters, which limits their performance. Results We present a cluster-number-based ensemble clustering algorithm, called MULTI-K, for microarray sample classification, which demonstrates remarkable accuracy. The method amalgamates multiple k-means runs by varying the number of clusters and identifies clusters that manifest the most robust co-memberships of elements. In addition to the original algorithm, we newly devised the entropy-plot to control the separation of singletons or small clusters. MULTI-K, unlike the simple k-means or other widely used methods, was able to capture clusters with complex and high-dimensional structures accurately. MULTI-K outperformed other methods including a recently developed ensemble clustering algorithm in tests with five simulated and eight real gene-expression data sets. Conclusion The geometric complexity of clusters should be taken into account for accurate classification of microarray data, and ensemble clustering applied to the number of clusters tackles the problem very well. The C++ code and the data sets tested are available from the authors.
Kim, Eun-Youn; Kim, Seon-Young; Ashlock, Daniel; Nam, Dougu
Uncovering subtypes of disease from microarray samples has important clinical implications such as survival time and sensitivity of individual patients to specific therapies. Unsupervised clustering methods have been used to classify this type of data. However, most existing methods focus on clusters with compact shapes and do not reflect the geometric complexity of the high dimensional microarray clusters, which limits their performance. We present a cluster-number-based ensemble clustering algorithm, called MULTI-K, for microarray sample classification, which demonstrates remarkable accuracy. The method amalgamates multiple k-means runs by varying the number of clusters and identifies clusters that manifest the most robust co-memberships of elements. In addition to the original algorithm, we newly devised the entropy-plot to control the separation of singletons or small clusters. MULTI-K, unlike the simple k-means or other widely used methods, was able to capture clusters with complex and high-dimensional structures accurately. MULTI-K outperformed other methods including a recently developed ensemble clustering algorithm in tests with five simulated and eight real gene-expression data sets. The geometric complexity of clusters should be taken into account for accurate classification of microarray data, and ensemble clustering applied to the number of clusters tackles the problem very well. The C++ code and the data sets tested are available from the authors.
Kumar, Ashnil; Kim, Jinman; Lyndon, David; Fulham, Michael; Feng, Dagan
The availability of medical imaging data from clinical archives, research literature, and clinical manuals, coupled with recent advances in computer vision offer the opportunity for image-based diagnosis, teaching, and biomedical research. However, the content and semantics of an image can vary depending on its modality and as such the identification of image modality is an important preliminary step. The key challenge for automatically classifying the modality of a medical image is due to the visual characteristics of different modalities: some are visually distinct while others may have only subtle differences. This challenge is compounded by variations in the appearance of images based on the diseases depicted and a lack of sufficient training data for some modalities. In this paper, we introduce a new method for classifying medical images that uses an ensemble of different convolutional neural network (CNN) architectures. CNNs are a state-of-the-art image classification technique that learns the optimal image features for a given classification task. We hypothesise that different CNN architectures learn different levels of semantic image representation and thus an ensemble of CNNs will enable higher quality features to be extracted. Our method develops a new feature extractor by fine-tuning CNNs that have been initialized on a large dataset of natural images. The fine-tuning process leverages the generic image features from natural images that are fundamental for all images and optimizes them for the variety of medical imaging modalities. These features are used to train numerous multiclass classifiers whose posterior probabilities are fused to predict the modalities of unseen images. Our experiments on the ImageCLEF 2016 medical image public dataset (30 modalities; 6776 training images, and 4166 test images) show that our ensemble of fine-tuned CNNs achieves a higher accuracy than established CNNs. Our ensemble also achieves a higher accuracy than methods in
Chellasamy, M.; Ferré, T. P. A.; Humlekrog Greve, M.; Larsen, R.; Chinnasamy, U.
Change Detection (CD) methods based on post-classification comparison approaches are claimed to provide potentially reliable results. They are considered to be most obvious quantitative method in the analysis of Land Use Land Cover (LULC) changes which provides from - to change information. But, the performance of post-classification comparison approaches highly depends on the accuracy of classification of individual images used for comparison. Hence, we present a classification approach that produce accurate classified results which aids to obtain improved change detection results. Machine learning is a part of broader framework in change detection, where neural networks have drawn much attention. Neural network algorithms adaptively estimate continuous functions from input data without mathematical representation of output dependence on input. A common practice for classification is to use Multi-Layer-Perceptron (MLP) neural network with backpropogation learning algorithm for prediction. To increase the ability of learning and prediction, multiple inputs (spectral, texture, topography, and multi-temporal information) are generally stacked to incorporate diversity of information. On the other hand literatures claims backpropagation algorithm to exhibit weak and unstable learning in use of multiple inputs, while dealing with complex datasets characterized by mixed uncertainty levels. To address the problem of learning complex information, we propose an ensemble classification technique that incorporates multiple inputs for classification unlike traditional stacking of multiple input data. In this paper, we present an Endorsement Theory based ensemble classification that integrates multiple information, in terms of prediction probabilities, to produce final classification results. Three different input datasets are used in this study: spectral, texture and indices, from SPOT-4 multispectral imagery captured on 1998 and 2003. Each SPOT image is classified
Wong G William
Full Text Available Abstract Background Pancreatic cancer is the fourth leading cause of cancer death in the United States. Consequently, identification of clinically relevant biomarkers for the early detection of this cancer type is urgently needed. In recent years, proteomics profiling techniques combined with various data analysis methods have been successfully used to gain critical insights into processes and mechanisms underlying pathologic conditions, particularly as they relate to cancer. However, the high dimensionality of proteomics data combined with their relatively small sample sizes poses a significant challenge to current data mining methodology where many of the standard methods cannot be applied directly. Here, we propose a novel methodological framework using machine learning method, in which decision tree based classifier ensembles coupled with feature selection methods, is applied to proteomics data generated from premalignant pancreatic cancer. Results This study explores the utility of three different feature selection schemas (Student t test, Wilcoxon rank sum test and genetic algorithm to reduce the high dimensionality of a pancreatic cancer proteomic dataset. Using the top features selected from each method, we compared the prediction performances of a single decision tree algorithm C4.5 with six different decision-tree based classifier ensembles (Random forest, Stacked generalization, Bagging, Adaboost, Logitboost and Multiboost. We show that ensemble classifiers always outperform single decision tree classifier in having greater accuracies and smaller prediction errors when applied to a pancreatic cancer proteomics dataset. Conclusion In our cross validation framework, classifier ensembles generally have better classification accuracies compared to that of a single decision tree when applied to a pancreatic cancer proteomic dataset, thus suggesting its utility in future proteomics data analysis. Additionally, the use of feature selection
Sørensen, Lauge; Nielsen, Mads
The International Challenge for Automated Prediction of MCI from MRI data offered independent, standardized comparison of machine learning algorithms for multi-class classification of normal control (NC), mild cognitive impairment (MCI), converting MCI (cMCI), and Alzheimer's disease (AD) using brain imaging and general cognition. We proposed to use an ensemble of support vector machines (SVMs) that combined bagging without replacement and feature selection. SVM is the most commonly used algorithm in multivariate classification of dementia, and it was therefore valuable to evaluate the potential benefit of ensembling this type of classifier. The ensemble SVM, using either a linear or a radial basis function (RBF) kernel, achieved multi-class classification accuracies of 55.6% and 55.0% in the challenge test set (60 NC, 60 MCI, 60 cMCI, 60 AD), resulting in a third place in the challenge. Similar feature subset sizes were obtained for both kernels, and the most frequently selected MRI features were the volumes of the two hippocampal subregions left presubiculum and right subiculum. Post-challenge analysis revealed that enforcing a minimum number of selected features and increasing the number of ensemble classifiers improved classification accuracy up to 59.1%. The ensemble SVM outperformed single SVM classifications consistently in the challenge test set. Ensemble methods using bagging and feature selection can improve the performance of the commonly applied SVM classifier in dementia classification. This resulted in competitive classification accuracies in the International Challenge for Automated Prediction of MCI from MRI data. Copyright © 2018 Elsevier B.V. All rights reserved.
Mkuzangwe, Nenekazi NP
Full Text Available This paper provides a performance bound of a network intrusion detection system (NIDS) that uses an ensemble of classifiers. Currently researchers rely on implementing the ensemble of classifiers based NIDS before they can determine the performance...
Park, Sang-Hoon; Lee, David; Lee, Sang-Goog
For the last few years, many feature extraction methods have been proposed based on biological signals. Among these, the brain signals have the advantage that they can be obtained, even by people with peripheral nervous system damage. Motor imagery electroencephalograms (EEG) are inexpensive to measure, offer a high temporal resolution, and are intuitive. Therefore, these have received a significant amount of attention in various fields, including signal processing, cognitive science, and medicine. The common spatial pattern (CSP) algorithm is a useful method for feature extraction from motor imagery EEG. However, performance degradation occurs in a small-sample setting (SSS), because the CSP depends on sample-based covariance. Since the active frequency range is different for each subject, it is also inconvenient to set the frequency range to be different every time. In this paper, we propose the feature extraction method based on a filter bank to solve these problems. The proposed method consists of five steps. First, motor imagery EEG is divided by a using filter bank. Second, the regularized CSP (R-CSP) is applied to the divided EEG. Third, we select the features according to mutual information based on the individual feature algorithm. Fourth, parameter sets are selected for the ensemble. Finally, we classify using ensemble based on features. The brain-computer interface competition III data set IVa is used to evaluate the performance of the proposed method. The proposed method improves the mean classification accuracy by 12.34%, 11.57%, 9%, 4.95%, and 4.47% compared with CSP, SR-CSP, R-CSP, filter bank CSP (FBCSP), and SR-FBCSP. Compared with the filter bank R-CSP ( , ), which is a parameter selection version of the proposed method, the classification accuracy is improved by 3.49%. In particular, the proposed method shows a large improvement in performance in the SSS.
DiFranco, Matthew D
We present a tile-based approach for producing clinically relevant probability maps of prostatic carcinoma in histological sections from radical prostatectomy. Our methodology incorporates ensemble learning for feature selection and classification on expert-annotated images. Random forest feature selection performed over varying training sets provides a subset of generalized CIEL*a*b* co-occurrence texture features, while sample selection strategies with minimal constraints reduce training data requirements to achieve reliable results. Ensembles of classifiers are built using expert-annotated tiles from training images, and scores for the probability of cancer presence are calculated from the responses of each classifier in the ensemble. Spatial filtering of tile-based texture features prior to classification results in increased heat-map coherence as well as AUC values of 95% using ensembles of either random forests or support vector machines. Our approach is designed for adaptation to different imaging modalities, image features, and histological decision domains.
Tan, C. M.; Lyon, R. J.; Stappers, B. W.; Cooper, S.; Hessels, J. W. T.; Kondratiev, V. I.; Michilli, D.; Sanidas, S.
One of the biggest challenges arising from modern large-scale pulsar surveys is the number of candidates generated. Here, we implemented several improvements to the machine learning (ML) classifier previously used by the LOFAR Tied-Array All-Sky Survey (LOTAAS) to look for new pulsars via filtering the candidates obtained during periodicity searches. To assist the ML algorithm, we have introduced new features which capture the frequency and time evolution of the signal and improved the signal-to-noise calculation accounting for broad profiles. We enhanced the ML classifier by including a third class characterizing RFI instances, allowing candidates arising from RFI to be isolated, reducing the false positive return rate. We also introduced a new training data set used by the ML algorithm that includes a large sample of pulsars misclassified by the previous classifier. Lastly, we developed an ensemble classifier comprised of five different Decision Trees. Taken together these updates improve the pulsar recall rate by 2.5 per cent, while also improving the ability to identify pulsars with wide pulse profiles, often misclassified by the previous classifier. The new ensemble classifier is also able to reduce the percentage of false positive candidates identified from each LOTAAS pointing from 2.5 per cent (˜500 candidates) to 1.1 per cent (˜220 candidates).
Kukunda, Collins B.; Duque-Lazo, Joaquín; González-Ferreiro, Eduardo; Thaden, Hauke; Kleinn, Christoph
Distinguishing tree species is relevant in many contexts of remote sensing assisted forest inventory. Accurate tree species maps support management and conservation planning, pest and disease control and biomass estimation. This study evaluated the performance of applying ensemble techniques with the goal of automatically distinguishing Pinus sylvestris L. and Pinus uncinata Mill. Ex Mirb within a 1.3 km2 mountainous area in Barcelonnette (France). Three modelling schemes were examined, based on: (1) high-density LiDAR data (160 returns m-2), (2) Worldview-2 multispectral imagery, and (3) Worldview-2 and LiDAR in combination. Variables related to the crown structure and height of individual trees were extracted from the normalized LiDAR point cloud at individual-tree level, after performing individual tree crown (ITC) delineation. Vegetation indices and the Haralick texture indices were derived from Worldview-2 images and served as independent spectral variables. Selection of the best predictor subset was done after a comparison of three variable selection procedures: (1) Random Forests with cross validation (AUCRFcv), (2) Akaike Information Criterion (AIC) and (3) Bayesian Information Criterion (BIC). To classify the species, 9 regression techniques were combined using ensemble models. Predictions were evaluated using cross validation and an independent dataset. Integration of datasets and models improved individual tree species classification (True Skills Statistic, TSS; from 0.67 to 0.81) over individual techniques and maintained strong predictive power (Relative Operating Characteristic, ROC = 0.91). Assemblage of regression models and integration of the datasets provided more reliable species distribution maps and associated tree-scale mapping uncertainties. Our study highlights the potential of model and data assemblage at improving species classifications needed in present-day forest planning and management.
G KISHOR KUMAR
large volumes of data by applying data analysis and discov- ery algorithms [1, 2]. Classification, a major data mining functionality, is a supervised learning method where the example set called the training set is used to classify the given query data item into one of the predefined classes, where a classifier derived from the ...
Full Text Available Image segmentation is the foundation of computer vision applications. In this paper, we propose a new cluster ensemble-based image segmentation algorithm, which overcomes several problems of traditional methods. We make two main contributions in this paper. First, we introduce the cluster ensemble concept to fuse the segmentation results from different types of visual features effectively, which can deliver a better final result and achieve a much more stable performance for broad categories of images. Second, we exploit the PageRank idea from Internet applications and apply it to the image segmentation task. This can improve the final segmentation results by combining the spatial information of the image and the semantic similarity of regions. Our experiments on four public image databases validate the superiority of our algorithm over conventional single type of feature or multiple types of features-based algorithms, since our algorithm can fuse multiple types of features effectively for better segmentation results. Moreover, our method is also proved to be very competitive in comparison with other state-of-the-art segmentation algorithms.
Benkeser, David; Ju, Cheng; Lendle, Sam; van der Laan, Mark
Online estimators update a current estimate with a new incoming batch of data without having to revisit past data thereby providing streaming estimates that are scalable to big data. We develop flexible, ensemble-based online estimators of an infinite-dimensional target parameter, such as a regression function, in the setting where data are generated sequentially by a common conditional data distribution given summary measures of the past. This setting encompasses a wide range of time-series models and, as special case, models for independent and identically distributed data. Our estimator considers a large library of candidate online estimators and uses online cross-validation to identify the algorithm with the best performance. We show that by basing estimates on the cross-validation-selected algorithm, we are asymptotically guaranteed to perform as well as the true, unknown best-performing algorithm. We provide extensions of this approach including online estimation of the optimal ensemble of candidate online estimators. We illustrate excellent performance of our methods using simulations and a real data example where we make streaming predictions of infectious disease incidence using data from a large database. Copyright © 2017 John Wiley & Sons, Ltd. Copyright © 2017 John Wiley & Sons, Ltd.
Full Text Available In this paper, we propose a supervised classification algorithm for Polarimetric Synthetic Aperture Radar (PolSAR images using multiple-feature fusion and ensemble learning. First, we extract different polarimetric features, including extended polarimetric feature space, Hoekman, Huynen, H/alpha/A, and fourcomponent scattering features of PolSAR images. Next, we randomly select two types of features each time from all feature sets to guarantee the reliability and diversity of later ensembles and use a support vector machine as the basic classifier for predicting classification results. Finally, we concatenate all prediction probabilities of basic classifiers as the final feature representation and employ the random forest method to obtain final classification results. Experimental results at the pixel and region levels show the effectiveness of the proposed algorithm.
Man, Jun [Zhejiang Provincial Key Laboratory of Agricultural Resources and Environment, Institute of Soil and Water Resources and Environmental Science, College of Environmental and Resource Sciences, Zhejiang University, Hangzhou China; Zhang, Jiangjiang [Zhejiang Provincial Key Laboratory of Agricultural Resources and Environment, Institute of Soil and Water Resources and Environmental Science, College of Environmental and Resource Sciences, Zhejiang University, Hangzhou China; Li, Weixuan [Pacific Northwest National Laboratory, Richland Washington USA; Zeng, Lingzao [Zhejiang Provincial Key Laboratory of Agricultural Resources and Environment, Institute of Soil and Water Resources and Environmental Science, College of Environmental and Resource Sciences, Zhejiang University, Hangzhou China; Wu, Laosheng [Department of Environmental Sciences, University of California, Riverside California USA
The ensemble Kalman filter (EnKF) has been widely used in parameter estimation for hydrological models. The focus of most previous studies was to develop more efficient analysis (estimation) algorithms. On the other hand, it is intuitively understandable that a well-designed sampling (data-collection) strategy should provide more informative measurements and subsequently improve the parameter estimation. In this work, a Sequential Ensemble-based Optimal Design (SEOD) method, coupled with EnKF, information theory and sequential optimal design, is proposed to improve the performance of parameter estimation. Based on the first-order and second-order statistics, different information metrics including the Shannon entropy difference (SD), degrees of freedom for signal (DFS) and relative entropy (RE) are used to design the optimal sampling strategy, respectively. The effectiveness of the proposed method is illustrated by synthetic one-dimensional and two-dimensional unsaturated flow case studies. It is shown that the designed sampling strategies can provide more accurate parameter estimation and state prediction compared with conventional sampling strategies. Optimal sampling designs based on various information metrics perform similarly in our cases. The effect of ensemble size on the optimal design is also investigated. Overall, larger ensemble size improves the parameter estimation and convergence of optimal sampling strategy. Although the proposed method is applied to unsaturated flow problems in this study, it can be equally applied in any other hydrological problems.
Full Text Available Photogrammetric UAV sees a surge in use for high-resolution mapping, but its use to map terrain under dense vegetation cover remains challenging due to a lack of exposed ground surfaces. This paper presents a novel object-oriented classification ensemble algorithm to leverage height, texture and contextual information of UAV data to improve landscape classification and terrain estimation. Its implementation incorporates multiple heuristics, such as multi-input machine learning-based classification, object-oriented ensemble, and integration of UAV and GPS surveys for terrain correction. Experiments based on a densely vegetated wetland restoration site showed classification improvement from 83.98% to 96.12% in overall accuracy and from 0.7806 to 0.947 in kappa value. Use of standard and existing UAV terrain mapping algorithms and software produced reliable digital terrain model only over exposed bare grounds (mean error = −0.019 m and RMSE = 0.035 m but severely overestimated the terrain by ~80% of mean vegetation height in vegetated areas. The terrain correction method successfully reduced the mean error from 0.302 m to −0.002 m (RMSE from 0.342 m to 0.177 m in low vegetation and from 1.305 m to 0.057 m (RMSE from 1.399 m to 0.550 m in tall vegetation. Overall, this research validated a feasible solution to integrate UAV and RTK GPS for terrain mapping in densely vegetated environments.
Full Text Available This paper presents a review of ensemble-based data assimilation for strongly nonlinear problems on the characterization of heterogeneous reservoirs with different production histories. It concentrates on ensemble Kalman filter (EnKF and ensemble smoother (ES as representative frameworks, discusses their pros and cons, and investigates recent progress to overcome their drawbacks. The typical weaknesses of ensemble-based methods are non-Gaussian parameters, improper prior ensembles and finite population size. Three categorized approaches, to mitigate these limitations, are reviewed with recent accomplishments; improvement of Kalman gains, add-on of transformation functions, and independent evaluation of observed data. The data assimilation in heterogeneous reservoirs, applying the improved ensemble methods, is discussed on predicting unknown dynamic data in reservoir characterization.
Nielsen, Andreas Brinch; Hansen, Lars Kai; Kjems, U
A sound classification model is presented that can classify signals into music, noise and speech. The model extracts the pitch of the signal using the harmonic product spectrum. Based on the pitch estimate and a pitch error measure, features are created and used in a probabilistic model with soft......-max output function. Both linear and quadratic inputs are used. The model is trained on 2 hours of sound and tested on publicly available data. A test classification error below 0.05 with 1 s classification windows is achieved. Further more it is shown that linear input performs as well as a quadratic...
Oong, Tatt Hee; Isa, Nor Ashidi Mat
This brief presents a new ordering algorithm for data presentation of fuzzy ARTMAP (FAM) ensembles. The proposed ordering algorithm manipulates the presentation order of the training data for each member of a FAM ensemble such that the categories created in each ensemble member are biased toward the vector of the chosen input feature. Diversity is created by varying the training presentation order based on the ascending order of the values from the most uncorrelated input features. Analysis shows that the categories created in two FAMs are compulsively diverse when the chosen input features used to determine the presentation order of the training data are uncorrelated. The proposed ordering algorithm was tested on 10 classification benchmark problems from the University of California, Irvine, machine learning repository and a cervical cancer problem as a case study. The experimental results show that the proposed method can produce a diverse, yet well generalized, FAM ensemble.
Ehrendorfer, M. [Dept. of Meteorology and Geophysics, The Univ. of Reading (United Kingdom)
Ensemble-based data assimilation methods related to the fundamental theory of Kalman filtering have been explored in a variety of mostly non-operational data assimilation contexts over the past decade with increasing intensity. While promising properties have been reported, a number of issues that arise in the development and application of ensemble-based data assimilation techniques, such as in the basic form of the ensemble Kalman filter (EnKF), still deserve particular attention. The necessity of employing an ensemble of small size represents a fundamental issue which in turn leads to several related points that must be carefully considered. In particular, the need to correct for sampling noise in the covariance structure estimated from the finite ensemble must be mentioned. Covariance inflation, localization through a Schur/Hadamard product, preventing the occurrence of filter divergence and inbreeding, as well as the loss of dynamical balances, are all issues directly related to the use of small ensemble sizes. Attempts to reduce effectively the sampling error due to small ensembles and at the same time maintaining an ensemble spread that realistically describes error structures have given rise to the development of variants of the basic form of the EnKF. These include, for example, the Ensemble Adjustment Kalman Filter (EAKF), the Ensemble Transform Kalman Filter (ETKF), the Ensemble Square-Root Filter (EnSRF), and the Local Ensemble Kalman Filter (LEKF). Further important considerations within ensemble-based Kalman filtering concern issues such as the treatment of model error, stochastic versus deterministic updating algorithms, the case of implementation and computational cost, serial processing of observations, avoiding the appearance of undesired dynamic imbalances, and the treatment of non-Gaussianity and nonlinearity. The discussion of the above issues within ensemble-based Kalman filtering forms the central topic of this article, that starts out with a
Fonseca, R.M.; Stordahl, A.; Leeuwenburgh, O.; Van den Hof, P.M.J.; Jansen, J.D.
We consider robust ensemble-based multi-objective optimization using a hierarchical switching algorithm for combined long-term and short term water flooding optimization. We apply a modified formulation of the ensemble gradient which results in improved performance compared to earlier formulations.
Peterson, Leif E; Coleman, Matthew A
Random Spherical Linear Oracles (RSLO) for DNA microarray gene expression data are proposed for classifier fusion. RSLO employs random hyperplane splits of samples in the principal component score space based on the first three principal components (X, Y, Z) of the input feature set. Hyperplane splits are used to assign training(testing) samples to separate logistic regression mini-classifiers, which increases the diversity of voting results since errors are not shared across mini-classifiers. We recommend use of RSLO with 3-4 10-fold CV and re-partitioning samples randomly every ten iterations prior to each 10-fold CV. This equates to a total of 30-40 iterations.
Sert, Egemen; Ertekin, Seyda; Halici, Ugur
Human level recall performance in detecting breast cancer considering microcalcifications from mammograms has a recall value between 74.5% and 92.3%. In this research, we approach to breast microcalcification classification problem using convolutional neural networks along with various preprocessing methods such as contrast scaling, dilation, cropping etc. and decision fusion using ensemble of networks. Various experiments on Digital Database for Screening Mammography dataset showed that preprocessing poses great importance on the classification performance. The stand-alone models using the dilation and cropping preprocessing techniques achieved the highest recall value of 91.3%. The ensembles of the stand-alone models surpass this recall value and a 97.3% value of recall is achieved. The ensemble having the highest F1 Score (harmonic mean of precision and recall), which is 94.5%, has a recall value of 94.0% and a precision value of 95.0%. This recall is still above human level performance and the models achieve competitive results in terms of accuracy, precision, recall and F1 score measures.
Full Text Available Ensemble data mining methods, also known as classifier combination, are often used to improve the performance of classification. Various classifier combination methods such as bagging, boosting, and random forest have been devised and have received considerable attention in the past. However, data dimensionality increases rapidly day by day. Such a trend poses various challenges as these methods are not suitable to directly apply to high-dimensional datasets. In this paper, we propose an ensemble method for classification of high-dimensional data, with each classifier constructed from a different set of features determined by partitioning of redundant features. In our method, the redundancy of features is considered to divide the original feature space. Then, each generated feature subset is trained by a support vector machine, and the results of each classifier are combined by majority voting. The efficiency and effectiveness of our method are demonstrated through comparisons with other ensemble techniques, and the results show that our method outperforms other methods.
Full Text Available Tool condition monitoring (TCM plays an important role in improving machining efficiency and guaranteeing workpiece quality. In order to realize reliable recognition of the tool condition, a robust classifier needs to be constructed to depict the relationship between tool wear states and sensory information. However, because of the complexity of the machining process and the uncertainty of the tool wear evolution, it is hard for a single classifier to fit all the collected samples without sacrificing generalization ability. In this paper, heterogeneous ensemble learning is proposed to realize tool condition monitoring in which the support vector machine (SVM, hidden Markov model (HMM and radius basis function (RBF are selected as base classifiers and a stacking ensemble strategy is further used to reflect the relationship between the outputs of these base classifiers and tool wear states. Based on the heterogeneous ensemble learning classifier, an online monitoring system is constructed in which the harmonic features are extracted from force signals and a minimal redundancy and maximal relevance (mRMR algorithm is utilized to select the most prominent features. To verify the effectiveness of the proposed method, a titanium alloy milling experiment was carried out and samples with different tool wear states were collected to build the proposed heterogeneous ensemble learning classifier. Moreover, the homogeneous ensemble learning model and majority voting strategy are also adopted to make a comparison. The analysis and comparison results show that the proposed heterogeneous ensemble learning classifier performs better in both classification accuracy and stability.
Jaspart, J.P.; Wald, F.; Weynand, K.; Gresnigt, A.M.
The influence of the rotational characteristics of the column bases on the structural frame response is discussed and specific design criteria for stiffness classification into semi-rigid and rigid joints are derived. The particular case of an industrial portal frame is then considered. Peer reviewed
Nizamani, Sarwat; Memon, Nasrullah; Wiil, Uffe Kock
We propose a cluster based classification model for suspicious email detection and other text classification tasks. The text classification tasks comprise many training examples that require a complex classification model. Using clusters for classification makes the model simpler and increases......, the classifier is trained on each cluster having reduced dimensionality and less number of examples. The experimental results show that the proposed model outperforms the existing classification models for the task of suspicious email detection and topic categorization on the Reuters-21578 and 20 Newsgroups...... datasets. Our model also outperforms A Decision Cluster Classification (ADCC) and the Decision Cluster Forest Classification (DCFC) models on the Reuters-21578 dataset....
Wang, Yanhua; Liu, Xiyu; Xiang, Laisheng
Ensemble clustering can improve the generalization ability of a single clustering algorithm and generate a more robust clustering result by integrating multiple base clusterings, so it becomes the focus of current clustering research. Ensemble clustering aims at finding a consensus partition which agrees as much as possible with base clusterings. Genetic algorithm is a highly parallel, stochastic, and adaptive search algorithm developed from the natural selection and evolutionary mechanism of...
Siaw Ling Lo
Full Text Available The vast amount and diversity of the content shared on social media can pose a challenge for any business wanting to use it to identify potential customers. In this paper, our aim is to investigate the use of both unsupervised and supervised learning methods for target audience classification on Twitter with minimal annotation efforts. Topic domains were automatically discovered from contents shared by followers of an account owner using Twitter Latent Dirichlet Allocation (LDA. A Support Vector Machine (SVM ensemble was then trained using contents from different account owners of the various topic domains identified by Twitter LDA. Experimental results show that the methods presented are able to successfully identify a target audience with high accuracy. In addition, we show that using a statistical inference approach such as bootstrapping in over-sampling, instead of using random sampling, to construct training datasets can achieve a better classifier in an SVM ensemble. We conclude that such an ensemble system can take advantage of data diversity, which enables real-world applications for differentiating prospective customers from the general audience, leading to business advantage in the crowded social media space.
Lo, Siaw Ling; Chiong, Raymond; Cornforth, David
The vast amount and diversity of the content shared on social media can pose a challenge for any business wanting to use it to identify potential customers. In this paper, our aim is to investigate the use of both unsupervised and supervised learning methods for target audience classification on Twitter with minimal annotation efforts. Topic domains were automatically discovered from contents shared by followers of an account owner using Twitter Latent Dirichlet Allocation (LDA). A Support Vector Machine (SVM) ensemble was then trained using contents from different account owners of the various topic domains identified by Twitter LDA. Experimental results show that the methods presented are able to successfully identify a target audience with high accuracy. In addition, we show that using a statistical inference approach such as bootstrapping in over-sampling, instead of using random sampling, to construct training datasets can achieve a better classifier in an SVM ensemble. We conclude that such an ensemble system can take advantage of data diversity, which enables real-world applications for differentiating prospective customers from the general audience, leading to business advantage in the crowded social media space.
PANIGRAHY, P. S.
Full Text Available Data driven approach for multi-class fault diagnosis of induction motor using MCSA at steady state condition is a complex pattern classification problem. This investigation has exploited the built-in ensemble process of non-iterative classifiers to resolve the most challenging issues in this area, including bearing and stator fault detection. Non-iterative techniques exhibit with an average 15% of increased fault classification accuracy against their iterative counterparts. Particularly RF has shown outstanding performance even at less number of training samples and noisy feature space because of its distributive feature model. The robustness of the results, backed by the experimental verification shows that the non-iterative individual classifiers like RF is the optimum choice in the area of automatic fault diagnosis of induction motor.
Xiao, Yawen; Wu, Jun; Lin, Zongli; Zhao, Xiaodong
Cancer is a complex worldwide health problem associated with high mortality. With the rapid development of the high-throughput sequencing technology and the application of various machine learning methods that have emerged in recent years, progress in cancer prediction has been increasingly made based on gene expression, providing insight into effective and accurate treatment decision making. Thus, developing machine learning methods, which can successfully distinguish cancer patients from healthy persons, is of great current interest. However, among the classification methods applied to cancer prediction so far, no one method outperforms all the others. In this paper, we demonstrate a new strategy, which applies deep learning to an ensemble approach that incorporates multiple different machine learning models. We supply informative gene data selected by differential gene expression analysis to five different classification models. Then, a deep learning method is employed to ensemble the outputs of the five classifiers. The proposed deep learning-based multi-model ensemble method was tested on three public RNA-seq data sets of three kinds of cancers, Lung Adenocarcinoma, Stomach Adenocarcinoma and Breast Invasive Carcinoma. The test results indicate that it increases the prediction accuracy of cancer for all the tested RNA-seq data sets as compared to using a single classifier or the majority voting algorithm. By taking full advantage of different classifiers, the proposed deep learning-based multi-model ensemble method is shown to be accurate and effective for cancer prediction. Copyright © 2017 Elsevier B.V. All rights reserved.
Tuarob, Suppawong; Tucker, Conrad S; Salathe, Marcel; Ram, Nilam
The role of social media as a source of timely and massive information has become more apparent since the era of Web 2.0.Multiple studies illustrated the use of information in social media to discover biomedical and health-related knowledge.Most methods proposed in the literature employ traditional document classification techniques that represent a document as a bag of words.These techniques work well when documents are rich in text and conform to standard English; however, they are not optimal for social media data where sparsity and noise are norms.This paper aims to address the limitations posed by the traditional bag-of-word based methods and propose to use heterogeneous features in combination with ensemble machine learning techniques to discover health-related information, which could prove to be useful to multiple biomedical applications, especially those needing to discover health-related knowledge in large scale social media data.Furthermore, the proposed methodology could be generalized to discover different types of information in various kinds of textual data. Social media data is characterized by an abundance of short social-oriented messages that do not conform to standard languages, both grammatically and syntactically.The problem of discovering health-related knowledge in social media data streams is then transformed into a text classification problem, where a text is identified as positive if it is health-related and negative otherwise.We first identify the limitations of the traditional methods which train machines with N-gram word features, then propose to overcome such limitations by utilizing the collaboration of machine learning based classifiers, each of which is trained to learn a semantically different aspect of the data.The parameter analysis for tuning each classifier is also reported. Three data sets are used in this research.The first data set comprises of approximately 5000 hand-labeled tweets, and is used for cross validation of the
Zhang, Jianhua; Yin, Zhong; Wang, Rubin
The identification of the temporal variations in human operator cognitive task-load (CTL) is crucial for preventing possible accidents in human-machine collaborative systems. Recent literature has shown that the change of discrete CTL level during human-machine system operations can be objectively recognized using neurophysiological data and supervised learning technique. The objective of this work is to design subject-specific multi-class CTL classifier to reveal the complex unknown relationship between the operator's task performance and neurophysiological features by combining target class labeling, physiological feature reduction and selection, and ensemble classification techniques. The psychophysiological data acquisition experiments were performed under multiple human-machine process control tasks. Four or five target classes of CTL were determined by using a Gaussian mixture model and three human performance variables. By using Laplacian eigenmap, a few salient EEG features were extracted, and heart rates were used as the input features of the CTL classifier. Then, multiple support vector machines were aggregated via majority voting to create an ensemble classifier for recognizing the CTL classes. Finally, the obtained CTL classification results were compared with those of several existing methods. The results showed that the proposed methods are capable of deriving a reasonable number of target classes and low-dimensional optimal EEG features for individual human operator subjects.
Choi, Jae Young; Kim, Dae Hoe; Ro, Yong Man; Plataniotis, Konstantinos N
We propose a novel computer-aided detection (CAD) framework of breast masses in mammography. To increase detection sensitivity for various types of mammographic masses, we propose the combined use of different detection algorithms. In particular, we develop a region-of-interest combination mechanism that integrates detection information gained from unsupervised and supervised detection algorithms. Also, to significantly reduce the number of false-positive (FP) detections, the new ensemble classification algorithm is developed. Extensive experiments have been conducted on a benchmark mammogram database. Results show that our combined detection approach can considerably improve the detection sensitivity with a small loss of FP rate, compared to representative detection algorithms previously developed for mammographic CAD systems. The proposed ensemble classification solution also has a dramatic impact on the reduction of FP detections; as much as 70% (from 15 to 4.5 per image) at only cost of 4.6% sensitivity loss (from 90.0% to 85.4%). Moreover, our proposed CAD method performs as well or better (70.7% and 80.0% per 1.5 and 3.5 FPs per image respectively) than the results of mammography CAD algorithms previously reported in the literature. (paper)
Full Text Available TIGGE (THORPEX International Grand Global Ensemble was a major part of the THORPEX (Observing System Research and Predictability Experiment. It integrates ensemble precipitation products from all the major forecast centers in the world and provides systematic evaluation on the multimodel ensemble prediction system. Development of meteorologic-hydrologic coupled flood forecasting model and early warning model based on the TIGGE precipitation ensemble forecast can provide flood probability forecast, extend the lead time of the flood forecast, and gain more time for decision-makers to make the right decision. In this study, precipitation ensemble forecast products from ECMWF, NCEP, and CMA are used to drive distributed hydrologic model TOPX. We focus on Yi River catchment and aim to build a flood forecast and early warning system. The results show that the meteorologic-hydrologic coupled model can satisfactorily predict the flow-process of four flood events. The predicted occurrence time of peak discharges is close to the observations. However, the magnitude of the peak discharges is significantly different due to various performances of the ensemble prediction systems. The coupled forecasting model can accurately predict occurrence of the peak time and the corresponding risk probability of peak discharge based on the probability distribution of peak time and flood warning, which can provide users a strong theoretical foundation and valuable information as a promising new approach.
Full Text Available Recent research into improving the effectiveness of forest inventory management using airborne LiDAR data has focused on developing advanced theories in data analytics. Furthermore, supervised learning as a predictive model for classifying tree genera (and species, where possible has been gaining popularity in order to minimize this labor-intensive task. However, bottlenecks remain that hinder the immediate adoption of supervised learning methods. With supervised classification, training samples are required for learning the parameters that govern the performance of a classifier, yet the selection of training data is often subjective and the quality of such samples is critically important. For LiDAR scanning in forest environments, the quantification of data quality is somewhat abstract, normally referring to some metric related to the completeness of individual tree crowns; however, this is not an issue that has received much attention in the literature. Intuitively the choice of training samples having varying quality will affect classification accuracy. In this paper a Diversity Index (DI is proposed that characterizes the diversity of data quality (Qi among selected training samples required for constructing a classification model of tree genera. The training sample is diversified in terms of data quality as opposed to the number of samples per class. The diversified training sample allows the classifier to better learn the positive and negative instances and; therefore; has a higher classification accuracy in discriminating the “unknown” class samples from the “known” samples. Our algorithm is implemented within the Random Forests base classifiers with six derived geometric features from LiDAR data. The training sample contains three tree genera (pine; poplar; and maple and the validation samples contains four labels (pine; poplar; maple; and “unknown”. Classification accuracy improved from 72.8%; when training samples were
Liu, Fang; Cao, San-xing; Lu, Rui
This paper proposes a user credit assessment model based on clustering ensemble aiming to solve the problem that users illegally spread pirated and pornographic media contents within the user self-service oriented broadband network new media platforms. Its idea is to do the new media user credit assessment by establishing indices system based on user credit behaviors, and the illegal users could be found according to the credit assessment results, thus to curb the bad videos and audios transmitted on the network. The user credit assessment model based on clustering ensemble proposed by this paper which integrates the advantages that swarm intelligence clustering is suitable for user credit behavior analysis and K-means clustering could eliminate the scattered users existed in the result of swarm intelligence clustering, thus to realize all the users' credit classification automatically. The model's effective verification experiments are accomplished which are based on standard credit application dataset in UCI machine learning repository, and the statistical results of a comparative experiment with a single model of swarm intelligence clustering indicates this clustering ensemble model has a stronger creditworthiness distinguishing ability, especially in the aspect of predicting to find user clusters with the best credit and worst credit, which will facilitate the operators to take incentive measures or punitive measures accurately. Besides, compared with the experimental results of Logistic regression based model under the same conditions, this clustering ensemble model is robustness and has better prediction accuracy.
Full Text Available Recognition of transportation modes can be used in different applications including human behavior research, transport management and traffic control. Previous work on transportation mode recognition has often relied on using multiple sensors or matching Geographic Information System (GIS information, which is not possible in many cases. In this paper, an approach based on ensemble learning is proposed to infer hybrid transportation modes using only Global Position System (GPS data. First, in order to distinguish between different transportation modes, we used a statistical method to generate global features and extract several local features from sub-trajectories after trajectory segmentation, before these features were combined in the classification stage. Second, to obtain a better performance, we used tree-based ensemble models (Random Forest, Gradient Boosting Decision Tree, and XGBoost instead of traditional methods (K-Nearest Neighbor, Decision Tree, and Support Vector Machines to classify the different transportation modes. The experiment results on the later have shown the efficacy of our proposed approach. Among them, the XGBoost model produced the best performance with a classification accuracy of 90.77% obtained on the GEOLIFE dataset, and we used a tree-based ensemble method to ensure accurate feature selection to reduce the model complexity.
We consider the Bayesian filtering problem for data assimilation following the kernel-based ensemble Gaussian-mixture filtering (EnGMF) approach introduced by Anderson and Anderson (1999). In this approach, the posterior distribution of the system state is propagated with the model using the ensemble Monte Carlo method, providing a forecast ensemble that is then used to construct a prior Gaussian-mixture (GM) based on the kernel density estimator. This results in two update steps: a Kalman filter (KF)-like update of the ensemble members and a particle filter (PF)-like update of the weights, followed by a resampling step to start a new forecast cycle. After formulating EnGMF for any observational operator, we analyze the influence of the bandwidth parameter of the kernel function on the covariance of the posterior distribution. We then focus on two aspects: i) the efficient implementation of EnGMF with (relatively) small ensembles, where we propose a new deterministic resampling strategy preserving the first two moments of the posterior GM to limit the sampling error; and ii) the analysis of the effect of the bandwidth parameter on contributions of KF and PF updates and on the weights variance. Numerical results using the Lorenz-96 model are presented to assess the behavior of EnGMF with deterministic resampling, study its sensitivity to different parameters and settings, and evaluate its performance against ensemble KFs. The proposed EnGMF approach with deterministic resampling suggests improved estimates in all tested scenarios, and is shown to require less localization and to be less sensitive to the choice of filtering parameters.
Chen, Hui; Tan, Chao; Wu, Tong; Wang, Li; Zhu, Wanping
Chinese liquor is one of the famous distilled spirits and counterfeit liquor is becoming a serious problem in the market. Especially, age liquor is facing the crisis of confidence because it is difficult for consumer to identify the marked age, which prompts unscrupulous traders to pose off low-grade liquors as high-grade liquors. An ideal method for authenticity confirmation of liquors should be non-invasive, non-destructive and timely. The combination of near-infrared spectroscopy with chemometrics proves to be a good way to reach these premises. A new strategy is proposed for classification and verification of the adulteration of liquors by using NIR spectroscopy and chemometric classification, i.e., ensemble support vector machines (SVM). Three measures, i.e., accuracy, sensitivity and specificity were used for performance evaluation. The results confirmed that the strategy can serve as a screening tool applied to verify adulteration of the liquor, that is, a prior step used to condition the sample to a deeper analysis only when a positive result for adulteration is obtained by the proposed methodology.
Azami, Hamed; Escudero, Javier
Breast cancer is one of the most common types of cancer in women all over the world. Early diagnosis of this kind of cancer can significantly increase the chances of long-term survival. Since diagnosis of breast cancer is a complex problem, neural network (NN) approaches have been used as a promising solution. Considering the low speed of the back-propagation (BP) algorithm to train a feed-forward NN, we consider a number of improved NN trainings for the Wisconsin breast cancer dataset: BP with momentum, BP with adaptive learning rate, BP with adaptive learning rate and momentum, Polak-Ribikre conjugate gradient algorithm (CGA), Fletcher-Reeves CGA, Powell-Beale CGA, scaled CGA, resilient BP (RBP), one-step secant and quasi-Newton methods. An NN ensemble, which is a learning paradigm to combine a number of NN outputs, is used to improve the accuracy of the classification task. Results demonstrate that NN ensemble-based classification methods have better performance than NN-based algorithms. The highest overall average accuracy is 97.68% obtained by NN ensemble trained by RBP for 50%-50% training-test evaluation method.
Cong, Zihan; Zhang, Xingming; Wang, Haoxiang; Xu, Hongjie
Aiming at the problem of “information overload” in the human resources industry, this paper proposes a human resource recommendation algorithm based on Ensemble Learning. The algorithm considers the characteristics and behaviours of both job seeker and job features in the real business circumstance. Firstly, the algorithm uses two ensemble learning methods-Bagging and Boosting. The outputs from both learning methods are then merged to form user interest model. Based on user interest model, job recommendation can be extracted for users. The algorithm is implemented as a parallelized recommendation system on Spark. A set of experiments have been done and analysed. The proposed algorithm achieves significant improvement in accuracy, recall rate and coverage, compared with recommendation algorithms such as UserCF and ItemCF.
Zhang, Song; Pruett, William Andrew; Hester, Robert
In an emergency situation such as hemorrhage, doctors need to predict which patients need immediate treatment and care. This task is difficult because of the diverse response to hemorrhage in human population. Ensemble physiological simulations provide a means to sample a diverse range of subjects and may have a better chance of containing the correct solution. However, to reveal the patterns and trends from the ensemble simulation is a challenging task. We have developed a visualization framework for ensemble physiological simulations. The visualization helps users identify trends among ensemble members, classify ensemble member into subpopulations for analysis, and provide prediction to future events by matching a new patient's data to existing ensembles. We demonstrated the effectiveness of the visualization on simulated physiological data. The lessons learned here can be applied to clinically-collected physiological data in the future.
A good classifier can correctly predict new data for which the class label is unknown, so it is important to construct a high accuracy classifier. Hence, classification techniques are much useful in ubiquitous computing. Associative classification achieves higher classification accuracy than some traditional rule-based classification approaches. However, the approach also has two major deficiencies. First, it generates a very large number of association classification rules, especially when t...
Full Text Available Ensemble clustering can improve the generalization ability of a single clustering algorithm and generate a more robust clustering result by integrating multiple base clusterings, so it becomes the focus of current clustering research. Ensemble clustering aims at finding a consensus partition which agrees as much as possible with base clusterings. Genetic algorithm is a highly parallel, stochastic, and adaptive search algorithm developed from the natural selection and evolutionary mechanism of biology. In this paper, an improved genetic algorithm is designed by improving the coding of chromosome. A new membrane evolutionary algorithm is constructed by using genetic mechanisms as evolution rules and combines with the communication mechanism of cell-like P system. The proposed algorithm is used to optimize the base clusterings and find the optimal chromosome as the final ensemble clustering result. The global optimization ability of the genetic algorithm and the rapid convergence of the membrane system make membrane evolutionary algorithm perform better than several state-of-the-art techniques on six real-world UCI data sets.
Wang, Yanhua; Liu, Xiyu; Xiang, Laisheng
Ensemble clustering can improve the generalization ability of a single clustering algorithm and generate a more robust clustering result by integrating multiple base clusterings, so it becomes the focus of current clustering research. Ensemble clustering aims at finding a consensus partition which agrees as much as possible with base clusterings. Genetic algorithm is a highly parallel, stochastic, and adaptive search algorithm developed from the natural selection and evolutionary mechanism of biology. In this paper, an improved genetic algorithm is designed by improving the coding of chromosome. A new membrane evolutionary algorithm is constructed by using genetic mechanisms as evolution rules and combines with the communication mechanism of cell-like P system. The proposed algorithm is used to optimize the base clusterings and find the optimal chromosome as the final ensemble clustering result. The global optimization ability of the genetic algorithm and the rapid convergence of the membrane system make membrane evolutionary algorithm perform better than several state-of-the-art techniques on six real-world UCI data sets.
Full Text Available This paper illustrates the use of a combined neural network model based on Stacked Generalization method for classification of electrocardiogram (ECG beats. In conventional Stacked Generalization method, the combiner learns to map the base classifiers' outputs to the target data. We claim adding the input pattern to the base classifiers' outputs helps the combiner to obtain knowledge about the input space and as the result, performs better on the same task. Experimental results support our claim that the additional knowledge according to the input space, improves the performance of the proposed method which is called Modified Stacked Generalization. In particular, for classification of 14966 ECG beats that were not previously seen during training phase, the Modified Stacked Generalization method reduced the error rate for 12.41% in comparison with the best of ten popular classifier fusion methods including Max, Min, Average, Product, Majority Voting, Borda Count, Decision Templates, Weighted Averaging based on Particle Swarm Optimization and Stacked Generalization.
Zhang, Jianhua; Li, Sunan; Wang, Rubin
In this paper, we deal with the Mental Workload (MWL) classification problem based on the measured physiological data. First we discussed the optimal depth (i.e., the number of hidden layers) and parameter optimization algorithms for the Convolutional Neural Networks (CNN). The base CNNs designed were tested according to five classification performance indices, namely Accuracy, Precision, F-measure, G-mean, and required training time. Then we developed an Ensemble Convolutional Neural Network (ECNN) to enhance the accuracy and robustness of the individual CNN model. For the ECNN design, three model aggregation approaches (weighted averaging, majority voting and stacking) were examined and a resampling strategy was used to enhance the diversity of individual CNN models. The results of MWL classification performance comparison indicated that the proposed ECNN framework can effectively improve MWL classification performance and is featured by entirely automatic feature extraction and MWL classification, when compared with traditional machine learning methods.
Full Text Available In this paper, we deal with the Mental Workload (MWL classification problem based on the measured physiological data. First we discussed the optimal depth (i.e., the number of hidden layers and parameter optimization algorithms for the Convolutional Neural Networks (CNN. The base CNNs designed were tested according to five classification performance indices, namely Accuracy, Precision, F-measure, G-mean, and required training time. Then we developed an Ensemble Convolutional Neural Network (ECNN to enhance the accuracy and robustness of the individual CNN model. For the ECNN design, three model aggregation approaches (weighted averaging, majority voting and stacking were examined and a resampling strategy was used to enhance the diversity of individual CNN models. The results of MWL classification performance comparison indicated that the proposed ECNN framework can effectively improve MWL classification performance and is featured by entirely automatic feature extraction and MWL classification, when compared with traditional machine learning methods.
Most significant ocean acoustic propagation occurs at tens of kilometers, at scales small compared basin and to most fine scale ocean modeling. To address the increased emphasis on uncertainty quantification, for example transmission loss (TL) probability density functions (PDF) within some radius, a polynomial chaos (PC) based method is utilized. In order to capture uncertainty in ocean modeling, Navy Coastal Ocean Model (NCOM) now includes ensembles distributed to reflect the ocean analysis statistics. Since the ensembles are included in the data assimilation for the new forecast ensembles, the acoustic modeling uses the ensemble predictions in a similar fashion for creating sound speed distribution over an acoustically relevant domain. Within an acoustic domain, singular value decomposition over the combined time-space structure of the sound speeds can be used to create Karhunen-Loève expansions of sound speed, subject to multivariate normality testing. These sound speed expansions serve as a basis for Hermite polynomial chaos expansions of derived quantities, in particular TL. The PC expansion coefficients result from so-called non-intrusive methods, involving evaluation of TL at multi-dimensional Gauss-Hermite quadrature collocation points. Traditional TL calculation from standard acoustic propagation modeling could be prohibitively time consuming at all multi-dimensional collocation points. This method employs Smolyak order and gridding methods to allow adaptive sub-sampling of the collocation points to determine only the most significant PC expansion coefficients to within a preset tolerance. Practically, the Smolyak order and grid sizes grow only polynomially in the number of Karhunen-Loève terms, alleviating the curse of dimensionality. The resulting TL PC coefficients allow the determination of TL PDF normality and its mean and standard deviation. In the non-normal case, PC Monte Carlo methods are used to rapidly establish the PDF. This work was
Abou-Zleikha, Mohamed; Tan, Zheng-Hua; Christensen, Mads Græsbøll
for a certain condition, the model becomes biased to the data used for training limiting the model’s generalisation ability. In this paper, we propose a BIC-based tuning-free approach for speaker segmentation through the use of ensemble-based learning. A forest of segmentation trees is constructed in which each...... points. The proposed approach is tested on artificially created conversations from the TIMIT database. The approach proposed show very accurate results comparable to those achieved by the-state-of-the-art methods with a 9% (absolute) higher F 1 compared with the standard ΔBIC with optimally tuned penalty...
Nanni, Loris; Lumini, Alessandra; Zaffonato, Nicolò
Alzheimer's disease (AD) is the most common cause of neurodegenerative dementia in the elderly population. Scientific research is very active in the challenge of designing automated approaches to achieve an early and certain diagnosis. Recently an international competition among AD predictors has been organized: "A Machine learning neuroimaging challenge for automated diagnosis of Mild Cognitive Impairment" (MLNeCh). This competition is based on pre-processed sets of T1-weighted Magnetic Resonance Images (MRI) to be classified in four categories: stable AD, individuals with MCI who converted to AD, individuals with MCI who did not convert to AD and healthy controls. In this work, we propose a method to perform early diagnosis of AD, which is evaluated on MLNeCh dataset. Since the automatic classification of AD is based on the use of feature vectors of high dimensionality, different techniques of feature selection/reduction are compared in order to avoid the curse-of-dimensionality problem, then the classification method is obtained as the combination of Support Vector Machines trained using different clusters of data extracted from the whole training set. The multi-classifier approach proposed in this work outperforms all the stand-alone method tested in our experiments. The final ensemble is based on a set of classifiers, each trained on a different cluster of the training data. The proposed ensemble has the great advantage of performing well using a very reduced version of the data (the reduction factor is more than 90%). The MATLAB code for the ensemble of classifiers will be publicly available 1 to other researchers for future comparisons. Copyright © 2017 Elsevier B.V. All rights reserved.
Barzan I Khayatt
Full Text Available There is a growing interest in the Non-ribosomal peptide synthetases (NRPSs and polyketide synthases (PKSs of microbes, fungi and plants because they can produce bioactive peptides such as antibiotics. The ability to identify the substrate specificity of the enzyme's adenylation (A and acyl-transferase (AT domains is essential to rationally deduce or engineer new products. We here report on a Hidden Markov Model (HMM-based ensemble method to predict the substrate specificity at high quality. We collected a new reference set of experimentally validated sequences. An initial classification based on alignment and Neighbor Joining was performed in line with most of the previously published prediction methods. We then created and tested single substrate specific HMMs and found that their use improved the correct identification significantly for A as well as for AT domains. A major advantage of the use of HMMs is that it abolishes the dependency on multiple sequence alignment and residue selection that is hampering the alignment-based clustering methods. Using our models we obtained a high prediction quality for the substrate specificity of the A domains similar to two recently published tools that make use of HMMs or Support Vector Machines (NRPSsp and NRPS predictor2, respectively. Moreover, replacement of the single substrate specific HMMs by ensembles of models caused a clear increase in prediction quality. We argue that the superiority of the ensemble over the single model is caused by the way substrate specificity evolves for the studied systems. It is likely that this also holds true for other protein domains. The ensemble predictor has been implemented in a simple web-based tool that is available at http://www.cmbi.ru.nl/NRPS-PKS-substrate-predictor/.
Stanescu, Ana; Caragea, Doina
Recent biochemical advances have led to inexpensive, time-efficient production of massive volumes of raw genomic data. Traditional machine learning approaches to genome annotation typically rely on large amounts of labeled data. The process of labeling data can be expensive, as it requires domain knowledge and expert involvement. Semi-supervised learning approaches that can make use of unlabeled data, in addition to small amounts of labeled data, can help reduce the costs associated with labeling. In this context, we focus on the problem of predicting splice sites in a genome using semi-supervised learning approaches. This is a challenging problem, due to the highly imbalanced distribution of the data, i.e., small number of splice sites as compared to the number of non-splice sites. To address this challenge, we propose to use ensembles of semi-supervised classifiers, specifically self-training and co-training classifiers. Our experiments on five highly imbalanced splice site datasets, with positive to negative ratios of 1-to-99, showed that the ensemble-based semi-supervised approaches represent a good choice, even when the amount of labeled data consists of less than 1% of all training data. In particular, we found that ensembles of co-training and self-training classifiers that dynamically balance the set of labeled instances during the semi-supervised iterations show improvements over the corresponding supervised ensemble baselines. In the presence of limited amounts of labeled data, ensemble-based semi-supervised approaches can successfully leverage the unlabeled data to enhance supervised ensembles learned from highly imbalanced data distributions. Given that such distributions are common for many biological sequence classification problems, our work can be seen as a stepping stone towards more sophisticated ensemble-based approaches to biological sequence annotation in a semi-supervised framework.
Pan, M; Zhang, J
Microarray gene expression technology provides a systematic approach to patient classification. However, microarray data pose a great computational challenge owing to their large dimensionality, small sample sizes, and potential correlations among genes. A recent study has shown that gene-gene correlations have a positive effect on the accuracy of classification models, in contrast to some previous results. In this study, a recently developed correlation-based classifier, the ensemble of random subspace (RS) Fisher linear discriminants (FLDs), was utilized. The impact of gene-gene correlations on the performance of this classifier and other classifiers was studied using simulated datasets and real datasets. A cross-validation framework was used to evaluate the performance of each classifier using the simulated datasets or real datasets, and misclassification rates (MRs) were computed. Using the simulated data, the average MRs of the correlation-based classifiers decreased as the correlations increased when there were more correlated genes. Using real data, the correlation-based classifiers outperformed the non-correlation-based classifiers, especially when the gene-gene correlations were high. The ensemble RS-FLD classifier is a potential state-of-the-art computational method. The correlation-based ensemble RS-FLD classifier was effective and benefited from gene-gene correlations, particularly when the correlations were high.
Full Text Available The impacts of the assimilated observations on the 24-hour forecasts are estimated with the ensemble-based method proposed by Kalnay et al. using an ensemble Kalman filter (EnKF. This method estimates the relative impact of observations in data assimilation similar to the adjoint-based method proposed by Langland and Baker but without using the adjoint model. It is implemented on the National Centers for Environmental Prediction Global Forecasting System EnKF that has been used as part of operational global data assimilation system at NCEP since May 2012. The result quantifies the overall positive impacts of the assimilated observations and the relative importance of the satellite radiance observations compared to other types of observations, especially for the moisture fields. A simple moving localisation based on the average wind, although not optimal, seems to work well. The method is also used to identify the cause of local forecast failure cases in the 24-hour forecasts. Data-denial experiments of the observations identified as producing a negative impact are performed, and forecast errors are reduced as estimated, thus validating the impact estimation.
Chellasamy, Menaka; Ferre, Ty; Greve, Mogens Humlekrog
(field vectors containing crop information and the field boundary of each crop). Crop discrimination is performed by considering a set of conclusions derived from individual neural networks based on ET. Verification of the diversification rules is performed by incorporating pixel-based classification uncertainty......Beginning in 2015, Danish farmers are obliged to meet specific crop diversification rules based on total land area and number of crops cultivated to be eligible for new greening subsidies. Hence, there is a need for the Danish government to extend their subsidy control system to verify farmers......’ declarations to war-rant greening payments under the new crop diversification rules. Remote Sensing (RS) technology has been used since 1992 to control farmers’ subsidies in Denmark. However, a proper RS-based approach is yet to be finalised to validate new crop diversity requirements designed for assessing...
Chellasamy, Menaka; Ferre, Ty; Greve, Mogens Humlekrog
An approach to use the available agricultural parcel information to automatically select training samples for crop classification is investigated. Previous research addressed the multi-evidence crop classification approach using an ensemble classifier. This first produced confidence measures using...... three Multi-Layer Perceptron (MLP) neural networks trained separately with spectral, texture and vegetation indices; classification labels were then assigned based on Endorsement Theory. The present study proposes an approach to feed this ensemble classifier with automatically selected training samples...... approach is named as ECRA (Ensemble based Cluster Refinement Approach). ECRA first automatically removes mislabeled samples and then selects the refined training samples in an iterative training-reclassification scheme. Mislabel removal is based on the expectation that mislabels in each class will be far...
Ali, Safdar; Majid, Abdul; Khan, Asifullah
Development of an accurate and reliable intelligent decision-making method for the construction of cancer diagnosis system is one of the fast growing research areas of health sciences. Such decision-making system can provide adequate information for cancer diagnosis and drug discovery. Descriptors derived from physicochemical properties of protein sequences are very useful for classifying cancerous proteins. Recently, several interesting research studies have been reported on breast cancer classification. To this end, we propose the exploitation of the physicochemical properties of amino acids in protein primary sequences such as hydrophobicity (Hd) and hydrophilicity (Hb) for breast cancer classification. Hd and Hb properties of amino acids, in recent literature, are reported to be quite effective in characterizing the constituent amino acids and are used to study protein foldings, interactions, structures, and sequence-order effects. Especially, using these physicochemical properties, we observed that proline, serine, tyrosine, cysteine, arginine, and asparagine amino acids offer high discrimination between cancerous and healthy proteins. In addition, unlike traditional ensemble classification approaches, the proposed 'IDM-PhyChm-Ens' method was developed by combining the decision spaces of a specific classifier trained on different feature spaces. The different feature spaces used were amino acid composition, split amino acid composition, and pseudo amino acid composition. Consequently, we have exploited different feature spaces using Hd and Hb properties of amino acids to develop an accurate method for classification of cancerous protein sequences. We developed ensemble classifiers using diverse learning algorithms such as random forest (RF), support vector machines (SVM), and K-nearest neighbor (KNN) trained on different feature spaces. We observed that ensemble-RF, in case of cancer classification, performed better than ensemble-SVM and ensemble-KNN. Our
Roberts, Eric; Smith, Spencer; Sindi, Suzanne; Smith, Kevin
Topological entropy is a measurement of orbit complexity in a dynamical system that can be estimated in 2D by embedding an initial material curve L0 in the fluid and estimating its growth under the evolution of the flow. This growth is given by L (t) = | L0 | eht , where L (t) is the length of the curve as a function of t and h is the topological entropy. In order to develop a method for computing Eq. (1) that will efficiently scale up in both system size and modeling time, one must be clever about extracting the maximum information from the limited trajectories available. The relative motion of trajectories through phase space encodes global information that is not contained in any individual trajectory. That is, extra information is ''hiding'' in an ensemble of classical trajectories, which is not exploited in a trajectory-by-trajectory approach. Using tools from computational geometry, we introduce a new algorithm designed to take advantage of such additional information that requires only potentially sparse sets of particle trajectories as input and no reliance on any detailed knowledge of the velocity field: the Ensemble-Based Topological Entropy Calculation, or E-tec.
Full Text Available Calls from 14 species of bat were classified to genus and species using discriminant function analysis (DFA, support vector machines (SVM and ensembles of neural networks (ENN. Both SVMs and ENNs outperformed DFA for every species while ENNs (mean identification rate – 97% consistently outperformed SVMs (mean identification rate – 87%. Correct classification rates produced by the ENNs varied from 91% to 100%; calls from six species were correctly identified with 100% accuracy. Calls from the five species of Myotis, a genus whose species are considered difficult to distinguish acoustically, had correct identification rates that varied from 91 – 100%. Five parameters were most important for classifying calls correctly while seven others contributed little to classification performance.
Full Text Available The degradation of photovoltaic (PV modules remains a major concern on the control and the development of the photovoltaic field, particularly, in regions with difficult climatic conditions. The main degradation modes of the PV modules are corrosion, discoloration, glass breaks, and cracks of cells. However, corrosion and discoloration remain the predominant degradation modes that still require further investigations. In this paper, a model-based PV corrosion prognostic approach, based on an ensemble Kalman filter (EnKF, is introduced to identify the PV corrosion parameters and then estimate the remaining useful life (RUL. Simulations have been conducted using measured data set, and results are reported to show the efficiency of the proposed approach.
Doyle, Suzanne R.; Donovan, Dennis M.
Aims The purpose of this study was to explore the selection of predictor variables in the evaluation of drug treatment completion using an ensemble approach with classification trees. The basic methodology is reviewed and the subagging procedure of random subsampling is applied. Methods Among 234 individuals with stimulant use disorders randomized to a 12-Step facilitative intervention shown to increase stimulant use abstinence, 67.52% were classified as treatment completers. A total of 122 baseline variables were used to identify factors associated with completion. Findings The number of types of self-help activity involvement prior to treatment was the predominant predictor. Other effective predictors included better coping self-efficacy for substance use in high-risk situations, more days of prior meeting attendance, greater acceptance of the Disease model, higher confidence for not resuming use following discharge, lower ASI Drug and Alcohol composite scores, negative urine screens for cocaine or marijuana, and fewer employment problems. Conclusions The application of an ensemble subsampling regression tree method utilizes the fact that classification trees are unstable but, on average, produce an improved prediction of the completion of drug abuse treatment. The results support the notion there are early indicators of treatment completion that may allow for modification of approaches more tailored to fitting the needs of individuals and potentially provide more successful treatment engagement and improved outcomes. PMID:25134038
Aalstad, K.; Westermann, S.; Bertino, L.
Snow, with high albedo, low thermal conductivity and large water storing capacity strongly modulates the surface energy and water balance. At the same time, the estimation of the seasonal snow cycle at the kilometer scale is a major hydrometeorological challenge and thus represents a significant source of uncertainty in land surface schemes. To constrain this uncertainty we are developing a modular ensemble-based subgrid snow data assimilation framework (ESSDA) for satellite-era snow distribution (SD) reanalyses. ESSDA makes use of an ensemble Kalman smoother (EnKS) approach to assimilate MODIS and LandSat retrievals of snow cover fraction into the subgrid snow distribution submodel (SSNOWD). Our problem is particularly challenging as snow cover is doubly bounded in physical space, which violates the Gaussian error assumption in the EnKS and thus compromises the performance of the reanalysis. In an attempt to alleviate this issue we make use of the technique of analytical Gaussian anamorphosis (GA) and apply the EnKS in a transformed space. The potential of the framework is presented through both synthetic (twin) and real experiments. The latter are carried out for the Bayelva catchment near Ny Ålelsund (79°N, Svalbard, Norway) where independent and accurate ground-based observations of snow cover and snow depth distributions are available. Our ensuing evaluation demonstrates that ESSDA provides robust estimates of the evolution of the SD over almost a decade long validation period. The use of GA, and our emphasis on the subgrid SD, is what sets ESSDA apart from previously proposed snow reanalysis frameworks. ESSDA has been developed to aid in satellite-based permafrost mapping efforts. Nonetheless, being modular and given the improved error treatment through GA, the ESSDA approach may also prove valuable for broader hydrometeorological reanalysis and forecast initialization efforts.
Sangouard, Nicolas; Simon, Christoph; de Riedmatten, Hugues; Gisin, Nicolas
The distribution of quantum states over long distances is limited by photon loss. Straightforward amplification as in classical telecommunications is not an option in quantum communication because of the no-cloning theorem. This problem could be overcome by implementing quantum repeater protocols, which create long-distance entanglement from shorter-distance entanglement via entanglement swapping. Such protocols require the capacity to create entanglement in a heralded fashion, to store it in quantum memories, and to swap it. One attractive general strategy for realizing quantum repeaters is based on the use of atomic ensembles as quantum memories, in combination with linear optical techniques and photon counting to perform all required operations. Here the theoretical and experimental status quo of this very active field are reviewed. The potentials of different approaches are compared quantitatively, with a focus on the most immediate goal of outperforming the direct transmission of photons.
In this paper, we propose a new and accurate technique for uncertainty analysis and uncertainty visualization based on fiber orientation distribution function (ODF) glyphs, associated with high angular resolution diffusion imaging (HARDI). Our visualization applies volume rendering techniques to an ensemble of 3D ODF glyphs, which we call SIP functions of diffusion shapes, to capture their variability due to underlying uncertainty. This rendering elucidates the complex heteroscedastic structural variation in these shapes. Furthermore, we quantify the extent of this variation by measuring the fraction of the volume of these shapes, which is consistent across all noise levels, the certain volume ratio. Our uncertainty analysis and visualization framework is then applied to synthetic data, as well as to HARDI human-brain data, to study the impact of various image acquisition parameters and background noise levels on the diffusion shapes. © 2012 IEEE.
Kwon, Hyun-Han; So, Byung-Jin; Kim, Seong-Hyeon; Kim, Byung-Seop
In recent years, Smart Water Grid (SWG) concept has globally emerged over the last decade and also gained significant recognition in South Korea. Especially, there has been growing interest in water demand forecast and optimal pump operation and this has led to various studies regarding energy saving and improvement of water supply reliability. Existing water demand forecasting models are categorized into two groups in view of modeling and predicting their behavior in time series. One is to consider embedded patterns such as seasonality, periodicity and trends, and the other one is an autoregressive model that is using short memory Markovian processes (Emmanuel et al., 2012). The main disadvantage of the abovementioned model is that there is a limit to predictability of water demands of about sub-daily scale because the system is nonlinear. In this regard, this study aims to develop a nonlinear ensemble model for hourly water demand forecasting which allow us to estimate uncertainties across different model classes. The proposed model is consist of two parts. One is a multi-model scheme that is based on combination of independent prediction model. The other one is a cross validation scheme named Bagging approach introduced by Brieman (1996) to derive weighting factors corresponding to individual models. Individual forecasting models that used in this study are linear regression analysis model, polynomial regression, multivariate adaptive regression splines(MARS), SVM(support vector machine). The concepts are demonstrated through application to observed from water plant at several locations in the South Korea. Keywords: water demand, non-linear model, the ensemble forecasting model, uncertainty. Acknowledgements This subject is supported by Korea Ministry of Environment as "Projects for Developing Eco-Innovation Technologies (GT-11-G-02-001-6)
A new hybrid ensemble Kalman filter/four-dimensional variational data assimilation (EnKF/4D-VAR) approach is introduced to mitigate background covariance limitations in the EnKF. The work is based on the adaptive EnKF (AEnKF) method, which bears a strong resemblance to the hybrid EnKF/three-dimensional variational data assimilation (3D-VAR) method. In the AEnKF, the representativeness of the EnKF ensemble is regularly enhanced with new members generated after back projection of the EnKF analysis residuals to state space using a 3D-VAR [or optimal interpolation (OI)] scheme with a preselected background covariance matrix. The idea here is to reformulate the transformation of the residuals as a 4D-VAR problem, constraining the new member with model dynamics and the previous observations. This should provide more information for the estimation of the new member and reduce dependence of the AEnKF on the assumed stationary background covariance matrix. This is done by integrating the analysis residuals backward in time with the adjoint model. Numerical experiments are performed with the Lorenz-96 model under different scenarios to test the new approach and to evaluate its performance with respect to the EnKF and the hybrid EnKF/3D-VAR. The new method leads to the least root-mean-square estimation errors as long as the linear assumption guaranteeing the stability of the adjoint model holds. It is also found to be less sensitive to choices of the assimilation system inputs and parameters.
Pinson, Pierre; Madsen, Henrik
For management and trading purposes, information on short-term wind generation (from few hours to few days ahead) is even more crucial at large offshore wind farms, since they concentrate a large capacity at a single location. The most complete information that can be provided today consists....... The obtained ensemble forecasts of wind power are then converted into predictive distributions with an original adaptive kernel dressing method. The shape of the kernels is driven by a mean-variance model, the parameters of which are recursively estimated in order to maximize the overall skill of obtained...
Re, Matteo; Valentini, Giorgio
proposed to explain the characteristics and the successful application of ensembles to different application domains. For instance, Allwein, Schapire, and Singer interpreted the improved generalization capabilities of ensembles of learning machines in the framework of large margin classifiers [4,177], Kleinberg in the context of stochastic discrimination theory , and Breiman and Friedman in the light of the bias-variance analysis borrowed from classical statistics [21,70]. Empirical studies showed that both in classification and regression problems, ensembles improve on single learning machines, and moreover large experimental studies compared the effectiveness of different ensemble methods on benchmark data sets [10,11,49,188]. The interest in this research area is motivated also by the availability of very fast computers and networks of workstations at a relatively low cost that allow the implementation and the experimentation of complex ensemble methods using off-the-shelf computer platforms. However, as explained in Section 26.2 there are deeper reasons to use ensembles of learning machines, motivated by the intrinsic characteristics of the ensemble methods. The main aim of this chapter is to introduce ensemble methods and to provide an overview and a bibliography of the main areas of research, without pretending to be exhaustive or to explain the detailed characteristics of each ensemble method. The paper is organized as follows. In the next section, the main theoretical and practical reasons for combining multiple learners are introduced. Section 26.3 depicts the main taxonomies on ensemble methods proposed in the literature. In Section 26.4 and 26.5, we present an overview of the main supervised ensemble methods reported in the literature, adopting a simple taxonomy, originally proposed in Ref. . Applications of ensemble methods are only marginally considered, but a specific section on some relevant applications of ensemble methods in astronomy and
Dejanic, Sanda; Scheidegger, Andreas; Rieckermann, Jörg; Albert, Carlo
Sampling Bayesian posteriors of model parameters is often required for making model-based probabilistic predictions. For complex environmental models, standard Monte Carlo Markov Chain (MCMC) methods are often infeasible because they require too many sequential model runs. Therefore, we focused on ensemble methods that use many Markov chains in parallel, since they can be run on modern cluster architectures. Little is known about how to choose the best performing sampler, for a given application. A poor choice can lead to an inappropriate representation of posterior knowledge. We assessed two different jump moves, the stretch and the differential evolution move, underlying, respectively, the software packages EMCEE and DREAM, which are popular in different scientific communities. For the assessment, we used analytical posteriors with features as they often occur in real posteriors, namely high dimensionality, strong non-linear correlations or multimodality. For posteriors with non-linear features, standard convergence diagnostics based on sample means can be insufficient. Therefore, we resorted to an entropy-based convergence measure. We assessed the samplers by means of their convergence speed, robustness and effective sample sizes. For posteriors with strongly non-linear features, we found that the stretch move outperforms the differential evolution move, w.r.t. all three aspects.
Ahmadi, Sepehr; El-Ella, Haitham A.R.; Hansen, Jørn B.
We demonstrate magnetic field sensing by recording the variation in the pump light absorption with nitrogen-vacancy center ensemble. At a frequency of 10 mHz we obtain a noise floor of ~30 nT/√Hz.......We demonstrate magnetic field sensing by recording the variation in the pump light absorption with nitrogen-vacancy center ensemble. At a frequency of 10 mHz we obtain a noise floor of ~30 nT/√Hz....
Full Text Available Heart disease is one of the most common diseases in the world. The objective of this study is to aid the diagnosis of heart disease using a hybrid classification system based on the ReliefF and Rough Set (RFRS method. The proposed system contains two subsystems: the RFRS feature selection system and a classification system with an ensemble classifier. The first system includes three stages: (i data discretization, (ii feature extraction using the ReliefF algorithm, and (iii feature reduction using the heuristic Rough Set reduction algorithm that we developed. In the second system, an ensemble classifier is proposed based on the C4.5 classifier. The Statlog (Heart dataset, obtained from the UCI database, was used for experiments. A maximum classification accuracy of 92.59% was achieved according to a jackknife cross-validation scheme. The results demonstrate that the performance of the proposed system is superior to the performances of previously reported classification techniques.
Theis, S.; Gebhardt, C.; Buchhold, M.; Ben Bouallègue, Z.; Ohl, R.; Paulat, M.; Peralta, C.
The numerical weather prediction model COSMO-DE is a configuration of the COSMO model with a horizontal grid size of 2.8 km. It has been running operationally at DWD since 2007, it covers the area of Germany and produces forecasts with a lead time of 0-21 hours. The model COSMO-DE is convection-permitting, which means that it does without a parametrisation of deep convection and simulates deep convection explicitly. One aim is an improved forecast of convective heavy rain events. Convection-permitting models are in operational use at several weather services, but currently not in ensemble mode. It is expected that an ensemble system could reveal the advantages of a convection-permitting model even better. The probabilistic approach is necessary, because the explicit simulation of convective processes for more than a few hours cannot be viewed as a deterministic forecast anymore. This is due to the chaotic behaviour and short life cycle of the processes which are simulated explicitly now. In the framework of the project COSMO-DE-EPS, DWD is developing and implementing an ensemble prediction system (EPS) for the model COSMO-DE. The project COSMO-DE-EPS comprises the generation of ensemble members, as well as the verification and visualization of the ensemble forecasts and also statistical postprocessing. A pre-operational mode of the EPS with 20 ensemble members is foreseen to start in 2010. Operational use is envisaged to start in 2012, after an upgrade to 40 members and inclusion of statistical postprocessing. The presentation introduces the project COSMO-DE-EPS and describes the design of the ensemble as it is planned for the pre-operational mode. In particular, the currently implemented method for the generation of ensemble members will be explained and discussed. The method includes variations of initial conditions, lateral boundary conditions, and model physics. At present, pragmatic methods are applied which resemble the basic ideas of a multi-model approach
Chellasamy, Menaka; Ferre, Ty; Greve, Mogens Humlekrog
three Multi-Layer Perceptron (MLP) neural networks trained separately with spectral, texture and vegetation indices; classification labels were then assigned based on Endorsement Theory. The present study proposes an approach to feed this ensemble classifier with automatically selected training samples...
Kumpf, Alexander; Tost, Bianca; Baumgart, Marlene; Riemer, Michael; Westermann, Rudiger; Rautenhaus, Marc
In meteorology, cluster analysis is frequently used to determine representative trends in ensemble weather predictions in a selected spatio-temporal region, e.g., to reduce a set of ensemble members to simplify and improve their analysis. Identified clusters (i.e., groups of similar members), however, can be very sensitive to small changes of the selected region, so that clustering results can be misleading and bias subsequent analyses. In this article, we - a team of visualization scientists and meteorologists-deliver visual analytics solutions to analyze the sensitivity of clustering results with respect to changes of a selected region. We propose an interactive visual interface that enables simultaneous visualization of a) the variation in composition of identified clusters (i.e., their robustness), b) the variability in cluster membership for individual ensemble members, and c) the uncertainty in the spatial locations of identified trends. We demonstrate that our solution shows meteorologists how representative a clustering result is, and with respect to which changes in the selected region it becomes unstable. Furthermore, our solution helps to identify those ensemble members which stably belong to a given cluster and can thus be considered similar. In a real-world application case we show how our approach is used to analyze the clustering behavior of different regions in a forecast of "Tropical Cyclone Karl", guiding the user towards the cluster robustness information required for subsequent ensemble analysis.
Pinson, Pierre; Madsen, Henrik
forecasting methodology. In a first stage, ensemble forecasts of meteorological variables are converted to power through a suitable power curve model. This modelemploys local polynomial regression, and is adoptively estimated with an orthogonal fitting method. The obtained ensemble forecasts of wind power...... are then converted into predictive distributions with an original adaptive kernel dressing method. The shape of the kernels is driven by a mean-variance model, the parameters of which ore recursively estimated in order to maximize the overall skill of obtained predictive distributions. Such a methodology has...
Antal, Bálint; Hajdu, András
In this paper, an ensemble-based method for the screening of diabetic retinopathy (DR) is proposed. This approach is based on features extracted from the output of several retinal image processing algorithms, such as image-level (quality assessment, pre-screening, AM/FM), lesion-specific (microaneurysms, exudates) and anatomical (macula, optic disc) components. The actual decision about the presence of the disease is then made by an ensemble of machine learning classifiers. We have tested our...
Horton, D. E.; Mankin, J. S.; Singh, D.; Diffenbaugh, N. S.; Swain, D. L.
Synoptic- to regional-scale circulation patterns drive daily surface weather conditions, while the occurrence and persistence of particular circulation patterns can have an outsized influence on the likelihood of hot, cold, wet, and/or dry extremes. Recent theoretical work has posited that increasing atmospheric greenhouse gas concentrations may exert some influence on mid-latitude circulation pattern behavior, though detection of such changes in observational analyses is likely to be challenging due to the natural variability of the climate system. To assess the influence of altered levels of radiative forcing on the behavior of atmospheric circulation patterns in the context of internal variability, we utilize pre-industrial, historical, and future simulations from the CESM1 Large Ensemble (LENS) Community Project single-model, multi-realization framework. Using self-organizing map clustering analysis, we classify summer mid-atmospheric circulation patterns over select regional mid-latitude Northern Hemisphere domains in each LENS experiment. For each region in each LENS experiment we calculate and compare the full distribution of summer pattern occurrence and duration, in addition to high-impact extreme (maximum) pattern persistence for each modeled year. Our results indicate little to no change in the left-tails and mean occurrence and duration of circulation patterns as increasing levels of radiative forcing are assessed. However, robust changes in the extreme right-tails of the duration and maximum persistence distributions are prevalent. Patterns in which robust right-tail changes are identified are diverse, with cyclonic, anticyclonic, zonal, and dipole patterns all represented. Our results indicate that in CESM1 LENS circulation changes due to increasing radiative forcing exist, but only in the potentially high impact tails of the distribution.
Full Text Available Ensembles are a well established machine learning paradigm, leading to accurate and robust models, predominantly applied to predictive modeling tasks. Ensemble models comprise a finite set of diverse predictive models whose combined output is expected to yield an improved predictive performance as compared to an individual model. In this paper, we propose a new method for learning ensembles of process-based models of dynamic systems. The process-based modeling paradigm employs domain-specific knowledge to automatically learn models of dynamic systems from time-series observational data. Previous work has shown that ensembles based on sampling observational data (i.e., bagging and boosting, significantly improve predictive performance of process-based models. However, this improvement comes at the cost of a substantial increase of the computational time needed for learning. To address this problem, the paper proposes a method that aims at efficiently learning ensembles of process-based models, while maintaining their accurate long-term predictive performance. This is achieved by constructing ensembles with sampling domain-specific knowledge instead of sampling data. We apply the proposed method to and evaluate its performance on a set of problems of automated predictive modeling in three lake ecosystems using a library of process-based knowledge for modeling population dynamics. The experimental results identify the optimal design decisions regarding the learning algorithm. The results also show that the proposed ensembles yield significantly more accurate predictions of population dynamics as compared to individual process-based models. Finally, while their predictive performance is comparable to the one of ensembles obtained with the state-of-the-art methods of bagging and boosting, they are substantially more efficient.
Limbach, F; Hauswald, C; Lähnemann, J; Wölz, M; Brandt, O; Trampert, A; Hanke, M; Jahn, U; Calarco, R; Geelhaar, L; Riechert, H
Light emitting diodes (LEDs) have been fabricated using ensembles of free-standing (In, Ga)N/GaN nanowires (NWs) grown on Si substrates in the self-induced growth mode by molecular beam epitaxy. Electron-beam-induced current analysis, cathodoluminescence as well as biased μ-photoluminescence spectroscopy, transmission electron microscopy, and electrical measurements indicate that the electroluminescence of such LEDs is governed by the differences in the individual current densities of the single-NW LEDs operated in parallel, i.e. by the inhomogeneity of the current path in the ensemble LED. In addition, the optoelectronic characterization leads to the conclusion that these NWs exhibit N-polarity and that the (In, Ga)N quantum well states in the NWs are subject to a non-vanishing quantum confined Stark effect. (paper)
Qin, Xiwen; Li, Qiaoling; Dong, Xiaogang; Lv, Siqi
Accurate diagnosis of rolling bearing fault on the normal operation of machinery and equipment has a very important significance. A method combining Ensemble Empirical Mode Decomposition (EEMD) and Random Forest (RF) is proposed. Firstly, the original signal is decomposed into several intrinsic mode functions (IMFs) by EEMD, and the effective IMFs are selected. Then their energy entropy is calculated as the feature. Finally, the classification is performed by RF. In addition, the wavelet meth...
Liao, James C. [Univ. of California, Los Angeles, CA (United States)
Ensemble modeling of kinetic systems addresses the challenges of kinetic model construction, with respect to parameter value selection, and still allows for the rich insights possible from kinetic models. This project aimed to show that constructing, implementing, and analyzing such models is a useful tool for the metabolic engineering toolkit, and that they can result in actionable insights from models. Key concepts are developed and deliverable publications and results are presented.
Full Text Available The current study proposes an integrated uncertainty and ensemble-based data assimilation framework (ICEA and evaluates its viability in providing operational streamflow predictions via assimilating snow water equivalent (SWE data. This step-wise framework applies a parameter uncertainty analysis algorithm (ISURF to identify the uncertainty structure of sensitive model parameters, which is subsequently formulated into an Ensemble Kalman Filter (EnKF to generate updated snow states for streamflow prediction. The framework is coupled to the US National Weather Service (NWS snow and rainfall-runoff models. Its applicability is demonstrated for an operational basin of a western River Forecast Center (RFC of the NWS. Performance of the framework is evaluated against existing operational baseline (RFC predictions, the stand-alone ISURF and the stand-alone EnKF. Results indicate that the ensemble-mean prediction of ICEA considerably outperforms predictions from the other three scenarios investigated, particularly in the context of predicting high flows (top 5th percentile. The ICEA streamflow ensemble predictions capture the variability of the observed streamflow well, however the ensemble is not wide enough to consistently contain the range of streamflow observations in the study basin. Our findings indicate that the ICEA has the potential to supplement the current operational (deterministic forecasting method in terms of providing improved single-valued (e.g., ensemble mean streamflow predictions as well as meaningful ensemble predictions.
Komori, N.; Enomoto, T.; Miyoshi, T.; Yamazaki, A.; Kuwano-Yoshida, A.; Taguchi, B.
To enhance the capability of the local ensemble transform Kalman filter (LETKF) with the Atmospheric general circulation model (GCM) for the Earth Simulator (AFES), a new system has been developed by replacing AFES with the Coupled atmosphere-ocean GCM for the Earth Simulator (CFES). An initial test of the prototype of the CFES-LETKF system has been completed successfully, assimilating atmospheric observational data (NCEP PREPBUFR archived at UCAR) every 6 hours to update the atmospheric variables, whereas the oceanic variables are kept unchanged throughout the assimilation procedure. An experimental retrospective analysis-forecast cycle with the coupled system (CLERA-A) starts on August 1, 2008, and the atmospheric initial conditions (63 members) are taken from the second generation of AFES-LETKF experimental ensemble reanalysis (ALERA2). The ALERA2 analyses are also used as forcing of stand-alone 63-member ensemble simulations with the Ocean GCM for the Earth Simulator (EnOFES), from which the oceanic initial conditions for the CLERA-A are taken. The ensemble spread of SST is larger in CLERA-A than in EnOFES, suggesting positive feedback between the ocean and the atmosphere. Although SST in CLERA-A suffers from the common biases among many coupled GCMs, the ensemble spreads of air temperature and specific humidity in the lower troposphere are larger in CLERA-A than in ALERA2. Thus replacement of AFES with CFES successfully contributes to mitigate an underestimation of the ensemble spread near the surface resulting from the single boundary condition for all ensemble members and the lack of atmosphere-ocean interaction. In addition, the basin-scale structure of surface atmospheric variables over the tropical Pacific is well reconstructed from the ensemble correlation in CLERA-A but not ALERA2. This suggests that use of a coupled GCM rather than an atmospheric GCM could be important even for atmospheric reanalysis with an ensemble-based data assimilation system.
Full Text Available Measuring stride variability and dynamics in children is useful for the quantitative study of gait maturation and neuromotor development in childhood and adolescence. In this paper, we computed the sample entropy (SampEn and average stride interval (ASI parameters to quantify the stride series of 50 gender-matched children participants in three age groups. We also normalized the SampEn and ASI values by leg length and body mass for each participant, respectively. Results show that the original and normalized SampEn values consistently decrease over the significance level of the Mann-Whitney U test (p<0.01 in children of 3–14 years old, which indicates the stride irregularity has been significantly ameliorated with the body growth. The original and normalized ASI values are also significantly changing when comparing between any two groups of young (aged 3–5 years, middle (aged 6–8 years, and elder (aged 10–14 years children. Such results suggest that healthy children may better modulate their gait cadence rhythm with the development of their musculoskeletal and neurological systems. In addition, the AdaBoost.M2 and Bagging algorithms were used to effectively distinguish the children’s gait patterns. These ensemble learning algorithms both provided excellent gait classification results in terms of overall accuracy (≥90%, recall (≥0.8, and precision (≥0.8077.
Wright, D.; Kirschbaum, D.; Yatheendradas, S.
The considerable uncertainties associated with quantitative precipitation estimates (QPE), whether from satellite platforms, ground-based weather radar, or numerical weather models, suggest that such QPE should be expressed as distributions or ensembles of possible values, rather than as single values. In this research, we borrow a framework from the weather forecast verification community, to "correct" satellite precipitation and generate ensemble QPE. This approach is based on the censored shifted gamma distribution (CSGD). The probability of precipitation, central tendency (i.e. mean), and the uncertainty can be captured by the three parameters of the CSGD. The CSGD can then be applied for simulation of rainfall ensembles using a flexible nonlinear regression framework, whereby the CSGD parameters can be conditioned on one or more reference rainfall datasets and on other time-varying covariates such as modeled or measured estimates of precipitable water and relative humidity. We present the framework and initial results by generating precipitation ensembles based on the Tropical Rainfall Measuring Mission Multi-satellite Precipitation Analysis (TMPA) dataset, using both NLDAS and PERSIANN-CDR precipitation datasets as references. We also incorporate a number of covariates from MERRA2 reanalysis including model-estimated precipitation, precipitable water, relative humidity, and lifting condensation level. We explore the prospects for applying the framework and other ensemble error models globally, including in regions where high-quality "ground truth" rainfall estimates are lacking. We compare the ensemble outputs against those of an independent rain gage-based ensemble rainfall dataset. "Pooling" of regional rainfall observations is explored as one option for improving ensemble estimates of rainfall extremes. The approach has potential applications in near-realtime, retrospective, and scenario modeling of rainfall-driven hazards such as floods and landslides
Highway, Suite 1204, Arlington, VA 22202-4302, and to the Office of Management and Budget, Paperwork Reduction Project (0704-0188) Washington, DC...COVERED (From - To) 6 Jul 08 – 11 Jul 08 4. TITLE AND SUBTITLE RANDOM CODING BOUNDS FOR DNA CODES BASED ON FIBONACCI ENSEMBLES OF DNA SEQUENCES...sequences which are generalizations of the Fibonacci sequences. 15. SUBJECT TERMS DNA Codes, Fibonacci Ensembles, DNA Computing, Code Optimization 16
Schnass, Karin; Vandergheynst, Pierre
In this article we present a signal model for classification based on a low dimensional dictionary embedded into the high dimensional signal space. We develop an alternate projection algorithm to find the embedding and the dictionary and finally test the classification performance of our scheme in comparison to Fisher’s LDA.
Ensemble approaches have been shown to enhance classification by combining the outputs from a set of voting classifiers. Diversity in error patterns among base classifiers promotes ensemble performance. Multi-task learning is an important characteristic for Neural Network classifiers. Introducing a secondary output unit that receives different…
Liu, Li; Gao, Chao; Xuan, Weidong; Xu, Yue-Ping
Ensemble flood forecasts by hydrological models using numerical weather prediction products as forcing data are becoming more commonly used in operational flood forecasting applications. In this study, a hydrological ensemble flood forecasting system comprised of an automatically calibrated Variable Infiltration Capacity model and quantitative precipitation forecasts from TIGGE dataset is constructed for Lanjiang Basin, Southeast China. The impacts of calibration strategies and ensemble methods on the performance of the system are then evaluated. The hydrological model is optimized by the parallel programmed ε-NSGA II multi-objective algorithm. According to the solutions by ε-NSGA II, two differently parameterized models are determined to simulate daily flows and peak flows at each of the three hydrological stations. Then a simple yet effective modular approach is proposed to combine these daily and peak flows at the same station into one composite series. Five ensemble methods and various evaluation metrics are adopted. The results show that ε-NSGA II can provide an objective determination on parameter estimation, and the parallel program permits a more efficient simulation. It is also demonstrated that the forecasts from ECMWF have more favorable skill scores than other Ensemble Prediction Systems. The multimodel ensembles have advantages over all the single model ensembles and the multimodel methods weighted on members and skill scores outperform other methods. Furthermore, the overall performance at three stations can be satisfactory up to ten days, however the hydrological errors can degrade the skill score by approximately 2 days, and the influence persists until a lead time of 10 days with a weakening trend. With respect to peak flows selected by the Peaks Over Threshold approach, the ensemble means from single models or multimodels are generally underestimated, indicating that the ensemble mean can bring overall improvement in forecasting of flows. For
Ahmadi, Sepehr; El-Ella, Haitham A. R.; Wojciechowski, Adam M.
We demonstrate magnetic-field sensing using an ensemble of nitrogen-vacancy centers by recording the variation in the pump-light absorption due to the spin-polarization dependence of the total ground-state population. Using a 532 nm pump laser, we measure the absorption of native nitrogen......-vacancy centers in a chemical-vapor-deposited diamond placed in a resonant optical cavity. For a laser pump power of 0.4 W and a cavity finesse of 45, we obtain a noise floor of ∼100 nT/√Hz spanning a bandwidth up to 125 Hz. We project a photon shot-noise-limited sensitivity of ∼1 pT/√Hz by optimizing...
Baraldi, P.; Compare, M.; Sauco, S.; Zio, E.
Particle Filtering (PF) is used in prognostics applications by reason of its capability of robustly predicting the future behavior of an equipment and, on this basis, its Residual Useful Life (RUL). It is a model-driven approach, as it resorts to analytical models of both the degradation process and the measurement acquisition system. This prevents its applicability to the cases, very common in industry, in which reliable models are lacking. In this work, we propose an original method to extend PF to the case in which an analytical measurement model is not available whereas, instead, a dataset containing pairs «state-measurement» is available. The dataset is used to train a bagged ensemble of Artificial Neural Networks (ANNs) which is, then, embedded in the PF as empirical measurement model.
Elias D. Nino-Ruiz
Full Text Available In this paper, a matrix-free posterior ensemble Kalman filter implementation based on a modified Cholesky decomposition is proposed. The method works as follows: the precision matrix of the background error distribution is estimated based on a modified Cholesky decomposition. The resulting estimator can be expressed in terms of Cholesky factors which can be updated based on a series of rank-one matrices in order to approximate the precision matrix of the analysis distribution. By using this matrix, the posterior ensemble can be built by either sampling from the posterior distribution or using synthetic observations. Furthermore, the computational effort of the proposed method is linear with regard to the model dimension and the number of observed components from the model domain. Experimental tests are performed making use of the Lorenz-96 model. The results reveal that, the accuracy of the proposed implementation in terms of root-mean-square-error is similar, and in some cases better, to that of a well-known ensemble Kalman filter (EnKF implementation: the local ensemble transform Kalman filter. In addition, the results are comparable to those obtained by the EnKF with large ensemble sizes.
Xue, Y.; Liu, S.; Hu, Y.; Yang, J.; Chen, Q.
To improve the accuracy in prediction, Genetic Algorithm based Adaptive Neural Network Ensemble (GA-ANNE) is presented. Intersections are allowed between different training sets based on the fuzzy clustering analysis, which ensures the diversity as well as the accuracy of individual Neural Networks (NNs). Moreover, to improve the accuracy of the adaptive weights of individual NNs, GA is used to optimize the cluster centers. Empirical results in predicting carbon flux of Duke Forest reveal that GA-ANNE can predict the carbon flux more accurately than Radial Basis Function Neural Network (RBFNN), Bagging NN ensemble, and ANNE. ?? 2007 IEEE.
Liu, Richen; Guo, Hanqi; Zhang, Jiang; Yuan, Xiaoru
We propose a longest common subsequence (LCS) based approach to compute the distance among vector field ensembles. By measuring how many common blocks the ensemble pathlines passing through, the LCS distance defines the similarity among vector field ensembles by counting the number of sharing domain data blocks. Compared to the traditional methods (e.g. point-wise Euclidean distance or dynamic time warping distance), the proposed approach is robust to outlier, data missing, and sampling rate of pathline timestep. Taking the advantages of smaller and reusable intermediate output, visualization based on the proposed LCS approach revealing temporal trends in the data at low storage cost, and avoiding tracing pathlines repeatedly. Finally, we evaluate our method on both synthetic data and simulation data, which demonstrate the robustness of the proposed approach.
Full Text Available Accurate diagnosis of rolling bearing fault on the normal operation of machinery and equipment has a very important significance. A method combining Ensemble Empirical Mode Decomposition (EEMD and Random Forest (RF is proposed. Firstly, the original signal is decomposed into several intrinsic mode functions (IMFs by EEMD, and the effective IMFs are selected. Then their energy entropy is calculated as the feature. Finally, the classification is performed by RF. In addition, the wavelet method is also used in the proposed process, the same as EEMD. The results of the comparison show that the EEMD method is more accurate than the wavelet method.
Lu, Wei; Li, Zhe; Chu, Jinghui
Breast cancer is a common cancer among women. With the development of modern medical science and information technology, medical imaging techniques have an increasingly important role in the early detection and diagnosis of breast cancer. In this paper, we propose an automated computer-aided diagnosis (CADx) framework for magnetic resonance imaging (MRI). The scheme consists of an ensemble of several machine learning-based techniques, including ensemble under-sampling (EUS) for imbalanced data processing, the Relief algorithm for feature selection, the subspace method for providing data diversity, and Adaboost for improving the performance of base classifiers. We extracted morphological, various texture, and Gabor features. To clarify the feature subsets' physical meaning, subspaces are built by combining morphological features with each kind of texture or Gabor feature. We tested our proposal using a manually segmented Region of Interest (ROI) data set, which contains 438 images of malignant tumors and 1898 images of normal tissues or benign tumors. Our proposal achieves an area under the ROC curve (AUC) value of 0.9617, which outperforms most other state-of-the-art breast MRI CADx systems. Compared with other methods, our proposal significantly reduces the false-positive classification rate. Copyright © 2017 Elsevier Ltd. All rights reserved.
Miguel, E.; Dykema, J.; Satyanath, S.; Anderson, J. G.
Accurate forecasts of regional climate, including temperature and precipitation, have significant implications for human activities, not just economically but socially. Sub Saharan Africa is a region that has displayed an exceptional propensity for devastating civil wars. Recent research in political economy has revealed a strong statistical relationship between year to year fluctuations in precipitation and civil conflict in this region in the 1980s and 1990s. To investigate how climate change may modify the regional risk of civil conflict in the future requires a probabilistic regional forecast that explicitly accounts for the community's uncertainty in the evolution of rainfall under anthropogenic forcing. We approach the regional climate prediction aspect of this question through the application of a recently demonstrated method called generalized scalar prediction (Leroy et al. 2009), which predicts arbitrary scalar quantities of the climate system. This prediction method can predict change in any variable or linear combination of variables of the climate system averaged over a wide range spatial scales, from regional to hemispheric to global. Generalized scalar prediction utilizes an ensemble of model predictions to represent the community's uncertainty range in climate modeling in combination with a timeseries of any type of observational data that exhibits sensitivity to the scalar of interest. It is not necessary to prioritize models in deriving with the final prediction. We present the results of the application of generalized scalar prediction for regional forecasts of temperature and precipitation and Sub Saharan Africa. We utilize the climate predictions along with the established statistical relationship between year-to-year rainfall variability in Sub Saharan Africa to investigate the potential impact of climate change on civil conflict within that region.
Landsgesell, Jonas; Holm, Christian; Smiatek, Jens
We present a novel method for the study of weak polyelectrolytes and general acid-base reactions in molecular dynamics and Monte Carlo simulations. The approach combines the advantages of the reaction ensemble and the Wang-Landau sampling method. Deprotonation and protonation reactions are simulated explicitly with the help of the reaction ensemble method, while the accurate sampling of the corresponding phase space is achieved by the Wang-Landau approach. The combination of both techniques provides a sufficient statistical accuracy such that meaningful estimates for the density of states and the partition sum can be obtained. With regard to these estimates, several thermodynamic observables like the heat capacity or reaction free energies can be calculated. We demonstrate that the computation times for the calculation of titration curves with a high statistical accuracy can be significantly decreased when compared to the original reaction ensemble method. The applicability of our approach is validated by the study of weak polyelectrolytes and their thermodynamic properties.
Aakerberg, Andreas; Nasrollahi, Kamal; Heder, Thomas
Augmenting RGB images with depth information is a well-known method to significantly improve the recognition accuracy of object recognition models. Another method to im- prove the performance of visual recognition models is ensemble learning. However, this method has not been widely explored...... in combination with deep convolutional neural network based RGB-D object recognition models. Hence, in this paper, we form different ensembles of complementary deep convolutional neural network models, and show that this can be used to increase the recognition performance beyond existing limits. Experiments...... on the Washington RGB-D Object Dataset show that our best performing ensemble improves the recognition performance with 0.7% compared to using the baseline model alone....
Qi, Chengming; Wang, Yuping; Tian, Wenjie; Wang, Qun
The performance of kernel-based method, such as support vector machine (SVM), is greatly affected by the choice of kernel function. Multiple kernel learning (MKL) is a promising family of machine learning algorithms and has attracted many attentions in recent years. MKL combines multiple sub-kernels to seek better results compared to single kernel learning. In order to improve the efficiency of SVM and MKL, in this paper, the Kullback–Leibler kernel function is derived to develop SVM. The proposed method employs an improved ensemble learning framework, named KLMKB, which applies Adaboost to learning multiple kernel-based classifier. In the experiment for hyperspectral remote sensing image classification, we employ feature selected through Optional Index Factor (OIF) to classify the satellite image. We extensively examine the performance of our approach in comparison to some relevant and state-of-the-art algorithms on a number of benchmark classification data sets and hyperspectral remote sensing image data set. Experimental results show that our method has a stable behavior and a noticeable accuracy for different data set.
Full Text Available The ideal state of continuous one-piece flow may never be achieved. Still the logistics manager can improve the flow by carefully positioning inventory to buffer against variations. Strategies such as lean, postponement, mass customization, and outsourcing all rely on strategic positioning of decoupling points to separate forecast-driven from customer-order-driven flows. Planning and scheduling of the flow are also based on classification of decoupling points as master scheduled or not. A comprehensive classification scheme for these types of decoupling points is introduced. The approach rests on identification of flows as being either demand based or supply based. The demand or supply is then combined with exogenous factors, classified as independent, or endogenous factors, classified as dependent. As a result, eight types of strategic as well as tactical decoupling points are identified resulting in a process-based framework for inventory classification that can be used for flow design.
Leeuwenburgh, O.; Brouwer, J.; Trani, M.
Assisted history matching methods are beginning to offer the possibility to use 4D seismic data in quantitative ways for reservoir characterization. We use the waveform data without any explicit inversion or interpretation step directly in an ensemble-based assisted history matching scheme with a 3D
Kosek, Anna Magdalena; Gehrke, Oliver
on an ensemble of non-linear artificial neural network DER models which detect and evaluate anomalies in DER operation. The proposed method is validated against measurement data which yields a precision of 0.947 and an accuracy of 0.976. This improves the precision and accuracy of a classic model-based anomaly...
Full Text Available Objective. This study aims to establish a model to analyze clinical experience of TCM veteran doctors. We propose an ensemble learning based framework to analyze clinical records with ICD-10 labels information for effective diagnosis and acupoints recommendation. Methods. We propose an ensemble learning framework for the analysis task. A set of base learners composed of decision tree (DT and support vector machine (SVM are trained by bootstrapping the training dataset. The base learners are sorted by accuracy and diversity through nondominated sort (NDS algorithm and combined through a deep ensemble learning strategy. Results. We evaluate the proposed method with comparison to two currently successful methods on a clinical diagnosis dataset with manually labeled ICD-10 information. ICD-10 label annotation and acupoints recommendation are evaluated for three methods. The proposed method achieves an accuracy rate of 88.2% ± 2.8% measured by zero-one loss for the first evaluation session and 79.6% ± 3.6% measured by Hamming loss, which are superior to the other two methods. Conclusion. The proposed ensemble model can effectively model the implied knowledge and experience in historic clinical data records. The computational cost of training a set of base learners is relatively low.
Full Text Available Knowledge on the contribution of observations to forecast accuracy is crucial for the refinement of observing and data assimilation systems. Several recent publications highlighted the benefits of efficiently approximating this observation impact using adjoint methods or ensembles. This study proposes a modification of an existing method for computing observation impact in an ensemble-based data assimilation and forecasting system and applies the method to a pre-operational, convective-scale regional modelling environment. Instead of the analysis, the modified approach uses observation-based verification metrics to mitigate the effect of correlation between the forecast and its verification norm. Furthermore, a peculiar property in the distribution of individual observation impact values is used to define a reliability indicator for the accuracy of the impact approximation. Applying this method to a 3-day test period shows that a well-defined observation impact value can be approximated for most observation types and the reliability indicator successfully depicts where results are not significant.
Yu, Lean; Wang, Shouyang; Lai, Kin Keung
In this study, an empirical mode decomposition (EMD) based neural network ensemble learning paradigm is proposed for world crude oil spot price forecasting. For this purpose, the original crude oil spot price series were first decomposed into a finite, and often small, number of intrinsic mode functions (IMFs). Then a three-layer feed-forward neural network (FNN) model was used to model each of the extracted IMFs, so that the tendencies of these IMFs could be accurately predicted. Finally, the prediction results of all IMFs are combined with an adaptive linear neural network (ALNN), to formulate an ensemble output for the original crude oil price series. For verification and testing, two main crude oil price series, West Texas Intermediate (WTI) crude oil spot price and Brent crude oil spot price, are used to test the effectiveness of the proposed EMD-based neural network ensemble learning methodology. Empirical results obtained demonstrate attractiveness of the proposed EMD-based neural network ensemble learning paradigm. (author)
Full Text Available Individual tree crown segmentation from Airborne Laser Scanning data is a nodal problem in forest remote sensing. Focusing on single layered spruce and fir dominated coniferous forests, this article addresses the problem of directly estimating 3D segment shape uncertainty (i.e., without field/reference surveys, using a probabilistic approach. First, a coarse segmentation (marker controlled watershed is applied. Then, the 3D alpha hull and several descriptors are computed for each segment. Based on these descriptors, the alpha hulls are grouped to form ensembles (i.e., groups of similar tree shapes. By examining how frequently regions of a shape occur within an ensemble, it is possible to assign a shape probability to each point within a segment. The shape probability can subsequently be thresholded to obtain improved (filtered tree segments. Results indicate this approach can be used to produce segmentation reliability maps. A comparison to manually segmented tree crowns also indicates that the approach is able to produce more reliable tree shapes than the initial (unfiltered segmentation.
Jobez, Pierre; Laplane, Cyril; Timoney, Nuala; Gisin, Nicolas; Ferrier, Alban; Goldner, Philippe; Afzelius, Mikael
Long-lived quantum memories are essential components of a long-standing goal of remote distribution of entanglement in quantum networks. These can be realized by storing the quantum states of light as single-spin excitations in atomic ensembles. However, spin states are often subjected to different dephasing processes that limit the storage time, which in principle could be overcome using spin-echo techniques. Theoretical studies suggest this to be challenging due to unavoidable spontaneous emission noise in ensemble-based quantum memories. Here, we demonstrate spin-echo manipulation of a mean spin excitation of 1 in a large solid-state ensemble, generated through storage of a weak optical pulse. After a storage time of about 1 ms we optically read-out the spin excitation with a high signal-to-noise ratio. Our results pave the way for long-duration optical quantum storage using spin-echo techniques for any ensemble-based memory.
Full Text Available Alzheimer's is a neurodegenerative disease caused by the destruction and death of brain neurons resulting in memory loss, impaired thinking ability, and in certain behavioral changes. Alzheimer disease is a major cause of dementia and eventually death all around the world. Early diagnosis of the disease is crucial which can help the victims to maintain their level of independence for comparatively longer time and live a best life possible. For early detection of Alzheimer's disease, we are proposing a novel approach based on fusion of multiple types of features including hemodynamic, volumetric and textural features of the brain. Our approach uses non-invasive fMRI with ensemble of classifiers, for the classification of the normal controls and the Alzheimer patients. For performance evaluation, ten-fold cross validation is used. Individual feature sets and fusion of features have been investigated with ensemble classifiers for successful classification of Alzheimer's patients from normal controls. It is observed that fusion of features resulted in improved results for accuracy, specificity and sensitivity.
Avogadri, Roberto; Valentini, Giorgio
Two major problems related the unsupervised analysis of gene expression data are represented by the accuracy and reliability of the discovered clusters, and by the biological fact that the boundaries between classes of patients or classes of functionally related genes are sometimes not clearly defined. The main goal of this work consists in the exploration of new strategies and in the development of new clustering methods to improve the accuracy and robustness of clustering results, taking into account the uncertainty underlying the assignment of examples to clusters in the context of gene expression data analysis. We propose a fuzzy ensemble clustering approach both to improve the accuracy of clustering results and to take into account the inherent fuzziness of biological and bio-medical gene expression data. We applied random projections that obey the Johnson-Lindenstrauss lemma to obtain several instances of lower dimensional gene expression data from the original high-dimensional ones, approximately preserving the information and the metric structure of the original data. Then we adopt a double fuzzy approach to obtain a consensus ensemble clustering, by first applying a fuzzy k-means algorithm to the different instances of the projected low-dimensional data and then by using a fuzzy t-norm to combine the multiple clusterings. Several variants of the fuzzy ensemble clustering algorithms are proposed, according to different techniques to combine the base clusterings and to obtain the final consensus clustering. We applied our proposed fuzzy ensemble methods to the gene expression analysis of leukemia, lymphoma, adenocarcinoma and melanoma patients, and we compared the results with other state of the art ensemble methods. Results show that in some cases, taking into account the natural fuzziness of the data, we can improve the discovery of classes of patients defined at bio-molecular level. The reduction of the dimension of the data, achieved through random
Lee, Kyewon; Ahn, Hongshik; Moon, Hojin; Kodell, Ralph L; Chen, James J
This article proposes a method for multiclass classification problems using ensembles of multinomial logistic regression models. A multinomial logit model is used as a base classifier in ensembles from random partitions of predictors. The multinomial logit model can be applied to each mutually exclusive subset of the feature space without variable selection. By combining multiple models the proposed method can handle a huge database without a constraint needed for analyzing high-dimensional data, and the random partition can improve the prediction accuracy by reducing the correlation among base classifiers. The proposed method is implemented using R, and the performance including overall prediction accuracy, sensitivity, and specificity for each category is evaluated on two real data sets and simulation data sets. To investigate the quality of prediction in terms of sensitivity and specificity, the area under the receiver operating characteristic (ROC) curve (AUC) is also examined. The performance of the proposed model is compared to a single multinomial logit model and it shows a substantial improvement in overall prediction accuracy. The proposed method is also compared with other classification methods such as the random forest, support vector machines, and random multinomial logit model.
Raahul, A.; Sapthagiri, R.; Pankaj, K.; Vijayarajan, V.
Gender identification is one of the major problem speech analysis today. Tracing the gender from acoustic data i.e., pitch, median, frequency etc. Machine learning gives promising results for classification problem in all the research domains. There are several performance metrics to evaluate algorithms of an area. Our Comparative model algorithm for evaluating 5 different machine learning algorithms based on eight different metrics in gender classification from acoustic data. Agenda is to identify gender, with five different algorithms: Linear Discriminant Analysis (LDA), K-Nearest Neighbour (KNN), Classification and Regression Trees (CART), Random Forest (RF), and Support Vector Machine (SVM) on basis of eight different metrics. The main parameter in evaluating any algorithms is its performance. Misclassification rate must be less in classification problems, which says that the accuracy rate must be high. Location and gender of the person have become very crucial in economic markets in the form of AdSense. Here with this comparative model algorithm, we are trying to assess the different ML algorithms and find the best fit for gender classification of acoustic data.
Blanco, A; Rodriguez, R; Martinez-Maranon, I
Mackerel is an infravalored fish captured by European fishing vessels. A manner to add value to this specie can be achieved by trying to classify it attending to its sex. Colour measurements were performed on Mackerel females and males (fresh and defrozen) extracted gonads to obtain differences between sexes. Several linear and non linear classifiers such as Support Vector Machines (SVM), k Nearest Neighbors (k-NN) or Diagonal Linear Discriminant Analysis (DLDA) can been applied to this problem. However, theyare usually based on Euclidean distances that fail to reflect accurately the sample proximities. Classifiers based on non-Euclidean dissimilarities misclassify a different set of patterns. We combine different kind of dissimilarity based classifiers. The diversity is induced considering a set of complementary dissimilarities for each model. The experimental results suggest that our algorithm helps to improve classifiers based on a single dissimilarity
Blanco, A.; Rodriguez, R.; Martinez-Maranon, I.
Mackerel is an infravalored fish captured by European fishing vessels. A manner to add value to this specie can be achieved by trying to classify it attending to its sex. Colour measurements were performed on Mackerel females and males (fresh and defrozen) extracted gonads to obtain differences between sexes. Several linear and non linear classifiers such as Support Vector Machines (SVM), k Nearest Neighbors (k-NN) or Diagonal Linear Discriminant Analysis (DLDA) can been applied to this problem. However, theyare usually based on Euclidean distances that fail to reflect accurately the sample proximities. Classifiers based on non-Euclidean dissimilarities misclassify a different set of patterns. We combine different kind of dissimilarity based classifiers. The diversity is induced considering a set of complementary dissimilarities for each model. The experimental results suggest that our algorithm helps to improve classifiers based on a single dissimilarity.
Yamada, Ikuya; Sato, Motoki; Shindo, Hiroyuki
This paper describes our approach for the triple scoring task at the WSDM Cup 2017. The task required participants to assign a relevance score for each pair of entities and their types in a knowledge base in order to enhance the ranking results in entity retrieval tasks. We propose an approach wherein the outputs of multiple neural network classifiers are combined using a supervised machine learning model. The experimental results showed that our proposed method achieved the best performance ...
Zheng, Jinde; Pan, Haiyang; Cheng, Junsheng
To timely detect the incipient failure of rolling bearing and find out the accurate fault location, a novel rolling bearing fault diagnosis method is proposed based on the composite multiscale fuzzy entropy (CMFE) and ensemble support vector machines (ESVMs). Fuzzy entropy (FuzzyEn), as an improvement of sample entropy (SampEn), is a new nonlinear method for measuring the complexity of time series. Since FuzzyEn (or SampEn) in single scale can not reflect the complexity effectively, multiscale fuzzy entropy (MFE) is developed by defining the FuzzyEns of coarse-grained time series, which represents the system dynamics in different scales. However, the MFE values will be affected by the data length, especially when the data are not long enough. By combining information of multiple coarse-grained time series in the same scale, the CMFE algorithm is proposed in this paper to enhance MFE, as well as FuzzyEn. Compared with MFE, with the increasing of scale factor, CMFE obtains much more stable and consistent values for a short-term time series. In this paper CMFE is employed to measure the complexity of vibration signals of rolling bearings and is applied to extract the nonlinear features hidden in the vibration signals. Also the physically meanings of CMFE being suitable for rolling bearing fault diagnosis are explored. Based on these, to fulfill an automatic fault diagnosis, the ensemble SVMs based multi-classifier is constructed for the intelligent classification of fault features. Finally, the proposed fault diagnosis method of rolling bearing is applied to experimental data analysis and the results indicate that the proposed method could effectively distinguish different fault categories and severities of rolling bearings.
Räisänen, J.; Ruokolainen, L.
Probabilistic forecasts of near-term climate change are derived by using a multimodel ensemble of climate change simulations and a simple resampling technique that increases the number of realizations for the possible combination of anthropogenic climate change and internal climate variability. The technique is based on the assumption that the probability distribution of local climate changes is only a function of the all-model mean global average warming. Although this is unlikely to be exac...
Mons, Vincent; Wang, Qi; Zaki, Tamer
Reconstructing the characteristics of a scalar source from limited remote measurements in a turbulent flow is a problem of great interest for environmental monitoring, and is challenging due to several aspects. Firstly, the numerical estimation of the scalar dispersion in a turbulent flow requires significant computational resources. Secondly, in actual practice, only a limited number of observations are available, which generally makes the corresponding inverse problem ill-posed. Ensemble-based variational data assimilation techniques are adopted to solve the problem of scalar source localization in a turbulent channel flow at Reτ = 180 . This approach combines the components of variational data assimilation and ensemble Kalman filtering, and inherits the robustness from the former and the ease of implementation from the latter. An ensemble-based methodology for optimal sensor placement is also proposed in order to improve the condition of the inverse problem, which enhances the performances of the data assimilation scheme. This work has been partially funded by the Office of Naval Research (Grant N00014-16-1-2542) and by the National Science Foundation (Grant 1461870).
Mohammad S. Islam
Full Text Available Decoding neural activities related to voluntary and involuntary movements is fundamental to understanding human brain motor circuits and neuromotor disorders and can lead to the development of neuromotor prosthetic devices for neurorehabilitation. This study explores using recorded deep brain local field potentials (LFPs for robust movement decoding of Parkinson’s disease (PD and Dystonia patients. The LFP data from voluntary movement activities such as left and right hand index finger clicking were recorded from patients who underwent surgeries for implantation of deep brain stimulation electrodes. Movement-related LFP signal features were extracted by computing instantaneous power related to motor response in different neural frequency bands. An innovative neural network ensemble classifier has been proposed and developed for accurate prediction of finger movement and its forthcoming laterality. The ensemble classifier contains three base neural network classifiers, namely, feedforward, radial basis, and probabilistic neural networks. The majority voting rule is used to fuse the decisions of the three base classifiers to generate the final decision of the ensemble classifier. The overall decoding performance reaches a level of agreement (kappa value at about 0.729±0.16 for decoding movement from the resting state and about 0.671±0.14 for decoding left and right visually cued movements.
Full Text Available Correlated information between different views incorporate useful for learning in multi view data. Canonical correlation analysis (CCA plays important role to extract these information. However, CCA only extracts the correlated information between paired data and cannot preserve correlated information between within-class samples. In this paper, we propose a two-view semi-supervised learning method called semi-supervised random correlation ensemble base on spectral clustering (SS_RCE. SS_RCE uses a multi-view method based on spectral clustering which takes advantage of discriminative information in multiple views to estimate labeling information of unlabeled samples. In order to enhance discriminative power of CCA features, we incorporate the labeling information of both unlabeled and labeled samples into CCA. Then, we use random correlation between within-class samples from cross view to extract diverse correlated features for training component classifiers. Furthermore, we extend a general model namely SSMV_RCE to construct ensemble method to tackle semi-supervised learning in the presence of multiple views. Finally, we compare the proposed methods with existing multi-view feature extraction methods using multi-view semi-supervised ensembles. Experimental results on various multi-view data sets are presented to demonstrate the effectiveness of the proposed methods.
Barnard, P. A.; Arellano, A. F.
Data assimilation has emerged as an integral part of numerical weather prediction (NWP). More recently, atmospheric chemistry processes have been incorporated into NWP models to provide forecasts and guidance on air quality. There is, however, a unique opportunity within this coupled system to investigate the additional benefit of constraining model dynamics and physics due to chemistry. Several studies have reported the strong interaction between chemistry and meteorology through radiation, transport, emission, and cloud processes. To examine its importance to NWP, we conduct an ensemble-based sensitivity analysis of meteorological fields to the chemical and aerosol fields within the Weather Research and Forecasting model coupled with Chemistry (WRF-Chem) and the Data Assimilation Research Testbed (DART) framework. In particular, we examine the sensitivity of the forecasts of surface temperature and related dynamical fields to the initial conditions of dust and aerosol concentrations in the model over the continental United States within the summer 2008 time period. We use an ensemble of meteorological and chemical/aerosol predictions within WRF-Chem/DART to calculate the sensitivities. This approach is similar to recent ensemble-based sensitivity studies in NWP. The use of an ensemble prediction is appealing because the analysis does not require the adjoint of the model, which to a certain extent becomes a limitation due to the rapidly evolving models and the increasing number of different observations. Here, we introduce this approach as applied to atmospheric chemistry. We also show our initial results of the calculated sensitivities from joint assimilation experiments using a combination of conventional meteorological observations from the National Centers for Environmental Prediction, retrievals of aerosol optical depth from NASA's Moderate Resolution Imaging Spectroradiometer, and retrievals of carbon monoxide from NASA's Measurements of Pollution in the
Higashi, Susan; Barreto, André da Motta Salles; Cantão, Maurício Egidio; de Vasconcelos, Ana Tereza Ribeiro
An essential step of a metagenomic study is the taxonomic classification, that is, the identification of the taxonomic lineage of the organisms in a given sample. The taxonomic classification process involves a series of decisions. Currently, in the context of metagenomics, such decisions are usually based on empirical studies that consider one specific type of classifier. In this study we propose a general framework for analyzing the impact that several decisions can have on the classification problem. Instead of focusing on any specific classifier, we define a generic score function that provides a measure of the difficulty of the classification task. Using this framework, we analyze the impact of the following parameters on the taxonomic classification problem: (i) the length of n-mers used to encode the metagenomic sequences, (ii) the similarity measure used to compare sequences, and (iii) the type of taxonomic classification, which can be conventional or hierarchical, depending on whether the classification process occurs in a single shot or in several steps according to the taxonomic tree. We defined a score function that measures the degree of separability of the taxonomic classes under a given configuration induced by the parameters above. We conducted an extensive computational experiment and found out that reasonable values for the parameters of interest could be (i) intermediate values of n, the length of the n-mers; (ii) any similarity measure, because all of them resulted in similar scores; and (iii) the hierarchical strategy, which performed better in all of the cases. As expected, short n-mers generate lower configuration scores because they give rise to frequency vectors that represent distinct sequences in a similar way. On the other hand, large values for n result in sparse frequency vectors that represent differently metagenomic fragments that are in fact similar, also leading to low configuration scores. Regarding the similarity measure, in
Orosco, Eugenio; Lopez, Natalia; Soria, Carlos; di Sciascio, Fernando
This paper bispectrum is used to classify human arm movements and control a robotic arm based on upper limb's surface electromyogram signals (sEMG). We use bispectrum based on third-order cumulant to parameterize sEMG signals and classify elbow flexion and extension, forearm pronation and supination, and rest states by an artificial neural network (ANN). Finally, a robotic manipulator is controlled based on classification and parameters extracted from the signals. All this process is made in real-time using QNX ® operative system.
Mechanism-based classification of drug exposure in pharmacoepidemiological studies In pharmacoepidemiology and pharmacovigilance, the relation between drug exposure and clinical outcomes is crucial. Exposure classification in pharmacoepidemiological studies is traditionally based on
Li, Zan; Geng, Zhi-Rong; Zhang, Cui; Wang, Xiao-Bo; Wang, Zhi-Lin
Considering the significant role of plasma homocysteine in physiological processes, two ensembles (F465-Cu(2+) and F508-Cu(2+)) were constructed based on a BODIPY (4,4-difluoro-1,3,5,7-tetramethyl-4-bora-3a,4a-diaza-s-indacene) scaffold conjugated with an azamacrocyclic (1,4,7-triazacyclononane and 1,4,7,10-tetraazacyclododecane) Cu(2+) complex. The results of this effort demonstrated that the F465-Cu(2+) ensemble could be employed to detect homocysteine in the presence of other biologically relevant species, including cysteine and glutathione, under physiological conditions with high selectivity and sensitivity in the turn-on fluorescence mode, while the F508-Cu(2+) ensemble showed no fluorescence responses toward biothiols. A possible mechanism for this homocysteine-specific specificity involving the formation of a homocysteine-induced six-membered ring sandwich structure was proposed and confirmed for the first time by time-dependent fluorescence spectra, ESI-MS and EPR. The detection limit of homocysteine in deproteinized human serum was calculated to be 241.4 nM with a linear range of 0-90.0 μM and the detection limit of F465 for Cu(2+) is 74.7 nM with a linear range of 0-6.0 μM (F508, 80.2 nM, 0-7.0 μM). We have demonstrated the application of the F465-Cu(2+) ensemble for detecting homocysteine in human serum and monitoring the activity of cystathionine β-synthase in vitro. Copyright © 2015 Elsevier B.V. All rights reserved.
Nandagopalan, S; Suryanarayana, Adiga B; Sudarshan, T S B; Chandrashekar, Dhanalakshmi; Manjunath, C N
This paper proposes a novel method to analyze and classify the cardiovascular ultrasound echocardiographic images using Naïve-Bayesian model via database OLAP-SQL. Efficient data mining algorithms based on tightly-coupled model is used to extract features. Three algorithms are proposed for classification namely Naïve-Bayesian Classifier for Discrete variables (NBCD) with SQL, NBCD with OLAP-SQL, and Naïve-Bayesian Classifier for Continuous variables (NBCC) using OLAP-SQL. The proposed model is trained with 207 patient images containing normal and abnormal categories. Out of the three proposed algorithms, a high classification accuracy of 96.59% was achieved from NBCC which is better than the earlier methods.
Lazar, Dora; Ihasz, Istvan
The short and medium range operational forecasts, warning and alarm of the severe weather are one of the most important activities of the Hungarian Meteorological Service. Our study provides comprehensive summary of newly developed methods based on ECMWF ensemble forecasts to assist successful prediction of the convective weather situations. . In the first part of the study a brief overview is given about the components of atmospheric convection, which are the atmospheric lifting force, convergence and vertical wind shear. The atmospheric instability is often used to characterize the so-called instability index; one of the most popular and often used indexes is the convective available potential energy. Heavy convective events, like intensive storms, supercells and tornadoes are needed the vertical instability, adequate moisture and vertical wind shear. As a first step statistical studies of these three parameters are based on nine years time series of 51-member ensemble forecasting model based on convective summer time period, various statistical analyses were performed. Relationship of the rate of the convective and total precipitation and above three parameters was studied by different statistical methods. Four new visualization methods were applied for supporting successful forecasts of severe weathers. Two of the four visualization methods the ensemble meteogram and the ensemble vertical profiles had been available at the beginning of our work. Both methods show probability of the meteorological parameters for the selected location. Additionally two new methods have been developed. First method provides probability map of the event exceeding predefined values, so the incident of the spatial uncertainty is well-defined. The convective weather events are characterized by the incident of space often rhapsodic occurs rather have expected the event area can be selected so that the ensemble forecasts give very good support. Another new visualization tool shows time
Zhao, Xu-Jun; Cai, Jiang-Hui; Zhang, Ji-Fu; Yang, Hai-Feng; Ma, Yang
Frequent pattern, frequently appearing in the data set, plays an important role in data mining. For the stellar spectrum classification tasks, a classification rule mining method based on classification pattern tree is presented on the basis of frequent pattern. The procedures can be shown as follows. Firstly, a new tree structure, i. e., classification pattern tree, is introduced based on the different frequencies of stellar spectral attributes in data base and its different importance used for classification. The related concepts and the construction method of classification pattern tree are also described in this paper. Then, the characteristics of the stellar spectrum are mapped to the classification pattern tree. Two modes of top-to-down and bottom-to-up are used to traverse the classification pattern tree and extract the classification rules. Meanwhile, the concept of pattern capability is introduced to adjust the number of classification rules and improve the construction efficiency of the classification pattern tree. Finally, the SDSS (the Sloan Digital Sky Survey) stellar spectral data provided by the National Astronomical Observatory are used to verify the accuracy of the method. The results show that a higher classification accuracy has been got.
Edwin Jayasingh Mariarputham
Full Text Available Accurate classification of Pap smear images becomes the challenging task in medical image processing. This can be improved in two ways. One way is by selecting suitable well defined specific features and the other is by selecting the best classifier. This paper presents a nominated texture based cervical cancer (NTCC classification system which classifies the Pap smear images into any one of the seven classes. This can be achieved by extracting well defined texture features and selecting best classifier. Seven sets of texture features (24 features are extracted which include relative size of nucleus and cytoplasm, dynamic range and first four moments of intensities of nucleus and cytoplasm, relative displacement of nucleus within the cytoplasm, gray level cooccurrence matrix, local binary pattern histogram, tamura features, and edge orientation histogram. Few types of support vector machine (SVM and neural network (NN classifiers are used for the classification. The performance of the NTCC algorithm is tested and compared to other algorithms on public image database of Herlev University Hospital, Denmark, with 917 Pap smear images. The output of SVM is found to be best for the most of the classes and better results for the remaining classes.
Full Text Available Object recognition has shown tremendous increase in the field of image analysis. The required set of image objects is identified and retrieved on the basis of object recognition. In this paper, we propose a novel classification technique called substance based image classification (SIC using a wavelet neural network. The foremost task of SIC is to remove the surrounding regions from an image to reduce the misclassified portion and to effectively reflect the shape of an object. At first, the image to be extracted is performed with SIC system through the segmentation of the image. Next, in order to attain more accurate information, with the extracted set of regions, the wavelet transform is applied for extracting the configured set of features. Finally, using the neural network classifier model, misclassification over the given natural images and further background images are removed from the given natural image using the LSEG segmentation. Moreover, to increase the accuracy of object classification, SIC system involves the removal of the regions in the surrounding image. Performance evaluation reveals that the proposed SIC system reduces the occurrence of misclassification and reflects the exact shape of an object to approximately 10–15%.
Singh, Harpreet; Rana, Prashant Singh; Singh, Urvinder
Drug synergy prediction plays a significant role in the medical field for inhibiting specific cancer agents. It can be developed as a pre-processing tool for therapeutic successes. Examination of different drug-drug interaction can be done by drug synergy score. It needs efficient regression-based machine learning approaches to minimize the prediction errors. Numerous machine learning techniques such as neural networks, support vector machines, random forests, LASSO, Elastic Nets, etc., have been used in the past to realize requirement as mentioned above. However, these techniques individually do not provide significant accuracy in drug synergy score. Therefore, the primary objective of this paper is to design a neuro-fuzzy-based ensembling approach. To achieve this, nine well-known machine learning techniques have been implemented by considering the drug synergy data. Based on the accuracy of each model, four techniques with high accuracy are selected to develop ensemble-based machine learning model. These models are Random forest, Fuzzy Rules Using Genetic Cooperative-Competitive Learning method (GFS.GCCL), Adaptive-Network-Based Fuzzy Inference System (ANFIS) and Dynamic Evolving Neural-Fuzzy Inference System method (DENFIS). Ensembling is achieved by evaluating the biased weighted aggregation (i.e. adding more weights to the model with a higher prediction score) of predicted data by selected models. The proposed and existing machine learning techniques have been evaluated on drug synergy score data. The comparative analysis reveals that the proposed method outperforms others in terms of accuracy, root mean square error and coefficient of correlation.
Morozov, Yu. V.; Spektor, A. A.
A method for classifying moving objects having a seismic effect on the ground surface is proposed which is based on statistical analysis of the envelopes of received signals. The values of the components of the amplitude spectrum of the envelopes obtained applying Hilbert and Fourier transforms are used as classification criteria. Examples illustrating the statistical properties of spectra and the operation of the seismic classifier are given for an ensemble of objects of four classes (person, group of people, large animal, vehicle). It is shown that the computational procedures for processing seismic signals are quite simple and can therefore be used in real-time systems with modest requirements for computational resources.
Wang, Hao; Yang, Yang; Peng, Bo; Chen, Qin
Thyroid Imaging Reporting and Data System(TI-RADS) is a valuable tool for differentiating the benign and the malignant thyroid nodules. In clinic, doctors can determine the extent of being benign or malignant in terms of different classes by using TI-RADS. Classification represents the degree of malignancy of thyroid nodules. TI-RADS as a classification standard can be used to guide the ultrasonic doctor to examine thyroid nodules more accurately and reliably. In this paper, we aim to classify the thyroid nodules with the help of TI-RADS. To this end, four ultrasound signs, i.e., cystic and solid, echo pattern, boundary feature and calcification of thyroid nodules are extracted and converted into feature vectors. Then semi-supervised fuzzy C-means ensemble (SS-FCME) model is applied to obtain the classification results. The experimental results demonstrate that the proposed method can help doctors diagnose the thyroid nodules effectively.
Dumbović, Mateja; Čalogović, Jaša; Vršnak, Bojan; Temmer, Manuela; Mays, M. Leila; Veronig, Astrid; Piantschitsch, Isabell
The drag-based model for heliospheric propagation of coronal mass ejections (CMEs) is a widely used analytical model that can predict CME arrival time and speed at a given heliospheric location. It is based on the assumption that the propagation of CMEs in interplanetary space is solely under the influence of magnetohydrodynamical drag, where CME propagation is determined based on CME initial properties as well as the properties of the ambient solar wind. We present an upgraded version, the drag-based ensemble model (DBEM), that covers ensemble modeling to produce a distribution of possible ICME arrival times and speeds. Multiple runs using uncertainty ranges for the input values can be performed in almost real-time, within a few minutes. This allows us to define the most likely ICME arrival times and speeds, quantify prediction uncertainties, and determine forecast confidence. The performance of the DBEM is evaluated and compared to that of ensemble WSA-ENLIL+Cone model (ENLIL) using the same sample of events. It is found that the mean error is ME = ‑9.7 hr, mean absolute error MAE = 14.3 hr, and root mean square error RMSE = 16.7 hr, which is somewhat higher than, but comparable to ENLIL errors (ME = ‑6.1 hr, MAE = 12.8 hr and RMSE = 14.4 hr). Overall, DBEM and ENLIL show a similar performance. Furthermore, we find that in both models fast CMEs are predicted to arrive earlier than observed, most likely owing to the physical limitations of models, but possibly also related to an overestimation of the CME initial speed for fast CMEs.
Hahnke, Richard L.; Meier-Kolthoff, Jan P.; García-López, Marina; Mukherjee, Supratim; Huntemann, Marcel; Ivanova, Natalia N.; Woyke, Tanja; Kyrpides, Nikos C.; Klenk, Hans-Peter; Göker, Markus
The bacterial phylum Bacteroidetes, characterized by a distinct gliding motility, occurs in a broad variety of ecosystems, habitats, life styles, and physiologies. Accordingly, taxonomic classification of the phylum, based on a limited number of features, proved difficult and controversial in the past, for example, when decisions were based on unresolved phylogenetic trees of the 16S rRNA gene sequence. Here we use a large collection of type-strain genomes from Bacteroidetes and closely related phyla for assessing their taxonomy based on the principles of phylogenetic classification and trees inferred from genome-scale data. No significant conflict between 16S rRNA gene and whole-genome phylogenetic analysis is found, whereas many but not all of the involved taxa are supported as monophyletic groups, particularly in the genome-scale trees. Phenotypic and phylogenomic features support the separation of Balneolaceae as new phylum Balneolaeota from Rhodothermaeota and of Saprospiraceae as new class Saprospiria from Chitinophagia. Epilithonimonas is nested within the older genus Chryseobacterium and without significant phenotypic differences; thus merging the two genera is proposed. Similarly, Vitellibacter is proposed to be included in Aequorivita. Flexibacter is confirmed as being heterogeneous and dissected, yielding six distinct genera. Hallella seregens is a later heterotypic synonym of Prevotella dentalis. Compared to values directly calculated from genome sequences, the G+C content mentioned in many species descriptions is too imprecise; moreover, corrected G+C content values have a significantly better fit to the phylogeny. Corresponding emendations of species descriptions are provided where necessary. Whereas most observed conflict with the current classification of Bacteroidetes is already visible in 16S rRNA gene trees, as expected whole-genome phylogenies are much better resolved. PMID:28066339
Full Text Available Ensemble Prediction has become an essential part of numerical weather forecasting. In this paper we investigate the ability of ensemble forecasts to provide an a priori estimate of the expected forecast skill. Several quantities derived from the local ensemble distribution are investigated for a two year data set of European Centre for Medium-Range Weather Forecasts (ECMWF temperature and wind speed ensemble forecasts at 30 German stations. The results indicate that the population of the ensemble mode provides useful information for the uncertainty in temperature forecasts. The ensemble entropy is a similar good measure. This is not true for the spread if it is simply calculated as the variance of the ensemble members with respect to the ensemble mean. The number of clusters in the C regions is almost unrelated to the local skill. For wind forecasts, the results are less promising.
Edouard, Simon; Vincendon, Béatrice; Ducrocq, Véronique
Intense precipitation events in the Mediterranean often lead to devastating flash floods (FF). FF modelling is affected by several kinds of uncertainties and Hydrological Ensemble Prediction Systems (HEPS) are designed to take those uncertainties into account. The major source of uncertainty comes from rainfall forcing and convective-scale meteorological ensemble prediction systems can manage it for forecasting purpose. But other sources are related to the hydrological modelling part of the HEPS. This study focuses on the uncertainties arising from the hydrological model parameters and initial soil moisture with aim to design an ensemble-based version of an hydrological model dedicated to Mediterranean fast responding rivers simulations, the ISBA-TOP coupled system. The first step consists in identifying the parameters that have the strongest influence on FF simulations by assuming perfect precipitation. A sensitivity study is carried out first using a synthetic framework and then for several real events and several catchments. Perturbation methods varying the most sensitive parameters as well as initial soil moisture allow designing an ensemble-based version of ISBA-TOP. The first results of this system on some real events are presented. The direct perspective of this work will be to drive this ensemble-based version with the members of a convective-scale meteorological ensemble prediction system to design a complete HEPS for FF forecasting.
Haaf, Ezra; Barthel, Roland
When assessing hydrogeological conditions at the regional scale, the analyst is often confronted with uncertainty of structures, inputs and processes while having to base inference on scarce and patchy data. Haaf and Barthel (2015) proposed a concept for handling this predicament by developing a groundwater systems classification framework, where information is transferred from similar, but well-explored and better understood to poorly described systems. The concept is based on the central hypothesis that similar systems react similarly to the same inputs and vice versa. It is conceptually related to PUB (Prediction in ungauged basins) where organization of systems and processes by quantitative methods is intended and used to improve understanding and prediction. Furthermore, using the framework it is expected that regional conceptual and numerical models can be checked or enriched by ensemble generated data from neighborhood-based estimators. In a first step, groundwater hydrographs from a large dataset in Southern Germany are compared in an effort to identify structural similarity in groundwater dynamics. A number of approaches to group hydrographs, mostly based on a similarity measure - which have previously only been used in local-scale studies, can be found in the literature. These are tested alongside different global feature extraction techniques. The resulting classifications are then compared to a visual "expert assessment"-based classification which serves as a reference. A ranking of the classification methods is carried out and differences shown. Selected groups from the classifications are related to geological descriptors. Here we present the most promising results from a comparison of classifications based on series correlation, different series distances and series features, such as the coefficients of the discrete Fourier transform and the intrinsic mode functions of empirical mode decomposition. Additionally, we show examples of classes
Eelsalu, Maris; Soomere, Tarmo
The risks and damages associated with coastal flooding that are naturally associated with an increase in the magnitude of extreme storm surges are one of the largest concerns of countries with extensive low-lying nearshore areas. The relevant risks are even more contrast for semi-enclosed water bodies such as the Baltic Sea where subtidal (weekly-scale) variations in the water volume of the sea substantially contribute to the water level and lead to large spreading of projections of future extreme water levels. We explore the options for using large ensembles of projections to more reliably evaluate return periods of extreme water levels. Single projections of the ensemble are constructed by means of fitting several sets of block maxima with various extreme value distributions. The ensemble is based on two simulated data sets produced in the Swedish Meteorological and Hydrological Institute. A hindcast by the Rossby Centre Ocean model is sampled with a resolution of 6 h and a similar hindcast by the circulation model NEMO with a resolution of 1 h. As the annual maxima of water levels in the Baltic Sea are not always uncorrelated, we employ maxima for calendar years and for stormy seasons. As the shape parameter of the Generalised Extreme Value distribution changes its sign and substantially varies in magnitude along the eastern coast of the Baltic Sea, the use of a single distribution for the entire coast is inappropriate. The ensemble involves projections based on the Generalised Extreme Value, Gumbel and Weibull distributions. The parameters of these distributions are evaluated using three different ways: maximum likelihood method and method of moments based on both biased and unbiased estimates. The total number of projections in the ensemble is 40. As some of the resulting estimates contain limited additional information, the members of pairs of projections that are highly correlated are assigned weights 0.6. A comparison of the ensemble-based projection of
Full Text Available Accurate staging of hepatic cirrhosis is important in investigating the cause and slowing down the effects of cirrhosis. Computer-aided diagnosis (CAD can provide doctors with an alternative second opinion and assist them to make a specific treatment with accurate cirrhosis stage. MRI has many advantages, including high resolution for soft tissue, no radiation, and multiparameters imaging modalities. So in this paper, multisequences MRIs, including T1-weighted, T2-weighted, arterial, portal venous, and equilibrium phase, are applied. However, CAD does not meet the clinical needs of cirrhosis and few researchers are concerned with it at present. Cirrhosis is characterized by the presence of widespread fibrosis and regenerative nodules in the hepatic, leading to different texture patterns of different stages. So, extracting texture feature is the primary task. Compared with typical gray level cooccurrence matrix (GLCM features, texture classification from random features provides an effective way, and we adopt it and propose CCTCRF for triple classification (normal, early, and middle and advanced stage. CCTCRF does not need strong assumptions except the sparse character of image, contains sufficient texture information, includes concise and effective process, and makes case decision with high accuracy. Experimental results also illustrate the satisfying performance and they are also compared with typical NN with GLCM.
Petrini, Paula Andreia; Lopes da Silva, Ricardo Magno; de Oliveira, Rafael Furlan; Merces, Leandro; Bufon, Carlos César Bof
Considerable advances in the field of molecular electronics have been achieved over the recent years. One persistent challenge, however, is the exploitation of the electronic properties of molecules fully integrated into devices. Typically, the molecular electronic properties are investigated using sophisticated techniques incompatible with a practical device technology, such as the scanning tunneling microscope (STM). The incorporation of molecular materials in devices is not a trivial task since the typical dimensions of electrical contacts are much larger than the molecular ones. To tackle this issue, we report on hybrid capacitors using mechanically-compliant nanomembranes to encapsulate ultrathin molecular ensembles for the investigation of molecular dielectric properties. As the prototype material, copper (II) phthalocyanine (CuPc) has been chosen as information on its dielectric constant (kCuPc) at the molecular scale is missing. Here, hybrid nanomembrane-based capacitors containing metallic nanomembranes, insulating Al2O3 layers, and the CuPc molecular ensemble have been fabricated and evaluated. The Al2O3 is used to prevent short circuits through the capacitor plates as the molecular layer is considerably thin (< 30 nm). From the electrical measurements of devices with molecular layers of different thicknesses, the CuPc dielectric constant has been reliably determined (kCuPc = 4.5 ± 0.5). These values suggest a mild contribution of molecular orientation in the CuPc dielectric properties. The reported nanomembrane-based capacitor is a viable strategy for the dielectric characterization of ultrathin molecular ensembles integrated into a practical, real device technology. © 2018 IOP Publishing Ltd.
Full Text Available Introduction: Dengue fever has been one of the most concerning endemic diseases of recent times. Every year, 50-100 million people get infected by the dengue virus across the world. Historically, it has been most prevalent in Southeast Asia and the Pacific Islands. In recent years, frequent dengue epidemics have started occurring in Latin America as well. This study focused on assessing the impact of different short and long-term lagged climatic predictors on dengue cases. Additionally, it assessed the impact of building an ensemble model using multiple time series and regression models, in improving prediction accuracy. Materials and Methods: Experimental data were based on two Latin American cities, viz. San Juan (Puerto Rico and Iquitos (Peru. Due to weather and geographic differences, San Juan recorded higher dengue incidences than Iquitos. Using lagged cross-correlations, this study confirmed the impact of temperature and vegetation on the number of dengue cases for both cities, though in varied degrees and time lags. An ensemble of multiple predictive models using an elaborate set of derived predictors was built and validated. Results: The proposed ensemble prediction achieved a mean absolute error of 21.55, 4.26 points lower than the 25.81 obtained by a standard negative binomial model. Changes in climatic conditions and urbanization were found to be strong predictors as established empirically in other researches. Some of the predictors were new and informative, which have not been explored in any other relevant studies yet. Discussion and Conclusions: Two original contributions were made in this research. Firstly, a focused and extensive feature engineering aligned with the mosquito lifecycle. Secondly, a novel covariate pattern-matching based prediction approach using past time series trend of the predictor variables. Increased accuracy of the proposed model over the benchmark model proved the appropriateness of the analytical approach
Lin, Yuan; Yu, Hongzhi; Wan, Fucheng; Xu, Tao
Data Mining has important application value in today’s industry and academia. Text classification is a very important technology in data mining. At present, there are many mature algorithms for text classification. KNN, NB, AB, SVM, decision tree and other classification methods all show good classification performance. Support Vector Machine’ (SVM) classification method is a good classifier in machine learning research. This paper will study the classification effect based on the SVM method in the Chinese text data, and use the support vector machine method in the chinese text to achieve the classify chinese text, and to able to combination of academia and practical application.
Huang, Meiyan; Yang, Wei; Jiang, Jun; Wu, Yao; Zhang, Yu; Chen, Wufan; Feng, Qianjin
Brain extraction is an important procedure in brain image analysis. Although numerous brain extraction methods have been presented, enhancing brain extraction methods remains challenging because brain MRI images exhibit complex characteristics, such as anatomical variability and intensity differences across different sequences and scanners. To address this problem, we present a Locally Linear Representation-based Classification (LLRC) method for brain extraction. A novel classification framework is derived by introducing the locally linear representation to the classical classification model. Under this classification framework, a common label fusion approach can be considered as a special case and thoroughly interpreted. Locality is important to calculate fusion weights for LLRC; this factor is also considered to determine that Local Anchor Embedding is more applicable in solving locally linear coefficients compared with other linear representation approaches. Moreover, LLRC supplies a way to learn the optimal classification scores of the training samples in the dictionary to obtain accurate classification. The International Consortium for Brain Mapping and the Alzheimer's Disease Neuroimaging Initiative databases were used to build a training dataset containing 70 scans. To evaluate the proposed method, we used four publicly available datasets (IBSR1, IBSR2, LPBA40, and ADNI3T, with a total of 241 scans). Experimental results demonstrate that the proposed method outperforms the four common brain extraction methods (BET, BSE, GCUT, and ROBEX), and is comparable to the performance of BEaST, while being more accurate on some datasets compared with BEaST. Copyright © 2014 Elsevier Inc. All rights reserved.
Delaney, C.; Mendoza, J.; Whitin, B.; Hartman, R. K.
Ensemble Forecast Operations (EFO) is a risk based approach of reservoir flood operations that incorporates ensemble streamflow predictions (ESPs) made by NOAA's California-Nevada River Forecast Center (CNRFC). With the EFO approach, each member of an ESP is individually modeled to forecast system conditions and calculate risk of reaching critical operational thresholds. Reservoir release decisions are computed which seek to manage forecasted risk to established risk tolerance levels. A water management model was developed for Lake Mendocino, a 111,000 acre-foot reservoir located near Ukiah, California, to evaluate the viability of the EFO alternative to improve water supply reliability but not increase downstream flood risk. Lake Mendocino is a dual use reservoir, which is owned and operated for flood control by the United States Army Corps of Engineers and is operated for water supply by the Sonoma County Water Agency. Due to recent changes in the operations of an upstream hydroelectric facility, this reservoir has suffered from water supply reliability issues since 2007. The EFO alternative was simulated using a 26-year (1985-2010) ESP hindcast generated by the CNRFC, which approximates flow forecasts for 61 ensemble members for a 15-day horizon. Model simulation results of the EFO alternative demonstrate a 36% increase in median end of water year (September 30) storage levels over existing operations. Additionally, model results show no increase in occurrence of flows above flood stage for points downstream of Lake Mendocino. This investigation demonstrates that the EFO alternative may be a viable approach for managing Lake Mendocino for multiple purposes (water supply, flood mitigation, ecosystems) and warrants further investigation through additional modeling and analysis.
Vosselman, George; Coenen, Maximilian; Rottensteiner, Franz
Classification of point clouds is needed as a first step in the extraction of various types of geo-information from point clouds. We present a new approach to contextual classification of segmented airborne laser scanning data. Potential advantages of segment-based classification are easily offset
Li, Dingfang; Luo, Longqiang; Zhang, Wen; Liu, Feng; Luo, Fei
Predicting piwi-interacting RNA (piRNA) is an important topic in the small non-coding RNAs, which provides clues for understanding the generation mechanism of gamete. To the best of our knowledge, several machine learning approaches have been proposed for the piRNA prediction, but there is still room for improvements. In this paper, we develop a genetic algorithm-based weighted ensemble method for predicting transposon-derived piRNAs. We construct datasets for three species: Human, Mouse and Drosophila. For each species, we compile the balanced dataset and imbalanced dataset, and thus obtain six datasets to build and evaluate prediction models. In the computational experiments, the genetic algorithm-based weighted ensemble method achieves 10-fold cross validation AUC of 0.932, 0.937 and 0.995 on the balanced Human dataset, Mouse dataset and Drosophila dataset, respectively, and achieves AUC of 0.935, 0.939 and 0.996 on the imbalanced datasets of three species. Further, we use the prediction models trained on the Mouse dataset to identify piRNAs of other species, and the models demonstrate the good performances in the cross-species prediction. Compared with other state-of-the-art methods, our method can lead to better performances. In conclusion, the proposed method is promising for the transposon-derived piRNA prediction. The source codes and datasets are available in https://github.com/zw9977129/piRNAPredictor .
Thielen-Del Pozo, J.; Pappenberger, F.; Bogner, K.; Kalas, M.; de Roo, A.
The hydrological community is looking increasingly at the use of ensemble prediction systems instead of single forecasts to increase flood warning times. International initiatives and research projects such as THORPEX, HEPEX, PREVIEW, or MAP-DPHASE foster successfully the interdisciplinary dialogue between the meteorological and hydrological communities. The European Flood Alert System (EFAS) is a pre-operational example of an early flood warning system based on multiple EPS and poor-man's ensembles weather inputs. EFAS research focuses on the exploration of the EPS stream flow information, their visualisation for different end user communities and their application in risk-based decision-making. EFAS further provides a platform for further research on flash floods, droughts and climate change. Here a case study of Romanian floods in October 2007 is analysed with multiple EPS at different spatial resolutions and lead times. While monthly forecasts are explored as first indicators for potential floods, the early flood warning capacity of EFAS is drawn mainly from medium-range (15 day) EPS forecasts. In addition information from the limited area model EPS (COSMO-LEPS) with much higher spatial resolution but only 5 days lead time are explored for better quantitative forecasts. An outlook contrasting the computational demands with the apparent benefit is given together with a few thoughts on developments that should be addressed in the near future.
Ludwiczak, Jan; Jarmula, Adam; Dunin-Horkawicz, Stanislaw
Computational protein design is a set of procedures for computing amino acid sequences that will fold into a specified structure. Rosetta Design, a commonly used software for protein design, allows for the effective identification of sequences compatible with a given backbone structure, while molecular dynamics (MD) simulations can thoroughly sample near-native conformations. We benchmarked a procedure in which Rosetta design is started on MD-derived structural ensembles and showed that such a combined approach generates 20-30% more diverse sequences than currently available methods with only a slight increase in computation time. Importantly, the increase in diversity is achieved without a loss in the quality of the designed sequences assessed by their resemblance to natural sequences. We demonstrate that the MD-based procedure is also applicable to de novo design tasks started from backbone structures without any sequence information. In addition, we implemented a protocol that can be used to assess the stability of designed models and to select the best candidates for experimental validation. In sum our results demonstrate that the MD ensemble-based flexible backbone design can be a viable method for protein design, especially for tasks that require a large pool of diverse sequences. Copyright © 2018 Elsevier Inc. All rights reserved.
Full Text Available Knowledge-based planning (KBP utilizes experienced planners’ knowledge embedded in prior plans to estimate optimal achievable dose volume histogram (DVH of new cases. In the regression-based KBP framework, previously planned patients’ anatomical features and DVHs are extracted, and prior knowledge is summarized as the regression coefficients that transform features to organ-at-risk DVH predictions. In our study, we find that in different settings, different regression methods work better. To improve the robustness of KBP models, we propose an ensemble method that combines the strengths of various linear regression models, including stepwise, lasso, elastic net, and ridge regression. In the ensemble approach, we first obtain individual model prediction metadata using in-training-set leave-one-out cross validation. A constrained optimization is subsequently performed to decide individual model weights. The metadata is also used to filter out impactful training set outliers. We evaluate our method on a fresh set of retrospectively retrieved anonymized prostate intensity-modulated radiation therapy (IMRT cases and head and neck IMRT cases. The proposed approach is more robust against small training set size, wrongly labeled cases, and dosimetric inferior plans, compared with other individual models. In summary, we believe the improved robustness makes the proposed method more suitable for clinical settings than individual models.
Hansen, Lars Kai; Salamon, Peter
We propose several means for improving the performance an training of neural networks for classification. We use crossvalidation as a tool for optimizing network parameters and architecture. We show further that the remaining generalization error can be reduced by invoking ensembles of similar...... networks....
Arnbjerg-Nielsen, Karsten; Funder, S. G.; Madsen, H.
change over time. The study focuses on assessing climate analogues for Denmark based on current climate data set (E-OBS) observations as well as the ENSEMBLES database of future climates with the aim of projecting future precipitation extremes. The local present precipitation extremes are assessed......Climate analogues, also denoted Space-For-Time, may be used to identify regions where the present climatic conditions resemble conditions of a past or future state of another location or region based on robust climate variable statistics in combination with projections of how these statistics...... by means of intensity-duration-frequency curves for urban drainage design for the relevant locations being France, the Netherlands, Belgium, Germany, the United Kingdom, and Denmark. Based on this approach projected increases of extreme precipitation by 2100 of 9 and 21% are expected for 2 and 10 year...
Full Text Available In order to lower the dependence on textual annotations for image searches, the content based image retrieval (CBIR has become a popular topic in computer vision. A wide range of CBIR applications consider classification techniques, such as artificial neural networks (ANN, support vector machines (SVM, etc. to understand the query image content to retrieve relevant output. However, in multi-class search environments, the retrieval results are far from optimal due to overlapping semantics amongst subjects of various classes. The classification through multiple classifiers generate better results, but as the number of negative examples increases due to highly correlated semantic classes, classification bias occurs towards the negative class, hence, the combination of the classifiers become even more unstable particularly in one-against-all classification scenarios. In order to resolve this issue, a genetic algorithm (GA based classifier comity learning (GCCL method is presented in this paper to generate stable classifiers by combining ANN with SVMs through asymmetric and symmetric bagging. The proposed approach resolves the classification disagreement amongst different classifiers and also resolves the class imbalance problem in CBIR. Once the stable classifiers are generated, the query image is presented to the trained model to understand the underlying semantic content of the query image for association with the precise semantic class. Afterwards, the feature similarity is computed within the obtained class to generate the semantic response of the system. The experiments reveal that the proposed method outperforms various state-of-the-art methods and significantly improves the image retrieval performance.
Raposo, Letícia M; Nobre, Flavio F
Resistance to antiretrovirals (ARVs) is a major problem faced by HIV-infected individuals. Different rule-based algorithms were developed to infer HIV-1 susceptibility to antiretrovirals from genotypic data. However, there is discordance between them, resulting in difficulties for clinical decisions about which treatment to use. Here, we developed ensemble classifiers integrating three interpretation algorithms: Agence Nationale de Recherche sur le SIDA (ANRS), Rega, and the genotypic resistance interpretation system from Stanford HIV Drug Resistance Database (HIVdb). Three approaches were applied to develop a classifier with a single resistance profile: stacked generalization, a simple plurality vote scheme and the selection of the interpretation system with the best performance. The strategies were compared with the Friedman's test and the performance of the classifiers was evaluated using the F-measure, sensitivity and specificity values. We found that the three strategies had similar performances for the selected antiretrovirals. For some cases, the stacking technique with naïve Bayes as the learning algorithm showed a statistically superior F-measure. This study demonstrates that ensemble classifiers can be an alternative tool for clinical decision-making since they provide a single resistance profile from the most commonly used resistance interpretation systems.
Full Text Available This study introduces the One-Class K-means with Randomly-projected features Algorithm (OCKRA. OCKRA is an ensemble of one-class classifiers built over multiple projections of a dataset according to random feature subsets. Algorithms found in the literature spread over a wide range of applications where ensembles of one-class classifiers have been satisfactorily applied; however, none is oriented to the area under our study: personal risk detection. OCKRA has been designed with the aim of improving the detection performance in the problem posed by the Personal RIsk DEtection(PRIDE dataset. PRIDE was built based on 23 test subjects, where the data for each user were captured using a set of sensors embedded in a wearable band. The performance of OCKRA was compared against support vector machine and three versions of the Parzen window classifier. On average, experimental results show that OCKRA outperformed the other classifiers for at least 0.53% of the area under the curve (AUC. In addition, OCKRA achieved an AUC above 90% for more than 57% of the users.
Guo, Yang; Liu, Shuhui; Li, Zhanhuai; Shang, Xuequn
The classification of cancer subtypes is of great importance to cancer disease diagnosis and therapy. Many supervised learning approaches have been applied to cancer subtype classification in the past few years, especially of deep learning based approaches. Recently, the deep forest model has been proposed as an alternative of deep neural networks to learn hyper-representations by using cascade ensemble decision trees. It has been proved that the deep forest model has competitive or even better performance than deep neural networks in some extent. However, the standard deep forest model may face overfitting and ensemble diversity challenges when dealing with small sample size and high-dimensional biology data. In this paper, we propose a deep learning model, so-called BCDForest, to address cancer subtype classification on small-scale biology datasets, which can be viewed as a modification of the standard deep forest model. The BCDForest distinguishes from the standard deep forest model with the following two main contributions: First, a named multi-class-grained scanning method is proposed to train multiple binary classifiers to encourage diversity of ensemble. Meanwhile, the fitting quality of each classifier is considered in representation learning. Second, we propose a boosting strategy to emphasize more important features in cascade forests, thus to propagate the benefits of discriminative features among cascade layers to improve the classification performance. Systematic comparison experiments on both microarray and RNA-Seq gene expression datasets demonstrate that our method consistently outperforms the state-of-the-art methods in application of cancer subtype classification. The multi-class-grained scanning and boosting strategy in our model provide an effective solution to ease the overfitting challenge and improve the robustness of deep forest model working on small-scale data. Our model provides a useful approach to the classification of cancer subtypes
Zahra Nazari; Dongshik Kang
Support Vector Machines (SVM) is the most successful algorithm for classification problems. SVM learns the decision boundary from two classes (for Binary Classification) of training points. However, sometimes there are some less meaningful samples amongst training points, which are corrupted by noises or misplaced in wrong side, called outliers. These outliers are affecting on margin and classification performance, and machine should better to discard them. SVM as a popular and widely used cl...
Zameer, Aneela; Arshad, Junaid; Khan, Asifullah; Raja, Muhammad Asif Zahoor
Highlights: • Genetic programming based ensemble of neural networks is employed for short term wind power prediction. • Proposed predictor shows resilience against abrupt changes in weather. • Genetic programming evolves nonlinear mapping between meteorological measures and wind-power. • Proposed approach gives mathematical expressions of wind power to its independent variables. • Proposed model shows relatively accurate and steady wind-power prediction performance. - Abstract: The inherent instability of wind power production leads to critical problems for smooth power generation from wind turbines, which then requires an accurate forecast of wind power. In this study, an effective short term wind power prediction methodology is presented, which uses an intelligent ensemble regressor that comprises Artificial Neural Networks and Genetic Programming. In contrast to existing series based combination of wind power predictors, whereby the error or variation in the leading predictor is propagated down the stream to the next predictors, the proposed intelligent ensemble predictor avoids this shortcoming by introducing Genetical Programming based semi-stochastic combination of neural networks. It is observed that the decision of the individual base regressors may vary due to the frequent and inherent fluctuations in the atmospheric conditions and thus meteorological properties. The novelty of the reported work lies in creating ensemble to generate an intelligent, collective and robust decision space and thereby avoiding large errors due to the sensitivity of the individual wind predictors. The proposed ensemble based regressor, Genetic Programming based ensemble of Artificial Neural Networks, has been implemented and tested on data taken from five different wind farms located in Europe. Obtained numerical results of the proposed model in terms of various error measures are compared with the recent artificial intelligence based strategies to demonstrate the
Zhang, Wen; Niu, Yanqing; Zou, Hua; Luo, Longqiang; Liu, Qianchao; Wu, Weijian
T-cell epitopes play the important role in T-cell immune response, and they are critical components in the epitope-based vaccine design. Immunogenicity is the ability to trigger an immune response. The accurate prediction of immunogenic T-cell epitopes is significant for designing useful vaccines and understanding the immune system. In this paper, we attempt to differentiate immunogenic epitopes from non-immunogenic epitopes based on their primary structures. First of all, we explore a variety of sequence-derived features, and analyze their relationship with epitope immunogenicity. To effectively utilize various features, a genetic algorithm (GA)-based ensemble method is proposed to determine the optimal feature subset and develop the high-accuracy ensemble model. In the GA optimization, a chromosome is to represent a feature subset in the search space. For each feature subset, the selected features are utilized to construct the base predictors, and an ensemble model is developed by taking the average of outputs from base predictors. The objective of GA is to search for the optimal feature subset, which leads to the ensemble model with the best cross validation AUC (area under ROC curve) on the training set. Two datasets named 'IMMA2' and 'PAAQD' are adopted as the benchmark datasets. Compared with the state-of-the-art methods POPI, POPISK, PAAQD and our previous method, the GA-based ensemble method produces much better performances, achieving the AUC score of 0.846 on IMMA2 dataset and the AUC score of 0.829 on PAAQD dataset. The statistical analysis demonstrates the performance improvements of GA-based ensemble method are statistically significant. The proposed method is a promising tool for predicting the immunogenic epitopes. The source codes and datasets are available in S1 File.
Full Text Available T-cell epitopes play the important role in T-cell immune response, and they are critical components in the epitope-based vaccine design. Immunogenicity is the ability to trigger an immune response. The accurate prediction of immunogenic T-cell epitopes is significant for designing useful vaccines and understanding the immune system.In this paper, we attempt to differentiate immunogenic epitopes from non-immunogenic epitopes based on their primary structures. First of all, we explore a variety of sequence-derived features, and analyze their relationship with epitope immunogenicity. To effectively utilize various features, a genetic algorithm (GA-based ensemble method is proposed to determine the optimal feature subset and develop the high-accuracy ensemble model. In the GA optimization, a chromosome is to represent a feature subset in the search space. For each feature subset, the selected features are utilized to construct the base predictors, and an ensemble model is developed by taking the average of outputs from base predictors. The objective of GA is to search for the optimal feature subset, which leads to the ensemble model with the best cross validation AUC (area under ROC curve on the training set.Two datasets named 'IMMA2' and 'PAAQD' are adopted as the benchmark datasets. Compared with the state-of-the-art methods POPI, POPISK, PAAQD and our previous method, the GA-based ensemble method produces much better performances, achieving the AUC score of 0.846 on IMMA2 dataset and the AUC score of 0.829 on PAAQD dataset. The statistical analysis demonstrates the performance improvements of GA-based ensemble method are statistically significant.The proposed method is a promising tool for predicting the immunogenic epitopes. The source codes and datasets are available in S1 File.
Suh, Jong Hwan
In recent years, the anonymous nature of the Internet has made it difficult to detect manipulated user reputations in social media, as well as to ensure the qualities of users and their posts. To deal with this, this study designs and examines an automatic approach that adopts writing style features to estimate user reputations in social media. Under varying ways of defining Good and Bad classes of user reputations based on the collected data, it evaluates the classification performance of the state-of-art methods: four writing style features, i.e. lexical, syntactic, structural, and content-specific, and eight classification techniques, i.e. four base learners-C4.5, Neural Network (NN), Support Vector Machine (SVM), and Naïve Bayes (NB)-and four Random Subspace (RS) ensemble methods based on the four base learners. When South Korea's Web forum, Daum Agora, was selected as a test bed, the experimental results show that the configuration of the full feature set containing content-specific features and RS-SVM combining RS and SVM gives the best accuracy for classification if the test bed poster reputations are segmented strictly into Good and Bad classes by portfolio approach. Pairwise t tests on accuracy confirm two expectations coming from the literature reviews: first, the feature set adding content-specific features outperform the others; second, ensemble learning methods are more viable than base learners. Moreover, among the four ways on defining the classes of user reputations, i.e. like, dislike, sum, and portfolio, the results show that the portfolio approach gives the highest accuracy.
Song Chenxiu; Zhu Lixin
Research reactors have different characteristics in the fields of reactor type, use, power level, design principle, operation model and safety performance, etc, and also have significant discrepancy in the aspect of nuclear safety regulation. This paper introduces classification of research reactors and discusses thinking of safety regulation based on the classification of research reactors. (authors)
He, Hui; Li, Zhigang; Yao, Chongchong; Zhang, Weizhe
With diverse online media emerging, there is a growing concern of sentiment classification problem. At present, text sentiment classification mainly utilizes supervised machine learning methods, which feature certain domain dependency. On the basis of Markov logic networks (MLNs), this study proposed a cross-domain multi-task text sentiment classification method rooted in transfer learning. Through many-to-one knowledge transfer, labeled text sentiment classification, knowledge was successfully transferred into other domains, and the precision of the sentiment classification analysis in the text tendency domain was improved. The experimental results revealed the following: (1) the model based on a MLN demonstrated higher precision than the single individual learning plan model. (2) Multi-task transfer learning based on Markov logical networks could acquire more knowledge than self-domain learning. The cross-domain text sentiment classification model could significantly improve the precision and efficiency of text sentiment classification.
A method frequently used in classification systems for improving classification accuracy is to combine outputs of several classifiers. Among various types of classifiers, fuzzy ones are tempting because of using intelligible fuzzy if-then rules. In the paper we build an AdaBoost ensemble of relational neuro-fuzzy classifiers. Relational fuzzy systems bond input and output fuzzy linguistic values by a binary relation; thus, fuzzy rules have additional, comparing to traditional fuzzy systems, weights - elements of a fuzzy relation matrix. Thanks to this the system is better adjustable to data during learning. In the paper an ensemble of relational fuzzy systems is proposed. The problem is that such an ensemble contains separate rule bases which cannot be directly merged. As systems are separate, we cannot treat fuzzy rules coming from different systems as rules from the same (single) system. In the paper, the problem is addressed by a novel design of fuzzy systems constituting the ensemble, resulting in normalization of individual rule bases during learning. The method described in the paper is tested on several known benchmarks and compared with other machine learning solutions from the literature.
Chardon, J.; Mathevet, T.; Le Lay, M.; Gailhard, J.
In the context of a national energy company (EDF : Electricité de France), hydro-meteorological forecasts are necessary to ensure safety and security of installations, meet environmental standards and improve water ressources management and decision making. Hydrological ensemble forecasts allow a better representation of meteorological and hydrological forecasts uncertainties and improve human expertise of hydrological forecasts, which is essential to synthesize available informations, coming from different meteorological and hydrological models and human experience. An operational hydrological ensemble forecasting chain has been developed at EDF since 2008 and is being used since 2010 on more than 30 watersheds in France. This ensemble forecasting chain is characterized ensemble pre-processing (rainfall and temperature) and post-processing (streamflow), where a large human expertise is solicited. The aim of this paper is to compare 2 hydrological ensemble post-processing methods developed at EDF in order improve ensemble forecasts reliability (similar to Monatanari &Brath, 2004; Schaefli et al., 2007). The aim of the post-processing methods is to dress hydrological ensemble forecasts with hydrological model uncertainties, based on perfect forecasts. The first method (called empirical approach) is based on a statistical modelisation of empirical error of perfect forecasts, by streamflow sub-samples of quantile class and lead-time. The second method (called dynamical approach) is based on streamflow sub-samples of quantile class and streamflow variation, and lead-time. On a set of 20 watersheds used for operational forecasts, results show that both approaches are necessary to ensure a good post-processing of hydrological ensemble, allowing a good improvement of reliability, skill and sharpness of ensemble forecasts. The comparison of the empirical and dynamical approaches shows the limits of the empirical approach which is not able to take into account hydrological
Binder, Moritz; Barthel, Thomas
Based on DMRG, strongly correlated quantum many-body systems at finite temperatures can be simulated by sampling over a certain class of pure matrix product states (MPS) called minimally entangled typical thermal states (METTS). Here, we show how symmetries of the system can be exploited to considerably reduce computation costs in the METTS algorithm. While this is straightforward for the canonical ensemble, we introduce a modification of the algorithm to efficiently simulate the grand-canonical ensemble under utilization of symmetries. In addition, we construct novel symmetry-conserving collapse bases for the transitions in the Markov chain of METTS that improve the speed of convergence of the algorithm by reducing autocorrelations.
An ensemble forecasting scheme for tsunami inundation is presented. The scheme consists of three elemental methods. The first is a hierarchical Bayesian inversion using Akaike's Bayesian Information Criterion (ABIC). The second is Montecarlo sampling from a probability density function of multidimensional normal distribution. The third is ensamble analysis of tsunami inundation simulations with multiple tsunami sources. Simulation based validation of the model was conducted. A tsunami scenario of M9.1 Nankai earthquake was chosen as a target of validation. Tsunami inundation around Nagoya Port was estimated by using synthetic tsunami waveforms at offshore GPS buoys. The error of estimation of tsunami inundation area was about 10% even if we used only ten minutes observation data. The estimation accuracy of waveforms on/off land and spatial distribution of maximum tsunami inundation depth is demonstrated.
Wang Yonggang; Chang Baosheng
Nuclear safety is the most critical target for nuclear power plant operation. Besides the rigid operation procedures established, evaluation of suppliers working with plants can be another important aspects. Selection and evaluation of suppliers can be classified with qualitative analysis and quantitative management. The indicators involved are coupled with each other in a very complicated manner, therefore the relevant data show the strong characteristic of non-linearity. The article is based on the research and analysis of the real conditions of the Daya Bay nuclear power plant operation management. Through study and analysis of the information home and abroad, and with reference to the neural network ensemble technology, the supplier evaluation system and model are established as illustrated within the paper, thus to heighten objectivity of the supplier selection. (authors)
Lei, Baiying; Hou, Wen; Zou, Wenbin; Li, Xia; Zhang, Cishen; Wang, Tianfu
Neuroimaging data has been widely used to predict clinical scores for automatic diagnosis of Alzheimer's disease (AD). For accurate clinical score prediction, one of the major challenges is high feature dimension of the imaging data. To address this issue, this paper presents an effective framework using a novel feature selection model via sparse learning. In contrast to previous approaches focusing on a single time point, this framework uses information at multiple time points. Specifically, a regularized correntropy with the spatial-temporal constraint is used to reduce the adverse effect of noise and outliers, and promote consistent and robust selection of features by exploring data characteristics. Furthermore, ensemble learning of support vector regression (SVR) is exploited to accurately predict AD scores based on the selected features. The proposed approach is extensively evaluated on the Alzheimer's disease neuroimaging initiative (ADNI) dataset. Our experiments demonstrate that the proposed approach not only achieves promising regression accuracy, but also successfully recognizes disease-related biomarkers.
Smith, Amy F.
© 2014 The Authors. Microcirculation published by John Wiley & Sons Ltd. Objective: Recent developments in high-resolution imaging techniques have enabled digital reconstruction of three-dimensional sections of microvascular networks down to the capillary scale. To better interpret these large data sets, our goal is to distinguish branching trees of arterioles and venules from capillaries. Methods: Two novel algorithms are presented for classifying vessels in microvascular anatomical data sets without requiring flow information. The algorithms are compared with a classification based on observed flow directions (considered the gold standard), and with an existing resistance-based method that relies only on structural data. Results: The first algorithm, developed for networks with one arteriolar and one venular tree, performs well in identifying arterioles and venules and is robust to parameter changes, but incorrectly labels a significant number of capillaries as arterioles or venules. The second algorithm, developed for networks with multiple inlets and outlets, correctly identifies more arterioles and venules, but is more sensitive to parameter changes. Conclusions: The algorithms presented here can be used to classify microvessels in large microvascular data sets lacking flow information. This provides a basis for analyzing the distinct geometrical properties and modelling the functional behavior of arterioles, capillaries, and venules.
Tervonen, Tommi; Linkov, Igor; Figueira, Jose Rui; Steevens, Jeffery; Chappell, Mark; Merad, Myriam
Various stakeholders are increasingly interested in the potential toxicity and other risks associated with nanomaterials throughout the different stages of a product's life cycle (e.g., development, production, use, disposal). Risk assessment methods and tools developed and applied to chemical and biological materials may not be readily adaptable for nanomaterials because of the current uncertainty in identifying the relevant physico-chemical and biological properties that adequately describe the materials. Such uncertainty is further driven by the substantial variations in the properties of the original material due to variable manufacturing processes employed in nanomaterial production. To guide scientists and engineers in nanomaterial research and application as well as to promote the safe handling and use of these materials, we propose a decision support system for classifying nanomaterials into different risk categories. The classification system is based on a set of performance metrics that measure both the toxicity and physico-chemical characteristics of the original materials, as well as the expected environmental impacts through the product life cycle. Stochastic multicriteria acceptability analysis (SMAA-TRI), a formal decision analysis method, was used as the foundation for this task. This method allowed us to cluster various nanomaterials in different ecological risk categories based on our current knowledge of nanomaterial physico-chemical characteristics, variation in produced material, and best professional judgments. SMAA-TRI uses Monte Carlo simulations to explore all feasible values for weights, criteria measurements, and other model parameters to assess the robustness of nanomaterial grouping for risk management purposes.
Chen, Jinying; Yu, Hong
Allowing patients to access their own electronic health record (EHR) notes through online patient portals has the potential to improve patient-centered care. However, EHR notes contain abundant medical jargon that can be difficult for patients to comprehend. One way to help patients is to reduce information overload and help them focus on medical terms that matter most to them. Targeted education can then be developed to improve patient EHR comprehension and the quality of care. The aim of this work was to develop FIT (Finding Important Terms for patients), an unsupervised natural language processing (NLP) system that ranks medical terms in EHR notes based on their importance to patients. We built FIT on a new unsupervised ensemble ranking model derived from the biased random walk algorithm to combine heterogeneous information resources for ranking candidate terms from each EHR note. Specifically, FIT integrates four single views (rankers) for term importance: patient use of medical concepts, document-level term salience, word co-occurrence based term relatedness, and topic coherence. It also incorporates partial information of term importance as conveyed by terms' unfamiliarity levels and semantic types. We evaluated FIT on 90 expert-annotated EHR notes and used the four single-view rankers as baselines. In addition, we implemented three benchmark unsupervised ensemble ranking methods as strong baselines. FIT achieved 0.885 AUC-ROC for ranking candidate terms from EHR notes to identify important terms. When including term identification, the performance of FIT for identifying important terms from EHR notes was 0.813 AUC-ROC. Both performance scores significantly exceeded the corresponding scores from the four single rankers (Ppatients. It may help develop future interventions to improve quality of care. By using unsupervised learning as well as a robust and flexible framework for information fusion, FIT can be readily applied to other domains and applications
Rautenhaus, M.; Grams, C. M.; Schäfler, A.; Westermann, R.
We present the application of interactive 3-D visualization of ensemble weather predictions to forecasting warm conveyor belt situations during aircraft-based atmospheric research campaigns. Motivated by forecast requirements of the T-NAWDEX-Falcon 2012 campaign, a method to predict 3-D probabilities of the spatial occurrence of warm conveyor belts has been developed. Probabilities are derived from Lagrangian particle trajectories computed on the forecast wind fields of the ECMWF ensemble prediction system. Integration of the method into the 3-D ensemble visualization tool Met.3D, introduced in the first part of this study, facilitates interactive visualization of WCB features and derived probabilities in the context of the ECMWF ensemble forecast. We investigate the sensitivity of the method with respect to trajectory seeding and forecast wind field resolution. Furthermore, we propose a visual analysis method to quantitatively analyse the contribution of ensemble members to a probability region and, thus, to assist the forecaster in interpreting the obtained probabilities. A case study, revisiting a forecast case from T-NAWDEX-Falcon, illustrates the practical application of Met.3D and demonstrates the use of 3-D and uncertainty visualization for weather forecasting and for planning flight routes in the medium forecast range (three to seven days before take-off).
Rasmussen, Christoffer Bøgelund; Nasrollahi, Kamal; Moeslund, Thomas B.
detectors. Ensemble strategies explored were firstly data sampling and selection and secondly combination strategies. Data sampling and selection aimed to create different subsets of data with respect to object size and image quality such that expert R-FCN ensemble members could be trained. Two combination...
Full Text Available A weighted accuracy and diversity (WAD method is presented, a novel measure used to evaluate the quality of the classifier ensemble, assisting in the ensemble selection task. The proposed measure is motivated by a commonly accepted hypothesis; that is, a robust classifier ensemble should not only be accurate but also different from every other member. In fact, accuracy and diversity are mutual restraint factors; that is, an ensemble with high accuracy may have low diversity, and an overly diverse ensemble may negatively affect accuracy. This study proposes a method to find the balance between accuracy and diversity that enhances the predictive ability of an ensemble for unknown data. The quality assessment for an ensemble is performed such that the final score is achieved by computing the harmonic mean of accuracy and diversity, where two weight parameters are used to balance them. The measure is compared to two representative measures, Kappa-Error and GenDiv, and two threshold measures that consider only accuracy or diversity, with two heuristic search algorithms, genetic algorithm, and forward hill-climbing algorithm, in ensemble selection tasks performed on 15 UCI benchmark datasets. The empirical results demonstrate that the WAD measure is superior to others in most cases.
Friedel, Michael; Buscema, Massimo
A training scheme is proposed for the real-time classification of soil and vegetation (landscape) components in EO-1 Hyperion hyperspectral images. First, an auto-contractive map is used to compute connectivity of reflectance values for spectral bands (N=200) from independent laboratory spectral library components. Second, a minimum spanning tree is used to identify optimal grouping of training components from connectivity values. Third, the reflectance values for optimal landscape component signatures are sorted. Fourth, empirical distribution functions (EDF) are computed for each landscape component. Fifth, the Monte-Carlo technique is used to generate realizations (N=30) for each landscape EDF. The correspondence of component realizations to original signatures validates the stochastic procedure. Presentation of the realizations to the self-organizing map (SOM) is done using three different map sizes: 14x10, 28x20, and 40 x 30. In each case, the SOM training proceeds first with a rough phase (20 iterations using a Gaussian neighborhood with an initial and final radius of 11 units and 3 units) and then fine phase (400 iterations using a Gaussian neighborhood with an initial and final radius of 3 units and 1 unit). The initial and final learning rates of 0.5 and 0.05 decay linearly down to 10-5, and the Gaussian neighborhood function decreases exponentially (decay rate of 10-3 iteration-1) providing reasonable convergence. Following training of the three networks, each corresponding SOM is used to independently classify the original spectral library signatures. In comparing the different SOM networks, the 28x20 map size is chosen for independent reproducibility and processing speed. The corresponding universal distance matrix reveals separation of the seven component classes for this map size thereby supporting it use as a Hyperion classifier.
Hu, Xiaohua; Park, E K; Zhang, Xiaodan
Generating high-quality gene clusters and identifying the underlying biological mechanism of the gene clusters are the important goals of clustering gene expression analysis. To get high-quality cluster results, most of the current approaches rely on choosing the best cluster algorithm, in which the design biases and assumptions meet the underlying distribution of the dataset. There are two issues for this approach: 1) usually, the underlying data distribution of the gene expression datasets is unknown and 2) there are so many clustering algorithms available and it is very challenging to choose the proper one. To provide a textual summary of the gene clusters, the most explored approach is the extractive approach that essentially builds upon techniques borrowed from the information retrieval, in which the objective is to provide terms to be used for query expansion, and not to act as a stand-alone summary for the entire document sets. Another drawback is that the clustering quality and cluster interpretation are treated as two isolated research problems and are studied separately. In this paper, we design and develop a unified system Gene Expression Miner to address these challenging issues in a principled and general manner by integrating cluster ensemble, text clustering, and multidocument summarization and provide an environment for comprehensive gene expression data analysis. We present a novel cluster ensemble approach to generate high-quality gene cluster. In our text summarization module, given a gene cluster, our expectation-maximization based algorithm can automatically identify subtopics and extract most probable terms for each topic. Then, the extracted top k topical terms from each subtopic are combined to form the biological explanation of each gene cluster. Experimental results demonstrate that our system can obtain high-quality clusters and provide informative key terms for the gene clusters.
NYYD Ensemble'i duost Traksmann - Lukk E.-S. Tüüri teosega "Symbiosis", mis on salvestatud ka hiljuti ilmunud NYYD Ensemble'i CDle. 2. märtsil Rakvere Teatri väikeses saalis ja 3. märtsil Rotermanni Soolalaos, kavas Tüür, Kaumann, Berio, Reich, Yun, Hauta-aho, Buckinx
Arjona, Cristian; Pentácolo, José; Gareis, Iván; Atum, Yanina; Gentiletti, Gerardo; Acevedo, Rubén; Rufiner, Leonardo
The Brain Computer Interface (BCI) translates brain activity into computer commands. To increase the performance of the BCI, to decode the user intentions it is necessary to get better the feature extraction and classification techniques. In this article the performance of a three linear discriminant analysis (LDA) classifiers ensemble is studied. The system based on ensemble can theoretically achieved better classification results than the individual counterpart, regarding individual classifier generation algorithm and the procedures for combine their outputs. Classic algorithms based on ensembles such as bagging and boosting are discussed here. For the application on BCI, it was concluded that the generated results using ER and AUC as performance index do not give enough information to establish which configuration is better.
Seetha, M.; Ravi, G.
Image Classification is the process of assigning classes to the pixels in remote sensed images and important for GIS applications, since the classified image is much easier to incorporate than the original unclassified image. To resolve misclassification in traditional parametric classifier like Maximum Likelihood Classifier, the neural network classifier is implemented using back propagation algorithm. The extra spectral and spatial knowledge acquired from the ancillary information is required to improve the accuracy and remove the spectral confusion. To build knowledge base automatically, this paper explores a non-parametric decision tree classifier to extract knowledge from the spatial data in the form of classification rules. A new method is proposed using a data structure called Peano Count Tree (P-tree) for decision tree classification. The Peano Count Tree is a spatial data organization that provides a lossless compressed representation of a spatial data set and facilitates efficient classification than other data mining techniques. The accuracy is assessed using the parameters overall accuracy, User's accuracy and Producer's accuracy for image classification methods of Maximum Likelihood Classification, neural network classification using back propagation, Knowledge Base Classification, Post classification and P-tree Classifier. The results reveal that the knowledge extracted from decision tree classifier and P-tree data structure from proposed approach remove the problem of spectral confusion to a greater extent. It is ascertained that the P-tree classifier surpasses the other classification techniques.
Full Text Available The representation based classification method (RBCM has shown huge potential for face recognition since it first emerged. Linear regression classification (LRC method and collaborative representation classification (CRC method are two well-known RBCMs. LRC and CRC exploit training samples of each class and all the training samples to represent the testing sample, respectively, and subsequently conduct classification on the basis of the representation residual. LRC method can be viewed as a “locality representation” method because it just uses the training samples of each class to represent the testing sample and it cannot embody the effectiveness of the “globality representation.” On the contrary, it seems that CRC method cannot own the benefit of locality of the general RBCM. Thus we propose to integrate CRC and LRC to perform more robust representation based classification. The experimental results on benchmark face databases substantially demonstrate that the proposed method achieves high classification accuracy.
Full Text Available Multi-camera networks have gained great interest in video-based surveillance systems for security monitoring, access control, etc. Person re-identification is an essential and challenging task in multi-camera networks, which aims to determine if a given individual has already appeared over the camera network. Individual recognition often uses faces as a trial and requires a large number of samples during the training phrase. This is difficult to fulfill due to the limitation of the camera hardware system and the unconstrained image capturing conditions. Conventional face recognition algorithms often encounter the “small sample size” (SSS problem arising from the small number of training samples compared to the high dimensionality of the sample space. To overcome this problem, interest in the combination of multiple base classifiers has sparked research efforts in ensemble methods. However, existing ensemble methods still open two questions: (1 how to define diverse base classifiers from the small data; (2 how to avoid the diversity/accuracy dilemma occurring during ensemble. To address these problems, this paper proposes a novel generic learning-based ensemble framework, which augments the small data by generating new samples based on a generic distribution and introduces a tailored 0–1 knapsack algorithm to alleviate the diversity/accuracy dilemma. More diverse base classifiers can be generated from the expanded face space, and more appropriate base classifiers are selected for ensemble. Extensive experimental results on four benchmarks demonstrate the higher ability of our system to cope with the SSS problem compared to the state-of-the-art system.
El Alem, A.; Chokmani, K.; Laurion, I.; El-Adlouni, S. E.
In reason of inland freshwaters sensitivity to Harmful algae blooms (HAB) development and the limits coverage of standards monitoring programs, remote sensing data have become increasingly used for monitoring HAB extension. Usually, HAB monitoring using remote sensing data is based on empirical and semi-empirical models. Development of such models requires a great number of continuous in situ measurements to reach an acceptable accuracy. However, Ministries and water management organizations often use two thresholds, established by the World Health Organization, to determine water quality. Consequently, the available data are ordinal «semi-qualitative» and they are mostly unexploited. Use of such databases with remote sensing data and statistical classification algorithms can produce hazard management maps linked to the presence of cyanobacteria. Unlike standard classification algorithms, which are generally unstable, classifiers based on ensemble systems are more general and stable. In the present study, an ensemble based classifier was developed and compared to a standard classification method called CART (Classification and Regression Tree) in a context of HAB monitoring in freshwaters using MODIS images downscaled to 250 spatial resolution and ordinal in situ data. Calibration and validation data on cyanobacteria densities were collected by the Ministère du Développement durable, de l'Environnement et de la Lutte contre les changements climatiques on 22 waters bodies between 2000 and 2010. These data comprise three density classes: waters poorly ( 100,000 cells mL-1) loaded in cyanobacteria. Results were very interesting and highlighted that inland waters exhibit different spectral response allowing them to be classified into the three above classes for water quality monitoring. On the other, even if the accuracy (Kappa-index = 0.86) of the proposed approach is relatively lower than that of the CART algorithm (Kappa-index = 0.87), but its robustness is
Full Text Available Aiming at the problem of low categorization accuracy and uneven distribution of the traditional text classification algorithms，a text classification algorithm based on deep learning has been put forward. Deep belief networks have very strong feature learning ability，which can be extracted from the high dimension of the original feature，so that the text classification can not only be considered，but also can be used to train classification model. The formula of TF-IDF is used to compute text eigenvalues，and the deep belief networks are used to construct the classifier. The experimental results show that compared with the commonly used classification algorithms such as support vector machine，neural network and extreme learning machine，the algorithm has higher accuracy and practicability，and it has opened up new ideas for the research of text classification.
Full Text Available Forecasting the Indian summer monsoon is a challenging task due to its complex and nonlinear behavior. A large number of global climatic variables with varying interaction patterns over years influence monsoon. Various statistical and neural prediction models have been proposed for forecasting monsoon, but many of them fail to capture variability over years. The skill of predictor variables of monsoon also evolves over time. In this article, we propose a joint-clustering of monsoon years and predictors for understanding and predicting the monsoon. This is achieved by subspace clustering algorithm. It groups the years based on prevailing global climatic condition using statistical clustering technique and subsequently for each such group it identifies significant climatic predictor variables which assist in better prediction. Prediction model is designed to frame individual cluster using random forest of regression tree. Prediction of aggregate and regional monsoon is attempted. Mean absolute error of 5.2% is obtained for forecasting aggregate Indian summer monsoon. Errors in predicting the regional monsoons are also comparable in comparison to the high variation of regional precipitation. Proposed joint-clustering based ensemble model is observed to be superior to existing monsoon prediction models and it also surpasses general nonclustering based prediction models.
Yeh, Jia-Rong; Lin, Tzu-Yu; Chen, Yun; Sun, Wei-Zen; Abbod, Maysam F; Shieh, Jiann-Shing
Cardiovascular system is known to be nonlinear and nonstationary. Traditional linear assessments algorithms of arterial stiffness and systemic resistance of cardiac system accompany the problem of nonstationary or inconvenience in practical applications. In this pilot study, two new assessment methods were developed: the first is ensemble empirical mode decomposition based reflection index (EEMD-RI) while the second is based on the phase shift between ECG and BP on cardiac oscillation. Both methods utilise the EEMD algorithm which is suitable for nonlinear and nonstationary systems. These methods were used to investigate the properties of arterial stiffness and systemic resistance for a pig's cardiovascular system via ECG and blood pressure (BP). This experiment simulated a sequence of continuous changes of blood pressure arising from steady condition to high blood pressure by clamping the artery and an inverse by relaxing the artery. As a hypothesis, the arterial stiffness and systemic resistance should vary with the blood pressure due to clamping and relaxing the artery. The results show statistically significant correlations between BP, EEMD-based RI, and the phase shift between ECG and BP on cardiac oscillation. The two assessments results demonstrate the merits of the EEMD for signal analysis.
Full Text Available Over the last few years, a collaborative work between CERFACS, LNHE (EDF R&D, SCHAPI and CE-REMA resulted in the implementation of a Data Assimilation (DA method on top of MASCARET in the framework of real-time forecasting. This prototype was based on a simplified Kalman filter where the description of the background error covariances is prescribed based on off-line climatology constant over time. This approach showed promising results on the Adour and Marne catchments as it improves the forecast skills of the hydraulic model using water level and discharge in-situ observations. An ensemble-based DA algorithm has recently been implemented to improve the modelling of the background error covariance matrix used to distribute the correction to the water level and discharge states when observations are assimilated from observation points to the entire state. It was demonstrated that the flow dependent description of the background error covariances with the EnKF algorithm leads to a more realistic correction of the hydraulic state with significant impact of the hydraulic network characteristics
Smith, Valoris Reid
This thesis discusses two ensembles, the study of which was dependent upon the controllable production of cold gas-phase samples using supersonic beams. The experiments on DNA bases and base clusters were carried out in Germany at the Max Born Institute. The experiments anticipating the construction of a molecular beam slower were carried out in the United States at the University of Texas at Austin. Femtosecond pump-probe techniques were employed to study the dynamics and electronic character of DNA bases, pairs and clusters in the gas phase. Experimentsnon DNA base monomers confirmed the dominance of a particular relaxation pathway, the npi* state. Competition between this state and another proposed relaxation pathway was demonstrated through observations of the DNA base pairs and base-water clusters, settling a recent controversy. Further, it was determined that the excited state dynamics in base pairs is due to intramolecular processes rather than intermolecular processes. Finally, results from base-water clusters confirm that microsolvation permits comparison with biologically relevant liquid phase experiments and with ab initio calculations, bridging a long-standing gap. A purely mechanical technique that does not rely upon quantum or electronic properties to produce very cold, very slow atoms and molecules would be more generally applicable than current approaches. The approach described here uses supersonic beam methods to produce a very cold beam of particles and a rotating paddle-wheel, or rotor, to slow the cold beam. Initial experiments testing the possibility of elastic scattering from a single crystal surface were conducted and the implications of these experiments are discussed.
He, E.; Chen, Q.; Wang, H.; Liu, X.
As a key step in 3D scene analysis, point cloud classification has gained a great deal of concerns in the past few years. Due to the uneven density, noise and data missing in point cloud, how to automatically classify the point cloud with a high precision is a very challenging task. The point cloud classification process typically includes the extraction of neighborhood based statistical information and machine learning algorithms. However, the robustness of neighborhood is limited to the density and curvature of the point cloud which lead to a label noise behavior in classification results. In this paper, we proposed a curvature based adaptive neighborhood for individual point cloud classification. Our main improvement is the curvature based adaptive neighborhood method, which could derive ideal 3D point local neighborhood and enhance the separability of features. The experiment result on Oakland benchmark dataset shows that the proposed method can effectively improve the classification accuracy of point cloud.
Full Text Available Breast cancer is an all too common disease in women, making how to effectively predict it an active research problem. A number of statistical and machine learning techniques have been employed to develop various breast cancer prediction models. Among them, support vector machines (SVM have been shown to outperform many related techniques. To construct the SVM classifier, it is first necessary to decide the kernel function, and different kernel functions can result in different prediction performance. However, there have been very few studies focused on examining the prediction performances of SVM based on different kernel functions. Moreover, it is unknown whether SVM classifier ensembles which have been proposed to improve the performance of single classifiers can outperform single SVM classifiers in terms of breast cancer prediction. Therefore, the aim of this paper is to fully assess the prediction performance of SVM and SVM ensembles over small and large scale breast cancer datasets. The classification accuracy, ROC, F-measure, and computational times of training SVM and SVM ensembles are compared. The experimental results show that linear kernel based SVM ensembles based on the bagging method and RBF kernel based SVM ensembles with the boosting method can be the better choices for a small scale dataset, where feature selection should be performed in the data pre-processing stage. For a large scale dataset, RBF kernel based SVM ensembles based on boosting perform better than the other classifiers.
Full Text Available Landform classification is a necessary task for various fields of landscape and regional planning, for example for landscape evaluation, erosion studies, hazard prediction, et al. This study proposes an improved object-based classification for Chinese landform types using the factor importance analysis of random forest and the gray-level co-occurrence matrix (GLCM. In this research, based on 1km DEM of China, the combination of the terrain factors extracted from DEM are selected by correlation analysis and Sheffield's entropy method. Random forest classification tree is applied to evaluate the importance of the terrain factors, which are used as multi-scale segmentation thresholds. Then the GLCM is conducted for the knowledge base of classification. The classification result was checked by using the 1:4,000,000 Chinese Geomorphological Map as reference. And the overall classification accuracy of the proposed method is 5.7% higher than ISODATA unsupervised classification, and 15.7% higher than the traditional object-based classification method.
A proposed data base system for detection, classification and location of fault on electricity company of Ghana electrical distribution system. Isaac Owusu-Nyarko, Mensah-Ananoo Eugine. Abstract. No Abstract. Keywords: database, classification of fault, power, distribution system, SCADA, ECG. Full Text: EMAIL FULL TEXT ...
Full Text Available In the ensemble-based four dimensional variational assimilation method (SVD-En4DVar, a singular value decomposition (SVD technique is used to select the leading eigenvectors and the analysis variables are expressed as the orthogonal bases expansion of the eigenvectors. The experiments with a two-dimensional shallow-water equation model and simulated observations show that the truncation error and rejection of observed signals due to the reduced-dimensional reconstruction of the analysis variable are the major factors that damage the analysis when the ensemble size is not large enough. However, a larger-sized ensemble is daunting computational burden. Experiments with a shallow-water equation model also show that the forecast error covariances remain relatively constant over time. For that reason, we propose an approach that increases the members of the forecast ensemble while reducing the update frequency of the forecast error covariance in order to increase analysis accuracy and to reduce the computational cost. A series of experiments were conducted with the shallow-water equation model to test the efficiency of this approach. The experimental results indicate that this approach is promising. Further experiments with the WRF model show that this approach is also suitable for the real atmospheric data assimilation problem, but the update frequency of the forecast error covariances should not be too low.
Zubia-Olaskoaga, Felix; Maraví-Poma, Enrique; Urreta-Barallobre, Iratxe; Ramírez-Puerta, María-Rosario; Mourelo-Fariña, Mónica; Marcos-Neira, María-Pilar
To compare the classification performance of the Revised Atlanta Classification, the Determinant-Based Classification, and a new modified Determinant-Based Classification according to observed mortality and morbidity. A prospective multicenter observational study conducted in 1-year period. Forty-six international ICUs (Epidemiology of Acute Pancreatitis in Intensive Care Medicine study). Admitted to an ICU with acute pancreatitis and at least one organ failure. Modified Determinant-Based Classification included four categories: In group 1, patients with transient organ failure and without local complications; in group 2, patients with transient organ failure and local complications; in group 3, patients with persistent organ failure and without local complications; and in group 4, patients with persistent organ failure and local complications. A total of 374 patients were included (mortality rate of 28.9%). When modified Determinant-Based Classification was applied, patients in group 1 presented low mortality (2.26%) and morbidity (5.38%), patients in group 2 presented low mortality (6.67%) and high morbidity (60.71%), patients in group 3 presented high mortality (41.46%) and low morbidity (8.33%), and patients in group 4 presented high mortality (59.09%) and morbidity (88.89%). The area under the receiver operator characteristics curve of modified Determinant-Based Classification for mortality was 0.81 (95% CI, 0.77-0.85), with significant differences in comparison to Revised Atlanta Classification (0.77; 95% CI, 0.73-0.81; p Determinant-Based Classification (0.77; 95% CI, 0.72-0.81; p Determinant-Based Classification was 0.80 (95% CI, 0.73-0.86), with significant differences in comparison to Revised Atlanta Classification (0.63, 95% CI, 0.57-0.70; p Determinant-Based Classification (0.81, 95% CI, 0.74-0.88; nonsignificant). Modified Determinant-Based Classification identified four groups with different clinical presentation in patients with acute pancreatitis in
Oza, Nikunj C.
A supervised learning task involves constructing a mapping from input data (normally described by several features) to the appropriate outputs. Within supervised learning, one type of task is a classification learning task, in which each output is one or more classes to which the input belongs. In supervised learning, a set of training examples---examples with known output values---is used by a learning algorithm to generate a model. This model is intended to approximate the mapping between the inputs and outputs. This model can be used to generate predicted outputs for inputs that have not been seen before. For example, we may have data consisting of observations of sunspots. In a classification learning task, our goal may be to learn to classify sunspots into one of several types. Each example may correspond to one candidate sunspot with various measurements or just an image. A learning algorithm would use the supplied examples to generate a model that approximates the mapping between each supplied set of measurements and the type of sunspot. This model can then be used to classify previously unseen sunspots based on the candidate's measurements. This chapter discusses methods to perform machine learning, with examples involving astronomy.
Diesing, Markus; Stephens, David
Seabed habitat mapping based on swath acoustic data and ground-truth samples is an emergent and active marine science discipline. Significant progress could be achieved by transferring techniques and approaches that have been successfully developed and employed in such fields as terrestrial land cover mapping. One such promising approach is the multiple classifier system, which aims at improving classification performance by combining the outputs of several classifiers. Here we present results of a multi-model ensemble applied to multibeam acoustic data covering more than 5000 km2 of seabed in the North Sea with the aim to derive accurate spatial predictions of seabed substrate. A suite of six machine learning classifiers (k-Nearest Neighbour, Support Vector Machine, Classification Tree, Random Forest, Neural Network and Naïve Bayes) was trained with ground-truth sample data classified into seabed substrate classes and their prediction accuracy was assessed with an independent set of samples. The three and five best performing models were combined to classifier ensembles. Both ensembles led to increased prediction accuracy as compared to the best performing single classifier. The improvements were however not statistically significant at the 5% level. Although the three-model ensemble did not perform significantly better than its individual component models, we noticed that the five-model ensemble did perform significantly better than three of the five component models. A classifier ensemble might therefore be an effective strategy to improve classification performance. Another advantage is the fact that the agreement in predicted substrate class between the individual models of the ensemble could be used as a measure of confidence. We propose a simple and spatially explicit measure of confidence that is based on model agreement and prediction accuracy.
Davidson, Rachel A; Nozick, Linda K; Wachtendorf, Tricia; Blanton, Brian; Colle, Brian; Kolar, Randall L; DeYoung, Sarah; Dresback, Kendra M; Yi, Wenqi; Yang, Kun; Leonardo, Nicholas
This article introduces a new integrated scenario-based evacuation (ISE) framework to support hurricane evacuation decision making. It explicitly captures the dynamics, uncertainty, and human-natural system interactions that are fundamental to the challenge of hurricane evacuation, but have not been fully captured in previous formal evacuation models. The hazard is represented with an ensemble of probabilistic scenarios, population behavior with a dynamic decision model, and traffic with a dynamic user equilibrium model. The components are integrated in a multistage stochastic programming model that minimizes risk and travel times to provide a tree of evacuation order recommendations and an evaluation of the risk and travel time performance for that solution. The ISE framework recommendations offer an advance in the state of the art because they: (1) are based on an integrated hazard assessment (designed to ultimately include inland flooding), (2) explicitly balance the sometimes competing objectives of minimizing risk and minimizing travel time, (3) offer a well-hedged solution that is robust under the range of ways the hurricane might evolve, and (4) leverage the substantial value of increasing information (or decreasing degree of uncertainty) over the course of a hurricane event. A case study for Hurricane Isabel (2003) in eastern North Carolina is presented to demonstrate how the framework is applied, the type of results it can provide, and how it compares to available methods of a single scenario deterministic analysis and a two-stage stochastic program. © 2018 Society for Risk Analysis.
Naganathan, Athi N
Engineering the conformational stabilities of proteins through mutations has immense potential in biotechnological applications. It is, however, an inherently challenging problem given the weak noncovalent nature of the stabilizing interactions. In this regard, we present here a robust and fast strategy to engineer protein stabilities through mutations involving charged residues using a structure-based statistical mechanical model that accounts for the ensemble nature of folding. We validate the method by predicting the absolute changes in stability for 138 experimental mutations from 16 different proteins and enzymes with a correlation of 0.65 and importantly with a success rate of 81%. Multiple point mutants are predicted with a higher success rate (90%) that is validated further by comparing meosphile-thermophile protein pairs. In parallel, we devise a methodology to rapidly engineer mutations in silico which we benchmark against experimental mutations of ubiquitin (correlation of 0.95) and check for its feasibility on a larger therapeutic protein DNase I. We expect the method to be of importance as a first and rapid step to screen for protein mutants with specific stability in the biotechnology industry, in the construction of stability maps at the residue level (i.e., hot spots), and as a robust tool to probe for mutations that enhance the stability of protein-based drugs.
Full Text Available The explosive growth of network traffic and its multitype on Internet have brought new and severe challenges to DDoS attack detection. To get the higher True Negative Rate (TNR, accuracy, and precision and to guarantee the robustness, stability, and universality of detection system, in this paper, we propose a DDoS attack detection method based on hybrid heterogeneous multiclassifier ensemble learning and design a heuristic detection algorithm based on Singular Value Decomposition (SVD to construct our detection system. Experimental results show that our detection method is excellent in TNR, accuracy, and precision. Therefore, our algorithm has good detective performance for DDoS attack. Through the comparisons with Random Forest, k-Nearest Neighbor (k-NN, and Bagging comprising the component classifiers when the three algorithms are used alone by SVD and by un-SVD, it is shown that our model is superior to the state-of-the-art attack detection techniques in system generalization ability, detection stability, and overall detection performance.
Lu, Yin; Liu, Lili; Lu, Dong; Cai, Yudong; Zheng, Mingyue; Luo, Xiaomin; Jiang, Hualiang; Chen, Kaixian
Drug-induced liver injury (DILI) is a major cause of drug withdrawal. The chemical properties of the drug, especially drug metabolites, play key roles in DILI. Our goal is to construct a QSAR model to predict drug hepatotoxicity based on drug metabolites. 64 hepatotoxic drug metabolites and 3,339 non-hepatotoxic drug metabolites were gathered from MDL Metabolite Database. Considering the imbalance of the dataset, we randomly split the negative samples and combined each portion with all the positive samples to construct individually balanced datasets for constructing independent classifiers. Then, we adopted an ensemble approach to make prediction based on the results of all individual classifiers and applied the minimum Redundancy Maximum Relevance (mRMR) feature selection method to select the molecular descriptors. Eventually, for the drugs in the external test set, a Bayesian inference method was used to predict the hepatotoxicity of a drug based on its metabolites. The model showed the average balanced accuracy=78.47%, sensitivity =74.17%, and specificity=82.77%. Five molecular descriptors characterizing molecular polarity, intramolecular bonding strength, and molecular frontier orbital energy were obtained. When predicting the hepatotoxicity of a drug based on all its metabolites, the sensitivity, specificity and balanced accuracy were 60.38%, 70.00%, and 65.19%, respectively, indicating that this method is useful for identifying the hepatotoxicity of drugs. We developed an in silico model to predict hepatotoxicity of drug metabolites. Moreover, Bayesian inference was applied to predict the hepatotoxicity of a drug based on its metabolites which brought out valuable high sensitivity and specificity. Copyright© Bentham Science Publishers; For any queries, please email at email@example.com.
Full Text Available To improve the accuracy of change detection in urban areas using bi-temporal high-resolution remote sensing images, a novel object-based change detection scheme combining multiple features and ensemble learning is proposed in this paper. Image segmentation is conducted to determine the objects in bi-temporal images separately. Subsequently, three kinds of object features, i.e., spectral, shape and texture, are extracted. Using the image differencing process, a difference image is generated and used as the input for nonlinear supervised classifiers, including k-nearest neighbor, support vector machine, extreme learning machine and random forest. Finally, the results of multiple classifiers are integrated using an ensemble rule called weighted voting to generate the final change detection result. Experimental results of two pairs of real high-resolution remote sensing datasets demonstrate that the proposed approach outperforms the traditional methods in terms of overall accuracy and generates change detection maps with a higher number of homogeneous regions in urban areas. Moreover, the influences of segmentation scale and the feature selection strategy on the change detection performance are also analyzed and discussed.
Chellasamy, Menaka; Ferre, Ty Paul; Greve, Mogens Humlekrog
operates based on two assumptions: 1) mislabels in each class will be far from their cluster centroid and 2) each crop class based on the available vector data has more correctly labeled samples than mislabeled samples. Three datasets, derived from bitemporal WorldView-2 multispectral imagery, are used...... the multievidence classification approach. The study is implemented with WorldView-2 imagery acquired for a study area in Denmark containing 15 crop classes. The multievidence classification approach with ECRA-based refinement is compared with the classification based on common training sample selection methods...... (manual and random). It is also compared with the winner-takes-all-based classification approach. ECRA achieves an overall classification accuracy of 92.8%, which is- higher than existing common approaches....
Liu, Zhun-Ga; Pan, Quan; Mercier, Gregoire; Dezert, Jean
The classification of incomplete patterns is a very challenging task because the object (incomplete pattern) with different possible estimations of missing values may yield distinct classification results. The uncertainty (ambiguity) of classification is mainly caused by the lack of information of the missing data. A new prototype-based credal classification (PCC) method is proposed to deal with incomplete patterns thanks to the belief function framework used classically in evidential reasoning approach. The class prototypes obtained by training samples are respectively used to estimate the missing values. Typically, in a c -class problem, one has to deal with c prototypes, which yield c estimations of the missing values. The different edited patterns based on each possible estimation are then classified by a standard classifier and we can get at most c distinct classification results for an incomplete pattern. Because all these distinct classification results are potentially admissible, we propose to combine them all together to obtain the final classification of the incomplete pattern. A new credal combination method is introduced for solving the classification problem, and it is able to characterize the inherent uncertainty due to the possible conflicting results delivered by different estimations of the missing values. The incomplete patterns that are very difficult to classify in a specific class will be reasonably and automatically committed to some proper meta-classes by PCC method in order to reduce errors. The effectiveness of PCC method has been tested through four experiments with artificial and real data sets.
O'Mahony, Michael P.; Smyth, Barry
Many online stores encourage their users to submit product/service reviews in order to guide future purchasing decisions. These reviews are often listed alongside product recommendations but, to date, limited attention has been paid as to how best to present these reviews to the end-user. In this paper, we describe a supervised classification approach that is designed to identify and recommend the most helpful product reviews. Using the TripAdvisor service as a case study, we compare the performance of several classification techniques using a range of features derived from hotel reviews. We then describe how these classifiers can be used as the basis for a practical recommender that automatically suggests the mosthelpful contrasting reviews to end-users. We present an empirical evaluation which shows that our approach achieves a statistically significant improvement over alternative review ranking schemes.
Adelman, Joshua L.; Grabe, Michael
We introduce an extension to the weighted ensemble (WE) path sampling method to restrict sampling to a one-dimensional path through a high dimensional phase space. Our method, which is based on the finite-temperature string method, permits efficient sampling of both equilibrium and non-equilibrium systems. Sampling obtained from the WE method guides the adaptive refinement of a Voronoi tessellation of order parameter space, whose generating points, upon convergence, coincide with the principle reaction pathway. We demonstrate the application of this method to several simple, two-dimensional models of driven Brownian motion and to the conformational change of the nitrogen regulatory protein C receiver domain using an elastic network model. The simplicity of the two-dimensional models allows us to directly compare the efficiency of the WE method to conventional brute force simulations and other path sampling algorithms, while the example of protein conformational change demonstrates how the method can be used to efficiently study transitions in the space of many collective variables. PMID:23387566
Full Text Available A major structural inconsistency of the traditional curve number (CN model is its dependence on an unstable fixed initial abstraction, which normally results in sudden jumps in runoff estimation. Likewise, the lack of pre-storm soil moisture accounting (PSMA procedure is another inherent limitation of the model. To circumvent those problems, we used a variable initial abstraction after ensembling the traditional CN model and a French four-parameter (GR4J model to better quantify direct runoff from ungauged watersheds. To mimic the natural rainfall-runoff transformation at the watershed scale, our new parameterization designates intrinsic parameters and uses a simple structure. It exhibited more accurate and consistent results than earlier methods in evaluating data from 39 forest-dominated watersheds, both for small and large watersheds. In addition, based on different performance evaluation indicators, the runoff reproduction results show that the proposed model produced more consistent results for dry, normal, and wet watershed conditions than the other models used in this study.
Full Text Available Solar energy generated from PhotoVoltaic (PV systems is one of the most promising types of renewable energy. However, it is highly variable as it depends on the solar irradiance and other meteorological factors. This variability creates difficulties for the large-scale integration of PV power in the electricity grid and requires accurate forecasting of the electricity generated by PV systems. In this paper we consider 2D-interval forecasts, where the goal is to predict summary statistics for the distribution of the PV power values in a future time interval. 2D-interval forecasts have been recently introduced, and they are more suitable than point forecasts for applications where the predicted variable has a high variability. We propose a method called NNE2D that combines variable selection based on mutual information and an ensemble of neural networks, to compute 2D-interval forecasts, where the two interval boundaries are expressed in terms of percentiles. NNE2D was evaluated for univariate prediction of Australian solar PV power data for two years. The results show that it is a promising method, outperforming persistence baselines and other methods used for comparison in terms of accuracy and coverage probability.
Ahmadalipour, Ali; Moradkhani, Hamid
Hydrologic modeling is one of the primary tools utilized for drought monitoring and drought early warning systems. Several sources of uncertainty in hydrologic modeling have been addressed in the literature. However, few studies have assessed the uncertainty of gridded observation datasets from a drought monitoring perspective. This study provides a hydrologic modeling oriented analysis of the gridded observation data uncertainties over the Pacific Northwest (PNW) and its implications on drought assessment. We utilized a recently developed 100-member ensemble-based observed forcing data to simulate hydrologic fluxes at 1/8° spatial resolution using Variable Infiltration Capacity (VIC) model, and compared the results with a deterministic observation. Meteorological and hydrological droughts are studied at multiple timescales over the basin, and seasonal long-term trends and variations of drought extent is investigated for each case. Results reveal large uncertainty of observed datasets at monthly timescale, with systematic differences for temperature records, mainly due to different lapse rates. The uncertainty eventuates in large disparities of drought characteristics. In general, an increasing trend is found for winter drought extent across the PNW. Furthermore, a ∼3% decrease per decade is detected for snow water equivalent (SWE) over the PNW, with the region being more susceptible to SWE variations of the northern Rockies than the western Cascades. The agricultural areas of southern Idaho demonstrate decreasing trend of natural soil moisture as a result of precipitation decline, which implies higher appeal for anthropogenic water storage and irrigation systems.
Novovičová, Jana; Malík, Antonín
Roč. 40, č. 3 (2004), s. 293-304 ISSN 0023-5954 R&D Projects: GA AV ČR IAA2075302; GA ČR GA102/03/0049; GA AV ČR KSK1019101 Institutional research plan: CEZ:AV0Z1075907 Keywords : text classification * text categorization * multinomial mixture model Subject RIV: BB - Applied Statistics, Operational Research Impact factor: 0.224, year: 2004
Orlando, José Ignacio; Prokofyeva, Elena; Del Fresno, Mariana; Blaschko, Matthew B
Diabetic retinopathy (DR) is one of the leading causes of preventable blindness in the world. Its earliest sign are red lesions, a general term that groups both microaneurysms (MAs) and hemorrhages (HEs). In daily clinical practice, these lesions are manually detected by physicians using fundus photographs. However, this task is tedious and time consuming, and requires an intensive effort due to the small size of the lesions and their lack of contrast. Computer-assisted diagnosis of DR based on red lesion detection is being actively explored due to its improvement effects both in clinicians consistency and accuracy. Moreover, it provides comprehensive feedback that is easy to assess by the physicians. Several methods for detecting red lesions have been proposed in the literature, most of them based on characterizing lesion candidates using hand crafted features, and classifying them into true or false positive detections. Deep learning based approaches, by contrast, are scarce in this domain due to the high expense of annotating the lesions manually. In this paper we propose a novel method for red lesion detection based on combining both deep learned and domain knowledge. Features learned by a convolutional neural network (CNN) are augmented by incorporating hand crafted features. Such ensemble vector of descriptors is used afterwards to identify true lesion candidates using a Random Forest classifier. We empirically observed that combining both sources of information significantly improve results with respect to using each approach separately. Furthermore, our method reported the highest performance on a per-lesion basis on DIARETDB1 and e-ophtha, and for screening and need for referral on MESSIDOR compared to a second human expert. Results highlight the fact that integrating manually engineered approaches with deep learned features is relevant to improve results when the networks are trained from lesion-level annotated data. An open source implementation of our
Yang, Ji-Jiang; Li, Jianqiang; Shen, Ruifang; Zeng, Yang; He, Jian; Bi, Jing; Li, Yong; Zhang, Qinyan; Peng, Lihui; Wang, Qing
Cataract is defined as a lenticular opacity presenting usually with poor visual acuity. It is one of the most common causes of visual impairment worldwide. Early diagnosis demands the expertise of trained healthcare professionals, which may present a barrier to early intervention due to underlying costs. To date, studies reported in the literature utilize a single learning model for retinal image classification in grading cataract severity. We present an ensemble learning based approach as a means to improving diagnostic accuracy. Three independent feature sets, i.e., wavelet-, sketch-, and texture-based features, are extracted from each fundus image. For each feature set, two base learning models, i.e., Support Vector Machine and Back Propagation Neural Network, are built. Then, the ensemble methods, majority voting and stacking, are investigated to combine the multiple base learning models for final fundus image classification. Empirical experiments are conducted for cataract detection (two-class task, i.e., cataract or non-cataractous) and cataract grading (four-class task, i.e., non-cataractous, mild, moderate or severe) tasks. The best performance of the ensemble classifier is 93.2% and 84.5% in terms of the correct classification rates for cataract detection and grading tasks, respectively. The results demonstrate that the ensemble classifier outperforms the single learning model significantly, which also illustrates the effectiveness of the proposed approach. Copyright © 2015 Elsevier Ireland Ltd. All rights reserved.
Full Text Available Feature selection and description is a key factor in classification of Earth observation data. In this paper a classification method based on tensor decomposition is proposed. First, multiple features are extracted from raw LiDAR point cloud, and raster LiDAR images are derived by accumulating features or the “raw” data attributes. Then, the feature rasters of LiDAR data are stored as a tensor, and tensor decomposition is used to select component features. This tensor representation could keep the initial spatial structure and insure the consideration of the neighborhood. Based on a small number of component features a k nearest neighborhood classification is applied.
WT) energy features by artificial neural network (ANN) and SVM classifiers. The proposed scheme utilizes wavelet based feature extraction to be used for the artificial neural networks in the classification. Six different PQ events are considered in ...
Morabito, Marco; Pavlinic, Daniela Z; Crisci, Alfonso; Capecchi, Valerio; Orlandini, Simone; Mekjavic, Igor B
Military and civil defense personnel are often involved in complex activities in a variety of outdoor environments. The choice of appropriate clothing ensembles represents an important strategy to establish the success of a military mission. The main aim of this study was to compare the known clothing insulation of the garment ensembles worn by soldiers during two winter outdoor field trials (hike and guard duty) with the estimated optimal clothing thermal insulations recommended to maintain thermoneutrality, assessed by using two different biometeorological procedures. The overall aim was to assess the applicability of such biometeorological procedures to weather forecast systems, thereby developing a comprehensive biometeorological tool for military operational forecast purposes. Military trials were carried out during winter 2006 in Pokljuka (Slovenia) by Slovene Armed Forces personnel. Gastrointestinal temperature, heart rate and environmental parameters were measured with portable data acquisition systems. The thermal characteristics of the clothing ensembles worn by the soldiers, namely thermal resistance, were determined with a sweating thermal manikin. Results showed that the clothing ensemble worn by the military was appropriate during guard duty but generally inappropriate during the hike. A general under-estimation of the biometeorological forecast model in predicting the optimal clothing insulation value was observed and an additional post-processing calibration might further improve forecast accuracy. This study represents the first step in the development of a comprehensive personalized biometeorological forecast system aimed at improving recommendations regarding the optimal thermal insulation of military garment ensembles for winter activities.
Full Text Available Abstract It is difficult from possibilities to select a most suitable effective way of clustering algorithm and its dataset for a defined set of gene expression data because we have a huge number of ways and huge number of gene expressions. At present many researchers are preferring to use hierarchical clustering in different forms this is no more totally optimal. Cluster ensemble research can solve this type of problem by automatically merging multiple data partitions from a wide range of different clusterings of any dimensions to improve both the quality and robustness of the clustering result. But we have many existing ensemble approaches using an association matrix to condense sample-cluster and co-occurrence statistics and relations within the ensemble are encapsulated only at raw level while the existing among clusters are totally discriminated. Finding these missing associations can greatly expand the capability of those ensemble methodologies for microarray data clustering. We propose general K-means cluster ensemble approach for the clustering of general categorical data into required number of partitions.
Clary, Renee; Wandersee, James
In this article, Renee Clary and James Wandersee describe the beginnings of "Classification," which lies at the very heart of science and depends upon pattern recognition. Clary and Wandersee approach patterns by first telling the story of the "Linnaean classification system," introduced by Carl Linnacus (1707-1778), who is…
Chen, Zhijia; Zhu, Yuanchang; Di, Yanqiang; Feng, Shaochong
In IaaS (infrastructure as a service) cloud environment, users are provisioned with virtual machines (VMs). To allocate resources for users dynamically and effectively, accurate resource demands predicting is essential. For this purpose, this paper proposes a self-adaptive prediction method using ensemble model and subtractive-fuzzy clustering based fuzzy neural network (ESFCFNN). We analyze the characters of user preferences and demands. Then the architecture of the prediction model is constructed. We adopt some base predictors to compose the ensemble model. Then the structure and learning algorithm of fuzzy neural network is researched. To obtain the number of fuzzy rules and the initial value of the premise and consequent parameters, this paper proposes the fuzzy c-means combined with subtractive clustering algorithm, that is, the subtractive-fuzzy clustering. Finally, we adopt different criteria to evaluate the proposed method. The experiment results show that the method is accurate and effective in predicting the resource demands.
Full Text Available In IaaS (infrastructure as a service cloud environment, users are provisioned with virtual machines (VMs. To allocate resources for users dynamically and effectively, accurate resource demands predicting is essential. For this purpose, this paper proposes a self-adaptive prediction method using ensemble model and subtractive-fuzzy clustering based fuzzy neural network (ESFCFNN. We analyze the characters of user preferences and demands. Then the architecture of the prediction model is constructed. We adopt some base predictors to compose the ensemble model. Then the structure and learning algorithm of fuzzy neural network is researched. To obtain the number of fuzzy rules and the initial value of the premise and consequent parameters, this paper proposes the fuzzy c-means combined with subtractive clustering algorithm, that is, the subtractive-fuzzy clustering. Finally, we adopt different criteria to evaluate the proposed method. The experiment results show that the method is accurate and effective in predicting the resource demands.
Chen, Zhijia; Zhu, Yuanchang; Di, Yanqiang; Feng, Shaochong
In IaaS (infrastructure as a service) cloud environment, users are provisioned with virtual machines (VMs). To allocate resources for users dynamically and effectively, accurate resource demands predicting is essential. For this purpose, this paper proposes a self-adaptive prediction method using ensemble model and subtractive-fuzzy clustering based fuzzy neural network (ESFCFNN). We analyze the characters of user preferences and demands. Then the architecture of the prediction model is constructed. We adopt some base predictors to compose the ensemble model. Then the structure and learning algorithm of fuzzy neural network is researched. To obtain the number of fuzzy rules and the initial value of the premise and consequent parameters, this paper proposes the fuzzy c-means combined with subtractive clustering algorithm, that is, the subtractive-fuzzy clustering. Finally, we adopt different criteria to evaluate the proposed method. The experiment results show that the method is accurate and effective in predicting the resource demands. PMID:25691896
Bedo, Marcos Vinicius Naves; Pereira Dos Santos, Davi; Ponciano-Silva, Marcelo; de Azevedo-Marques, Paulo Mazzoncini; Ferreira de Carvalho, André Ponce de León; Traina, Caetano
Content-based medical image retrieval (CBMIR) is a powerful resource to improve differential computer-aided diagnosis. The major problem with CBMIR applications is the semantic gap, a situation in which the system does not follow the users' sense of similarity. This gap can be bridged by the adequate modeling of similarity queries, which ultimately depends on the combination of feature extractor methods and distance functions. In this study, such combinations are referred to as perceptual parameters, as they impact on how images are compared. In a CBMIR, the perceptual parameters must be manually set by the users, which imposes a heavy burden on the specialists; otherwise, the system will follow a predefined sense of similarity. This paper presents a novel approach to endow a CBMIR with a proper sense of similarity, in which the system defines the perceptual parameter depending on the query element. The method employs ensemble strategy, where an extreme learning machine acts as a meta-learner and identifies the most suitable perceptual parameter according to a given query image. This parameter defines the search space for the similarity query that retrieves the most similar images. An instance-based learning classifier labels the query image following the query result set. As the concept implementation, we integrated the approach into a mammogram CBMIR. For each query image, the resulting tool provided a complete second opinion, including lesion class, system certainty degree, and set of most similar images. Extensive experiments on a large mammogram dataset showed that our proposal achieved a hit ratio up to 10% higher than the traditional CBMIR approach without requiring external parameters from the users. Our database-driven solution was also up to 25% faster than content retrieval traditional approaches.
Guo, Wei; Tse, Peter W.
Today, remote machine condition monitoring is popular due to the continuous advancement in wireless communication. Bearing is the most frequently and easily failed component in many rotating machines. To accurately identify the type of bearing fault, large amounts of vibration data need to be collected. However, the volume of transmitted data cannot be too high because the bandwidth of wireless communication is limited. To solve this problem, the data are usually compressed before transmitting to a remote maintenance center. This paper proposes a novel signal compression method that can substantially reduce the amount of data that need to be transmitted without sacrificing the accuracy of fault identification. The proposed signal compression method is based on ensemble empirical mode decomposition (EEMD), which is an effective method for adaptively decomposing the vibration signal into different bands of signal components, termed intrinsic mode functions (IMFs). An optimization method was designed to automatically select appropriate EEMD parameters for the analyzed signal, and in particular to select the appropriate level of the added white noise in the EEMD method. An index termed the relative root-mean-square error was used to evaluate the decomposition performances under different noise levels to find the optimal level. After applying the optimal EEMD method to a vibration signal, the IMF relating to the bearing fault can be extracted from the original vibration signal. Compressing this signal component obtains a much smaller proportion of data samples to be retained for transmission and further reconstruction. The proposed compression method were also compared with the popular wavelet compression method. Experimental results demonstrate that the optimization of EEMD parameters can automatically find appropriate EEMD parameters for the analyzed signals, and the IMF-based compression method provides a higher compression ratio, while retaining the bearing defect
Kilic, Niyazi; Hosgormez, Erkan
Ensemble learning methods are one of the most powerful tools for the pattern classification problems. In this paper, the effects of ensemble learning methods and some physical bone densitometry parameters on osteoporotic fracture detection were investigated. Six feature set models were constructed including different physical parameters and they fed into the ensemble classifiers as input features. As ensemble learning techniques, bagging, gradient boosting and random subspace (RSM) were used. Instance based learning (IBk) and random forest (RF) classifiers applied to six feature set models. The patients were classified into three groups such as osteoporosis, osteopenia and control (healthy), using ensemble classifiers. Total classification accuracy and f-measure were also used to evaluate diagnostic performance of the proposed ensemble classification system. The classification accuracy has reached to 98.85 % by the combination of model 6 (five BMD + five T-score values) using RSM-RF classifier. The findings of this paper suggest that the patients will be able to be warned before a bone fracture occurred, by just examining some physical parameters that can easily be measured without invasive operations.
Broekmann, P.; Fluegel, A.; Emnet, C.; Arnold, M.; Roeger-Goepfert, C.; Wagner, A.; Hai, N.T.M.; Mayer, D.
Highlights: → Three fundamental types of suppressor additives for copper electroplating could be identified by means of potential transient measurements. → These suppressor additives differ in their synergistic and antagonistic interplay with anions that are chemisorbed on the metallic copper surface during electrodeposition. → In addition these suppressor chemistries reveal different barrier properties with respect to cupric ions and plating additives (Cl, SPS). - Abstract: Three fundamental types of suppressor additives for copper electroplating could be identified by means of potential transient measurements. These suppressor additives differ in their synergistic and antagonistic interplay with anions that are chemisorbed on the metallic copper surface during electrodeposition. In addition these suppressor chemistries reveal different barrier properties with respect to cupric ions and plating additives (Cl, SPS). While the type-I suppressor selectively forms efficient barriers for copper inter-diffusion on chloride-terminated electrode surfaces we identified a type-II suppressor that interacts non-selectively with any kind of anions chemisorbed on copper (chloride, sulfate, sulfonate). Type-I suppressors are vital for the superconformal copper growth mode in Damascene processing and show an antagonistic interaction with SPS (Bis-Sodium-Sulfopropyl-Disulfide) which involves the deactivation of this suppressor chemistry. This suppressor deactivation is rationalized in terms of compositional changes in the layer of the chemisorbed anions due to the competition of chloride and MPS (Mercaptopropane Sulfonic Acid) for adsorption sites on the metallic copper surface. MPS is the product of the dissociative SPS adsorption within the preexisting chloride matrix on the copper surface. The non-selectivity in the adsorption behavior of the type-II suppressor is rationalized in terms of anion/cation pairing effects of the poly-cationic suppressor and the anion-modified copper substrate. Atomic-scale insights into the competitive Cl/MPS adsorption are gained from in situ STM (Scanning Tunneling Microscopy) using single crystalline copper surfaces as model substrates. Type-III suppressors are a third class of suppressors. In case of type-I and type-II suppressor chemistries the resulting steady-state deposition conditions are completely independent on the particular succession of additive adsorption. In contrast to that a strong dependence of the suppressing capabilities on the sequence of additive adsorption ('first comes, first serves' principle) is observed for the type-III suppressor. This behavior is explained by a suppressor barrier that impedes not only the copper inter-diffusion but also the transport of other additives (e.g. SPS) to the copper surface.
Stelios K. Mylonas
Full Text Available This paper proposes an object-based segmentation/classification scheme for remotely sensed images, based on a novel variant of the recently proposed Genetic Sequential Image Segmentation (GeneSIS algorithm. GeneSIS segments the image in an iterative manner, whereby at each iteration a single object is extracted via a genetic-based object extraction algorithm. Contrary to the previous pixel-based GeneSIS where the candidate objects to be extracted were evaluated through the fuzzy content of their included pixels, in the newly developed region-based GeneSIS algorithm, a watershed-driven fine segmentation map is initially obtained from the original image, which serves as the basis for the forthcoming GeneSIS segmentation. Furthermore, in order to enhance the spatial search capabilities, we introduce a more descriptive encoding scheme in the object extraction algorithm, where the structural search modules are represented by polygonal shapes. Our objectives in the new framework are posed as follows: enhance the flexibility of the algorithm in extracting more flexible object shapes, assure high level classification accuracies, and reduce the execution time of the segmentation, while at the same time preserving all the inherent attributes of the GeneSIS approach. Finally, exploiting the inherent attribute of GeneSIS to produce multiple segmentations, we also propose two segmentation fusion schemes that operate on the ensemble of segmentations generated by GeneSIS. Our approaches are tested on an urban and two agricultural images. The results show that region-based GeneSIS has considerably lower computational demands compared to the pixel-based one. Furthermore, the suggested methods achieve higher classification accuracies and good segmentation maps compared to a series of existing algorithms.
Zhijia Chen; Yuanchang Zhu; Yanqiang Di; Shaochong Feng
In IaaS (infrastructure as a service) cloud environment, users are provisioned with virtual machines (VMs). To allocate resources for users dynamically and effectively, accurate resource demands predicting is essential. For this purpose, this paper proposes a self-adaptive prediction method using ensemble model and subtractive-fuzzy clustering based fuzzy neural network (ESFCFNN). We analyze the characters of user preferences and demands. Then the architecture of the prediction model is const...
Full Text Available We propose a biologically plausible architecture for unsupervised ensemble learning in a population of spiking neural network classifiers. A mixture of experts type organisation is shown to be effective, with the individual classifier outputs combined via a gating network whose operation is driven by input timing dependent plasticity (ITDP. The ITDP gating mechanism is based on recent experimental findings. An abstract, analytically tractable model of the ITDP driven ensemble architecture is derived from a logical model based on the probabilities of neural firing events. A detailed analysis of this model provides insights that allow it to be extended into a full, biologically plausible, computational implementation of the architecture which is demonstrated on a visual classification task. The extended model makes use of a style of spiking network, first introduced as a model of cortical microcircuits, that is capable of Bayesian inference, effectively performing expectation maximization. The unsupervised ensemble learning mechanism, based around such spiking expectation maximization (SEM networks whose combined outputs are mediated by ITDP, is shown to perform the visual classification task well and to generalize to unseen data. The combined ensemble performance is significantly better than that of the individual classifiers, validating the ensemble architecture and learning mechanisms. The properties of the full model are analysed in the light of extensive experiments with the classification task, including an investigation into the influence of different input feature selection schemes and a comparison with a hierarchical STDP based ensemble architecture.
Hu, H P; Niu, Z J; Bai, Y P; Tan, X H
Based on gene expression, we have classified 53 colon cancer patients with UICC II into two groups: relapse and no relapse. Samples were taken from each patient, and gene information was extracted. Of the 53 samples examined, 500 genes were considered proper through analyses by S-Kohonen, BP, and SVM neural networks. Classification accuracy obtained by S-Kohonen neural network reaches 91%, which was more accurate than classification by BP and SVM neural networks. The results show that S-Kohonen neural network is more plausible for classification and has a certain feasibility and validity as compared with BP and SVM neural networks.
Full Text Available The complaint recognizer system plays an important role in making sure the correct classification of the hot complaint,improving the service quantity of telecommunications industry.The customers’ complaint in telecommunications industry has its special particularity which should be done in limited time,which cause the error in classification of hot complaint.The paper presents a model of complaint hot intelligent classification based on text mining,which can classify the hot complaint in the correct level of the complaint navigation.The examples show that the model can be efficient to classify the text of the complaint.
O'Shea, Timothy James; Roy, Tamoghna; Clancy, T. Charles
We conduct an in depth study on the performance of deep learning based radio signal classification for radio communications signals. We consider a rigorous baseline method using higher order moments and strong boosted gradient tree classification and compare performance between the two approaches across a range of configurations and channel impairments. We consider the effects of carrier frequency offset, symbol rate, and multi-path fading in simulation and conduct over-the-air measurement of radio classification performance in the lab using software radios and compare performance and training strategies for both. Finally we conclude with a discussion of remaining problems, and design considerations for using such techniques.
This article presents and discusses definitions of the term “classification” and the related concepts “Concept/conceptualization,”“categorization,” “ordering,” “taxonomy” and “typology.” It further presents and discusses theories of classification including the influences of Aristotle...... and Wittgenstein. It presents different views on forming classes, including logical division, numerical taxonomy, historical classification, hermeneutical and pragmatic/critical views. Finally, issues related to artificial versus natural classification and taxonomic monism versus taxonomic pluralism are briefly...
Zhang, Tong; Kuo, C.-C. Jay
A hierarchical system for audio classification and retrieval based on audio content analysis is presented in this paper. The system consists of three stages. The audio recordings are first classical and segmented into speech, music, several types of environmental sounds, and silence, based on morphological and statistical analysis of temporal curves of the energy function, the average zero-crossing rate, and the fundamental frequency of audio signals. The first stage is called the coarse-level audio classification and segmentation. Then, environmental sounds are classified into finer classes such as applause, rain, birds' sound, etc., which is called the fine-level audio classification. The second stage is based on time-frequency analysis of audio signals and the use of the hidden Markov model (HMM) for classification. In the third stage, the query-by-example audio retrieval is implemented where similar sounds can be found according to the input sample audio. The way of modeling audio features with the hidden Markov model, the procedures of audio classification and retrieval, and the experimental results are described. It is shown that, with the proposed new system, audio recordings can be automatically segmented and classified into basic types in real time with an accuracy higher than 90%. Examples of audio fine classification and audio retrieval with the proposed HMM-based method are also provided.
Choi, Y.; Jeon, S. W.; Lim, C. H.; Ryu, J.
Biodiversity is rapidly declining globally and efforts are needed to mitigate this continually increasing loss of species. Clustering of areas with similar habitats can be used to prioritize protected areas and distribute resources for the conservation of species, selection of representative sample areas for research, and evaluation of impacts due to environmental changes. In this study, Northeast Asia (NEA) was classified into 14 bioclimatic zones using statistical techniques, which are correlation analysis and principal component analysis (PCA), and the iterative self-organizing data analysis technique algorithm (ISODATA). Based on these bioclimatic classification, we predicted shift of bioclimatic zones due to climate change. The input variables include the current climatic data (1960-1990) and the future climatic data of the HadGEM2-AO model (RCP 4.5(2050, 2070) and 8.5(2050, 2070)) provided by WorldClim. Using these data, multi-modeling methods including maximum likelihood classification, random forest, and species distribution modelling have been used to project the impact of climate change on the spatial distribution of bioclimatic zones within NEA. The results of various models were compared and analyzed by overlapping each result. As the result, significant changes in bioclimatic conditions can be expected throughout the NEA by 2050s and 2070s. The overall zones moved upward and some zones were predicted to disappear. This analysis provides the basis for understanding potential impacts of climate change on biodiversity and ecosystem. Also, this could be used more effectively to support decision making on climate change adaptation.
Gao, Yue; Ji, Rongrong; Cui, Peng; Dai, Qionghai; Hua, Gang
Hyperspectral image classification with limited number of labeled pixels is a challenging task. In this paper, we propose a bilayer graph-based learning framework to address this problem. For graph-based classification, how to establish the neighboring relationship among the pixels from the high dimensional features is the key toward a successful classification. Our graph learning algorithm contains two layers. The first-layer constructs a simple graph, where each vertex denotes one pixel and the edge weight encodes the similarity between two pixels. Unsupervised learning is then conducted to estimate the grouping relations among different pixels. These relations are subsequently fed into the second layer to form a hypergraph structure, on top of which, semisupervised transductive learning is conducted to obtain the final classification results. Our experiments on three data sets demonstrate the merits of our proposed approach, which compares favorably with state of the art.
Sheikh, Qamar M; Gatherer, Derek; Reche, Pedro A; Flower, Darren R
Influenza A viral heterogeneity remains a significant threat due to unpredictable antigenic drift in seasonal influenza and antigenic shifts caused by the emergence of novel subtypes. Annual review of multivalent influenza vaccines targets strains of influenza A and B likely to be predominant in future influenza seasons. This does not induce broad, cross protective immunity against emergent subtypes. Better strategies are needed to prevent future pandemics. Cross-protection can be achieved by activating CD8+ and CD4+ T cells against highly conserved regions of the influenza genome. We combine available experimental data with informatics-based immunological predictions to help design vaccines potentially able to induce cross-protective T-cells against multiple influenza subtypes. To exemplify our approach we designed two epitope ensemble vaccines comprising highly conserved and experimentally verified immunogenic influenza A epitopes as putative non-seasonal influenza vaccines; one specifically targets the US population and the other is a universal vaccine. The USA-specific vaccine comprised 6 CD8+ T cell epitopes (GILGFVFTL, FMYSDFHFI, GMDPRMCSL, SVKEKDMTK, FYIQMCTEL, DTVNRTHQY) and 3 CD4+ epitopes (KGILGFVFTLTVPSE, EYIMKGVYINTALLN, ILGFVFTLTVPSERG). The universal vaccine comprised 8 CD8+ epitopes: (FMYSDFHFI, GILGFVFTL, ILRGSVAHK, FYIQMCTEL, ILKGKFQTA, YYLEKANKI, VSDGGPNLY, YSHGTGTGY) and the same 3 CD4+ epitopes. Our USA-specific vaccine has a population protection coverage (portion of the population potentially responsive to one or more component epitopes of the vaccine, PPC) of over 96 and 95% coverage of observed influenza subtypes. The universal vaccine has a PPC value of over 97 and 88% coverage of observed subtypes. http://imed.med.ucm.es/Tools/episopt.html CONTACT: firstname.lastname@example.org. © The Author 2016. Published by Oxford University Press.
introduced in geophysi- cal fluid dynamics to quantify predictive information content in forecast ensembles (Kleeman 2002; Abramov et al. 2005). Here, we...National Ocean Partnership Program. 33 REFERENCES Abramov , R., A. Majda, and R. Kleeman, 2005: Information theory and predictability for low-frequency
Public school music education in the USA remains wedded to large ensemble performance. Instruction tends to be teacher directed, relies on styles from the Western canon and exhibits little concern for musical interests of students. The idea that a fundamental purpose of education is the creation of a just society is difficult for many music…
Miyoshi, Seiji; Hara, Kazuyuki; Okada, Masato
Ensemble learning of K nonlinear perceptrons, which determine their outputs by sign functions, is discussed within the framework of online learning and statistical mechanics. One purpose of statistical learning theory is to theoretically obtain the generalization error. This paper shows that ensemble generalization error can be calculated by using two order parameters, that is, the similarity between a teacher and a student, and the similarity among students. The differential equations that describe the dynamical behaviors of these order parameters are derived in the case of general learning rules. The concrete forms of these differential equations are derived analytically in the cases of three well-known rules: Hebbian learning, perceptron learning, and AdaTron (adaptive perceptron) learning. Ensemble generalization errors of these three rules are calculated by using the results determined by solving their differential equations. As a result, these three rules show different characteristics in their affinity for ensemble learning, that is “maintaining variety among students.” Results show that AdaTron learning is superior to the other two rules with respect to that affinity.
Liu, Leo; Bak, Claus Leth; Chen, Zhe
With the increasing penetration of renewable energy resources and other forms of dispersed generation, more and more uncertainties will be brought to the dynamic security assessment (DSA) of power systems. This paper proposes an approach that uses ensemble decision trees (EDT) for online DSA. Fed...
Shweta C. Dharmadhikari; Maya Ingle; Parag Kulkarni
Automatic classification of text documents has become an important research issue now days. Properclassification of text documents requires information retrieval, machine learning and Natural languageprocessing (NLP) techniques. Our aim is to focus on important approaches to automatic textclassification based on machine learning techniques viz. supervised, unsupervised and semi supervised.In this paper we present a review of various text classification approaches under machine learningparadig...
Lam, Polo C.-H.; Abagyan, Ruben; Totrov, Maxim
Ligand docking to flexible protein molecules can be efficiently carried out through ensemble docking to multiple protein conformations, either from experimental X-ray structures or from in silico simulations. The success of ensemble docking often requires the careful selection of complementary protein conformations, through docking and scoring of known co-crystallized ligands. False positives, in which a ligand in a wrong pose achieves a better docking score than that of native pose, arise as additional protein conformations are added. In the current study, we developed a new ligand-biased ensemble receptor docking method and composite scoring function which combine the use of ligand-based atomic property field (APF) method with receptor structure-based docking. This method helps us to correctly dock 30 out of 36 ligands presented by the D3R docking challenge. For the six mis-docked ligands, the cognate receptor structures prove to be too different from the 40 available experimental Pocketome conformations used for docking and could be identified only by receptor sampling beyond experimentally explored conformational subspace.
Zhang, Tong; Kuo, C.-C. Jay
An on-line audio classification and segmentation system is presented in this research, where audio recordings are classified and segmented into speech, music, several types of environmental sounds and silence based on audio content analysis. This is the first step of our continuing work towards a general content-based audio classification and retrieval system. The extracted audio features include temporal curves of the energy function,the average zero- crossing rate, the fundamental frequency of audio signals, as well as statistical and morphological features of these curves. The classification result is achieved through a threshold-based heuristic procedure. The audio database that we have built, details of feature extraction, classification and segmentation procedures, and experimental results are described. It is shown that, with the proposed new system, audio recordings can be automatically segmented and classified into basic types in real time with an accuracy of over 90 percent. Outlines of further classification of audio into finer types and a query-by-example audio retrieval system on top of the coarse classification are also introduced.
Ozcift, Akin; Gulten, Arif
Improving accuracies of machine learning algorithms is vital in designing high performance computer-aided diagnosis (CADx) systems. Researches have shown that a base classifier performance might be enhanced by ensemble classification strategies. In this study, we construct rotation forest (RF) ensemble classifiers of 30 machine learning algorithms to evaluate their classification performances using Parkinson's, diabetes and heart diseases from literature. While making experiments, first the feature dimension of three datasets is reduced using correlation based feature selection (CFS) algorithm. Second, classification performances of 30 machine learning algorithms are calculated for three datasets. Third, 30 classifier ensembles are constructed based on RF algorithm to assess performances of respective classifiers with the same disease data. All the experiments are carried out with leave-one-out validation strategy and the performances of the 60 algorithms are evaluated using three metrics; classification accuracy (ACC), kappa error (KE) and area under the receiver operating characteristic (ROC) curve (AUC). Base classifiers succeeded 72.15%, 77.52% and 84.43% average accuracies for diabetes, heart and Parkinson's datasets, respectively. As for RF classifier ensembles, they produced average accuracies of 74.47%, 80.49% and 87.13% for respective diseases. RF, a newly proposed classifier ensemble algorithm, might be used to improve accuracy of miscellaneous machine learning algorithms to design advanced CADx systems. Copyright Â© 2011 Elsevier Ireland Ltd. All rights reserved.
Polyakova, A.; Lipinskiy, L.
Some problems are difficult to solve by using a single intelligent information technology (IIT). The ensemble of the various data mining (DM) techniques is a set of models which are able to solve the problem by itself, but the combination of which allows increasing the efficiency of the system as a whole. Using the IIT ensembles can improve the reliability and efficiency of the final decision, since it emphasizes on the diversity of its components. The new method of the intellectual informational technology ensemble design is considered in this paper. It is based on the fuzzy logic and is designed to solve the classification and regression problems. The ensemble consists of several data mining algorithms: artificial neural network, support vector machine and decision trees. These algorithms and their ensemble have been tested by solving the face recognition problems. Principal components analysis (PCA) is used for feature selection.
Sang, Cheng Wei; Sun, Hong
Polarimetric SAR (PolSAR) image classification is one of the important applications of PolSAR remote sensing. It is a difficult high-dimension nonlinear mapping problem, the sparse representations based on learning overcomplete dictionary have shown great potential to solve such problem. The overcomplete dictionary plays an important role in PolSAR image classification, however for PolSAR image complex scenes, features shared by different classes will weaken the discrimination of learned dictionary, so as to degrade classification performance. In this paper, we propose a novel overcomplete dictionary learning model to enhance the discrimination of dictionary. The learned overcomplete dictionary by the proposed model is more discriminative and very suitable for PolSAR classification.
Salvador J. Diaz-Cano
Full Text Available Any robust classification system depends on its purpose and must refer to accepted standards, its strength relying on predictive values and a careful consideration of known factors that can affect its reliability. In this context, a molecular classification of human cancer must refer to the current gold standard (histological classification and try to improve it with key prognosticators for metastatic potential, staging and grading. Although organ-specific examples have been published based on proteomics, transcriptomics and genomics evaluations, the most popular approach uses gene expression analysis as a direct correlate of cellular differentiation, which represents the key feature of the histological classification. RNA is a labile molecule that varies significantly according with the preservation protocol, its transcription reflect the adaptation of the tumor cells to the microenvironment, it can be passed through mechanisms of intercellular transference of genetic information (exosomes, and it is exposed to epigenetic modifications. More robust classifications should be based on stable molecules, at the genetic level represented by DNA to improve reliability, and its analysis must deal with the concept of intratumoral heterogeneity, which is at the origin of tumor progression and is the byproduct of the selection process during the clonal expansion and progression of neoplasms. The simultaneous analysis of multiple DNA targets and next generation sequencing offer the best practical approach for an analytical genomic classification of tumors.
Ben Bouallègue, Zied; Heppelmann, Tobias; Theis, Susanne E.
. The new approach which preserves the dynamical development of the ensemble members is called dynamic ensemble copula coupling (d-ECC). The ensemble based empirical copulas, ECC and d-ECC, are applied to wind forecasts from the high resolution ensemble system COSMO-DEEPS run operationally at the German...
Demargne, Julie; Javelle, Pierre; Organde, Didier; de Saint Aubin, Céline; Ramos, Maria-Helena
In order to better anticipate flash flood events and provide timely warnings to communities at risk, the French national service in charge of flood forecasting (SCHAPI) is implementing a national flash flood warning system for small-to-medium ungauged basins. Based on a discharge-threshold flood warning method called AIGA (Javelle et al. 2014), the current version of the system runs a simplified hourly distributed hydrologic model with operational radar-gauge QPE grids from Météo-France at a 1-km2 resolution every 15 minutes. This produces real-time peak discharge estimates along the river network, which are subsequently compared to regionalized flood frequency estimates to provide warnings according to the AIGA-estimated return period of the ongoing event. To further extend the effective warning lead time while accounting for hydrometeorological uncertainties, the flash flood warning system is being enhanced to include Météo-France's AROME-NWC high-resolution precipitation nowcasts as time-lagged ensembles and multiple sets of hydrological regionalized parameters. The operational deterministic precipitation forecasts, from the nowcasting version of the AROME convection-permitting model (Auger et al. 2015), were provided at a 2.5-km resolution for a 6-hr forecast horizon for 9 significant rain events from September 2014 to June 2016. The time-lagged approach is a practical choice of accounting for the atmospheric forecast uncertainty when no extensive forecast archive is available for statistical modelling. The evaluation on 781 French basins showed significant improvements in terms of flash flood event detection and effective warning lead-time, compared to warnings from the current AIGA setup (without any future precipitation). We also discuss how to effectively communicate verification information to help determine decision-relevant warning thresholds for flood magnitude and probability. Javelle, P., Demargne, J., Defrance, D., Arnaud, P., 2014. Evaluating
Wu Song; Jiang Qigang
A method is carried out to quick classification extract of the type of land use in agricultural areas, which is based on the spot6 high resolution remote sensing classification data and used of the good nonlinear classification ability of support vector machine. The results show that the spot6 high resolution remote sensing classification data can realize land classification efficiently, the overall classification accuracy reached 88.79% and Kappa factor is 0.8632 which means that the classif...
Fernández, Alberto; Carmona, Cristobal José; José Del Jesus, María; Herrera, Francisco
Imbalanced classification is related to those problems that have an uneven distribution among classes. In addition to the former, when instances are located into the overlapped areas, the correct modeling of the problem becomes harder. Current solutions for both issues are often focused on the binary case study, as multi-class datasets require an additional effort to be addressed. In this research, we overcome these problems by carrying out a combination between feature and instance selections. Feature selection will allow simplifying the overlapping areas easing the generation of rules to distinguish among the classes. Selection of instances from all classes will address the imbalance itself by finding the most appropriate class distribution for the learning task, as well as possibly removing noise and difficult borderline examples. For the sake of obtaining an optimal joint set of features and instances, we embedded the searching for both parameters in a Multi-Objective Evolutionary Algorithm, using the C4.5 decision tree as baseline classifier in this wrapper approach. The multi-objective scheme allows taking a double advantage: the search space becomes broader, and we may provide a set of different solutions in order to build an ensemble of classifiers. This proposal has been contrasted versus several state-of-the-art solutions on imbalanced classification showing excellent results in both binary and multi-class problems.
To develop a new ensemble learning method and construct highly predictive regression models in chemoinformatics and chemometrics, applicability domains (ADs) are introduced into the ensemble learning process of prediction. When estimating values of an objective variable using subregression models, only the submodels with ADs that cover a query sample, i.e., the sample is inside the model's AD, are used. By constructing submodels and changing a list of selected explanatory variables, the union of the submodels' ADs, which defines the overall AD, becomes large, and the prediction performance is enhanced for diverse compounds. By analyzing a quantitative structure-activity relationship data set and a quantitative structure-property relationship data set, it is confirmed that the ADs can be enlarged and the estimation performance of regression models is improved compared with traditional methods.
Fsheikh, Ahmed H.
A nonlinear orthogonal matching pursuit (NOMP) for sparse calibration of reservoir models is presented. Sparse calibration is a challenging problem as the unknowns are both the non-zero components of the solution and their associated weights. NOMP is a greedy algorithm that discovers at each iteration the most correlated components of the basis functions with the residual. The discovered basis (aka support) is augmented across the nonlinear iterations. Once the basis functions are selected from the dictionary, the solution is obtained by applying Tikhonov regularization. The proposed algorithm relies on approximate gradient estimation using an iterative stochastic ensemble method (ISEM). ISEM utilizes an ensemble of directional derivatives to efficiently approximate gradients. In the current study, the search space is parameterized using an overcomplete dictionary of basis functions built using the K-SVD algorithm.
Ji, Peng; Wu, Changcheng; Xu, Xiaonong; Song, Aiguo; Li, Huijun
Posture recognition is a very important Human-Robot Interaction (HRI) way. To segment effective posture from an image, we propose an improved region grow algorithm which combining with the Single Gauss Color Model. The experiment shows that the improved region grow algorithm can get the complete and accurate posture than traditional Single Gauss Model and region grow algorithm, and it can eliminate the similar region from the background at the same time. In the posture recognition part, and in order to improve the recognition rate, we propose a CNN ensemble classifier, and in order to reduce the misjudgments during a continuous gesture control, a vote filter is proposed and applied to the sequence of recognition results. Comparing with CNN classifier, the CNN ensemble classifier we proposed can yield a 96.27% recognition rate, which is better than that of CNN classifier, and the proposed vote filter can improve the recognition result and reduce the misjudgments during the consecutive gesture switch.
Kikuchi, Ryota; Misaka, Takashi; Obayashi, Shigeru, E-mail: email@example.com [Institute of Fluid Science, Tohoku University, 2-1-1 Katahira, Aoba-ku, Sendai, Miyagi 980-8577 (Japan)
An integrated method of a proper orthogonal decomposition based reduced-order model (ROM) and data assimilation is proposed for the real-time prediction of an unsteady flow field. In this paper, a particle filter (PF) and an ensemble Kalman filter (EnKF) are compared for data assimilation and the difference in the predicted flow fields is evaluated focusing on the probability density function (PDF) of the model variables. The proposed method is demonstrated using identical twin experiments of an unsteady flow field around a circular cylinder at the Reynolds number of 1000. The PF and EnKF are employed to estimate temporal coefficients of the ROM based on the observed velocity components in the wake of the circular cylinder. The prediction accuracy of ROM-PF is significantly better than that of ROM-EnKF due to the flexibility of PF for representing a PDF compared to EnKF. Furthermore, the proposed method reproduces the unsteady flow field several orders faster than the reference numerical simulation based on the Navier–Stokes equations. (paper)
Sørensen, Lauge; Lo, Pechin Chien Pau; Dirksen, Asger
A novel method for classification of abnormality in anatomical tree structures is presented. A tree is classified based on direct comparisons with other trees in a dissimilarity-based classification scheme. The pair-wise dissimilarity measure between two trees is based on a linear assignment...... between the branch feature vectors representing those trees. Hereby, localized information in the branches is collectively used in classification and variations in feature values across the tree are taken into account. An approximate anatomical correspondence between matched branches can be achieved...... by including anatomical features in the branch feature vectors. The proposed approach is applied to classify airway trees in computed tomography images of subjects with and without chronic obstructive pulmonary disease (COPD). Using the wall area percentage (WA%), a common measure of airway abnormality in COPD...
Kim, S.; Riazi, H.; Shin, C.; Seo, D.
Due to the large dimensionality of the state vector and sparsity of observations, the initial conditions (IC) of water quality models are subject to large uncertainties. To reduce the IC uncertainties in operational water quality forecasting, an ensemble data assimilation (DA) procedure for the Hydrologic Simulation Program - Fortran (HSPF) model has been developed and evaluated for the Kumho River Subcatchment of the Nakdong River Basin in Korea. The procedure, referred to herein as MLEF-HSPF, uses maximum likelihood ensemble filter (MLEF) which combines strengths of variational assimilation (VAR) and ensemble Kalman filter (EnKF). The Control variables involved in the DA procedure include the bias correction factors for mean areal precipitation and mean areal potential evaporation, the hydrologic state variables, and the water quality state variables such as water temperature, dissolved oxygen (DO), biochemical oxygen demand (BOD), ammonium (NH4), nitrate (NO3), phosphate (PO4) and chlorophyll a (CHL-a). Due to the very large dimensionality of the inverse problem, accurately specifying the parameters for the DA procdedure is a challenge. Systematic sensitivity analysis is carried out for identifying the optimal parameter settings. To evaluate the robustness of MLEF-HSPF, we use multiple subcatchments of the Nakdong River Basin. In evaluation, we focus on the performance of MLEF-HSPF on prediction of extreme water quality events.
Alexandre M. Ramos
Full Text Available A new classification method of the large-scale circulation characteristic for a specific target area (NW Iberian Peninsula is presented, based on the analysis of 90-h backward trajectories arriving in this area calculated with the 3-D Lagrangian particle dispersion model FLEXPART. A cluster analysis is applied to separate the backward trajectories in up to five representative air streams for each day. Specific measures are then used to characterise the distinct air streams (e.g., curvature of the trajectories, cyclonic or anticyclonic flow, moisture evolution, origin and length of the trajectories. The robustness of the presented method is demonstrated in comparison with the Eulerian Lamb weather type classification.A case study of the 2003 heatwave is discussed in terms of the new Lagrangian circulation and the Lamb weather type classifications. It is shown that the new classification method adds valuable information about the pertinent meteorological conditions, which are missing in an Eulerian approach. The new method is climatologically evaluated for the five-year time period from December 1999 to November 2004. The ability of the method to capture the inter-seasonal circulation variability in the target region is shown. Furthermore, the multi-dimensional character of the classification is shortly discussed, in particular with respect to inter-seasonal differences. Finally, the relationship between the new Lagrangian classification and the precipitation in the target area is studied.
Tamilselvan, Prasanna; Wang, Pingfeng
Effective health diagnosis provides multifarious benefits such as improved safety, improved reliability and reduced costs for operation and maintenance of complex engineered systems. This paper presents a novel multi-sensor health diagnosis method using deep belief network (DBN). DBN has recently become a popular approach in machine learning for its promised advantages such as fast inference and the ability to encode richer and higher order network structures. The DBN employs a hierarchical structure with multiple stacked restricted Boltzmann machines and works through a layer by layer successive learning process. The proposed multi-sensor health diagnosis methodology using DBN based state classification can be structured in three consecutive stages: first, defining health states and preprocessing sensory data for DBN training and testing; second, developing DBN based classification models for diagnosis of predefined health states; third, validating DBN classification models with testing sensory dataset. Health diagnosis using DBN based health state classification technique is compared with four existing diagnosis techniques. Benchmark classification problems and two engineering health diagnosis applications: aircraft engine health diagnosis and electric power transformer health diagnosis are employed to demonstrate the efficacy of the proposed approach
Cerra, D.; Mueller, R.; Reinartz, P.
This paper presents a new classification methodology for hyperspectral data based on synergetics theory, which describes the spontaneous formation of patterns and structures in a system through self-organization. We introduce a representation for hyperspectral data, in which a spectrum can be projected in a space spanned by a set of user-defined prototype vectors, which belong to some classes of interest. Each test vector is attracted by a final state associated to a prototype, and can be thus classified. As typical synergetics-based systems have the drawback of a rigid training step, we modify it to allow the selection of user-defined training areas, used to weight the prototype vectors through attention parameters and to produce a more accurate classification map through majority voting of independent classifications. Results are comparable to state of the art classification methodologies, both general and specific to hyperspectral data and, as each classification is based on a single training sample per class, the proposed technique would be particularly effective in tasks where only a small training dataset is available.
Xia, Ping; Yan, Hua; Li, Jing; Sun, Jiande
This paper proposed a single image super-resolution algorithm based on image patch classification and sparse representation where gradient information is used to classify image patches into three different classes in order to reflect the difference between the different types of image patches. Compared with other classification algorithms, gradient information based algorithm is simpler and more effective. In this paper, each class is learned to get a corresponding sub-dictionary. High-resolution image patch can be reconstructed by the dictionary and sparse representation coefficients of corresponding class of image patches. The result of the experiments demonstrated that the proposed algorithm has a better effect compared with the other algorithms.
Hill, A.; Weiss, C.; Ancell, B. C.
The basic premise of observation targeting is that additional observations, when gathered and assimilated with a numerical weather prediction (NWP) model, will produce a more accurate forecast related to a specific phenomenon. Ensemble-sensitivity analysis (ESA; Ancell and Hakim 2007; Torn and Hakim 2008) is a tool capable of accurately estimating the proper location of targeted observations in areas that have initial model uncertainty and large error growth, as well as predicting the reduction of forecast variance due to the assimilated observation. ESA relates an ensemble of NWP model forecasts, specifically an ensemble of scalar forecast metrics, linearly to earlier model states. A thorough investigation is presented to determine how different factors of the forecast process are impacting our ability to successfully target new observations for mesoscale convection forecasts. Our primary goals for this work are to determine: (1) If targeted observations hold more positive impact over non-targeted (i.e. randomly chosen) observations; (2) If there are lead-time constraints to targeting for convection; (3) How inflation, localization, and the assimilation filter influence impact prediction and realized results; (4) If there exist differences between targeted observations at the surface versus aloft; and (5) how physics errors and nonlinearity may augment observation impacts.Ten cases of dryline-initiated convection between 2011 to 2013 are simulated within a simplified OSSE framework and presented here. Ensemble simulations are produced from a cycling system that utilizes the Weather Research and Forecasting (WRF) model v3.8.1 within the Data Assimilation Research Testbed (DART). A "truth" (nature) simulation is produced by supplying a 3-km WRF run with GFS analyses and integrating the model forward 90 hours, from the beginning of ensemble initialization through the end of the forecast. Target locations for surface and radiosonde observations are computed 6, 12, and
Wimmer, Georg; Tamaki, Toru; Tischendorf, J J W; Häfner, Michael; Yoshida, Shigeto; Tanaka, Shinji; Uhl, Andreas
In this work, various wavelet based methods like the discrete wavelet transform, the dual-tree complex wavelet transform, the Gabor wavelet transform, curvelets, contourlets and shearlets are applied for the automated classification of colonic polyps. The methods are tested on 8 HD-endoscopic image databases, where each database is acquired using different imaging modalities (Pentax's i-Scan technology combined with or without staining the mucosa), 2 NBI high-magnification databases and one database with chromoscopy high-magnification images. To evaluate the suitability of the wavelet based methods with respect to the classification of colonic polyps, the classification performances of 3 wavelet transforms and the more recent curvelets, contourlets and shearlets are compared using a common framework. Wavelet transforms were already often and successfully applied to the classification of colonic polyps, whereas curvelets, contourlets and shearlets have not been used for this purpose so far. We apply different feature extraction techniques to extract the information of the subbands of the wavelet based methods. Most of the in total 25 approaches were already published in different texture classification contexts. Thus, the aim is also to assess and compare their classification performance using a common framework. Three of the 25 approaches are novel. These three approaches extract Weibull features from the subbands of curvelets, contourlets and shearlets. Additionally, 5 state-of-the-art non wavelet based methods are applied to our databases so that we can compare their results with those of the wavelet based methods. It turned out that extracting Weibull distribution parameters from the subband coefficients generally leads to high classification results, especially for the dual-tree complex wavelet transform, the Gabor wavelet transform and the Shearlet transform. These three wavelet based transforms in combination with Weibull features even outperform the state
The detection of emotions is a hot topic in the area of computer vision. Emotions are based on subtle changes in the face that are intuitively detected and interpreted by humans. Detecting these subtle changes, based on mathematical models, is a great challenge in the area of computer vision. In this thesis a new method is proposed to achieve state-of-the-art emotion detection performance. This method is based on facial feature points to monitor subtle changes in the face. Therefore the c...
Fihl, Preben; Moeslund, Thomas B.
This paper deals with classification of human gait types based on the notion that different gait types are in fact different types of locomotion, i.e., running is not simply walking done faster. We present the duty-factor, which is a descriptor based on this notion. The duty-factor is independent...... on the speed of the human, the cameras setup etc. and hence a robust descriptor for gait classification. The dutyfactor is basically a matter of measuring the ground support of the feet with respect to the stride. We estimate this by comparing the incoming silhouettes to a database of silhouettes with known...... ground support. Silhouettes are extracted using the Codebook method and represented using Shape Contexts. The matching with database silhouettes is done using the Hungarian method. While manually estimated duty-factors show a clear classification the presented system contains misclassifications due...
Naghibi, Seyed Amir; Moghaddam, Davood Davoodi; Kalantar, Bahareh; Pradhan, Biswajeet; Kisi, Ozgur
In recent years, application of ensemble models has been increased tremendously in various types of natural hazard assessment such as landslides and floods. However, application of this kind of robust models in groundwater potential mapping is relatively new. This study applied four data mining algorithms including AdaBoost, Bagging, generalized additive model (GAM), and Naive Bayes (NB) models to map groundwater potential. Then, a novel frequency ratio data mining ensemble model (FREM) was introduced and evaluated. For this purpose, eleven groundwater conditioning factors (GCFs), including altitude, slope aspect, slope angle, plan curvature, stream power index (SPI), river density, distance from rivers, topographic wetness index (TWI), land use, normalized difference vegetation index (NDVI), and lithology were mapped. About 281 well locations with high potential were selected. Wells were randomly partitioned into two classes for training the models (70% or 197) and validating them (30% or 84). AdaBoost, Bagging, GAM, and NB algorithms were employed to get groundwater potential maps (GPMs). The GPMs were categorized into potential classes using natural break method of classification scheme. In the next stage, frequency ratio (FR) value was calculated for the output of the four aforementioned models and were summed, and finally a GPM was produced using FREM. For validating the models, area under receiver operating characteristics (ROC) curve was calculated. The ROC curve for prediction dataset was 94.8, 93.5, 92.6, 92.0, and 84.4% for FREM, Bagging, AdaBoost, GAM, and NB models, respectively. The results indicated that FREM had the best performance among all the models. The better performance of the FREM model could be related to reduction of over fitting and possible errors. Other models such as AdaBoost, Bagging, GAM, and NB also produced acceptable performance in groundwater modelling. The GPMs produced in the current study may facilitate groundwater exploitation
Stucki, G; Kostanjsek, N; Ustün, B; Cieza, A
If we aim towards a comprehensive understanding of human functioning and the development of comprehensive programs to optimize functioning of individuals and populations we need to develop suitable measures. The approval of the International Classification, Disability and Health (ICF) in 2001 by the 54th World Health Assembly as the first universally shared model and classification of functioning, disability and health marks, therefore an important step in the development of measurement instruments and ultimately for our understanding of functioning, disability and health. The acceptance and use of the ICF as a reference framework and classification has been facilitated by its development in a worldwide, comprehensive consensus process and the increasing evidence regarding its validity. However, the broad acceptance and use of the ICF as a reference framework and classification will also depend on the resolution of conceptual and methodological challenges relevant for the classification and measurement of functioning. This paper therefore describes first how the ICF categories can serve as building blocks for the measurement of functioning and then the current state of the development of ICF based practical tools and international standards such as the ICF Core Sets. Finally it illustrates how to map the world of measures to the ICF and vice versa and the methodological principles relevant for the transformation of information obtained with a clinical test or a patient-oriented instrument to the ICF as well as the development of ICF-based clinical and self-reported measurement instruments.
Gu, Chengwei; Wu, Ming; Zhang, Chuang
Sentence classification is one of the significant issues in Natural Language Processing (NLP). Feature extraction is often regarded as the key point for natural language processing. Traditional ways based on machine learning can not take high level features into consideration, such as Naive Bayesian Model. The neural network for sentence classification can make use of contextual information to achieve greater results in sentence classification tasks. In this paper, we focus on classifying Chinese sentences. And the most important is that we post a novel architecture of Convolutional Neural Network (CNN) to apply on Chinese sentence classification. In particular, most of the previous methods often use softmax classifier for prediction, we embed a linear support vector machine to substitute softmax in the deep neural network model, minimizing a margin-based loss to get a better result. And we use tanh as an activation function, instead of ReLU. The CNN model improve the result of Chinese sentence classification tasks. Experimental results on the Chinese news title database validate the effectiveness of our model.
Full Text Available The authors propose to implement conditional non-linear optimal perturbation related to model parameters (CNOP-P through an ensemble-based approach. The approach was first used in our earlier study and is improved to be suitable for calculating CNOP-P. Idealised experiments using the Lorenz-63 model are conducted to evaluate the performance of the improved ensemble-based approach. The results show that the maximum prediction error after optimisation has been multiplied manifold compared with the initial-guess prediction error, and is extremely close to, or greater than, the maximum value of the exhaustive attack method (a million random samples. The calculation of CNOP-P by the ensemble-based approach is capable of maintaining a high accuracy over a long prediction time under different constraints and initial conditions. Further, the CNOP-P obtained by the approach is applied to sensitivity analysis of the Lorenz-63 model. The sensitivity analysis indicates that when the prediction time is set to 0.2 time units, the Lorenz-63 model becomes extremely insensitive to one parameter, which leaves the other two parameters to affect the uncertainty of the model. Finally, a serial of parameter estimation experiments are performed to verify sensitivity analysis. It is found that when the three parameters are estimated simultaneously, the insensitive parameter is estimated much worse, but the Lorenz-63 model can still generate a very good simulation thanks to the relatively accurate values of the other two parameters. When only two sensitive parameters are estimated simultaneously and the insensitive parameter is left to be non-optimised, the outcome is better than the case when the three parameters are estimated simultaneously. With the increase of prediction time and observation, however, the model sensitivity to the insensitive parameter increases accordingly and the insensitive parameter can also be estimated successfully.
El Gharamti, Mohamad
The ensemble Kalman filter (EnKF) recursively integrates field data into simulation models to obtain a better characterization of the model’s state and parameters. These are generally estimated following a state-parameters joint augmentation strategy. In this study, we introduce a new smoothing-based joint EnKF scheme, in which we introduce a one-step-ahead smoothing of the state before updating the parameters. Numerical experiments are performed with a two-dimensional synthetic subsurface contaminant transport model. The improved performance of the proposed joint EnKF scheme compared to the standard joint EnKF compensates for the modest increase in the computational cost.
mr devi singh
Jan 7, 2015 ... The present communication is aimed to find out determinants of molecular marker based classification of rice (Oryza sativa L) germplasm using the available data from an experiment conducted for development of molecular fingerprints of diverse varieties of Basmati and non Basmati rice adapted to.
Hogenboom, Alexander; Frasincar, Flavius; de Jong, Franciska M.G.; Kaymak, Uzay
The exploitation of structural aspects of content is becoming increasingly popular in rule-based polarity classification systems. Such systems typically weight the sentiment conveyed by text segments in accordance with these segments' roles in the structure of a text, as identified by deep
W.H.L.M. Pijls (Wim); R. Potharst (Rob)
textabstractIn this technical report , two new algorithms based upon frequent patterns are proposed. One algorithm is a classification method. The other one is an algorithm for target group selection. In both algorithms, first of all, the collection of frequent patterns in the training set is
Li, N.; Pfeifer, N.; Liu, C.
The common statistical methods for supervised classification usually require a large amount of training data to achieve reasonable results, which is time consuming and inefficient. This paper proposes a tensor sparse representation classification (SRC) method for airborne LiDAR points. The LiDAR points are represented as tensors to keep attributes in its spatial space. Then only a few of training data is used for dictionary learning, and the sparse tensor is calculated based on tensor OMP algorithm. The point label is determined by the minimal reconstruction residuals. Experiments are carried out on real LiDAR points whose result shows that objects can be distinguished by this algorithm successfully.
Rodolfo de Figueiredo Dalvi
Full Text Available Abstract Introduction This paper presents a complete approach for the automatic classification of heartbeats to assist experts in the diagnosis of typical arrhythmias, such as right bundle branch block, left bundle branch block, premature ventricular beats, premature atrial beats and paced beats. Methods A pre-processing step was performed on the electrocardiograms (ECG for baseline removal. Next, a QRS complex detection algorithm was implemented to detect the heartbeats, which contain the primary information that is employed in the classification approach. Next, ECG segmentation was performed, by which a set of features based on the RR interval and the beat waveform morphology were extracted from the ECG signal. The size of the feature vector was reduced by principal component analysis. Finally, the reduced feature vector was employed as the input to an artificial neural network. Results Our approach was tested on the Massachusetts Institute of Technology arrhythmia database. The classification performance on a test set of 18 ECG records of 30 min each achieved an accuracy of 96.97%, a sensitivity of 95.05%, a specificity of 90.88%, a positive predictive value of 95.11%, and a negative predictive value of 92.7%. Conclusion The proposed approach achieved high accuracy for classifying ECG heartbeats and could be used to assist cardiologists in telecardiology services. The main contribution of our classification strategy is in the feature selection step, which reduced classification complexity without major changes in the performance.
Jochumsen, Lars Wurtz
The topic of this thesis is target classification of radar tracks from a 2D mechanically scanning coastal surveillance radar. The measurements provided by the radar are position data and therefore the classification is mainly based on kinematic data, which is deduced from the position. The target...... been terminated. Therefore, an update of the classification results must be made for each measurement of the target. The data for this work are collected throughout the PhD and are both collected from radars and other sensors such as GPS....... classes used in this work are classes, which are normal for coastal surveillance e.g.~ships, helicopters, birds etc. The classifier must be recursive as all data of a track is not present at any given moment. If all data were available, it would be too late to classify the track, as the track would have...
Gavrilovic, Zoran; Stefanovic, Milutin; Milovanovic, Irina; Cotric, Jelena; Milojevic, Mileta
A complex methodology for torrents and erosion and the associated calculations was developed during the second half of the twentieth century in Serbia. It was the 'Erosion Potential Method'. One of the modules of that complex method was focused on torrent classification. The module enables the identification of hydro graphic, climate and erosion characteristics. The method makes it possible for each torrent, regardless of its magnitude, to be simply and recognizably described by the 'Formula of torrentially'. The above torrent classification is the base on which a set of optimisation calculations is developed for the required scope of erosion-control works and measures, the application of which enables the management of significantly larger erosion and torrential regions compared to the previous period. This paper will present the procedure and the method of torrent classification.
Full Text Available Shape is not only an important indicator for assessing the grade of the apple, but also the important factors for increasing the value of the apple. In order to improve the apple shape classification accuracy rate, an approach for apple shape sorting based on wavelet moments was proposed, the image was first subjected to a normalization process using its regular moments to obtain scale and translation invariance, the rotation invariant wavelet moment features were then extracted from the scale and translation normalized images and the method of cluster analysis was used for finished the shape classification. This method performs better than traditional approaches such as Fourier descriptors and Zernike moments, because of that Wavelet moments can provide time-domain and frequency domain window, which was verified by experiments. The normal fruit shape, mild deformity and severe deformity classification accuracy is 86.21 %, 85.82 %, 90.81 % by our method.
Su, Hongjun; Sheng, Yehua; Du, Peijun; Chen, Chen; Liu, Kui
A novel approach using volumetric texture and reduced-spectral features is presented for hyperspectral image classification. Using this approach, the volumetric textural features were extracted by volumetric gray-level co-occurrence matrices (VGLCM). The spectral features were extracted by minimum estimated abundance covariance (MEAC) and linear prediction (LP)-based band selection, and a semi-supervised k-means (SKM) clustering method with deleting the worst cluster (SKMd) bandclustering algorithms. Moreover, four feature combination schemes were designed for hyperspectral image classification by using spectral and textural features. It has been proven that the proposed method using VGLCM outperforms the gray-level co-occurrence matrices (GLCM) method, and the experimental results indicate that the combination of spectral information with volumetric textural features leads to an improved classification performance in hyperspectral imagery.
Full Text Available Focused on the issue that conventional remote sensing image classification methods have run into the bottlenecks in accuracy, a new remote sensing image classification method inspired by deep learning is proposed, which is based on Stacked Denoising Autoencoder. First, the deep network model is built through the stacked layers of Denoising Autoencoder. Then, with noised input, the unsupervised Greedy layer-wise training algorithm is used to train each layer in turn for more robust expressing, characteristics are obtained in supervised learning by Back Propagation (BP neural network, and the whole network is optimized by error back propagation. Finally, Gaofen-1 satellite (GF-1 remote sensing data are used for evaluation, and the total accuracy and kappa accuracy reach 95.7% and 0.955, respectively, which are higher than that of the Support Vector Machine and Back Propagation neural network. The experiment results show that the proposed method can effectively improve the accuracy of remote sensing image classification.
Ciompi, Francesco; de Hoop, Bartjan; van Riel, Sarah J; Chung, Kaman; Scholten, Ernst Th; Oudkerk, Matthijs; de Jong, Pim A; Prokop, Mathias; van Ginneken, Bram
In this paper, we tackle the problem of automatic classification of pulmonary peri-fissural nodules (PFNs). The classification problem is formulated as a machine learning approach, where detected nodule candidates are classified as PFNs or non-PFNs. Supervised learning is used, where a classifier is
Teo, Jason; Hou, Chew Lin; Mountstephens, James
Electroencephalogram (EEG)-based emotion classification is rapidly becoming one of the most intensely studied areas of brain-computer interfacing (BCI). The ability to passively identify yet accurately correlate brainwaves with our immediate emotions opens up truly meaningful and previously unattainable human-computer interactions such as in forensic neuroscience, rehabilitative medicine, affective entertainment and neuro-marketing. One particularly useful yet rarely explored areas of EEG-based emotion classification is preference recognition , which is simply the detection of like versus dislike. Within the limited investigations into preference classification, all reported studies were based on musically-induced stimuli except for a single study which used 2D images. The main objective of this study is to apply deep learning, which has been shown to produce state-of-the-art results in diverse hard problems such as in computer vision, natural language processing and audio recognition, to 3D object preference classification over a larger group of test subjects. A cohort of 16 users was shown 60 bracelet-like objects as rotating visual stimuli on a computer display while their preferences and EEGs were recorded. After training a variety of machine learning approaches which included deep neural networks, we then attempted to classify the users' preferences for the 3D visual stimuli based on their EEGs. Here, we show that that deep learning outperforms a variety of other machine learning classifiers for this EEG-based preference classification task particularly in a highly challenging dataset with large inter- and intra-subject variability.
Achieng, K. O.; Zhu, J.
There are a number of North American Regional Climate Change Assessment Program (NARCCAP) climatic models that have been used to project surface runoff in the mid-21st century. Statistical model selection techniques are often used to select the model that best fits data. However, model selection techniques often lead to different conclusions. In this study, ten models are averaged in Bayesian paradigm to project runoff. Bayesian Model Averaging (BMA) is used to project and identify effect of model uncertainty on future runoff projections. Baseflow separation - a two-digital filter which is also called Eckhardt filter - is used to separate USGS streamflow (total runoff) into two components: baseflow and surface runoff. We use this surface runoff as the a priori runoff when conducting BMA of runoff simulated from the ten RCM models. The primary objective of this study is to evaluate how well RCM multi-model ensembles simulate surface runoff, in a Bayesian framework. Specifically, we investigate and discuss the following questions: How well do ten RCM models ensemble jointly simulate surface runoff by averaging over all the models using BMA, given a priori surface runoff? What are the effects of model uncertainty on surface runoff simulation?
Full Text Available Classification is the most important method to determine type of crop contained in a region for agricultural planning. There are two types of the classification. First is pixel based and the other is object based classification method. While pixel based classification methods are based on the information in each pixel, object based classification method is based on objects or image objects that formed by the combination of information from a set of similar pixels. Multispectral image contains a higher degree of spectral resolution than a panchromatic image. Panchromatic image have a higher spatial resolution than a multispectral image. Pan sharpening is a process of merging high spatial resolution panchromatic and high spectral resolution multispectral imagery to create a single high resolution color image. The aim of the study was to compare the potential classification accuracy provided by pan sharpened image. In this study, SPOT 5 image was used dated April 2013. 5m panchromatic image and 10m multispectral image are pan sharpened. Four different classification methods were investigated: maximum likelihood, decision tree, support vector machine at the pixel level and object based classification methods. SPOT 5 pan sharpened image was used to classification sun flowers and corn in a study site located at Kadirli region on Osmaniye in Turkey. The effects of pan sharpened image on classification results were also examined. Accuracy assessment showed that the object based classification resulted in the better overall accuracy values than the others. The results that indicate that these classification methods can be used for identifying sun flower and corn and estimating crop areas.
Karakus, P.; Karabork, H.
Classification is the most important method to determine type of crop contained in a region for agricultural planning. There are two types of the classification. First is pixel based and the other is object based classification method. While pixel based classification methods are based on the information in each pixel, object based classification method is based on objects or image objects that formed by the combination of information from a set of similar pixels. Multispectral image contains a higher degree of spectral resolution than a panchromatic image. Panchromatic image have a higher spatial resolution than a multispectral image. Pan sharpening is a process of merging high spatial resolution panchromatic and high spectral resolution multispectral imagery to create a single high resolution color image. The aim of the study was to compare the potential classification accuracy provided by pan sharpened image. In this study, SPOT 5 image was used dated April 2013. 5m panchromatic image and 10m multispectral image are pan sharpened. Four different classification methods were investigated: maximum likelihood, decision tree, support vector machine at the pixel level and object based classification methods. SPOT 5 pan sharpened image was used to classification sun flowers and corn in a study site located at Kadirli region on Osmaniye in Turkey. The effects of pan sharpened image on classification results were also examined. Accuracy assessment showed that the object based classification resulted in the better overall accuracy values than the others. The results that indicate that these classification methods can be used for identifying sun flower and corn and estimating crop areas.
Full Text Available Packet classification is a ubiquitous and key building block for many critical network devices. However, it remains as one of the main bottlenecks faced when designing fast network devices. In this paper, we propose a novel Group Based Search packet classification Algorithm (GBSA that is scalable, fast, and efficient. GBSA consumes an average of 0.4 megabytes of memory for a 10 k rule set. The worst-case classification time per packet is 2 microseconds, and the preprocessing speed is 3 M rules/second based on an Xeon processor operating at 3.4 GHz. When compared with other state-of-the-art classification techniques, the results showed that GBSA outperforms the competition with respect to speed, memory usage, and processing time. Moreover, GBSA is amenable to implementation in hardware. Three different hardware implementations are also presented in this paper including an Application Specific Instruction Set Processor (ASIP implementation and two pure Register-Transfer Level (RTL implementations based on Impulse-C and Handel-C flows, respectively. Speedups achieved with these hardware accelerators ranged from 9x to 18x compared with a pure software implementation running on an Xeon processor.
Frumuşanu, G.; Afteni, C.; Badea, N.; Epureanu, A.
EU Directive 92/75/EC established for the first time an energy consumption labelling scheme, further implemented by several other directives. As consequence, nowadays many products (e.g. home appliances, tyres, light bulbs, houses) have an EU Energy Label when offered for sale or rent. Several energy consumption models of manufacturing equipments have been also developed. This paper proposes an energy efficiency - based classification of the manufacturing workstation, aiming to characterize its energetic behaviour. The concept of energy efficiency of the manufacturing workstation is defined. On this base, a classification methodology has been developed. It refers to specific criteria and their evaluation modalities, together to the definition & delimitation of energy efficiency classes. The energy class position is defined after the amount of energy needed by the workstation in the middle point of its operating domain, while its extension is determined by the value of the first coefficient from the Taylor series that approximates the dependence between the energy consume and the chosen parameter of the working regime. The main domain of interest for this classification looks to be the optimization of the manufacturing activities planning and programming. A case-study regarding an actual lathe classification from energy efficiency point of view, based on two different approaches (analytical and numerical) is also included.
Full Text Available The sparse representation based classifier (SRC and its kernel version (KSRC have been employed for hyperspectral image (HSI classification. However, the state-of-the-art SRC often aims at extended surface objects with linear mixture in smooth scene and assumes that the number of classes is given. Considering the small target with complex background, a sparse representation based binary hypothesis (SRBBH model is established in this paper. In this model, a query pixel is represented in two ways, which are, respectively, by background dictionary and by union dictionary. The background dictionary is composed of samples selected from the local dual concentric window centered at the query pixel. Thus, for each pixel the classification issue becomes an adaptive multiclass classification problem, where only the number of desired classes is required. Furthermore, the kernel method is employed to improve the interclass separability. In kernel space, the coding vector is obtained by using kernel-based orthogonal matching pursuit (KOMP algorithm. Then the query pixel can be labeled by the characteristics of the coding vectors. Instead of directly using the reconstruction residuals, the different impacts the background dictionary and union dictionary have on reconstruction are used for validation and classification. It enhances the discrimination and hence improves the performance.
Full Text Available Abstract Relating chemical features to bioactivities is critical in molecular design and is used extensively in the lead discovery and optimization process. A variety of techniques from statistics, data mining and machine learning have been applied to this process. In this study, we utilize a collection of methods, called associative classification mining (ACM, which are popular in the data mining community, but so far have not been applied widely in cheminformatics. More specifically, classification based on predictive association rules (CPAR, classification based on multiple association rules (CMAR and classification based on association rules (CBA are employed on three datasets using various descriptor sets. Experimental evaluations on anti-tuberculosis (antiTB, mutagenicity and hERG (the human Ether-a-go-go-Related Gene blocker datasets show that these three methods are computationally scalable and appropriate for high speed mining. Additionally, they provide comparable accuracy and efficiency to the commonly used Bayesian and support vector machines (SVM methods, and produce highly interpretable models.
Jeon, Young Jin; Chatellier, Ludovic; David, Laurent
A novel multi-frame particle image velocimetry (PIV) method, able to evaluate a fluid trajectory by means of an ensemble-averaged cross-correlation, is introduced. The method integrates the advantages of the state-of-art time-resolved PIV (TR-PIV) methods to further enhance both robustness and dynamic range. The fluid trajectory follows a polynomial model with a prescribed order. A set of polynomial coefficients, which maximizes the ensemble-averaged cross-correlation value across the frames, is regarded as the most appropriate solution. To achieve a convergence of the trajectory in terms of polynomial coefficients, an ensemble-averaged cross-correlation map is constructed by sampling cross-correlation values near the predictor trajectory with respect to an imposed change of each polynomial coefficient. A relation between the given change and corresponding cross-correlation maps, which could be calculated from the ordinary cross-correlation, is derived. A disagreement between computational domain and corresponding physical domain is compensated by introducing the Jacobian matrix based on the image deformation scheme in accordance with the trajectory. An increased cost of the convergence calculation, associated with the nonlinearity of the fluid trajectory, is moderated by means of a V-cycle iteration. To validate enhancements of the present method, quantitative comparisons with the state-of-arts TR-PIV methods, e.g., the adaptive temporal interval, the multi-frame pyramid correlation and the fluid trajectory correlation, were carried out by using synthetically generated particle image sequences. The performances of the tested methods are discussed in algorithmic terms. A high-rate TR-PIV experiment of a flow over an airfoil demonstrates the effectiveness of the present method. It is shown that the present method is capable of reducing random errors in both velocity and material acceleration while suppressing spurious temporal fluctuations due to measurement noise.
Full Text Available In this paper, aiming at the characteristics of Chinese text classification, using the ICTCLAS(Chinese lexical analysis system of Chinese academy of sciences for document segmentation, and for data cleaning and filtering the Stop words, using the information gain and document frequency feature selection algorithm to document feature selection. Based on this, based on the Naive Bayesian algorithm implemented text classifier , and use Chinese corpus of Fudan University has carried on the experiment and analysis on the system.
Triantafyllou, George N.
An application of an ensemble-based robust filter for data assimilation into an ecosystem model of the Cretan Sea is presented and discussed. The ecosystem model comprises two on-line coupled sub-models: the Princeton Ocean Model (POM) and the European Regional Seas Ecosystem Model (ERSEM). The filtering scheme is based on the Singular Evolutive Interpolated Kalman (SEIK) filter which is implemented with a time-local H∞ filtering strategy to enhance robustness and performances during periods of strong ecosystem variability. Assimilation experiments in the Cretan Sea indicate that robustness can be achieved in the SEIK filter by introducing an adaptive inflation scheme of the modes of the filter error covariance matrix. Twin-experiments are performed to evaluate the performance of the assimilation system and to study the benefits of using robust filtering in an ensemble filtering framework. Pseudo-observations of surface chlorophyll, extracted from a model reference run, were assimilated every two days. Simulation results suggest that the adaptive inflation scheme significantly improves the behavior of the SEIK filter during periods of strong ecosystem variability. © 2012 Elsevier B.V.
Mansoori, Eghbal G; Zolghadri, Mansoor J; Katebi, Seraj D
In this paper, we have proposed a fuzzy rule-based classifier for assigning amino acid sequences into different superfamilies of proteins. While the most popular methods for protein classification rely on sequence alignment, our approach is alignment-free and so more human readable. It accounts for the distribution of contiguous patterns of n amino acids ( n-grams) in the sequences as features, alike other alignment-independent methods. Our approach, first extracts a plenty of features from a set of training sequences, then selects only some best of them, using a proposed feature ranking method. Thereafter, using these features, a novel steady-state genetic algorithm for extracting fuzzy classification rules from data is used to generate a compact set of interpretable fuzzy rules. The generated rules are simple and human understandable. So, the biologists can utilize them, for classification purposes, or incorporate their expertise to interpret or even modify them. To evaluate the performance of our fuzzy rule-based classifier, we have compared it with the conventional nonfuzzy C4.5 algorithm, beside some other fuzzy classifiers. This comparative study is conducted through classifying the protein sequences of five superfamily classes, downloaded from a public domain database. The obtained results show that the generated fuzzy rules are more interpretable, with acceptable improvement in the classification accuracy.
Rao, Mukund; Rao, Suryaprakash; Masser, Ian; Kasturirangan, K.
Information extraction from satellite images is the process of delineation of entities in the image which pertain to some feature on the earth and to which on associating an attribute, a classification of the image is obtained. Classification is a common technique to extract information from remote sensing data and, by and large, the common classification techniques mainly exploit the spectral characteristics of remote sensing images and attempt to detect patterns in spectral information to classify images. These are based on a per-pixel analysis of the spectral information, "clustering" or "grouping" of pixels is done to generate meaningful thematic information. Most of the classification techniques apply statistical pattern recognition of image spectral vectors to "label" each pixel with appropriate class information from a set of training information. On the other hand, Segmentation is not new, but it is yet seldom used in image processing of remotely sensed data. Although there has been a lot of development in segmentation of grey tone images in this field and other fields, like robotic vision, there has been little progress in segmentation of colour or multi-band imagery. Especially within the last two years many new segmentation algorithms as well as applications were developed, but not all of them lead to qualitatively convincing results while being robust and operational. One reason is that the segmentation of an image into a given number of regions is a problem with a huge number of possible solutions. Newer algorithms based on fractal approach could eventually revolutionize image processing of remotely sensed data. The paper looks at applying spatial concepts to image processing, paving the way to algorithmically formulate some more advanced aspects of cognition and inference. In GIS-based spatial analysis, vector-based tools already have been able to support advanced tasks generating new knowledge. By identifying objects (as segmentation results) from
Begüm Terzi, Merve; Arikan, Feza; Arikan, Orhan; Karatay, Secil
Ionosphere is an anisotropic, inhomogeneous, time varying and spatio-temporally dispersive medium whose parameters can be estimated almost always by using indirect measurements. Geomagnetic, gravitational, solar or seismic activities cause variations of ionosphere at various spatial and temporal scales. This complex spatio-temporal variability is challenging to be identified due to extensive scales in period, duration, amplitude and frequency of disturbances. Since geomagnetic and solar indices such as Disturbance storm time (Dst), F10.7 solar flux, Sun Spot Number (SSN), Auroral Electrojet (AE), Kp and W-index provide information about variability on a global scale, identification and classification of regional disturbances poses a challenge. The main aim of this study is to classify the regional effects of global geomagnetic storms and classify them according to their risk levels. For this purpose, Total Electron Content (TEC) estimated from GPS receivers, which is one of the major parameters of ionosphere, will be used to model the regional and local variability that differs from global activity along with solar and geomagnetic indices. In this work, for the automated classification of the regional disturbances, a classification technique based on a robust machine learning technique that have found wide spread use, Support Vector Machine (SVM) is proposed. SVM is a supervised learning model used for classification with associated learning algorithm that analyze the data and recognize patterns. In addition to performing linear classification, SVM can efficiently perform nonlinear classification by embedding data into higher dimensional feature spaces. Performance of the developed classification technique is demonstrated for midlatitude ionosphere over Anatolia using TEC estimates generated from the GPS data provided by Turkish National Permanent GPS Network (TNPGN-Active) for solar maximum year of 2011. As a result of implementing the developed classification
Liu, Zhao; Zhou, Yan-Lian; Ju, Wei-Min; Gao, Ping
By using an ensemble Kalman filter (EnKF) to assimilate the observed soil moisture data, the modified boreal ecosystem productivity simulator (BEPS) model was adopted to simulate the dynamics of soil moisture in winter wheat root zones at Xuzhou Agro-meteorological Station, Jiangsu Province of China during the growth seasons in 2000-2004. After the assimilation of observed data, the determination coefficient, root mean square error, and average absolute error of simulated soil moisture were in the ranges of 0.626-0.943, 0.018-0.042, and 0.021-0.041, respectively, with the simulation precision improved significantly, as compared with that before assimilation, indicating the applicability of data assimilation in improving the simulation of soil moisture. The experimental results at single point showed that the errors in the forcing data and observations and the frequency and soil depth of the assimilation of observed data all had obvious effects on the simulated soil moisture.
Huang, Yong; Wang, Kehong; Zhou, Zhilan; Zhou, Xiaoxiao; Fang, Jimi
The arc of gas metal arc welding (GMAW) contains abundant information about its stability and droplet transition, which can be effectively characterized by extracting the arc electrical signals. In this study, ensemble empirical mode decomposition (EEMD) was used to evaluate the stability of electrical current signals. The welding electrical signals were first decomposed by EEMD, and then transformed to a Hilbert-Huang spectrum and a marginal spectrum. The marginal spectrum is an approximate distribution of amplitude with frequency of signals, and can be described by a marginal index. Analysis of various welding process parameters showed that the marginal index of current signals increased when the welding process was more stable, and vice versa. Thus EEMD combined with the marginal index can effectively uncover the stability and droplet transition of GMAW.
Ding, Dong-Sheng; Zhou, Zhi-Yuan; Shi, Bao-Sen; Guo, Guang-Can
A quantum memory is a key component for quantum networks, which will enable the distribution of quantum information. Its successful development requires storage of single-photon light. Encoding photons with spatial shape through higher-dimensional states significantly increases their information-carrying capability and network capacity. However, constructing such quantum memories is challenging. Here we report the first experimental realization of a true single-photon-carrying orbital angular momentum stored via electromagnetically induced transparency in a cold atomic ensemble. Our experiments show that the non-classical pair correlation between trigger photon and retrieved photon is retained, and the spatial structure of input and retrieved photons exhibits strong similarity. More importantly, we demonstrate that single-photon coherence is preserved during storage. The ability to store spatial structure at the single-photon level opens the possibility for high-dimensional quantum memories.
Fan, Rui; Huang, Zhenyu; Wang, Shaobu; Diao, Ruisheng; Meng, Da
With the growing interest in the application of wind energy, doubly fed induction generator (DFIG) plays an essential role in the industry nowadays. To deal with the increasing stochastic variations introduced by intermittent wind resource and responsive loads, dynamic state estimation (DSE) are introduced in any power system associated with DFIGs. However, sometimes this dynamic analysis canould not work because the parameters of DFIGs are not accurate enough. To solve the problem, an ensemble Kalman filter (EnKF) method is proposed for the state estimation and parameter calibration tasks. In this paper, a DFIG is modeled and implemented with the EnKF method. Sensitivity analysis is demonstrated regarding the measurement noise, initial state errors and parameter errors. The results indicate this EnKF method has a robust performance on the state estimation and parameter calibration of DFIGs.
Hatzopoulos, Stavros Dimitris
Transiently Evoked Otoacoustic Emissions (TEOAE) are signals produced by the cochlea upon stimulation by an acoustic click. Within the context of this dissertation, it was hypothesized that the relationship between the TEOAEs and the functional status of the OHCs provided an opportunity for designing a TEOAE-based clinical procedure that could be used to assess cochlear function. To understand the nature of the TEOAE signals in the time and the frequency domain several different analyses were performed. Using normative Input-Output (IO) curves, short-time FFT analyses and cochlear computer simulations, it was found that for optimization of the hearing loss classification it is necessary to use a complete 20 ms TEOAE segment. It was also determined that various 2-D filtering methods (median and averaging filtering masks, LP-FFT) used to enhance of the TEOAE S/N offered minimal improvement (less than 6 dB per stimulus level). Higher S/N improvements resulted in TEOAE sequences that were over-smoothed. The final classification algorithm was based on a statistical analysis of raw FFT data and when applied to a sample set of clinically obtained TEOAE recordings (from 56 normal and 66 hearing-loss subjects) correctly identified 94.3% of the normal and 90% of the hearing loss subjects, at the 80 dB SPL stimulus level. To enhance the discrimination between the conductive and the sensorineural populations, data from the 68 dB SPL stimulus level were used, which yielded a normal classification of 90.2%, a hearing loss classification of 87.5% and a conductive-sensorineural classification of 87%. Among the hearing-loss populations the best discrimination was obtained in the group of otosclerosis and the worst in the group of acute acoustic trauma.
Full Text Available Scientific customer value segmentation (CVS is the base of efficient customer relationship management, and customer credit scoring, fraud detection, and churn prediction all belong to CVS. In real CVS, the customer data usually include lots of missing values, which may affect the performance of CVS model greatly. This study proposes a one-step dynamic classifier ensemble model for missing values (ODCEM model. On the one hand, ODCEM integrates the preprocess of missing values and the classification modeling into one step; on the other hand, it utilizes multiple classifiers ensemble technology in constructing the classification models. The empirical results in credit scoring dataset “German” from UCI and the real customer churn prediction dataset “China churn” show that the ODCEM outperforms four commonly used “two-step” models and the ensemble based model LMF and can provide better decision support for market managers.
Nway Nway Kyaw Win
Full Text Available Abstract Power quality disturbances PQDs result serious problems in the reliability safety and economy of power system network. In order to improve electric power quality events the detection and classification of PQDs must be made type of transient fault. Software analysis of wavelet transform with multiresolution analysis MRA algorithm and feed forward neural network probabilistic and multilayer feed forward neural network based methodology for automatic classification of eight types of PQ signals flicker harmonics sag swell impulse fluctuation notch and oscillatory will be presented. The wavelet family Db4 is chosen in this system to calculate the values of detailed energy distributions as input features for classification because it can perform well in detecting and localizing various types of PQ disturbances. This technique classifies the types of PQDs problem sevents.The classifiers classify and identify the disturbance type according to the energy distribution. The results show that the PNN can analyze different power disturbance types efficiently. Therefore it can be seen that PNN has better classification accuracy than MLFF.
Full Text Available Abstract Background Recent years have seen an explosion in the availability of data in the chemistry domain. With this information explosion, however, retrieving relevant results from the available information, and organising those results, become even harder problems. Computational processing is essential to filter and organise the available resources so as to better facilitate the work of scientists. Ontologies encode expert domain knowledge in a hierarchically organised machine-processable format. One such ontology for the chemical domain is ChEBI. ChEBI provides a classification of chemicals based on their structural features and a role or activity-based classification. An example of a structure-based class is 'pentacyclic compound' (compounds containing five-ring structures, while an example of a role-based class is 'analgesic', since many different chemicals can act as analgesics without sharing structural features. Structure-based classification in chemistry exploits elegant regularities and symmetries in the underlying chemical domain. As yet, there has been neither a systematic analysis of the types of structural classification in use in chemistry nor a comparison to the capabilities of available technologies. Results We analyze the different categories of structural classes in chemistry, presenting a list of patterns for features found in class definitions. We compare these patterns of class definition to tools which allow for automation of hierarchy construction within cheminformatics and within logic-based ontology technology, going into detail in the latter case with respect to the expressive capabilities of the Web Ontology Language and recent extensions for modelling structured objects. Finally we discuss the relationships and interactions between cheminformatics approaches and logic-based approaches. Conclusion Systems that perform intelligent reasoning tasks on chemistry data require a diverse set of underlying computational
Zulvia, Ferani E.; Kuo, R. J.
This paper proposes a classification algorithm based on a support vector machine (SVM) and gradient evolution (GE) algorithms. SVM algorithm has been widely used in classification. However, its result is significantly influenced by the parameters. Therefore, this paper aims to propose an improvement of SVM algorithm which can find the best SVMs’ parameters automatically. The proposed algorithm employs a GE algorithm to automatically determine the SVMs’ parameters. The GE algorithm takes a role as a global optimizer in finding the best parameter which will be used by SVM algorithm. The proposed GE-SVM algorithm is verified using some benchmark datasets and compared with other metaheuristic-based SVM algorithms. The experimental results show that the proposed GE-SVM algorithm obtains better results than other algorithms tested in this paper.
Full Text Available Today′s content-based video retrieval technologies are still far from human′s requirements. A fundamental reason is the lack of content representation that is able to bridge the gap between visual features and semantic conception in video. In this paper, we propose a motion pattern descriptor, motion texture that characterizes motion in a generic way. With this representation, we design a semantic classification scheme to effectively map video clips to semantic categories. Support vector machines (SVMs are used as the classifiers. In addition, this scheme also improves significantly the performance of motion-based shot retrieval due to the comprehensiveness and effectiveness of motion pattern descriptor and the semantic classification capability as shown by experimental evaluations.
Lo, Pechin Chien Pau; Sporring, Jon; Ashraf, Haseem
This paper presents a method for improving airway tree segmentation using vessel orientation information. We use the fact that an airway branch is always accompanied by an artery, with both structures having similar orientations. This work is based on a voxel classification airway segmentation...... method proposed previously. The probability of a voxel belonging to the airway, from the voxel classification method, is augmented with an orientation similarity measure as a criterion for region growing. The orientation similarity measure of a voxel indicates how similar is the orientation...... of the surroundings of a voxel, estimated based on a tube model, is to that of a neighboring vessel. The proposed method is tested on 20 CT images from different subjects selected randomly from a lung cancer screening study. Length of the airway branches from the results of the proposed method are significantly...
ASI Series F, Computer and Systems Sciences, 163:446–456, 1999. 5  D. Hall and J. Llinas. An introduction to multisensor data fusion . Proceedings of...advantages of in- formation fusion based on sparsity models for multi- modal classification. Among several sparsity models, tree- structured sparsity provides...rithm is proposed to solve the optimization problem, which is an efficient tool for feature-level fusion among either ho- mogeneous or heterogeneous
Zhao, Wei; Xiao, Shixiao; Zhang, Baocan; Huang, Xiaojing; You, Rongyi
Electrocardiogram (ECG) signals are susceptible to be disturbed by 50 Hz power line interference (PLI) in the process of acquisition and conversion. This paper, therefore, proposes a novel PLI removal algorithm based on morphological component analysis (MCA) and ensemble empirical mode decomposition (EEMD). Firstly, according to the morphological differences in ECG waveform characteristics, the noisy ECG signal was decomposed into the mutated component, the smooth component and the residual component by MCA. Secondly, intrinsic mode functions (IMF) of PLI was filtered. The noise suppression rate (NSR) and the signal distortion ratio (SDR) were used to evaluate the effect of de-noising algorithm. Finally, the ECG signals were re-constructed. Based on the experimental comparison, it was concluded that the proposed algorithm had better filtering functions than the improved Levkov algorithm, because it could not only effectively filter the PLI, but also have smaller SDR value.
Amin, Hafeez Ullah; Mumtaz, Wajid; Subhani, Ahmad Rauf; Saad, Mohamad Naufal Mohamad; Malik, Aamir Saeed
Feature extraction is an important step in the process of electroencephalogram (EEG) signal classification. The authors propose a "pattern recognition" approach that discriminates EEG signals recorded during different cognitive conditions. Wavelet based feature extraction such as, multi-resolution decompositions into detailed and approximate coefficients as well as relative wavelet energy were computed. Extracted relative wavelet energy features were normalized to zero mean and unit variance and then optimized using Fisher's discriminant ratio (FDR) and principal component analysis (PCA). A high density EEG dataset validated the proposed method (128-channels) by identifying two classifications: (1) EEG signals recorded during complex cognitive tasks using Raven's Advance Progressive Metric (RAPM) test; (2) EEG signals recorded during a baseline task (eyes open). Classifiers such as, K-nearest neighbors (KNN), Support Vector Machine (SVM), Multi-layer Perceptron (MLP), and Naïve Bayes (NB) were then employed. Outcomes yielded 99.11% accuracy via SVM classifier for coefficient approximations (A5) of low frequencies ranging from 0 to 3.90 Hz. Accuracy rates for detailed coefficients were 98.57 and 98.39% for SVM and KNN, respectively; and for detailed coefficients (D5) deriving from the sub-band range (3.90-7.81 Hz). Accuracy rates for MLP and NB classifiers were comparable at 97.11-89.63% and 91.60-81.07% for A5 and D5 coefficients, respectively. In addition, the proposed approach was also applied on public dataset for classification of two cognitive tasks and achieved comparable classification results, i.e., 93.33% accuracy with KNN. The proposed scheme yielded significantly higher classification performances using machine learning classifiers compared to extant quantitative feature extraction. These results suggest the proposed feature extraction method reliably classifies EEG signals recorded during cognitive tasks with a higher degree of accuracy.
Hafeez Ullah Amin
Full Text Available Feature extraction is an important step in the process of electroencephalogram (EEG signal classification. The authors propose a “pattern recognition” approach that discriminates EEG signals recorded during different cognitive conditions. Wavelet based feature extraction such as, multi-resolution decompositions into detailed and approximate coefficients as well as relative wavelet energy were computed. Extracted relative wavelet energy features were normalized to zero mean and unit variance and then optimized using Fisher's discriminant ratio (FDR and principal component analysis (PCA. A high density EEG dataset validated the proposed method (128-channels by identifying two classifications: (1 EEG signals recorded during complex cognitive tasks using Raven's Advance Progressive Metric (RAPM test; (2 EEG signals recorded during a baseline task (eyes open. Classifiers such as, K-nearest neighbors (KNN, Support Vector Machine (SVM, Multi-layer Perceptron (MLP, and Naïve Bayes (NB were then employed. Outcomes yielded 99.11% accuracy via SVM classifier for coefficient approximations (A5 of low frequencies ranging from 0 to 3.90 Hz. Accuracy rates for detailed coefficients were 98.57 and 98.39% for SVM and KNN, respectively; and for detailed coefficients (D5 deriving from the sub-band range (3.90–7.81 Hz. Accuracy rates for MLP and NB classifiers were comparable at 97.11–89.63% and 91.60–81.07% for A5 and D5 coefficients, respectively. In addition, the proposed approach was also applied on public dataset for classification of two cognitive tasks and achieved comparable classification results, i.e., 93.33% accuracy with KNN. The proposed scheme yielded significantly higher classification performances using machine learning classifiers compared to extant quantitative feature extraction. These results suggest the proposed feature extraction method reliably classifies EEG signals recorded during cognitive tasks with a higher degree of accuracy.
The ensemble square root filter (EnSRF) [1, 2, 3, 4] is a popular method for data assimilation in high dimensional systems (e.g., geophysics models). Essentially the EnSRF is a Monte Carlo implementation of the conventional Kalman filter (KF) [5, 6]. It is mainly different from the KF at the prediction steps, where it is some ensembles, rather then the means and covariance matrices, of the system state that are propagated forward. In doing this, the EnSRF is computationally more efficient than the KF, since propagating a covariance matrix forward in high dimensional systems is prohibitively expensive. In addition, the EnSRF is also very convenient in implementation. By propagating the ensembles of the system state, the EnSRF can be directly applied to nonlinear systems without any change in comparison to the assimilation procedures in linear systems. However, by adopting the Monte Carlo method, the EnSRF also incurs certain sampling errors. One way to alleviate this problem is to introduce certain symmetry to the ensembles, which can reduce the sampling errors and spurious modes in evaluation of the means and covariances of the ensembles . In this contribution, we present two methods to produce symmetric ensembles. One is based on the unscented transform [8, 9], which leads to the unscented Kalman filter (UKF) [8, 9] and its variant, the ensemble unscented Kalman filter (EnUKF) . The other is based on Stirling’s interpolation formula (SIF), which results in the divided difference filter (DDF) . Here we propose a simplified divided difference filter (sDDF) in the context of ensemble filtering. The similarity and difference between the sDDF and the EnUKF will be discussed. Numerical experiments will also be conducted to investigate the performance of the sDDF and the EnUKF, and compare them to a well‐established EnSRF, the ensemble transform Kalman filter (ETKF) .
Zhang, Li; Ai, Haixin; Chen, Wen; Yin, Zimo; Hu, Huan; Zhu, Junfeng; Zhao, Jian; Zhao, Qi; Liu, Hongsheng
Carcinogenicity refers to a highly toxic end point of certain chemicals, and has become an important issue in the drug development process. In this study, three novel ensemble classification models, namely Ensemble SVM, Ensemble RF, and Ensemble XGBoost, were developed to predict carcinogenicity of chemicals using seven types of molecular fingerprints and three machine learning methods based on a dataset containing 1003 diverse compounds with rat carcinogenicity. Among these three models, Ensemble XGBoost is found to be the best, giving an average accuracy of 70.1 ± 2.9%, sensitivity of 67.0 ± 5.0%, and specificity of 73.1 ± 4.4% in five-fold cross-validation and an accuracy of 70.0%, sensitivity of 65.2%, and specificity of 76.5% in external validation. In comparison with some recent methods, the ensemble models outperform some machine learning-based approaches and yield equal accuracy and higher specificity but lower sensitivity than rule-based expert systems. It is also found that the ensemble models could be further improved if more data were available. As an application, the ensemble models are employed to discover potential carcinogens in the DrugBank database. The results indicate that the proposed models are helpful in predicting the carcinogenicity of chemicals. A web server called CarcinoPred-EL has been built for these models ( http://ccsipb.lnu.edu.cn/toxicity/CarcinoPred-EL/ ).
Aalstad, Kristoffer; Westermann, Sebastian; Vikhamar Schuler, Thomas; Boike, Julia; Bertino, Laurent
With its high albedo, low thermal conductivity and large water storing capacity, snow strongly modulates the surface energy and water balance, which makes it a critical factor in mid- to high-latitude and mountain environments. However, estimating the snow water equivalent (SWE) is challenging in remote-sensing applications already at medium spatial resolutions of 1 km. We present an ensemble-based data assimilation framework that estimates the peak subgrid SWE distribution (SSD) at the 1 km scale by assimilating fractional snow-covered area (fSCA) satellite retrievals in a simple snow model forced by downscaled reanalysis data. The basic idea is to relate the timing of the snow cover depletion (accessible from satellite products) to the peak SSD. Peak subgrid SWE is assumed to be lognormally distributed, which can be translated to a modeled time series of fSCA through the snow model. Assimilation of satellite-derived fSCA facilitates the estimation of the peak SSD, while taking into account uncertainties in both the model and the assimilated data sets. As an extension to previous studies, our method makes use of the novel (to snow data assimilation) ensemble smoother with multiple data assimilation (ES-MDA) scheme combined with analytical Gaussian anamorphosis to assimilate time series of Moderate Resolution Imaging Spectroradiometer (MODIS) and Sentinel-2 fSCA retrievals. The scheme is applied to Arctic sites near Ny-Ålesund (79° N, Svalbard, Norway) where field measurements of fSCA and SWE distributions are available. The method is able to successfully recover accurate estimates of peak SSD on most of the occasions considered. Through the ES-MDA assimilation, the root-mean-square error (RMSE) for the fSCA, peak mean SWE and peak subgrid coefficient of variation is improved by around 75, 60 and 20 %, respectively, when compared to the prior, yielding RMSEs of 0.01, 0.09 m water equivalent (w.e.) and 0.13, respectively. The ES-MDA either outperforms or at least
Full Text Available With its high albedo, low thermal conductivity and large water storing capacity, snow strongly modulates the surface energy and water balance, which makes it a critical factor in mid- to high-latitude and mountain environments. However, estimating the snow water equivalent (SWE is challenging in remote-sensing applications already at medium spatial resolutions of 1 km. We present an ensemble-based data assimilation framework that estimates the peak subgrid SWE distribution (SSD at the 1 km scale by assimilating fractional snow-covered area (fSCA satellite retrievals in a simple snow model forced by downscaled reanalysis data. The basic idea is to relate the timing of the snow cover depletion (accessible from satellite products to the peak SSD. Peak subgrid SWE is assumed to be lognormally distributed, which can be translated to a modeled time series of fSCA through the snow model. Assimilation of satellite-derived fSCA facilitates the estimation of the peak SSD, while taking into account uncertainties in both the model and the assimilated data sets. As an extension to previous studies, our method makes use of the novel (to snow data assimilation ensemble smoother with multiple data assimilation (ES-MDA scheme combined with analytical Gaussian anamorphosis to assimilate time series of Moderate Resolution Imaging Spectroradiometer (MODIS and Sentinel-2 fSCA retrievals. The scheme is applied to Arctic sites near Ny-Ålesund (79° N, Svalbard, Norway where field measurements of fSCA and SWE distributions are available. The method is able to successfully recover accurate estimates of peak SSD on most of the occasions considered. Through the ES-MDA assimilation, the root-mean-square error (RMSE for the fSCA, peak mean SWE and peak subgrid coefficient of variation is improved by around 75, 60 and 20 %, respectively, when compared to the prior, yielding RMSEs of 0.01, 0.09 m water equivalent (w.e. and 0.13, respectively. The ES-MDA either
Full Text Available Traditionally, tumors are classified by histopathological criteria, i.e., based on their specific morphological appearances. Consequently, current therapeutic decisions in oncology are strongly influenced by histology rather than underlying molecular or genomic aberrations. The increase of information on molecular changes however, enabled by the Human Genome Project and the International Cancer Genome Consortium as well as the manifold advances in molecular biology and high-throughput sequencing techniques, inaugurated the integration of genomic information into disease classification. Furthermore, in some cases it became evident that former classifications needed major revision and adaption. Such adaptations are often required by understanding the pathogenesis of a disease from a specific molecular alteration, using this molecular driver for targeted and highly effective therapies. Altogether, reclassifications should lead to higher information content of the underlying diagnoses, reflecting their molecular pathogenesis and resulting in optimized and individual therapeutic decisions. The objective of this article is to summarize some particularly important examples of genome-based classification approaches and associated therapeutic concepts. In addition to reviewing disease specific markers, we focus on potentially therapeutic or predictive markers and the relevance of molecular diagnostics in disease monitoring.
Hu, G. C.; Zhao, Q. H.
Enormous scientific and technical developments have been carried out to further improve the remote sensing for decades, particularly Polarimetric Synthetic Aperture Radar(PolSAR) technique, so classification method based on PolSAR images has getted much more attention from scholars and related department around the world. The multilook polarmetric G0-Wishart model is a more flexible model which describe homogeneous, heterogeneous and extremely heterogeneous regions in the image. Moreover, the polarmetric G0-Wishart distribution dose not include the modified Bessel function of the second kind. It is a kind of simple statistical distribution model with less parameter. To prove its feasibility, a process of classification has been tested with the full-polarized Synthetic Aperture Radar (SAR) image by the method. First, apply multilook polarimetric SAR data process and speckle filter to reduce speckle influence for classification result. Initially classify the image into sixteen classes by H/A/α decomposition. Using the ICM algorithm to classify feature based on the G0-Wshart distance. Qualitative and quantitative results show that the proposed method can classify polaimetric SAR data effectively and efficiently.
Full Text Available The common statistical methods for supervised classification usually require a large amount of training data to achieve reasonable results, which is time consuming and inefficient. This paper proposes a tensor sparse representation classification (SRC method for airborne LiDAR points. The LiDAR points are represented as tensors to keep attributes in its spatial space. Then only a few of training data is used for dictionary learning, and the sparse tensor is calculated based on tensor OMP algorithm. The point label is determined by the minimal reconstruction residuals. Experiments are carried out on real LiDAR points whose result shows that objects can be distinguished by this algorithm successfully.
Saba, Tanzila; Alqahtani, Fatimah Ayidh
Data entry forms are employed in all types of enterprises to collect hundreds of customer's information on daily basis. The information is filled manually by the customers. Hence, it is laborious and time consuming to use human operator to transfer these customers information into computers manually. Additionally, it is expensive and human errors might cause serious flaws. The automatic interpretation of scanned forms has facilitated many real applications from speed and accuracy point of view such as keywords spotting, sorting of postal addresses, script matching and writer identification. This research deals with different strategies to extract customer's information from these scanned forms, interpretation and classification. Accordingly, extracted information is segmented into characters for their classification and finally stored in the forms of records in databases for their further processing. This paper presents a detailed discussion of these semantic based analysis strategies for forms processing. Finally, new directions are also recommended for future research. [Figure not available: see fulltext.
Kothari, Sonal; Phan, John H; Young, Andrew N; Wang, May D
Automatic cancer diagnostic systems based on histological image classification are important for improving therapeutic decisions. Previous studies propose textural and morphological features for such systems. These features capture patterns in histological images that are useful for both cancer grading and subtyping. However, because many of these features lack a clear biological interpretation, pathologists may be reluctant to adopt these features for clinical diagnosis. We examine the utility of biologically interpretable shape-based features for classification of histological renal tumor images. Using Fourier shape descriptors, we extract shape-based features that capture the distribution of stain-enhanced cellular and tissue structures in each image and evaluate these features using a multi-class prediction model. We compare the predictive performance of the shape-based diagnostic model to that of traditional models, i.e., using textural, morphological and topological features. The shape-based model, with an average accuracy of 77%, outperforms or complements traditional models. We identify the most informative shapes for each renal tumor subtype from the top-selected features. Results suggest that these shapes are not only accurate diagnostic features, but also correlate with known biological characteristics of renal tumors. Shape-based analysis of histological renal tumor images accurately classifies disease subtypes and reveals biologically insightful discriminatory features. This method for shape-based analysis can be extended to other histological datasets to aid pathologists in diagnostic and therapeutic decisions
Schuld, Maria; Petruccione, Francesco
Quantum machine learning witnesses an increasing amount of quantum algorithms for data-driven decision making, a problem with potential applications ranging from automated image recognition to medical diagnosis. Many of those algorithms are implementations of quantum classifiers, or models for the classification of data inputs with a quantum computer. Following the success of collective decision making with ensembles in classical machine learning, this paper introduces the concept of quantum ensembles of quantum classifiers. Creating the ensemble corresponds to a state preparation routine, after which the quantum classifiers are evaluated in parallel and their combined decision is accessed by a single-qubit measurement. This framework naturally allows for exponentially large ensembles in which - similar to Bayesian learning - the individual classifiers do not have to be trained. As an example, we analyse an exponentially large quantum ensemble in which each classifier is weighed according to its performance in classifying the training data, leading to new results for quantum as well as classical machine learning.
Liu, Huanjun; Zhang, Xiaokang; Zhang, Xinle
Soil taxonomy plays an important role in soil utility and management, but China has only course soil map created based on 1980s data. New technology, e.g. spectroscopy, could simplify soil classification. The study try to classify soils basing on the spectral characteristics of topsoil samples. 148 topsoil samples of typical soils, including Black soil, Chernozem, Blown soil and Meadow soil, were collected from Songnen plain, Northeast China, and the room spectral reflectance in the visible and near infrared region (400-2500 nm) were processed with weighted moving average, resampling technique, and continuum removal. Spectral indices were extracted from soil spectral characteristics, including the second absorption positions of spectral curve, the first absorption vale's area, and slope of spectral curve at 500-600 nm and 1340-1360 nm. Then K-means clustering and decision tree were used respectively to build soil classification model. The results indicated that 1) the second absorption positions of Black soil and Chernozem were located at 610 nm and 650 nm respectively; 2) the spectral curve of the meadow is similar to its adjacent soil, which could be due to soil erosion; 3) decision tree model showed higher classification accuracy, and accuracy of Black soil, Chernozem, Blown soil and Meadow are 100%, 88%, 97%, 50% respectively, and the accuracy of Blown soil could be increased to 100% by adding one more spectral index (the first two vole's area) to the model, which showed that the model could be used for soil classification and soil map in near future.
Vahie, S.; Zeigler, B.P.; Cho, H. [Univ. of Arizona, Tucson, AZ (United States)
This paper describes the structure of dynamic neuronal ensembles (DNEs). DNEs represent a new paradigm for learning, based on biological neural networks that use variable structures. We present a computational neural element that demonstrates biological neuron functionality such as neurotransmitter feedback absolute refractory period and multiple output potentials. More specifically, we will develop a network of neural elements that have the ability to dynamically strengthen, weaken, add and remove interconnections. We demonstrate that the DNE is capable of performing dynamic modifications to neuron connections and exhibiting biological neuron functionality. In addition to its applications for learning, DNEs provide an excellent environment for testing and analysis of biological neural systems. An example of habituation and hyper-sensitization in biological systems, using a neural circuit from a snail is presented and discussed. This paper provides an insight into the DNE paradigm using models developed and simulated in DEVS.
Seaby, Lauren Paige; Refsgaard, J.C.; Sonnenborg, T.O.
An ensemble of 11 regional climate model (RCM) projections are analysed for Denmark from a hydrological modelling inputs perspective. Two bias correction approaches are applied: a relatively simple monthly delta change (DC) method and a more complex daily distribution-based scaling (DBS) method....... Differences in the strength and direction of climate change signals are compared across models and between bias correction methods, the statistical significance of climate change is tested as it evolves over the 21st century, and the impact of choice of reference and change period lengths is analysed...... as it relates to assumptions of stationary in current climate and change signals. Both DC and DBS methods are able to capture mean monthly and seasonal climate characteristics in temperature (T), precipitation (P), and potential evapotranspiration (ETpot). For P, which is comparatively more variable by day...
Arslan, Evren; Yildiz, Sedat; Albayrak, Yalcin; Koklukaya, Etem
Fibromyalgia syndrome (FMS) is a chronic muscle and skeletal system disease observed generally in women, manifesting itself with a widespread pain and impairing the individual's quality of life. FMS diagnosis is made based on the American College of Rheumatology (ACR) criteria. However, recently the employability and sufficiency of ACR criteria are under debate. In this context, several evaluation methods, including clinical evaluation methods were proposed by researchers. Accordingly, ACR had to update their criteria announced back in 1990, 2010 and 2011. Proposed rule based fuzzy logic method aims to evaluate FMS at a different angle as well. This method contains a rule base derived from the 1990 ACR criteria and the individual experiences of specialists. The study was conducted using the data collected from 60 inpatient and 30 healthy volunteers. Several tests and physical examination were administered to the participants. The fuzzy logic rule base was structured using the parameters of tender point count, chronic widespread pain period, pain severity, fatigue severity and sleep disturbance level, which were deemed important in FMS diagnosis. It has been observed that generally fuzzy predictor was 95.56 % consistent with at least of the specialists, who are not a creator of the fuzzy rule base. Thus, in diagnosis classification where the severity of FMS was classified as well, consistent findings were obtained from the comparison of interpretations and experiences of specialists and the fuzzy logic approach. The study proposes a rule base, which could eliminate the shortcomings of 1990 ACR criteria during the FMS evaluation process. Furthermore, the proposed method presents a classification on the severity of the disease, which was not available with the ACR criteria. The study was not limited to only disease classification but at the same time the probability of occurrence and severity was classified. In addition, those who were not suffering from FMS were
Full Text Available In living organisms, production of reactive oxygen species (ROS is counterbalanced by their elimination and/or prevention of formation which in concert can typically maintain a steady-state (stationary ROS level. However, this balance may be disturbed and lead to elevated ROS levels and enhanced damage to biomolecules. Since 1985, when H. Sies first introduced the definition of oxidative stress, this area has become one of the hot topics in biology and, to date, many details related to ROS-induced damage to cellular components, ROS-based signaling, cellular responses and adaptation have been disclosed. However, some basal oxidative damage always occurs under unstressed conditions, and in many experimental studies it is difficult to show definitely that oxidative stress is indeed induced by the stressor. Therefore, usually researchers experience substantial difficulties in the correct interpretation of oxidative stress development. For example, in many cases an increase or decrease in the activity of antioxidant and related enzymes are interpreted as evidences of oxidative stress. Careful selection of specific biomarkers (ROS-modified targets may be very helpful. To avoid these sorts of problems, I propose several classifications of oxidative stress based on its time-course and intensity. The time-course classification includes acute and chronic stresses. In the intensity based classification, I propose to discriminate four zones of function in the relationship between “Dose/concentration of inducer” and the measured “Endpoint”: I – basal oxidative stress zone (BOS; II – low intensity oxidative stress (LOS; III – intermediate intensity oxidative stress (IOS; IV – high intensity oxidative stress (HOS. The proposed classifications may be helpful to describe experimental data where oxidative stress is induced and systematize it based on its time course and intensity. Perspective directions of investigations in the field include
Sawacha, Zimi; Guarneri, Gabriella; Avogaro, Angelo; Cobelli, Claudio
The diabetic foot, one of the most serious complications of diabetes mellitus and a major risk factor for plantar ulceration, is determined mainly by peripheral neuropathy. Neuropathic patients exhibit decreased stability while standing as well as during dynamic conditions. A new methodology for diabetic gait pattern classification based on cluster analysis has been proposed that aims to identify groups of subjects with similar patterns of gait and verify if three-dimensional gait data are able to distinguish diabetic gait patterns from one of the control subjects. The gait of 20 nondiabetic individuals and 46 diabetes patients with and without peripheral neuropathy was analyzed [mean age 59.0 (2.9) and 61.1(4.4) years, mean body mass index (BMI) 24.0 (2.8), and 26.3 (2.0)]. K-means cluster analysis was applied to classify the subjects' gait patterns through the analysis of their ground reaction forces, joints and segments (trunk, hip, knee, ankle) angles, and moments. Cluster analysis classification led to definition of four well-separated clusters: one aggregating just neuropathic subjects, one aggregating both neuropathics and non-neuropathics, one including only diabetes patients, and one including either controls or diabetic and neuropathic subjects. Cluster analysis was useful in grouping subjects with similar gait patterns and provided evidence that there were subgroups that might otherwise not be observed if a group ensemble was presented for any specific variable. In particular, we observed the presence of neuropathic subjects with a gait similar to the controls and diabetes patients with a long disease duration with a gait as altered as the neuropathic one. © 2010 Diabetes Technology Society.
Rautenhaus, M.; Grams, C. M.; Schäfler, A.; Westermann, R.
We present the application of interactive three-dimensional (3-D) visualization of ensemble weather predictions to forecasting warm conveyor belt situations during aircraft-based atmospheric research campaigns. Motivated by forecast requirements of the T-NAWDEX-Falcon 2012 (THORPEX - North Atlantic Waveguide and Downstream Impact Experiment) campaign, a method to predict 3-D probabilities of the spatial occurrence of warm conveyor belts (WCBs) has been developed. Probabilities are derived from Lagrangian particle trajectories computed on the forecast wind fields of the European Centre for Medium Range Weather Forecasts (ECMWF) ensemble prediction system. Integration of the method into the 3-D ensemble visualization tool Met.3D, introduced in the first part of this study, facilitates interactive visualization of WCB features and derived probabilities in the context of the ECMWF ensemble forecast. We investigate the sensitivity of the method with respect to trajectory seeding and grid spacing of the forecast wind field. Furthermore, we propose a visual analysis method to quantitatively analyse the contribution of ensemble members to a probability region and, thus, to assist the forecaster in interpreting the obtained probabilities. A case study, revisiting a forecast case from T-NAWDEX-Falcon, illustrates the practical application of Met.3D and demonstrates the use of 3-D and uncertainty visualization for weather forecasting and for planning flight routes in the medium forecast range (3 to 7 days before take-off).
Full Text Available We present the application of interactive three-dimensional (3-D visualization of ensemble weather predictions to forecasting warm conveyor belt situations during aircraft-based atmospheric research campaigns. Motivated by forecast requirements of the T-NAWDEX-Falcon 2012 (THORPEX – North Atlantic Waveguide and Downstream Impact Experiment campaign, a method to predict 3-D probabilities of the spatial occurrence of warm conveyor belts (WCBs has been developed. Probabilities are derived from Lagrangian particle trajectories computed on the forecast wind fields of the European Centre for Medium Range Weather Forecasts (ECMWF ensemble prediction system. Integration of the method into the 3-D ensemble visualization tool Met.3D, introduced in the first part of this study, facilitates interactive visualization of WCB features and derived probabilities in the context of the ECMWF ensemble forecast. We investigate the sensitivity of the method with respect to trajectory seeding and grid spacing of the forecast wind field. Furthermore, we propose a visual analysis method to quantitatively analyse the contribution of ensemble members to a probability region and, thus, to assist the forecaster in interpreting the obtained probabilities. A case study, revisiting a forecast case from T-NAWDEX-Falcon, illustrates the practical application of Met.3D and demonstrates the use of 3-D and uncertainty visualization for weather forecasting and for planning flight routes in the medium forecast range (3 to 7 days before take-off.
Mehlhorn, Alexander T.; Zwingmann, Joern; Hirschmueller, Anja; Suedkamp, Norbert P.; Schmal, Hagen [University of Freiburg Medical Center, Department of Orthopaedic Surgery, Freiburg (Germany)
Avulsion fractures of the fifth metatarsal base (MTB5) are common fore foot injuries. Based on a radiomorphometric analysis reflecting the risk for a secondary displacement, a new classification was developed. A cohort of 95 healthy, sportive, and young patients (age ≤ 50 years) with avulsion fractures of the MTB5 was included in the study and divided into groups with non-displaced, primary-displaced, and secondary-displaced fractures. Radiomorphometric data obtained using standard oblique and dorso-plantar views were analyzed in association with secondary displacement. Based on this, a classification was developed and checked for reproducibility. Fractures with a longer distance between the lateral edge of the styloid process and the lateral fracture step-off and fractures with a more medial joint entry of the fracture line at the MTB5 are at higher risk to displace secondarily. Based on these findings, all fractures were divided into three types: type I with a fracture entry in the lateral third; type II in the middle third; and type III in the medial third of the MTB5. Additionally, the three types were subdivided into an A-type with a fracture displacement <2 mm and a B-type with a fracture displacement ≥ 2 mm. A substantial level of interobserver agreement was found in the assignment of all 95 fractures to the six fracture types (κ = 0.72). The secondary displacement of fractures was confirmed by all examiners in 100 %. Radiomorphometric data may identify fractures at risk for secondary displacement of the MTB5. Based on this, a reliable classification was developed. (orig.)
Sharma, Govind K; Kumar, Anish; Jayakumar, T; Purnachandra Rao, B; Mariyappa, N
A signal processing methodology is proposed in this paper for effective reconstruction of ultrasonic signals in coarse grained high scattering austenitic stainless steel. The proposed methodology is comprised of the Ensemble Empirical Mode Decomposition (EEMD) processing of ultrasonic signals and application of signal minimisation algorithm on selected Intrinsic Mode Functions (IMFs) obtained by EEMD. The methodology is applied to ultrasonic signals obtained from austenitic stainless steel specimens of different grain size, with and without defects. The influence of probe frequency and data length of a signal on EEMD decomposition is also investigated. For a particular sampling rate and probe frequency, the same range of IMFs can be used to reconstruct the ultrasonic signal, irrespective of the grain size in the range of 30-210 μm investigated in this study. This methodology is successfully employed for detection of defects in a 50mm thick coarse grain austenitic stainless steel specimens. Signal to noise ratio improvement of better than 15 dB is observed for the ultrasonic signal obtained from a 25 mm deep flat bottom hole in 200 μm grain size specimen. For ultrasonic signals obtained from defects at different depths, a minimum of 7 dB extra enhancement in SNR is achieved as compared to the sum of selected IMF approach. The application of minimisation algorithm with EEMD processed signal in the proposed methodology proves to be effective for adaptive signal reconstruction with improved signal to noise ratio. This methodology was further employed for successful imaging of defects in a B-scan. Copyright © 2014. Published by Elsevier B.V.
Kim, J.; Kim, H. M.; Cho, C. H.; Boo, K. O.
Estimation of the surface CO2 flux is crucial to understand the mechanism of surface carbon source and sink. In Asia, there are large uptake regions such as forests in boreal and temperate regions. In this study, to diagnose the surface CO2 flux in the globe and Asia, CO2 observations were assimilated in the CarbonTracker developed by NOAA. The CarbonTracker is an inverse modeling system that estimates the surface CO2 flux using an ensemble Kalman filter with atmospheric CO2 measurements as a constraint. First, the capability of CarbonTracker as an analysis tool for estimating surface CO2 flux in Asia was investigated. Different from the CarbonTracker developed by NOAA, a nesting domain centered on Asia was used with additional observations in Asia. In addition, a diagnostic tool to calculate the effect of individual CO2 observations on estimating the surface CO2 flux was developed using the analysis sensitivity to observation and information content in the CarbonTracker framework. The results showed that CarbonTracker works appropriately for estimating surface CO2 flux. The nesting domain centered in Asia produces a detailed estimate of the surface CO2 fluxes and exhibited better agreement with the CO2 observations in Asia. Additional observations provide beneficial impact on the estimated surface CO2 flux in Asia and Europe. The analysis sensitivity showed seasonal variations with greater sensitivities in summer and lower sensitivities in winter. Strong correlation exists between the information content and the optimized surface CO2 flux.
Full Text Available Overfitting is an important problem in machine learning. Several algorithms, such as the extreme learning machine (ELM, suffer from this issue when facing high-dimensional sparse data, e.g., in text classification. One common issue is that the extent of overfitting is not well quantified. In this paper, we propose a quantitative measure of overfitting referred to as the rate of overfitting (RO and a novel model, named AdaBELM, to reduce the overfitting. With RO, the overfitting problem can be quantitatively measured and identified. The newly proposed model can achieve high performance on multi-class text classification. To evaluate the generalizability of the new model, we designed experiments based on three datasets, i.e., the 20 Newsgroups, Reuters-21578, and BioMed corpora, which represent balanced, unbalanced, and real application data, respectively. Experiment results demonstrate that AdaBELM can reduce overfitting and outperform classical ELM, decision tree, random forests, and AdaBoost on all three text-classification datasets; for example, it can achieve 62.2% higher accuracy than ELM. Therefore, the proposed model has a good generalizability.
Full Text Available Image classification aims to group images into corresponding semantic categories. Due to the difficulties of interclass similarity and intraclass variability, it is a challenging issue in computer vision. In this paper, an unsupervised feature learning approach called convolutional denoising sparse autoencoder (CDSAE is proposed based on the theory of visual attention mechanism and deep learning methods. Firstly, saliency detection method is utilized to get training samples for unsupervised feature learning. Next, these samples are sent to the denoising sparse autoencoder (DSAE, followed by convolutional layer and local contrast normalization layer. Generally, prior in a specific task is helpful for the task solution. Therefore, a new pooling strategy—spatial pyramid pooling (SPP fused with center-bias prior—is introduced into our approach. Experimental results on the common two image datasets (STL-10 and CIFAR-10 demonstrate that our approach is effective in image classification. They also demonstrate that none of these three components: local contrast normalization, SPP fused with center-prior, and l2 vector normalization can be excluded from our proposed approach. They jointly improve image representation and classification performance.
Silva, J; Heim, W; Chau, T
We have previously proposed the use of "muscle sounds" or mechanomyography (MMG) as a reliable alternative measure of muscle activity with the main objective of facilitating the use of more comfortable and functional soft silicone sockets with below-elbow externally powered prosthesis. This work describes an integrated strategy where data and sensor fusion algorithms are combined to provide MMG-based detection, estimation and classification of muscle activity. The proposed strategy represents the first ever attempt to generate multiple output signals for practical prosthesis control using a MMG multisensor array embedded distally within a silicon soft socket. This multisensor fusion strategy consists of two stages. The first is the detection stage which determines the presence or absence of muscle contractions in the acquired signals. Upon detection of a contraction, the second stage, that of classification, specifies the nature of the contraction and determines the corresponding control output. Tests with real amputees indicate that with the simple detection and classification algorithms proposed, MMG is indeed comparable to and may exceed EMG functionally.
Wu, D; Lu, B; Xue, H D; Lai, Y M; Qian, J M; Yang, H
Objective: To compare the performance of the revision of Atlanta classification (RAC) and determinant-based classification (DBC) in acute pancreatitis. Methods: Consecutive patients with acute pancreatitis admitted to a single center from January 2001 to January 2015 were retrospectively analyzed. Patients were classified into mild, moderately severe and severe categories based on RAC and were simultaneously classified into mild, moderate, severe and critical grades according to DBC. Disease severity and clinical outcomes were compared between subgroups. The receiver operating curve (ROC) was used to compare the utility of RAC and DBC by calculating the area under curve (AUC). Results: Among 1 120 patients enrolled, organ failure occurred in 343 patients (30.6%) and infected necrosis in 74 patients(6.6%). A total of 63 patients (5.6%) died. Statistically significant difference of disease severity and outcomes was observed between all the subgroups in RAC and DBC ( Pacute pancreatitis (with both persistent organ failure and infected necrosis) had the most severe clinical course and the highest mortality (19/31, 61.3%). DBC had a larger AUC (0.73, 95% CI 0.69-0.78) than RAC (0.68, 95% CI 0.65-0.73) in classifying ICU admissions ( P= 0.031), but both were similar in predicting mortality( P= 0.372) and prolonged ICU stay ( P= 0.266). Conclusions: DBC and RAC perform comparably well in categorizing patients with acute pancreatitis regarding disease severity and clinical outcome. DBC is slightly better than RAC in predicting prolonged hospital stay. Persistent organ failure and infected necrosis are risk factors for poor prognosis and presence of both is associated with the most dismal outcome.
Full Text Available Hematological malignancies are the types of cancer that affect blood, bone marrow and lymph nodes. As these tissues are naturally connected through the immune system, a disease affecting one of them will often affect the others as well. The hematological malignancies include; Leukemia, Lymphoma, Multiple myeloma. Among them, leukemia is a serious malignancy that starts in blood tissues especially the bone marrow, where the blood is made. Researches show, leukemia is one of the common cancers in the world. So, the emphasis on diagnostic techniques and best treatments would be able to provide better prognosis and survival for patients. In this paper, an automatic diagnosis recommender system for classifying leukemia based on cooperative game is presented. Through out this research, we analyze the flow cytometry data toward the classification of leukemia into eight classes. We work on real data set from different types of leukemia that have been collected at Iran Blood Transfusion Organization (IBTO. Generally, the data set contains 400 samples taken from human leukemic bone marrow. This study deals with cooperative game used for classification according to different weights assigned to the markers. The proposed method is versatile as there are no constraints to what the input or output represent. This means that it can be used to classify a population according to their contributions. In other words, it applies equally to other groups of data. The experimental results show the accuracy rate of 93.12%, for classification and compared to decision tree (C4.5 with (90.16% in accuracy. The result demonstrates that cooperative game is very promising to be used directly for classification of leukemia as a part of Active Medical decision support system for interpretation of flow cytometry readout. This system could assist clinical hematologists to properly recognize different kinds of leukemia by preparing suggestions and this could improve the treatment
Torkaman, Atefeh; Charkari, Nasrollah Moghaddam; Aghaeipour, Mahnaz
Hematological malignancies are the types of cancer that affect blood, bone marrow and lymph nodes. As these tissues are naturally connected through the immune system, a disease affecting one of them will often affect the others as well. The hematological malignancies include; Leukemia, Lymphoma, Multiple myeloma. Among them, leukemia is a serious malignancy that starts in blood tissues especially the bone marrow, where the blood is made. Researches show, leukemia is one of the common cancers in the world. So, the emphasis on diagnostic techniques and best treatments would be able to provide better prognosis and survival for patients. In this paper, an automatic diagnosis recommender system for classifying leukemia based on cooperative game is presented. Through out this research, we analyze the flow cytometry data toward the classification of leukemia into eight classes. We work on real data set from different types of leukemia that have been collected at Iran Blood Transfusion Organization (IBTO). Generally, the data set contains 400 samples taken from human leukemic bone marrow. This study deals with cooperative game used for classification according to different weights assigned to the markers. The proposed method is versatile as there are no constraints to what the input or output represent. This means that it can be used to classify a population according to their contributions. In other words, it applies equally to other groups of data. The experimental results show the accuracy rate of 93.12%, for classification and compared to decision tree (C4.5) with (90.16%) in accuracy. The result demonstrates that cooperative game is very promising to be used directly for classification of leukemia as a part of Active Medical decision support system for interpretation of flow cytometry readout. This system could assist clinical hematologists to properly recognize different kinds of leukemia by preparing suggestions and this could improve the treatment of leukemic
Nichola C Garbett
Full Text Available Plasma thermograms (thermal stability profiles of blood plasma are being utilized as a new diagnostic approach for clinical assessment. In this study, we investigated the ability of plasma thermograms to classify systemic lupus erythematosus (SLE patients versus non SLE controls using a sample of 300 SLE and 300 control subjects from the Lupus Family Registry and Repository. Additionally, we evaluated the heterogeneity of thermograms along age, sex, ethnicity, concurrent health conditions and SLE diagnostic criteria.Thermograms were visualized graphically for important differences between covariates and summarized using various measures. A modified linear discriminant analysis was used to segregate SLE versus control subjects on the basis of the thermograms. Classification accuracy was measured based on multiple training/test splits of the data and compared to classification based on SLE serological markers.Median sensitivity, specificity, and overall accuracy based on classification using plasma thermograms was 86%, 83%, and 84% compared to 78%, 95%, and 86% based on a combination of five antibody tests. Combining thermogram and serology information together improved sensitivity from 78% to 86% and overall accuracy from 86% to 89% relative to serology alone. Predictive accuracy of thermograms for distinguishing SLE and osteoarthritis / rheumatoid arthritis patients was comparable. Both gender and anemia significantly interacted with disease status for plasma thermograms (p<0.001, with greater separation between SLE and control thermograms for females relative to males and for patients with anemia relative to patients without anemia.Plasma thermograms constitute an additional biomarker which may help improve diagnosis of SLE patients, particularly when coupled with standard diagnostic testing. Differences in thermograms according to patient sex, ethnicity, clinical and environmental factors are important considerations for application of
Vogelmann, A. M.; Zhang, D.; Kollias, P.; Endo, S.; Lamer, K.; Gustafson, W. I., Jr.; Romps, D. M.
Continental boundary layer clouds are important to simulations of weather and climate because of their impact on surface budgets and vertical transports of energy and moisture; however, model-parameterized boundary layer clouds do not agree well with observations in part because small-scale turbulence and convection are not properly represented. To advance parameterization development and evaluation, observational constraints are needed on critical parameters such as cloud-base mass flux and its relationship to cloud cover and the sub-cloud boundary layer structure including vertical velocity variance and skewness. In this study, these constraints are derived from Doppler lidar observations and ensemble large-eddy simulations (LES) from the U.S. Department of Energy Atmospheric Radiation Measurement (ARM) Facility Southern Great Plains (SGP) site in Oklahoma. The Doppler lidar analysis will extend the single-site, long-term analysis of Lamer and Kollias  and augment this information with the short-term but unique 1-2 year period since five Doppler lidars began operation at the SGP, providing critical information on regional variability. These observations will be compared to the statistics obtained from ensemble, routine LES conducted by the LES ARM Symbiotic Simulation and Observation (LASSO) project (https://www.arm.gov/capabilities/modeling/lasso). An Observation System Simulation Experiment (OSSE) will be presented that uses the LASSO LES fields to determine criteria for which relationships from Doppler lidar observations are adequately sampled to yield convergence. Any systematic differences between the observed and simulated relationships will be examined to understand factors contributing to the differences. Lamer, K., and P. Kollias (2015), Observations of fair-weather cumuli over land: Dynamical factors controlling cloud size and cover, Geophys. Res. Lett., 42, 8693-8701, doi:10.1002/2015GL064534
Aita, Takuyo; Nishigaki, Koichi
To visualize a bird's-eye view of an ensemble of mitochondrial genome sequences for various species, we recently developed a novel method of mapping a biological sequence ensemble into Three-Dimensional (3D) vector space. First, we represented a biological sequence of a species s by a word-composition vector x(s), where its length [absolute value]x(s)[absolute value] represents the sequence length, and its unit vector x(s)/[absolute value]x(s)[absolute value] represents the relative composition of the K-tuple words through the sequence and the size of the dimension, N=4(K), is the number of all possible words with the length of K. Second, we mapped the vector x(s) to the 3D position vector y(s), based on the two following simple principles: (1) [absolute value]y(s)[absolute value]=[absolute value]x(s)[absolute value] and (2) the angle between y(s) and y(t) maximally correlates with the angle between x(s) and x(t). The mitochondrial genome sequences for 311 species, including 177 Animalia, 85 Fungi and 49 Green plants, were mapped into 3D space by using K=7. The mapping was successful because the angles between vectors before and after the mapping highly correlated with each other (correlation coefficients were 0.92-0.97). Interestingly, the Animalia kingdom is distributed along a single arc belt (just like the Milky Way on a Celestial Globe), and the Fungi and Green plant kingdoms are distributed in a similar arc belt. These two arc belts intersect at their respective middle regions and form a cross structure just like a jet aircraft fuselage and its wings. This new mapping method will allow researchers to intuitively interpret the visual information presented in the maps in a highly effective manner. Copyright © 2012 Elsevier Inc. All rights reserved.
Christ, E.; Webster, P. J.; Collins, G.; Byrd, S.
Recent droughts and the continuing water wars between the states of Georgia, Alabama and Florida have made agricultural producers more aware of the importance of managing their irrigation systems more efficiently. Many southeastern states are beginning to consider laws that will require monitoring and regulation of water used for irrigation. Recently, Georgia suspended issuing irrigation permits in some areas of the southwestern portion of the state to try and limit the amount of water being used in irrigation. However, even in southern Georgia, which receives on average between 23 and 33 inches of rain during the growing season, irrigation can significantly impact crop yields. In fact, studies have shown that when fields do not receive rainfall at the most critical stages in the life of cotton, yield for irrigated fields can be up to twice as much as fields for non-irrigated cotton. This leads to the motivation for this study, which is to produce a forecast tool that will enable producers to make more efficient irrigation management decisions. We will use the ECMWF (European Centre for Medium-Range Weather Forecasts) vars EPS (Ensemble Prediction System) model precipitation forecasts for the grid points included in the 1◦ x 1◦ lat/lon square surrounding the point of interest. We will then apply q-to-q bias corrections to the forecasts. Once we have applied the bias corrections, we will use the check-book method of irrigation scheduling to determine the probability of receiving the required amount of rainfall for each week of the growing season. These forecasts will be used during a field trial conducted at the CM Stripling Irrigation Research Park in Camilla, Georgia. This research will compare differences in yield and water use among the standard checkbook method of irrigation, which uses no precipitation forecast knowledge, the weather.com forecast, a dry land plot, and the ensemble-based forecasts mentioned above.
Hsieh, Sheau-Ling; Hsieh, Sung-Huai; Cheng, Po-Hsun; Chen, Chi-Huang; Hsu, Kai-Ping; Lee, I-Shun; Wang, Zhenyu; Lai, Feipei
In this paper, we classify the breast cancer of medical diagnostic data. Information gain has been adapted for feature selections. Neural fuzzy (NF), k-nearest neighbor (KNN), quadratic classifier (QC), each single model scheme as well as their associated, ensemble ones have been developed for classifications. In addition, a combined ensemble model with these three schemes has been constructed for further validations. The experimental results indicate that the ensemble learning performs better than individual single ones. Moreover, the combined ensemble model illustrates the highest accuracy of classifications for the breast cancer among all models.
Lynnes, Christopher; Berrick, S.; Gopalan, A.; Hua, X.; Shen, S.; Smith, P.; Yang, K-Y.; Wheeler, K.; Curry, C.
The high volume of Earth Observing System data has proven to be challenging to manage for data centers and users alike. At the Goddard Earth Sciences Distributed Active Archive Center (GES DAAC), about 1 TB of new data are archived each day. Distribution to users is also about 1 TB/day. A substantial portion of this distribution is MODIS calibrated radiance data, which has a wide variety of uses. However, much of the data is not useful for a particular user's needs: for example, ocean color users typically need oceanic pixels that are free of cloud and sun-glint. The GES DAAC is using a simple Bayesian classification scheme to rapidly classify each pixel in the scene in order to support several experimental content-based data services for near-real-time MODIS calibrated radiance products (from Direct Readout stations). Content-based subsetting would allow distribution of, say, only clear pixels to the user if desired. Content-based subscriptions would distribute data to users only when they fit the user's usability criteria in their area of interest within the scene. Content-based cache management would retain more useful data on disk for easy online access. The classification may even be exploited in an automated quality assessment of the geolocation product. Though initially to be demonstrated at the GES DAAC, these techniques have applicability in other resource-limited environments, such as spaceborne data systems.
Liu, Shuming; Che, Han; Smith, Kate; Chang, Tian
Emergent contamination events have a significant impact on water systems. After contamination detection, it is important to classify the type of contaminant quickly to provide support for remediation attempts. Conventional methods generally either rely on laboratory-based analysis, which requires a long analysis time, or on multivariable-based geometry analysis and sequence analysis, which is prone to being affected by the contaminant concentration. This paper proposes a new contaminant classification method, which discriminates contaminants in a real time manner independent of the contaminant concentration. The proposed method quantifies the similarities or dissimilarities between sensors' responses to different types of contaminants. The performance of the proposed method was evaluated using data from contaminant injection experiments in a laboratory and compared with a Euclidean distance-based method. The robustness of the proposed method was evaluated using an uncertainty analysis. The results show that the proposed method performed better in identifying the type of contaminant than the Euclidean distance based method and that it could classify the type of contaminant in minutes without significantly compromising the correct classification rate (CCR).
Khan, Abdul Rauf; Schiøler, Henrik; Knudsen, Torben
Classification is one of the major constituents of the data-mining toolkit. The well-known methods for classification are built on either the principle of logic or statistical/mathematical reasoning for classification. In this article we propose: (1) a different strategy, which is based on the po......Classification is one of the major constituents of the data-mining toolkit. The well-known methods for classification are built on either the principle of logic or statistical/mathematical reasoning for classification. In this article we propose: (1) a different strategy, which is based...
Brian E. Bunker
Full Text Available Unsupervised classification or clustering of multi-decadal land surface phenology provides a spatio-temporal synopsis of natural and agricultural vegetation response to environmental variability and anthropogenic activities. Notwithstanding the detailed temporal information available in calibrated bi-monthly normalized difference vegetation index (NDVI and comparable time series, typical pre-classification workflows average a pixel’s bi-monthly index within the larger multi-decadal time series. While this process is one practical way to reduce the dimensionality of time series with many hundreds of image epochs, it effectively dampens temporal variation from both intra and inter-annual observations related to land surface phenology. Through a novel application of object-based segmentation aimed at spatial (not temporal dimensionality reduction, all 294 image epochs from a Moderate Resolution Imaging Spectroradiometer (MODIS bi-monthly NDVI time series covering the northern Fertile Crescent were retained (in homogenous landscape units as unsupervised classification inputs. Given the inherent challenges of in situ or manual image interpretation of land surface phenology classes, a cluster validation approach based on transformed divergence enabled comparison between traditional and novel techniques. Improved intra-annual contrast was clearly manifest in rain-fed agriculture and inter-annual trajectories showed increased cluster cohesion, reducing the overall number of classes identified in the Fertile Crescent study area from 24 to 10. Given careful segmentation parameters, this spatial dimensionality reduction technique augments the value of unsupervised learning to generate homogeneous land surface phenology units. By combining recent scalable computational approaches to image segmentation, future work can pursue new global land surface phenology products based on the high temporal resolution signatures of vegetation index time series.
Blyverket, Jostein; Hamer, Paul; Svendby, Tove; Lahoz, William
The Soil Moisture and Ocean Salinity (SMOS) satellite samples soil moisture at a spatial scale of ˜40 km and in the top ˜5 cm of the soil, depending on land cover and soil type. Remote sensing products have a limited spatial and temporal cover, with a re-visit time of 3 days close to the Equator for SMOS. These factors make it difficult to monitor the hydrological cycle over e.g., Northern Areas where there is a strong topography, fractal coastline and long periods of snow cover, all of which affect the SMOS soil moisture retrieval. Until now simple 3-layer force and restore models have been used to close the spatial (vertical/horizontal) and temporal gaps of soil moisture from remote sensing platforms. In this study we have implemented the Ensemble Transform Kalman Filter (ETKF) into the Surfex land surface model, and used the ISBA diffusion scheme with 11-vertical layers. In contrast to the rapid changing surface layer, the slower changing root zone soil moisture is important for long term evapotranspiration and water supply. By combining a land surface model with satellite observations using data assimilation we can provide a better estimate of the root zone soil moisture at regional scales. The Surfex model runs are done for a European domain, from 1 July 2012 to 1 August 2013. For validation of our model setup, we compare with in situ stations from the International Soil Moisture Network (ISMN) and the Norwegian Water and Energy Authorities (NVE); we also compare against the ESA CCI soil moisture product v02.2, which does not include SMOS soil moisture data. SMOS observations and open loop model runs are shown to exhibit large biases, these are removed before assimilation by a linear rescaling technique. Information from the satellite is transferred into deeper layers of the model using data assimilation, improving the root zone product when validated against in situ stations. The improved correlation between the assimilated product and the in situ values
Roberts, E S; Annibale, A; Coolen, A C C
Tailored graph ensembles are a developing bridge between biological networks and statistical mechanics. The aim is to use this concept to generate a suite of rigorous tools that can be used to quantify and compare the topology of cellular signalling networks, such as protein-protein interaction networks and gene regulation networks. We calculate exact and explicit formulae for the leading orders in the system size of the Shannon entropies of random graph ensembles constrained with degree distribution and degree-degree correlation. We also construct an ergodic detailed balance Markov chain with non-trivial acceptance probabilities which converges to a strictly uniform measure and is based on edge swaps that conserve all degrees. The acceptance probabilities can be generalized to define Markov chains that target any alternative desired measure on the space of directed or undirected graphs, in order to generate graphs with more sophisticated topological features.
Full Text Available This paper proposes a new adaptive image quality assessment (AIQA method, which is based on distortion classifying. AIQA contains two parts, distortion classification and image quality assessment. Firstly, we analysis characteristics of the original and distorted images, including the distribution of wavelet coefficient, the ratio of edge energy and inner energy of the differential image block, we divide distorted images into White Noise distortion, JPEG compression distortion and fuzzy distortion. To evaluate the quality of first two type distortion images, we use pixel based structure similarity metric and DCT based structural similarity metric respectively. For those blurriness pictures, we present a new wavelet-based structure similarity algorithm. According to the experimental results, AIQA takes the advantages of different structural similarity metrics, and it’s able to simulate the human visual perception effectively.
Capper, David; Jones, David T.W.; Sill, Martin
Accurate pathological diagnosis is crucial for optimal management of patients with cancer. For the approximately 100 known tumour types of the central nervous system, standardization of the diagnostic process has been shown to be particularly challenging - with substantial inter......-observer variability in the histopathological diagnosis of many tumour types. Here we present a comprehensive approach for the DNA methylation-based classification of central nervous system tumours across all entities and age groups, and demonstrate its application in a routine diagnostic setting. We show...
Wang, Jinliang; Gao, Yan; Wang, Xiaohua; Fu, Lei
Forest texture is an intrinsic characteristic and an important visual feature of a forest ecological system. Full utilization of forest texture will be a great help in increasing the accuracy of forest classification based on remote sensed data. Taking Shangri-La as a study area, forest classification has been based on the texture. The results show that: (1) From the texture abundance, texture boundary, entropy as well as visual interpretation, the combination of Grayscale-gradient co-occurrence matrix and wavelet transformation is much better than either one of both ways of forest texture information extraction; (2) During the forest texture information extraction, the size of the texture-suitable window determined by the semi-variogram method depends on the forest type (evergreen broadleaf forest is 3×3, deciduous broadleaf forest is 5×5, etc.). (3)While classifying forest based on forest texture information, the texture factor assembly differs among forests: Variance Heterogeneity and Correlation should be selected when the window is between 3×3 and 5×5 Mean, Correlation, and Entropy should be used when the window in the range of 7×7 to 19×19 and Correlation, Second Moment, and Variance should be used when the range is larger than 21×21.
Wang, Jinliang; Gao, Yan; Fu, Lei; Wang, Xiaohua
Forest texture is an intrinsic characteristic and an important visual feature of a forest ecological system. Full utilization of forest texture will be a great help in increasing the accuracy of forest classification based on remote sensed data. Taking Shangri-La as a study area, forest classification has been based on the texture. The results show that: (1) From the texture abundance, texture boundary, entropy as well as visual interpretation, the combination of Grayscale-gradient co-occurrence matrix and wavelet transformation is much better than either one of both ways of forest texture information extraction; (2) During the forest texture information extraction, the size of the texture-suitable window determined by the semi-variogram method depends on the forest type (evergreen broadleaf forest is 3×3, deciduous broadleaf forest is 5×5, etc.). (3)While classifying forest based on forest texture information, the texture factor assembly differs among forests: Variance Heterogeneity and Correlation should be selected when the window is between 3×3 and 5×5; Mean, Correlation, and Entropy should be used when the window in the range of 7×7 to 19×19; and Correlation, Second Moment, and Variance should be used when the range is larger than 21×21
Vaughn, Brandon K.; Wang, Qui
Many areas in educational and psychological research involve the use of classification statistical analysis. For example, school districts might be interested in attaining variables that provide optimal prediction of school dropouts. In psychology, a researcher might be interested in the classification of a subject into a particular psychological…
Wolski, Piotr; Jack, Christopher; Tadross, Mark; van Aardenne, Lisa; Lennard, Christopher
Self-Organizing Maps (SOM) based classifications of synoptic circulation patterns are increasingly being used to interpret large-scale drivers of local climate variability, and as part of statistical downscaling methodologies. These applications rely on a basic premise of synoptic climatology, i.e. that local weather is conditioned by the large-scale circulation. While it is clear that this relationship holds in principle, the implications of its implementation through SOM-based classification, particularly at interannual and longer time scales, are not well recognized. Here we use a SOM to understand the interannual synoptic drivers of climate variability at two locations in the winter and summer rainfall regimes of South Africa. We quantify the portion of variance in seasonal rainfall totals that is explained by year to year differences in the synoptic circulation, as schematized by a SOM. We furthermore test how different spatial domain sizes and synoptic variables affect the ability of the SOM to capture the dominant synoptic drivers of interannual rainfall variability. Additionally, we identify systematic synoptic forcing that is not captured by the SOM classification. The results indicate that the frequency of synoptic states, as schematized by a relatively disaggregated SOM (7 × 9) of prognostic atmospheric variables, including specific humidity, air temperature and geostrophic winds, captures only 20-45% of interannual local rainfall variability, and that the residual variance contains a strong systematic component. Utilising a multivariate linear regression framework demonstrates that this residual variance can largely be explained using synoptic variables over a particular location; even though they are used in the development of the SOM their influence, however, diminishes with the size of the SOM spatial domain. The influence of the SOM domain size, the choice of SOM atmospheric variables and grid-point explanatory variables on the levels of explained
Full Text Available Neuronal spike trains are used by the nervous system to encode and transmit information. Euclidean distance-based methods (EDBMs have been applied to quantify the similarity between temporally-discretized spike trains and model responses. In this study, using the same discretization procedure, we developed and applied a joint probability-based method (JPBM to classify individual spike trains of slowly adapting pulmonary stretch receptors (SARs. The activity of individual SARs was recorded in anaesthetized, paralysed adult male rabbits, which were artificially-ventilated at constant rate and one of three different volumes. Two-thirds of the responses to the 600 stimuli presented at each volume were used to construct three response models (one for each stimulus volume consisting of a series of time bins, each with spike probabilities. The remaining one-third of the responses where used as test responses to be classified into one of the three model responses. This was done by computing the joint probability of observing the same series of events (spikes or no spikes, dictated by the test response in a given model and determining which probability of the three was highest. The JPBM generally produced better classification accuracy than the EDBM, and both performed well above chance. Both methods were similarly affected by variations in discretization parameters, response epoch duration, and two different response alignment strategies. Increasing bin widths increased classification accuracy, which also improved with increased observation time, but primarily during periods of increasing lung inflation. Thus, the JPBM is a simple and effective method performing spike train classification.
Brochero, Darwin; Hajji, Islem; Pina, Jasson; Plana, Queralt; Sylvain, Jean-Daniel; Vergeynst, Jenna; Anctil, Francois
Theories about generalization error with ensembles are mainly based on the diversity concept, which promotes resorting to many members of different properties to support mutually agreeable decisions. Kuncheva (2004) proposed the Multi Level Diversity Model (MLDM) to promote diversity in model ensembles, combining different data subsets, input subsets, models, parameters, and including a combiner level in order to optimize the final ensemble. This work tests the hypothesis about the minimisation of the generalization error with ensembles of Neural Network (NN) structures. We used the MLDM to evaluate two different scenarios: (i) ensembles from a same NN architecture, and (ii) a super-ensemble built by a combination of sub-ensembles of many NN architectures. The time series used correspond to the 12 basins of the MOdel Parameter Estimation eXperiment (MOPEX) project that were used by Duan et al. (2006) and Vos (2013) as benchmark. Six architectures are evaluated: FeedForward NN (FFNN) trained with the Levenberg Marquardt algorithm (Hagan et al., 1996), FFNN trained with SCE (Duan et al., 1993), Recurrent NN trained with a complex method (Weins et al., 2008), Dynamic NARX NN (Leontaritis and Billings, 1985), Echo State Network (ESN), and leak integrator neuron (L-ESN) (Lukosevicius and Jaeger, 2009). Each architecture performs separately an Input Variable Selection (IVS) according to a forward stepwise selection (Anctil et al., 2009) using mean square error as objective function. Post-processing by Predictor Stepwise Selection (PSS) of the super-ensemble has been done following the method proposed by Brochero et al. (2011). IVS results showed that the lagged stream flow, lagged precipitation, and Standardized Precipitation Index (SPI) (McKee et al., 1993) were the most relevant variables. They were respectively selected as one of the firsts three selected variables in 66, 45, and 28 of the 72 scenarios. A relationship between aridity index (Arora, 2002) and NN
Full Text Available Microarray data analysis has been shown to provide an effective tool for studying cancer and genetic diseases. Although classical machine learning techniques have successfully been applied to find informative genes and to predict class labels for new samples, common restrictions of microarray analysis such as small sample sizes, a large attribute space and high noise levels still limit its scientific and clinical applications. Increasing the interpretability of prediction models while retaining a high accuracy would help to exploit the information content in microarray data more effectively. For this purpose, we evaluate our rule-based evolutionary machine learning systems, BioHEL and GAssist, on three public microarray cancer datasets, obtaining simple rule-based models for sample classification. A comparison with other benchmark microarray sample classifiers based on three diverse feature selection algorithms suggests that these evolutionary learning techniques can compete with state-of-the-art methods like support vector machines. The obtained models reach accuracies above 90% in two-level external cross-validation, with the added value of facilitating interpretation by using only combinations of simple if-then-else rules. As a further benefit, a literature mining analysis reveals that prioritizations of informative genes extracted from BioHEL's classification rule sets can outperform gene rankings obtained from a conventional ensemble feature selection in terms of the pointwise mutual information between relevant disease terms and the standardized names of top-ranked genes.
Glaab, Enrico; Bacardit, Jaume; Garibaldi, Jonathan M; Krasnogor, Natalio
Microarray data analysis has been shown to provide an effective tool for studying cancer and genetic diseases. Although classical machine learning techniques have successfully been applied to find informative genes and to predict class labels for new samples, common restrictions of microarray analysis such as small sample sizes, a large attribute space and high noise levels still limit its scientific and clinical applications. Increasing the interpretability of prediction models while retaining a high accuracy would help to exploit the information content in microarray data more effectively. For this purpose, we evaluate our rule-based evolutionary machine learning systems, BioHEL and GAssist, on three public microarray cancer datasets, obtaining simple rule-based models for sample classification. A comparison with other benchmark microarray sample classifiers based on three diverse feature selection algorithms suggests that these evolutionary learning techniques can compete with state-of-the-art methods like support vector machines. The obtained models reach accuracies above 90% in two-level external cross-validation, with the added value of facilitating interpretation by using only combinations of simple if-then-else rules. As a further benefit, a literature mining analysis reveals that prioritizations of informative genes extracted from BioHEL's classification rule sets can outperform gene rankings obtained from a conventional ensemble feature selection in terms of the pointwise mutual information between relevant disease terms and the standardized names of top-ranked genes.
Aghazadeh, Babak Seyed; Heris, Hossein Khadivi
In this paper an efficient fuzzy wavelet packet (WP) based feature extraction method and fuzzy logic based disorder assessment technique were used to investigate voice signals of patients suffering from unilateral vocal fold paralysis (UVFP). Mother wavelet function of tenth order Daubechies (d10) was employed to decompose signals in 5 levels. Next, WP coefficients were used to measure energy and Shannon entropy features at different spectral sub-bands. Consequently, using fuzzy c-means method, signals were clustered into 2 classes. The amount of fuzzy membership of pathological and normal signals in their corresponding clusters was considered as a measure to quantify the discrimination ability of features. A classification accuracy of 100 percent was achieved using an artificial neural network classifier. Finally, fuzzy c-means clustering method was used as a way of voice pathology assessment. Accordingly, fuzzy membership function based health index is proposed.
Sinkov, Nikolai A; Sandercock, P Mark L; Harynuk, James J
Detection and identification of ignitable liquids (ILs) in arson debris is a critical part of arson investigations. The challenge of this task is due to the complex and unpredictable chemical nature of arson debris, which also contains pyrolysis products from the fire. ILs, most commonly gasoline, are complex chemical mixtures containing hundreds of compounds that will be consumed or otherwise weathered by the fire to varying extents depending on factors such as temperature, air flow, the surface on which IL was placed, etc. While methods such as ASTM E-1618 are effective, data interpretation can be a costly bottleneck in the analytical process for some laboratories. In this study, we address this issue through the application of chemometric tools. Prior to the application of chemometric tools such as PLS-DA and SIMCA, issues of chromatographic alignment and variable selection need to be addressed. Here we use an alignment strategy based on a ladder consisting of perdeuterated n-alkanes. Variable selection and model optimization was automated using a hybrid backward elimination (BE) and forward selection (FS) approach guided by the cluster resolution (CR) metric. In this work, we demonstrate the automated construction, optimization, and application of chemometric tools to casework arson data. The resulting PLS-DA and SIMCA classification models, trained with 165 training set samples, have provided classification of 55 validation set samples based on gasoline content with 100% specificity and sensitivity. Copyright © 2013 Elsevier Ireland Ltd. All rights reserved.
Li, Xiang-Ru; Lu, Yu; Zhou, Jian-Ming; Wang, Yong-Jun
With the wide application of high-quality CCD in celestial spectrum imagery and the implementation of many large sky survey programs (e. g., Sloan Digital Sky Survey (SDSS), Two-degree-Field Galaxy Redshift Survey (2dF), Spectroscopic Survey Telescope (SST), Large Sky Area Multi-Object Fiber Spectroscopic Telescope (LAMOST) program and Large Synoptic Survey Telescope (LSST) program, etc.), celestial observational data are coming into the world like torrential rain. Therefore, to utilize them effectively and fully, research on automated processing methods for celestial data is imperative. In the present work, we investigated how to recognizing galaxies and quasars from spectra based on nearest neighbor method. Galaxies and quasars are extragalactic objects, they are far away from earth, and their spectra are usually contaminated by various noise. Therefore, it is a typical problem to recognize these two types of spectra in automatic spectra classification. Furthermore, the utilized method, nearest neighbor, is one of the most typical, classic, mature algorithms in pattern recognition and data mining, and often is used as a benchmark in developing novel algorithm. For applicability in practice, it is shown that the recognition ratio of nearest neighbor method (NN) is comparable to the best results reported in the literature based on more complicated methods, and the superiority of NN is that this method does not need to be trained, which is useful in incremental learning and parallel computation in mass spectral data processing. In conclusion, the results in this work are helpful for studying galaxies and quasars spectra classification.
Li, Zhenlong; Jin, Xue; Zhao, Xiaohua
This paper addresses the problem of detecting drunk driving based on classification of multivariate time series. First, driving performance measures were collected from a test in a driving simulator located in the Traffic Research Center, Beijing University of Technology. Lateral position and steering angle were used to detect drunk driving. Second, multivariate time series analysis was performed to extract the features. A piecewise linear representation was used to represent multivariate time series. A bottom-up algorithm was then employed to separate multivariate time series. The slope and time interval of each segment were extracted as the features for classification. Third, a support vector machine classifier was used to classify driver's state into two classes (normal or drunk) according to the extracted features. The proposed approach achieved an accuracy of 80.0%. Drunk driving detection based on the analysis of multivariate time series is feasible and effective. The approach has implications for drunk driving detection. Copyright © 2015 Elsevier Ltd and National Safety Council. All rights reserved.
Wang, Xiang-Yang; Wu, Zhi-Fang; Chen, Liang; Zheng, Hong-Liang; Yang, Hong-Ying
Image segmentation remains an important, but hard-to-solve, problem since it appears to be application dependent with usually no a priori information available regarding the image structure. In recent years, many image segmentation algorithms have been developed, but they are often very complex and some undesired results occur frequently. In this paper, we propose a pixel classification based color image segmentation using quaternion exponent moments. Firstly, the pixel-level image feature is extracted based on quaternion exponent moments (QEMs), which can capture effectively the image pixel content by considering the correlation between different color channels. Then, the pixel-level image feature is used as input of twin support vector machines (TSVM) classifier, and the TSVM model is trained by selecting the training samples with Arimoto entropy thresholding. Finally, the color image is segmented with the trained TSVM model. The proposed scheme has the following advantages: (1) the effective QEMs is introduced to describe color image pixel content, which considers the correlation between different color channels, (2) the excellent TSVM classifier is utilized, which has lower computation time and higher classification accuracy. Experimental results show that our proposed method has very promising segmentation performance compared with the state-of-the-art segmentation approaches recently proposed in the literature. Copyright © 2015 Elsevier Ltd. All rights reserved.
Wang, Lei; Liu, Zhiwen; Miao, Qiang; Zhang, Xin
A time-frequency analysis method based on ensemble local mean decomposition (ELMD) and fast kurtogram (FK) is proposed for rotating machinery fault diagnosis. Local mean decomposition (LMD), as an adaptive non-stationary and nonlinear signal processing method, provides the capability to decompose multicomponent modulation signal into a series of demodulated mono-components. However, the occurring mode mixing is a serious drawback. To alleviate this, ELMD based on noise-assisted method was developed. Still, the existing environmental noise in the raw signal remains in corresponding PF with the component of interest. FK has good performance in impulse detection while strong environmental noise exists. But it is susceptible to non-Gaussian noise. The proposed method combines the merits of ELMD and FK to detect the fault for rotating machinery. Primarily, by applying ELMD the raw signal is decomposed into a set of product functions (PFs). Then, the PF which mostly characterizes fault information is selected according to kurtosis index. Finally, the selected PF signal is further filtered by an optimal band-pass filter based on FK to extract impulse signal. Fault identification can be deduced by the appearance of fault characteristic frequencies in the squared envelope spectrum of the filtered signal. The advantages of ELMD over LMD and EEMD are illustrated in the simulation analyses. Furthermore, the efficiency of the proposed method in fault diagnosis for rotating machinery is demonstrated on gearbox case and rolling bearing case analyses.
Bayesian estimation/inversion is commonly used to quantify and reduce modeling uncertainties in coastal ocean model, especially in the framework of parameter estimation. Based on Bayes rule, the posterior probability distribution function (pdf) of the estimated quantities is obtained conditioned on available data. It can be computed either directly, using a Markov chain Monte Carlo (MCMC) approach, or by sequentially processing the data following a data assimilation approach, which is heavily exploited in large dimensional state estimation problems. The advantage of data assimilation schemes over MCMC-type methods arises from the ability to algorithmically accommodate a large number of uncertain quantities without significant increase in the computational requirements. However, only approximate estimates are generally obtained by this approach due to the restricted Gaussian prior and noise assumptions that are generally imposed in these methods. This contribution aims at evaluating the effectiveness of utilizing an ensemble Kalman-based data assimilation method for parameter estimation of a coastal ocean model against an MCMC polynomial chaos (PC)-based scheme. We focus on quantifying the uncertainties of a coastal ocean ADvanced CIRCulation (ADCIRC) model with respect to the Manning’s n coefficients. Based on a realistic framework of observation system simulation experiments (OSSEs), we apply an ensemble Kalman filter and the MCMC method employing a surrogate of ADCIRC constructed by a non-intrusive PC expansion for evaluating the likelihood, and test both approaches under identical scenarios. We study the sensitivity of the estimated posteriors with respect to the parameters of the inference methods, including ensemble size, inflation factor, and PC order. A full analysis of both methods, in the context of coastal ocean model, suggests that an ensemble Kalman filter with appropriate ensemble size and well-tuned inflation provides reliable mean estimates and
Lahmiri, S.; Boukadoum, M.
Accurate forecasting of stock market volatility is an important issue in portfolio risk management. In this paper, an ensemble system for stock market volatility is presented. It is composed of three different models that hybridize the exponential generalized autoregressive conditional heteroscedasticity (GARCH) process and the artificial neural network trained with the backpropagation algorithm (BPNN) to forecast stock market volatility under normal, t-Student, and generalized error distribution (GED) assumption separately. The goal is to design an ensemble system where each single hybrid model is capable to capture normality, excess skewness, or excess kurtosis in the data to achieve complementarity. The performance of each EGARCH-BPNN and the ensemble system is evaluated by the closeness of the volatility forecasts to realized volatility. Based on mean absolute error and mean of squared errors, the experimental results show that proposed ensemble model used to capture normality, skewness, and kurtosis in data is more accurate than the individual EGARCH-BPNN models in forecasting the S&P 500 intra-day volatility based on one and five-minute time horizons data.
Full Text Available The simplest known plant pathogens are the viroids. Because of their non-coding single-stranded circular RNA genome, they depend on both their sequence and their structure for both a successful infection and their replication. In the recent years, important progress in the elucidation of their structures was achieved using an adaptation of the selective 2'-hydroxyl acylation analyzed by primer extension (SHAPE protocol in order to probe viroid structures in solution. Previously, SHAPE has been adapted to elucidate the structures of all of the members of the family Avsunviroidae, as well as those of a few members of the family Pospiviroidae. In this study, with the goal of providing an entire compendium of the secondary structures of the various viroid species, a total of thirteen new Pospiviroidae members were probed in solution using the SHAPE protocol. More specifically, the secondary structures of eleven species for which the genus was previously known were initially elucidated. At this point, considering all of the SHAPE elucidated secondary structures, a classification system for viroids in their respective genera was proposed. On the basis of the structural classification reported here, the probings of both the Grapevine latent viroid and the Dahlia latent viroid provide sound arguments for the determination of their respective genera, which appear to be Apscaviroid and Hostuviroid, respectively. More importantly, this study provides the complete repertoire of the secondary structures, mapped in solution, of all of the accepted viroid species reported thus far. In addition, a classification scheme based on structural hallmarks, an important tool for many biological studies, is proposed.
Li, Xiaohui; Yang, Huiying; Rabinowitz, Yaron S.
PURPOSE To determine in a longitudinal study whether there is correlation between videokeratography and clinical signs of keratoconus that might be useful to practicing clinicians. SETTING Cornea-Genetic Eye Institute, Cedars-Sinai Medical Center, Los Angeles, California, USA. METHODS Eyes grouped as keratoconus, early keratoconus, keratoconus suspect, or normal based on clinical signs and videokeratography were examined at baseline and followed for 1 to 8 years. Differences in quantitative videokeratography indices and the progression rate were evaluated. The quantitative indices were central keratometry (K), the inferior–superior (I–S) value, and the keratoconus percentage index (KISA). Discriminant analysis was used to estimate the classification rate using the indices. RESULTS There were significant differences at baseline between the normal, keratoconus-suspect, and early keratoconus groups in all indices; the respective means were central K: 44.17 D, 45.13 D, and 45.97 D; I–S: 0.57, 1.20, and 4.44; log(KISA): 2.49, 2.94, and 5.71 (all Pkeratoconus-suspect group progressed to early keratoconus or keratoconus and 75% in the early keratoconus group progressed to keratoconus. Using all 3 indices and age, 86.9% in the normal group, 75.3% in the early keratoconus group, and 44.6% in the keratoconus-suspect group could be classified, yielding a total classification rate of 68.9%. CONCLUSIONS Cross-sectional and longitudinal data showed significant differences between groups in the 3 indices. Use of this classification scheme might form a basis for detecting subclinical keratoconus. PMID:19683159
Full Text Available On the basis of the cluster validity function based on geometric probability in literature [1, 2], propose a cluster analysis method based on geometric probability to process large amount of data in rectangular area. The basic idea is top-down stepwise refinement, firstly categories then subcategories. On all clustering levels, use the cluster validity function based on geometric probability firstly, determine clusters and the gathering direction, then determine the center of clustering and the border of clusters. Through TM remote sensing image classification examples, compare with the supervision and unsupervised classification in ERDAS and the cluster analysis method based on geometric probability in two-dimensional square which is proposed in literature 2. Results show that the proposed method can significantly improve the classification accuracy.
Tuma, Matthias; Igel, Christian; Mialle, Pierrick
Infrasound monitoring is one of four remote sensing technologies continuously employed by the CTBTO Preparatory Commission. The CTBTO's infrasound network is designed to monitor the Earth for potential evidence of atmospheric or shallow underground nuclear explosions. Upon completion, it will comprise 60 infrasound array stations distributed around the globe, of which 47 were certified in January 2014. Three stages can be identified in CTBTO infrasound data processing: automated processing at the level of single array stations, automated processing at the level of the overall global network, and interactive review by human analysts. At station level, the cross correlation-based PMCC algorithm is used for initial detection of coherent wavefronts. It produces estimates for trace velocity and azimuth of incoming wavefronts, as well as other descriptive features characterizing a signal. Detected arrivals are then categorized into potentially treaty-relevant versus noise-type signals by a rule-based expert system. This corresponds to a binary classification task at the level of station processing. In addition, incoming signals may be grouped according to their travel path in the atmosphere. The present work investigates automatic classification of infrasound arrivals by kernel-based pattern recognition methods. It aims to explore the potential of state-of-the-art machine learning methods vis-a-vis the current rule-based and task-tailored expert system. To this purpose, we first address the compilation of a representative, labeled reference benchmark dataset as a prerequisite for both classifier training and evaluation. Data representation is based on features extracted by the CTBTO's PMCC algorithm. As classifiers, we employ support vector machines (SVMs) in a supervised learning setting. Different SVM kernel functions are used and adapted through different hyperparameter optimization routines. The resulting performance is compared to several baseline classifiers. All
Full Text Available A new class of ensemble filters, called the Diffuse Ensemble Filter (DEnF, is proposed in this paper. The DEnF assumes that the forecast errors orthogonal to the first guess ensemble are uncorrelated with the latter ensemble and have infinite variance. The assumption of infinite variance corresponds to the limit of "complete lack of knowledge" and differs dramatically from the implicit assumption made in most other ensemble filters, which is that the forecast errors orthogonal to the first guess ensemble have vanishing errors. The DEnF is independent of the detailed covariances assumed in the space orthogonal to the ensemble space, and reduces to conventional ensemble square root filters when the number of ensembles exceeds the model dimension. The DEnF is well defined only in data rich regimes and involves the inversion of relatively large matrices, although this barrier might be circumvented by variational methods. Two algorithms for solving the DEnF, namely the Diffuse Ensemble Kalman Filter (DEnKF and the Diffuse Ensemble Transform Kalman Filter (DETKF, are proposed and found to give comparable results. These filters generally converge to the traditional EnKF and ETKF, respectively, when the ensemble size exceeds the model dimension. Numerical experiments demonstrate that the DEnF eliminates filter collapse, which occurs in ensemble Kalman filters for small ensemble sizes. Also, the use of the DEnF to initialize a conventional square root filter dramatically accelerates the spin-up time for convergence. However, in a perfect model scenario, the DEnF produces larger errors than ensemble square root filters that have covariance localization and inflation. For imperfect forecast models, the DEnF produces smaller errors than the ensemble square root filter with inflation. These experiments suggest that the DEnF has some advantages relative to the ensemble square root filters in the regime of small ensemble size, imperfect model, and copious
Botto, Anna; Camporese, Matteo
Hydrological models allow scientists to predict the response of water systems under varying forcing conditions. In particular, many physically-based integrated models were recently developed in order to understand the fundamental hydrological processes occurring at the catchment scale. However, the use of this class of hydrological models is still relatively limited, as their prediction skills heavily depend on reliable parameter estimation, an operation that is never trivial, being normally affected by large uncertainty and requiring huge computational effort. The objective of this work is to test the potential of data assimilation to be used as an inverse modeling procedure for the broad class of integrated hydrological models. To pursue this goal, a Bayesian data assimilation (DA) algorithm based on a Monte Carlo approach, namely the ensemble Kalman filter (EnKF), is combined with the CATchment HYdrology (CATHY) model. In this approach, input variables (atmospheric forcing, soil parameters, initial conditions) are statistically perturbed providing an ensemble of realizations aimed at taking into account the uncertainty involved in the process. Each realization is propagated forward by the CATHY hydrological model within a parallel R framework, developed to reduce the computational effort. When measurements are available, the EnKF is used to update both the system state and soil parameters. In particular, four different assimilation scenarios are applied to test the capability of the modeling framework: first only pressure head or water content are assimilated, then, the combination of both, and finally both pressure head and water content together with the subsurface outflow. To demonstrate the effectiveness of the approach in a real-world scenario, an artificial hillslope was designed and built to provide real measurements for the DA analyses. The experimental facility, located in the Department of Civil, Environmental and Architectural Engineering of the
Jiang, He; Dong, Yao; Xiao, Ling
Highlights: • Ensemble learning system is proposed to forecast the global solar radiation. • LASSO is utilized as feature selection method for subset model. • GSO is used to select the weight vector aggregating the response of subset model. • A simple and efficient algorithm is designed based on thresholding function. • Theoretical analysis focusing on error rate is provided. - Abstract: Forecasting of effective solar irradiation has developed a huge interest in recent decades, mainly due to its various applications in grid connect photovoltaic installations. This paper develops and investigates an ensemble learning based multistage intelligent approach to forecast 5 days global horizontal radiation at four given locations of India. The two-way interaction model is considered with purpose of detecting the associated correlation between the features. The main structure of the novel method is the ensemble learning, which is based on Divide and Conquer principle, is applied to enhance the forecasting accuracy and model stability. An efficient feature selection method LASSO is performed in the input space with the regularization parameter selected by Cross-Validation. A weight vector which best represents the importance of each individual model in ensemble system is provided by glowworm swarm optimization. The combination of feature selection and parameter selection are helpful in creating the diversity of the ensemble learning. In order to illustrate the validity of the proposed method, the datasets at four different locations of the India are split into training and test datasets. The results of the real data experiments demonstrate the efficiency and efficacy of the proposed method comparing with other competitors.
Wang, Huaqing; Li, Ruitong; Tang, Gang; Yuan, Hongfang; Zhao, Qingliang; Cao, Xi
A Compound fault signal usually contains multiple characteristic signals and strong confusion noise, which makes it difficult to separate week fault signals from them through conventional ways, such as FFT-based envelope detection, wavelet transform or empirical mode decomposition individually. In order to improve the compound faults diagnose of rolling bearings via signals' separation, the present paper proposes a new method to identify compound faults from measured mixed-signals, which is based on ensemble empirical mode decomposition (EEMD) method and independent component analysis (ICA) technique. With the approach, a vibration signal is firstly decomposed into intrinsic mode functions (IMF) by EEMD method to obtain multichannel signals. Then, according to a cross correlation criterion, the corresponding IMF is selected as the input matrix of ICA. Finally, the compound faults can be separated effectively by executing ICA method, which makes the fault features more easily extracted and more clearly identified. Experimental results validate the effectiveness of the proposed method in compound fault separating, which works not only for the outer race defect, but also for the rollers defect and the unbalance fault of the experimental system.
Full Text Available In order to guarantee the stable operation of shearers and promote construction of an automatic coal mining working face, an online cutting pattern recognition method with high accuracy and speed based on Improved Ensemble Empirical Mode Decomposition (IEEMD and Probabilistic Neural Network (PNN is proposed. An industrial microphone is installed on the shearer and the cutting sound is collected as the recognition criterion to overcome the disadvantages of giant size, contact measurement and low identification rate of traditional detectors. To avoid end-point effects and get rid of undesirable intrinsic mode function (IMF components in the initial signal, IEEMD is conducted on the sound. The end-point continuation based on the practical storage data is performed first to overcome the end-point effect. Next the average correlation coefficient, which is calculated by the correlation of the first IMF with others, is introduced to select essential IMFs. Then the energy and standard deviation of the reminder IMFs are extracted as features and PNN is applied to classify the cutting patterns. Finally, a simulation example, with an accuracy of 92.67%, and an industrial application prove the efficiency and correctness of the proposed method.
Juan-Albarracín, Javier; Fuster-Garcia, Elies; Manjón, José V.; Robles, Montserrat; Aparici, F.; Martí-Bonmatí, L.; García-Gómez, Juan M.
Automatic brain tumour segmentation has become a key component for the future of brain tumour treatment. Currently, most of brain tumour segmentation approaches arise from the supervised learning standpoint, which requires a labelled training dataset from which to infer the models of the classes. The performance of these models is directly determined by the size and quality of the training corpus, whose retrieval becomes a tedious and time-consuming task. On the other hand, unsupervised approaches avoid these limitations but often do not reach comparable results than the supervised methods. In this sense, we propose an automated unsupervised method for brain tumour segmentation based on anatomical Magnetic Resonance (MR) images. Four unsupervised classification algorithms, grouped by their structured or non-structured condition, were evaluated within our pipeline. Considering the non-structured algorithms, we evaluated K-means, Fuzzy K-means and Gaussian Mixture Model (GMM), whereas as structured classification algorithms we evaluated Gaussian Hidden Markov Random Field (GHMRF). An automated postprocess based on a statistical approach supported by tissue probability maps is proposed to automatically identify the tumour classes after the segmentations. We evaluated our brain tumour segmentation method with the public BRAin Tumor Segmentation (BRATS) 2013 Test and Leaderboard datasets. Our approach based on the GMM model improves the results obtained by most of the supervised methods evaluated with the Leaderboard set and reaches the second position in the ranking. Our variant based on the GHMRF achieves the first position in the Test ranking of the unsupervised approaches and the seventh position in the general Test ranking, which confirms the method as a viable alternative for brain tumour segmentation. PMID:25978453
Juan-Albarracín, Javier; Fuster-Garcia, Elies; Manjón, José V; Robles, Montserrat; Aparici, F; Martí-Bonmatí, L; García-Gómez, Juan M
Automatic brain tumour segmentation has become a key component for the future of brain tumour treatment. Currently, most of brain tumour segmentation approaches arise from the supervised learning standpoint, which requires a labelled training dataset from which to infer the models of the classes. The performance of these models is directly determined by the size and quality of the training corpus, whose retrieval becomes a tedious and time-consuming task. On the other hand, unsupervised approaches avoid these limitations but often do not reach comparable results than the supervised methods. In this sense, we propose an automated unsupervised method for brain tumour segmentation based on anatomical Magnetic Resonance (MR) images. Four unsupervised classification algorithms, grouped by their structured or non-structured condition, were evaluated within our pipeline. Considering the non-structured algorithms, we evaluated K-means, Fuzzy K-means and Gaussian Mixture Model (GMM), whereas as structured classification algorithms we evaluated Gaussian Hidden Markov Random Field (GHMRF). An automated postprocess based on a statistical approach supported by tissue probability maps is proposed to automatically identify the tumour classes after the segmentations. We evaluated our brain tumour segmentation method with the public BRAin Tumor Segmentation (BRATS) 2013 Test and Leaderboard datasets. Our approach based on the GMM model improves the results obtained by most of the supervised methods evaluated with the Leaderboard set and reaches the second position in the ranking. Our variant based on the GHMRF achieves the first position in the Test ranking of the unsupervised approaches and the seventh position in the general Test ranking, which confirms the method as a viable alternative for brain tumour segmentation.
Hsieh, Yi-Ta; Chen, Chaur-Tzuhn; Chen, Jan-Chang
In general, considerable human and material resources are required for performing a forest inventory survey. Using remote sensing technologies to save forest inventory costs has thus become an important topic in forest inventory-related studies. Leica ADS-40 digital aerial photographs feature advantages such as high spatial resolution, high radiometric resolution, and a wealth of spectral information. As a result, they have been widely used to perform forest inventories. We classified ADS-40 digital aerial photographs according to the complex forest land cover types listed in the Fourth Forest Resource Survey in an effort to establish a classification method for categorizing ADS-40 digital aerial photographs. Subsequently, we classified the images using the knowledge-based classification method in combination with object-based analysis techniques, decision tree classification techniques, classification parameters such as object texture, shape, and spectral characteristics, a class-based classification method, and geographic information system mapping information. Finally, the results were compared with manually interpreted aerial photographs. Images were classified using a hierarchical classification method comprised of four classification levels (levels 1 to 4). The classification overall accuracy (OA) of levels 1 to 4 is within a range of 64.29% to 98.50%. The final result comparisons showed that the proposed classification method achieved an OA of 78.20% and a kappa coefficient of 0.7597. On the basis of the image classification results, classification errors occurred mostly in images of sunlit crowns because the image values for individual trees varied. Such a variance was caused by the crown structure and the incident angle of the sun. These errors lowered image classification accuracy and warrant further studies. This study corroborates the high feasibility for mapping complex forest land cover types using ADS-40 digital aerial photographs.
Full Text Available Credit scoring methods are widely used for evaluating loan applications in financial and banking institutions. Credit score identifies if applicant customers belong to good risk applicant group or a bad risk applicant group. These decisions are based on the demographic data of the customers, overall business by the customer with bank, and loan payment history of the loan applicants. The advantages of using credit scoring models include reducing the cost of credit analysis, enabling faster credit decisions and diminishing possible risk. Many statistical and machine learning techniques such as Logistic Regression, Support Vector Machines, Neural Networks and Decision tree algorithms have been used independently and as hybrid credit scoring models. This paper proposes an ensemble based technique combining seven individual models to increase the classification accuracy. Feature selection has also been used for selecting important attributes for classification. Cross classification was conducted using three data partitions. German credit dataset having 1000 instances and 21 attributes is used in the present study. The results of the experiments revealed that the ensemble model yielded a very good accuracy when compared to individual models. In all three different partitions, the ensemble model was able to classify more than 80% of the loan customers as good creditors correctly. Also, for 70:30 partition there was a good impact of feature selection on the accuracy of classifiers. The results were improved for almost all individual models including the ensemble model.
Full Text Available The problem of classification in incomplete information system is a hot issue in intelligent information processing. Hypergraph is a new intelligent method for machine learning. However, it is hard to process the incomplete information system by the traditional hypergraph, which is due to two reasons: (1 the hyperedges are generated randomly in traditional hypergraph model; (2 the existing methods are unsuitable to deal with incomplete information system, for the sake of missing values in incomplete information system. In this paper, we propose a novel classification algorithm for incomplete information system based on hypergraph model and rough set theory. Firstly, we initialize the hypergraph. Second, we classify the training set by neighborhood hypergraph. Third, under the guidance of rough set, we replace the poor hyperedges. After that, we can obtain a good classifier. The proposed approach is tested on 15 data sets from UCI machine learning repository. Furthermore, it is compared with some existing methods, such as C4.5, SVM, NavieBayes, and KNN. The experimental results show that the proposed algorithm has better performance via Precision, Recall, AUC, and F-measure.
Ardha Aryaguna, Prama; Danoedoro, Projo
Developments of analysis remote sensing have same way with development of technology especially in sensor and plane. Now, a lot of image have high spatial and radiometric resolution, that's why a lot information. Vegetation object analysis such floristic composition got a lot advantage of that development. Floristic composition can be interpreted using a lot of method such pixel based classification and object based classification. The problems for pixel based method on high spatial resolution image are salt and paper who appear in result of classification. The purpose of this research are compare effectiveness between pixel based classification and object based classification for composition vegetation mapping on high resolution image Worldview-2. The results show that pixel based classification using majority 5×5 kernel windows give the highest accuracy between another classifications. The highest accuracy is 73.32% from image Worldview-2 are being radiometric corrected level surface reflectance, but for overall accuracy in every class, object based are the best between another methods. Reviewed from effectiveness aspect, pixel based are more effective then object based for vegetation composition mapping in Tidar forest.
Sewak, Mihir S.; Reddy, Narender P.; Duan, Zhong-Hui
Analysis of gene expression data provides an objective and efficient technique for sub‑classification of leukemia. The purpose of the present study was to design a committee neural networks based classification systems to subcategorize leukemia gene expression data. In the study, a binary classification system was considered to differentiate acute lymphoblastic leukemia from acute myeloid leukemia. A ternary classification system which classifies leukemia expression data into three subclasses...
Shindo, Kiyo; Shimizu, Akinobu; Kobatake, Hidefumi; Nawano, Shigeru; Shinozaki, Kenji
An organ segmentation learned by a conventional ensemble learning algorithm suffers from unnatural errors because each voxel is classified independently in the segmentation process. This paper proposes a novel ensemble learning algorithm that can take into account global shape and location of organs. It estimates the shape and location of an organ from a given image by combining an intermediate segmentation result with a statistical shape model. Once an ensemble learning algorithm could not improve the segmentation performance in the iterative learning process, it estimates the shape and location by finding an optimal model parameter set with maximum degree of correspondence between a statistical shape model and the intermediate segmentation result. Novel weak classifiers are generated based on a signed distance from a boundary of the estimated shape and a distance from a barycenter of the intermediate segmentation result. Subsequently it continues the learning process with the novel weak classifiers. This paper presents experimental results where the proposed ensemble learning algorithm generates a segmentation process that can extract a spleen from a 3D CT image more precisely than a conventional one. (author)
S. W. Loke
Full Text Available Context-aware smart things are capable of computational behaviour based on sensing the physical world, inferring context from the sensed data, and acting on the sensed context. A collection of such things can form what we call a thing-ensemble, when they have the ability to communicate with one another (over a short range network such as Bluetooth, or the Internet, i.e. the Internet of Things (IoT concept, sense each other, and when each of them might play certain roles with respect to each other. Each smart thing in a thing-ensemble might have its own context-aware behaviours which when integrated with other smart things yield behaviours that are not straightforward to reason with. We present Sigma, a language of operators, inspired from modular logic programming, for specifying and reasoning with combined behaviours among smart things in a thing-ensemble. We show numerous examples of the use of Sigma for describing a range of behaviours over a diverse range of thing-ensembles, from sensor networks to smart digital frames, demonstrating the versatility of our approach. We contend that our operator approach abstracts away low-level communication and protocol details, and allows systems of context-aware things to be designed and built in a compositional and incremental manner.
Full Text Available Reliable climate change scenarios are critical for West Africa, whose economy relies mostly on agriculture and, in this regard, multimodel ensembles are believed to provide the most robust climate change information. Toward this end, we analyze and intercompare the performance of a set of four regional climate models (RCMs driven by two global climate models (GCMs (for a total of 4 different GCM-RCM pairs in simulating present day and future climate over West Africa. The results show that the individual RCM members as well as their ensemble employing the same driving fields exhibit different biases and show mixed results in terms of outperforming the GCM simulation of seasonal temperature and precipitation, indicating a substantial sensitivity of RCMs to regional and local processes. These biases are reduced and GCM simulations improved upon by averaging all four RCM simulations, suggesting that multi-model RCM ensembles based on different driving GCMs help to compensate systematic errors from both the nested and the driving models. This confirms the importance of the multi-model approach for improving robustness of climate change projections. Illustrative examples of such ensemble reveal that the western Sahel undergoes substantial drying in future climate projections mostly due to a decrease in peak monsoon rainfall.
Full Text Available PURPOSE: Demonstrate radiological findings of 127 angiomyolipomas (AMLs and propose a classification based on the radiological evidence of fat. MATERIALS AND METHODS: The imaging findings of 85 consecutive patients with AMLs: isolated (n = 73, multiple without tuberous sclerosis (TS (n = 4 and multiple with TS (n = 8, were retrospectively reviewed. Eighteen AMLs (14% presented with hemorrhage. All patients were submitted to a dedicated helical CT or magnetic resonance studies. All hemorrhagic and non-hemorrhagic lesions were grouped together since our objective was to analyze the presence of detectable fat. Out of 85 patients, 53 were monitored and 32 were treated surgically due to large perirenal component (n = 13, hemorrhage (n = 11 and impossibility of an adequate preoperative characterization (n = 8. There was not a case of renal cell carcinoma (RCC with fat component in this group of patients. RESULTS: Based on the presence and amount of detectable fat within the lesion, AMLs were classified in 4 distinct radiological patterns: Pattern-I, predominantly fatty (usually less than 2 cm in diameter and intrarenal: 54%; Pattern-II, partially fatty (intrarenal or exophytic: 29%; Pattern-III, minimally fatty (most exophytic and perirenal: 11%; and Pattern-IV, without fat (most exophytic and perirenal: 6%. CONCLUSIONS: This proposed classification might be useful to understand the imaging manifestations of AMLs, their differential diagnosis and determine when further radiological evaluation would be necessary. Small (< 1.5 cm, pattern-I AMLs tend to be intra-renal, homogeneous and predominantly fatty. As they grow they tend to be partially or completely exophytic and heterogeneous (patterns II and III. The rare pattern-IV AMLs, however, can be small or large, intra-renal or exophytic but are always homogeneous and hyperdense mass. Since no renal cell carcinoma was found in our series, from an evidence-based practice, all renal mass with detectable
Arellano, A. F., Jr.; Raeder, K.; Anderson, J. L.; Hess, P. G.; Emmons, L. K.; Edwards, D. P.; Pfister, G. G.; Campos, T. L.; Sachse, G. W.
We present a global chemical data assimilation system using a global atmosphere model, the Community Atmosphere Model (CAM3) with simplified chemistry and the Data Assimilation Research Testbed (DART) assimilation package. DART is a community software facility for assimilation studies using the ensemble Kalman filter approach. Here, we apply the assimilation system to constrain global tropospheric carbon monoxide (CO) by assimilating meteorological observations of temperature and horizontal wind velocity and satellite CO retrievals from the Measurement of Pollution in the Troposphere (MOPITT) satellite instrument. We verify the system performance using independent CO observations taken on board the NSFINCAR C-130 and NASA DC-8 aircrafts during the April 2006 part of the Intercontinental Chemical Transport Experiment (INTEX-B). Our evaluations show that MOPITT data assimilation provides significant improvements in terms of capturing the observed CO variability relative to no MOPITT assimilation (i.e. the correlation improves from 0.62 to 0.71, significant at 99% confidence). The assimilation provides evidence of median CO loading of about 150 ppbv at 700 hPa over the NE Pacific during April 2006. This is marginally higher than the modeled CO with no MOPITT assimilation (-140 ppbv). Our ensemble-based estimates of model uncertainty also show model overprediction over the source region (i.e. China) and underprediction over the NE Pacific, suggesting model errors that cannot be readily explained by emissions alone. These results have important implications for improving regional chemical forecasts and for inverse modeling of CO sources and further demonstrate the utility of the assimilation system in comparing non-coincident measurements, e.g. comparing satellite retrievals of CO with in-situ aircraft measurements. The work described above also brought to light several short-comings of the data assimilation approach for CO profiles. Because of the limited vertical
Wang, Xiaolei; Qin, Li; Cao, Jingli
Since increasing acute pancreatitis (AP) severity is significantly associated with mortality, accurate and rapid determination of severity is crucial for effective clinical management. This study investigated the value of the revised Atlanta classification (RAC) and the determinant-based classification (DBC) systems in stratifying severity of acute pancreatitis. This retrospective observational cohort study included 480 AP patients. Patient demographics and clinical characteristics were recorded. The primary outcome was mortality, and secondary outcomes were admission to intensive care unit (ICU), duration of ICU stay, and duration of hospital stay. Based on the RAC classification, there were 295 patients with mild AP (MAP), 146 patients with moderate-to-severe AP (MSAP), and 39 patients with severe AP (SAP). Based on the DBC classification, there were 389 patients with MAP, 41 patients with MSAP, 32 patients with SAP, and 18 patients with critical AP (CAP). ROC curve analysis showed that the DBC system had a significantly higher accuracy at predicting organ failure compared to the RAC system (p < .001). Multivariate regression analysis showed that age and ICU stay were independent risk factors of mortality. The DBC system had a higher accuracy at predicting organ failure. Age and ICU stay were significantly associated with risk of death in AP patients. A classification of CAP by the DBC system should warrant close attention, and rapid implementation of effective measures to reduce mortality.
Full Text Available In order to improve the surface ozone forecast over Beijing and surrounding regions, data assimilation method integrated into a high-resolution regional air quality model and a regional air quality monitoring network are employed. Several advanced data assimilation strategies based on ensemble Kalman filter are designed to adjust O3 initial conditions, NOx initial conditions and emissions, VOCs initial conditions and emissions separately or jointly through assimilating ozone observations. As a result, adjusting precursor initial conditions demonstrates potential improvement of the 1-h ozone forecast almost as great as shown by adjusting precursor emissions. Nevertheless, either adjusting precursor initial conditions or emissions show deficiency in improving the short-term ozone forecast at suburban areas. Adjusting ozone initial values brings significant improvement to the 1-h ozone forecast, and its limitations lie in the difficulty in improving the 1-h forecast at some urban site. A simultaneous adjustment of the above five variables is found to be able to reduce these limitations and display an overall better performance in improving both the 1-h and 24-h ozone forecast over these areas. The root mean square errors of 1-h ozone forecast at urban sites and suburban sites decrease by 51% and 58% respectively compared with those in free run. Through these experiments, we found that assimilating local ozone observations is determinant for ozone forecast over the observational area, while assimilating remote ozone observations could reduce the uncertainty in regional transport ozone.
Zamani, Reza; Akhond-Ali, Ali-Mohammad; Roozbahani, Abbas; Fattahi, Rouhollah
Water shortage and climate change are the most important issues of sustainable agricultural and water resources development. Given the importance of water availability in crop production, the present study focused on risk assessment of climate change impact on agricultural water requirement in southwest of Iran, under two emission scenarios (A2 and B1) for the future period (2025-2054). A multi-model ensemble framework based on mean observed temperature-precipitation (MOTP) method and a combined probabilistic approach Long Ashton Research Station-Weather Generator (LARS-WG) and change factor (CF) have been used for downscaling to manage the uncertainty of outputs of 14 general circulation models (GCMs). The results showed an increasing temperature in all months and irregular changes of precipitation (either increasing or decreasing) in the future period. In addition, the results of the calculated annual net water requirement for all crops affected by climate change indicated an increase between 4 and 10 %. Furthermore, an increasing process is also expected regarding to the required water demand volume. The most and the least expected increase in the water demand volume is about 13 and 5 % for A2 and B1 scenarios, respectively. Considering the results and the limited water resources in the study area, it is crucial to provide water resources planning in order to reduce the negative effects of climate change. Therefore, the adaptation scenarios with the climate change related to crop pattern and water consumption should be taken into account.
Full Text Available As train loads and travel speeds have increased over time, railway axle bearings have become critical elements which require more efficient non-destructive inspection and fault diagnostics methods. This paper presents a novel and adaptive procedure based on ensemble empirical mode decomposition (EEMD and Hilbert marginal spectrum for multi-fault diagnostics of axle bearings. EEMD overcomes the limitations that often hypothesize about data and computational efforts that restrict the application of signal processing techniques. The outputs of this adaptive approach are the intrinsic mode functions that are treated with the Hilbert transform in order to obtain the Hilbert instantaneous frequency spectrum and marginal spectrum. Anyhow, not all the IMFs obtained by the decomposition should be considered into Hilbert marginal spectrum. The IMFs’ confidence index arithmetic proposed in this paper is fully autonomous, overcoming the major limit of selection by user with experience, and allows the development of on-line tools. The effectiveness of the improvement is proven by the successful diagnosis of an axle bearing with a single fault or multiple composite faults, e.g., outer ring fault, cage fault and pin roller fault.
Yi, Cai; Lin, Jianhui; Zhang, Weihua; Ding, Jianming
As train loads and travel speeds have increased over time, railway axle bearings have become critical elements which require more efficient non-destructive inspection and fault diagnostics methods. This paper presents a novel and adaptive procedure based on ensemble empirical mode decomposition (EEMD) and Hilbert marginal spectrum for multi-fault diagnostics of axle bearings. EEMD overcomes the limitations that often hypothesize about data and computational efforts that restrict the application of signal processing techniques. The outputs of this adaptive approach are the intrinsic mode functions that are treated with the Hilbert transform in order to obtain the Hilbert instantaneous frequency spectrum and marginal spectrum. Anyhow, not all the IMFs obtained by the decomposition should be considered into Hilbert marginal spectrum. The IMFs’ confidence index arithmetic proposed in this paper is fully autonomous, overcoming the major limit of selection by user with experience, and allows the development of on-line tools. The effectiveness of the improvement is proven by the successful diagnosis of an axle bearing with a single fault or multiple composite faults, e.g., outer ring fault, cage fault and pin roller fault. PMID:25970256
Pelosi, Anna; Battista Chirico, Giovanni; Van den Bergh, Joris; Vannitsem, Stephane
Forecasts from numerical weather prediction (NWP) models often suffer from both systematic and non-systematic errors. These are present in both deterministic and ensemble forecasts, and originate from various sources such as model error and subgrid variability. Statistical post-processing techniques can partly remove such errors, which is particularly important when NWP outputs concerning surface weather variables are employed for site specific applications. Many different post-processing techniques have been developed. For deterministic forecasts, adaptive methods such as the Kalman filter are often used, which sequentially post-process the forecasts by continuously updating the correction parameters as new ground observations become available. These methods are especially valuable when long training data sets do not exist. For ensemble forecasts, well-known techniques are ensemble model output statistics (EMOS), and so-called "member-by-member" approaches (MBM). Here, we introduce a new adaptive post-processing technique for ensemble predictions. The proposed method is a sequential Kalman filtering technique that fully exploits the information content of the ensemble. One correction equation is retrieved and applied to all members, however the parameters of the regression equations are retrieved by exploiting the second order statistics of the forecast ensemble. We compare our new method with two other techniques: a simple method that makes use of a running bias correction of the ensemble mean, and an MBM post-processing approach that rescales the ensemble mean and spread, based on minimization of the Continuous Ranked Probability Score (CRPS). We perform a verification study for the region of Campania in southern Italy. We use two years (2014-2015) of daily meteorological observations of 2-meter temperature and 10-meter wind speed from 18 ground-based automatic weather stations distributed across the region, comparing them with the corresponding COSMO
Wang, Huiya; Zhang, Shanwen
Gene expression profiles have great potential for accurate tumor diagnosis. It is expected to enable us to diagnose tumors precisely and systematically, and also bring the researchers of machine learning two challenges, the curse of dimensionality and the small sample size problems. We propose a manifold learning based dimensional reduction algorithm named orthogonal local discriminant embedding (O-LDE) and apply it to tumor classification. Comparing with the classical local discriminant embedding (LDE), O-LDE aims to obtain an orthogonal linear projection matrix by solving an optimization problem. After being projected into a low-dimensional subspace by O-LDE, the data points of the same class maintain their intrinsic neighbor relations, whereas the neighboring points of the different classes are far from each other. Experimental results on a public tumor dataset validate the effectiveness and feasibility of the proposed algorithm.
Abid Hasan, Md; Hasan, Md Kamrul; Abdul Mottalib, M
Predicting the class of gene expression profiles helps improve the diagnosis and treatment of diseases. Analysing huge gene expression data otherwise known as microarray data is complicated due to its high dimensionality. Hence the traditional classifiers do not perform well where the number of features far exceeds the number of samples. A good set of features help classifiers to classify the dataset efficiently. Moreover, a manageable set of features is also desirable for the biologist for further analysis. In this paper, we have proposed a linear regression-based feature selection method for selecting discriminative features. Our main focus is to classify the dataset more accurately using less number of features than other traditional feature selection methods. Our method has been compared with several other methods and in almost every case the classification accuracy is higher using less number of features than the other popular feature selection methods.
Deans, Cameron; Griffin, Lewis D.; Marmugi, Luca; Renzoni, Ferruccio
We demonstrate identification of position, material, orientation, and shape of objects imaged by a Rb 85 atomic magnetometer performing electromagnetic induction imaging supported by machine learning. Machine learning maximizes the information extracted from the images created by the magnetometer, demonstrating the use of hidden data. Localization 2.6 times better than the spatial resolution of the imaging system and successful classification up to 97% are obtained. This circumvents the need of solving the inverse problem and demonstrates the extension of machine learning to diffusive systems, such as low-frequency electrodynamics in media. Automated collection of task-relevant information from quantum-based electromagnetic imaging will have a relevant impact from biomedicine to security.
Shi, Wenhua; Zhang, Xiongwei; Zou, Xia; Han, Wei
In this paper, a speech enhancement method using noise classification and Deep Neural Network (DNN) was proposed. Gaussian mixture model (GMM) was employed to determine the noise type in speech-absent frames. DNN was used to model the relationship between noisy observation and clean speech. Once the noise type was determined, the corresponding DNN model was applied to enhance the noisy speech. GMM was trained with mel-frequency cepstrum coefficients (MFCC) and the parameters were estimated with an iterative expectation-maximization (EM) algorithm. Noise type was updated by spectrum entropy-based voice activity detection (VAD). Experimental results demonstrate that the proposed method could achieve better objective speech quality and smaller distortion under stationary and non-stationary conditions.
Jang, Junbong; Santamarina, J. Carlos
The 75-μm particle size is used to discriminate between fine and coarse grains. Further analysis of fine grains is typically based on the plasticity chart. Whereas pore-fluid-chemistry-dependent soil response is a salient and distinguishing characteristic of fine grains, pore-fluid chemistry is not addressed in current classification systems. Liquid limits obtained with electrically contrasting pore fluids (deionized water, 2-M NaCl brine, and kerosene) are combined to define the soil “electrical sensitivity.” Liquid limit and electrical sensitivity can be effectively used to classify fine grains according to their fluid-soil response into no-, low-, intermediate-, or high-plasticity fine grains of low, intermediate, or high electrical sensitivity. The proposed methodology benefits from the accumulated experience with liquid limit in the field and addresses the needs of a broader range of geotechnical engineering problems.
Chen, Jian-qing; Cui, Ping-yuan; Cui, Hui-tao
Crater detection and classification are critical elements for planetary mission preparations and landing site selection. This paper presents a methodology for the automated detection and matching of craters on images of planetary surface such as Moon, Mars and asteroids. For craters usually are bowl shaped depression, craters can be figured as circles or circular arc during landing phase. Based on the hypothesis that detected crater edges is related to craters in a template by translation, rotation and scaling, the proposed matching method use circles to fitting craters edge, and align circular arc edges from the image of the target body with circular features contained in a model. The approach includes edge detection, edge grouping, reference point detection and geometric circle model matching. Finally we simulate planetary surface to test the reasonableness and effectiveness of the proposed method.
The 75-μm particle size is used to discriminate between fine and coarse grains. Further analysis of fine grains is typically based on the plasticity chart. Whereas pore-fluid-chemistry-dependent soil response is a salient and distinguishing characteristic of fine grains, pore-fluid chemistry is not addressed in current classification systems. Liquid limits obtained with electrically contrasting pore fluids (deionized water, 2-M NaCl brine, and kerosene) are combined to define the soil "electrical sensitivity." Liquid limit and electrical sensitivity can be effectively used to classify fine grains according to their fluid-soil response into no-, low-, intermediate-, or high-plasticity fine grains of low, intermediate, or high electrical sensitivity. The proposed methodology benefits from the accumulated experience with liquid limit in the field and addresses the needs of a broader range of geotechnical engineering problems. © ASCE.
Oliveira, E J; Oliveira Filho, O S; Santos, V S
We evaluated the genetic variation of cassava accessions based on qualitative (binomial and multicategorical) and quantitative traits (continuous). We characterized 95 accessions obtained from the Cassava Germplasm Bank of Embrapa Mandioca e Fruticultura; we evaluated these accessions for 13 continuous, 10 binary, and 25 multicategorical traits. First, we analyzed the accessions based only on quantitative traits; next, we conducted joint analysis (qualitative and quantitative traits) based on the Ward-MLM method, which performs clustering in two stages. According to the pseudo-F, pseudo-t2, and maximum likelihood criteria, we identified five and four groups based on quantitative trait and joint analysis, respectively. The smaller number of groups identified based on joint analysis may be related to the nature of the data. On the other hand, quantitative data are more subject to environmental effects in the phenotype expression; this results in the absence of genetic differences, thereby contributing to greater differentiation among accessions. For most of the accessions, the maximum probability of classification was >0.90, independent of the trait analyzed, indicating a good fit of the clustering method. Differences in clustering according to the type of data implied that analysis of quantitative and qualitative traits in cassava germplasm might explore different genomic regions. On the other hand, when joint analysis was used, the means and ranges of genetic distances were high, indicating that the Ward-MLM method is very useful for clustering genotypes when there are several phenotypic traits, such as in the case of genetic resources and breeding programs.
Liu, Lingqiao; Wang, Peng; Shen, Chunhua; Wang, Lei; Hengel, Anton van den; Wang, Chao; Shen, Heng Tao
Deriving from the gradient vector of a generative model of local features, Fisher vector coding (FVC) has been identified as an effective coding method for image classification. Most, if not all, FVC implementations employ the Gaussian mixture model (GMM) as the generative model for local features. However, the representative power of a GMM can be limited because it essentially assumes that local features can be characterized by a fixed number of feature prototypes, and the number of prototypes is usually small in FVC. To alleviate this limitation, in this work, we break the convention which assumes that a local feature is drawn from one of a few Gaussian distributions. Instead, we adopt a compositional mechanism which assumes that a local feature is drawn from a Gaussian distribution whose mean vector is composed as a linear combination of multiple key components, and the combination weight is a latent random variable. In doing so we greatly enhance the representative power of the generative model underlying FVC. To implement our idea, we design two particular generative models following this compositional approach. In our first model, the mean vector is sampled from the subspace spanned by a set of bases and the combination weight is drawn from a Laplace distribution. In our second model, we further assume that a local feature is composed of a discriminative part and a residual part. As a result, a local feature is generated by the linear combination of discriminative part bases and residual part bases. The decomposition of the discriminative and residual parts is achieved via the guidance of a pre-trained supervised coding method. By calculating the gradient vector of the proposed models, we derive two new Fisher vector coding strategies. The first is termed Sparse Coding-based Fisher Vector Coding (SCFVC) and can be used as the substitute of traditional GMM based FVC. The second is termed Hybrid Sparse Coding-based Fisher vector coding (HSCFVC) since it
Full Text Available The existing material classification is proposed to improve the inventory management. However, different materials have the different quality-related attributes, especially in the aircraft industry. In order to reduce the cost without sacrificing the quality, we propose a quality-oriented material classification system considering the material quality character, Quality cost, and Quality influence. Analytic Hierarchy Process helps to make feature selection and classification decision. We use the improved Kraljic Portfolio Matrix to establish the three-dimensional classification model. The aircraft materials can be divided into eight types, including general type, key type, risk type, and leveraged type. Aiming to improve the classification accuracy of various materials, the algorithm of Support Vector Machine is introduced. Finally, we compare the SVM and BP neural network in the application. The results prove that the SVM algorithm is more efficient and accurate and the quality-oriented material classification is valuable.
Saleh, F.; Ramaswamy, V.; Georgas, N.; Blumberg, A. F.; Wang, Y.
Advances in computational resources and modeling techniques are opening the path to effectively integrate existing complex models. In the context of flood prediction, recent extreme events have demonstrated the importance of integrating components of the hydrosystem to better represent the interactions amongst different physical processes and phenomena. As such, there is a pressing need to develop holistic and cross-disciplinary modeling frameworks that effectively integrate existing models and better represent the operative dynamics. This work presents a novel Hydrologic-Hydraulic-Hydrodynamic Ensemble (H3E) flood prediction framework that operationally integrates existing predictive models representing coastal (New York Harbor Observing and Prediction System, NYHOPS), hydrologic (US Army Corps of Engineers Hydrologic Modeling System, HEC-HMS) and hydraulic (2-dimensional River Analysis System, HEC-RAS) components. The state-of-the-art framework is forced with 125 ensemble meteorological inputs from numerical weather prediction models including the Global Ensemble Forecast System, the European Centre for Medium-Range Weather Forecasts (ECMWF), the Canadian Meteorological Centre (CMC), the Short Range Ensemble Forecast (SREF) and the North American Mesoscale Forecast System (NAM). The framework produces, within a 96-hour forecast horizon, on-the-fly Google Earth flood maps that provide critical information for decision makers and emergency preparedness managers. The utility of the framework was demonstrated by retrospectively forecasting an extreme flood event, hurricane Sandy in the Passaic and Hackensack watersheds (New Jersey, USA). Hurricane Sandy caused significant damage to a number of critical facilities in this area including the New Jersey Transit's main storage and maintenance facility. The results of this work demonstrate that ensemble based frameworks provide improved flood predictions and useful information about associated uncertainties, thus
Cisty, Milan; Hlavcova, Kamila
a prerequisite for solving some subsequent task, this bias is propagated to the subsequent modelling or other work. Therefore, for the sake of achieving more general and precise outputs while solving such tasks, the authors of the present paper are proposing a hybrid approach, which has the potential for obtaining improved results. Although the authors continue recommending the use of the mentioned parametric PSD models in the proposed methodology, the final prediction is made by an ensemble machine learning algorithm based on regression trees, the so-called Random Forest algorithm, which is built on top of the outputs of such models, which serves as an ensemble members. An improvement in precision was proved, and it is documented in the paper that the ensemble model worked better than any of its constituents. References Nemes, A., Wosten, J.H.M., Lilly, A., Voshaar, J.H.O.: Evaluation of different procedures to interpolate particle-size distributions to achieve compatibility within soil databases. Geoderma 90, 187- 202 (1999) Hwang, S.: Effect of texture on the performance of soil particle-size distribution models. Geoderma 123, 363-371 (2004) Botula, Y.D., Cornelis, W.M., Baert, G., Mafuka, P., Van Ranst, E.: Particle size distribution models for soils of the humid tropics. J Soils Sediments. 13, 686-698 (2013)
Beegle, Amy C.
As instrumental world music ensembles such as steel pan, mariachi, gamelan and West African drums are becoming more the norm than the exception in North American school music programs, there are other world music ensembles just starting to gain popularity in particular parts of the United States. The kulintang ensemble, a drum and gong ensemble…
Amalisana, Birohmatin; Rokhmatullah; Hernina, Revi
The advantage of image classification is to provide earth’s surface information like landcover and time-series changes. Nowadays, pixel-based image classification technique is commonly performed with variety of algorithm such as minimum distance, parallelepiped, maximum likelihood, mahalanobis distance. On the other hand, landcover classification can also be acquired by using object-based image classification technique. In addition, object-based classification uses image segmentation from parameter such as scale, form, colour, smoothness and compactness. This research is aimed to compare the result of landcover classification and its change detection between parallelepiped pixel-based and object-based classification method. Location of this research is Bogor with 20 years range of observation from 1996 until 2016. This region is famous as urban areas which continuously change due to its rapid development, so that time-series landcover information of this region will be interesting.
Jun Wan; Zehua Chen; Yingwu Chen; Zhidong Bai
Classification is an interesting problem in functional data analysis (FDA), because many science and application problems end up with classification problems, such as recognition, prediction, control, decision making, management, etc. As the high dimension and high correlation in functional data (FD), it is a key problem to extract features from FD whereas keeping its global characters, which relates to the classification efficiency and precision to heavens. In this paper...
Mukhopadhyay, Anirban; Bandyopadhyay, Sanghamitra; Maulik, Ujjwal
With the advancement of microarray technology, it is now possible to study the expression profiles of thousands of genes across different experimental conditions or tissue samples simultaneously. Microarray cancer datasets, organized as samples versus genes fashion, are being used for classification of tissue samples into benign and malignant or their subtypes. They are also useful for identifying potential gene markers for each cancer subtype, which helps in successful diagnosis of particular cancer types. In this article, we have presented an unsupervised cancer classification technique based on multiobjective genetic clustering of the tissue samples. In this regard, a real-coded encoding of the cluster centers is used and cluster compactness and separation are simultaneously optimized. The resultant set of near-Pareto-optimal solutions contains a number of non-dominated solutions. A novel approach to combine the clustering information possessed by the non-dominated solutions through Support Vector Machine (SVM) classifier has been proposed. Final clustering is obtained by consensus among the clusterings yielded by different kernel functions. The performance of the proposed multiobjective clustering method has been compared with that of several other microarray clustering algorithms for three publicly available benchmark cancer datasets. Moreover, statistical significance tests have been conducted to establish the statistical superiority of the proposed clustering method. Furthermore, relevant gene markers have been identified using the clustering result produced by the proposed clustering method and demonstrated visually. Biological relationships among the gene markers are also studied based on gene ontology. The results obtained are found to be promising and can possibly have important impact in the area of unsupervised cancer classification as well as gene marker identification for multiple cancer subtypes.
Ceylan Koydemir, Hatice; Feng, Steve; Liang, Kyle; Nadkarni, Rohan; Tseng, Derek; Benien, Parul; Ozcan, Aydogan
Giardia lamblia causes a disease known as giardiasis, which results in diarrhea, abdominal cramps, and bloating. Although conventional pathogen detection methods used in water analysis laboratories offer high sensitivity and specificity, they are time consuming, and need experts to operate bulky equipment and analyze the samples. Here we present a field-portable and cost-effective smartphone-based waterborne pathogen detection platform that can automatically classify Giardia cysts using machine learning. Our platform enables the detection and quantification of Giardia cysts in one hour, including sample collection, labeling, filtration, and automated counting steps. We evaluated the performance of three prototypes using Giardia-spiked water samples from different sources (e.g., reagent-grade, tap, non-potable, and pond water samples). We populated a training database with >30,000 cysts and estimated our detection sensitivity and specificity using 20 different classifier models, including decision trees, nearest neighbor classifiers, support vector machines (SVMs), and ensemble classifiers, and compared their speed of training and classification, as well as predicted accuracies. Among them, cubic SVM, medium Gaussian SVM, and bagged-trees were the most promising classifier types with accuracies of 94.1%, 94.2%, and 95%, respectively; we selected the latter as our preferred classifier for the detection and enumeration of Giardia cysts that are imaged using our mobile-phone fluorescence microscope. Without the need for any experts or microbiologists, this field-portable pathogen detection platform can present a useful tool for water quality monitoring in resource-limited-settings.
Barzegar, Rahim; Fijani, Elham; Asghari Moghaddam, Asghar; Tziritis, Evangelos
Accurate prediction of groundwater level (GWL) fluctuations can play an important role in water resources management. The aims of the research are to evaluate the performance of different hybrid wavelet-group method of data handling (WA-GMDH) and wavelet-extreme learning machine (WA-ELM) models and to combine different wavelet based models for forecasting the GWL for one, two and three months step-ahead in the Maragheh-Bonab plain, NW Iran, as a case study. The research used totally 367 monthly GWLs (m) datasets (Sep 1985-Mar 2016) which were split into two subsets; the first 312 datasets (85% of total) were used for model development (training) and the remaining 55 ones (15% of total) for model evaluation (testing). The stepwise selection was used to select appropriate lag times as the inputs of the proposed models. The performance criteria such as coefficient of determination (R 2 ), root mean square error (RMSE) and Nash-Sutcliffe efficiency coefficient (NSC) were used for assessing the efficiency of the models. The results indicated that the ELM models outperformed GMDH models. To construct the hybrid wavelet based models, the inputs and outputs were decomposed into sub-time series employing different maximal overlap discrete wavelet transform (MODWT) functions, namely Daubechies, Symlet, Haar and Dmeyer of different orders at level two. Subsequently, these sub-time series were served in the GMDH and ELM models as an input dataset to forecast the multi-step-ahead GWL. The wavelet based models improved the performances of GMDH and ELM models for multi-step-ahead GWL forecasting. To combine the advantages of different wavelets, a least squares boosting (LSBoost) algorithm was applied. The use of the boosting multi-WA-neural network models provided the best performances for GWL forecasts in comparison with single WA-neural network-based models. Copyright © 2017 Elsevier B.V. All rights reserved.
Jonsson, Leif; Borg, Markus; Broman, David; Sandahl, Kristian; Eldh, Sigrid; Runeson, Per
Bug report assignment is an important part of software maintenance. In particular, incorrect assignments of bug reports to development teams can be very expensive in large software development projects. Several studies propose automating bug assignment techniques using machine learning in open source software contexts, but no study exists for large-scale proprietary projects in industry. The goal of this study is to evaluate automated bug assignment techniques that are based on machine learni...
Full Text Available To improve crop model performance for regional crop yield estimates, a new four-dimensional variational algorithm (POD4DVar merging the Monte Carlo and proper orthogonal decomposition techniques was introduced to develop a data assimilation strategy using the Crop Environment Resource Synthesis (CERES-Wheat model. Two winter wheat yield estimation procedures were conducted on a field plot and regional scale to test the feasibility and potential of the POD4DVar-based strategy. Winter wheat yield forecasts for the field plots showed a coefficient of determination (R2 of 0.73, a root mean square error (RMSE of 319 kg/ha, and a relative error (RE of 3.49%. An acceptable yield at the regional scale was estimated with an R2 of 0.997, RMSE of 7346 tons, and RE of 3.81%. The POD4DVar-based strategy was more accurate and efficient than the EnKF-based strategy. In addition to crop yield, other critical crop variables such as the biomass, harvest index, evapotranspiration, and soil organic carbon may also be estimated. The present study thus introduces a promising approach for operationally monitoring regional crop growth and predicting yield. Successful application of this assimilation model at regional scales must focus on uncertainties derived from the crop model, model inputs, data assimilation algorithm, and assimilated observations.
Arna van Engelen
Full Text Available Background: Histology sections provide accurate information on atherosclerotic plaque composition, and are used in various applications. To our knowledge, no automated systems for plaque component segmentation in histology sections currently exist. Materials and Methods: We perform pixel-wise classification of fibrous, lipid, and necrotic tissue in Elastica Von Gieson-stained histology sections, using features based on color channel intensity and local image texture and structure. We compare an approach where we train on independent data to an approach where we train on one or two sections per specimen in order to segment the remaining sections. We evaluate the results on segmentation accuracy in histology, and we use the obtained histology segmentations to train plaque component classification methods in ex vivo Magnetic resonance imaging (MRI and in vivo MRI and computed tomography (CT. Results: In leave-one-specimen-out experiments on 176 histology slices of 13 plaques, a pixel-wise accuracy of 75.7 ± 6.8% was obtained. This increased to 77.6 ± 6.5% when two manually annotated slices of the specimen to be segmented were used for training. Rank correlations of relative component volumes with manually annotated volumes were high in this situation (P = 0.82-0.98. Using the obtained histology segmentations to train plaque component classification methods in ex vivo MRI and in vivo MRI and CT resulted in similar image segmentations for training on the automated histology segmentations as for training on a fully manual ground truth. The size of the lipid-rich necrotic core was significantly smaller when training on fully automated histology segmentations than when manually annotated histology sections were used. This difference was reduced and not statistically significant when one or two slices per section were manually annotated for histology segmentation. Conclusions: Good histology segmentations can be obtained by automated segmentation
Background: Proteins have the fundamental ability to selectively bind to other molecules and perform specific functions through such interactions, such as protein-ligand binding. Accurate prediction of protein residues that physically bind to ligands is important for drug design and protein docking studies. Most of the successful protein-ligand binding predictions were based on known structures. However, structural information is not largely available in practice due to the huge gap between the number of known protein sequences and that of experimentally solved structures
Quantitative analysis of the risk for reservoir real-time operation is a hard task owing to the difficulty of accurate description of inflow uncertainties. The ensemble-based hydrologic forecasts directly depict the inflows not only the marginal distributions but also their persistence via scenarios. This motivates us to analyze the reservoir real-time operating risk with ensemble-based hydrologic forecasts as inputs. A method is developed by using the forecast horizon point to divide the future time into two stages, the forecast lead-time and the unpredicted time. The risk within the forecast lead-time is computed based on counting the failure number of forecast scenarios, and the risk in the unpredicted time is estimated using reservoir routing with the design floods and the reservoir water levels of forecast horizon point. As a result, a two-stage risk analysis method is set up to quantify the entire flood risks by defining the ratio of the number of scenarios that excessive the critical value to the total number of scenarios. The China's Three Gorges Reservoir (TGR) is selected as a case study, where the parameter and precipitation uncertainties are implemented to produce ensemble-based hydrologic forecasts. The Bayesian inference, Markov Chain Monte Carlo, is used to account for the parameter uncertainty. Two reservoir operation schemes, the real operated and scenario optimization, are evaluated for the flood risks and hydropower profits analysis. With the 2010 flood, it is found that the improvement of the hydrologic forecast accuracy is unnecessary to decrease the reservoir real-time operation risk, and most risks are from the forecast lead-time. It is therefore valuable to decrease the avarice of ensemble-based hydrologic forecasts with less bias for a reservoir operational purpose.
Jiang, Jing; Lu, Chunming; Peng, Danling; Zhu, Chaozhe; Howell, Peter
Among the non-fluencies seen in speech, some are more typical (MT) of stuttering speakers, whereas others are less typical (LT) and are common to both stuttering and fluent speakers. No neuroimaging work has evaluated the neural basis for grouping these symptom types. Another long-debated issue is which type (LT, MT) whole-word repetitions (WWR) should be placed in. In this study, a sentence completion task was performed by twenty stuttering patients who were scanned using an event-related design. This task elicited stuttering in these patients. Each stuttered trial from each patient was sorted into the MT or LT types with WWR put aside. Pattern classification was employed to train a patient-specific single trial model to automatically classify each trial as MT or LT using the corresponding fMRI data. This model was then validated by using test data that were independent of the training data. In a subsequent analysis, the classification model, just established, was used to determine which type the WWR should be placed in. The results showed that the LT and the MT could be separated with high accuracy based on their brain activity. The brain regions that made most contribution to the separation of the types were: the left inferior frontal cortex and bilateral precuneus, both of which showed higher activity in the MT than in the LT; and the left putamen and right cerebellum which showed the opposite activity pattern. The results also showed that the brain activity for WWR was more similar to that of the LT and fluent speech than to that of the MT. These findings provide a neurological basis for separating the MT and the LT types, and support the widely-used MT/LT symptom grouping scheme. In addition, WWR play a similar role as the LT, and thus should be placed in the LT type. PMID:22761887
Full Text Available Most existing methods for sequence-based classification use exhaustive feature generation, employing, for example, all k-mer patterns. The motivation behind such (enumerative approaches is to minimize the potential for overlooking important features. However, there are shortcomings to this strategy. First, practical constraints limit the scope of exhaustive feature generation to patterns of length ≤ k, such that potentially important, longer (> k predictors are not considered. Second, features so generated exhibit strong dependencies, which can complicate understanding of derived classification rules. Third, and most importantly, numerous irrelevant features are created. These concerns can compromise prediction and interpretation. While remedies have been proposed, they tend to be problem-specific and not broadly applicable. Here, we develop a generally applicable methodology, and an attendant software pipeline, that is predicated on discriminatory motif finding. In addition to the traditional training and validation partitions, our framework entails a third level of data partitioning, a discovery partition. A discriminatory motif finder is used on sequences and associated class labels in the discovery partition to yield a (small set of features. These features are then used as inputs to a classifier in the training partition. Finally, performance assessment occurs on the validation partition. Important attributes of our approach are its modularity (any discriminatory motif finder and any classifier can be deployed and its universality (all data, including sequences that are unaligned and/or of unequal length, can be accommodated. We illustrate our approach on two nucleosome occupancy datasets and a protein solubility dataset, previously analyzed using enumerative feature generation. Our method achieves excellent performance results, with and without optimization of classifier tuning parameters. A Python pipeline implementing the approach is
Full Text Available Among the non-fluencies seen in speech, some are more typical (MT of stuttering speakers, whereas others are less typical (LT and are common to both stuttering and fluent speakers. No neuroimaging work has evaluated the neural basis for grouping these symptom types. Another long-debated issue is which type (LT, MT whole-word repetitions (WWR should be placed in. In this study, a sentence completion task was performed by twenty stuttering patients who were scanned using an event-related design. This task elicited stuttering in these patients. Each stuttered trial from each patient was sorted into the MT or LT types with WWR put aside. Pattern classification was employed to train a patient-specific single trial model to automatically classify each trial as MT or LT using the corresponding fMRI data. This model was then validated by using test data that were independent of the training data. In a subsequent analysis, the classification model, just established, was used to determine which type the WWR should be placed in. The results showed that the LT and the MT could be separated with high accuracy based on their brain activity. The brain regions that made most contribution to the separation of the types were: the left inferior frontal cortex and bilateral precuneus, both of which showed higher activity in the MT than in the LT; and the left putamen and right cerebellum which showed the opposite activity pattern. The results also showed that the brain activity for WWR was more similar to that of the LT and fluent speech than to that of the MT. These findings provide a neurological basis for separating the MT and the LT types, and support the widely-used MT/LT symptom grouping scheme. In addition, WWR play a similar role as the LT, and thus should be placed in the LT type.
Full Text Available We consider the problem of predicting sensitivity of cancer cell lines to new drugs based on supervised learning on genomic profiles. The genetic and epigenetic characterization of a cell line provides observations on various aspects of regulation including DNA copy number variations, gene expression, DNA methylation and protein abundance. To extract relevant information from the various data types, we applied a random forest based approach to generate sensitivity predictions from each type of data and combined the predictions in a linear regression model to generate the final drug sensitivity prediction. Our approach when applied to the NCI-DREAM drug sensitivity prediction challenge was a top performer among 47 teams and produced high accuracy predictions. Our results show that the incorporation of multiple genomic characterizations lowered the mean and variance of the estimated bootstrap prediction error. We also applied our approach to the Cancer Cell Line Encyclopedia database for sensitivity prediction and the ability to extract the top targets of an anti-cancer drug. The results illustrate the effectiveness of our approach in predicting drug sensitivity from heterogeneous genomic datasets.
Chakraborty, Sourav; Mondal, Snehasish; Bhowmick, Sourav; Ma, Jianqiu; Tan, Hongwei; Neogi, Subhadip; Das, Neeladri
Preparation and characterization of two new triptycene based polytopic Pt(II) organometallic complexes are being reported. These complexes have three trans-bromobis(trialkylphosphine)platinum(II) units directly attached to the central triptycene unit. These organoplatinum complexes were converted to the corresponding nitrate salts for subsequent use in self-assembly reactions. Characterization of these organometallic triptycene complexes by multinuclear NMR, FTIR, mass spectrometry and elemental analyses is described. The molecular structure of one of the organoplatinum triptycene tripods was determined by single-crystal X-ray crystallography. The potential utility of these organometallic tritopic acceptors as building blocks in the construction of metallasupramolecular cages containing the triptycene motif is explored. Additionally, for the first time, 3,3'-bipyridine has been used as a flexible donor tecton for self-assembly of discrete and finite metallacages using triptycene based tritopic organometallic acceptor units. Triptycene motif containing supramolecules were characterized by multinuclear NMR (including (1)H DOSY), mass spectrometry and elemental analyses. Geometry of each supramolecular framework was optimized by employing the PM6 semiempirical molecular orbital method to predict its shape and size.
Jaiswal, Neeru; Kishtawal, C. M.; Bhomia, Swati
The southwest (SW) monsoon season (June, July, August and September) is the major period of rainfall over the Indian region. The present study focuses on the development of a new multi-model ensemble approach based on the similarity criterion (SMME) for the prediction of SW monsoon rainfall in the extended range. This approach is based on the assumption that training with the similar type of conditions may provide the better forecasts in spite of the sequential training which is being used in the conventional MME approaches. In this approach, the training dataset has been selected by matching the present day condition to the archived dataset and days with the most similar conditions were identified and used for training the model. The coefficients thus generated were used for the rainfall prediction. The precipitation forecasts from four general circulation models (GCMs), viz. European Centre for Medium-Range Weather Forecasts (ECMWF), United Kingdom Meteorological Office (UKMO), National Centre for Environment Prediction (NCEP) and China Meteorological Administration (CMA) have been used for developing the SMME forecasts. The forecasts of 1-5, 6-10 and 11-15 days were generated using the newly developed approach for each pentad of June-September during the years 2008-2013 and the skill of the model was analysed using verification scores, viz. equitable skill score (ETS), mean absolute error (MAE), Pearson's correlation coefficient and Nash-Sutcliffe model efficiency index. Statistical analysis of SMME forecasts shows superior forecast skill compared to the conventional MME and the individual models for all the pentads, viz. 1-5, 6-10 and 11-15 days.
Qi, Wei; Liu, Junguo; Yang, Hong; Sweetapple, Chris
Global precipitation products are very important datasets in flow simulations, especially in poorly gauged regions. Uncertainties resulting from precipitation products, hydrological models and their combinations vary with time and data magnitude, and undermine their application to flow simulations. However, previous studies have not quantified these uncertainties individually and explicitly. This study developed an ensemble-based dynamic Bayesian averaging approach (e-Bay) for deterministic discharge simulations using multiple global precipitation products and hydrological models. In this approach, the joint probability of precipitation products and hydrological models being correct is quantified based on uncertainties in maximum and mean estimation, posterior probability is quantified as functions of the magnitude and timing of discharges, and the law of total probability is implemented to calculate expected discharges. Six global fine-resolution precipitation products and two hydrological models of different complexities are included in an illustrative application. e-Bay can effectively quantify uncertainties and therefore generate better deterministic discharges than traditional approaches (weighted average methods with equal and varying weights and maximum likelihood approach). The mean Nash-Sutcliffe Efficiency values of e-Bay are up to 0.97 and 0.85 in training and validation periods respectively, which are at least 0.06 and 0.13 higher than traditional approaches. In addition, with increased training data, assessment criteria values of e-Bay show smaller fluctuations than traditional approaches and its performance becomes outstanding. The proposed e-Bay approach bridges the gap between global precipitation products and their pragmatic applications to discharge simulations, and is beneficial to water resources management in ungauged or poorly gauged regions across the world.
Velazquez, Emmanuel Rios; Aerts, Hugo J. W. L.; Gu, Yuhua; Goldgof, Dmitry B.; De Ruysscher, Dirk; Dekker, Andre; Korn, René; Gillies, Robert J.; Lambin, Philippe
Purpose To assess the clinical relevance of a semiautomatic CT-based ensemble segmentation method, by comparing it to pathology and to CT/PET manual delineations by five independent radiation oncologists in non-small cell lung cancer (NSCLC). Materials and Methods For twenty NSCLC patients (stage Ib – IIIb) the primary tumor was delineated manually on CT/PET scans by five independent radiation oncologists and segmented using a CT based semi-automatic tool. Tumor volume and overlap fractions between manual and semiautomatic-segmented volumes were compared. All measurements were correlated with the maximal diameter on macroscopic examination of the surgical specimen. Imaging data is available on www.cancerdata.org. Results High overlap fractions were observed between the semi-automatically segmented volumes and the intersection (92.5 ± 9.0, mean ± SD) and union (94.2 ± 6.8) of the manual delineations. No statistically significant differences in tumor volume were observed between the semiautomatic segmentation (71.4 ± 83.2 cm3, mean ± SD) and manual delineations (81.9 ± 94.1 cm3; p = 0.57). The maximal tumor diameter of the semiautomatic-segmented tumor correlated strongly with the macroscopic diameter of the primary tumor (r = 0.96). Conclusion Semiautomatic segmentation of the primary tumor on CT demonstrated high agreement with CT/PET manual delineations and strongly correlated with the macroscopic diameter considered the “gold standard”. This method may be used routinely in clinical practice and could be employed as a starting point for treatment planning, target definition in multi-center clinical trials or for high throughput data mining research. This method is particularly suitable for peripherally located tumors. PMID:23157978
Jaiswal, Neeru; Kishtawal, C. M.; Bhomia, Swati
The southwest (SW) monsoon season (June, July, August and September) is the major period of rainfall over the Indian region. The present study focuses on the development of a new multi-model ensemble approach based on the similarity criterion (SMME) for the prediction of SW monsoon rainfall in the extended range. This approach is based on the assumption that training with the similar type of conditions may provide the better forecasts in spite of the sequential training which is being used in the conventional MME approaches. In this approach, the training dataset has been selected by matching the present day condition to the archived dataset and days with the most similar conditions were identified and used for training the model. The coefficients thus generated were used for the rainfall prediction. The precipitation forecasts from four general circulation models (GCMs), viz. European Centre for Medium-Range Weather Forecasts (ECMWF), United Kingdom Meteorological Office (UKMO), National Centre for Environment Prediction (NCEP) and China Meteorological Administration (CMA) have been used for developing the SMME forecasts. The forecasts of 1-5, 6-10 and 11-15 days were generated using the newly developed approach for each pentad of June-September during the years 2008-2013 and the skill of the model was analysed using verification scores, viz. equitable skill score (ETS), mean absolute error (MAE), Pearson's correlation coefficient and Nash-Sutcliffe model efficiency index. Statistical analysis of SMME forecasts shows superior forecast skill compared to the conventional MME and the individual models for all the pentads, viz. 1-5, 6-10 and 11-15 days.
Hibbett, David; Abarenkov, Kessy; Kõljalg, Urmas; Öpik, Maarja; Chai, Benli; Cole, James; Wang, Qiong; Crous, Pedro; Robert, Vincent; Helgason, Thorunn; Herr, Joshua R; Kirk, Paul; Lueschow, Shiloh; O'Donnell, Kerry; Nilsson, R Henrik; Oono, Ryoko; Schoch, Conrad; Smyth, Christopher; Walker, Donald M; Porras-Alfaro, Andrea; Taylor, John W; Geiser, David M
Fungal taxonomy and ecology have been revolutionized by the application of molecular methods and both have increasing connections to genomics and functional biology. However, data streams from traditional specimen- and culture-based systematics are not yet fully integrated with those from metagenomic and metatranscriptomic studies, which limits understanding of the taxonomic diversity and metabolic properties of fungal communities. This article reviews current resources, needs, and opportunities for sequence-based classification and identification (SBCI) in fungi as well as related efforts in prokaryotes. To realize the full potential of fungal SBCI it will be necessary to make advances in multiple areas. Improvements in sequencing methods, including long-read and single-cell technologies, will empower fungal molecular ecologists to look beyond ITS and current shotgun metagenomics approaches. Data quality and accessibility will be enhanced by attention to data and metadata standards and rigorous enforcement of policies for deposition of data and workflows. Taxonomic communities will need to develop best practices for molecular characterization in their focal clades, while also contributing to globally useful datasets including ITS. Changes to nomenclatural rules are needed to enable validPUBLICation of sequence-based taxon descriptions. Finally, cultural shifts are necessary to promote adoption of SBCI and to accord professional credit to individuals who contribute to community resources.
durability, availability and rational use of available resources. This would also help in avoiding ... have used different classifiers like MLP-BP-ANN, Pearson correlation, energy value, and SVM and achieved classification ..... best trade-off between classification accuracy and computational time. The RF classifier needs.
In this paper mathematical modeling method has been used in order to simulating of power quality events. In regarding to excellent neural network function in classification and pattern recognition works, multilayer neural network has been used for events classification. The STFT and Discrete Wavelet Transform are used for ...
Abril Valeria Uriarte-Arcia
Full Text Available The ever increasing data generation confronts us with the problem of handling online massive amounts of information. One of the biggest challenges is how to extract valuable information from these massive continuous data streams during single scanning. In a data stream context, data arrive continuously at high speed; therefore the algorithms developed to address this context must be efficient regarding memory and time management and capable of detecting changes over time in the underlying distribution that generated the data. This work describes a novel method for the task of pattern classification over a continuous data stream based on an associative model. The proposed method is based on the Gamma classifier, which is inspired by the Alpha-Beta associative memories, which are both supervised pattern recognition models. The proposed method is capable of handling the space and time constrain inherent to data stream scenarios. The Data Streaming Gamma classifier (DS-Gamma classifier implements a sliding window approach to provide concept drift detection and a forgetting mechanism. In order to test the classifier, several experiments were performed using different data stream scenarios with real and synthetic data streams. The experimental results show that the method exhibits competitive performance when compared to other state-of-the-art algorithms.
Lubis, A. S.; Muis, Z. A.; Hastuty, I. P.; Siregar, I. M.
Factors that must be considered in compaction of the soil works were the type of soil material, field control, maintenance and availability of funds. Those problems then raised the idea of how to estimate the density of the soil with a proper implementation system, fast, and economical. This study aims to estimate the compaction parameter i.e. the maximum dry unit weight (γ dmax) and optimum water content (Wopt) based on soil classification. Each of 30 samples were being tested for its properties index and compaction test. All of the data’s from the laboratory test results, were used to estimate the compaction parameter values by using linear regression and Goswami Model. From the research result, the soil types were A4, A-6, and A-7 according to AASHTO and SC, SC-SM, and CL based on USCS. By linear regression, the equation for estimation of the maximum dry unit weight (γdmax *)=1,862-0,005*FINES- 0,003*LL and estimation of the optimum water content (wopt *)=- 0,607+0,362*FINES+0,161*LL. By Goswami Model (with equation Y=mLogG+k), for estimation of the maximum dry unit weight (γdmax *) with m=-0,376 and k=2,482, for estimation of the optimum water content (wopt *) with m=21,265 and k=-32,421. For both of these equations a 95% confidence interval was obtained.
There is a trend of growing interest and demand for greater access of unmanned aircraft (UA) to the National Airspace System (NAS) as the ongoing development of UA technology has created the potential for significant economic benefits. However, the lack of a comprehensive and efficient UA regulatory framework has constrained the number and kinds of UA operations that can be performed. This report presents initial results of a study aimed at defining a safety-risk-based UA classification as a plausible basis for a regulatory framework for UA operating in the NAS. Much of the study up to this point has been at a conceptual high level. The report includes a survey of contextual topics, analysis of safety risk considerations, and initial recommendations for a risk-based approach to safe UA operations in the NAS. The next phase of the study will develop and leverage deeper clarity and insight into practical engineering and regulatory considerations for ensuring that UA operations have an acceptable level of safety.
Boschetto, Davide; Grisan, Enrico
Chromoendoscopy (CH) is a gastroenterology imaging modality that involves the staining of tissues with methylene blue, which reacts with the internal walls of the gastrointestinal tract, improving the visual contrast in mucosal surfaces and thus enhancing a doctor's ability to screen precancerous lesions or early cancer. This technique helps identify areas that can be targeted for biopsy or treatment and in this work we will focus on gastric cancer detection. Gastric chromoendoscopy for cancer detection has several taxonomies available, one of which classifies CH images into three classes (normal, metaplasia, dysplasia) based on color, shape and regularity of pit patterns. Computer-assisted diagnosis is desirable to help us improve the reliability of the tissue classification and abnormalities detection. However, traditional computer vision methodologies, mainly segmentation, do not translate well to the specific visual characteristics of a gastroenterology imaging scenario. We propose the exploitation of a first unsupervised segmentation via superpixel, which groups pixels into perceptually meaningful atomic regions, used to replace the rigid structure of the pixel grid. For each superpixel, a set of features is extracted and then fed to a random forest based classifier, which computes a model used to predict the class of each superpixel. The average general accuracy of our model is 92.05% in the pixel domain (86.62% in the superpixel domain), while detection accuracies on the normal and abnormal class are respectively 85.71% and 95%. Eventually, the whole image class can be predicted image through a majority vote on each superpixel's predicted class.
Full Text Available Motivation. Independent Components Analysis (ICA maximizes the statistical independence of the representational components of a training gene expression profiles (GEP ensemble, but it cannot distinguish relations between the different factors, or different modes, and it is not available to high-order GEP Data Mining. In order to generalize ICA, we introduce Multilinear-ICA and apply it to tumor classification using high order GEP. Firstly, we introduce the basis conceptions and operations of tensor and recommend Support Vector Machine (SVM classifier and Multilinear-ICA. Secondly, the higher score genes of original high order GEP are selected by using t-statistics and tabulate tensors. Thirdly, the tensors are performed by Multilinear-ICA. Finally, the SVM is used to classify the tumor subtypes. Results. To show the validity of the proposed method, we apply it to tumor classification using high order GEP. Though we only use three datasets, the experimental results show that the method is effective and feasible. Through this survey, we hope to gain some insight into the problem of high order GEP tumor classification, in aid of further developing more effective tumor classification algorithms.
Kim, Ok-Yeon; Chan, Johnny C. L.
This study aims to predict the seasonal TC track density over the South Pacific by combining the Asia-Pacific Economic Cooperation (APEC) Climate Center (APCC) multi-model ensemble (MME) dynamical prediction system with a statistical model. The hybrid dynamical-statistical model is developed for each of the three clusters that represent major groups of TC best tracks in the South Pacific. The cross validation result from the MME hybrid model demonstrates moderate but statistically significant skills to predict TC numbers across all TC clusters, with correlation coefficients of 0.4 to 0.6 between the hindcasts and observations for 1982/1983 to 2008/2009. The prediction skill in the area east of about 170°E is significantly influenced by strong El Niño, whereas the skill in the southwest Pacific region mainly comes from the linear trend of TC number. The prediction skill of TC track density is particularly high in the region where there is climatological high TC track density around the area 160°E-180° and 20°S. Since this area has a mixed response with respect to ENSO, the prediction skill of TC track density is higher in non-ENSO years compared to that in ENSO years. Even though the cross-validation prediction skill is higher in the area east of about 170°E compared to other areas, this region shows less skill for track density based on the categorical verification due to huge influences by strong El Niño years. While prediction skill of the developed methodology varies across the region, it is important that the model demonstrates skill in the area where TC activity is high. Such a result has an important practical implication—improving the accuracy of seasonal forecast and providing communities at risk with advanced information which could assist with preparedness and disaster risk reduction.
Parasuraman, Kamban; Elshorbagy, Amin
Uncertainty analysis is starting to be widely acknowledged as an integral part of hydrological modeling. The conventional treatment of uncertainty analysis in hydrologic modeling is to assume a deterministic model structure, and treat its associated parameters as imperfectly known, thereby neglecting the uncertainty associated with the model structure. In this paper, a modeling framework that can explicitly account for the effect of model structure uncertainty has been proposed. The modeling framework is based on initially generating different realizations of the original data set using a non-parametric bootstrap method, and then exploiting the ability of the self-organizing algorithms, namely genetic programming, to evolve their own model structure for each of the resampled data sets. The resulting ensemble of models is then used to quantify the uncertainty associated with the model structure. The performance of the proposed modeling framework is analyzed with regards to its ability in characterizing the evapotranspiration process at the Southwest Sand Storage facility, located near Ft. McMurray, Alberta. Eddy-covariance-measured actual evapotranspiration is modeled as a function of net radiation, air temperature, ground temperature, relative humidity, and wind speed. Investigating the relation between model complexity, prediction accuracy, and uncertainty, two sets of experiments were carried out by varying the level of mathematical operators that can be used to define the predictand-predictor relationship. While the first set uses just the additive operators, the second set uses both the additive and the multiplicative operators to define the predictand-predictor relationship. The results suggest that increasing the model complexity may lead to better prediction accuracy but at an expense of increasing uncertainty. Compared to the model parameter uncertainty, the relative contribution of model structure uncertainty to the predictive uncertainty of a model is
Musolino, M.; Meneghini, M.; Scarparo, L.; De Santi, C.; Tahraoui, A.; Geelhaar, L.; Zanoni, E.; Riechert, H.
III-N nanowires (NWs) are an attractive alternative to conventional planar layers as the basis for light-emitting diodes (LEDs). In fact, the NW geometry enables the growth of (In,Ga)N/GaN heterostructures with high indium content and without extended defects regardless of the substrate. Despite these conceptual advantages, the NW-LEDs so far reported often exhibit higher leakage currents and higher turn-on voltages than the planar LEDs. In this work, we investigate the mechanisms responsible for the unusually high leakage currents in (In,Ga)N/GaN LEDs based on self-induced NW ensembles grown by molecular beam epitaxy on Si substrates. The temperature-dependent current-voltage (I-V) characteristics, acquired between 83 and 403 K, reveal that temperatures higher than 240 K may activate a further conduction process, which is not present in the low temperature range. Deep level transient spectroscopy (DLTS) measurements show the presence of electron traps, which are activated in the same temperature interval. A detailed analysis of the DLTS signal reveals the presence of two distinct deep levels with apparent activation energies close to Ec-570 meV and Ec-840 meV, and capture cross sections of about 1.0x10-15 cm2 and 2x10-14 cm2, respectively. These results suggest that the leakage process might be related to trap-assisted tunneling, possibly produced by point defects located in the core and/or on the sidewalls of the NWs.
Boykin, P Oscar; Mor, Tal; Roychowdhury, Vwani; Vatan, Farrokh
In ensemble (or bulk) quantum computation, all computations are performed on an ensemble of computers rather than on a single computer. Measurements of qubits in an individual computer cannot be performed; instead, only expectation values (over the complete ensemble of computers) can be measured. As a result of this limitation on the model of computation, many algorithms cannot be processed directly on such computers, and must be modified, as the common strategy of delaying the measurements usually does not resolve this ensemble-measurement problem. Here we present several new strategies for resolving this problem. Based on these strategies we provide new versions of some of the most important quantum algorithms, versions that are suitable for implementing on ensemble quantum computers, e.g., on liquid NMR quantum computers. These algorithms are Shor's factorization algorithm, Grover's search algorithm (with several marked items), and an algorithm for quantum fault-tolerant computation. The first two algorithms are simply modified using a randomizing and a sorting strategies. For the last algorithm, we develop a classical-quantum hybrid strategy for removing measurements. We use it to present a novel quantum fault-tolerant scheme. More explicitly, we present schemes for fault-tolerant measurement-free implementation of Toffoli and σ(z)(¼) as these operations cannot be implemented "bitwise", and their standard fault-tolerant implementations require measurement.
Full Text Available In our previous study , we have compared the performance of a number of widely used discrimination methods for classifying ovarian cancer using Matrix Assisted Laser Desorption Ionization (MALDI mass spectrometry data on serum samples obtained from Reflectron mode. Our results demonstrate good performance with a random forest classifier. In this follow-up study, to improve the molecular classification power of the MALDI platform for ovarian cancer disease, we expanded the mass range of the MS data by adding data acquired in Linear mode and evaluated the resultant decrease in classification error. A general statistical framework is proposed to obtain unbiased classification error estimates and to analyze the effects of sample size and number of selected m/z features on classification errors. We also emphasize the importance of combining biological knowledge and statistical analysis to obtain both biologically and statistically sound results. Our study shows improvement in classification accuracy upon expanding the mass range of the analysis. In order to obtain the best classification accuracies possible, we found that a relatively large training sample size is needed to obviate the sample variations. For the ovarian MS dataset that is the focus of the current study, our results show that approximately 20-40 m/z features are needed to achieve the best classification accuracy from MALDI-MS analysis of sera. Supplementary information can be found at http://bioinformatics.med.yale.edu/proteomics/BioSupp2.html.
Full Text Available Ensemble clustering (EC can arise in data assimilation with ensemble square root filters (EnSRFs using non-linear models: an M-member ensemble splits into a single outlier and a cluster of M–1 members. The stochastic Ensemble Kalman Filter does not present this problem. Modifications to the EnSRFs by a periodic resampling of the ensemble through random rotations have been proposed to address it. We introduce a metric to quantify the presence of EC and present evidence to dispel the notion that EC leads to filter failure. Starting from a univariate model, we show that EC is not a permanent but transient phenomenon; it occurs intermittently in non-linear models. We perform a series of data assimilation experiments using a standard EnSRF and a modified EnSRF by a resampling though random rotations. The modified EnSRF thus alleviates issues associated with EC at the cost of traceability of individual ensemble trajectories and cannot use some of algorithms that enhance performance of standard EnSRF. In the non-linear regimes of low-dimensional models, the analysis root mean square error of the standard EnSRF slowly grows with ensemble size if the size is larger than the dimension of the model state. However, we do not observe this problem in a more complex model that uses an ensemble size much smaller than the dimension of the model state, along with inflation and localisation. Overall, we find that transient EC does not handicap the performance of the standard EnSRF.
NCMRWF Global Forecast. System) with ETR (Ensemble Transform with Rescaling) based Global Ensemble Forecast (GEFS) of resolution T-190L28 is investigated. The experiment is conducted for a period of one week in June 2013 and forecast ...
Brzezinski, Dariusz; Stefanowski, Jerzy
Data stream mining has been receiving increased attention due to its presence in a wide range of applications, such as sensor networks, banking, and telecommunication. One of the most important challenges in learning from data streams is reacting to concept drift, i.e., unforeseen changes of the stream's underlying data distribution. Several classification algorithms that cope with concept drift have been put forward, however, most of them specialize in one type of change. In this paper, we propose a new data stream classifier, called the Accuracy Updated Ensemble (AUE2), which aims at reacting equally well to different types of drift. AUE2 combines accuracy-based weighting mechanisms known from block-based ensembles with the incremental nature of Hoeffding Trees. The proposed algorithm is experimentally compared with 11 state-of-the-art stream methods, including single classifiers, block-based and online ensembles, and hybrid approaches in different drift scenarios. Out of all the compared algorithms, AUE2 provided best average classification accuracy while proving to be less memory consuming than other ensemble approaches. Experimental results show that AUE2 can be considered suitable for scenarios, involving many types of drift as well as static environments.
Elsawy, Amr S; Eldawlatly, Seif; Taher, Mohamed; Aly, Gamal M
The current trend to use Brain-Computer Interfaces (BCIs) with mobile devices mandates the development of efficient EEG data processing methods. In this paper, we demonstrate the performance of a Principal Component Analysis (PCA) ensemble classifier for P300-based spellers. We recorded EEG data from multiple subjects using the Emotiv neuroheadset in the context of a classical oddball P300 speller paradigm. We compare the performance of the proposed ensemble classifier to the performance of traditional feature extraction and classifier methods. Our results demonstrate the capability of the PCA ensemble classifier to classify P300 data recorded using the Emotiv neuroheadset with an average accuracy of 86.29% on cross-validation data. In addition, offline testing of the recorded data reveals an average classification accuracy of 73.3% that is significantly higher than that achieved using traditional methods. Finally, we demonstrate the effect of the parameters of the P300 speller paradigm on the performance of the method.
Saleh, F.; Ramaswamy, V.; Wang, Y.; Georgas, N.; Blumberg, A.; Pullen, J.
Estuarine regions can experience compound impacts from coastal storm surge and riverine flooding. The challenges in forecasting flooding in such areas are multi-faceted due to uncertainties associated with meteorological drivers and interactions between hydrological and coastal processes. The objective of this work is to evaluate how uncertainties from meteorological predictions propagate through an ensemble-based flood prediction framework and translate into uncertainties in simulated inundation extents. A multi-scale framework, consisting of hydrologic, coastal and hydrodynamic models, was used to simulate two extreme flood events at the confluence of the Passaic and Hackensack rivers and Newark Bay. The events were Hurricane Irene (2011), a combination of inland flooding and coastal storm surge, and Hurricane Sandy (2012) where coastal storm surge was the dominant component. The hydrodynamic component of the framework was first forced with measured streamflow and ocean water level data to establish baseline inundation extents with the best available forcing data. The coastal and hydrologic models were then forced with meteorological predictions from 21 ensemble members of the Global Ensemble Forecast System (GEFS) to retrospectively represent potential future conditions up to 96 hours prior to the events. Inundation extents produced by the hydrodynamic model, forced with the 95th percentile of the ensemble-based coastal and hydrologic boundary conditions, were in good agreement with baseline conditions for both events. The USGS reanalysis of Hurricane Sandy inundation extents was encapsulated between the 50th and 95th percentile of the forecasted inundation extents, and that of Hurricane Irene was similar but with caveats associated with data availability and reliability. This work highlights the importance of accounting for meteorological uncertainty to represent a range of possible future inundation extents at high resolution (∼m).
Full Text Available Personal recognition using palm-vein patterns has emerged as a promising alternative for human recognition because of its uniqueness, stability, live body identification, flexibility, and difficulty to cheat. With the expanding application of palm-vein pattern recognition, the corresponding growth of the database has resulted in a long response time. To shorten the response time of identification, this paper proposes a simple and useful classification for palm-vein identification based on principal direction features. In the registration process, the Gaussian-Radon transform is adopted to extract the orientation matrix and then compute the principal direction of a palm-vein image based on the orientation matrix. The database can be classified into six bins based on the value of the principal direction. In the identification process, the principal direction of the test sample is first extracted to ascertain the corresponding bin. One-by-one matching with the training samples is then performed in the bin. To improve recognition efficiency while maintaining better recognition accuracy, two neighborhood bins of the corresponding bin are continuously searched to identify the input palm-vein image. Evaluation experiments are conducted on three different databases, namely, PolyU, CASIA, and the database of this study. Experimental results show that the searching range of one test sample in PolyU, CASIA and our database by the proposed method for palm-vein identification can be reduced to 14.29%, 14.50%, and 14.28%, with retrieval accuracy of 96.67%, 96.00%, and 97.71%, respectively. With 10,000 training samples in the database, the execution time of the identification process by the traditional method is 18.56 s, while that by the proposed approach is 3.16 s. The experimental results confirm that the proposed approach is more efficient than the traditional method, especially for a large database.
Tamás, F. D.
Full Text Available The qualitative identification to determine the origin (i.e. manufacturing factory of Spanish clinkers is described. The classification of clinkers produced in different factories can be based on their trace element content. Approximately fifteen clinker sorts are analysed, collected from 11 Spanish cement factories to determine their Mg, Sr, Ba, Mn, Ti, Zr, Zn and V content. An expert system formulated by a binary decision tree is designed based on the collected data. The performance of the obtained classifier was measured by ten-fold cross validation. The results show that the proposed method is useful to identify an easy-to-use expert system that is able to determine the origin of the clinker based on its trace element content.
En el presente trabajo se describe el procedimiento de identificación cualitativa de clínkeres españoles con el objeto de determinar su origen (fábrica. Esa clasificación de los clínkeres se basa en el contenido de sus elementos traza. Se analizaron 15 clínkeres diferentes procedentes de 11 fábricas de cemento españolas, determinándose los contenidos en Mg, Sr, Ba, Mn, Ti, Zr, Zn y V. Se ha diseñado un sistema experto mediante un árbol de decisión binario basado en los datos recogidos. La clasificación obtenida fue examinada mediante la validación cruzada de 10 valores. Los resultados obtenidos muestran que el modelo propuesto es válido para identificar, de manera fácil, un sistema experto capaz de determinar el origen de un clínker basándose en el contenido de sus elementos traza.
Full Text Available A method for improving radar-derived quantitative precipitation estimation is proposed. Tropical vertical profiles of reflectivity (VPRs are first determined from multiple VPRs. Upon identifying a tropical VPR, the event can be further classified as either tropical-stratiform or tropical-convective rainfall by a fuzzy logic (FL algorithm. Based on the precipitation-type fields, the reflectivity values are converted into rainfall rate using a Z-R relationship. In order to evaluate the performance of this rainfall classification scheme, three experiments were conducted using three months of data and two study cases. In Experiment I, the Weather Surveillance Radar-1988 Doppler (WSR-88D default Z-R relationship was applied. In Experiment II, the precipitation regime was separated into convective and stratiform rainfall using the FL algorithm, and corresponding Z-R relationships were used. In Experiment III, the precipitation regime was separated into convective, stratiform, and tropical rainfall, and the corresponding Z-R relationships were applied. The results show that the rainfall rates obtained from all three experiments match closely with the gauge observations, although Experiment II could solve the underestimation, when compared to Experiment I. Experiment III significantly reduced this underestimation and generated the most accurate radar estimates of rain rate among the three experiments.
Cook, Jack A; Davit, Barbara M; Polli, James E
The Biopharmaceutics Classification System (BCS) is employed to waive in vivo bioequivalence testing (i.e. provide "biowaivers") for new and generic drugs that are BCS class I. Granting biowaivers under systems such as the BCS eliminates unnecessary drug exposures to healthy subjects and provides economic relief, while maintaining the high public health standard for therapeutic equivalence. International scientific consensus suggests class III drugs are also eligible for biowaivers. The objective of this study was to estimate the economic impact of class I BCS-based biowaivers, along with the economic impact of a potential expansion to BCS class III. Methods consider the distribution of drugs across the four BCS classes, numbers of in vivo bioequivalence studies performed from a five year period, and effects of highly variable drugs (HVDs). Results indicate that 26% of all drugs are class I non-HVDs, 7% are class I HVDs, 27% are class III non-HVDs, and 3% are class III HVDs. An estimated 66 to 76 million dollars can be saved each year in clinical study costs if all class I compounds were granted biowaivers. Between 21 and 24 million dollars of this savings is from HVDs. If BCS class III compounds were also granted waivers, an additional direct savings of 62 to 71 million dollars would be realized, with 9 to 10 million dollars coming from HVDs.
Skriver, Henning; Dierking, Wolfgang
Polarimetric SAR images acquired at C- and L-band over sea ice in the Greenland Sea, Baltic Sea, and Beaufort Sea have been analysed with respect to their potential for ice type classification. The polarimetric data were gathered by the Danish EMISAR and the US AIRSAR which both are airborne...... systems. A hierarchical classification scheme was chosen for sea ice because our knowledge about magnitudes, variations, and dependences of sea ice signatures can be directly considered. The optimal sequence of