WorldWideScience

Sample records for univariate feature-selection methods

  1. EEG feature selection method based on decision tree.

    Duan, Lijuan; Ge, Hui; Ma, Wei; Miao, Jun

    2015-01-01

    This paper aims to solve automated feature selection problem in brain computer interface (BCI). In order to automate feature selection process, we proposed a novel EEG feature selection method based on decision tree (DT). During the electroencephalogram (EEG) signal processing, a feature extraction method based on principle component analysis (PCA) was used, and the selection process based on decision tree was performed by searching the feature space and automatically selecting optimal features. Considering that EEG signals are a series of non-linear signals, a generalized linear classifier named support vector machine (SVM) was chosen. In order to test the validity of the proposed method, we applied the EEG feature selection method based on decision tree to BCI Competition II datasets Ia, and the experiment showed encouraging results.

  2. A Comparative Study of Feature Selection and Classification Methods for Gene Expression Data

    Abusamra, Heba

    2013-01-01

    Different experiments have been applied to compare the performance of the classification methods with and without performing feature selection. Results revealed the important role of feature selection in classifying gene expression data. By performing feature selection, the classification accuracy can be significantly boosted by using a small number of genes. The relationship of features selected in different feature selection methods is investigated and the most frequent features selected in each fold among all methods for both datasets are evaluated.

  3. Principal Feature Analysis: A Multivariate Feature Selection Method for fMRI Data

    Lijun Wang

    2013-01-01

    Full Text Available Brain decoding with functional magnetic resonance imaging (fMRI requires analysis of complex, multivariate data. Multivoxel pattern analysis (MVPA has been widely used in recent years. MVPA treats the activation of multiple voxels from fMRI data as a pattern and decodes brain states using pattern classification methods. Feature selection is a critical procedure of MVPA because it decides which features will be included in the classification analysis of fMRI data, thereby improving the performance of the classifier. Features can be selected by limiting the analysis to specific anatomical regions or by computing univariate (voxel-wise or multivariate statistics. However, these methods either discard some informative features or select features with redundant information. This paper introduces the principal feature analysis as a novel multivariate feature selection method for fMRI data processing. This multivariate approach aims to remove features with redundant information, thereby selecting fewer features, while retaining the most information.

  4. Orthogonal feature selection method. [For preprocessing of man spectral data

    Kowalski, B R [Univ. of Washington, Seattle; Bender, C F

    1976-01-01

    A new method of preprocessing spectral data for extraction of molecular structural information is desired. This SELECT method generates orthogonal features that are important for classification purposes and that also retain their identity to the original measurements. A brief introduction to chemical pattern recognition is presented. A brief description of the method and an application to mass spectral data analysis follow. (BLM)

  5. Toward optimal feature selection using ranking methods and classification algorithms

    Novaković Jasmina

    2011-01-01

    Full Text Available We presented a comparison between several feature ranking methods used on two real datasets. We considered six ranking methods that can be divided into two broad categories: statistical and entropy-based. Four supervised learning algorithms are adopted to build models, namely, IB1, Naive Bayes, C4.5 decision tree and the RBF network. We showed that the selection of ranking methods could be important for classification accuracy. In our experiments, ranking methods with different supervised learning algorithms give quite different results for balanced accuracy. Our cases confirm that, in order to be sure that a subset of features giving the highest accuracy has been selected, the use of many different indices is recommended.

  6. Linear feature selection in texture analysis - A PLS based method

    Marques, Joselene; Igel, Christian; Lillholm, Martin

    2013-01-01

    We present a texture analysis methodology that combined uncommitted machine-learning techniques and partial least square (PLS) in a fully automatic framework. Our approach introduces a robust PLS-based dimensionality reduction (DR) step to specifically address outliers and high-dimensional feature...... and considering all CV groups, the methods selected 36 % of the original features available. The diagnosis evaluation reached a generalization area-under-the-ROC curve of 0.92, which was higher than established cartilage-based markers known to relate to OA diagnosis....

  7. GAIN RATIO BASED FEATURE SELECTION METHOD FOR PRIVACY PRESERVATION

    R. Praveena Priyadarsini

    2011-04-01

    Full Text Available Privacy-preservation is a step in data mining that tries to safeguard sensitive information from unsanctioned disclosure and hence protecting individual data records and their privacy. There are various privacy preservation techniques like k-anonymity, l-diversity and t-closeness and data perturbation. In this paper k-anonymity privacy protection technique is applied to high dimensional datasets like adult and census. since, both the data sets are high dimensional, feature subset selection method like Gain Ratio is applied and the attributes of the datasets are ranked and low ranking attributes are filtered to form new reduced data subsets. K-anonymization privacy preservation technique is then applied on reduced datasets. The accuracy of the privacy preserved reduced datasets and the original datasets are compared for their accuracy on the two functionalities of data mining namely classification and clustering using naïve Bayesian and k-means algorithm respectively. Experimental results show that classification and clustering accuracy are comparatively the same for reduced k-anonym zed datasets and the original data sets.

  8. A Comparative Study of Feature Selection and Classification Methods for Gene Expression Data

    Abusamra, Heba

    2013-05-01

    Microarray technology has enriched the study of gene expression in such a way that scientists are now able to measure the expression levels of thousands of genes in a single experiment. Microarray gene expression data gained great importance in recent years due to its role in disease diagnoses and prognoses which help to choose the appropriate treatment plan for patients. This technology has shifted a new era in molecular classification, interpreting gene expression data remains a difficult problem and an active research area due to their native nature of “high dimensional low sample size”. Such problems pose great challenges to existing classification methods. Thus, effective feature selection techniques are often needed in this case to aid to correctly classify different tumor types and consequently lead to a better understanding of genetic signatures as well as improve treatment strategies. This thesis aims on a comparative study of state-of-the-art feature selection methods, classification methods, and the combination of them, based on gene expression data. We compared the efficiency of three different classification methods including: support vector machines, k- nearest neighbor and random forest, and eight different feature selection methods, including: information gain, twoing rule, sum minority, max minority, gini index, sum of variances, t- statistics, and one-dimension support vector machine. Five-fold cross validation was used to evaluate the classification performance. Two publicly available gene expression data sets of glioma were used for this study. Different experiments have been applied to compare the performance of the classification methods with and without performing feature selection. Results revealed the important role of feature selection in classifying gene expression data. By performing feature selection, the classification accuracy can be significantly boosted by using a small number of genes. The relationship of features selected in

  9. A Comparative Study of Feature Selection and Classification Methods for Gene Expression Data of Glioma

    Abusamra, Heba

    2013-11-01

    Microarray gene expression data gained great importance in recent years due to its role in disease diagnoses and prognoses which help to choose the appropriate treatment plan for patients. This technology has shifted a new era in molecular classification. Interpreting gene expression data remains a difficult problem and an active research area due to their native nature of “high dimensional low sample size”. Such problems pose great challenges to existing classification methods. Thus, effective feature selection techniques are often needed in this case to aid to correctly classify different tumor types and consequently lead to a better understanding of genetic signatures as well as improve treatment strategies. This paper aims on a comparative study of state-of-the- art feature selection methods, classification methods, and the combination of them, based on gene expression data. We compared the efficiency of three different classification methods including: support vector machines, k-nearest neighbor and random forest, and eight different feature selection methods, including: information gain, twoing rule, sum minority, max minority, gini index, sum of variances, t-statistics, and one-dimension support vector machine. Five-fold cross validation was used to evaluate the classification performance. Two publicly available gene expression data sets of glioma were used in the experiments. Results revealed the important role of feature selection in classifying gene expression data. By performing feature selection, the classification accuracy can be significantly boosted by using a small number of genes. The relationship of features selected in different feature selection methods is investigated and the most frequent features selected in each fold among all methods for both datasets are evaluated.

  10. A Comparative Study of Feature Selection and Classification Methods for Gene Expression Data of Glioma

    Abusamra, Heba

    2013-01-01

    Microarray gene expression data gained great importance in recent years due to its role in disease diagnoses and prognoses which help to choose the appropriate treatment plan for patients. This technology has shifted a new era in molecular classification. Interpreting gene expression data remains a difficult problem and an active research area due to their native nature of “high dimensional low sample size”. Such problems pose great challenges to existing classification methods. Thus, effective feature selection techniques are often needed in this case to aid to correctly classify different tumor types and consequently lead to a better understanding of genetic signatures as well as improve treatment strategies. This paper aims on a comparative study of state-of-the- art feature selection methods, classification methods, and the combination of them, based on gene expression data. We compared the efficiency of three different classification methods including: support vector machines, k-nearest neighbor and random forest, and eight different feature selection methods, including: information gain, twoing rule, sum minority, max minority, gini index, sum of variances, t-statistics, and one-dimension support vector machine. Five-fold cross validation was used to evaluate the classification performance. Two publicly available gene expression data sets of glioma were used in the experiments. Results revealed the important role of feature selection in classifying gene expression data. By performing feature selection, the classification accuracy can be significantly boosted by using a small number of genes. The relationship of features selected in different feature selection methods is investigated and the most frequent features selected in each fold among all methods for both datasets are evaluated.

  11. Which DTW Method Applied to Marine Univariate Time Series Imputation

    Phan , Thi-Thu-Hong; Caillault , Émilie; Lefebvre , Alain; Bigand , André

    2017-01-01

    International audience; Missing data are ubiquitous in any domains of applied sciences. Processing datasets containing missing values can lead to a loss of efficiency and unreliable results, especially for large missing sub-sequence(s). Therefore, the aim of this paper is to build a framework for filling missing values in univariate time series and to perform a comparison of different similarity metrics used for the imputation task. This allows to suggest the most suitable methods for the imp...

  12. A Feature Selection Method for Large-Scale Network Traffic Classification Based on Spark

    Yong Wang

    2016-02-01

    Full Text Available Currently, with the rapid increasing of data scales in network traffic classifications, how to select traffic features efficiently is becoming a big challenge. Although a number of traditional feature selection methods using the Hadoop-MapReduce framework have been proposed, the execution time was still unsatisfactory with numeral iterative computations during the processing. To address this issue, an efficient feature selection method for network traffic based on a new parallel computing framework called Spark is proposed in this paper. In our approach, the complete feature set is firstly preprocessed based on Fisher score, and a sequential forward search strategy is employed for subsets. The optimal feature subset is then selected using the continuous iterations of the Spark computing framework. The implementation demonstrates that, on the precondition of keeping the classification accuracy, our method reduces the time cost of modeling and classification, and improves the execution efficiency of feature selection significantly.

  13. An input feature selection method applied to fuzzy neural networks for signal esitmation

    Na, Man Gyun; Sim, Young Rok

    2001-01-01

    It is well known that the performance of a fuzzy neural networks strongly depends on the input features selected for its training. In its applications to sensor signal estimation, there are a large number of input variables related with an output. As the number of input variables increases, the training time of fuzzy neural networks required increases exponentially. Thus, it is essential to reduce the number of inputs to a fuzzy neural networks and to select the optimum number of mutually independent inputs that are able to clearly define the input-output mapping. In this work, principal component analysis (PAC), genetic algorithms (GA) and probability theory are combined to select new important input features. A proposed feature selection method is applied to the signal estimation of the steam generator water level, the hot-leg flowrate, the pressurizer water level and the pressurizer pressure sensors in pressurized water reactors and compared with other input feature selection methods

  14. Feature Selection Methods for Zero-Shot Learning of Neural Activity

    Carlos A. Caceres

    2017-06-01

    Full Text Available Dimensionality poses a serious challenge when making predictions from human neuroimaging data. Across imaging modalities, large pools of potential neural features (e.g., responses from particular voxels, electrodes, and temporal windows have to be related to typically limited sets of stimuli and samples. In recent years, zero-shot prediction models have been introduced for mapping between neural signals and semantic attributes, which allows for classification of stimulus classes not explicitly included in the training set. While choices about feature selection can have a substantial impact when closed-set accuracy, open-set robustness, and runtime are competing design objectives, no systematic study of feature selection for these models has been reported. Instead, a relatively straightforward feature stability approach has been adopted and successfully applied across models and imaging modalities. To characterize the tradeoffs in feature selection for zero-shot learning, we compared correlation-based stability to several other feature selection techniques on comparable data sets from two distinct imaging modalities: functional Magnetic Resonance Imaging and Electrocorticography. While most of the feature selection methods resulted in similar zero-shot prediction accuracies and spatial/spectral patterns of selected features, there was one exception; A novel feature/attribute correlation approach was able to achieve those accuracies with far fewer features, suggesting the potential for simpler prediction models that yield high zero-shot classification accuracy.

  15. Feature selection for splice site prediction: A new method using EDA-based feature ranking

    Rouzé Pierre

    2004-05-01

    Full Text Available Abstract Background The identification of relevant biological features in large and complex datasets is an important step towards gaining insight in the processes underlying the data. Other advantages of feature selection include the ability of the classification system to attain good or even better solutions using a restricted subset of features, and a faster classification. Thus, robust methods for fast feature selection are of key importance in extracting knowledge from complex biological data. Results In this paper we present a novel method for feature subset selection applied to splice site prediction, based on estimation of distribution algorithms, a more general framework of genetic algorithms. From the estimated distribution of the algorithm, a feature ranking is derived. Afterwards this ranking is used to iteratively discard features. We apply this technique to the problem of splice site prediction, and show how it can be used to gain insight into the underlying biological process of splicing. Conclusion We show that this technique proves to be more robust than the traditional use of estimation of distribution algorithms for feature selection: instead of returning a single best subset of features (as they normally do this method provides a dynamical view of the feature selection process, like the traditional sequential wrapper methods. However, the method is faster than the traditional techniques, and scales better to datasets described by a large number of features.

  16. TEHRAN AIR POLLUTANTS PREDICTION BASED ON RANDOM FOREST FEATURE SELECTION METHOD

    A. Shamsoddini

    2017-09-01

    Full Text Available Air pollution as one of the most serious forms of environmental pollutions poses huge threat to human life. Air pollution leads to environmental instability, and has harmful and undesirable effects on the environment. Modern prediction methods of the pollutant concentration are able to improve decision making and provide appropriate solutions. This study examines the performance of the Random Forest feature selection in combination with multiple-linear regression and Multilayer Perceptron Artificial Neural Networks methods, in order to achieve an efficient model to estimate carbon monoxide and nitrogen dioxide, sulfur dioxide and PM2.5 contents in the air. The results indicated that Artificial Neural Networks fed by the attributes selected by Random Forest feature selection method performed more accurate than other models for the modeling of all pollutants. The estimation accuracy of sulfur dioxide emissions was lower than the other air contaminants whereas the nitrogen dioxide was predicted more accurate than the other pollutants.

  17. Tehran Air Pollutants Prediction Based on Random Forest Feature Selection Method

    Shamsoddini, A.; Aboodi, M. R.; Karami, J.

    2017-09-01

    Air pollution as one of the most serious forms of environmental pollutions poses huge threat to human life. Air pollution leads to environmental instability, and has harmful and undesirable effects on the environment. Modern prediction methods of the pollutant concentration are able to improve decision making and provide appropriate solutions. This study examines the performance of the Random Forest feature selection in combination with multiple-linear regression and Multilayer Perceptron Artificial Neural Networks methods, in order to achieve an efficient model to estimate carbon monoxide and nitrogen dioxide, sulfur dioxide and PM2.5 contents in the air. The results indicated that Artificial Neural Networks fed by the attributes selected by Random Forest feature selection method performed more accurate than other models for the modeling of all pollutants. The estimation accuracy of sulfur dioxide emissions was lower than the other air contaminants whereas the nitrogen dioxide was predicted more accurate than the other pollutants.

  18. New Hybrid Features Selection Method: A Case Study on Websites Phishing

    Khairan D. Rajab

    2017-01-01

    Full Text Available Phishing is one of the serious web threats that involves mimicking authenticated websites to deceive users in order to obtain their financial information. Phishing has caused financial damage to the different online stakeholders. It is massive in the magnitude of hundreds of millions; hence it is essential to minimize this risk. Classifying websites into “phishy” and legitimate types is a primary task in data mining that security experts and decision makers are hoping to improve particularly with respect to the detection rate and reliability of the results. One way to ensure the reliability of the results and to enhance performance is to identify a set of related features early on so the data dimensionality reduces and irrelevant features are discarded. To increase reliability of preprocessing, this article proposes a new feature selection method that combines the scores of multiple known methods to minimize discrepancies in feature selection results. The proposed method has been applied to the problem of website phishing classification to show its pros and cons in identifying relevant features. Results against a security dataset reveal that the proposed preprocessing method was able to derive new features datasets which when mined generate high competitive classifiers with reference to detection rate when compared to results obtained from other features selection methods.

  19. Speech Emotion Feature Selection Method Based on Contribution Analysis Algorithm of Neural Network

    Wang Xiaojia; Mao Qirong; Zhan Yongzhao

    2008-01-01

    There are many emotion features. If all these features are employed to recognize emotions, redundant features may be existed. Furthermore, recognition result is unsatisfying and the cost of feature extraction is high. In this paper, a method to select speech emotion features based on contribution analysis algorithm of NN is presented. The emotion features are selected by using contribution analysis algorithm of NN from the 95 extracted features. Cluster analysis is applied to analyze the effectiveness for the features selected, and the time of feature extraction is evaluated. Finally, 24 emotion features selected are used to recognize six speech emotions. The experiments show that this method can improve the recognition rate and the time of feature extraction

  20. Variable selection in near-infrared spectroscopy: Benchmarking of feature selection methods on biodiesel data

    Balabin, Roman M.; Smirnov, Sergey V.

    2011-01-01

    During the past several years, near-infrared (near-IR/NIR) spectroscopy has increasingly been adopted as an analytical tool in various fields from petroleum to biomedical sectors. The NIR spectrum (above 4000 cm -1 ) of a sample is typically measured by modern instruments at a few hundred of wavelengths. Recently, considerable effort has been directed towards developing procedures to identify variables (wavelengths) that contribute useful information. Variable selection (VS) or feature selection, also called frequency selection or wavelength selection, is a critical step in data analysis for vibrational spectroscopy (infrared, Raman, or NIRS). In this paper, we compare the performance of 16 different feature selection methods for the prediction of properties of biodiesel fuel, including density, viscosity, methanol content, and water concentration. The feature selection algorithms tested include stepwise multiple linear regression (MLR-step), interval partial least squares regression (iPLS), backward iPLS (BiPLS), forward iPLS (FiPLS), moving window partial least squares regression (MWPLS), (modified) changeable size moving window partial least squares (CSMWPLS/MCSMWPLSR), searching combination moving window partial least squares (SCMWPLS), successive projections algorithm (SPA), uninformative variable elimination (UVE, including UVE-SPA), simulated annealing (SA), back-propagation artificial neural networks (BP-ANN), Kohonen artificial neural network (K-ANN), and genetic algorithms (GAs, including GA-iPLS). Two linear techniques for calibration model building, namely multiple linear regression (MLR) and partial least squares regression/projection to latent structures (PLS/PLSR), are used for the evaluation of biofuel properties. A comparison with a non-linear calibration model, artificial neural networks (ANN-MLP), is also provided. Discussion of gasoline, ethanol-gasoline (bioethanol), and diesel fuel data is presented. The results of other spectroscopic

  1. A new and fast image feature selection method for developing an optimal mammographic mass detection scheme.

    Tan, Maxine; Pu, Jiantao; Zheng, Bin

    2014-08-01

    Selecting optimal features from a large image feature pool remains a major challenge in developing computer-aided detection (CAD) schemes of medical images. The objective of this study is to investigate a new approach to significantly improve efficacy of image feature selection and classifier optimization in developing a CAD scheme of mammographic masses. An image dataset including 1600 regions of interest (ROIs) in which 800 are positive (depicting malignant masses) and 800 are negative (depicting CAD-generated false positive regions) was used in this study. After segmentation of each suspicious lesion by a multilayer topographic region growth algorithm, 271 features were computed in different feature categories including shape, texture, contrast, isodensity, spiculation, local topological features, as well as the features related to the presence and location of fat and calcifications. Besides computing features from the original images, the authors also computed new texture features from the dilated lesion segments. In order to select optimal features from this initial feature pool and build a highly performing classifier, the authors examined and compared four feature selection methods to optimize an artificial neural network (ANN) based classifier, namely: (1) Phased Searching with NEAT in a Time-Scaled Framework, (2) A sequential floating forward selection (SFFS) method, (3) A genetic algorithm (GA), and (4) A sequential forward selection (SFS) method. Performances of the four approaches were assessed using a tenfold cross validation method. Among these four methods, SFFS has highest efficacy, which takes 3%-5% of computational time as compared to GA approach, and yields the highest performance level with the area under a receiver operating characteristic curve (AUC) = 0.864 ± 0.034. The results also demonstrated that except using GA, including the new texture features computed from the dilated mass segments improved the AUC results of the ANNs optimized

  2. A Feature Selection Method Based on Fisher's Discriminant Ratio for Text Sentiment Classification

    Wang, Suge; Li, Deyu; Wei, Yingjie; Li, Hongxia

    With the rapid growth of e-commerce, product reviews on the Web have become an important information source for customers' decision making when they intend to buy some product. As the reviews are often too many for customers to go through, how to automatically classify them into different sentiment orientation categories (i.e. positive/negative) has become a research problem. In this paper, based on Fisher's discriminant ratio, an effective feature selection method is proposed for product review text sentiment classification. In order to validate the validity of the proposed method, we compared it with other methods respectively based on information gain and mutual information while support vector machine is adopted as the classifier. In this paper, 6 subexperiments are conducted by combining different feature selection methods with 2 kinds of candidate feature sets. Under 1006 review documents of cars, the experimental results indicate that the Fisher's discriminant ratio based on word frequency estimation has the best performance with F value 83.3% while the candidate features are the words which appear in both positive and negative texts.

  3. A novel relational regularization feature selection method for joint regression and classification in AD diagnosis.

    Zhu, Xiaofeng; Suk, Heung-Il; Wang, Li; Lee, Seong-Whan; Shen, Dinggang

    2017-05-01

    In this paper, we focus on joint regression and classification for Alzheimer's disease diagnosis and propose a new feature selection method by embedding the relational information inherent in the observations into a sparse multi-task learning framework. Specifically, the relational information includes three kinds of relationships (such as feature-feature relation, response-response relation, and sample-sample relation), for preserving three kinds of the similarity, such as for the features, the response variables, and the samples, respectively. To conduct feature selection, we first formulate the objective function by imposing these three relational characteristics along with an ℓ 2,1 -norm regularization term, and further propose a computationally efficient algorithm to optimize the proposed objective function. With the dimension-reduced data, we train two support vector regression models to predict the clinical scores of ADAS-Cog and MMSE, respectively, and also a support vector classification model to determine the clinical label. We conducted extensive experiments on the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset to validate the effectiveness of the proposed method. Our experimental results showed the efficacy of the proposed method in enhancing the performances of both clinical scores prediction and disease status identification, compared to the state-of-the-art methods. Copyright © 2015 Elsevier B.V. All rights reserved.

  4. Feature Selection Methods for Robust Decoding of Finger Movements in a Non-human Primate

    Padmanaban, Subash; Baker, Justin; Greger, Bradley

    2018-01-01

    Objective: The performance of machine learning algorithms used for neural decoding of dexterous tasks may be impeded due to problems arising when dealing with high-dimensional data. The objective of feature selection algorithms is to choose a near-optimal subset of features from the original feature space to improve the performance of the decoding algorithm. The aim of our study was to compare the effects of four feature selection techniques, Wilcoxon signed-rank test, Relative Importance, Principal Component Analysis (PCA), and Mutual Information Maximization on SVM classification performance for a dexterous decoding task. Approach: A nonhuman primate (NHP) was trained to perform small coordinated movements—similar to typing. An array of microelectrodes was implanted in the hand area of the motor cortex of the NHP and used to record action potentials (AP) during finger movements. A Support Vector Machine (SVM) was used to classify which finger movement the NHP was making based upon AP firing rates. We used the SVM classification to examine the functional parameters of (i) robustness to simulated failure and (ii) longevity of classification. We also compared the effect of using isolated-neuron and multi-unit firing rates as the feature vector supplied to the SVM. Main results: The average decoding accuracy for multi-unit features and single-unit features using Mutual Information Maximization (MIM) across 47 sessions was 96.74 ± 3.5% and 97.65 ± 3.36% respectively. The reduction in decoding accuracy between using 100% of the features and 10% of features based on MIM was 45.56% (from 93.7 to 51.09%) and 4.75% (from 95.32 to 90.79%) for multi-unit and single-unit features respectively. MIM had best performance compared to other feature selection methods. Significance: These results suggest improved decoding performance can be achieved by using optimally selected features. The results based on clinically relevant performance metrics also suggest that the decoding

  5. A DYNAMIC FEATURE SELECTION METHOD FOR DOCUMENT RANKING WITH RELEVANCE FEEDBACK APPROACH

    K. Latha

    2010-07-01

    Full Text Available Ranking search results is essential for information retrieval and Web search. Search engines need to not only return highly relevant results, but also be fast to satisfy users. As a result, not all available features can be used for ranking, and in fact only a small percentage of these features can be used. Thus, it is crucial to have a feature selection mechanism that can find a subset of features that both meets latency requirements and achieves high relevance. In this paper we describe a 0/1 knapsack procedure for automatically selecting features to use within Generalization model for Document Ranking. We propose an approach for Relevance Feedback using Expectation Maximization method and evaluate the algorithm on the TREC Collection for describing classes of feedback textual information retrieval features. Experimental results, evaluated on standard TREC-9 part of the OHSUMED collections, show that our feature selection algorithm produces models that are either significantly more effective than, or equally effective as, models such as Markov Random Field model, Correlation Co-efficient and Count Difference method

  6. A stereo remote sensing feature selection method based on artificial bee colony algorithm

    Yan, Yiming; Liu, Pigang; Zhang, Ye; Su, Nan; Tian, Shu; Gao, Fengjiao; Shen, Yi

    2014-05-01

    To improve the efficiency of stereo information for remote sensing classification, a stereo remote sensing feature selection method is proposed in this paper presents, which is based on artificial bee colony algorithm. Remote sensing stereo information could be described by digital surface model (DSM) and optical image, which contain information of the three-dimensional structure and optical characteristics, respectively. Firstly, three-dimensional structure characteristic could be analyzed by 3D-Zernike descriptors (3DZD). However, different parameters of 3DZD could descript different complexity of three-dimensional structure, and it needs to be better optimized selected for various objects on the ground. Secondly, features for representing optical characteristic also need to be optimized. If not properly handled, when a stereo feature vector composed of 3DZD and image features, that would be a lot of redundant information, and the redundant information may not improve the classification accuracy, even cause adverse effects. To reduce information redundancy while maintaining or improving the classification accuracy, an optimized frame for this stereo feature selection problem is created, and artificial bee colony algorithm is introduced for solving this optimization problem. Experimental results show that the proposed method can effectively improve the computational efficiency, improve the classification accuracy.

  7. A ROC-based feature selection method for computer-aided detection and diagnosis

    Wang, Songyuan; Zhang, Guopeng; Liao, Qimei; Zhang, Junying; Jiao, Chun; Lu, Hongbing

    2014-03-01

    Image-based computer-aided detection and diagnosis (CAD) has been a very active research topic aiming to assist physicians to detect lesions and distinguish them from benign to malignant. However, the datasets fed into a classifier usually suffer from small number of samples, as well as significantly less samples available in one class (have a disease) than the other, resulting in the classifier's suboptimal performance. How to identifying the most characterizing features of the observed data for lesion detection is critical to improve the sensitivity and minimize false positives of a CAD system. In this study, we propose a novel feature selection method mR-FAST that combines the minimal-redundancymaximal relevance (mRMR) framework with a selection metric FAST (feature assessment by sliding thresholds) based on the area under a ROC curve (AUC) generated on optimal simple linear discriminants. With three feature datasets extracted from CAD systems for colon polyps and bladder cancer, we show that the space of candidate features selected by mR-FAST is more characterizing for lesion detection with higher AUC, enabling to find a compact subset of superior features at low cost.

  8. An Ensemble Method with Integration of Feature Selection and Classifier Selection to Detect the Landslides

    Zhongqin, G.; Chen, Y.

    2017-12-01

    Abstract Quickly identify the spatial distribution of landslides automatically is essential for the prevention, mitigation and assessment of the landslide hazard. It's still a challenging job owing to the complicated characteristics and vague boundary of the landslide areas on the image. The high resolution remote sensing image has multi-scales, complex spatial distribution and abundant features, the object-oriented image classification methods can make full use of the above information and thus effectively detect the landslides after the hazard happened. In this research we present a new semi-supervised workflow, taking advantages of recent object-oriented image analysis and machine learning algorithms to quick locate the different origins of landslides of some areas on the southwest part of China. Besides a sequence of image segmentation, feature selection, object classification and error test, this workflow ensemble the feature selection and classifier selection. The feature this study utilized were normalized difference vegetation index (NDVI) change, textural feature derived from the gray level co-occurrence matrices (GLCM), spectral feature and etc. The improvement of this study shows this algorithm significantly removes some redundant feature and the classifiers get fully used. All these improvements lead to a higher accuracy on the determination of the shape of landslides on the high resolution remote sensing image, in particular the flexibility aimed at different kinds of landslides.

  9. Pattern Recognition Methods and Features Selection for Speech Emotion Recognition System.

    Partila, Pavol; Voznak, Miroslav; Tovarek, Jaromir

    2015-01-01

    The impact of the classification method and features selection for the speech emotion recognition accuracy is discussed in this paper. Selecting the correct parameters in combination with the classifier is an important part of reducing the complexity of system computing. This step is necessary especially for systems that will be deployed in real-time applications. The reason for the development and improvement of speech emotion recognition systems is wide usability in nowadays automatic voice controlled systems. Berlin database of emotional recordings was used in this experiment. Classification accuracy of artificial neural networks, k-nearest neighbours, and Gaussian mixture model is measured considering the selection of prosodic, spectral, and voice quality features. The purpose was to find an optimal combination of methods and group of features for stress detection in human speech. The research contribution lies in the design of the speech emotion recognition system due to its accuracy and efficiency.

  10. Pattern Recognition Methods and Features Selection for Speech Emotion Recognition System

    Pavol Partila

    2015-01-01

    Full Text Available The impact of the classification method and features selection for the speech emotion recognition accuracy is discussed in this paper. Selecting the correct parameters in combination with the classifier is an important part of reducing the complexity of system computing. This step is necessary especially for systems that will be deployed in real-time applications. The reason for the development and improvement of speech emotion recognition systems is wide usability in nowadays automatic voice controlled systems. Berlin database of emotional recordings was used in this experiment. Classification accuracy of artificial neural networks, k-nearest neighbours, and Gaussian mixture model is measured considering the selection of prosodic, spectral, and voice quality features. The purpose was to find an optimal combination of methods and group of features for stress detection in human speech. The research contribution lies in the design of the speech emotion recognition system due to its accuracy and efficiency.

  11. FEATURE SELECTION METHODS BASED ON MUTUAL INFORMATION FOR CLASSIFYING HETEROGENEOUS FEATURES

    Ratri Enggar Pawening

    2016-06-01

    Full Text Available Datasets with heterogeneous features can affect feature selection results that are not appropriate because it is difficult to evaluate heterogeneous features concurrently. Feature transformation (FT is another way to handle heterogeneous features subset selection. The results of transformation from non-numerical into numerical features may produce redundancy to the original numerical features. In this paper, we propose a method to select feature subset based on mutual information (MI for classifying heterogeneous features. We use unsupervised feature transformation (UFT methods and joint mutual information maximation (JMIM methods. UFT methods is used to transform non-numerical features into numerical features. JMIM methods is used to select feature subset with a consideration of the class label. The transformed and the original features are combined entirely, then determine features subset by using JMIM methods, and classify them using support vector machine (SVM algorithm. The classification accuracy are measured for any number of selected feature subset and compared between UFT-JMIM methods and Dummy-JMIM methods. The average classification accuracy for all experiments in this study that can be achieved by UFT-JMIM methods is about 84.47% and Dummy-JMIM methods is about 84.24%. This result shows that UFT-JMIM methods can minimize information loss between transformed and original features, and select feature subset to avoid redundant and irrelevant features.

  12. Enhancement web proxy cache performance using Wrapper Feature Selection methods with NB and J48

    Mahmoud Al-Qudah, Dua'a.; Funke Olanrewaju, Rashidah; Wong Azman, Amelia

    2017-11-01

    Web proxy cache technique reduces response time by storing a copy of pages between client and server sides. If requested pages are cached in the proxy, there is no need to access the server. Due to the limited size and excessive cost of cache compared to the other storages, cache replacement algorithm is used to determine evict page when the cache is full. On the other hand, the conventional algorithms for replacement such as Least Recently Use (LRU), First in First Out (FIFO), Least Frequently Use (LFU), Randomized Policy etc. may discard important pages just before use. Furthermore, using conventional algorithm cannot be well optimized since it requires some decision to intelligently evict a page before replacement. Hence, most researchers propose an integration among intelligent classifiers and replacement algorithm to improves replacement algorithms performance. This research proposes using automated wrapper feature selection methods to choose the best subset of features that are relevant and influence classifiers prediction accuracy. The result present that using wrapper feature selection methods namely: Best First (BFS), Incremental Wrapper subset selection(IWSS)embedded NB and particle swarm optimization(PSO)reduce number of features and have a good impact on reducing computation time. Using PSO enhance NB classifier accuracy by 1.1%, 0.43% and 0.22% over using NB with all features, using BFS and using IWSS embedded NB respectively. PSO rises J48 accuracy by 0.03%, 1.91 and 0.04% over using J48 classifier with all features, using IWSS-embedded NB and using BFS respectively. While using IWSS embedded NB fastest NB and J48 classifiers much more than BFS and PSO. However, it reduces computation time of NB by 0.1383 and reduce computation time of J48 by 2.998.

  13. Clustering based gene expression feature selection method: A computational approach to enrich the classifier efficiency of differentially expressed genes

    Abusamra, Heba; Bajic, Vladimir B.

    2016-01-01

    decrease the computational time and cost, but also improve the classification performance. Among different approaches of feature selection methods, however most of them suffer from several problems such as lack of robustness, validation issues etc. Here, we

  14. A kernel-based multivariate feature selection method for microarray data classification.

    Shiquan Sun

    Full Text Available High dimensionality and small sample sizes, and their inherent risk of overfitting, pose great challenges for constructing efficient classifiers in microarray data classification. Therefore a feature selection technique should be conducted prior to data classification to enhance prediction performance. In general, filter methods can be considered as principal or auxiliary selection mechanism because of their simplicity, scalability, and low computational complexity. However, a series of trivial examples show that filter methods result in less accurate performance because they ignore the dependencies of features. Although few publications have devoted their attention to reveal the relationship of features by multivariate-based methods, these methods describe relationships among features only by linear methods. While simple linear combination relationship restrict the improvement in performance. In this paper, we used kernel method to discover inherent nonlinear correlations among features as well as between feature and target. Moreover, the number of orthogonal components was determined by kernel Fishers linear discriminant analysis (FLDA in a self-adaptive manner rather than by manual parameter settings. In order to reveal the effectiveness of our method we performed several experiments and compared the results between our method and other competitive multivariate-based features selectors. In our comparison, we used two classifiers (support vector machine, [Formula: see text]-nearest neighbor on two group datasets, namely two-class and multi-class datasets. Experimental results demonstrate that the performance of our method is better than others, especially on three hard-classify datasets, namely Wang's Breast Cancer, Gordon's Lung Adenocarcinoma and Pomeroy's Medulloblastoma.

  15. Value Added Methods: Moving from Univariate to Multivariate Criteria

    Newman, David; Newman, Isadore; Ridenour, Carolyn; Morales, Jennifer

    2014-01-01

    The authors describe five value-added methods (VAM) used in school assessment as the backdrop to their main thesis. Then they review the assumptions underlying measurement and evaluation, the foundation of all assessment systems, including value-added. They discuss the traditional criterion variable used in VAM: a standardized test score. Next,…

  16. A Comparative Study of Feature Selection Methods for the Discriminative Analysis of Temporal Lobe Epilepsy

    Chunren Lai

    2017-12-01

    Full Text Available It is crucial to differentiate patients with temporal lobe epilepsy (TLE from the healthy population and determine abnormal brain regions in TLE. The cortical features and changes can reveal the unique anatomical patterns of brain regions from structural magnetic resonance (MR images. In this study, structural MR images from 41 patients with left TLE, 34 patients with right TLE, and 58 normal controls (NC were acquired, and four kinds of cortical measures, namely cortical thickness, cortical surface area, gray matter volume (GMV, and mean curvature, were explored for discriminative analysis. Three feature selection methods including the independent sample t-test filtering, the sparse-constrained dimensionality reduction model (SCDRM, and the support vector machine-recursive feature elimination (SVM-RFE were investigated to extract dominant features among the compared groups for classification using the support vector machine (SVM classifier. The results showed that the SVM-RFE achieved the highest performance (most classifications with more than 84% accuracy, followed by the SCDRM, and the t-test. Especially, the surface area and GMV exhibited prominent discriminative ability, and the performance of the SVM was improved significantly when the four cortical measures were combined. Additionally, the dominant regions with higher classification weights were mainly located in the temporal and the frontal lobe, including the entorhinal cortex, rostral middle frontal, parahippocampal cortex, superior frontal, insula, and cuneus. This study concluded that the cortical features provided effective information for the recognition of abnormal anatomical patterns and the proposed methods had the potential to improve the clinical diagnosis of TLE.

  17. Absolute cosine-based SVM-RFE feature selection method for prostate histopathological grading.

    Sahran, Shahnorbanun; Albashish, Dheeb; Abdullah, Azizi; Shukor, Nordashima Abd; Hayati Md Pauzi, Suria

    2018-04-18

    Feature selection (FS) methods are widely used in grading and diagnosing prostate histopathological images. In this context, FS is based on the texture features obtained from the lumen, nuclei, cytoplasm and stroma, all of which are important tissue components. However, it is difficult to represent the high-dimensional textures of these tissue components. To solve this problem, we propose a new FS method that enables the selection of features with minimal redundancy in the tissue components. We categorise tissue images based on the texture of individual tissue components via the construction of a single classifier and also construct an ensemble learning model by merging the values obtained by each classifier. Another issue that arises is overfitting due to the high-dimensional texture of individual tissue components. We propose a new FS method, SVM-RFE(AC), that integrates a Support Vector Machine-Recursive Feature Elimination (SVM-RFE) embedded procedure with an absolute cosine (AC) filter method to prevent redundancy in the selected features of the SV-RFE and an unoptimised classifier in the AC. We conducted experiments on H&E histopathological prostate and colon cancer images with respect to three prostate classifications, namely benign vs. grade 3, benign vs. grade 4 and grade 3 vs. grade 4. The colon benchmark dataset requires a distinction between grades 1 and 2, which are the most difficult cases to distinguish in the colon domain. The results obtained by both the single and ensemble classification models (which uses the product rule as its merging method) confirm that the proposed SVM-RFE(AC) is superior to the other SVM and SVM-RFE-based methods. We developed an FS method based on SVM-RFE and AC and successfully showed that its use enabled the identification of the most crucial texture feature of each tissue component. Thus, it makes possible the distinction between multiple Gleason grades (e.g. grade 3 vs. grade 4) and its performance is far superior to

  18. Influence of Feature Selection Methods on Classification Sensitivity Based on the Example of A Study of Polish Voivodship Tourist Attractiveness

    Bąk Iwona

    2014-07-01

    Full Text Available The purpose of this article is to determine the influence of various methods of selection of diagnostic features on the sensitivity of classification. Three options of feature selection are presented: a parametric feature selection method with a sum (option I, a median of the correlation coefficients matrix column elements (option II and the method of a reversed matrix (option III. Efficiency of the groupings was verified by the indicators of homogeneity, heterogeneity and the correctness of grouping. In the assessment of group efficiency the approach with the Weber median was used. The undertaken problem was illustrated with a research into the tourist attractiveness of voivodships in Poland in 2011.

  19. Data Visualization and Feature Selection Methods in Gel-based Proteomics

    Silva, Tomé Santos; Richard, Nadege; Dias, Jorge P.

    2014-01-01

    -based proteomics, summarizing the current state of research within this field. Particular focus is given on discussing the usefulness of available multivariate analysis tools both for data visualization and feature selection purposes. Visual examples are given using a real gel-based proteomic dataset as basis....

  20. Advances in feature selection methods for hyperspectral image processing in food industry applications: a review.

    Dai, Qiong; Cheng, Jun-Hu; Sun, Da-Wen; Zeng, Xin-An

    2015-01-01

    There is an increased interest in the applications of hyperspectral imaging (HSI) for assessing food quality, safety, and authenticity. HSI provides abundance of spatial and spectral information from foods by combining both spectroscopy and imaging, resulting in hundreds of contiguous wavebands for each spatial position of food samples, also known as the curse of dimensionality. It is desirable to employ feature selection algorithms for decreasing computation burden and increasing predicting accuracy, which are especially relevant in the development of online applications. Recently, a variety of feature selection algorithms have been proposed that can be categorized into three groups based on the searching strategy namely complete search, heuristic search and random search. This review mainly introduced the fundamental of each algorithm, illustrated its applications in hyperspectral data analysis in the food field, and discussed the advantages and disadvantages of these algorithms. It is hoped that this review should provide a guideline for feature selections and data processing in the future development of hyperspectral imaging technique in foods.

  1. A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data

    Himmelreich Uwe

    2009-07-01

    Full Text Available Abstract Background Regularized regression methods such as principal component or partial least squares regression perform well in learning tasks on high dimensional spectral data, but cannot explicitly eliminate irrelevant features. The random forest classifier with its associated Gini feature importance, on the other hand, allows for an explicit feature elimination, but may not be optimally adapted to spectral data due to the topology of its constituent classification trees which are based on orthogonal splits in feature space. Results We propose to combine the best of both approaches, and evaluated the joint use of a feature selection based on a recursive feature elimination using the Gini importance of random forests' together with regularized classification methods on spectral data sets from medical diagnostics, chemotaxonomy, biomedical analytics, food science, and synthetically modified spectral data. Here, a feature selection using the Gini feature importance with a regularized classification by discriminant partial least squares regression performed as well as or better than a filtering according to different univariate statistical tests, or using regression coefficients in a backward feature elimination. It outperformed the direct application of the random forest classifier, or the direct application of the regularized classifiers on the full set of features. Conclusion The Gini importance of the random forest provided superior means for measuring feature relevance on spectral data, but – on an optimal subset of features – the regularized classifiers might be preferable over the random forest classifier, in spite of their limitation to model linear dependencies only. A feature selection based on Gini importance, however, may precede a regularized linear classification to identify this optimal subset of features, and to earn a double benefit of both dimensionality reduction and the elimination of noise from the classification task.

  2. Clustering based gene expression feature selection method: A computational approach to enrich the classifier efficiency of differentially expressed genes

    Abusamra, Heba

    2016-07-20

    The native nature of high dimension low sample size of gene expression data make the classification task more challenging. Therefore, feature (gene) selection become an apparent need. Selecting a meaningful and relevant genes for classifier not only decrease the computational time and cost, but also improve the classification performance. Among different approaches of feature selection methods, however most of them suffer from several problems such as lack of robustness, validation issues etc. Here, we present a new feature selection technique that takes advantage of clustering both samples and genes. Materials and methods We used leukemia gene expression dataset [1]. The effectiveness of the selected features were evaluated by four different classification methods; support vector machines, k-nearest neighbor, random forest, and linear discriminate analysis. The method evaluate the importance and relevance of each gene cluster by summing the expression level for each gene belongs to this cluster. The gene cluster consider important, if it satisfies conditions depend on thresholds and percentage otherwise eliminated. Results Initial analysis identified 7120 differentially expressed genes of leukemia (Fig. 15a), after applying our feature selection methodology we end up with specific 1117 genes discriminating two classes of leukemia (Fig. 15b). Further applying the same method with more stringent higher positive and lower negative threshold condition, number reduced to 58 genes have be tested to evaluate the effectiveness of the method (Fig. 15c). The results of the four classification methods are summarized in Table 11. Conclusions The feature selection method gave good results with minimum classification error. Our heat-map result shows distinct pattern of refines genes discriminating between two classes of leukemia.

  3. A consistency-based feature selection method allied with linear SVMs for HIV-1 protease cleavage site prediction.

    Orkun Oztürk

    Full Text Available BACKGROUND: Predicting type-1 Human Immunodeficiency Virus (HIV-1 protease cleavage site in protein molecules and determining its specificity is an important task which has attracted considerable attention in the research community. Achievements in this area are expected to result in effective drug design (especially for HIV-1 protease inhibitors against this life-threatening virus. However, some drawbacks (like the shortage of the available training data and the high dimensionality of the feature space turn this task into a difficult classification problem. Thus, various machine learning techniques, and specifically several classification methods have been proposed in order to increase the accuracy of the classification model. In addition, for several classification problems, which are characterized by having few samples and many features, selecting the most relevant features is a major factor for increasing classification accuracy. RESULTS: We propose for HIV-1 data a consistency-based feature selection approach in conjunction with recursive feature elimination of support vector machines (SVMs. We used various classifiers for evaluating the results obtained from the feature selection process. We further demonstrated the effectiveness of our proposed method by comparing it with a state-of-the-art feature selection method applied on HIV-1 data, and we evaluated the reported results based on attributes which have been selected from different combinations. CONCLUSION: Applying feature selection on training data before realizing the classification task seems to be a reasonable data-mining process when working with types of data similar to HIV-1. On HIV-1 data, some feature selection or extraction operations in conjunction with different classifiers have been tested and noteworthy outcomes have been reported. These facts motivate for the work presented in this paper. SOFTWARE AVAILABILITY: The software is available at http

  4. Recurrence predictive models for patients with hepatocellular carcinoma after radiofrequency ablation using support vector machines with feature selection methods.

    Liang, Ja-Der; Ping, Xiao-Ou; Tseng, Yi-Ju; Huang, Guan-Tarn; Lai, Feipei; Yang, Pei-Ming

    2014-12-01

    Recurrence of hepatocellular carcinoma (HCC) is an important issue despite effective treatments with tumor eradication. Identification of patients who are at high risk for recurrence may provide more efficacious screening and detection of tumor recurrence. The aim of this study was to develop recurrence predictive models for HCC patients who received radiofrequency ablation (RFA) treatment. From January 2007 to December 2009, 83 newly diagnosed HCC patients receiving RFA as their first treatment were enrolled. Five feature selection methods including genetic algorithm (GA), simulated annealing (SA) algorithm, random forests (RF) and hybrid methods (GA+RF and SA+RF) were utilized for selecting an important subset of features from a total of 16 clinical features. These feature selection methods were combined with support vector machine (SVM) for developing predictive models with better performance. Five-fold cross-validation was used to train and test SVM models. The developed SVM-based predictive models with hybrid feature selection methods and 5-fold cross-validation had averages of the sensitivity, specificity, accuracy, positive predictive value, negative predictive value, and area under the ROC curve as 67%, 86%, 82%, 69%, 90%, and 0.69, respectively. The SVM derived predictive model can provide suggestive high-risk recurrent patients, who should be closely followed up after complete RFA treatment. Copyright © 2014 Elsevier Ireland Ltd. All rights reserved.

  5. Cellulose I crystallinity determination using FT-Raman spectroscopy : univariate and multivariate methods

    Umesh P. Agarwal; Richard S. Reiner; Sally A. Ralph

    2010-01-01

    Two new methods based on FT–Raman spectroscopy, one simple, based on band intensity ratio, and the other using a partial least squares (PLS) regression model, are proposed to determine cellulose I crystallinity. In the simple method, crystallinity in cellulose I samples was determined based on univariate regression that was first developed using the Raman band...

  6. Identification of Biomarkers for Esophageal Squamous Cell Carcinoma Using Feature Selection and Decision Tree Methods

    Chun-Wei Tung

    2013-01-01

    Full Text Available Esophageal squamous cell cancer (ESCC is one of the most common fatal human cancers. The identification of biomarkers for early detection could be a promising strategy to decrease mortality. Previous studies utilized microarray techniques to identify more than one hundred genes; however, it is desirable to identify a small set of biomarkers for clinical use. This study proposes a sequential forward feature selection algorithm to design decision tree models for discriminating ESCC from normal tissues. Two potential biomarkers of RUVBL1 and CNIH were identified and validated based on two public available microarray datasets. To test the discrimination ability of the two biomarkers, 17 pairs of expression profiles of ESCC and normal tissues from Taiwanese male patients were measured by using microarray techniques. The classification accuracies of the two biomarkers in all three datasets were higher than 90%. Interpretable decision tree models were constructed to analyze expression patterns of the two biomarkers. RUVBL1 was consistently overexpressed in all three datasets, although we found inconsistent CNIH expression possibly affected by the diverse major risk factors for ESCC across different areas.

  7. Oral cancer prognosis based on clinicopathologic and genomic markers using a hybrid of feature selection and machine learning methods

    2013-01-01

    Background Machine learning techniques are becoming useful as an alternative approach to conventional medical diagnosis or prognosis as they are good for handling noisy and incomplete data, and significant results can be attained despite a small sample size. Traditionally, clinicians make prognostic decisions based on clinicopathologic markers. However, it is not easy for the most skilful clinician to come out with an accurate prognosis by using these markers alone. Thus, there is a need to use genomic markers to improve the accuracy of prognosis. The main aim of this research is to apply a hybrid of feature selection and machine learning methods in oral cancer prognosis based on the parameters of the correlation of clinicopathologic and genomic markers. Results In the first stage of this research, five feature selection methods have been proposed and experimented on the oral cancer prognosis dataset. In the second stage, the model with the features selected from each feature selection methods are tested on the proposed classifiers. Four types of classifiers are chosen; these are namely, ANFIS, artificial neural network, support vector machine and logistic regression. A k-fold cross-validation is implemented on all types of classifiers due to the small sample size. The hybrid model of ReliefF-GA-ANFIS with 3-input features of drink, invasion and p63 achieved the best accuracy (accuracy = 93.81%; AUC = 0.90) for the oral cancer prognosis. Conclusions The results revealed that the prognosis is superior with the presence of both clinicopathologic and genomic markers. The selected features can be investigated further to validate the potential of becoming as significant prognostic signature in the oral cancer studies. PMID:23725313

  8. A general procedure to generate models for urban environmental-noise pollution using feature selection and machine learning methods.

    Torija, Antonio J; Ruiz, Diego P

    2015-02-01

    The prediction of environmental noise in urban environments requires the solution of a complex and non-linear problem, since there are complex relationships among the multitude of variables involved in the characterization and modelling of environmental noise and environmental-noise magnitudes. Moreover, the inclusion of the great spatial heterogeneity characteristic of urban environments seems to be essential in order to achieve an accurate environmental-noise prediction in cities. This problem is addressed in this paper, where a procedure based on feature-selection techniques and machine-learning regression methods is proposed and applied to this environmental problem. Three machine-learning regression methods, which are considered very robust in solving non-linear problems, are used to estimate the energy-equivalent sound-pressure level descriptor (LAeq). These three methods are: (i) multilayer perceptron (MLP), (ii) sequential minimal optimisation (SMO), and (iii) Gaussian processes for regression (GPR). In addition, because of the high number of input variables involved in environmental-noise modelling and estimation in urban environments, which make LAeq prediction models quite complex and costly in terms of time and resources for application to real situations, three different techniques are used to approach feature selection or data reduction. The feature-selection techniques used are: (i) correlation-based feature-subset selection (CFS), (ii) wrapper for feature-subset selection (WFS), and the data reduction technique is principal-component analysis (PCA). The subsequent analysis leads to a proposal of different schemes, depending on the needs regarding data collection and accuracy. The use of WFS as the feature-selection technique with the implementation of SMO or GPR as regression algorithm provides the best LAeq estimation (R(2)=0.94 and mean absolute error (MAE)=1.14-1.16 dB(A)). Copyright © 2014 Elsevier B.V. All rights reserved.

  9. Validated univariate and multivariate spectrophotometric methods for the determination of pharmaceuticals mixture in complex wastewater

    Riad, Safaa M.; Salem, Hesham; Elbalkiny, Heba T.; Khattab, Fatma I.

    2015-04-01

    Five, accurate, precise, and sensitive univariate and multivariate spectrophotometric methods were developed for the simultaneous determination of a ternary mixture containing Trimethoprim (TMP), Sulphamethoxazole (SMZ) and Oxytetracycline (OTC) in waste water samples collected from different cites either production wastewater or livestock wastewater after their solid phase extraction using OASIS HLB cartridges. In univariate methods OTC was determined at its λmax 355.7 nm (0D), while (TMP) and (SMZ) were determined by three different univariate methods. Method (A) is based on successive spectrophotometric resolution technique (SSRT). The technique starts with the ratio subtraction method followed by ratio difference method for determination of TMP and SMZ. Method (B) is successive derivative ratio technique (SDR). Method (C) is mean centering of the ratio spectra (MCR). The developed multivariate methods are principle component regression (PCR) and partial least squares (PLS). The specificity of the developed methods is investigated by analyzing laboratory prepared mixtures containing different ratios of the three drugs. The obtained results are statistically compared with those obtained by the official methods, showing no significant difference with respect to accuracy and precision at p = 0.05.

  10. R package imputeTestbench to compare imputations methods for univariate time series

    Bokde, Neeraj; Kulat, Kishore; Beck, Marcus W; Asencio-Cortés, Gualberto

    2016-01-01

    This paper describes the R package imputeTestbench that provides a testbench for comparing imputation methods for missing data in univariate time series. The imputeTestbench package can be used to simulate the amount and type of missing data in a complete dataset and compare filled data using different imputation methods. The user has the option to simulate missing data by removing observations completely at random or in blocks of different sizes. Several default imputation methods are includ...

  11. Success/Failure Prediction of Noninvasive Mechanical Ventilation in Intensive Care Units. Using Multiclassifiers and Feature Selection Methods.

    Martín-González, Félix; González-Robledo, Javier; Sánchez-Hernández, Fernando; Moreno-García, María N

    2016-05-17

    This paper addresses the problem of decision-making in relation to the administration of noninvasive mechanical ventilation (NIMV) in intensive care units. Data mining methods were employed to find out the factors influencing the success/failure of NIMV and to predict its results in future patients. These artificial intelligence-based methods have not been applied in this field in spite of the good results obtained in other medical areas. Feature selection methods provided the most influential variables in the success/failure of NIMV, such as NIMV hours, PaCO2 at the start, PaO2 / FiO2 ratio at the start, hematocrit at the start or PaO2 / FiO2 ratio after two hours. These methods were also used in the preprocessing step with the aim of improving the results of the classifiers. The algorithms provided the best results when the dataset used as input was the one containing the attributes selected with the CFS method. Data mining methods can be successfully applied to determine the most influential factors in the success/failure of NIMV and also to predict NIMV results in future patients. The results provided by classifiers can be improved by preprocessing the data with feature selection techniques.

  12. Thermal load forecasting in district heating networks using deep learning and advanced feature selection methods

    Suryanarayana, Gowri; Lago Garcia, J.; Geysen, Davy; Aleksiejuk, Piotr; Johansson, Christian

    2018-01-01

    Recent research has seen several forecasting methods being applied for heat load forecasting of district heating networks. This paper presents two methods that gain significant improvements compared to the previous works. First, an automated way of handling non-linear dependencies in linear

  13. A Permutation Importance-Based Feature Selection Method for Short-Term Electricity Load Forecasting Using Random Forest

    Nantian Huang

    2016-09-01

    Full Text Available The prediction accuracy of short-term load forecast (STLF depends on prediction model choice and feature selection result. In this paper, a novel random forest (RF-based feature selection method for STLF is proposed. First, 243 related features were extracted from historical load data and the time information of prediction points to form the original feature set. Subsequently, the original feature set was used to train an RF as the original model. After the training process, the prediction error of the original model on the test set was recorded and the permutation importance (PI value of each feature was obtained. Then, an improved sequential backward search method was used to select the optimal forecasting feature subset based on the PI value of each feature. Finally, the optimal forecasting feature subset was used to train a new RF model as the final prediction model. Experiments showed that the prediction accuracy of RF trained by the optimal forecasting feature subset was higher than that of the original model and comparative models based on support vector regression and artificial neural network.

  14. A novel peak-hopping stepwise feature selection method with application to Raman spectroscopy

    McShane, M.J.; Cameron, B.D.; Cote, G.L.; Motamedi, M.; Spiegelman, C.H.

    1999-01-01

    A new stepwise approach to variable selection for spectroscopy that includes chemical information and attempts to test several spectral regions producing high ranking coefficients has been developed to improve on currently available methods. Existing selection techniques can, in general, be placed into two groups: the first, time-consuming optimization approaches that ignore available information about sample chemistry and require considerable expertise to arrive at appropriate solutions (e.g. genetic algorithms), and the second, stepwise procedures that tend to select many variables in the same area containing redundant information. The algorithm described here is a fast stepwise procedure that uses multiple ranking chains to identify several spectral regions correlated with known sample properties. The multiple-chain approach allows the generation of a final ranking vector that moves quickly away from the initial selection point, testing several areas exhibiting correlation between spectra and composition early in the stepping procedure. Quantitative evidence of the success of this approach as applied to Raman spectroscopy is given in terms of processing speed, number of selected variables, and prediction error in comparison with other selection methods. In this respect, the procedure described here may be considered as a significant evolutionary step in variable selection algorithms. (Copyright (c) 1999 Elsevier Science B.V., Amsterdam. All rights reserved.)

  15. Feature Selection Method Based on Artificial Bee Colony Algorithm and Support Vector Machines for Medical Datasets Classification

    Mustafa Serter Uzer

    2013-01-01

    Full Text Available This paper offers a hybrid approach that uses the artificial bee colony (ABC algorithm for feature selection and support vector machines for classification. The purpose of this paper is to test the effect of elimination of the unimportant and obsolete features of the datasets on the success of the classification, using the SVM classifier. The developed approach conventionally used in liver diseases and diabetes diagnostics, which are commonly observed and reduce the quality of life, is developed. For the diagnosis of these diseases, hepatitis, liver disorders and diabetes datasets from the UCI database were used, and the proposed system reached a classification accuracies of 94.92%, 74.81%, and 79.29%, respectively. For these datasets, the classification accuracies were obtained by the help of the 10-fold cross-validation method. The results show that the performance of the method is highly successful compared to other results attained and seems very promising for pattern recognition applications.

  16. Detection of biomarkers for Hepatocellular Carcinoma using a hybrid univariate gene selection methods

    Abdel Samee Nagwan M

    2012-08-01

    Full Text Available Abstract Background Discovering new biomarkers has a great role in improving early diagnosis of Hepatocellular carcinoma (HCC. The experimental determination of biomarkers needs a lot of time and money. This motivates this work to use in-silico prediction of biomarkers to reduce the number of experiments required for detecting new ones. This is achieved by extracting the most representative genes in microarrays of HCC. Results In this work, we provide a method for extracting the differential expressed genes, up regulated ones, that can be considered candidate biomarkers in high throughput microarrays of HCC. We examine the power of several gene selection methods (such as Pearson’s correlation coefficient, Cosine coefficient, Euclidean distance, Mutual information and Entropy with different estimators in selecting informative genes. A biological interpretation of the highly ranked genes is done using KEGG (Kyoto Encyclopedia of Genes and Genomes pathways, ENTREZ and DAVID (Database for Annotation, Visualization, and Integrated Discovery databases. The top ten genes selected using Pearson’s correlation coefficient and Cosine coefficient contained six genes that have been implicated in cancer (often multiple cancers genesis in previous studies. A fewer number of genes were obtained by the other methods (4 genes using Mutual information, 3genes using Euclidean distance and only one gene using Entropy. A better result was obtained by the utilization of a hybrid approach based on intersecting the highly ranked genes in the output of all investigated methods. This hybrid combination yielded seven genes (2 genes for HCC and 5 genes in different types of cancer in the top ten genes of the list of intersected genes. Conclusions To strengthen the effectiveness of the univariate selection methods, we propose a hybrid approach by intersecting several of these methods in a cascaded manner. This approach surpasses all of univariate selection methods when

  17. Feature Selection by Reordering

    Jiřina, Marcel; Jiřina jr., M.

    2005-01-01

    Roč. 2, č. 1 (2005), s. 155-161 ISSN 1738-6438 Institutional research plan: CEZ:AV0Z10300504 Keywords : feature selection * data reduction * ordering of features Subject RIV: BA - General Mathematics

  18. Comparison of different Methods for Univariate Time Series Imputation in R

    Moritz, Steffen; Sardá, Alexis; Bartz-Beielstein, Thomas; Zaefferer, Martin; Stork, Jörg

    2015-01-01

    Missing values in datasets are a well-known problem and there are quite a lot of R packages offering imputation functions. But while imputation in general is well covered within R, it is hard to find functions for imputation of univariate time series. The problem is, most standard imputation techniques can not be applied directly. Most algorithms rely on inter-attribute correlations, while univariate time series imputation needs to employ time dependencies. This paper provides an overview of ...

  19. Comparison of multivariate and univariate statistical process control and monitoring methods

    Leger, R.P.; Garland, WM.J.; Macgregor, J.F.

    1996-01-01

    Work in recent years has lead to the development of multivariate process monitoring schemes which use Principal Component Analysis (PCA). This research compares the performance of a univariate scheme and a multivariate PCA scheme used for monitoring a simple process with 11 measured variables. The multivariate PCA scheme was able to adequately represent the process using two principal components. This resulted in a PCA monitoring scheme which used two charts as opposed to 11 charts for the univariate scheme and therefore had distinct advantages in terms of both data representation, presentation, and fault diagnosis capabilities. (author)

  20. Characteristics of genomic signatures derived using univariate methods and mechanistically anchored functional descriptors for predicting drug- and xenobiotic-induced nephrotoxicity.

    Shi, Weiwei; Bugrim, Andrej; Nikolsky, Yuri; Nikolskya, Tatiana; Brennan, Richard J

    2008-01-01

    ABSTRACT The ideal toxicity biomarker is composed of the properties of prediction (is detected prior to traditional pathological signs of injury), accuracy (high sensitivity and specificity), and mechanistic relationships to the endpoint measured (biological relevance). Gene expression-based toxicity biomarkers ("signatures") have shown good predictive power and accuracy, but are difficult to interpret biologically. We have compared different statistical methods of feature selection with knowledge-based approaches, using GeneGo's database of canonical pathway maps, to generate gene sets for the classification of renal tubule toxicity. The gene set selection algorithms include four univariate analyses: t-statistics, fold-change, B-statistics, and RankProd, and their combination and overlap for the identification of differentially expressed probes. Enrichment analysis following the results of the four univariate analyses, Hotelling T-square test, and, finally out-of-bag selection, a variant of cross-validation, were used to identify canonical pathway maps-sets of genes coordinately involved in key biological processes-with classification power. Differentially expressed genes identified by the different statistical univariate analyses all generated reasonably performing classifiers of tubule toxicity. Maps identified by enrichment analysis or Hotelling T-square had lower classification power, but highlighted perturbed lipid homeostasis as a common discriminator of nephrotoxic treatments. The out-of-bag method yielded the best functionally integrated classifier. The map "ephrins signaling" performed comparably to a classifier derived using sparse linear programming, a machine learning algorithm, and represents a signaling network specifically involved in renal tubule development and integrity. Such functional descriptors of toxicity promise to better integrate predictive toxicogenomics with mechanistic analysis, facilitating the interpretation and risk assessment of

  1. Regression Is a Univariate General Linear Model Subsuming Other Parametric Methods as Special Cases.

    Vidal, Sherry

    Although the concept of the general linear model (GLM) has existed since the 1960s, other univariate analyses such as the t-test and the analysis of variance models have remained popular. The GLM produces an equation that minimizes the mean differences of independent variables as they are related to a dependent variable. From a computer printout…

  2. Genetic search feature selection for affective modeling

    Martínez, Héctor P.; Yannakakis, Georgios N.

    2010-01-01

    Automatic feature selection is a critical step towards the generation of successful computational models of affect. This paper presents a genetic search-based feature selection method which is developed as a global-search algorithm for improving the accuracy of the affective models built....... The method is tested and compared against sequential forward feature selection and random search in a dataset derived from a game survey experiment which contains bimodal input features (physiological and gameplay) and expressed pairwise preferences of affect. Results suggest that the proposed method...

  3. Diagnosing Autism Spectrum Disorder from Brain Resting-State Functional Connectivity Patterns Using a Deep Neural Network with a Novel Feature Selection Method.

    Guo, Xinyu; Dominick, Kelli C; Minai, Ali A; Li, Hailong; Erickson, Craig A; Lu, Long J

    2017-01-01

    The whole-brain functional connectivity (FC) pattern obtained from resting-state functional magnetic resonance imaging data are commonly applied to study neuropsychiatric conditions such as autism spectrum disorder (ASD) by using different machine learning models. Recent studies indicate that both hyper- and hypo- aberrant ASD-associated FCs were widely distributed throughout the entire brain rather than only in some specific brain regions. Deep neural networks (DNN) with multiple hidden layers have shown the ability to systematically extract lower-to-higher level information from high dimensional data across a series of neural hidden layers, significantly improving classification accuracy for such data. In this study, a DNN with a novel feature selection method (DNN-FS) is developed for the high dimensional whole-brain resting-state FC pattern classification of ASD patients vs. typical development (TD) controls. The feature selection method is able to help the DNN generate low dimensional high-quality representations of the whole-brain FC patterns by selecting features with high discriminating power from multiple trained sparse auto-encoders. For the comparison, a DNN without the feature selection method (DNN-woFS) is developed, and both of them are tested with different architectures (i.e., with different numbers of hidden layers/nodes). Results show that the best classification accuracy of 86.36% is generated by the DNN-FS approach with 3 hidden layers and 150 hidden nodes (3/150). Remarkably, DNN-FS outperforms DNN-woFS for all architectures studied. The most significant accuracy improvement was 9.09% with the 3/150 architecture. The method also outperforms other feature selection methods, e.g., two sample t -test and elastic net. In addition to improving the classification accuracy, a Fisher's score-based biomarker identification method based on the DNN is also developed, and used to identify 32 FCs related to ASD. These FCs come from or cross different pre

  4. Diagnosing Autism Spectrum Disorder from Brain Resting-State Functional Connectivity Patterns Using a Deep Neural Network with a Novel Feature Selection Method

    Xinyu Guo

    2017-08-01

    Full Text Available The whole-brain functional connectivity (FC pattern obtained from resting-state functional magnetic resonance imaging data are commonly applied to study neuropsychiatric conditions such as autism spectrum disorder (ASD by using different machine learning models. Recent studies indicate that both hyper- and hypo- aberrant ASD-associated FCs were widely distributed throughout the entire brain rather than only in some specific brain regions. Deep neural networks (DNN with multiple hidden layers have shown the ability to systematically extract lower-to-higher level information from high dimensional data across a series of neural hidden layers, significantly improving classification accuracy for such data. In this study, a DNN with a novel feature selection method (DNN-FS is developed for the high dimensional whole-brain resting-state FC pattern classification of ASD patients vs. typical development (TD controls. The feature selection method is able to help the DNN generate low dimensional high-quality representations of the whole-brain FC patterns by selecting features with high discriminating power from multiple trained sparse auto-encoders. For the comparison, a DNN without the feature selection method (DNN-woFS is developed, and both of them are tested with different architectures (i.e., with different numbers of hidden layers/nodes. Results show that the best classification accuracy of 86.36% is generated by the DNN-FS approach with 3 hidden layers and 150 hidden nodes (3/150. Remarkably, DNN-FS outperforms DNN-woFS for all architectures studied. The most significant accuracy improvement was 9.09% with the 3/150 architecture. The method also outperforms other feature selection methods, e.g., two sample t-test and elastic net. In addition to improving the classification accuracy, a Fisher's score-based biomarker identification method based on the DNN is also developed, and used to identify 32 FCs related to ASD. These FCs come from or cross

  5. Dysphonic Voice Pattern Analysis of Patients in Parkinson’s Disease Using Minimum Interclass Probability Risk Feature Selection and Bagging Ensemble Learning Methods

    Yunfeng Wu

    2017-01-01

    Full Text Available Analysis of quantified voice patterns is useful in the detection and assessment of dysphonia and related phonation disorders. In this paper, we first study the linear correlations between 22 voice parameters of fundamental frequency variability, amplitude variations, and nonlinear measures. The highly correlated vocal parameters are combined by using the linear discriminant analysis method. Based on the probability density functions estimated by the Parzen-window technique, we propose an interclass probability risk (ICPR method to select the vocal parameters with small ICPR values as dominant features and compare with the modified Kullback-Leibler divergence (MKLD feature selection approach. The experimental results show that the generalized logistic regression analysis (GLRA, support vector machine (SVM, and Bagging ensemble algorithm input with the ICPR features can provide better classification results than the same classifiers with the MKLD selected features. The SVM is much better at distinguishing normal vocal patterns with a specificity of 0.8542. Among the three classification methods, the Bagging ensemble algorithm with ICPR features can identify 90.77% vocal patterns, with the highest sensitivity of 0.9796 and largest area value of 0.9558 under the receiver operating characteristic curve. The classification results demonstrate the effectiveness of our feature selection and pattern analysis methods for dysphonic voice detection and measurement.

  6. Feature selection and data sampling methods for learning reputation dimensions: The University of Amsterdam at RepLab 2014

    Gârbacea, C.; Tsagkias, M.; de Rijke, M.

    2014-01-01

    We report on our participation in the reputation dimension task of the CLEF RepLab 2014 evaluation initiative, i.e., to classify social media updates into eight predefined categories. We address the task by using corpus-based methods to extract textual features from the labeled training data to

  7. Feature selection and nearest centroid classification for protein mass spectrometry

    Levner Ilya

    2005-03-01

    Full Text Available Abstract Background The use of mass spectrometry as a proteomics tool is poised to revolutionize early disease diagnosis and biomarker identification. Unfortunately, before standard supervised classification algorithms can be employed, the "curse of dimensionality" needs to be solved. Due to the sheer amount of information contained within the mass spectra, most standard machine learning techniques cannot be directly applied. Instead, feature selection techniques are used to first reduce the dimensionality of the input space and thus enable the subsequent use of classification algorithms. This paper examines feature selection techniques for proteomic mass spectrometry. Results This study examines the performance of the nearest centroid classifier coupled with the following feature selection algorithms. Student-t test, Kolmogorov-Smirnov test, and the P-test are univariate statistics used for filter-based feature ranking. From the wrapper approaches we tested sequential forward selection and a modified version of sequential backward selection. Embedded approaches included shrunken nearest centroid and a novel version of boosting based feature selection we developed. In addition, we tested several dimensionality reduction approaches, namely principal component analysis and principal component analysis coupled with linear discriminant analysis. To fairly assess each algorithm, evaluation was done using stratified cross validation with an internal leave-one-out cross-validation loop for automated feature selection. Comprehensive experiments, conducted on five popular cancer data sets, revealed that the less advocated sequential forward selection and boosted feature selection algorithms produce the most consistent results across all data sets. In contrast, the state-of-the-art performance reported on isolated data sets for several of the studied algorithms, does not hold across all data sets. Conclusion This study tested a number of popular feature

  8. Univariate Time Series Prediction of Solar Power Using a Hybrid Wavelet-ARMA-NARX Prediction Method

    Nazaripouya, Hamidreza; Wang, Yubo; Chu, Chi-Cheng; Pota, Hemanshu; Gadh, Rajit

    2016-05-02

    This paper proposes a new hybrid method for super short-term solar power prediction. Solar output power usually has a complex, nonstationary, and nonlinear characteristic due to intermittent and time varying behavior of solar radiance. In addition, solar power dynamics is fast and is inertia less. An accurate super short-time prediction is required to compensate for the fluctuations and reduce the impact of solar power penetration on the power system. The objective is to predict one step-ahead solar power generation based only on historical solar power time series data. The proposed method incorporates discrete wavelet transform (DWT), Auto-Regressive Moving Average (ARMA) models, and Recurrent Neural Networks (RNN), while the RNN architecture is based on Nonlinear Auto-Regressive models with eXogenous inputs (NARX). The wavelet transform is utilized to decompose the solar power time series into a set of richer-behaved forming series for prediction. ARMA model is employed as a linear predictor while NARX is used as a nonlinear pattern recognition tool to estimate and compensate the error of wavelet-ARMA prediction. The proposed method is applied to the data captured from UCLA solar PV panels and the results are compared with some of the common and most recent solar power prediction methods. The results validate the effectiveness of the proposed approach and show a considerable improvement in the prediction precision.

  9. New theory of discriminant analysis after R. Fisher advanced research by the feature selection method for microarray data

    Shinmura, Shuichi

    2016-01-01

    This is the first book to compare eight LDFs by different types of datasets, such as Fisher’s iris data, medical data with collinearities, Swiss banknote data that is a linearly separable data (LSD), student pass/fail determination using student attributes, 18 pass/fail determinations using exam scores, Japanese automobile data, and six microarray datasets (the datasets) that are LSD. We developed the 100-fold cross-validation for the small sample method (Method 1) instead of the LOO method. We proposed a simple model selection procedure to choose the best model having minimum M2 and Revised IP-OLDF based on MNM criterion was found to be better than other M2s in the above datasets. We compared two statistical LDFs and six MP-based LDFs. Those were Fisher’s LDF, logistic regression, three SVMs, Revised IP-OLDF, and another two OLDFs. Only a hard-margin SVM (H-SVM) and Revised IP-OLDF could discriminate LSD theoretically (Problem 2). We solved the defect of the generalized inverse matrices (Problem 3). For ...

  10. Leukemia and colon tumor detection based on microarray data classification using momentum backpropagation and genetic algorithm as a feature selection method

    Wisesty, Untari N.; Warastri, Riris S.; Puspitasari, Shinta Y.

    2018-03-01

    Cancer is one of the major causes of mordibility and mortality problems in the worldwide. Therefore, the need of a system that can analyze and identify a person suffering from a cancer by using microarray data derived from the patient’s Deoxyribonucleic Acid (DNA). But on microarray data has thousands of attributes, thus making the challenges in data processing. This is often referred to as the curse of dimensionality. Therefore, in this study built a system capable of detecting a patient whether contracted cancer or not. The algorithm used is Genetic Algorithm as feature selection and Momentum Backpropagation Neural Network as a classification method, with data used from the Kent Ridge Bio-medical Dataset. Based on system testing that has been done, the system can detect Leukemia and Colon Tumor with best accuracy equal to 98.33% for colon tumor data and 100% for leukimia data. Genetic Algorithm as feature selection algorithm can improve system accuracy, which is from 64.52% to 98.33% for colon tumor data and 65.28% to 100% for leukemia data, and the use of momentum parameters can accelerate the convergence of the system in the training process of Neural Network.

  11. 区间数分级决策的特征选择方法研究%Research on Feature Selection Method for Interval Sorting Decision

    宋鹏; 梁吉业; 钱宇华; 李常洪

    2017-01-01

    In the field of multiple attributes decision making,sorting decision has become an important kind of issue and been widely concerned in many practical application areas.In the process of making sorting decision,the rational and effective feature selection methods can extract informative and pertinent attributes,and thus improve the efficiency of decision making.From the extant literatures,many valuable researches have been provided for more reasonably solving this problem in the context of diverse data types,such as single value,null value and set value.However,very few studies focus on the sorting decision in term of interval-valued data.The objective of this paper is to provide a new feature selection approach for interval sorting decision by using the interval outranking relation.By integrating rough set model and information entropy theory,a new measurement called complementary condition entropy,which investigates the complementary nature of the relevant sets,is proposed for feature evaluation through analyzing the inherent implication of correlation between considered attributes in the problem of interval sorting decision.Furthermore,on the basis of the difference of the values of complementary condition entropy,the representation of the indispensable attributes and the measurement of attributes importance are presented,and then develop a heuristic feature selection algorithm is proposed for interval sorting decision.Finally,two illustrative applications,namely,the issues of venture investment and portfolio selection,are employed to demonstrate the validity of the proposed method.For the problem of multi-stage venture investment decision,through investigating the competitiveness,development capacity and financial capability of 16 investment projects,the corresponding probabilistic decision rules having better generalization capability,which can be used to determine whether to perform further investment.As to the issue of portfolio selection,91 stocks coming

  12. Online feature selection with streaming features.

    Wu, Xindong; Yu, Kui; Ding, Wei; Wang, Hao; Zhu, Xingquan

    2013-05-01

    We propose a new online feature selection framework for applications with streaming features where the knowledge of the full feature space is unknown in advance. We define streaming features as features that flow in one by one over time whereas the number of training examples remains fixed. This is in contrast with traditional online learning methods that only deal with sequentially added observations, with little attention being paid to streaming features. The critical challenges for Online Streaming Feature Selection (OSFS) include 1) the continuous growth of feature volumes over time, 2) a large feature space, possibly of unknown or infinite size, and 3) the unavailability of the entire feature set before learning starts. In the paper, we present a novel Online Streaming Feature Selection method to select strongly relevant and nonredundant features on the fly. An efficient Fast-OSFS algorithm is proposed to improve feature selection performance. The proposed algorithms are evaluated extensively on high-dimensional datasets and also with a real-world case study on impact crater detection. Experimental results demonstrate that the algorithms achieve better compactness and higher prediction accuracy than existing streaming feature selection algorithms.

  13. Feature selection toolbox software package

    Pudil, Pavel; Novovičová, Jana; Somol, Petr

    2002-01-01

    Roč. 23, č. 4 (2002), s. 487-492 ISSN 0167-8655 R&D Projects: GA ČR GA402/01/0981 Institutional research plan: CEZ:AV0Z1075907 Keywords : pattern recognition * feature selection * loating search algorithms Subject RIV: BB - Applied Statistics, Operational Research Impact factor: 0.409, year: 2002

  14. Comparing observer models and feature selection methods for a task-based statistical assessment of digital breast tomsynthesis in reconstruction space

    Park, Subok; Zhang, George Z.; Zeng, Rongping; Myers, Kyle J.

    2014-03-01

    A task-based assessment of image quality1 for digital breast tomosynthesis (DBT) can be done in either the projected or reconstructed data space. As the choice of observer models and feature selection methods can vary depending on the type of task and data statistics, we previously investigated the performance of two channelized- Hotelling observer models in conjunction with 2D Laguerre-Gauss (LG) and two implementations of partial least squares (PLS) channels along with that of the Hotelling observer in binary detection tasks involving DBT projections.2, 3 The difference in these observers lies in how the spatial correlation in DBT angular projections is incorporated in the observer's strategy to perform the given task. In the current work, we extend our method to the reconstructed data space of DBT. We investigate how various model observers including the aforementioned compare for performing the binary detection of a spherical signal embedded in structured breast phantoms with the use of DBT slices reconstructed via filtered back projection. We explore how well the model observers incorporate the spatial correlation between different numbers of reconstructed DBT slices while varying the number of projections. For this, relatively small and large scan angles (24° and 96°) are used for comparison. Our results indicate that 1) given a particular scan angle, the number of projections needed to achieve the best performance for each observer is similar across all observer/channel combinations, i.e., Np = 25 for scan angle 96° and Np = 13 for scan angle 24°, and 2) given these sufficient numbers of projections, the number of slices for each observer to achieve the best performance differs depending on the channel/observer types, which is more pronounced in the narrow scan angle case.

  15. Discriminative semi-supervised feature selection via manifold regularization.

    Xu, Zenglin; King, Irwin; Lyu, Michael Rung-Tsong; Jin, Rong

    2010-07-01

    Feature selection has attracted a huge amount of interest in both research and application communities of data mining. We consider the problem of semi-supervised feature selection, where we are given a small amount of labeled examples and a large amount of unlabeled examples. Since a small number of labeled samples are usually insufficient for identifying the relevant features, the critical problem arising from semi-supervised feature selection is how to take advantage of the information underneath the unlabeled data. To address this problem, we propose a novel discriminative semi-supervised feature selection method based on the idea of manifold regularization. The proposed approach selects features through maximizing the classification margin between different classes and simultaneously exploiting the geometry of the probability distribution that generates both labeled and unlabeled data. In comparison with previous semi-supervised feature selection algorithms, our proposed semi-supervised feature selection method is an embedded feature selection method and is able to find more discriminative features. We formulate the proposed feature selection method into a convex-concave optimization problem, where the saddle point corresponds to the optimal solution. To find the optimal solution, the level method, a fairly recent optimization method, is employed. We also present a theoretic proof of the convergence rate for the application of the level method to our problem. Empirical evaluation on several benchmark data sets demonstrates the effectiveness of the proposed semi-supervised feature selection method.

  16. Feature Selection via Chaotic Antlion Optimization.

    Hossam M Zawbaa

    Full Text Available Selecting a subset of relevant properties from a large set of features that describe a dataset is a challenging machine learning task. In biology, for instance, the advances in the available technologies enable the generation of a very large number of biomarkers that describe the data. Choosing the more informative markers along with performing a high-accuracy classification over the data can be a daunting task, particularly if the data are high dimensional. An often adopted approach is to formulate the feature selection problem as a biobjective optimization problem, with the aim of maximizing the performance of the data analysis model (the quality of the data training fitting while minimizing the number of features used.We propose an optimization approach for the feature selection problem that considers a "chaotic" version of the antlion optimizer method, a nature-inspired algorithm that mimics the hunting mechanism of antlions in nature. The balance between exploration of the search space and exploitation of the best solutions is a challenge in multi-objective optimization. The exploration/exploitation rate is controlled by the parameter I that limits the random walk range of the ants/prey. This variable is increased iteratively in a quasi-linear manner to decrease the exploration rate as the optimization progresses. The quasi-linear decrease in the variable I may lead to immature convergence in some cases and trapping in local minima in other cases. The chaotic system proposed here attempts to improve the tradeoff between exploration and exploitation. The methodology is evaluated using different chaotic maps on a number of feature selection datasets. To ensure generality, we used ten biological datasets, but we also used other types of data from various sources. The results are compared with the particle swarm optimizer and with genetic algorithm variants for feature selection using a set of quality metrics.

  17. Novel feature selection method based on Stochastic Methods Coupled to Support Vector Machines using H- NMR data (data of olive and hazelnut oils

    Oscar Eduardo Gualdron

    2014-12-01

    Full Text Available One of the principal inconveniences that analysis and information processing presents is that of the representation of dataset. Normally, one encounters a high number of samples, each one with thousands of variables, and in many cases with irrelevant information and noise. Therefore, in order to represent findings in a clearer way, it is necessary to reduce the amount of variables. In this paper, a novel variable selection technique for multivariable data analysis, inspired on stochastic methods and designed to work with support vector machines (SVM, is described. The approach is demonstrated in a food application involving the detection of adulteration of olive oil (more expensive with hazelnut oil (cheaper. Fingerprinting by H NMR spectroscopy was used to analyze the different samples. Results show that it is possible to reduce the number of variables without affecting classification results.

  18. Survival Prediction and Feature Selection in Patients with Breast Cancer Using Support Vector Regression

    Shahrbanoo Goli

    2016-01-01

    Full Text Available The Support Vector Regression (SVR model has been broadly used for response prediction. However, few researchers have used SVR for survival analysis. In this study, a new SVR model is proposed and SVR with different kernels and the traditional Cox model are trained. The models are compared based on different performance measures. We also select the best subset of features using three feature selection methods: combination of SVR and statistical tests, univariate feature selection based on concordance index, and recursive feature elimination. The evaluations are performed using available medical datasets and also a Breast Cancer (BC dataset consisting of 573 patients who visited the Oncology Clinic of Hamadan province in Iran. Results show that, for the BC dataset, survival time can be predicted more accurately by linear SVR than nonlinear SVR. Based on the three feature selection methods, metastasis status, progesterone receptor status, and human epidermal growth factor receptor 2 status are the best features associated to survival. Also, according to the obtained results, performance of linear and nonlinear kernels is comparable. The proposed SVR model performs similar to or slightly better than other models. Also, SVR performs similar to or better than Cox when all features are included in model.

  19. Classification Using Markov Blanket for Feature Selection

    Zeng, Yifeng; Luo, Jian

    2009-01-01

    Selecting relevant features is in demand when a large data set is of interest in a classification task. It produces a tractable number of features that are sufficient and possibly improve the classification performance. This paper studies a statistical method of Markov blanket induction algorithm...... for filtering features and then applies a classifier using the Markov blanket predictors. The Markov blanket contains a minimal subset of relevant features that yields optimal classification performance. We experimentally demonstrate the improved performance of several classifiers using a Markov blanket...... induction as a feature selection method. In addition, we point out an important assumption behind the Markov blanket induction algorithm and show its effect on the classification performance....

  20. Naive Bayes-Guided Bat Algorithm for Feature Selection

    Ahmed Majid Taha

    2013-01-01

    Full Text Available When the amount of data and information is said to double in every 20 months or so, feature selection has become highly important and beneficial. Further improvements in feature selection will positively affect a wide array of applications in fields such as pattern recognition, machine learning, or signal processing. Bio-inspired method called Bat Algorithm hybridized with a Naive Bayes classifier has been presented in this work. The performance of the proposed feature selection algorithm was investigated using twelve benchmark datasets from different domains and was compared to three other well-known feature selection algorithms. Discussion focused on four perspectives: number of features, classification accuracy, stability, and feature generalization. The results showed that BANB significantly outperformed other algorithms in selecting lower number of features, hence removing irrelevant, redundant, or noisy features while maintaining the classification accuracy. BANB is also proven to be more stable than other methods and is capable of producing more general feature subsets.

  1. Naive Bayes-Guided Bat Algorithm for Feature Selection

    Taha, Ahmed Majid; Mustapha, Aida; Chen, Soong-Der

    2013-01-01

    When the amount of data and information is said to double in every 20 months or so, feature selection has become highly important and beneficial. Further improvements in feature selection will positively affect a wide array of applications in fields such as pattern recognition, machine learning, or signal processing. Bio-inspired method called Bat Algorithm hybridized with a Naive Bayes classifier has been presented in this work. The performance of the proposed feature selection algorithm was investigated using twelve benchmark datasets from different domains and was compared to three other well-known feature selection algorithms. Discussion focused on four perspectives: number of features, classification accuracy, stability, and feature generalization. The results showed that BANB significantly outperformed other algorithms in selecting lower number of features, hence removing irrelevant, redundant, or noisy features while maintaining the classification accuracy. BANB is also proven to be more stable than other methods and is capable of producing more general feature subsets. PMID:24396295

  2. A redundancy-removing feature selection algorithm for nominal data

    Zhihua Li

    2015-10-01

    Full Text Available No order correlation or similarity metric exists in nominal data, and there will always be more redundancy in a nominal dataset, which means that an efficient mutual information-based nominal-data feature selection method is relatively difficult to find. In this paper, a nominal-data feature selection method based on mutual information without data transformation, called the redundancy-removing more relevance less redundancy algorithm, is proposed. By forming several new information-related definitions and the corresponding computational methods, the proposed method can compute the information-related amount of nominal data directly. Furthermore, by creating a new evaluation function that considers both the relevance and the redundancy globally, the new feature selection method can evaluate the importance of each nominal-data feature. Although the presented feature selection method takes commonly used MIFS-like forms, it is capable of handling high-dimensional datasets without expensive computations. We perform extensive experimental comparisons of the proposed algorithm and other methods using three benchmarking nominal datasets with two different classifiers. The experimental results demonstrate the average advantage of the presented algorithm over the well-known NMIFS algorithm in terms of the feature selection and classification accuracy, which indicates that the proposed method has a promising performance.

  3. A comparison between univariate probabilistic and multivariate (logistic regression) methods for landslide susceptibility analysis: the example of the Febbraro valley (Northern Alps, Italy)

    Rossi, M.; Apuani, T.; Felletti, F.

    2009-04-01

    The aim of this paper is to compare the results of two statistical methods for landslide susceptibility analysis: 1) univariate probabilistic method based on landslide susceptibility index, 2) multivariate method (logistic regression). The study area is the Febbraro valley, located in the central Italian Alps, where different types of metamorphic rocks croup out. On the eastern part of the studied basin a quaternary cover represented by colluvial and secondarily, by glacial deposits, is dominant. In this study 110 earth flows, mainly located toward NE portion of the catchment, were analyzed. They involve only the colluvial deposits and their extension mainly ranges from 36 to 3173 m2. Both statistical methods require to establish a spatial database, in which each landslide is described by several parameters that can be assigned using a main scarp central point of landslide. The spatial database is constructed using a Geographical Information System (GIS). Each landslide is described by several parameters corresponding to the value of main scarp central point of the landslide. Based on bibliographic review a total of 15 predisposing factors were utilized. The width of the intervals, in which the maps of the predisposing factors have to be reclassified, has been defined assuming constant intervals to: elevation (100 m), slope (5 °), solar radiation (0.1 MJ/cm2/year), profile curvature (1.2 1/m), tangential curvature (2.2 1/m), drainage density (0.5), lineament density (0.00126). For the other parameters have been used the results of the probability-probability plots analysis and the statistical indexes of landslides site. In particular slope length (0 ÷ 2, 2 ÷ 5, 5 ÷ 10, 10 ÷ 20, 20 ÷ 35, 35 ÷ 260), accumulation flow (0 ÷ 1, 1 ÷ 2, 2 ÷ 5, 5 ÷ 12, 12 ÷ 60, 60 ÷27265), Topographic Wetness Index 0 ÷ 0.74, 0.74 ÷ 1.94, 1.94 ÷ 2.62, 2.62 ÷ 3.48, 3.48 ÷ 6,00, 6.00 ÷ 9.44), Stream Power Index (0 ÷ 0.64, 0.64 ÷ 1.28, 1.28 ÷ 1.81, 1.81 ÷ 4.20, 4.20 ÷ 9

  4. Penalized feature selection and classification in bioinformatics

    Ma, Shuangge; Huang, Jian

    2008-01-01

    In bioinformatics studies, supervised classification with high-dimensional input variables is frequently encountered. Examples routinely arise in genomic, epigenetic and proteomic studies. Feature selection can be employed along with classifier construction to avoid over-fitting, to generate more reliable classifier and to provide more insights into the underlying causal relationships. In this article, we provide a review of several recently developed penalized feature selection and classific...

  5. Evaluating statistical and clinical significance of intervention effects in single-case experimental designs: an SPSS method to analyze univariate data.

    Maric, Marija; de Haan, Else; Hogendoorn, Sanne M; Wolters, Lidewij H; Huizenga, Hilde M

    2015-03-01

    Single-case experimental designs are useful methods in clinical research practice to investigate individual client progress. Their proliferation might have been hampered by methodological challenges such as the difficulty applying existing statistical procedures. In this article, we describe a data-analytic method to analyze univariate (i.e., one symptom) single-case data using the common package SPSS. This method can help the clinical researcher to investigate whether an intervention works as compared with a baseline period or another intervention type, and to determine whether symptom improvement is clinically significant. First, we describe the statistical method in a conceptual way and show how it can be implemented in SPSS. Simulation studies were performed to determine the number of observation points required per intervention phase. Second, to illustrate this method and its implications, we present a case study of an adolescent with anxiety disorders treated with cognitive-behavioral therapy techniques in an outpatient psychotherapy clinic, whose symptoms were regularly assessed before each session. We provide a description of the data analyses and results of this case study. Finally, we discuss the advantages and shortcomings of the proposed method. Copyright © 2014. Published by Elsevier Ltd.

  6. Comparative study of the efficiency of computed univariate and multivariate methods for the estimation of the binary mixture of clotrimazole and dexamethasone using two different spectral regions

    Fayez, Yasmin Mohammed; Tawakkol, Shereen Mostafa; Fahmy, Nesma Mahmoud; Lotfy, Hayam Mahmoud; Shehata, Mostafa Abdel-Aty

    2018-04-01

    Three methods of analysis are conducted that need computational procedures by the Matlab® software. The first is the univariate mean centering method which eliminates the interfering signal of the one component at a selected wave length leaving the amplitude measured to represent the component of interest only. The other two multivariate methods named PLS and PCR depend on a large number of variables that lead to extraction of the maximum amount of information required to determine the component of interest in the presence of the other. Good accurate and precise results are obtained from the three methods for determining clotrimazole in the linearity range 1-12 μg/mL and 75-550 μg/mL with dexamethasone acetate 2-20 μg/mL in synthetic mixtures and pharmaceutical formulation using two different spectral regions 205-240 nm and 233-278 nm. The results obtained are compared statistically to each other and to the official methods.

  7. Hierarchical feature selection for erythema severity estimation

    Wang, Li; Shi, Chenbo; Shu, Chang

    2014-10-01

    At present PASI system of scoring is used for evaluating erythema severity, which can help doctors to diagnose psoriasis [1-3]. The system relies on the subjective judge of doctors, where the accuracy and stability cannot be guaranteed [4]. This paper proposes a stable and precise algorithm for erythema severity estimation. Our contributions are twofold. On one hand, in order to extract the multi-scale redness of erythema, we design the hierarchical feature. Different from traditional methods, we not only utilize the color statistical features, but also divide the detect window into small window and extract hierarchical features. Further, a feature re-ranking step is introduced, which can guarantee that extracted features are irrelevant to each other. On the other hand, an adaptive boosting classifier is applied for further feature selection. During the step of training, the classifier will seek out the most valuable feature for evaluating erythema severity, due to its strong learning ability. Experimental results demonstrate the high precision and robustness of our algorithm. The accuracy is 80.1% on the dataset which comprise 116 patients' images with various kinds of erythema. Now our system has been applied for erythema medical efficacy evaluation in Union Hosp, China.

  8. Recursive SVM feature selection and sample classification for mass-spectrometry and microarray data

    Harris Lyndsay N

    2006-04-01

    Full Text Available Abstract Background Like microarray-based investigations, high-throughput proteomics techniques require machine learning algorithms to identify biomarkers that are informative for biological classification problems. Feature selection and classification algorithms need to be robust to noise and outliers in the data. Results We developed a recursive support vector machine (R-SVM algorithm to select important genes/biomarkers for the classification of noisy data. We compared its performance to a similar, state-of-the-art method (SVM recursive feature elimination or SVM-RFE, paying special attention to the ability of recovering the true informative genes/biomarkers and the robustness to outliers in the data. Simulation experiments show that a 5 %-~20 % improvement over SVM-RFE can be achieved regard to these properties. The SVM-based methods are also compared with a conventional univariate method and their respective strengths and weaknesses are discussed. R-SVM was applied to two sets of SELDI-TOF-MS proteomics data, one from a human breast cancer study and the other from a study on rat liver cirrhosis. Important biomarkers found by the algorithm were validated by follow-up biological experiments. Conclusion The proposed R-SVM method is suitable for analyzing noisy high-throughput proteomics and microarray data and it outperforms SVM-RFE in the robustness to noise and in the ability to recover informative features. The multivariate SVM-based method outperforms the univariate method in the classification performance, but univariate methods can reveal more of the differentially expressed features especially when there are correlations between the features.

  9. Feature Selection Based on Mutual Correlation

    Haindl, Michal; Somol, Petr; Ververidis, D.; Kotropoulos, C.

    2006-01-01

    Roč. 19, č. 4225 (2006), s. 569-577 ISSN 0302-9743. [Iberoamerican Congress on Pattern Recognition. CIARP 2006 /11./. Cancun, 14.11.2006-17.11.2006] R&D Projects: GA AV ČR 1ET400750407; GA MŠk 1M0572; GA AV ČR IAA2075302 EU Projects: European Commission(XE) 507752 - MUSCLE Institutional research plan: CEZ:AV0Z10750506 Keywords : feature selection Subject RIV: BD - Theory of Information Impact factor: 0.402, year: 2005 http://library.utia.cas.cz/separaty/historie/haindl-feature selection based on mutual correlation.pdf

  10. Pairwise Constraint-Guided Sparse Learning for Feature Selection.

    Liu, Mingxia; Zhang, Daoqiang

    2016-01-01

    Feature selection aims to identify the most informative features for a compact and accurate data representation. As typical supervised feature selection methods, Lasso and its variants using L1-norm-based regularization terms have received much attention in recent studies, most of which use class labels as supervised information. Besides class labels, there are other types of supervised information, e.g., pairwise constraints that specify whether a pair of data samples belong to the same class (must-link constraint) or different classes (cannot-link constraint). However, most of existing L1-norm-based sparse learning methods do not take advantage of the pairwise constraints that provide us weak and more general supervised information. For addressing that problem, we propose a pairwise constraint-guided sparse (CGS) learning method for feature selection, where the must-link and the cannot-link constraints are used as discriminative regularization terms that directly concentrate on the local discriminative structure of data. Furthermore, we develop two variants of CGS, including: 1) semi-supervised CGS that utilizes labeled data, pairwise constraints, and unlabeled data and 2) ensemble CGS that uses the ensemble of pairwise constraint sets. We conduct a series of experiments on a number of data sets from University of California-Irvine machine learning repository, a gene expression data set, two real-world neuroimaging-based classification tasks, and two large-scale attribute classification tasks. Experimental results demonstrate the efficacy of our proposed methods, compared with several established feature selection methods.

  11. Embedded Incremental Feature Selection for Reinforcement Learning

    2012-05-01

    Prior to this work, feature selection for reinforce- ment learning has focused on linear value function ap- proximation ( Kolter and Ng, 2009; Parr et al...InProceed- ings of the the 23rd International Conference on Ma- chine Learning, pages 449–456. Kolter , J. Z. and Ng, A. Y. (2009). Regularization and feature

  12. Joint Feature Selection and Classification for Multilabel Learning.

    Huang, Jun; Li, Guorong; Huang, Qingming; Wu, Xindong

    2018-03-01

    Multilabel learning deals with examples having multiple class labels simultaneously. It has been applied to a variety of applications, such as text categorization and image annotation. A large number of algorithms have been proposed for multilabel learning, most of which concentrate on multilabel classification problems and only a few of them are feature selection algorithms. Current multilabel classification models are mainly built on a single data representation composed of all the features which are shared by all the class labels. Since each class label might be decided by some specific features of its own, and the problems of classification and feature selection are often addressed independently, in this paper, we propose a novel method which can perform joint feature selection and classification for multilabel learning, named JFSC. Different from many existing methods, JFSC learns both shared features and label-specific features by considering pairwise label correlations, and builds the multilabel classifier on the learned low-dimensional data representations simultaneously. A comparative study with state-of-the-art approaches manifests a competitive performance of our proposed method both in classification and feature selection for multilabel learning.

  13. Tracing the breeding farm of domesticated pig using feature selection (

    Taehyung Kwon

    2017-11-01

    Full Text Available Objective Increasing food safety demands in the animal product market have created a need for a system to trace the food distribution process, from the manufacturer to the retailer, and genetic traceability is an effective method to trace the origin of animal products. In this study, we successfully achieved the farm tracing of 6,018 multi-breed pigs, using single nucleotide polymorphism (SNP markers strictly selected through least absolute shrinkage and selection operator (LASSO feature selection. Methods We performed farm tracing of domesticated pig (Sus scrofa from SNP markers and selected the most relevant features for accurate prediction. Considering multi-breed composition of our data, we performed feature selection using LASSO penalization on 4,002 SNPs that are shared between breeds, which also includes 179 SNPs with small between-breed difference. The 100 highest-scored features were extracted from iterative simulations and then evaluated using machine-leaning based classifiers. Results We selected 1,341 SNPs from over 45,000 SNPs through iterative LASSO feature selection, to minimize between-breed differences. We subsequently selected 100 highest-scored SNPs from iterative scoring, and observed high statistical measures in classification of breeding farms by cross-validation only using these SNPs. Conclusion The study represents a successful application of LASSO feature selection on multi-breed pig SNP data to trace the farm information, which provides a valuable method and possibility for further researches on genetic traceability.

  14. Feature Selection with the Boruta Package

    Kursa, Miron B.; Rudnicki, Witold R.

    2010-01-01

    This article describes a R package Boruta, implementing a novel feature selection algorithm for finding emph{all relevant variables}. The algorithm is designed as a wrapper around a Random Forest classification algorithm. It iteratively removes the features which are proved by a statistical test to be less relevant than random probes. The Boruta package provides a convenient interface to the algorithm. The short description of the algorithm and examples of its application are presented.

  15. Feature Selection with the Boruta Package

    Miron B. Kursa

    2010-10-01

    Full Text Available This article describes a R package Boruta, implementing a novel feature selection algorithm for finding emph{all relevant variables}. The algorithm is designed as a wrapper around a Random Forest classification algorithm. It iteratively removes the features which are proved by a statistical test to be less relevant than random probes. The Boruta package provides a convenient interface to the algorithm. The short description of the algorithm and examples of its application are presented.

  16. Adversarial Feature Selection Against Evasion Attacks.

    Zhang, Fei; Chan, Patrick P K; Biggio, Battista; Yeung, Daniel S; Roli, Fabio

    2016-03-01

    Pattern recognition and machine learning techniques have been increasingly adopted in adversarial settings such as spam, intrusion, and malware detection, although their security against well-crafted attacks that aim to evade detection by manipulating data at test time has not yet been thoroughly assessed. While previous work has been mainly focused on devising adversary-aware classification algorithms to counter evasion attempts, only few authors have considered the impact of using reduced feature sets on classifier security against the same attacks. An interesting, preliminary result is that classifier security to evasion may be even worsened by the application of feature selection. In this paper, we provide a more detailed investigation of this aspect, shedding some light on the security properties of feature selection against evasion attacks. Inspired by previous work on adversary-aware classifiers, we propose a novel adversary-aware feature selection model that can improve classifier security against evasion attacks, by incorporating specific assumptions on the adversary's data manipulation strategy. We focus on an efficient, wrapper-based implementation of our approach, and experimentally validate its soundness on different application examples, including spam and malware detection.

  17. NetProt: Complex-based Feature Selection.

    Goh, Wilson Wen Bin; Wong, Limsoon

    2017-08-04

    Protein complex-based feature selection (PCBFS) provides unparalleled reproducibility with high phenotypic relevance on proteomics data. Currently, there are five PCBFS paradigms, but not all representative methods have been implemented or made readily available. To allow general users to take advantage of these methods, we developed the R-package NetProt, which provides implementations of representative feature-selection methods. NetProt also provides methods for generating simulated differential data and generating pseudocomplexes for complex-based performance benchmarking. The NetProt open source R package is available for download from https://github.com/gohwils/NetProt/releases/ , and online documentation is available at http://rpubs.com/gohwils/204259 .

  18. Feature selection for high-dimensional integrated data

    Zheng, Charles; Schwartz, Scott; Chapkin, Robert S.; Carroll, Raymond J.; Ivanov, Ivan

    2012-01-01

    Motivated by the problem of identifying correlations between genes or features of two related biological systems, we propose a model of feature selection in which only a subset of the predictors Xt are dependent on the multidimensional variate Y, and the remainder of the predictors constitute a “noise set” Xu independent of Y. Using Monte Carlo simulations, we investigated the relative performance of two methods: thresholding and singular-value decomposition, in combination with stochastic optimization to determine “empirical bounds” on the small-sample accuracy of an asymptotic approximation. We demonstrate utility of the thresholding and SVD feature selection methods to with respect to a recent infant intestinal gene expression and metagenomics dataset.

  19. Feature selection for high-dimensional integrated data

    Zheng, Charles

    2012-04-26

    Motivated by the problem of identifying correlations between genes or features of two related biological systems, we propose a model of feature selection in which only a subset of the predictors Xt are dependent on the multidimensional variate Y, and the remainder of the predictors constitute a “noise set” Xu independent of Y. Using Monte Carlo simulations, we investigated the relative performance of two methods: thresholding and singular-value decomposition, in combination with stochastic optimization to determine “empirical bounds” on the small-sample accuracy of an asymptotic approximation. We demonstrate utility of the thresholding and SVD feature selection methods to with respect to a recent infant intestinal gene expression and metagenomics dataset.

  20. Novel Automatic Filter-Class Feature Selection for Machine Learning Regression

    Wollsen, Morten Gill; Hallam, John; Jørgensen, Bo Nørregaard

    2017-01-01

    With the increased focus on application of Big Data in all sectors of society, the performance of machine learning becomes essential. Efficient machine learning depends on efficient feature selection algorithms. Filter feature selection algorithms are model-free and therefore very fast, but require...... model in the feature selection process. PCA is often used in machine learning litterature and can be considered the default feature selection method. RDESF outperformed PCA in both experiments in both prediction error and computational speed. RDESF is a new step into filter-based automatic feature...

  1. Feature Selection for Chemical Sensor Arrays Using Mutual Information

    Wang, X. Rosalind; Lizier, Joseph T.; Nowotny, Thomas; Berna, Amalia Z.; Prokopenko, Mikhail; Trowell, Stephen C.

    2014-01-01

    We address the problem of feature selection for classifying a diverse set of chemicals using an array of metal oxide sensors. Our aim is to evaluate a filter approach to feature selection with reference to previous work, which used a wrapper approach on the same data set, and established best features and upper bounds on classification performance. We selected feature sets that exhibit the maximal mutual information with the identity of the chemicals. The selected features closely match those found to perform well in the previous study using a wrapper approach to conduct an exhaustive search of all permitted feature combinations. By comparing the classification performance of support vector machines (using features selected by mutual information) with the performance observed in the previous study, we found that while our approach does not always give the maximum possible classification performance, it always selects features that achieve classification performance approaching the optimum obtained by exhaustive search. We performed further classification using the selected feature set with some common classifiers and found that, for the selected features, Bayesian Networks gave the best performance. Finally, we compared the observed classification performances with the performance of classifiers using randomly selected features. We found that the selected features consistently outperformed randomly selected features for all tested classifiers. The mutual information filter approach is therefore a computationally efficient method for selecting near optimal features for chemical sensor arrays. PMID:24595058

  2. Simultaneous feature selection and classification via Minimax Probability Machine

    Liming Yang

    2010-12-01

    Full Text Available This paper presents a novel method for simultaneous feature selection and classification by incorporating a robust L1-norm into the objective function of Minimax Probability Machine (MPM. A fractional programming framework is derived by using a bound on the misclassification error involving the mean and covariance of the data. Furthermore, the problems are solved by the Quadratic Interpolation method. Experiments show that our methods can select fewer features to improve the generalization compared to MPM, which illustrates the effectiveness of the proposed algorithms.

  3. Feature Selection for Wheat Yield Prediction

    Ruß, Georg; Kruse, Rudolf

    Carrying out effective and sustainable agriculture has become an important issue in recent years. Agricultural production has to keep up with an everincreasing population by taking advantage of a field’s heterogeneity. Nowadays, modern technology such as the global positioning system (GPS) and a multitude of developed sensors enable farmers to better measure their fields’ heterogeneities. For this small-scale, precise treatment the term precision agriculture has been coined. However, the large amounts of data that are (literally) harvested during the growing season have to be analysed. In particular, the farmer is interested in knowing whether a newly developed heterogeneity sensor is potentially advantageous or not. Since the sensor data are readily available, this issue should be seen from an artificial intelligence perspective. There it can be treated as a feature selection problem. The additional task of yield prediction can be treated as a multi-dimensional regression problem. This article aims to present an approach towards solving these two practically important problems using artificial intelligence and data mining ideas and methodologies.

  4. Multi-Objective Particle Swarm Optimization Approach for Cost-Based Feature Selection in Classification.

    Zhang, Yong; Gong, Dun-Wei; Cheng, Jian

    2017-01-01

    Feature selection is an important data-preprocessing technique in classification problems such as bioinformatics and signal processing. Generally, there are some situations where a user is interested in not only maximizing the classification performance but also minimizing the cost that may be associated with features. This kind of problem is called cost-based feature selection. However, most existing feature selection approaches treat this task as a single-objective optimization problem. This paper presents the first study of multi-objective particle swarm optimization (PSO) for cost-based feature selection problems. The task of this paper is to generate a Pareto front of nondominated solutions, that is, feature subsets, to meet different requirements of decision-makers in real-world applications. In order to enhance the search capability of the proposed algorithm, a probability-based encoding technology and an effective hybrid operator, together with the ideas of the crowding distance, the external archive, and the Pareto domination relationship, are applied to PSO. The proposed PSO-based multi-objective feature selection algorithm is compared with several multi-objective feature selection algorithms on five benchmark datasets. Experimental results show that the proposed algorithm can automatically evolve a set of nondominated solutions, and it is a highly competitive feature selection method for solving cost-based feature selection problems.

  5. Discrete Biogeography Based Optimization for Feature Selection in Molecular Signatures.

    Liu, Bo; Tian, Meihong; Zhang, Chunhua; Li, Xiangtao

    2015-04-01

    Biomarker discovery from high-dimensional data is a complex task in the development of efficient cancer diagnoses and classification. However, these data are usually redundant and noisy, and only a subset of them present distinct profiles for different classes of samples. Thus, selecting high discriminative genes from gene expression data has become increasingly interesting in the field of bioinformatics. In this paper, a discrete biogeography based optimization is proposed to select the good subset of informative gene relevant to the classification. In the proposed algorithm, firstly, the fisher-markov selector is used to choose fixed number of gene data. Secondly, to make biogeography based optimization suitable for the feature selection problem; discrete migration model and discrete mutation model are proposed to balance the exploration and exploitation ability. Then, discrete biogeography based optimization, as we called DBBO, is proposed by integrating discrete migration model and discrete mutation model. Finally, the DBBO method is used for feature selection, and three classifiers are used as the classifier with the 10 fold cross-validation method. In order to show the effective and efficiency of the algorithm, the proposed algorithm is tested on four breast cancer dataset benchmarks. Comparison with genetic algorithm, particle swarm optimization, differential evolution algorithm and hybrid biogeography based optimization, experimental results demonstrate that the proposed method is better or at least comparable with previous method from literature when considering the quality of the solutions obtained. © 2015 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.

  6. Feature selection using genetic algorithms for fetal heart rate analysis

    Xu, Liang; Redman, Christopher W G; Georgieva, Antoniya; Payne, Stephen J

    2014-01-01

    The fetal heart rate (FHR) is monitored on a paper strip (cardiotocogram) during labour to assess fetal health. If necessary, clinicians can intervene and assist with a prompt delivery of the baby. Data-driven computerized FHR analysis could help clinicians in the decision-making process. However, selecting the best computerized FHR features that relate to labour outcome is a pressing research problem. The objective of this study is to apply genetic algorithms (GA) as a feature selection method to select the best feature subset from 64 FHR features and to integrate these best features to recognize unfavourable FHR patterns. The GA was trained on 404 cases and tested on 106 cases (both balanced datasets) using three classifiers, respectively. Regularization methods and backward selection were used to optimize the GA. Reasonable classification performance is shown on the testing set for the best feature subset (Cohen's kappa values of 0.45 to 0.49 using different classifiers). This is, to our knowledge, the first time that a feature selection method for FHR analysis has been developed on a database of this size. This study indicates that different FHR features, when integrated, can show good performance in predicting labour outcome. It also gives the importance of each feature, which will be a valuable reference point for further studies. (paper)

  7. An opinion formation based binary optimization approach for feature selection

    Hamedmoghadam, Homayoun; Jalili, Mahdi; Yu, Xinghuo

    2018-02-01

    This paper proposed a novel optimization method based on opinion formation in complex network systems. The proposed optimization technique mimics human-human interaction mechanism based on a mathematical model derived from social sciences. Our method encodes a subset of selected features to the opinion of an artificial agent and simulates the opinion formation process among a population of agents to solve the feature selection problem. The agents interact using an underlying interaction network structure and get into consensus in their opinions, while finding better solutions to the problem. A number of mechanisms are employed to avoid getting trapped in local minima. We compare the performance of the proposed method with a number of classical population-based optimization methods and a state-of-the-art opinion formation based method. Our experiments on a number of high dimensional datasets reveal outperformance of the proposed algorithm over others.

  8. SIP-FS: a novel feature selection for data representation

    Yiyou Guo

    2018-02-01

    Full Text Available Abstract Multiple features are widely used to characterize real-world datasets. It is desirable to select leading features with stability and interpretability from a set of distinct features for a comprehensive data description. However, most of existing feature selection methods focus on the predictability (e.g., prediction accuracy of selected results yet neglect stability. To obtain compact data representation, a novel feature selection method is proposed to improve stability, and interpretability without sacrificing predictability (SIP-FS. Instead of mutual information, generalized correlation is adopted in minimal redundancy maximal relevance to measure the relation between different feature types. Several feature types (each contains a certain number of features can then be selected and evaluated quantitatively to determine what types contribute to a specific class, thereby enhancing the so-called interpretability of features. Moreover, stability is introduced in the criterion of SIP-FS to obtain consistent results of ranking. We conduct experiments on three publicly available datasets using one-versus-all strategy to select class-specific features. The experiments illustrate that SIP-FS achieves significant performance improvements in terms of stability and interpretability with desirable prediction accuracy and indicates advantages over several state-of-the-art approaches.

  9. Hybrid feature selection for supporting lightweight intrusion detection systems

    Song, Jianglong; Zhao, Wentao; Liu, Qiang; Wang, Xin

    2017-08-01

    Redundant and irrelevant features not only cause high resource consumption but also degrade the performance of Intrusion Detection Systems (IDS), especially when coping with big data. These features slow down the process of training and testing in network traffic classification. Therefore, a hybrid feature selection approach in combination with wrapper and filter selection is designed in this paper to build a lightweight intrusion detection system. Two main phases are involved in this method. The first phase conducts a preliminary search for an optimal subset of features, in which the chi-square feature selection is utilized. The selected set of features from the previous phase is further refined in the second phase in a wrapper manner, in which the Random Forest(RF) is used to guide the selection process and retain an optimized set of features. After that, we build an RF-based detection model and make a fair comparison with other approaches. The experimental results on NSL-KDD datasets show that our approach results are in higher detection accuracy as well as faster training and testing processes.

  10. On the Feature Selection and Classification Based on Information Gain for Document Sentiment Analysis

    Asriyanti Indah Pratiwi

    2018-01-01

    Full Text Available Sentiment analysis in a movie review is the needs of today lifestyle. Unfortunately, enormous features make the sentiment of analysis slow and less sensitive. Finding the optimum feature selection and classification is still a challenge. In order to handle an enormous number of features and provide better sentiment classification, an information-based feature selection and classification are proposed. The proposed method reduces more than 90% unnecessary features while the proposed classification scheme achieves 96% accuracy of sentiment classification. From the experimental results, it can be concluded that the combination of proposed feature selection and classification achieves the best performance so far.

  11. Effective Feature Selection for Classification of Promoter Sequences.

    Kouser K

    Full Text Available Exploring novel computational methods in making sense of biological data has not only been a necessity, but also productive. A part of this trend is the search for more efficient in silico methods/tools for analysis of promoters, which are parts of DNA sequences that are involved in regulation of expression of genes into other functional molecules. Promoter regions vary greatly in their function based on the sequence of nucleotides and the arrangement of protein-binding short-regions called motifs. In fact, the regulatory nature of the promoters seems to be largely driven by the selective presence and/or the arrangement of these motifs. Here, we explore computational classification of promoter sequences based on the pattern of motif distributions, as such classification can pave a new way of functional analysis of promoters and to discover the functionally crucial motifs. We make use of Position Specific Motif Matrix (PSMM features for exploring the possibility of accurately classifying promoter sequences using some of the popular classification techniques. The classification results on the complete feature set are low, perhaps due to the huge number of features. We propose two ways of reducing features. Our test results show improvement in the classification output after the reduction of features. The results also show that decision trees outperform SVM (Support Vector Machine, KNN (K Nearest Neighbor and ensemble classifier LibD3C, particularly with reduced features. The proposed feature selection methods outperform some of the popular feature transformation methods such as PCA and SVD. Also, the methods proposed are as accurate as MRMR (feature selection method but much faster than MRMR. Such methods could be useful to categorize new promoters and explore regulatory mechanisms of gene expressions in complex eukaryotic species.

  12. An Efficient Cost-Sensitive Feature Selection Using Chaos Genetic Algorithm for Class Imbalance Problem

    Jing Bian

    2016-01-01

    Full Text Available In the era of big data, feature selection is an essential process in machine learning. Although the class imbalance problem has recently attracted a great deal of attention, little effort has been undertaken to develop feature selection techniques. In addition, most applications involving feature selection focus on classification accuracy but not cost, although costs are important. To cope with imbalance problems, we developed a cost-sensitive feature selection algorithm that adds the cost-based evaluation function of a filter feature selection using a chaos genetic algorithm, referred to as CSFSG. The evaluation function considers both feature-acquiring costs (test costs and misclassification costs in the field of network security, thereby weakening the influence of many instances from the majority of classes in large-scale datasets. The CSFSG algorithm reduces the total cost of feature selection and trades off both factors. The behavior of the CSFSG algorithm is tested on a large-scale dataset of network security, using two kinds of classifiers: C4.5 and k-nearest neighbor (KNN. The results of the experimental research show that the approach is efficient and able to effectively improve classification accuracy and to decrease classification time. In addition, the results of our method are more promising than the results of other cost-sensitive feature selection algorithms.

  13. Evaluating statistical and clinical significance of intervention effects in single-case experimental designs: An SPSS method to analyze univariate data

    Maric, M.; de Haan, M.; Hogendoorn, S.M.; Wolters, L.H.; Huizenga, H.M.

    2015-01-01

    Single-case experimental designs are useful methods in clinical research practice to investigate individual client progress. Their proliferation might have been hampered by methodological challenges such as the difficulty applying existing statistical procedures. In this article, we describe a

  14. Evaluating statistical and clinical significance of intervention effects in single-case experimental designs: an SPSS method to analyze univariate data

    Maric, Marija; de Haan, Else; Hogendoorn, Sanne M.; Wolters, Lidewij H.; Huizenga, Hilde M.

    2015-01-01

    Single-case experimental designs are useful methods in clinical research practice to investigate individual client progress. Their proliferation might have been hampered by methodological challenges such as the difficulty applying existing statistical procedures. In this article, we describe a

  15. Development of a Univariate Membrane-Based Mid-Infrared Method for Protein Quantitation and Total Lipid Content Analysis of Biological Samples

    Ivona Strug

    2014-01-01

    Full Text Available Biological samples present a range of complexities from homogeneous purified protein to multicomponent mixtures. Accurate qualification of such samples is paramount to downstream applications. We describe the development of an MIR spectroscopy-based analytical method offering simultaneous protein quantitation (0.25–5 mg/mL and analysis of total lipid or detergent species, as well as the identification of other biomolecules present in biological samples. The method utilizes a hydrophilic PTFE membrane engineered for presentation of aqueous samples in a dried format compatible with fast infrared analysis. Unlike classical quantification techniques, the reported method is amino acid sequence independent and thus applicable to complex samples of unknown composition. By comparison to existing platforms, this MIR-based method enables direct quantification using minimal sample volume (2 µL; it is well-suited where repeat access and limited sample size are critical parameters. Further, accurate results can be derived without specialized training or knowledge of IR spectroscopy. Overall, the simplified application and analysis system provides a more cost-effective alternative to high-throughput IR systems for research laboratories with minimal throughput demands. In summary, the MIR-based system provides a viable alternative to current protein quantitation methods; it also uniquely offers simultaneous qualification of other components, notably lipids and detergents.

  16. [Electroencephalogram Feature Selection Based on Correlation Coefficient Analysis].

    Zhou, Jinzhi; Tang, Xiaofang

    2015-08-01

    In order to improve the accuracy of classification with small amount of motor imagery training data on the development of brain-computer interface (BCD systems, we proposed an analyzing method to automatically select the characteristic parameters based on correlation coefficient analysis. Throughout the five sample data of dataset IV a from 2005 BCI Competition, we utilized short-time Fourier transform (STFT) and correlation coefficient calculation to reduce the number of primitive electroencephalogram dimension, then introduced feature extraction based on common spatial pattern (CSP) and classified by linear discriminant analysis (LDA). Simulation results showed that the average rate of classification accuracy could be improved by using correlation coefficient feature selection method than those without using this algorithm. Comparing with support vector machine (SVM) optimization features algorithm, the correlation coefficient analysis can lead better selection parameters to improve the accuracy of classification.

  17. Features Selection for Skin Micro-Image Symptomatic Recognition

    HUYue-li; CAOJia-lin; ZHAOQian; FENGXu

    2004-01-01

    Automatic recognition of skin micro-image symptom is important in skin diagnosis and treatment. Feature selection is to improve the classification performance of skin micro-image symptom.This paper proposes a hybrid approach based on the support vector machine (SVM) technique and genetic algorithm (GA) to select an optimum feature subset from the feature group extracted from the skin micro-images. An adaptive GA is introduced for maintaining the convergence rate. With the proposed method, the average cross validation accuracy is increased from 88.25% using all features to 96.92% using only selected features provided by a classifier for classification of 5 classes of skin symptoms. The experimental results are satisfactory.

  18. Features Selection for Skin Micro-Image Symptomatic Recognition

    HU Yue-li; CAO Jia-lin; ZHAO Qian; FENG Xu

    2004-01-01

    Automatic recognition of skin micro-image symptom is important in skin diagnosis and treatment. Feature selection is to improve the classification performance of skin micro-image symptom.This paper proposes a hybrid approach based on the support vector machine (SVM) technique and genetic algorithm (GA) to select an optimum feature subset from the feature group extracted from the skin micro-images. An adaptive GA is introduced for maintaining the convergence rate. With the proposed method, the average cross validation accuracy is increased from 88.25% using all features to 96.92 % using only selected features provided by a classifier for classification of 5 classes of skin symptoms. The experimental results are satisfactory.

  19. Feature Selection and ANN Solar Power Prediction

    O’Leary, Daniel; Kubby, Joel

    2017-01-01

    A novel method of solar power forecasting for individuals and small businesses is developed in this paper based on machine learning, image processing, and acoustic classification techniques. Increases in the production of solar power at the consumer level require automated forecasting systems to minimize loss, cost, and environmental impact for homes and businesses that produce and consume power (prosumers). These new participants in the energy market, prosumers, require new artificial neural...

  20. Feature Selection and ANN Solar Power Prediction

    Daniel O’Leary

    2017-01-01

    Full Text Available A novel method of solar power forecasting for individuals and small businesses is developed in this paper based on machine learning, image processing, and acoustic classification techniques. Increases in the production of solar power at the consumer level require automated forecasting systems to minimize loss, cost, and environmental impact for homes and businesses that produce and consume power (prosumers. These new participants in the energy market, prosumers, require new artificial neural network (ANN performance tuning techniques to create accurate ANN forecasts. Input masking, an ANN tuning technique developed for acoustic signal classification and image edge detection, is applied to prosumer solar data to improve prosumer forecast accuracy over traditional macrogrid ANN performance tuning techniques. ANN inputs tailor time-of-day masking based on error clustering in the time domain. Results show an improvement in prediction to target correlation, the R2 value, lowering inaccuracy of sample predictions by 14.4%, with corresponding drops in mean average error of 5.37% and root mean squared error of 6.83%.

  1. Simultaneous Channel and Feature Selection of Fused EEG Features Based on Sparse Group Lasso

    Jin-Jia Wang

    2015-01-01

    Full Text Available Feature extraction and classification of EEG signals are core parts of brain computer interfaces (BCIs. Due to the high dimension of the EEG feature vector, an effective feature selection algorithm has become an integral part of research studies. In this paper, we present a new method based on a wrapped Sparse Group Lasso for channel and feature selection of fused EEG signals. The high-dimensional fused features are firstly obtained, which include the power spectrum, time-domain statistics, AR model, and the wavelet coefficient features extracted from the preprocessed EEG signals. The wrapped channel and feature selection method is then applied, which uses the logistical regression model with Sparse Group Lasso penalized function. The model is fitted on the training data, and parameter estimation is obtained by modified blockwise coordinate descent and coordinate gradient descent method. The best parameters and feature subset are selected by using a 10-fold cross-validation. Finally, the test data is classified using the trained model. Compared with existing channel and feature selection methods, results show that the proposed method is more suitable, more stable, and faster for high-dimensional feature fusion. It can simultaneously achieve channel and feature selection with a lower error rate. The test accuracy on the data used from international BCI Competition IV reached 84.72%.

  2. Feature selection gait-based gender classification under different circumstances

    Sabir, Azhin; Al-Jawad, Naseer; Jassim, Sabah

    2014-05-01

    This paper proposes a gender classification based on human gait features and investigates the problem of two variations: clothing (wearing coats) and carrying bag condition as addition to the normal gait sequence. The feature vectors in the proposed system are constructed after applying wavelet transform. Three different sets of feature are proposed in this method. First, Spatio-temporal distance that is dealing with the distance of different parts of the human body (like feet, knees, hand, Human Height and shoulder) during one gait cycle. The second and third feature sets are constructed from approximation and non-approximation coefficient of human body respectively. To extract these two sets of feature we divided the human body into two parts, upper and lower body part, based on the golden ratio proportion. In this paper, we have adopted a statistical method for constructing the feature vector from the above sets. The dimension of the constructed feature vector is reduced based on the Fisher score as a feature selection method to optimize their discriminating significance. Finally k-Nearest Neighbor is applied as a classification method. Experimental results demonstrate that our approach is providing more realistic scenario and relatively better performance compared with the existing approaches.

  3. Efficient Multi-Label Feature Selection Using Entropy-Based Label Selection

    Jaesung Lee

    2016-11-01

    Full Text Available Multi-label feature selection is designed to select a subset of features according to their importance to multiple labels. This task can be achieved by ranking the dependencies of features and selecting the features with the highest rankings. In a multi-label feature selection problem, the algorithm may be faced with a dataset containing a large number of labels. Because the computational cost of multi-label feature selection increases according to the number of labels, the algorithm may suffer from a degradation in performance when processing very large datasets. In this study, we propose an efficient multi-label feature selection method based on an information-theoretic label selection strategy. By identifying a subset of labels that significantly influence the importance of features, the proposed method efficiently outputs a feature subset. Experimental results demonstrate that the proposed method can identify a feature subset much faster than conventional multi-label feature selection methods for large multi-label datasets.

  4. Max-AUC feature selection in computer-aided detection of polyps in CT colonography.

    Xu, Jian-Wu; Suzuki, Kenji

    2014-03-01

    We propose a feature selection method based on a sequential forward floating selection (SFFS) procedure to improve the performance of a classifier in computerized detection of polyps in CT colonography (CTC). The feature selection method is coupled with a nonlinear support vector machine (SVM) classifier. Unlike the conventional linear method based on Wilks' lambda, the proposed method selected the most relevant features that would maximize the area under the receiver operating characteristic curve (AUC), which directly maximizes classification performance, evaluated based on AUC value, in the computer-aided detection (CADe) scheme. We presented two variants of the proposed method with different stopping criteria used in the SFFS procedure. The first variant searched all feature combinations allowed in the SFFS procedure and selected the subsets that maximize the AUC values. The second variant performed a statistical test at each step during the SFFS procedure, and it was terminated if the increase in the AUC value was not statistically significant. The advantage of the second variant is its lower computational cost. To test the performance of the proposed method, we compared it against the popular stepwise feature selection method based on Wilks' lambda for a colonic-polyp database (25 polyps and 2624 nonpolyps). We extracted 75 morphologic, gray-level-based, and texture features from the segmented lesion candidate regions. The two variants of the proposed feature selection method chose 29 and 7 features, respectively. Two SVM classifiers trained with these selected features yielded a 96% by-polyp sensitivity at false-positive (FP) rates of 4.1 and 6.5 per patient, respectively. Experiments showed a significant improvement in the performance of the classifier with the proposed feature selection method over that with the popular stepwise feature selection based on Wilks' lambda that yielded 18.0 FPs per patient at the same sensitivity level.

  5. Evolutionary Feature Selection for Big Data Classification: A MapReduce Approach

    Daniel Peralta

    2015-01-01

    Full Text Available Nowadays, many disciplines have to deal with big datasets that additionally involve a high number of features. Feature selection methods aim at eliminating noisy, redundant, or irrelevant features that may deteriorate the classification performance. However, traditional methods lack enough scalability to cope with datasets of millions of instances and extract successful results in a delimited time. This paper presents a feature selection algorithm based on evolutionary computation that uses the MapReduce paradigm to obtain subsets of features from big datasets. The algorithm decomposes the original dataset in blocks of instances to learn from them in the map phase; then, the reduce phase merges the obtained partial results into a final vector of feature weights, which allows a flexible application of the feature selection procedure using a threshold to determine the selected subset of features. The feature selection method is evaluated by using three well-known classifiers (SVM, Logistic Regression, and Naive Bayes implemented within the Spark framework to address big data problems. In the experiments, datasets up to 67 millions of instances and up to 2000 attributes have been managed, showing that this is a suitable framework to perform evolutionary feature selection, improving both the classification accuracy and its runtime when dealing with big data problems.

  6. Multi-task feature selection in microarray data by binary integer programming.

    Lan, Liang; Vucetic, Slobodan

    2013-12-20

    A major challenge in microarray classification is that the number of features is typically orders of magnitude larger than the number of examples. In this paper, we propose a novel feature filter algorithm to select the feature subset with maximal discriminative power and minimal redundancy by solving a quadratic objective function with binary integer constraints. To improve the computational efficiency, the binary integer constraints are relaxed and a low-rank approximation to the quadratic term is applied. The proposed feature selection algorithm was extended to solve multi-task microarray classification problems. We compared the single-task version of the proposed feature selection algorithm with 9 existing feature selection methods on 4 benchmark microarray data sets. The empirical results show that the proposed method achieved the most accurate predictions overall. We also evaluated the multi-task version of the proposed algorithm on 8 multi-task microarray datasets. The multi-task feature selection algorithm resulted in significantly higher accuracy than when using the single-task feature selection methods.

  7. Comparison of feature selection and classification for MALDI-MS data

    Yang Mary

    2009-07-01

    Full Text Available Abstract Introduction In the classification of Mass Spectrometry (MS proteomics data, peak detection, feature selection, and learning classifiers are critical to classification accuracy. To better understand which methods are more accurate when classifying data, some publicly available peak detection algorithms for Matrix assisted Laser Desorption Ionization Mass Spectrometry (MALDI-MS data were recently compared; however, the issue of different feature selection methods and different classification models as they relate to classification performance has not been addressed. With the application of intelligent computing, much progress has been made in the development of feature selection methods and learning classifiers for the analysis of high-throughput biological data. The main objective of this paper is to compare the methods of feature selection and different learning classifiers when applied to MALDI-MS data and to provide a subsequent reference for the analysis of MS proteomics data. Results We compared a well-known method of feature selection, Support Vector Machine Recursive Feature Elimination (SVMRFE, and a recently developed method, Gradient based Leave-one-out Gene Selection (GLGS that effectively performs microarray data analysis. We also compared several learning classifiers including K-Nearest Neighbor Classifier (KNNC, Naïve Bayes Classifier (NBC, Nearest Mean Scaled Classifier (NMSC, uncorrelated normal based quadratic Bayes Classifier recorded as UDC, Support Vector Machines, and a distance metric learning for Large Margin Nearest Neighbor classifier (LMNN based on Mahanalobis distance. To compare, we conducted a comprehensive experimental study using three types of MALDI-MS data. Conclusion Regarding feature selection, SVMRFE outperformed GLGS in classification. As for the learning classifiers, when classification models derived from the best training were compared, SVMs performed the best with respect to the expected testing

  8. Comparison of spectrum normalization techniques for univariate ...

    Laser-induced breakdown spectroscopy; univariate study; normalization models; stainless steel; standard error of prediction. Abstract. Analytical performance of six different spectrum normalization techniques, namelyinternal normalization, normalization with total light, normalization with background along with their ...

  9. Online Feature Selection for Classifying Emphysema in HRCT Images

    M. Prasad

    2008-06-01

    Full Text Available Feature subset selection, applied as a pre- processing step to machine learning, is valuable in dimensionality reduction, eliminating irrelevant data and improving classifier performance. In the classic formulation of the feature selection problem, it is assumed that all the features are available at the beginning. However, in many real world problems, there are scenarios where not all features are present initially and must be integrated as they become available. In such scenarios, online feature selection provides an efficient way to sort through a large space of features. It is in this context that we introduce online feature selection for the classification of emphysema, a smoking related disease that appears as low attenuation regions in High Resolution Computer Tomography (HRCT images. The technique was successfully evaluated on 61 HRCT scans and compared with different online feature selection approaches, including hill climbing, best first search, grafting, and correlation-based feature selection. The results were also compared against ldensity maskr, a standard approach used for emphysema detection in medical image analysis.

  10. Feature Selection Criteria for Real Time EKF-SLAM Algorithm

    Fernando Auat Cheein

    2010-02-01

    Full Text Available This paper presents a seletion procedure for environmet features for the correction stage of a SLAM (Simultaneous Localization and Mapping algorithm based on an Extended Kalman Filter (EKF. This approach decreases the computational time of the correction stage which allows for real and constant-time implementations of the SLAM. The selection procedure consists in chosing the features the SLAM system state covariance is more sensible to. The entire system is implemented on a mobile robot equipped with a range sensor laser. The features extracted from the environment correspond to lines and corners. Experimental results of the real time SLAM algorithm and an analysis of the processing-time consumed by the SLAM with the feature selection procedure proposed are shown. A comparison between the feature selection approach proposed and the classical sequential EKF-SLAM along with an entropy feature selection approach is also performed.

  11. VC-dimension of univariate decision trees.

    Yildiz, Olcay Taner

    2015-02-01

    In this paper, we give and prove the lower bounds of the Vapnik-Chervonenkis (VC)-dimension of the univariate decision tree hypothesis class. The VC-dimension of the univariate decision tree depends on the VC-dimension values of its subtrees and the number of inputs. Via a search algorithm that calculates the VC-dimension of univariate decision trees exhaustively, we show that our VC-dimension bounds are tight for simple trees. To verify that the VC-dimension bounds are useful, we also use them to get VC-generalization bounds for complexity control using structural risk minimization in decision trees, i.e., pruning. Our simulation results show that structural risk minimization pruning using the VC-dimension bounds finds trees that are more accurate as those pruned using cross validation.

  12. Feature selection based on SVM significance maps for classification of dementia

    E.E. Bron (Esther); M. Smits (Marion); J.C. van Swieten (John); W.J. Niessen (Wiro); S. Klein (Stefan)

    2014-01-01

    textabstractSupport vector machine significance maps (SVM p-maps) previously showed clusters of significantly different voxels in dementiarelated brain regions. We propose a novel feature selection method for classification of dementia based on these p-maps. In our approach, the SVM p-maps are

  13. Relevant test set using feature selection algorithm for early detection ...

    The objective of feature selection is to find the most relevant features for classification. Thus, the dimensionality of the information will be reduced and may improve classification's accuracy. This paper proposed a minimum set of relevant questions that can be used for early detection of dyslexia. In this research, we ...

  14. Selective Audiovisual Semantic Integration Enabled by Feature-Selective Attention.

    Li, Yuanqing; Long, Jinyi; Huang, Biao; Yu, Tianyou; Wu, Wei; Li, Peijun; Fang, Fang; Sun, Pei

    2016-01-13

    An audiovisual object may contain multiple semantic features, such as the gender and emotional features of the speaker. Feature-selective attention and audiovisual semantic integration are two brain functions involved in the recognition of audiovisual objects. Humans often selectively attend to one or several features while ignoring the other features of an audiovisual object. Meanwhile, the human brain integrates semantic information from the visual and auditory modalities. However, how these two brain functions correlate with each other remains to be elucidated. In this functional magnetic resonance imaging (fMRI) study, we explored the neural mechanism by which feature-selective attention modulates audiovisual semantic integration. During the fMRI experiment, the subjects were presented with visual-only, auditory-only, or audiovisual dynamical facial stimuli and performed several feature-selective attention tasks. Our results revealed that a distribution of areas, including heteromodal areas and brain areas encoding attended features, may be involved in audiovisual semantic integration. Through feature-selective attention, the human brain may selectively integrate audiovisual semantic information from attended features by enhancing functional connectivity and thus regulating information flows from heteromodal areas to brain areas encoding the attended features.

  15. A Hybrid Feature Selection Approach for Arabic Documents Classification

    Habib, Mena Badieh; Sarhan, Ahmed A. E.; Salem, Abdel-Badeeh M.; Fayed, Zaki T.; Gharib, Tarek F.

    Text Categorization (classification) is the process of classifying documents into a predefined set of categories based on their content. Text categorization algorithms usually represent documents as bags of words and consequently have to deal with huge number of features. Feature selection tries to

  16. Robust Feature Selection from Microarray Data Based on Cooperative Game Theory and Qualitative Mutual Information

    Atiyeh Mortazavi

    2016-01-01

    Full Text Available High dimensionality of microarray data sets may lead to low efficiency and overfitting. In this paper, a multiphase cooperative game theoretic feature selection approach is proposed for microarray data classification. In the first phase, due to high dimension of microarray data sets, the features are reduced using one of the two filter-based feature selection methods, namely, mutual information and Fisher ratio. In the second phase, Shapley index is used to evaluate the power of each feature. The main innovation of the proposed approach is to employ Qualitative Mutual Information (QMI for this purpose. The idea of Qualitative Mutual Information causes the selected features to have more stability and this stability helps to deal with the problem of data imbalance and scarcity. In the third phase, a forward selection scheme is applied which uses a scoring function to weight each feature. The performance of the proposed method is compared with other popular feature selection algorithms such as Fisher ratio, minimum redundancy maximum relevance, and previous works on cooperative game based feature selection. The average classification accuracy on eleven microarray data sets shows that the proposed method improves both average accuracy and average stability compared to other approaches.

  17. Univariate characterization of the German business cycle 1955-1994

    Weihs, Claus; Garczarek, Ursula

    2002-01-01

    We present a descriptive analysis of stylized facts for the German business cycle. We demonstrate that simple ad-hoc instructions for identifying univariate rules characterizing the German business cycle 1955-1994 lead to an error rate comparable to standard multivariate methods.

  18. Comparison of Imputation Methods for Handling Missing Categorical Data with Univariate Pattern|| Una comparación de métodos de imputación de variables categóricas con patrón univariado

    Torres Munguía, Juan Armando

    2014-06-01

    Full Text Available This paper examines the sample proportions estimates in the presence of univariate missing categorical data. A database about smoking habits (2011 National Addiction Survey of Mexico was used to create simulated yet realistic datasets at rates 5% and 15% of missingness, each for MCAR, MAR and MNAR mechanisms. Then the performance of six methods for addressing missingness is evaluated: listwise, mode imputation, random imputation, hot-deck, imputation by polytomous regression and random forests. Results showed that the most effective methods for dealing with missing categorical data in most of the scenarios assessed in this paper were hot-deck and polytomous regression approaches. || El presente estudio examina la estimación de proporciones muestrales en la presencia de valores faltantes en una variable categórica. Se utiliza una encuesta de consumo de tabaco (Encuesta Nacional de Adicciones de México 2011 para crear bases de datos simuladas pero reales con 5% y 15% de valores perdidos para cada mecanismo de no respuesta MCAR, MAR y MNAR. Se evalúa el desempeño de seis métodos para tratar la falta de respuesta: listwise, imputación de moda, imputación aleatoria, hot-deck, imputación por regresión politómica y árboles de clasificación. Los resultados de las simulaciones indican que los métodos más efectivos para el tratamiento de la no respuesta en variables categóricas, bajo los escenarios simulados, son hot-deck y la regresión politómica.

  19. Multi-level gene/MiRNA feature selection using deep belief nets and active learning.

    Ibrahim, Rania; Yousri, Noha A; Ismail, Mohamed A; El-Makky, Nagwa M

    2014-01-01

    Selecting the most discriminative genes/miRNAs has been raised as an important task in bioinformatics to enhance disease classifiers and to mitigate the dimensionality curse problem. Original feature selection methods choose genes/miRNAs based on their individual features regardless of how they perform together. Considering group features instead of individual ones provides a better view for selecting the most informative genes/miRNAs. Recently, deep learning has proven its ability in representing the data in multiple levels of abstraction, allowing for better discrimination between different classes. However, the idea of using deep learning for feature selection is not widely used in the bioinformatics field yet. In this paper, a novel multi-level feature selection approach named MLFS is proposed for selecting genes/miRNAs based on expression profiles. The approach is based on both deep and active learning. Moreover, an extension to use the technique for miRNAs is presented by considering the biological relation between miRNAs and genes. Experimental results show that the approach was able to outperform classical feature selection methods in hepatocellular carcinoma (HCC) by 9%, lung cancer by 6% and breast cancer by around 10% in F1-measure. Results also show the enhancement in F1-measure of our approach over recently related work in [1] and [2].

  20. Fast Branch & Bound algorithms for optimal feature selection

    Somol, Petr; Pudil, Pavel; Kittler, J.

    2004-01-01

    Roč. 26, č. 7 (2004), s. 900-912 ISSN 0162-8828 R&D Projects: GA ČR GA402/02/1271; GA ČR GA402/03/1310; GA AV ČR KSK1019101 Institutional research plan: CEZ:AV0Z1075907 Keywords : subset search * feature selection * search tree Subject RIV: BD - Theory of Information Impact factor: 4.352, year: 2004

  1. A Variance Minimization Criterion to Feature Selection Using Laplacian Regularization.

    He, Xiaofei; Ji, Ming; Zhang, Chiyuan; Bao, Hujun

    2011-10-01

    In many information processing tasks, one is often confronted with very high-dimensional data. Feature selection techniques are designed to find the meaningful feature subset of the original features which can facilitate clustering, classification, and retrieval. In this paper, we consider the feature selection problem in unsupervised learning scenarios, which is particularly difficult due to the absence of class labels that would guide the search for relevant information. Based on Laplacian regularized least squares, which finds a smooth function on the data manifold and minimizes the empirical loss, we propose two novel feature selection algorithms which aim to minimize the expected prediction error of the regularized regression model. Specifically, we select those features such that the size of the parameter covariance matrix of the regularized regression model is minimized. Motivated from experimental design, we use trace and determinant operators to measure the size of the covariance matrix. Efficient computational schemes are also introduced to solve the corresponding optimization problems. Extensive experimental results over various real-life data sets have demonstrated the superiority of the proposed algorithms.

  2. Infrared face recognition based on LBP histogram and KW feature selection

    Xie, Zhihua

    2014-07-01

    The conventional LBP-based feature as represented by the local binary pattern (LBP) histogram still has room for performance improvements. This paper focuses on the dimension reduction of LBP micro-patterns and proposes an improved infrared face recognition method based on LBP histogram representation. To extract the local robust features in infrared face images, LBP is chosen to get the composition of micro-patterns of sub-blocks. Based on statistical test theory, Kruskal-Wallis (KW) feature selection method is proposed to get the LBP patterns which are suitable for infrared face recognition. The experimental results show combination of LBP and KW features selection improves the performance of infrared face recognition, the proposed method outperforms the traditional methods based on LBP histogram, discrete cosine transform(DCT) or principal component analysis(PCA).

  3. Improving Classification of Protein Interaction Articles Using Context Similarity-Based Feature Selection.

    Chen, Yifei; Sun, Yuxing; Han, Bing-Qing

    2015-01-01

    Protein interaction article classification is a text classification task in the biological domain to determine which articles describe protein-protein interactions. Since the feature space in text classification is high-dimensional, feature selection is widely used for reducing the dimensionality of features to speed up computation without sacrificing classification performance. Many existing feature selection methods are based on the statistical measure of document frequency and term frequency. One potential drawback of these methods is that they treat features separately. Hence, first we design a similarity measure between the context information to take word cooccurrences and phrase chunks around the features into account. Then we introduce the similarity of context information to the importance measure of the features to substitute the document and term frequency. Hence we propose new context similarity-based feature selection methods. Their performance is evaluated on two protein interaction article collections and compared against the frequency-based methods. The experimental results reveal that the context similarity-based methods perform better in terms of the F1 measure and the dimension reduction rate. Benefiting from the context information surrounding the features, the proposed methods can select distinctive features effectively for protein interaction article classification.

  4. The application of feature selection to the development of Gaussian process models for percutaneous absorption.

    Lam, Lun Tak; Sun, Yi; Davey, Neil; Adams, Rod; Prapopoulou, Maria; Brown, Marc B; Moss, Gary P

    2010-06-01

    The aim was to employ Gaussian processes to assess mathematically the nature of a skin permeability dataset and to employ these methods, particularly feature selection, to determine the key physicochemical descriptors which exert the most significant influence on percutaneous absorption, and to compare such models with established existing models. Gaussian processes, including automatic relevance detection (GPRARD) methods, were employed to develop models of percutaneous absorption that identified key physicochemical descriptors of percutaneous absorption. Using MatLab software, the statistical performance of these models was compared with single linear networks (SLN) and quantitative structure-permeability relationships (QSPRs). Feature selection methods were used to examine in more detail the physicochemical parameters used in this study. A range of statistical measures to determine model quality were used. The inherently nonlinear nature of the skin data set was confirmed. The Gaussian process regression (GPR) methods yielded predictive models that offered statistically significant improvements over SLN and QSPR models with regard to predictivity (where the rank order was: GPR > SLN > QSPR). Feature selection analysis determined that the best GPR models were those that contained log P, melting point and the number of hydrogen bond donor groups as significant descriptors. Further statistical analysis also found that great synergy existed between certain parameters. It suggested that a number of the descriptors employed were effectively interchangeable, thus questioning the use of models where discrete variables are output, usually in the form of an equation. The use of a nonlinear GPR method produced models with significantly improved predictivity, compared with SLN or QSPR models. Feature selection methods were able to provide important mechanistic information. However, it was also shown that significant synergy existed between certain parameters, and as such it

  5. Univariate normalization of bispectrum using Hölder's inequality.

    Shahbazi, Forooz; Ewald, Arne; Nolte, Guido

    2014-08-15

    Considering that many biological systems including the brain are complex non-linear systems, suitable methods capable of detecting these non-linearities are required to study the dynamical properties of these systems. One of these tools is the third order cummulant or cross-bispectrum, which is a measure of interfrequency interactions between three signals. For convenient interpretation, interaction measures are most commonly normalized to be independent of constant scales of the signals such that its absolute values are bounded by one, with this limit reflecting perfect coupling. Although many different normalization factors for cross-bispectra were suggested in the literature these either do not lead to bounded measures or are themselves dependent on the coupling and not only on the scale of the signals. In this paper we suggest a normalization factor which is univariate, i.e., dependent only on the amplitude of each signal and not on the interactions between signals. Using a generalization of Hölder's inequality it is proven that the absolute value of this univariate bicoherence is bounded by zero and one. We compared three widely used normalizations to the univariate normalization concerning the significance of bicoherence values gained from resampling tests. Bicoherence values are calculated from real EEG data recorded in an eyes closed experiment from 10 subjects. The results show slightly more significant values for the univariate normalization but in general, the differences are very small or even vanishing in some subjects. Therefore, we conclude that the normalization factor does not play an important role in the bicoherence values with regard to statistical power, although a univariate normalization is the only normalization factor which fulfills all the required conditions of a proper normalization. Copyright © 2014 Elsevier B.V. All rights reserved.

  6. Adaptive feature selection using v-shaped binary particle swarm optimization.

    Teng, Xuyang; Dong, Hongbin; Zhou, Xiurong

    2017-01-01

    Feature selection is an important preprocessing method in machine learning and data mining. This process can be used not only to reduce the amount of data to be analyzed but also to build models with stronger interpretability based on fewer features. Traditional feature selection methods evaluate the dependency and redundancy of features separately, which leads to a lack of measurement of their combined effect. Moreover, a greedy search considers only the optimization of the current round and thus cannot be a global search. To evaluate the combined effect of different subsets in the entire feature space, an adaptive feature selection method based on V-shaped binary particle swarm optimization is proposed. In this method, the fitness function is constructed using the correlation information entropy. Feature subsets are regarded as individuals in a population, and the feature space is searched using V-shaped binary particle swarm optimization. The above procedure overcomes the hard constraint on the number of features, enables the combined evaluation of each subset as a whole, and improves the search ability of conventional binary particle swarm optimization. The proposed algorithm is an adaptive method with respect to the number of feature subsets. The experimental results show the advantages of optimizing the feature subsets using the V-shaped transfer function and confirm the effectiveness and efficiency of the feature subsets obtained under different classifiers.

  7. Notes on the evolution of feature selection methodology

    Somol, Petr; Novovičová, Jana; Pudil, Pavel

    2007-01-01

    Roč. 43, č. 5 (2007), s. 713-730 ISSN 0023-5954 R&D Projects: GA ČR GA102/07/1594; GA MŠk 1M0572; GA AV ČR IAA2075302 EU Projects: European Commission(XE) 507752 - MUSCLE Grant - others:GA MŠk(CZ) 2C06019 Institutional research plan: CEZ:AV0Z10750506 Keywords : feature selection * branch and bound * sequential search * mixture model Subject RIV: IN - Informatics, Computer Science Impact factor: 0.552, year: 2007

  8. Conditional Mutual Information Based Feature Selection for Classification Task

    Novovičová, Jana; Somol, Petr; Haindl, Michal; Pudil, Pavel

    2007-01-01

    Roč. 45, č. 4756 (2007), s. 417-426 ISSN 0302-9743 R&D Projects: GA MŠk 1M0572; GA AV ČR IAA2075302 EU Projects: European Commission(XE) 507752 - MUSCLE Grant - others:GA MŠk(CZ) 2C06019 Institutional research plan: CEZ:AV0Z10750506 Keywords : Pattern classification * feature selection * conditional mutual information * text categorization Subject RIV: BB - Applied Statistics, Operational Research Impact factor: 0.402, year: 2005

  9. Emotional textile image classification based on cross-domain convolutional sparse autoencoders with feature selection

    Li, Zuhe; Fan, Yangyu; Liu, Weihua; Yu, Zeqi; Wang, Fengqin

    2017-01-01

    We aim to apply sparse autoencoder-based unsupervised feature learning to emotional semantic analysis for textile images. To tackle the problem of limited training data, we present a cross-domain feature learning scheme for emotional textile image classification using convolutional autoencoders. We further propose a correlation-analysis-based feature selection method for the weights learned by sparse autoencoders to reduce the number of features extracted from large size images. First, we randomly collect image patches on an unlabeled image dataset in the source domain and learn local features with a sparse autoencoder. We then conduct feature selection according to the correlation between different weight vectors corresponding to the autoencoder's hidden units. We finally adopt a convolutional neural network including a pooling layer to obtain global feature activations of textile images in the target domain and send these global feature vectors into logistic regression models for emotional image classification. The cross-domain unsupervised feature learning method achieves 65% to 78% average accuracy in the cross-validation experiments corresponding to eight emotional categories and performs better than conventional methods. Feature selection can reduce the computational cost of global feature extraction by about 50% while improving classification performance.

  10. Feature selection in classification of eye movements using electrooculography for activity recognition.

    Mala, S; Latha, K

    2014-01-01

    Activity recognition is needed in different requisition, for example, reconnaissance system, patient monitoring, and human-computer interfaces. Feature selection plays an important role in activity recognition, data mining, and machine learning. In selecting subset of features, an efficient evolutionary algorithm Differential Evolution (DE), a very efficient optimizer, is used for finding informative features from eye movements using electrooculography (EOG). Many researchers use EOG signals in human-computer interactions with various computational intelligence methods to analyze eye movements. The proposed system involves analysis of EOG signals using clearness based features, minimum redundancy maximum relevance features, and Differential Evolution based features. This work concentrates more on the feature selection algorithm based on DE in order to improve the classification for faultless activity recognition.

  11. Tensor-based Multi-view Feature Selection with Applications to Brain Diseases

    Cao, Bokai; He, Lifang; Kong, Xiangnan; Yu, Philip S.; Hao, Zhifeng; Ragin, Ann B.

    2015-01-01

    In the era of big data, we can easily access information from multiple views which may be obtained from different sources or feature subsets. Generally, different views provide complementary information for learning tasks. Thus, multi-view learning can facilitate the learning process and is prevalent in a wide range of application domains. For example, in medical science, measurements from a series of medical examinations are documented for each subject, including clinical, imaging, immunologic, serologic and cognitive measures which are obtained from multiple sources. Specifically, for brain diagnosis, we can have different quantitative analysis which can be seen as different feature subsets of a subject. It is desirable to combine all these features in an effective way for disease diagnosis. However, some measurements from less relevant medical examinations can introduce irrelevant information which can even be exaggerated after view combinations. Feature selection should therefore be incorporated in the process of multi-view learning. In this paper, we explore tensor product to bring different views together in a joint space, and present a dual method of tensor-based multi-view feature selection (dual-Tmfs) based on the idea of support vector machine recursive feature elimination. Experiments conducted on datasets derived from neurological disorder demonstrate the features selected by our proposed method yield better classification performance and are relevant to disease diagnosis. PMID:25937823

  12. Feature selection and multi-kernel learning for sparse representation on a manifold

    Wang, Jim Jing-Yan

    2014-03-01

    Sparse representation has been widely studied as a part-based data representation method and applied in many scientific and engineering fields, such as bioinformatics and medical imaging. It seeks to represent a data sample as a sparse linear combination of some basic items in a dictionary. Gao etal. (2013) recently proposed Laplacian sparse coding by regularizing the sparse codes with an affinity graph. However, due to the noisy features and nonlinear distribution of the data samples, the affinity graph constructed directly from the original feature space is not necessarily a reliable reflection of the intrinsic manifold of the data samples. To overcome this problem, we integrate feature selection and multiple kernel learning into the sparse coding on the manifold. To this end, unified objectives are defined for feature selection, multiple kernel learning, sparse coding, and graph regularization. By optimizing the objective functions iteratively, we develop novel data representation algorithms with feature selection and multiple kernel learning respectively. Experimental results on two challenging tasks, N-linked glycosylation prediction and mammogram retrieval, demonstrate that the proposed algorithms outperform the traditional sparse coding methods. © 2013 Elsevier Ltd.

  13. Feature selection and multi-kernel learning for sparse representation on a manifold.

    Wang, Jim Jing-Yan; Bensmail, Halima; Gao, Xin

    2014-03-01

    Sparse representation has been widely studied as a part-based data representation method and applied in many scientific and engineering fields, such as bioinformatics and medical imaging. It seeks to represent a data sample as a sparse linear combination of some basic items in a dictionary. Gao et al. (2013) recently proposed Laplacian sparse coding by regularizing the sparse codes with an affinity graph. However, due to the noisy features and nonlinear distribution of the data samples, the affinity graph constructed directly from the original feature space is not necessarily a reliable reflection of the intrinsic manifold of the data samples. To overcome this problem, we integrate feature selection and multiple kernel learning into the sparse coding on the manifold. To this end, unified objectives are defined for feature selection, multiple kernel learning, sparse coding, and graph regularization. By optimizing the objective functions iteratively, we develop novel data representation algorithms with feature selection and multiple kernel learning respectively. Experimental results on two challenging tasks, N-linked glycosylation prediction and mammogram retrieval, demonstrate that the proposed algorithms outperform the traditional sparse coding methods. Copyright © 2013 Elsevier Ltd. All rights reserved.

  14. Improving permafrost distribution modelling using feature selection algorithms

    Deluigi, Nicola; Lambiel, Christophe; Kanevski, Mikhail

    2016-04-01

    The availability of an increasing number of spatial data on the occurrence of mountain permafrost allows the employment of machine learning (ML) classification algorithms for modelling the distribution of the phenomenon. One of the major problems when dealing with high-dimensional dataset is the number of input features (variables) involved. Application of ML classification algorithms to this large number of variables leads to the risk of overfitting, with the consequence of a poor generalization/prediction. For this reason, applying feature selection (FS) techniques helps simplifying the amount of factors required and improves the knowledge on adopted features and their relation with the studied phenomenon. Moreover, taking away irrelevant or redundant variables from the dataset effectively improves the quality of the ML prediction. This research deals with a comparative analysis of permafrost distribution models supported by FS variable importance assessment. The input dataset (dimension = 20-25, 10 m spatial resolution) was constructed using landcover maps, climate data and DEM derived variables (altitude, aspect, slope, terrain curvature, solar radiation, etc.). It was completed with permafrost evidences (geophysical and thermal data and rock glacier inventories) that serve as training permafrost data. Used FS algorithms informed about variables that appeared less statistically important for permafrost presence/absence. Three different algorithms were compared: Information Gain (IG), Correlation-based Feature Selection (CFS) and Random Forest (RF). IG is a filter technique that evaluates the worth of a predictor by measuring the information gain with respect to the permafrost presence/absence. Conversely, CFS is a wrapper technique that evaluates the worth of a subset of predictors by considering the individual predictive ability of each variable along with the degree of redundancy between them. Finally, RF is a ML algorithm that performs FS as part of its

  15. Doubly sparse factor models for unifying feature transformation and feature selection

    Katahira, Kentaro; Okanoya, Kazuo; Okada, Masato; Matsumoto, Narihisa; Sugase-Miyamoto, Yasuko

    2010-01-01

    A number of unsupervised learning methods for high-dimensional data are largely divided into two groups based on their procedures, i.e., (1) feature selection, which discards irrelevant dimensions of the data, and (2) feature transformation, which constructs new variables by transforming and mixing over all dimensions. We propose a method that both selects and transforms features in a common Bayesian inference procedure. Our method imposes a doubly automatic relevance determination (ARD) prior on the factor loading matrix. We propose a variational Bayesian inference for our model and demonstrate the performance of our method on both synthetic and real data.

  16. Doubly sparse factor models for unifying feature transformation and feature selection

    Katahira, Kentaro; Okanoya, Kazuo; Okada, Masato [ERATO, Okanoya Emotional Information Project, Japan Science Technology Agency, Saitama (Japan); Matsumoto, Narihisa; Sugase-Miyamoto, Yasuko, E-mail: okada@k.u-tokyo.ac.j [Human Technology Research Institute, National Institute of Advanced Industrial Science and Technology, Ibaraki (Japan)

    2010-06-01

    A number of unsupervised learning methods for high-dimensional data are largely divided into two groups based on their procedures, i.e., (1) feature selection, which discards irrelevant dimensions of the data, and (2) feature transformation, which constructs new variables by transforming and mixing over all dimensions. We propose a method that both selects and transforms features in a common Bayesian inference procedure. Our method imposes a doubly automatic relevance determination (ARD) prior on the factor loading matrix. We propose a variational Bayesian inference for our model and demonstrate the performance of our method on both synthetic and real data.

  17. The impact of feature selection on one and two-class classification performance for plant microRNAs.

    Khalifa, Waleed; Yousef, Malik; Saçar Demirci, Müşerref Duygu; Allmer, Jens

    2016-01-01

    MicroRNAs (miRNAs) are short nucleotide sequences that form a typical hairpin structure which is recognized by a complex enzyme machinery. It ultimately leads to the incorporation of 18-24 nt long mature miRNAs into RISC where they act as recognition keys to aid in regulation of target mRNAs. It is involved to determine miRNAs experimentally and, therefore, machine learning is used to complement such endeavors. The success of machine learning mostly depends on proper input data and appropriate features for parameterization of the data. Although, in general, two-class classification (TCC) is used in the field; because negative examples are hard to come by, one-class classification (OCC) has been tried for pre-miRNA detection. Since both positive and negative examples are currently somewhat limited, feature selection can prove to be vital for furthering the field of pre-miRNA detection. In this study, we compare the performance of OCC and TCC using eight feature selection methods and seven different plant species providing positive pre-miRNA examples. Feature selection was very successful for OCC where the best feature selection method achieved an average accuracy of 95.6%, thereby being ∼29% better than the worst method which achieved 66.9% accuracy. While the performance is comparable to TCC, which performs up to 3% better than OCC, TCC is much less affected by feature selection and its largest performance gap is ∼13% which only occurs for two of the feature selection methodologies. We conclude that feature selection is crucially important for OCC and that it can perform on par with TCC given the proper set of features.

  18. The impact of feature selection on one and two-class classification performance for plant microRNAs

    Waleed Khalifa

    2016-06-01

    Full Text Available MicroRNAs (miRNAs are short nucleotide sequences that form a typical hairpin structure which is recognized by a complex enzyme machinery. It ultimately leads to the incorporation of 18–24 nt long mature miRNAs into RISC where they act as recognition keys to aid in regulation of target mRNAs. It is involved to determine miRNAs experimentally and, therefore, machine learning is used to complement such endeavors. The success of machine learning mostly depends on proper input data and appropriate features for parameterization of the data. Although, in general, two-class classification (TCC is used in the field; because negative examples are hard to come by, one-class classification (OCC has been tried for pre-miRNA detection. Since both positive and negative examples are currently somewhat limited, feature selection can prove to be vital for furthering the field of pre-miRNA detection. In this study, we compare the performance of OCC and TCC using eight feature selection methods and seven different plant species providing positive pre-miRNA examples. Feature selection was very successful for OCC where the best feature selection method achieved an average accuracy of 95.6%, thereby being ∼29% better than the worst method which achieved 66.9% accuracy. While the performance is comparable to TCC, which performs up to 3% better than OCC, TCC is much less affected by feature selection and its largest performance gap is ∼13% which only occurs for two of the feature selection methodologies. We conclude that feature selection is crucially important for OCC and that it can perform on par with TCC given the proper set of features.

  19. Effects of changing canopy directional reflectance on feature selection

    Smith, J. A.; Oliver, R. E.; Kilpela, O. E.

    1973-01-01

    The use of a Monte Carlo model for generating sample directional reflectance data for two simplified target canopies at two different solar positions is reported. Successive iterations through the model permit the calculation of a mean vector and covariance matrix for canopy reflectance for varied sensor view angles. These data may then be used to calculate the divergence between the target distributions for various wavelength combinations and for these view angles. Results of a feature selection analysis indicate that different sets of wavelengths are optimum for target discrimination depending on sensor view angle and that the targets may be more easily discriminated for some scan angles than others. The time-varying behavior of these results is also pointed out.

  20. GMDH-Based Semi-Supervised Feature Selection for Electricity Load Classification Forecasting

    Lintao Yang

    2018-01-01

    Full Text Available With the development of smart power grids, communication network technology and sensor technology, there has been an exponential growth in complex electricity load data. Irregular electricity load fluctuations caused by the weather and holiday factors disrupt the daily operation of the power companies. To deal with these challenges, this paper investigates a day-ahead electricity peak load interval forecasting problem. It transforms the conventional continuous forecasting problem into a novel interval forecasting problem, and then further converts the interval forecasting problem into the classification forecasting problem. In addition, an indicator system influencing the electricity load is established from three dimensions, namely the load series, calendar data, and weather data. A semi-supervised feature selection algorithm is proposed to address an electricity load classification forecasting issue based on the group method of data handling (GMDH technology. The proposed algorithm consists of three main stages: (1 training the basic classifier; (2 selectively marking the most suitable samples from the unclassified label data, and adding them to an initial training set; and (3 training the classification models on the final training set and classifying the test samples. An empirical analysis of electricity load dataset from four Chinese cities is conducted. Results show that the proposed model can address the electricity load classification forecasting problem more efficiently and effectively than the FW-Semi FS (forward semi-supervised feature selection and GMDH-U (GMDH-based semi-supervised feature selection for customer classification models.

  1. Electricity market price spike analysis by a hybrid data model and feature selection technique

    Amjady, Nima; Keynia, Farshid

    2010-01-01

    In a competitive electricity market, energy price forecasting is an important activity for both suppliers and consumers. For this reason, many techniques have been proposed to predict electricity market prices in the recent years. However, electricity price is a complex volatile signal owning many spikes. Most of electricity price forecast techniques focus on the normal price prediction, while price spike forecast is a different and more complex prediction process. Price spike forecasting has two main aspects: prediction of price spike occurrence and value. In this paper, a novel technique for price spike occurrence prediction is presented composed of a new hybrid data model, a novel feature selection technique and an efficient forecast engine. The hybrid data model includes both wavelet and time domain variables as well as calendar indicators, comprising a large candidate input set. The set is refined by the proposed feature selection technique evaluating both relevancy and redundancy of the candidate inputs. The forecast engine is a probabilistic neural network, which are fed by the selected candidate inputs of the feature selection technique and predict price spike occurrence. The efficiency of the whole proposed method for price spike occurrence forecasting is evaluated by means of real data from the Queensland and PJM electricity markets. (author)

  2. [Feature extraction for breast cancer data based on geometric algebra theory and feature selection using differential evolution].

    Li, Jing; Hong, Wenxue

    2014-12-01

    The feature extraction and feature selection are the important issues in pattern recognition. Based on the geometric algebra representation of vector, a new feature extraction method using blade coefficient of geometric algebra was proposed in this study. At the same time, an improved differential evolution (DE) feature selection method was proposed to solve the elevated high dimension issue. The simple linear discriminant analysis was used as the classifier. The result of the 10-fold cross-validation (10 CV) classification of public breast cancer biomedical dataset was more than 96% and proved superior to that of the original features and traditional feature extraction method.

  3. Fatigue level estimation of monetary bills based on frequency band acoustic signals with feature selection by supervised SOM

    Teranishi, Masaru; Omatu, Sigeru; Kosaka, Toshihisa

    Fatigued monetary bills adversely affect the daily operation of automated teller machines (ATMs). In order to make the classification of fatigued bills more efficient, the development of an automatic fatigued monetary bill classification method is desirable. We propose a new method by which to estimate the fatigue level of monetary bills from the feature-selected frequency band acoustic energy pattern of banking machines. By using a supervised self-organizing map (SOM), we effectively estimate the fatigue level using only the feature-selected frequency band acoustic energy pattern. Furthermore, the feature-selected frequency band acoustic energy pattern improves the estimation accuracy of the fatigue level of monetary bills by adding frequency domain information to the acoustic energy pattern. The experimental results with real monetary bill samples reveal the effectiveness of the proposed method.

  4. Robust Ground Target Detection by SAR and IR Sensor Fusion Using Adaboost-Based Feature Selection

    Sungho Kim

    2016-07-01

    Full Text Available Long-range ground targets are difficult to detect in a noisy cluttered environment using either synthetic aperture radar (SAR images or infrared (IR images. SAR-based detectors can provide a high detection rate with a high false alarm rate to background scatter noise. IR-based approaches can detect hot targets but are affected strongly by the weather conditions. This paper proposes a novel target detection method by decision-level SAR and IR fusion using an Adaboost-based machine learning scheme to achieve a high detection rate and low false alarm rate. The proposed method consists of individual detection, registration, and fusion architecture. This paper presents a single framework of a SAR and IR target detection method using modified Boolean map visual theory (modBMVT and feature-selection based fusion. Previous methods applied different algorithms to detect SAR and IR targets because of the different physical image characteristics. One method that is optimized for IR target detection produces unsuccessful results in SAR target detection. This study examined the image characteristics and proposed a unified SAR and IR target detection method by inserting a median local average filter (MLAF, pre-filter and an asymmetric morphological closing filter (AMCF, post-filter into the BMVT. The original BMVT was optimized to detect small infrared targets. The proposed modBMVT can remove the thermal and scatter noise by the MLAF and detect extended targets by attaching the AMCF after the BMVT. Heterogeneous SAR and IR images were registered automatically using the proposed RANdom SAmple Region Consensus (RANSARC-based homography optimization after a brute-force correspondence search using the detected target centers and regions. The final targets were detected by feature-selection based sensor fusion using Adaboost. The proposed method showed good SAR and IR target detection performance through feature selection-based decision fusion on a synthetic

  5. Robust Ground Target Detection by SAR and IR Sensor Fusion Using Adaboost-Based Feature Selection

    Kim, Sungho; Song, Woo-Jin; Kim, So-Hyun

    2016-01-01

    Long-range ground targets are difficult to detect in a noisy cluttered environment using either synthetic aperture radar (SAR) images or infrared (IR) images. SAR-based detectors can provide a high detection rate with a high false alarm rate to background scatter noise. IR-based approaches can detect hot targets but are affected strongly by the weather conditions. This paper proposes a novel target detection method by decision-level SAR and IR fusion using an Adaboost-based machine learning scheme to achieve a high detection rate and low false alarm rate. The proposed method consists of individual detection, registration, and fusion architecture. This paper presents a single framework of a SAR and IR target detection method using modified Boolean map visual theory (modBMVT) and feature-selection based fusion. Previous methods applied different algorithms to detect SAR and IR targets because of the different physical image characteristics. One method that is optimized for IR target detection produces unsuccessful results in SAR target detection. This study examined the image characteristics and proposed a unified SAR and IR target detection method by inserting a median local average filter (MLAF, pre-filter) and an asymmetric morphological closing filter (AMCF, post-filter) into the BMVT. The original BMVT was optimized to detect small infrared targets. The proposed modBMVT can remove the thermal and scatter noise by the MLAF and detect extended targets by attaching the AMCF after the BMVT. Heterogeneous SAR and IR images were registered automatically using the proposed RANdom SAmple Region Consensus (RANSARC)-based homography optimization after a brute-force correspondence search using the detected target centers and regions. The final targets were detected by feature-selection based sensor fusion using Adaboost. The proposed method showed good SAR and IR target detection performance through feature selection-based decision fusion on a synthetic database generated

  6. Classification Influence of Features on Given Emotions and Its Application in Feature Selection

    Xing, Yin; Chen, Chuang; Liu, Li-Long

    2018-04-01

    In order to solve the problem that there is a large amount of redundant data in high-dimensional speech emotion features, we analyze deeply the extracted speech emotion features and select better features. Firstly, a given emotion is classified by each feature. Secondly, the recognition rate is ranked in descending order. Then, the optimal threshold of features is determined by rate criterion. Finally, the better features are obtained. When applied in Berlin and Chinese emotional data set, the experimental results show that the feature selection method outperforms the other traditional methods.

  7. A Local Asynchronous Distributed Privacy Preserving Feature Selection Algorithm for Large Peer-to-Peer Networks

    National Aeronautics and Space Administration — In this paper we develop a local distributed privacy preserving algorithm for feature selection in a large peer-to-peer environment. Feature selection is often used...

  8. The effect of destination linked feature selection in real-time network intrusion detection

    Mzila, P

    2013-07-01

    Full Text Available techniques in the network intrusion detection system (NIDS) is the feature selection technique. The ability of NIDS to accurately identify intrusion from the network traffic relies heavily on feature selection, which describes the pattern of the network...

  9. Effective traffic features selection algorithm for cyber-attacks samples

    Li, Yihong; Liu, Fangzheng; Du, Zhenyu

    2018-05-01

    By studying the defense scheme of Network attacks, this paper propose an effective traffic features selection algorithm based on k-means++ clustering to deal with the problem of high dimensionality of traffic features which extracted from cyber-attacks samples. Firstly, this algorithm divide the original feature set into attack traffic feature set and background traffic feature set by the clustering. Then, we calculates the variation of clustering performance after removing a certain feature. Finally, evaluating the degree of distinctiveness of the feature vector according to the result. Among them, the effective feature vector is whose degree of distinctiveness exceeds the set threshold. The purpose of this paper is to select out the effective features from the extracted original feature set. In this way, it can reduce the dimensionality of the features so as to reduce the space-time overhead of subsequent detection. The experimental results show that the proposed algorithm is feasible and it has some advantages over other selection algorithms.

  10. Feature selection and multi-kernel learning for adaptive graph regularized nonnegative matrix factorization

    Wang, Jim Jing-Yan

    2014-09-20

    Nonnegative matrix factorization (NMF), a popular part-based representation technique, does not capture the intrinsic local geometric structure of the data space. Graph regularized NMF (GNMF) was recently proposed to avoid this limitation by regularizing NMF with a nearest neighbor graph constructed from the input data set. However, GNMF has two main bottlenecks. First, using the original feature space directly to construct the graph is not necessarily optimal because of the noisy and irrelevant features and nonlinear distributions of data samples. Second, one possible way to handle the nonlinear distribution of data samples is by kernel embedding. However, it is often difficult to choose the most suitable kernel. To solve these bottlenecks, we propose two novel graph-regularized NMF methods, AGNMFFS and AGNMFMK, by introducing feature selection and multiple-kernel learning to the graph regularized NMF, respectively. Instead of using a fixed graph as in GNMF, the two proposed methods learn the nearest neighbor graph that is adaptive to the selected features and learned multiple kernels, respectively. For each method, we propose a unified objective function to conduct feature selection/multi-kernel learning, NMF and adaptive graph regularization simultaneously. We further develop two iterative algorithms to solve the two optimization problems. Experimental results on two challenging pattern classification tasks demonstrate that the proposed methods significantly outperform state-of-the-art data representation methods.

  11. Integrative approaches to the prediction of protein functions based on the feature selection

    Lee Hyunju

    2009-12-01

    Full Text Available Abstract Background Protein function prediction has been one of the most important issues in functional genomics. With the current availability of various genomic data sets, many researchers have attempted to develop integration models that combine all available genomic data for protein function prediction. These efforts have resulted in the improvement of prediction quality and the extension of prediction coverage. However, it has also been observed that integrating more data sources does not always increase the prediction quality. Therefore, selecting data sources that highly contribute to the protein function prediction has become an important issue. Results We present systematic feature selection methods that assess the contribution of genome-wide data sets to predict protein functions and then investigate the relationship between genomic data sources and protein functions. In this study, we use ten different genomic data sources in Mus musculus, including: protein-domains, protein-protein interactions, gene expressions, phenotype ontology, phylogenetic profiles and disease data sources to predict protein functions that are labelled with Gene Ontology (GO terms. We then apply two approaches to feature selection: exhaustive search feature selection using a kernel based logistic regression (KLR, and a kernel based L1-norm regularized logistic regression (KL1LR. In the first approach, we exhaustively measure the contribution of each data set for each function based on its prediction quality. In the second approach, we use the estimated coefficients of features as measures of contribution of data sources. Our results show that the proposed methods improve the prediction quality compared to the full integration of all data sources and other filter-based feature selection methods. We also show that contributing data sources can differ depending on the protein function. Furthermore, we observe that highly contributing data sets can be similar among

  12. HEART RATE VARIABILITY CLASSIFICATION USING SADE-ELM CLASSIFIER WITH BAT FEATURE SELECTION

    R Kavitha

    2017-07-01

    Full Text Available The electrical activity of the human heart is measured by the vital bio medical signal called ECG. This electrocardiogram is employed as a crucial source to gather the diagnostic information of a patient’s cardiopathy. The monitoring function of cardiac disease is diagnosed by documenting and handling the electrocardiogram (ECG impulses. In the recent years many research has been done and developing an enhanced method to identify the risk in the patient’s body condition by processing and analysing the ECG signal. This analysis of the signal helps to find the cardiac abnormalities, arrhythmias, and many other heart problems. ECG signal is processed to detect the variability in heart rhythm; heart rate variability is calculated based on the time interval between heart beats. Heart Rate Variability HRV is measured by the variation in the beat to beat interval. The Heart rate Variability (HRV is an essential aspect to diagnose the properties of the heart. Recent development enhances the potential with the aid of non-linear metrics in reference point with feature selection. In this paper, the fundamental elements are taken from the ECG signal for feature selection process where Bat algorithm is employed for feature selection to predict the best feature and presented to the classifier for accurate classification. The popular machine learning algorithm ELM is taken for classification, integrated with evolutionary algorithm named Self- Adaptive Differential Evolution Extreme Learning Machine SADEELM to improve the reliability of classification. It combines Effective Fuzzy Kohonen clustering network (EFKCN to be able to increase the accuracy of the effect for HRV transmission classification. Hence, it is observed that the experiment carried out unveils that the precision is improved by the SADE-ELM method and concurrently optimizes the computation time.

  13. Feature Selection for Nonstationary Data: Application to Human Recognition Using Medical Biometrics.

    Komeili, Majid; Louis, Wael; Armanfard, Narges; Hatzinakos, Dimitrios

    2018-05-01

    Electrocardiogram (ECG) and transient evoked otoacoustic emission (TEOAE) are among the physiological signals that have attracted significant interest in biometric community due to their inherent robustness to replay and falsification attacks. However, they are time-dependent signals and this makes them hard to deal with in across-session human recognition scenario where only one session is available for enrollment. This paper presents a novel feature selection method to address this issue. It is based on an auxiliary dataset with multiple sessions where it selects a subset of features that are more persistent across different sessions. It uses local information in terms of sample margins while enforcing an across-session measure. This makes it a perfect fit for aforementioned biometric recognition problem. Comprehensive experiments on ECG and TEOAE variability due to time lapse and body posture are done. Performance of the proposed method is compared against seven state-of-the-art feature selection algorithms as well as another six approaches in the area of ECG and TEOAE biometric recognition. Experimental results demonstrate that the proposed method performs noticeably better than other algorithms.

  14. A scale space approach for unsupervised feature selection in mass spectra classification for ovarian cancer detection.

    Ceccarelli, Michele; d'Acierno, Antonio; Facchiano, Angelo

    2009-10-15

    Mass spectrometry spectra, widely used in proteomics studies as a screening tool for protein profiling and to detect discriminatory signals, are high dimensional data. A large number of local maxima (a.k.a. peaks) have to be analyzed as part of computational pipelines aimed at the realization of efficient predictive and screening protocols. With this kind of data dimensions and samples size the risk of over-fitting and selection bias is pervasive. Therefore the development of bio-informatics methods based on unsupervised feature extraction can lead to general tools which can be applied to several fields of predictive proteomics. We propose a method for feature selection and extraction grounded on the theory of multi-scale spaces for high resolution spectra derived from analysis of serum. Then we use support vector machines for classification. In particular we use a database containing 216 samples spectra divided in 115 cancer and 91 control samples. The overall accuracy averaged over a large cross validation study is 98.18. The area under the ROC curve of the best selected model is 0.9962. We improved previous known results on the problem on the same data, with the advantage that the proposed method has an unsupervised feature selection phase. All the developed code, as MATLAB scripts, can be downloaded from http://medeaserver.isa.cnr.it/dacierno/spectracode.htm.

  15. A HYBRID FILTER AND WRAPPER FEATURE SELECTION APPROACH FOR DETECTING CONTAMINATION IN DRINKING WATER MANAGEMENT SYSTEM

    S. VISALAKSHI

    2017-07-01

    Full Text Available Feature selection is an important task in predictive models which helps to identify the irrelevant features in the high - dimensional dataset. In this case of water contamination detection dataset, the standard wrapper algorithm alone cannot be applied because of the complexity. To overcome this computational complexity problem and making it lighter, filter-wrapper based algorithm has been proposed. In this work, reducing the feature space is a significant component of water contamination. The main findings are as follows: (1 The main goal is speeding up the feature selection process, so the proposed filter - based feature pre-selection is applied and guarantees that useful data are improbable to be detached in the initial stage which discussed briefly in this paper. (2 The resulting features are again filtered by using the Genetic Algorithm coded with Support Vector Machine method, where it facilitates to nutshell the subset of features with high accuracy and decreases the expense. Experimental results show that the proposed methods trim down redundant features effectively and achieved better classification accuracy.

  16. Kernel-based Joint Feature Selection and Max-Margin Classification for Early Diagnosis of Parkinson’s Disease

    Adeli, Ehsan; Wu, Guorong; Saghafi, Behrouz; An, Le; Shi, Feng; Shen, Dinggang

    2017-01-01

    Feature selection methods usually select the most compact and relevant set of features based on their contribution to a linear regression model. Thus, these features might not be the best for a non-linear classifier. This is especially crucial for the tasks, in which the performance is heavily dependent on the feature selection techniques, like the diagnosis of neurodegenerative diseases. Parkinson’s disease (PD) is one of the most common neurodegenerative disorders, which progresses slowly while affects the quality of life dramatically. In this paper, we use the data acquired from multi-modal neuroimaging data to diagnose PD by investigating the brain regions, known to be affected at the early stages. We propose a joint kernel-based feature selection and classification framework. Unlike conventional feature selection techniques that select features based on their performance in the original input feature space, we select features that best benefit the classification scheme in the kernel space. We further propose kernel functions, specifically designed for our non-negative feature types. We use MRI and SPECT data of 538 subjects from the PPMI database, and obtain a diagnosis accuracy of 97.5%, which outperforms all baseline and state-of-the-art methods.

  17. Kernel-based Joint Feature Selection and Max-Margin Classification for Early Diagnosis of Parkinson’s Disease

    Adeli, Ehsan; Wu, Guorong; Saghafi, Behrouz; An, Le; Shi, Feng; Shen, Dinggang

    2017-01-01

    Feature selection methods usually select the most compact and relevant set of features based on their contribution to a linear regression model. Thus, these features might not be the best for a non-linear classifier. This is especially crucial for the tasks, in which the performance is heavily dependent on the feature selection techniques, like the diagnosis of neurodegenerative diseases. Parkinson’s disease (PD) is one of the most common neurodegenerative disorders, which progresses slowly while affects the quality of life dramatically. In this paper, we use the data acquired from multi-modal neuroimaging data to diagnose PD by investigating the brain regions, known to be affected at the early stages. We propose a joint kernel-based feature selection and classification framework. Unlike conventional feature selection techniques that select features based on their performance in the original input feature space, we select features that best benefit the classification scheme in the kernel space. We further propose kernel functions, specifically designed for our non-negative feature types. We use MRI and SPECT data of 538 subjects from the PPMI database, and obtain a diagnosis accuracy of 97.5%, which outperforms all baseline and state-of-the-art methods. PMID:28120883

  18. Differential privacy-based evaporative cooling feature selection and classification with relief-F and random forests.

    Le, Trang T; Simmons, W Kyle; Misaki, Masaya; Bodurka, Jerzy; White, Bill C; Savitz, Jonathan; McKinney, Brett A

    2017-09-15

    Classification of individuals into disease or clinical categories from high-dimensional biological data with low prediction error is an important challenge of statistical learning in bioinformatics. Feature selection can improve classification accuracy but must be incorporated carefully into cross-validation to avoid overfitting. Recently, feature selection methods based on differential privacy, such as differentially private random forests and reusable holdout sets, have been proposed. However, for domains such as bioinformatics, where the number of features is much larger than the number of observations p≫n , these differential privacy methods are susceptible to overfitting. We introduce private Evaporative Cooling, a stochastic privacy-preserving machine learning algorithm that uses Relief-F for feature selection and random forest for privacy preserving classification that also prevents overfitting. We relate the privacy-preserving threshold mechanism to a thermodynamic Maxwell-Boltzmann distribution, where the temperature represents the privacy threshold. We use the thermal statistical physics concept of Evaporative Cooling of atomic gases to perform backward stepwise privacy-preserving feature selection. On simulated data with main effects and statistical interactions, we compare accuracies on holdout and validation sets for three privacy-preserving methods: the reusable holdout, reusable holdout with random forest, and private Evaporative Cooling, which uses Relief-F feature selection and random forest classification. In simulations where interactions exist between attributes, private Evaporative Cooling provides higher classification accuracy without overfitting based on an independent validation set. In simulations without interactions, thresholdout with random forest and private Evaporative Cooling give comparable accuracies. We also apply these privacy methods to human brain resting-state fMRI data from a study of major depressive disorder. Code

  19. Pattern Classification Using an Olfactory Model with PCA Feature Selection in Electronic Noses: Study and Application

    Junbao Zheng

    2012-03-01

    Full Text Available Biologically-inspired models and algorithms are considered as promising sensor array signal processing methods for electronic noses. Feature selection is one of the most important issues for developing robust pattern recognition models in machine learning. This paper describes an investigation into the classification performance of a bionic olfactory model with the increase of the dimensions of input feature vector (outer factor as well as its parallel channels (inner factor. The principal component analysis technique was applied for feature selection and dimension reduction. Two data sets of three classes of wine derived from different cultivars and five classes of green tea derived from five different provinces of China were used for experiments. In the former case the results showed that the average correct classification rate increased as more principal components were put in to feature vector. In the latter case the results showed that sufficient parallel channels should be reserved in the model to avoid pattern space crowding. We concluded that 6~8 channels of the model with principal component feature vector values of at least 90% cumulative variance is adequate for a classification task of 3~5 pattern classes considering the trade-off between time consumption and classification rate.

  20. Feature-selective attention: evidence for a decline in old age.

    Quigley, Cliodhna; Andersen, Søren K; Schulze, Lars; Grunwald, Martin; Müller, Matthias M

    2010-04-19

    Although attention in older adults is an active research area, feature-selective aspects have not yet been explicitly studied. Here we report the results of an exploratory study involving directed changes in feature-selective attention. The stimuli used were two random dot kinematograms (RDKs) of different colours, superimposed and centrally presented. A colour cue with random onset after the beginning of each trial instructed young and older subjects to attend to one of the RDKs and detect short intervals of coherent motion while ignoring analogous motion events in the non-cued RDK. Behavioural data show that older adults could detect motion, but discriminated target from distracter motion less reliably than young adults. The method of frequency tagging allowed us to separate the EEG responses to the attended and ignored stimuli and directly compare steady-state visual evoked potential (SSVEP) amplitudes elicited by each stimulus before and after cue onset. We found that younger adults show a clear attentional enhancement of SSVEP amplitude in the post-cue interval, while older adults' SSVEP responses to attended and ignored stimuli do not differ. Thus, in situations where attentional selection cannot be spatially resolved, older adults show a deficit in selection that is not shared by young adults. Copyright 2010 Elsevier Ireland Ltd. All rights reserved.

  1. SVM-RFE based feature selection and Taguchi parameters optimization for multiclass SVM classifier.

    Huang, Mei-Ling; Hung, Yung-Hsiang; Lee, W M; Li, R K; Jiang, Bo-Ru

    2014-01-01

    Recently, support vector machine (SVM) has excellent performance on classification and prediction and is widely used on disease diagnosis or medical assistance. However, SVM only functions well on two-group classification problems. This study combines feature selection and SVM recursive feature elimination (SVM-RFE) to investigate the classification accuracy of multiclass problems for Dermatology and Zoo databases. Dermatology dataset contains 33 feature variables, 1 class variable, and 366 testing instances; and the Zoo dataset contains 16 feature variables, 1 class variable, and 101 testing instances. The feature variables in the two datasets were sorted in descending order by explanatory power, and different feature sets were selected by SVM-RFE to explore classification accuracy. Meanwhile, Taguchi method was jointly combined with SVM classifier in order to optimize parameters C and γ to increase classification accuracy for multiclass classification. The experimental results show that the classification accuracy can be more than 95% after SVM-RFE feature selection and Taguchi parameter optimization for Dermatology and Zoo databases.

  2. Feature Selection and Parameter Optimization of Support Vector Machines Based on Modified Artificial Fish Swarm Algorithms

    Kuan-Cheng Lin

    2015-01-01

    Full Text Available Rapid advances in information and communication technology have made ubiquitous computing and the Internet of Things popular and practicable. These applications create enormous volumes of data, which are available for analysis and classification as an aid to decision-making. Among the classification methods used to deal with big data, feature selection has proven particularly effective. One common approach involves searching through a subset of the features that are the most relevant to the topic or represent the most accurate description of the dataset. Unfortunately, searching through this kind of subset is a combinatorial problem that can be very time consuming. Meaheuristic algorithms are commonly used to facilitate the selection of features. The artificial fish swarm algorithm (AFSA employs the intelligence underlying fish swarming behavior as a means to overcome optimization of combinatorial problems. AFSA has proven highly successful in a diversity of applications; however, there remain shortcomings, such as the likelihood of falling into a local optimum and a lack of multiplicity. This study proposes a modified AFSA (MAFSA to improve feature selection and parameter optimization for support vector machine classifiers. Experiment results demonstrate the superiority of MAFSA in classification accuracy using subsets with fewer features for given UCI datasets, compared to the original FASA.

  3. An ant colony optimization based feature selection for web page classification.

    Saraç, Esra; Özel, Selma Ayşe

    2014-01-01

    The increased popularity of the web has caused the inclusion of huge amount of information to the web, and as a result of this explosive information growth, automated web page classification systems are needed to improve search engines' performance. Web pages have a large number of features such as HTML/XML tags, URLs, hyperlinks, and text contents that should be considered during an automated classification process. The aim of this study is to reduce the number of features to be used to improve runtime and accuracy of the classification of web pages. In this study, we used an ant colony optimization (ACO) algorithm to select the best features, and then we applied the well-known C4.5, naive Bayes, and k nearest neighbor classifiers to assign class labels to web pages. We used the WebKB and Conference datasets in our experiments, and we showed that using the ACO for feature selection improves both accuracy and runtime performance of classification. We also showed that the proposed ACO based algorithm can select better features with respect to the well-known information gain and chi square feature selection methods.

  4. Feature selection for neural network based defect classification of ceramic components using high frequency ultrasound.

    Kesharaju, Manasa; Nagarajah, Romesh

    2015-09-01

    The motivation for this research stems from a need for providing a non-destructive testing method capable of detecting and locating any defects and microstructural variations within armour ceramic components before issuing them to the soldiers who rely on them for their survival. The development of an automated ultrasonic inspection based classification system would make possible the checking of each ceramic component and immediately alert the operator about the presence of defects. Generally, in many classification problems a choice of features or dimensionality reduction is significant and simultaneously very difficult, as a substantial computational effort is required to evaluate possible feature subsets. In this research, a combination of artificial neural networks and genetic algorithms are used to optimize the feature subset used in classification of various defects in reaction-sintered silicon carbide ceramic components. Initially wavelet based feature extraction is implemented from the region of interest. An Artificial Neural Network classifier is employed to evaluate the performance of these features. Genetic Algorithm based feature selection is performed. Principal Component Analysis is a popular technique used for feature selection and is compared with the genetic algorithm based technique in terms of classification accuracy and selection of optimal number of features. The experimental results confirm that features identified by Principal Component Analysis lead to improved performance in terms of classification percentage with 96% than Genetic algorithm with 94%. Copyright © 2015 Elsevier B.V. All rights reserved.

  5. Game Theoretic Approach for Systematic Feature Selection; Application in False Alarm Detection in Intensive Care Units

    Fatemeh Afghah

    2018-03-01

    Full Text Available Intensive Care Units (ICUs are equipped with many sophisticated sensors and monitoring devices to provide the highest quality of care for critically ill patients. However, these devices might generate false alarms that reduce standard of care and result in desensitization of caregivers to alarms. Therefore, reducing the number of false alarms is of great importance. Many approaches such as signal processing and machine learning, and designing more accurate sensors have been developed for this purpose. However, the significant intrinsic correlation among the extracted features from different sensors has been mostly overlooked. A majority of current data mining techniques fail to capture such correlation among the collected signals from different sensors that limits their alarm recognition capabilities. Here, we propose a novel information-theoretic predictive modeling technique based on the idea of coalition game theory to enhance the accuracy of false alarm detection in ICUs by accounting for the synergistic power of signal attributes in the feature selection stage. This approach brings together techniques from information theory and game theory to account for inter-features mutual information in determining the most correlated predictors with respect to false alarm by calculating Banzhaf power of each feature. The numerical results show that the proposed method can enhance classification accuracy and improve the area under the ROC (receiver operating characteristic curve compared to other feature selection techniques, when integrated in classifiers such as Bayes-Net that consider inter-features dependencies.

  6. Sparse Bayesian classification and feature selection for biological expression data with high correlations.

    Xian Yang

    Full Text Available Classification models built on biological expression data are increasingly used to predict distinct disease subtypes. Selected features that separate sample groups can be the candidates of biomarkers, helping us to discover biological functions/pathways. However, three challenges are associated with building a robust classification and feature selection model: 1 the number of significant biomarkers is much smaller than that of measured features for which the search will be exhaustive; 2 current biological expression data are big in both sample size and feature size which will worsen the scalability of any search algorithms; and 3 expression profiles of certain features are typically highly correlated which may prevent to distinguish the predominant features. Unfortunately, most of the existing algorithms are partially addressing part of these challenges but not as a whole. In this paper, we propose a unified framework to address the above challenges. The classification and feature selection problem is first formulated as a nonconvex optimisation problem. Then the problem is relaxed and solved iteratively by a sequence of convex optimisation procedures which can be distributed computed and therefore allows the efficient implementation on advanced infrastructures. To illustrate the competence of our method over others, we first analyse a randomly generated simulation dataset under various conditions. We then analyse a real gene expression dataset on embryonal tumour. Further downstream analysis, such as functional annotation and pathway analysis, are performed on the selected features which elucidate several biological findings.

  7. An Appraisal Model Based on a Synthetic Feature Selection Approach for Students’ Academic Achievement

    Ching-Hsue Cheng

    2017-11-01

    Full Text Available Obtaining necessary information (and even extracting hidden messages from existing big data, and then transforming them into knowledge, is an important skill. Data mining technology has received increased attention in various fields in recent years because it can be used to find historical patterns and employ machine learning to aid in decision-making. When we find unexpected rules or patterns from the data, they are likely to be of high value. This paper proposes a synthetic feature selection approach (SFSA, which is combined with a support vector machine (SVM to extract patterns and find the key features that influence students’ academic achievement. For verifying the proposed model, two databases, namely, “Student Profile” and “Tutorship Record”, were collected from an elementary school in Taiwan, and were concatenated into an integrated dataset based on students’ names as a research dataset. The results indicate the following: (1 the accuracy of the proposed feature selection approach is better than that of the Minimum-Redundancy-Maximum-Relevance (mRMR approach; (2 the proposed model is better than the listing methods when the six least influential features have been deleted; and (3 the proposed model can enhance the accuracy and facilitate the interpretation of the pattern from a hybrid-type dataset of students’ academic achievement.

  8. Comparison of spectrum normalization techniques for univariate ...

    2016-02-29

    Feb 29, 2016 ... 1Fuel Chemistry Division, Bhabha Atomic Research Centre, Mumbai 400 085, India. 2Department of ... their three-point smoothing methods were studied using LIBS for quantification of Cr, Mn and Ni ... nique for the qualitative and quantitative analysis of the samples. .... SEP is a type of mean square error.

  9. Feature Selection using Multi-objective Genetic Algorith m: A Hybrid Approach

    Ahuja, Jyoti; GJUST - Guru Jambheshwar University of Sciecne and Technology; Ratnoo, Saroj Dahiya; GJUST - Guru Jambheshwar University of Sciecne and Technology

    2015-01-01

    Feature selection is an important pre-processing task for building accurate and comprehensible classification models. Several researchers have applied filter, wrapper or hybrid approaches using genetic algorithms which are good candidates for optimization problems that involve large search spaces like in the case of feature selection. Moreover, feature selection is an inherently multi-objective problem with many competing objectives involving size, predictive power and redundancy of the featu...

  10. Mutual information based feature selection for medical image retrieval

    Zhi, Lijia; Zhang, Shaomin; Li, Yan

    2018-04-01

    In this paper, authors propose a mutual information based method for lung CT image retrieval. This method is designed to adapt to different datasets and different retrieval task. For practical applying consideration, this method avoids using a large amount of training data. Instead, with a well-designed training process and robust fundamental features and measurements, the method in this paper can get promising performance and maintain economic training computation. Experimental results show that the method has potential practical values for clinical routine application.

  11. Automatic Image Segmentation Using Active Contours with Univariate Marginal Distribution

    I. Cruz-Aceves

    2013-01-01

    Full Text Available This paper presents a novel automatic image segmentation method based on the theory of active contour models and estimation of distribution algorithms. The proposed method uses the univariate marginal distribution model to infer statistical dependencies between the control points on different active contours. These contours have been generated through an alignment process of reference shape priors, in order to increase the exploration and exploitation capabilities regarding different interactive segmentation techniques. This proposed method is applied in the segmentation of the hollow core in microscopic images of photonic crystal fibers and it is also used to segment the human heart and ventricular areas from datasets of computed tomography and magnetic resonance images, respectively. Moreover, to evaluate the performance of the medical image segmentations compared to regions outlined by experts, a set of similarity measures has been adopted. The experimental results suggest that the proposed image segmentation method outperforms the traditional active contour model and the interactive Tseng method in terms of segmentation accuracy and stability.

  12. Feature selection using genetic algorithm for breast cancer diagnosis: experiment on three different datasets

    Aalaei, Shokoufeh; Shahraki, Hadi; Rowhanimanesh, Alireza; Eslami, Saeid

    2016-01-01

    This study addresses feature selection for breast cancer diagnosis. The present process uses a wrapper approach using GA-based on feature selection and PS-classifier. The results of experiment show that the proposed model is comparable to the other models on Wisconsin breast cancer datasets. To

  13. Feature selection is the ReliefF for multiple instance learning

    Zafra, A.; Pechenizkiy, M.; Ventura, S.

    2010-01-01

    Dimensionality reduction and feature selection in particular are known to be of a great help for making supervised learning more effective and efficient. Many different feature selection techniques have been proposed for the traditional settings, where each instance is expected to have a label. In

  14. Towards Feature Selection in Actor-Critic Algorithms

    Rohanimanesh, Khashayar; Roy, Nicholas; Tedrake, Russ

    2007-01-01

    .... They demonstrate that two popular representations for value methods -- the barycentric interpolators and the graph Laplacian proto-value functions -- can be used to represent the actor so as to satisfy these conditions...

  15. Human activity recognition based on feature selection in smart home using back-propagation algorithm.

    Fang, Hongqing; He, Lei; Si, Hao; Liu, Peng; Xie, Xiaolei

    2014-09-01

    In this paper, Back-propagation(BP) algorithm has been used to train the feed forward neural network for human activity recognition in smart home environments, and inter-class distance method for feature selection of observed motion sensor events is discussed and tested. And then, the human activity recognition performances of neural network using BP algorithm have been evaluated and compared with other probabilistic algorithms: Naïve Bayes(NB) classifier and Hidden Markov Model(HMM). The results show that different feature datasets yield different activity recognition accuracy. The selection of unsuitable feature datasets increases the computational complexity and degrades the activity recognition accuracy. Furthermore, neural network using BP algorithm has relatively better human activity recognition performances than NB classifier and HMM. Copyright © 2014 ISA. Published by Elsevier Ltd. All rights reserved.

  16. Information Theory for Gabor Feature Selection for Face Recognition

    Shen Linlin

    2006-01-01

    Full Text Available A discriminative and robust feature—kernel enhanced informative Gabor feature—is proposed in this paper for face recognition. Mutual information is applied to select a set of informative and nonredundant Gabor features, which are then further enhanced by kernel methods for recognition. Compared with one of the top performing methods in the 2004 Face Verification Competition (FVC2004, our methods demonstrate a clear advantage over existing methods in accuracy, computation efficiency, and memory cost. The proposed method has been fully tested on the FERET database using the FERET evaluation protocol. Significant improvements on three of the test data sets are observed. Compared with the classical Gabor wavelet-based approaches using a huge number of features, our method requires less than 4 milliseconds to retrieve a few hundreds of features. Due to the substantially reduced feature dimension, only 4 seconds are required to recognize 200 face images. The paper also unified different Gabor filter definitions and proposed a training sample generation algorithm to reduce the effects caused by unbalanced number of samples available in different classes.

  17. Information Theory for Gabor Feature Selection for Face Recognition

    Shen, Linlin; Bai, Li

    2006-12-01

    A discriminative and robust feature—kernel enhanced informative Gabor feature—is proposed in this paper for face recognition. Mutual information is applied to select a set of informative and nonredundant Gabor features, which are then further enhanced by kernel methods for recognition. Compared with one of the top performing methods in the 2004 Face Verification Competition (FVC2004), our methods demonstrate a clear advantage over existing methods in accuracy, computation efficiency, and memory cost. The proposed method has been fully tested on the FERET database using the FERET evaluation protocol. Significant improvements on three of the test data sets are observed. Compared with the classical Gabor wavelet-based approaches using a huge number of features, our method requires less than 4 milliseconds to retrieve a few hundreds of features. Due to the substantially reduced feature dimension, only 4 seconds are required to recognize 200 face images. The paper also unified different Gabor filter definitions and proposed a training sample generation algorithm to reduce the effects caused by unbalanced number of samples available in different classes.

  18. Emotion of Physiological Signals Classification Based on TS Feature Selection

    Wang Yujing; Mo Jianlin

    2015-01-01

    This paper propose a method of TS-MLP about emotion recognition of physiological signal.It can recognize emotion successfully by Tabu search which selects features of emotion’s physiological signals and multilayer perceptron that is used to classify emotion.Simulation shows that it has achieved good emotion classification performance.

  19. Sequence-based classification using discriminatory motif feature selection.

    Hao Xiong

    Full Text Available Most existing methods for sequence-based classification use exhaustive feature generation, employing, for example, all k-mer patterns. The motivation behind such (enumerative approaches is to minimize the potential for overlooking important features. However, there are shortcomings to this strategy. First, practical constraints limit the scope of exhaustive feature generation to patterns of length ≤ k, such that potentially important, longer (> k predictors are not considered. Second, features so generated exhibit strong dependencies, which can complicate understanding of derived classification rules. Third, and most importantly, numerous irrelevant features are created. These concerns can compromise prediction and interpretation. While remedies have been proposed, they tend to be problem-specific and not broadly applicable. Here, we develop a generally applicable methodology, and an attendant software pipeline, that is predicated on discriminatory motif finding. In addition to the traditional training and validation partitions, our framework entails a third level of data partitioning, a discovery partition. A discriminatory motif finder is used on sequences and associated class labels in the discovery partition to yield a (small set of features. These features are then used as inputs to a classifier in the training partition. Finally, performance assessment occurs on the validation partition. Important attributes of our approach are its modularity (any discriminatory motif finder and any classifier can be deployed and its universality (all data, including sequences that are unaligned and/or of unequal length, can be accommodated. We illustrate our approach on two nucleosome occupancy datasets and a protein solubility dataset, previously analyzed using enumerative feature generation. Our method achieves excellent performance results, with and without optimization of classifier tuning parameters. A Python pipeline implementing the approach is

  20. Day-ahead price forecasting of electricity markets by a new feature selection algorithm and cascaded neural network technique

    Amjady, Nima; Keynia, Farshid

    2009-01-01

    With the introduction of restructuring into the electric power industry, the price of electricity has become the focus of all activities in the power market. Electricity price forecast is key information for electricity market managers and participants. However, electricity price is a complex signal due to its non-linear, non-stationary, and time variant behavior. In spite of performed research in this area, more accurate and robust price forecast methods are still required. In this paper, a new forecast strategy is proposed for day-ahead price forecasting of electricity markets. Our forecast strategy is composed of a new two stage feature selection technique and cascaded neural networks. The proposed feature selection technique comprises modified Relief algorithm for the first stage and correlation analysis for the second stage. The modified Relief algorithm selects candidate inputs with maximum relevancy with the target variable. Then among the selected candidates, the correlation analysis eliminates redundant inputs. Selected features by the two stage feature selection technique are used for the forecast engine, which is composed of 24 consecutive forecasters. Each of these 24 forecasters is a neural network allocated to predict the price of 1 h of the next day. The whole proposed forecast strategy is examined on the Spanish and Australia's National Electricity Markets Management Company (NEMMCO) and compared with some of the most recent price forecast methods.

  1. Feature Selection for Motor Imagery EEG Classification Based on Firefly Algorithm and Learning Automata

    Aiming Liu

    2017-11-01

    Full Text Available Motor Imagery (MI electroencephalography (EEG is widely studied for its non-invasiveness, easy availability, portability, and high temporal resolution. As for MI EEG signal processing, the high dimensions of features represent a research challenge. It is necessary to eliminate redundant features, which not only create an additional overhead of managing the space complexity, but also might include outliers, thereby reducing classification accuracy. The firefly algorithm (FA can adaptively select the best subset of features, and improve classification accuracy. However, the FA is easily entrapped in a local optimum. To solve this problem, this paper proposes a method of combining the firefly algorithm and learning automata (LA to optimize feature selection for motor imagery EEG. We employed a method of combining common spatial pattern (CSP and local characteristic-scale decomposition (LCD algorithms to obtain a high dimensional feature set, and classified it by using the spectral regression discriminant analysis (SRDA classifier. Both the fourth brain–computer interface competition data and real-time data acquired in our designed experiments were used to verify the validation of the proposed method. Compared with genetic and adaptive weight particle swarm optimization algorithms, the experimental results show that our proposed method effectively eliminates redundant features, and improves the classification accuracy of MI EEG signals. In addition, a real-time brain–computer interface system was implemented to verify the feasibility of our proposed methods being applied in practical brain–computer interface systems.

  2. Rough-fuzzy clustering and unsupervised feature selection for wavelet based MR image segmentation.

    Pradipta Maji

    Full Text Available Image segmentation is an indispensable process in the visualization of human tissues, particularly during clinical analysis of brain magnetic resonance (MR images. For many human experts, manual segmentation is a difficult and time consuming task, which makes an automated brain MR image segmentation method desirable. In this regard, this paper presents a new segmentation method for brain MR images, integrating judiciously the merits of rough-fuzzy computing and multiresolution image analysis technique. The proposed method assumes that the major brain tissues, namely, gray matter, white matter, and cerebrospinal fluid from the MR images are considered to have different textural properties. The dyadic wavelet analysis is used to extract the scale-space feature vector for each pixel, while the rough-fuzzy clustering is used to address the uncertainty problem of brain MR image segmentation. An unsupervised feature selection method is introduced, based on maximum relevance-maximum significance criterion, to select relevant and significant textural features for segmentation problem, while the mathematical morphology based skull stripping preprocessing step is proposed to remove the non-cerebral tissues like skull. The performance of the proposed method, along with a comparison with related approaches, is demonstrated on a set of synthetic and real brain MR images using standard validity indices.

  3. Feature Selection for Motor Imagery EEG Classification Based on Firefly Algorithm and Learning Automata.

    Liu, Aiming; Chen, Kun; Liu, Quan; Ai, Qingsong; Xie, Yi; Chen, Anqi

    2017-11-08

    Motor Imagery (MI) electroencephalography (EEG) is widely studied for its non-invasiveness, easy availability, portability, and high temporal resolution. As for MI EEG signal processing, the high dimensions of features represent a research challenge. It is necessary to eliminate redundant features, which not only create an additional overhead of managing the space complexity, but also might include outliers, thereby reducing classification accuracy. The firefly algorithm (FA) can adaptively select the best subset of features, and improve classification accuracy. However, the FA is easily entrapped in a local optimum. To solve this problem, this paper proposes a method of combining the firefly algorithm and learning automata (LA) to optimize feature selection for motor imagery EEG. We employed a method of combining common spatial pattern (CSP) and local characteristic-scale decomposition (LCD) algorithms to obtain a high dimensional feature set, and classified it by using the spectral regression discriminant analysis (SRDA) classifier. Both the fourth brain-computer interface competition data and real-time data acquired in our designed experiments were used to verify the validation of the proposed method. Compared with genetic and adaptive weight particle swarm optimization algorithms, the experimental results show that our proposed method effectively eliminates redundant features, and improves the classification accuracy of MI EEG signals. In addition, a real-time brain-computer interface system was implemented to verify the feasibility of our proposed methods being applied in practical brain-computer interface systems.

  4. Feature Selection based on Machine Learning in MRIs for Hippocampal Segmentation

    Tangaro, Sabina; Amoroso, Nicola; Brescia, Massimo; Cavuoti, Stefano; Chincarini, Andrea; Errico, Rosangela; Paolo, Inglese; Longo, Giuseppe; Maglietta, Rosalia; Tateo, Andrea; Riccio, Giuseppe; Bellotti, Roberto

    2015-01-01

    Neurodegenerative diseases are frequently associated with structural changes in the brain. Magnetic resonance imaging (MRI) scans can show these variations and therefore can be used as a supportive feature for a number of neurodegenerative diseases. The hippocampus has been known to be a biomarker for Alzheimer disease and other neurological and psychiatric diseases. However, it requires accurate, robust, and reproducible delineation of hippocampal structures. Fully automatic methods are usually the voxel based approach; for each voxel a number of local features were calculated. In this paper, we compared four different techniques for feature selection from a set of 315 features extracted for each voxel: (i) filter method based on the Kolmogorov-Smirnov test; two wrapper methods, respectively, (ii) sequential forward selection and (iii) sequential backward elimination; and (iv) embedded method based on the Random Forest Classifier on a set of 10 T1-weighted brain MRIs and tested on an independent set of 25 subjects. The resulting segmentations were compared with manual reference labelling. By using only 23 feature for each voxel (sequential backward elimination) we obtained comparable state-of-the-art performances with respect to the standard tool FreeSurfer.

  5. A New Feature Selection Algorithm Based on the Mean Impact Variance

    Weidong Cheng

    2014-01-01

    Full Text Available The selection of fewer or more representative features from multidimensional features is important when the artificial neural network (ANN algorithm is used as a classifier. In this paper, a new feature selection method called the mean impact variance (MIVAR method is proposed to determine the feature that is more suitable for classification. Moreover, this method is constructed on the basis of the training process of the ANN algorithm. To verify the effectiveness of the proposed method, the MIVAR value is used to rank the multidimensional features of the bearing fault diagnosis. In detail, (1 70-dimensional all waveform features are extracted from a rolling bearing vibration signal with four different operating states, (2 the corresponding MIVAR values of all 70-dimensional features are calculated to rank all features, (3 14 groups of 10-dimensional features are separately generated according to the ranking results and the principal component analysis (PCA algorithm and a back propagation (BP network is constructed, and (4 the validity of the ranking result is proven by training this BP network with these seven groups of 10-dimensional features and by comparing the corresponding recognition rates. The results prove that the features with larger MIVAR value can lead to higher recognition rates.

  6. A Meta-Heuristic Regression-Based Feature Selection for Predictive Analytics

    Bharat Singh

    2014-11-01

    Full Text Available A high-dimensional feature selection having a very large number of features with an optimal feature subset is an NP-complete problem. Because conventional optimization techniques are unable to tackle large-scale feature selection problems, meta-heuristic algorithms are widely used. In this paper, we propose a particle swarm optimization technique while utilizing regression techniques for feature selection. We then use the selected features to classify the data. Classification accuracy is used as a criterion to evaluate classifier performance, and classification is accomplished through the use of k-nearest neighbour (KNN and Bayesian techniques. Various high dimensional data sets are used to evaluate the usefulness of the proposed approach. Results show that our approach gives better results when compared with other conventional feature selection algorithms.

  7. Feature selection in wind speed prediction systems based on a hybrid coral reefs optimization – Extreme learning machine approach

    Salcedo-Sanz, S.; Pastor-Sánchez, A.; Prieto, L.; Blanco-Aguilera, A.; García-Herrera, R.

    2014-01-01

    Highlights: • A novel approach for short-term wind speed prediction is presented. • The system is formed by a coral reefs optimization algorithm and an extreme learning machine. • Feature selection is carried out with the CRO to improve the ELM performance. • The method is tested in real wind farm data in USA, for the period 2007–2008. - Abstract: This paper presents a novel approach for short-term wind speed prediction based on a Coral Reefs Optimization algorithm (CRO) and an Extreme Learning Machine (ELM), using meteorological predictive variables from a physical model (the Weather Research and Forecast model, WRF). The approach is based on a Feature Selection Problem (FSP) carried out with the CRO, that must obtain a reduced number of predictive variables out of the total available from the WRF. This set of features will be the input of an ELM, that finally provides the wind speed prediction. The CRO is a novel bio-inspired approach, based on the simulation of reef formation and coral reproduction, able to obtain excellent results in optimization problems. On the other hand, the ELM is a new paradigm in neural networks’ training, that provides a robust and extremely fast training of the network. Together, these algorithms are able to successfully solve this problem of feature selection in short-term wind speed prediction. Experiments in a real wind farm in the USA show the excellent performance of the CRO–ELM approach in this FSP wind speed prediction problem

  8. Feature Selection Has a Large Impact on One-Class Classification Accuracy for MicroRNAs in Plants.

    Yousef, Malik; Saçar Demirci, Müşerref Duygu; Khalifa, Waleed; Allmer, Jens

    2016-01-01

    MicroRNAs (miRNAs) are short RNA sequences involved in posttranscriptional gene regulation. Their experimental analysis is complicated and, therefore, needs to be supplemented with computational miRNA detection. Currently computational miRNA detection is mainly performed using machine learning and in particular two-class classification. For machine learning, the miRNAs need to be parametrized and more than 700 features have been described. Positive training examples for machine learning are readily available, but negative data is hard to come by. Therefore, it seems prerogative to use one-class classification instead of two-class classification. Previously, we were able to almost reach two-class classification accuracy using one-class classifiers. In this work, we employ feature selection procedures in conjunction with one-class classification and show that there is up to 36% difference in accuracy among these feature selection methods. The best feature set allowed the training of a one-class classifier which achieved an average accuracy of ~95.6% thereby outperforming previous two-class-based plant miRNA detection approaches by about 0.5%. We believe that this can be improved upon in the future by rigorous filtering of the positive training examples and by improving current feature clustering algorithms to better target pre-miRNA feature selection.

  9. Efficient Feature Selection and Classification of Protein Sequence Data in Bioinformatics

    Faye, Ibrahima; Samir, Brahim Belhaouari; Md Said, Abas

    2014-01-01

    Bioinformatics has been an emerging area of research for the last three decades. The ultimate aims of bioinformatics were to store and manage the biological data, and develop and analyze computational tools to enhance their understanding. The size of data accumulated under various sequencing projects is increasing exponentially, which presents difficulties for the experimental methods. To reduce the gap between newly sequenced protein and proteins with known functions, many computational techniques involving classification and clustering algorithms were proposed in the past. The classification of protein sequences into existing superfamilies is helpful in predicting the structure and function of large amount of newly discovered proteins. The existing classification results are unsatisfactory due to a huge size of features obtained through various feature encoding methods. In this work, a statistical metric-based feature selection technique has been proposed in order to reduce the size of the extracted feature vector. The proposed method of protein classification shows significant improvement in terms of performance measure metrics: accuracy, sensitivity, specificity, recall, F-measure, and so forth. PMID:25045727

  10. Feature Selection and Kernel Learning for Local Learning-Based Clustering.

    Zeng, Hong; Cheung, Yiu-ming

    2011-08-01

    The performance of the most clustering algorithms highly relies on the representation of data in the input space or the Hilbert space of kernel methods. This paper is to obtain an appropriate data representation through feature selection or kernel learning within the framework of the Local Learning-Based Clustering (LLC) (Wu and Schölkopf 2006) method, which can outperform the global learning-based ones when dealing with the high-dimensional data lying on manifold. Specifically, we associate a weight to each feature or kernel and incorporate it into the built-in regularization of the LLC algorithm to take into account the relevance of each feature or kernel for the clustering. Accordingly, the weights are estimated iteratively in the clustering process. We show that the resulting weighted regularization with an additional constraint on the weights is equivalent to a known sparse-promoting penalty. Hence, the weights of those irrelevant features or kernels can be shrunk toward zero. Extensive experiments show the efficacy of the proposed methods on the benchmark data sets.

  11. A combined Fisher and Laplacian score for feature selection in QSAR based drug design using compounds with known and unknown activities.

    Valizade Hasanloei, Mohammad Amin; Sheikhpour, Razieh; Sarram, Mehdi Agha; Sheikhpour, Elnaz; Sharifi, Hamdollah

    2018-02-01

    Quantitative structure-activity relationship (QSAR) is an effective computational technique for drug design that relates the chemical structures of compounds to their biological activities. Feature selection is an important step in QSAR based drug design to select the most relevant descriptors. One of the most popular feature selection methods for classification problems is Fisher score which aim is to minimize the within-class distance and maximize the between-class distance. In this study, the properties of Fisher criterion were extended for QSAR models to define the new distance metrics based on the continuous activity values of compounds with known activities. Then, a semi-supervised feature selection method was proposed based on the combination of Fisher and Laplacian criteria which exploits both compounds with known and unknown activities to select the relevant descriptors. To demonstrate the efficiency of the proposed semi-supervised feature selection method in selecting the relevant descriptors, we applied the method and other feature selection methods on three QSAR data sets such as serine/threonine-protein kinase PLK3 inhibitors, ROCK inhibitors and phenol compounds. The results demonstrated that the QSAR models built on the selected descriptors by the proposed semi-supervised method have better performance than other models. This indicates the efficiency of the proposed method in selecting the relevant descriptors using the compounds with known and unknown activities. The results of this study showed that the compounds with known and unknown activities can be helpful to improve the performance of the combined Fisher and Laplacian based feature selection methods.

  12. Feature Selection as a Time and Cost-Saving Approach for Land Suitability Classification (Case Study of Shavur Plain, Iran

    Saeid Hamzeh

    2016-10-01

    Full Text Available Land suitability classification is important in planning and managing sustainable land use. Most approaches to land suitability analysis combine a large number of land and soil parameters, and are time-consuming and costly. In this study, a potentially useful technique (combined feature selection and fuzzy-AHP method to increase the efficiency of land suitability analysis was presented. To this end, three different feature selection algorithms—random search, best search and genetic methods—were used to determine the most effective parameters for land suitability classification for the cultivation of barely in the Shavur Plain, southwest Iran. Next, land suitability classes were calculated for all methods by using the fuzzy-AHP approach. Salinity (electrical conductivity (EC, alkalinity (exchangeable sodium percentage (ESP, wetness and soil texture were selected using the random search method. Gypsum, EC, ESP, and soil texture were selected using both the best search and genetic methods. The result shows a strong agreement between the standard fuzzy-AHP methods and methods presented in this study. The values of Kappa coefficients were 0.82, 0.79 and 0.79 for the random search, best search and genetic methods, respectively, compared with the standard fuzzy-AHP method. Our results indicate that EC, ESP, soil texture and wetness are the most effective features for evaluating land suitability classification for the cultivation of barely in the study area, and uses of these parameters, together with their appropriate weights as obtained from fuzzy-AHP, can perform good results for land suitability classification. So, the combined feature selection presented and the fuzzy-AHP approach has the potential to save time and money for land suitability classification.

  13. Univariate and multivariate skewness and kurtosis for measuring nonnormality: Prevalence, influence and estimation.

    Cain, Meghan K; Zhang, Zhiyong; Yuan, Ke-Hai

    2017-10-01

    Nonnormality of univariate data has been extensively examined previously (Blanca et al., Methodology: European Journal of Research Methods for the Behavioral and Social Sciences, 9(2), 78-84, 2013; Miceeri, Psychological Bulletin, 105(1), 156, 1989). However, less is known of the potential nonnormality of multivariate data although multivariate analysis is commonly used in psychological and educational research. Using univariate and multivariate skewness and kurtosis as measures of nonnormality, this study examined 1,567 univariate distriubtions and 254 multivariate distributions collected from authors of articles published in Psychological Science and the American Education Research Journal. We found that 74 % of univariate distributions and 68 % multivariate distributions deviated from normal distributions. In a simulation study using typical values of skewness and kurtosis that we collected, we found that the resulting type I error rates were 17 % in a t-test and 30 % in a factor analysis under some conditions. Hence, we argue that it is time to routinely report skewness and kurtosis along with other summary statistics such as means and variances. To facilitate future report of skewness and kurtosis, we provide a tutorial on how to compute univariate and multivariate skewness and kurtosis by SAS, SPSS, R and a newly developed Web application.

  14. BLProt: Prediction of bioluminescent proteins based on support vector machine and relieff feature selection

    Kandaswamy, Krishna Kumar

    2011-08-17

    Background: Bioluminescence is a process in which light is emitted by a living organism. Most creatures that emit light are sea creatures, but some insects, plants, fungi etc, also emit light. The biotechnological application of bioluminescence has become routine and is considered essential for many medical and general technological advances. Identification of bioluminescent proteins is more challenging due to their poor similarity in sequence. So far, no specific method has been reported to identify bioluminescent proteins from primary sequence.Results: In this paper, we propose a novel predictive method that uses a Support Vector Machine (SVM) and physicochemical properties to predict bioluminescent proteins. BLProt was trained using a dataset consisting of 300 bioluminescent proteins and 300 non-bioluminescent proteins, and evaluated by an independent set of 141 bioluminescent proteins and 18202 non-bioluminescent proteins. To identify the most prominent features, we carried out feature selection with three different filter approaches, ReliefF, infogain, and mRMR. We selected five different feature subsets by decreasing the number of features, and the performance of each feature subset was evaluated.Conclusion: BLProt achieves 80% accuracy from training (5 fold cross-validations) and 80.06% accuracy from testing. The performance of BLProt was compared with BLAST and HMM. High prediction accuracy and successful prediction of hypothetical proteins suggests that BLProt can be a useful approach to identify bioluminescent proteins from sequence information, irrespective of their sequence similarity. 2011 Kandaswamy et al; licensee BioMed Central Ltd.

  15. Multiclass Classification of Cardiac Arrhythmia Using Improved Feature Selection and SVM Invariants.

    Mustaqeem, Anam; Anwar, Syed Muhammad; Majid, Muahammad

    2018-01-01

    Arrhythmia is considered a life-threatening disease causing serious health issues in patients, when left untreated. An early diagnosis of arrhythmias would be helpful in saving lives. This study is conducted to classify patients into one of the sixteen subclasses, among which one class represents absence of disease and the other fifteen classes represent electrocardiogram records of various subtypes of arrhythmias. The research is carried out on the dataset taken from the University of California at Irvine Machine Learning Data Repository. The dataset contains a large volume of feature dimensions which are reduced using wrapper based feature selection technique. For multiclass classification, support vector machine (SVM) based approaches including one-against-one (OAO), one-against-all (OAA), and error-correction code (ECC) are employed to detect the presence and absence of arrhythmias. The SVM method results are compared with other standard machine learning classifiers using varying parameters and the performance of the classifiers is evaluated using accuracy, kappa statistics, and root mean square error. The results show that OAO method of SVM outperforms all other classifiers by achieving an accuracy rate of 81.11% when used with 80/20 data split and 92.07% using 90/10 data split option.

  16. Raman spectral feature selection using ant colony optimization for breast cancer diagnosis.

    Fallahzadeh, Omid; Dehghani-Bidgoli, Zohreh; Assarian, Mohammad

    2018-06-04

    Pathology as a common diagnostic test of cancer is an invasive, time-consuming, and partially subjective method. Therefore, optical techniques, especially Raman spectroscopy, have attracted the attention of cancer diagnosis researchers. However, as Raman spectra contain numerous peaks involved in molecular bounds of the sample, finding the best features related to cancerous changes can improve the accuracy of diagnosis in this method. The present research attempted to improve the power of Raman-based cancer diagnosis by finding the best Raman features using the ACO algorithm. In the present research, 49 spectra were measured from normal, benign, and cancerous breast tissue samples using a 785-nm micro-Raman system. After preprocessing for removal of noise and background fluorescence, the intensity of 12 important Raman bands of the biological samples was extracted as features of each spectrum. Then, the ACO algorithm was applied to find the optimum features for diagnosis. As the results demonstrated, by selecting five features, the classification accuracy of the normal, benign, and cancerous groups increased by 14% and reached 87.7%. ACO feature selection can improve the diagnostic accuracy of Raman-based diagnostic models. In the present study, features corresponding to ν(C-C) αhelix proline, valine (910-940), νs(C-C) skeletal lipids (1110-1130), and δ(CH2)/δ(CH3) proteins (1445-1460) were selected as the best features in cancer diagnosis.

  17. BLProt: Prediction of bioluminescent proteins based on support vector machine and relieff feature selection

    Kandaswamy, Krishna Kumar; Pugalenthi, Ganesan; Hazrati, Mehrnaz Khodam; Kalies, Kai-Uwe; Martinetz, Thomas

    2011-01-01

    Background: Bioluminescence is a process in which light is emitted by a living organism. Most creatures that emit light are sea creatures, but some insects, plants, fungi etc, also emit light. The biotechnological application of bioluminescence has become routine and is considered essential for many medical and general technological advances. Identification of bioluminescent proteins is more challenging due to their poor similarity in sequence. So far, no specific method has been reported to identify bioluminescent proteins from primary sequence.Results: In this paper, we propose a novel predictive method that uses a Support Vector Machine (SVM) and physicochemical properties to predict bioluminescent proteins. BLProt was trained using a dataset consisting of 300 bioluminescent proteins and 300 non-bioluminescent proteins, and evaluated by an independent set of 141 bioluminescent proteins and 18202 non-bioluminescent proteins. To identify the most prominent features, we carried out feature selection with three different filter approaches, ReliefF, infogain, and mRMR. We selected five different feature subsets by decreasing the number of features, and the performance of each feature subset was evaluated.Conclusion: BLProt achieves 80% accuracy from training (5 fold cross-validations) and 80.06% accuracy from testing. The performance of BLProt was compared with BLAST and HMM. High prediction accuracy and successful prediction of hypothetical proteins suggests that BLProt can be a useful approach to identify bioluminescent proteins from sequence information, irrespective of their sequence similarity. 2011 Kandaswamy et al; licensee BioMed Central Ltd.

  18. New Riemannian Priors on the Univariate Normal Model

    Salem Said

    2014-07-01

    Full Text Available The current paper introduces new prior distributions on the univariate normal model, with the aim of applying them to the classification of univariate normal populations. These new prior distributions are entirely based on the Riemannian geometry of the univariate normal model, so that they can be thought of as “Riemannian priors”. Precisely, if {pθ ; θ ∈ Θ} is any parametrization of the univariate normal model, the paper considers prior distributions G( θ - , γ with hyperparameters θ - ∈ Θ and γ > 0, whose density with respect to Riemannian volume is proportional to exp(−d2(θ, θ - /2γ2, where d2(θ, θ - is the square of Rao’s Riemannian distance. The distributions G( θ - , γ are termed Gaussian distributions on the univariate normal model. The motivation for considering a distribution G( θ - , γ is that this distribution gives a geometric representation of a class or cluster of univariate normal populations. Indeed, G( θ - , γ has a unique mode θ - (precisely, θ - is the unique Riemannian center of mass of G( θ - , γ, as shown in the paper, and its dispersion away from θ - is given by γ.  Therefore, one thinks of members of the class represented by G( θ - , γ as being centered around θ - and  lying within a typical  distance determined by γ. The paper defines rigorously the Gaussian distributions G( θ - , γ and describes an algorithm for computing maximum likelihood estimates of their hyperparameters. Based on this algorithm and on the Laplace approximation, it describes how the distributions G( θ - , γ can be used as prior distributions for Bayesian classification of large univariate normal populations. In a concrete application to texture image classification, it is shown that  this  leads  to  an  improvement  in  performance  over  the  use  of  conjugate  priors.

  19. Fisher Information Based Meteorological Factors Introduction and Features Selection for Short-Term Load Forecasting

    Shuping Cai

    2018-03-01

    Full Text Available Weather information is an important factor in short-term load forecasting (STLF. However, for a long time, more importance has always been attached to forecasting models instead of other processes such as the introduction of weather factors or feature selection for STLF. The main aim of this paper is to develop a novel methodology based on Fisher information for meteorological variables introduction and variable selection in STLF. Fisher information computation for one-dimensional and multidimensional weather variables is first described, and then the introduction of meteorological factors and variables selection for STLF models are discussed in detail. On this basis, different forecasting models with the proposed methodology are established. The proposed methodology is implemented on real data obtained from Electric Power Utility of Zhenjiang, Jiangsu Province, in southeast China. The results show the advantages of the proposed methodology in comparison with other traditional ones regarding prediction accuracy, and it has very good practical significance. Therefore, it can be used as a unified method for introducing weather variables into STLF models, and selecting their features.

  20. Feature Selection and Blind Source Separation in an EEG-Based Brain-Computer Interface

    Michael H. Thaut

    2005-11-01

    Full Text Available Most EEG-based BCI systems make use of well-studied patterns of brain activity. However, those systems involve tasks that indirectly map to simple binary commands such as “yes” or “no” or require many weeks of biofeedback training. We hypothesized that signal processing and machine learning methods can be used to discriminate EEG in a direct “yes”/“no” BCI from a single session. Blind source separation (BSS and spectral transformations of the EEG produced a 180-dimensional feature space. We used a modified genetic algorithm (GA wrapped around a support vector machine (SVM classifier to search the space of feature subsets. The GA-based search found feature subsets that outperform full feature sets and random feature subsets. Also, BSS transformations of the EEG outperformed the original time series, particularly in conjunction with a subset search of both spaces. The results suggest that BSS and feature selection can be used to improve the performance of even a “direct,” single-session BCI.

  1. Prediction of Protein Structural Class Based on Gapped-Dipeptides and a Recursive Feature Selection Approach

    Taigang Liu

    2015-12-01

    Full Text Available The prior knowledge of protein structural class may offer useful clues on understanding its functionality as well as its tertiary structure. Though various significant efforts have been made to find a fast and effective computational approach to address this problem, it is still a challenging topic in the field of bioinformatics. The position-specific score matrix (PSSM profile has been shown to provide a useful source of information for improving the prediction performance of protein structural class. However, this information has not been adequately explored. To this end, in this study, we present a feature extraction technique which is based on gapped-dipeptides composition computed directly from PSSM. Then, a careful feature selection technique is performed based on support vector machine-recursive feature elimination (SVM-RFE. These optimal features are selected to construct a final predictor. The results of jackknife tests on four working datasets show that our method obtains satisfactory prediction accuracies by extracting features solely based on PSSM and could serve as a very promising tool to predict protein structural class.

  2. DWFS: A Wrapper Feature Selection Tool Based on a Parallel Genetic Algorithm

    Soufan, Othman

    2015-02-26

    Many scientific problems can be formulated as classification tasks. Data that harbor relevant information are usually described by a large number of features. Frequently, many of these features are irrelevant for the class prediction. The efficient implementation of classification models requires identification of suitable combinations of features. The smaller number of features reduces the problem\\'s dimensionality and may result in higher classification performance. We developed DWFS, a web-based tool that allows for efficient selection of features for a variety of problems. DWFS follows the wrapper paradigm and applies a search strategy based on Genetic Algorithms (GAs). A parallel GA implementation examines and evaluates simultaneously large number of candidate collections of features. DWFS also integrates various filteringmethods thatmay be applied as a pre-processing step in the feature selection process. Furthermore, weights and parameters in the fitness function of GA can be adjusted according to the application requirements. Experiments using heterogeneous datasets from different biomedical applications demonstrate that DWFS is fast and leads to a significant reduction of the number of features without sacrificing performance as compared to several widely used existing methods. DWFS can be accessed online at www.cbrc.kaust.edu.sa/dwfs.

  3. Feature Selection and Predictors of Falls with Foot Force Sensors Using KNN-Based Algorithms

    Shengyun Liang

    2015-11-01

    Full Text Available The aging process may lead to the degradation of lower extremity function in the elderly population, which can restrict their daily quality of life and gradually increase the fall risk. We aimed to determine whether objective measures of physical function could predict subsequent falls. Ground reaction force (GRF data, which was quantified by sample entropy, was collected by foot force sensors. Thirty eight subjects (23 fallers and 15 non-fallers participated in functional movement tests, including walking and sit-to-stand (STS. A feature selection algorithm was used to select relevant features to classify the elderly into two groups: at risk and not at risk of falling down, for three KNN-based classifiers: local mean-based k-nearest neighbor (LMKNN, pseudo nearest neighbor (PNN, local mean pseudo nearest neighbor (LMPNN classification. We compared classification performances, and achieved the best results with LMPNN, with sensitivity, specificity and accuracy all 100%. Moreover, a subset of GRFs was significantly different between the two groups via Wilcoxon rank sum test, which is compatible with the classification results. This method could potentially be used by non-experts to monitor balance and the risk of falling down in the elderly population.

  4. DWFS: A Wrapper Feature Selection Tool Based on a Parallel Genetic Algorithm

    Soufan, Othman; Kleftogiannis, Dimitrios A.; Kalnis, Panos; Bajic, Vladimir B.

    2015-01-01

    Many scientific problems can be formulated as classification tasks. Data that harbor relevant information are usually described by a large number of features. Frequently, many of these features are irrelevant for the class prediction. The efficient implementation of classification models requires identification of suitable combinations of features. The smaller number of features reduces the problem's dimensionality and may result in higher classification performance. We developed DWFS, a web-based tool that allows for efficient selection of features for a variety of problems. DWFS follows the wrapper paradigm and applies a search strategy based on Genetic Algorithms (GAs). A parallel GA implementation examines and evaluates simultaneously large number of candidate collections of features. DWFS also integrates various filteringmethods thatmay be applied as a pre-processing step in the feature selection process. Furthermore, weights and parameters in the fitness function of GA can be adjusted according to the application requirements. Experiments using heterogeneous datasets from different biomedical applications demonstrate that DWFS is fast and leads to a significant reduction of the number of features without sacrificing performance as compared to several widely used existing methods. DWFS can be accessed online at www.cbrc.kaust.edu.sa/dwfs.

  5. Cost-Sensitive Feature Selection of Numeric Data with Measurement Errors

    Hong Zhao

    2013-01-01

    Full Text Available Feature selection is an essential process in data mining applications since it reduces a model’s complexity. However, feature selection with various types of costs is still a new research topic. In this paper, we study the cost-sensitive feature selection problem of numeric data with measurement errors. The major contributions of this paper are fourfold. First, a new data model is built to address test costs and misclassification costs as well as error boundaries. It is distinguished from the existing models mainly on the error boundaries. Second, a covering-based rough set model with normal distribution measurement errors is constructed. With this model, coverings are constructed from data rather than assigned by users. Third, a new cost-sensitive feature selection problem is defined on this model. It is more realistic than the existing feature selection problems. Fourth, both backtracking and heuristic algorithms are proposed to deal with the new problem. Experimental results show the efficiency of the pruning techniques for the backtracking algorithm and the effectiveness of the heuristic algorithm. This study is a step toward realistic applications of the cost-sensitive learning.

  6. A study of metaheuristic algorithms for high dimensional feature selection on microarray data

    Dankolo, Muhammad Nasiru; Radzi, Nor Haizan Mohamed; Sallehuddin, Roselina; Mustaffa, Noorfa Haszlinna

    2017-11-01

    Microarray systems enable experts to examine gene profile at molecular level using machine learning algorithms. It increases the potentials of classification and diagnosis of many diseases at gene expression level. Though, numerous difficulties may affect the efficiency of machine learning algorithms which includes vast number of genes features comprised in the original data. Many of these features may be unrelated to the intended analysis. Therefore, feature selection is necessary to be performed in the data pre-processing. Many feature selection algorithms are developed and applied on microarray which including the metaheuristic optimization algorithms. This paper discusses the application of the metaheuristics algorithms for feature selection in microarray dataset. This study reveals that, the algorithms have yield an interesting result with limited resources thereby saving computational expenses of machine learning algorithms.

  7. Analysis of Different Feature Selection Criteria Based on a Covariance Convergence Perspective for a SLAM Algorithm

    Auat Cheein, Fernando A.; Carelli, Ricardo

    2011-01-01

    This paper introduces several non-arbitrary feature selection techniques for a Simultaneous Localization and Mapping (SLAM) algorithm. The feature selection criteria are based on the determination of the most significant features from a SLAM convergence perspective. The SLAM algorithm implemented in this work is a sequential EKF (Extended Kalman filter) SLAM. The feature selection criteria are applied on the correction stage of the SLAM algorithm, restricting it to correct the SLAM algorithm with the most significant features. This restriction also causes a decrement in the processing time of the SLAM. Several experiments with a mobile robot are shown in this work. The experiments concern the map reconstruction and a comparison between the different proposed techniques performance. The experiments were carried out at an outdoor environment composed by trees, although the results shown herein are not restricted to a special type of features. PMID:22346568

  8. Evaluation of droplet size distributions using univariate and multivariate approaches

    Gauno, M.H.; Larsen, C.C.; Vilhelmsen, T.

    2013-01-01

    of the distribution. The current study was aiming to compare univariate and multivariate approach in evaluating droplet size distributions. As a model system, the atomization of a coating solution from a two-fluid nozzle was investigated. The effect of three process parameters (concentration of ethyl cellulose...... in ethanol, atomizing air pressure, and flow rate of coating solution) on the droplet size and droplet size distribution using a full mixed factorial design was used. The droplet size produced by a two-fluid nozzle was measured by laser diffraction and reported as volume based size distribution....... Investigation of loading and score plots from principal component analysis (PCA) revealed additional information on the droplet size distributions and it was possible to identify univariate statistics (volume median droplet size), which were similar, however, originating from varying droplet size distributions...

  9. Univariate decision tree induction using maximum margin classification

    Yıldız, Olcay Taner

    2012-01-01

    In many pattern recognition applications, first decision trees are used due to their simplicity and easily interpretable nature. In this paper, we propose a new decision tree learning algorithm called univariate margin tree where, for each continuous attribute, the best split is found using convex optimization. Our simulation results on 47 data sets show that the novel margin tree classifier performs at least as good as C4.5 and linear discriminant tree (LDT) with a similar time complexity. F...

  10. Acceleration techniques in the univariate Lipschitz global optimization

    Sergeyev, Yaroslav D.; Kvasov, Dmitri E.; Mukhametzhanov, Marat S.; De Franco, Angela

    2016-10-01

    Univariate box-constrained Lipschitz global optimization problems are considered in this contribution. Geometric and information statistical approaches are presented. The novel powerful local tuning and local improvement techniques are described in the contribution as well as the traditional ways to estimate the Lipschitz constant. The advantages of the presented local tuning and local improvement techniques are demonstrated using the operational characteristics approach for comparing deterministic global optimization algorithms on the class of 100 widely used test functions.

  11. Evaluation of droplet size distributions using univariate and multivariate approaches.

    Gaunø, Mette Høg; Larsen, Crilles Casper; Vilhelmsen, Thomas; Møller-Sonnergaard, Jørn; Wittendorff, Jørgen; Rantanen, Jukka

    2013-01-01

    Pharmaceutically relevant material characteristics are often analyzed based on univariate descriptors instead of utilizing the whole information available in the full distribution. One example is droplet size distribution, which is often described by the median droplet size and the width of the distribution. The current study was aiming to compare univariate and multivariate approach in evaluating droplet size distributions. As a model system, the atomization of a coating solution from a two-fluid nozzle was investigated. The effect of three process parameters (concentration of ethyl cellulose in ethanol, atomizing air pressure, and flow rate of coating solution) on the droplet size and droplet size distribution using a full mixed factorial design was used. The droplet size produced by a two-fluid nozzle was measured by laser diffraction and reported as volume based size distribution. Investigation of loading and score plots from principal component analysis (PCA) revealed additional information on the droplet size distributions and it was possible to identify univariate statistics (volume median droplet size), which were similar, however, originating from varying droplet size distributions. The multivariate data analysis was proven to be an efficient tool for evaluating the full information contained in a distribution.

  12. Gene expression network reconstruction by convex feature selection when incorporating genetic perturbations.

    Benjamin A Logsdon

    Full Text Available Cellular gene expression measurements contain regulatory information that can be used to discover novel network relationships. Here, we present a new algorithm for network reconstruction powered by the adaptive lasso, a theoretically and empirically well-behaved method for selecting the regulatory features of a network. Any algorithms designed for network discovery that make use of directed probabilistic graphs require perturbations, produced by either experiments or naturally occurring genetic variation, to successfully infer unique regulatory relationships from gene expression data. Our approach makes use of appropriately selected cis-expression Quantitative Trait Loci (cis-eQTL, which provide a sufficient set of independent perturbations for maximum network resolution. We compare the performance of our network reconstruction algorithm to four other approaches: the PC-algorithm, QTLnet, the QDG algorithm, and the NEO algorithm, all of which have been used to reconstruct directed networks among phenotypes leveraging QTL. We show that the adaptive lasso can outperform these algorithms for networks of ten genes and ten cis-eQTL, and is competitive with the QDG algorithm for networks with thirty genes and thirty cis-eQTL, with rich topologies and hundreds of samples. Using this novel approach, we identify unique sets of directed relationships in Saccharomyces cerevisiae when analyzing genome-wide gene expression data for an intercross between a wild strain and a lab strain. We recover novel putative network relationships between a tyrosine biosynthesis gene (TYR1, and genes involved in endocytosis (RCY1, the spindle checkpoint (BUB2, sulfonate catabolism (JLP1, and cell-cell communication (PRM7. Our algorithm provides a synthesis of feature selection methods and graphical model theory that has the potential to reveal new directed regulatory relationships from the analysis of population level genetic and gene expression data.

  13. Mining for diagnostic information in body surface potential maps: A comparison of feature selection techniques

    McCullagh Paul J

    2005-09-01

    Full Text Available Abstract Background In body surface potential mapping, increased spatial sampling is used to allow more accurate detection of a cardiac abnormality. Although diagnostically superior to more conventional electrocardiographic techniques, the perceived complexity of the Body Surface Potential Map (BSPM acquisition process has prohibited its acceptance in clinical practice. For this reason there is an interest in striking a compromise between the minimum number of electrocardiographic recording sites required to sample the maximum electrocardiographic information. Methods In the current study, several techniques widely used in the domains of data mining and knowledge discovery have been employed to mine for diagnostic information in 192 lead BSPMs. In particular, the Single Variable Classifier (SVC based filter and Sequential Forward Selection (SFS based wrapper approaches to feature selection have been implemented and evaluated. Using a set of recordings from 116 subjects, the diagnostic ability of subsets of 3, 6, 9, 12, 24 and 32 electrocardiographic recording sites have been evaluated based on their ability to correctly asses the presence or absence of Myocardial Infarction (MI. Results It was observed that the wrapper approach, using sequential forward selection and a 5 nearest neighbour classifier, was capable of choosing a set of 24 recording sites that could correctly classify 82.8% of BSPMs. Although the filter method performed slightly less favourably, the performance was comparable with a classification accuracy of 79.3%. In addition, experiments were conducted to show how (a features chosen using the wrapper approach were specific to the classifier used in the selection model, and (b lead subsets chosen were not necessarily unique. Conclusion It was concluded that both the filter and wrapper approaches adopted were suitable for guiding the choice of recording sites useful for determining the presence of MI. It should be noted however

  14. Evaluation of feature selection algorithms for classification in temporal lobe epilepsy based on MR images

    Lai, Chunren; Guo, Shengwen; Cheng, Lina; Wang, Wensheng; Wu, Kai

    2017-02-01

    It's very important to differentiate the temporal lobe epilepsy (TLE) patients from healthy people and localize the abnormal brain regions of the TLE patients. The cortical features and changes can reveal the unique anatomical patterns of brain regions from the structural MR images. In this study, structural MR images from 28 normal controls (NC), 18 left TLE (LTLE), and 21 right TLE (RTLE) were acquired, and four types of cortical feature, namely cortical thickness (CTh), cortical surface area (CSA), gray matter volume (GMV), and mean curvature (MCu), were explored for discriminative analysis. Three feature selection methods, the independent sample t-test filtering, the sparse-constrained dimensionality reduction model (SCDRM), and the support vector machine-recursive feature elimination (SVM-RFE), were investigated to extract dominant regions with significant differences among the compared groups for classification using the SVM classifier. The results showed that the SVM-REF achieved the highest performance (most classifications with more than 92% accuracy), followed by the SCDRM, and the t-test. Especially, the surface area and gray volume matter exhibited prominent discriminative ability, and the performance of the SVM was improved significantly when the four cortical features were combined. Additionally, the dominant regions with higher classification weights were mainly located in temporal and frontal lobe, including the inferior temporal, entorhinal cortex, fusiform, parahippocampal cortex, middle frontal and frontal pole. It was demonstrated that the cortical features provided effective information to determine the abnormal anatomical pattern and the proposed method has the potential to improve the clinical diagnosis of the TLE.

  15. KLASIFIKASI INTI SEL PAP SMEAR BERDASARKAN ANALISIS TEKSTUR MENGGUNAKAN CORRELATION-BASED FEATURE SELECTION BERBASIS ALGORITMA C4.5

    Toni Arifin

    2014-09-01

    Full Text Available Abstract - Pap Smear is an early examination to diagnose whether there’s indication cervical cancer or not, the process of observations were done by observing pap smear cell under the microscope. There’s so many research has been done to differentiate between normal and abnormal cell. In this research presents a classification of pap smear cell based on texture analysis. This research is using the Harlev image which amounts to 280 images, 140 images are used as training data and 140 images other are used as testing. On the texture analysis used Gray level Co-occurance Matrix (GLCM method with 5 parameters that is correlation, energy, homogeneity and entropy added by counting the value of brightness. For choose which the best attribute used correlation-based feature selection method and than used C45 algorithm for produce classification rule. The result accuracy of the classification normal and abnormal used decision tree C45 is 96,43% and errors in predicting is 3,57%. Keywords : Classification, Pap Smear cell image, texture analysis, Correlation-based feature selection, C45 algorithm. Abstrak - Pap Smear merupakan pemeriksaan dini untuk mendiagnosa apakah ada indikasi kanker serviks atau tidak, proses pengamatan dilakukan dengan mengamati sel pap smear dibawah mikroskop. Banyak penelitian yang telah dilakukan untuk membedakan antara sel normal dan abnormal. Dalam penelitian ini menyajikan klasifikasi inti sel pap smear berdasarkan analisis tektur. Citra yang digunakan dalam penelitian ini adalah citra Harlev yang berjumlah 280 citra, 140 citra digunakan sebagai data training dan 140 citra lain digunakan sebagai testing. Pada analisis tekstur mengunakan metode Gray level Co-occurrence Matrix (GLCM menggunakan 5 parameter yaitu korelasi, energi, homogenitas dan entropi ditambah dengan menghitung nilai brightness. Untuk memilih mana atribut terbaik digunakan metode correlation-based feature selection lalu digunakan algoritma C45 untuk

  16. Feature selection for wearable smartphone-based human activity recognition with able bodied, elderly, and stroke patients.

    Nicole A Capela

    Full Text Available Human activity recognition (HAR, using wearable sensors, is a growing area with the potential to provide valuable information on patient mobility to rehabilitation specialists. Smartphones with accelerometer and gyroscope sensors are a convenient, minimally invasive, and low cost approach for mobility monitoring. HAR systems typically pre-process raw signals, segment the signals, and then extract features to be used in a classifier. Feature selection is a crucial step in the process to reduce potentially large data dimensionality and provide viable parameters to enable activity classification. Most HAR systems are customized to an individual research group, including a unique data set, classes, algorithms, and signal features. These data sets are obtained predominantly from able-bodied participants. In this paper, smartphone accelerometer and gyroscope sensor data were collected from populations that can benefit from human activity recognition: able-bodied, elderly, and stroke patients. Data from a consecutive sequence of 41 mobility tasks (18 different tasks were collected for a total of 44 participants. Seventy-six signal features were calculated and subsets of these features were selected using three filter-based, classifier-independent, feature selection methods (Relief-F, Correlation-based Feature Selection, Fast Correlation Based Filter. The feature subsets were then evaluated using three generic classifiers (Naïve Bayes, Support Vector Machine, j48 Decision Tree. Common features were identified for all three populations, although the stroke population subset had some differences from both able-bodied and elderly sets. Evaluation with the three classifiers showed that the feature subsets produced similar or better accuracies than classification with the entire feature set. Therefore, since these feature subsets are classifier-independent, they should be useful for developing and improving HAR systems across and within populations.

  17. Feature selection for wearable smartphone-based human activity recognition with able bodied, elderly, and stroke patients.

    Capela, Nicole A; Lemaire, Edward D; Baddour, Natalie

    2015-01-01

    Human activity recognition (HAR), using wearable sensors, is a growing area with the potential to provide valuable information on patient mobility to rehabilitation specialists. Smartphones with accelerometer and gyroscope sensors are a convenient, minimally invasive, and low cost approach for mobility monitoring. HAR systems typically pre-process raw signals, segment the signals, and then extract features to be used in a classifier. Feature selection is a crucial step in the process to reduce potentially large data dimensionality and provide viable parameters to enable activity classification. Most HAR systems are customized to an individual research group, including a unique data set, classes, algorithms, and signal features. These data sets are obtained predominantly from able-bodied participants. In this paper, smartphone accelerometer and gyroscope sensor data were collected from populations that can benefit from human activity recognition: able-bodied, elderly, and stroke patients. Data from a consecutive sequence of 41 mobility tasks (18 different tasks) were collected for a total of 44 participants. Seventy-six signal features were calculated and subsets of these features were selected using three filter-based, classifier-independent, feature selection methods (Relief-F, Correlation-based Feature Selection, Fast Correlation Based Filter). The feature subsets were then evaluated using three generic classifiers (Naïve Bayes, Support Vector Machine, j48 Decision Tree). Common features were identified for all three populations, although the stroke population subset had some differences from both able-bodied and elderly sets. Evaluation with the three classifiers showed that the feature subsets produced similar or better accuracies than classification with the entire feature set. Therefore, since these feature subsets are classifier-independent, they should be useful for developing and improving HAR systems across and within populations.

  18. Automatic Target Recognition: Statistical Feature Selection of Non-Gaussian Distributed Target Classes

    2011-06-01

    implementing, and evaluating many feature selection algorithms. Mucciardi and Gose compared seven different techniques for choosing subsets of pattern...122 THIS PAGE INTENTIONALLY LEFT BLANK 123 LIST OF REFERENCES [1] A. Mucciardi and E. Gose , “A comparison of seven techniques for

  19. Multi-Stage Recognition of Speech Emotion Using Sequential Forward Feature Selection

    Liogienė Tatjana

    2016-07-01

    Full Text Available The intensive research of speech emotion recognition introduced a huge collection of speech emotion features. Large feature sets complicate the speech emotion recognition task. Among various feature selection and transformation techniques for one-stage classification, multiple classifier systems were proposed. The main idea of multiple classifiers is to arrange the emotion classification process in stages. Besides parallel and serial cases, the hierarchical arrangement of multi-stage classification is most widely used for speech emotion recognition. In this paper, we present a sequential-forward-feature-selection-based multi-stage classification scheme. The Sequential Forward Selection (SFS and Sequential Floating Forward Selection (SFFS techniques were employed for every stage of the multi-stage classification scheme. Experimental testing of the proposed scheme was performed using the German and Lithuanian emotional speech datasets. Sequential-feature-selection-based multi-stage classification outperformed the single-stage scheme by 12–42 % for different emotion sets. The multi-stage scheme has shown higher robustness to the growth of emotion set. The decrease in recognition rate with the increase in emotion set for multi-stage scheme was lower by 10–20 % in comparison with the single-stage case. Differences in SFS and SFFS employment for feature selection were negligible.

  20. Examining applying high performance genetic data feature selection and classification algorithms for colon cancer diagnosis.

    Al-Rajab, Murad; Lu, Joan; Xu, Qiang

    2017-07-01

    This paper examines the accuracy and efficiency (time complexity) of high performance genetic data feature selection and classification algorithms for colon cancer diagnosis. The need for this research derives from the urgent and increasing need for accurate and efficient algorithms. Colon cancer is a leading cause of death worldwide, hence it is vitally important for the cancer tissues to be expertly identified and classified in a rapid and timely manner, to assure both a fast detection of the disease and to expedite the drug discovery process. In this research, a three-phase approach was proposed and implemented: Phases One and Two examined the feature selection algorithms and classification algorithms employed separately, and Phase Three examined the performance of the combination of these. It was found from Phase One that the Particle Swarm Optimization (PSO) algorithm performed best with the colon dataset as a feature selection (29 genes selected) and from Phase Two that the Support Vector Machine (SVM) algorithm outperformed other classifications, with an accuracy of almost 86%. It was also found from Phase Three that the combined use of PSO and SVM surpassed other algorithms in accuracy and performance, and was faster in terms of time analysis (94%). It is concluded that applying feature selection algorithms prior to classification algorithms results in better accuracy than when the latter are applied alone. This conclusion is important and significant to industry and society. Copyright © 2017 Elsevier B.V. All rights reserved.

  1. Heuristic algorithms for feature selection under Bayesian models with block-diagonal covariance structure.

    Foroughi Pour, Ali; Dalton, Lori A

    2018-03-21

    Many bioinformatics studies aim to identify markers, or features, that can be used to discriminate between distinct groups. In problems where strong individual markers are not available, or where interactions between gene products are of primary interest, it may be necessary to consider combinations of features as a marker family. To this end, recent work proposes a hierarchical Bayesian framework for feature selection that places a prior on the set of features we wish to select and on the label-conditioned feature distribution. While an analytical posterior under Gaussian models with block covariance structures is available, the optimal feature selection algorithm for this model remains intractable since it requires evaluating the posterior over the space of all possible covariance block structures and feature-block assignments. To address this computational barrier, in prior work we proposed a simple suboptimal algorithm, 2MNC-Robust, with robust performance across the space of block structures. Here, we present three new heuristic feature selection algorithms. The proposed algorithms outperform 2MNC-Robust and many other popular feature selection algorithms on synthetic data. In addition, enrichment analysis on real breast cancer, colon cancer, and Leukemia data indicates they also output many of the genes and pathways linked to the cancers under study. Bayesian feature selection is a promising framework for small-sample high-dimensional data, in particular biomarker discovery applications. When applied to cancer data these algorithms outputted many genes already shown to be involved in cancer as well as potentially new biomarkers. Furthermore, one of the proposed algorithms, SPM, outputs blocks of heavily correlated genes, particularly useful for studying gene interactions and gene networks.

  2. An enhanced PSO-DEFS based feature selection with biometric authentication for identification of diabetic retinopathy

    Umarani Balakrishnan

    2016-11-01

    Full Text Available Recently, automatic diagnosis of diabetic retinopathy (DR from the retinal image is the most significant research topic in the medical applications. Diabetic macular edema (DME is the major reason for the loss of vision in patients suffering from DR. Early identification of the DR enables to prevent the vision loss and encourage diabetic control activities. Many techniques are developed to diagnose the DR. The major drawbacks of the existing techniques are low accuracy and high time complexity. To overcome these issues, this paper proposes an enhanced particle swarm optimization-differential evolution feature selection (PSO-DEFS based feature selection approach with biometric authentication for the identification of DR. Initially, a hybrid median filter (HMF is used for pre-processing the input images. Then, the pre-processed images are embedded with each other by using least significant bit (LSB for authentication purpose. Simultaneously, the image features are extracted using convoluted local tetra pattern (CLTrP and Tamura features. Feature selection is performed using PSO-DEFS and PSO-gravitational search algorithm (PSO-GSA to reduce time complexity. Based on some performance metrics, the PSO-DEFS is chosen as a better choice for feature selection. The feature selection is performed based on the fitness value. A multi-relevance vector machine (M-RVM is introduced to classify the 13 normal and 62 abnormal images among 75 images from 60 patients. Finally, the DR patients are further classified by M-RVM. The experimental results exhibit that the proposed approach achieves better accuracy, sensitivity, and specificity than the existing techniques.

  3. Effect of feature-selective attention on neuronal responses in macaque area MT

    Chen, X.; Hoffmann, K.-P.; Albright, T. D.

    2012-01-01

    Attention influences visual processing in striate and extrastriate cortex, which has been extensively studied for spatial-, object-, and feature-based attention. Most studies exploring neural signatures of feature-based attention have trained animals to attend to an object identified by a certain feature and ignore objects/displays identified by a different feature. Little is known about the effects of feature-selective attention, where subjects attend to one stimulus feature domain (e.g., color) of an object while features from different domains (e.g., direction of motion) of the same object are ignored. To study this type of feature-selective attention in area MT in the middle temporal sulcus, we trained macaque monkeys to either attend to and report the direction of motion of a moving sine wave grating (a feature for which MT neurons display strong selectivity) or attend to and report its color (a feature for which MT neurons have very limited selectivity). We hypothesized that neurons would upregulate their firing rate during attend-direction conditions compared with attend-color conditions. We found that feature-selective attention significantly affected 22% of MT neurons. Contrary to our hypothesis, these neurons did not necessarily increase firing rate when animals attended to direction of motion but fell into one of two classes. In one class, attention to color increased the gain of stimulus-induced responses compared with attend-direction conditions. The other class displayed the opposite effects. Feature-selective activity modulations occurred earlier in neurons modulated by attention to color compared with neurons modulated by attention to motion direction. Thus feature-selective attention influences neuronal processing in macaque area MT but often exhibited a mismatch between the preferred stimulus dimension (direction of motion) and the preferred attention dimension (attention to color). PMID:22170961

  4. Mining potential biomarkers associated with space flight in Caenorhabditis elegans experienced Shenzhou-8 mission with multiple feature selection techniques

    Zhao, Lei; Gao, Ying; Mi, Dong; Sun, Yeqing

    2016-01-01

    Highlights: • A combined algorithm is proposed to mine biomarkers of spaceflight in C. elegans. • This algorithm makes the feature selection more reliable and robust. • Apply this algorithm to predict 17 positive biomarkers to space environment stress. • The strategy can be used as a general method to select important features. - Abstract: To identify the potential biomarkers associated with space flight, a combined algorithm, which integrates the feature selection techniques, was used to deal with the microarray datasets of Caenorhabditis elegans obtained in the Shenzhou-8 mission. Compared with the ground control treatment, a total of 86 differentially expressed (DE) genes in responses to space synthetic environment or space radiation environment were identified by two filter methods. And then the top 30 ranking genes were selected by the random forest algorithm. Gene Ontology annotation and functional enrichment analyses showed that these genes were mainly associated with metabolism process. Furthermore, clustering analysis showed that 17 genes among these are positive, including 9 for space synthetic environment and 8 for space radiation environment only. These genes could be used as the biomarkers to reflect the space environment stresses. In addition, we also found that microgravity is the main stress factor to change the expression patterns of biomarkers for the short-duration spaceflight.

  5. DNABP: Identification of DNA-Binding Proteins Based on Feature Selection Using a Random Forest and Predicting Binding Residues.

    Ma, Xin; Guo, Jing; Sun, Xiao

    2016-01-01

    DNA-binding proteins are fundamentally important in cellular processes. Several computational-based methods have been developed to improve the prediction of DNA-binding proteins in previous years. However, insufficient work has been done on the prediction of DNA-binding proteins from protein sequence information. In this paper, a novel predictor, DNABP (DNA-binding proteins), was designed to predict DNA-binding proteins using the random forest (RF) classifier with a hybrid feature. The hybrid feature contains two types of novel sequence features, which reflect information about the conservation of physicochemical properties of the amino acids, and the binding propensity of DNA-binding residues and non-binding propensities of non-binding residues. The comparisons with each feature demonstrated that these two novel features contributed most to the improvement in predictive ability. Furthermore, to improve the prediction performance of the DNABP model, feature selection using the minimum redundancy maximum relevance (mRMR) method combined with incremental feature selection (IFS) was carried out during the model construction. The results showed that the DNABP model could achieve 86.90% accuracy, 83.76% sensitivity, 90.03% specificity and a Matthews correlation coefficient of 0.727. High prediction accuracy and performance comparisons with previous research suggested that DNABP could be a useful approach to identify DNA-binding proteins from sequence information. The DNABP web server system is freely available at http://www.cbi.seu.edu.cn/DNABP/.

  6. Mining potential biomarkers associated with space flight in Caenorhabditis elegans experienced Shenzhou-8 mission with multiple feature selection techniques

    Zhao, Lei [Institute of Environmental Systems Biology, College of Environmental Science and Engineering, Dalian Maritime University, Dalian 116026 (China); Gao, Ying [Center of Medical Physics and Technology, Hefei Institutes of Physical Science, Chinese Academy of Sciences, Shushanhu Road 350, Hefei 230031 (China); Mi, Dong, E-mail: mid@dlmu.edu.cn [Department of Physics, Dalian Maritime University, Dalian 116026 (China); Sun, Yeqing, E-mail: yqsun@dlmu.edu.cn [Institute of Environmental Systems Biology, College of Environmental Science and Engineering, Dalian Maritime University, Dalian 116026 (China)

    2016-09-15

    Highlights: • A combined algorithm is proposed to mine biomarkers of spaceflight in C. elegans. • This algorithm makes the feature selection more reliable and robust. • Apply this algorithm to predict 17 positive biomarkers to space environment stress. • The strategy can be used as a general method to select important features. - Abstract: To identify the potential biomarkers associated with space flight, a combined algorithm, which integrates the feature selection techniques, was used to deal with the microarray datasets of Caenorhabditis elegans obtained in the Shenzhou-8 mission. Compared with the ground control treatment, a total of 86 differentially expressed (DE) genes in responses to space synthetic environment or space radiation environment were identified by two filter methods. And then the top 30 ranking genes were selected by the random forest algorithm. Gene Ontology annotation and functional enrichment analyses showed that these genes were mainly associated with metabolism process. Furthermore, clustering analysis showed that 17 genes among these are positive, including 9 for space synthetic environment and 8 for space radiation environment only. These genes could be used as the biomarkers to reflect the space environment stresses. In addition, we also found that microgravity is the main stress factor to change the expression patterns of biomarkers for the short-duration spaceflight.

  7. Improving mass candidate detection in mammograms via feature maxima propagation and local feature selection.

    Melendez, Jaime; Sánchez, Clara I; van Ginneken, Bram; Karssemeijer, Nico

    2014-08-01

    Mass candidate detection is a crucial component of multistep computer-aided detection (CAD) systems. It is usually performed by combining several local features by means of a classifier. When these features are processed on a per-image-location basis (e.g., for each pixel), mismatching problems may arise while constructing feature vectors for classification, which is especially true when the behavior expected from the evaluated features is a peaked response due to the presence of a mass. In this study, two of these problems, consisting of maxima misalignment and differences of maxima spread, are identified and two solutions are proposed. The first proposed method, feature maxima propagation, reproduces feature maxima through their neighboring locations. The second method, local feature selection, combines different subsets of features for different feature vectors associated with image locations. Both methods are applied independently and together. The proposed methods are included in a mammogram-based CAD system intended for mass detection in screening. Experiments are carried out with a database of 382 digital cases. Sensitivity is assessed at two sets of operating points. The first one is the interval of 3.5-15 false positives per image (FPs/image), which is typical for mass candidate detection. The second one is 1 FP/image, which allows to estimate the quality of the mass candidate detector's output for use in subsequent steps of the CAD system. The best results are obtained when the proposed methods are applied together. In that case, the mean sensitivity in the interval of 3.5-15 FPs/image significantly increases from 0.926 to 0.958 (p < 0.0002). At the lower rate of 1 FP/image, the mean sensitivity improves from 0.628 to 0.734 (p < 0.0002). Given the improved detection performance, the authors believe that the strategies proposed in this paper can render mass candidate detection approaches based on image location classification more robust to feature

  8. Revealing metabolite biomarkers for acupuncture treatment by linear programming based feature selection.

    Wang, Yong; Wu, Qiao-Feng; Chen, Chen; Wu, Ling-Yun; Yan, Xian-Zhong; Yu, Shu-Guang; Zhang, Xiang-Sun; Liang, Fan-Rong

    2012-01-01

    Acupuncture has been practiced in China for thousands of years as part of the Traditional Chinese Medicine (TCM) and has gradually accepted in western countries as an alternative or complementary treatment. However, the underlying mechanism of acupuncture, especially whether there exists any difference between varies acupoints, remains largely unknown, which hinders its widespread use. In this study, we develop a novel Linear Programming based Feature Selection method (LPFS) to understand the mechanism of acupuncture effect, at molecular level, by revealing the metabolite biomarkers for acupuncture treatment. Specifically, we generate and investigate the high-throughput metabolic profiles of acupuncture treatment at several acupoints in human. To select the subsets of metabolites that best characterize the acupuncture effect for each meridian point, an optimization model is proposed to identify biomarkers from high-dimensional metabolic data from case and control samples. Importantly, we use nearest centroid as the prototype to simultaneously minimize the number of selected features and the leave-one-out cross validation error of classifier. We compared the performance of LPFS to several state-of-the-art methods, such as SVM recursive feature elimination (SVM-RFE) and sparse multinomial logistic regression approach (SMLR). We find that our LPFS method tends to reveal a small set of metabolites with small standard deviation and large shifts, which exactly serves our requirement for good biomarker. Biologically, several metabolite biomarkers for acupuncture treatment are revealed and serve as the candidates for further mechanism investigation. Also biomakers derived from five meridian points, Zusanli (ST36), Liangmen (ST21), Juliao (ST3), Yanglingquan (GB34), and Weizhong (BL40), are compared for their similarity and difference, which provide evidence for the specificity of acupoints. Our result demonstrates that metabolic profiling might be a promising method to

  9. A comprehensive analysis of earthquake damage patterns using high dimensional model representation feature selection

    Taşkin Kaya, Gülşen

    2013-10-01

    Recently, earthquake damage assessment using satellite images has been a very popular ongoing research direction. Especially with the availability of very high resolution (VHR) satellite images, a quite detailed damage map based on building scale has been produced, and various studies have also been conducted in the literature. As the spatial resolution of satellite images increases, distinguishability of damage patterns becomes more cruel especially in case of using only the spectral information during classification. In order to overcome this difficulty, textural information needs to be involved to the classification to improve the visual quality and reliability of damage map. There are many kinds of textural information which can be derived from VHR satellite images depending on the algorithm used. However, extraction of textural information and evaluation of them have been generally a time consuming process especially for the large areas affected from the earthquake due to the size of VHR image. Therefore, in order to provide a quick damage map, the most useful features describing damage patterns needs to be known in advance as well as the redundant features. In this study, a very high resolution satellite image after Iran, Bam earthquake was used to identify the earthquake damage. Not only the spectral information, textural information was also used during the classification. For textural information, second order Haralick features were extracted from the panchromatic image for the area of interest using gray level co-occurrence matrix with different size of windows and directions. In addition to using spatial features in classification, the most useful features representing the damage characteristic were selected with a novel feature selection method based on high dimensional model representation (HDMR) giving sensitivity of each feature during classification. The method called HDMR was recently proposed as an efficient tool to capture the input

  10. Univaried models in the series of temperature of the air

    Leon Aristizabal Gloria esperanza

    2000-01-01

    The theoretical framework for the study of the air's temperature time series is the theory of stochastic processes, particularly those known as ARIMA, that make it possible to carry out a univaried analysis. ARIMA models are built in order to explain the structure of the monthly temperatures corresponding to the mean, the absolute maximum, absolute minimum, maximum mean and minimum mean temperatures, for four stations in Colombia. By means of those models, the possible evolution of the latter variables is estimated with predictive aims in mind. The application and utility of the models is discussed

  11. Effect Sizes for Research Univariate and Multivariate Applications

    Grissom, Robert J

    2011-01-01

    Noted for its comprehensive coverage, this greatly expanded new edition now covers the use of univariate and multivariate effect sizes. Many measures and estimators are reviewed along with their application, interpretation, and limitations. Noted for its practical approach, the book features numerous examples using real data for a variety of variables and designs, to help readers apply the material to their own data. Tips on the use of SPSS, SAS, R, and S-Plus are provided. The book's broad disciplinary appeal results from its inclusion of a variety of examples from psychology, medicine, educa

  12. Handbook of univariate and multivariate data analysis with IBM SPSS

    Ho, Robert

    2013-01-01

    Using the same accessible, hands-on approach as its best-selling predecessor, the Handbook of Univariate and Multivariate Data Analysis with IBM SPSS, Second Edition explains how to apply statistical tests to experimental findings, identify the assumptions underlying the tests, and interpret the findings. This second edition now covers more topics and has been updated with the SPSS statistical package for Windows.New to the Second EditionThree new chapters on multiple discriminant analysis, logistic regression, and canonical correlationNew section on how to deal with missing dataCoverage of te

  13. The pathways for intelligible speech: multivariate and univariate perspectives.

    Evans, S; Kyong, J S; Rosen, S; Golestani, N; Warren, J E; McGettigan, C; Mourão-Miranda, J; Wise, R J S; Scott, S K

    2014-09-01

    An anterior pathway, concerned with extracting meaning from sound, has been identified in nonhuman primates. An analogous pathway has been suggested in humans, but controversy exists concerning the degree of lateralization and the precise location where responses to intelligible speech emerge. We have demonstrated that the left anterior superior temporal sulcus (STS) responds preferentially to intelligible speech (Scott SK, Blank CC, Rosen S, Wise RJS. 2000. Identification of a pathway for intelligible speech in the left temporal lobe. Brain. 123:2400-2406.). A functional magnetic resonance imaging study in Cerebral Cortex used equivalent stimuli and univariate and multivariate analyses to argue for the greater importance of bilateral posterior when compared with the left anterior STS in responding to intelligible speech (Okada K, Rong F, Venezia J, Matchin W, Hsieh IH, Saberi K, Serences JT,Hickok G. 2010. Hierarchical organization of human auditory cortex: evidence from acoustic invariance in the response to intelligible speech. 20: 2486-2495.). Here, we also replicate our original study, demonstrating that the left anterior STS exhibits the strongest univariate response and, in decoding using the bilateral temporal cortex, contains the most informative voxels showing an increased response to intelligible speech. In contrast, in classifications using local "searchlights" and a whole brain analysis, we find greater classification accuracy in posterior rather than anterior temporal regions. Thus, we show that the precise nature of the multivariate analysis used will emphasize different response profiles associated with complex sound to speech processing. © The Author 2013. Published by Oxford University Press.

  14. Feature selection and classification of MAQC-II breast cancer and multiple myeloma microarray gene expression data.

    Qingzhong Liu

    Full Text Available Microarray data has a high dimension of variables but available datasets usually have only a small number of samples, thereby making the study of such datasets interesting and challenging. In the task of analyzing microarray data for the purpose of, e.g., predicting gene-disease association, feature selection is very important because it provides a way to handle the high dimensionality by exploiting information redundancy induced by associations among genetic markers. Judicious feature selection in microarray data analysis can result in significant reduction of cost while maintaining or improving the classification or prediction accuracy of learning machines that are employed to sort out the datasets. In this paper, we propose a gene selection method called Recursive Feature Addition (RFA, which combines supervised learning and statistical similarity measures. We compare our method with the following gene selection methods: Support Vector Machine Recursive Feature Elimination (SVMRFE, Leave-One-Out Calculation Sequential Forward Selection (LOOCSFS, Gradient based Leave-one-out Gene Selection (GLGS. To evaluate the performance of these gene selection methods, we employ several popular learning classifiers on the MicroArray Quality Control phase II on predictive modeling (MAQC-II breast cancer dataset and the MAQC-II multiple myeloma dataset. Experimental results show that gene selection is strictly paired with learning classifier. Overall, our approach outperforms other compared methods. The biological functional analysis based on the MAQC-II breast cancer dataset convinced us to apply our method for phenotype prediction. Additionally, learning classifiers also play important roles in the classification of microarray data and our experimental results indicate that the Nearest Mean Scale Classifier (NMSC is a good choice due to its prediction reliability and its stability across the three performance measurements: Testing accuracy, MCC values, and

  15. A proposed framework on hybrid feature selection techniques for handling high dimensional educational data

    Shahiri, Amirah Mohamed; Husain, Wahidah; Rashid, Nur'Aini Abd

    2017-10-01

    Huge amounts of data in educational datasets may cause the problem in producing quality data. Recently, data mining approach are increasingly used by educational data mining researchers for analyzing the data patterns. However, many research studies have concentrated on selecting suitable learning algorithms instead of performing feature selection process. As a result, these data has problem with computational complexity and spend longer computational time for classification. The main objective of this research is to provide an overview of feature selection techniques that have been used to analyze the most significant features. Then, this research will propose a framework to improve the quality of students' dataset. The proposed framework uses filter and wrapper based technique to support prediction process in future study.

  16. Improving Naive Bayes with Online Feature Selection for Quick Adaptation to Evolving Feature Usefulness

    Pon, R K; Cardenas, A F; Buttler, D J

    2007-09-19

    The definition of what makes an article interesting varies from user to user and continually evolves even for a single user. As a result, for news recommendation systems, useless document features can not be determined a priori and all features are usually considered for interestingness classification. Consequently, the presence of currently useless features degrades classification performance [1], particularly over the initial set of news articles being classified. The initial set of document is critical for a user when considering which particular news recommendation system to adopt. To address these problems, we introduce an improved version of the naive Bayes classifier with online feature selection. We use correlation to determine the utility of each feature and take advantage of the conditional independence assumption used by naive Bayes for online feature selection and classification. The augmented naive Bayes classifier performs 28% better than the traditional naive Bayes classifier in recommending news articles from the Yahoo! RSS feeds.

  17. Enhancing the Performance of LibSVM Classifier by Kernel F-Score Feature Selection

    Sarojini, Balakrishnan; Ramaraj, Narayanasamy; Nickolas, Savarimuthu

    Medical Data mining is the search for relationships and patterns within the medical datasets that could provide useful knowledge for effective clinical decisions. The inclusion of irrelevant, redundant and noisy features in the process model results in poor predictive accuracy. Much research work in data mining has gone into improving the predictive accuracy of the classifiers by applying the techniques of feature selection. Feature selection in medical data mining is appreciable as the diagnosis of the disease could be done in this patient-care activity with minimum number of significant features. The objective of this work is to show that selecting the more significant features would improve the performance of the classifier. We empirically evaluate the classification effectiveness of LibSVM classifier on the reduced feature subset of diabetes dataset. The evaluations suggest that the feature subset selected improves the predictive accuracy of the classifier and reduce false negatives and false positives.

  18. Feature Selection of Network Intrusion Data using Genetic Algorithm and Particle Swarm Optimization

    Iwan Syarif

    2016-12-01

    Full Text Available This paper describes the advantages of using Evolutionary Algorithms (EA for feature selection on network intrusion dataset. Most current Network Intrusion Detection Systems (NIDS are unable to detect intrusions in real time because of high dimensional data produced during daily operation. Extracting knowledge from huge data such as intrusion data requires new approach. The more complex the datasets, the higher computation time and the harder they are to be interpreted and analyzed. This paper investigates the performance of feature selection algoritms in network intrusiona data. We used Genetic Algorithms (GA and Particle Swarm Optimizations (PSO as feature selection algorithms. When applied to network intrusion datasets, both GA and PSO have significantly reduces the number of features. Our experiments show that GA successfully reduces the number of attributes from 41 to 15 while PSO reduces the number of attributes from 41 to 9. Using k Nearest Neighbour (k-NN as a classifier,the GA-reduced dataset which consists of 37% of original attributes, has accuracy improvement from 99.28% to 99.70% and its execution time is also 4.8 faster than the execution time of original dataset. Using the same classifier, PSO-reduced dataset which consists of 22% of original attributes, has the fastest execution time (7.2 times faster than the execution time of original datasets. However, its accuracy is slightly reduced 0.02% from 99.28% to 99.26%. Overall, both GA and PSO are good solution as feature selection techniques because theyhave shown very good performance in reducing the number of features significantly while still maintaining and sometimes improving the classification accuracy as well as reducing the computation time.

  19. featsel: A framework for benchmarking of feature selection algorithms and cost functions

    Marcelo S. Reis; Gustavo Estrela; Carlos Eduardo Ferreira; Junior Barrera

    2017-01-01

    In this paper, we introduce featsel, a framework for benchmarking of feature selection algorithms and cost functions. This framework allows the user to deal with the search space as a Boolean lattice and has its core coded in C++ for computational efficiency purposes. Moreover, featsel includes Perl scripts to add new algorithms and/or cost functions, generate random instances, plot graphs and organize results into tables. Besides, this framework already comes with dozens of algorithms and co...

  20. Less is more: Avoiding the LIBS dimensionality curse through judicious feature selection for explosive detection

    Kumar Myakalwar, Ashwin; Spegazzini, Nicolas; Zhang, Chi; Kumar Anubham, Siva; Dasari, Ramachandra R.; Barman, Ishan; Kumar Gundawar, Manoj

    2015-01-01

    Despite its intrinsic advantages, translation of laser induced breakdown spectroscopy for material identification has been often impeded by the lack of robustness of developed classification models, often due to the presence of spurious correlations. While a number of classifiers exhibiting high discriminatory power have been reported, efforts in establishing the subset of relevant spectral features that enable a fundamental interpretation of the segmentation capability and avoid the ‘curse of dimensionality’ have been lacking. Using LIBS data acquired from a set of secondary explosives, we investigate judicious feature selection approaches and architect two different chemometrics classifiers –based on feature selection through prerequisite knowledge of the sample composition and genetic algorithm, respectively. While the full spectral input results in classification rate of ca.92%, selection of only carbon to hydrogen spectral window results in near identical performance. Importantly, the genetic algorithm-derived classifier shows a statistically significant improvement to ca. 94% accuracy for prospective classification, even though the number of features used is an order of magnitude smaller. Our findings demonstrate the impact of rigorous feature selection in LIBS and also hint at the feasibility of using a discrete filter based detector thereby enabling a cheaper and compact system more amenable to field operations. PMID:26286630

  1. OPTIMASI KLASIFIKASI SEL TUNGGAL PAP SMEAR MENGGUNAKAN CORRELATION BASED FEATURES SELECTION (CFS BERBASIS C4.5 DAN NAIVE BAYES

    Asti Herliana

    2016-11-01

    Full Text Available Abstract – Cervical cancer was the most dangerous disease and generally attacks women. Early detection through Pap Smear method was one way to prevent the desease to grow in womans cervival canal. Based on the result of Pap Smear methode, the single cell of data that known as herlev data is available. This data, then used as a reference by the experts to find the best level classification from each class of cervical cancer. The decision tree C4.5 and Naïve Bayes have proven to give the best result on 280 data trial of herlev with support by Correlation based Features Selection (CFS optimization method. The issues raised in the present study was does CFS optimization methode that combined with the classification method of C4.5 and Naïve Bayes can provide increased the accuracy results when it faced the 917 data of herlev. The results of this study show that CFS method that combined either with C4.5 methods and naïve bayes classification accuracy was decrease when compared without using CFS method. In terms of showing that CFS can not provide the best result when if confronted with big data. Keywords : optimization, classification, single cell of Pap Smear, Correlation based Features Selection, C4.5, Naïve Bayes   Abstrak – Kanker serviks merupakan penyakit yang sangat berbahaya dan pada umumnya menyerang kaum wanita. Deteksi sejak dini melalui metode Pap Smear merupakan salah satu cara untuk mencegah penyakit ini berkembang didalam saluran serviks wanita. Berdasarkan hasil dari metode Pap Smear, didapatkanlah data sel tunggal yang kini dikenal dengan data herlev. Data ini, kemudian dijadikan acuan dalam penelitian oleh para ahli dewasa ini untuk menemukan tingkat klasifikasi terbaik dari masing-masing kelas kanker serviks. Metode Decision tree C4.5 dan Naïve Bayes terbukti memberikan hasil yang terbaik pada ujicoba data herlev sebanyak 280 data dengan dukungan dari metode optimasi Correlation based Features Selection(CFS. Permasalahan

  2. Classification of Alzheimer's disease patients with hippocampal shape wrapper-based feature selection and support vector machine

    Young, Jonathan; Ridgway, Gerard; Leung, Kelvin; Ourselin, Sebastien

    2012-02-01

    It is well known that hippocampal atrophy is a marker of the onset of Alzheimer's disease (AD) and as a result hippocampal volumetry has been used in a number of studies to provide early diagnosis of AD and predict conversion of mild cognitive impairment patients to AD. However, rates of atrophy are not uniform across the hippocampus making shape analysis a potentially more accurate biomarker. This study studies the hippocampi from 226 healthy controls, 148 AD patients and 330 MCI patients obtained from T1 weighted structural MRI images from the ADNI database. The hippocampi are anatomically segmented using the MAPS multi-atlas segmentation method, and the resulting binary images are then processed with SPHARM software to decompose their shapes as a weighted sum of spherical harmonic basis functions. The resulting parameterizations are then used as feature vectors in Support Vector Machine (SVM) classification. A wrapper based feature selection method was used as this considers the utility of features in discriminating classes in combination, fully exploiting the multivariate nature of the data and optimizing the selected set of features for the type of classifier that is used. The leave-one-out cross validated accuracy obtained on training data is 88.6% for classifying AD vs controls and 74% for classifying MCI-converters vs MCI-stable with very compact feature sets, showing that this is a highly promising method. There is currently a considerable fall in accuracy on unseen data indicating that the feature selection is sensitive to the data used, however feature ensemble methods may overcome this.

  3. Rolling Bearing Fault Diagnosis Using Modified Neighborhood Preserving Embedding and Maximal Overlap Discrete Wavelet Packet Transform with Sensitive Features Selection

    Fei Dong

    2018-01-01

    Full Text Available In order to enhance the performance of bearing fault diagnosis and classification, features extraction and features dimensionality reduction have become more important. The original statistical feature set was calculated from single branch reconstruction vibration signals obtained by using maximal overlap discrete wavelet packet transform (MODWPT. In order to reduce redundancy information of original statistical feature set, features selection by adjusted rand index and sum of within-class mean deviations (FSASD was proposed to select fault sensitive features. Furthermore, a modified features dimensionality reduction method, supervised neighborhood preserving embedding with label information (SNPEL, was proposed to realize low-dimensional representations for high-dimensional feature space. Finally, vibration signals collected from two experimental test rigs were employed to evaluate the performance of the proposed procedure. The results show that the effectiveness, adaptability, and superiority of the proposed procedure can serve as an intelligent bearing fault diagnosis system.

  4. Gene features selection for three-class disease classification via multiple orthogonal partial least square discriminant analysis and S-plot using microarray data.

    Yang, Mingxing; Li, Xiumin; Li, Zhibin; Ou, Zhimin; Liu, Ming; Liu, Suhuan; Li, Xuejun; Yang, Shuyu

    2013-01-01

    DNA microarray analysis is characterized by obtaining a large number of gene variables from a small number of observations. Cluster analysis is widely used to analyze DNA microarray data to make classification and diagnosis of disease. Because there are so many irrelevant and insignificant genes in a dataset, a feature selection approach must be employed in data analysis. The performance of cluster analysis of this high-throughput data depends on whether the feature selection approach chooses the most relevant genes associated with disease classes. Here we proposed a new method using multiple Orthogonal Partial Least Squares-Discriminant Analysis (mOPLS-DA) models and S-plots to select the most relevant genes to conduct three-class disease classification and prediction. We tested our method using Golub's leukemia microarray data. For three classes with subtypes, we proposed hierarchical orthogonal partial least squares-discriminant analysis (OPLS-DA) models and S-plots to select features for two main classes and their subtypes. For three classes in parallel, we employed three OPLS-DA models and S-plots to choose marker genes for each class. The power of feature selection to classify and predict three-class disease was evaluated using cluster analysis. Further, the general performance of our method was tested using four public datasets and compared with those of four other feature selection methods. The results revealed that our method effectively selected the most relevant features for disease classification and prediction, and its performance was better than that of the other methods.

  5. Improving feature selection process resistance to failures caused by curse-of-dimensionality effects

    Somol, Petr; Grim, Jiří; Novovičová, Jana; Pudil, P.

    2011-01-01

    Roč. 47, č. 3 (2011), s. 401-425 ISSN 0023-5954 R&D Projects: GA MŠk 1M0572; GA ČR GA102/08/0593 Grant - others:GA MŠk(CZ) 2C06019 Institutional research plan: CEZ:AV0Z10750506 Keywords : feature selection * curse of dimensionality * over-fitting * stability * machine learning * dimensionality reduction Subject RIV: IN - Informatics, Computer Science Impact factor: 0.454, year: 2011 http://library.utia.cas.cz/separaty/2011/RO/somol-0368741.pdf

  6. Feature-Selective Attention Adaptively Shifts Noise Correlations in Primary Auditory Cortex.

    Downer, Joshua D; Rapone, Brittany; Verhein, Jessica; O'Connor, Kevin N; Sutter, Mitchell L

    2017-05-24

    Sensory environments often contain an overwhelming amount of information, with both relevant and irrelevant information competing for neural resources. Feature attention mediates this competition by selecting the sensory features needed to form a coherent percept. How attention affects the activity of populations of neurons to support this process is poorly understood because population coding is typically studied through simulations in which one sensory feature is encoded without competition. Therefore, to study the effects of feature attention on population-based neural coding, investigations must be extended to include stimuli with both relevant and irrelevant features. We measured noise correlations ( r noise ) within small neural populations in primary auditory cortex while rhesus macaques performed a novel feature-selective attention task. We found that the effect of feature-selective attention on r noise depended not only on the population tuning to the attended feature, but also on the tuning to the distractor feature. To attempt to explain how these observed effects might support enhanced perceptual performance, we propose an extension of a simple and influential model in which shifts in r noise can simultaneously enhance the representation of the attended feature while suppressing the distractor. These findings present a novel mechanism by which attention modulates neural populations to support sensory processing in cluttered environments. SIGNIFICANCE STATEMENT Although feature-selective attention constitutes one of the building blocks of listening in natural environments, its neural bases remain obscure. To address this, we developed a novel auditory feature-selective attention task and measured noise correlations ( r noise ) in rhesus macaque A1 during task performance. Unlike previous studies showing that the effect of attention on r noise depends on population tuning to the attended feature, we show that the effect of attention depends on the tuning

  7. Normed kernel function-based fuzzy possibilistic C-means (NKFPCM) algorithm for high-dimensional breast cancer database classification with feature selection is based on Laplacian Score

    Lestari, A. W.; Rustam, Z.

    2017-07-01

    In the last decade, breast cancer has become the focus of world attention as this disease is one of the primary leading cause of death for women. Therefore, it is necessary to have the correct precautions and treatment. In previous studies, Fuzzy Kennel K-Medoid algorithm has been used for multi-class data. This paper proposes an algorithm to classify the high dimensional data of breast cancer using Fuzzy Possibilistic C-means (FPCM) and a new method based on clustering analysis using Normed Kernel Function-Based Fuzzy Possibilistic C-Means (NKFPCM). The objective of this paper is to obtain the best accuracy in classification of breast cancer data. In order to improve the accuracy of the two methods, the features candidates are evaluated using feature selection, where Laplacian Score is used. The results show the comparison accuracy and running time of FPCM and NKFPCM with and without feature selection.

  8. An improved strategy for skin lesion detection and classification using uniform segmentation and feature selection based approach.

    Nasir, Muhammad; Attique Khan, Muhammad; Sharif, Muhammad; Lali, Ikram Ullah; Saba, Tanzila; Iqbal, Tassawar

    2018-02-21

    Melanoma is the deadliest type of skin cancer with highest mortality rate. However, the annihilation in early stage implies a high survival rate therefore, it demands early diagnosis. The accustomed diagnosis methods are costly and cumbersome due to the involvement of experienced experts as well as the requirements for highly equipped environment. The recent advancements in computerized solutions for these diagnoses are highly promising with improved accuracy and efficiency. In this article, we proposed a method for the classification of melanoma and benign skin lesions. Our approach integrates preprocessing, lesion segmentation, features extraction, features selection, and classification. Preprocessing is executed in the context of hair removal by DullRazor, whereas lesion texture and color information are utilized to enhance the lesion contrast. In lesion segmentation, a hybrid technique has been implemented and results are fused using additive law of probability. Serial based method is applied subsequently that extracts and fuses the traits such as color, texture, and HOG (shape). The fused features are selected afterwards by implementing a novel Boltzman Entropy method. Finally, the selected features are classified by Support Vector Machine. The proposed method is evaluated on publically available data set PH2. Our approach has provided promising results of sensitivity 97.7%, specificity 96.7%, accuracy 97.5%, and F-score 97.5%, which are significantly better than the results of existing methods available on the same data set. The proposed method detects and classifies melanoma significantly good as compared to existing methods. © 2018 Wiley Periodicals, Inc.

  9. Gesture Recognition from Data Streams of Human Motion Sensor Using Accelerated PSO Swarm Search Feature Selection Algorithm

    Simon Fong

    2015-01-01

    Full Text Available Human motion sensing technology gains tremendous popularity nowadays with practical applications such as video surveillance for security, hand signing, and smart-home and gaming. These applications capture human motions in real-time from video sensors, the data patterns are nonstationary and ever changing. While the hardware technology of such motion sensing devices as well as their data collection process become relatively mature, the computational challenge lies in the real-time analysis of these live feeds. In this paper we argue that traditional data mining methods run short of accurately analyzing the human activity patterns from the sensor data stream. The shortcoming is due to the algorithmic design which is not adaptive to the dynamic changes in the dynamic gesture motions. The successor of these algorithms which is known as data stream mining is evaluated versus traditional data mining, through a case of gesture recognition over motion data by using Microsoft Kinect sensors. Three different subjects were asked to read three comic strips and to tell the stories in front of the sensor. The data stream contains coordinates of articulation points and various positions of the parts of the human body corresponding to the actions that the user performs. In particular, a novel technique of feature selection using swarm search and accelerated PSO is proposed for enabling fast preprocessing for inducing an improved classification model in real-time. Superior result is shown in the experiment that runs on this empirical data stream. The contribution of this paper is on a comparative study between using traditional and data stream mining algorithms and incorporation of the novel improved feature selection technique with a scenario where different gesture patterns are to be recognized from streaming sensor data.

  10. DYNAMIC FEATURE SELECTION FOR WEB USER IDENTIFICATION ON LINGUISTIC AND STYLISTIC FEATURES OF ONLINE TEXTS

    A. A. Vorobeva

    2017-01-01

    Full Text Available The paper deals with identification and authentication of web users participating in the Internet information processes (based on features of online texts.In digital forensics web user identification based on various linguistic features can be used to discover identity of individuals, criminals or terrorists using the Internet to commit cybercrimes. Internet could be used as a tool in different types of cybercrimes (fraud and identity theft, harassment and anonymous threats, terrorist or extremist statements, distribution of illegal content and information warfare. Linguistic identification of web users is a kind of biometric identification, it can be used to narrow down the suspects, identify a criminal and prosecute him. Feature set includes various linguistic and stylistic features extracted from online texts. We propose dynamic feature selection for each web user identification task. Selection is based on calculating Manhattan distance to k-nearest neighbors (Relief-f algorithm. This approach improves the identification accuracy and minimizes the number of features. Experiments were carried out on several datasets with different level of class imbalance. Experiment results showed that features relevance varies in different set of web users (probable authors of some text; features selection for each set of web users improves identification accuracy by 4% at the average that is approximately 1% higher than with the use of static set of features. The proposed approach is most effective for a small number of training samples (messages per user.

  11. Time course of spatial and feature selective attention for partly-occluded objects.

    Kasai, Tetsuko; Takeya, Ryuji

    2012-07-01

    Attention selects objects/groups as the most fundamental units, and this may be achieved by an attention-spreading mechanism. Previous event-related potential (ERP) studies have found that attention-spreading is reflected by a decrease in the N1 spatial attention effect. The present study tested whether the electrophysiological attention effect is associated with the perception of object unity or amodal completion through the use of partly-occluded objects. ERPs were recorded in 14 participants who were required to pay attention to their left or right visual field and to press a button for a target shape in the attended field. Bilateral stimuli were presented rapidly, and were separated, connected, or connected behind an occluder. Behavioral performance in the connected and occluded conditions was worse than that in the separated condition, indicating that attention spread over perceptual object representations after amodal completion. Consistently, the late N1 spatial attention effect (180-220 ms post-stimulus) and the early phase (230-280 ms) of feature selection effects (target N2) at contralateral sites decreased, equally for the occluded and connected conditions, while the attention effect in the early N1 latency (140-180 ms) shifted most positively for the occluded condition. These results suggest that perceptual organization processes for object recognition transiently modulate spatial and feature selection processes in the visual cortex. Copyright © 2012 Elsevier Ltd. All rights reserved.

  12. Feature selection model based on clustering and ranking in pipeline for microarray data

    Barnali Sahu

    2017-01-01

    Full Text Available Most of the available feature selection techniques in the literature are classifier bound. It means a group of features tied to the performance of a specific classifier as applied in wrapper and hybrid approach. Our objective in this study is to select a set of generic features not tied to any classifier based on the proposed framework. This framework uses attribute clustering and feature ranking techniques in pipeline in order to remove redundant features. On each uncovered cluster, signal-to-noise ratio, t-statistics and significance analysis of microarray are independently applied to select the top ranked features. Both filter and evolutionary wrapper approaches have been considered for feature selection and the data set with selected features are given to ensemble of predefined statistically different classifiers. The class labels of the test data are determined using majority voting technique. Moreover, with the aforesaid objectives, this paper focuses on obtaining a stable result out of various classification models. Further, a comparative analysis has been performed to study the classification accuracy and computational time of the current approach and evolutionary wrapper techniques. It gives a better insight into the features and further enhancing the classification accuracy with less computational time.

  13. Feature selection and classifier parameters estimation for EEG signals peak detection using particle swarm optimization.

    Adam, Asrul; Shapiai, Mohd Ibrahim; Tumari, Mohd Zaidi Mohd; Mohamad, Mohd Saberi; Mubin, Marizan

    2014-01-01

    Electroencephalogram (EEG) signal peak detection is widely used in clinical applications. The peak point can be detected using several approaches, including time, frequency, time-frequency, and nonlinear domains depending on various peak features from several models. However, there is no study that provides the importance of every peak feature in contributing to a good and generalized model. In this study, feature selection and classifier parameters estimation based on particle swarm optimization (PSO) are proposed as a framework for peak detection on EEG signals in time domain analysis. Two versions of PSO are used in the study: (1) standard PSO and (2) random asynchronous particle swarm optimization (RA-PSO). The proposed framework tries to find the best combination of all the available features that offers good peak detection and a high classification rate from the results in the conducted experiments. The evaluation results indicate that the accuracy of the peak detection can be improved up to 99.90% and 98.59% for training and testing, respectively, as compared to the framework without feature selection adaptation. Additionally, the proposed framework based on RA-PSO offers a better and reliable classification rate as compared to standard PSO as it produces low variance model.

  14. Self-Adaptive MOEA Feature Selection for Classification of Bankruptcy Prediction Data

    Gaspar-Cunha, A.; Recio, G.; Costa, L.; Estébanez, C.

    2014-01-01

    Bankruptcy prediction is a vast area of finance and accounting whose importance lies in the relevance for creditors and investors in evaluating the likelihood of getting into bankrupt. As companies become complex, they develop sophisticated schemes to hide their real situation. In turn, making an estimation of the credit risks associated with counterparts or predicting bankruptcy becomes harder. Evolutionary algorithms have shown to be an excellent tool to deal with complex problems in finances and economics where a large number of irrelevant features are involved. This paper provides a methodology for feature selection in classification of bankruptcy data sets using an evolutionary multiobjective approach that simultaneously minimise the number of features and maximise the classifier quality measure (e.g., accuracy). The proposed methodology makes use of self-adaptation by applying the feature selection algorithm while simultaneously optimising the parameters of the classifier used. The methodology was applied to four different sets of data. The obtained results showed the utility of using the self-adaptation of the classifier. PMID:24707201

  15. A Quantum Hybrid PSO Combined with Fuzzy k-NN Approach to Feature Selection and Cell Classification in Cervical Cancer Detection

    Abdullah M. Iliyasu

    2017-12-01

    Full Text Available A quantum hybrid (QH intelligent approach that blends the adaptive search capability of the quantum-behaved particle swarm optimisation (QPSO method with the intuitionistic rationality of traditional fuzzy k-nearest neighbours (Fuzzy k-NN algorithm (known simply as the Q-Fuzzy approach is proposed for efficient feature selection and classification of cells in cervical smeared (CS images. From an initial multitude of 17 features describing the geometry, colour, and texture of the CS images, the QPSO stage of our proposed technique is used to select the best subset features (i.e., global best particles that represent a pruned down collection of seven features. Using a dataset of almost 1000 images, performance evaluation of our proposed Q-Fuzzy approach assesses the impact of our feature selection on classification accuracy by way of three experimental scenarios that are compared alongside two other approaches: the All-features (i.e., classification without prior feature selection and another hybrid technique combining the standard PSO algorithm with the Fuzzy k-NN technique (P-Fuzzy approach. In the first and second scenarios, we further divided the assessment criteria in terms of classification accuracy based on the choice of best features and those in terms of the different categories of the cervical cells. In the third scenario, we introduced new QH hybrid techniques, i.e., QPSO combined with other supervised learning methods, and compared the classification accuracy alongside our proposed Q-Fuzzy approach. Furthermore, we employed statistical approaches to establish qualitative agreement with regards to the feature selection in the experimental scenarios 1 and 3. The synergy between the QPSO and Fuzzy k-NN in the proposed Q-Fuzzy approach improves classification accuracy as manifest in the reduction in number cell features, which is crucial for effective cervical cancer detection and diagnosis.

  16. A machine vision system for automated non-invasive assessment of cell viability via dark field microscopy, wavelet feature selection and classification

    Friehs Karl

    2008-10-01

    Full Text Available Abstract Background Cell viability is one of the basic properties indicating the physiological state of the cell, thus, it has long been one of the major considerations in biotechnological applications. Conventional methods for extracting information about cell viability usually need reagents to be applied on the targeted cells. These reagent-based techniques are reliable and versatile, however, some of them might be invasive and even toxic to the target cells. In support of automated noninvasive assessment of cell viability, a machine vision system has been developed. Results This system is based on supervised learning technique. It learns from images of certain kinds of cell populations and trains some classifiers. These trained classifiers are then employed to evaluate the images of given cell populations obtained via dark field microscopy. Wavelet decomposition is performed on the cell images. Energy and entropy are computed for each wavelet subimage as features. A feature selection algorithm is implemented to achieve better performance. Correlation between the results from the machine vision system and commonly accepted gold standards becomes stronger if wavelet features are utilized. The best performance is achieved with a selected subset of wavelet features. Conclusion The machine vision system based on dark field microscopy in conjugation with supervised machine learning and wavelet feature selection automates the cell viability assessment, and yields comparable results to commonly accepted methods. Wavelet features are found to be suitable to describe the discriminative properties of the live and dead cells in viability classification. According to the analysis, live cells exhibit morphologically more details and are intracellularly more organized than dead ones, which display more homogeneous and diffuse gray values throughout the cells. Feature selection increases the system's performance. The reason lies in the fact that feature

  17. Sequence-Based Prediction of RNA-Binding Proteins Using Random Forest with Minimum Redundancy Maximum Relevance Feature Selection

    Xin Ma

    2015-01-01

    Full Text Available The prediction of RNA-binding proteins is one of the most challenging problems in computation biology. Although some studies have investigated this problem, the accuracy of prediction is still not sufficient. In this study, a highly accurate method was developed to predict RNA-binding proteins from amino acid sequences using random forests with the minimum redundancy maximum relevance (mRMR method, followed by incremental feature selection (IFS. We incorporated features of conjoint triad features and three novel features: binding propensity (BP, nonbinding propensity (NBP, and evolutionary information combined with physicochemical properties (EIPP. The results showed that these novel features have important roles in improving the performance of the predictor. Using the mRMR-IFS method, our predictor achieved the best performance (86.62% accuracy and 0.737 Matthews correlation coefficient. High prediction accuracy and successful prediction performance suggested that our method can be a useful approach to identify RNA-binding proteins from sequence information.

  18. Univariate/multivariate genome-wide association scans using data from families and unrelated samples.

    Lei Zhang

    2009-08-01

    Full Text Available As genome-wide association studies (GWAS are becoming more popular, two approaches, among others, could be considered in order to improve statistical power for identifying genes contributing subtle to moderate effects to human diseases. The first approach is to increase sample size, which could be achieved by combining both unrelated and familial subjects together. The second approach is to jointly analyze multiple correlated traits. In this study, by extending generalized estimating equations (GEEs, we propose a simple approach for performing univariate or multivariate association tests for the combined data of unrelated subjects and nuclear families. In particular, we correct for population stratification by integrating principal component analysis and transmission disequilibrium test strategies. The proposed method allows for multiple siblings as well as missing parental information. Simulation studies show that the proposed test has improved power compared to two popular methods, EIGENSTRAT and FBAT, by analyzing the combined data, while correcting for population stratification. In addition, joint analysis of bivariate traits has improved power over univariate analysis when pleiotropic effects are present. Application to the Genetic Analysis Workshop 16 (GAW16 data sets attests to the feasibility and applicability of the proposed method.

  19. An application of locally linear model tree algorithm with combination of feature selection in credit scoring

    Siami, Mohammad; Gholamian, Mohammad Reza; Basiri, Javad

    2014-10-01

    Nowadays, credit scoring is one of the most important topics in the banking sector. Credit scoring models have been widely used to facilitate the process of credit assessing. In this paper, an application of the locally linear model tree algorithm (LOLIMOT) was experimented to evaluate the superiority of its performance to predict the customer's credit status. The algorithm is improved with an aim of adjustment by credit scoring domain by means of data fusion and feature selection techniques. Two real world credit data sets - Australian and German - from UCI machine learning database were selected to demonstrate the performance of our new classifier. The analytical results indicate that the improved LOLIMOT significantly increase the prediction accuracy.

  20. Tumor recognition in wireless capsule endoscopy images using textural features and SVM-based feature selection.

    Li, Baopu; Meng, Max Q-H

    2012-05-01

    Tumor in digestive tract is a common disease and wireless capsule endoscopy (WCE) is a relatively new technology to examine diseases for digestive tract especially for small intestine. This paper addresses the problem of automatic recognition of tumor for WCE images. Candidate color texture feature that integrates uniform local binary pattern and wavelet is proposed to characterize WCE images. The proposed features are invariant to illumination change and describe multiresolution characteristics of WCE images. Two feature selection approaches based on support vector machine, sequential forward floating selection and recursive feature elimination, are further employed to refine the proposed features for improving the detection accuracy. Extensive experiments validate that the proposed computer-aided diagnosis system achieves a promising tumor recognition accuracy of 92.4% in WCE images on our collected data.

  1. Fuzzy Mutual Information Based min-Redundancy and Max-Relevance Heterogeneous Feature Selection

    Daren Yu

    2011-08-01

    Full Text Available Feature selection is an important preprocessing step in pattern classification and machine learning, and mutual information is widely used to measure relevance between features and decision. However, it is difficult to directly calculate relevance between continuous or fuzzy features using mutual information. In this paper we introduce the fuzzy information entropy and fuzzy mutual information for computing relevance between numerical or fuzzy features and decision. The relationship between fuzzy information entropy and differential entropy is also discussed. Moreover, we combine fuzzy mutual information with qmin-Redundancy-Max-Relevanceq, qMax-Dependencyq and min-Redundancy-Max-Dependencyq algorithms. The performance and stability of the proposed algorithms are tested on benchmark data sets. Experimental results show the proposed algorithms are effective and stable.

  2. Feature-selective Attention in Frontoparietal Cortex: Multivoxel Codes Adjust to Prioritize Task-relevant Information.

    Jackson, Jade; Rich, Anina N; Williams, Mark A; Woolgar, Alexandra

    2017-02-01

    Human cognition is characterized by astounding flexibility, enabling us to select appropriate information according to the objectives of our current task. A circuit of frontal and parietal brain regions, often referred to as the frontoparietal attention network or multiple-demand (MD) regions, are believed to play a fundamental role in this flexibility. There is evidence that these regions dynamically adjust their responses to selectively process information that is currently relevant for behavior, as proposed by the "adaptive coding hypothesis" [Duncan, J. An adaptive coding model of neural function in prefrontal cortex. Nature Reviews Neuroscience, 2, 820-829, 2001]. Could this provide a neural mechanism for feature-selective attention, the process by which we preferentially process one feature of a stimulus over another? We used multivariate pattern analysis of fMRI data during a perceptually challenging categorization task to investigate whether the representation of visual object features in the MD regions flexibly adjusts according to task relevance. Participants were trained to categorize visually similar novel objects along two orthogonal stimulus dimensions (length/orientation) and performed short alternating blocks in which only one of these dimensions was relevant. We found that multivoxel patterns of activation in the MD regions encoded the task-relevant distinctions more strongly than the task-irrelevant distinctions: The MD regions discriminated between stimuli of different lengths when length was relevant and between the same objects according to orientation when orientation was relevant. The data suggest a flexible neural system that adjusts its representation of visual objects to preferentially encode stimulus features that are currently relevant for behavior, providing a neural mechanism for feature-selective attention.

  3. Feature-selective attention in healthy old age: a selective decline in selective attention?

    Quigley, Cliodhna; Müller, Matthias M

    2014-02-12

    Deficient selection against irrelevant information has been proposed to underlie age-related cognitive decline. We recently reported evidence for maintained early sensory selection when older and younger adults used spatial selective attention to perform a challenging task. Here we explored age-related differences when spatial selection is not possible and feature-selective attention must be deployed. We additionally compared the integrity of feedforward processing by exploiting the well established phenomenon of suppression of visual cortical responses attributable to interstimulus competition. Electroencephalogram was measured while older and younger human adults responded to brief occurrences of coherent motion in an attended stimulus composed of randomly moving, orientation-defined, flickering bars. Attention was directed to horizontal or vertical bars by a pretrial cue, after which two orthogonally oriented, overlapping stimuli or a single stimulus were presented. Horizontal and vertical bars flickered at different frequencies and thereby elicited separable steady-state visual-evoked potentials, which were used to examine the effect of feature-based selection and the competitive influence of a second stimulus on ongoing visual processing. Age differences were found in feature-selective attentional modulation of visual responses: older adults did not show consistent modulation of magnitude or phase. In contrast, the suppressive effect of a second stimulus was robust and comparable in magnitude across age groups, suggesting that bottom-up processing of the current stimuli is essentially unchanged in healthy old age. Thus, it seems that visual processing per se is unchanged, but top-down attentional control is compromised in older adults when space cannot be used to guide selection.

  4. Spectral characteristics and feature selection of satellite remote sensing data for climate and anthropogenic changes assessment in Bucharest area

    Zoran, Maria; Savastru, Roxana; Savastru, Dan; Tautan, Marina; Miclos, Sorin; Cristescu, Luminita; Carstea, Elfrida; Baschir, Laurentiu

    2010-05-01

    Urban systems play a vital role in social and economic development in all countries. Their environmental changes can be investigated on different spatial and temporal scales. Urban and peri-urban environment dynamics is of great interest for future planning and decision making as well as in frame of local and regional changes. Changes in urban land cover include changes in biotic diversity, actual and potential primary productivity, soil quality, runoff, and sedimentation rates, and cannot be well understood without the knowledge of land use change that drives them. The study focuses on the assessment of environmental features changes for Bucharest metropolitan area, Romania by satellite remote sensing and in-situ monitoring data. Rational feature selection from the varieties of spectral channels in the optical wavelengths of electromagnetic spectrum (VIS and NIR) is very important for effective analysis and information extraction of remote sensing data. Based on comprehensively analyses of the spectral characteristics of remote sensing data is possibly to derive environmental changes in urban areas. The information quantity contained in a band is an important parameter in evaluating the band. The deviation and entropy are often used to show information amount. Feature selection is one of the most important steps in recognition and classification of remote sensing images. Therefore, it is necessary to select features before classification. The optimal features are those that can be used to distinguish objects easily and correctly. Three factors—the information quantity of bands, the correlation between bands and the spectral characteristic (e.g. absorption specialty) of classified objects in test area Bucharest have been considered in our study. As, the spectral characteristic of an object is influenced by many factors, being difficult to define optimal feature parameters to distinguish all the objects in a whole area, a method of multi-level feature selection

  5. Genetic algorithm based feature selection combined with dual classification for the automated detection of proliferative diabetic retinopathy.

    Welikala, R A; Fraz, M M; Dehmeshki, J; Hoppe, A; Tah, V; Mann, S; Williamson, T H; Barman, S A

    2015-07-01

    Proliferative diabetic retinopathy (PDR) is a condition that carries a high risk of severe visual impairment. The hallmark of PDR is the growth of abnormal new vessels. In this paper, an automated method for the detection of new vessels from retinal images is presented. This method is based on a dual classification approach. Two vessel segmentation approaches are applied to create two separate binary vessel map which each hold vital information. Local morphology features are measured from each binary vessel map to produce two separate 4-D feature vectors. Independent classification is performed for each feature vector using a support vector machine (SVM) classifier. The system then combines these individual outcomes to produce a final decision. This is followed by the creation of additional features to generate 21-D feature vectors, which feed into a genetic algorithm based feature selection approach with the objective of finding feature subsets that improve the performance of the classification. Sensitivity and specificity results using a dataset of 60 images are 0.9138 and 0.9600, respectively, on a per patch basis and 1.000 and 0.975, respectively, on a per image basis. Copyright © 2015 Elsevier Ltd. All rights reserved.

  6. A novel computer-aided diagnosis system for breast MRI based on feature selection and ensemble learning.

    Lu, Wei; Li, Zhe; Chu, Jinghui

    2017-04-01

    Breast cancer is a common cancer among women. With the development of modern medical science and information technology, medical imaging techniques have an increasingly important role in the early detection and diagnosis of breast cancer. In this paper, we propose an automated computer-aided diagnosis (CADx) framework for magnetic resonance imaging (MRI). The scheme consists of an ensemble of several machine learning-based techniques, including ensemble under-sampling (EUS) for imbalanced data processing, the Relief algorithm for feature selection, the subspace method for providing data diversity, and Adaboost for improving the performance of base classifiers. We extracted morphological, various texture, and Gabor features. To clarify the feature subsets' physical meaning, subspaces are built by combining morphological features with each kind of texture or Gabor feature. We tested our proposal using a manually segmented Region of Interest (ROI) data set, which contains 438 images of malignant tumors and 1898 images of normal tissues or benign tumors. Our proposal achieves an area under the ROC curve (AUC) value of 0.9617, which outperforms most other state-of-the-art breast MRI CADx systems. Compared with other methods, our proposal significantly reduces the false-positive classification rate. Copyright © 2017 Elsevier Ltd. All rights reserved.

  7. Robust assignment of cancer subtypes from expression data using a uni-variate gene expression average as classifier

    Lauss, Martin; Frigyesi, Attila; Ryden, Tobias; Höglund, Mattias

    2010-01-01

    Genome wide gene expression data is a rich source for the identification of gene signatures suitable for clinical purposes and a number of statistical algorithms have been described for both identification and evaluation of such signatures. Some employed algorithms are fairly complex and hence sensitive to over-fitting whereas others are more simple and straight forward. Here we present a new type of simple algorithm based on ROC analysis and the use of metagenes that we believe will be a good complement to existing algorithms. The basis for the proposed approach is the use of metagenes, instead of collections of individual genes, and a feature selection using AUC values obtained by ROC analysis. Each gene in a data set is assigned an AUC value relative to the tumor class under investigation and the genes are ranked according to these values. Metagenes are then formed by calculating the mean expression level for an increasing number of ranked genes, and the metagene expression value that optimally discriminates tumor classes in the training set is used for classification of new samples. The performance of the metagene is then evaluated using LOOCV and balanced accuracies. We show that the simple uni-variate gene expression average algorithm performs as well as several alternative algorithms such as discriminant analysis and the more complex approaches such as SVM and neural networks. The R package rocc is freely available at http://cran.r-project.org/web/packages/rocc/index.html

  8. A Guideline to Univariate Statistical Analysis for LC/MS-Based Untargeted Metabolomics-Derived Data

    Maria Vinaixa

    2012-10-01

    Full Text Available Several metabolomic software programs provide methods for peak picking, retention time alignment and quantification of metabolite features in LC/MS-based metabolomics. Statistical analysis, however, is needed in order to discover those features significantly altered between samples. By comparing the retention time and MS/MS data of a model compound to that from the altered feature of interest in the research sample, metabolites can be then unequivocally identified. This paper reports on a comprehensive overview of a workflow for statistical analysis to rank relevant metabolite features that will be selected for further MS/MS experiments. We focus on univariate data analysis applied in parallel on all detected features. Characteristics and challenges of this analysis are discussed and illustrated using four different real LC/MS untargeted metabolomic datasets. We demonstrate the influence of considering or violating mathematical assumptions on which univariate statistical test rely, using high-dimensional LC/MS datasets. Issues in data analysis such as determination of sample size, analytical variation, assumption of normality and homocedasticity, or correction for multiple testing are discussed and illustrated in the context of our four untargeted LC/MS working examples.

  9. A comparison of bivariate and univariate QTL mapping in livestock populations

    Sorensen Daniel

    2003-11-01

    Full Text Available Abstract This study presents a multivariate, variance component-based QTL mapping model implemented via restricted maximum likelihood (REML. The method was applied to investigate bivariate and univariate QTL mapping analyses, using simulated data. Specifically, we report results on the statistical power to detect a QTL and on the precision of parameter estimates using univariate and bivariate approaches. The model and methodology were also applied to study the effectiveness of partitioning the overall genetic correlation between two traits into a component due to many genes of small effect, and one due to the QTL. It is shown that when the QTL has a pleiotropic effect on two traits, a bivariate analysis leads to a higher statistical power of detecting the QTL and to a more precise estimate of the QTL's map position, in particular in the case when the QTL has a small effect on the trait. The increase in power is most marked in cases where the contributions of the QTL and of the polygenic components to the genetic correlation have opposite signs. The bivariate REML analysis can successfully partition the two components contributing to the genetic correlation between traits.

  10. A Systematic Evaluation of Feature Selection and Classification Algorithms Using Simulated and Real miRNA Sequencing Data

    Sheng Yang

    2015-01-01

    Full Text Available Sequencing is widely used to discover associations between microRNAs (miRNAs and diseases. However, the negative binomial distribution (NB and high dimensionality of data obtained using sequencing can lead to low-power results and low reproducibility. Several statistical learning algorithms have been proposed to address sequencing data, and although evaluation of these methods is essential, such studies are relatively rare. The performance of seven feature selection (FS algorithms, including baySeq, DESeq, edgeR, the rank sum test, lasso, particle swarm optimistic decision tree, and random forest (RF, was compared by simulation under different conditions based on the difference of the mean, the dispersion parameter of the NB, and the signal to noise ratio. Real data were used to evaluate the performance of RF, logistic regression, and support vector machine. Based on the simulation and real data, we discuss the behaviour of the FS and classification algorithms. The Apriori algorithm identified frequent item sets (mir-133a, mir-133b, mir-183, mir-937, and mir-96 from among the deregulated miRNAs of six datasets from The Cancer Genomics Atlas. Taking these findings altogether and considering computational memory requirements, we propose a strategy that combines edgeR and DESeq for large sample sizes.

  11. An unsupervised technique for optimal feature selection in attribute profiles for spectral-spatial classification of hyperspectral images

    Bhardwaj, Kaushal; Patra, Swarnajyoti

    2018-04-01

    Inclusion of spatial information along with spectral features play a significant role in classification of remote sensing images. Attribute profiles have already proved their ability to represent spatial information. In order to incorporate proper spatial information, multiple attributes are required and for each attribute large profiles need to be constructed by varying the filter parameter values within a wide range. Thus, the constructed profiles that represent spectral-spatial information of an hyperspectral image have huge dimension which leads to Hughes phenomenon and increases computational burden. To mitigate these problems, this work presents an unsupervised feature selection technique that selects a subset of filtered image from the constructed high dimensional multi-attribute profile which are sufficiently informative to discriminate well among classes. In this regard the proposed technique exploits genetic algorithms (GAs). The fitness function of GAs are defined in an unsupervised way with the help of mutual information. The effectiveness of the proposed technique is assessed using one-against-all support vector machine classifier. The experiments conducted on three hyperspectral data sets show the robustness of the proposed method in terms of computation time and classification accuracy.

  12. Feature Selection for Object-Based Classification of High-Resolution Remote Sensing Images Based on the Combination of a Genetic Algorithm and Tabu Search

    Lei Shi

    2018-01-01

    Full Text Available In object-based image analysis of high-resolution images, the number of features can reach hundreds, so it is necessary to perform feature reduction prior to classification. In this paper, a feature selection method based on the combination of a genetic algorithm (GA and tabu search (TS is presented. The proposed GATS method aims to reduce the premature convergence of the GA by the use of TS. A prematurity index is first defined to judge the convergence situation during the search. When premature convergence does take place, an improved mutation operator is executed, in which TS is performed on individuals with higher fitness values. As for the other individuals with lower fitness values, mutation with a higher probability is carried out. Experiments using the proposed GATS feature selection method and three other methods, a standard GA, the multistart TS method, and ReliefF, were conducted on WorldView-2 and QuickBird images. The experimental results showed that the proposed method outperforms the other methods in terms of the final classification accuracy.

  13. Feature Selection for Object-Based Classification of High-Resolution Remote Sensing Images Based on the Combination of a Genetic Algorithm and Tabu Search

    Shi, Lei; Wan, Youchuan; Gao, Xianjun

    2018-01-01

    In object-based image analysis of high-resolution images, the number of features can reach hundreds, so it is necessary to perform feature reduction prior to classification. In this paper, a feature selection method based on the combination of a genetic algorithm (GA) and tabu search (TS) is presented. The proposed GATS method aims to reduce the premature convergence of the GA by the use of TS. A prematurity index is first defined to judge the convergence situation during the search. When premature convergence does take place, an improved mutation operator is executed, in which TS is performed on individuals with higher fitness values. As for the other individuals with lower fitness values, mutation with a higher probability is carried out. Experiments using the proposed GATS feature selection method and three other methods, a standard GA, the multistart TS method, and ReliefF, were conducted on WorldView-2 and QuickBird images. The experimental results showed that the proposed method outperforms the other methods in terms of the final classification accuracy. PMID:29581721

  14. Novel Mahalanobis-based feature selection improves one-class classification of early hepatocellular carcinoma.

    Thomaz, Ricardo de Lima; Carneiro, Pedro Cunha; Bonin, João Eliton; Macedo, Túlio Augusto Alves; Patrocinio, Ana Claudia; Soares, Alcimar Barbosa

    2018-05-01

    Detection of early hepatocellular carcinoma (HCC) is responsible for increasing survival rates in up to 40%. One-class classifiers can be used for modeling early HCC in multidetector computed tomography (MDCT), but demand the specific knowledge pertaining to the set of features that best describes the target class. Although the literature outlines several features for characterizing liver lesions, it is unclear which is most relevant for describing early HCC. In this paper, we introduce an unconstrained GA feature selection algorithm based on a multi-objective Mahalanobis fitness function to improve the classification performance for early HCC. We compared our approach to a constrained Mahalanobis function and two other unconstrained functions using Welch's t-test and Gaussian Data Descriptors. The performance of each fitness function was evaluated by cross-validating a one-class SVM. The results show that the proposed multi-objective Mahalanobis fitness function is capable of significantly reducing data dimensionality (96.4%) and improving one-class classification of early HCC (0.84 AUC). Furthermore, the results provide strong evidence that intensity features extracted at the arterial to portal and arterial to equilibrium phases are important for classifying early HCC.

  15. Audio-visual synchrony and feature-selective attention co-amplify early visual processing.

    Keitel, Christian; Müller, Matthias M

    2016-05-01

    Our brain relies on neural mechanisms of selective attention and converging sensory processing to efficiently cope with rich and unceasing multisensory inputs. One prominent assumption holds that audio-visual synchrony can act as a strong attractor for spatial attention. Here, we tested for a similar effect of audio-visual synchrony on feature-selective attention. We presented two superimposed Gabor patches that differed in colour and orientation. On each trial, participants were cued to selectively attend to one of the two patches. Over time, spatial frequencies of both patches varied sinusoidally at distinct rates (3.14 and 3.63 Hz), giving rise to pulse-like percepts. A simultaneously presented pure tone carried a frequency modulation at the pulse rate of one of the two visual stimuli to introduce audio-visual synchrony. Pulsed stimulation elicited distinct time-locked oscillatory electrophysiological brain responses. These steady-state responses were quantified in the spectral domain to examine individual stimulus processing under conditions of synchronous versus asynchronous tone presentation and when respective stimuli were attended versus unattended. We found that both, attending to the colour of a stimulus and its synchrony with the tone, enhanced its processing. Moreover, both gain effects combined linearly for attended in-sync stimuli. Our results suggest that audio-visual synchrony can attract attention to specific stimulus features when stimuli overlap in space.

  16. Inhibitory Control of Feature Selectivity in an Object Motion Sensitive Circuit of the Retina

    Tahnbee Kim

    2017-05-01

    Full Text Available Object motion sensitive (OMS W3-retinal ganglion cells (W3-RGCs in mice respond to local movements in a visual scene but remain silent during self-generated global image motion. The excitatory inputs that drive responses of W3-RGCs to local motion were recently characterized, but which inhibitory neurons suppress W3-RGCs’ responses to global motion, how these neurons encode motion information, and how their connections are organized along the excitatory circuit axis remains unknown. Here, we find that a genetically identified amacrine cell (AC type, TH2-AC, exhibits fast responses to global motion and slow responses to local motion. Optogenetic stimulation shows that TH2-ACs provide strong GABAA receptor-mediated input to W3-RGCs but only weak input to upstream excitatory neurons. Cell-type-specific silencing reveals that temporally coded inhibition from TH2-ACs cancels W3-RGC spike responses to global but not local motion stimuli and, thus, controls the feature selectivity of OMS signals sent to the brain.

  17. Feature Import Vector Machine: A General Classifier with Flexible Feature Selection.

    Ghosh, Samiran; Wang, Yazhen

    2015-02-01

    The support vector machine (SVM) and other reproducing kernel Hilbert space (RKHS) based classifier systems are drawing much attention recently due to its robustness and generalization capability. General theme here is to construct classifiers based on the training data in a high dimensional space by using all available dimensions. The SVM achieves huge data compression by selecting only few observations which lie close to the boundary of the classifier function. However when the number of observations are not very large (small n ) but the number of dimensions/features are large (large p ), then it is not necessary that all available features are of equal importance in the classification context. Possible selection of an useful fraction of the available features may result in huge data compression. In this paper we propose an algorithmic approach by means of which such an optimal set of features could be selected. In short, we reverse the traditional sequential observation selection strategy of SVM to that of sequential feature selection. To achieve this we have modified the solution proposed by Zhu and Hastie (2005) in the context of import vector machine (IVM), to select an optimal sub-dimensional model to build the final classifier with sufficient accuracy.

  18. Feature Selection with Conjunctions of Decision Stumps and Learning from Microarray Data.

    Shah, M; Marchand, M; Corbeil, J

    2012-01-01

    One of the objectives of designing feature selection learning algorithms is to obtain classifiers that depend on a small number of attributes and have verifiable future performance guarantees. There are few, if any, approaches that successfully address the two goals simultaneously. To the best of our knowledge, such algorithms that give theoretical bounds on the future performance have not been proposed so far in the context of the classification of gene expression data. In this work, we investigate the premise of learning a conjunction (or disjunction) of decision stumps in Occam's Razor, Sample Compression, and PAC-Bayes learning settings for identifying a small subset of attributes that can be used to perform reliable classification tasks. We apply the proposed approaches for gene identification from DNA microarray data and compare our results to those of the well-known successful approaches proposed for the task. We show that our algorithm not only finds hypotheses with a much smaller number of genes while giving competitive classification accuracy but also having tight risk guarantees on future performance, unlike other approaches. The proposed approaches are general and extensible in terms of both designing novel algorithms and application to other domains.

  19. Genetic Particle Swarm Optimization–Based Feature Selection for Very-High-Resolution Remotely Sensed Imagery Object Change Detection

    Chen, Qiang; Chen, Yunhao; Jiang, Weiguo

    2016-01-01

    In the field of multiple features Object-Based Change Detection (OBCD) for very-high-resolution remotely sensed images, image objects have abundant features and feature selection affects the precision and efficiency of OBCD. Through object-based image analysis, this paper proposes a Genetic Particle Swarm Optimization (GPSO)-based feature selection algorithm to solve the optimization problem of feature selection in multiple features OBCD. We select the Ratio of Mean to Variance (RMV) as the fitness function of GPSO, and apply the proposed algorithm to the object-based hybrid multivariate alternative detection model. Two experiment cases on Worldview-2/3 images confirm that GPSO can significantly improve the speed of convergence, and effectively avoid the problem of premature convergence, relative to other feature selection algorithms. According to the accuracy evaluation of OBCD, GPSO is superior at overall accuracy (84.17% and 83.59%) and Kappa coefficient (0.6771 and 0.6314) than other algorithms. Moreover, the sensitivity analysis results show that the proposed algorithm is not easily influenced by the initial parameters, but the number of features to be selected and the size of the particle swarm would affect the algorithm. The comparison experiment results reveal that RMV is more suitable than other functions as the fitness function of GPSO-based feature selection algorithm. PMID:27483285

  20. Genetic Particle Swarm Optimization-Based Feature Selection for Very-High-Resolution Remotely Sensed Imagery Object Change Detection.

    Chen, Qiang; Chen, Yunhao; Jiang, Weiguo

    2016-07-30

    In the field of multiple features Object-Based Change Detection (OBCD) for very-high-resolution remotely sensed images, image objects have abundant features and feature selection affects the precision and efficiency of OBCD. Through object-based image analysis, this paper proposes a Genetic Particle Swarm Optimization (GPSO)-based feature selection algorithm to solve the optimization problem of feature selection in multiple features OBCD. We select the Ratio of Mean to Variance (RMV) as the fitness function of GPSO, and apply the proposed algorithm to the object-based hybrid multivariate alternative detection model. Two experiment cases on Worldview-2/3 images confirm that GPSO can significantly improve the speed of convergence, and effectively avoid the problem of premature convergence, relative to other feature selection algorithms. According to the accuracy evaluation of OBCD, GPSO is superior at overall accuracy (84.17% and 83.59%) and Kappa coefficient (0.6771 and 0.6314) than other algorithms. Moreover, the sensitivity analysis results show that the proposed algorithm is not easily influenced by the initial parameters, but the number of features to be selected and the size of the particle swarm would affect the algorithm. The comparison experiment results reveal that RMV is more suitable than other functions as the fitness function of GPSO-based feature selection algorithm.

  1. Reconstruction and analysis of transcription factor-miRNA co-regulatory feed-forward loops in human cancers using filter-wrapper feature selection.

    Chen Peng

    Full Text Available BACKGROUND: As one of the most common types of co-regulatory motifs, feed-forward loops (FFLs control many cell functions and play an important role in human cancers. Therefore, it is crucial to reconstruct and analyze cancer-related FFLs that are controlled by transcription factor (TF and microRNA (miRNA simultaneously, in order to find out how miRNAs and TFs cooperate with each other in cancer cells and how they contribute to carcinogenesis. Current FFL studies rely on predicted regulation information and therefore suffer the false positive issue in prediction results. More critically, FFLs generated by existing approaches cannot represent the dynamic and conditional regulation relationship under different experimental conditions. METHODOLOGY/PRINCIPAL FINDINGS: In this study, we proposed a novel filter-wrapper feature selection method to accurately identify co-regulatory mechanism by incorporating prior information from predicted regulatory interactions with parallel miRNA/mRNA expression datasets. By applying this method, we reconstructed 208 and 110 TF-miRNA co-regulatory FFLs from human pan-cancer and prostate datasets, respectively. Further analysis of these cancer-related FFLs showed that the top-ranking TF STAT3 and miRNA hsa-let-7e are key regulators implicated in human cancers, which have regulated targets significantly enriched in cellular process regulations and signaling pathways that are involved in carcinogenesis. CONCLUSIONS/SIGNIFICANCE: In this study, we introduced an efficient computational approach to reconstruct co-regulatory FFLs by accurately identifying gene co-regulatory interactions. The strength of the proposed feature selection method lies in the fact it can precisely filter out false positives in predicted regulatory interactions by quantitatively modeling the complex co-regulation of target genes mediated by TFs and miRNAs simultaneously. Moreover, the proposed feature selection method can be generally applied to

  2. Univariate and Bivariate Empirical Mode Decomposition for Postural Stability Analysis

    Jacques Duchêne

    2008-05-01

    Full Text Available The aim of this paper was to compare empirical mode decomposition (EMD and two new extended methods of  EMD named complex empirical mode decomposition (complex-EMD and bivariate empirical mode decomposition (bivariate-EMD. All methods were used to analyze stabilogram center of pressure (COP time series. The two new methods are suitable to be applied to complex time series to extract complex intrinsic mode functions (IMFs before the Hilbert transform is subsequently applied on the IMFs. The trace of the analytic IMF in the complex plane has a circular form, with each IMF having its own rotation frequency. The area of the circle and the average rotation frequency of IMFs represent efficient indicators of the postural stability status of subjects. Experimental results show the effectiveness of these indicators to identify differences in standing posture between groups.

  3. Unsupervised Feature Selection for Interval Ordered Information Systems%区间序信息系统的无监督特征选择

    闫岳君; 代建华

    2017-01-01

    目前已有很多针对单值信息系统的无监督特征选择方法,但针对区间值信息系统的无监督特征选择方法却很少.针对区间序信息系统,文中提出模糊优势关系,并基于此关系扩展模糊排序信息熵和模糊排序互信息,用于评价特征的重要性.再结合一种综合考虑信息量和冗余度的无监督最大信息最小冗余(UmIMR)准则,构造无监督特征选择方法.最后通过实验证明文中方法的有效性.%There are a number of unsupervised feature selection methods proposed for single-valued information systems, but little research focuses on unsupervised feature selection of interval-valued information systems. In this paper, a fuzzy dominance relation is proposed for interval ordered information systems. Then, fuzzy rank information entropy and fuzzy rank mutual information are extended to evaluate the importance of features. Consequently, an unsupervised feature selection method is designed based on an unsupervised maximum information and minimum redundancy ( UmImR ) criterion. In the UmImR criterion, the amount of information and redundancy are taken into account. Experimental results demonstrate the effectiveness of the proposed method.

  4. Intelligent feature selection techniques for pattern classification of Lamb wave signals

    Hinders, Mark K.; Miller, Corey A.

    2014-01-01

    Lamb wave interaction with flaws is a complex, three-dimensional phenomenon, which often frustrates signal interpretation schemes based on mode arrival time shifts predicted by dispersion curves. As the flaw severity increases, scattering and mode conversion effects will often dominate the time-domain signals, obscuring available information about flaws because multiple modes may arrive on top of each other. Even for idealized flaw geometries the scattering and mode conversion behavior of Lamb waves is very complex. Here, multi-mode Lamb waves in a metal plate are propagated across a rectangular flat-bottom hole in a sequence of pitch-catch measurements corresponding to the double crosshole tomography geometry. The flaw is sequentially deepened, with the Lamb wave measurements repeated at each flaw depth. Lamb wave tomography reconstructions are used to identify which waveforms have interacted with the flaw and thereby carry information about its depth. Multiple features are extracted from each of the Lamb wave signals using wavelets, which are then fed to statistical pattern classification algorithms that identify flaw severity. In order to achieve the highest classification accuracy, an optimal feature space is required but it’s never known a priori which features are going to be best. For structural health monitoring we make use of the fact that physical flaws, such as corrosion, will only increase over time. This allows us to identify feature vectors which are topologically well-behaved by requiring that sequential classes “line up” in feature vector space. An intelligent feature selection routine is illustrated that identifies favorable class distributions in multi-dimensional feature spaces using computational homology theory. Betti numbers and formal classification accuracies are calculated for each feature space subset to establish a correlation between the topology of the class distribution and the corresponding classification accuracy

  5. Improved feature selection based on genetic algorithms for real time disruption prediction on JET

    Rattá, G.A.; Vega, J.; Murari, A.

    2012-01-01

    Highlights: ► A new signal selection methodology to improve disruption prediction is reported. ► The approach is based on Genetic Algorithms. ► An advanced predictor has been created with the new set of signals. ► The new system obtains considerably higher prediction rates. - Abstract: The early prediction of disruptions is an important aspect of the research in the field of Tokamak control. A very recent predictor, called “Advanced Predictor Of Disruptions” (APODIS), developed for the “Joint European Torus” (JET), implements the real time recognition of incoming disruptions with the best success rate achieved ever and an outstanding stability for long periods following training. In this article, a new methodology to select the set of the signals’ parameters in order to maximize the performance of the predictor is reported. The approach is based on “Genetic Algorithms” (GAs). With the feature selection derived from GAs, a new version of APODIS has been developed. The results are significantly better than the previous version not only in terms of success rates but also in extending the interval before the disruption in which reliable predictions are achieved. Correct disruption predictions with a success rate in excess of 90% have been achieved 200 ms before the time of the disruption. The predictor response is compared with that of JET's Protection System (JPS) and the ADODIS predictor is shown to be far superior. Both systems have been carefully tested with a wide number of discharges to understand their relative merits and the most profitable directions of further improvements.

  6. Improved feature selection based on genetic algorithms for real time disruption prediction on JET

    Ratta, G.A., E-mail: garatta@gateme.unsj.edu.ar [GATEME, Facultad de Ingenieria, Universidad Nacional de San Juan, Avda. San Martin 1109 (O), 5400 San Juan (Argentina); JET EFDA, Culham Science Centre, OX14 3DB Abingdon (United Kingdom); Vega, J. [Asociacion EURATOM/CIEMAT para Fusion, Avda. Complutense, 40, 28040 Madrid (Spain); JET EFDA, Culham Science Centre, OX14 3DB Abingdon (United Kingdom); Murari, A. [Associazione EURATOM-ENEA per la Fusione, Consorzio RFX, 4-35127 Padova (Italy); JET EFDA, Culham Science Centre, OX14 3DB Abingdon (United Kingdom)

    2012-09-15

    Highlights: Black-Right-Pointing-Pointer A new signal selection methodology to improve disruption prediction is reported. Black-Right-Pointing-Pointer The approach is based on Genetic Algorithms. Black-Right-Pointing-Pointer An advanced predictor has been created with the new set of signals. Black-Right-Pointing-Pointer The new system obtains considerably higher prediction rates. - Abstract: The early prediction of disruptions is an important aspect of the research in the field of Tokamak control. A very recent predictor, called 'Advanced Predictor Of Disruptions' (APODIS), developed for the 'Joint European Torus' (JET), implements the real time recognition of incoming disruptions with the best success rate achieved ever and an outstanding stability for long periods following training. In this article, a new methodology to select the set of the signals' parameters in order to maximize the performance of the predictor is reported. The approach is based on 'Genetic Algorithms' (GAs). With the feature selection derived from GAs, a new version of APODIS has been developed. The results are significantly better than the previous version not only in terms of success rates but also in extending the interval before the disruption in which reliable predictions are achieved. Correct disruption predictions with a success rate in excess of 90% have been achieved 200 ms before the time of the disruption. The predictor response is compared with that of JET's Protection System (JPS) and the ADODIS predictor is shown to be far superior. Both systems have been carefully tested with a wide number of discharges to understand their relative merits and the most profitable directions of further improvements.

  7. Intrusion detection model using fusion of chi-square feature selection and multi class SVM

    Ikram Sumaiya Thaseen

    2017-10-01

    Full Text Available Intrusion detection is a promising area of research in the domain of security with the rapid development of internet in everyday life. Many intrusion detection systems (IDS employ a sole classifier algorithm for classifying network traffic as normal or abnormal. Due to the large amount of data, these sole classifier models fail to achieve a high attack detection rate with reduced false alarm rate. However by applying dimensionality reduction, data can be efficiently reduced to an optimal set of attributes without loss of information and then classified accurately using a multi class modeling technique for identifying the different network attacks. In this paper, we propose an intrusion detection model using chi-square feature selection and multi class support vector machine (SVM. A parameter tuning technique is adopted for optimization of Radial Basis Function kernel parameter namely gamma represented by ‘ϒ’ and over fitting constant ‘C’. These are the two important parameters required for the SVM model. The main idea behind this model is to construct a multi class SVM which has not been adopted for IDS so far to decrease the training and testing time and increase the individual classification accuracy of the network attacks. The investigational results on NSL-KDD dataset which is an enhanced version of KDDCup 1999 dataset shows that our proposed approach results in a better detection rate and reduced false alarm rate. An experimentation on the computational time required for training and testing is also carried out for usage in time critical applications.

  8. Univariate and multivariate forecasting of hourly solar radiation with artificial intelligence techniques

    Sfetsos, A. [7 Pirsou Str., Athens (Greece); Coonick, A.H. [Imperial Coll. of Science Technology and Medicine, Dept. of Electrical and Electronic Engineering, London (United Kingdom)

    2000-07-01

    This paper introduces a new approach for the forecasting of mean hourly global solar radiation received by a horizontal surface. In addition to the traditional linear methods, several artificial-intelligence-based techniques are studied. These include linear, feed-forward, recurrent Elman and Radial Basis neural networks alongside the adaptive neuro-fuzzy inference scheme. The problem is examined initially for the univariate case, and is extended to include additional meteorological parameters in the process of estimating the optimum model. The results indicate that the developed artificial intelligence models predict the solar radiation time series more effectively compared to the conventional procedures based on the clearness index. The forecasting ability of some models can be further enhanced with the use of additional meteorological parameters. (Author)

  9. Information-theoretical feature selection using data obtained by Scanning Electron Microscopy coupled with and Energy Dispersive X-ray spectrometer for the classification of glass traces

    Ramos, Daniel; Zadora, Grzegorz

    2011-01-01

    Highlights: → A selection of the best features for multivariate forensic glass classification using SEM-EDX was performed. → The feature selection process was carried out by means of an exhaustive search, with an Empirical Cross-Entropy objective function. → Results show remarkable accuracy of the best variables selected following the proposed procedure for the task of classifying glass fragments into windows or containers. - Abstract: In this work, a selection of the best features for multivariate forensic glass classification using Scanning Electron Microscopy coupled with an Energy Dispersive X-ray spectrometer (SEM-EDX) has been performed. This has been motivated by the fact that the databases available for forensic glass classification are sparse nowadays, and the acquisition of SEM-EDX data is both costly and time-consuming for forensic laboratories. The database used for this work consists of 278 glass objects for which 7 variables, based on their elemental compositions obtained with SEM-EDX, are available. Two categories are considered for the classification task, namely containers and car/building windows, both of them typical in forensic casework. A multivariate model is proposed for the computation of the likelihood ratios. The feature selection process is carried out by means of an exhaustive search, with an Empirical Cross-Entropy (ECE) objective function. The ECE metric takes into account not only the discriminating power of the model in use, but also its calibration, which indicates whether or not the likelihood ratios are interpretable in a probabilistic way. Thus, the proposed model is applied to all the 63 possible univariate, bivariate and trivariate combinations taken from the 7 variables in the database, and its performance is ranked by its ECE. Results show remarkable accuracy of the best variables selected following the proposed procedure for the task of classifying glass fragments into windows (from cars or buildings) or containers

  10. Trend and forecasting rate of cancer deaths at a public university hospital using univariate modeling

    Ismail, A.; Hassan, Noor I.

    2013-09-01

    Cancer is one of the principal causes of death in Malaysia. This study was performed to determine the pattern of rate of cancer deaths at a public hospital in Malaysia over an 11 year period from year 2001 to 2011, to determine the best fitted model of forecasting the rate of cancer deaths using Univariate Modeling and to forecast the rates for the next two years (2012 to 2013). The medical records of the death of patients with cancer admitted at this Hospital over 11 year's period were reviewed, with a total of 663 cases. The cancers were classified according to 10th Revision International Classification of Diseases (ICD-10). Data collected include socio-demographic background of patients such as registration number, age, gender, ethnicity, ward and diagnosis. Data entry and analysis was accomplished using SPSS 19.0 and Minitab 16.0. The five Univariate Models used were Naïve with Trend Model, Average Percent Change Model (ACPM), Single Exponential Smoothing, Double Exponential Smoothing and Holt's Method. The overall 11 years rate of cancer deaths showed that at this hospital, Malay patients have the highest percentage (88.10%) compared to other ethnic groups with males (51.30%) higher than females. Lung and breast cancer have the most number of cancer deaths among gender. About 29.60% of the patients who died due to cancer were aged 61 years old and above. The best Univariate Model used for forecasting the rate of cancer deaths is Single Exponential Smoothing Technique with alpha of 0.10. The forecast for the rate of cancer deaths shows a horizontally or flat value. The forecasted mortality trend remains at 6.84% from January 2012 to December 2013. All the government and private sectors and non-governmental organizations need to highlight issues on cancer especially lung and breast cancers to the public through campaigns using mass media, media electronics, posters and pamphlets in the attempt to decrease the rate of cancer deaths in Malaysia.

  11. Identification of Subtype-Specific Prognostic Genes for Early-Stage Lung Adenocarcinoma and Squamous Cell Carcinoma Patients Using an Embedded Feature Selection Algorithm.

    Suyan Tian

    Full Text Available The existence of fundamental differences between lung adenocarcinoma (AC and squamous cell carcinoma (SCC in their underlying mechanisms motivated us to postulate that specific genes might exist relevant to prognosis of each histology subtype. To test on this research hypothesis, we previously proposed a simple Cox-regression model based feature selection algorithm and identified successfully some subtype-specific prognostic genes when applying this method to real-world data. In this article, we continue our effort on identification of subtype-specific prognostic genes for AC and SCC, and propose a novel embedded feature selection method by extending Threshold Gradient Descent Regularization (TGDR algorithm and minimizing on a corresponding negative partial likelihood function. Using real-world datasets and simulated ones, we show these two proposed methods have comparable performance whereas the new proposal is superior in terms of model parsimony. Our analysis provides some evidence on the existence of such subtype-specific prognostic genes, more investigation is warranted.

  12. An improved chaotic fruit fly optimization based on a mutation strategy for simultaneous feature selection and parameter optimization for SVM and its applications.

    Ye, Fei; Lou, Xin Yuan; Sun, Lin Fu

    2017-01-01

    This paper proposes a new support vector machine (SVM) optimization scheme based on an improved chaotic fly optimization algorithm (FOA) with a mutation strategy to simultaneously perform parameter setting turning for the SVM and feature selection. In the improved FOA, the chaotic particle initializes the fruit fly swarm location and replaces the expression of distance for the fruit fly to find the food source. However, the proposed mutation strategy uses two distinct generative mechanisms for new food sources at the osphresis phase, allowing the algorithm procedure to search for the optimal solution in both the whole solution space and within the local solution space containing the fruit fly swarm location. In an evaluation based on a group of ten benchmark problems, the proposed algorithm's performance is compared with that of other well-known algorithms, and the results support the superiority of the proposed algorithm. Moreover, this algorithm is successfully applied in a SVM to perform both parameter setting turning for the SVM and feature selection to solve real-world classification problems. This method is called chaotic fruit fly optimization algorithm (CIFOA)-SVM and has been shown to be a more robust and effective optimization method than other well-known methods, particularly in terms of solving the medical diagnosis problem and the credit card problem.

  13. An improved chaotic fruit fly optimization based on a mutation strategy for simultaneous feature selection and parameter optimization for SVM and its applications

    Lou, Xin Yuan; Sun, Lin Fu

    2017-01-01

    This paper proposes a new support vector machine (SVM) optimization scheme based on an improved chaotic fly optimization algorithm (FOA) with a mutation strategy to simultaneously perform parameter setting turning for the SVM and feature selection. In the improved FOA, the chaotic particle initializes the fruit fly swarm location and replaces the expression of distance for the fruit fly to find the food source. However, the proposed mutation strategy uses two distinct generative mechanisms for new food sources at the osphresis phase, allowing the algorithm procedure to search for the optimal solution in both the whole solution space and within the local solution space containing the fruit fly swarm location. In an evaluation based on a group of ten benchmark problems, the proposed algorithm’s performance is compared with that of other well-known algorithms, and the results support the superiority of the proposed algorithm. Moreover, this algorithm is successfully applied in a SVM to perform both parameter setting turning for the SVM and feature selection to solve real-world classification problems. This method is called chaotic fruit fly optimization algorithm (CIFOA)-SVM and has been shown to be a more robust and effective optimization method than other well-known methods, particularly in terms of solving the medical diagnosis problem and the credit card problem. PMID:28369096

  14. Improving the performance of univariate control charts for abnormal detection and classification

    Yiakopoulos, Christos; Koutsoudaki, Maria; Gryllias, Konstantinos; Antoniadis, Ioannis

    2017-03-01

    Bearing failures in rotating machinery can cause machine breakdown and economical loss, if no effective actions are taken on time. Therefore, it is of prime importance to detect accurately the presence of faults, especially at their early stage, to prevent sequent damage and reduce costly downtime. The machinery fault diagnosis follows a roadmap of data acquisition, feature extraction and diagnostic decision making, in which mechanical vibration fault feature extraction is the foundation and the key to obtain an accurate diagnostic result. A challenge in this area is the selection of the most sensitive features for various types of fault, especially when the characteristics of failures are difficult to be extracted. Thus, a plethora of complex data-driven fault diagnosis methods are fed by prominent features, which are extracted and reduced through traditional or modern algorithms. Since most of the available datasets are captured during normal operating conditions, the last decade a number of novelty detection methods, able to work when only normal data are available, have been developed. In this study, a hybrid method combining univariate control charts and a feature extraction scheme is introduced focusing towards an abnormal change detection and classification, under the assumption that measurements under normal operating conditions of the machinery are available. The feature extraction method integrates the morphological operators and the Morlet wavelets. The effectiveness of the proposed methodology is validated on two different experimental cases with bearing faults, demonstrating that the proposed approach can improve the fault detection and classification performance of conventional control charts.

  15. [A SAS marco program for batch processing of univariate Cox regression analysis for great database].

    Yang, Rendong; Xiong, Jie; Peng, Yangqin; Peng, Xiaoning; Zeng, Xiaomin

    2015-02-01

    To realize batch processing of univariate Cox regression analysis for great database by SAS marco program. We wrote a SAS macro program, which can filter, integrate, and export P values to Excel by SAS9.2. The program was used for screening survival correlated RNA molecules of ovarian cancer. A SAS marco program could finish the batch processing of univariate Cox regression analysis, the selection and export of the results. The SAS macro program has potential applications in reducing the workload of statistical analysis and providing a basis for batch processing of univariate Cox regression analysis.

  16. Segmentation of Coronary Angiograms Using Gabor Filters and Boltzmann Univariate Marginal Distribution Algorithm

    Fernando Cervantes-Sanchez

    2016-01-01

    Full Text Available This paper presents a novel method for improving the training step of the single-scale Gabor filters by using the Boltzmann univariate marginal distribution algorithm (BUMDA in X-ray angiograms. Since the single-scale Gabor filters (SSG are governed by three parameters, the optimal selection of the SSG parameters is highly desirable in order to maximize the detection performance of coronary arteries while reducing the computational time. To obtain the best set of parameters for the SSG, the area (Az under the receiver operating characteristic curve is used as fitness function. Moreover, to classify vessel and nonvessel pixels from the Gabor filter response, the interclass variance thresholding method has been adopted. The experimental results using the proposed method obtained the highest detection rate with Az=0.9502 over a training set of 40 images and Az=0.9583 with a test set of 40 images. In addition, the experimental results of vessel segmentation provided an accuracy of 0.944 with the test set of angiograms.

  17. Combinatorial bounds on the α-divergence of univariate mixture models

    Nielsen, Frank; Sun, Ke

    2017-01-01

    We derive lower- and upper-bounds of α-divergence between univariate mixture models with components in the exponential family. Three pairs of bounds are presented in order with increasing quality and increasing computational cost. They are verified

  18. Behavioral performance follows the time course of neural facilitation and suppression during cued shifts of feature-selective attention

    Andersen, S. K.; Müller, M. M.

    2010-01-01

    A central question in the field of attention is whether visual processing is a strictly limited resource, which must be allocated by selective attention. If this were the case, attentional enhancement of one stimulus should invariably lead to suppression of unattended distracter stimuli. Here we examine voluntary cued shifts of feature-selective attention to either one of two superimposed red or blue random dot kinematograms (RDKs) to test whether such a reciprocal relationship between enhanc...

  19. Support Vector Feature Selection for Early Detection of Anastomosis Leakage From Bag-of-Words in Electronic Health Records.

    Soguero-Ruiz, Cristina; Hindberg, Kristian; Rojo-Alvarez, Jose Luis; Skrovseth, Stein Olav; Godtliebsen, Fred; Mortensen, Kim; Revhaug, Arthur; Lindsetmo, Rolv-Ole; Augestad, Knut Magne; Jenssen, Robert

    2016-09-01

    The free text in electronic health records (EHRs) conveys a huge amount of clinical information about health state and patient history. Despite a rapidly growing literature on the use of machine learning techniques for extracting this information, little effort has been invested toward feature selection and the features' corresponding medical interpretation. In this study, we focus on the task of early detection of anastomosis leakage (AL), a severe complication after elective surgery for colorectal cancer (CRC) surgery, using free text extracted from EHRs. We use a bag-of-words model to investigate the potential for feature selection strategies. The purpose is earlier detection of AL and prediction of AL with data generated in the EHR before the actual complication occur. Due to the high dimensionality of the data, we derive feature selection strategies using the robust support vector machine linear maximum margin classifier, by investigating: 1) a simple statistical criterion (leave-one-out-based test); 2) an intensive-computation statistical criterion (Bootstrap resampling); and 3) an advanced statistical criterion (kernel entropy). Results reveal a discriminatory power for early detection of complications after CRC (sensitivity 100%; specificity 72%). These results can be used to develop prediction models, based on EHR data, that can support surgeons and patients in the preoperative decision making phase.

  20. Visual classification of very fine-grained sediments: Evaluation through univariate and multivariate statistics

    Hohn, M. Ed; Nuhfer, E.B.; Vinopal, R.J.; Klanderman, D.S.

    1980-01-01

    Classifying very fine-grained rocks through fabric elements provides information about depositional environments, but is subject to the biases of visual taxonomy. To evaluate the statistical significance of an empirical classification of very fine-grained rocks, samples from Devonian shales in four cored wells in West Virginia and Virginia were measured for 15 variables: quartz, illite, pyrite and expandable clays determined by X-ray diffraction; total sulfur, organic content, inorganic carbon, matrix density, bulk density, porosity, silt, as well as density, sonic travel time, resistivity, and ??-ray response measured from well logs. The four lithologic types comprised: (1) sharply banded shale, (2) thinly laminated shale, (3) lenticularly laminated shale, and (4) nonbanded shale. Univariate and multivariate analyses of variance showed that the lithologic classification reflects significant differences for the variables measured, difference that can be detected independently of stratigraphic effects. Little-known statistical methods found useful in this work included: the multivariate analysis of variance with more than one effect, simultaneous plotting of samples and variables on canonical variates, and the use of parametric ANOVA and MANOVA on ranked data. ?? 1980 Plenum Publishing Corporation.

  1. Stress assessment based on EEG univariate features and functional connectivity measures.

    Alonso, J F; Romero, S; Ballester, M R; Antonijoan, R M; Mañanas, M A

    2015-07-01

    The biological response to stress originates in the brain but involves different biochemical and physiological effects. Many common clinical methods to assess stress are based on the presence of specific hormones and on features extracted from different signals, including electrocardiogram, blood pressure, skin temperature, or galvanic skin response. The aim of this paper was to assess stress using EEG-based variables obtained from univariate analysis and functional connectivity evaluation. Two different stressors, the Stroop test and sleep deprivation, were applied to 30 volunteers to find common EEG patterns related to stress effects. Results showed a decrease of the high alpha power (11 to 12 Hz), an increase in the high beta band (23 to 36 Hz, considered a busy brain indicator), and a decrease in the approximate entropy. Moreover, connectivity showed that the high beta coherence and the interhemispheric nonlinear couplings, measured by the cross mutual information function, increased significantly for both stressors, suggesting that useful stress indexes may be obtained from EEG-based features.

  2. Interactive prostate segmentation using atlas-guided semi-supervised learning and adaptive feature selection

    Park, Sang Hyun [Department of Radiology and BRIC, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599 (United States); Gao, Yaozong, E-mail: yzgao@cs.unc.edu [Department of Computer Science, Department of Radiology, and BRIC, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599 (United States); Shi, Yinghuan, E-mail: syh@nju.edu.cn [State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023 (China); Shen, Dinggang, E-mail: dgshen@med.unc.edu [Department of Radiology and BRIC, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599 and Department of Brain and Cognitive Engineering, Korea University, Seoul 136-713 (Korea, Republic of)

    2014-11-01

    Purpose: Accurate prostate segmentation is necessary for maximizing the effectiveness of radiation therapy of prostate cancer. However, manual segmentation from 3D CT images is very time-consuming and often causes large intra- and interobserver variations across clinicians. Many segmentation methods have been proposed to automate this labor-intensive process, but tedious manual editing is still required due to the limited performance. In this paper, the authors propose a new interactive segmentation method that can (1) flexibly generate the editing result with a few scribbles or dots provided by a clinician, (2) fast deliver intermediate results to the clinician, and (3) sequentially correct the segmentations from any type of automatic or interactive segmentation methods. Methods: The authors formulate the editing problem as a semisupervised learning problem which can utilize a priori knowledge of training data and also the valuable information from user interactions. Specifically, from a region of interest near the given user interactions, the appropriate training labels, which are well matched with the user interactions, can be locally searched from a training set. With voting from the selected training labels, both confident prostate and background voxels, as well as unconfident voxels can be estimated. To reflect informative relationship between voxels, location-adaptive features are selected from the confident voxels by using regression forest and Fisher separation criterion. Then, the manifold configuration computed in the derived feature space is enforced into the semisupervised learning algorithm. The labels of unconfident voxels are then predicted by regularizing semisupervised learning algorithm. Results: The proposed interactive segmentation method was applied to correct automatic segmentation results of 30 challenging CT images. The correction was conducted three times with different user interactions performed at different time periods, in order to

  3. Interactive prostate segmentation using atlas-guided semi-supervised learning and adaptive feature selection

    Park, Sang Hyun; Gao, Yaozong; Shi, Yinghuan; Shen, Dinggang

    2014-01-01

    Purpose: Accurate prostate segmentation is necessary for maximizing the effectiveness of radiation therapy of prostate cancer. However, manual segmentation from 3D CT images is very time-consuming and often causes large intra- and interobserver variations across clinicians. Many segmentation methods have been proposed to automate this labor-intensive process, but tedious manual editing is still required due to the limited performance. In this paper, the authors propose a new interactive segmentation method that can (1) flexibly generate the editing result with a few scribbles or dots provided by a clinician, (2) fast deliver intermediate results to the clinician, and (3) sequentially correct the segmentations from any type of automatic or interactive segmentation methods. Methods: The authors formulate the editing problem as a semisupervised learning problem which can utilize a priori knowledge of training data and also the valuable information from user interactions. Specifically, from a region of interest near the given user interactions, the appropriate training labels, which are well matched with the user interactions, can be locally searched from a training set. With voting from the selected training labels, both confident prostate and background voxels, as well as unconfident voxels can be estimated. To reflect informative relationship between voxels, location-adaptive features are selected from the confident voxels by using regression forest and Fisher separation criterion. Then, the manifold configuration computed in the derived feature space is enforced into the semisupervised learning algorithm. The labels of unconfident voxels are then predicted by regularizing semisupervised learning algorithm. Results: The proposed interactive segmentation method was applied to correct automatic segmentation results of 30 challenging CT images. The correction was conducted three times with different user interactions performed at different time periods, in order to

  4. Interactive prostate segmentation using atlas-guided semi-supervised learning and adaptive feature selection.

    Park, Sang Hyun; Gao, Yaozong; Shi, Yinghuan; Shen, Dinggang

    2014-11-01

    Accurate prostate segmentation is necessary for maximizing the effectiveness of radiation therapy of prostate cancer. However, manual segmentation from 3D CT images is very time-consuming and often causes large intra- and interobserver variations across clinicians. Many segmentation methods have been proposed to automate this labor-intensive process, but tedious manual editing is still required due to the limited performance. In this paper, the authors propose a new interactive segmentation method that can (1) flexibly generate the editing result with a few scribbles or dots provided by a clinician, (2) fast deliver intermediate results to the clinician, and (3) sequentially correct the segmentations from any type of automatic or interactive segmentation methods. The authors formulate the editing problem as a semisupervised learning problem which can utilize a priori knowledge of training data and also the valuable information from user interactions. Specifically, from a region of interest near the given user interactions, the appropriate training labels, which are well matched with the user interactions, can be locally searched from a training set. With voting from the selected training labels, both confident prostate and background voxels, as well as unconfident voxels can be estimated. To reflect informative relationship between voxels, location-adaptive features are selected from the confident voxels by using regression forest and Fisher separation criterion. Then, the manifold configuration computed in the derived feature space is enforced into the semisupervised learning algorithm. The labels of unconfident voxels are then predicted by regularizing semisupervised learning algorithm. The proposed interactive segmentation method was applied to correct automatic segmentation results of 30 challenging CT images. The correction was conducted three times with different user interactions performed at different time periods, in order to evaluate both the efficiency

  5. Supervised Variational Relevance Learning, An Analytic Geometric Feature Selection with Applications to Omic Datasets.

    Boareto, Marcelo; Cesar, Jonatas; Leite, Vitor B P; Caticha, Nestor

    2015-01-01

    We introduce Supervised Variational Relevance Learning (Suvrel), a variational method to determine metric tensors to define distance based similarity in pattern classification, inspired in relevance learning. The variational method is applied to a cost function that penalizes large intraclass distances and favors small interclass distances. We find analytically the metric tensor that minimizes the cost function. Preprocessing the patterns by doing linear transformations using the metric tensor yields a dataset which can be more efficiently classified. We test our methods using publicly available datasets, for some standard classifiers. Among these datasets, two were tested by the MAQC-II project and, even without the use of further preprocessing, our results improve on their performance.

  6. A data-driven multi-model methodology with deep feature selection for short-term wind forecasting

    Feng, Cong; Cui, Mingjian; Hodge, Bri-Mathias; Zhang, Jie

    2017-01-01

    Highlights: • An ensemble model is developed to produce both deterministic and probabilistic wind forecasts. • A deep feature selection framework is developed to optimally determine the inputs to the forecasting methodology. • The developed ensemble methodology has improved the forecasting accuracy by up to 30%. - Abstract: With the growing wind penetration into the power system worldwide, improving wind power forecasting accuracy is becoming increasingly important to ensure continued economic and reliable power system operations. In this paper, a data-driven multi-model wind forecasting methodology is developed with a two-layer ensemble machine learning technique. The first layer is composed of multiple machine learning models that generate individual forecasts. A deep feature selection framework is developed to determine the most suitable inputs to the first layer machine learning models. Then, a blending algorithm is applied in the second layer to create an ensemble of the forecasts produced by first layer models and generate both deterministic and probabilistic forecasts. This two-layer model seeks to utilize the statistically different characteristics of each machine learning algorithm. A number of machine learning algorithms are selected and compared in both layers. This developed multi-model wind forecasting methodology is compared to several benchmarks. The effectiveness of the proposed methodology is evaluated to provide 1-hour-ahead wind speed forecasting at seven locations of the Surface Radiation network. Numerical results show that comparing to the single-algorithm models, the developed multi-model framework with deep feature selection procedure has improved the forecasting accuracy by up to 30%.

  7. A Modified Feature Selection and Artificial Neural Network-Based Day-Ahead Load Forecasting Model for a Smart Grid

    Ashfaq Ahmad

    2015-12-01

    Full Text Available In the operation of a smart grid (SG, day-ahead load forecasting (DLF is an important task. The SG can enhance the management of its conventional and renewable resources with a more accurate DLF model. However, DLF model development is highly challenging due to the non-linear characteristics of load time series in SGs. In the literature, DLF models do exist; however, these models trade off between execution time and forecast accuracy. The newly-proposed DLF model will be able to accurately predict the load of the next day with a fair enough execution time. Our proposed model consists of three modules; the data preparation module, feature selection and the forecast module. The first module makes the historical load curve compatible with the feature selection module. The second module removes redundant and irrelevant features from the input data. The third module, which consists of an artificial neural network (ANN, predicts future load on the basis of selected features. Moreover, the forecast module uses a sigmoid function for activation and a multi-variate auto-regressive model for weight updating during the training process. Simulations are conducted in MATLAB to validate the performance of our newly-proposed DLF model in terms of accuracy and execution time. Results show that our proposed modified feature selection and modified ANN (m(FS + ANN-based model for SGs is able to capture the non-linearity(ies in the history load curve with 97 . 11 % accuracy. Moreover, this accuracy is achieved at the cost of a fair enough execution time, i.e., we have decreased the average execution time of the existing FS + ANN-based model by 38 . 50 % .

  8. Feature Selection and the Class Imbalance Problem in Predicting Protein Function from Sequence

    Al-Shahib, A.; Breitling, R.; Gilbert, D.

    2005-01-01

    Abstract: When the standard approach to predict protein function by sequence homology fails, other alternative methods can be used that require only the amino acid sequence for predicting function. One such approach uses machine learning to predict protein function directly from amino acid sequence

  9. A comparison of multivariate and univariate time series approaches to modelling and forecasting emergency department demand in Western Australia.

    Aboagye-Sarfo, Patrick; Mai, Qun; Sanfilippo, Frank M; Preen, David B; Stewart, Louise M; Fatovich, Daniel M

    2015-10-01

    To develop multivariate vector-ARMA (VARMA) forecast models for predicting emergency department (ED) demand in Western Australia (WA) and compare them to the benchmark univariate autoregressive moving average (ARMA) and Winters' models. Seven-year monthly WA state-wide public hospital ED presentation data from 2006/07 to 2012/13 were modelled. Graphical and VARMA modelling methods were used for descriptive analysis and model fitting. The VARMA models were compared to the benchmark univariate ARMA and Winters' models to determine their accuracy to predict ED demand. The best models were evaluated by using error correction methods for accuracy. Descriptive analysis of all the dependent variables showed an increasing pattern of ED use with seasonal trends over time. The VARMA models provided a more precise and accurate forecast with smaller confidence intervals and better measures of accuracy in predicting ED demand in WA than the ARMA and Winters' method. VARMA models are a reliable forecasting method to predict ED demand for strategic planning and resource allocation. While the ARMA models are a closely competing alternative, they under-estimated future ED demand. Copyright © 2015 Elsevier Inc. All rights reserved.

  10. Manifold regularized multi-task feature selection for multi-modality classification in Alzheimer's disease.

    Jie, Biao; Zhang, Daoqiang; Cheng, Bo; Shen, Dinggang

    2013-01-01

    Accurate diagnosis of Alzheimer's disease (AD), as well as its prodromal stage (i.e., mild cognitive impairment, MCI), is very important for possible delay and early treatment of the disease. Recently, multi-modality methods have been used for fusing information from multiple different and complementary imaging and non-imaging modalities. Although there are a number of existing multi-modality methods, few of them have addressed the problem of joint identification of disease-related brain regions from multi-modality data for classification. In this paper, we proposed a manifold regularized multi-task learning framework to jointly select features from multi-modality data. Specifically, we formulate the multi-modality classification as a multi-task learning framework, where each task focuses on the classification based on each modality. In order to capture the intrinsic relatedness among multiple tasks (i.e., modalities), we adopted a group sparsity regularizer, which ensures only a small number of features to be selected jointly. In addition, we introduced a new manifold based Laplacian regularization term to preserve the geometric distribution of original data from each task, which can lead to the selection of more discriminative features. Furthermore, we extend our method to the semi-supervised setting, which is very important since the acquisition of a large set of labeled data (i.e., diagnosis of disease) is usually expensive and time-consuming, while the collection of unlabeled data is relatively much easier. To validate our method, we have performed extensive evaluations on the baseline Magnetic resonance imaging (MRI) and fluorodeoxyglucose positron emission tomography (FDG-PET) data of Alzheimer's Disease Neuroimaging Initiative (ADNI) database. Our experimental results demonstrate the effectiveness of the proposed method.

  11. Simultaneous feature selection and parameter optimisation using an artificial ant colony: case study of melting point prediction

    Nigsch Florian

    2008-10-01

    Full Text Available Abstract Background We present a novel feature selection algorithm, Winnowing Artificial Ant Colony (WAAC, that performs simultaneous feature selection and model parameter optimisation for the development of predictive quantitative structure-property relationship (QSPR models. The WAAC algorithm is an extension of the modified ant colony algorithm of Shen et al. (J Chem Inf Model 2005, 45: 1024–1029. We test the ability of the algorithm to develop a predictive partial least squares model for the Karthikeyan dataset (J Chem Inf Model 2005, 45: 581–590 of melting point values. We also test its ability to perform feature selection on a support vector machine model for the same dataset. Results Starting from an initial set of 203 descriptors, the WAAC algorithm selected a PLS model with 68 descriptors which has an RMSE on an external test set of 46.6°C and R2 of 0.51. The number of components chosen for the model was 49, which was close to optimal for this feature selection. The selected SVM model has 28 descriptors (cost of 5, ε of 0.21 and an RMSE of 45.1°C and R2 of 0.54. This model outperforms a kNN model (RMSE of 48.3°C, R2 of 0.47 for the same data and has similar performance to a Random Forest model (RMSE of 44.5°C, R2 of 0.55. However it is much less prone to bias at the extremes of the range of melting points as shown by the slope of the line through the residuals: -0.43 for WAAC/SVM, -0.53 for Random Forest. Conclusion With a careful choice of objective function, the WAAC algorithm can be used to optimise machine learning and regression models that suffer from overfitting. Where model parameters also need to be tuned, as is the case with support vector machine and partial least squares models, it can optimise these simultaneously. The moving probabilities used by the algorithm are easily interpreted in terms of the best and current models of the ants, and the winnowing procedure promotes the removal of irrelevant descriptors.

  12. Simultaneous feature selection and parameter optimisation using an artificial ant colony: case study of melting point prediction.

    O'Boyle, Noel M; Palmer, David S; Nigsch, Florian; Mitchell, John Bo

    2008-10-29

    We present a novel feature selection algorithm, Winnowing Artificial Ant Colony (WAAC), that performs simultaneous feature selection and model parameter optimisation for the development of predictive quantitative structure-property relationship (QSPR) models. The WAAC algorithm is an extension of the modified ant colony algorithm of Shen et al. (J Chem Inf Model 2005, 45: 1024-1029). We test the ability of the algorithm to develop a predictive partial least squares model for the Karthikeyan dataset (J Chem Inf Model 2005, 45: 581-590) of melting point values. We also test its ability to perform feature selection on a support vector machine model for the same dataset. Starting from an initial set of 203 descriptors, the WAAC algorithm selected a PLS model with 68 descriptors which has an RMSE on an external test set of 46.6 degrees C and R2 of 0.51. The number of components chosen for the model was 49, which was close to optimal for this feature selection. The selected SVM model has 28 descriptors (cost of 5, epsilon of 0.21) and an RMSE of 45.1 degrees C and R2 of 0.54. This model outperforms a kNN model (RMSE of 48.3 degrees C, R2 of 0.47) for the same data and has similar performance to a Random Forest model (RMSE of 44.5 degrees C, R2 of 0.55). However it is much less prone to bias at the extremes of the range of melting points as shown by the slope of the line through the residuals: -0.43 for WAAC/SVM, -0.53 for Random Forest. With a careful choice of objective function, the WAAC algorithm can be used to optimise machine learning and regression models that suffer from overfitting. Where model parameters also need to be tuned, as is the case with support vector machine and partial least squares models, it can optimise these simultaneously. The moving probabilities used by the algorithm are easily interpreted in terms of the best and current models of the ants, and the winnowing procedure promotes the removal of irrelevant descriptors.

  13. Forecasting electricity spot-prices using linear univariate time-series models

    Cuaresma, Jesus Crespo; Hlouskova, Jaroslava; Kossmeier, Stephan; Obersteiner, Michael

    2004-01-01

    This paper studies the forecasting abilities of a battery of univariate models on hourly electricity spot prices, using data from the Leipzig Power Exchange. The specifications studied include autoregressive models, autoregressive-moving average models and unobserved component models. The results show that specifications, where each hour of the day is modelled separately present uniformly better forecasting properties than specifications for the whole time-series, and that the inclusion of simple probabilistic processes for the arrival of extreme price events can lead to improvements in the forecasting abilities of univariate models for electricity spot prices. (Author)

  14. Inference for feature selection using the Lasso with high-dimensional data

    Brink-Jensen, Kasper; Ekstrøm, Claus Thorn

    2014-01-01

    Penalized regression models such as the Lasso have proved useful for variable selection in many fields - especially for situations with high-dimensional data where the numbers of predictors far exceeds the number of observations. These methods identify and rank variables of importance but do...... not generally provide any inference of the selected variables. Thus, the variables selected might be the "most important" but need not be significant. We propose a significance test for the selection found by the Lasso. We introduce a procedure that computes inference and p-values for features chosen...... by the Lasso. This method rephrases the null hypothesis and uses a randomization approach which ensures that the error rate is controlled even for small samples. We demonstrate the ability of the algorithm to compute $p$-values of the expected magnitude with simulated data using a multitude of scenarios...

  15. A feature selection approach for identification of signature genes from SAGE data

    Silva Paulo JS

    2007-05-01

    Full Text Available Abstract Background One goal of gene expression profiling is to identify signature genes that robustly distinguish different types or grades of tumors. Several tumor classifiers based on expression profiling have been proposed using microarray technique. Due to important differences in the probabilistic models of microarray and SAGE technologies, it is important to develop suitable techniques to select specific genes from SAGE measurements. Results A new framework to select specific genes that distinguish different biological states based on the analysis of SAGE data is proposed. The new framework applies the bolstered error for the identification of strong genes that separate the biological states in a feature space defined by the gene expression of a training set. Credibility intervals defined from a probabilistic model of SAGE measurements are used to identify the genes that distinguish the different states with more reliability among all gene groups selected by the strong genes method. A score taking into account the credibility and the bolstered error values in order to rank the groups of considered genes is proposed. Results obtained using SAGE data from gliomas are presented, thus corroborating the introduced methodology. Conclusion The model representing counting data, such as SAGE, provides additional statistical information that allows a more robust analysis. The additional statistical information provided by the probabilistic model is incorporated in the methodology described in the paper. The introduced method is suitable to identify signature genes that lead to a good separation of the biological states using SAGE and may be adapted for other counting methods such as Massive Parallel Signature Sequencing (MPSS or the recent Sequencing-By-Synthesis (SBS technique. Some of such genes identified by the proposed method may be useful to generate classifiers.

  16. Applications of random forest feature selection for fine-scale genetic population assignment.

    Sylvester, Emma V A; Bentzen, Paul; Bradbury, Ian R; Clément, Marie; Pearce, Jon; Horne, John; Beiko, Robert G

    2018-02-01

    Genetic population assignment used to inform wildlife management and conservation efforts requires panels of highly informative genetic markers and sensitive assignment tests. We explored the utility of machine-learning algorithms (random forest, regularized random forest and guided regularized random forest) compared with F ST ranking for selection of single nucleotide polymorphisms (SNP) for fine-scale population assignment. We applied these methods to an unpublished SNP data set for Atlantic salmon ( Salmo salar ) and a published SNP data set for Alaskan Chinook salmon ( Oncorhynchus tshawytscha ). In each species, we identified the minimum panel size required to obtain a self-assignment accuracy of at least 90% using each method to create panels of 50-700 markers Panels of SNPs identified using random forest-based methods performed up to 7.8 and 11.2 percentage points better than F ST -selected panels of similar size for the Atlantic salmon and Chinook salmon data, respectively. Self-assignment accuracy ≥90% was obtained with panels of 670 and 384 SNPs for each data set, respectively, a level of accuracy never reached for these species using F ST -selected panels. Our results demonstrate a role for machine-learning approaches in marker selection across large genomic data sets to improve assignment for management and conservation of exploited populations.

  17. Recursive Cluster Elimination (RCE for classification and feature selection from gene expression data

    Showe Louise C

    2007-05-01

    Full Text Available Abstract Background Classification studies using gene expression datasets are usually based on small numbers of samples and tens of thousands of genes. The selection of those genes that are important for distinguishing the different sample classes being compared, poses a challenging problem in high dimensional data analysis. We describe a new procedure for selecting significant genes as recursive cluster elimination (RCE rather than recursive feature elimination (RFE. We have tested this algorithm on six datasets and compared its performance with that of two related classification procedures with RFE. Results We have developed a novel method for selecting significant genes in comparative gene expression studies. This method, which we refer to as SVM-RCE, combines K-means, a clustering method, to identify correlated gene clusters, and Support Vector Machines (SVMs, a supervised machine learning classification method, to identify and score (rank those gene clusters for the purpose of classification. K-means is used initially to group genes into clusters. Recursive cluster elimination (RCE is then applied to iteratively remove those clusters of genes that contribute the least to the classification performance. SVM-RCE identifies the clusters of correlated genes that are most significantly differentially expressed between the sample classes. Utilization of gene clusters, rather than individual genes, enhances the supervised classification accuracy of the same data as compared to the accuracy when either SVM or Penalized Discriminant Analysis (PDA with recursive feature elimination (SVM-RFE and PDA-RFE are used to remove genes based on their individual discriminant weights. Conclusion SVM-RCE provides improved classification accuracy with complex microarray data sets when it is compared to the classification accuracy of the same datasets using either SVM-RFE or PDA-RFE. SVM-RCE identifies clusters of correlated genes that when considered together

  18. Prediction of cognitive and motor development in preterm children using exhaustive feature selection and cross-validation of near-term white matter microstructure.

    Schadl, Kornél; Vassar, Rachel; Cahill-Rowley, Katelyn; Yeom, Kristin W; Stevenson, David K; Rose, Jessica

    2018-01-01

    Advanced neuroimaging and computational methods offer opportunities for more accurate prognosis. We hypothesized that near-term regional white matter (WM) microstructure, assessed on diffusion tensor imaging (DTI), using exhaustive feature selection with cross-validation would predict neurodevelopment in preterm children. Near-term MRI and DTI obtained at 36.6 ± 1.8 weeks postmenstrual age in 66 very-low-birth-weight preterm neonates were assessed. 60/66 had follow-up neurodevelopmental evaluation with Bayley Scales of Infant-Toddler Development, 3rd-edition (BSID-III) at 18-22 months. Linear models with exhaustive feature selection and leave-one-out cross-validation computed based on DTI identified sets of three brain regions most predictive of cognitive and motor function; logistic regression models were computed to classify high-risk infants scoring one standard deviation below mean. Cognitive impairment was predicted (100% sensitivity, 100% specificity; AUC = 1) by near-term right middle-temporal gyrus MD, right cingulate-cingulum MD, left caudate MD. Motor impairment was predicted (90% sensitivity, 86% specificity; AUC = 0.912) by left precuneus FA, right superior occipital gyrus MD, right hippocampus FA. Cognitive score variance was explained (29.6%, cross-validated Rˆ2 = 0.296) by left posterior-limb-of-internal-capsule MD, Genu RD, right fusiform gyrus AD. Motor score variance was explained (31.7%, cross-validated Rˆ2 = 0.317) by left posterior-limb-of-internal-capsule MD, right parahippocampal gyrus AD, right middle-temporal gyrus AD. Search in large DTI feature space more accurately identified neonatal neuroimaging correlates of neurodevelopment.

  19. Feature selection from a facial image for distinction of sasang constitution.

    Koo, Imhoi; Kim, Jong Yeol; Kim, Myoung Geun; Kim, Keun Ho

    2009-09-01

    Recently, oriental medicine has received attention for providing personalized medicine through consideration of the unique nature and constitution of individual patients. With the eventual goal of globalization, the current trend in oriental medicine research is the standardization by adopting western scientific methods, which could represent a scientific revolution. The purpose of this study is to establish methods for finding statistically significant features in a facial image with respect to distinguishing constitution and to show the meaning of those features. From facial photo images, facial elements are analyzed in terms of the distance, angle and the distance ratios, for which there are 1225, 61 250 and 749 700 features, respectively. Due to the very large number of facial features, it is quite difficult to determine truly meaningful features. We suggest a process for the efficient analysis of facial features including the removal of outliers, control for missing data to guarantee data confidence and calculation of statistical significance by applying ANOVA. We show the statistical properties of selected features according to different constitutions using the nine distances, 10 angles and 10 rates of distance features that are finally established. Additionally, the Sasang constitutional meaning of the selected features is shown here.

  20. Feature Selection from a Facial Image for Distinction of Sasang Constitution

    Imhoi Koo

    2009-01-01

    Full Text Available Recently, oriental medicine has received attention for providing personalized medicine through consideration of the unique nature and constitution of individual patients. With the eventual goal of globalization, the current trend in oriental medicine research is the standardization by adopting western scientific methods, which could represent a scientific revolution. The purpose of this study is to establish methods for finding statistically significant features in a facial image with respect to distinguishing constitution and to show the meaning of those features. From facial photo images, facial elements are analyzed in terms of the distance, angle and the distance ratios, for which there are 1225, 61 250 and 749 700 features, respectively. Due to the very large number of facial features, it is quite difficult to determine truly meaningful features. We suggest a process for the efficient analysis of facial features including the removal of outliers, control for missing data to guarantee data confidence and calculation of statistical significance by applying ANOVA. We show the statistical properties of selected features according to different constitutions using the nine distances, 10 angles and 10 rates of distance features that are finally established. Additionally, the Sasang constitutional meaning of the selected features is shown here.

  1. Feature Selection from a Facial Image for Distinction of Sasang Constitution

    Koo, Imhoi; Kim, Jong Yeol; Kim, Myoung Geun

    2009-01-01

    Recently, oriental medicine has received attention for providing personalized medicine through consideration of the unique nature and constitution of individual patients. With the eventual goal of globalization, the current trend in oriental medicine research is the standardization by adopting western scientific methods, which could represent a scientific revolution. The purpose of this study is to establish methods for finding statistically significant features in a facial image with respect to distinguishing constitution and to show the meaning of those features. From facial photo images, facial elements are analyzed in terms of the distance, angle and the distance ratios, for which there are 1225, 61 250 and 749 700 features, respectively. Due to the very large number of facial features, it is quite difficult to determine truly meaningful features. We suggest a process for the efficient analysis of facial features including the removal of outliers, control for missing data to guarantee data confidence and calculation of statistical significance by applying ANOVA. We show the statistical properties of selected features according to different constitutions using the nine distances, 10 angles and 10 rates of distance features that are finally established. Additionally, the Sasang constitutional meaning of the selected features is shown here. PMID:19745013

  2. The Use of Univariate and Multivariate Analyses in the Geochemical Exploration, Ravanj Lead Mine, Delijan, Iran

    Mostafa Nejadhadad

    2017-11-01

    Full Text Available A geochemical exploration program was applied to recognize the anomalous geochemical haloes at the Ravanj lead mine, Delijan, Iran. Sampling of unweathered rocks were undertaken across rock exposures on a 10 × 10 meter grid (n = 302 as well as the accessible parts of underground mine A (n = 42. First, the threshold values of all elements were determined using the cut-off values used in the exploratory data analysis (EDA method. Then, for further studies, elements with lognormal distributions (Pb, Zn, Ag, As, Cd, Co, Cu, Sb, S, Sr, Th, Ba, Bi, Fe, Ni and Mn were selected. Robustness against outliers is achieved by application of central log ratio transformation to address the closure problems with compositional data prior to principle components analysis (PCA. Results of these analyses show that, in the Ravanj deposit, Pb mineralization is characterized by a Pb-Ba-Ag-Sb ± Zn ± Cd association. The supra-mineralization haloes are characterized by barite and tetrahedrite in a Ba- Th- Ag- Cu- Sb- As- Sr association and sub-mineralization haloes are comprised of pyrite and tetrahedrite, probably reflecting a Fe-Cu-As-Bi-Ni-Co-Mo-Mn association. Using univariate and multivariate geostatistical analyses (e.g., EDA and robust PCA, four anomalies were detected and mapped in Block A of the Ravanj deposit. Anomalies 1 and 2 are around the ancient orebodies. Anomaly 3 is located in a thin bedded limestone-shale intercalation unit that does not show significant mineralization. Drilling of the fourth anomaly suggested a low grade, non-economic Pb mineralization.

  3. Multi-Stage Feature Selection by Using Genetic Algorithms for Fault Diagnosis in Gearboxes Based on Vibration Signal

    Mariela Cerrada

    2015-09-01

    Full Text Available There are growing demands for condition-based monitoring of gearboxes, and techniques to improve the reliability, effectiveness and accuracy for fault diagnosis are considered valuable contributions. Feature selection is still an important aspect in machine learning-based diagnosis in order to reach good performance in the diagnosis system. The main aim of this research is to propose a multi-stage feature selection mechanism for selecting the best set of condition parameters on the time, frequency and time-frequency domains, which are extracted from vibration signals for fault diagnosis purposes in gearboxes. The selection is based on genetic algorithms, proposing in each stage a new subset of the best features regarding the classifier performance in a supervised environment. The selected features are augmented at each stage and used as input for a neural network classifier in the next step, while a new subset of feature candidates is treated by the selection process. As a result, the inherent exploration and exploitation of the genetic algorithms for finding the best solutions of the selection problem are locally focused. The Sensors 2015, 15 23904 approach is tested on a dataset from a real test bed with several fault classes under different running conditions of load and velocity. The model performance for diagnosis is over 98%.

  4. Relevance feature selection of modal frequency-ambient condition pattern recognition in structural health assessment for reinforced concrete buildings

    He-Qing Mu

    2016-08-01

    Full Text Available Modal frequency is an important indicator for structural health assessment. Previous studies have shown that this indicator is substantially affected by the fluctuation of ambient conditions, such as temperature and humidity. Therefore, recognizing the pattern between modal frequency and ambient conditions is necessary for reliable long-term structural health assessment. In this article, a novel machine-learning algorithm is proposed to automatically select relevance features in modal frequency-ambient condition pattern recognition based on structural dynamic response and ambient condition measurement. In contrast to the traditional feature selection approaches by examining a large number of combinations of extracted features, the proposed algorithm conducts continuous relevance feature selection by introducing a sophisticated hyperparameterization on the weight parameter vector controlling the relevancy of different features in the prediction model. The proposed algorithm is then utilized for structural health assessment for a reinforced concrete building based on 1-year daily measurements. It turns out that the optimal model class including the relevance features for each vibrational mode is capable to capture the pattern between the corresponding modal frequency and the ambient conditions.

  5. Decoding auditory spatial and emotional information encoding using multivariate versus univariate techniques.

    Kryklywy, James H; Macpherson, Ewan A; Mitchell, Derek G V

    2018-04-01

    Emotion can have diverse effects on behaviour and perception, modulating function in some circumstances, and sometimes having little effect. Recently, it was identified that part of the heterogeneity of emotional effects could be due to a dissociable representation of emotion in dual pathway models of sensory processing. Our previous fMRI experiment using traditional univariate analyses showed that emotion modulated processing in the auditory 'what' but not 'where' processing pathway. The current study aims to further investigate this dissociation using a more recently emerging multi-voxel pattern analysis searchlight approach. While undergoing fMRI, participants localized sounds of varying emotional content. A searchlight multi-voxel pattern analysis was conducted to identify activity patterns predictive of sound location and/or emotion. Relative to the prior univariate analysis, MVPA indicated larger overlapping spatial and emotional representations of sound within early secondary regions associated with auditory localization. However, consistent with the univariate analysis, these two dimensions were increasingly segregated in late secondary and tertiary regions of the auditory processing streams. These results, while complimentary to our original univariate analyses, highlight the utility of multiple analytic approaches for neuroimaging, particularly for neural processes with known representations dependent on population coding.

  6. Combinatorial bounds on the α-divergence of univariate mixture models

    Nielsen, Frank

    2017-06-20

    We derive lower- and upper-bounds of α-divergence between univariate mixture models with components in the exponential family. Three pairs of bounds are presented in order with increasing quality and increasing computational cost. They are verified empirically through simulated Gaussian mixture models. The presented methodology generalizes to other divergence families relying on Hellinger-type integrals.

  7. A comparison of univariate and multivariate methods for analyzing clinal variation in an invasive species

    Edwards, K.R.; Bastlová, D.; Edwards-Jonášová, Magda; Květ, J.

    2011-01-01

    Roč. 674, č. 1 (2011), s. 119-131 ISSN 0018-8158 Institutional research plan: CEZ:AV0Z60870520 Keywords : common garden * life history traits * local adaptation * principal components analysis * purple loosestrife * redundancy analysis Subject RIV: EH - Ecology, Behaviour Impact factor: 1.784, year: 2011 http://www.springerlink.com/content/71r10n3367m98jxl/

  8. Feature Selection and Classification of Ulcerated Lesions Using Statistical Analysis for WCE Images

    Shipra Suman

    2017-10-01

    Full Text Available Wireless capsule endoscopy (WCE is a technology developed to inspect the whole gastrointestinal tract (especially the small bowel area that is unreachable using the traditional endoscopy procedure for various abnormalities in a non-invasive manner. However, visualization of a massive number of images is a very time-consuming and tedious task for physicians (prone to human error. Thus, an automatic scheme for lesion detection in WCE videos is a potential solution to alleviate this problem. In this work, a novel statistical approach was chosen for differentiating ulcer and non-ulcer pixels using various color spaces (or more specifically using relevant color bands. The chosen feature vector was used to compute the performance metrics using SVM with grid search method for maximum efficiency. The experimental results and analysis showed that the proposed algorithm was robust in detecting ulcers. The performance in terms of accuracy, sensitivity, and specificity are 97.89%, 96.22%, and 95.09%, respectively, which is promising.

  9. A Semidefinite Programming Based Search Strategy for Feature Selection with Mutual Information Measure.

    Naghibi, Tofigh; Hoffmann, Sarah; Pfister, Beat

    2015-08-01

    Feature subset selection, as a special case of the general subset selection problem, has been the topic of a considerable number of studies due to the growing importance of data-mining applications. In the feature subset selection problem there are two main issues that need to be addressed: (i) Finding an appropriate measure function than can be fairly fast and robustly computed for high-dimensional data. (ii) A search strategy to optimize the measure over the subset space in a reasonable amount of time. In this article mutual information between features and class labels is considered to be the measure function. Two series expansions for mutual information are proposed, and it is shown that most heuristic criteria suggested in the literature are truncated approximations of these expansions. It is well-known that searching the whole subset space is an NP-hard problem. Here, instead of the conventional sequential search algorithms, we suggest a parallel search strategy based on semidefinite programming (SDP) that can search through the subset space in polynomial time. By exploiting the similarities between the proposed algorithm and an instance of the maximum-cut problem in graph theory, the approximation ratio of this algorithm is derived and is compared with the approximation ratio of the backward elimination method. The experiments show that it can be misleading to judge the quality of a measure solely based on the classification accuracy, without taking the effect of the non-optimum search strategy into account.

  10. Feature selection using angle modulated simulated Kalman filter for peak classification of EEG signals.

    Adam, Asrul; Ibrahim, Zuwairie; Mokhtar, Norrima; Shapiai, Mohd Ibrahim; Mubin, Marizan; Saad, Ismail

    2016-01-01

    In the existing electroencephalogram (EEG) signals peak classification research, the existing models, such as Dumpala, Acir, Liu, and Dingle peak models, employ different set of features. However, all these models may not be able to offer good performance for various applications and it is found to be problem dependent. Therefore, the objective of this study is to combine all the associated features from the existing models before selecting the best combination of features. A new optimization algorithm, namely as angle modulated simulated Kalman filter (AMSKF) will be employed as feature selector. Also, the neural network random weight method is utilized in the proposed AMSKF technique as a classifier. In the conducted experiment, 11,781 samples of peak candidate are employed in this study for the validation purpose. The samples are collected from three different peak event-related EEG signals of 30 healthy subjects; (1) single eye blink, (2) double eye blink, and (3) eye movement signals. The experimental results have shown that the proposed AMSKF feature selector is able to find the best combination of features and performs at par with the existing related studies of epileptic EEG events classification.

  11. Hybrid feature selection algorithm using symmetrical uncertainty and a harmony search algorithm

    Salameh Shreem, Salam; Abdullah, Salwani; Nazri, Mohd Zakree Ahmad

    2016-04-01

    Microarray technology can be used as an efficient diagnostic system to recognise diseases such as tumours or to discriminate between different types of cancers in normal tissues. This technology has received increasing attention from the bioinformatics community because of its potential in designing powerful decision-making tools for cancer diagnosis. However, the presence of thousands or tens of thousands of genes affects the predictive accuracy of this technology from the perspective of classification. Thus, a key issue in microarray data is identifying or selecting the smallest possible set of genes from the input data that can achieve good predictive accuracy for classification. In this work, we propose a two-stage selection algorithm for gene selection problems in microarray data-sets called the symmetrical uncertainty filter and harmony search algorithm wrapper (SU-HSA). Experimental results show that the SU-HSA is better than HSA in isolation for all data-sets in terms of the accuracy and achieves a lower number of genes on 6 out of 10 instances. Furthermore, the comparison with state-of-the-art methods shows that our proposed approach is able to obtain 5 (out of 10) new best results in terms of the number of selected genes and competitive results in terms of the classification accuracy.

  12. Modeling the potential risk factors of bovine viral diarrhea prevalence in Egypt using univariable and multivariable logistic regression analyses

    Abdelfattah M. Selim

    2018-03-01

    Full Text Available Aim: The present cross-sectional study was conducted to determine the seroprevalence and potential risk factors associated with Bovine viral diarrhea virus (BVDV disease in cattle and buffaloes in Egypt, to model the potential risk factors associated with the disease using logistic regression (LR models, and to fit the best predictive model for the current data. Materials and Methods: A total of 740 blood samples were collected within November 2012-March 2013 from animals aged between 6 months and 3 years. The potential risk factors studied were species, age, sex, and herd location. All serum samples were examined with indirect ELIZA test for antibody detection. Data were analyzed with different statistical approaches such as Chi-square test, odds ratios (OR, univariable, and multivariable LR models. Results: Results revealed a non-significant association between being seropositive with BVDV and all risk factors, except for species of animal. Seroprevalence percentages were 40% and 23% for cattle and buffaloes, respectively. OR for all categories were close to one with the highest OR for cattle relative to buffaloes, which was 2.237. Likelihood ratio tests showed a significant drop of the -2LL from univariable LR to multivariable LR models. Conclusion: There was an evidence of high seroprevalence of BVDV among cattle as compared with buffaloes with the possibility of infection in different age groups of animals. In addition, multivariable LR model was proved to provide more information for association and prediction purposes relative to univariable LR models and Chi-square tests if we have more than one predictor.

  13. A hybrid feature selection and health indicator construction scheme for delay-time-based degradation modelling of rolling element bearings

    Zhang, Bin; Deng, Congying; Zhang, Yi

    2018-03-01

    Rolling element bearings are mechanical components used frequently in most rotating machinery and they are also vulnerable links representing the main source of failures in such systems. Thus, health condition monitoring and fault diagnosis of rolling element bearings have long been studied to improve operational reliability and maintenance efficiency of rotatory machines. Over the past decade, prognosis that enables forewarning of failure and estimation of residual life attracted increasing attention. To accurately and efficiently predict failure of the rolling element bearing, the degradation requires to be well represented and modelled. For this purpose, degradation of the rolling element bearing is analysed with the delay-time-based model in this paper. Also, a hybrid feature selection and health indicator construction scheme is proposed for extraction of the bearing health relevant information from condition monitoring sensor data. Effectiveness of the presented approach is validated through case studies on rolling element bearing run-to-failure experiments.

  14. Multiple-output support vector machine regression with feature selection for arousal/valence space emotion assessment.

    Torres-Valencia, Cristian A; Álvarez, Mauricio A; Orozco-Gutiérrez, Alvaro A

    2014-01-01

    Human emotion recognition (HER) allows the assessment of an affective state of a subject. Until recently, such emotional states were described in terms of discrete emotions, like happiness or contempt. In order to cover a high range of emotions, researchers in the field have introduced different dimensional spaces for emotion description that allow the characterization of affective states in terms of several variables or dimensions that measure distinct aspects of the emotion. One of the most common of such dimensional spaces is the bidimensional Arousal/Valence space. To the best of our knowledge, all HER systems so far have modelled independently, the dimensions in these dimensional spaces. In this paper, we study the effect of modelling the output dimensions simultaneously and show experimentally the advantages in modeling them in this way. We consider a multimodal approach by including features from the Electroencephalogram and a few physiological signals. For modelling the multiple outputs, we employ a multiple output regressor based on support vector machines. We also include an stage of feature selection that is developed within an embedded approach known as Recursive Feature Elimination (RFE), proposed initially for SVM. The results show that several features can be eliminated using the multiple output support vector regressor with RFE without affecting the performance of the regressor. From the analysis of the features selected in smaller subsets via RFE, it can be observed that the signals that are more informative into the arousal and valence space discrimination are the EEG, Electrooculogram/Electromiogram (EOG/EMG) and the Galvanic Skin Response (GSR).

  15. Optimum location of external markers using feature selection algorithms for real-time tumor tracking in external-beam radiotherapy: a virtual phantom study.

    Nankali, Saber; Torshabi, Ahmad Esmaili; Miandoab, Payam Samadi; Baghizadeh, Amin

    2016-01-08

    In external-beam radiotherapy, using external markers is one of the most reliable tools to predict tumor position, in clinical applications. The main challenge in this approach is tumor motion tracking with highest accuracy that depends heavily on external markers location, and this issue is the objective of this study. Four commercially available feature selection algorithms entitled 1) Correlation-based Feature Selection, 2) Classifier, 3) Principal Components, and 4) Relief were proposed to find optimum location of external markers in combination with two "Genetic" and "Ranker" searching procedures. The performance of these algorithms has been evaluated using four-dimensional extended cardiac-torso anthropomorphic phantom. Six tumors in lung, three tumors in liver, and 49 points on the thorax surface were taken into account to simulate internal and external motions, respectively. The root mean square error of an adaptive neuro-fuzzy inference system (ANFIS) as prediction model was considered as metric for quantitatively evaluating the performance of proposed feature selection algorithms. To do this, the thorax surface region was divided into nine smaller segments and predefined tumors motion was predicted by ANFIS using external motion data of given markers at each small segment, separately. Our comparative results showed that all feature selection algorithms can reasonably select specific external markers from those segments where the root mean square error of the ANFIS model is minimum. Moreover, the performance accuracy of proposed feature selection algorithms was compared, separately. For this, each tumor motion was predicted using motion data of those external markers selected by each feature selection algorithm. Duncan statistical test, followed by F-test, on final results reflected that all proposed feature selection algorithms have the same performance accuracy for lung tumors. But for liver tumors, a correlation-based feature selection algorithm, in

  16. Optimum location of external markers using feature selection algorithms for real‐time tumor tracking in external‐beam radiotherapy: a virtual phantom study

    Nankali, Saber; Miandoab, Payam Samadi; Baghizadeh, Amin

    2016-01-01

    In external‐beam radiotherapy, using external markers is one of the most reliable tools to predict tumor position, in clinical applications. The main challenge in this approach is tumor motion tracking with highest accuracy that depends heavily on external markers location, and this issue is the objective of this study. Four commercially available feature selection algorithms entitled 1) Correlation‐based Feature Selection, 2) Classifier, 3) Principal Components, and 4) Relief were proposed to find optimum location of external markers in combination with two “Genetic” and “Ranker” searching procedures. The performance of these algorithms has been evaluated using four‐dimensional extended cardiac‐torso anthropomorphic phantom. Six tumors in lung, three tumors in liver, and 49 points on the thorax surface were taken into account to simulate internal and external motions, respectively. The root mean square error of an adaptive neuro‐fuzzy inference system (ANFIS) as prediction model was considered as metric for quantitatively evaluating the performance of proposed feature selection algorithms. To do this, the thorax surface region was divided into nine smaller segments and predefined tumors motion was predicted by ANFIS using external motion data of given markers at each small segment, separately. Our comparative results showed that all feature selection algorithms can reasonably select specific external markers from those segments where the root mean square error of the ANFIS model is minimum. Moreover, the performance accuracy of proposed feature selection algorithms was compared, separately. For this, each tumor motion was predicted using motion data of those external markers selected by each feature selection algorithm. Duncan statistical test, followed by F‐test, on final results reflected that all proposed feature selection algorithms have the same performance accuracy for lung tumors. But for liver tumors, a correlation‐based feature

  17. Univariate time series modeling and an application to future claims amount in SOCSO's invalidity pension scheme

    Chek, Mohd Zaki Awang; Ahmad, Abu Bakar; Ridzwan, Ahmad Nur Azam Ahmad; Jelas, Imran Md.; Jamal, Nur Faezah; Ismail, Isma Liana; Zulkifli, Faiz; Noor, Syamsul Ikram Mohd

    2012-09-01

    The main objective of this study is to forecast the future claims amount of Invalidity Pension Scheme (IPS). All data were derived from SOCSO annual reports from year 1972 - 2010. These claims consist of all claims amount from 7 benefits offered by SOCSO such as Invalidity Pension, Invalidity Grant, Survivors Pension, Constant Attendance Allowance, Rehabilitation, Funeral and Education. Prediction of future claims of Invalidity Pension Scheme will be made using Univariate Forecasting Models to predict the future claims among workforce in Malaysia.

  18. Guaranteed Bounds on Information-Theoretic Measures of Univariate Mixtures Using Piecewise Log-Sum-Exp Inequalities

    Nielsen, Frank

    2016-12-09

    Information-theoreticmeasures, such as the entropy, the cross-entropy and the Kullback-Leibler divergence between two mixture models, are core primitives in many signal processing tasks. Since the Kullback-Leibler divergence of mixtures provably does not admit a closed-form formula, it is in practice either estimated using costly Monte Carlo stochastic integration, approximated or bounded using various techniques. We present a fast and generic method that builds algorithmically closed-form lower and upper bounds on the entropy, the cross-entropy, the Kullback-Leibler and the α-divergences of mixtures. We illustrate the versatile method by reporting our experiments for approximating the Kullback-Leibler and the α-divergences between univariate exponential mixtures, Gaussian mixtures, Rayleigh mixtures and Gamma mixtures.

  19. Use of near-infrared spectroscopy and feature selection techniques for predicting the caffeine content and roasting color in roasted coffees.

    Pizarro, Consuelo; Esteban-Díez, Isabel; González-Sáiz, José-María; Forina, Michele

    2007-09-05

    Near-infrared spectroscopy (NIRS), combined with diverse feature selection techniques and multivariate calibration methods, has been used to develop robust and reliable reduced-spectrum regression models based on a few NIR filter sensors for determining two key parameters for the characterization of roasted coffees, which are extremely relevant from a quality assurance standpoint: roasting color and caffeine content. The application of the stepwise orthogonalization of predictors (an "old" technique recently revisited, known by the acronym SELECT) provided notably improved regression models for the two response variables modeled, with root-mean-square errors of the residuals in external prediction (RMSEP) equal to 3.68 and 1.46% for roasting color and caffeine content of roasted coffee samples, respectively. The improvement achieved by the application of the SELECT-OLS method was particularly remarkable when the very low complexities associated with the final models obtained for predicting both roasting color (only 9 selected wavelengths) and caffeine content (17 significant wavelengths) were taken into account. The simple and reliable calibration models proposed in the present study encourage the possibility of implementing them in online and routine applications to predict quality parameters of unknown coffee samples via their NIR spectra, thanks to the use of a NIR instrument equipped with a proper filter system, which would imply a considerable simplification with regard to the recording and interpretation of the spectra, as well as an important economic saving.

  20. A Genetic-Based Feature Selection Approach in the Identification of Left/Right Hand Motor Imagery for a Brain-Computer Interface.

    Yaacoub, Charles; Mhanna, Georges; Rihana, Sandy

    2017-01-23

    Electroencephalography is a non-invasive measure of the brain electrical activity generated by millions of neurons. Feature extraction in electroencephalography analysis is a core issue that may lead to accurate brain mental state classification. This paper presents a new feature selection method that improves left/right hand movement identification of a motor imagery brain-computer interface, based on genetic algorithms and artificial neural networks used as classifiers. Raw electroencephalography signals are first preprocessed using appropriate filtering. Feature extraction is carried out afterwards, based on spectral and temporal signal components, and thus a feature vector is constructed. As various features might be inaccurate and mislead the classifier, thus degrading the overall system performance, the proposed approach identifies a subset of features from a large feature space, such that the classifier error rate is reduced. Experimental results show that the proposed method is able to reduce the number of features to as low as 0.5% (i.e., the number of ignored features can reach 99.5%) while improving the accuracy, sensitivity, specificity, and precision of the classifier.

  1. A Genetic-Based Feature Selection Approach in the Identification of Left/Right Hand Motor Imagery for a Brain-Computer Interface

    Charles Yaacoub

    2017-01-01

    Full Text Available Electroencephalography is a non-invasive measure of the brain electrical activity generated by millions of neurons. Feature extraction in electroencephalography analysis is a core issue that may lead to accurate brain mental state classification. This paper presents a new feature selection method that improves left/right hand movement identification of a motor imagery brain-computer interface, based on genetic algorithms and artificial neural networks used as classifiers. Raw electroencephalography signals are first preprocessed using appropriate filtering. Feature extraction is carried out afterwards, based on spectral and temporal signal components, and thus a feature vector is constructed. As various features might be inaccurate and mislead the classifier, thus degrading the overall system performance, the proposed approach identifies a subset of features from a large feature space, such that the classifier error rate is reduced. Experimental results show that the proposed method is able to reduce the number of features to as low as 0.5% (i.e., the number of ignored features can reach 99.5% while improving the accuracy, sensitivity, specificity, and precision of the classifier.

  2. A compound structure of ELM based on feature selection and parameter optimization using hybrid backtracking search algorithm for wind speed forecasting

    Zhang, Chu; Zhou, Jianzhong; Li, Chaoshun; Fu, Wenlong; Peng, Tian

    2017-01-01

    Highlights: • A novel hybrid approach is proposed for wind speed forecasting. • The variational mode decomposition (VMD) is optimized to decompose the original wind speed series. • The input matrix and parameters of ELM are optimized simultaneously by using a hybrid BSA. • Results show that OVMD-HBSA-ELM achieves better performance in terms of prediction accuracy. - Abstract: Reliable wind speed forecasting is essential for wind power integration in wind power generation system. The purpose of paper is to develop a novel hybrid model for short-term wind speed forecasting and demonstrates its efficiency. In the proposed model, a compound structure of extreme learning machine (ELM) based on feature selection and parameter optimization using hybrid backtracking search algorithm (HBSA) is employed as the predictor. The real-valued BSA (RBSA) is exploited to search for the optimal combination of weights and bias of ELM while the binary-valued BSA (BBSA) is exploited as a feature selection method applying on the candidate inputs predefined by partial autocorrelation function (PACF) values to reconstruct the input-matrix. Due to the volatility and randomness of wind speed signal, an optimized variational mode decomposition (OVMD) is employed to eliminate the redundant noises. The parameters of the proposed OVMD are determined according to the center frequencies of the decomposed modes and the residual evaluation index (REI). The wind speed signal is decomposed into a few modes via OVMD. The aggregation of the forecasting results of these modes constructs the final forecasting result of the proposed model. The proposed hybrid model has been applied on the mean half-hour wind speed observation data from two wind farms in Inner Mongolia, China and 10-min wind speed data from the Sotavento Galicia wind farm are studied as an additional case. Parallel experiments have been designed to compare with the proposed model. Results obtained from this study indicate that the

  3. Nuisance forecasting. Univariate modelling and very-short-term forecasting of winter smog episodes; Immissionsprognose. Univariate Modellierung und Kuerzestfristvorhersage von Wintersmogsituationen

    Schlink, U.

    1996-12-31

    The work evaluates specifically the nuisance data provided by the measuring station in the centre of Leipig during the period from 1980 to 1993, with the aim to develop an algorithm for making very short-term forecasts of excessive nuisances. Forecasting was to be univariate, i.e., based exclusively on the half-hourly readings of SO{sub 2} concentrations taken in the past. As shown by Fourier analysis, there exist three main and mutually independent spectral regions: the high-frequency sector (period < 12 hours) of unstable irregularities, the seasonal sector with the periods of 24 and 12 hours, and the low-frequency sector (period > 24 hours). After breaking the measuring series up into components, the low-frequency sector is termed trend component, or trend for short. For obtaining the components, a Kalman filter is used. It was found that smog episodes are most adequately described by the trend component. This is therefore more closely investigated. The phase representation then shows characteristic trajectories of the trends. (orig./KW) [Deutsch] In der vorliegende Arbeit wurden speziell die Immissionsdaten der Messstation Leipzig-Mitte des Zeitraumes 1980-1993 mit dem Ziel der Erstellung eines Algorithmus fuer die Kuerzestfristprognose von Ueberschreitungssituationen untersucht. Die Prognosestellung sollte allein anhand der in der Vergangenheit registrierten Halbstundenwerte der SO{sub 2}-Konzentration, also univariat erfolgen. Wie die Fourieranalyse zeigt, gibt es drei wesentliche und voneinander unabhaengige Spektralbereiche: Den hochfrequenten Bereich (Periode <12 Stunden) der instabilen Irregularitaeten, den saisonalen Anteil mit den Perioden von 24 und 12 Stunden und den niedrigfrequenten Bereich (Periode >24 Stunden). Letzterer wird nach einer Zerlegung der Messreihe in Komponenten als Trendkomponente (oder kurz Trend) bezeichnet. Fuer die Komponentenzerlegung wird ein Kalman-Filter verwendet. Es stellt sich heraus, dass Smogepisoden am deutlichsten

  4. Sparse feature selection identifies H2A.Z as a novel, pattern-specific biomarker for asymmetrically self-renewing distributed stem cells

    Yang Hoon Huh

    2015-03-01

    Full Text Available There is a long-standing unmet clinical need for biomarkers with high specificity for distributed stem cells (DSCs in tissues, or for use in diagnostic and therapeutic cell preparations (e.g., bone marrow. Although DSCs are essential for tissue maintenance and repair, accurate determination of their numbers for medical applications has been problematic. Previous searches for biomarkers expressed specifically in DSCs were hampered by difficulty obtaining pure DSCs and by the challenges in mining complex molecular expression data. To identify such useful and specific DSC biomarkers, we combined a novel sparse feature selection method with combinatorial molecular expression data focused on asymmetric self-renewal, a conspicuous property of DSCs. The analysis identified reduced expression of the histone H2A variant H2A.Z as a superior molecular discriminator for DSC asymmetric self-renewal. Subsequent molecular expression studies showed H2A.Z to be a novel “pattern-specific biomarker” for asymmetrically self-renewing cells, with sufficient specificity to count asymmetrically self-renewing DSCs in vitro and potentially in situ.

  5. Classification of breast masses in ultrasound images using self-adaptive differential evolution extreme learning machine and rough set feature selection.

    Prabusankarlal, Kadayanallur Mahadevan; Thirumoorthy, Palanisamy; Manavalan, Radhakrishnan

    2017-04-01

    A method using rough set feature selection and extreme learning machine (ELM) whose learning strategy and hidden node parameters are optimized by self-adaptive differential evolution (SaDE) algorithm for classification of breast masses is investigated. A pathologically proven database of 140 breast ultrasound images, including 80 benign and 60 malignant, is used for this study. A fast nonlocal means algorithm is applied for speckle noise removal, and multiresolution analysis of undecimated discrete wavelet transform is used for accurate segmentation of breast lesions. A total of 34 features, including 29 textural and five morphological, are applied to a [Formula: see text]-fold cross-validation scheme, in which more relevant features are selected by quick-reduct algorithm, and the breast masses are discriminated into benign or malignant using SaDE-ELM classifier. The diagnosis accuracy of the system is assessed using parameters, such as accuracy (Ac), sensitivity (Se), specificity (Sp), positive predictive value (PPV), negative predictive value (NPV), Matthew's correlation coefficient (MCC), and area ([Formula: see text]) under receiver operating characteristics curve. The performance of the proposed system is also compared with other classifiers, such as support vector machine and ELM. The results indicated that the proposed SaDE algorithm has superior performance with [Formula: see text], [Formula: see text], [Formula: see text], [Formula: see text], [Formula: see text], [Formula: see text], and [Formula: see text] compared to other classifiers.

  6. Correlation Feature Selection and Mutual Information Theory Based Quantitative Research on Meteorological Impact Factors of Module Temperature for Solar Photovoltaic Systems

    Yujing Sun

    2016-12-01

    Full Text Available The module temperature is the most important parameter influencing the output power of solar photovoltaic (PV systems, aside from solar irradiance. In this paper, we focus on the interdisciplinary research that combines the correlation analysis, mutual information (MI and heat transfer theory, which aims to figure out the correlative relations between different meteorological impact factors (MIFs and PV module temperature from both quality and quantitative aspects. The identification and confirmation of primary MIFs of PV module temperature are investigated as the first step of this research from the perspective of physical meaning and mathematical analysis about electrical performance and thermal characteristic of PV modules based on PV effect and heat transfer theory. Furthermore, the quantitative description of the MIFs influence on PV module temperature is mathematically formulated as several indexes using correlation-based feature selection (CFS and MI theory to explore the specific impact degrees under four different typical weather statuses named general weather classes (GWCs. Case studies for the proposed methods were conducted using actual measurement data of a 500 kW grid-connected solar PV plant in China. The results not only verified the knowledge about the main MIFs of PV module temperatures, more importantly, but also provide the specific ratio of quantitative impact degrees of these three MIFs respectively through CFS and MI based measures under four different GWCs.

  7. Hybridization between multi-objective genetic algorithm and support vector machine for feature selection in walker-assisted gait.

    Martins, Maria; Costa, Lino; Frizera, Anselmo; Ceres, Ramón; Santos, Cristina

    2014-03-01

    Walker devices are often prescribed incorrectly to patients, leading to the increase of dissatisfaction and occurrence of several problems, such as, discomfort and pain. Thus, it is necessary to objectively evaluate the effects that assisted gait can have on the gait patterns of walker users, comparatively to a non-assisted gait. A gait analysis, focusing on spatiotemporal and kinematics parameters, will be issued for this purpose. However, gait analysis yields redundant information that often is difficult to interpret. This study addresses the problem of selecting the most relevant gait features required to differentiate between assisted and non-assisted gait. For that purpose, it is presented an efficient approach that combines evolutionary techniques, based on genetic algorithms, and support vector machine algorithms, to discriminate differences between assisted and non-assisted gait with a walker with forearm supports. For comparison purposes, other classification algorithms are verified. Results with healthy subjects show that the main differences are characterized by balance and joints excursion in the sagittal plane. These results, confirmed by clinical evidence, allow concluding that this technique is an efficient feature selection approach. Copyright © 2013 Elsevier Ireland Ltd. All rights reserved.

  8. Genetic Fuzzy System (GFS based wavelet co-occurrence feature selection in mammogram classification for breast cancer diagnosis

    Meenakshi M. Pawar

    2016-09-01

    Full Text Available Breast cancer is significant health problem diagnosed mostly in women worldwide. Therefore, early detection of breast cancer is performed with the help of digital mammography, which can reduce mortality rate. This paper presents wrapper based feature selection approach for wavelet co-occurrence feature (WCF using Genetic Fuzzy System (GFS in mammogram classification problem. The performance of GFS algorithm is explained using mini-MIAS database. WCF features are obtained from detail wavelet coefficients at each level of decomposition of mammogram image. At first level of decomposition, 18 features are applied to GFS algorithm, which selects 5 features with an average classification success rate of 39.64%. Subsequently, at second level it selects 9 features from 36 features and the classification success rate is improved to 56.75%. For third level, 16 features are selected from 54 features and average success rate is improved to 64.98%. Lastly, at fourth level 72 features are applied to GFS, which selects 16 features and thereby increasing average success rate to 89.47%. Hence, GFS algorithm is the effective way of obtaining optimal set of feature in breast cancer diagnosis.

  9. Computer aided diagnosis system for Alzheimer disease using brain diffusion tensor imaging features selected by Pearson's correlation.

    Graña, M; Termenon, M; Savio, A; Gonzalez-Pinto, A; Echeveste, J; Pérez, J M; Besga, A

    2011-09-20

    The aim of this paper is to obtain discriminant features from two scalar measures of Diffusion Tensor Imaging (DTI) data, Fractional Anisotropy (FA) and Mean Diffusivity (MD), and to train and test classifiers able to discriminate Alzheimer's Disease (AD) patients from controls on the basis of features extracted from the FA or MD volumes. In this study, support vector machine (SVM) classifier was trained and tested on FA and MD data. Feature selection is done computing the Pearson's correlation between FA or MD values at voxel site across subjects and the indicative variable specifying the subject class. Voxel sites with high absolute correlation are selected for feature extraction. Results are obtained over an on-going study in Hospital de Santiago Apostol collecting anatomical T1-weighted MRI volumes and DTI data from healthy control subjects and AD patients. FA features and a linear SVM classifier achieve perfect accuracy, sensitivity and specificity in several cross-validation studies, supporting the usefulness of DTI-derived features as an image-marker for AD and to the feasibility of building Computer Aided Diagnosis systems for AD based on them. Copyright © 2011 Elsevier Ireland Ltd. All rights reserved.

  10. Behavioral performance follows the time course of neural facilitation and suppression during cued shifts of feature-selective attention.

    Andersen, S K; Müller, M M

    2010-08-03

    A central question in the field of attention is whether visual processing is a strictly limited resource, which must be allocated by selective attention. If this were the case, attentional enhancement of one stimulus should invariably lead to suppression of unattended distracter stimuli. Here we examine voluntary cued shifts of feature-selective attention to either one of two superimposed red or blue random dot kinematograms (RDKs) to test whether such a reciprocal relationship between enhancement of an attended and suppression of an unattended stimulus can be observed. The steady-state visual evoked potential (SSVEP), an oscillatory brain response elicited by the flickering RDKs, was measured in human EEG. Supporting limited resources, we observed both an enhancement of the attended and a suppression of the unattended RDK, but this observed reciprocity did not occur concurrently: enhancement of the attended RDK started at 220 ms after cue onset and preceded suppression of the unattended RDK by about 130 ms. Furthermore, we found that behavior was significantly correlated with the SSVEP time course of a measure of selectivity (attended minus unattended) but not with a measure of total activity (attended plus unattended). The significant deviations from a temporally synchronized reciprocity between enhancement and suppression suggest that the enhancement of the attended stimulus may cause the suppression of the unattended stimulus in the present experiment.

  11. Prediction of protein modification sites of pyrrolidone carboxylic acid using mRMR feature selection and analysis.

    Lu-Lu Zheng

    Full Text Available Pyrrolidone carboxylic acid (PCA is formed during a common post-translational modification (PTM of extracellular and multi-pass membrane proteins. In this study, we developed a new predictor to predict the modification sites of PCA based on maximum relevance minimum redundancy (mRMR and incremental feature selection (IFS. We incorporated 727 features that belonged to 7 kinds of protein properties to predict the modification sites, including sequence conservation, residual disorder, amino acid factor, secondary structure and solvent accessibility, gain/loss of amino acid during evolution, propensity of amino acid to be conserved at protein-protein interface and protein surface, and deviation of side chain carbon atom number. Among these 727 features, 244 features were selected by mRMR and IFS as the optimized features for the prediction, with which the prediction model achieved a maximum of MCC of 0.7812. Feature analysis showed that all feature types contributed to the modification process. Further site-specific feature analysis showed that the features derived from PCA's surrounding sites contributed more to the determination of PCA sites than other sites. The detailed feature analysis in this paper might provide important clues for understanding the mechanism of the PCA formation and guide relevant experimental validations.

  12. QRS complex detection based on continuous density hidden Markov models using univariate observations

    Sotelo, S.; Arenas, W.; Altuve, M.

    2018-04-01

    In the electrocardiogram (ECG), the detection of QRS complexes is a fundamental step in the ECG signal processing chain since it allows the determination of other characteristics waves of the ECG and provides information about heart rate variability. In this work, an automatic QRS complex detector based on continuous density hidden Markov models (HMM) is proposed. HMM were trained using univariate observation sequences taken either from QRS complexes or their derivatives. The detection approach is based on the log-likelihood comparison of the observation sequence with a fixed threshold. A sliding window was used to obtain the observation sequence to be evaluated by the model. The threshold was optimized by receiver operating characteristic curves. Sensitivity (Sen), specificity (Spc) and F1 score were used to evaluate the detection performance. The approach was validated using ECG recordings from the MIT-BIH Arrhythmia database. A 6-fold cross-validation shows that the best detection performance was achieved with 2 states HMM trained with QRS complexes sequences (Sen = 0.668, Spc = 0.360 and F1 = 0.309). We concluded that these univariate sequences provide enough information to characterize the QRS complex dynamics from HMM. Future works are directed to the use of multivariate observations to increase the detection performance.

  13. Wind Speed Prediction Using a Univariate ARIMA Model and a Multivariate NARX Model

    Erasmo Cadenas

    2016-02-01

    Full Text Available Two on step ahead wind speed forecasting models were compared. A univariate model was developed using a linear autoregressive integrated moving average (ARIMA. This method’s performance is well studied for a large number of prediction problems. The other is a multivariate model developed using a nonlinear autoregressive exogenous artificial neural network (NARX. This uses the variables: barometric pressure, air temperature, wind direction and solar radiation or relative humidity, as well as delayed wind speed. Both models were developed from two databases from two sites: an hourly average measurements database from La Mata, Oaxaca, Mexico, and a ten minute average measurements database from Metepec, Hidalgo, Mexico. The main objective was to compare the impact of the various meteorological variables on the performance of the multivariate model of wind speed prediction with respect to the high performance univariate linear model. The NARX model gave better results with improvements on the ARIMA model of between 5.5% and 10. 6% for the hourly database and of between 2.3% and 12.8% for the ten minute database for mean absolute error and mean squared error, respectively.

  14. Degree of contribution (DoC) feature selection algorithm for structural brain MRI volumetric features in depression detection.

    Kipli, Kuryati; Kouzani, Abbas Z

    2015-07-01

    Accurate detection of depression at an individual level using structural magnetic resonance imaging (sMRI) remains a challenge. Brain volumetric changes at a structural level appear to have importance in depression biomarkers studies. An automated algorithm is developed to select brain sMRI volumetric features for the detection of depression. A feature selection (FS) algorithm called degree of contribution (DoC) is developed for selection of sMRI volumetric features. This algorithm uses an ensemble approach to determine the degree of contribution in detection of major depressive disorder. The DoC is the score of feature importance used for feature ranking. The algorithm involves four stages: feature ranking, subset generation, subset evaluation, and DoC analysis. The performance of DoC is evaluated on the Duke University Multi-site Imaging Research in the Analysis of Depression sMRI dataset. The dataset consists of 115 brain sMRI scans of 88 healthy controls and 27 depressed subjects. Forty-four sMRI volumetric features are used in the evaluation. The DoC score of forty-four features was determined as the accuracy threshold (Acc_Thresh) was varied. The DoC performance was compared with that of four existing FS algorithms. At all defined Acc_Threshs, DoC outperformed the four examined FS algorithms for the average classification score and the maximum classification score. DoC has a good ability to generate reduced-size subsets of important features that could yield high classification accuracy. Based on the DoC score, the most discriminant volumetric features are those from the left-brain region.

  15. Reciprocal Benefits of Mass-Univariate and Multivariate Modeling in Brain Mapping: Applications to Event-Related Functional MRI, H215O-, and FDG-PET

    James R. Moeller

    2006-01-01

    Full Text Available In brain mapping studies of sensory, cognitive, and motor operations, specific waveforms of dynamic neural activity are predicted based on theoretical models of human information processing. For example in event-related functional MRI (fMRI, the general linear model (GLM is employed in mass-univariate analyses to identify the regions whose dynamic activity closely matches the expected waveforms. By comparison multivariate analyses based on PCA or ICA provide greater flexibility in detecting spatiotemporal properties of experimental data that may strongly support alternative neuroscientific explanations. We investigated conjoint multivariate and mass-univariate analyses that combine the capabilities to (1 verify activation of neural machinery we already understand and (2 discover reliable signatures of new neural machinery. We examined combinations of GLM and PCA that recover latent neural signals (waveforms and footprints with greater accuracy than either method alone. Comparative results are illustrated with analyses of real fMRI data, adding to Monte Carlo simulation support.

  16. Forecasting electric vehicles sales with univariate and multivariate time series models: The case of China.

    Zhang, Yong; Zhong, Miner; Geng, Nana; Jiang, Yunjian

    2017-01-01

    The market demand for electric vehicles (EVs) has increased in recent years. Suitable models are necessary to understand and forecast EV sales. This study presents a singular spectrum analysis (SSA) as a univariate time-series model and vector autoregressive model (VAR) as a multivariate model. Empirical results suggest that SSA satisfactorily indicates the evolving trend and provides reasonable results. The VAR model, which comprised exogenous parameters related to the market on a monthly basis, can significantly improve the prediction accuracy. The EV sales in China, which are categorized into battery and plug-in EVs, are predicted in both short term (up to December 2017) and long term (up to 2020), as statistical proofs of the growth of the Chinese EV industry.

  17. Lower bounds on the run time of the univariate marginal distribution algorithm on OneMax

    Krejca, Martin S.; Witt, Carsten

    2017-01-01

    The Univariate Marginal Distribution Algorithm (UMDA), a popular estimation of distribution algorithm, is studied from a run time perspective. On the classical OneMax benchmark function, a lower bound of Ω(μ√n + n log n), where μ is the population size, on its expected run time is proved...... values maintained by the algorithm, including carefully designed potential functions. These techniques may prove useful in advancing the field of run time analysis for estimation of distribution algorithms in general........ This is the first direct lower bound on the run time of the UMDA. It is stronger than the bounds that follow from general black-box complexity theory and is matched by the run time of many evolutionary algorithms. The results are obtained through advanced analyses of the stochastic change of the frequencies of bit...

  18. Automatic Feature Selection and Weighting for the Formation of Homogeneous Groups for Regional Intensity-Duration-Frequency (IDF) Curve Estimation

    Yang, Z.; Burn, D. H.

    2017-12-01

    Extreme rainfall events can have devastating impacts on society. To quantify the associated risk, the IDF curve has been used to provide the essential rainfall-related information for urban planning. However, the recent changes in the rainfall climatology caused by climate change and urbanization have made the estimates provided by the traditional regional IDF approach increasingly inaccurate. This inaccuracy is mainly caused by two problems: 1) The ineffective choice of similarity indicators for the formation of a homogeneous group at different regions; and 2) An inadequate number of stations in the pooling group that does not adequately reflect the optimal balance between group size and group homogeneity or achieve the lowest uncertainty in the rainfall quantiles estimates. For the first issue, to consider the temporal difference among different meteorological and topographic indicators, a three-layer design is proposed based on three stages in the extreme rainfall formation: cloud formation, rainfall generation and change of rainfall intensity above urban surface. During the process, the impacts from climate change and urbanization are considered through the inclusion of potential relevant features at each layer. Then to consider spatial difference of similarity indicators for the homogeneous group formation at various regions, an automatic feature selection and weighting algorithm, specifically the hybrid searching algorithm of Tabu search, Lagrange Multiplier and Fuzzy C-means Clustering, is used to select the optimal combination of features for the potential optimal homogenous groups formation at a specific region. For the second issue, to compare the uncertainty of rainfall quantile estimates among potential groups, the two sample Kolmogorov-Smirnov test-based sample ranking process is used. During the process, linear programming is used to rank these groups based on the confidence intervals of the quantile estimates. The proposed methodology fills the gap

  19. Sequence based prediction of DNA-binding proteins based on hybrid feature selection using random forest and Gaussian naïve Bayes.

    Wangchao Lou

    Full Text Available Developing an efficient method for determination of the DNA-binding proteins, due to their vital roles in gene regulation, is becoming highly desired since it would be invaluable to advance our understanding of protein functions. In this study, we proposed a new method for the prediction of the DNA-binding proteins, by performing the feature rank using random forest and the wrapper-based feature selection using forward best-first search strategy. The features comprise information from primary sequence, predicted secondary structure, predicted relative solvent accessibility, and position specific scoring matrix. The proposed method, called DBPPred, used Gaussian naïve Bayes as the underlying classifier since it outperformed five other classifiers, including decision tree, logistic regression, k-nearest neighbor, support vector machine with polynomial kernel, and support vector machine with radial basis function. As a result, the proposed DBPPred yields the highest average accuracy of 0.791 and average MCC of 0.583 according to the five-fold cross validation with ten runs on the training benchmark dataset PDB594. Subsequently, blind tests on the independent dataset PDB186 by the proposed model trained on the entire PDB594 dataset and by other five existing methods (including iDNA-Prot, DNA-Prot, DNAbinder, DNABIND and DBD-Threader were performed, resulting in that the proposed DBPPred yielded the highest accuracy of 0.769, MCC of 0.538, and AUC of 0.790. The independent tests performed by the proposed DBPPred on completely a large non-DNA binding protein dataset and two RNA binding protein datasets also showed improved or comparable quality when compared with the relevant prediction methods. Moreover, we observed that majority of the selected features by the proposed method are statistically significantly different between the mean feature values of the DNA-binding and the non DNA-binding proteins. All of the experimental results indicate that

  20. Application of Raman Spectroscopy and Univariate Modelling As a Process Analytical Technology for Cell Therapy Bioprocessing

    Baradez, Marc-Olivier; Biziato, Daniela; Hassan, Enas; Marshall, Damian

    2018-01-01

    Cell therapies offer unquestionable promises for the treatment, and in some cases even the cure, of complex diseases. As we start to see more of these therapies gaining market authorization, attention is turning to the bioprocesses used for their manufacture, in particular the challenge of gaining higher levels of process control to help regulate cell behavior, manage process variability, and deliver product of a consistent quality. Many processes already incorporate the measurement of key markers such as nutrient consumption, metabolite production, and cell concentration, but these are often performed off-line and only at set time points in the process. Having the ability to monitor these markers in real-time using in-line sensors would offer significant advantages, allowing faster decision-making and a finer level of process control. In this study, we use Raman spectroscopy as an in-line optical sensor for bioprocess monitoring of an autologous T-cell immunotherapy model produced in a stirred tank bioreactor system. Using reference datasets generated on a standard bioanalyzer, we develop chemometric models from the Raman spectra for glucose, glutamine, lactate, and ammonia. These chemometric models can accurately monitor donor-specific increases in nutrient consumption and metabolite production as the primary T-cell transition from a recovery phase and begin proliferating. Using a univariate modeling approach, we then show how changes in peak intensity within the Raman spectra can be correlated with cell concentration and viability. These models, which act as surrogate markers, can be used to monitor cell behavior including cell proliferation rates, proliferative capacity, and transition of the cells to a quiescent phenotype. Finally, using the univariate models, we also demonstrate how Raman spectroscopy can be applied for real-time monitoring. The ability to measure these key parameters using an in-line Raman optical sensor makes it possible to have immediate

  1. Application of Raman Spectroscopy and Univariate Modelling As a Process Analytical Technology for Cell Therapy Bioprocessing.

    Baradez, Marc-Olivier; Biziato, Daniela; Hassan, Enas; Marshall, Damian

    2018-01-01

    Cell therapies offer unquestionable promises for the treatment, and in some cases even the cure, of complex diseases. As we start to see more of these therapies gaining market authorization, attention is turning to the bioprocesses used for their manufacture, in particular the challenge of gaining higher levels of process control to help regulate cell behavior, manage process variability, and deliver product of a consistent quality. Many processes already incorporate the measurement of key markers such as nutrient consumption, metabolite production, and cell concentration, but these are often performed off-line and only at set time points in the process. Having the ability to monitor these markers in real-time using in-line sensors would offer significant advantages, allowing faster decision-making and a finer level of process control. In this study, we use Raman spectroscopy as an in-line optical sensor for bioprocess monitoring of an autologous T-cell immunotherapy model produced in a stirred tank bioreactor system. Using reference datasets generated on a standard bioanalyzer, we develop chemometric models from the Raman spectra for glucose, glutamine, lactate, and ammonia. These chemometric models can accurately monitor donor-specific increases in nutrient consumption and metabolite production as the primary T-cell transition from a recovery phase and begin proliferating. Using a univariate modeling approach, we then show how changes in peak intensity within the Raman spectra can be correlated with cell concentration and viability. These models, which act as surrogate markers, can be used to monitor cell behavior including cell proliferation rates, proliferative capacity, and transition of the cells to a quiescent phenotype. Finally, using the univariate models, we also demonstrate how Raman spectroscopy can be applied for real-time monitoring. The ability to measure these key parameters using an in-line Raman optical sensor makes it possible to have immediate

  2. Application of Raman Spectroscopy and Univariate Modelling As a Process Analytical Technology for Cell Therapy Bioprocessing

    Marc-Olivier Baradez

    2018-03-01

    Full Text Available Cell therapies offer unquestionable promises for the treatment, and in some cases even the cure, of complex diseases. As we start to see more of these therapies gaining market authorization, attention is turning to the bioprocesses used for their manufacture, in particular the challenge of gaining higher levels of process control to help regulate cell behavior, manage process variability, and deliver product of a consistent quality. Many processes already incorporate the measurement of key markers such as nutrient consumption, metabolite production, and cell concentration, but these are often performed off-line and only at set time points in the process. Having the ability to monitor these markers in real-time using in-line sensors would offer significant advantages, allowing faster decision-making and a finer level of process control. In this study, we use Raman spectroscopy as an in-line optical sensor for bioprocess monitoring of an autologous T-cell immunotherapy model produced in a stirred tank bioreactor system. Using reference datasets generated on a standard bioanalyzer, we develop chemometric models from the Raman spectra for glucose, glutamine, lactate, and ammonia. These chemometric models can accurately monitor donor-specific increases in nutrient consumption and metabolite production as the primary T-cell transition from a recovery phase and begin proliferating. Using a univariate modeling approach, we then show how changes in peak intensity within the Raman spectra can be correlated with cell concentration and viability. These models, which act as surrogate markers, can be used to monitor cell behavior including cell proliferation rates, proliferative capacity, and transition of the cells to a quiescent phenotype. Finally, using the univariate models, we also demonstrate how Raman spectroscopy can be applied for real-time monitoring. The ability to measure these key parameters using an in-line Raman optical sensor makes it possible

  3. Effects of univariate and multivariate regression on the accuracy of hydrogen quantification with laser-induced breakdown spectroscopy

    Ytsma, Cai R.; Dyar, M. Darby

    2018-01-01

    Hydrogen (H) is a critical element to measure on the surface of Mars because its presence in mineral structures is indicative of past hydrous conditions. The Curiosity rover uses the laser-induced breakdown spectrometer (LIBS) on the ChemCam instrument to analyze rocks for their H emission signal at 656.6 nm, from which H can be quantified. Previous LIBS calibrations for H used small data sets measured on standards and/or manufactured mixtures of hydrous minerals and rocks and applied univariate regression to spectra normalized in a variety of ways. However, matrix effects common to LIBS make these calibrations of limited usefulness when applied to the broad range of compositions on the Martian surface. In this study, 198 naturally-occurring hydrous geological samples covering a broad range of bulk compositions with directly-measured H content are used to create more robust prediction models for measuring H in LIBS data acquired under Mars conditions. Both univariate and multivariate prediction models, including partial least square (PLS) and the least absolute shrinkage and selection operator (Lasso), are compared using several different methods for normalization of H peak intensities. Data from the ChemLIBS Mars-analog spectrometer at Mount Holyoke College are compared against spectra from the same samples acquired using a ChemCam-like instrument at Los Alamos National Laboratory and the ChemCam instrument on Mars. Results show that all current normalization and data preprocessing variations for quantifying H result in models with statistically indistinguishable prediction errors (accuracies) ca. ± 1.5 weight percent (wt%) H2O, limiting the applications of LIBS in these implementations for geological studies. This error is too large to allow distinctions among the most common hydrous phases (basalts, amphiboles, micas) to be made, though some clays (e.g., chlorites with ≈ 12 wt% H2O, smectites with 15-20 wt% H2O) and hydrated phases (e.g., gypsum with ≈ 20

  4. Univariate and multivariate analysis on processing tomato quality under different mulches

    Carmen Moreno

    2014-04-01

    Full Text Available The use of eco-friendly mulch materials as alternatives to the standard polyethylene (PE has become increasingly prevalent worldwide. Consequently, a comparison of mulch materials from different origins is necessary to evaluate their feasibility. Several researchers have compared the effects of mulch materials on each crop variable through univariate analysis (ANOVA. However, it is important to focus on the effect of these materials on fruit quality, because this factor decisively influences the acceptance of the final product by consumers and the industrial sector. This study aimed to analyze the information supplied by a randomized complete block experiment combined over two seasons, a principal component analysis (PCA and a cluster analysis (CA when studying the effects of mulch materials on the quality of processing tomato (Lycopersicon esculentum Mill.. The study focused on the variability in the quality measurements and on the determination of mulch materials with a similar response to them. A comparison of the results from both types of analysis yielded complementary information. ANOVA showed the similarity of certain materials. However, considering the totality of the variables analyzed, the final interpretation was slightly complicated. PCA indicated that the juice color, the fruit firmness and the soluble solid content were the most influential factors in the total variability of a set of 12 juice and fruit variables, and CA allowed us to establish four categories of treatment: plastics (polyethylene - PE, oxo- and biodegradable materials, papers, manual weeding and barley (Hordeum vulgare L. straw. Oxobiodegradable and PE were most closely related based on CA.

  5. Influence of microclimatic ammonia levels on productive performance of different broilers' breeds estimated with univariate and multivariate approaches.

    Soliman, Essam S; Moawed, Sherif A; Hassan, Rania A

    2017-08-01

    Birds litter contains unutilized nitrogen in the form of uric acid that is converted into ammonia; a fact that does not only affect poultry performance but also has a negative effect on people's health around the farm and contributes in the environmental degradation. The influence of microclimatic ammonia emissions on Ross and Hubbard broilers reared in different housing systems at two consecutive seasons (fall and winter) was evaluated using a discriminant function analysis to differentiate between Ross and Hubbard breeds. A total number of 400 air samples were collected and analyzed for ammonia levels during the experimental period. Data were analyzed using univariate and multivariate statistical methods. Ammonia levels were significantly higher (p0.05) were found between the two farms in body weight, body weight gain, feed intake, feed conversion ratio, and performance index (PI) of broilers. Body weight; weight gain and PI had increased values (pbroiler breed. Ammonia emissions were positively (although weekly) correlated with the ambient relative humidity (r=0.383; p0.05). Test of significance of discriminant function analysis did not show a classification based on the studied traits suggesting that they cannot been used as predictor variables. The percentage of correct classification was 52% and it was improved after deletion of highly correlated traits to 57%. The study revealed that broiler's growth was negatively affected by increased microclimatic ammonia concentrations and recommended the analysis of broilers' growth performance parameters data using multivariate discriminant function analysis.

  6. ReliefSeq: a gene-wise adaptive-K nearest-neighbor feature selection tool for finding gene-gene interactions and main effects in mRNA-Seq gene expression data.

    Brett A McKinney

    Full Text Available Relief-F is a nonparametric, nearest-neighbor machine learning method that has been successfully used to identify relevant variables that may interact in complex multivariate models to explain phenotypic variation. While several tools have been developed for assessing differential expression in sequence-based transcriptomics, the detection of statistical interactions between transcripts has received less attention in the area of RNA-seq analysis. We describe a new extension and assessment of Relief-F for feature selection in RNA-seq data. The ReliefSeq implementation adapts the number of nearest neighbors (k for each gene to optimize the Relief-F test statistics (importance scores for finding both main effects and interactions. We compare this gene-wise adaptive-k (gwak Relief-F method with standard RNA-seq feature selection tools, such as DESeq and edgeR, and with the popular machine learning method Random Forests. We demonstrate performance on a panel of simulated data that have a range of distributional properties reflected in real mRNA-seq data including multiple transcripts with varying sizes of main effects and interaction effects. For simulated main effects, gwak-Relief-F feature selection performs comparably to standard tools DESeq and edgeR for ranking relevant transcripts. For gene-gene interactions, gwak-Relief-F outperforms all comparison methods at ranking relevant genes in all but the highest fold change/highest signal situations where it performs similarly. The gwak-Relief-F algorithm outperforms Random Forests for detecting relevant genes in all simulation experiments. In addition, Relief-F is comparable to the other methods based on computational time. We also apply ReliefSeq to an RNA-Seq study of smallpox vaccine to identify gene expression changes between vaccinia virus-stimulated and unstimulated samples. ReliefSeq is an attractive tool for inclusion in the suite of tools used for analysis of mRNA-Seq data; it has power to

  7. Study of Ecotype and Sowing Date Interaction in Cumin (Cuminum cyminum L. using Different Univariate Stability Parameters

    J Ghanbari

    2017-06-01

    Full Text Available Introduction Cumin is one of the most important medicinal plants in Iran and today, it is in the second level of popularity between spices in the world after black pepper. Cumin is an aromatic plant used as flavoring and seasoning agent in foods. Cumin seeds have been found to possess significant biological and have been used for treatment of toothache, dyspepsia, diarrhoea, epilepsy and jaundice. Knowledge of GEI is advantageous to have a cultivar that gives consistently high yield in a broad range of environments and to increase efficiency of breeding program and selection of best genotypes. A genotype that has stable trait expression across environments contributes little to GEI and its performance should be more predictable from the main several statistical methods have been proposed for stability analysis, with the aim of explaining the information contained in the GEI. Regression technique was proposed by Finlay and Wilkinson (1963 and was improved by Eberhart and Russell (1966. Generally, genotype stability was estimated by the slope of and deviation from the regression line for each of the genotypes. This is a popular method in stability analysis and has been applied in many crops. Non-parametric methods (rank mean (R, standard deviation rank (SDR and yield index ratio (YIR, environmental variance (S2i and genotypic variation coefficient (CVi Wricke's ecovalence and Shukla's stability variance (Shukla, 1972 have been used to determine genotype-by-environment interaction in many studies. This study was aimed to evaluate the ecotype × sowing date interaction in cumin and to evaluation of genotypic response of cumin to different sowing dates using univariate stability parameters. Materials and Methods In order to study of ecotype × sowing date interaction, different cumin ecotypes: Semnan, Fars, Yazd, Golestan, Khorasan-Razavi, Khorasan-Shomali, Khorasan-Jonoubi, Isfahan and Kerman in 5 different sowing dates (26th December, 10th January

  8. An exercise in model validation: Comparing univariate statistics and Monte Carlo-based multivariate statistics

    Weathers, J.B.; Luck, R.; Weathers, J.W.

    2009-01-01

    The complexity of mathematical models used by practicing engineers is increasing due to the growing availability of sophisticated mathematical modeling tools and ever-improving computational power. For this reason, the need to define a well-structured process for validating these models against experimental results has become a pressing issue in the engineering community. This validation process is partially characterized by the uncertainties associated with the modeling effort as well as the experimental results. The net impact of the uncertainties on the validation effort is assessed through the 'noise level of the validation procedure', which can be defined as an estimate of the 95% confidence uncertainty bounds for the comparison error between actual experimental results and model-based predictions of the same quantities of interest. Although general descriptions associated with the construction of the noise level using multivariate statistics exists in the literature, a detailed procedure outlining how to account for the systematic and random uncertainties is not available. In this paper, the methodology used to derive the covariance matrix associated with the multivariate normal pdf based on random and systematic uncertainties is examined, and a procedure used to estimate this covariance matrix using Monte Carlo analysis is presented. The covariance matrices are then used to construct approximate 95% confidence constant probability contours associated with comparison error results for a practical example. In addition, the example is used to show the drawbacks of using a first-order sensitivity analysis when nonlinear local sensitivity coefficients exist. Finally, the example is used to show the connection between the noise level of the validation exercise calculated using multivariate and univariate statistics.

  9. An exercise in model validation: Comparing univariate statistics and Monte Carlo-based multivariate statistics

    Weathers, J.B. [Shock, Noise, and Vibration Group, Northrop Grumman Shipbuilding, P.O. Box 149, Pascagoula, MS 39568 (United States)], E-mail: James.Weathers@ngc.com; Luck, R. [Department of Mechanical Engineering, Mississippi State University, 210 Carpenter Engineering Building, P.O. Box ME, Mississippi State, MS 39762-5925 (United States)], E-mail: Luck@me.msstate.edu; Weathers, J.W. [Structural Analysis Group, Northrop Grumman Shipbuilding, P.O. Box 149, Pascagoula, MS 39568 (United States)], E-mail: Jeffrey.Weathers@ngc.com

    2009-11-15

    The complexity of mathematical models used by practicing engineers is increasing due to the growing availability of sophisticated mathematical modeling tools and ever-improving computational power. For this reason, the need to define a well-structured process for validating these models against experimental results has become a pressing issue in the engineering community. This validation process is partially characterized by the uncertainties associated with the modeling effort as well as the experimental results. The net impact of the uncertainties on the validation effort is assessed through the 'noise level of the validation procedure', which can be defined as an estimate of the 95% confidence uncertainty bounds for the comparison error between actual experimental results and model-based predictions of the same quantities of interest. Although general descriptions associated with the construction of the noise level using multivariate statistics exists in the literature, a detailed procedure outlining how to account for the systematic and random uncertainties is not available. In this paper, the methodology used to derive the covariance matrix associated with the multivariate normal pdf based on random and systematic uncertainties is examined, and a procedure used to estimate this covariance matrix using Monte Carlo analysis is presented. The covariance matrices are then used to construct approximate 95% confidence constant probability contours associated with comparison error results for a practical example. In addition, the example is used to show the drawbacks of using a first-order sensitivity analysis when nonlinear local sensitivity coefficients exist. Finally, the example is used to show the connection between the noise level of the validation exercise calculated using multivariate and univariate statistics.

  10. Comparison of univariate and multivariate calibration for the determination of micronutrients in pellets of plant materials by laser induced breakdown spectrometry

    Batista Braga, Jez Willian; Trevizan, Lilian Cristina; Nunes, Lidiane Cristina; Aparecida Rufini, Iolanda; Santos, Dario; Krug, Francisco Jose

    2010-01-01

    The application of laser induced breakdown spectrometry (LIBS) aiming the direct analysis of plant materials is a great challenge that still needs efforts for its development and validation. In this way, a series of experimental approaches has been carried out in order to show that LIBS can be used as an alternative method to wet acid digestions based methods for analysis of agricultural and environmental samples. The large amount of information provided by LIBS spectra for these complex samples increases the difficulties for selecting the most appropriated wavelengths for each analyte. Some applications have suggested that improvements in both accuracy and precision can be achieved by the application of multivariate calibration in LIBS data when compared to the univariate regression developed with line emission intensities. In the present work, the performance of univariate and multivariate calibration, based on partial least squares regression (PLSR), was compared for analysis of pellets of plant materials made from an appropriate mixture of cryogenically ground samples with cellulose as the binding agent. The development of a specific PLSR model for each analyte and the selection of spectral regions containing only lines of the analyte of interest were the best conditions for the analysis. In this particular application, these models showed a similar performance, but PLSR seemed to be more robust due to a lower occurrence of outliers in comparison to the univariate method. Data suggests that efforts dealing with sample presentation and fitness of standards for LIBS analysis must be done in order to fulfill the boundary conditions for matrix independent development and validation.

  11. Influence of microclimatic ammonia levels on productive performance of different broilers’ breeds estimated with univariate and multivariate approaches

    Soliman, Essam S.; Moawed, Sherif A.; Hassan, Rania A.

    2017-01-01

    Background and Aim: Birds litter contains unutilized nitrogen in the form of uric acid that is converted into ammonia; a fact that does not only affect poultry performance but also has a negative effect on people’s health around the farm and contributes in the environmental degradation. The influence of microclimatic ammonia emissions on Ross and Hubbard broilers reared in different housing systems at two consecutive seasons (fall and winter) was evaluated using a discriminant function analysis to differentiate between Ross and Hubbard breeds. Materials and Methods: A total number of 400 air samples were collected and analyzed for ammonia levels during the experimental period. Data were analyzed using univariate and multivariate statistical methods. Results: Ammonia levels were significantly higher (p0.05) were found between the two farms in body weight, body weight gain, feed intake, feed conversion ratio, and performance index (PI) of broilers. Body weight; weight gain and PI had increased values (pbroiler breed. Ammonia emissions were positively (although weekly) correlated with the ambient relative humidity (r=0.383; p0.05). Test of significance of discriminant function analysis did not show a classification based on the studied traits suggesting that they cannot been used as predictor variables. The percentage of correct classification was 52% and it was improved after deletion of highly correlated traits to 57%. Conclusion: The study revealed that broiler’s growth was negatively affected by increased microclimatic ammonia concentrations and recommended the analysis of broilers’ growth performance parameters data using multivariate discriminant function analysis. PMID:28919677

  12. The use of principal components and univariate charts to control multivariate processes

    Marcela A. G. Machado

    2008-04-01

    Full Text Available In this article, we evaluate the performance of the T² chart based on the principal components (PC X chart and the simultaneous univariate control charts based on the original variables (SU charts or based on the principal components (SUPC charts. The main reason to consider the PC chart lies on the dimensionality reduction. However, depending on the disturbance and on the way the original variables are related, the chart is very slow in signaling, except when all variables are negatively correlated and the principal component is wisely selected. Comparing the SU , the SUPC and the T² charts we conclude that the SU X charts (SUPC charts have a better overall performance when the variables are positively (negatively correlated. We also develop the expression to obtain the power of two S² charts designed for monitoring the covariance matrix. These joint S² charts are, in the majority of the cases, more efficient than the generalized variance chart.Neste artigo, avaliamos o desempenho do gráfico de T² baseado em componentes principais (gráfico PC e dos gráficos de controle simultâneos univariados baseados nas variáveis originais (gráfico SU X ou baseados em componentes principais (gráfico SUPC. A principal razão para o uso do gráfico PC é a redução de dimensionalidade. Entretanto, dependendo da perturbação e da correlação entre as variáveis originais, o gráfico é lento em sinalizar, exceto quando todas as variáveis são negativamente correlacionadas e a componente principal é adequadamente escolhida. Comparando os gráficos SU X, SUPC e T² concluímos que o gráfico SU X (gráfico SUPC tem um melhor desempenho global quando as variáveis são positivamente (negativamente correlacionadas. Desenvolvemos também uma expressão para obter o poder de detecção de dois gráficos de S² projetados para controlar a matriz de covariâncias. Os gráficos conjuntos de S² são, na maioria dos casos, mais eficientes que o gr

  13. Development of in Silico Models for Predicting P-Glycoprotein Inhibitors Based on a Two-Step Approach for Feature Selection and Its Application to Chinese Herbal Medicine Screening.

    Yang, Ming; Chen, Jialei; Shi, Xiufeng; Xu, Liwen; Xi, Zhijun; You, Lisha; An, Rui; Wang, Xinhong

    2015-10-05

    P-glycoprotein (P-gp) is regarded as an important factor in determining the ADMET (absorption, distribution, metabolism, elimination, and toxicity) characteristics of drugs and drug candidates. Successful prediction of P-gp inhibitors can thus lead to an improved understanding of the underlying mechanisms of both changes in the pharmacokinetics of drugs and drug-drug interactions. Therefore, there has been considerable interest in the development of in silico modeling of P-gp inhibitors in recent years. Considering that a large number of molecular descriptors are used to characterize diverse structural moleculars, efficient feature selection methods are required to extract the most informative predictors. In this work, we constructed an extensive available data set of 2428 molecules that includes 1518 P-gp inhibitors and 910 P-gp noninhibitors from multiple resources. Importantly, a two-step feature selection approach based on a genetic algorithm and a greedy forward-searching algorithm was employed to select the minimum set of the most informative descriptors that contribute to the prediction of P-gp inhibitors. To determine the best machine learning algorithm, 18 classifiers coupled with the feature selection method were compared. The top three best-performing models (flexible discriminant analysis, support vector machine, and random forest) and their ensemble model using respectively only 3, 9, 7, and 14 descriptors achieve an overall accuracy of 83.2%-86.7% for the training set containing 1040 compounds, an overall accuracy of 82.3%-85.5% for the test set containing 1039 compounds, and a prediction accuracy of 77.4%-79.9% for the external validation set containing 349 compounds. The models were further extensively validated by DrugBank database (1890 compounds). The proposed models are competitive with and in some cases better than other published models in terms of prediction accuracy and minimum number of descriptors. Applicability domain then was addressed

  14. Pap Smear Diagnosis Using a Hybrid Intelligent Scheme Focusing on Genetic Algorithm Based Feature Selection and Nearest Neighbor Classification

    Marinakis, Yannis; Dounias, Georgios; Jantzen, Jan

    2009-01-01

    The term pap-smear refers to samples of human cells stained by the so-called Papanicolaou method. The purpose of the Papanicolaou method is to diagnose pre-cancerous cell changes before they progress to invasive carcinoma. In this paper a metaheuristic algorithm is proposed in order to classify t...... other previously applied intelligent approaches....

  15. Toward better public health reporting using existing off the shelf approaches: A comparison of alternative cancer detection approaches using plaintext medical data and non-dictionary based feature selection.

    Kasthurirathne, Suranga N; Dixon, Brian E; Gichoya, Judy; Xu, Huiping; Xia, Yuni; Mamlin, Burke; Grannis, Shaun J

    2016-04-01

    Increased adoption of electronic health records has resulted in increased availability of free text clinical data for secondary use. A variety of approaches to obtain actionable information from unstructured free text data exist. These approaches are resource intensive, inherently complex and rely on structured clinical data and dictionary-based approaches. We sought to evaluate the potential to obtain actionable information from free text pathology reports using routinely available tools and approaches that do not depend on dictionary-based approaches. We obtained pathology reports from a large health information exchange and evaluated the capacity to detect cancer cases from these reports using 3 non-dictionary feature selection approaches, 4 feature subset sizes, and 5 clinical decision models: simple logistic regression, naïve bayes, k-nearest neighbor, random forest, and J48 decision tree. The performance of each decision model was evaluated using sensitivity, specificity, accuracy, positive predictive value, and area under the receiver operating characteristics (ROC) curve. Decision models parameterized using automated, informed, and manual feature selection approaches yielded similar results. Furthermore, non-dictionary classification approaches identified cancer cases present in free text reports with evaluation measures approaching and exceeding 80-90% for most metrics. Our methods are feasible and practical approaches for extracting substantial information value from free text medical data, and the results suggest that these methods can perform on par, if not better, than existing dictionary-based approaches. Given that public health agencies are often under-resourced and lack the technical capacity for more complex methodologies, these results represent potentially significant value to the public health field. Copyright © 2016 Elsevier Inc. All rights reserved.

  16. A swarm-trained k-nearest prototypes adaptive classifier with automatic feature selection for interval data.

    Silva Filho, Telmo M; Souza, Renata M C R; Prudêncio, Ricardo B C

    2016-08-01

    Some complex data types are capable of modeling data variability and imprecision. These data types are studied in the symbolic data analysis field. One such data type is interval data, which represents ranges of values and is more versatile than classic point data for many domains. This paper proposes a new prototype-based classifier for interval data, trained by a swarm optimization method. Our work has two main contributions: a swarm method which is capable of performing both automatic selection of features and pruning of unused prototypes and a generalized weighted squared Euclidean distance for interval data. By discarding unnecessary features and prototypes, the proposed algorithm deals with typical limitations of prototype-based methods, such as the problem of prototype initialization. The proposed distance is useful for learning classes in interval datasets with different shapes, sizes and structures. When compared to other prototype-based methods, the proposed method achieves lower error rates in both synthetic and real interval datasets. Copyright © 2016 Elsevier Ltd. All rights reserved.

  17. AHIMSA - Ad hoc histogram information measure sensing algorithm for feature selection in the context of histogram inspired clustering techniques

    Dasarathy, B. V.

    1976-01-01

    An algorithm is proposed for dimensionality reduction in the context of clustering techniques based on histogram analysis. The approach is based on an evaluation of the hills and valleys in the unidimensional histograms along the different features and provides an economical means of assessing the significance of the features in a nonparametric unsupervised data environment. The method has relevance to remote sensing applications.

  18. Manifold Regularized Multi-Task Feature Selection for Multi-Modality Classification in Alzheimer’s Disease

    Jie, Biao; Cheng, Bo

    2014-01-01

    Accurate diagnosis of Alzheimer’s disease (AD), as well as its pro-dromal stage (i.e., mild cognitive impairment, MCI), is very important for possible delay and early treatment of the disease. Recently, multi-modality methods have been used for fusing information from multiple different and complementary imaging and non-imaging modalities. Although there are a number of existing multi-modality methods, few of them have addressed the problem of joint identification of disease-related brain regions from multi-modality data for classification. In this paper, we proposed a manifold regularized multi-task learning framework to jointly select features from multi-modality data. Specifically, we formulate the multi-modality classification as a multi-task learning framework, where each task focuses on the classification based on each modality. In order to capture the intrinsic relatedness among multiple tasks (i.e., modalities), we adopted a group sparsity regularizer, which ensures only a small number of features to be selected jointly. In addition, we introduced a new manifold based Laplacian regularization term to preserve the geometric distribution of original data from each task, which can lead to the selection of more discriminative features. Furthermore, we extend our method to the semi-supervised setting, which is very important since the acquisition of a large set of labeled data (i.e., diagnosis of disease) is usually expensive and time-consuming, while the collection of unlabeled data is relatively much easier. To validate our method, we have performed extensive evaluations on the baseline Magnetic resonance imaging (MRI) and fluorodeoxyglucose positron emission tomography (FDG-PET) data of Alzheimer’s Disease Neuroimaging Initiative (ADNI) database. Our experimental results demonstrate the effectiveness of the proposed method. PMID:24505676

  19. Feature Selection and Parameters Optimization of SVM Using Particle Swarm Optimization for Fault Classification in Power Distribution Systems.

    Cho, Ming-Yuan; Hoang, Thi Thom

    2017-01-01

    Fast and accurate fault classification is essential to power system operations. In this paper, in order to classify electrical faults in radial distribution systems, a particle swarm optimization (PSO) based support vector machine (SVM) classifier has been proposed. The proposed PSO based SVM classifier is able to select appropriate input features and optimize SVM parameters to increase classification accuracy. Further, a time-domain reflectometry (TDR) method with a pseudorandom binary sequence (PRBS) stimulus has been used to generate a dataset for purposes of classification. The proposed technique has been tested on a typical radial distribution network to identify ten different types of faults considering 12 given input features generated by using Simulink software and MATLAB Toolbox. The success rate of the SVM classifier is over 97%, which demonstrates the effectiveness and high efficiency of the developed method.

  20. Icing Forecasting of High Voltage Transmission Line Using Weighted Least Square Support Vector Machine with Fireworks Algorithm for Feature Selection

    Tiannan Ma

    2016-12-01

    Full Text Available Accurate forecasting of icing thickness has great significance for ensuring the security and stability of the power grid. In order to improve the forecasting accuracy, this paper proposes an icing forecasting system based on the fireworks algorithm and weighted least square support vector machine (W-LSSVM. The method of the fireworks algorithm is employed to select the proper input features with the purpose of eliminating redundant influence. In addition, the aim of the W-LSSVM model is to train and test the historical data-set with the selected features. The capability of this proposed icing forecasting model and framework is tested through simulation experiments using real-world icing data from the monitoring center of the key laboratory of anti-ice disaster, Hunan, South China. The results show that the proposed W-LSSVM-FA method has a higher prediction accuracy and it may be a promising alternative for icing thickness forecasting.

  1. Feature Selection and Parameters Optimization of SVM Using Particle Swarm Optimization for Fault Classification in Power Distribution Systems

    Ming-Yuan Cho

    2017-01-01

    Full Text Available Fast and accurate fault classification is essential to power system operations. In this paper, in order to classify electrical faults in radial distribution systems, a particle swarm optimization (PSO based support vector machine (SVM classifier has been proposed. The proposed PSO based SVM classifier is able to select appropriate input features and optimize SVM parameters to increase classification accuracy. Further, a time-domain reflectometry (TDR method with a pseudorandom binary sequence (PRBS stimulus has been used to generate a dataset for purposes of classification. The proposed technique has been tested on a typical radial distribution network to identify ten different types of faults considering 12 given input features generated by using Simulink software and MATLAB Toolbox. The success rate of the SVM classifier is over 97%, which demonstrates the effectiveness and high efficiency of the developed method.

  2. Feature selection and recognition from nonspecific volatile profiles for discrimination of apple juices according to variety and geographical origin.

    Guo, Jing; Yue, Tianli; Yuan, Yahong

    2012-10-01

    Apple juice is a complex mixture of volatile and nonvolatile components. To develop discrimination models on the basis of the volatile composition for an efficient classification of apple juices according to apple variety and geographical origin, chromatography volatile profiles of 50 apple juice samples belonging to 6 varieties and from 5 counties of Shaanxi (China) were obtained by headspace solid-phase microextraction coupled with gas chromatography. The volatile profiles were processed as continuous and nonspecific signals through multivariate analysis techniques. Different preprocessing methods were applied to raw chromatographic data. The blind chemometric analysis of the preprocessed chromatographic profiles was carried out. Stepwise linear discriminant analysis (SLDA) revealed satisfactory discriminations of apple juices according to variety and geographical origin, provided respectively 100% and 89.8% success rate in terms of prediction ability. Finally, the discriminant volatile compounds selected by SLDA were identified by gas chromatography-mass spectrometry. The proposed strategy was able to verify the variety and geographical origin of apple juices involving only a reduced number of discriminate retention times selected by the stepwise procedure. This result encourages the similar procedures to be considered in quality control of apple juices. This work presented a method for an efficient discrimination of apple juices according to apple variety and geographical origin using HS-SPME-GC-MS together with chemometric tools. Discrimination models developed could help to achieve greater control over the quality of the juice and to detect possible adulteration of the product. © 2012 Institute of Food Technologists®

  3. Particle swarm optimization based feature enhancement and feature selection for improved emotion recognition in speech and glottal signals.

    Muthusamy, Hariharan; Polat, Kemal; Yaacob, Sazali

    2015-01-01

    In the recent years, many research works have been published using speech related features for speech emotion recognition, however, recent studies show that there is a strong correlation between emotional states and glottal features. In this work, Mel-frequency cepstralcoefficients (MFCCs), linear predictive cepstral coefficients (LPCCs), perceptual linear predictive (PLP) features, gammatone filter outputs, timbral texture features, stationary wavelet transform based timbral texture features and relative wavelet packet energy and entropy features were extracted from the emotional speech (ES) signals and its glottal waveforms(GW). Particle swarm optimization based clustering (PSOC) and wrapper based particle swarm optimization (WPSO) were proposed to enhance the discerning ability of the features and to select the discriminating features respectively. Three different emotional speech databases were utilized to gauge the proposed method. Extreme learning machine (ELM) was employed to classify the different types of emotions. Different experiments were conducted and the results show that the proposed method significantly improves the speech emotion recognition performance compared to previous works published in the literature.

  4. Different underlying mechanisms for face emotion and gender processing during feature-selective attention: Evidence from event-related potential studies.

    Wang, Hailing; Ip, Chengteng; Fu, Shimin; Sun, Pei

    2017-05-01

    Face recognition theories suggest that our brains process invariant (e.g., gender) and changeable (e.g., emotion) facial dimensions separately. To investigate whether these two dimensions are processed in different time courses, we analyzed the selection negativity (SN, an event-related potential component reflecting attentional modulation) elicited by face gender and emotion during a feature selective attention task. Participants were instructed to attend to a combination of face emotion and gender attributes in Experiment 1 (bi-dimensional task) and to either face emotion or gender in Experiment 2 (uni-dimensional task). The results revealed that face emotion did not elicit a substantial SN, whereas face gender consistently generated a substantial SN in both experiments. These results suggest that face gender is more sensitive to feature-selective attention and that face emotion is encoded relatively automatically on SN, implying the existence of different underlying processing mechanisms for invariant and changeable facial dimensions. Copyright © 2017 Elsevier Ltd. All rights reserved.

  5. Temperature profile retrieval in axisymmetric combustion plumes using multilayer perceptron modeling and spectral feature selection in the infrared CO2 emission band.

    García-Cuesta, Esteban; de Castro, Antonio J; Galván, Inés M; López, Fernando

    2014-01-01

    In this work, a methodology based on the combined use of a multilayer perceptron model fed using selected spectral information is presented to invert the radiative transfer equation (RTE) and to recover the spatial temperature profile inside an axisymmetric flame. The spectral information is provided by the measurement of the infrared CO2 emission band in the 3-5 μm spectral region. A guided spectral feature selection was carried out using a joint criterion of principal component analysis and a priori physical knowledge of the radiative problem. After applying this guided feature selection, a subset of 17 wavenumbers was selected. The proposed methodology was applied over synthetic scenarios. Also, an experimental validation was carried out by measuring the spectral emission of the exhaust hot gas plume in a microjet engine with a Fourier transform-based spectroradiometer. Temperatures retrieved using the proposed methodology were compared with classical thermocouple measurements, showing a good agreement between them. Results obtained using the proposed methodology are very promising and can encourage the use of sensor systems based on the spectral measurement of the CO2 emission band in the 3-5 μm spectral window to monitor combustion processes in a nonintrusive way.

  6. Classifying Pediatric Central Nervous System Tumors through near Optimal Feature Selection and Mutual Information: A Single Center Cohort

    Mohammad Faranoush

    2013-10-01

    Full Text Available Background: Labeling, gathering mutual information, clustering and classificationof central nervous system tumors may assist in predicting not only distinct diagnosesbased on tumor-specific features but also prognosis. This study evaluates the epidemi-ological features of central nervous system tumors in children who referred to Mahak’sPediatric Cancer Treatment and Research Center in Tehran, Iran.Methods: This cohort (convenience sample study comprised 198 children (≤15years old with central nervous system tumors who referred to Mahak's PediatricCancer Treatment and Research Center from 2007 to 2010. In addition to the descriptiveanalyses on epidemiological features and mutual information, we used the LeastSquares Support Vector Machines method in MATLAB software to propose apreliminary predictive model of pediatric central nervous system tumor feature-labelanalysis. Results:Of patients, there were 63.1% males and 36.9% females. Patients' mean±SDage was 6.11±3.65 years. Tumor location was as follows: supra-tentorial (30.3%, infra-tentorial (67.7% and 2% (spinal. The most frequent tumors registered were: high-gradeglioma (supra-tentorial in 36 (59.99% patients and medulloblastoma (infra-tentorialin 65 (48.51% patients. The most prevalent clinical findings included vomiting,headache and impaired vision. Gender, age, ethnicity, tumor stage and the presence ofmetastasis were the features predictive of supra-tentorial tumor histology.Conclusion: Our data agreed with previous reports on the epidemiology of centralnervous system tumors. Our feature-label analysis has shown how presenting features maypartially predict diagnosis. Timely diagnosis and management of central nervous systemtumors can lead to decreased disease burden and improved survival. This may be furtherfacilitated through development of partitioning, risk prediction and prognostic models.

  7. UniFIeD Univariate Frequency-based Imputation for Time Series Data

    Friese, Martina; Stork, Jörg; Ramos Guerra, Ricardo; Bartz-Beielstein, Thomas; Thaker, Soham; Flasch, Oliver; Zaefferer, Martin

    2013-01-01

    This paper introduces UniFIeD, a new data preprocessing method for time series. UniFIeD can cope with large intervals of missing data. A scalable test function generator, which allows the simulation of time series with different gap sizes, is presented additionally. An experimental study demonstrates that (i) UniFIeD shows a significant better performance than simple imputation methods and (ii) UniFIeD is able to handle situations, where advanced imputation methods fail. The results are indep...

  8. Application of all relevant feature selection for failure analysis of parameter-induced simulation crashes in climate models

    Paja, W.; Wrzesień, M.; Niemiec, R.; Rudnicki, W. R.

    2015-07-01

    The climate models are extremely complex pieces of software. They reflect best knowledge on physical components of the climate, nevertheless, they contain several parameters, which are too weakly constrained by observations, and can potentially lead to a crash of simulation. Recently a study by Lucas et al. (2013) has shown that machine learning methods can be used for predicting which combinations of parameters can lead to crash of simulation, and hence which processes described by these parameters need refined analyses. In the current study we reanalyse the dataset used in this research using different methodology. We confirm the main conclusion of the original study concerning suitability of machine learning for prediction of crashes. We show, that only three of the eight parameters indicated in the original study as relevant for prediction of the crash are indeed strongly relevant, three other are relevant but redundant, and two are not relevant at all. We also show that the variance due to split of data between training and validation sets has large influence both on accuracy of predictions and relative importance of variables, hence only cross-validated approach can deliver robust prediction of performance and relevance of variables.

  9. Application of all-relevant feature selection for the failure analysis of parameter-induced simulation crashes in climate models

    Paja, Wiesław; Wrzesien, Mariusz; Niemiec, Rafał; Rudnicki, Witold R.

    2016-03-01

    Climate models are extremely complex pieces of software. They reflect the best knowledge on the physical components of the climate; nevertheless, they contain several parameters, which are too weakly constrained by observations, and can potentially lead to a simulation crashing. Recently a study by Lucas et al. (2013) has shown that machine learning methods can be used for predicting which combinations of parameters can lead to the simulation crashing and hence which processes described by these parameters need refined analyses. In the current study we reanalyse the data set used in this research using different methodology. We confirm the main conclusion of the original study concerning the suitability of machine learning for the prediction of crashes. We show that only three of the eight parameters indicated in the original study as relevant for prediction of the crash are indeed strongly relevant, three others are relevant but redundant and two are not relevant at all. We also show that the variance due to the split of data between training and validation sets has a large influence both on the accuracy of predictions and on the relative importance of variables; hence only a cross-validated approach can deliver a robust prediction of performance and relevance of variables.

  10. Data quality assurance in monitoring of wastewater quality: Univariate on-line and off-line methods

    Alferes, J.; Poirier, P.; Lamaire-Chad, C.

    To make water quality monitoring networks useful for practice, the automation of data collection and data validation still represents an important challenge. Efficient monitoring depends on careful quality control and quality assessment. With a practical orientation a data quality assurance proce...

  11. Evaluation of genetic diversity among soybean (Glycine max) genotypes using univariate and multivariate analysis.

    Oliveira, M M; Sousa, L B; Reis, M C; Silva Junior, E G; Cardoso, D B O; Hamawaki, O T; Nogueira, A P O

    2017-05-31

    The genetic diversity study has paramount importance in breeding programs; hence, it allows selection and choice of the parental genetic divergence, which have the agronomic traits desired by the breeder. This study aimed to characterize the genetic divergence between 24 soybean genotypes through their agronomic traits, using multivariate clustering methods to select the potential genitors for the promising hybrid combinations. Six agronomic traits evaluated were number of days to flowering and maturity, plant height at flowering and maturity, insertion height of the first pod, and yield. The genetic divergence evaluated by multivariate analysis that esteemed first the Mahalanobis' generalized distance (D 2 ), then the clustering using Tocher's optimization methods, and then the unweighted pair group method with arithmetic average (UPGMA). Tocher's optimization method and the UPGMA agreed with the groups' constitution between each other, the formation of eight distinct groups according Tocher's method and seven distinct groups using UPGMA. The trait number of days for flowering (45.66%) was the most efficient to explain dissimilarity between genotypes, and must be one of the main traits considered by the breeder in the moment of genitors choice in soybean-breeding programs. The genetic variability allowed the identification of dissimilar genotypes and with superior performances. The hybridizations UFU 18 x UFUS CARAJÁS, UFU 15 x UFU 13, and UFU 13 x UFUS CARAJÁS are promising to obtain superior segregating populations, which enable the development of more productive genotypes.

  12. Feature selection for portfolio optimization

    Bjerring, Thomas Trier; Ross, Omri; Weissensteiner, Alex

    2016-01-01

    Most portfolio selection rules based on the sample mean and covariance matrix perform poorly out-of-sample. Moreover, there is a growing body of evidence that such optimization rules are not able to beat simple rules of thumb, such as 1/N. Parameter uncertainty has been identified as one major....... While most of the diversification benefits are preserved, the parameter estimation problem is alleviated. We conduct out-of-sample back-tests to show that in most cases different well-established portfolio selection rules applied on the reduced asset universe are able to improve alpha relative...

  13. Who uses nursing theory? A univariate descriptive analysis of five years' research articles.

    Bond, A Elaine; Eshah, Nidal Farid; Bani-Khaled, Mohammed; Hamad, Atef Omar; Habashneh, Samira; Kataua', Hussein; al-Jarrah, Imad; Abu Kamal, Andaleeb; Hamdan, Falastine Rafic; Maabreh, Roqia

    2011-06-01

    Since the early 1950s, nursing leaders have worked diligently to build the Scientific Discipline of Nursing, integrating Theory, Research and Practice. Recently, the role of theory has again come into question, with some scientists claiming nurses are not using theory to guide their research, with which to improve practice. The purposes of this descriptive study were to determine: (i) Were nursing scientists' research articles in leading nursing journals based on theory? (ii) If so, were the theories nursing theories or borrowed theories? (iii) Were the theories integrated into the studies, or were they used as organizing frameworks? Research articles from seven top ISI journals were analysed, excluding regularly featured columns, meta-analyses, secondary analysis, case studies and literature reviews. The authors used King's dynamic Interacting system and Goal Attainment Theory as an organizing framework. They developed consensus on how to identify the integration of theory, searching the Title, Abstract, Aims, Methods, Discussion and Conclusion sections of each research article, whether quantitative or qualitative. Of 2857 articles published in the seven journals from 2002 to, and including, 2006, 2184 (76%) were research articles. Of the 837 (38%) authors who used theories, 460 (55%) used nursing theories, 377 (45%) used other theories: 776 (93%) of those who used theory integrated it into their studies, including qualitative studies, while 51 (7%) reported they used theory as an organizing framework for their studies. Closer analysis revealed theory principles were implicitly implied, even in research reports that did not explicitly report theory usage. Increasing numbers of nursing research articles (though not percentagewise) continue to be guided by theory, and not always by nursing theory. Newer nursing research methods may not explicitly state the use of nursing theory, though it is implicitly implied. © 2010 The Authors. Scandinavian Journal of Caring

  14. Monitoring endemic livestock diseases using laboratory diagnostic data: A simulation study to evaluate the performance of univariate process monitoring control algorithms.

    Lopes Antunes, Ana Carolina; Dórea, Fernanda; Halasa, Tariq; Toft, Nils

    2016-05-01

    Surveillance systems are critical for accurate, timely monitoring and effective disease control. In this study, we investigated the performance of univariate process monitoring control algorithms in detecting changes in seroprevalence for endemic diseases. We also assessed the effect of sample size (number of sentinel herds tested in the surveillance system) on the performance of the algorithms. Three univariate process monitoring control algorithms were compared: Shewart p Chart(1) (PSHEW), Cumulative Sum(2) (CUSUM) and Exponentially Weighted Moving Average(3) (EWMA). Increases in seroprevalence were simulated from 0.10 to 0.15 and 0.20 over 4, 8, 24, 52 and 104 weeks. Each epidemic scenario was run with 2000 iterations. The cumulative sensitivity(4) (CumSe) and timeliness were used to evaluate the algorithms' performance with a 1% false alarm rate. Using these performance evaluation criteria, it was possible to assess the accuracy and timeliness of the surveillance system working in real-time. The results showed that EWMA and PSHEW had higher CumSe (when compared with the CUSUM) from week 1 until the end of the period for all simulated scenarios. Changes in seroprevalence from 0.10 to 0.20 were more easily detected (higher CumSe) than changes from 0.10 to 0.15 for all three algorithms. Similar results were found with EWMA and PSHEW, based on the median time to detection. Changes in the seroprevalence were detected later with CUSUM, compared to EWMA and PSHEW for the different scenarios. Increasing the sample size 10 fold halved the time to detection (CumSe=1), whereas increasing the sample size 100 fold reduced the time to detection by a factor of 6. This study investigated the performance of three univariate process monitoring control algorithms in monitoring endemic diseases. It was shown that automated systems based on these detection methods identified changes in seroprevalence at different times. Increasing the number of tested herds would lead to faster

  15. Real-time prediction of extreme ambient carbon monoxide concentrations due to vehicular exhaust emissions using univariate linear stochastic models

    Sharma, P.; Khare, M.

    2000-01-01

    Historical data of the time-series of carbon monoxide (CO) concentration was analysed using Box-Jenkins modelling approach. Univariate Linear Stochastic Models (ULSMs) were developed to examine the degree of prediction possible for situations where only a limited data set, restricted only to the past record of pollutant data are available. The developed models can be used to provide short-term, real-time forecast of extreme CO concentrations for an Air Quality Control Region (AQCR), comprising a major traffic intersection in a Central Business District of Delhi City, India. (author)

  16. Investigating univariate temporal patterns for intrinsic connectivity networks based on complexity and low-frequency oscillation: a test-retest reliability study.

    Wang, X; Jiao, Y; Tang, T; Wang, H; Lu, Z

    2013-12-19

    Intrinsic connectivity networks (ICNs) are composed of spatial components and time courses. The spatial components of ICNs were discovered with moderate-to-high reliability. So far as we know, few studies focused on the reliability of the temporal patterns for ICNs based their individual time courses. The goals of this study were twofold: to investigate the test-retest reliability of temporal patterns for ICNs, and to analyze these informative univariate metrics. Additionally, a correlation analysis was performed to enhance interpretability. Our study included three datasets: (a) short- and long-term scans, (b) multi-band echo-planar imaging (mEPI), and (c) eyes open or closed. Using dual regression, we obtained the time courses of ICNs for each subject. To produce temporal patterns for ICNs, we applied two categories of univariate metrics: network-wise complexity and network-wise low-frequency oscillation. Furthermore, we validated the test-retest reliability for each metric. The network-wise temporal patterns for most ICNs (especially for default mode network, DMN) exhibited moderate-to-high reliability and reproducibility under different scan conditions. Network-wise complexity for DMN exhibited fair reliability (ICC<0.5) based on eyes-closed sessions. Specially, our results supported that mEPI could be a useful method with high reliability and reproducibility. In addition, these temporal patterns were with physiological meanings, and certain temporal patterns were correlated to the node strength of the corresponding ICN. Overall, network-wise temporal patterns of ICNs were reliable and informative and could be complementary to spatial patterns of ICNs for further study. Copyright © 2013 IBRO. Published by Elsevier Ltd. All rights reserved.

  17. The issue of multiple univariate comparisons in the context of neuroelectric brain mapping: an application in a neuromarketing experiment.

    Vecchiato, G; De Vico Fallani, F; Astolfi, L; Toppi, J; Cincotti, F; Mattia, D; Salinari, S; Babiloni, F

    2010-08-30

    This paper presents some considerations about the use of adequate statistical techniques in the framework of the neuroelectromagnetic brain mapping. With the use of advanced EEG/MEG recording setup involving hundred of sensors, the issue of the protection against the type I errors that could occur during the execution of hundred of univariate statistical tests, has gained interest. In the present experiment, we investigated the EEG signals from a mannequin acting as an experimental subject. Data have been collected while performing a neuromarketing experiment and analyzed with state of the art computational tools adopted in specialized literature. Results showed that electric data from the mannequin's head presents statistical significant differences in power spectra during the visualization of a commercial advertising when compared to the power spectra gathered during a documentary, when no adjustments were made on the alpha level of the multiple univariate tests performed. The use of the Bonferroni or Bonferroni-Holm adjustments returned correctly no differences between the signals gathered from the mannequin in the two experimental conditions. An partial sample of recently published literature on different neuroscience journals suggested that at least the 30% of the papers do not use statistical protection for the type I errors. While the occurrence of type I errors could be easily managed with appropriate statistical techniques, the use of such techniques is still not so largely adopted in the literature. Copyright (c) 2010 Elsevier B.V. All rights reserved.

  18. Univariate Lp and ɭ p Averaging, 0 < p < 1, in Polynomial Time by Utilization of Statistical Structure

    John E. Lavery

    2012-10-01

    Full Text Available We present evidence that one can calculate generically combinatorially expensive Lp and lp averages, 0 < p < 1, in polynomial time by restricting the data to come from a wide class of statistical distributions. Our approach differs from the approaches in the previous literature, which are based on a priori sparsity requirements or on accepting a local minimum as a replacement for a global minimum. The functionals by which Lp averages are calculated are not convex but are radially monotonic and the functionals by which lp averages are calculated are nearly so, which are the keys to solvability in polynomial time. Analytical results for symmetric, radially monotonic univariate distributions are presented. An algorithm for univariate lp averaging is presented. Computational results for a Gaussian distribution, a class of symmetric heavy-tailed distributions and a class of asymmetric heavy-tailed distributions are presented. Many phenomena in human-based areas are increasingly known to be represented by data that have large numbers of outliers and belong to very heavy-tailed distributions. When tails of distributions are so heavy that even medians (L1 and l1 averages do not exist, one needs to consider using lp minimization principles with 0 < p < 1.

  19. Evaluation of the efficiency of continuous wavelet transform as processing and preprocessing algorithm for resolution of overlapped signals in univariate and multivariate regression analyses; an application to ternary and quaternary mixtures

    Hegazy, Maha A.; Lotfy, Hayam M.; Mowaka, Shereen; Mohamed, Ekram Hany

    2016-07-01

    Wavelets have been adapted for a vast number of signal-processing applications due to the amount of information that can be extracted from a signal. In this work, a comparative study on the efficiency of continuous wavelet transform (CWT) as a signal processing tool in univariate regression and a pre-processing tool in multivariate analysis using partial least square (CWT-PLS) was conducted. These were applied to complex spectral signals of ternary and quaternary mixtures. CWT-PLS method succeeded in the simultaneous determination of a quaternary mixture of drotaverine (DRO), caffeine (CAF), paracetamol (PAR) and p-aminophenol (PAP, the major impurity of paracetamol). While, the univariate CWT failed to simultaneously determine the quaternary mixture components and was able to determine only PAR and PAP, the ternary mixtures of DRO, CAF, and PAR and CAF, PAR, and PAP. During the calculations of CWT, different wavelet families were tested. The univariate CWT method was validated according to the ICH guidelines. While for the development of the CWT-PLS model a calibration set was prepared by means of an orthogonal experimental design and their absorption spectra were recorded and processed by CWT. The CWT-PLS model was constructed by regression between the wavelet coefficients and concentration matrices and validation was performed by both cross validation and external validation sets. Both methods were successfully applied for determination of the studied drugs in pharmaceutical formulations.

  20. Particle swarm optimization and genetic algorithm as feature selection techniques for the QSAR modeling of imidazo[1,5-a]pyrido[3,2-e]pyrazines, inhibitors of phosphodiesterase 10A.

    Goodarzi, Mohammad; Saeys, Wouter; Deeb, Omar; Pieters, Sigrid; Vander Heyden, Yvan

    2013-12-01

    Quantitative structure-activity relationship (QSAR) modeling was performed for imidazo[1,5-a]pyrido[3,2-e]pyrazines, which constitute a class of phosphodiesterase 10A inhibitors. Particle swarm optimization (PSO) and genetic algorithm (GA) were used as feature selection techniques to find the most reliable molecular descriptors from a large pool. Modeling of the relationship between the selected descriptors and the pIC50 activity data was achieved by linear [multiple linear regression (MLR)] and non-linear [locally weighted regression (LWR) based on both Euclidean (E) and Mahalanobis (M) distances] methods. In addition, a stepwise MLR model was built using only a limited number of quantum chemical descriptors, selected because of their correlation with the pIC50 . The model was not found interesting. It was concluded that the LWR model, based on the Euclidean distance, applied on the descriptors selected by PSO has the best prediction ability. However, some other models behaved similarly. The root-mean-squared errors of prediction (RMSEP) for the test sets obtained by PSO/MLR, GA/MLR, PSO/LWRE, PSO/LWRM, GA/LWRE, and GA/LWRM models were 0.333, 0.394, 0.313, 0.333, 0.421, and 0.424, respectively. The PSO-selected descriptors resulted in the best prediction models, both linear and non-linear. © 2013 John Wiley & Sons A/S.

  1. Will initiatives to promote hydroelectricity consumption be effective? Evidence from univariate and panel LM unit root tests with structural breaks

    Lean, Hooi Hooi; Smyth, Russell

    2014-01-01

    This paper examines whether initiatives to promote hydroelectricity consumption are likely to be effective by applying univariate and panel Lagrange Multiplier (LM) unit root tests to hydroelectricity consumption in 55 countries over the period 1965–2011. We find that for the panel, as well as about four-fifths of individual countries, that hydroelectricity consumption is stationary. This result implies that shocks to hydroelectricity consumption in most countries will only result in temporary deviations from the long-run growth path. An important consequence of this finding is that initiatives designed to have permanent positive effects on hydroelectricity consumption, such as large-scale dam construction, are unlikely to be effective in increasing the share of hydroelectricity, relative to consumption of fossil fuels. - Highlights: • Applies unit root tests to hydroelectricity consumption. • Hydroelectricity consumption is stationary. • Shocks to hydroelectricity consumption result in temporary deviations from the long-run growth path

  2. Tile-based Fisher-ratio software for improved feature selection analysis of comprehensive two-dimensional gas chromatography-time-of-flight mass spectrometry data.

    Marney, Luke C; Siegler, W Christopher; Parsons, Brendon A; Hoggard, Jamin C; Wright, Bob W; Synovec, Robert E

    2013-10-15

    Comprehensive two-dimensional (2D) gas chromatography coupled with time-of-flight mass spectrometry (GC × GC-TOFMS) is a highly capable instrumental platform that produces complex and information-rich multi-dimensional chemical data. The data can be initially overwhelming, especially when many samples (of various sample classes) are analyzed with multiple injections for each sample. Thus, the data must be analyzed in such a way as to extract the most meaningful information. The pixel-based and peak table-based Fisher ratio algorithmic approaches have been used successfully in the past to reduce the multi-dimensional data down to those chemical compounds that are changing between the sample classes relative to those that are not changing (i.e., chemical feature selection). We report on the initial development of a computationally fast novel tile-based Fisher-ratio software that addresses the challenges due to 2D retention time misalignment without explicitly aligning the data, which is often a shortcoming for both pixel-based and peak table-based algorithmic approaches. Concurrently, the tile-based Fisher-ratio algorithm significantly improves the sensitivity contrast of true positives against a background of potential false positives and noise. In this study, eight compounds, plus one internal standard, were spiked into diesel at various concentrations. The tile-based F-ratio algorithmic approach was able to "discover" all spiked analytes, within the complex diesel sample matrix with thousands of potential false positives, in each possible concentration comparison, even at the lowest absolute spiked analyte concentration ratio of 1.06, the ratio between the concentrations in the spiked diesel sample to the native concentration in diesel. Copyright © 2013 Elsevier B.V. All rights reserved.

  3. [Retrospective statistical analysis of clinical factors of recurrence in chronic subdural hematoma: correlation between univariate and multivariate analysis].

    Takayama, Motoharu; Terui, Keita; Oiwa, Yoshitsugu

    2012-10-01

    Chronic subdural hematoma is common in elderly individuals and surgical procedures are simple. The recurrence rate of chronic subdural hematoma, however, varies from 9.2 to 26.5% after surgery. The authors studied factors of the recurrence using univariate and multivariate analyses in patients with chronic subdural hematoma We retrospectively reviewed 239 consecutive cases of chronic subdural hematoma who received burr-hole surgery with irrigation and closed-system drainage. We analyzed the relationships between recurrence of chronic subdural hematoma and factors such as sex, age, laterality, bleeding tendency, other complicated diseases, density on CT, volume of the hematoma, residual air in the hematoma cavity, use of artificial cerebrospinal fluid. Twenty-one patients (8.8%) experienced a recurrence of chronic subdural hematoma. Multiple logistic regression found that the recurrence rate was higher in patients with a large volume of the residual air, and was lower in patients using artificial cerebrospinal fluid. No statistical differences were found in bleeding tendency. Techniques to reduce the air in the hematoma cavity are important for good outcome in surgery of chronic subdural hematoma. Also, the use of artificial cerebrospinal fluid reduces recurrence of chronic subdural hematoma. The surgical procedures can be the same for patients with bleeding tendencies.

  4. What do differences between multi-voxel and univariate analysis mean? How subject-, voxel-, and trial-level variance impact fMRI analysis.

    Davis, Tyler; LaRocque, Karen F; Mumford, Jeanette A; Norman, Kenneth A; Wagner, Anthony D; Poldrack, Russell A

    2014-08-15

    Multi-voxel pattern analysis (MVPA) has led to major changes in how fMRI data are analyzed and interpreted. Many studies now report both MVPA results and results from standard univariate voxel-wise analysis, often with the goal of drawing different conclusions from each. Because MVPA results can be sensitive to latent multidimensional representations and processes whereas univariate voxel-wise analysis cannot, one conclusion that is often drawn when MVPA and univariate results differ is that the activation patterns underlying MVPA results contain a multidimensional code. In the current study, we conducted simulations to formally test this assumption. Our findings reveal that MVPA tests are sensitive to the magnitude of voxel-level variability in the effect of a condition within subjects, even when the same linear relationship is coded in all voxels. We also find that MVPA is insensitive to subject-level variability in mean activation across an ROI, which is the primary variance component of interest in many standard univariate tests. Together, these results illustrate that differences between MVPA and univariate tests do not afford conclusions about the nature or dimensionality of the neural code. Instead, targeted tests of the informational content and/or dimensionality of activation patterns are critical for drawing strong conclusions about the representational codes that are indicated by significant MVPA results. Copyright © 2014 Elsevier Inc. All rights reserved.

  5. Characterizing Reinforcement Learning Methods through Parameterized Learning Problems

    2011-06-03

    extraneous. The agent could potentially adapt these representational aspects by applying methods from feature selection ( Kolter and Ng, 2009; Petrik et al...611–616. AAAI Press. Kolter , J. Z. and Ng, A. Y. (2009). Regularization and feature selection in least-squares temporal difference learning. In A. P

  6. Guaranteed Bounds on Information-Theoretic Measures of Univariate Mixtures Using Piecewise Log-Sum-Exp Inequalities

    Nielsen, Frank; Sun, Ke

    2016-01-01

    does not admit a closed-form formula, it is in practice either estimated using costly Monte Carlo stochastic integration, approximated or bounded using various techniques. We present a fast and generic method that builds algorithmically closed

  7. A combined MRI and MRSI based multiclass system for brain tumour recognition using LS-SVMs with class probabilities and feature selection.

    Luts, J.; Heerschap, A.; Suykens, J.A.; Huffel, S. van

    2007-01-01

    OBJECTIVE: This study investigates the use of automated pattern recognition methods on magnetic resonance data with the ultimate goal to assist clinicians in the diagnosis of brain tumours. Recently, the combined use of magnetic resonance imaging (MRI) and magnetic resonance spectroscopic imaging

  8. 基于特征选择支持向量机的柱塞泵智能诊断%Intelligent Fault Diagnosis for Plunger Pump Based on Features Selection and Support Vector Machines

    崔英; 杜文辽; 孙旺; 李彦明

    2013-01-01

    柱塞泵是工程机械的关键部件,其性能好坏将直接影响整个设备的正常工作。针对柱塞泵提出基于特征选择支持向量机的智能诊断方法。对采集的振动信号基于小波包分解提取能量特征,然后利用Fisher准则函数选择对智能诊断最有利的特征,利用支持向量机进行训练,并将每个二类支持向量机按二叉树的组织形式构成系统的诊断模型。以汽车起重机柱塞泵为研究对象,其6种故障形式,包括正常、轴承内圈故障、滚动体故障、柱塞故障、配流盘故障、斜盘故障,用于检验所提算法的诊断能力,并与传统的BP神经网络和最近的蚁群神经网络方法进行对比。诊断结果表明:所提出的算法优于另外两种方法,具有较好的诊断效果。%In truck crane,the plunger pump is the key equipment,and the quality of the pump affects directly the performance of whole mechanical system. A novel intelligent diagnosis method based on features selection and support vector machine (SVM)was proposed for plunger pump in truck crane. Based on the wavelet packet decompose,the wavelet packet energy was extracted from the original vibration signal to represent the condition of equipment. Then,the Fisher criterion was utilized to select the most suitable fea-tures for diagnosis. Finally,each two-class SVM with binary tree architecture was trained to recognize the condition of mechanism. The proposed method was employed in the diagnosis of plunger pump in truck crane. The six states,including normal state,bearing inner race fault,bearing roller fault,plunger fault,thrust plate wear fault,and swash plate wear fault,were used to test the classification performance of the proposed Fisher-SVMs model,which was compared with the classical and the latest models,such as BP ANN,ANT ANN,respectively. The experimental results show that the Fisher-SVMs is superior to the other two models,and gets a promising re-sult.

  9. The Short-Term Power Load Forecasting Based on Sperm Whale Algorithm and Wavelet Least Square Support Vector Machine with DWT-IR for Feature Selection

    Jin-peng Liu

    2017-07-01

    Full Text Available Short-term power load forecasting is an important basis for the operation of integrated energy system, and the accuracy of load forecasting directly affects the economy of system operation. To improve the forecasting accuracy, this paper proposes a load forecasting system based on wavelet least square support vector machine and sperm whale algorithm. Firstly, the methods of discrete wavelet transform and inconsistency rate model (DWT-IR are used to select the optimal features, which aims to reduce the redundancy of input vectors. Secondly, the kernel function of least square support vector machine LSSVM is replaced by wavelet kernel function for improving the nonlinear mapping ability of LSSVM. Lastly, the parameters of W-LSSVM are optimized by sperm whale algorithm, and the short-term load forecasting method of W-LSSVM-SWA is established. Additionally, the example verification results show that the proposed model outperforms other alternative methods and has a strong effectiveness and feasibility in short-term power load forecasting.

  10. An efficient swarm intelligence approach to feature selection based on invasive weed optimization: Application to multivariate calibration and classification using spectroscopic data

    Sheykhizadeh, Saheleh; Naseri, Abdolhossein

    2018-04-01

    Variable selection plays a key role in classification and multivariate calibration. Variable selection methods are aimed at choosing a set of variables, from a large pool of available predictors, relevant to the analyte concentrations estimation, or to achieve better classification results. Many variable selection techniques have now been introduced among which, those which are based on the methodologies of swarm intelligence optimization have been more respected during a few last decades since they are mainly inspired by nature. In this work, a simple and new variable selection algorithm is proposed according to the invasive weed optimization (IWO) concept. IWO is considered a bio-inspired metaheuristic mimicking the weeds ecological behavior in colonizing as well as finding an appropriate place for growth and reproduction; it has been shown to be very adaptive and powerful to environmental changes. In this paper, the first application of IWO, as a very simple and powerful method, to variable selection is reported using different experimental datasets including FTIR and NIR data, so as to undertake classification and multivariate calibration tasks. Accordingly, invasive weed optimization - linear discrimination analysis (IWO-LDA) and invasive weed optimization- partial least squares (IWO-PLS) are introduced for multivariate classification and calibration, respectively.

  11. Research and Application of Hybrid Forecasting Model Based on an Optimal Feature Selection System—A Case Study on Electrical Load Forecasting

    Yunxuan Dong

    2017-04-01

    Full Text Available The process of modernizing smart grid prominently increases the complexity and uncertainty in scheduling and operation of power systems, and, in order to develop a more reliable, flexible, efficient and resilient grid, electrical load forecasting is not only an important key but is still a difficult and challenging task as well. In this paper, a short-term electrical load forecasting model, with a unit for feature learning named Pyramid System and recurrent neural networks, has been developed and it can effectively promote the stability and security of the power grid. Nine types of methods for feature learning are compared in this work to select the best one for learning target, and two criteria have been employed to evaluate the accuracy of the prediction intervals. Furthermore, an electrical load forecasting method based on recurrent neural networks has been formed to achieve the relational diagram of historical data, and, to be specific, the proposed techniques are applied to electrical load forecasting using the data collected from New South Wales, Australia. The simulation results show that the proposed hybrid models can not only satisfactorily approximate the actual value but they are also able to be effective tools in the planning of smart grids.

  12. An optimal set of features for predicting type IV secretion system effector proteins for a subset of species based on a multi-level feature selection approach.

    Zhila Esna Ashari

    Full Text Available Type IV secretion systems (T4SS are multi-protein complexes in a number of bacterial pathogens that can translocate proteins and DNA to the host. Most T4SSs function in conjugation and translocate DNA; however, approximately 13% function to secrete proteins, delivering effector proteins into the cytosol of eukaryotic host cells. Upon entry, these effectors manipulate the host cell's machinery for their own benefit, which can result in serious illness or death of the host. For this reason recognition of T4SS effectors has become an important subject. Much previous work has focused on verifying effectors experimentally, a costly endeavor in terms of money, time, and effort. Having good predictions for effectors will help to focus experimental validations and decrease testing costs. In recent years, several scoring and machine learning-based methods have been suggested for the purpose of predicting T4SS effector proteins. These methods have used different sets of features for prediction, and their predictions have been inconsistent. In this paper, an optimal set of features is presented for predicting T4SS effector proteins using a statistical approach. A thorough literature search was performed to find features that have been proposed. Feature values were calculated for datasets of known effectors and non-effectors for T4SS-containing pathogens for four genera with a sufficient number of known effectors, Legionella pneumophila, Coxiella burnetii, Brucella spp, and Bartonella spp. The features were ranked, and less important features were filtered out. Correlations between remaining features were removed, and dimensional reduction was accomplished using principal component analysis and factor analysis. Finally, the optimal features for each pathogen were chosen by building logistic regression models and evaluating each model. The results based on evaluation of our logistic regression models confirm the effectiveness of our four optimal sets of

  13. A new model of flavonoids affinity towards P-glycoprotein: genetic algorithm-support vector machine with features selected by a modified particle swarm optimization algorithm.

    Cui, Ying; Chen, Qinggang; Li, Yaxiao; Tang, Ling

    2017-02-01

    Flavonoids exhibit a high affinity for the purified cytosolic NBD (C-terminal nucleotide-binding domain) of P-glycoprotein (P-gp). To explore the affinity of flavonoids for P-gp, quantitative structure-activity relationship (QSAR) models were developed using support vector machines (SVMs). A novel method coupling a modified particle swarm optimization algorithm with random mutation strategy and a genetic algorithm coupled with SVM was proposed to simultaneously optimize the kernel parameters of SVM and determine the subset of optimized features for the first time. Using DRAGON descriptors to represent compounds for QSAR, three subsets (training, prediction and external validation set) derived from the dataset were employed to investigate QSAR. With excluding of the outlier, the correlation coefficient (R 2 ) of the whole training set (training and prediction) was 0.924, and the R 2 of the external validation set was 0.941. The root-mean-square error (RMSE) of the whole training set was 0.0588; the RMSE of the cross-validation of the external validation set was 0.0443. The mean Q 2 value of leave-many-out cross-validation was 0.824. With more informations from results of randomization analysis and applicability domain, the proposed model is of good predictive ability, stability.

  14. OPTIMIZACIÓN DEL DISEÑÓ DE PARÁMETROS: MÉTODO FOREST-GENETIC UNIVARIANTE // OPTIMIZING PARAMETER DESIGN: THE UNIVARIATE FOREST-GENETIC METHOD

    Adriana Villa Murillo

    2016-06-01

    Full Text Available El Dr Genichi Taguchi desarrolló en los años 80 una metodología para la mejora del diseño de parámetros de productos y procesos, conocida como metodología Taguchi. Diversas propuestas han surgido en las que se mezclan técnicas de inteligencia artificial. Proponemos la creación de un híbrido entre Random Forest (RF y los Algoritmos Genéticos (GA en tres fases; normalización, modelización y optimización. La primera fase corresponde a la preparación previa del conjunto de datos mediante funciones de normalización. En la modelización se determina la función objetivo utilizando estrategias basadas en RF para predecir el valor de la respuesta en un conjunto de parámetros dado. Finalmente, en la fase de optimización se obtiene la combinación óptima de los niveles de los parámetros mediante la integración de propiedades dadas por nuestro esquema de modelización en el establecimiento del correspondiente GA. Se comparan los resultados de forma numérica con aportes recientemente encontrados en la literatura. Nuestra propuesta metodológica se concentra en las variables de mayor importancia producto del proceso de modelización con RF, lo que permite desarrollar y dirigir de manera más eficiente las nuevas generaciones en la fase de optimización y en consecuencia, alcanzar significativas mejoras en cuanto al objetivo de calidad considerado.// ABSTRACT: In the 80's, Dr Genichi Taguchi developed a methodology for processes and product parameters design improvement known as the Taguchi methodology. Different proposals have emerged involving artificial intelligence techniques. Our proposal consists of a hybrid methodology that combines Random Forest (RF and Genetic Algorithms (GA in three phases: normalization, modeling and optimization. The first phase corresponds to the previous preparation of the data set by using normalization functions. In the modeling, the objective function is determined using strategies based on RF to predict the value of the response in a given set of parameters. Finally, in the optimization phase, the optimal combination of the parameter levels is obtained by integrating properties given by our modeling scheme into the corresponding GA. The results are compared numerically with the contributions recently found in the literature. Our methodological proposal focuses on the most important variables resulting from the RF modeling process, which allows to develop and direct more efficiently the new generations in the optimization phase, and consequently, achieve significant improvements in the quality objective considered.

  15. Comparison between the univariate and multivariate analysis on the partial characterization of the endoglucanase produced in the solid state fermentation by Aspergillus oryzae ATCC 10124.

    de Brito, Aila Riany; Santos Reis, Nadabe Dos; Silva, Tatielle Pereira; Ferreira Bonomo, Renata Cristina; Trovatti Uetanabaro, Ana Paula; de Assis, Sandra Aparecida; da Silva, Erik Galvão Paranhos; Aguiar-Oliveira, Elizama; Oliveira, Julieta Rangel; Franco, Marcelo

    2017-11-26

    Endoglucanase production by Aspergillus oryzae ATCC 10124 cultivated in rice husks or peanut shells was optimized by experimental design as a function of humidity, time, and temperature. The optimum temperature for the endoglucanase activity was estimated by a univariate analysis (one factor at the time) as 50°C (rice husks) and 60°C (peanut shells), however, by a multivariate analysis (synergism of factors), it was determined a different temperature (56°C) for endoglucanase from peanut shells. For the optimum pH, values determined by univariate and multivariate analysis were 5 and 5.2 (rice husk) and 5 and 7.6 (peanut shells). In addition, the best half-lives were observed at 50°C as 22.8 hr (rice husks) and 7.3 hr (peanut shells), also, 80% of residual activities was obtained between 30 and 50°C for both substrates, and the pH stability was improved at 5-7 (rice hulls) and 6-9 (peanut shells). Both endoglucanases obtained presented different characteristics as a result of the versatility of fungi in different substrates.

  16. Feature Selection, Flaring Size and Time-to-Flare Prediction Using Support Vector Regression, and Automated Prediction of Flaring Behavior Based on Spatio-Temporal Measures Using Hidden Markov Models

    Al-Ghraibah, Amani

    Solar flares release stored magnetic energy in the form of radiation and can have significant detrimental effects on earth including damage to technological infrastructure. Recent work has considered methods to predict future flare activity on the basis of quantitative measures of the solar magnetic field. Accurate advanced warning of solar flare occurrence is an area of increasing concern and much research is ongoing in this area. Our previous work 111] utilized standard pattern recognition and classification techniques to determine (classify) whether a region is expected to flare within a predictive time window, using a Relevance Vector Machine (RVM) classification method. We extracted 38 features which describing the complexity of the photospheric magnetic field, the result classification metrics will provide the baseline against which we compare our new work. We find a true positive rate (TPR) of 0.8, true negative rate (TNR) of 0.7, and true skill score (TSS) of 0.49. This dissertation proposes three basic topics; the first topic is an extension to our previous work [111, where we consider a feature selection method to determine an appropriate feature subset with cross validation classification based on a histogram analysis of selected features. Classification using the top five features resulting from this analysis yield better classification accuracies across a large unbalanced dataset. In particular, the feature subsets provide better discrimination of the many regions that flare where we find a TPR of 0.85, a TNR of 0.65 sightly lower than our previous work, and a TSS of 0.5 which has an improvement comparing with our previous work. In the second topic, we study the prediction of solar flare size and time-to-flare using support vector regression (SVR). When we consider flaring regions only, we find an average error in estimating flare size of approximately half a GOES class. When we additionally consider non-flaring regions, we find an increased average

  17. Feature selection based classifier combination approach for ...

    ved for the isolated English text, but for the handwritten Devanagari script it is not ... characters, lack of standard benchmarking and ground truth dataset, lack of ..... theory, proposed by Glen Shafer as a way to represent cognitive knowledge.

  18. Finger vein recognition with personalized feature selection.

    Xi, Xiaoming; Yang, Gongping; Yin, Yilong; Meng, Xianjing

    2013-08-22

    Finger veins are a promising biometric pattern for personalized identification in terms of their advantages over existing biometrics. Based on the spatial pyramid representation and the combination of more effective information such as gray, texture and shape, this paper proposes a simple but powerful feature, called Pyramid Histograms of Gray, Texture and Orientation Gradients (PHGTOG). For a finger vein image, PHGTOG can reflect the global spatial layout and local details of gray, texture and shape. To further improve the recognition performance and reduce the computational complexity, we select a personalized subset of features from PHGTOG for each subject by using the sparse weight vector, which is trained by using LASSO and called PFS-PHGTOG. We conduct extensive experiments to demonstrate the promise of the PHGTOG and PFS-PHGTOG, experimental results on our databases show that PHGTOG outperforms the other existing features. Moreover, PFS-PHGTOG can further boost the performance in comparison with PHGTOG.

  19. Finger Vein Recognition with Personalized Feature Selection

    Xianjing Meng

    2013-08-01

    Full Text Available Finger veins are a promising biometric pattern for personalized identification in terms of their advantages over existing biometrics. Based on the spatial pyramid representation and the combination of more effective information such as gray, texture and shape, this paper proposes a simple but powerful feature, called Pyramid Histograms of Gray, Texture and Orientation Gradients (PHGTOG. For a finger vein image, PHGTOG can reflect the global spatial layout and local details of gray, texture and shape. To further improve the recognition performance and reduce the computational complexity, we select a personalized subset of features from PHGTOG for each subject by using the sparse weight vector, which is trained by using LASSO and called PFS-PHGTOG. We conduct extensive experiments to demonstrate the promise of the PHGTOG and PFS-PHGTOG, experimental results on our databases show that PHGTOG outperforms the other existing features. Moreover, PFS-PHGTOG can further boost the performance in comparison with PHGTOG.

  20. Input significance analysis: feature selection through synaptic ...

    Connection Weights (CW) and Garson's Algorithm (GA) and the classifier selected ... from the UCI Machine Learning Repository and executed in an online ... connectionist systems; evolving fuzzy neural network; connection weights; Garson's

  1. Predictive Feature Selection for Genetic Policy Search

    2014-05-22

    limited manual intervention are becoming increasingly desirable as more complex tasks in dynamic and high- tempo environments are explored. Reinforcement...states in many domains causes features relevant to the reward variations to be overlooked, which hinders the policy search. 3.4 Parameter Selection PFS...the current feature subset. This local minimum may be “deceptive,” meaning that it does not clearly lead to the global optimal policy ( Goldberg and

  2. MULTIVARIANT AND UNIVARIANT INTERGROUP DIFFERENCES IN THE INVESTIGATED SPECIFIC MOTOR SPACE BETWEEN RESPONDENTS JUNIORS AND SENIORS MEMBERS OF THE MACEDONIAN NATIONAL KARATE TEAM

    Kеnan Аsani

    2013-07-01

    Full Text Available The aim is to establish intergroup multivariant and univariant investigated differences in specific motor space between respondents juniors and seniors members of the Macedonian karate team. The sample of 30 male karate respondents covers juniors on 16,17 and seniors over 18 years.In the research were applied 20 specific motor tests. Based on Graph 1 where it is presented multivariant analysis of variance Manova and Anova can be noted that respondents juniors and seniors, although not belonging to the same population are not different in multivariant understudied area.W. lambda of .19, Rao-wool R - Approximation of 1.91 degrees of freedom df 1 = 20 and df 2 = 9 provides the level of significance of p =, 16. Based on univariant analysis for each variable separately can be seen that has been around intergroup statistically significant difference in seven SMAEGERI (kick in the sack with favoritism leg mae geri for 10 sec., SMAVASI (kick in the sack with favoritism foot mavashi geri by 10 sec., SUSIRO (kick in the sack with favoritism leg ushiro geri for 10 sec., SKIZAME (kick in the sack with favoritism hand kizame cuki for 10 sec., STAPNSR (taping with foot in sagital plane for 15 sec. SUDMNR (hitting a moving target with weaker hand and SUDMPN (hitting a moving target with favoritism foot of twenty applied manifest variables. There are no intergroup differences in multivariant investigated specific - motor space among the respondents juniors and seniors members of the Macedonian karate team. Based on univariant analysis for each variable separately can be seen that has been around intergroup statistically significant difference in seven SMAEGERI (kick in the sack with favoritism leg mae geri for 10 sec., SMAVASI (kick in the sack with favoritism foot mavashi geri by 10 sec., SUSIRO (kick in the sack with favoritism leg ushiro geri for 10 sec., SKIZAME (kick in the sack with favoritism hand kizame cuki for 10 sec., STAPNSR (taping with foot in

  3. In situ calibration using univariate analyses based on the onboard ChemCam targets: first prediction of Martian rock and soil compositions

    Fabre, C.; Cousin, A.; Wiens, R.C.; Ollila, A.; Gasnault, O.; Maurice, S.; Sautter, V.; Forni, O.; Lasue, J.; Tokar, R.; Vaniman, D.; Melikechi, N.

    2014-01-01

    Curiosity rover landed on August 6th, 2012 in Gale Crater, Mars and it possesses unique analytical capabilities to investigate the chemistry and mineralogy of the Martian soil. In particular, the LIBS technique is being used for the first time on another planet with the ChemCam instrument, and more than 75,000 spectra have been returned in the first year on Mars. Curiosity carries body-mounted calibration targets specially designed for the ChemCam instrument, some of which are homgeneous glasses and others that are fine-grained glass-ceramics. We present direct calibrations, using these onboard standards to infer elements and element ratios by ratioing relative peak areas. As the laser spot size is around 300 μm, the LIBS technique provides measurements of the silicate glass compositions representing homogeneous material and measurements of the ceramic targets that are comparable to fine-grained rock or soil. The laser energy and the auto-focus are controlled for all sequences used for calibration. The univariate calibration curves present relatively to very good correlation coefficients with low RSDs for major and ratio calibrations. Trace element calibration curves (Li, Sr, and Mn), down to several ppm, can be used as a rapid tool to draw attention to remarkable rocks and soils along the traverse. First comparisons to alpha-particle X-ray spectroscopy (APXS) data, on selected targets, show good agreement for most elements and for Mg# and Al/Si estimates. SiO 2 estimates using univariate cannot be yet used. Na 2 O and K 2 O estimates are relevant for high alkali contents, but probably under estimated due to the CCCT initial compositions. Very good results for CaO and Al 2 O 3 estimates and satisfactory results for FeO are obtained. - Highlights: • In situ LIBS univariate calibrations are done using the Curiosity onboard standards. • Major and minor element contents can be rapidly obtained. • Trace element contents can be used as a rapid tool along the

  4. In situ calibration using univariate analyses based on the onboard ChemCam targets: first prediction of Martian rock and soil compositions

    Fabre, C. [GeoRessources lab, Université de Lorraine, Nancy (France); Cousin, A.; Wiens, R.C. [Los Alamos National Laboratory, Los Alamos, NM (United States); Ollila, A. [University of NM, Albuquerque (United States); Gasnault, O.; Maurice, S. [IRAP, Toulouse (France); Sautter, V. [Museum National d' Histoire Naturelle, Paris (France); Forni, O.; Lasue, J. [IRAP, Toulouse (France); Tokar, R.; Vaniman, D. [Planetary Science Institute, Tucson, AZ (United States); Melikechi, N. [Delaware State University (United States)

    2014-09-01

    Curiosity rover landed on August 6th, 2012 in Gale Crater, Mars and it possesses unique analytical capabilities to investigate the chemistry and mineralogy of the Martian soil. In particular, the LIBS technique is being used for the first time on another planet with the ChemCam instrument, and more than 75,000 spectra have been returned in the first year on Mars. Curiosity carries body-mounted calibration targets specially designed for the ChemCam instrument, some of which are homgeneous glasses and others that are fine-grained glass-ceramics. We present direct calibrations, using these onboard standards to infer elements and element ratios by ratioing relative peak areas. As the laser spot size is around 300 μm, the LIBS technique provides measurements of the silicate glass compositions representing homogeneous material and measurements of the ceramic targets that are comparable to fine-grained rock or soil. The laser energy and the auto-focus are controlled for all sequences used for calibration. The univariate calibration curves present relatively to very good correlation coefficients with low RSDs for major and ratio calibrations. Trace element calibration curves (Li, Sr, and Mn), down to several ppm, can be used as a rapid tool to draw attention to remarkable rocks and soils along the traverse. First comparisons to alpha-particle X-ray spectroscopy (APXS) data, on selected targets, show good agreement for most elements and for Mg# and Al/Si estimates. SiO{sub 2} estimates using univariate cannot be yet used. Na{sub 2}O and K{sub 2}O estimates are relevant for high alkali contents, but probably under estimated due to the CCCT initial compositions. Very good results for CaO and Al{sub 2}O{sub 3} estimates and satisfactory results for FeO are obtained. - Highlights: • In situ LIBS univariate calibrations are done using the Curiosity onboard standards. • Major and minor element contents can be rapidly obtained. • Trace element contents can be used as a

  5. Evaluation of in-line Raman data for end-point determination of a coating process: Comparison of Science-Based Calibration, PLS-regression and univariate data analysis.

    Barimani, Shirin; Kleinebudde, Peter

    2017-10-01

    A multivariate analysis method, Science-Based Calibration (SBC), was used for the first time for endpoint determination of a tablet coating process using Raman data. Two types of tablet cores, placebo and caffeine cores, received a coating suspension comprising a polyvinyl alcohol-polyethylene glycol graft-copolymer and titanium dioxide to a maximum coating thickness of 80µm. Raman spectroscopy was used as in-line PAT tool. The spectra were acquired every minute and correlated to the amount of applied aqueous coating suspension. SBC was compared to another well-known multivariate analysis method, Partial Least Squares-regression (PLS) and a simpler approach, Univariate Data Analysis (UVDA). All developed calibration models had coefficient of determination values (R 2 ) higher than 0.99. The coating endpoints could be predicted with root mean square errors (RMSEP) less than 3.1% of the applied coating suspensions. Compared to PLS and UVDA, SBC proved to be an alternative multivariate calibration method with high predictive power. Copyright © 2017 Elsevier B.V. All rights reserved.

  6. Building Customer Churn Prediction Models in Fitness Industry with Machine Learning Methods

    Shan, Min

    2017-01-01

    With the rapid growth of digital systems, churn management has become a major focus within customer relationship management in many industries. Ample research has been conducted for churn prediction in different industries with various machine learning methods. This thesis aims to combine feature selection and supervised machine learning methods for defining models of churn prediction and apply them on fitness industry. Forward selection is chosen as feature selection methods. Support Vector ...

  7. Univariate and multiple linear regression analyses for 23 single nucleotide polymorphisms in 14 genes predisposing to chronic glomerular diseases and IgA nephropathy in Han Chinese.

    Wang, Hui; Sui, Weiguo; Xue, Wen; Wu, Junyong; Chen, Jiejing; Dai, Yong

    2014-09-01

    Immunoglobulin A nephropathy (IgAN) is a complex trait regulated by the interaction among multiple physiologic regulatory systems and probably involving numerous genes, which leads to inconsistent findings in genetic studies. One possibility of failure to replicate some single-locus results is that the underlying genetics of IgAN nephropathy is based on multiple genes with minor effects. To learn the association between 23 single nucleotide polymorphisms (SNPs) in 14 genes predisposing to chronic glomerular diseases and IgAN in Han males, the 23 SNPs genotypes of 21 Han males were detected and analyzed with a BaiO gene chip, and their associations were analyzed with univariate analysis and multiple linear regression analysis. Analysis showed that CTLA4 rs231726 and CR2 rs1048971 revealed a significant association with IgAN. These findings support the multi-gene nature of the etiology of IgAN and propose a potential gene-gene interactive model for future studies.

  8. A Java-based fMRI processing pipeline evaluation system for assessment of univariate general linear model and multivariate canonical variate analysis-based pipelines.

    Zhang, Jing; Liang, Lichen; Anderson, Jon R; Gatewood, Lael; Rottenberg, David A; Strother, Stephen C

    2008-01-01

    As functional magnetic resonance imaging (fMRI) becomes widely used, the demands for evaluation of fMRI processing pipelines and validation of fMRI analysis results is increasing rapidly. The current NPAIRS package, an IDL-based fMRI processing pipeline evaluation framework, lacks system interoperability and the ability to evaluate general linear model (GLM)-based pipelines using prediction metrics. Thus, it can not fully evaluate fMRI analytical software modules such as FSL.FEAT and NPAIRS.GLM. In order to overcome these limitations, a Java-based fMRI processing pipeline evaluation system was developed. It integrated YALE (a machine learning environment) into Fiswidgets (a fMRI software environment) to obtain system interoperability and applied an algorithm to measure GLM prediction accuracy. The results demonstrated that the system can evaluate fMRI processing pipelines with univariate GLM and multivariate canonical variates analysis (CVA)-based models on real fMRI data based on prediction accuracy (classification accuracy) and statistical parametric image (SPI) reproducibility. In addition, a preliminary study was performed where four fMRI processing pipelines with GLM and CVA modules such as FSL.FEAT and NPAIRS.CVA were evaluated with the system. The results indicated that (1) the system can compare different fMRI processing pipelines with heterogeneous models (NPAIRS.GLM, NPAIRS.CVA and FSL.FEAT) and rank their performance by automatic performance scoring, and (2) the rank of pipeline performance is highly dependent on the preprocessing operations. These results suggest that the system will be of value for the comparison, validation, standardization and optimization of functional neuroimaging software packages and fMRI processing pipelines.

  9. Novel 3D ultrasound image-based biomarkers based on a feature selection from a 2D standardized vessel wall thickness map: a tool for sensitive assessment of therapies for carotid atherosclerosis

    Chiu, Bernard; Bing, Li; Chow, Tommy W S, E-mail: bcychiu@cityu.edu.hk, E-mail: bingli5@student.cityu.edu.hk, E-mail: eetchow@cityu.edu.hk [Department of Electronic Engineering, City University of Hong Kong (Hong Kong)

    2013-09-07

    With the advent of new therapies and management strategies for carotid atherosclerosis, there is a parallel need for measurement tools or biomarkers to evaluate the efficacy of these new strategies. 3D ultrasound has been shown to provide reproducible measurements of plaque area/volume and vessel wall volume. However, since carotid atherosclerosis is a focal disease that predominantly occurs at bifurcations, biomarkers based on local plaque change may be more sensitive than global volumetric measurements in demonstrating efficacy of new therapies. The ultimate goal of this paper is to develop a biomarker that is based on the local distribution of vessel-wall-plus-plaque thickness change (VWT-Change) that has occurred during the course of a clinical study. To allow comparison between different treatment groups, the VWT-Change distribution of each subject must first be mapped to a standardized domain. In this study, we developed a technique to map the 3D VWT-Change distribution to a 2D standardized template. We then applied a feature selection technique to identify regions on the 2D standardized map on which subjects in different treatment groups exhibit greater difference in VWT-Change. The proposed algorithm was applied to analyse the VWT-Change of 20 subjects in a placebo-controlled study of the effect of atorvastatin (Lipitor). The average VWT-Change for each subject was computed (i) over all points in the 2D map and (ii) over feature points only. For the average computed over all points, 97 subjects per group would be required to detect an effect size of 25% that of atorvastatin in a six-month study. The sample size is reduced to 25 subjects if the average were computed over feature points only. The introduction of this sensitive quantification technique for carotid atherosclerosis progression/regression would allow many proof-of-principle studies to be performed before a more costly and longer study involving a larger population is held to confirm the treatment

  10. Novel 3D ultrasound image-based biomarkers based on a feature selection from a 2D standardized vessel wall thickness map: a tool for sensitive assessment of therapies for carotid atherosclerosis

    Chiu, Bernard; Li Bing; Chow, Tommy W S

    2013-01-01

    With the advent of new therapies and management strategies for carotid atherosclerosis, there is a parallel need for measurement tools or biomarkers to evaluate the efficacy of these new strategies. 3D ultrasound has been shown to provide reproducible measurements of plaque area/volume and vessel wall volume. However, since carotid atherosclerosis is a focal disease that predominantly occurs at bifurcations, biomarkers based on local plaque change may be more sensitive than global volumetric measurements in demonstrating efficacy of new therapies. The ultimate goal of this paper is to develop a biomarker that is based on the local distribution of vessel-wall-plus-plaque thickness change (VWT-Change) that has occurred during the course of a clinical study. To allow comparison between different treatment groups, the VWT-Change distribution of each subject must first be mapped to a standardized domain. In this study, we developed a technique to map the 3D VWT-Change distribution to a 2D standardized template. We then applied a feature selection technique to identify regions on the 2D standardized map on which subjects in different treatment groups exhibit greater difference in VWT-Change. The proposed algorithm was applied to analyse the VWT-Change of 20 subjects in a placebo-controlled study of the effect of atorvastatin (Lipitor). The average VWT-Change for each subject was computed (i) over all points in the 2D map and (ii) over feature points only. For the average computed over all points, 97 subjects per group would be required to detect an effect size of 25% that of atorvastatin in a six-month study. The sample size is reduced to 25 subjects if the average were computed over feature points only. The introduction of this sensitive quantification technique for carotid atherosclerosis progression/regression would allow many proof-of-principle studies to be performed before a more costly and longer study involving a larger population is held to confirm the treatment

  11. Effects of Feature Extraction and Classification Methods on Cyberbully Detection

    Esra SARAÇ

    2016-12-01

    Full Text Available Cyberbullying is defined as an aggressive, intentional action against a defenseless person by using the Internet, or other electronic contents. Researchers have found that many of the bullying cases have tragically ended in suicides; hence automatic detection of cyberbullying has become important. In this study we show the effects of feature extraction, feature selection, and classification methods that are used, on the performance of automatic detection of cyberbullying. To perform the experiments FormSpring.me dataset is used and the effects of preprocessing methods; several classifiers like C4.5, Naïve Bayes, kNN, and SVM; and information gain and chi square feature selection methods are investigated. Experimental results indicate that the best classification results are obtained when alphabetic tokenization, no stemming, and no stopwords removal are applied. Using feature selection also improves cyberbully detection performance. When classifiers are compared, C4.5 performs the best for the used dataset.

  12. Design and Selection of Machine Learning Methods Using Radiomics and Dosiomics for Normal Tissue Complication Probability Modeling of Xerostomia

    Hubert S. Gabryś

    2018-03-01

    Full Text Available PurposeThe purpose of this study is to investigate whether machine learning with dosiomic, radiomic, and demographic features allows for xerostomia risk assessment more precise than normal tissue complication probability (NTCP models based on the mean radiation dose to parotid glands.Material and methodsA cohort of 153 head-and-neck cancer patients was used to model xerostomia at 0–6 months (early, 6–15 months (late, 15–24 months (long-term, and at any time (a longitudinal model after radiotherapy. Predictive power of the features was evaluated by the area under the receiver operating characteristic curve (AUC of univariate logistic regression models. The multivariate NTCP models were tuned and tested with single and nested cross-validation, respectively. We compared predictive performance of seven classification algorithms, six feature selection methods, and ten data cleaning/class balancing techniques using the Friedman test and the Nemenyi post hoc analysis.ResultsNTCP models based on the parotid mean dose failed to predict xerostomia (AUCs < 0.60. The most informative predictors were found for late and long-term xerostomia. Late xerostomia correlated with the contralateral dose gradient in the anterior–posterior (AUC = 0.72 and the right–left (AUC = 0.68 direction, whereas long-term xerostomia was associated with parotid volumes (AUCs > 0.85, dose gradients in the right–left (AUCs > 0.78, and the anterior–posterior (AUCs > 0.72 direction. Multivariate models of long-term xerostomia were typically based on the parotid volume, the parotid eccentricity, and the dose–volume histogram (DVH spread with the generalization AUCs ranging from 0.74 to 0.88. On average, support vector machines and extra-trees were the top performing classifiers, whereas the algorithms based on logistic regression were the best choice for feature selection. We found no advantage in using data cleaning or class balancing

  13. Comparative study between univariate spectrophotometry and multivariate calibration as analytical tools for quantitation of Benazepril alone and in combination with Amlodipine.

    Farouk, M; Elaziz, Omar Abd; Tawakkol, Shereen M; Hemdan, A; Shehata, Mostafa A

    2014-04-05

    Four simple, accurate, reproducible, and selective methods have been developed and subsequently validated for the determination of Benazepril (BENZ) alone and in combination with Amlodipine (AML) in pharmaceutical dosage form. The first method is pH induced difference spectrophotometry, where BENZ can be measured in presence of AML as it showed maximum absorption at 237nm and 241nm in 0.1N HCl and 0.1N NaOH, respectively, while AML has no wavelength shift in both solvents. The second method is the new Extended Ratio Subtraction Method (EXRSM) coupled to Ratio Subtraction Method (RSM) for determination of both drugs in commercial dosage form. The third and fourth methods are multivariate calibration which include Principal Component Regression (PCR) and Partial Least Squares (PLSs). A detailed validation of the methods was performed following the ICH guidelines and the standard curves were found to be linear in the range of 2-30μg/mL for BENZ in difference and extended ratio subtraction spectrophotometric method, and 5-30 for AML in EXRSM method, with well accepted mean correlation coefficient for each analyte. The intra-day and inter-day precision and accuracy results were well within the acceptable limits. Copyright © 2013 Elsevier B.V. All rights reserved.

  14. Design and Selection of Machine Learning Methods Using Radiomics and Dosiomics for Normal Tissue Complication Probability Modeling of Xerostomia.

    Gabryś, Hubert S; Buettner, Florian; Sterzing, Florian; Hauswald, Henrik; Bangert, Mark

    2018-01-01

    The purpose of this study is to investigate whether machine learning with dosiomic, radiomic, and demographic features allows for xerostomia risk assessment more precise than normal tissue complication probability (NTCP) models based on the mean radiation dose to parotid glands. A cohort of 153 head-and-neck cancer patients was used to model xerostomia at 0-6 months (early), 6-15 months (late), 15-24 months (long-term), and at any time (a longitudinal model) after radiotherapy. Predictive power of the features was evaluated by the area under the receiver operating characteristic curve (AUC) of univariate logistic regression models. The multivariate NTCP models were tuned and tested with single and nested cross-validation, respectively. We compared predictive performance of seven classification algorithms, six feature selection methods, and ten data cleaning/class balancing techniques using the Friedman test and the Nemenyi post hoc analysis. NTCP models based on the parotid mean dose failed to predict xerostomia (AUCs  0.85), dose gradients in the right-left (AUCs > 0.78), and the anterior-posterior (AUCs > 0.72) direction. Multivariate models of long-term xerostomia were typically based on the parotid volume, the parotid eccentricity, and the dose-volume histogram (DVH) spread with the generalization AUCs ranging from 0.74 to 0.88. On average, support vector machines and extra-trees were the top performing classifiers, whereas the algorithms based on logistic regression were the best choice for feature selection. We found no advantage in using data cleaning or class balancing methods. We demonstrated that incorporation of organ- and dose-shape descriptors is beneficial for xerostomia prediction in highly conformal radiotherapy treatments. Due to strong reliance on patient-specific, dose-independent factors, our results underscore the need for development of personalized data-driven risk profiles for NTCP models of xerostomia. The facilitated

  15. Prognostic factors in children and adolescents with acute myeloid leukemia (excluding children with Down syndrome and acute promyelocytic leukemia): univariate and recursive partitioning analysis of patients treated on Pediatric Oncology Group (POG) Study 8821.

    Chang, M; Raimondi, S C; Ravindranath, Y; Carroll, A J; Camitta, B; Gresik, M V; Steuber, C P; Weinstein, H

    2000-07-01

    The purpose of the paper was to define clinical or biological features associated with the risk for treatment failure for children with acute myeloid leukemia. Data from 560 children and adolescents with newly diagnosed acute myeloid leukemia who entered the Pediatric Oncology Group Study 8821 from June 1988 to March 1993 were analyzed by univariate and recursive partitioning methods. Children with Down syndrome or acute promyelocytic leukemia were excluded from the study. Factors examined included age, number of leukocytes, sex, FAB morphologic subtype, cytogenetic findings, and extramedullary disease at the time of diagnosis. The overall event-free survival (EFS) rate at 4 years was 32.7% (s.e. = 2.2%). Age > or =2 years, fewer than 50 x 10(9)/I leukocytes, and t(8;21) or inv(16), and normal chromosomes were associated with higher rates of EFS (P value = 0.003, 0.049, 0.0003, 0.031, respectively), whereas the M5 subtype of AML (P value = 0.0003) and chromosome abnormalities other than t(8;21) and inv(16) were associated with lower rates of EFS (P value = 0.0001). Recursive partitioning analysis defined three groups of patients with widely varied prognoses: female patients with t(8;21), inv(16), or a normal karyotype (n = 89) had the best prognosis (4-year EFS = 55.1%, s.e. = 5.7%); male patients with t(8;21), inv(16) or normal chromosomes (n = 106) had an intermediate prognosis (4-year EFS = 38.1%, s.e. = 5.3%); patients with chromosome abnormalities other than t(8;21) and inv(16) (n = 233) had the worst prognosis (4-year EFS = 27.0%, s.e. = 3.2%). One hundred and thirty-two patients (24%) could not be grouped because of missing cytogenetic data, mainly due to inadequate marrow samples. The results suggest that pediatric patients with acute myeloid leukemia can be categorized into three potential risk groups for prognosis and that differences in sex and chromosomal abnormalities are associated with differences in estimates of EFS. These results are tentative and

  16. Comparative Study of Univariate Spectrophotometry and Multivariate Calibration for the Determination of Levamisole Hydrochloride and Closantel Sodium in a Binary Mixture.

    Abdel-Aziz, Omar; Hussien, Emad M; El Kosasy, Amira M; Ahmed, Neven

    2016-07-01

    Six simple, accurate, reproducible, and selective derivative spectrophotometric and chemometric methods have been developed and validated for the determination of levamisole HCl (Lev) either alone or in combination with closantel sodium (Clo) in the pharmaceutical dosage form. Lev was determined by first-derivative, first-derivative ratio, and mean-centering methods by measuring the peak amplitude at 220.8, 243.8, and 210.4 nm, respectively. The methods were linear over the concentration range 2.0-10.0 μg/mL Lev. The methods exhibited a high accuracy, with recovery data within ±1.9% and RSD <1.3% (n = 9) for the determination of Lev in the presence of Clo. Fortunately, Lev showed no significant UV absorbance at 370.6 nm, which allowed the determination of Clo over the concentration range 16.0-80.0 μg/mL using zero-order spectra, with a high precision (RSD <1.5%, n = 9). Furthermore, principal component regression and partial least-squares with optimized parameters were used for the determination of Lev in the presence of Clo. The recovery was within ±1%, with RSD <1.0% (n = 9) and root mean square error of prediction ≤1.0. The proposed methods were validated according to the International Conference on Harmonization guidelines. The proposed methods were used in the determination of Lev and Clo in a binary mixture and a pharmaceutical formulation, with high accuracy and precision.

  17. Method

    Ling Fiona W.M.

    2017-01-01

    Full Text Available Rapid prototyping of microchannel gain lots of attention from researchers along with the rapid development of microfluidic technology. The conventional methods carried few disadvantages such as high cost, time consuming, required high operating pressure and temperature and involve expertise in operating the equipment. In this work, new method adapting xurography method is introduced to replace the conventional method of fabrication of microchannels. The novelty in this study is replacing the adhesion film with clear plastic film which was used to cut the design of the microchannel as the material is more suitable for fabricating more complex microchannel design. The microchannel was then mold using polymethyldisiloxane (PDMS and bonded with a clean glass to produce a close microchannel. The microchannel produced had a clean edge indicating good master mold was produced using the cutting plotter and the bonding between the PDMS and glass was good where no leakage was observed. The materials used in this method is cheap and the total time consumed is less than 5 hours where this method is suitable for rapid prototyping of microchannel.

  18. method

    L. M. Kimball

    2002-01-01

    Full Text Available This paper presents an interior point algorithm to solve the multiperiod hydrothermal economic dispatch (HTED. The multiperiod HTED is a large scale nonlinear programming problem. Various optimization methods have been applied to the multiperiod HTED, but most neglect important network characteristics or require decomposition into thermal and hydro subproblems. The algorithm described here exploits the special bordered block diagonal structure and sparsity of the Newton system for the first order necessary conditions to result in a fast efficient algorithm that can account for all network aspects. Applying this new algorithm challenges a conventional method for the use of available hydro resources known as the peak shaving heuristic.

  19. An entropy-based improved k-top scoring pairs (TSP) method for ...

    An entropy-based improved k-top scoring pairs (TSP) (Ik-TSP) method was presented in this study for the classification and prediction of human cancers based on gene-expression data. We compared Ik-TSP classifiers with 5 different machine learning methods and the k-TSP method based on 3 different feature selection ...

  20. Sampling Methods for Wallenius' and Fisher's Noncentral Hypergeometric Distributions

    Fog, Agner

    2008-01-01

    the mode, ratio-of-uniforms rejection method, and rejection by sampling in the tau domain. Methods for the multivariate distributions include: simulation of urn experiments, conditional method, Gibbs sampling, and Metropolis-Hastings sampling. These methods are useful for Monte Carlo simulation of models...... of biased sampling and models of evolution and for calculating moments and quantiles of the distributions.......Several methods for generating variates with univariate and multivariate Wallenius' and Fisher's noncentral hypergeometric distributions are developed. Methods for the univariate distributions include: simulation of urn experiments, inversion by binary search, inversion by chop-down search from...

  1. Electronic nose with a new feature reduction method and a multi-linear classifier for Chinese liquor classification

    Jing, Yaqi; Meng, Qinghao, E-mail: qh-meng@tju.edu.cn; Qi, Peifeng; Zeng, Ming; Li, Wei; Ma, Shugen [Tianjin Key Laboratory of Process Measurement and Control, Institute of Robotics and Autonomous Systems, School of Electrical Engineering and Automation, Tianjin University, Tianjin 300072 (China)

    2014-05-15

    An electronic nose (e-nose) was designed to classify Chinese liquors of the same aroma style. A new method of feature reduction which combined feature selection with feature extraction was proposed. Feature selection method used 8 feature-selection algorithms based on information theory and reduced the dimension of the feature space to 41. Kernel entropy component analysis was introduced into the e-nose system as a feature extraction method and the dimension of feature space was reduced to 12. Classification of Chinese liquors was performed by using back propagation artificial neural network (BP-ANN), linear discrimination analysis (LDA), and a multi-linear classifier. The classification rate of the multi-linear classifier was 97.22%, which was higher than LDA and BP-ANN. Finally the classification of Chinese liquors according to their raw materials and geographical origins was performed using the proposed multi-linear classifier and classification rate was 98.75% and 100%, respectively.

  2. Electronic nose with a new feature reduction method and a multi-linear classifier for Chinese liquor classification

    Jing, Yaqi; Meng, Qinghao; Qi, Peifeng; Zeng, Ming; Li, Wei; Ma, Shugen

    2014-01-01

    An electronic nose (e-nose) was designed to classify Chinese liquors of the same aroma style. A new method of feature reduction which combined feature selection with feature extraction was proposed. Feature selection method used 8 feature-selection algorithms based on information theory and reduced the dimension of the feature space to 41. Kernel entropy component analysis was introduced into the e-nose system as a feature extraction method and the dimension of feature space was reduced to 12. Classification of Chinese liquors was performed by using back propagation artificial neural network (BP-ANN), linear discrimination analysis (LDA), and a multi-linear classifier. The classification rate of the multi-linear classifier was 97.22%, which was higher than LDA and BP-ANN. Finally the classification of Chinese liquors according to their raw materials and geographical origins was performed using the proposed multi-linear classifier and classification rate was 98.75% and 100%, respectively

  3. Graph-based unsupervised feature selection and multiview ...

    2015-09-28

    Sep 28, 2015 ... is presented by Yu et al. (2010) to retrieve biomedical ... dimensional microarray data, still require further research to be done in this topic .... relatively aggressive and require therapy soon after diagnosis or else patient dies ...

  4. Feature selection for domain knowledge representation through multitask learning

    Rosman, Benjamin S

    2014-10-01

    Full Text Available represent stimuli of interest, and rich feature sets which increase the dimensionality of the space and thus the difficulty of the learning problem. We focus on a multitask reinforcement learning setting, where the agent is learning domain knowledge...

  5. Spectrally Queued Feature Selection for Robotic Visual Odometery

    2010-11-23

    in these systems has yet to be defined. 1. INTRODUCTION 1.1 Uses of Autonomous Vehicles Autonomous vehicles have a wide range of possible...applications. In military situations, autonomous vehicles are valued for their ability to keep Soldiers far away from danger. A robot can inspect and disarm...just a glimpse of what engineers are hoping for in the future. 1.2 Biological Influence Autonomous vehicles are becoming more of a possibility in

  6. Feature selection using feature dissimilarity measure and density ...

    2015-09-28

    Sep 28, 2015 ... that the value of λ2 tends to 0 as the absolute value of. ϱ(x,y) increases. ..... ison of speech recognition algorithms; in Acoustics, Speech, and ... single-cell rna-seq for marker-free decomposition of tissues into cell types.

  7. Feature Selection and Classifier Development for Radio Frequency Device Identification

    2015-12-01

    248 Table C-5: p-values vs Test Statistic for Wine Quality. ................................................. 248 Table C-6...p-values vs Test Statistic for Wisconsin Breast Cancer. .............................. 249 Table C-7: p-values vs Test Statistic for Wine ...electronic devices, such as RF-DNA. A brief review of the various approaches is considered to illustrate the benefits of the RF-DNA approach. 40

  8. A harmony search algorithm for clustering with feature selection

    Carlos Cobos

    2010-01-01

    Full Text Available En este artículo se presenta un nuevo algoritmo de clustering denominado IHSK, con la capacidad de seleccionar características en un orden de complejidad lineal. El algoritmo es inspirado en la combinación de los algoritmos de búsqueda armónica y K-means. Para la selección de las características se usó el concepto de variabilidad y un método heurístico que penaliza la presencia de dimensiones con baja probabilidad de aportar en la solución actual. El algoritmo fue probado con conjuntos de datos sintéticos y reales, obteniendo resultados prometedores.

  9. Soft computing based feature selection for environmental sound classification

    Shakoor, A.; May, T.M.; Van Schijndel, N.H.

    2010-01-01

    Environmental sound classification has a wide range of applications,like hearing aids, mobile communication devices, portable media players, and auditory protection devices. Sound classification systemstypically extract features from the input sound. Using too many features increases complexity

  10. Feature Selection for Audio Surveillance in Urban Environment

    KIKTOVA Eva

    2014-05-01

    Full Text Available This paper presents the work leading to the acoustic event detection system, which is designed to recognize two types of acoustic events (shot and breaking glass in urban environment. For this purpose, a huge front-end processing was performed for the effective parametric representation of an input sound. MFCC features and features computed during their extraction (MELSPEC and FBANK, then MPEG-7 audio descriptors and other temporal and spectral characteristics were extracted. High dimensional feature sets were created and in the next phase reduced by the mutual information based selection algorithms. Hidden Markov Model based classifier was applied and evaluated by the Viterbi decoding algorithm. Thus very effective feature sets were identified and also the less important features were found.

  11. Feature Selection Using Adaboost for Face Expression Recognition

    Silapachote, Piyanuch; Karuppiah, Deepak R; Hanson, Allen R

    2005-01-01

    We propose a classification technique for face expression recognition using AdaBoost that learns by selecting the relevant global and local appearance features with the most discriminating information...

  12. Biometrics Theory, Methods, and Applications

    Boulgouris, N V; Micheli-Tzanakou, Evangelia

    2009-01-01

    An in-depth examination of the cutting edge of biometrics. This book fills a gap in the literature by detailing the recent advances and emerging theories, methods, and applications of biometric systems in a variety of infrastructures. Edited by a panel of experts, it provides comprehensive coverage of:. Multilinear discriminant analysis for biometric signal recognition;. Biometric identity authentication techniques based on neural networks;. Multimodal biometrics and design of classifiers for biometric fusion;. Feature selection and facial aging modeling for face recognition;. Geometrical and

  13. A new cascade NN based method to short-term load forecast in deregulated electricity market

    Kouhi, Sajjad; Keynia, Farshid

    2013-01-01

    Highlights: • We are proposed a new hybrid cascaded NN based method and WT to short-term load forecast in deregulated electricity market. • An efficient preprocessor consist of normalization and shuffling of signals is presented. • In order to select the best inputs, a two-stage feature selection is presented. • A new cascaded structure consist of three cascaded NNs is used as forecaster. - Abstract: Short-term load forecasting (STLF) is a major discussion in efficient operation of power systems. The electricity load is a nonlinear signal with time dependent behavior. The area of electricity load forecasting has still essential need for more accurate and stable load forecast algorithm. To improve the accuracy of prediction, a new hybrid forecast strategy based on cascaded neural network is proposed for STLF. This method is consists of wavelet transform, an intelligent two-stage feature selection, and cascaded neural network. The feature selection is used to remove the irrelevant and redundant inputs. The forecast engine is composed of three cascaded neural network (CNN) structure. This cascaded structure can be efficiently extract input/output mapping function of the nonlinear electricity load data. Adjustable parameters of the intelligent feature selection and CNN is fine-tuned by a kind of cross-validation technique. The proposed STLF is tested on PJM and New York electricity markets. It is concluded from the result, the proposed algorithm is a robust forecast method

  14. A comparison of multivariate genome-wide association methods

    Galesloot, Tessel E; Van Steen, Kristel; Kiemeney, Lambertus A L M

    2014-01-01

    Joint association analysis of multiple traits in a genome-wide association study (GWAS), i.e. a multivariate GWAS, offers several advantages over analyzing each trait in a separate GWAS. In this study we directly compared a number of multivariate GWAS methods using simulated data. We focused on six...... methods that are implemented in the software packages PLINK, SNPTEST, MultiPhen, BIMBAM, PCHAT and TATES, and also compared them to standard univariate GWAS, analysis of the first principal component of the traits, and meta-analysis of univariate results. We simulated data (N = 1000) for three...... for scenarios with an opposite sign of genetic and residual correlation. All multivariate analyses resulted in a higher power than univariate analyses, even when only one of the traits was associated with the QTL. Hence, use of multivariate GWAS methods can be recommended, even when genetic correlations between...

  15. A New Variable Selection Method Based on Mutual Information Maximization by Replacing Collinear Variables for Nonlinear Quantitative Structure-Property Relationship Models

    Ghasemi, Jahan B.; Zolfonoun, Ehsan [Toosi University of Technology, Tehran (Korea, Republic of)

    2012-05-15

    Selection of the most informative molecular descriptors from the original data set is a key step for development of quantitative structure activity/property relationship models. Recently, mutual information (MI) has gained increasing attention in feature selection problems. This paper presents an effective mutual information-based feature selection approach, named mutual information maximization by replacing collinear variables (MIMRCV), for nonlinear quantitative structure-property relationship models. The proposed variable selection method was applied to three different QSPR datasets, soil degradation half-life of 47 organophosphorus pesticides, GC-MS retention times of 85 volatile organic compounds, and water-to-micellar cetyltrimethylammonium bromide partition coefficients of 62 organic compounds.The obtained results revealed that using MIMRCV as feature selection method improves the predictive quality of the developed models compared to conventional MI based variable selection algorithms.

  16. A New Variable Selection Method Based on Mutual Information Maximization by Replacing Collinear Variables for Nonlinear Quantitative Structure-Property Relationship Models

    Ghasemi, Jahan B.; Zolfonoun, Ehsan

    2012-01-01

    Selection of the most informative molecular descriptors from the original data set is a key step for development of quantitative structure activity/property relationship models. Recently, mutual information (MI) has gained increasing attention in feature selection problems. This paper presents an effective mutual information-based feature selection approach, named mutual information maximization by replacing collinear variables (MIMRCV), for nonlinear quantitative structure-property relationship models. The proposed variable selection method was applied to three different QSPR datasets, soil degradation half-life of 47 organophosphorus pesticides, GC-MS retention times of 85 volatile organic compounds, and water-to-micellar cetyltrimethylammonium bromide partition coefficients of 62 organic compounds.The obtained results revealed that using MIMRCV as feature selection method improves the predictive quality of the developed models compared to conventional MI based variable selection algorithms

  17. Effects of Feature Extraction and Classification Methods on Cyberbully Detection

    ÖZEL, Selma Ayşe; SARAÇ, Esra

    2016-01-01

    Cyberbullying is defined as an aggressive, intentional action against a defenseless person by using the Internet, or other electronic contents. Researchers have found that many of the bullying cases have tragically ended in suicides; hence automatic detection of cyberbullying has become important. In this study we show the effects of feature extraction, feature selection, and classification methods that are used, on the performance of automatic detection of cyberbullying. To perform the exper...

  18. Seminal quality prediction using data mining methods.

    Sahoo, Anoop J; Kumar, Yugal

    2014-01-01

    Now-a-days, some new classes of diseases have come into existences which are known as lifestyle diseases. The main reasons behind these diseases are changes in the lifestyle of people such as alcohol drinking, smoking, food habits etc. After going through the various lifestyle diseases, it has been found that the fertility rates (sperm quantity) in men has considerably been decreasing in last two decades. Lifestyle factors as well as environmental factors are mainly responsible for the change in the semen quality. The objective of this paper is to identify the lifestyle and environmental features that affects the seminal quality and also fertility rate in man using data mining methods. The five artificial intelligence techniques such as Multilayer perceptron (MLP), Decision Tree (DT), Navie Bayes (Kernel), Support vector machine+Particle swarm optimization (SVM+PSO) and Support vector machine (SVM) have been applied on fertility dataset to evaluate the seminal quality and also to predict the person is either normal or having altered fertility rate. While the eight feature selection techniques such as support vector machine (SVM), neural network (NN), evolutionary logistic regression (LR), support vector machine plus particle swarm optimization (SVM+PSO), principle component analysis (PCA), chi-square test, correlation and T-test methods have been used to identify more relevant features which affect the seminal quality. These techniques are applied on fertility dataset which contains 100 instances with nine attribute with two classes. The experimental result shows that SVM+PSO provides higher accuracy and area under curve (AUC) rate (94% & 0.932) among multi-layer perceptron (MLP) (92% & 0.728), Support Vector Machines (91% & 0.758), Navie Bayes (Kernel) (89% & 0.850) and Decision Tree (89% & 0.735) for some of the seminal parameters. This paper also focuses on the feature selection process i.e. how to select the features which are more important for prediction of

  19. A comparative study of machine learning methods for time-to-event survival data for radiomics risk modelling.

    Leger, Stefan; Zwanenburg, Alex; Pilz, Karoline; Lohaus, Fabian; Linge, Annett; Zöphel, Klaus; Kotzerke, Jörg; Schreiber, Andreas; Tinhofer, Inge; Budach, Volker; Sak, Ali; Stuschke, Martin; Balermpas, Panagiotis; Rödel, Claus; Ganswindt, Ute; Belka, Claus; Pigorsch, Steffi; Combs, Stephanie E; Mönnich, David; Zips, Daniel; Krause, Mechthild; Baumann, Michael; Troost, Esther G C; Löck, Steffen; Richter, Christian

    2017-10-16

    Radiomics applies machine learning algorithms to quantitative imaging data to characterise the tumour phenotype and predict clinical outcome. For the development of radiomics risk models, a variety of different algorithms is available and it is not clear which one gives optimal results. Therefore, we assessed the performance of 11 machine learning algorithms combined with 12 feature selection methods by the concordance index (C-Index), to predict loco-regional tumour control (LRC) and overall survival for patients with head and neck squamous cell carcinoma. The considered algorithms are able to deal with continuous time-to-event survival data. Feature selection and model building were performed on a multicentre cohort (213 patients) and validated using an independent cohort (80 patients). We found several combinations of machine learning algorithms and feature selection methods which achieve similar results, e.g. C-Index = 0.71 and BT-COX: C-Index = 0.70 in combination with Spearman feature selection. Using the best performing models, patients were stratified into groups of low and high risk of recurrence. Significant differences in LRC were obtained between both groups on the validation cohort. Based on the presented analysis, we identified a subset of algorithms which should be considered in future radiomics studies to develop stable and clinically relevant predictive models for time-to-event endpoints.

  20. Elastic SCAD as a novel penalization method for SVM classification tasks in high-dimensional data.

    Becker, Natalia; Toedt, Grischa; Lichter, Peter; Benner, Axel

    2011-05-09

    Classification and variable selection play an important role in knowledge discovery in high-dimensional data. Although Support Vector Machine (SVM) algorithms are among the most powerful classification and prediction methods with a wide range of scientific applications, the SVM does not include automatic feature selection and therefore a number of feature selection procedures have been developed. Regularisation approaches extend SVM to a feature selection method in a flexible way using penalty functions like LASSO, SCAD and Elastic Net.We propose a novel penalty function for SVM classification tasks, Elastic SCAD, a combination of SCAD and ridge penalties which overcomes the limitations of each penalty alone.Since SVM models are extremely sensitive to the choice of tuning parameters, we adopted an interval search algorithm, which in comparison to a fixed grid search finds rapidly and more precisely a global optimal solution. Feature selection methods with combined penalties (Elastic Net and Elastic SCAD SVMs) are more robust to a change of the model complexity than methods using single penalties. Our simulation study showed that Elastic SCAD SVM outperformed LASSO (L1) and SCAD SVMs. Moreover, Elastic SCAD SVM provided sparser classifiers in terms of median number of features selected than Elastic Net SVM and often better predicted than Elastic Net in terms of misclassification error.Finally, we applied the penalization methods described above on four publicly available breast cancer data sets. Elastic SCAD SVM was the only method providing robust classifiers in sparse and non-sparse situations. The proposed Elastic SCAD SVM algorithm provides the advantages of the SCAD penalty and at the same time avoids sparsity limitations for non-sparse data. We were first to demonstrate that the integration of the interval search algorithm and penalized SVM classification techniques provides fast solutions on the optimization of tuning parameters.The penalized SVM

  1. A New Hybrid Method for Improving the Performance of Myocardial Infarction Prediction

    Hojatollah Hamidi

    2016-06-01

    Full Text Available Abstract Introduction: Myocardial Infarction, also known as heart attack, normally occurs due to such causes as smoking, family history, diabetes, and so on. It is recognized as one of the leading causes of death in the world. Therefore, the present study aimed to evaluate the performance of classification models in order to predict Myocardial Infarction, using a feature selection method that includes Forward Selection and Genetic Algorithm. Materials & Methods: The Myocardial Infarction data set used in this study contains the information related to 519 visitors to Shahid Madani Specialized Hospital of Khorramabad, Iran. This data set includes 33 features. The proposed method includes a hybrid feature selection method in order to enhance the performance of classification algorithms. The first step of this method selects the features using Forward Selection. At the second step, the selected features were given to a genetic algorithm, in order to select the best features. Classification algorithms entail Ada Boost, Naïve Bayes, J48 decision tree and simpleCART are applied to the data set with selected features, for predicting Myocardial Infarction. Results: The best results have been achieved after applying the proposed feature selection method, which were obtained via simpleCART and J48 algorithms with the accuracies of 96.53% and 96.34%, respectively. Conclusion: Based on the results, the performances of classification algorithms are improved. So, applying the proposed feature selection method, along with classification algorithms seem to be considered as a confident method with respect to predicting the Myocardial Infarction.

  2. Development of headspace solid-phase microextraction method for ...

    A headspace solid-phase microextraction (HS-SPME) method was developed as a preliminary investigation using univariate approach for the analysis of 14 multiclass pesticide residues in fruits and vegetable samples. The gas chromatography mass spectrometry parameters (desorption temperature and time, column flow ...

  3. A Multifeatures Fusion and Discrete Firefly Optimization Method for Prediction of Protein Tyrosine Sulfation Residues.

    Guo, Song; Liu, Chunhua; Zhou, Peng; Li, Yanling

    2016-01-01

    Tyrosine sulfation is one of the ubiquitous protein posttranslational modifications, where some sulfate groups are added to the tyrosine residues. It plays significant roles in various physiological processes in eukaryotic cells. To explore the molecular mechanism of tyrosine sulfation, one of the prerequisites is to correctly identify possible protein tyrosine sulfation residues. In this paper, a novel method was presented to predict protein tyrosine sulfation residues from primary sequences. By means of informative feature construction and elaborate feature selection and parameter optimization scheme, the proposed predictor achieved promising results and outperformed many other state-of-the-art predictors. Using the optimal features subset, the proposed method achieved mean MCC of 94.41% on the benchmark dataset, and a MCC of 90.09% on the independent dataset. The experimental performance indicated that our new proposed method could be effective in identifying the important protein posttranslational modifications and the feature selection scheme would be powerful in protein functional residues prediction research fields.

  4. Statistical and optimization methods to expedite neural network training for transient identification

    Reifman, J.; Vitela, E.J.; Lee, J.C.

    1993-01-01

    Two complementary methods, statistical feature selection and nonlinear optimization through conjugate gradients, are used to expedite feedforward neural network training. Statistical feature selection techniques in the form of linear correlation coefficients and information-theoretic entropy are used to eliminate redundant and non-informative plant parameters to reduce the size of the network. The method of conjugate gradients is used to accelerate the network training convergence and to systematically calculate the Teaming and momentum constants at each iteration. The proposed techniques are compared with the backpropagation algorithm using the entire set of plant parameters in the training of neural networks to identify transients simulated with the Midland Nuclear Power Plant Unit 2 simulator. By using 25% of the plant parameters and the conjugate gradients, a 30-fold reduction in CPU time was obtained without degrading the diagnostic ability of the network

  5. Improving lung cancer prognosis assessment by incorporating synthetic minority oversampling technique and score fusion method

    Yan, Shiju; Qian, Wei; Guan, Yubao; Zheng, Bin

    2016-01-01

    Purpose: This study aims to investigate the potential to improve lung cancer recurrence risk prediction performance for stage I NSCLS patients by integrating oversampling, feature selection, and score fusion techniques and develop an optimal prediction model. Methods: A dataset involving 94 early stage lung cancer patients was retrospectively assembled, which includes CT images, nine clinical and biological (CB) markers, and outcome of 3-yr disease-free survival (DFS) after surgery. Among the 94 patients, 74 remained DFS and 20 had cancer recurrence. Applying a computer-aided detection scheme, tumors were segmented from the CT images and 35 quantitative image (QI) features were initially computed. Two normalized Gaussian radial basis function network (RBFN) based classifiers were built based on QI features and CB markers separately. To improve prediction performance, the authors applied a synthetic minority oversampling technique (SMOTE) and a BestFirst based feature selection method to optimize the classifiers and also tested fusion methods to combine QI and CB based prediction results. Results: Using a leave-one-case-out cross-validation (K-fold cross-validation) method, the computed areas under a receiver operating characteristic curve (AUCs) were 0.716 ± 0.071 and 0.642 ± 0.061, when using the QI and CB based classifiers, respectively. By fusion of the scores generated by the two classifiers, AUC significantly increased to 0.859 ± 0.052 (p < 0.05) with an overall prediction accuracy of 89.4%. Conclusions: This study demonstrated the feasibility of improving prediction performance by integrating SMOTE, feature selection, and score fusion techniques. Combining QI features and CB markers and performing SMOTE prior to feature selection in classifier training enabled RBFN based classifier to yield improved prediction accuracy.

  6. Improving lung cancer prognosis assessment by incorporating synthetic minority oversampling technique and score fusion method

    Yan, Shiju [School of Medical Instrument and Food Engineering, University of Shanghai for Science and Technology, Shanghai 200093, China and School of Electrical and Computer Engineering, University of Oklahoma, Norman, Oklahoma 73019 (United States); Qian, Wei [Department of Electrical and Computer Engineering, University of Texas, El Paso, Texas 79968 and Sino-Dutch Biomedical and Information Engineering School, Northeastern University, Shenyang 110819 (China); Guan, Yubao [Department of Radiology, Guangzhou Medical University, Guangzhou 510182 (China); Zheng, Bin, E-mail: Bin.Zheng-1@ou.edu [School of Electrical and Computer Engineering, University of Oklahoma, Norman, Oklahoma 73019 (United States)

    2016-06-15

    Purpose: This study aims to investigate the potential to improve lung cancer recurrence risk prediction performance for stage I NSCLS patients by integrating oversampling, feature selection, and score fusion techniques and develop an optimal prediction model. Methods: A dataset involving 94 early stage lung cancer patients was retrospectively assembled, which includes CT images, nine clinical and biological (CB) markers, and outcome of 3-yr disease-free survival (DFS) after surgery. Among the 94 patients, 74 remained DFS and 20 had cancer recurrence. Applying a computer-aided detection scheme, tumors were segmented from the CT images and 35 quantitative image (QI) features were initially computed. Two normalized Gaussian radial basis function network (RBFN) based classifiers were built based on QI features and CB markers separately. To improve prediction performance, the authors applied a synthetic minority oversampling technique (SMOTE) and a BestFirst based feature selection method to optimize the classifiers and also tested fusion methods to combine QI and CB based prediction results. Results: Using a leave-one-case-out cross-validation (K-fold cross-validation) method, the computed areas under a receiver operating characteristic curve (AUCs) were 0.716 ± 0.071 and 0.642 ± 0.061, when using the QI and CB based classifiers, respectively. By fusion of the scores generated by the two classifiers, AUC significantly increased to 0.859 ± 0.052 (p < 0.05) with an overall prediction accuracy of 89.4%. Conclusions: This study demonstrated the feasibility of improving prediction performance by integrating SMOTE, feature selection, and score fusion techniques. Combining QI features and CB markers and performing SMOTE prior to feature selection in classifier training enabled RBFN based classifier to yield improved prediction accuracy.

  7. Comparison between ARIMA and DES Methods of Forecasting Population for Housing Demand in Johor

    Alias Ahmad Rizal; Zainun Noor Yasmin; Abdul Rahman Ismail

    2016-01-01

    Forecasting accuracy is a primary criterion in selecting appropriate method of prediction. Even though there are various methods of forecasting however not all of these methods are able to predict with good accuracy. This paper presents an evaluation of two methods of population forecasting for housing demand. These methods are Autoregressive Integrated Moving Average (ARIMA) and Double Exponential Smoothing (DES). Both of the methods are principally adopting univariate time series analysis w...

  8. Extensão bivariada do índice de confiabilidade univariado para avaliação da estabilidade fenotípica Bivariate extension of univariate reliability index for evaluating phenotypic stability

    Suzankelly Cunha Arruda de Abreu

    2004-10-01

    Full Text Available Com o presente trabalho, objetiva-se realizar a derivação teórica da extensão bivariada dos métodos de Annicchiarico (1992 e Annicchiarico et al. (1995 para estudar a estabilidade fenotípica. A partir dos ensaios com genótipos em ambientes e mensurações de duas variáveis, cada genótipo teve seu valor padronizado com relação a cada variável k = 1, 2. Essa padronização foi realizada em função da média do ambiente, da seguinte forma: Wijk = Yijk/×100 ; em que Wijk representa o valor padronizado do genótipo i, no ambiente j para a variável k; representa a média observada do genótipo , no ambiente para a variável k e , a média de todos genótipos para o ambiente e variável k. Com os valores padronizados foram estimados o vetor média e a matriz de variância e covariância de cada genótipo. Foi obtida a derivação teórica da extensão bivariada do índice de risco (Ii de Annicchiarico com sucesso e foi proposto um segundo índice de risco baseado nas probabilidades bivariada (Prb i; os dois índices apresentaram grande concordância nos resultados obtidos em um exemplo ilustrativo com genótipos de melões.The objective of this work was to obtain the theoretical derivation of the bivariate extension to the methods proposed by Annicchiarico (1992 and Annicchiarico et al. (1995 for studing phenotypic stability. Considering assays with genotypes in environments and two variates, every genotype had the response of each variate (k = 1, 2 standardized. This standardization has been made using the environment means as follows: Wijk = Yijk/×100 ; where Wijk represents the ith genotype standard value in the jth environment for the kth variate; represents the observed mean of the ith genotype, in jth environment for the kth variate e the overall genotypes means for jth environment to kth variate. Considering the standardized values, the genotypes mean vector and covariance matrix were estimated. The theoretical derivation of the