Bayesian Variable Selection in Spatial Autoregressive Models
Jesus Crespo Cuaresma; Philipp Piribauer
2015-01-01
This paper compares the performance of Bayesian variable selection approaches for spatial autoregressive models. We present two alternative approaches which can be implemented using Gibbs sampling methods in a straightforward way and allow us to deal with the problem of model uncertainty in spatial autoregressive models in a flexible and computationally efficient way. In a simulation study we show that the variable selection approaches tend to outperform existing Bayesian model averaging tech...
Bayesian variable selection with spherically symmetric priors
De Kock, M B
2014-01-01
We propose that Bayesian variable selection for linear parametrisations with Gaussian iid likelihoods be based on the spherical symmetry of the diagonalised parameter space. This reduces the multidimensional parameter space problem to one dimension without the need for conjugate priors. Combining this likelihood with what we call the r-prior results in a framework in which we can derive closed forms for the evidence, posterior and characteristic function for four different r-priors, including the hyper-g prior and the Zellner-Siow prior, which are shown to be special cases of our r-prior. Two scenarios of a single variable dispersion parameter and of fixed dispersion are studied separately, and asymptotic forms comparable to the traditional information criteria are derived. In a simple simulation exercise, we find that model comparison based on our uniform r-prior appears to fare better than the current model comparison schemes.
A Bayesian variable selection procedure for ranking overlapping gene sets
Skarman, Axel; Mahdi Shariati, Mohammad; Janss, Luc;
2012-01-01
data to study how the variable selection method was affected by overlaps among the pathways. In addition, we compared our approach to another that ignores the overlaps, and studied the differences in the prioritization. The variable selection method was robust to a change in prior probability...... described. In many cases, these methods test one gene set at a time, and therefore do not consider overlaps among the pathways. Here, we present a Bayesian variable selection method to prioritize gene sets that overcomes this limitation by considering all gene sets simultaneously. We applied Bayesian...... variable selection to differential expression to prioritize the molecular and genetic pathways involved in the responses to Escherichia coli infection in Danish Holstein cows. Results We used a Bayesian variable selection method to prioritize Kyoto Encyclopedia of Genes and Genomes pathways. We used our...
Bayesian Variable Selection for Detecting Adaptive Genomic Differences Among Populations
Riebler, Andrea; Held, Leonhard; Stephan, Wolfgang
2008-01-01
We extend an Fst-based Bayesian hierarchical model, implemented via Markov chain Monte Carlo, for the detection of loci that might be subject to positive selection. This model divides the Fst-influencing factors into locus-specific effects, population-specific effects, and effects that are specific for the locus in combination with the population. We introduce a Bayesian auxiliary variable for each locus effect to automatically select nonneutral locus effects. As a by-product, the efficiency ...
Bayesian Variable Selection in Cost-Effectiveness Analysis
Miguel A. Negrín
2010-04-01
Full Text Available Linear regression models are often used to represent the cost and effectiveness of medical treatment. The covariates used may include sociodemographic variables, such as age, gender or race; clinical variables, such as initial health status, years of treatment or the existence of concomitant illnesses; and a binary variable indicating the treatment received. However, most studies estimate only one model, which usually includes all the covariates. This procedure ignores the question of uncertainty in model selection. In this paper, we examine four alternative Bayesian variable selection methods that have been proposed. In this analysis, we estimate the inclusion probability of each covariate in the real model conditional on the data. Variable selection can be useful for estimating incremental effectiveness and incremental cost, through Bayesian model averaging, as well as for subgroup analysis.
Bayesian variable selection for detecting adaptive genomic differences among populations.
Riebler, Andrea; Held, Leonhard; Stephan, Wolfgang
2008-03-01
We extend an F(st)-based Bayesian hierarchical model, implemented via Markov chain Monte Carlo, for the detection of loci that might be subject to positive selection. This model divides the F(st)-influencing factors into locus-specific effects, population-specific effects, and effects that are specific for the locus in combination with the population. We introduce a Bayesian auxiliary variable for each locus effect to automatically select nonneutral locus effects. As a by-product, the efficiency of the original approach is improved by using a reparameterization of the model. The statistical power of the extended algorithm is assessed with simulated data sets from a Wright-Fisher model with migration. We find that the inclusion of model selection suggests a clear improvement in discrimination as measured by the area under the receiver operating characteristic (ROC) curve. Additionally, we illustrate and discuss the quality of the newly developed method on the basis of an allozyme data set of the fruit fly Drosophila melanogaster and a sequence data set of the wild tomato Solanum chilense. For data sets with small sample sizes, high mutation rates, and/or long sequences, however, methods based on nucleotide statistics should be preferred. PMID:18245358
Bayesian Biclustering on Discrete Data: Variable Selection Methods
Guo, Lei
2013-01-01
Biclustering is a technique for clustering rows and columns of a data matrix simultaneously. Over the past few years, we have seen its applications in biology-related fields, as well as in many data mining projects. As opposed to classical clustering methods, biclustering groups objects that are similar only on a subset of variables. Many biclustering algorithms on continuous data have emerged over the last decade. In this dissertation, we will focus on two Bayesian biclustering algorithms we...
Bayesian Variable Selection for Logistic Models Using Auxiliary Mixture Sampling
Tüchler, Regina
2006-01-01
The paper presents an Markov Chain Monte Carlo algorithm for both variable and covariance selection in the context of logistic mixed effects models. This algorithm allows us to sample solely from standard densities, with no additional tuning being needed. We apply a stochastic search variable approach to select explanatory variables as well as to determine the structure of the random effects covariance matrix. For logistic mixed effects models prior determination of explanatory variables and ...
Steady-state priors and Bayesian variable selection in VAR forecasting
Louzis, Dimitrios P.
2015-01-01
This study proposes methods for estimating Bayesian vector autoregressions (VARs) with an automatic variable selection and an informative prior on the unconditional mean or steady-state of the system. We show that extant Gibbs sampling methods for Bayesian variable selection can be efficiently extended to incorporate prior beliefs on the steady-state of the economy. Empirical analysis, based on three major US macroeconomic time series, indicates that the out-of-sample forecasting accuracy of ...
Bayesian variable selection and data integration for biological regulatory networks
Jensen, Shane T; Chen, Guang; Stoeckert, Jr, Christian J.
2007-01-01
A substantial focus of research in molecular biology are gene regulatory networks: the set of transcription factors and target genes which control the involvement of different biological processes in living cells. Previous statistical approaches for identifying gene regulatory networks have used gene expression data, ChIP binding data or promoter sequence data, but each of these resources provides only partial information. We present a Bayesian hierarchical model that integrates all three dat...
Lu, Zhaohua; Zhu, Hongtu; Knickmeyer, Rebecca C.; Sullivan, Patrick F.; Stephanie, Williams N.; Zou, Fei
2015-01-01
The power of genome-wide association studies (GWAS) for mapping complex traits with single SNP analysis may be undermined by modest SNP effect sizes, unobserved causal SNPs, correlation among adjacent SNPs, and SNP-SNP interactions. Alternative approaches for testing the association between a single SNP-set and individual phenotypes have been shown to be promising for improving the power of GWAS. We propose a Bayesian latent variable selection (BLVS) method to simultaneously model the joint a...
Bayesian Factor Analysis as a Variable-Selection Problem: Alternative Priors and Consequences.
Lu, Zhao-Hua; Chow, Sy-Miin; Loken, Eric
2016-01-01
Factor analysis is a popular statistical technique for multivariate data analysis. Developments in the structural equation modeling framework have enabled the use of hybrid confirmatory/exploratory approaches in which factor-loading structures can be explored relatively flexibly within a confirmatory factor analysis (CFA) framework. Recently, Muthén & Asparouhov proposed a Bayesian structural equation modeling (BSEM) approach to explore the presence of cross loadings in CFA models. We show that the issue of determining factor-loading patterns may be formulated as a Bayesian variable selection problem in which Muthén and Asparouhov's approach can be regarded as a BSEM approach with ridge regression prior (BSEM-RP). We propose another Bayesian approach, denoted herein as the Bayesian structural equation modeling with spike-and-slab prior (BSEM-SSP), which serves as a one-stage alternative to the BSEM-RP. We review the theoretical advantages and disadvantages of both approaches and compare their empirical performance relative to two modification indices-based approaches and exploratory factor analysis with target rotation. A teacher stress scale data set is used to demonstrate our approach. PMID:27314566
Locating disease genes using Bayesian variable selection with the Haseman-Elston method
He Qimei
2003-12-01
Full Text Available Abstract Background We applied stochastic search variable selection (SSVS, a Bayesian model selection method, to the simulated data of Genetic Analysis Workshop 13. We used SSVS with the revisited Haseman-Elston method to find the markers linked to the loci determining change in cholesterol over time. To study gene-gene interaction (epistasis and gene-environment interaction, we adopted prior structures, which incorporate the relationship among the predictors. This allows SSVS to search in the model space more efficiently and avoid the less likely models. Results In applying SSVS, instead of looking at the posterior distribution of each of the candidate models, which is sensitive to the setting of the prior, we ranked the candidate variables (markers according to their marginal posterior probability, which was shown to be more robust to the prior. Compared with traditional methods that consider one marker at a time, our method considers all markers simultaneously and obtains more favorable results. Conclusions We showed that SSVS is a powerful method for identifying linked markers using the Haseman-Elston method, even for weak effects. SSVS is very effective because it does a smart search over the entire model space.
A bayesian integrative model for genetical genomics with spatially informed variable selection.
Cassese, Alberto; Guindani, Michele; Vannucci, Marina
2014-01-01
We consider a Bayesian hierarchical model for the integration of gene expression levels with comparative genomic hybridization (CGH) array measurements collected on the same subjects. The approach defines a measurement error model that relates the gene expression levels to latent copy number states. In turn, the latent states are related to the observed surrogate CGH measurements via a hidden Markov model. The model further incorporates variable selection with a spatial prior based on a probit link that exploits dependencies across adjacent DNA segments. Posterior inference is carried out via Markov chain Monte Carlo stochastic search techniques. We study the performance of the model in simulations and show better results than those achieved with recently proposed alternative priors. We also show an application to data from a genomic study on lung squamous cell carcinoma, where we identify potential candidates of associations between copy number variants and the transcriptional activity of target genes. Gene ontology (GO) analyses of our findings reveal enrichments in genes that code for proteins involved in cancer. Our model also identifies a number of potential candidate biomarkers for further experimental validation. PMID:25288877
On the use of pseudo-likelihoods in Bayesian variable selection.
Racugno, Walter; Salvan, Alessandra; Ventura, Laura
2005-01-01
In the presence of nuisance parameters, we discuss a one-parameter Bayesian analysis based on a pseudo-likelihood assuming a default prior distribution for the parameter of interest only. Although this way to proceed cannot always be considered as orthodox in the Bayesian perspective, it is of interest to evaluate whether the use of suitable pseudo-likelihoods may be proposed for Bayesian inference. Attention is focused in the context of regression models, in particular on inference about a s...
Bhadra, Anindya
2013-04-22
We describe a Bayesian technique to (a) perform a sparse joint selection of significant predictor variables and significant inverse covariance matrix elements of the response variables in a high-dimensional linear Gaussian sparse seemingly unrelated regression (SSUR) setting and (b) perform an association analysis between the high-dimensional sets of predictors and responses in such a setting. To search the high-dimensional model space, where both the number of predictors and the number of possibly correlated responses can be larger than the sample size, we demonstrate that a marginalization-based collapsed Gibbs sampler, in combination with spike and slab type of priors, offers a computationally feasible and efficient solution. As an example, we apply our method to an expression quantitative trait loci (eQTL) analysis on publicly available single nucleotide polymorphism (SNP) and gene expression data for humans where the primary interest lies in finding the significant associations between the sets of SNPs and possibly correlated genetic transcripts. Our method also allows for inference on the sparse interaction network of the transcripts (response variables) after accounting for the effect of the SNPs (predictor variables). We exploit properties of Gaussian graphical models to make statements concerning conditional independence of the responses. Our method compares favorably to existing Bayesian approaches developed for this purpose. © 2013, The International Biometric Society.
Bhadra, Anindya; Mallick, Bani K
2013-06-01
We describe a Bayesian technique to (a) perform a sparse joint selection of significant predictor variables and significant inverse covariance matrix elements of the response variables in a high-dimensional linear Gaussian sparse seemingly unrelated regression (SSUR) setting and (b) perform an association analysis between the high-dimensional sets of predictors and responses in such a setting. To search the high-dimensional model space, where both the number of predictors and the number of possibly correlated responses can be larger than the sample size, we demonstrate that a marginalization-based collapsed Gibbs sampler, in combination with spike and slab type of priors, offers a computationally feasible and efficient solution. As an example, we apply our method to an expression quantitative trait loci (eQTL) analysis on publicly available single nucleotide polymorphism (SNP) and gene expression data for humans where the primary interest lies in finding the significant associations between the sets of SNPs and possibly correlated genetic transcripts. Our method also allows for inference on the sparse interaction network of the transcripts (response variables) after accounting for the effect of the SNPs (predictor variables). We exploit properties of Gaussian graphical models to make statements concerning conditional independence of the responses. Our method compares favorably to existing Bayesian approaches developed for this purpose. PMID:23607608
Zhang, Linlin; Guindani, Michele; Versace, Francesco; Vannucci, Marina
2014-07-15
In this paper we present a novel wavelet-based Bayesian nonparametric regression model for the analysis of functional magnetic resonance imaging (fMRI) data. Our goal is to provide a joint analytical framework that allows to detect regions of the brain which exhibit neuronal activity in response to a stimulus and, simultaneously, infer the association, or clustering, of spatially remote voxels that exhibit fMRI time series with similar characteristics. We start by modeling the data with a hemodynamic response function (HRF) with a voxel-dependent shape parameter. We detect regions of the brain activated in response to a given stimulus by using mixture priors with a spike at zero on the coefficients of the regression model. We account for the complex spatial correlation structure of the brain by using a Markov random field (MRF) prior on the parameters guiding the selection of the activated voxels, therefore capturing correlation among nearby voxels. In order to infer association of the voxel time courses, we assume correlated errors, in particular long memory, and exploit the whitening properties of discrete wavelet transforms. Furthermore, we achieve clustering of the voxels by imposing a Dirichlet process (DP) prior on the parameters of the long memory process. For inference, we use Markov Chain Monte Carlo (MCMC) sampling techniques that combine Metropolis-Hastings schemes employed in Bayesian variable selection with sampling algorithms for nonparametric DP models. We explore the performance of the proposed model on simulated data, with both block- and event-related design, and on real fMRI data. PMID:24650600
Berg, van den S.; Calus, M.P.L.; Meuwissen, T.H.E.; Wientjes, Y.C.J.
2015-01-01
Background: The use of information across populations is an attractive approach to increase the accuracy of genomic prediction for numerically small populations. However, accuracies of across population genomic prediction, in which reference and selection individuals are from different population
Bayesian variable order Markov models: Towards Bayesian predictive state representations
C. Dimitrakakis
2009-01-01
We present a Bayesian variable order Markov model that shares many similarities with predictive state representations. The resulting models are compact and much easier to specify and learn than classical predictive state representations. Moreover, we show that they significantly outperform a more st
Integer variables estimation problems: the Bayesian approach
G. Venuti
1997-06-01
Full Text Available In geodesy as well as in geophysics there are a number of examples where the unknown parameters are partly constrained to be integer numbers, while other parameters have a continuous range of possible values. In all such situations the ordinary least square principle, with integer variates fixed to the most probable integer value, can lead to paradoxical results, due to the strong non-linearity of the manifold of admissible values. On the contrary an overall estimation procedure assigning the posterior distribution to all variables, discrete and continuous, conditional to the observed quantities, like the so-called Bayesian approach, has the advantage of weighting correctly the possible errors in choosing different sets of integer values, thus providing a more realistic and stable estimate even of the continuous parameters. In this paper, after a short recall of the basics of Bayesian theory in section 2, we present the natural Bayesian solution to the problem of assessing the estimable signal from noisy observations in section 3 and the Bayesian solution to cycle slips detection and repair for a stream of GPS measurements in section 4. An elementary synthetic example is discussed in section 3 to illustrate the theory presented and more elaborate, though synthetic, examples are discussed in section 4 where realistic streams of GPS observations, with cycle slips, are simulated and then back processed.
Larson, Nicholas B; McDonnell, Shannon; Albright, Lisa Cannon; Teerlink, Craig; Stanford, Janet; Ostrander, Elaine A; Isaacs, William B; Xu, Jianfeng; Cooney, Kathleen A; Lange, Ethan; Schleutker, Johanna; Carpten, John D; Powell, Isaac; Bailey-Wilson, Joan; Cussenot, Olivier; Cancel-Tassin, Geraldine; Giles, Graham; MacInnis, Robert; Maier, Christiane; Whittemore, Alice S; Hsieh, Chih-Lin; Wiklund, Fredrik; Catolona, William J; Foulkes, William; Mandal, Diptasri; Eeles, Rosalind; Kote-Jarai, Zsofia; Ackerman, Michael J; Olson, Timothy M; Klein, Christopher J; Thibodeau, Stephen N; Schaid, Daniel J
2016-09-01
Rare variants (RVs) have been shown to be significant contributors to complex disease risk. By definition, these variants have very low minor allele frequencies and traditional single-marker methods for statistical analysis are underpowered for typical sequencing study sample sizes. Multimarker burden-type approaches attempt to identify aggregation of RVs across case-control status by analyzing relatively small partitions of the genome, such as genes. However, it is generally the case that the aggregative measure would be a mixture of causal and neutral variants, and these omnibus tests do not directly provide any indication of which RVs may be driving a given association. Recently, Bayesian variable selection approaches have been proposed to identify RV associations from a large set of RVs under consideration. Although these approaches have been shown to be powerful at detecting associations at the RV level, there are often computational limitations on the total quantity of RVs under consideration and compromises are necessary for large-scale application. Here, we propose a computationally efficient alternative formulation of this method using a probit regression approach specifically capable of simultaneously analyzing hundreds to thousands of RVs. We evaluate our approach to detect causal variation on simulated data and examine sensitivity and specificity in instances of high RV dimensionality as well as apply it to pathway-level RV analysis results from a prostate cancer (PC) risk case-control sequencing study. Finally, we discuss potential extensions and future directions of this work. PMID:27312771
Bayesian Model Averaging in the Instrumental Variable Regression Model
Gary Koop; Robert Leon Gonzalez; Rodney Strachan
2011-01-01
This paper considers the instrumental variable regression model when there is uncertainly about the set of instruments, exogeneity restrictions, the validity of identifying restrictions and the set of exogenous regressors. This uncertainly can result in a huge number of models. To avoid statistical problems associated with standard model selection procedures, we develop a reversible jump Markov chain Monte Carlo algorithm that allows us to do Bayesian model averaging. The algorithm is very fl...
Bayesian auxiliary variable models for binary and multinomial regression
Holmes, C C; HELD, L.
2006-01-01
In this paper we discuss auxiliary variable approaches to Bayesian binary and multinomial regression. These approaches are ideally suited to automated Markov chain Monte Carlo simulation. In the first part we describe a simple technique using joint updating that improves the performance of the conventional probit regression algorithm. In the second part we discuss auxiliary variable methods for inference in Bayesian logistic regression, including covariate set uncertainty. Fina...
Bayesian site selection for fast Gaussian process regression
Pourhabib, Arash
2014-02-05
Gaussian Process (GP) regression is a popular method in the field of machine learning and computer experiment designs; however, its ability to handle large data sets is hindered by the computational difficulty in inverting a large covariance matrix. Likelihood approximation methods were developed as a fast GP approximation, thereby reducing the computation cost of GP regression by utilizing a much smaller set of unobserved latent variables called pseudo points. This article reports a further improvement to the likelihood approximation methods by simultaneously deciding both the number and locations of the pseudo points. The proposed approach is a Bayesian site selection method where both the number and locations of the pseudo inputs are parameters in the model, and the Bayesian model is solved using a reversible jump Markov chain Monte Carlo technique. Through a number of simulated and real data sets, it is demonstrated that with appropriate priors chosen, the Bayesian site selection method can produce a good balance between computation time and prediction accuracy: it is fast enough to handle large data sets that a full GP is unable to handle, and it improves, quite often remarkably, the prediction accuracy, compared with the existing likelihood approximations. © 2014 Taylor and Francis Group, LLC.
Bayesian item selection in constrained adaptive testing using shadow tests
Veldkamp, Bernard P.
2010-01-01
Application of Bayesian item selection criteria in computerized adaptive testing might result in improvement of bias and MSE of the ability estimates. The question remains how to apply Bayesian item selection criteria in the context of constrained adaptive testing, where large numbers of specificati
Bayesian Item Selection in Constrained Adaptive Testing Using Shadow Tests
Veldkamp, Bernard P.
2010-01-01
Application of Bayesian item selection criteria in computerized adaptive testing might result in improvement of bias and MSE of the ability estimates. The question remains how to apply Bayesian item selection criteria in the context of constrained adaptive testing, where large numbers of specifications have to be taken into account in the item…
Improving randomness characterization through Bayesian model selection
R., Rafael Díaz-H; Martínez, Alí M Angulo; U'Ren, Alfred B; Hirsch, Jorge G; Marsili, Matteo; Castillo, Isaac Pérez
2016-01-01
Nowadays random number generation plays an essential role in technology with important applications in areas ranging from cryptography, which lies at the core of current communication protocols, to Monte Carlo methods, and other probabilistic algorithms. In this context, a crucial scientific endeavour is to develop effective methods that allow the characterization of random number generators. However, commonly employed methods either lack formality (e.g. the NIST test suite), or are inapplicable in principle (e.g. the characterization derived from the Algorithmic Theory of Information (ATI)). In this letter we present a novel method based on Bayesian model selection, which is both rigorous and effective, for characterizing randomness in a bit sequence. We derive analytic expressions for a model's likelihood which is then used to compute its posterior probability distribution. Our method proves to be more rigorous than NIST's suite and the Borel-Normality criterion and its implementation is straightforward. We...
Bayesian item selection in constrained adaptive testing using shadow tests
Bernard P. Veldkamp
2010-01-01
Application of Bayesian item selection criteria in computerized adaptive testing might result in improvement of bias and MSE of the ability estimates. The question remains how to apply Bayesian item selection criteria in the context of constrained adaptive testing, where large numbers of specifications have to be taken into account in the item selection process. The Shadow Test Approach is a general purpose algorithm for administering constrained CAT. In this paper it is shown how the approac...
Species selection on variability.
Lloyd, E. A.; Gould, S J
1993-01-01
Most analyses of species selection require emergent, as opposed to aggregate, characters at the species level. This "emergent character" approach tends to focus on the search for adaptations at the species level. Such an approach seems to banish the most potent evolutionary property of populations--variability itself--from arguments about species selection (for variation is an aggregate character). We wish, instead, to extend the legitimate domain of species selection to aggregate characters....
Lin, Lin; Chan, Cliburn; West, Mike
2016-01-01
We discuss the evaluation of subsets of variables for the discriminative evidence they provide in multivariate mixture modeling for classification. The novel development of Bayesian classification analysis presented is partly motivated by problems of design and selection of variables in biomolecular studies, particularly involving widely used assays of large-scale single-cell data generated using flow cytometry technology. For such studies and for mixture modeling generally, we define discriminative analysis that overlays fitted mixture models using a natural measure of concordance between mixture component densities, and define an effective and computationally feasible method for assessing and prioritizing subsets of variables according to their roles in discrimination of one or more mixture components. We relate the new discriminative information measures to Bayesian classification probabilities and error rates, and exemplify their use in Bayesian analysis of Dirichlet process mixture models fitted via Markov chain Monte Carlo methods as well as using a novel Bayesian expectation-maximization algorithm. We present a series of theoretical and simulated data examples to fix concepts and exhibit the utility of the approach, and compare with prior approaches. We demonstrate application in the context of automatic classification and discriminative variable selection in high-throughput systems biology using large flow cytometry datasets. PMID:26040910
Dissecting Magnetar Variability with Bayesian Hierarchical Models
Huppenkothen, Daniela; Brewer, Brendon J.; Hogg, David W.; Murray, Iain; Frean, Marcus; Elenbaas, Chris; Watts, Anna L.; Levin, Yuri; van der Horst, Alexander J.; Kouveliotou, Chryssa
2015-09-01
Neutron stars are a prime laboratory for testing physical processes under conditions of strong gravity, high density, and extreme magnetic fields. Among the zoo of neutron star phenomena, magnetars stand out for their bursting behavior, ranging from extremely bright, rare giant flares to numerous, less energetic recurrent bursts. The exact trigger and emission mechanisms for these bursts are not known; favored models involve either a crust fracture and subsequent energy release into the magnetosphere, or explosive reconnection of magnetic field lines. In the absence of a predictive model, understanding the physical processes responsible for magnetar burst variability is difficult. Here, we develop an empirical model that decomposes magnetar bursts into a superposition of small spike-like features with a simple functional form, where the number of model components is itself part of the inference problem. The cascades of spikes that we model might be formed by avalanches of reconnection, or crust rupture aftershocks. Using Markov Chain Monte Carlo sampling augmented with reversible jumps between models with different numbers of parameters, we characterize the posterior distributions of the model parameters and the number of components per burst. We relate these model parameters to physical quantities in the system, and show for the first time that the variability within a burst does not conform to predictions from ideas of self-organized criticality. We also examine how well the properties of the spikes fit the predictions of simplified cascade models for the different trigger mechanisms.
Dissecting magnetar variability with Bayesian hierarchical models
Huppenkothen, D; Hogg, D W; Murray, I; Frean, M; Elenbaas, C; Watts, A L; Levin, Y; van der Horst, A J; Kouveliotou, C
2015-01-01
Neutron stars are a prime laboratory for testing physical processes under conditions of strong gravity, high density, and extreme magnetic fields. Among the zoo of neutron star phenomena, magnetars stand out for their bursting behaviour, ranging from extremely bright, rare giant flares to numerous, less energetic recurrent bursts. The exact trigger and emission mechanisms for these bursts are not known; favoured models involve either a crust fracture and subsequent energy release into the magnetosphere, or explosive reconnection of magnetic field lines. In the absence of a predictive model, understanding the physical processes responsible for magnetar burst variability is difficult. Here, we develop an empirical model that decomposes magnetar bursts into a superposition of small spike-like features with a simple functional form, where the number of model components is itself part of the inference problem. The cascades of spikes that we model might be formed by avalanches of reconnection, or crust rupture afte...
Optimizing the Amount of Models Taken into Consideration During Model Selection in Bayesian Networks
Castelo, J.R.; Siebes, Arno
1999-01-01
Graphical model selection from data embodies several difficulties. Among them, it is specially challenging the size of the sample space of models on which one should carry out model selection, even considering only a modest amount of variables. This becomes more severe when one works on those graphical models where some variables may be responses to other. This is the case of Bayesian Networks that are modeled by acyclic digraphs. In this paper we try to reduce the amount of models taken into...
Two-Stage Bayesian Model Averaging in Endogenous Variable Models.
Lenkoski, Alex; Eicher, Theo S; Raftery, Adrian E
2014-01-01
Economic modeling in the presence of endogeneity is subject to model uncertainty at both the instrument and covariate level. We propose a Two-Stage Bayesian Model Averaging (2SBMA) methodology that extends the Two-Stage Least Squares (2SLS) estimator. By constructing a Two-Stage Unit Information Prior in the endogenous variable model, we are able to efficiently combine established methods for addressing model uncertainty in regression models with the classic technique of 2SLS. To assess the validity of instruments in the 2SBMA context, we develop Bayesian tests of the identification restriction that are based on model averaged posterior predictive p-values. A simulation study showed that 2SBMA has the ability to recover structure in both the instrument and covariate set, and substantially improves the sharpness of resulting coefficient estimates in comparison to 2SLS using the full specification in an automatic fashion. Due to the increased parsimony of the 2SBMA estimate, the Bayesian Sargan test had a power of 50 percent in detecting a violation of the exogeneity assumption, while the method based on 2SLS using the full specification had negligible power. We apply our approach to the problem of development accounting, and find support not only for institutions, but also for geography and integration as development determinants, once both model uncertainty and endogeneity have been jointly addressed. PMID:24223471
Bayesian Model Selection for LISA Pathfinder
Karnesis, Nikolaos; Sopuerta, Carlos F; Gibert, Ferran; Armano, Michele; Audley, Heather; Congedo, Giuseppe; Diepholz, Ingo; Ferraioli, Luigi; Hewitson, Martin; Hueller, Mauro; Korsakova, Natalia; Plagnol, Eric; Vitale, and Stefano
2013-01-01
The main goal of the LISA Pathfinder (LPF) mission is to fully characterize the acceleration noise models and to test key technologies for future space-based gravitational-wave observatories similar to the LISA/eLISA concept. The Data Analysis (DA) team has developed complex three-dimensional models of the LISA Technology Package (LTP) experiment on-board LPF. These models are used for simulations, but more importantly, they will be used for parameter estimation purposes during flight operations. One of the tasks of the DA team is to identify the physical effects that contribute significantly to the properties of the instrument noise. A way of approaching to this problem is to recover the essential parameters of the LTP which describe the data. Thus, we want to define the simplest model that efficiently explains the observations. To do so, adopting a Bayesian framework, one has to estimate the so-called Bayes Factor between two competing models. In our analysis, we use three main different methods to estimate...
Evaluating variable selection methods for diagnosis of myocardial infarction.
Dreiseitl, S; Ohno-Machado, L; Vinterbo, S
1999-01-01
This paper evaluates the variable selection performed by several machine-learning techniques on a myocardial infarction data set. The focus of this work is to determine which of 43 input variables are considered relevant for prediction of myocardial infarction. The algorithms investigated were logistic regression (with stepwise, forward, and backward selection), backpropagation for multilayer perceptrons (input relevance determination), Bayesian neural networks (automatic relevance determination), and rough sets. An independent method (self-organizing maps) was then used to evaluate and visualize the different subsets of predictor variables. Results show good agreement on some predictors, but also variability among different methods; only one variable was selected by all models. PMID:10566358
Evaluating variable selection methods for diagnosis of myocardial infarction.
Dreiseitl, S.; Ohno-Machado, L.; Vinterbo, S.
1999-01-01
This paper evaluates the variable selection performed by several machine-learning techniques on a myocardial infarction data set. The focus of this work is to determine which of 43 input variables are considered relevant for prediction of myocardial infarction. The algorithms investigated were logistic regression (with stepwise, forward, and backward selection), backpropagation for multilayer perceptrons (input relevance determination), Bayesian neural networks (automatic relevance determinat...
Optimal speech motor control and token-to-token variability: a Bayesian modeling approach.
Patri, Jean-François; Diard, Julien; Perrier, Pascal
2015-12-01
The remarkable capacity of the speech motor system to adapt to various speech conditions is due to an excess of degrees of freedom, which enables producing similar acoustical properties with different sets of control strategies. To explain how the central nervous system selects one of the possible strategies, a common approach, in line with optimal motor control theories, is to model speech motor planning as the solution of an optimality problem based on cost functions. Despite the success of this approach, one of its drawbacks is the intrinsic contradiction between the concept of optimality and the observed experimental intra-speaker token-to-token variability. The present paper proposes an alternative approach by formulating feedforward optimal control in a probabilistic Bayesian modeling framework. This is illustrated by controlling a biomechanical model of the vocal tract for speech production and by comparing it with an existing optimal control model (GEPPETO). The essential elements of this optimal control model are presented first. From them the Bayesian model is constructed in a progressive way. Performance of the Bayesian model is evaluated based on computer simulations and compared to the optimal control model. This approach is shown to be appropriate for solving the speech planning problem while accounting for variability in a principled way. PMID:26497359
Towards Distributed Bayesian Estimation A Short Note on Selected Aspects
Dedecius, Kamil; Sečkárová, Vladimíra
Prague: Institute of Information Theory and Automation, 2011, s. 67-72. ISBN 978-80-903834-6-3. [The 2nd International Workshop od Decision Making with Multiple Imperfect Decision Makers. Held in Conjunction with the 25th Annual Conference on Neural Information Processing Systems (NIPS 2011). Sierra Nevada (ES), 16.12.2011-16.12.2011] R&D Projects: GA ČR GA102/08/0567 Institutional research plan: CEZ:AV0Z10750506 Keywords : efficient estimation * a linear or nonlinear model * distributed estimation * Bayesian decision making Subject RIV: BB - Applied Statistics, Operational Research http://library.utia.cas.cz/separaty/2011/AS/dedecius-towards distributed bayesian estimation a short note on selected aspects.pdf
Bayesian predictive modeling for genomic based personalized treatment selection.
Ma, Junsheng; Stingo, Francesco C; Hobbs, Brian P
2016-06-01
Efforts to personalize medicine in oncology have been limited by reductive characterizations of the intrinsically complex underlying biological phenomena. Future advances in personalized medicine will rely on molecular signatures that derive from synthesis of multifarious interdependent molecular quantities requiring robust quantitative methods. However, highly parameterized statistical models when applied in these settings often require a prohibitively large database and are sensitive to proper characterizations of the treatment-by-covariate interactions, which in practice are difficult to specify and may be limited by generalized linear models. In this article, we present a Bayesian predictive framework that enables the integration of a high-dimensional set of genomic features with clinical responses and treatment histories of historical patients, providing a probabilistic basis for using the clinical and molecular information to personalize therapy for future patients. Our work represents one of the first attempts to define personalized treatment assignment rules based on large-scale genomic data. We use actual gene expression data acquired from The Cancer Genome Atlas in the settings of leukemia and glioma to explore the statistical properties of our proposed Bayesian approach for personalizing treatment selection. The method is shown to yield considerable improvements in predictive accuracy when compared to penalized regression approaches. PMID:26575856
Dynamic sensor action selection with Bayesian decision analysis
Kristensen, Steen; Hansen, Volker; Kondak, Konstantin
1998-10-01
The aim of this work is to create a framework for the dynamic planning of sensor actions for an autonomous mobile robot. The framework uses Bayesian decision analysis, i.e., a decision-theoretic method, to evaluate possible sensor actions and selecting the most appropriate ones given the available sensors and what is currently known about the state of the world. Since sensing changes the knowledge of the system and since the current state of the robot (task, position, etc.) determines what knowledge is relevant, the evaluation and selection of sensing actions is an on-going process that effectively determines the behavior of the robot. The framework has been implemented on a real mobile robot and has been proven to be able to control in real-time the sensor actions of the system. In current work we are investigating methods to reduce or automatically generate the necessary model information needed by the decision- theoretic method to select the appropriate sensor actions.
Bayes and empirical-Bayes multiplicity adjustment in the variable-selection problem
Scott, James G; 10.1214/10-AOS792
2010-01-01
This paper studies the multiplicity-correction effect of standard Bayesian variable-selection priors in linear regression. Our first goal is to clarify when, and how, multiplicity correction happens automatically in Bayesian analysis, and to distinguish this correction from the Bayesian Ockham's-razor effect. Our second goal is to contrast empirical-Bayes and fully Bayesian approaches to variable selection through examples, theoretical results and simulations. Considerable differences between the two approaches are found. In particular, we prove a theorem that characterizes a surprising aymptotic discrepancy between fully Bayes and empirical Bayes. This discrepancy arises from a different source than the failure to account for hyperparameter uncertainty in the empirical-Bayes estimate. Indeed, even at the extreme, when the empirical-Bayes estimate converges asymptotically to the true variable-inclusion probability, the potential for a serious difference remains.
Bayesian modeling of ChIP-chip data using latent variables.
Wu, Mingqi
2009-10-26
BACKGROUND: The ChIP-chip technology has been used in a wide range of biomedical studies, such as identification of human transcription factor binding sites, investigation of DNA methylation, and investigation of histone modifications in animals and plants. Various methods have been proposed in the literature for analyzing the ChIP-chip data, such as the sliding window methods, the hidden Markov model-based methods, and Bayesian methods. Although, due to the integrated consideration of uncertainty of the models and model parameters, Bayesian methods can potentially work better than the other two classes of methods, the existing Bayesian methods do not perform satisfactorily. They usually require multiple replicates or some extra experimental information to parametrize the model, and long CPU time due to involving of MCMC simulations. RESULTS: In this paper, we propose a Bayesian latent model for the ChIP-chip data. The new model mainly differs from the existing Bayesian models, such as the joint deconvolution model, the hierarchical gamma mixture model, and the Bayesian hierarchical model, in two respects. Firstly, it works on the difference between the averaged treatment and control samples. This enables the use of a simple model for the data, which avoids the probe-specific effect and the sample (control/treatment) effect. As a consequence, this enables an efficient MCMC simulation of the posterior distribution of the model, and also makes the model more robust to the outliers. Secondly, it models the neighboring dependence of probes by introducing a latent indicator vector. A truncated Poisson prior distribution is assumed for the latent indicator variable, with the rationale being justified at length. CONCLUSION: The Bayesian latent method is successfully applied to real and ten simulated datasets, with comparisons with some of the existing Bayesian methods, hidden Markov model methods, and sliding window methods. The numerical results indicate that the
Bayesian modeling of ChIP-chip data using latent variables
Tian Yanan
2009-10-01
Full Text Available Abstract Background The ChIP-chip technology has been used in a wide range of biomedical studies, such as identification of human transcription factor binding sites, investigation of DNA methylation, and investigation of histone modifications in animals and plants. Various methods have been proposed in the literature for analyzing the ChIP-chip data, such as the sliding window methods, the hidden Markov model-based methods, and Bayesian methods. Although, due to the integrated consideration of uncertainty of the models and model parameters, Bayesian methods can potentially work better than the other two classes of methods, the existing Bayesian methods do not perform satisfactorily. They usually require multiple replicates or some extra experimental information to parametrize the model, and long CPU time due to involving of MCMC simulations. Results In this paper, we propose a Bayesian latent model for the ChIP-chip data. The new model mainly differs from the existing Bayesian models, such as the joint deconvolution model, the hierarchical gamma mixture model, and the Bayesian hierarchical model, in two respects. Firstly, it works on the difference between the averaged treatment and control samples. This enables the use of a simple model for the data, which avoids the probe-specific effect and the sample (control/treatment effect. As a consequence, this enables an efficient MCMC simulation of the posterior distribution of the model, and also makes the model more robust to the outliers. Secondly, it models the neighboring dependence of probes by introducing a latent indicator vector. A truncated Poisson prior distribution is assumed for the latent indicator variable, with the rationale being justified at length. Conclusion The Bayesian latent method is successfully applied to real and ten simulated datasets, with comparisons with some of the existing Bayesian methods, hidden Markov model methods, and sliding window methods. The numerical results
Feature Selection for Bayesian Evaluation of Trauma Death Risk
Jakaite, L
2008-01-01
In the last year more than 70,000 people have been brought to the UK hospitals with serious injuries. Each time a clinician has to urgently take a patient through a screening procedure to make a reliable decision on the trauma treatment. Typically, such procedure comprises around 20 tests; however the condition of a trauma patient remains very difficult to be tested properly. What happens if these tests are ambiguously interpreted, and information about the severity of the injury will come misleading? The mistake in a decision can be fatal: using a mild treatment can put a patient at risk of dying from posttraumatic shock, while using an overtreatment can also cause death. How can we reduce the risk of the death caused by unreliable decisions? It has been shown that probabilistic reasoning, based on the Bayesian methodology of averaging over decision models, allows clinicians to evaluate the uncertainty in decision making. Based on this methodology, in this paper we aim at selecting the most important screeni...
Family Background Variables as Instruments for Education in Income Regressions: A Bayesian Analysis
Hoogerheide, Lennart; Block, Joern H.; Thurik, Roy
2012-01-01
The validity of family background variables instrumenting education in income regressions has been much criticized. In this paper, we use data from the 2004 German Socio-Economic Panel and Bayesian analysis to analyze to what degree violations of the strict validity assumption affect the estimation results. We show that, in case of moderate direct…
Errata: A survey of Bayesian predictive methods for model assessment, selection and comparison
Aki Vehtari
2014-03-01
Full Text Available Errata for “A survey of Bayesian predictive methods for model assessment, selection and comparison” by A. Vehtari and J. Ojanen, Statistics Surveys, 6 (2012, 142–228. doi:10.1214/12-SS102.
Mixed Bayesian Networks with Auxiliary Variables for Automatic Speech Recognition
Stephenson, Todd Andrew; Magimai.-Doss, Mathew; Bourlard, Hervé
2001-01-01
Standard hidden Markov models (HMMs), as used in automatic speech recognition (ASR), calculate their emission probabilities by an artificial neural network (ANN) or a Gaussian distribution conditioned on the hidden state variable, considering the emissions independent of any other variable in the model. Recent work showed the benefit of conditioning the emission distributions on a discrete auxiliary variable, which is observed in training and hidden in recognition. Related work has shown the ...
Guillaume Marrelec
Full Text Available The use of mutual information as a similarity measure in agglomerative hierarchical clustering (AHC raises an important issue: some correction needs to be applied for the dimensionality of variables. In this work, we formulate the decision of merging dependent multivariate normal variables in an AHC procedure as a Bayesian model comparison. We found that the Bayesian formulation naturally shrinks the empirical covariance matrix towards a matrix set a priori (e.g., the identity, provides an automated stopping rule, and corrects for dimensionality using a term that scales up the measure as a function of the dimensionality of the variables. Also, the resulting log Bayes factor is asymptotically proportional to the plug-in estimate of mutual information, with an additive correction for dimensionality in agreement with the Bayesian information criterion. We investigated the behavior of these Bayesian alternatives (in exact and asymptotic forms to mutual information on simulated and real data. An encouraging result was first derived on simulations: the hierarchical clustering based on the log Bayes factor outperformed off-the-shelf clustering techniques as well as raw and normalized mutual information in terms of classification accuracy. On a toy example, we found that the Bayesian approaches led to results that were similar to those of mutual information clustering techniques, with the advantage of an automated thresholding. On real functional magnetic resonance imaging (fMRI datasets measuring brain activity, it identified clusters consistent with the established outcome of standard procedures. On this application, normalized mutual information had a highly atypical behavior, in the sense that it systematically favored very large clusters. These initial experiments suggest that the proposed Bayesian alternatives to mutual information are a useful new tool for hierarchical clustering.
Kim, Junhan; Chan, Chi-kwan; Medeiros, Lia; Ozel, Feryal; Psaltis, Dimitrios
2016-01-01
The Event Horizon Telescope (EHT) is a millimeter-wavelength, very-long baseline interferometer (VLBI) that is capable of observing black holes with horizon-scale resolution. Early observations have revealed variable horizon-scale emission in the Galactic Center black hole, Sagittarius A* (Sgr A*). Comparing such observations to time-dependent general relativistic magnetohydrodynamic (GRMHD) simulations requires statistical tools that explicitly consider the variability in both the data and the models. We develop here a Bayesian method to compare time-resolved simulation images to variable VLBI data, in order to infer model parameters and perform model comparisons. We use mock EHT data based on GRMHD simulations to explore the robustness of this Bayesian method and contrast it to approaches that do not consider the effects of variability. We find that time-independent models lead to offset values of the inferred parameters with artificially reduced uncertainties. We also apply our method to the early EHT data...
Multi-variable Echo State Network Optimized by Bayesian Regulation for Daily Peak Load Forecasting
Dongxiao Niu
2012-11-01
Full Text Available In this paper, a multi-variable echo state network trained with Bayesian regulation has been developed for the short-time load forecasting. In this study, we focus on the generalization of a new recurrent network. Therefore, Bayesian regulation and Levenberg-Marquardt algorithm is adopted to modify the output weight. The model is verified by data from a local power company in south China and its performance is rather satisfactory. Besides, traditional methods are also used for the same task as comparison. The simulation results lead to the conclusion that the proposed scheme is feasible and has great robustness and satisfactory capacity of generalization.
Modelling of Traffic Flow with Bayesian Autoregressive Model with Variable Partial Forgetting
Dedecius, Kamil; Nagy, Ivan; Hofman, Radek
Praha : ČVUT v Praze, 2011, s. 1-11. [CTU Workshop 2011. Praha (CZ), 01.02.2011-01.02.2011] Grant ostatní: ČVUT v Praze(CZ) SGS 10/099/OHK3/1T/16 Institutional research plan: CEZ:AV0Z10750506 Keywords : Bayesian modelling * traffic modelling Subject RIV: BB - Applied Statistics, Operational Research http://library.utia.cas.cz/separaty/2011/AS/dedecius-modelling of traffic flow with bayesian autoregressive model with variable partial forgetting.pdf
Bayesian approach to inverse problems for functions with a variable-index Besov prior
Jia, Junxiong; Peng, Jigen; Gao, Jinghuai
2016-08-01
The Bayesian approach has been adopted to solve inverse problems that reconstruct a function from noisy observations. Prior measures play a key role in the Bayesian method. Hence, many probability measures have been proposed, among which total variation (TV) is a well-known prior measure that can preserve sharp edges. However, it has two drawbacks, the staircasing effect and a lack of the discretization-invariant property. The variable-index TV prior has been proposed and analyzed in the area of image analysis for the former, and the Besov prior has been employed recently for the latter. To overcome both issues together, in this paper, we present a variable-index Besov prior measure, which is a non-Gaussian measure. Some useful properties of this new prior measure have been proven for functions defined on a torus. We have also generalized Bayesian inverse theory in infinite dimensions for our new setting. Finally, this theory has been applied to integer- and fractional-order backward diffusion problems. To the best of our knowledge, this is the first time that the Bayesian approach has been used for the fractional-order backward diffusion problem, which provides an opportunity to quantify its uncertainties.
Bayesian analysis of variable-order, reversible Markov chains
Bacallado, Sergio
2011-01-01
We define a conjugate prior for the reversible Markov chain of order $r$. The prior arises from a partially exchangeable reinforced random walk, in the same way that the Beta distribution arises from the exchangeable Poly\\'{a} urn. An extension to variable-order Markov chains is also derived. We show the utility of this prior in testing the order and estimating the parameters of a reversible Markov model.
Characterizing the Aperiodic Variability of 3XMM Sources using Bayesian Blocks
Salvetti, D.; De Luca, A.; Belfiore, A.; Marelli, M.
2016-06-01
I will present Bayesian blocks algorithm and its application to XMM sources, statistical properties of the entire 3XMM sample, and a few interesting cases. While XMM-Newton is the best suited instrument for the characterization of X-ray source variability, its most recent catalogue (3XMM) reports light curves only for the brightest ones and excludes from its analysis periods of background flares. One aim of the EXTraS ("Exploring the X-ray Transient and variable Sky") project is the characterization of aperiodic variability of as many 3XMM sources as possible on a time scale shorter than the XMM observation. We adapted the original Bayesian blocks algorithm to account for background contamination, including soft proton flares. In addition, we characterized the short-term aperiodic variability performing a number of statistical tests on all the Bayesian blocks light curves. The EXTraS catalogue and products will be released to the community in 2017, together with tools that will allow the user to replicate EXTraS results and extend them through the next decade.
Vrieze, Scott I.
2012-01-01
This article reviews the Akaike information criterion (AIC) and the Bayesian information criterion (BIC) in model selection and the appraisal of psychological theory. The focus is on latent variable models, given their growing use in theory testing and construction. Theoretical statistical results in regression are discussed, and more important…
Bayesian natural selection and the evolution of perceptual systems.
Geisler, Wilson S.; Diehl, Randy L.
2002-01-01
In recent years, there has been much interest in characterizing statistical properties of natural stimuli in order to better understand the design of perceptual systems. A fruitful approach has been to compare the processing of natural stimuli in real perceptual systems with that of ideal observers derived within the framework of Bayesian statistical decision theory. While this form of optimization theory has provided a deeper understanding of the information contained in natural stimuli as w...
Implementation of upper limit calculation for a Poisson variable by Bayesian approach
ZHU Yong-Sheng
2008-01-01
The calculation of Bayesian confidence upper limit for a Poisson variable including both signal and background with and without systematic uncertainties has been formulated.A Fortran 77 routine,BPULE,has been developed to implement the calculation.The routine can account for systematic uncertainties in the background expectation and signal efficiency.The systematic uncertainties may be separately parameterized by a Gaussian,Log-Gaussian or fiat probability density function (pdf).Some technical details of BPULE have been discussed.
D. Das
2014-04-01
Full Text Available Climate projections simulated by Global Climate Models (GCM are often used for assessing the impacts of climate change. However, the relatively coarse resolutions of GCM outputs often precludes their application towards accurately assessing the effects of climate change on finer regional scale phenomena. Downscaling of climate variables from coarser to finer regional scales using statistical methods are often performed for regional climate projections. Statistical downscaling (SD is based on the understanding that the regional climate is influenced by two factors – the large scale climatic state and the regional or local features. A transfer function approach of SD involves learning a regression model which relates these features (predictors to a climatic variable of interest (predictand based on the past observations. However, often a single regression model is not sufficient to describe complex dynamic relationships between the predictors and predictand. We focus on the covariate selection part of the transfer function approach and propose a nonparametric Bayesian mixture of sparse regression models based on Dirichlet Process (DP, for simultaneous clustering and discovery of covariates within the clusters while automatically finding the number of clusters. Sparse linear models are parsimonious and hence relatively more generalizable than non-sparse alternatives, and lends to domain relevant interpretation. Applications to synthetic data demonstrate the value of the new approach and preliminary results related to feature selection for statistical downscaling shows our method can lead to new insights.
Bayesian data fusion for spatial prediction of categorical variables in environmental sciences
Gengler, Sarah, E-mail: sarahgengler@gmail.com; Bogaert, Patrick, E-mail: sarahgengler@gmail.com [Earth and Life Institute, Environmental Sciences. Université catholique de Louvain, Croix du Sud 2/L7.05.16, B-1348 Louvain-la-Neuve (Belgium)
2014-12-05
First developed to predict continuous variables, Bayesian Maximum Entropy (BME) has become a complete framework in the context of space-time prediction since it has been extended to predict categorical variables and mixed random fields. This method proposes solutions to combine several sources of data whatever the nature of the information. However, the various attempts that were made for adapting the BME methodology to categorical variables and mixed random fields faced some limitations, as a high computational burden. The main objective of this paper is to overcome this limitation by generalizing the Bayesian Data Fusion (BDF) theoretical framework to categorical variables, which is somehow a simplification of the BME method through the convenient conditional independence hypothesis. The BDF methodology for categorical variables is first described and then applied to a practical case study: the estimation of soil drainage classes using a soil map and point observations in the sandy area of Flanders around the city of Mechelen (Belgium). The BDF approach is compared to BME along with more classical approaches, as Indicator CoKringing (ICK) and logistic regression. Estimators are compared using various indicators, namely the Percentage of Correctly Classified locations (PCC) and the Average Highest Probability (AHP). Although BDF methodology for categorical variables is somehow a simplification of BME approach, both methods lead to similar results and have strong advantages compared to ICK and logistic regression.
Using Bayesian Model Selection to Characterize Neonatal Eeg Recordings
Mitchell, Timothy J.
2009-12-01
The brains of premature infants must undergo significant maturation outside of the womb and are thus particularly susceptible to injury. Electroencephalographic (EEG) recordings are an important diagnostic tool in determining if a newborn's brain is functioning normally or if injury has occurred. However, interpreting the recordings is difficult and requires the skills of a trained electroencephelographer. Because these EEG specialists are rare, an automated interpretation of newborn EEG recordings would increase access to an important diagnostic tool for physicians. To automate this procedure, we employ Bayesian probability theory to compute the posterior probability for the EEG features of interest and use the results in a program designed to mimic EEG specialists. Specifically, we will be identifying waveforms of varying frequency and amplitude, as well as periods of flat recordings where brain activity is minimal.
Influences of variables on ship collision probability in a Bayesian belief network model
The influences of the variables in a Bayesian belief network model for estimating the role of human factors on ship collision probability in the Gulf of Finland are studied for discovering the variables with the largest influences and for examining the validity of the network. The change in the so-called causation probability is examined while observing each state of the network variables and by utilizing sensitivity and mutual information analyses. Changing course in an encounter situation is the most influential variable in the model, followed by variables such as the Officer of the Watch's action, situation assessment, danger detection, personal condition and incapacitation. The least influential variables are the other distractions on bridge, the bridge view, maintenance routines and the officer's fatigue. In general, the methods are found to agree on the order of the model variables although some disagreements arise due to slightly dissimilar approaches to the concept of variable influence. The relative values and the ranking of variables based on the values are discovered to be more valuable than the actual numerical values themselves. Although the most influential variables seem to be plausible, there are some discrepancies between the indicated influences in the model and literature. Thus, improvements are suggested to the network.
Bayesian estimation in IRT models with missing values in background variables
Christian Aßmann
2015-12-01
Full Text Available Large scale assessment studies typically aim at investigating the relationship between persons competencies and explaining variables. Individual competencies are often estimated by explicitly including explaining background variables into corresponding Item Response Theory models. Since missing values in background variables inevitably occur, strategies to handle the uncertainty related to missing values in parameter estimation are required. We propose to adapt a Bayesian estimation strategy based on Markov Chain Monte Carlo techniques. Sampling from the posterior distribution of parameters is thereby enriched by sampling from the full conditional distribution of the missing values. We consider non-parametric as well as parametric approximations for the full conditional distributions of missing values, thus allowing for a flexible incorporation of metric as well as categorical background variables. We evaluate the validity of our approach with respect to statistical accuracy by a simulation study controlling the missing values generating mechanism. We show that the proposed Bayesian strategy allows for effective comparison of nested model specifications via gauging highest posterior density intervals of all involved model parameters. An illustration of the suggested approach uses data from the National Educational Panel Study on mathematical competencies of fifth grade students.
A survey of Bayesian predictive methods for model assessment, selection and comparison
Aki Vehtari
2012-01-01
Full Text Available To date, several methods exist in the statistical literature formodel assessment, which purport themselves specifically as Bayesian predictive methods. The decision theoretic assumptions on which these methodsare based are not always clearly stated in the original articles, however.The aim of this survey is to provide a unified review of Bayesian predictivemodel assessment and selection methods, and of methods closely related tothem. We review the various assumptions that are made in this context anddiscuss the connections between different approaches, with an emphasis onhow each method approximates the expected utility of using a Bayesianmodel for the purpose of predicting future data.
Burgess, Stephen; Thompson, Simon G; Andrews, G;
2010-01-01
Genetic markers can be used as instrumental variables, in an analogous way to randomization in a clinical trial, to estimate the causal relationship between a phenotype and an outcome variable. Our purpose is to extend the existing methods for such Mendelian randomization studies to the context of...... multiple genetic markers measured in multiple studies, based on the analysis of individual participant data. First, for a single genetic marker in one study, we show that the usual ratio of coefficients approach can be reformulated as a regression with heterogeneous error in the explanatory variable. This...... can be implemented using a Bayesian approach, which is next extended to include multiple genetic markers. We then propose a hierarchical model for undertaking a meta-analysis of multiple studies, in which it is not necessary that the same genetic markers are measured in each study. This provides an...
Finding the Most Distant Quasars Using Bayesian Selection Methods
Mortlock, Daniel
2014-01-01
Quasars, the brightly glowing disks of material that can form around the super-massive black holes at the centres of large galaxies, are amongst the most luminous astronomical objects known and so can be seen at great distances. The most distant known quasars are seen as they were when the Universe was less than a billion years old (i.e., $\\sim\\!7%$ of its current age). Such distant quasars are, however, very rare, and so are difficult to distinguish from the billions of other comparably-bright sources in the night sky. In searching for the most distant quasars in a recent astronomical sky survey (the UKIRT Infrared Deep Sky Survey, UKIDSS), there were $\\sim\\!10^3$ apparently plausible candidates for each expected quasar, far too many to reobserve with other telescopes. The solution to this problem was to apply Bayesian model comparison, making models of the quasar population and the dominant contaminating population (Galactic stars) to utilise the information content in the survey measurements. The result wa...
Applications of Bayesian Model Selection to Cosmological Parameters
Trotta, R
2005-01-01
Bayesian evidence is a tool for model comparison which can be used to decide whether the introduction of a new parameter is warranted by data. I show that the usual sampling statistic rejection tests for a null hypothesis can be misleading, since they do not take into account the information content of the data. I review the Laplace approximation and the Savage-Dickey density ratio to compute Bayes factors, which avoid the need of carrying out a computationally demanding multi-dimensional integration. I present a new procedure to forecast the Bayes factor of a future observation by computing the Expected Posterior Odds (ExPO). As an illustration, I consider three key parameters for our understanding of the cosmological concordance model: the spectral tilt of scalar perturbations, the spatial curvature of the Universe and a CDM isocurvature component to the initial conditions which is totally (anti)correlated with the adiabatic mode. I find that current data are not informative enough to draw a conclusion on t...
Placek, Ben; Knuth, Kevin H. [Physics Department, University at Albany (SUNY), Albany, NY 12222 (United States); Angerhausen, Daniel, E-mail: bplacek@albany.edu, E-mail: kknuth@albany.edu, E-mail: daniel.angerhausen@gmail.com [Department of Physics, Applied Physics, and Astronomy, Rensselear Polytechnic Institute, Troy, NY 12180 (United States)
2014-11-10
EXONEST is an algorithm dedicated to detecting and characterizing the photometric signatures of exoplanets, which include reflection and thermal emission, Doppler boosting, and ellipsoidal variations. Using Bayesian inference, we can test between competing models that describe the data as well as estimate model parameters. We demonstrate this approach by testing circular versus eccentric planetary orbital models, as well as testing for the presence or absence of four photometric effects. In addition to using Bayesian model selection, a unique aspect of EXONEST is the potential capability to distinguish between reflective and thermal contributions to the light curve. A case study is presented using Kepler data recorded from the transiting planet KOI-13b. By considering only the nontransiting portions of the light curve, we demonstrate that it is possible to estimate the photometrically relevant model parameters of KOI-13b. Furthermore, Bayesian model testing confirms that the orbit of KOI-13b has a detectable eccentricity.
Elsheikh, Ahmed H.
2014-02-01
A Hybrid Nested Sampling (HNS) algorithm is proposed for efficient Bayesian model calibration and prior model selection. The proposed algorithm combines, Nested Sampling (NS) algorithm, Hybrid Monte Carlo (HMC) sampling and gradient estimation using Stochastic Ensemble Method (SEM). NS is an efficient sampling algorithm that can be used for Bayesian calibration and estimating the Bayesian evidence for prior model selection. Nested sampling has the advantage of computational feasibility. Within the nested sampling algorithm, a constrained sampling step is performed. For this step, we utilize HMC to reduce the correlation between successive sampled states. HMC relies on the gradient of the logarithm of the posterior distribution, which we estimate using a stochastic ensemble method based on an ensemble of directional derivatives. SEM only requires forward model runs and the simulator is then used as a black box and no adjoint code is needed. The developed HNS algorithm is successfully applied for Bayesian calibration and prior model selection of several nonlinear subsurface flow problems. © 2013 Elsevier Inc.
Variable selection by lasso-type methods
Sohail Chand
2011-09-01
Full Text Available Variable selection is an important property of shrinkage methods. The adaptive lasso is an oracle procedure and can do consistent variable selection. In this paper, we provide an explanation that how use of adaptive weights make it possible for the adaptive lasso to satisfy the necessary and almost sufcient condition for consistent variable selection. We suggest a novel algorithm and give an important result that for the adaptive lasso if predictors are normalised after the introduction of adaptive weights, it makes the adaptive lasso performance identical to the lasso.
A Gene Selection Algorithm using Bayesian Classification Approach
Alok Sharma; Kuldip K. Paliwal
2012-01-01
In this study, we propose a new feature (or gene) selection algorithm using Bayes classification approach. The algorithm can find gene subset crucial for cancer classification problem. Problem statement: Gene identification plays important role in human cancer classification problem. Several feature selection algorithms have been proposed for analyzing and understanding influential genes using gene expression profiles. Approach: The feature selection algorithms aim to explore genes that are c...
Selecting AGN through variability in SN datasets
Boutsia, K.; Leibundgut, B.; Trevese, D.; Vagnetti, F.
2010-01-01
Variability is a main property of active galactic nuclei (AGN) and it was adopted as a selection criterion using multi epoch surveys conducted for the detection of supernovae (SNe). We have used two SN datasets. First we selected the AXAF field of the STRESS project, centered in the Chandra Deep Field South where, besides the deep X-ray surveys also various optical catalogs exist. Our method yielded 132 variable AGN candidates. We then extended our method including the dataset of the ESSENCE ...
Stochastic search variable selection for identifying multiple quantitative trait loci.
Yi, Nengjun; George, Varghese; Allison, David B
2003-07-01
In this article, we utilize stochastic search variable selection methodology to develop a Bayesian method for identifying multiple quantitative trait loci (QTL) for complex traits in experimental designs. The proposed procedure entails embedding multiple regression in a hierarchical normal mixture model, where latent indicators for all markers are used to identify the multiple markers. The markers with significant effects can be identified as those with higher posterior probability included in the model. A simple and easy-to-use Gibbs sampler is employed to generate samples from the joint posterior distribution of all unknowns including the latent indicators, genetic effects for all markers, and other model parameters. The proposed method was evaluated using simulated data and illustrated using a real data set. The results demonstrate that the proposed method works well under typical situations of most QTL studies in terms of number of markers and marker density. PMID:12871920
Variable selection: Current practice in epidemiological studies
S. Walter (Stefan); H.W. Tiemeier (Henning)
2009-01-01
textabstractSelection of covariates is among the most controversial and difficult tasks in epidemiologic analysis. Correct variable selection addresses the problem of confounding in etiologic research and allows unbiased estimation of probabilities in prognostic studies. The aim of this commentary i
Purposeful selection of variables in logistic regression
Williams David Keith
2008-12-01
Full Text Available Abstract Background The main problem in many model-building situations is to choose from a large set of covariates those that should be included in the "best" model. A decision to keep a variable in the model might be based on the clinical or statistical significance. There are several variable selection algorithms in existence. Those methods are mechanical and as such carry some limitations. Hosmer and Lemeshow describe a purposeful selection of covariates within which an analyst makes a variable selection decision at each step of the modeling process. Methods In this paper we introduce an algorithm which automates that process. We conduct a simulation study to compare the performance of this algorithm with three well documented variable selection procedures in SAS PROC LOGISTIC: FORWARD, BACKWARD, and STEPWISE. Results We show that the advantage of this approach is when the analyst is interested in risk factor modeling and not just prediction. In addition to significant covariates, this variable selection procedure has the capability of retaining important confounding variables, resulting potentially in a slightly richer model. Application of the macro is further illustrated with the Hosmer and Lemeshow Worchester Heart Attack Study (WHAS data. Conclusion If an analyst is in need of an algorithm that will help guide the retention of significant covariates as well as confounding ones they should consider this macro as an alternative tool.
Variable Selection in Logistic Regression Mo del
ZHANG Shangli; ZHANG Lili; QIU Kuanmin; LU Ying; CAI Baigen
2015-01-01
Variable selection is one of the most impor-tant problems in pattern recognition. In linear regression model, there are many methods can solve this problem, such as Least absolute shrinkage and selection operator (LASSO) and many improved LASSO methods, but there are few variable selection methods in generalized linear models. We study the variable selection problem in logis-tic regression model. We propose a new variable selection method–the logistic elastic net, prove that it has grouping eff ect which means that the strongly correlated predictors tend to be in or out of the model together. The logistic elastic net is particularly useful when the number of pre-dictors (p) is much bigger than the number of observations (n). By contrast, the LASSO is not a very satisfactory vari-able selection method in the case when p is more larger than n. The advantage and eff ectiveness of this method are demonstrated by real leukemia data and a simulation study.
Brentani Helena
2004-08-01
Full Text Available Abstract Background An important challenge for transcript counting methods such as Serial Analysis of Gene Expression (SAGE, "Digital Northern" or Massively Parallel Signature Sequencing (MPSS, is to carry out statistical analyses that account for the within-class variability, i.e., variability due to the intrinsic biological differences among sampled individuals of the same class, and not only variability due to technical sampling error. Results We introduce a Bayesian model that accounts for the within-class variability by means of mixture distribution. We show that the previously available approaches of aggregation in pools ("pseudo-libraries" and the Beta-Binomial model, are particular cases of the mixture model. We illustrate our method with a brain tumor vs. normal comparison using SAGE data from public databases. We show examples of tags regarded as differentially expressed with high significance if the within-class variability is ignored, but clearly not so significant if one accounts for it. Conclusion Using available information about biological replicates, one can transform a list of candidate transcripts showing differential expression to a more reliable one. Our method is freely available, under GPL/GNU copyleft, through a user friendly web-based on-line tool or as R language scripts at supplemental web-site.
Emery, A. F.; Valenti, E.; Bardot, D.
2007-01-01
Parameter estimation is generally based upon the maximum likelihood approach and often involves regularization. Typically it is desired that the results be unbiased and of minimum variance. However, it is often better to accept biased estimates that have minimum mean square error. Bayesian inference is an attractive approach that achieves this goal and incorporates regularization automatically. More importantly, it permits us to analyse experiments in which both the system response and the independent variables (time, sensor position, experimental conditions, etc) are corrupted by noise and in which the model includes nuisance variables. This paper describes the use of Bayesian inference for an apparently simple experiment which is, in fact, fundamentally difficult and is compounded by a nuisance variable. By presenting this analysis we hope that members of the inverse community will see the value of applying Bayesian inference.
Within-subject consistency and between-subject variability in Bayesian reasoning strategies.
Cohen, Andrew L; Staub, Adrian
2015-09-01
It is well known that people tend to perform poorly when asked to determine a posterior probability on the basis of a base rate, true positive rate, and false positive rate. The present experiments assessed the extent to which individual participants nevertheless adopt consistent strategies in these Bayesian reasoning problems, and investigated the nature of these strategies. In two experiments, one laboratory-based and one internet-based, each participant completed 36 problems with factorially manipulated probabilities. Many participants applied consistent strategies involving use of only one of the three probabilities provided in the problem, or additive combination of two of the probabilities. There was, however, substantial variability across participants in which probabilities were taken into account. In the laboratory experiment, participants' eye movements were tracked as they read the problems. There was evidence of a relationship between information use and attention to a source of information. Participants' self-assessments of their performance, however, revealed little confidence that the strategies they applied were actually correct. These results suggest that the hypothesis of base rate neglect actually underestimates people's difficulty with Bayesian reasoning, but also suggest that participants are aware of their ignorance. PMID:26354671
Barroso, L M A; Teodoro, P E; Nascimento, M; Torres, F E; Dos Santos, A; Corrêa, A M; Sagrilo, E; Corrêa, C C G; Silva, F A; Ceccon, G
2016-01-01
This study aimed to verify that a Bayesian approach could be used for the selection of upright cowpea genotypes with high adaptability and phenotypic stability, and the study also evaluated the efficiency of using informative and minimally informative a priori distributions. Six trials were conducted in randomized blocks, and the grain yield of 17 upright cowpea genotypes was assessed. To represent the minimally informative a priori distributions, a probability distribution with high variance was used, and a meta-analysis concept was adopted to represent the informative a priori distributions. Bayes factors were used to conduct comparisons between the a priori distributions. The Bayesian approach was effective for selection of upright cowpea genotypes with high adaptability and phenotypic stability using the Eberhart and Russell method. Bayes factors indicated that the use of informative a priori distributions provided more accurate results than minimally informative a priori distributions. PMID:26985961
Bayesian Model Selection and Prediction with Empirical Applications
Phillips, Peter C.B.
1992-01-01
This paper builds on some recent work by the author and Werner Ploberger (1991, 1994) on the development of "Bayes models" for time series and on the authors' model selection criterion "PIC." The PIC criterion is used in this paper to determine the lag order, the trend degree, and the presence or absence of a unit root in an autoregression with deterministic trend. A new forecast encompassing test for Bayes models is developed which allows one Bayes model to be compared with another on the ba...
QUASAR SELECTION BASED ON PHOTOMETRIC VARIABILITY
We develop a method for separating quasars from other variable point sources using Sloan Digital Sky Survey (SDSS) Stripe 82 light-curve data for ∼ 10,000 variable objects. To statistically describe quasar variability, we use a damped random walk model parametrized by a damping timescale, τ, and an asymptotic amplitude (structure function), SF∞. With the aid of an SDSS spectroscopically confirmed quasar sample, we demonstrate that variability selection in typical extragalactic fields with low stellar density can deliver complete samples with reasonable purity (or efficiency, E). Compared to a selection method based solely on the slope of the structure function, the inclusion of the τ information boosts E from 60% to 75% while maintaining a highly complete sample (98%) even in the absence of color information. For a completeness of C = 90%, E is boosted from 80% to 85%. Conversely, C improves from 90% to 97% while maintaining E = 80% when imposing a lower limit on τ. With the aid of color selection, the purity can be further boosted to 96%, with C = 93%. Hence, selection methods based on variability will play an important role in the selection of quasars with data provided by upcoming large sky surveys, such as Pan-STARRS and the Large Synoptic Survey Telescope (LSST). For a typical (simulated) LSST cadence over 10 years and a photometric accuracy of 0.03 mag (achieved at i ∼ 22), C is expected to be 88% for a simple sample selection criterion of >100 days. In summary, given an adequate survey cadence, photometric variability provides an even better method than color selection for separating quasars from stars.
Variable and subset selection in PLS regression
Høskuldsson, Agnar
2001-01-01
The purpose of this paper is to present some useful methods for introductory analysis of variables and subsets in relation to PLS regression. We present here methods that are efficient in finding the appropriate variables or subset to use in the PLS regression. The general conclusion...... is that variable selection is important for successful analysis of chemometric data. An important aspect of the results presented is that lack of variable selection can spoil the PLS regression, and that cross-validation measures using a test set can show larger variation, when we use different subsets of X, than...... obtained by different methods. We also present an approach to orthogonal scatter correction. The procedures and comparisons are applied to industrial data. (C) 2001 Elsevier Science B.V. All rights reserved....
Heart rate variability estimation in photoplethysmography signals using Bayesian learning approach.
Alqaraawi, Ahmed; Alwosheel, Ahmad; Alasaad, Amr
2016-06-01
Heart rate variability (HRV) has become a marker for various health and disease conditions. Photoplethysmography (PPG) sensors integrated in wearable devices such as smart watches and phones are widely used to measure heart activities. HRV requires accurate estimation of time interval between consecutive peaks in the PPG signal. However, PPG signal is very sensitive to motion artefact which may lead to poor HRV estimation if false peaks are detected. In this Letter, the authors propose a probabilistic approach based on Bayesian learning to better estimate HRV from PPG signal recorded by wearable devices and enhance the performance of the automatic multi scale-based peak detection (AMPD) algorithm used for peak detection. The authors' experiments show that their approach enhances the performance of the AMPD algorithm in terms of number of HRV related metrics such as sensitivity, positive predictive value, and average temporal resolution. PMID:27382483
The selective bleed variable cycle engine
Nascimento, M. A. R.
1992-01-01
A new concept in aircraft propulsion is described in this work. In particular, variable jet engine is investigated for supersonic ASTOVL aircraft. This engine is a Selective Bleed Variable Cycle, twin shaft turbofan. At low flight speeds the engine operates as a medium bypass turbofan. At supersonic cruise it operates as low bypass turbofan without reheat. The performance of the engine and its components is analyzed using a novel matching procedure. Off-design engine performance characterist...
A new variable selection method for classification
Nuñez Letamendia,Laura
2007-01-01
Full Text Available This work proposes an “ad hoc” new method for variable selection in classification, specifically in Discriminant Analysis. This new method is based on the metaheuristic strategy Tabu Search. From a computational point of view variable selection is a NP-Hard problem and therefore there is no guarantee of finding the optimum solution (NP = Nondeterministic Polynomial Time. This means that when the size of the problem is large finding an optimum solution in practice is unfeasible. As found in other optimization problems, metaheuristic techniques have proved to be good at solving this type of problems. Although there are many references in the literature regarding selecting variables for their use in classification, there are very few key references on the selection of variables for their use in Discriminant Analysis. In fact, the most well-known statistical packages continue to use classic selection methods as Stepwise, Backward or Forward. After performing some tests it is found that Tabu Search obtains significantly better results than the Stepwise, Backward or Forward methods used by classic statistical packages.
Bayesian model selection applied to artificial neural networks used for water resources modeling
Kingston, Greer B.; Maier, Holger R.; Lambert, Martin F.
2008-04-01
Artificial neural networks (ANNs) have proven to be extremely valuable tools in the field of water resources engineering. However, one of the most difficult tasks in developing an ANN is determining the optimum level of complexity required to model a given problem, as there is no formal systematic model selection method. This paper presents a Bayesian model selection (BMS) method for ANNs that provides an objective approach for comparing models of varying complexity in order to select the most appropriate ANN structure. The approach uses Markov Chain Monte Carlo posterior simulations to estimate the evidence in favor of competing models and, in this study, three known methods for doing this are compared in terms of their suitability for being incorporated into the proposed BMS framework for ANNs. However, it is acknowledged that it can be particularly difficult to accurately estimate the evidence of ANN models. Therefore, the proposed BMS approach for ANNs incorporates a further check of the evidence results by inspecting the marginal posterior distributions of the hidden-to-output layer weights, which unambiguously indicate any redundancies in the hidden layer nodes. The fact that this check is available is one of the greatest advantages of the proposed approach over conventional model selection methods, which do not provide such a test and instead rely on the modeler's subjective choice of selection criterion. The advantages of a total Bayesian approach to ANN development, including training and model selection, are demonstrated on two synthetic and one real world water resources case study.
Schöniger, Anneli; Wöhling, Thomas; Samaniego, Luis; Nowak, Wolfgang
2014-12-01
Bayesian model selection or averaging objectively ranks a number of plausible, competing conceptual models based on Bayes' theorem. It implicitly performs an optimal trade-off between performance in fitting available data and minimum model complexity. The procedure requires determining Bayesian model evidence (BME), which is the likelihood of the observed data integrated over each model's parameter space. The computation of this integral is highly challenging because it is as high-dimensional as the number of model parameters. Three classes of techniques to compute BME are available, each with its own challenges and limitations: (1) Exact and fast analytical solutions are limited by strong assumptions. (2) Numerical evaluation quickly becomes unfeasible for expensive models. (3) Approximations known as information criteria (ICs) such as the AIC, BIC, or KIC (Akaike, Bayesian, or Kashyap information criterion, respectively) yield contradicting results with regard to model ranking. Our study features a theory-based intercomparison of these techniques. We further assess their accuracy in a simplistic synthetic example where for some scenarios an exact analytical solution exists. In more challenging scenarios, we use a brute-force Monte Carlo integration method as reference. We continue this analysis with a real-world application of hydrological model selection. This is a first-time benchmarking of the various methods for BME evaluation against true solutions. Results show that BME values from ICs are often heavily biased and that the choice of approximation method substantially influences the accuracy of model ranking. For reliable model selection, bias-free numerical methods should be preferred over ICs whenever computationally feasible.
Variable Selection for Latent Dirichlet Allocation
Kim, Dongwoo; Oh, Alice
2012-01-01
In latent Dirichlet allocation (LDA), topics are multinomial distributions over the entire vocabulary. However, the vocabulary usually contains many words that are not relevant in forming the topics. We adopt a variable selection method widely used in statistical modeling as a dimension reduction tool and combine it with LDA. In this variable selection model for LDA (vsLDA), topics are multinomial distributions over a subset of the vocabulary, and by excluding words that are not informative for finding the latent topic structure of the corpus, vsLDA finds topics that are more robust and discriminative. We compare three models, vsLDA, LDA with symmetric priors, and LDA with asymmetric priors, on heldout likelihood, MCMC chain consistency, and document classification. The performance of vsLDA is better than symmetric LDA for likelihood and classification, better than asymmetric LDA for consistency and classification, and about the same in the other comparisons.
Coping with Trial-to-Trial Variability of Event Related Signals: A Bayesian Inference Approach
Ding, Mingzhou; Chen, Youghong; Knuth, Kevin H.; Bressler, Steven L.; Schroeder, Charles E.
2005-01-01
In electro-neurophysiology, single-trial brain responses to a sensory stimulus or a motor act are commonly assumed to result from the linear superposition of a stereotypic event-related signal (e.g. the event-related potential or ERP) that is invariant across trials and some ongoing brain activity often referred to as noise. To extract the signal, one performs an ensemble average of the brain responses over many identical trials to attenuate the noise. To date, h s simple signal-plus-noise (SPN) model has been the dominant approach in cognitive neuroscience. Mounting empirical evidence has shown that the assumptions underlying this model may be overly simplistic. More realistic models have been proposed that account for the trial-to-trial variability of the event-related signal as well as the possibility of multiple differentially varying components within a given ERP waveform. The variable-signal-plus-noise (VSPN) model, which has been demonstrated to provide the foundation for separation and characterization of multiple differentially varying components, has the potential to provide a rich source of information for questions related to neural functions that complement the SPN model. Thus, being able to estimate the amplitude and latency of each ERP component on a trial-by-trial basis provides a critical link between the perceived benefits of the VSPN model and its many concrete applications. In this paper we describe a Bayesian approach to deal with this issue and the resulting strategy is referred to as the differentially Variable Component Analysis (dVCA). We compare the performance of dVCA on simulated data with Independent Component Analysis (ICA) and analyze neurobiological recordings from monkeys performing cognitive tasks.
In making low-level radioactivity measurements of populations, it is commonly observed that a substantial portion of net results are negative. Furthermore, the observed variance of the measurement results arises from a combination of measurement uncertainty and population variability. This paper presents a method for disaggregating measurement uncertainty from population variability to produce a probability density function (PDF) of possibly true results. To do this, simple, justifiable, and reasonable assumptions are made about the relationship of the measurements to the measurands (the 'true values'). The measurements are assumed to be unbiased, that is, that their average value is the average of the measurands. Using traditional estimates of each measurement's uncertainty to disaggregate population variability from measurement uncertainty, a PDF of measurands for the population is produced. Then, using Bayes's theorem, the same assumptions, and all the data from the population of individuals, a prior PDF is computed for each individual's measurand. These PDFs are non-negative, and their average is equal to the average of the measurement results for the population. The uncertainty in these Bayesian posterior PDFs is all Berkson with no remaining classical component. The methods are applied to baseline bioassay data from the Hanford site. The data include 90Sr urinalysis measurements on 128 people, 137Cs in vivo measurements on 5,337 people, and 239Pu urinalysis measurements on 3,270 people. The method produces excellent results for the 90Sr and 137Cs measurements, since there are nonzero concentrations of these global fallout radionuclides in people who have not been occupationally exposed. The method does not work for the 239Pu measurements in non-occupationally exposed people because the population average is essentially zero.
Seidou, O.; Asselin, J. J.; Ouarda, T. B. M. J.
2007-08-01
Multivariate linear regression is one of the most popular modeling tools in hydrology and climate sciences for explaining the link between key variables. Piecewise linear regression is not always appropriate since the relationship may experiment sudden changes due to climatic, environmental, or anthropogenic perturbations. To address this issue, a practical and general approach to the Bayesian analysis of the multivariate regression model is presented. The approach allows simultaneous single change point detection in a multivariate sample and can account for missing data in the response variables and/or in the explicative variables. It also improves on recently published change point detection methodologies by allowing a more flexible and thus more realistic prior specification for the existence of a change and the date of change as well as for the regression parameters. The estimation of all unknown parameters is achieved by Monte Carlo Markov chain simulations. It is shown that the developed approach is able to reproduce the results of Rasmussen (2001) as well as those of Perreault et al. (2000a, 2000b). Furthermore, two of the examples provided in the paper show that the proposed methodology can readily be applied to some problems that cannot be addressed by any of the above-mentioned approaches because of limiting model structure and/or restrictive prior assumptions. The first of these examples deals with single change point detection in the multivariate linear relationship between mean basin-scale precipitation at different periods of the year and the summer-autumn flood peaks of the Broadback River located in northern Quebec, Canada. The second one addresses the problem of missing data estimation with uncertainty assessment in multisite streamflow records with a possible simultaneous shift in mean streamflow values that occurred at an unknown date.
Variable Selection in Model-based Clustering: A General Variable Role Modeling
Maugis, Cathy; Celeux, Gilles; Martin-Magniette, Marie-Laure
2008-01-01
The currently available variable selection procedures in model-based clustering assume that the irrelevant clustering variables are all independent or are all linked with the relevant clustering variables. We propose a more versatile variable selection model which describes three possible roles for each variable: The relevant clustering variables, the irrelevant clustering variables dependent on a part of the relevant clustering variables and the irrelevant clustering variables totally indepe...
Bayesian model selection without evidences: application to the dark energy equation-of-state
Hee, Sonke; Hobson, Mike P; Lasenby, Anthony N
2015-01-01
A method is presented for Bayesian model selection without explicitly computing evidences, by using a combined likelihood and introducing an integer model selection parameter $n$ so that Bayes factors, or more generally posterior odds ratios, may be read off directly from the posterior of $n$. If the total number of models under consideration is specified a priori, the full joint parameter space $(\\theta, n)$ of the models is of fixed dimensionality and can be explored using standard MCMC or nested sampling methods, without the need for reversible jump MCMC techniques. The posterior on $n$ is then obtained by straightforward marginalisation. We demonstrate the efficacy of our approach by application to several toy models. We then apply it to constraining the dark energy equation-of-state using a free-form reconstruction technique. We show that $\\Lambda$CDM is significantly favoured over all extensions, including the simple $w(z){=}{\\rm constant}$ model.
Markus Krauss
Full Text Available Interindividual variability in anatomical and physiological properties results in significant differences in drug pharmacokinetics. The consideration of such pharmacokinetic variability supports optimal drug efficacy and safety for each single individual, e.g. by identification of individual-specific dosings. One clear objective in clinical drug development is therefore a thorough characterization of the physiological sources of interindividual variability. In this work, we present a Bayesian population physiologically-based pharmacokinetic (PBPK approach for the mechanistically and physiologically realistic identification of interindividual variability. The consideration of a generic and highly detailed mechanistic PBPK model structure enables the integration of large amounts of prior physiological knowledge, which is then updated with new experimental data in a Bayesian framework. A covariate model integrates known relationships of physiological parameters to age, gender and body height. We further provide a framework for estimation of the a posteriori parameter dependency structure at the population level. The approach is demonstrated considering a cohort of healthy individuals and theophylline as an application example. The variability and co-variability of physiological parameters are specified within the population; respectively. Significant correlations are identified between population parameters and are applied for individual- and population-specific visual predictive checks of the pharmacokinetic behavior, which leads to improved results compared to present population approaches. In the future, the integration of a generic PBPK model into an hierarchical approach allows for extrapolations to other populations or drugs, while the Bayesian paradigm allows for an iterative application of the approach and thereby a continuous updating of physiological knowledge with new data. This will facilitate decision making e.g. from preclinical to
Jensen, Finn Verner; Nielsen, Thomas Dyhre
2016-01-01
Mathematically, a Bayesian graphical model is a compact representation of the joint probability distribution for a set of variables. The most frequently used type of Bayesian graphical models are Bayesian networks. The structural part of a Bayesian graphical model is a graph consisting of nodes and...... largely due to the availability of efficient inference algorithms for answering probabilistic queries about the states of the variables in the network. Furthermore, to support the construction of Bayesian network models, learning algorithms are also available. We give an overview of the Bayesian network...
A Bayesian Network Approach for Offshore Risk Analysis Through Linguistic Variables
无
2007-01-01
This paper presents a new approach for offshore risk analysis that is capable of dealing with linguistic probabilities in Bayesian networks (BNs). In this paper, linguistic probabilities are used to describe occurrence likelihood of hazardous events that may cause possible accidents in offshore operations. In order to use fuzzy information, an f-weighted valuation function is proposed to transform linguistic judgements into crisp probability distributions which can be easily put into a BN to model causal relationships among risk factors. The use of linguistic variables makes it easier for human experts to express their knowledge, and the transformation of linguistic judgements into crisp probabilities can significantly save the cost of computation, modifying and maintaining a BN model. The flexibility of the method allows for multiple forms of information to be used to quantify model relationships, including formally assessed expert opinion when quantitative data are lacking, or when only qualitative or vague statements can be made. The model is a modular representation of uncertain knowledge caused due to randomness, vagueness and ignorance. This makes the risk analysis of offshore engineering systems more functional and easier in many assessment contexts. Specifically, the proposed f-weighted valuation function takes into account not only the dominating values, but also the α-level values that are ignored by conventional valuation methods. A case study of the collision risk between a Floating Production, Storage and Off-loading (FPSO) unit and the authorised vessels due to human elements during operation is used to illustrate the application of the proposed model.
Gamma prior distribution selection for Bayesian analysis of failure rate and reliability
It is assumed that the phenomenon under study is such that the time-to-failure may be modeled by an exponential distribution with failure rate lambda. For Bayesian analyses of the assumed model, the family of gamma distributions provides conjugate prior models for lambda. Thus, an experimenter needs to select a particular gamma model to conduct a Bayesian reliability analysis. The purpose of this report is to present a methodology that can be used to translate engineering information, experience, and judgment into a choice of a gamma prior distribution. The proposed methodology assumes that the practicing engineer can provide percentile data relating to either the failure rate or the reliability of the phenomenon being investigated. For example, the methodology will select the gamma prior distribution which conveys an engineer's belief that the failure rate lambda simultaneously satisfies the probability statements, P(lambda less than 1.0 x 10-3) equals 0.50 and P(lambda less than 1.0 x 10-5) equals 0.05. That is, two percentiles provided by an engineer are used to determine a gamma prior model which agrees with the specified percentiles. For those engineers who prefer to specify reliability percentiles rather than the failure rate percentiles illustrated above, it is possible to use the induced negative-log gamma prior distribution which satisfies the probability statements, P(R(t0) less than 0.99) equals 0.50 and P(R(t0) less than 0.99999) equals 0.95, for some operating time t0. The report also includes graphs for selected percentiles which assist an engineer in applying the procedure. 28 figures, 16 tables
Gamma prior distribution selection for Bayesian analysis of failure rate and reliability
Waller, R.A.; Johnson, M.M.; Waterman, M.S.; Martz, H.F. Jr.
1976-07-01
It is assumed that the phenomenon under study is such that the time-to-failure may be modeled by an exponential distribution with failure rate lambda. For Bayesian analyses of the assumed model, the family of gamma distributions provides conjugate prior models for lambda. Thus, an experimenter needs to select a particular gamma model to conduct a Bayesian reliability analysis. The purpose of this report is to present a methodology that can be used to translate engineering information, experience, and judgment into a choice of a gamma prior distribution. The proposed methodology assumes that the practicing engineer can provide percentile data relating to either the failure rate or the reliability of the phenomenon being investigated. For example, the methodology will select the gamma prior distribution which conveys an engineer's belief that the failure rate lambda simultaneously satisfies the probability statements, P(lambda less than 1.0 x 10/sup -3/) equals 0.50 and P(lambda less than 1.0 x 10/sup -5/) equals 0.05. That is, two percentiles provided by an engineer are used to determine a gamma prior model which agrees with the specified percentiles. For those engineers who prefer to specify reliability percentiles rather than the failure rate percentiles illustrated above, it is possible to use the induced negative-log gamma prior distribution which satisfies the probability statements, P(R(t/sub 0/) less than 0.99) equals 0.50 and P(R(t/sub 0/) less than 0.99999) equals 0.95, for some operating time t/sub 0/. The report also includes graphs for selected percentiles which assist an engineer in applying the procedure. 28 figures, 16 tables.
Maximum Likelihood Bayesian Averaging of Spatial Variability Models in Unsaturated Fractured Tuff
Hydrologic analyses typically rely on a single conceptual-mathematical model. Yet hydrologic environments are open and complex, rendering them prone to multiple interpretations and mathematical descriptions. Adopting only one of these may lead to statistical bias and underestimation of uncertainty. Bayesian Model Averaging (BMA) provides an optimal way to combine the predictions of several competing models and to assess their joint predictive uncertainty. However, it tends to be computationally demanding and relies heavily on prior information about model parameters. We apply a maximum likelihood (ML) version of BMA (MLBMA) to seven alternative variogram models of log air permeability data from single-hole pneumatic injection tests in six boreholes at the Apache Leap Research Site (ALRS) in central Arizona. Unbiased ML estimates of variogram and drift parameters are obtained using Adjoint State Maximum Likelihood Cross Validation in conjunction with Universal Kriging and Generalized L east Squares. Standard information criteria provide an ambiguous ranking of the models, which does not justify selecting one of them and discarding all others as is commonly done in practice. Instead, we eliminate some of the models based on their negligibly small posterior probabilities and use the rest to project the measured log permeabilities by kriging onto a rock volume containing the six boreholes. We then average these four projections, and associated kriging variances, using the posterior probability of each model as weight. Finally, we cross-validate the results by eliminating from consideration all data from one borehole at a time, repeating the above process, and comparing the predictive capability of MLBMA with that of each individual model. We find that MLBMA is superior to any individual geostatistical model of log permeability among those we consider at the ALRS
Amene, E; Hanson, L A; Zahn, E A; Wild, S R; Döpfer, D
2016-07-01
The purpose of this study was to apply a novel statistical method for variable selection and a model-based approach for filling data gaps in mortality rates associated with foodborne diseases using the WHO Vital Registration mortality dataset. Correlation analysis and elastic net regularization methods were applied to drop redundant variables and to select the most meaningful subset of predictors. Whenever predictor data were missing, multiple imputation was used to fill in plausible values. Cluster analysis was applied to identify similar groups of countries based on the values of the predictors. Finally, a Bayesian hierarchical regression model was fit to the final dataset for predicting mortality rates. From 113 potential predictors, 32 were retained after correlation analysis. Out of these 32 predictors, eight with non-zero coefficients were selected using the elastic net regularization method. Based on the values of these variables, four clusters of countries were identified. The uncertainty of predictions was large for countries within clusters lacking mortality rates, and it was low for a cluster that had mortality rate information. Our results demonstrated that, using Bayesian hierarchical regression models, a data-driven clustering of countries and a meaningful subset of predictors can be used to fill data gaps in foodborne disease mortality. PMID:26785774
Schöniger, A.; Nowak, W.; Wöhling, T.
2013-12-01
Bayesian model averaging (BMA) combines the predictive capabilities of alternative conceptual models into a robust best estimate and allows the quantification of conceptual uncertainty. The individual models are weighted with their posterior probability according to Bayes' theorem. Despite this rigorous procedure, we see four obstacles to robust model ranking: (1) The weights inherit uncertainty related to measurement noise in the calibration data set, which may compromise the reliability of model ranking. (2) Posterior weights rank the models only relative to each other, but do not contain information about the absolute model performance. (3) There is a lack of objective methods to assess whether the suggested models are practically distinguishable or very similar to each other, i.e., whether the individual models explore different regions of the model space. (4) No theory for optimal design (OD) of experiments exists that explicitly aims at maximum-confidence model discrimination. The goal of our study is to overcome these four shortcomings. We determine the robustness of weights against measurement noise (1) by repeatedly perturbing the observed data with random measurement errors and analyzing the variability in the obtained weights. Realizing that model weights have a probability distribution of their own, we introduce an additional term into the overall prediction uncertainty analysis scheme which we call 'weighting uncertainty'. We further assess an 'absolute distance' in performance of the model set from the truth (2) as seen through the eyes of the data by interpreting statistics of Bayesian model evidence. This analysis is of great value for modellers to decide, if the modelling task can be satisfactorily carried out with the model(s) at hand, or if more effort should be invested in extending the set with better performing models. As a further prerequisite for robust model selection, we scrutinize the ability of BMA to distinguish between the models in
Bayesian Nonparametric Graph Clustering
Banerjee, Sayantan; Akbani, Rehan; Baladandayuthapani, Veerabhadran
2015-01-01
We present clustering methods for multivariate data exploiting the underlying geometry of the graphical structure between variables. As opposed to standard approaches that assume known graph structures, we first estimate the edge structure of the unknown graph using Bayesian neighborhood selection approaches, wherein we account for the uncertainty of graphical structure learning through model-averaged estimates of the suitable parameters. Subsequently, we develop a nonparametric graph cluster...
Variables influencing victim selection in genocide.
Komar, Debra A
2008-01-01
While victims of racially motivated violence may be identified through observation of morphological features, those targeted because of their ethnic, religious, or national identity are not easily recognized. This study examines how perpetrators of genocide recognize their victims. Court documents, including indictments, witness statements, and testimony from the International Criminal Tribunals for Rwanda and the former Yugoslavia (FY) detail the interactions between victim and assailant. A total of 6012 decedents were included in the study; only 20.8% had been positively identified. Variables influencing victim selection in Rwanda included location, segregation, incitement, and prior relationship, while significant factors in FY were segregation, location, age/gender, and social data. Additional contributing factors in both countries included self-identification, victim behavior, linguistic or clothing evidence, and morphological features. Understanding the system of recognition used by perpetrators aids investigators tasked with establishing victim identity in such prosecutions. PMID:18005010
Selection of Trusted Service Providers by Enforcing Bayesian Analysis in iVCE
GU Bao-jun; LI Xiao-yong; WANG Wei-nong
2008-01-01
The initiative of internet-based virtual computing environment (iVCE) aims to provide the end users and applications With a harmonious, trustworthy and transparent integrated computing environment which will facilitate sharing and collaborating of network resources between applications. Trust management is an elementary component for iVCE. The uncertain and dynamic characteristics of iVCE necessitate the requirement for the trust management to be subjective, historical evidence based and context dependent. This paper presents a Bayesian analysis-based trust model, which aims to secure the active agents for selecting appropriate trustod services in iVCE. Simulations are made to analyze the properties of the trust model which show that the subjective prior information influences trust evaluation a lot and the model stimulates positive interactions.
Kok-Chin Khor
2009-01-01
Full Text Available Problem statement: Implementing a single or multiple classifiers that involve a Bayesian Network (BN is a rising research interest in network intrusion detection domain. Approach: However, little attention has been given to evaluate the performance of BN classifiers before they could be implemented in a real system. In this research, we proposed a novel approach to select important features by utilizing two selected feature selection algorithms utilizing filter approach. Results: The selected features were further validated by domain experts where extra features were added into the final proposed feature set. We then constructed three types of BN namely, Naive Bayes Classifiers (NBC, Learned BN and Expert-elicited BN by utilizing a standard network intrusion dataset. The performance of each classifier was recorded. We found that there was no difference in overall performance of the BNs and therefore, concluded that the BNs performed equivalently well in detecting network attacks. Conclusion/Recommendations: The results of the study indicated that the BN built using the proposed feature set has less features but the performance was comparable to BNs built using other feature sets generated by the two algorithms.
Bayesian model selection validates a biokinetic model for zirconium processing in humans
Schmidl Daniel
2012-08-01
Full Text Available Abstract Background In radiation protection, biokinetic models for zirconium processing are of crucial importance in dose estimation and further risk analysis for humans exposed to this radioactive substance. They provide limiting values of detrimental effects and build the basis for applications in internal dosimetry, the prediction for radioactive zirconium retention in various organs as well as retrospective dosimetry. Multi-compartmental models are the tool of choice for simulating the processing of zirconium. Although easily interpretable, determining the exact compartment structure and interaction mechanisms is generally daunting. In the context of observing the dynamics of multiple compartments, Bayesian methods provide efficient tools for model inference and selection. Results We are the first to apply a Markov chain Monte Carlo approach to compute Bayes factors for the evaluation of two competing models for zirconium processing in the human body after ingestion. Based on in vivo measurements of human plasma and urine levels we were able to show that a recently published model is superior to the standard model of the International Commission on Radiological Protection. The Bayes factors were estimated by means of the numerically stable thermodynamic integration in combination with a recently developed copula-based Metropolis-Hastings sampler. Conclusions In contrast to the standard model the novel model predicts lower accretion of zirconium in bones. This results in lower levels of noxious doses for exposed individuals. Moreover, the Bayesian approach allows for retrospective dose assessment, including credible intervals for the initially ingested zirconium, in a significantly more reliable fashion than previously possible. All methods presented here are readily applicable to many modeling tasks in systems biology.
Variable selection in model-based discriminant analysis
Maugis, Cathy; Celeux, Gilles; Martin-Magniette, Marie-Laure
2010-01-01
A general methodology for selecting predictors for Gaussian generative classification models is presented. The problem is regarded as a model selection problem. Three different roles for each possible predictor are considered: a variable can be a relevant classification predictor or not, and the irrelevant classification variables can be linearly dependent on a part of the relevant predictors or independent variables. This variable selection model was inspired by the model-based clustering mo...
A general adaptive modeling algorithm for selection and validation of coarse-grained models of atomistic systems is presented. A Bayesian framework is developed to address uncertainties in parameters, data, and model selection. Algorithms for computing output sensitivities to parameter variances, model evidence and posterior model plausibilities for given data, and for computing what are referred to as Occam Categories in reference to a rough measure of model simplicity, make up components of the overall approach. Computational results are provided for representative applications
Comparison of Two Gas Selection Methodologies: An Application of Bayesian Model Averaging
Renholds, Andrea S.; Thompson, Sandra E.; Anderson, Kevin K.; Chilton, Lawrence K.
2006-03-31
One goal of hyperspectral imagery analysis is the detection and characterization of plumes. Characterization includes identifying the gases in the plumes, which is a model selection problem. Two gas selection methods compared in this report are Bayesian model averaging (BMA) and minimum Akaike information criterion (AIC) stepwise regression (SR). Simulated spectral data from a three-layer radiance transfer model were used to compare the two methods. Test gases were chosen to span the types of spectra observed, which exhibit peaks ranging from broad to sharp. The size and complexity of the search libraries were varied. Background materials were chosen to either replicate a remote area of eastern Washington or feature many common background materials. For many cases, BMA and SR performed the detection task comparably in terms of the receiver operating characteristic curves. For some gases, BMA performed better than SR when the size and complexity of the search library increased. This is encouraging because we expect improved BMA performance upon incorporation of prior information on background materials and gases.
A Bayesian Approach to Service Selection for Secondary Users in Cognitive Radio Networks
Elaheh Homayounvala
2015-10-01
Full Text Available In cognitive radio networks where secondary users (SUs use the time-frequency gaps of primary users' (PUs licensed spectrum opportunistically, the experienced throughput of SUs depend not only on the traffic load of the PUs but also on the PUs' service type. Each service has its own pattern of channel usage, and if the SUs know the dominant pattern of primary channel usage, then they can make a better decision on choosing which service is better to be used at a specific time to get the best advantage of the primary channel, in terms of higher achievable throughput. However, it is difficult to inform directly SUs of PUs' dominant used services in each area, for practical reasons. This paper proposes a learning mechanism embedded in SUs to sense the primary channel for a specific length of time. This algorithm recommends the SUs upon sensing a free primary channel, to choose the best service in order to get the best performance, in terms of maximum achieved throughput and the minimum experienced delay. The proposed learning mechanism is based on a Bayesian approach that can predict the performance of a requested service for a given SU. Simulation results show that this service selection method outperforms the blind opportunistic SU service selection, significantly.
Highlights: • Deduce secondary structure content of intrinsically disordered proteins from IR spectra. • Bayesian analysis to infer conformations of disordered regions of proteins from IR. • Comparison of measured and calculated IR spectra to obtain thermodynamic weights. - Abstract: As it remains practically impossible to generate ergodic ensembles for large intrinsically disordered proteins (IDP) with molecular dynamics (MD) simulations, it becomes critical to compare spectroscopic characteristics of the theoretically generated ensembles to corresponding measurements. We develop a Bayesian framework to infer the ensemble properties of an IDP using a combination of conformations generated by MD simulations and its measured infrared spectrum. We performed 100 different MD simulations totaling more than 10 μs to characterize the conformational ensemble of α-synuclein, a prototypical IDP, in water. These conformations are clustered based on solvent accessibility and helical content. We compute the amide-I band for these clusters and predict the thermodynamic weights of each cluster given the measured amide-I band. Bayesian analysis produces a reproducible and non-redundant set of thermodynamic weights for each cluster, which can then be used to calculate the ensemble properties. In a rigorous validation, these weights reproduce measured chemical shifts
Bayesian model selection for testing the no-hair theorem with black hole ringdowns
Gossan, S; Sathyaprakash, B S
2011-01-01
General relativity predicts that a black hole that results from the merger of two compact stars (either black holes or neutron stars) is initially highly deformed but soon settles down to a quiescent state by emitting a superposition of quasi-normal modes (QNMs). The QNMs are damped sinusoids with characteristic frequencies and decay times that depend only on the mass and spin of the black hole and no other parameter - a statement of the no-hair theorem. In this paper we have examined the extent to which QNMs could be used to test the no-hair theorem with future ground- and space-based gravitational-wave detectors. We model departures from general relativity (GR) by introducing extra parameters which change the mode frequencies or decay times from their general relativistic values. With the aid of numerical simulations and Bayesian model selection, we assess the extent to which the presence of such a parameter could be inferred, and its value estimated. We find that it is harder to decipher the departure of d...
Schöniger, Anneli; Illman, Walter A.; Wöhling, Thomas; Nowak, Wolfgang
2015-12-01
Groundwater modelers face the challenge of how to assign representative parameter values to the studied aquifer. Several approaches are available to parameterize spatial heterogeneity in aquifer parameters. They differ in their conceptualization and complexity, ranging from homogeneous models to heterogeneous random fields. While it is common practice to invest more effort into data collection for models with a finer resolution of heterogeneities, there is a lack of advice which amount of data is required to justify a certain level of model complexity. In this study, we propose to use concepts related to Bayesian model selection to identify this balance. We demonstrate our approach on the characterization of a heterogeneous aquifer via hydraulic tomography in a sandbox experiment (Illman et al., 2010). We consider four increasingly complex parameterizations of hydraulic conductivity: (1) Effective homogeneous medium, (2) geology-based zonation, (3) interpolation by pilot points, and (4) geostatistical random fields. First, we investigate the shift in justified complexity with increasing amount of available data by constructing a model confusion matrix. This matrix indicates the maximum level of complexity that can be justified given a specific experimental setup. Second, we determine which parameterization is most adequate given the observed drawdown data. Third, we test how the different parameterizations perform in a validation setup. The results of our test case indicate that aquifer characterization via hydraulic tomography does not necessarily require (or justify) a geostatistical description. Instead, a zonation-based model might be a more robust choice, but only if the zonation is geologically adequate.
Bayesian Credit Ratings (new version)
Paola Cerchiello; Paolo Giudici
2013-01-01
In this contribution we aim at improving ordinal variable selection in the context of causal models. In this regard, we propose an approach that provides a formal inferential tool to compare the explanatory power of each covariate, and, therefore, to select an effective model for classification purposes. Our proposed model is Bayesian nonparametric, and, thus, keeps the amount of model specification to a minimum. We consider the case in which information from the covariates is at the ordinal ...
ANALYSIS OF BAYESIAN CLASSIFIER ACCURACY
Felipe Schneider Costa
2013-01-01
Full Text Available The naÃ¯ve Bayes classifier is considered one of the most effective classification algorithms today, competing with more modern and sophisticated classifiers. Despite being based on unrealistic (naÃ¯ve assumption that all variables are independent, given the output class, the classifier provides proper results. However, depending on the scenario utilized (network structure, number of samples or training cases, number of variables, the network may not provide appropriate results. This study uses a process variable selection, using the chi-squared test to verify the existence of dependence between variables in the data model in order to identify the reasons which prevent a Bayesian network to provide good performance. A detailed analysis of the data is also proposed, unlike other existing work, as well as adjustments in case of limit values between two adjacent classes. Furthermore, variable weights are used in the calculation of a posteriori probabilities, calculated with mutual information function. Tests were applied in both a naÃ¯ve Bayesian network and a hierarchical Bayesian network. After testing, a significant reduction in error rate has been observed. The naÃ¯ve Bayesian network presented a drop in error rates from twenty five percent to five percent, considering the initial results of the classification process. In the hierarchical network, there was not only a drop in fifteen percent error rate, but also the final result came to zero.
Hug, Sabine Carolin
2015-01-01
In this thesis we use differential equations for mathematically representing biological processes. For this we have to infer the associated parameters for fitting the differential equations to measurement data. If the structure of the ODE itself is uncertain, model selection methods have to be applied. We refine several existing Bayesian methods, ranging from an adaptive scheme for the computation of high-dimensional integrals to multi-chain Metropolis-Hastings algorithms for high-dimensional...
Schoelzel, C. [Bonn Univ. (Germany). Meteorologisches Inst.
2006-07-01
This thesis presents the development of statistical climatological-botanical transfer functions in order to provide reconstructions of Holocene climate variability in the Near East region. Two classical concepts, the biomisation as well as the indicator taxa approach, are translated into a Bayesian network. Fossil pollen spectra of laminated sediments from the Ein Gedi location at the western shoreline of the Dead Sea and from the crater lake Birkat Ram in the northern Golan serve as proxy data, covering the past 10000 and 6500 years, respectively. The climatological variables are winter temperature, summer temperature, and annual precipitation, obtained from the 0.5 x 0.5 degree climatology CRU TS 1.0. The Bayesian biome model is based on the three main vegetation territories, the Mediterranean, the Irano-Turanian, and the Saharo-Arabian territory, which are digitized on the same grid as the climate data. From their spatial extend, a classification in the phase space is described by estimating the conditional probability for the existence of a certain biome given the climate. These biome specific likelihood functions are modelled by a generalised linear model, including second order monomials of the climate variables. A statistical mixture model is applied to the biome probabilities as estimated by the Ein Gedi data, resulting in a posterior probability density function for the three dimensional climate state vector. The indicator taxa model is based on the distribution of 15 Mediterranean taxa. Their spatial extend allows to estimate the taxon specific likelihood functions. In this case, they are conditional probability density functions for the climate state vector given the existence of a certain taxon. In order to address the general problem of multivariate non-normally distributed populations, multivariate normal Copulas are used, which allow to create distribution functions with gamma as well as normal marginal distributions. Applying the model to the Birkat
Using Instrumental Variables Properly to Account for Selection Effects
Porter, Stephen R.
2012-01-01
Selection bias is problematic when evaluating the effects of postsecondary interventions on college students, and can lead to biased estimates of program effects. While instrumental variables can be used to account for endogeneity due to self-selection, current practice requires that all five assumptions of instrumental variables be met in order…
W David Walter
Full Text Available Bovine tuberculosis is a bacterial disease caused by Mycobacterium bovis in livestock and wildlife with hosts that include Eurasian badgers (Meles meles, brushtail possum (Trichosurus vulpecula, and white-tailed deer (Odocoileus virginianus. Risk-assessment efforts in Michigan have been initiated on farms to minimize interactions of cattle with wildlife hosts but research on M. bovis on cattle farms has not investigated the spatial context of disease epidemiology. To incorporate spatially explicit data, initial likelihood of infection probabilities for cattle farms tested for M. bovis, prevalence of M. bovis in white-tailed deer, deer density, and environmental variables for each farm were modeled in a Bayesian hierarchical framework. We used geo-referenced locations of 762 cattle farms that have been tested for M. bovis, white-tailed deer prevalence, and several environmental variables that may lead to long-term survival and viability of M. bovis on farms and surrounding habitats (i.e., soil type, habitat type. Bayesian hierarchical analyses identified deer prevalence and proportion of sandy soil within our sampling grid as the most supported model. Analysis of cattle farms tested for M. bovis identified that for every 1% increase in sandy soil resulted in an increase in odds of infection by 4%. Our analysis revealed that the influence of prevalence of M. bovis in white-tailed deer was still a concern even after considerable efforts to prevent cattle interactions with white-tailed deer through on-farm mitigation and reduction in the deer population. Cattle farms test positive for M. bovis annually in our study area suggesting that the potential for an environmental source either on farms or in the surrounding landscape may contributing to new or re-infections with M. bovis. Our research provides an initial assessment of potential environmental factors that could be incorporated into additional modeling efforts as more knowledge of deer herd
Furtado-Junior, I; Abrunhosa, F A; Holanda, F C A F; Tavares, M C S
2016-06-01
Fishing selectivity of the mangrove crab Ucides cordatus in the north coast of Brazil can be defined as the fisherman's ability to capture and select individuals from a certain size or sex (or a combination of these factors) which suggests an empirical selectivity. Considering this hypothesis, we calculated the selectivity curves for males and females crabs using the logit function of the logistic model in the formulation. The Bayesian inference consisted of obtaining the posterior distribution by applying the Markov chain Monte Carlo (MCMC) method to software R using the OpenBUGS, BRugs, and R2WinBUGS libraries. The estimated results of width average carapace selection for males and females compared with previous studies reporting the average width of the carapace of sexual maturity allow us to confirm the hypothesis that most mature individuals do not suffer from fishing pressure; thus, ensuring their sustainability. PMID:26934154
Cha, YoonKyung; Soon Park, Seok; Won Lee, Hye; Stow, Craig A.
2016-01-01
Modeling to accurately predict river phytoplankton distribution and abundance is important in water quality and resource management. Nevertheless, the complex nature of eutrophication processes in highly connected river systems makes the task challenging. To model dynamics of river phytoplankton, represented by chlorophyll a (Chl a) concentration, we propose a Bayesian hierarchical model that explicitly accommodates seasonality and upstream-downstream spatial gradient in the structure. The utility of our model is demonstrated with an application to the Nakdong River (South Korea), which is a eutrophic, intensively regulated river, but functions as an irreplaceable water source for more than 13 million people. Chl a is modeled with two manageable factors, river flow, and total phosphorus (TP) concentration. Our model results highlight the importance of taking seasonal and spatial context into account when describing flow regimes and phosphorus delivery in rivers. A contrasting positive Chl a-flow relationship across stations versus negative Chl a-flow slopes that arose when Chl a was modeled on a station-month basis is an illustration of Simpson's paradox, which necessitates modeling Chl a-flow relationships decomposed into seasonal and spatial components. Similar Chl a-TP slopes among stations and months suggest that, with the flow effect removed, positive TP effects on Chl a are uniform regardless of the season and station in the river. Our model prediction successfully captured the shift in the spatial and monthly patterns of Chl a.
Research on Some Questions About Selection of Independent Variables
TAO Jing-xuan
2002-01-01
The paper studies four methods about selection of independent variables in multivariate analysis. In general condition, advanced statistical method and backward statistical method could not obtain the best subset of independent variables. It is possibly affected by the orders of variables or associations among variables. When multicollinearity is presented in a set of explanatory variables-abnormal state, it is not effective to use the method, although stepwise regression and optimum selecting method of total subsets is widely used.According to this case, the paper proposes a new method which combines deleting variables with ingredient analysis and is used in research and science practically.The important characteristic of this paper is that it gives some examples to support each conclusion.
A Variable-Selection Heuristic for K-Means Clustering.
Brusco, Michael J.; Cradit, J. Dennis
2001-01-01
Presents a variable selection heuristic for nonhierarchical (K-means) cluster analysis based on the adjusted Rand index for measuring cluster recovery. Subjected the heuristic to Monte Carlo testing across more than 2,200 datasets. Results indicate that the heuristic is extremely effective at eliminating masking variables. (SLD)
A numeric comparison of variable selection algorithms for supervised learning
Datasets in modern High Energy Physics (HEP) experiments are often described by dozens or even hundreds of input variables. Reducing a full variable set to a subset that most completely represents information about data is therefore an important task in analysis of HEP data. We compare various variable selection algorithms for supervised learning using several datasets such as, for instance, imaging gamma-ray Cherenkov telescope (MAGIC) data found at the UCI repository. We use classifiers and variable selection methods implemented in the statistical package StatPatternRecognition (SPR), a free open-source C++ package developed in the HEP community ( (http://sourceforge.net/projects/statpatrec/)). For each dataset, we select a powerful classifier and estimate its learning accuracy on variable subsets obtained by various selection algorithms. When possible, we also estimate the CPU time needed for the variable subset selection. The results of this analysis are compared with those published previously for these datasets using other statistical packages such as R and Weka. We show that the most accurate, yet slowest, method is a wrapper algorithm known as generalized sequential forward selection ('Add N Remove R') implemented in SPR.
Impact of Frequentist and Bayesian Methods on Survey Sampling Practice: A Selective Appraisal
Rao, J. N. K.
2011-01-01
According to Hansen, Madow and Tepping [J. Amer. Statist. Assoc. 78 (1983) 776--793], "Probability sampling designs and randomization inference are widely accepted as the standard approach in sample surveys." In this article, reasons are advanced for the wide use of this design-based approach, particularly by federal agencies and other survey organizations conducting complex large scale surveys on topics related to public policy. Impact of Bayesian methods in survey sampling is also discussed...
Rodríguez-Prieto Víctor
2012-08-01
Full Text Available Abstract Background Bovine tuberculosis (bTB is a chronic infectious disease mainly caused by Mycobacterium bovis. Although eradication is a priority for the European authorities, bTB remains active or even increasing in many countries, causing significant economic losses. The integral consideration of epidemiological factors is crucial to more cost-effectively allocate control measures. The aim of this study was to identify the nature and extent of the association between TB distribution and a list of potential risk factors regarding cattle, wild ungulates and environmental aspects in Ciudad Real, a Spanish province with one of the highest TB herd prevalences. Results We used a Bayesian mixed effects multivariable logistic regression model to predict TB occurrence in either domestic or wild mammals per municipality in 2007 by using information from the previous year. The municipal TB distribution and endemicity was clustered in the western part of the region and clearly overlapped with the explanatory variables identified in the final model: (1 incident cattle farms, (2 number of years of veterinary inspection of big game hunting events, (3 prevalence in wild boar, (4 number of sampled cattle, (5 persistent bTB-infected cattle farms, (6 prevalence in red deer, (7 proportion of beef farms, and (8 farms devoted to bullfighting cattle. Conclusions The combination of these eight variables in the final model highlights the importance of the persistence of the infection in the hosts, surveillance efforts and some cattle management choices in the circulation of M. bovis in the region. The spatial distribution of these variables, together with particular Mediterranean features that favour the wildlife-livestock interface may explain the M. bovis persistence in this region. Sanitary authorities should allocate efforts towards specific areas and epidemiological situations where the wildlife-livestock interface seems to critically hamper the definitive b
Variable selection and estimation for longitudinal survey data
Wang, Li
2014-09-01
There is wide interest in studying longitudinal surveys where sample subjects are observed successively over time. Longitudinal surveys have been used in many areas today, for example, in the health and social sciences, to explore relationships or to identify significant variables in regression settings. This paper develops a general strategy for the model selection problem in longitudinal sample surveys. A survey weighted penalized estimating equation approach is proposed to select significant variables and estimate the coefficients simultaneously. The proposed estimators are design consistent and perform as well as the oracle procedure when the correct submodel was known. The estimating function bootstrap is applied to obtain the standard errors of the estimated parameters with good accuracy. A fast and efficient variable selection algorithm is developed to identify significant variables for complex longitudinal survey data. Simulated examples are illustrated to show the usefulness of the proposed methodology under various model settings and sampling designs. © 2014 Elsevier Inc.
2010-01-01
Genetic markers can be used as instrumental variables, in an analogous way to randomization in a clinical trial, to estimate the causal relationship between a phenotype and an outcome variable. Our purpose is to extend the existing methods for such Mendelian randomization studies to the context...... an overall estimate of the causal relationship between the phenotype and the outcome, and an assessment of its heterogeneity across studies. As an example, we estimate the causal relationship of blood concentrations of C-reactive protein on fibrinogen levels using data from 11 studies. These methods provide...... a flexible framework for efficient estimation of causal relationships derived from multiple studies. Issues discussed include weak instrument bias, analysis of binary outcome data such as disease risk, missing genetic data, and the use of haplotypes....
Deng, Bai-chuan; Yun, Yong-huan; Liang, Yi-zeng; Yi, Lun-zhao
2014-10-01
In this study, a new optimization algorithm called the Variable Iterative Space Shrinkage Approach (VISSA) that is based on the idea of model population analysis (MPA) is proposed for variable selection. Unlike most of the existing optimization methods for variable selection, VISSA statistically evaluates the performance of variable space in each step of optimization. Weighted binary matrix sampling (WBMS) is proposed to generate sub-models that span the variable subspace. Two rules are highlighted during the optimization procedure. First, the variable space shrinks in each step. Second, the new variable space outperforms the previous one. The second rule, which is rarely satisfied in most of the existing methods, is the core of the VISSA strategy. Compared with some promising variable selection methods such as competitive adaptive reweighted sampling (CARS), Monte Carlo uninformative variable elimination (MCUVE) and iteratively retaining informative variables (IRIV), VISSA showed better prediction ability for the calibration of NIR data. In addition, VISSA is user-friendly; only a few insensitive parameters are needed, and the program terminates automatically without any additional conditions. The Matlab codes for implementing VISSA are freely available on the website: https://sourceforge.net/projects/multivariateanalysis/files/VISSA/. PMID:25083512
Ball, Jessica Lynne
Light Detection and Ranging (LiDAR) data has shown great potential to estimate spatially explicit forest variables, including above-ground biomass, stem density, tree height, and more. Due to its ability to garner information about the vertical and horizontal structure of forest canopies effectively and efficiently, LiDAR sensors have played a key role in the development of operational air and space-borne instruments capable of gathering information about forest structure at regional, continental, and global scales. Combining LiDAR datasets with field-based validation measurements to build predictive models is becoming an attractive solution to the problem of quantifying and mapping forest structure for private forest land owners and local, state, and federal government entities alike. As with any statistical model using spatially indexed data, the potential to violate modeling assumptions resulting from spatial correlation is high. This thesis explores several different modeling frameworks that aim to accommodate correlation structures within model residuals. The development is motivated using LiDAR and forest inventory datasets. Special attention is paid to estimation and propagation of parameter and model uncertainty through to prediction units. Inference follows a Bayesian statistical paradigm. Results suggest the proposed frameworks help ensure model assumptions are met and prediction performance can be improved by pursuing spatially enabled models.
Bayesian feature weighting for unsupervised learning, with application to object recognition
Carbonetto, Peter; De Freitas, Nando; Gustafson, Paul; Thompson, Natalie
2003-01-01
We present a method for variable selection/weighting in an unsupervised learning context using Bayesian shrinkage. The basis for the model parameters and cluster assignments can be computed simultaneous using an efficient EM algorithm. Applying our Bayesian shrinkage model to a complex problem in object recognition (Duygulu, Barnard, de Freitas and Forsyth 2002), our experiments yied good results.
Optimal speech motor control and token-to-token variability: a Bayesian modeling approach
Patri, Jean-François; Diard, Julien; Perrier, Pascal
2015-01-01
The remarkable capacity of the speech motor system to adapt to various speech conditions is due to an excess of degrees of freedom, which enables producing similar acoustical properties with different sets of control strategies. To explain how the Central Nervous System selects one of the possible strategies, a common approach, in line with optimal motor control theories, is to model speech motor planning as the solution of an optimality problem based on cost functions. Despite the success of...
Mabaso Musawenkosi LH
2007-09-01
Full Text Available Abstract Background Several malaria risk maps have been developed in recent years, many from the prevalence of infection data collated by the MARA (Mapping Malaria Risk in Africa project, and using various environmental data sets as predictors. Variable selection is a major obstacle due to analytical problems caused by over-fitting, confounding and non-independence in the data. Testing and comparing every combination of explanatory variables in a Bayesian spatial framework remains unfeasible for most researchers. The aim of this study was to develop a malaria risk map using a systematic and practicable variable selection process for spatial analysis and mapping of historical malaria risk in Botswana. Results Of 50 potential explanatory variables from eight environmental data themes, 42 were significantly associated with malaria prevalence in univariate logistic regression and were ranked by the Akaike Information Criterion. Those correlated with higher-ranking relatives of the same environmental theme, were temporarily excluded. The remaining 14 candidates were ranked by selection frequency after running automated step-wise selection procedures on 1000 bootstrap samples drawn from the data. A non-spatial multiple-variable model was developed through step-wise inclusion in order of selection frequency. Previously excluded variables were then re-evaluated for inclusion, using further step-wise bootstrap procedures, resulting in the exclusion of another variable. Finally a Bayesian geo-statistical model using Markov Chain Monte Carlo simulation was fitted to the data, resulting in a final model of three predictor variables, namely summer rainfall, mean annual temperature and altitude. Each was independently and significantly associated with malaria prevalence after allowing for spatial correlation. This model was used to predict malaria prevalence at unobserved locations, producing a smooth risk map for the whole country. Conclusion We have
Variable Selection in the Partially Linear Errors-in-Variables Models for Longitudinal Data
Yi-ping YANG; Liu-gen XUE; Wei-hu CHENG
2012-01-01
This paper proposes a new approach for variable selection in partially linear errors-in-variables (EV) models for longitudinal data by penalizing appropriate estimating functions.We apply the SCAD penalty to simultaneously select significant variables and estimate unknown parameters.The rate of convergence and the asymptotic normality of the resulting estimators are established.Furthermore,with proper choice of regularization parameters,we show that the proposed estimators perform as well as the oracle procedure.A new algorithm is proposed for solving penalized estimating equation.The asymptotic results are augmented by a simulation study.
Sparse covariance thresholding for high-dimensional variable selection
Daye, X. Jessie Jeng And Z. John
2010-01-01
In high-dimensions, many variable selection methods, such as the lasso, are often limited by excessive variability and rank deficiency of the sample covariance matrix. Covariance sparsity is a natural phenomenon in high-dimensional applications, such as microarray analysis, image processing, etc., in which a large number of predictors are independent or weakly correlated. In this paper, we propose the covariance-thresholded lasso, a new class of regression methods that can utilize covariance ...
Bayesian model selection framework for identifying growth patterns in filamentous fungi.
Lin, Xiao; Terejanu, Gabriel; Shrestha, Sajan; Banerjee, Sourav; Chanda, Anindya
2016-06-01
This paper describes a rigorous methodology for quantification of model errors in fungal growth models. This is essential to choose the model that best describes the data and guide modeling efforts. Mathematical modeling of growth of filamentous fungi is necessary in fungal biology for gaining systems level understanding on hyphal and colony behaviors in different environments. A critical challenge in the development of these mathematical models arises from the indeterminate nature of their colony architecture, which is a result of processing diverse intracellular signals induced in response to a heterogeneous set of physical and nutritional factors. There exists a practical gap in connecting fungal growth models with measurement data. Here, we address this gap by introducing the first unified computational framework based on Bayesian inference that can quantify individual model errors and rank the statistical models based on their descriptive power against data. We show that this Bayesian model comparison is just a natural formalization of Occam׳s razor. The application of this framework is discussed in comparing three models in the context of synthetic data generated from a known true fungal growth model. This framework of model comparison achieves a trade-off between data fitness and model complexity and the quantified model error not only helps in calibrating and comparing the models, but also in making better predictions and guiding model refinements. PMID:27000772
A New Statistic for Variable Selection in Questionnaire Analysis
ZHANG Jun-hua; FANG Wei-wu
2001-01-01
In this paper, a new statistic is proposed for variable selection which is one of the important problems in analysis of questionnaire data. Contrasting to other methods, the approach introduced here can be used not only for two groups of samples but can also be easily generalized to the multi-group case.
Mark I Rowley
Full Text Available We present novel Bayesian methods for the analysis of exponential decay data that exploit the evidence carried by every detected decay event and enables robust extension to advanced processing. Our algorithms are presented in the context of fluorescence lifetime imaging microscopy (FLIM and particular attention has been paid to model the time-domain system (based on time-correlated single photon counting with unprecedented accuracy. We present estimates of decay parameters for mono- and bi-exponential systems, offering up to a factor of two improvement in accuracy compared to previous popular techniques. Results of the analysis of synthetic and experimental data are presented, and areas where the superior precision of our techniques can be exploited in Förster Resonance Energy Transfer (FRET experiments are described. Furthermore, we demonstrate two advanced processing methods: decay model selection to choose between differing models such as mono- and bi-exponential, and the simultaneous estimation of instrument and decay parameters.
Pierre eBerthet
2012-10-01
Full Text Available Several studies have shown a strong involvement of the basal ganglia (BG in action selection and dopamine dependent learning. The dopaminergic signal to striatum, the input stage of the BG, has been commonly described as coding a reward prediction error (RPE, i.e. the difference between the predicted and actual reward. The RPE has been hypothesized to be critical in the modulation of the synaptic plasticity in cortico-striatal synapses in the direct and indirect pathway. We developed an abstract computational model of the BG, with a dual pathway structure functionally corresponding to the direct and indirect pathways, and compared its behaviour to biological data as well as other reinforcement learning models. The computations in our model are inspired by Bayesian inference, and the synaptic plasticity changes depend on a three factor Hebbian-Bayesian learning rule based on co-activation of pre- and post-synaptic units and on the value of the RPE. The model builds on a modified Actor-Critic architecture and implements the direct (Go and the indirect (NoGo pathway, as well as the reward prediction (RP system, acting in a complementary fashion. We investigated the performance of the model system when different configurations of the Go, NoGo and RP system were utilized, e.g. using only the Go, NoGo, or RP system, or combinations of those. Learning performance was investigated in several types of learning paradigms, such as learning-relearning, successive learning, stochastic learning, reversal learning and a two-choice task. The RPE and the activity of the model during learning were similar to monkey electrophysiological and behavioural data. Our results, however, show that there is not a unique best way to configure this BG model to handle well all the learning paradigms tested. We thus suggest that an agent might dynamically configure its action selection mode, possibly depending on task characteristics and also on how much time is available.
Berthet, Pierre; Hellgren-Kotaleski, Jeanette; Lansner, Anders
2012-01-01
Several studies have shown a strong involvement of the basal ganglia (BG) in action selection and dopamine dependent learning. The dopaminergic signal to striatum, the input stage of the BG, has been commonly described as coding a reward prediction error (RPE), i.e., the difference between the predicted and actual reward. The RPE has been hypothesized to be critical in the modulation of the synaptic plasticity in cortico-striatal synapses in the direct and indirect pathway. We developed an abstract computational model of the BG, with a dual pathway structure functionally corresponding to the direct and indirect pathways, and compared its behavior to biological data as well as other reinforcement learning models. The computations in our model are inspired by Bayesian inference, and the synaptic plasticity changes depend on a three factor Hebbian-Bayesian learning rule based on co-activation of pre- and post-synaptic units and on the value of the RPE. The model builds on a modified Actor-Critic architecture and implements the direct (Go) and the indirect (NoGo) pathway, as well as the reward prediction (RP) system, acting in a complementary fashion. We investigated the performance of the model system when different configurations of the Go, NoGo, and RP system were utilized, e.g., using only the Go, NoGo, or RP system, or combinations of those. Learning performance was investigated in several types of learning paradigms, such as learning-relearning, successive learning, stochastic learning, reversal learning and a two-choice task. The RPE and the activity of the model during learning were similar to monkey electrophysiological and behavioral data. Our results, however, show that there is not a unique best way to configure this BG model to handle well all the learning paradigms tested. We thus suggest that an agent might dynamically configure its action selection mode, possibly depending on task characteristics and also on how much time is available. PMID:23060764
We investigate the use of optical photometric variability to select and identify blazars in large-scale time-domain surveys, in part to aid in the identification of blazar counterparts to the ∼30% of γ-ray sources in the Fermi 2FGL catalog still lacking reliable associations. Using data from the optical LINEAR asteroid survey, we characterize the optical variability of blazars by fitting a damped random walk model to individual light curves with two main model parameters, the characteristic timescales of variability τ, and driving amplitudes on short timescales σ-circumflex. Imposing cuts on minimum τ and σ-circumflex allows for blazar selection with high efficiency E and completeness C. To test the efficacy of this approach, we apply this method to optically variable LINEAR objects that fall within the several-arcminute error ellipses of γ-ray sources in the Fermi 2FGL catalog. Despite the extreme stellar contamination at the shallow depth of the LINEAR survey, we are able to recover previously associated optical counterparts to Fermi active galactic nuclei with E ≥ 88% and C = 88% in Fermi 95% confidence error ellipses having semimajor axis r < 8'. We find that the suggested radio counterpart to Fermi source 2FGL J1649.6+5238 has optical variability consistent with other γ-ray blazars and is likely to be the γ-ray source. Our results suggest that the variability of the non-thermal jet emission in blazars is stochastic in nature, with unique variability properties due to the effects of relativistic beaming. After correcting for beaming, we estimate that the characteristic timescale of blazar variability is ∼3 years in the rest frame of the jet, in contrast with the ∼320 day disk flux timescale observed in quasars. The variability-based selection method presented will be useful for blazar identification in time-domain optical surveys and is also a probe of jet physics.
Barcella, William; Iorio, Maria De; Baio, Gianluca; Malone-Lee, James
2016-04-15
Lower urinary tract symptoms can indicate the presence of urinary tract infection (UTI), a condition that if it becomes chronic requires expensive and time consuming care as well as leading to reduced quality of life. Detecting the presence and gravity of an infection from the earliest symptoms is then highly valuable. Typically, white blood cell (WBC) count measured in a sample of urine is used to assess UTI. We consider clinical data from 1341 patients in their first visit in which UTI (i.e. WBC ≥ 1) is diagnosed. In addition, for each patient, a clinical profile of 34 symptoms was recorded. In this paper, we propose a Bayesian nonparametric regression model based on the Dirichlet process prior aimed at providing the clinicians with a meaningful clustering of the patients based on both the WBC (response variable) and possible patterns within the symptoms profiles (covariates). This is achieved by assuming a probability model for the symptoms as well as for the response variable. To identify the symptoms most associated to UTI, we specify a spike and slab base measure for the regression coefficients: this induces dependence of symptoms selection on cluster assignment. Posterior inference is performed through Markov Chain Monte Carlo methods. PMID:26536840
Portfolio Selection Based on Distance between Fuzzy Variables
Weiyi Qian
2014-01-01
Full Text Available This paper researches portfolio selection problem in fuzzy environment. We introduce a new simple method in which the distance between fuzzy variables is used to measure the divergence of fuzzy investment return from a prior one. Firstly, two new mathematical models are proposed by expressing divergence as distance, investment return as expected value, and risk as variance and semivariance, respectively. Secondly, the crisp forms of the new models are also provided for different types of fuzzy variables. Finally, several numerical examples are given to illustrate the effectiveness of the proposed approach.
A Bayesian Optimisation Algorithm for the Nurse Scheduling Problem
Jingpeng, Li
2008-01-01
A Bayesian optimization algorithm for the nurse scheduling problem is presented, which involves choosing a suitable scheduling rule from a set for each nurses assignment. Unlike our previous work that used Gas to implement implicit learning, the learning in the proposed algorithm is explicit, ie. Eventually, we will be able to identify and mix building blocks directly. The Bayesian optimization algorithm is applied to implement such explicit learning by building a Bayesian network of the joint distribution of solutions. The conditional probability of each variable in the network is computed according to an initial set of promising solutions. Subsequently, each new instance for each variable is generated, ie in our case, a new rule string has been obtained. Another set of rule strings will be generated in this way, some of which will replace previous strings based on fitness selection. If stopping conditions are not met, the conditional probabilities for all nodes in the Bayesian network are updated again usin...
Henry de-Graft Acquah
2013-03-01
Full Text Available Alternative formulations of the Bayesian Information Criteria provide a basis for choosing between competing methods for detecting price asymmetry. However, very little is understood about their performance in the asymmetric price transmission modelling framework. In addressing this issue, this paper introduces and applies parametric bootstrap techniques to evaluate the ability of Bayesian Information Criteria (BIC and Draper's Information Criteria (DIC in discriminating between alternative asymmetric price transmission models under various error and sample size conditions. The results of the bootstrap simulations indicate that model selection performance depends on bootstrap sample size and the amount of noise in the data generating process. The Bayesian criterion clearly identifies the true asymmetric model out of different competing models in the presence of bootstrap samples. Draper's Information Criteria (DIC; Draper, 1995 outperforms BIC at either larger bootstrap sample size or lower noise level.
The selection and application of variable order differential operators
Ramirez, Lynnette E. S.
2009-01-01
This work demonstrates the practicality of using variable order (VO) derivative operators for modeling the dynamics of complex systems. First we review the various candidate VO integral and derivative operator definitions proposed in the literature. We select a definition that is appropriate for physical modeling based on the following criteria: the VO operator must be able to return all intermediate values between 0 and 1 that correspond to the argument of the order of differe...
Variable Selection for Marginal Longitudinal Generalized Linear Models
Eva Cantoni; Joanna Mills Flemming; Elvezio Ronchetti
2003-01-01
Variable selection is an essential part of any statistical analysis and yet has been somewhat neglected in the context of longitudinal data analysis. In this paper we propose a generalized version of Mallows's Cp (GCp) suitable for use with both parametric and nonparametric models. GCp provides an estimate of a measure of model's adequacy for prediction. We examine its performance with popular marginal longitudinal models (fitted using GEE) and contrast results with what is typically done in ...
Filament winding cylinders. III - Selection of the process variables
Lee, Soo-Yong; Springer, George S.
1990-01-01
By using the Lee-Springer filament winding model temperatures, degrees of cure, viscosities, stresses, strains, fiber tensions, fiber motions, and void diameters were calculated in graphite-epoxy composite cylinders during the winding and subsequent curing. The results demonstrate the type of information which can be generated by the model. It is shown, in reference to these results, how the model, and the corresponding WINDTHICK code, can be used to select the appropriate process variables.
Mahalanobis distance and variable selection to optimize dose response
A battery of statistical techniques are combined to improve detection of low-level dose response. First, Mahalanobis distances are used to classify objects as normal or abnormal. Then the proportion classified abnormal is regressed on dose. Finally, a subset of regressor variables is selected which maximizes the slope of the dose response line. Use of the techniques is illustrated by application to mouse sperm damaged by low doses of x-rays
Multi-scale inference of interaction rules in animal groups using Bayesian model selection.
Richard P Mann
Full Text Available Inference of interaction rules of animals moving in groups usually relies on an analysis of large scale system behaviour. Models are tuned through repeated simulation until they match the observed behaviour. More recent work has used the fine scale motions of animals to validate and fit the rules of interaction of animals in groups. Here, we use a Bayesian methodology to compare a variety of models to the collective motion of glass prawns (Paratya australiensis. We show that these exhibit a stereotypical 'phase transition', whereby an increase in density leads to the onset of collective motion in one direction. We fit models to this data, which range from: a mean-field model where all prawns interact globally; to a spatial Markovian model where prawns are self-propelled particles influenced only by the current positions and directions of their neighbours; up to non-Markovian models where prawns have 'memory' of previous interactions, integrating their experiences over time when deciding to change behaviour. We show that the mean-field model fits the large scale behaviour of the system, but does not capture the observed locality of interactions. Traditional self-propelled particle models fail to capture the fine scale dynamics of the system. The most sophisticated model, the non-Markovian model, provides a good match to the data at both the fine scale and in terms of reproducing global dynamics, while maintaining a biologically plausible perceptual range. We conclude that prawns' movements are influenced by not just the current direction of nearby conspecifics, but also those encountered in the recent past. Given the simplicity of prawns as a study system our research suggests that self-propelled particle models of collective motion should, if they are to be realistic at multiple biological scales, include memory of previous interactions and other non-Markovian effects.
Multi-scale inference of interaction rules in animal groups using Bayesian model selection.
Richard P Mann
2012-01-01
Full Text Available Inference of interaction rules of animals moving in groups usually relies on an analysis of large scale system behaviour. Models are tuned through repeated simulation until they match the observed behaviour. More recent work has used the fine scale motions of animals to validate and fit the rules of interaction of animals in groups. Here, we use a Bayesian methodology to compare a variety of models to the collective motion of glass prawns (Paratya australiensis. We show that these exhibit a stereotypical 'phase transition', whereby an increase in density leads to the onset of collective motion in one direction. We fit models to this data, which range from: a mean-field model where all prawns interact globally; to a spatial Markovian model where prawns are self-propelled particles influenced only by the current positions and directions of their neighbours; up to non-Markovian models where prawns have 'memory' of previous interactions, integrating their experiences over time when deciding to change behaviour. We show that the mean-field model fits the large scale behaviour of the system, but does not capture fine scale rules of interaction, which are primarily mediated by physical contact. Conversely, the Markovian self-propelled particle model captures the fine scale rules of interaction but fails to reproduce global dynamics. The most sophisticated model, the non-Markovian model, provides a good match to the data at both the fine scale and in terms of reproducing global dynamics. We conclude that prawns' movements are influenced by not just the current direction of nearby conspecifics, but also those encountered in the recent past. Given the simplicity of prawns as a study system our research suggests that self-propelled particle models of collective motion should, if they are to be realistic at multiple biological scales, include memory of previous interactions and other non-Markovian effects.
Martínez, Isabel; Wiegand, Thorsten; Camarero, J Julio; Batllori, Enric; Gutiérrez, Emilia
2011-05-01
Alpine tree-line ecotones are characterized by marked changes at small spatial scales that may result in a variety of physiognomies. A set of alternative individual-based models was tested with data from four contrasting Pinus uncinata ecotones in the central Spanish Pyrenees to reveal the minimal subset of processes required for tree-line formation. A Bayesian approach combined with Markov chain Monte Carlo methods was employed to obtain the posterior distribution of model parameters, allowing the use of model selection procedures. The main features of real tree lines emerged only in models considering nonlinear responses in individual rates of growth or mortality with respect to the altitudinal gradient. Variation in tree-line physiognomy reflected mainly changes in the relative importance of these nonlinear responses, while other processes, such as dispersal limitation and facilitation, played a secondary role. Different nonlinear responses also determined the presence or absence of krummholz, in agreement with recent findings highlighting a different response of diffuse and abrupt or krummholz tree lines to climate change. The method presented here can be widely applied in individual-based simulation models and will turn model selection and evaluation in this type of models into a more transparent, effective, and efficient exercise. PMID:21508601
Bayesian Variable Selection to identify QTL affecting a simulated quantitative trait
Schurink, A.; Janss, L.L.G.; Heuven, H.C.M.
2012-01-01
Background Recent developments in genetic technology and methodology enable accurate detection of QTL and estimation of breeding values, even in individuals without phenotypes. The QTL-MAS workshop offers the opportunity to test different methods to perform a genome-wide association study on simulat
Bayesian Variable Selection to identify QTL affecting a simulated quantitative trait
Schurink, A.; Janss, L.L.G.; Heuven, H.C.M.
2012-01-01
Abstract Background: Recent developments in genetic technology and methodology enable accurate detection of QTL and estimation of breeding values, even in individuals without phenotypes. The QTL-MAS workshop offers the opportunity to test different methods to perform a genome-wide association study
Embryologic changes in rabbit lines selected for litter size variability.
García, M L; Blasco, A; Argente, M J
2016-09-15
A divergent selection experiment on litter size variability was carried out. Correlated response in early embryo survival, embryonic development, size of embryos, and size of embryonic coats after four generations of selection was estimated. A total of 429 embryos from 51 high-line females and 648 embryos from 80 low-line females were used in the experiment. The traits studied were percentage of normal embryos, embryo diameter, zona pellucida thickness, and mucin coat thickness. Traits were measured at 24, 48, and 72 hours postcoitum (hpc); mucin coat thickness was only measured at 48 and 72 hpc. The embryos were classified as zygotes or two-cell embryos at 24 hpc; 16-cell embryos or early morulae at 48 hpc; and early morulae, compacted morulae, or blastocyst at 72 hpc. At 24 hpc, the percentage of normal embryos in the high line was lower than in the low line (-2.5%), and embryos in the high line showed 10% higher zona pellucida thickness than those of the low line. No differences in percentage of zygotes or two-cell embryos were found. At 48 hpc, the high-line embryos were less developed, with a higher percentage of 16-cell embryos (23.4%) and a lower percentage of early morulae (-23.4%). At 72 hpc, high-line embryos continued to be less developed, showing higher percentages of early morulae and compact morulae and lower percentages of blastocyst (-1.8%). No differences in embryo diameter or mucin coat thickness were found at any time. In conclusion, selection for litter size variability has consequences on early embryonic survival and development, with embryos presenting a lower state of development and a lower percentage of normal embryos in the line selected for higher variability. PMID:27207473
Isoenzymatic variability in tropical maize populations under reciprocal recurrent selection
Pinto Luciana Rossini
2003-01-01
Full Text Available Maize (Zea mays L. is one of the crops in which the genetic variability has been extensively studied at isoenzymatic loci. The genetic variability of the maize populations BR-105 and BR-106, and the synthetics IG-3 and IG-4, obtained after one cycle of a high-intensity reciprocal recurrent selection (RRS, was investigated at seven isoenzymatic loci. A total of twenty alleles were identified, and most of the private alleles were found in the BR-106 population. One cycle of reciprocal recurrent selection (RRS caused reductions of 12% in the number of alleles in both populations. Changes in allele frequencies were also observed between populations and synthetics, mainly for the Est 2 locus. Populations presented similar values for the number of alleles per locus, percentage of polymorphic loci, and observed and expected heterozygosities. A decrease of the genetic variation values was observed for the synthetics as a consequence of genetic drift effects and reduction of the effective population sizes. The distribution of the genetic diversity within and between populations revealed that most of the diversity was maintained within them, i.e. BR-105 x BR-106 (G ST = 3.5% and IG-3 x IG-4 (G ST = 4.0%. The genetic distances between populations and synthetics increased approximately 21%. An increase in the genetic divergence between the populations occurred without limiting new selection procedures.
A Selective Overview of Variable Selection in High Dimensional Feature Space.
Fan, Jianqing; Lv, Jinchi
2010-01-01
High dimensional statistical problems arise from diverse fields of scientific research and technological development. Variable selection plays a pivotal role in contemporary statistical learning and scientific discoveries. The traditional idea of best subset selection methods, which can be regarded as a specific form of penalized likelihood, is computationally too expensive for many modern statistical applications. Other forms of penalized likelihood methods have been successfully developed over the last decade to cope with high dimensionality. They have been widely applied for simultaneously selecting important variables and estimating their effects in high dimensional statistical inference. In this article, we present a brief account of the recent developments of theory, methods, and implementations for high dimensional variable selection. What limits of the dimensionality such methods can handle, what the role of penalty functions is, and what the statistical properties are rapidly drive the advances of the field. The properties of non-concave penalized likelihood and its roles in high dimensional statistical modeling are emphasized. We also review some recent advances in ultra-high dimensional variable selection, with emphasis on independence screening and two-scale methods. PMID:21572976
Kaski Kimmo
2007-05-01
Full Text Available Abstract Background A key challenge in metabonomics is to uncover quantitative associations between multidimensional spectroscopic data and biochemical measures used for disease risk assessment and diagnostics. Here we focus on clinically relevant estimation of lipoprotein lipids by 1H NMR spectroscopy of serum. Results A Bayesian methodology, with a biochemical motivation, is presented for a real 1H NMR metabonomics data set of 75 serum samples. Lipoprotein lipid concentrations were independently obtained for these samples via ultracentrifugation and specific biochemical assays. The Bayesian models were constructed by Markov chain Monte Carlo (MCMC and they showed remarkably good quantitative performance, the predictive R-values being 0.985 for the very low density lipoprotein triglycerides (VLDL-TG, 0.787 for the intermediate, 0.943 for the low, and 0.933 for the high density lipoprotein cholesterol (IDL-C, LDL-C and HDL-C, respectively. The modelling produced a kernel-based reformulation of the data, the parameters of which coincided with the well-known biochemical characteristics of the 1H NMR spectra; particularly for VLDL-TG and HDL-C the Bayesian methodology was able to clearly identify the most characteristic resonances within the heavily overlapping information in the spectra. For IDL-C and LDL-C the resulting model kernels were more complex than those for VLDL-TG and HDL-C, probably reflecting the severe overlap of the IDL and LDL resonances in the 1H NMR spectra. Conclusion The systematic use of Bayesian MCMC analysis is computationally demanding. Nevertheless, the combination of high-quality quantification and the biochemical rationale of the resulting models is expected to be useful in the field of metabonomics.
Estimation and variable selection for generalized additive partial linear models
Wang, Li
2011-08-01
We study generalized additive partial linear models, proposing the use of polynomial spline smoothing for estimation of nonparametric functions, and deriving quasi-likelihood based estimators for the linear parameters. We establish asymptotic normality for the estimators of the parametric components. The procedure avoids solving large systems of equations as in kernel-based procedures and thus results in gains in computational simplicity. We further develop a class of variable selection procedures for the linear parameters by employing a nonconcave penalized quasi-likelihood, which is shown to have an asymptotic oracle property. Monte Carlo simulations and an empirical example are presented for illustration. © Institute of Mathematical Statistics, 2011.
Penalized maximum likelihood estimation and variable selection in geostatistics
Chu, Tingjin; Wang, Haonan; 10.1214/11-AOS919
2012-01-01
We consider the problem of selecting covariates in spatial linear models with Gaussian process errors. Penalized maximum likelihood estimation (PMLE) that enables simultaneous variable selection and parameter estimation is developed and, for ease of computation, PMLE is approximated by one-step sparse estimation (OSE). To further improve computational efficiency, particularly with large sample sizes, we propose penalized maximum covariance-tapered likelihood estimation (PMLE$_{\\mathrm{T}}$) and its one-step sparse estimation (OSE$_{\\mathrm{T}}$). General forms of penalty functions with an emphasis on smoothly clipped absolute deviation are used for penalized maximum likelihood. Theoretical properties of PMLE and OSE, as well as their approximations PMLE$_{\\mathrm{T}}$ and OSE$_{\\mathrm{T}}$ using covariance tapering, are derived, including consistency, sparsity, asymptotic normality and the oracle properties. For covariance tapering, a by-product of our theoretical results is consistency and asymptotic normal...
Secondary eclipses in the CoRoT light curves: A homogeneous search based on Bayesian model selection
Parviainen, Hannu; Belmonte, Juan Antonio
2012-01-01
We aim to identify and characterize secondary eclipses in the original light curves of all published CoRoT planets using uniform detection and evaluation critetia. Our analysis is based on a Bayesian model selection between two competing models: one with and one without an eclipse signal. The search is carried out by mapping the Bayes factor in favor of the eclipse model as a function of the eclipse center time, after which the characterization of plausible eclipse candidates is done by estimating the posterior distributions of the eclipse model parameters using Markov Chain Monte Carlo. We discover statistically significant eclipse events for two planets, CoRoT-6b and CoRoT-11b, and for one brown dwarf, CoRoT-15b. We also find marginally significant eclipse events passing our plausibility criteria for CoRoT-3b, 13b, 18b, and 21b. The previously published CoRoT-1b and CoRoT-2b eclipses are also confirmed.
The Tabu Search Procedure: An Alternative to the Variable Selection Methods
Mills, Jamie, D.; Olejnik, Stephen, F.; Marcoulides, George, A.
2005-01-01
The effectiveness of the Tabu variable selection algorithm, to identify predictor variables related to a criterion variable, is compared with the stepwise variable selection method and the all possible regression approach. Considering results obtained from previous research, Tabu is more successful in identifying relevant variables than the…
Villforth, Carolin; Koekemoer, Anton M.; Grogin, Norman A.
2010-01-01
Variability is a property shared by practically all AGN. This makes variability selection a possible technique for identifying AGN. Given that variability selection makes no prior assumption about spectral properties, it is a powerful technique for detecting both low-luminosity AGN in which the host galaxy emission is dominating and AGN with unusual spectral properties. In this paper, we will discuss and test different statistical methods for the detection of variability in sparsely sampled d...
Identifying relevant positions in proteins by Critical Variable Selection.
Grigolon, Silvia; Franz, Silvio; Marsili, Matteo
2016-06-21
Evolution in its course has found a variety of solutions to the same optimisation problem. The advent of high-throughput genomic sequencing has made available extensive data from which, in principle, one can infer the underlying structure on which biological functions rely. In this paper, we present a new method aimed at the extraction of sites encoding structural and functional properties from a set of protein primary sequences, namely a multiple sequence alignment. The method, called critical variable selection, is based on the idea that subsets of relevant sites correspond to subsequences that occur with a particularly broad frequency distribution in the dataset. By applying this algorithm to in silico sequences, to the response regulator receiver and to the voltage sensor domain of ion channels, we show that this procedure recovers not only the information encoded in single site statistics and pairwise correlations but also captures dependencies going beyond pairwise correlations. The method proposed here is complementary to statistical coupling analysis, in that the most relevant sites predicted by the two methods differ markedly. We find robust and consistent results for datasets as small as few hundred sequences that reveal a hidden hierarchy of sites that are consistent with the present knowledge on biologically relevant sites and evolutionary dynamics. This suggests that critical variable selection is capable of identifying a core of sites encoding functional and structural information in a multiple sequence alignment. PMID:26974515
Sudre, Carole H; Cardoso, M Jorge; Bouvy, Willem H; Biessels, Geert Jan; Barnes, Josephine; Ourselin, Sebastien
2015-10-01
In neuroimaging studies, pathologies can present themselves as abnormal intensity patterns. Thus, solutions for detecting abnormal intensities are currently under investigation. As each patient is unique, an unbiased and biologically plausible model of pathological data would have to be able to adapt to the subject's individual presentation. Such a model would provide the means for a better understanding of the underlying biological processes and improve one's ability to define pathologically meaningful imaging biomarkers. With this aim in mind, this work proposes a hierarchical fully unsupervised model selection framework for neuroimaging data which enables the distinction between different types of abnormal image patterns without pathological a priori knowledge. Its application on simulated and clinical data demonstrated the ability to detect abnormal intensity clusters, resulting in a competitive to improved behavior in white matter lesion segmentation when compared to three other freely-available automated methods. PMID:25850086
During the past several years, near-infrared (near-IR/NIR) spectroscopy has increasingly been adopted as an analytical tool in various fields from petroleum to biomedical sectors. The NIR spectrum (above 4000 cm-1) of a sample is typically measured by modern instruments at a few hundred of wavelengths. Recently, considerable effort has been directed towards developing procedures to identify variables (wavelengths) that contribute useful information. Variable selection (VS) or feature selection, also called frequency selection or wavelength selection, is a critical step in data analysis for vibrational spectroscopy (infrared, Raman, or NIRS). In this paper, we compare the performance of 16 different feature selection methods for the prediction of properties of biodiesel fuel, including density, viscosity, methanol content, and water concentration. The feature selection algorithms tested include stepwise multiple linear regression (MLR-step), interval partial least squares regression (iPLS), backward iPLS (BiPLS), forward iPLS (FiPLS), moving window partial least squares regression (MWPLS), (modified) changeable size moving window partial least squares (CSMWPLS/MCSMWPLSR), searching combination moving window partial least squares (SCMWPLS), successive projections algorithm (SPA), uninformative variable elimination (UVE, including UVE-SPA), simulated annealing (SA), back-propagation artificial neural networks (BP-ANN), Kohonen artificial neural network (K-ANN), and genetic algorithms (GAs, including GA-iPLS). Two linear techniques for calibration model building, namely multiple linear regression (MLR) and partial least squares regression/projection to latent structures (PLS/PLSR), are used for the evaluation of biofuel properties. A comparison with a non-linear calibration model, artificial neural networks (ANN-MLP), is also provided. Discussion of gasoline, ethanol-gasoline (bioethanol), and diesel fuel data is presented. The results of other spectroscopic techniques
Finley, A. O.; Banerjee, S.; Cook, B. D.
2010-12-01
Recent advances in remote sensing, specifically waveform Light Detection and Ranging (LiDAR) sensors, provide the data needed to quantify forest variables at a fine spatial resolution over large domains. Of particular interest is LiDAR data from NASA's Laser Vegetation Imaging Sensor (LVIS), upcoming Deformation, Ecosystem Structure, and Dynamics of Ice (DESDynI) missions, and NSF's National Ecological Observatory Network planned Airborne Observation Platform. A central challenge to using these data is to couple field measurements of forest variables (e.g., species, indices of structural complexity, light competition, or drought stress) with the high-dimensional LiDAR signal through a model, which allows prediction of the tree-level variables at locations where only the remotely sensed data area are available. It is common to model the high-dimensional signal vector as a mixture of a relatively small number of Gaussian distributions. The parameters from these Gaussian distributions, or indices derived from the parameters, can then be used as regressors in a regression model. These approaches retain only a small amount of information contained in the signal. Further, it is not known a priori which features of the signal explain the most variability in the response variables. It is possible to fully exploit the information in the signal by treating it as an object, thus, we define a framework to couple a spatial latent factor model with forest variables using a fully Bayesian functional spatial data analysis. Our proposed modeling framework explicitly: 1) reduces the dimensionality of signals in an optimal way (i.e., preserves the information that describes the maximum variability in response variable); 2) propagates uncertainty in data and parameters through to prediction, and; 3) acknowledges and leverages spatial dependence among the regressors and model residuals to meet statistical assumptions and improve prediction. The proposed modeling framework is
Wöhling, T.; Schöniger, A.; Geiges, A.; Nowak, W.; Gayler, S.
2013-12-01
The objective selection of appropriate models for realistic simulations of coupled soil-plant processes is a challenging task since the processes are complex, not fully understood at larger scales, and highly non-linear. Also, comprehensive data sets are scarce, and measurements are uncertain. In the past decades, a variety of different models have been developed that exhibit a wide range of complexity regarding their approximation of processes in the coupled model compartments. We present a method for evaluating experimental design for maximum confidence in the model selection task. The method considers uncertainty in parameters, measurements and model structures. Advancing the ideas behind Bayesian Model Averaging (BMA), we analyze the changes in posterior model weights and posterior model choice uncertainty when more data are made available. This allows assessing the power of different data types, data densities and data locations in identifying the best model structure from among a suite of plausible models. The models considered in this study are the crop models CERES, SUCROS, GECROS and SPASS, which are coupled to identical routines for simulating soil processes within the modelling framework Expert-N. The four models considerably differ in the degree of detail at which crop growth and root water uptake are represented. Monte-Carlo simulations were conducted for each of these models considering their uncertainty in soil hydraulic properties and selected crop model parameters. Using a Bootstrap Filter (BF), the models were then conditioned on field measurements of soil moisture, matric potential, leaf-area index, and evapotranspiration rates (from eddy-covariance measurements) during a vegetation period of winter wheat at a field site at the Swabian Alb in Southwestern Germany. Following our new method, we derived model weights when using all data or different subsets thereof. We discuss to which degree the posterior mean outperforms the prior mean and all
Bayesian model averaging to explore the worth of data for soil-plant model selection and prediction
Wöhling, Thomas; Schöniger, Anneli; Gayler, Sebastian; Nowak, Wolfgang
2015-04-01
A Bayesian model averaging (BMA) framework is presented to evaluate the worth of different observation types and experimental design options for (1) more confidence in model selection and (2) for increased predictive reliability. These two modeling tasks are handled separately because model selection aims at identifying the most appropriate model with respect to a given calibration data set, while predictive reliability aims at reducing uncertainty in model predictions through constraining the plausible range of both models and model parameters. For that purpose, we pursue an optimal design of measurement framework that is based on BMA and that considers uncertainty in parameters, measurements, and model structures. We apply this framework to select between four crop models (the vegetation components of CERES, SUCROS, GECROS, and SPASS), which are coupled to identical routines for simulating soil carbon and nitrogen turnover, soil heat and nitrogen transport, and soil water movement. An ensemble of parameter realizations was generated for each model using Monte-Carlo simulation. We assess each model's plausibility by determining its posterior weight, which signifies the probability to have generated a given experimental data set. Several BMA analyses were conducted for different data packages with measurements of soil moisture, evapotranspiration (ETa), and leaf area index (LAI). The posterior weights resulting from the different BMA runs were compared to the weight distribution of a reference run with all data types to investigate the utility of different data packages and monitoring design options in identifying the most appropriate model in the ensemble. We found that different (combinations of) data types support different models and none of the four crop models outperforms all others under all data scenarios. The best model discrimination was observed for those data where the competing models disagree the most. The data worth for reducing prediction
Peixin ZHAO
2013-01-01
In this paper,we consider the variable selection for the parametric components of varying coefficient partially linear models with censored data.By constructing a penalized auxiliary vector ingeniously,we propose an empirical likelihood based variable selection procedure,and show that it is consistent and satisfies the sparsity.The simulation studies show that the proposed variable selection method is workable.
Noncausal Bayesian Vector Autoregression
Lanne, Markku; Luoto, Jani
We propose a Bayesian inferential procedure for the noncausal vector autoregressive (VAR) model that is capable of capturing nonlinearities and incorporating effects of missing variables. In particular, we devise a fast and reliable posterior simulator that yields the predictive distribution as a...
Zheng-yan Lin; Yu-ze Yuan
2012-01-01
Semiparametric models with diverging number of predictors arise in many contemporary scientific areas. Variable selection for these models consists of two components: model selection for non-parametric components and selection of significant variables for the parametric portion.In this paper,we consider a variable selection procedure by combining basis function approximation with SCAD penalty.The proposed procedure simultaneously selects significant variables in the parametric components and the nonparametric components.With appropriate selection of tuning parameters,we establish the consistency and sparseness of this procedure.
Birth order and selected work-related personality variables.
Phillips, A S; Bedeian, A G; Mossholder, K W; Touliatos, J
1988-12-01
A possible link between birth order and various individual characteristics (e. g., intelligence, potential eminence, need for achievement, sociability) has been suggested by personality theorists such as Adler for over a century. The present study examines whether birth order is associated with selected personality variables that may be related to various work outcomes. 3 of 7 hypotheses were supported and the effect sizes for these were small. Firstborns scored significantly higher than later borns on measures of dominance, good impression, and achievement via conformity. No differences between firstborns and later borns were found in managerial potential, work orientation, achievement via independence, and sociability. The study's sample consisted of 835 public, government, and industrial accountants responding to a national US survey of accounting professionals. The nature of the sample may have been partially responsible for the results obtained. Its homogeneity may have caused any birth order effects to wash out. It can be argued that successful membership in the accountancy profession requires internalization of a set of prescribed rules and standards. It may be that accountants as a group are locked in to a behavioral framework. Any differentiation would result from spurious interpersonal differences, not from predictable birth-order related characteristics. A final interpretation is that birth order effects are nonexistent or statistical artifacts. Given the present data and particularistic sample, however, the authors have insufficient information from which to draw such a conclusion. PMID:12281942
Flood quantile estimation at ungauged sites by Bayesian networks
Mediero, L.; Santillán, D.; Garrote, L.
2012-04-01
Estimating flood quantiles at a site for which no observed measurements are available is essential for water resources planning and management. Ungauged sites have no observations about the magnitude of floods, but some site and basin characteristics are known. The most common technique used is the multiple regression analysis, which relates physical and climatic basin characteristic to flood quantiles. Regression equations are fitted from flood frequency data and basin characteristics at gauged sites. Regression equations are a rigid technique that assumes linear relationships between variables and cannot take the measurement errors into account. In addition, the prediction intervals are estimated in a very simplistic way from the variance of the residuals in the estimated model. Bayesian networks are a probabilistic computational structure taken from the field of Artificial Intelligence, which have been widely and successfully applied to many scientific fields like medicine and informatics, but application to the field of hydrology is recent. Bayesian networks infer the joint probability distribution of several related variables from observations through nodes, which represent random variables, and links, which represent causal dependencies between them. A Bayesian network is more flexible than regression equations, as they capture non-linear relationships between variables. In addition, the probabilistic nature of Bayesian networks allows taking the different sources of estimation uncertainty into account, as they give a probability distribution as result. A homogeneous region in the Tagus Basin was selected as case study. A regression equation was fitted taking the basin area, the annual maximum 24-hour rainfall for a given recurrence interval and the mean height as explanatory variables. Flood quantiles at ungauged sites were estimated by Bayesian networks. Bayesian networks need to be learnt from a huge enough data set. As observational data are reduced, a
The SEDs, Host Galaxies and Environments of Variability Selected AGN in GOODS-S
Villforth, Carolin; Sarajedini, Vicki; Koekemoer, Anton
2012-01-01
Variability selection has been proposed as a powerful tool for identifying both low-luminosity AGN and those with unusual SEDs. However, a systematic study of sources selected in such a way has been lacking. In this paper, we present the multi-wavelength properties of the variability selected AGN in GOODS South. We demonstrate that variability selection indeed reliably identifies AGN, predominantly of low luminosity. We find contamination from stars as well as a very small sample of sources t...
Variable Selection for Varying-Coefficient Models with Missing Response at Random
Pei Xin ZHAO; Liu Gen XUE
2011-01-01
In this paper, we present a variable selection procedure by combining basis function approximations with penalized estimating equations for varying-coefficient models with missing response at random. With appropriate selection of the tuning parameters, we establish the consistency of the variable selection procedure and the optimal convergence rate of the regularized estimators. A simulation study is undertaken to assess the finite sample performance of the proposed variable selection procedure.
Bayesian Analysis Made Simple An Excel GUI for WinBUGS
Woodward, Philip
2011-01-01
From simple NLMs to complex GLMMs, this book describes how to use the GUI for WinBUGS - BugsXLA - an Excel add-in written by the author that allows a range of Bayesian models to be easily specified. With case studies throughout, the text shows how to routinely apply even the more complex aspects of model specification, such as GLMMs, outlier robust models, random effects Emax models, auto-regressive errors, and Bayesian variable selection. It provides brief, up-to-date discussions of current issues in the practical application of Bayesian methods. The author also explains how to obtain free so
Lesaffre, Emmanuel
2012-01-01
The growth of biostatistics has been phenomenal in recent years and has been marked by considerable technical innovation in both methodology and computational practicality. One area that has experienced significant growth is Bayesian methods. The growing use of Bayesian methodology has taken place partly due to an increasing number of practitioners valuing the Bayesian paradigm as matching that of scientific discovery. In addition, computational advances have allowed for more complex models to be fitted routinely to realistic data sets. Through examples, exercises and a combination of introd
Bayesian modeling using WinBUGS
Ntzoufras, Ioannis
2009-01-01
A hands-on introduction to the principles of Bayesian modeling using WinBUGS Bayesian Modeling Using WinBUGS provides an easily accessible introduction to the use of WinBUGS programming techniques in a variety of Bayesian modeling settings. The author provides an accessible treatment of the topic, offering readers a smooth introduction to the principles of Bayesian modeling with detailed guidance on the practical implementation of key principles. The book begins with a basic introduction to Bayesian inference and the WinBUGS software and goes on to cover key topics, including: Markov Chain Monte Carlo algorithms in Bayesian inference Generalized linear models Bayesian hierarchical models Predictive distribution and model checking Bayesian model and variable evaluation Computational notes and screen captures illustrate the use of both WinBUGS as well as R software to apply the discussed techniques. Exercises at the end of each chapter allow readers to test their understanding of the presented concepts and all ...
Widyas, Nuzul; Jensen, Just; Nielsen, Vivi Hunnicke
selected downwards and three lines were kept as controls. Bayesian statistical methods are used to estimate the genetic variance components. Mixed model analysis is modified including mutation effect following the methods by Wray (1990). DIC was used to compare the model. Models including mutation effect...... have better fit compared to the model with only additive effect. Mutation as direct effect contributes 3.18% of the total phenotypic variance. While in the model with interactions between additive and mutation, it contributes 1.43% as direct effect and 1.36% as interaction effect of the total variance...
Pei Xin ZHAO; Liu Gen XUE
2011-01-01
In this paper,we present a variable selection procedure by combining basis function approximations with penalized estimating equations for semiparametric varying-coefficient partially linear models with missing response at random.The proposed procedure simultaneously selects significant variables in parametric components and nonparametric components.With appropriate selection of the tuning parameters,we establish the consistency of the variable selection procedure and the convergence rate of the regularized estimators.A simulation study is undertaken to assess the finite sample performance of the proposed variable selection procedure.
Draper, D.
2001-01-01
© 2012 Springer Science+Business Media, LLC. All rights reserved. Article Outline: Glossary Definition of the Subject and Introduction The Bayesian Statistical Paradigm Three Examples Comparison with the Frequentist Statistical Paradigm Future Directions Bibliography
Gilkey, Kelly M.; Myers, Jerry G.; McRae, Michael P.; Griffin, Elise A.; Kallrui, Aditya S.
2012-01-01
The Exploration Medical Capability project is creating a catalog of risk assessments using the Integrated Medical Model (IMM). The IMM is a software-based system intended to assist mission planners in preparing for spaceflight missions by helping them to make informed decisions about medical preparations and supplies needed for combating and treating various medical events using Probabilistic Risk Assessment. The objective is to use statistical analyses to inform the IMM decision tool with estimated probabilities of medical events occurring during an exploration mission. Because data regarding astronaut health are limited, Bayesian statistical analysis is used. Bayesian inference combines prior knowledge, such as data from the general U.S. population, the U.S. Submarine Force, or the analog astronaut population located at the NASA Johnson Space Center, with observed data for the medical condition of interest. The posterior results reflect the best evidence for specific medical events occurring in flight. Bayes theorem provides a formal mechanism for combining available observed data with data from similar studies to support the quantification process. The IMM team performed Bayesian updates on the following medical events: angina, appendicitis, atrial fibrillation, atrial flutter, dental abscess, dental caries, dental periodontal disease, gallstone disease, herpes zoster, renal stones, seizure, and stroke.
Relationships of Selected Personal and Social Variables in Conforming Judgment
Long, Huey B.
1970-01-01
To help determine relationships between certain personality variables and conforming judgment, and difference in conforming judgments among differently structured groups, prison immates were studies for the personality variables of IQ (California Capacity Questionnaire), agreement response set (Couch and Kenniston Scale), and dogmatism (Form E,…
Bayesian Peak Picking for NMR Spectra
Cheng, Yichen
2014-02-01
Protein structure determination is a very important topic in structural genomics, which helps people to understand varieties of biological functions such as protein-protein interactions, protein–DNA interactions and so on. Nowadays, nuclear magnetic resonance (NMR) has often been used to determine the three-dimensional structures of protein in vivo. This study aims to automate the peak picking step, the most important and tricky step in NMR structure determination. We propose to model the NMR spectrum by a mixture of bivariate Gaussian densities and use the stochastic approximation Monte Carlo algorithm as the computational tool to solve the problem. Under the Bayesian framework, the peak picking problem is casted as a variable selection problem. The proposed method can automatically distinguish true peaks from false ones without preprocessing the data. To the best of our knowledge, this is the first effort in the literature that tackles the peak picking problem for NMR spectrum data using Bayesian method.
Ciarleglio, Maria M; Arendt, Christopher D; Makuch, Robert W; Peduzzi, Peter N
2015-03-01
Specification of the treatment effect that a clinical trial is designed to detect (θA) plays a critical role in sample size and power calculations. However, no formal method exists for using prior information to guide the choice of θA. This paper presents a hybrid classical and Bayesian procedure for choosing an estimate of the treatment effect to be detected in a clinical trial that formally integrates prior information into this aspect of trial design. The value of θA is found that equates the pre-specified frequentist power and the conditional expected power of the trial. The conditional expected power averages the traditional frequentist power curve using the conditional prior distribution of the true unknown treatment effect θ as the averaging weight. The Bayesian prior distribution summarizes current knowledge of both the magnitude of the treatment effect and the strength of the prior information through the assumed spread of the distribution. By using a hybrid classical and Bayesian approach, we are able to formally integrate prior information on the uncertainty and variability of the treatment effect into the design of the study, mitigating the risk that the power calculation will be overly optimistic while maintaining a frequentist framework for the final analysis. The value of θA found using this method may be written as a function of the prior mean μ0 and standard deviation τ0, with a unique relationship for a given ratio of μ0/τ0. Results are presented for Normal, Uniform, and Gamma priors for θ. PMID:25583273
Random Forests for Ordinal Response Data: Prediction and Variable Selection
Janitza, Silke; Tutz, Gerhard; Boulesteix, Anne-Laure
2014-01-01
The random forest method is a commonly used tool for classification with high-dimensional data that is able to rank candidate predictors through its inbuilt variable importance measures (VIMs). It can be applied to various kinds of regression problems including nominal, metric and survival response variables. While classification and regression problems using random forest methodology have been extensively investigated in the past, there seems to be a lack of literature on handling ordinal re...
Regression Analysis with Block Missing Values and Variables Selection
Chien-Pai Han
2011-07-01
Full Text Available We consider a regression model when a block of observations is missing, i.e. there are a group of observations with all the explanatory variables or covariates observed and another set of observations with only a block of the variables observed. We propose an estimator of the regression coefficients that is a combination of two estimators, one based on the observations with no missing variables, and the other the set all observations after deleting of the block of variables with missing values. The proposed combined estimator will be compared with the uncombined estimators. If the experimenter suspects that the variables with missing values may be deleted, a preliminary test will be performed to resolve the uncertainty. If the preliminary test of the null hypothesis that regression coefficients of the variables with missing value equal to zero is accepted, then only the data with no missing values are used for estimating the regression coefficients. Otherwise the combined estimator is used. This gives a preliminary test estimator. The properties of the preliminary test estimator and comparisons of the estimators are studied by a Monte Carlo study
Finley, Andrew O.; Banerjee, Sudipto; Cook, Bruce D.; Bradford, John B.
2013-01-01
In this paper we detail a multivariate spatial regression model that couples LiDAR, hyperspectral and forest inventory data to predict forest outcome variables at a high spatial resolution. The proposed model is used to analyze forest inventory data collected on the US Forest Service Penobscot Experimental Forest (PEF), ME, USA. In addition to helping meet the regression model's assumptions, results from the PEF analysis suggest that the addition of multivariate spatial random effects improves model fit and predictive ability, compared with two commonly applied modeling approaches. This improvement results from explicitly modeling the covariation among forest outcome variables and spatial dependence among observations through the random effects. Direct application of such multivariate models to even moderately large datasets is often computationally infeasible because of cubic order matrix algorithms involved in estimation. We apply a spatial dimension reduction technique to help overcome this computational hurdle without sacrificing richness in modeling.
Bayesian phylogeography finds its roots.
Philippe Lemey
2009-09-01
Full Text Available As a key factor in endemic and epidemic dynamics, the geographical distribution of viruses has been frequently interpreted in the light of their genetic histories. Unfortunately, inference of historical dispersal or migration patterns of viruses has mainly been restricted to model-free heuristic approaches that provide little insight into the temporal setting of the spatial dynamics. The introduction of probabilistic models of evolution, however, offers unique opportunities to engage in this statistical endeavor. Here we introduce a Bayesian framework for inference, visualization and hypothesis testing of phylogeographic history. By implementing character mapping in a Bayesian software that samples time-scaled phylogenies, we enable the reconstruction of timed viral dispersal patterns while accommodating phylogenetic uncertainty. Standard Markov model inference is extended with a stochastic search variable selection procedure that identifies the parsimonious descriptions of the diffusion process. In addition, we propose priors that can incorporate geographical sampling distributions or characterize alternative hypotheses about the spatial dynamics. To visualize the spatial and temporal information, we summarize inferences using virtual globe software. We describe how Bayesian phylogeography compares with previous parsimony analysis in the investigation of the influenza A H5N1 origin and H5N1 epidemiological linkage among sampling localities. Analysis of rabies in West African dog populations reveals how virus diffusion may enable endemic maintenance through continuous epidemic cycles. From these analyses, we conclude that our phylogeographic framework will make an important asset in molecular epidemiology that can be easily generalized to infer biogeogeography from genetic data for many organisms.
A Simple Method for Variable Selection in Regression with Respect to Treatment Selection
Lacey Gunter
2011-09-01
Full Text Available In this paper, we compare the method of Gunter et al. (2011 for variable selection in treatment comparison analysis (an approach to regression analysis where treatment-covariate interactions are deemed important with a simple stepwise selection method that we introduce. The stepwise method has several advantages, most notably its generalization to regression models that are not necessarily linear, its simplicity and its intuitive nature. We show that the new simple method works surprisingly well compared to the more complex method when compared in the linear regression framework. We use four generative models (explicitly detailed in the paper for the simulations and compare spuriously identified interactions and where applicable (generative models 3 and 4 correctly identified interactions. We also apply the new method to logistic regression and Poisson regression and illustrate its performance in Table 2 in the paper. The simple method can be applied to other types of regression models including various other generalized linear models, Cox proportional hazard models and nonlinear models.
Adaptive Dynamic Bayesian Networks
Ng, B M
2007-10-26
A discrete-time Markov process can be compactly modeled as a dynamic Bayesian network (DBN)--a graphical model with nodes representing random variables and directed edges indicating causality between variables. Each node has a probability distribution, conditional on the variables represented by the parent nodes. A DBN's graphical structure encodes fixed conditional dependencies between variables. But in real-world systems, conditional dependencies between variables may be unknown a priori or may vary over time. Model errors can result if the DBN fails to capture all possible interactions between variables. Thus, we explore the representational framework of adaptive DBNs, whose structure and parameters can change from one time step to the next: a distribution's parameters and its set of conditional variables are dynamic. This work builds on recent work in nonparametric Bayesian modeling, such as hierarchical Dirichlet processes, infinite-state hidden Markov networks and structured priors for Bayes net learning. In this paper, we will explain the motivation for our interest in adaptive DBNs, show how popular nonparametric methods are combined to formulate the foundations for adaptive DBNs, and present preliminary results.
Tibbetts, Elizabeth A
2004-01-01
The ability to recognize individuals is common in animals; however, we know little about why the phenotypic variability necessary for individual recognition has evolved in some animals but not others. One possibility is that natural selection favours variability in some social contexts but not in others. Polistes fuscatus wasps have variable facial and abdominal markings used for individual recognition within their complex societies. Here, I explore whether social behaviour can select for var...
VARIABLE SELECTION BY PSEUDO WAVELETS IN HETEROSCEDASTIC REGRESSION MODELS INVOLVING TIME SERIES
无
2006-01-01
A simple but efficient method has been proposed to select variables in heteroscedastic regression models. It is shown that the pseudo empirical wavelet coefficients corresponding to the significant explanatory variables in the regression models are clearly larger than those nonsignificant ones, on the basis of which a procedure is developed to select variables in regression models. The coefficients of the models are also estimated. All estimators are proved to be consistent.
Variable selection in functional data classification: a maxima-hunting proposal
Berrendero, José R.; Cuevas, Antonio; Torrecilla, José L.
2013-01-01
Variable selection is considered in the setting of supervised binary classification with functional data $\\{X(t),\\ t\\in[0,1]\\}$. By "variable selection" we mean any dimension-reduction method which leads to replace the whole trajectory $\\{X(t),\\ t\\in[0,1]\\}$, with a low-dimensional vector $(X(t_1),\\ldots,X(t_k))$ still keeping a similar classification error. Our proposal for variable selection is based on the idea of selecting the local maxima $(t_1,\\ldots,t_k)$ of the function ${\\mathcal V}_...
Gil Luiz HS
2010-11-01
Full Text Available Abstract Background Plasmodium vivax malaria is a major public health challenge in Latin America, Asia and Oceania, with 130-435 million clinical cases per year worldwide. Invasion of host blood cells by P. vivax mainly depends on a type I membrane protein called Duffy binding protein (PvDBP. The erythrocyte-binding motif of PvDBP is a 170 amino-acid stretch located in its cysteine-rich region II (PvDBPII, which is the most variable segment of the protein. Methods To test whether diversifying natural selection has shaped the nucleotide diversity of PvDBPII in Brazilian populations, this region was sequenced in 122 isolates from six different geographic areas. A Bayesian method was applied to test for the action of natural selection under a population genetic model that incorporates recombination. The analysis was integrated with a structural model of PvDBPII, and T- and B-cell epitopes were localized on the 3-D structure. Results The results suggest that: (i recombination plays an important role in determining the haplotype structure of PvDBPII, and (ii PvDBPII appears to contain neutrally evolving codons as well as codons evolving under natural selection. Diversifying selection preferentially acts on sites identified as epitopes, particularly on amino acid residues 417, 419, and 424, which show strong linkage disequilibrium. Conclusions This study shows that some polymorphisms of PvDBPII are present near the erythrocyte-binding domain and might serve to elude antibodies that inhibit cell invasion. Therefore, these polymorphisms should be taken into account when designing vaccines aimed at eliciting antibodies to inhibit erythrocyte invasion.
High-Dimensional Non-Linear Variable Selection through Hierarchical Kernel Learning
Bach, Francis
2009-01-01
We consider the problem of high-dimensional non-linear variable selection for supervised learning. Our approach is based on performing linear selection among exponentially many appropriately defined positive definite kernels that characterize non-linear interactions between the original variables. To select efficiently from these many kernels, we use the natural hierarchical structure of the problem to extend the multiple kernel learning framework to kernels that can be embedded in a directed acyclic graph; we show that it is then possible to perform kernel selection through a graph-adapted sparsity-inducing norm, in polynomial time in the number of selected kernels. Moreover, we study the consistency of variable selection in high-dimensional settings, showing that under certain assumptions, our regularization framework allows a number of irrelevant variables which is exponential in the number of observations. Our simulations on synthetic datasets and datasets from the UCI repository show state-of-the-art pre...
Optical variability of X-ray-selected QSOs
Photometric data for ten X-ray-selected quasistellar objects have been obtained from archival records of the Rosemary Hill Observatory. Reliable magnitudes were obtained for seven of the ten sources and six displayed optical variations significant at the 95 percent confidence level or greater. One source appeared to exhibit optically violent behavior. Light curves and photographic magnitudes are presented and discussed. 22 references
GOUR BANDYOPADHYAY
2013-01-01
The study uses correlation and regression analysis to examine the impact of Non Performing Assets (NPA) on selected banking variables in two Public Sector Banks (PSBs) in India. Initially to examine degree of association between the strategic banking variables identified, simple correlation co-efficients have been computed and their significance examined. For the purpose of examining impact of NPA on the profitability and other strategic banking variables including time variable, simple li...
Bayesian Methods and Universal Darwinism
Campbell, John
2010-01-01
Bayesian methods since the time of Laplace have been understood by their practitioners as closely aligned to the scientific method. Indeed a recent champion of Bayesian methods, E. T. Jaynes, titled his textbook on the subject Probability Theory: the Logic of Science. Many philosophers of science including Karl Popper and Donald Campbell have interpreted the evolution of Science as a Darwinian process consisting of a 'copy with selective retention' algorithm abstracted from Darwin's theory of...
Cameron, Ewan
2013-01-01
In the second paper of this series we extend our Bayesian reanalysis of the evidence for a cosmic variation of the fine structure constant to the semi-parametric modelling regime. By adopting a mixture of Dirichlet processes prior for the unexplained errors in each instrumental subgroup of the benchmark quasar dataset we go some way towards freeing our model selection procedure from the apparent subjectivity of a fixed distributional form. Despite the infinite-dimensional domain of the error hierarchy so constructed we are able to demonstrate a recursive scheme for marginal likelihood estimation with prior-sensitivity analysis directly analogous to that presented in Paper I, thereby allowing the robustness of our posterior Bayes factors to hyper-parameter choice and model specification to be readily verified. In the course of this work we elucidate various similarities between unexplained error problems in the seemingly disparate fields of astronomy and clinical meta-analysis, and we highlight a number of sop...
Estimating a positive false discovery rate for variable selection in pharmacogenetic studies.
Li, Lang; Hui, Siu; Pennello, Gene; Desta, Zeruesenay; Todd, Skaar; Nguyen, Anne; Flockhart, David
2007-01-01
Selecting predictors to optimize the outcome prediction is an important statistical method. However, it usually ignores the false positives in the selected predictors. In this paper, we develop a positive false discovery rate (pFDR) estimate for a conventional step-wise forward variable selection procedure. We propose two views of a variable selection process, an overall and an individual test. An interesting feature of the overall test is that its power of selecting non-null predictors increases with the proportion of non-null predictors among all candidate predictors. Data analysis is illustrated with a pharmacogenetics example. PMID:17885872
Ödman, Peter; Johansen, C.L.; Olsson, L.;
2010-01-01
of biomass and substrate (casamino acids) concentrations, respectively. The effect of combination of fluorescence and gas analyzer data as well as of different variable selection methods was investigated. Improved prediction models were obtained by combination of data from the two sensors and by...... variable selection using a genetic algorithm, interval PLS, and the principal variables method, respectively. A stepwise variable elimination method was applied to the three-way fluorescence data, resulting in simpler and more accurate N-PLS models. The prediction models were validated using leave...
Kirstein, Roland
2005-01-01
This paper presents a modification of the inspection game: The ?Bayesian Monitoring? model rests on the assumption that judges are interested in enforcing compliant behavior and making correct decisions. They may base their judgements on an informative but imperfect signal which can be generated costlessly. In the original inspection game, monitoring is costly and generates a perfectly informative signal. While the inspection game has only one mixed strategy equilibrium, three Perfect Bayesia...
Variability-based active galactic nucleus selection using image subtraction in the SDSS and LSST era
With upcoming all-sky surveys such as LSST poised to generate a deep digital movie of the optical sky, variability-based active galactic nucleus (AGN) selection will enable the construction of highly complete catalogs with minimum contamination. In this study, we generate g-band difference images and construct light curves (LCs) for QSO/AGN candidates listed in Sloan Digital Sky Survey Stripe 82 public catalogs compiled from different methods, including spectroscopy, optical colors, variability, and X-ray detection. Image differencing excels at identifying variable sources embedded in complex or blended emission regions such as Type II AGNs and other low-luminosity AGNs that may be omitted from traditional photometric or spectroscopic catalogs. To separate QSOs/AGNs from other sources using our difference image LCs, we explore several LC statistics and parameterize optical variability by the characteristic damping timescale (τ) and variability amplitude. By virtue of distinguishable variability parameters of AGNs, we are able to select them with high completeness of 93.4% and efficiency (i.e., purity) of 71.3%. Based on optical variability, we also select highly variable blazar candidates, whose infrared colors are consistent with known blazars. One-third of them are also radio detected. With the X-ray selected AGN candidates, we probe the optical variability of X-ray detected optically extended sources using their difference image LCs for the first time. A combination of optical variability and X-ray detection enables us to select various types of host-dominated AGNs. Contrary to the AGN unification model prediction, two Type II AGN candidates (out of six) show detectable variability on long-term timescales like typical Type I AGNs. This study will provide a baseline for future optical variability studies of extended sources.
Variability-based active galactic nucleus selection using image subtraction in the SDSS and LSST era
Choi, Yumi; Gibson, Robert R.; Becker, Andrew C.; Ivezić, Željko; Connolly, Andrew J.; Ruan, John J.; Anderson, Scott F. [Department of Astronomy, University of Washington, Box 351580, Seattle, WA 98195 (United States); MacLeod, Chelsea L., E-mail: ymchoi@astro.washington.edu [Physics Department, U.S. Naval Academy, 572 Holloway Road, Annapolis, MD 21402 (United States)
2014-02-10
With upcoming all-sky surveys such as LSST poised to generate a deep digital movie of the optical sky, variability-based active galactic nucleus (AGN) selection will enable the construction of highly complete catalogs with minimum contamination. In this study, we generate g-band difference images and construct light curves (LCs) for QSO/AGN candidates listed in Sloan Digital Sky Survey Stripe 82 public catalogs compiled from different methods, including spectroscopy, optical colors, variability, and X-ray detection. Image differencing excels at identifying variable sources embedded in complex or blended emission regions such as Type II AGNs and other low-luminosity AGNs that may be omitted from traditional photometric or spectroscopic catalogs. To separate QSOs/AGNs from other sources using our difference image LCs, we explore several LC statistics and parameterize optical variability by the characteristic damping timescale (τ) and variability amplitude. By virtue of distinguishable variability parameters of AGNs, we are able to select them with high completeness of 93.4% and efficiency (i.e., purity) of 71.3%. Based on optical variability, we also select highly variable blazar candidates, whose infrared colors are consistent with known blazars. One-third of them are also radio detected. With the X-ray selected AGN candidates, we probe the optical variability of X-ray detected optically extended sources using their difference image LCs for the first time. A combination of optical variability and X-ray detection enables us to select various types of host-dominated AGNs. Contrary to the AGN unification model prediction, two Type II AGN candidates (out of six) show detectable variability on long-term timescales like typical Type I AGNs. This study will provide a baseline for future optical variability studies of extended sources.
Resting high frequency heart rate variability selectively predicts cooperative behavior.
Beffara, Brice; Bret, Amélie G; Vermeulen, Nicolas; Mermillod, Martial
2016-10-01
This study explores whether the vagal connection between the heart and the brain is involved in prosocial behaviors. The Polyvagal Theory postulates that vagal activity underlies prosocial tendencies. Even if several results suggest that vagal activity is associated with prosocial behaviors, none of them used behavioral measures of prosociality to establish this relationship. We recorded the resting state vagal activity (reflected by High Frequency Heart Rate Variability, HF-HRV) of 48 (42 suitale for analysis) healthy human adults and measured their level of cooperation during a hawk-dove game. We also manipulated the consequence of mutual defection in the hawk-dove game (severe vs. moderate). Results show that HF-HRV is positively and linearly related to cooperation level, but only when the consequence of mutual defection is severe (compared to moderate). This supports that i) prosocial behaviors are likely to be underpinned by vagal functioning ii) physiological disposition to cooperate interacts with environmental context. We discuss these results within the theoretical framework of the Polyvagal Theory. PMID:27343804
3D Bayesian contextual classifiers
Larsen, Rasmus
2000-01-01
We extend a series of multivariate Bayesian 2-D contextual classifiers to 3-D by specifying a simultaneous Gaussian distribution for the feature vectors as well as a prior distribution of the class variables of a pixel and its 6 nearest 3-D neighbours.......We extend a series of multivariate Bayesian 2-D contextual classifiers to 3-D by specifying a simultaneous Gaussian distribution for the feature vectors as well as a prior distribution of the class variables of a pixel and its 6 nearest 3-D neighbours....
Galea, J. M.; Ruge, D.; Buijink, A.; Bestmann, S.; Rothwell, J. C.
2013-01-01
Action selection describes the high-level process which selects between competing movements. In animals, behavioural variability is critical for the motor exploration required to select the action which optimizes reward and minimizes cost/punishment, and is guided by dopamine (DA). The aim of this study was to test in humans whether low-level movement parameters are affected by punishment and reward in ways similar to high-level action selection. Moreover, we addressed the proposed dependence...
Variable selection methods in PLS regression - a comparison study on metabolomics data
Karaman, İbrahim; Hedemann, Mette Skou; Knudsen, Knud Erik Bach;
integrated approach. Due to the high number of variables in data sets (both raw data and after peak picking) the selection of important variables in an explorative analysis is difficult, especially when different data sets of metabolomics data need to be related. Variable selection (or removal of irrelevant....... The aim of the metabolomics study was to investigate the metabolic profile in pigs fed various cereal fractions with special attention to the metabolism of lignans using LC-MS based metabolomic approach. References 1. Lê Cao KA, Rossouw D, Robert-Granié C, Besse P: A Sparse PLS for Variable Selection when......Partial least squares regression (PLSR) has been applied to various fields such as psychometrics, consumer science, econometrics and process control. Recently it has been applied to metabolomics based data sets (GC/LC-MS, NMR) and proven to be a very powerful in situations with many variables...
Discriminative variable selection for clustering with the sparse Fisher-EM algorithm
Bouveyron, Charles
2012-01-01
The interest in variable selection for clustering has increased recently due to the growing need in clustering high-dimensional data. Variable selection allows in particular to ease both the clustering and the interpretation of the results. Existing approaches have demonstrated the efficiency of variable selection for clustering but turn out to be either very time consuming or not sparse enough in high-dimensional spaces. This work proposes to perform a selection of the discriminative variables by introducing sparsity in the loading matrix of the Fisher-EM algorithm. This clustering method has been recently proposed for the simultaneous visualization and clustering of high-dimensional data. It is based on a latent mixture model which fits the data into a low-dimensional discriminative subspace. Three different approaches are proposed in this work to introduce sparsity in the orientation matrix of the discriminative subspace through $\\ell_{1}$-type penalizations. Experimental comparisons with existing approach...
Self-selection for personality variables among healthy volunteers.
Pieters, M S; Jennekens-Schinkel, A; Schoemaker, H C; Cohen, A F
1992-01-01
1. Healthy student volunteers (n = 103) participating in ongoing clinical pharmacological research completed the Dutch Personality Inventory (DPI), the Dutch version of the Spielberger State-Trait Anxiety Inventory (STAI-DY) and the Dutch version of the Sensation Seeking Scale (SSS). 2. The volunteers were more extrovert (P less than 0.001), more flexible (P less than 0.001), more tolerant or less impulsive (P less than 0.001), had more self-confidence and initiative (P less than 0.001), and were more satisfied and optimistic (P less than 0.01) when compared with the general norm. When compared with a student norm, volunteers had lower levels of state (P less than 0.001) and trait (P less than 0.05) anxiety. The general sensation seeking tendency of volunteers was higher than in the student norm group (P less than 0.001). The volunteers had a greater tendency to thrill-and-adventure-seeking (P less than 0.001) and to disinhibition (P less than 0.01). 3. Hence, volunteers were a selected sample of the total population of students. This may influence the interpretation of pharmacokinetic and pharmacodynamic parameters. 4. Personality screening should be added to the screening procedures for volunteers. PMID:1540478
EFFECT OF ASANA PRACTICES AND BRISK WALKING ON SELECTED PSYCHOLOGICAL VARIABLES AMONG DIABETIC WOMEN
Sabarinathan, J; D. Sakthignanavel
2015-01-01
The purpose of the study was to find out the effect of asana practices and brisk walking on selected psychological variables among diabetic women. The study was conducted on sixty diabetic women. Totally three groups, namely, control and Experimental group I &II consisting of 20 diabetic women who underwent eight weeks practice in selected asana practices and brisk walking whereas the control group did not undergo any type of training. The psychological variables in anxiety, se...
Stock market reaction to selected macroeconomic variables in the Nigerian economy
Abraham, Terfa Williams
2011-01-01
This study examines the relationship between the stock market and selected macroeconomic variables in Nigeria. The all share index was used as a proxy for the stock market while inflation, interest and exchange rates were the macroeconomic variables selected. Employing error correction model, it was found that a significant negative short run relationship exists between the stock market and the minimum rediscounting rate (MRR) implying that, a decrease in the MRR, would improve the performanc...
Predictive modeling with high-dimensional data streams: an on-line variable selection approach
McWilliams, Brian; Montana, Giovanni
2009-01-01
International audience In this paper we propose a computationally efficient algorithm for on-line variable selection in multivariate regression problems involving high dimensional data streams. The algorithm recursively extracts all the latent factors of a partial least squares solution and selects the most important variables for each factor. This is achieved by means of only one sparse singular value decomposition which can be efficiently updated on-line and in an adaptive fashion. Simul...
Sparse partial least squares for on-line variable selection in multivariate data streams
McWilliams, Brian; Montana, Giovanni
2009-01-01
In this paper we propose a computationally efficient algorithm for on-line variable selection in multivariate regression problems involving high dimensional data streams. The algorithm recursively extracts all the latent factors of a partial least squares solution and selects the most important variables for each factor. This is achieved by means of only one sparse singular value decomposition which can be efficiently updated on-line and in an adaptive fashion. Simulation results based on art...
Variability-selected low luminosity AGNs in the SA57 and in the CDFS
Vagnetti, F; Trevese, D
2009-01-01
Low Luminosity Active Galactic Nuclei (LLAGNs) are contaminated by the light of their host galaxies, thus they cannot be detected by the usual colour techniques. For this reason their evolution in cosmic time is poorly known. Variability is a property shared by virtually all active galactic nuclei, and it was adopted as a criterion to select them using multi epoch surveys. Here we report on two variability surveys in different sky areas, the Selected Area 57 and the Chandra Deep Field South.
M. Dhanalakshmi; Grace Helina; Senthilkumar
2015-01-01
The aim of this study was to find out the comparative effects of aerobics and resistance exercises on selected physiological variables among obese children. To achieve the purpose, 60 Obese children, whose BMI was greater than 30 kg/m2 were randomly selected and assigned into three groups, aerobics exercises group (AEG), resistance training group (RTG) and Control group (CG) consisting of 20 in each. After assessing physiological variables, forced vital capacity and resting heart rate init...
Variable selection in multiple linear regression: The influence of individual cases
SJ Steel
2007-12-01
Full Text Available The influence of individual cases in a data set is studied when variable selection is applied in multiple linear regression. Two different influence measures, based on the C_p criterion and Akaike's information criterion, are introduced. The relative change in the selection criterion when an individual case is omitted is proposed as the selection influence of the specific omitted case. Four standard examples from the literature are considered and the selection influence of the cases is calculated. It is argued that the selection procedure may be improved by taking the selection influence of individual data cases into account.
Hill Steven M; Neve Richard M; Bayani Nora; Kuo Wen-Lin; Ziyad Safiyyah; Spellman Paul T; Gray Joe W; Mukherjee Sach
2012-01-01
Abstract Background An important question in the analysis of biochemical data is that of identifying subsets of molecular variables that may jointly influence a biological response. Statistical variable selection methods have been widely used for this purpose. In many settings, it may be important to incorporate ancillary biological information concerning the variables of interest. Pathway and network maps are one example of a source of such information. However, although ancillary informatio...
Zhu, Xiang-Wei; Xin, Yan-Jun; Ge, Hui-Lin
2015-04-27
Variable selection is of crucial significance in QSAR modeling since it increases the model predictive ability and reduces noise. The selection of the right variables is far more complicated than the development of predictive models. In this study, eight continuous and categorical data sets were employed to explore the applicability of two distinct variable selection methods random forests (RF) and least absolute shrinkage and selection operator (LASSO). Variable selection was performed: (1) by using recursive random forests to rule out a quarter of the least important descriptors at each iteration and (2) by using LASSO modeling with 10-fold inner cross-validation to tune its penalty λ for each data set. Along with regular statistical parameters of model performance, we proposed the highest pairwise correlation rate, average pairwise Pearson's correlation coefficient, and Tanimoto coefficient to evaluate the optimal by RF and LASSO in an extensive way. Results showed that variable selection could allow a tremendous reduction of noisy descriptors (at most 96% with RF method in this study) and apparently enhance model's predictive performance as well. Furthermore, random forests showed property of gathering important predictors without restricting their pairwise correlation, which is contrary to LASSO. The mutual exclusion of highly correlated variables in LASSO modeling tends to skip important variables that are highly related to response endpoints and thus undermine the model's predictive performance. The optimal variables selected by RF share low similarity with those by LASSO (e.g., the Tanimoto coefficients were smaller than 0.20 in seven out of eight data sets). We found that the differences between RF and LASSO predictive performances mainly resulted from the variables selected by different strategies rather than the learning algorithms. Our study showed that the right selection of variables is more important than the learning algorithm for modeling. We hope
Probability and Bayesian statistics
1987-01-01
This book contains selected and refereed contributions to the "Inter national Symposium on Probability and Bayesian Statistics" which was orga nized to celebrate the 80th birthday of Professor Bruno de Finetti at his birthplace Innsbruck in Austria. Since Professor de Finetti died in 1985 the symposium was dedicated to the memory of Bruno de Finetti and took place at Igls near Innsbruck from 23 to 26 September 1986. Some of the pa pers are published especially by the relationship to Bruno de Finetti's scientific work. The evolution of stochastics shows growing importance of probability as coherent assessment of numerical values as degrees of believe in certain events. This is the basis for Bayesian inference in the sense of modern statistics. The contributions in this volume cover a broad spectrum ranging from foundations of probability across psychological aspects of formulating sub jective probability statements, abstract measure theoretical considerations, contributions to theoretical statistics an...
Bessiere, Pierre; Ahuactzin, Juan Manuel; Mekhnacha, Kamel
2013-01-01
Probability as an Alternative to Boolean LogicWhile logic is the mathematical foundation of rational reasoning and the fundamental principle of computing, it is restricted to problems where information is both complete and certain. However, many real-world problems, from financial investments to email filtering, are incomplete or uncertain in nature. Probability theory and Bayesian computing together provide an alternative framework to deal with incomplete and uncertain data. Decision-Making Tools and Methods for Incomplete and Uncertain DataEmphasizing probability as an alternative to Boolean
Henry de-Graft Acquah; Joseph Acquah
2013-01-01
Alternative formulations of the Bayesian Information Criteria provide a basis for choosing between competing methods for detecting price asymmetry. However, very little is understood about their performance in the asymmetric price transmission modelling framework. In addressing this issue, this paper introduces and applies parametric bootstrap techniques to evaluate the ability of Bayesian Information Criteria (BIC) and Draper's Information Criteria (DIC) in discriminating between alternative...
The Time Domain Spectroscopic Survey: Variable Object Selection and Anticipated Results
Morganson, Eric; Anderson, Scott F; Ruan, John J; Myers, Adam D; Eracleous, Michael; Kelly, Brandon; Badenes, Carlos; Banados, Eduardo; Blanton, Michael R; Bershady, Matthew A; Borissova, Jura; Brandt, William Nielsen; Burgett, William S; Chambers, Kenneth; Draper, Peter W; Davenport, James R A; Flewelling, Heather; Garnavich, Peter; Hawley, Suzanne L; Hodapp, Klaus W; Isler, Jedidah C; Kaiser, Nick; Kinemuchi, Karen; Kudritzki, Rolf P; Metcalfe, Nigel; Morgan, Jeffrey S; Paris, Isabelle; Parvizi, Mahmoud; Poleski, Radoslaw; Price, Paul A; Salvato, Mara; Shanks, Tom; Schlafly, Eddie F; Schneider, Donald P; Shen, Yue; Stassun, Keivan; Tonry, John T; Walter, Fabian; Waters, Chris Z
2015-01-01
We present the selection algorithm and anticipated results for the Time Domain Spectroscopic Survey (TDSS). TDSS is an SDSS-IV eBOSS subproject that will provide initial identification spectra of approximately 220,000 luminosity-variable objects (variable stars and AGN) across 7,500 square degrees selected from a combination of SDSS and multi-epoch Pan-STARRS1 photometry. TDSS will be the largest spectroscopic survey to explicitly target variable objects, avoiding pre-selection on the basis of colors or detailed modeling of specific variability characteristics. Kernel Density Estimate (KDE) analysis of our target population performed on SDSS Stripe 82 data suggests our target sample will be 95% pure (meaning 95% of objects we select have genuine luminosity variability of a few magnitudes or more). Our final spectroscopic sample will contain roughly 135,000 quasars and 85,000 stellar variables, approximately 4,000 of which will be RR Lyrae stars which may be used as outer Milky Way probes. The variability-sele...
Comparison of Sparse and Jack-knife partial least squares regression methods for variable selection
Karaman, Ibrahim; Qannari, El Mostafa; Martens, Harald;
2013-01-01
is often used by chemometricians. In order to evaluate the predictive ability of both methods, cross model validation was implemented. The performance of both methods was assessed using FTIR spectroscopic data, on the one hand, and a set of simulated data. The stability of the variable selection procedures...... was highlighted by the frequency of the selection of each variable in the cross model validation segments. Computationally, Jack-knife PLSR was much faster than Sparse PLSR. But while it was found that both methods have more or less the same predictive ability, Sparse PLSR turned out to be generally very stable......The objective of this study was to compare two different techniques of variable selection, Sparse PLSR and Jack-knife PLSR, with respect to their predictive ability and their ability to identify relevant variables. Sparse PLSR is a method that is frequently used in genomics, whereas Jack-knife PLSR...
A bootstrapping soft shrinkage approach for variable selection in chemical modeling.
Deng, Bai-Chuan; Yun, Yong-Huan; Cao, Dong-Sheng; Yin, Yu-Long; Wang, Wei-Ting; Lu, Hong-Mei; Luo, Qian-Yi; Liang, Yi-Zeng
2016-02-18
In this study, a new variable selection method called bootstrapping soft shrinkage (BOSS) method is developed. It is derived from the idea of weighted bootstrap sampling (WBS) and model population analysis (MPA). The weights of variables are determined based on the absolute values of regression coefficients. WBS is applied according to the weights to generate sub-models and MPA is used to analyze the sub-models to update weights for variables. The optimization procedure follows the rule of soft shrinkage, in which less important variables are not eliminated directly but are assigned smaller weights. The algorithm runs iteratively and terminates until the number of variables reaches one. The optimal variable set with the lowest root mean squared error of cross-validation (RMSECV) is selected. The method was tested on three groups of near infrared (NIR) spectroscopic datasets, i.e. corn datasets, diesel fuels datasets and soy datasets. Three high performing variable selection methods, i.e. Monte Carlo uninformative variable elimination (MCUVE), competitive adaptive reweighted sampling (CARS) and genetic algorithm partial least squares (GA-PLS) are used for comparison. The results show that BOSS is promising with improved prediction performance. The Matlab codes for implementing BOSS are freely available on the website: http://www.mathworks.com/matlabcentral/fileexchange/52770-boss. PMID:26826688
GOUR BANDYOPADHYAY
2013-12-01
Full Text Available The study uses correlation and regression analysis to examine the impact of Non Performing Assets (NPA on selected banking variables in two Public Sector Banks (PSBs in India. Initially to examine degree of association between the strategic banking variables identified, simple correlation co-efficients have been computed and their significance examined. For the purpose of examining impact of NPA on the profitability and other strategic banking variables including time variable, simple linear regression and multiple regression (as appropriate have been attempted. To diagnose the problem of multi co-linearity in multiple regressions, the value of tolerance factor (TOL along with variance inflating factor (VIF have also been computed and compared with standard. The study reveals that NPA has statistically significant negative impact on profitability and statistically significant impact on few strategic banking variables in respect of two selected PSBs.
De Smet, Tom; Struys, Michel M. R. F.; Neckebroek, Martine M.; Van den Hauwe, Kristof; Bonte, Sjoert; Mortier, Eric P.
2008-01-01
BACKGROUND: Closed-loop control of the hypnotic component of anesthesia has been proposed in an attempt to optimize drug delivery. Here, we introduce a newly developed Bayesian-based, patient-individualized, model-based, adaptive control method for bispectral index (BIS) guided propofol infusion int
Refining gene signatures: a Bayesian approach
Labbe Aurélie
2009-12-01
Full Text Available Abstract Background In high density arrays, the identification of relevant genes for disease classification is complicated by not only the curse of dimensionality but also the highly correlated nature of the array data. In this paper, we are interested in the question of how many and which genes should be selected for a disease class prediction. Our work consists of a Bayesian supervised statistical learning approach to refine gene signatures with a regularization which penalizes for the correlation between the variables selected. Results Our simulation results show that we can most often recover the correct subset of genes that predict the class as compared to other methods, even when accuracy and subset size remain the same. On real microarray datasets, we show that our approach can refine gene signatures to obtain either the same or better predictive performance than other existing methods with a smaller number of genes. Conclusions Our novel Bayesian approach includes a prior which penalizes highly correlated features in model selection and is able to extract key genes in the highly correlated context of microarray data. The methodology in the paper is described in the context of microarray data, but can be applied to any array data (such as micro RNA, for example as a first step towards predictive modeling of cancer pathways. A user-friendly software implementation of the method is available.
COMPARISON OF SELECTED PHYSIOLOGICAL VARIABLES OF PLAYERS BELONGING TO VARIOUS DISTANCE RUNNERS
Satpal Yadav; Arvind S.Sajwan; Ankan Sinha
2009-01-01
The purpose of the study was to compare the selected physiological variables namely; maximum oxygen consumption, vital capacity, resting heart rate and hemoglobin content among various distance runners. Thesubjects were selected from the male athlete’s of Gwalior district of various distance runners i.e. short, middle and long distance runners for this study. Ten (10) male athletes from each groups namely short, middle and long distance groups were selected as the subject for the study. Selec...
EFFECT OF PLYOMETRIC TRAINING ON SELECTED SKILL PERFORMANCE VARIABLES AMONG FEMALE HOCKEY PLAYERS
G. VASANTHI; P. Y. Sivachandran
2014-01-01
The purpose of the study was to find out the effect of plyometric training on selected skill performance variables among female hockey players. To achieve the purpose of the present study, thirty female hockey players were randomly selected from PKR Women College of Arts and Science and Gopi Arts and Science College, Erode district, Tamilnadu, India and their age ranged from 18 to 21 years. The selected subjects were divided into two groups of fifteen subjects in each. Group I ...
A survey of variable selection methods in two Chinese epidemiology journals
Lynn Henry S
2010-09-01
Full Text Available Abstract Background Although much has been written on developing better procedures for variable selection, there is little research on how it is practiced in actual studies. This review surveys the variable selection methods reported in two high-ranking Chinese epidemiology journals. Methods Articles published in 2004, 2006, and 2008 in the Chinese Journal of Epidemiology and the Chinese Journal of Preventive Medicine were reviewed. Five categories of methods were identified whereby variables were selected using: A - bivariate analyses; B - multivariable analysis; e.g. stepwise or individual significance testing of model coefficients; C - first bivariate analyses, followed by multivariable analysis; D - bivariate analyses or multivariable analysis; and E - other criteria like prior knowledge or personal judgment. Results Among the 287 articles that reported using variable selection methods, 6%, 26%, 30%, 21%, and 17% were in categories A through E, respectively. One hundred sixty-three studies selected variables using bivariate analyses, 80% (130/163 via multiple significance testing at the 5% alpha-level. Of the 219 multivariable analyses, 97 (44% used stepwise procedures, 89 (41% tested individual regression coefficients, but 33 (15% did not mention how variables were selected. Sixty percent (58/97 of the stepwise routines also did not specify the algorithm and/or significance levels. Conclusions The variable selection methods reported in the two journals were limited in variety, and details were often missing. Many studies still relied on problematic techniques like stepwise procedures and/or multiple testing of bivariate associations at the 0.05 alpha-level. These deficiencies should be rectified to safeguard the scientific validity of articles published in Chinese epidemiology journals.
Wu, Rui-mei; Zhao, Jie-wen; Chen, Quan-sheng; Huang, Xing-yi
2011-07-01
The present paper was attempted to study the feasibility to determine the taste quality of green tea using FT-NIR spectroscopy combined with variable selection methods. Chemistry evaluation, as the reference measurement, was used to measure the total taste scores of green tea infusion. First, synergy interval PLS (siPLS) was implemented to select efficient spectral regions from SNV preprocessed spectra; then, optimal variables were selected using genetic algorithm (GA) from these selected spectral regions by siPLS, and the optimal model was achieved with Rp = 0.8908, RMSEP = 4.66 in the prediction set when 38 variables and 6 PLS factors were included. Experimental results showed that the performance of siPLS-GA model was superior to those of others. This study demonstrated that NIR spectra could be used successfully to measure taste quality of green tea and siPLS-GA algorithm has superiority to other algorithm in developing NIR spectral regression model. PMID:21942023
Woosik Jang
2015-01-01
Full Text Available Since the 1970s, revenues generated by Korean contractors in international construction have increased rapidly, exceeding USD 70 billion per year in recent years. However, Korean contractors face significant risks from market uncertainty and sensitivity to economic volatility and technical difficulties. As the volatility of these risks threatens project profitability, approximately 15% of bad projects were found to account for 74% of losses from the same international construction sector. Anticipating bad projects via preemptive risk management can better prevent losses so that contractors can enhance the efficiency of bidding decisions during the early stages of a project cycle. In line with these objectives, this paper examines the effect of such factors on the degree of project profitability. The Naïve Bayesian classifier is applied to identify a good project screening tool, which increases practical applicability using binomial variables with limited information that is obtainable in the early stages. The proposed model produced superior classification results that adequately reflect contractor views of risk. It is anticipated that when users apply the proposed model based on their own knowledge and expertise, overall firm profit rates will increase as a result of early abandonment of bad projects as well as the prioritization of good projects before final bidding decisions are made.
Variable selectivity and the role of nutritional quality in food selection by a planktonic rotifer
To investigate the potential for selective feeding to enhance fitness, I test the hypothesis that an herbivorous zooplankter selects those food items that best support its reproduction. Under this hypothesis, growth and reproduction on selected food items should be higher than on less preferred items. The hypothesis is not supported. In situ selectivity by the rotifer Keratella taurocephala for Cryptomonas relative to Chlamydomonas goes through a seasonal cycle, in apparent response to fluctuating Cryptomonas populations. However, reproduction on a unialgal diet of Cryptomonas is consistently high and similar to that on Chlamydomonas. Oocystis, which also supports reproduction equivalent to that supported by Chlamydomonas, is sometimes rejected by K. taurocephala. In addition, K. taurocephala does not discriminate between Merismopedia and Chlamydomonas even though Merismopedia supports virtually no reproduction by the rotifer. Selection by K. taurocephala does not simply maximize the intake of food items that yield high reproduction. Selectivity is a complex, dynamic process, one function of which may be the exploitation of locally or seasonally abundant foods. (author)
Bayesian Approach to Handling Informative Sampling
Sikov, Anna
2015-01-01
In the case of informative sampling the sampling scheme explicitly or implicitly depends on the response variable. As a result, the sample distribution of response variable can- not be used for making inference about the population. In this research I investigate the problem of informative sampling from the Bayesian perspective. Application of the Bayesian approach permits solving the problems, which arise due to complexity of the models, being used for handling informative sampling. The main...
Kuiper, Rebecca M; Nederhoff, Tim; Klugkist, Irene
2015-05-01
In this paper, the performance of six types of techniques for comparisons of means is examined. These six emerge from the distinction between the method employed (hypothesis testing, model selection using information criteria, or Bayesian model selection) and the set of hypotheses that is investigated (a classical, exploration-based set of hypotheses containing equality constraints on the means, or a theory-based limited set of hypotheses with equality and/or order restrictions). A simulation study is conducted to examine the performance of these techniques. We demonstrate that, if one has specific, a priori specified hypotheses, confirmation (i.e., investigating theory-based hypotheses) has advantages over exploration (i.e., examining all possible equality-constrained hypotheses). Furthermore, examining reasonable order-restricted hypotheses has more power to detect the true effect/non-null hypothesis than evaluating only equality restrictions. Additionally, when investigating more than one theory-based hypothesis, model selection is preferred over hypothesis testing. Because of the first two results, we further examine the techniques that are able to evaluate order restrictions in a confirmatory fashion by examining their performance when the homogeneity of variance assumption is violated. Results show that the techniques are robust to heterogeneity when the sample sizes are equal. When the sample sizes are unequal, the performance is affected by heterogeneity. The size and direction of the deviations from the baseline, where there is no heterogeneity, depend on the effect size (of the means) and on the trend in the group variances with respect to the ordering of the group sizes. Importantly, the deviations are less pronounced when the group variances and sizes exhibit the same trend (e.g., are both increasing with group number). PMID:24975402
Universal Darwinism as a process of Bayesian inference
Campbell, John O
2016-01-01
Many of the mathematical frameworks describing natural selection are equivalent to Bayes Theorem, also known as Bayesian updating. By definition, a process of Bayesian Inference is one which involves a Bayesian update, so we may conclude that these frameworks describe natural selection as a process of Bayesian inference. Thus natural selection serves as a counter example to a widely-held interpretation that restricts Bayesian Inference to human mental processes (including the endeavors of statisticians). As Bayesian inference can always be cast in terms of (variational) free energy minimization, natural selection can be viewed as comprising two components: a generative model of an "experiment" in the external world environment, and the results of that "experiment" or the "surprise" entailed by predicted and actual outcomes of the "experiment". Minimization of free energy implies that the implicit measure of "surprise" experienced serves to update the generative model in a Bayesian manner. This description clo...
Bayesian Methods and Universal Darwinism
Campbell, John
2010-01-01
Bayesian methods since the time of Laplace have been understood by their practitioners as closely aligned to the scientific method. Indeed a recent champion of Bayesian methods, E. T. Jaynes, titled his textbook on the subject Probability Theory: the Logic of Science. Many philosophers of science including Karl Popper and Donald Campbell have interpreted the evolution of Science as a Darwinian process consisting of a 'copy with selective retention' algorithm abstracted from Darwin's theory of Natural Selection. Arguments are presented for an isomorphism between Bayesian Methods and Darwinian processes. Universal Darwinism, as the term has been developed by Richard Dawkins, Daniel Dennett and Susan Blackmore, is the collection of scientific theories which explain the creation and evolution of their subject matter as due to the operation of Darwinian processes. These subject matters span the fields of atomic physics, chemistry, biology and the social sciences. The principle of Maximum Entropy states that system...
Creaco, E.; Berardi, L.; Sun, Siao; Giustolisi, O.; Savic, D.
2016-04-01
The growing availability of field data, from information and communication technologies (ICTs) in "smart" urban infrastructures, allows data modeling to understand complex phenomena and to support management decisions. Among the analyzed phenomena, those related to storm water quality modeling have recently been gaining interest in the scientific literature. Nonetheless, the large amount of available data poses the problem of selecting relevant variables to describe a phenomenon and enable robust data modeling. This paper presents a procedure for the selection of relevant input variables using the multiobjective evolutionary polynomial regression (EPR-MOGA) paradigm. The procedure is based on scrutinizing the explanatory variables that appear inside the set of EPR-MOGA symbolic model expressions of increasing complexity and goodness of fit to target output. The strategy also enables the selection to be validated by engineering judgement. In such context, the multiple case study extension of EPR-MOGA, called MCS-EPR-MOGA, is adopted. The application of the proposed procedure to modeling storm water quality parameters in two French catchments shows that it was able to significantly reduce the number of explanatory variables for successive analyses. Finally, the EPR-MOGA models obtained after the input selection are compared with those obtained by using the same technique without benefitting from input selection and with those obtained in previous works where other data-modeling techniques were used on the same data. The comparison highlights the effectiveness of both EPR-MOGA and the input selection procedure.
Ross, Cody T; Strimling, Pontus; Ericksen, Karen Paige; Lindenfors, Patrik; Mulder, Monique Borgerhoff
2016-06-01
We present formal evolutionary models for the origins and persistence of the practice of Female Genital Modification (FGMo). We then test the implications of these models using normative cross-cultural data on FGMo in Africa and Bayesian phylogenetic methods that explicitly model adaptive evolution. Empirical evidence provides some support for the findings of our evolutionary models that the de novo origins of the FGMo practice should be associated with social stratification, and that social stratification should place selective pressures on the adoption of FGMo; these results, however, are tempered by the finding that FGMo has arisen in many cultures that have no social stratification, and that forces operating orthogonally to stratification appear to play a more important role in the cross-cultural distribution of FGMo. To explain these cases, one must consider cultural evolutionary explanations in conjunction with behavioral ecological ones. We conclude with a discussion of the implications of our study for policies designed to end the practice of FGMo. PMID:26846688
Definition of Valid Proteomic Biomarkers: A Bayesian Solution
Harris, Keith; Girolami, Mark; Mischak, Harald
Clinical proteomics is suffering from high hopes generated by reports on apparent biomarkers, most of which could not be later substantiated via validation. This has brought into focus the need for improved methods of finding a panel of clearly defined biomarkers. To examine this problem, urinary proteome data was collected from healthy adult males and females, and analysed to find biomarkers that differentiated between genders. We believe that models that incorporate sparsity in terms of variables are desirable for biomarker selection, as proteomics data typically contains a huge number of variables (peptides) and few samples making the selection process potentially unstable. This suggests the application of a two-level hierarchical Bayesian probit regression model for variable selection which assumes a prior that favours sparseness. The classification performance of this method is shown to improve that of the Probabilistic K-Nearest Neighbour model.
Variable selection in the explorative analysis of several data blocks in metabolomics
Karaman, İbrahim; Nørskov, Natalja; Yde, Christian Clement;
highly correlated data sets in one integrated approach. Due to the high number of variables in data sets from metabolomics (both raw data and after peak picking) the selection of important variables in an explorative analysis is difficult, especially when different data sets of metabolomics data need...... to be related. Tools for the handling of mental overflow minimising false discovery rates both by using statistical and biological validation in an integrative approach are needed. In this paper different strategies for variable selection were considered with respect to false discovery and the possibility...... with many variables for the purpose of reducing over-fitting problems and providing useful interpretation tools. These tools have excellent possibilities for giving a graphical overview of sample and variation patterns. They handle co-linearity in an efficient way and make it possible to use different...
Variable selection in PLSR and extensions to a multi-block setting for metabolomics data
Karaman, İbrahim; Hedemann, Mette Skou; Knudsen, Knud Erik Bach;
-block situation. Thereby the close relationship to elastic net remains established. [1] K. A. Lê Cao, D. Rossouw, C. Robert-Granié, and P. Besse, A sparse PLS for variable selection when integrating omics data, Statistical Applications in Genetics and Molecular Biology, 7 (2008). [2] F. Westad and H. Martens...... genomics [1]. They became quickly well established in the field of statistics because a close relationship to elastic net has been established. In sparse variable selection combined with PLSR, a soft thresholding is applied on each loading weight separately. In the field of chemometrics Jack-knifing has......When applying LC-MS or NMR spectroscopy in metabolomics studies, high-dimensional data are generated and effective tools for variable selection are needed in order to detect the important metabolites. Methods based on sparsity combined with PLSR have recently attracted attention in the field of...
Variability-based AGN selection using image subtraction in the SDSS and LSST era
Choi, Yumi; Gibson, Robert R.; Becker, Andrew C.; Ivezić, Željko; Connolly, Andrew J.; MacLeod, Chelsea L.; Ruan, John J.; Anderson, Scott F
2013-01-01
With upcoming all sky surveys such as LSST poised to generate a deep digital movie of the optical sky, variability-based AGN selection will enable the construction of highly-complete catalogs with minimum contamination. In this study, we generate $g$-band difference images and construct light curves for QSO/AGN candidates listed in SDSS Stripe 82 public catalogs compiled from different methods, including spectroscopy, optical colors, variability, and X-ray detection. Image differencing excels...
Geographic Elements Selection Algorithm Based on Quadtree in Variable-scale Visualization
Hao Guo; Feixiang Chen; Junjie Peng
2013-01-01
In order to balance the demand between local and global visualization in the data acquisition, this paper adopts the variable-scale visualization technology and uses quadrangular frustum pyramid projection to show geographic information continuously on a mobile device. In addition, the geographic elements in the variable-scale transition region are crowd because of unceasingly scale changing. In order to solve this problem, this paper presents a quadtree-based geographic elements selection al...
Zuber, Verena
2012-01-01
In this thesis, we address the identification of biomarkers in high-dimensional omics data. The identification of valid biomarkers is especially relevant for personalized medicine that depends on accurate prediction rules. Moreover, biomarkers elucidate the provenance of disease, or molecular changes related to disease. From a statistical point of view the identification of biomarkers is best cast as variable selection. In particular, we refer to variables as the molecular attributes under in...