Directory of Open Access Journals (Sweden)
Ying Wang
Full Text Available In this study, we used high-dimensional pattern regression methods based on structural (gray and white matter; GM and WM and functional (positron emission tomography of regional cerebral blood flow; PET brain data to identify cross-sectional imaging biomarkers of cognitive performance in cognitively normal older adults from the Baltimore Longitudinal Study of Aging (BLSA. We focused on specific components of executive and memory domains known to decline with aging, including manipulation, semantic retrieval, long-term memory (LTM, and short-term memory (STM. For each imaging modality, brain regions associated with each cognitive domain were generated by adaptive regional clustering. A relevance vector machine was adopted to model the nonlinear continuous relationship between brain regions and cognitive performance, with cross-validation to select the most informative brain regions (using recursive feature elimination as imaging biomarkers and optimize model parameters. Predicted cognitive scores using our regression algorithm based on the resulting brain regions correlated well with actual performance. Also, regression models obtained using combined GM, WM, and PET imaging modalities outperformed models based on single modalities. Imaging biomarkers related to memory performance included the orbito-frontal and medial temporal cortical regions with LTM showing stronger correlation with the temporal lobe than STM. Brain regions predicting executive performance included orbito-frontal, and occipito-temporal areas. The PET modality had higher contribution to most cognitive domains except manipulation, which had higher WM contribution from the superior longitudinal fasciculus and the genu of the corpus callosum. These findings based on machine-learning methods demonstrate the importance of combining structural and functional imaging data in understanding complex cognitive mechanisms and also their potential usage as biomarkers that predict cognitive
Efficient Smoothed Concomitant Lasso Estimation for High Dimensional Regression
Ndiaye, Eugene; Fercoq, Olivier; Gramfort, Alexandre; Leclère, Vincent; Salmon, Joseph
2017-10-01
In high dimensional settings, sparse structures are crucial for efficiency, both in term of memory, computation and performance. It is customary to consider ℓ 1 penalty to enforce sparsity in such scenarios. Sparsity enforcing methods, the Lasso being a canonical example, are popular candidates to address high dimension. For efficiency, they rely on tuning a parameter trading data fitting versus sparsity. For the Lasso theory to hold this tuning parameter should be proportional to the noise level, yet the latter is often unknown in practice. A possible remedy is to jointly optimize over the regression parameter as well as over the noise level. This has been considered under several names in the literature: Scaled-Lasso, Square-root Lasso, Concomitant Lasso estimation for instance, and could be of interest for uncertainty quantification. In this work, after illustrating numerical difficulties for the Concomitant Lasso formulation, we propose a modification we coined Smoothed Concomitant Lasso, aimed at increasing numerical stability. We propose an efficient and accurate solver leading to a computational cost no more expensive than the one for the Lasso. We leverage on standard ingredients behind the success of fast Lasso solvers: a coordinate descent algorithm, combined with safe screening rules to achieve speed efficiency, by eliminating early irrelevant features.
Multivariate linear regression of high-dimensional fMRI data with multiple target variables.
Valente, Giancarlo; Castellanos, Agustin Lage; Vanacore, Gianluca; Formisano, Elia
2014-05-01
Multivariate regression is increasingly used to study the relation between fMRI spatial activation patterns and experimental stimuli or behavioral ratings. With linear models, informative brain locations are identified by mapping the model coefficients. This is a central aspect in neuroimaging, as it provides the sought-after link between the activity of neuronal populations and subject's perception, cognition or behavior. Here, we show that mapping of informative brain locations using multivariate linear regression (MLR) may lead to incorrect conclusions and interpretations. MLR algorithms for high dimensional data are designed to deal with targets (stimuli or behavioral ratings, in fMRI) separately, and the predictive map of a model integrates information deriving from both neural activity patterns and experimental design. Not accounting explicitly for the presence of other targets whose associated activity spatially overlaps with the one of interest may lead to predictive maps of troublesome interpretation. We propose a new model that can correctly identify the spatial patterns associated with a target while achieving good generalization. For each target, the training is based on an augmented dataset, which includes all remaining targets. The estimation on such datasets produces both maps and interaction coefficients, which are then used to generalize. The proposed formulation is independent of the regression algorithm employed. We validate this model on simulated fMRI data and on a publicly available dataset. Results indicate that our method achieves high spatial sensitivity and good generalization and that it helps disentangle specific neural effects from interaction with predictive maps associated with other targets. Copyright © 2013 Wiley Periodicals, Inc.
Penalized estimation for competing risks regression with applications to high-dimensional covariates
DEFF Research Database (Denmark)
Ambrogi, Federico; Scheike, Thomas H.
2016-01-01
High-dimensional regression has become an increasingly important topic for many research fields. For example, biomedical research generates an increasing amount of data to characterize patients' bio-profiles (e.g. from a genomic high-throughput assay). The increasing complexity in the characteriz......High-dimensional regression has become an increasingly important topic for many research fields. For example, biomedical research generates an increasing amount of data to characterize patients' bio-profiles (e.g. from a genomic high-throughput assay). The increasing complexity...... in the characterization of patients' bio-profiles is added to the complexity related to the prolonged follow-up of patients with the registration of the occurrence of possible adverse events. This information may offer useful insight into disease dynamics and in identifying subset of patients with worse prognosis...... and better response to the therapy. Although in the last years the number of contributions for coping with high and ultra-high-dimensional data in standard survival analysis have increased (Witten and Tibshirani, 2010. Survival analysis with high-dimensional covariates. Statistical Methods in Medical...
Energy Technology Data Exchange (ETDEWEB)
Li, Weixuan; Lin, Guang; Li, Bing
2016-09-01
A well-known challenge in uncertainty quantification (UQ) is the "curse of dimensionality". However, many high-dimensional UQ problems are essentially low-dimensional, because the randomness of the quantity of interest (QoI) is caused only by uncertain parameters varying within a low-dimensional subspace, known as the sufficient dimension reduction (SDR) subspace. Motivated by this observation, we propose and demonstrate in this paper an inverse regression-based UQ approach (IRUQ) for high-dimensional problems. Specifically, we use an inverse regression procedure to estimate the SDR subspace and then convert the original problem to a low-dimensional one, which can be efficiently solved by building a response surface model such as a polynomial chaos expansion. The novelty and advantages of the proposed approach is seen in its computational efficiency and practicality. Comparing with Monte Carlo, the traditionally preferred approach for high-dimensional UQ, IRUQ with a comparable cost generally gives much more accurate solutions even for high-dimensional problems, and even when the dimension reduction is not exactly sufficient. Theoretically, IRUQ is proved to converge twice as fast as the approach it uses seeking the SDR subspace. For example, while a sliced inverse regression method converges to the SDR subspace at the rate of $O(n^{-1/2})$, the corresponding IRUQ converges at $O(n^{-1})$. IRUQ also provides several desired conveniences in practice. It is non-intrusive, requiring only a simulator to generate realizations of the QoI, and there is no need to compute the high-dimensional gradient of the QoI. Finally, error bars can be derived for the estimation results reported by IRUQ.
Zhao, Lue Ping; Bolouri, Hamid
2016-04-01
Maturing omics technologies enable researchers to generate high dimension omics data (HDOD) routinely in translational clinical studies. In the field of oncology, The Cancer Genome Atlas (TCGA) provided funding support to researchers to generate different types of omics data on a common set of biospecimens with accompanying clinical data and has made the data available for the research community to mine. One important application, and the focus of this manuscript, is to build predictive models for prognostic outcomes based on HDOD. To complement prevailing regression-based approaches, we propose to use an object-oriented regression (OOR) methodology to identify exemplars specified by HDOD patterns and to assess their associations with prognostic outcome. Through computing patient's similarities to these exemplars, the OOR-based predictive model produces a risk estimate using a patient's HDOD. The primary advantages of OOR are twofold: reducing the penalty of high dimensionality and retaining the interpretability to clinical practitioners. To illustrate its utility, we apply OOR to gene expression data from non-small cell lung cancer patients in TCGA and build a predictive model for prognostic survivorship among stage I patients, i.e., we stratify these patients by their prognostic survival risks beyond histological classifications. Identification of these high-risk patients helps oncologists to develop effective treatment protocols and post-treatment disease management plans. Using the TCGA data, the total sample is divided into training and validation data sets. After building up a predictive model in the training set, we compute risk scores from the predictive model, and validate associations of risk scores with prognostic outcome in the validation data (P-value=0.015). Copyright © 2016 Elsevier Inc. All rights reserved.
Non-Asymptotic Oracle Inequalities for the High-Dimensional Cox Regression via Lasso.
Kong, Shengchun; Nan, Bin
2014-01-01
We consider finite sample properties of the regularized high-dimensional Cox regression via lasso. Existing literature focuses on linear models or generalized linear models with Lipschitz loss functions, where the empirical risk functions are the summations of independent and identically distributed (iid) losses. The summands in the negative log partial likelihood function for censored survival data, however, are neither iid nor Lipschitz.We first approximate the negative log partial likelihood function by a sum of iid non-Lipschitz terms, then derive the non-asymptotic oracle inequalities for the lasso penalized Cox regression using pointwise arguments to tackle the difficulties caused by lacking iid Lipschitz losses.
Xu, Chao; Fang, Jian; Shen, Hui; Wang, Yu-Ping; Deng, Hong-Wen
2018-01-25
Extreme phenotype sampling (EPS) is a broadly-used design to identify candidate genetic factors contributing to the variation of quantitative traits. By enriching the signals in extreme phenotypic samples, EPS can boost the association power compared to random sampling. Most existing statistical methods for EPS examine the genetic factors individually, despite many quantitative traits have multiple genetic factors underlying their variation. It is desirable to model the joint effects of genetic factors, which may increase the power and identify novel quantitative trait loci under EPS. The joint analysis of genetic data in high-dimensional situations requires specialized techniques, e.g., the least absolute shrinkage and selection operator (LASSO). Although there are extensive research and application related to LASSO, the statistical inference and testing for the sparse model under EPS remain unknown. We propose a novel sparse model (EPS-LASSO) with hypothesis test for high-dimensional regression under EPS based on a decorrelated score function. The comprehensive simulation shows EPS-LASSO outperforms existing methods with stable type I error and FDR control. EPS-LASSO can provide a consistent power for both low- and high-dimensional situations compared with the other methods dealing with high-dimensional situations. The power of EPS-LASSO is close to other low-dimensional methods when the causal effect sizes are small and is superior when the effects are large. Applying EPS-LASSO to a transcriptome-wide gene expression study for obesity reveals 10 significant body mass index associated genes. Our results indicate that EPS-LASSO is an effective method for EPS data analysis, which can account for correlated predictors. The source code is available at https://github.com/xu1912/EPSLASSO. hdeng2@tulane.edu. Supplementary data are available at Bioinformatics online.
Shiqing Wang; Limin Su
2013-01-01
During the last few years, a great deal of attention has been focused on Lasso and Dantzig selector in high-dimensional linear regression when the number of variables can be much larger than the sample size. Under a sparsity scenario, the authors (see, e.g., Bickel et al., 2009, Bunea et al., 2007, Candes and Tao, 2007, Candès and Tao, 2007, Donoho et al., 2006, Koltchinskii, 2009, Koltchinskii, 2009, Meinshausen and Yu, 2009, Rosenbaum and Tsybakov, 2010, Tsybakov, 2006, van de Geer, 2008, a...
Sauerbrei, Willi; Boulesteix, Anne-Laure; Binder, Harald
2011-11-01
Multivariable regression models can link a potentially large number of variables to various kinds of outcomes, such as continuous, binary, or time-to-event endpoints. Selection of important variables and selection of the functional form for continuous covariates are key parts of building such models but are notoriously difficult due to several reasons. Caused by multicollinearity between predictors and a limited amount of information in the data, (in)stability can be a serious issue of models selected. For applications with a moderate number of variables, resampling-based techniques have been developed for diagnosing and improving multivariable regression models. Deriving models for high-dimensional molecular data has led to the need for adapting these techniques to settings where the number of variables is much larger than the number of observations. Three studies with a time-to-event outcome, of which one has high-dimensional data, are used to illustrate several techniques. Investigations at the covariate level and at the predictor level are seen to provide considerable insight into model stability and performance. While some areas are indicated where resampling techniques for model building still need further refinement, our case studies illustrate that these techniques can already be recommended for wider use.
Xia, Yin; Cai, Tianxi; Cai, T Tony
2018-01-01
Motivated by applications in genomics, we consider in this paper global and multiple testing for the comparisons of two high-dimensional linear regression models. A procedure for testing the equality of the two regression vectors globally is proposed and shown to be particularly powerful against sparse alternatives. We then introduce a multiple testing procedure for identifying unequal coordinates while controlling the false discovery rate and false discovery proportion. Theoretical justifications are provided to guarantee the validity of the proposed tests and optimality results are established under sparsity assumptions on the regression coefficients. The proposed testing procedures are easy to implement. Numerical properties of the procedures are investigated through simulation and data analysis. The results show that the proposed tests maintain the desired error rates under the null and have good power under the alternative at moderate sample sizes. The procedures are applied to the Framingham Offspring study to investigate the interactions between smoking and cardiovascular related genetic mutations important for an inflammation marker.
Directory of Open Access Journals (Sweden)
Shiqing Wang
2013-01-01
Full Text Available During the last few years, a great deal of attention has been focused on Lasso and Dantzig selector in high-dimensional linear regression when the number of variables can be much larger than the sample size. Under a sparsity scenario, the authors (see, e.g., Bickel et al., 2009, Bunea et al., 2007, Candes and Tao, 2007, Candès and Tao, 2007, Donoho et al., 2006, Koltchinskii, 2009, Koltchinskii, 2009, Meinshausen and Yu, 2009, Rosenbaum and Tsybakov, 2010, Tsybakov, 2006, van de Geer, 2008, and Zhang and Huang, 2008 discussed the relations between Lasso and Dantzig selector and derived sparsity oracle inequalities for the prediction risk and bounds on the estimation loss. In this paper, we point out that some of the authors overemphasize the role of some sparsity conditions, and the assumptions based on this sparsity condition may cause bad results. We give better assumptions and the methods that avoid using the sparsity condition. As a comparison with the results by Bickel et al., 2009, more precise oracle inequalities for the prediction risk and bounds on the estimation loss are derived when the number of variables can be much larger than the sample size.
High-dimensional, massive sample-size Cox proportional hazards regression for survival analysis.
Mittal, Sushil; Madigan, David; Burd, Randall S; Suchard, Marc A
2014-04-01
Survival analysis endures as an old, yet active research field with applications that spread across many domains. Continuing improvements in data acquisition techniques pose constant challenges in applying existing survival analysis methods to these emerging data sets. In this paper, we present tools for fitting regularized Cox survival analysis models on high-dimensional, massive sample-size (HDMSS) data using a variant of the cyclic coordinate descent optimization technique tailored for the sparsity that HDMSS data often present. Experiments on two real data examples demonstrate that efficient analyses of HDMSS data using these tools result in improved predictive performance and calibration.
Privacy-Preserving Distributed Linear Regression on High-Dimensional Data
Directory of Open Access Journals (Sweden)
Gascón Adrià
2017-10-01
Full Text Available We propose privacy-preserving protocols for computing linear regression models, in the setting where the training dataset is vertically distributed among several parties. Our main contribution is a hybrid multi-party computation protocol that combines Yao’s garbled circuits with tailored protocols for computing inner products. Like many machine learning tasks, building a linear regression model involves solving a system of linear equations. We conduct a comprehensive evaluation and comparison of different techniques for securely performing this task, including a new Conjugate Gradient Descent (CGD algorithm. This algorithm is suitable for secure computation because it uses an efficient fixed-point representation of real numbers while maintaining accuracy and convergence rates comparable to what can be obtained with a classical solution using floating point numbers. Our technique improves on Nikolaenko et al.’s method for privacy-preserving ridge regression (S&P 2013, and can be used as a building block in other analyses. We implement a complete system and demonstrate that our approach is highly scalable, solving data analysis problems with one million records and one hundred features in less than one hour of total running time.
Fisher, Charles K; Mehta, Pankaj
2015-06-01
Feature selection, identifying a subset of variables that are relevant for predicting a response, is an important and challenging component of many methods in statistics and machine learning. Feature selection is especially difficult and computationally intensive when the number of variables approaches or exceeds the number of samples, as is often the case for many genomic datasets. Here, we introduce a new approach--the Bayesian Ising Approximation (BIA)-to rapidly calculate posterior probabilities for feature relevance in L2 penalized linear regression. In the regime where the regression problem is strongly regularized by the prior, we show that computing the marginal posterior probabilities for features is equivalent to computing the magnetizations of an Ising model with weak couplings. Using a mean field approximation, we show it is possible to rapidly compute the feature selection path described by the posterior probabilities as a function of the L2 penalty. We present simulations and analytical results illustrating the accuracy of the BIA on some simple regression problems. Finally, we demonstrate the applicability of the BIA to high-dimensional regression by analyzing a gene expression dataset with nearly 30 000 features. These results also highlight the impact of correlations between features on Bayesian feature selection. An implementation of the BIA in C++, along with data for reproducing our gene expression analyses, are freely available at http://physics.bu.edu/∼pankajm/BIACode. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com.
Directory of Open Access Journals (Sweden)
Sergio Rojas-Galeano
2008-03-01
Full Text Available The analysis of complex proteomic and genomic profiles involves the identification of significant markers within a set of hundreds or even thousands of variables that represent a high-dimensional problem space. The occurrence of noise, redundancy or combinatorial interactions in the profile makes the selection of relevant variables harder.Here we propose a method to select variables based on estimated relevance to hidden patterns. Our method combines a weighted-kernel discriminant with an iterative stochastic probability estimation algorithm to discover the relevance distribution over the set of variables. We verified the ability of our method to select predefined relevant variables in synthetic proteome-like data and then assessed its performance on biological high-dimensional problems. Experiments were run on serum proteomic datasets of infectious diseases. The resulting variable subsets achieved classification accuracies of 99% on Human African Trypanosomiasis, 91% on Tuberculosis, and 91% on Malaria serum proteomic profiles with fewer than 20% of variables selected. Our method scaled-up to dimensionalities of much higher orders of magnitude as shown with gene expression microarray datasets in which we obtained classification accuracies close to 90% with fewer than 1% of the total number of variables.Our method consistently found relevant variables attaining high classification accuracies across synthetic and biological datasets. Notably, it yielded very compact subsets compared to the original number of variables, which should simplify downstream biological experimentation.
Evolutionary fields can explain patterns of high-dimensional complexity in ecology.
Wilsenach, James; Landi, Pietro; Hui, Cang
2017-04-01
One of the properties that make ecological systems so unique is the range of complex behavioral patterns that can be exhibited by even the simplest communities with only a few species. Much of this complexity is commonly attributed to stochastic factors that have very high-degrees of freedom. Orthodox study of the evolution of these simple networks has generally been limited in its ability to explain complexity, since it restricts evolutionary adaptation to an inertia-free process with few degrees of freedom in which only gradual, moderately complex behaviors are possible. We propose a model inspired by particle-mediated field phenomena in classical physics in combination with fundamental concepts in adaptation, which suggests that small but high-dimensional chaotic dynamics near to the adaptive trait optimum could help explain complex properties shared by most ecological datasets, such as aperiodicity and pink, fractal noise spectra. By examining a simple predator-prey model and appealing to real ecological data, we show that this type of complexity could be easily confused for or confounded by stochasticity, especially when spurred on or amplified by stochastic factors that share variational and spectral properties with the underlying dynamics.
Evolutionary fields can explain patterns of high-dimensional complexity in ecology
Wilsenach, James; Landi, Pietro; Hui, Cang
2017-04-01
One of the properties that make ecological systems so unique is the range of complex behavioral patterns that can be exhibited by even the simplest communities with only a few species. Much of this complexity is commonly attributed to stochastic factors that have very high-degrees of freedom. Orthodox study of the evolution of these simple networks has generally been limited in its ability to explain complexity, since it restricts evolutionary adaptation to an inertia-free process with few degrees of freedom in which only gradual, moderately complex behaviors are possible. We propose a model inspired by particle-mediated field phenomena in classical physics in combination with fundamental concepts in adaptation, which suggests that small but high-dimensional chaotic dynamics near to the adaptive trait optimum could help explain complex properties shared by most ecological datasets, such as aperiodicity and pink, fractal noise spectra. By examining a simple predator-prey model and appealing to real ecological data, we show that this type of complexity could be easily confused for or confounded by stochasticity, especially when spurred on or amplified by stochastic factors that share variational and spectral properties with the underlying dynamics.
Patterns of Regression in Rett Syndrome
Directory of Open Access Journals (Sweden)
J Gordon Millichap
2002-10-01
Full Text Available Patterns and features of regression in a case series of 53 girls and women with Rett syndrome were studied at the Institute of Child Health and Great Ormond Street Children’s Hospital, London, UK.
Ternès, Nils; Rotolo, Federico; Michiels, Stefan
2016-07-10
Correct selection of prognostic biomarkers among multiple candidates is becoming increasingly challenging as the dimensionality of biological data becomes higher. Therefore, minimizing the false discovery rate (FDR) is of primary importance, while a low false negative rate (FNR) is a complementary measure. The lasso is a popular selection method in Cox regression, but its results depend heavily on the penalty parameter λ. Usually, λ is chosen using maximum cross-validated log-likelihood (max-cvl). However, this method has often a very high FDR. We review methods for a more conservative choice of λ. We propose an empirical extension of the cvl by adding a penalization term, which trades off between the goodness-of-fit and the parsimony of the model, leading to the selection of fewer biomarkers and, as we show, to the reduction of the FDR without large increase in FNR. We conducted a simulation study considering null and moderately sparse alternative scenarios and compared our approach with the standard lasso and 10 other competitors: Akaike information criterion (AIC), corrected AIC, Bayesian information criterion (BIC), extended BIC, Hannan and Quinn information criterion (HQIC), risk information criterion (RIC), one-standard-error rule, adaptive lasso, stability selection, and percentile lasso. Our extension achieved the best compromise across all the scenarios between a reduction of the FDR and a limited raise of the FNR, followed by the AIC, the RIC, and the adaptive lasso, which performed well in some settings. We illustrate the methods using gene expression data of 523 breast cancer patients. In conclusion, we propose to apply our extension to the lasso whenever a stringent FDR with a limited FNR is targeted. Copyright © 2016 John Wiley & Sons, Ltd. Copyright © 2016 John Wiley & Sons, Ltd.
Classifying High-Dimensional Patterns Using a Fuzzy Logic Discriminant Network
Directory of Open Access Journals (Sweden)
Nick J. Pizzi
2012-01-01
Full Text Available Although many classification techniques exist to analyze patterns possessing straightforward characteristics, they tend to fail when the ratio of features to patterns is very large. This “curse of dimensionality” is especially prevalent in many complex, voluminous biomedical datasets acquired using the latest spectroscopic modalities. To address this pattern classification issue, we present a technique using an adaptive network of fuzzy logic connectives to combine class boundaries generated by sets of discriminant functions. We empirically evaluate the effectiveness of this classification technique by comparing it against two conventional benchmark approaches, both of which use feature averaging as a preprocessing phase.
Radiation regression patterns after cobalt plaque insertion for retinoblastoma
Energy Technology Data Exchange (ETDEWEB)
Buys, R.J.; Abramson, D.H.; Ellsworth, R.M.; Haik, B.
1983-08-01
An analysis of 31 eyes of 30 patients who had been treated with cobalt plaques for retinoblastoma disclosed that a type I radiation regression pattern developed in 15 patients; type II, in one patient, and type III, in five patients. Nine patients had a regression pattern characterized by complete destruction of the tumor, the surrounding choroid, and all of the vessels in the area into which the plaque was inserted. This resulting white scar, corresponding to the sclerae only, was classified as a type IV radiation regression pattern. There was no evidence of tumor recurrence in patients with type IV regression patterns, with an average follow-up of 6.5 years, after receiving cobalt plaque therapy. Twenty-nine of these 30 patients had been unsuccessfully treated with at least one other modality (ie, light coagulation, cryotherapy, external beam radiation, or chemotherapy).
Clustering high dimensional data
DEFF Research Database (Denmark)
Assent, Ira
2012-01-01
High-dimensional data, i.e., data described by a large number of attributes, pose specific challenges to clustering. The so-called ‘curse of dimensionality’, coined originally to describe the general increase in complexity of various computational problems as dimensionality increases, is known...... for clustering are required. Consequently, recent research has focused on developing techniques and clustering algorithms specifically for high-dimensional data. Still, open research issues remain. Clustering is a data mining task devoted to the automatic grouping of data based on mutual similarity. Each cluster...... groups objects that are similar to one another, whereas dissimilar objects are assigned to different clusters, possibly separating out noise. In this manner, clusters describe the data structure in an unsupervised manner, i.e., without the need for class labels. A number of clustering paradigms exist...
Contiguous Uniform Deviation for Multiple Linear Regression in Pattern Recognition
Andriana, A. S.; Prihatmanto, D.; Hidaya, E. M. I.; Supriana, I.; Machbub, C.
2017-01-01
Understanding images by recognizing its objects is still a challenging task. Face elements detection has been developed by researchers but not yet shows enough information (low resolution in information) needed for recognizing objects. Available face recognition methods still have error in classification and need a huge amount of examples which may still be incomplete. Another approach which is still rare in understanding images uses pattern structures or syntactic grammars describing shape detail features. Image pixel values are also processed as signal patterns which are approximated by mathematical function curve fitting. This paper attempts to add contiguous uniform deviation method to curve fitting algorithm to increase applicability in image recognition system related to object movement. The combination of multiple linear regression and contiguous uniform deviation method are applied to the function of image pixel values, and show results in higher resolution (more information) of visual object detail description in object movement.
CSIR Research Space (South Africa)
Mc
2012-07-01
Full Text Available stream_source_info McLaren_2012.pdf.txt stream_content_type text/plain stream_size 2190 Content-Encoding ISO-8859-1 stream_name McLaren_2012.pdf.txt Content-Type text/plain; charset=ISO-8859-1 High dimensional... entanglement M. McLAREN1,2, F.S. ROUX1 & A. FORBES1,2,3 1. CSIR National Laser Centre, PO Box 395, Pretoria 0001 2. School of Physics, University of the Stellenbosch, Private Bag X1, 7602, Matieland 3. School of Physics, University of Kwazulu...
Autoregressive logistic regression applied to atmospheric circulation patterns
Guanche, Y.; Mínguez, R.; Méndez, F. J.
2014-01-01
Autoregressive logistic regression models have been successfully applied in medical and pharmacology research fields, and in simple models to analyze weather types. The main purpose of this paper is to introduce a general framework to study atmospheric circulation patterns capable of dealing simultaneously with: seasonality, interannual variability, long-term trends, and autocorrelation of different orders. To show its effectiveness on modeling performance, daily atmospheric circulation patterns identified from observed sea level pressure fields over the Northeastern Atlantic, have been analyzed using this framework. Model predictions are compared with probabilities from the historical database, showing very good fitting diagnostics. In addition, the fitted model is used to simulate the evolution over time of atmospheric circulation patterns using Monte Carlo method. Simulation results are statistically consistent with respect to the historical sequence in terms of (1) probability of occurrence of the different weather types, (2) transition probabilities and (3) persistence. The proposed model constitutes an easy-to-use and powerful tool for a better understanding of the climate system.
Electricity demand forecasting using regression, scenarios and pattern analysis
CSIR Research Space (South Africa)
Khuluse, S
2009-02-01
Full Text Available The objective of the study is to forecast national electricity demand patterns for a period of twenty years: total annual consumption and understanding seasonal effects. No constraint on the supply of electricity was assumed...
Understanding high-dimensional spaces
Skillicorn, David B
2012-01-01
High-dimensional spaces arise as a way of modelling datasets with many attributes. Such a dataset can be directly represented in a space spanned by its attributes, with each record represented as a point in the space with its position depending on its attribute values. Such spaces are not easy to work with because of their high dimensionality: our intuition about space is not reliable, and measures such as distance do not provide as clear information as we might expect. There are three main areas where complex high dimensionality and large datasets arise naturally: data collected by online ret
Bayesian Analysis of High Dimensional Classification
Mukhopadhyay, Subhadeep; Liang, Faming
2009-12-01
Modern data mining and bioinformatics have presented an important playground for statistical learning techniques, where the number of input variables is possibly much larger than the sample size of the training data. In supervised learning, logistic regression or probit regression can be used to model a binary output and form perceptron classification rules based on Bayesian inference. In these cases , there is a lot of interest in searching for sparse model in High Dimensional regression(/classification) setup. we first discuss two common challenges for analyzing high dimensional data. The first one is the curse of dimensionality. The complexity of many existing algorithms scale exponentially with the dimensionality of the space and by virtue of that algorithms soon become computationally intractable and therefore inapplicable in many real applications. secondly, multicollinearities among the predictors which severely slowdown the algorithm. In order to make Bayesian analysis operational in high dimension we propose a novel 'Hierarchical stochastic approximation monte carlo algorithm' (HSAMC), which overcomes the curse of dimensionality, multicollinearity of predictors in high dimension and also it possesses the self-adjusting mechanism to avoid the local minima separated by high energy barriers. Models and methods are illustrated by simulation inspired from from the feild of genomics. Numerical results indicate that HSAMC can work as a general model selection sampler in high dimensional complex model space.
High-dimensional covariance estimation with high-dimensional data
Pourahmadi, Mohsen
2013-01-01
Methods for estimating sparse and large covariance matrices Covariance and correlation matrices play fundamental roles in every aspect of the analysis of multivariate data collected from a variety of fields including business and economics, health care, engineering, and environmental and physical sciences. High-Dimensional Covariance Estimation provides accessible and comprehensive coverage of the classical and modern approaches for estimating covariance matrices as well as their applications to the rapidly developing areas lying at the intersection of statistics and mac
Retinoblastoma regression patterns following chemoreduction and adjuvant therapy in 557 tumors.
Shields, Carol L; Palamar, Melis; Sharma, Pooja; Ramasubramanian, Aparna; Leahey, Ann; Meadows, Anna T; Shields, Jerry A
2009-03-01
To evaluate retinoblastoma regression patterns following chemoreduction and adjuvant therapy. A total of 557 retinoblastomas. A retrospective medical record review following 6 cycles of chemoreduction and tumor consolidation (thermotherapy or cryotherapy). Regression patterns included type 0 (no remnant), type 1 (calcified remnant), type 2 (noncalcified remnant), type 3 (partially calcified remnant), and type 4 (flat scar). Regression pattern. Retinoblastoma regressions were type 0 (n = 10), type 1 (n = 75), type 2 (n = 28), type 3 (n = 127), and type 4 (n = 317). Tumors with an initial thickness of 3 mm or less regressed most often to type 4 (92%), those 3 to 8 mm regressed to type 3 (34%) or type 4 (40%), and those thicker than 8 mm regressed to type 1 (40%) or type 3 (49%). Factors predictive of type 1 regression included larger tumor base and closer foveolar proximity. Factors predictive of type 3 included older age, larger tumor base, macular location, closer foveolar proximity, and lack of consolidation. Factors predictive of type 4 included familial hereditary pattern, smaller tumor base, greater foveolar distance, and tumor consolidation. Following chemoreduction, most small retinoblastomas result in a flat scar, intermediate tumors in a flat or partially calcified remnant, and large tumors in a more completely calcified remnant.
Gao, Xiangyun; An, Haizhong; Fang, Wei; Huang, Xuan; Li, Huajiao; Zhong, Weiqiong; Ding, Yinghui
2014-07-01
The linear regression parameters between two time series can be different under different lengths of observation period. If we study the whole period by the sliding window of a short period, the change of the linear regression parameters is a process of dynamic transmission over time. We tackle fundamental research that presents a simple and efficient computational scheme: a linear regression patterns transmission algorithm, which transforms linear regression patterns into directed and weighted networks. The linear regression patterns (nodes) are defined by the combination of intervals of the linear regression parameters and the results of the significance testing under different sizes of the sliding window. The transmissions between adjacent patterns are defined as edges, and the weights of the edges are the frequency of the transmissions. The major patterns, the distance, and the medium in the process of the transmission can be captured. The statistical results of weighted out-degree and betweenness centrality are mapped on timelines, which shows the features of the distribution of the results. Many measurements in different areas that involve two related time series variables could take advantage of this algorithm to characterize the dynamic relationships between the time series from a new perspective.
Regression patterns in treated retinoblastoma with chemotherapy plus focal adjuvant therapy.
Ghassemi, Fariba; Rahmanikhah, Elham; Roohipoor, Ramak; Karkhaneh, Reza; Faegh, Adeleh
2013-04-01
The aim of this study was evaluation of the regression patterns after 3, 6, and 8 months of treatment. A total of 100 retinoblastoma tumors (57 eyes of 35 patients) were treated with 6 (n = 8) or 8 (n = 92) cycles of systemic chemoreduction and tumor consolidation (transpupillary thermotherapy [TTT] or cryotherapy) during this prospective study. After 3 months of treatment, type 3 was the predominant pattern (n = 57%, 57%), while after 6 and 8 months of treatment the tumors regressed to type 4 most often (44% and 52%, respectively). Smaller tumors and the peripheral tumors were likely to regress to type 4, whereas larger tumors and those nearer to fovea were more likely to become type 1 pattern. Tumors consolidated with cryotherapy mostly showed type 4 regression (3rd month: 40%, 6th month: 90%, and 8th month: 87.5%). Whereas, those treated with TTT rather regressed to type 3 after 3 months (57.9%) and to type 4 after 6 and 8 months of treatment (51.4% and 59.5%, respectively). Recurrence of the tumor was 40% in our cases with defined correlation with tumor location, size, and subretinal seeds. We conclude that regression patterns of tumors in patients undergoing systemic chemoreduction with focal adjuvant treatments predominantly changed over time and their changes are dependent on tumor size, location, and type of treatment. It appears that subretinal seeds, tumor size, and location of tumors are the most important factors predicting tumor recurrence. Copyright © 2012 Wiley Periodicals, Inc.
Snow, David P
2017-01-01
This study investigates infants' transition from nonverbal to verbal communication using evidence from regression patterns. As an example of regressions, prelinguistic infants learning American Sign Language (ASL) use pointing gestures to communicate. At the onset of single signs, however, these gestures disappear. Petitto (1987) attributed the regression to the children's discovery that pointing has two functions, namely, deixis and linguistic pronouns. The 1:2 relation (1 form, 2 functions) violates the simple 1:1 pattern that infants are believed to expect. This kind of conflict, Petitto argued, explains the regression. Based on the additional observation that the regression coincided with the boundary between prelinguistic and linguistic communication, Petitto concluded that the prelinguistic and linguistic periods are autonomous. The purpose of the present study was to evaluate the 1:1 model and to determine whether it explains a previously reported regression of intonation in English. Background research showed that gestures and intonation have different forms but the same pragmatic meanings, a 2:1 form-function pattern that plausibly precipitates the regression. The hypothesis of the study was that gestures and intonation are closely related. Moreover, because gestures and intonation change in the opposite direction, the negative correlation between them indicates a robust inverse relationship. To test this prediction, speech samples of 29 infants (8 to 16 months) were analyzed acoustically and compared to parent-report data on several verbal and gestural scales. In support of the hypothesis, gestures alone were inversely correlated with intonation. In addition, the regression model explains nonlinearities stemming from different form-function configurations. However, the results failed to support the claim that regressions linked to early words or signs reflect autonomy. The discussion ends with a focus on the special role of intonation in children
Linking teleconnection patterns to European temperature – a multiple linear regression model
Henning W. Rust; Andy Richling; Peter Bissolli; Uwe Ulbrich
2015-01-01
The link between the indices of twelve atmospheric teleconnection patterns (mostly Northern Hemispheric) and gridded European temperature data is investigated by means of multiple linear regression models for each grid cell and month. Furthermore index-specific signals are calculated to estimate the contribution to temperature anomalies caused by each individual teleconnection pattern. To this extent, an observational product of monthly mean temperature (E-OBS), as well as monthly time series...
Recursive bias estimation for high dimensional regression smoothers
Energy Technology Data Exchange (ETDEWEB)
Hengartner, Nicolas W [Los Alamos National Laboratory; Cornillon, Pierre - Andre [AGROSUP, FRANCE; Matzner - Lober, Eric [UNIV OF RENNES, FRANCE
2009-01-01
In multivariate nonparametric analysis, sparseness of the covariates also called curse of dimensionality, forces one to use large smoothing parameters. This leads to biased smoother. Instead of focusing on optimally selecting the smoothing parameter, we fix it to some reasonably large value to ensure an over-smoothing of the data. The resulting smoother has a small variance but a substantial bias. In this paper, we propose to iteratively correct of the bias initial estimator by an estimate of the latter obtained by smoothing the residuals. We examine in details the convergence of the iterated procedure for classical smoothers and relate our procedure to L{sub 2}-Boosting, For multivariate thin plate spline smoother, we proved that our procedure adapts to the correct and unknown order of smoothness for estimating an unknown function m belonging to H({nu}) (Sobolev space where m should be bigger than d/2). We apply our method to simulated and real data and show that our method compares favorably with existing procedures.
Directory of Open Access Journals (Sweden)
Aurélien Miralles
Full Text Available Scincine lizards in Madagascar form an endemic clade of about 60 species exhibiting a variety of ecomorphological adaptations. Several subclades have adapted to burrowing and convergently regressed their limbs and eyes, resulting in a variety of partial and completely limbless morphologies among extant taxa. However, patterns of limb regression in these taxa have not been studied in detail. Here we fill this gap in knowledge by providing a phylogenetic analysis of DNA sequences of three mitochondrial and four nuclear gene fragments in an extended sampling of Malagasy skinks, and microtomographic analyses of osteology of various burrowing taxa adapted to sand substrate. Based on our data we propose to (i consider Sirenoscincus Sakata & Hikida, 2003, as junior synonym of Voeltzkowia Boettger, 1893; (ii resurrect the genus name Grandidierina Mocquard, 1894, for four species previously included in Voeltzkowia; and (iii consider Androngo Brygoo, 1982, as junior synonym of Pygomeles Grandidier, 1867. By supporting the clade consisting of the limbless Voeltzkowia mira and the forelimb-only taxa V. mobydick and V. yamagishii, our data indicate that full regression of limbs and eyes occurred in parallel twice in the genus Voeltzkowia (as hitherto defined that we consider as a sand-swimming ecomorph: in the Voeltzkowia clade sensu stricto the regression first affected the hindlimbs and subsequently the forelimbs, whereas the Grandidierina clade first regressed the forelimbs and subsequently the hindlimbs following the pattern prevalent in squamates. Timetree reconstructions for the Malagasy Scincidae contain a substantial amount of uncertainty due to the absence of suitable primary fossil calibrations. However, our preliminary reconstructions suggest rapid limb regression in Malagasy scincids with an estimated maximal duration of 6 MYr for a complete regression in Paracontias, and 4 and 8 MYr respectively for complete regression of forelimbs in
Miralles, Aurélien; Hipsley, Christy A; Erens, Jesse; Gehara, Marcelo; Rakotoarison, Andolalao; Glaw, Frank; Müller, Johannes; Vences, Miguel
2015-01-01
Scincine lizards in Madagascar form an endemic clade of about 60 species exhibiting a variety of ecomorphological adaptations. Several subclades have adapted to burrowing and convergently regressed their limbs and eyes, resulting in a variety of partial and completely limbless morphologies among extant taxa. However, patterns of limb regression in these taxa have not been studied in detail. Here we fill this gap in knowledge by providing a phylogenetic analysis of DNA sequences of three mitochondrial and four nuclear gene fragments in an extended sampling of Malagasy skinks, and microtomographic analyses of osteology of various burrowing taxa adapted to sand substrate. Based on our data we propose to (i) consider Sirenoscincus Sakata & Hikida, 2003, as junior synonym of Voeltzkowia Boettger, 1893; (ii) resurrect the genus name Grandidierina Mocquard, 1894, for four species previously included in Voeltzkowia; and (iii) consider Androngo Brygoo, 1982, as junior synonym of Pygomeles Grandidier, 1867. By supporting the clade consisting of the limbless Voeltzkowia mira and the forelimb-only taxa V. mobydick and V. yamagishii, our data indicate that full regression of limbs and eyes occurred in parallel twice in the genus Voeltzkowia (as hitherto defined) that we consider as a sand-swimming ecomorph: in the Voeltzkowia clade sensu stricto the regression first affected the hindlimbs and subsequently the forelimbs, whereas the Grandidierina clade first regressed the forelimbs and subsequently the hindlimbs following the pattern prevalent in squamates. Timetree reconstructions for the Malagasy Scincidae contain a substantial amount of uncertainty due to the absence of suitable primary fossil calibrations. However, our preliminary reconstructions suggest rapid limb regression in Malagasy scincids with an estimated maximal duration of 6 MYr for a complete regression in Paracontias, and 4 and 8 MYr respectively for complete regression of forelimbs in Grandidierina and
Miralles, Aurélien; Hipsley, Christy A.; Erens, Jesse; Gehara, Marcelo; Rakotoarison, Andolalao; Glaw, Frank; Müller, Johannes; Vences, Miguel
2015-01-01
Scincine lizards in Madagascar form an endemic clade of about 60 species exhibiting a variety of ecomorphological adaptations. Several subclades have adapted to burrowing and convergently regressed their limbs and eyes, resulting in a variety of partial and completely limbless morphologies among extant taxa. However, patterns of limb regression in these taxa have not been studied in detail. Here we fill this gap in knowledge by providing a phylogenetic analysis of DNA sequences of three mitochondrial and four nuclear gene fragments in an extended sampling of Malagasy skinks, and microtomographic analyses of osteology of various burrowing taxa adapted to sand substrate. Based on our data we propose to (i) consider Sirenoscincus Sakata & Hikida, 2003, as junior synonym of Voeltzkowia Boettger, 1893; (ii) resurrect the genus name Grandidierina Mocquard, 1894, for four species previously included in Voeltzkowia; and (iii) consider Androngo Brygoo, 1982, as junior synonym of Pygomeles Grandidier, 1867. By supporting the clade consisting of the limbless Voeltzkowia mira and the forelimb-only taxa V. mobydick and V. yamagishii, our data indicate that full regression of limbs and eyes occurred in parallel twice in the genus Voeltzkowia (as hitherto defined) that we consider as a sand-swimming ecomorph: in the Voeltzkowia clade sensu stricto the regression first affected the hindlimbs and subsequently the forelimbs, whereas the Grandidierina clade first regressed the forelimbs and subsequently the hindlimbs following the pattern prevalent in squamates. Timetree reconstructions for the Malagasy Scincidae contain a substantial amount of uncertainty due to the absence of suitable primary fossil calibrations. However, our preliminary reconstructions suggest rapid limb regression in Malagasy scincids with an estimated maximal duration of 6 MYr for a complete regression in Paracontias, and 4 and 8 MYr respectively for complete regression of forelimbs in Grandidierina and
High-dimensional model estimation and model selection
CERN. Geneva
2015-01-01
I will review concepts and algorithms from high-dimensional statistics for linear model estimation and model selection. I will particularly focus on the so-called p>>n setting where the number of variables p is much larger than the number of samples n. I will focus mostly on regularized statistical estimators that produce sparse models. Important examples include the LASSO and its matrix extension, the Graphical LASSO, and more recent non-convex methods such as the TREX. I will show the applicability of these estimators in a diverse range of scientific applications, such as sparse interaction graph recovery and high-dimensional classification and regression problems in genomics.
Schmidtmann, I; Elsäßer, A; Weinmann, A; Binder, H
2014-12-30
For determining a manageable set of covariates potentially influential with respect to a time-to-event endpoint, Cox proportional hazards models can be combined with variable selection techniques, such as stepwise forward selection or backward elimination based on p-values, or regularized regression techniques such as component-wise boosting. Cox regression models have also been adapted for dealing with more complex event patterns, for example, for competing risks settings with separate, cause-specific hazard models for each event type, or for determining the prognostic effect pattern of a variable over different landmark times, with one conditional survival model for each landmark. Motivated by a clinical cancer registry application, where complex event patterns have to be dealt with and variable selection is needed at the same time, we propose a general approach for linking variable selection between several Cox models. Specifically, we combine score statistics for each covariate across models by Fisher's method as a basis for variable selection. This principle is implemented for a stepwise forward selection approach as well as for a regularized regression technique. In an application to data from hepatocellular carcinoma patients, the coupled stepwise approach is seen to facilitate joint interpretation of the different cause-specific Cox models. In conditional survival models at landmark times, which address updates of prediction as time progresses and both treatment and other potential explanatory variables may change, the coupled regularized regression approach identifies potentially important, stably selected covariates together with their effect time pattern, despite having only a small number of events. These results highlight the promise of the proposed approach for coupling variable selection between Cox models, which is particularly relevant for modeling for clinical cancer registries with their complex event patterns. Copyright © 2014 John Wiley & Sons
Sparse High Dimensional Models in Economics.
Fan, Jianqing; Lv, Jinchi; Qi, Lei
2011-09-01
This paper reviews the literature on sparse high dimensional models and discusses some applications in economics and finance. Recent developments of theory, methods, and implementations in penalized least squares and penalized likelihood methods are highlighted. These variable selection methods are proved to be effective in high dimensional sparse modeling. The limits of dimensionality that regularization methods can handle, the role of penalty functions, and their statistical properties are detailed. Some recent advances in ultra-high dimensional sparse modeling are also briefly discussed.
Linking teleconnection patterns to European temperature – a multiple linear regression model
Directory of Open Access Journals (Sweden)
Henning W. Rust
2015-04-01
Full Text Available The link between the indices of twelve atmospheric teleconnection patterns (mostly Northern Hemispheric and gridded European temperature data is investigated by means of multiple linear regression models for each grid cell and month. Furthermore index-specific signals are calculated to estimate the contribution to temperature anomalies caused by each individual teleconnection pattern. To this extent, an observational product of monthly mean temperature (E-OBS, as well as monthly time series of teleconnection indices (CPC, NOAA for the period 1951–2010 are evaluated. The stepwise regression approach is used to build grid cell based models for each month on the basis of the five most important teleconnection indices (NAO, EA, EAWR, SCAND, POLEUR, which are motivated by an exploratory correlation analysis. The temperature links are dominated by NAO and EA in Northern, Western, Central and South Western Europe, by EAWR during summer/autumn in Russia/Fenno-Scandia and by SCAND in Russia/Northern Europe; POLEUR shows minor effects only. In comparison to the climatological forecast, the presented linear regression models improve the temperature modelling by 30–40 % with better results in winter and spring. They can be used to model the spatial distribution and structure of observed temperature anomalies, where two to three patterns are the main contributors. As an example the estimated temperature signals induced by the teleconnection indices is shown for February 2010.
Modeling High-Dimensional Multichannel Brain Signals
Hu, Lechuan
2017-12-12
Our goal is to model and measure functional and effective (directional) connectivity in multichannel brain physiological signals (e.g., electroencephalograms, local field potentials). The difficulties from analyzing these data mainly come from two aspects: first, there are major statistical and computational challenges for modeling and analyzing high-dimensional multichannel brain signals; second, there is no set of universally agreed measures for characterizing connectivity. To model multichannel brain signals, our approach is to fit a vector autoregressive (VAR) model with potentially high lag order so that complex lead-lag temporal dynamics between the channels can be captured. Estimates of the VAR model will be obtained by our proposed hybrid LASSLE (LASSO + LSE) method which combines regularization (to control for sparsity) and least squares estimation (to improve bias and mean-squared error). Then we employ some measures of connectivity but put an emphasis on partial directed coherence (PDC) which can capture the directional connectivity between channels. PDC is a frequency-specific measure that explains the extent to which the present oscillatory activity in a sender channel influences the future oscillatory activity in a specific receiver channel relative to all possible receivers in the network. The proposed modeling approach provided key insights into potential functional relationships among simultaneously recorded sites during performance of a complex memory task. Specifically, this novel method was successful in quantifying patterns of effective connectivity across electrode locations, and in capturing how these patterns varied across trial epochs and trial types.
Directory of Open Access Journals (Sweden)
Dayeon Shin
2018-01-01
Full Text Available Diet plays a crucial role in cognitive function. Few studies have examined the relationship between dietary patterns and cognitive functions of older adults in the Korean population. This study aimed to identify the effect of dietary patterns on the risk of mild cognitive impairment. A total of 239 participants, including 88 men and 151 women, aged 65 years and older were selected from health centers in the district of Seoul, Gyeonggi province, and Incheon, in Korea. Dietary patterns were determined using Reduced Rank Regression (RRR methods with responses regarding vitamin B6, vitamin C, and iron intakes, based on both a one-day 24-h recall and a food frequency questionnaire. Cognitive function was assessed using the Korean-Mini Mental State Examination (K-MMSE. Multivariable logistic regression models were used to estimate the association between dietary pattern score and the risk of mild cognitive impairment. A total of 20 (8% out of the 239 participants had mild cognitive impairment. Three dietary patterns were identified: seafood and vegetables, high meat, and bread, ham, and alcohol. Among the three dietary patterns, the older adult population who adhered to the seafood and vegetables pattern, characterized by high intake of seafood, vegetables, fruits, bread, snacks, soy products, beans, chicken, pork, ham, egg, and milk had a decreased risk of mild cognitive impairment compared to those who did not (adjusted odds ratios 0.06, 95% confidence interval 0.01–0.72 after controlling for gender, supplementation, education, history of dementia, physical activity, body mass index (BMI, and duration of sleep. The other two dietary patterns were not significantly associated with the risk of mild cognitive impairment. In conclusion, high consumption of fruits, vegetables, seafood, and protein foods was significantly associated with reduced mild cognitive impairment in older Korean adults. These results can contribute to the establishment of
HSM: Heterogeneous Subspace Mining in High Dimensional Data
DEFF Research Database (Denmark)
Müller, Emmanuel; Assent, Ira; Seidl, Thomas
2009-01-01
Heterogeneous data, i.e. data with both categorical and continuous values, is common in many databases. However, most data mining algorithms assume either continuous or categorical attributes, but not both. In high dimensional data, phenomena due to the "curse of dimensionality" pose additional...... challenges. Usually, due to locally varying relevance of attributes, patterns do not show across the full set of attributes. In this paper we propose HSM, which defines a new pattern model for heterogeneous high dimensional data. It allows data mining in arbitrary subsets of the attributes that are relevant...... for the respective patterns. Based on this model we propose an efficient algorithm, which is aware of the heterogeneity of the attributes. We extend an indexing structure for continuous attributes such that HSM indexing adapts to different attribute types. In our experiments we show that HSM efficiently mines...
Quantifying Photonic High-Dimensional Entanglement
Martin, Anthony; Guerreiro, Thiago; Tiranov, Alexey; Designolle, Sébastien; Fröwis, Florian; Brunner, Nicolas; Huber, Marcus; Gisin, Nicolas
2017-03-01
High-dimensional entanglement offers promising perspectives in quantum information science. In practice, however, the main challenge is to devise efficient methods to characterize high-dimensional entanglement, based on the available experimental data which is usually rather limited. Here we report the characterization and certification of high-dimensional entanglement in photon pairs, encoded in temporal modes. Building upon recently developed theoretical methods, we certify an entanglement of formation of 2.09(7) ebits in a time-bin implementation, and 4.1(1) ebits in an energy-time implementation. These results are based on very limited sets of local measurements, which illustrates the practical relevance of these methods.
Asymptotically Honest Confidence Regions for High Dimensional
DEFF Research Database (Denmark)
Caner, Mehmet; Kock, Anders Bredahl
While variable selection and oracle inequalities for the estimation and prediction error have received considerable attention in the literature on high-dimensional models, very little work has been done in the area of testing and construction of confidence bands in high-dimensional models. However...... of the asymptotic covariance matrix of an increasing number of parameters which is robust against conditional heteroskedasticity. To our knowledge we are the first to do so. Next, we show that our confidence bands are honest over sparse high-dimensional sub vectors of the parameter space and that they contract...... at the optimal rate. All our results are valid in high-dimensional models. Our simulations reveal that the desparsified conservative Lasso estimates the parameters much more precisely than the desparsified Lasso, has much better size properties and produces confidence bands with markedly superior coverage rates....
The additive hazards model with high-dimensional regressors
DEFF Research Database (Denmark)
Martinussen, Torben; Scheike, Thomas
2009-01-01
This paper considers estimation and prediction in the Aalen additive hazards model in the case where the covariate vector is high-dimensional such as gene expression measurements. Some form of dimension reduction of the covariate space is needed to obtain useful statistical analyses. We study...... the partial least squares regression method. It turns out that it is naturally adapted to this setting via the so-called Krylov sequence. The resulting PLS estimator is shown to be consistent provided that the number of terms included is taken to be equal to the number of relevant components in the regression...
High dimensional multiclass classification with applications to cancer diagnosis
DEFF Research Database (Denmark)
Vincent, Martin
Probabilistic classifiers are introduced and it is shown that the only regular linear probabilistic classifier with convex risk is multinomial regression. Penalized empirical risk minimization is introduced and used to construct supervised learning methods for probabilistic classifiers. A sparse...... group lasso penalized approach to high dimensional multinomial classification is presented. On different real data examples it is found that this approach clearly outperforms multinomial lasso in terms of error rate and features included in the model. An efficient coordinate descent algorithm...... is developed and the convergence is established. This algorithm is implemented in the msgl R package. Examples of high dimensional multiclass problems are studied, in particular examples of multiclass classification based on gene expression measurements. One such example is the clinically important - problem...
Directory of Open Access Journals (Sweden)
Bhavna Chawla
2016-01-01
Full Text Available Purpose: To prospectively study the clinical outcome and regression patterns of early retinoblastoma (Groups A and B after systemic chemotherapy and focal consolidation in Indian children. Materials and Methods: Group A eyes were treated with focal therapy (transpupillary thermotherapy/cryotherapy and Group B with systemic chemoreduction and focal therapy. Outcome measures were efficacy and safety of treatment, risk factors for treatment failure, regression patterns, and factors predictive of regression patterns. Results: Of 119 eyes (216 tumors, 14 (11.8% were Group A and 105 (88.2% were Group B eyes. The mean follow-up was 22.6 months. Tumor control was achieved in 111/119 eyes (93.3% overall, 100% Group A, 92.4% Group B. Eight Group B eyes (6.7% had treatment failure. No serious systemic side-effects were noted. Risk factors for failure included larger tumors (P = 0.001 and proximity to posterior pole (P = 0.014. Regression patterns were Type 4 (50.2%, Type 3 (31.7%, Type 1 (11.1%, and Type 2 (7%. Factors predictive of Type 4 regression were smaller tumors, anterior location, younger age; Type 3 regression was associated with larger tumors, macular location, and older age. Conclusions: Systemic chemoreduction and focal therapy provided effective tumor control in Indian children. Factors predictive of regression patterns included age, tumor size and its location, and the modality of treatment.
Chawla, Bhavna; Jain, Amit; Seth, Rachna; Azad, Rajvardhan; Mohan, V K; Pushker, Neelam; Ghose, Supriyo
2016-07-01
To prospectively study the clinical outcome and regression patterns of early retinoblastoma (Groups A and B) after systemic chemotherapy and focal consolidation in Indian children. Group A eyes were treated with focal therapy (transpupillary thermotherapy/cryotherapy) and Group B with systemic chemoreduction and focal therapy. Outcome measures were efficacy and safety of treatment, risk factors for treatment failure, regression patterns, and factors predictive of regression patterns. Of 119 eyes (216 tumors), 14 (11.8%) were Group A and 105 (88.2%) were Group B eyes. The mean follow-up was 22.6 months. Tumor control was achieved in 111/119 eyes (93.3% overall, 100% Group A, 92.4% Group B). Eight Group B eyes (6.7%) had treatment failure. No serious systemic side-effects were noted. Risk factors for failure included larger tumors (P = 0.001) and proximity to posterior pole (P = 0.014). Regression patterns were Type 4 (50.2%), Type 3 (31.7%), Type 1 (11.1%), and Type 2 (7%). Factors predictive of Type 4 regression were smaller tumors, anterior location, younger age; Type 3 regression was associated with larger tumors, macular location, and older age. Systemic chemoreduction and focal therapy provided effective tumor control in Indian children. Factors predictive of regression patterns included age, tumor size and its location, and the modality of treatment.
Describing Growth Pattern of Bali Cows Using Non-linear Regression Models
Directory of Open Access Journals (Sweden)
Mohd. Hafiz A.W
2016-12-01
Full Text Available The objective of this study was to evaluate the best fit non-linear regression model to describe the growth pattern of Bali cows. Estimates of asymptotic mature weight, rate of maturing and constant of integration were derived from Brody, von Bertalanffy, Gompertz and Logistic models which were fitted to cross-sectional data of body weight taken from 74 Bali cows raised in MARDI Research Station Muadzam Shah Pahang. Coefficient of determination (R2 and residual mean squares (MSE were used to determine the best fit model in describing the growth pattern of Bali cows. Von Bertalanffy model was the best model among the four growth functions evaluated to determine the mature weight of Bali cattle as shown by the highest R2 and lowest MSE values (0.973 and 601.9, respectively, followed by Gompertz (0.972 and 621.2, respectively, Logistic (0.971 and 648.4, respectively and Brody (0.932 and 660.5, respectively models. The correlation between rate of maturing and mature weight was found to be negative in the range of -0.170 to -0.929 for all models, indicating that animals of heavier mature weight had lower rate of maturing. The use of non-linear model could summarize the weight-age relationship into several biologically interpreted parameters compared to the entire lifespan weight-age data points that are difficult and time consuming to interpret.
High dimensional neurocomputing growth, appraisal and applications
Tripathi, Bipin Kumar
2015-01-01
The book presents a coherent understanding of computational intelligence from the perspective of what is known as "intelligent computing" with high-dimensional parameters. It critically discusses the central issue of high-dimensional neurocomputing, such as quantitative representation of signals, extending the dimensionality of neuron, supervised and unsupervised learning and design of higher order neurons. The strong point of the book is its clarity and ability of the underlying theory to unify our understanding of high-dimensional computing where conventional methods fail. The plenty of application oriented problems are presented for evaluating, monitoring and maintaining the stability of adaptive learning machine. Author has taken care to cover the breadth and depth of the subject, both in the qualitative as well as quantitative way. The book is intended to enlighten the scientific community, ranging from advanced undergraduates to engineers, scientists and seasoned researchers in computational intelligenc...
Livingstone, Katherine M; McNaughton, Sarah A
2017-01-01
Evidence linking dietary patterns (DP) and obesity and hypertension prevalence is inconsistent. We aimed to identify DP derived from energy density, fibre and sugar intakes, as well as Na, K, fibre, SFA and PUFA, and investigate associations with obesity and hypertension. Adults (n 4908) were included from the cross-sectional Australian Health Survey 2011-2013. Two 24-h dietary recalls estimated food and nutrient intakes. Reduced rank regression derived DP with dietary energy density (DED), fibre density and total sugar intake as response variables for obesity and Na:K, SFA:PUFA and fibre density as variables for hypertension. Poisson regression investigated relationships between DP and prevalence ratios (PR) of overweight/obesity (BMI≥25 kg/m2) and hypertension (blood pressure≥140/90 mmHg). Obesity-DP1 was positively correlated with fibre density and sugars and inversely with DED. Obesity-DP2 was positively correlated with sugars and inversely with fibre density. Individuals in the highest tertile of Obesity-DP1 and Obesity-DP2, compared with the lowest, had lower (PR 0·88; 95 % CI 0·81, 0·95) and higher (PR 1·09; 95 % CI 1·01, 1·18) prevalence of obesity, respectively. Na:K and SFA:PUFA were positively correlated with Hypertension-DP1 and inversely correlated with Hypertension-DP2, respectively. There was a trend towards higher hypertension prevalence in the highest tertile of Hypertension-DP1 compared with the lowest (PR 1·18; 95 % CI 0·99, 1·41). Hypertension-DP2 was not associated with hypertension. Obesity prevalence was inversely associated with low-DED, high-fibre and high-sugar (natural sugars) diets and positively associated with low-fibre and high-sugar (added sugars) diets. Hypertension prevalence was higher on low-fibre and high-Na and SFA diets.
High dimensional classifiers in the imbalanced case
DEFF Research Database (Denmark)
Bak, Britta Anker; Jensen, Jens Ledet
We consider the binary classification problem in the imbalanced case where the number of samples from the two groups differ. The classification problem is considered in the high dimensional case where the number of variables is much larger than the number of samples, and where the imbalance leads...
Directory of Open Access Journals (Sweden)
Carolina Machuca
2017-08-01
Full Text Available Abstract Background Dentine hypersensitivity (DH affects people’s quality of life (QoL. However changes in the internal meaning of QoL, known as Response shift (RS may undermine longitudinal assessment of QoL. This study aimed to describe patterns of RS in people with DH using Classification and Regression Trees (CRT and to explore the convergent validity of CRT with the then-test and ideals approaches. Methods Data from an 8-week clinical trial of mouthwashes for dentine hypersensitivity (n = 75 using the Dentine Hypersensitivity Experience Questionnaire (DHEQ as the outcome measure, were analysed. CRT was used to examine 8-week changes in DHEQ total score as a dependent variable with clinical status for DH and each DHEQ subscale score (restrictions, coping, social, emotional and identity as independent variables. Recalibration was inferred when the clinical change was not consistent with the DHEQ change score using a minimally important difference for DHEQ of 22 points. Reprioritization was inferred by changes in the relative importance of each subscale to the model over time. Results Overall, 50.7% of participants experienced a clinical improvement in their DH after treatment and 22.7% experienced an important improvement in their quality of life. Thirty-six per cent shifted their internal standards downward and 14.7% upwards, suggesting recalibration. Reprioritization occurred over time among the social and emotional impacts of DH. Conclusions CRT was a useful method to reveal both, the types and nature of RS in people with a mild health condition and demonstrated convergent validity with design based approaches to detect RS.
7th High Dimensional Probability Meeting
Mason, David; Reynaud-Bouret, Patricia; Rosinski, Jan
2016-01-01
This volume collects selected papers from the 7th High Dimensional Probability meeting held at the Institut d'Études Scientifiques de Cargèse (IESC) in Corsica, France. High Dimensional Probability (HDP) is an area of mathematics that includes the study of probability distributions and limit theorems in infinite-dimensional spaces such as Hilbert spaces and Banach spaces. The most remarkable feature of this area is that it has resulted in the creation of powerful new tools and perspectives, whose range of application has led to interactions with other subfields of mathematics, statistics, and computer science. These include random matrices, nonparametric statistics, empirical processes, statistical learning theory, concentration of measure phenomena, strong and weak approximations, functional estimation, combinatorial optimization, and random graphs. The contributions in this volume show that HDP theory continues to thrive and develop new tools, methods, techniques and perspectives to analyze random phenome...
Directory of Open Access Journals (Sweden)
Getchell Thomas V
2005-04-01
Full Text Available Abstract Background Cluster analyses are used to analyze microarray time-course data for gene discovery and pattern recognition. However, in general, these methods do not take advantage of the fact that time is a continuous variable, and existing clustering methods often group biologically unrelated genes together. Results We propose a quadratic regression method for identification of differentially expressed genes and classification of genes based on their temporal expression profiles for non-cyclic short time-course microarray data. This method treats time as a continuous variable, therefore preserves actual time information. We applied this method to a microarray time-course study of gene expression at short time intervals following deafferentation of olfactory receptor neurons. Nine regression patterns have been identified and shown to fit gene expression profiles better than k-means clusters. EASE analysis identified over-represented functional groups in each regression pattern and each k-means cluster, which further demonstrated that the regression method provided more biologically meaningful classifications of gene expression profiles than the k-means clustering method. Comparison with Peddada et al.'s order-restricted inference method showed that our method provides a different perspective on the temporal gene profiles. Reliability study indicates that regression patterns have the highest reliabilities. Conclusion Our results demonstrate that the proposed quadratic regression method improves gene discovery and pattern recognition for non-cyclic short time-course microarray data. With a freely accessible Excel macro, investigators can readily apply this method to their microarray data.
Vermeulen, E.; Stronks, K.; Visser, M.; Brouwer, I.A.; Snijder, M.B.; Mocking, R.J.T.; Derks, E.M.A.; Schene, A.H.; Nicolaou, M.
2017-01-01
BACKGROUND/OBJECTIVES: To investigate the association of dietary patterns derived by reduced rank regression (RRR) with depressive symptoms in a multi-ethnic population. SUBJECTS/METHODS: Cross-sectional data from the HELIUS study were used. In total, 4967 men and women (18-70 years) of Dutch,
Introduction to high-dimensional statistics
Giraud, Christophe
2015-01-01
Ever-greater computing technologies have given rise to an exponentially growing volume of data. Today massive data sets (with potentially thousands of variables) play an important role in almost every branch of modern human activity, including networks, finance, and genetics. However, analyzing such data has presented a challenge for statisticians and data analysts and has required the development of new statistical methods capable of separating the signal from the noise.Introduction to High-Dimensional Statistics is a concise guide to state-of-the-art models, techniques, and approaches for ha
Batis, Carolina; Mendez, Michelle A; Gordon-Larsen, Penny; Sotres-Alvarez, Daniela; Adair, Linda; Popkin, Barry
2016-02-01
We examined the association between dietary patterns and diabetes using the strengths of two methods: principal component analysis (PCA) to identify the eating patterns of the population and reduced rank regression (RRR) to derive a pattern that explains the variation in glycated Hb (HbA1c), homeostasis model assessment of insulin resistance (HOMA-IR) and fasting glucose. We measured diet over a 3 d period with 24 h recalls and a household food inventory in 2006 and used it to derive PCA and RRR dietary patterns. The outcomes were measured in 2009. Adults (n 4316) from the China Health and Nutrition Survey. The adjusted odds ratio for diabetes prevalence (HbA1c≥6·5 %), comparing the highest dietary pattern score quartile with the lowest, was 1·26 (95 % CI 0·76, 2·08) for a modern high-wheat pattern (PCA; wheat products, fruits, eggs, milk, instant noodles and frozen dumplings), 0·76 (95 % CI 0·49, 1·17) for a traditional southern pattern (PCA; rice, meat, poultry and fish) and 2·37 (95 % CI 1·56, 3·60) for the pattern derived with RRR. By comparing the dietary pattern structures of RRR and PCA, we found that the RRR pattern was also behaviourally meaningful. It combined the deleterious effects of the modern high-wheat pattern (high intakes of wheat buns and breads, deep-fried wheat and soya milk) with the deleterious effects of consuming the opposite of the traditional southern pattern (low intakes of rice, poultry and game, fish and seafood). Our findings suggest that using both PCA and RRR provided useful insights when studying the association of dietary patterns with diabetes.
Modeling high dimensional multichannel brain signals
Hu, Lechuan
2017-03-27
In this paper, our goal is to model functional and effective (directional) connectivity in network of multichannel brain physiological signals (e.g., electroencephalograms, local field potentials). The primary challenges here are twofold: first, there are major statistical and computational difficulties for modeling and analyzing high dimensional multichannel brain signals; second, there is no set of universally-agreed measures for characterizing connectivity. To model multichannel brain signals, our approach is to fit a vector autoregressive (VAR) model with sufficiently high order so that complex lead-lag temporal dynamics between the channels can be accurately characterized. However, such a model contains a large number of parameters. Thus, we will estimate the high dimensional VAR parameter space by our proposed hybrid LASSLE method (LASSO+LSE) which is imposes regularization on the first step (to control for sparsity) and constrained least squares estimation on the second step (to improve bias and mean-squared error of the estimator). Then to characterize connectivity between channels in a brain network, we will use various measures but put an emphasis on partial directed coherence (PDC) in order to capture directional connectivity between channels. PDC is a directed frequency-specific measure that explains the extent to which the present oscillatory activity in a sender channel influences the future oscillatory activity in a specific receiver channel relative all possible receivers in the network. Using the proposed modeling approach, we have achieved some insights on learning in a rat engaged in a non-spatial memory task.
Miralles, A.; Hipsley, C.A.; Erens, J.; Gehara, M.; Rakotoarison, A.; Glaw, F.; Müller, J.; Vences, M.
2015-01-01
Scincine lizards in Madagascar form an endemic clade of about 60 species exhibiting a variety of ecomorphological adaptations. Several subclades have adapted to burrowing and convergently regressed their limbs and eyes, resulting in a variety of partial and completely limbless morphologies among
Directory of Open Access Journals (Sweden)
Laura K. Frank
2015-07-01
Full Text Available Reduced rank regression (RRR is an innovative technique to establish dietary patterns related to biochemical risk factors for type 2 diabetes, but has not been applied in sub-Saharan Africa. In a hospital-based case-control study for type 2 diabetes in Kumasi (diabetes cases, 538; controls, 668 dietary intake was assessed by a specific food frequency questionnaire. After random split of our study population, we derived a dietary pattern in the training set using RRR with adiponectin, HDL-cholesterol and triglycerides as responses and 35 food items as predictors. This pattern score was applied to the validation set, and its association with type 2 diabetes was examined by logistic regression. The dietary pattern was characterized by a high consumption of plantain, cassava, and garden egg, and a low intake of rice, juice, vegetable oil, eggs, chocolate drink, sweets, and red meat; the score correlated positively with serum triglycerides and negatively with adiponectin. The multivariate-adjusted odds ratio of type 2 diabetes for the highest quintile compared to the lowest was 4.43 (95% confidence interval: 1.87–10.50, p for trend < 0.001. The identified dietary pattern increases the odds of type 2 diabetes in urban Ghanaians, which is mainly attributed to increased serum triglycerides.
Mapping morphological shape as a high-dimensional functional curve.
Fu, Guifang; Huang, Mian; Bo, Wenhao; Hao, Han; Wu, Rongling
2017-01-06
Detecting how genes regulate biological shape has become a multidisciplinary research interest because of its wide application in many disciplines. Despite its fundamental importance, the challenges of accurately extracting information from an image, statistically modeling the high-dimensional shape and meticulously locating shape quantitative trait loci (QTL) affect the progress of this research. In this article, we propose a novel integrated framework that incorporates shape analysis, statistical curve modeling and genetic mapping to detect significant QTLs regulating variation of biological shape traits. After quantifying morphological shape via a radius centroid contour approach, each shape, as a phenotype, was characterized as a high-dimensional curve, varying as angle θ runs clockwise with the first point starting from angle zero. We then modeled the dynamic trajectories of three mean curves and variation patterns as functions of θ Our framework led to the detection of a few significant QTLs regulating the variation of leaf shape collected from a natural population of poplar, Populus szechuanica var tibetica This population, distributed at altitudes 2000-4500 m above sea level, is an evolutionarily important plant species. This is the first work in the quantitative genetic shape mapping area that emphasizes a sense of 'function' instead of decomposing the shape into a few discrete principal components, as the majority of shape studies do. © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com.
Energy Technology Data Exchange (ETDEWEB)
Hans, F.J.; Reinges, M.H.T. [Department of Neurosurgery, University Hospital of the Technical University of Aachen, Aachen (Germany); Krings, T.; Mull, M. [Department of Neuroradiology, University Hospital of the Technical University of Aachen, Pauwelsstr. 30, 52057, Aachen (Germany)
2004-06-01
We report on a patient with fibromuscular dysplasia who presented with a right-sided giant calcified cavernous internal carotid artery (ICA) aneurysm and two additional supraophthalmic ICA aneurysms. Endovascular closure of the right ICA using detachable balloons was performed with collateralisation of the right hemisphere via the right-sided posterior communicating and the anterior communicating arteries. Repeat angiography after 6 months demonstrated spontaneous complete regression of the two supraophthalmic aneurysms, although the parent vessel was still perfused. In comparison to the former angiography, the flow within the parent vessel was reversed due to the proximal ICA balloon occlusion. MRI demonstrated that the aneurysms were not obliterated by thrombosis alone, but showed a real regression in size. This case report demonstrates that changes in cerebral hemodynamics potentially lead to plastic changes in the vessel architecture in adults and that aneurysms can be flow-related, even if not associated with high flow fistulas or arteriovenous malformations, especially in cases with an arterial wall disease. (orig.)
Perinetti, Giuseppe; Rosso, Luigi; Riatti, Riccardo; Contardo, Luca
2016-01-01
The knowledge of the associations between the timing of skeletal maturation and craniofacial growth is of primary importance when planning a functional treatment for most of the skeletal malocclusions. This cross-sectional study was thus aimed at evaluating whether sagittal and vertical craniofacial growth has an association with the timing of circumpubertal skeletal maturation. A total of 320 subjects (160 females and 160 males) were included in the study (mean age, 12.3 ± 1.7 years; range, 7.6-16.7 years). These subjects were equally distributed in the circumpubertal cervical vertebral maturation (CVM) stages 2 to 5. Each CVM stage group also had equal number of females and males. Multiple regression models were run for each CVM stage group to assess the significance of the association of cephalometric parameters (ANB, SN/MP, and NSBa angles) with age of attainment of the corresponding CVM stage (in months). Significant associations were seen only for stage 3, where the SN/MP angle was negatively associated with age (β coefficient, -0.7). These results show that hyperdivergent and hypodivergent subjects may have an anticipated and delayed attainment of the pubertal CVM stage 3, respectively. However, such association remains of little entity and it would become clinically relevant only in extreme cases.
Chen, L. X.; Wu, Q. P.
2012-10-01
Recently, Dada et al. reported on the experimental entanglement concentration and violation of generalized Bell inequalities with orbital angular momentum (OAM) [Nat. Phys. 7, 677 (2011)]. Here we demonstrate that the high-dimensional entanglement concentration can be performed in arbitrary OAM subspaces with selectivity. Instead of violating the generalized Bell inequalities, the working principle of present entanglement concentration is visualized by the biphoton OAM Klyshko picture, and its good performance is confirmed and quantified through the experimental Shannon dimensionalities after concentration.
Vermeulen, E; Stronks, K; Visser, M; Brouwer, I A; Snijder, M B; Mocking, R J T; Derks, E M; Schene, A H; Nicolaou, M
2017-08-01
To investigate the association of dietary patterns derived by reduced rank regression (RRR) with depressive symptoms in a multi-ethnic population. Cross-sectional data from the HELIUS study were used. In total, 4967 men and women (18-70 years) of Dutch, South-Asian Surinamese, African Surinamese, Turkish and Moroccan origin living in the Netherlands were included. Diet was measured using ethnic-specific food frequency questionnaires. Depressive symptoms were measured with the nine-item patient health questionnaire. By performing RRR in the whole population and per ethnic group, comparable dietary patterns were identified and therefore the dietary pattern for the whole population was used for subsequent analyses. We identified a dietary pattern that was strongly related to eicosapentaenoic acid+docosahexaenoic acid, folate, magnesium and zinc (response variables) and which was characterized by milk products, cheese, whole grains, vegetables, legumes, nuts, potatoes and red meat. After adjustment for confounders, a statistically significant inverse association was observed in the whole population (B: -0.03, 95% CI: -0.06, -0.00, P=0.046) and among Moroccan (B: -0.09, 95% CI: -0.13, -0.04, P=0.027) and South-Asian Surinamese participants (B: -0.05, 95% CI: -0.09, -0.01, P=ethnic groups. No statistically significant associations were found between the dietary pattern and significant depressed mood in any of the ethnic groups. No consistent evidence was found that consumption of a dietary pattern, high in nutrients that are hypothesized to protect against depression, was associated with lower depressive symptoms across different ethnic groups.
High dimensional data driven statistical mechanics.
Adachi, Yoshitaka; Sadamatsu, Sunao
2014-11-01
In "3D4D materials science", there are five categories such as (a) Image acquisition, (b) Processing, (c) Analysis, (d) Modelling, and (e) Data sharing. This presentation highlights the core of these categories [1]. Analysis and modellingA three-dimensional (3D) microstructure image contains topological features such as connectivity in addition to metric features. Such more microstructural information seems to be useful for more precise property prediction. There are two ways for microstructure-based property prediction (Fig. 1A). One is 3D image data based modelling such as micromechanics or crystal plasticity finite element method. The other one is a numerical microstructural features driven machine learning approach such as artificial neural network or Bayesian estimation method. It is the key to convert the 3D image data into numerals in order to apply the dataset to property prediction. As a numerical feature of microstructures, grain size, number of density, of particles, connectivity of particles, grain boundary connectivity, stacking degree, clustering etc. should be taken into consideration. These microstructural features are so-called "materials genome". Among those materials genome, we have to find out dominant factors to determine a focused property. The dominant factorzs are defined as "descriptor(s)" in high dimensional data driven statistical mechanics.jmicro;63/suppl_1/i4/DFU086F1F1DFU086F1Fig. 1.(a) A concept of 3D4D materials science. (b) Fully-automated serial sectioning 3D microscope "Genus_3D". (c) Materials Genome Archive (JSPS). Image acquisitionIt is important for researchers to choice a 3D microscope from various microscopes depending on a length-scale of a focused microstructure. There is a long term request to acquire a 3D microstructure image more conveniently. Therefore a fully automated serial sectioning 3D optical microscope "Genus_3D" (Fig. 1B) has been developed and nowadays it is commercially available. A user can get a good
Shi, Yuan; Katzschner, Lutz; Ng, Edward
2017-10-30
Urban heat island (UHI) effect significantly raises the health burden and building energy consumption in the high-density urban environment of Hong Kong. A better understanding of the spatiotemporal pattern of UHI is essential to health risk assessments and energy consumption management but challenging in a high-density environment due to the sparsely distributed meteorological stations and the highly diverse urban features. In this study, we modelled the spatiotemporal pattern of UHI effect using the land use regression (LUR) approach in geographic information system with meteorological records of the recent 4years (2013-2016), sounding data and geographic predictors in Hong Kong. A total of 224 predictor variables were calculated and involved in model development. As a result, a total of 10 models were developed (daytime and nighttime, four seasons and annual average). As expected, meteorological records (CLD, Spd, MSLP) and sounding indices (KINX, CAPV and SHOW) are temporally correlated with UHI at high significance levels. On the top of the resultant LUR models, the influential spatial predictors of UHI with regression coefficients and their critical buffer width were also identified for the high-density urban scenario of Hong Kong. The study results indicate that the spatial pattern of UHI is largely determined by the LU/LC (RES1500, FVC500) and urban geomorphometry (h¯, BVD, λ¯F, Ψsky and z0) in a high-density built environment, especially during nighttime. The resultant models could be adopted to enrich the current urban design guideline and help with the UHI mitigation. Copyright © 2017 Elsevier B.V. All rights reserved.
Dessens, Jos A. G.; Jansen, Wim; Ganzeboom, Harry B. G.; Heijden, Peter G. M. van der
2003-01-01
This paper brings together the virtues of linear regression models for status attainment models formulated by second-generation social mobility researchers and the strengths of log-linear models formulated by third-generation researchers, into fourth-generation social mobility models, by using
Bayesian Subset Modeling for High-Dimensional Generalized Linear Models
Liang, Faming
2013-06-01
This article presents a new prior setting for high-dimensional generalized linear models, which leads to a Bayesian subset regression (BSR) with the maximum a posteriori model approximately equivalent to the minimum extended Bayesian information criterion model. The consistency of the resulting posterior is established under mild conditions. Further, a variable screening procedure is proposed based on the marginal inclusion probability, which shares the same properties of sure screening and consistency with the existing sure independence screening (SIS) and iterative sure independence screening (ISIS) procedures. However, since the proposed procedure makes use of joint information from all predictors, it generally outperforms SIS and ISIS in real applications. This article also makes extensive comparisons of BSR with the popular penalized likelihood methods, including Lasso, elastic net, SIS, and ISIS. The numerical results indicate that BSR can generally outperform the penalized likelihood methods. The models selected by BSR tend to be sparser and, more importantly, of higher prediction ability. In addition, the performance of the penalized likelihood methods tends to deteriorate as the number of predictors increases, while this is not significant for BSR. Supplementary materials for this article are available online. © 2013 American Statistical Association.
Hierarchical low-rank approximation for high dimensional approximation
Nouy, Anthony
2016-01-07
Tensor methods are among the most prominent tools for the numerical solution of high-dimensional problems where functions of multiple variables have to be approximated. Such high-dimensional approximation problems naturally arise in stochastic analysis and uncertainty quantification. In many practical situations, the approximation of high-dimensional functions is made computationally tractable by using rank-structured approximations. In this talk, we present algorithms for the approximation in hierarchical tensor format using statistical methods. Sparse representations in a given tensor format are obtained with adaptive or convex relaxation methods, with a selection of parameters using crossvalidation methods.
Multivariate statistics high-dimensional and large-sample approximations
Fujikoshi, Yasunori; Shimizu, Ryoichi
2010-01-01
A comprehensive examination of high-dimensional analysis of multivariate methods and their real-world applications Multivariate Statistics: High-Dimensional and Large-Sample Approximations is the first book of its kind to explore how classical multivariate methods can be revised and used in place of conventional statistical tools. Written by prominent researchers in the field, the book focuses on high-dimensional and large-scale approximations and details the many basic multivariate methods used to achieve high levels of accuracy. The authors begin with a fundamental presentation of the basic
Bayesian Variable Selection in High-dimensional Applications
V. Rockova (Veronika)
2013-01-01
markdownabstract__Abstract__ Advances in research technologies over the past few decades have encouraged the proliferation of massive datasets, revolutionizing statistical perspectives on high-dimensionality. Highthroughput technologies have become pervasive in diverse scientific disciplines
Effects of dependence in high-dimensional multiple testing problems
Kim, K.I.; van de Wiel, M.A.
2008-01-01
Background: We consider effects of dependence among variables of high-dimensional data in multiple hypothesis testing problems, in particular the False Discovery Rate (FDR) control procedures. Recent simulation studies consider only simple correlation structures among variables, which is hardly
SNP interaction detection with Random Forests in high-dimensional genetic data.
Winham, Stacey J; Colby, Colin L; Freimuth, Robert R; Wang, Xin; de Andrade, Mariza; Huebner, Marianne; Biernacka, Joanna M
2012-07-15
Identifying variants associated with complex human traits in high-dimensional data is a central goal of genome-wide association studies. However, complicated etiologies such as gene-gene interactions are ignored by the univariate analysis usually applied in these studies. Random Forests (RF) are a popular data-mining technique that can accommodate a large number of predictor variables and allow for complex models with interactions. RF analysis produces measures of variable importance that can be used to rank the predictor variables. Thus, single nucleotide polymorphism (SNP) analysis using RFs is gaining popularity as a potential filter approach that considers interactions in high-dimensional data. However, the impact of data dimensionality on the power of RF to identify interactions has not been thoroughly explored. We investigate the ability of rankings from variable importance measures to detect gene-gene interaction effects and their potential effectiveness as filters compared to p-values from univariate logistic regression, particularly as the data becomes increasingly high-dimensional. RF effectively identifies interactions in low dimensional data. As the total number of predictor variables increases, probability of detection declines more rapidly for interacting SNPs than for non-interacting SNPs, indicating that in high-dimensional data the RF variable importance measures are capturing marginal effects rather than capturing the effects of interactions. While RF remains a promising data-mining technique that extends univariate methods to condition on multiple variables simultaneously, RF variable importance measures fail to detect interaction effects in high-dimensional data in the absence of a strong marginal component, and therefore may not be useful as a filter technique that allows for interaction effects in genome-wide data.
Inferring gene regulatory relationships with a high-dimensional robust approach.
Zang, Yangguang; Zhao, Qing; Zhang, Qingzhao; Li, Yang; Zhang, Sanguo; Ma, Shuangge
2017-07-01
Gene expression (GE) levels have important biological and clinical implications. They are regulated by copy number alterations (CNAs). Modeling the regulatory relationships between GEs and CNAs facilitates understanding disease biology and can also have values in translational medicine. The expression level of a gene can be regulated by its cis-acting as well as trans-acting CNAs, and the set of trans-acting CNAs is usually not known, which poses a high-dimensional selection and estimation problem. Most of the existing studies share a common limitation in that they cannot accommodate long-tailed distributions or contamination of GE data. In this study, we develop a high-dimensional robust regression approach to infer the regulatory relationships between GEs and CNAs. A high-dimensional regression model is used to accommodate the effects of both cis-acting and trans-acting CNAs. A density power divergence loss function is used to accommodate long-tailed GE distributions and contamination. Penalization is adopted for regularized estimation and selection of relevant CNAs. The proposed approach is effectively realized using a coordinate descent algorithm. Simulation shows that it has competitive performance compared to the nonrobust benchmark and the robust LAD (least absolute deviation) approach. We analyze TCGA (The Cancer Genome Atlas) data on cutaneous melanoma and study GE-CNA regulations in the RAP (regulation of apoptosis) pathway, which further demonstrates the satisfactory performance of the proposed approach. © 2017 WILEY PERIODICALS, INC.
Non-intrusive low-rank separated approximation of high-dimensional stochastic models
Doostan, Alireza
2013-08-01
This work proposes a sampling-based (non-intrusive) approach within the context of low-. rank separated representations to tackle the issue of curse-of-dimensionality associated with the solution of models, e.g., PDEs/ODEs, with high-dimensional random inputs. Under some conditions discussed in details, the number of random realizations of the solution, required for a successful approximation, grows linearly with respect to the number of random inputs. The construction of the separated representation is achieved via a regularized alternating least-squares regression, together with an error indicator to estimate model parameters. The computational complexity of such a construction is quadratic in the number of random inputs. The performance of the method is investigated through its application to three numerical examples including two ODE problems with high-dimensional random inputs. © 2013 Elsevier B.V.
Statistical Analysis for High-Dimensional Data : The Abel Symposium 2014
Bühlmann, Peter; Glad, Ingrid; Langaas, Mette; Richardson, Sylvia; Vannucci, Marina
2016-01-01
This book features research contributions from The Abel Symposium on Statistical Analysis for High Dimensional Data, held in Nyvågar, Lofoten, Norway, in May 2014. The focus of the symposium was on statistical and machine learning methodologies specifically developed for inference in “big data” situations, with particular reference to genomic applications. The contributors, who are among the most prominent researchers on the theory of statistics for high dimensional inference, present new theories and methods, as well as challenging applications and computational solutions. Specific themes include, among others, variable selection and screening, penalised regression, sparsity, thresholding, low dimensional structures, computational challenges, non-convex situations, learning graphical models, sparse covariance and precision matrices, semi- and non-parametric formulations, multiple testing, classification, factor models, clustering, and preselection. Highlighting cutting-edge research and casting light on...
Estimating the effect of a variable in a high-dimensional regression model
DEFF Research Database (Denmark)
Jensen, Peter Sandholt; Wurtz, Allan
A problem encountered in some empirical research, e.g. growth empirics, is that the potential number of explanatory variables is large compared to the number of observations. This makes it infeasible to condition on all variables in order to determine whether a particular variable has an effect. We...
High Dimensional Modulation and MIMO Techniques for Access Networks
DEFF Research Database (Denmark)
Binti Othman, Maisara
Exploration of advanced modulation formats and multiplexing techniques for next generation optical access networks are of interest as promising solutions for delivering multiple services to end-users. This thesis addresses this from two different angles: high dimensionality carrierless amplitudep...... wired-wireless access networks....... the capacity per wavelength of the femto-cell network. Bit rate up to 1.59 Gbps with fiber-wireless transmission over 1 m air distance is demonstrated. The results presented in this thesis demonstrate the feasibility of high dimensionality CAP in increasing the number of dimensions and their potentially...... to be utilized for multiple service allocation to different users. MIMO multiplexing techniques with OFDM provides the scalability in increasing spectral efficiency and bit rates for RoF systems. High dimensional CAP and MIMO multiplexing techniques are two promising solutions for supporting wired and hybrid...
Two representations of a high-dimensional perceptual space.
Victor, Jonathan D; Rizvi, Syed M; Conte, Mary M
2017-08-01
A perceptual space is a mental workspace of points in a sensory domain that supports similarity and difference judgments and enables further processing such as classification and naming. Perceptual spaces are present across sensory modalities; examples include colors, faces, auditory textures, and odors. Color is perhaps the best-studied perceptual space, but it is atypical in two respects. First, the dimensions of color space are directly linked to the three cone absorption spectra, but the dimensions of generic perceptual spaces are not as readily traceable to single-neuron properties. Second, generic perceptual spaces have more than three dimensions. This is important because representing each distinguishable point in a high-dimensional space by a separate neuron or population is unwieldy; combinatorial strategies may be needed to overcome this hurdle. To study the representation of a complex perceptual space, we focused on a well-characterized 10-dimensional domain of visual textures. Within this domain, we determine perceptual distances in a threshold task (segmentation) and a suprathreshold task (border salience comparison). In N=4 human observers, we find both quantitative and qualitative differences between these sets of measurements. Quantitatively, observers' segmentation thresholds were inconsistent with their uncertainty determined from border salience comparisons. Qualitatively, segmentation thresholds suggested that distances are determined by a coordinate representation with Euclidean geometry. Border salience comparisons, in contrast, indicated a global curvature of the space, and that distances are determined by activity patterns across broadly tuned elements. Thus, our results indicate two representations of this perceptual space, and suggest that they use differing combinatorial strategies. To move from sensory signals to decisions and actions, the brain carries out a sequence of transformations. An important stage in this process is the
Energy Technology Data Exchange (ETDEWEB)
Schmid, Maximilian P.; Fidarova, Elena [Dept. of Radiotherapy, Comprehensive Cancer Center, Medical Univ. of Vienna, Vienna (Austria)], e-mail: maximilian.schmid@akhwien.at; Poetter, Richard [Dept. of Radiotherapy, Comprehensive Cancer Center, Medical Univ. of Vienna, Vienna (Austria); Christian Doppler Lab. for Medical Radiation Research for Radiation Oncology, Medical Univ. of Vienna (Austria)] [and others
2013-10-15
Purpose: To investigate the impact of magnetic resonance imaging (MRI)-morphologic differences in parametrial infiltration on tumour response during primary radio chemotherapy in cervical cancer. Material and methods: Eighty-five consecutive cervical cancer patients with FIGO stages IIB (n = 59) and IIIB (n = 26), treated by external beam radiotherapy ({+-}chemotherapy) and image-guided adaptive brachytherapy, underwent T2-weighted MRI at the time of diagnosis and at the time of brachytherapy. MRI patterns of parametrial tumour infiltration at the time of diagnosis were assessed with regard to predominant morphology and maximum extent of parametrial tumour infiltration and were stratified into five tumour groups (TG): 1) expansive with spiculae; 2) expansive with spiculae and infiltrating parts; 3) infiltrative into the inner third of the parametrial space (PM); 4) infiltrative into the middle third of the PM; and 5) infiltrative into the outer third of the PM. MRI at the time of brachytherapy was used for identifying presence (residual vs. no residual disease) and signal intensity (high vs. intermediate) of residual disease within the PM. Left and right PM of each patient were evaluated separately at both time points. The impact of the TG on tumour remission status within the PM was analysed using {chi}2-test and logistic regression analysis. Results: In total, 170 PM were analysed. The TG 1, 2, 3, 4, 5 were present in 12%, 11%, 35%, 25% and 12% of the cases, respectively. Five percent of the PM were tumour-free. Residual tumour in the PM was identified in 19%, 68%, 88%, 90% and 85% of the PM for the TG 1, 2, 3, 4, and 5, respectively. The TG 3 - 5 had significantly higher rates of residual tumour in the PM in comparison to TG 1 + 2 (88% vs. 43%, p < 0.01). Conclusion: MRI-morphologic features of PM infiltration appear to allow for prediction of tumour response during external beam radiotherapy and chemotherapy. A predominantly infiltrative tumour spread at the
Bit-Table Based Biclustering and Frequent Closed Itemset Mining in High-Dimensional Binary Data
Directory of Open Access Journals (Sweden)
András Király
2014-01-01
Full Text Available During the last decade various algorithms have been developed and proposed for discovering overlapping clusters in high-dimensional data. The two most prominent application fields in this research, proposed independently, are frequent itemset mining (developed for market basket data and biclustering (applied to gene expression data analysis. The common limitation of both methodologies is the limited applicability for very large binary data sets. In this paper we propose a novel and efficient method to find both frequent closed itemsets and biclusters in high-dimensional binary data. The method is based on simple but very powerful matrix and vector multiplication approaches that ensure that all patterns can be discovered in a fast manner. The proposed algorithm has been implemented in the commonly used MATLAB environment and freely available for researchers.
Vermeulen, E.; Stronks, K.; Visser, M de; Brouwer, I.A.; Schene, A.H.; Mocking, R.J.; Colpo, M.; Bandinelli, S.; Ferrucci, L.; Nicolaou, M.
2016-01-01
This study aimed to identify dietary patterns using reduced rank regression (RRR) and to explore their associations with depressive symptoms over 9 years in the Invecchiare in Chianti study. At baseline, 1362 participants (55.4 % women) aged 18-102 years (mean age 68 (sd 15.5) years) were included
Vermeulen, Esther; Stronks, Karien; Visser, Marjolein; Brouwer, Ingeborg A.; Schene, Aart H.; Mocking, Roel J. T.; Colpo, Marco; Bandinelli, Stefania; Ferrucci, Luigi; Nicolaou, Mary
2016-01-01
This study aimed to identify dietary patterns using reduced rank regression (RRR) and to explore their associations with depressive symptoms over 9 years in the Invecchiare in Chianti study. At baseline, 1362 participants (55·4 % women) aged 18-102 years (mean age 68 (sd 15·5) years) were included
Vermeulen, Esther; Stronks, Karien; Visser, Marjolein; Brouwer, Ingeborg A; Schene, Aart H; Mocking, Roel J T; Colpo, Marco; Bandinelli, Stefania; Ferrucci, Luigi; Nicolaou, Mary
This study aimed to identify dietary patterns using reduced rank regression (RRR) and to explore their associations with depressive symptoms over 9 years in the Invecchiare in Chianti study. At baseline, 1362 participants (55·4 % women) aged 18-102 years (mean age 68 (sd 15·5) years) were included
High-dimensional quantum channel estimation using classical light
CSIR Research Space (South Africa)
Mabena, Chemist M
2017-11-01
Full Text Available A method is proposed to characterize a high-dimensional quantum channel with the aid of classical light. It uses a single nonseparable input optical field that contains correlations between spatial modes and wavelength to determine the effect...
A hybridized K-means clustering approach for high dimensional ...
African Journals Online (AJOL)
Due to incredible growth of high dimensional dataset, conventional data base querying methods are inadequate to extract useful information, so researchers nowadays is forced to develop new techniques to meet the raised requirements. Such large expression data gives rise to a number of new computational challenges ...
Inference in High-dimensional Dynamic Panel Data Models
DEFF Research Database (Denmark)
Kock, Anders Bredahl; Tang, Haihan
error variance may be non-constant over time and depend on the covariates. Furthermore, our procedure allows for inference on high-dimensional subsets of the parameter vector of an increasing cardinality. We show that the confidence bands resulting from our procedure are asymptotically honest...
High-Dimensional Statistical Learning: Roots, Justifications, and Potential Machineries.
Zollanvari, Amin
2015-01-01
High-dimensional data generally refer to data in which the number of variables is larger than the sample size. Analyzing such datasets poses great challenges for classical statistical learning because the finite-sample performance of methods developed within classical statistical learning does not live up to classical asymptotic premises in which the sample size unboundedly grows for a fixed dimensionality of observations. Much work has been done in developing mathematical-statistical techniques for analyzing high-dimensional data. Despite remarkable progress in this field, many practitioners still utilize classical methods for analyzing such datasets. This state of affairs can be attributed, in part, to a lack of knowledge and, in part, to the ready-to-use computational and statistical software packages that are well developed for classical techniques. Moreover, many scientists working in a specific field of high-dimensional statistical learning are either not aware of other existing machineries in the field or are not willing to try them out. The primary goal in this work is to bring together various machineries of high-dimensional analysis, give an overview of the important results, and present the operating conditions upon which they are grounded. When appropriate, readers are referred to relevant review articles for more information on a specific subject.
The additive hazards model with high-dimensional regressors
DEFF Research Database (Denmark)
Martinussen, Torben
2009-01-01
This paper considers estimation and prediction in the Aalen additive hazards model in the case where the covariate vector is high-dimensional such as gene expression measurements. Some form of dimension reduction of the covariate space is needed to obtain useful statistical analyses. We study...
Likelihood ratio based verification in high dimensional spaces
Hendrikse, A.J.; Veldhuis, Raymond N.J.; Spreeuwers, Lieuwe Jan
The increase of the dimensionality of data sets often lead to problems during estimation, which are denoted as the curse of dimensionality. One of the problems of Second Order Statistics (SOS) estimation in high dimensional data is that the resulting covariance matrices are not full rank, so their
Irregular grid methods for pricing high-dimensional American options
Berridge, S.J.
2004-01-01
This thesis proposes and studies numerical methods for pricing high-dimensional American options; important examples being basket options, Bermudan swaptions and real options. Four new methods are presented and analysed, both in terms of their application to various test problems, and in terms of
EigenPrism: inference for high dimensional signal-to-noise ratios.
Janson, Lucas; Barber, Rina Foygel; Candès, Emmanuel
2017-09-01
Consider the following three important problems in statistical inference, namely, constructing confidence intervals for (1) the error of a high-dimensional (p > n) regression estimator, (2) the linear regression noise level, and (3) the genetic signal-to-noise ratio of a continuous-valued trait (related to the heritability). All three problems turn out to be closely related to the little-studied problem of performing inference on the [Formula: see text]-norm of the signal in high-dimensional linear regression. We derive a novel procedure for this, which is asymptotically correct when the covariates are multivariate Gaussian and produces valid confidence intervals in finite samples as well. The procedure, called EigenPrism, is computationally fast and makes no assumptions on coefficient sparsity or knowledge of the noise level. We investigate the width of the EigenPrism confidence intervals, including a comparison with a Bayesian setting in which our interval is just 5% wider than the Bayes credible interval. We are then able to unify the three aforementioned problems by showing that the EigenPrism procedure with only minor modifications is able to make important contributions to all three. We also investigate the robustness of coverage and find that the method applies in practice and in finite samples much more widely than just the case of multivariate Gaussian covariates. Finally, we apply EigenPrism to a genetic dataset to estimate the genetic signal-to-noise ratio for a number of continuous phenotypes.
High-dimensional multispectral image fusion: classification by neural network
He, Mingyi; Xia, Jiantao
2003-06-01
Advances in sensor technology for Earth observation make it possible to collect multispectral data in much higher dimensionality. Such high dimensional data will it possible to classify more classes. However, it will also have several impacts on processing technology. First, because of its huge data, more processing power will be needed to process such high dimensional data. Second, because of its high dimensionality and the limited training samples, it is very difficult for Bayes method to estimate the parameters accurately. So the classification accuracy cannot be high enough. Neural Network is an intelligent signal processing method. MLFNN (Multi-Layer Feedforward Neural Network) directly learn from training samples and the probability model needs not to be estimated, the classification may be conducted through neural network fusion of multispectral images. The latent information about different classes can be extracted from training samples by MLFNN. However, because of the huge data and high dimensionality, MLFNN will face some serious difficulties: (1) There are many local minimal points in the error surface of MLFNN; (2) Over-fitting phenomena. These two difficulties depress the classification accuracy and generalization performance of MLFNN. In order to overcome these difficulties, the author proposed DPFNN (Double Parallel Feedforward Neural Networks) used to classify the high dimensional multispectral images. The model and learning algorithm of DPFNN with strong generalization performance are proposed, with emphases on the regularization of output weights and improvement of the generalization performance of DPFNN. As DPFNN is composed of MLFNN and SLFNN (Single-Layer Feedforward Neural Network), it has the advantages of MLFNN and SLFNN: (1) Good nonlinear mapping capability; (2) High learning speed for linear-like problem. Experimental results with generated data, 64-band practical multispectral images and 220-band multispectral images show that the new
Ballesio, Laura; Gigli, Silvia; Di Pastena, Francesca; Giraldi, Guglielmo; Manganaro, Lucia; Anastasi, Emanuela; Catalano, Carlo
2017-03-01
The objective of this study is to analyze magnetic resonance imaging shrinkage pattern of tumor regression after neoadjuvant chemotherapy and to evaluate its relationship with biological subtypes and pathological response. We reviewed the magnetic resonance imaging studies of 51 patients with single mass-enhancing lesions (performed at time 0 and at the II and last cycles of neoadjuvant chemotherapy). Tumors were classified as Luminal A, Luminal B, HER2+, and Triple Negative based on biological and immunohistochemical analysis after core needle biopsy. We classified shrinkage pattern, based on tumor regression morphology on magnetic resonance imaging at the II cycle, as concentric, nodular, and mixed. We assigned a numeric score (0: none; 1: low; 2: medium; 3: high) to the enhancement intensity decrease. Pathological response on the surgical specimen was classified as complete (grade 5), partial (grades 4-3), and non-response (grades 1-2) according to Miller and Payne system. Fisher test was used to relate shrinkage pattern with biological subtypes and final pathological response. Seventeen patients achieved complete response, 25 partial response, and 9 non-response. A total of 13 lesions showed nodular pattern, 20 concentric, and 18 mixed. We found an association between concentric pattern and HER2+ (p < 0.001) and mixed pattern and Luminal A lesions (p < 0.001). We observed a statistical significant correlation between concentric pattern and complete response (p < 0.001) and between mixed pattern and non-response (p = 0.005). Enhancement intensity decrease 3 was associated with complete response (p < 0.001). Shrinkage pattern and enhancement intensity decrease may serve as early response indicators after neoadjuvant chemotherapy. Shrinkage pattern correlates with tumor biological subtypes.
Lee, Jenny Hyunjung; McDonnell, Kevin T; Zelenyuk, Alla; Imre, Dan; Mueller, Klaus
2014-03-01
Although the euclidean distance does well in measuring data distances within high-dimensional clusters, it does poorly when it comes to gauging intercluster distances. This significantly impacts the quality of global, low-dimensional space embedding procedures such as the popular multidimensional scaling (MDS) where one can often observe nonintuitive layouts. We were inspired by the perceptual processes evoked in the method of parallel coordinates which enables users to visually aggregate the data by the patterns the polylines exhibit across the dimension axes. We call the path of such a polyline its structure and suggest a metric that captures this structure directly in high-dimensional space. This allows us to better gauge the distances of spatially distant data constellations and so achieve data aggregations in MDS plots that are more cognizant of existing high-dimensional structure similarities. Our biscale framework distinguishes far-distances from near-distances. The coarser scale uses the structural similarity metric to separate data aggregates obtained by prior classification or clustering, while the finer scale employs the appropriate euclidean distance.
Structural analysis of high-dimensional basins of attraction
Martiniani, Stefano; Schrenk, K. Julian; Stevenson, Jacob D.; Wales, David J.; Frenkel, Daan
2016-09-01
We propose an efficient Monte Carlo method for the computation of the volumes of high-dimensional bodies with arbitrary shape. We start with a region of known volume within the interior of the manifold and then use the multistate Bennett acceptance-ratio method to compute the dimensionless free-energy difference between a series of equilibrium simulations performed within this object. The method produces results that are in excellent agreement with thermodynamic integration, as well as a direct estimate of the associated statistical uncertainties. The histogram method also allows us to directly obtain an estimate of the interior radial probability density profile, thus yielding useful insight into the structural properties of such a high-dimensional body. We illustrate the method by analyzing the effect of structural disorder on the basins of attraction of mechanically stable packings of soft repulsive spheres.
Dimensionality reduction for registration of high-dimensional data sets.
Xu, Min; Chen, Hao; Varshney, Pramod K
2013-08-01
Registration of two high-dimensional data sets often involves dimensionality reduction to yield a single-band image from each data set followed by pairwise image registration. We develop a new application-specific algorithm for dimensionality reduction of high-dimensional data sets such that the weighted harmonic mean of Cramér-Rao lower bounds for the estimation of the transformation parameters for registration is minimized. The performance of the proposed dimensionality reduction algorithm is evaluated using three remotes sensing data sets. The experimental results using mutual information-based pairwise registration technique demonstrate that our proposed dimensionality reduction algorithm combines the original data sets to obtain the image pair with more texture, resulting in improved image registration.
Analysis of chaos in high-dimensional wind power system
Wang, Cong; Zhang, Hongli; Fan, Wenhui; Ma, Ping
2018-01-01
A comprehensive analysis on the chaos of a high-dimensional wind power system is performed in this study. A high-dimensional wind power system is more complex than most power systems. An 11-dimensional wind power system proposed by Huang, which has not been analyzed in previous studies, is investigated. When the systems are affected by external disturbances including single parameter and periodic disturbance, or its parameters changed, chaotic dynamics of the wind power system is analyzed and chaotic parameters ranges are obtained. Chaos existence is confirmed by calculation and analysis of all state variables' Lyapunov exponents and the state variable sequence diagram. Theoretical analysis and numerical simulations show that the wind power system chaos will occur when parameter variations and external disturbances change to a certain degree.
Machine-learned cluster identification in high-dimensional data.
Ultsch, Alfred; Lötsch, Jörn
2017-02-01
High-dimensional biomedical data are frequently clustered to identify subgroup structures pointing at distinct disease subtypes. It is crucial that the used cluster algorithm works correctly. However, by imposing a predefined shape on the clusters, classical algorithms occasionally suggest a cluster structure in homogenously distributed data or assign data points to incorrect clusters. We analyzed whether this can be avoided by using emergent self-organizing feature maps (ESOM). Data sets with different degrees of complexity were submitted to ESOM analysis with large numbers of neurons, using an interactive R-based bioinformatics tool. On top of the trained ESOM the distance structure in the high dimensional feature space was visualized in the form of a so-called U-matrix. Clustering results were compared with those provided by classical common cluster algorithms including single linkage, Ward and k-means. Ward clustering imposed cluster structures on cluster-less "golf ball", "cuboid" and "S-shaped" data sets that contained no structure at all (random data). Ward clustering also imposed structures on permuted real world data sets. By contrast, the ESOM/U-matrix approach correctly found that these data contain no cluster structure. However, ESOM/U-matrix was correct in identifying clusters in biomedical data truly containing subgroups. It was always correct in cluster structure identification in further canonical artificial data. Using intentionally simple data sets, it is shown that popular clustering algorithms typically used for biomedical data sets may fail to cluster data correctly, suggesting that they are also likely to perform erroneously on high dimensional biomedical data. The present analyses emphasized that generally established classical hierarchical clustering algorithms carry a considerable tendency to produce erroneous results. By contrast, unsupervised machine-learned analysis of cluster structures, applied using the ESOM/U-matrix method, is a
Parsimonious description for predicting high-dimensional dynamics
Yoshito Hirata; Tomoya Takeuchi; Shunsuke Horai; Hideyuki Suzuki; Kazuyuki Aihara
2015-01-01
When we observe a system, we often cannot observe all its variables and may have some of its limited measurements. Under such a circumstance, delay coordinates, vectors made of successive measurements, are useful to reconstruct the states of the whole system. Although the method of delay coordinates is theoretically supported for high-dimensional dynamical systems, practically there is a limitation because the calculation for higher-dimensional delay coordinates becomes more expensive. Here, ...
Bayesian Visual Analytics: Interactive Visualization for High Dimensional Data
Han, Chao
2012-01-01
In light of advancements made in data collection techniques over the past two decades, data mining has become common practice to summarize large, high dimensional datasets, in hopes of discovering noteworthy data structures. However, one concern is that most data mining approaches rely upon strict criteria that may mask information in data that analysts may find useful. We propose a new approach called Bayesian Visual Analytics (BaVA) which merges Bayesian Statistics with Visual Analytics to ...
RRT+ : Fast Planning for High-Dimensional Configuration Spaces
Xanthidis, Marios; Rekleitis, Ioannis; O'Kane, Jason M.
2016-01-01
In this paper we propose a new family of RRT based algorithms, named RRT+ , that are able to find faster solutions in high-dimensional configuration spaces compared to other existing RRT variants by finding paths in lower dimensional subspaces of the configuration space. The method can be easily applied to complex hyper-redundant systems and can be adapted by other RRT based planners. We introduce RRT+ and develop some variants, called PrioritizedRRT+ , PrioritizedRRT+-Connect, and Prioritize...
Evaluating Clustering in Subspace Projections of High Dimensional Data
DEFF Research Database (Denmark)
Müller, Emmanuel; Günnemann, Stephan; Assent, Ira
2009-01-01
Clustering high dimensional data is an emerging research field. Subspace clustering or projected clustering group similar objects in subspaces, i.e. projections, of the full space. In the past decade, several clustering paradigms have been developed in parallel, without thorough evaluation and co...... and create a common baseline for future developments and comparable evaluations in the field. For repeatability, all implementations, data sets and evaluation measures are available on our website....
Oracle inequalities for high-dimensional panel data models
DEFF Research Database (Denmark)
Kock, Anders Bredahl
This paper is concerned with high-dimensional panel data models where the number of regressors can be much larger than the sample size. Under the assumption that the true parameter vector is sparse we establish finite sample upper bounds on the estimation error of the Lasso under two different se...... results by simulations and apply the methods to search for covariates explaining growth in the G8 countries....
High-dimensional quantum cloning and applications to quantum hacking.
Bouchard, Frédéric; Fickler, Robert; Boyd, Robert W; Karimi, Ebrahim
2017-02-01
Attempts at cloning a quantum system result in the introduction of imperfections in the state of the copies. This is a consequence of the no-cloning theorem, which is a fundamental law of quantum physics and the backbone of security for quantum communications. Although perfect copies are prohibited, a quantum state may be copied with maximal accuracy via various optimal cloning schemes. Optimal quantum cloning, which lies at the border of the physical limit imposed by the no-signaling theorem and the Heisenberg uncertainty principle, has been experimentally realized for low-dimensional photonic states. However, an increase in the dimensionality of quantum systems is greatly beneficial to quantum computation and communication protocols. Nonetheless, no experimental demonstration of optimal cloning machines has hitherto been shown for high-dimensional quantum systems. We perform optimal cloning of high-dimensional photonic states by means of the symmetrization method. We show the universality of our technique by conducting cloning of numerous arbitrary input states and fully characterize our cloning machine by performing quantum state tomography on cloned photons. In addition, a cloning attack on a Bennett and Brassard (BB84) quantum key distribution protocol is experimentally demonstrated to reveal the robustness of high-dimensional states in quantum cryptography.
Optimal Feature Selection in High-Dimensional Discriminant Analysis.
Kolar, Mladen; Liu, Han
2015-02-01
We consider the high-dimensional discriminant analysis problem. For this problem, different methods have been proposed and justified by establishing exact convergence rates for the classification risk, as well as the ℓ 2 convergence results to the discriminative rule. However, sharp theoretical analysis for the variable selection performance of these procedures have not been established, even though model interpretation is of fundamental importance in scientific data analysis. This paper bridges the gap by providing sharp sufficient conditions for consistent variable selection using the sparse discriminant analysis (Mai et al., 2012). Through careful analysis, we establish rates of convergence that are significantly faster than the best known results and admit an optimal scaling of the sample size n , dimensionality p , and sparsity level s in the high-dimensional setting. Sufficient conditions are complemented by the necessary information theoretic limits on the variable selection problem in the context of high-dimensional discriminant analysis. Exploiting a numerical equivalence result, our method also establish the optimal results for the ROAD estimator (Fan et al., 2012) and the sparse optimal scaling estimator (Clemmensen et al., 2011). Furthermore, we analyze an exhaustive search procedure, whose performance serves as a benchmark, and show that it is variable selection consistent under weaker conditions. Extensive simulations demonstrating the sharpness of the bounds are also provided.
Smart sampling and incremental function learning for very large high dimensional data.
Loyola R, Diego G; Pedergnana, Mattia; Gimeno García, Sebastián
2016-06-01
Very large high dimensional data are common nowadays and they impose new challenges to data-driven and data-intensive algorithms. Computational Intelligence techniques have the potential to provide powerful tools for addressing these challenges, but the current literature focuses mainly on handling scalability issues related to data volume in terms of sample size for classification tasks. This work presents a systematic and comprehensive approach for optimally handling regression tasks with very large high dimensional data. The proposed approach is based on smart sampling techniques for minimizing the number of samples to be generated by using an iterative approach that creates new sample sets until the input and output space of the function to be approximated are optimally covered. Incremental function learning takes place in each sampling iteration, the new samples are used to fine tune the regression results of the function learning algorithm. The accuracy and confidence levels of the resulting approximation function are assessed using the probably approximately correct computation framework. The smart sampling and incremental function learning techniques can be easily used in practical applications and scale well in the case of extremely large data. The feasibility and good results of the proposed techniques are demonstrated using benchmark functions as well as functions from real-world problems. Copyright © 2015 The Authors. Published by Elsevier Ltd.. All rights reserved.
Multiple Group Testing Procedures for Analysis of High-Dimensional Genomic Data
Directory of Open Access Journals (Sweden)
Hyoseok Ko
2016-12-01
Full Text Available In genetic association studies with high-dimensional genomic data, multiple group testing procedures are often required in order to identify disease/trait-related genes or genetic regions, where multiple genetic sites or variants are located within the same gene or genetic region. However, statistical testing procedures based on an individual test suffer from multiple testing issues such as the control of family-wise error rate and dependent tests. Moreover, detecting only a few of genes associated with a phenotype outcome among tens of thousands of genes is of main interest in genetic association studies. In this reason regularization procedures, where a phenotype outcome regresses on all genomic markers and then regression coefficients are estimated based on a penalized likelihood, have been considered as a good alternative approach to analysis of high-dimensional genomic data. But, selection performance of regularization procedures has been rarely compared with that of statistical group testing procedures. In this article, we performed extensive simulation studies where commonly used group testing procedures such as principal component analysis, Hotelling's T2 test, and permutation test are compared with group lasso (least absolute selection and shrinkage operator in terms of true positive selection. Also, we applied all methods considered in simulation studies to identify genes associated with ovarian cancer from over 20,000 genetic sites generated from Illumina Infinium HumanMethylation27K Beadchip. We found a big discrepancy of selected genes between multiple group testing procedures and group lasso.
Wang, Xiaoping; Zhang, Fei
2017-12-22
Water quality is highly dependent on landscape characteristics. This study explored the relationships between landscape patterns and water quality in the Ebinur Lake oasis in China. The water quality index (WQI) has been used to identify threats to water quality and contribute to better water resource management. This study established the WQI and analyzed the influence of landscapes on the WQI based on a stepwise linear regression (SLR) model and geographically weighted regression (GWR) models. The results showed that the WQI was between 56.61 and 2886.51. The map of the WQI showed poor water quality. Both positive and negative relationships between certain land use and land cover (LULC) types and the WQI were observed for different buffers. This relationship is most significant for the 400-m buffer. There is a significant relationship between the water quality index and landscape index (i.e., PLAND, DIVISION, aggregation index (AI), COHESION, landscape shape index (LSI), and largest patch index (LPI)), demonstrated by using stepwise multiple linear regressions under the 400-m scale, which resulted in an adjusted R 2 between 0.63 and 0.88. The local R 2 between the LPI and LSI for forest grasslands and the WQI are high in the Akeqisu River and the Kuitun rivers and low in the Bortala River, with an R 2 ranging from 0.57 to 1.86. The local R 2 between the LSI for croplands and the WQI is 0.44. The local R 2 values between the LPI for saline lands and the WQI are high in the Jing River and low in the Bo River, Akeqisu River, and Kuitun rivers, ranging from 0.57 to 1.86.
Lamichhane, Archana P.; Liese, Angela D.; Urbina, Elaine M.; Crandell, Jamie L.; Jaacks, Lindsay M.; Dabelea, Dana; Black, Mary Helen; Merchant, Anwar T.; Mayer-Davis, Elizabeth J.
2014-01-01
BACKGROUND/OBJECTIVES Youth with type 1 diabetes (T1DM) are at substantially increased risk for adverse vascular outcomes, but little is known about the influence of dietary behavior on cardiovascular disease (CVD) risk profile. We aimed to identify dietary intake patterns associated with CVD risk factors and evaluate their impact on arterial stiffness (AS) measures collected thereafter in a cohort of youth with T1DM. SUBJECTS/METHODS Baseline diet data from a food frequency questionnaire and CVD risk factors (triglycerides, LDL-cholesterol, systolic BP, HbA1c, C-reactive protein and waist circumference) were available for 1,153 youth aged ≥10 years with T1DM from the SEARCH for Diabetes in Youth Study. A dietary intake pattern was identified using 33 food-groups as predictors and six CVD risk factors as responses in reduced rank regression (RRR) analysis. Associations of this RRR-derived dietary pattern with AS measures [augmentation index(AIx75), n=229; pulse wave velocity(PWV), n=237; and brachial distensibility(BrachD), n=228] were then assessed using linear regression. RESULTS The RRR-derived pattern was characterized by high intakes of sugar-sweetened beverages (SSB) and diet soda, eggs, potatoes and high-fat meats, and low intakes of sweets/desserts and low-fat dairy; major contributors were SSB and diet soda. This pattern captured the largest variability in adverse CVD risk profile and was subsequently associated with AIx75 (β=0.47; p<0.01). The mean difference in AIx75 concentration between the highest and the lowest dietary pattern quartiles was 4.3% in fully adjusted model. CONCLUSIONS Intervention strategies to reduce consumption of unhealthful foods and beverages among youth with T1DM may significantly improve CVD risk profile and ultimately reduce the risk for AS. PMID:24865480
Lamichhane, A P; Liese, A D; Urbina, E M; Crandell, J L; Jaacks, L M; Dabelea, D; Black, M H; Merchant, A T; Mayer-Davis, E J
2014-12-01
Youth with type 1 diabetes (T1DM) are at substantially increased risk for adverse vascular outcomes, but little is known about the influence of dietary behavior on cardiovascular disease (CVD) risk profile. We aimed to identify dietary intake patterns associated with CVD risk factors and evaluate their impact on arterial stiffness (AS) measures collected thereafter in a cohort of youth with T1DM. Baseline diet data from a food frequency questionnaire and CVD risk factors (triglycerides, low density lipoprotein-cholesterol, systolic blood pressure, hemoglobin A1c, C-reactive protein and waist circumference) were available for 1153 youth aged ⩾10 years with T1DM from the SEARCH for Diabetes in Youth Study. A dietary intake pattern was identified using 33 food groups as predictors and six CVD risk factors as responses in reduced rank regression (RRR) analysis. Associations of this RRR-derived dietary pattern with AS measures (augmentation index (AIx75), n=229; pulse wave velocity, n=237; and brachial distensibility, n=228) were then assessed using linear regression. The RRR-derived pattern was characterized by high intakes of sugar-sweetened beverages (SSB) and diet soda, eggs, potatoes and high-fat meats and low intakes of sweets/desserts and low-fat dairy; major contributors were SSB and diet soda. This pattern captured the largest variability in adverse CVD risk profile and was subsequently associated with AIx75 (β=0.47; P<0.01). The mean difference in AIx75 concentration between the highest and the lowest dietary pattern quartiles was 4.3% in fully adjusted model. Intervention strategies to reduce consumption of unhealthy foods and beverages among youth with T1DM may significantly improve CVD risk profile and ultimately reduce the risk for AS.
Inverse Regression for the Wiener Class of Systems
Lyzell, Christian; Enqvist, Martin
2011-01-01
The concept of inverse regression has turned out to be quite useful for dimension reduction in regression analysis problems. Using methods like sliced inverse regression (SIR) and directional regression (DR), some high-dimensional nonlinear regression problems can be turned into more tractable low-dimensional problems. Here, the usefulness of inverse regression for identification of nonlinear dynamical systems will be discussed. In particular, it will be shown that the inverse regression meth...
Reboussin, Beth A.; Hubbard, Scott; Ialongo, Nicholas S.
2007-01-01
The aim of this paper was to describe patterns of marijuana involvement during the middle-school years from the first chance to try marijuana down through the early stages of experiencing health and social problems from marijuana use in a sample of African-American adolescents. A total of 488 urban-dwelling African-American middle-school students were interviewed in sixth, seventh and eighth grades as part of a longitudinal field study. Longitudinal latent class models were used to identify s...
Variable kernel density estimation in high-dimensional feature spaces
CSIR Research Space (South Africa)
Van der Walt, Christiaan M
2017-02-01
Full Text Available with the KDE is non-parametric, since no parametric distribution is imposed on the estimate; instead the estimated distribution is defined by the sum of the kernel functions centred on the data points. KDEs thus require the selection of two design parameters... has become feasible – understanding and modelling high- dimensional data has thus become a crucial activity, espe- cially in the field of machine learning. Since non-parametric density estimators are data-driven and do not require or impose a pre...
On spectral distribution of high dimensional covariation matrices
DEFF Research Database (Denmark)
Heinrich, Claudio; Podolskij, Mark
In this paper we present the asymptotic theory for spectral distributions of high dimensional covariation matrices of Brownian diffusions. More specifically, we consider N-dimensional Itô integrals with time varying matrix-valued integrands. We observe n equidistant high frequency data points...... of the underlying Brownian diffusion and we assume that N/n -> c in (0,oo). We show that under a certain mixed spectral moment condition the spectral distribution of the empirical covariation matrix converges in distribution almost surely. Our proof relies on method of moments and applications of graph theory....
Directory of Open Access Journals (Sweden)
Fassnacht, S. R.
2012-05-01
Full Text Available The relation between snow water equivalent (SWE and 28 variables (27 topographically-based topographic variables and canopy density for the Colorado River Basin, USA was explored through a multi-variate regression. These variables include location, slope and aspect at different scales, derived variables to indicate the distance to sources of moisture and proximity to and characteristics of obstacles between these moisture sources and areas of snow accumulation, and canopy density. A weekly time step of snow telemetry (SNOTEL SWE data from 1990 through 1999 was used. The most important variables were elevation and regional scale (81 km² slope. Since the seasonal and inter-annual variability is high, a regression relationship should be formulated for each time step. The inter-annual variation in the relation between SWE and topographic variables partially corresponded with the amount of snow accumulated over the season and the El Niño Southern Oscillation cycle.Se analiza la relación entre el equivalente de agua en la nieve (SWE y 28 variables (27 variables topográficas y otra basada en la densidad del dosel para la Cuenca del Río Colorado, EE.UU. mediante regresión multivariante. Estas variables incluyen la localización, pendiente y orientación a diferentes escalas, además de variables derivadas para indicar la distancia a las fuentes de humedad y la proximidad a las barreras topográficas, además de las características de las barreras topográficas entre las fuentes de humedad, las áreas de acumulación de nieve y la densidad del dosel. Se utilizaron telemetrías semanales de nieve (SNOTEL desde 1990 hasta 1999. Las variables más importantes fueron la elevación y la pendiente a escala regional (81 km². Dada la alta variabilidad estacional e interanual, fue necesario establecer regresiones específicas para cada intervalo disponible de datos. La variación interanual en la relación entre variables topográficas y el SWE se
Wu, Shan-Shan; Yang, Hao; Guo, Fei; Han, Rui-Ming
2017-02-15
Multivariate statistical analyses combined with geographically weighted regression (GWR) were used to identify spatial variations of heavy metals in sediments and to examine relationships between metal pollution and land use practices in watersheds, including urban land, agriculture land, forest and water bodies. Seven metals (Cu, Zn, Pb, Cr, Ni, Mn and Fe) of sediments were measured at 31 sampling sites in Sheyang River. Most metals were under a certain degree enrichment based on the enrichment factors. Cluster analysis grouped all sites into four statistically significant cluster, severely contaminated areas were concentrated in areas with intensive human activities. Correlation analysis and PCA indicated Cu, Zn and Pb were derived from anthropogenic activities, while the sources of Cr and Ni were complicated. However, Fe and Mn originated from natural sources. According to results of GWR, there are stronger association between metal pollution with urban land than agricultural land and forest. Moreover, the relationships between land use and metal concentration were affected by the urbanization level of watersheds. Agricultural land had a weak associated with heavy metal pollution and the relationships might be stronger in less-urbanized. This study provided useful information for the assessment and management of heavy metal hazards in studied area. Copyright © 2016 Elsevier B.V. All rights reserved.
Reboussin, Beth A; Hubbard, Scott; Ialongo, Nicholas S
2007-09-06
The aim of this paper was to describe patterns of marijuana involvement during the middle-school years from the first chance to try marijuana down through the early stages of experiencing health and social problems from marijuana use in a sample of African-American adolescents. A total of 488 urban-dwelling African-American middle-school students were interviewed in sixth, seventh and eighth grades as part of a longitudinal field study. Longitudinal latent class models were used to identify subgroups (classes) of adolescents with similar patterns of marijuana involvement. Three classes were identified; little or no involvement (prevalence 85%, 71%, 55% in sixth, seventh and eighth grade, respectively), marijuana exposure opportunity (12%, 19% and 26%), and marijuana use and problems (2%, 9% and 19%). High levels of aggressive/disruptive behavior exhibited as early as first grade and moderate to high levels of deviant peer affiliation were associated with an increased risk of marijuana exposure opportunities in middle-school. Moderate to high levels of aggressive/disruptive behavior and deviant peer affiliation, moderate to low levels of parent monitoring and high levels of perceived neighborhood disadvantage were associated with an increased risk of marijuana use and problems. Significant interactions with grade provided evidence that the influences of parent monitoring and neighborhood disadvantage decrease through the middle-school years. Although not statistically significant, the magnitude of the effects of deviant peer affiliation on marijuana use and problems increased two-fold from sixth to eighth grade. These findings highlight the importance of marijuana exposure opportunities in the pathway to marijuana use and problems and the potential to intervene on behaviors exhibited as early as first grade. It also underscores the importance of developing interventions that are sensitive to the strong influence of parents at entry into middle-school and the shift
Sample size requirements for training high-dimensional risk predictors.
Dobbin, Kevin K; Song, Xiao
2013-09-01
A common objective of biomarker studies is to develop a predictor of patient survival outcome. Determining the number of samples required to train a predictor from survival data is important for designing such studies. Existing sample size methods for training studies use parametric models for the high-dimensional data and cannot handle a right-censored dependent variable. We present a new training sample size method that is non-parametric with respect to the high-dimensional vectors, and is developed for a right-censored response. The method can be applied to any prediction algorithm that satisfies a set of conditions. The sample size is chosen so that the expected performance of the predictor is within a user-defined tolerance of optimal. The central method is based on a pilot dataset. To quantify uncertainty, a method to construct a confidence interval for the tolerance is developed. Adequacy of the size of the pilot dataset is discussed. An alternative model-based version of our method for estimating the tolerance when no adequate pilot dataset is available is presented. The model-based method requires a covariance matrix be specified, but we show that the identity covariance matrix provides adequate sample size when the user specifies three key quantities. Application of the sample size method to two microarray datasets is discussed.
Scalable Nearest Neighbor Algorithms for High Dimensional Data.
Muja, Marius; Lowe, David G
2014-11-01
For many computer vision and machine learning problems, large training sets are key for good performance. However, the most computationally expensive part of many computer vision and machine learning algorithms consists of finding nearest neighbor matches to high dimensional vectors that represent the training data. We propose new algorithms for approximate nearest neighbor matching and evaluate and compare them with previous algorithms. For matching high dimensional features, we find two algorithms to be the most efficient: the randomized k-d forest and a new algorithm proposed in this paper, the priority search k-means tree. We also propose a new algorithm for matching binary features by searching multiple hierarchical clustering trees and show it outperforms methods typically used in the literature. We show that the optimal nearest neighbor algorithm and its parameters depend on the data set characteristics and describe an automated configuration procedure for finding the best algorithm to search a particular data set. In order to scale to very large data sets that would otherwise not fit in the memory of a single machine, we propose a distributed nearest neighbor matching framework that can be used with any of the algorithms described in the paper. All this research has been released as an open source library called fast library for approximate nearest neighbors (FLANN), which has been incorporated into OpenCV and is now one of the most popular libraries for nearest neighbor matching.
High-dimensional camera shake removal with given depth map.
Yue, Tao; Suo, Jinli; Dai, Qionghai
2014-06-01
Camera motion blur is drastically nonuniform for large depth-range scenes, and the nonuniformity caused by camera translation is depth dependent but not the case for camera rotations. To restore the blurry images of large-depth-range scenes deteriorated by arbitrary camera motion, we build an image blur model considering 6-degrees of freedom (DoF) of camera motion with a given scene depth map. To make this 6D depth-aware model tractable, we propose a novel parametrization strategy to reduce the number of variables and an effective method to estimate high-dimensional camera motion as well. The number of variables is reduced by temporal sampling motion function, which describes the 6-DoF camera motion by sampling the camera trajectory uniformly in time domain. To effectively estimate the high-dimensional camera motion parameters, we construct the probabilistic motion density function (PMDF) to describe the probability distribution of camera poses during exposure, and apply it as a unified constraint to guide the convergence of the iterative deblurring algorithm. Specifically, PMDF is computed through a back projection from 2D local blur kernels to 6D camera motion parameter space and robust voting. We conduct a series of experiments on both synthetic and real captured data, and validate that our method achieves better performance than existing uniform methods and nonuniform methods on large-depth-range scenes.
Elucidating high-dimensional cancer hallmark annotation via enriched ontology.
Yan, Shankai; Wong, Ka-Chun
2017-09-01
Cancer hallmark annotation is a promising technique that could discover novel knowledge about cancer from the biomedical literature. The automated annotation of cancer hallmarks could reveal relevant cancer transformation processes in the literature or extract the articles that correspond to the cancer hallmark of interest. It acts as a complementary approach that can retrieve knowledge from massive text information, advancing numerous focused studies in cancer research. Nonetheless, the high-dimensional nature of cancer hallmark annotation imposes a unique challenge. To address the curse of dimensionality, we compared multiple cancer hallmark annotation methods on 1580 PubMed abstracts. Based on the insights, a novel approach, UDT-RF, which makes use of ontological features is proposed. It expands the feature space via the Medical Subject Headings (MeSH) ontology graph and utilizes novel feature selections for elucidating the high-dimensional cancer hallmark annotation space. To demonstrate its effectiveness, state-of-the-art methods are compared and evaluated by a multitude of performance metrics, revealing the full performance spectrum on the full set of cancer hallmarks. Several case studies are conducted, demonstrating how the proposed approach could reveal novel insights into cancers. https://github.com/cskyan/chmannot. Copyright © 2017 Elsevier Inc. All rights reserved.
Prediction of Incident Diabetes in the Jackson Heart Study Using High-Dimensional Machine Learning.
Directory of Open Access Journals (Sweden)
Ramon Casanova
Full Text Available Statistical models to predict incident diabetes are often based on limited variables. Here we pursued two main goals: 1 investigate the relative performance of a machine learning method such as Random Forests (RF for detecting incident diabetes in a high-dimensional setting defined by a large set of observational data, and 2 uncover potential predictors of diabetes. The Jackson Heart Study collected data at baseline and in two follow-up visits from 5,301 African Americans. We excluded those with baseline diabetes and no follow-up, leaving 3,633 individuals for analyses. Over a mean 8-year follow-up, 584 participants developed diabetes. The full RF model evaluated 93 variables including demographic, anthropometric, blood biomarker, medical history, and echocardiogram data. We also used RF metrics of variable importance to rank variables according to their contribution to diabetes prediction. We implemented other models based on logistic regression and RF where features were preselected. The RF full model performance was similar (AUC = 0.82 to those more parsimonious models. The top-ranked variables according to RF included hemoglobin A1C, fasting plasma glucose, waist circumference, adiponectin, c-reactive protein, triglycerides, leptin, left ventricular mass, high-density lipoprotein cholesterol, and aldosterone. This work shows the potential of RF for incident diabetes prediction while dealing with high-dimensional data.
Class prediction for high-dimensional class-imbalanced data
Directory of Open Access Journals (Sweden)
Lusa Lara
2010-10-01
Full Text Available Abstract Background The goal of class prediction studies is to develop rules to accurately predict the class membership of new samples. The rules are derived using the values of the variables available for each subject: the main characteristic of high-dimensional data is that the number of variables greatly exceeds the number of samples. Frequently the classifiers are developed using class-imbalanced data, i.e., data sets where the number of samples in each class is not equal. Standard classification methods used on class-imbalanced data often produce classifiers that do not accurately predict the minority class; the prediction is biased towards the majority class. In this paper we investigate if the high-dimensionality poses additional challenges when dealing with class-imbalanced prediction. We evaluate the performance of six types of classifiers on class-imbalanced data, using simulated data and a publicly available data set from a breast cancer gene-expression microarray study. We also investigate the effectiveness of some strategies that are available to overcome the effect of class imbalance. Results Our results show that the evaluated classifiers are highly sensitive to class imbalance and that variable selection introduces an additional bias towards classification into the majority class. Most new samples are assigned to the majority class from the training set, unless the difference between the classes is very large. As a consequence, the class-specific predictive accuracies differ considerably. When the class imbalance is not too severe, down-sizing and asymmetric bagging embedding variable selection work well, while over-sampling does not. Variable normalization can further worsen the performance of the classifiers. Conclusions Our results show that matching the prevalence of the classes in training and test set does not guarantee good performance of classifiers and that the problems related to classification with class
Inference for feature selection using the Lasso with high-dimensional data
DEFF Research Database (Denmark)
Brink-Jensen, Kasper; Ekstrøm, Claus Thorn
2014-01-01
that involve various effects strengths and correlation between predictors. The algorithm is also applied to a prostate cancer dataset that has been analyzed in recent papers on the subject. The proposed method is found to provide a powerful way to make inference for feature selection even for small samples......Penalized regression models such as the Lasso have proved useful for variable selection in many fields - especially for situations with high-dimensional data where the numbers of predictors far exceeds the number of observations. These methods identify and rank variables of importance but do...... not generally provide any inference of the selected variables. Thus, the variables selected might be the "most important" but need not be significant. We propose a significance test for the selection found by the Lasso. We introduce a procedure that computes inference and p-values for features chosen...
Bilionis, Ilias; Gonzalez, Marcial
2016-01-01
The prohibitive cost of performing Uncertainty Quantification (UQ) tasks with a very large number of input parameters can be addressed, if the response exhibits some special structure that can be discovered and exploited. Several physical responses exhibit a special structure known as an active subspace (AS), a linear manifold of the stochastic space characterized by maximal response variation. The idea is that one should first identify this low dimensional manifold, project the high-dimensional input onto it, and then link the projection to the output. In this work, we develop a probabilistic version of AS which is gradient-free and robust to observational noise. Our approach relies on a novel Gaussian process regression with built-in dimensionality reduction with the AS represented as an orthogonal projection matrix that serves as yet another covariance function hyper-parameter to be estimated from the data. To train the model, we design a two-step maximum likelihood optimization procedure that ensures the ...
Applying recursive numerical integration techniques for solving high dimensional integrals
Energy Technology Data Exchange (ETDEWEB)
Ammon, Andreas [IVU Traffic Technologies AG, Berlin (Germany); Genz, Alan [Washington State Univ., Pullman, WA (United States). Dept. of Mathematics; Hartung, Tobias [King' s College, London (United Kingdom). Dept. of Mathematics; Jansen, Karl; Volmer, Julia [Deutsches Elektronen-Synchrotron (DESY), Zeuthen (Germany). John von Neumann-Inst. fuer Computing NIC; Leoevey, Hernan [Humboldt Univ. Berlin (Germany). Inst. fuer Mathematik
2016-11-15
The error scaling for Markov-Chain Monte Carlo techniques (MCMC) with N samples behaves like 1/√(N). This scaling makes it often very time intensive to reduce the error of computed observables, in particular for applications in lattice QCD. It is therefore highly desirable to have alternative methods at hand which show an improved error scaling. One candidate for such an alternative integration technique is the method of recursive numerical integration (RNI). The basic idea of this method is to use an efficient low-dimensional quadrature rule (usually of Gaussian type) and apply it iteratively to integrate over high-dimensional observables and Boltzmann weights. We present the application of such an algorithm to the topological rotor and the anharmonic oscillator and compare the error scaling to MCMC results. In particular, we demonstrate that the RNI technique shows an error scaling in the number of integration points m that is at least exponential.
Testing the mean matrix in high-dimensional transposable data.
Touloumis, Anestis; Tavaré, Simon; Marioni, John C
2015-03-01
The structural information in high-dimensional transposable data allows us to write the data recorded for each subject in a matrix such that both the rows and the columns correspond to variables of interest. One important problem is to test the null hypothesis that the mean matrix has a particular structure without ignoring the dependence structure among and/or between the row and column variables. To address this, we develop a generic and computationally inexpensive nonparametric testing procedure to assess the hypothesis that, in each predefined subset of columns (rows), the column (row) mean vector remains constant. In simulation studies, the proposed testing procedure seems to have good performance and, unlike simple practical approaches, it preserves the nominal size and remains powerful even if the row and/or column variables are not independent. Finally, we illustrate the use of the proposed methodology via two empirical examples from gene expression microarrays. © 2015, The International Biometric Society.
Building high dimensional imaging database for content based image search
Sun, Qinpei; Sun, Jianyong; Ling, Tonghui; Wang, Mingqing; Yang, Yuanyuan; Zhang, Jianguo
2016-03-01
In medical imaging informatics, content-based image retrieval (CBIR) techniques are employed to aid radiologists in the retrieval of images with similar image contents. CBIR uses visual contents, normally called as image features, to search images from large scale image databases according to users' requests in the form of a query image. However, most of current CBIR systems require a distance computation of image character feature vectors to perform query, and the distance computations can be time consuming when the number of image character features grows large, and thus this limits the usability of the systems. In this presentation, we propose a novel framework which uses a high dimensional database to index the image character features to improve the accuracy and retrieval speed of a CBIR in integrated RIS/PACS.
Experimental High-Dimensional Einstein-Podolsky-Rosen Steering
Zeng, Qiang; Wang, Bo; Li, Pengyun; Zhang, Xiangdong
2018-01-01
Steering nonlocality is the fundamental property of quantum mechanics, which has been widely demonstrated in some systems with qubits. Recently, theoretical works have shown that the high-dimensional (HD) steering effect exhibits novel and important features, such as noise suppression, which appear promising for potential application in quantum information processing (QIP). However, experimental observation of these HD properties remains a great challenge to date. In this work, we demonstrate the HD steering effect by encoding with orbital angular momentum photons for the first time. More importantly, we have quantitatively certified the noise-suppression phenomenon in the HD steering effect by introducing a tunable isotropic noise. We believe our results represent a significant advance of the nonlocal steering study and have direct benefits for QIP applications with superior capacity and reliability.
Technical Report: Scalable Parallel Algorithms for High Dimensional Numerical Integration
Energy Technology Data Exchange (ETDEWEB)
Masalma, Yahya [Universidad del Turabo; Jiao, Yu [ORNL
2010-10-01
We implemented a scalable parallel quasi-Monte Carlo numerical high-dimensional integration for tera-scale data points. The implemented algorithm uses the Sobol s quasi-sequences to generate random samples. Sobol s sequence was used to avoid clustering effects in the generated random samples and to produce low-discrepancy random samples which cover the entire integration domain. The performance of the algorithm was tested. Obtained results prove the scalability and accuracy of the implemented algorithms. The implemented algorithm could be used in different applications where a huge data volume is generated and numerical integration is required. We suggest using the hyprid MPI and OpenMP programming model to improve the performance of the algorithms. If the mixed model is used, attention should be paid to the scalability and accuracy.
Interactive Java Tools for Exploring High-dimensional Data
Directory of Open Access Journals (Sweden)
James W. Bradley
2000-12-01
Full Text Available The World Wide Web (WWW is a new mechanism for providing information. At this point, the majority of the information on the WWW is static, which means it is incapable of responding to user input. Text, images, and video are examples of static information that can easily be included in a WWW page. With the advent of the Java programming language, it is now possible to embed dynamic information in the form of interactive programs called applets. Therefore, it is not only possible to transfer raw data over the WWW, but we can also now provide interactive graphics for displaying and exploring data in the context of a WWW page. In this paper, we will describe the use of Java applets that have been developed for the interactive display of high dimensional data on the WWW.
Nguyen, Lan Huong; Holmes, Susan
2017-09-13
Detecting patterns in high-dimensional multivariate datasets is non-trivial. Clustering and dimensionality reduction techniques often help in discerning inherent structures. In biological datasets such as microbial community composition or gene expression data, observations can be generated from a continuous process, often unknown. Estimating data points' 'natural ordering' and their corresponding uncertainties can help researchers draw insights about the mechanisms involved. We introduce a Bayesian Unidimensional Scaling (BUDS) technique which extracts dominant sources of variation in high dimensional datasets and produces their visual data summaries, facilitating the exploration of a hidden continuum. The method maps multivariate data points to latent one-dimensional coordinates along their underlying trajectory, and provides estimated uncertainty bounds. By statistically modeling dissimilarities and applying a DiSTATIS registration method to their posterior samples, we are able to incorporate visualizations of uncertainties in the estimated data trajectory across different regions using confidence contours for individual data points. We also illustrate the estimated overall data density across different areas by including density clouds. One-dimensional coordinates recovered by BUDS help researchers discover sample attributes or covariates that are factors driving the main variability in a dataset. We demonstrated usefulness and accuracy of BUDS on a set of published microbiome 16S and RNA-seq and roll call data. Our method effectively recovers and visualizes natural orderings present in datasets. Automatic visualization tools for data exploration and analysis are available at: https://nlhuong.shinyapps.io/visTrajectory/ .
The Subspace Voyager: Exploring High-Dimensional Data along a Continuum of Salient 3D Subspaces.
Wang, Bing; Mueller, Klaus
2018-02-01
Analyzing high-dimensional data and finding hidden patterns is a difficult problem and has attracted numerous research efforts. Automated methods can be useful to some extent but bringing the data analyst into the loop via interactive visual tools can help the discovery process tremendously. An inherent problem in this effort is that humans lack the mental capacity to truly understand spaces exceeding three spatial dimensions. To keep within this limitation, we describe a framework that decomposes a high-dimensional data space into a continuum of generalized 3D subspaces. Analysts can then explore these 3D subspaces individually via the familiar trackball interface while using additional facilities to smoothly transition to adjacent subspaces for expanded space comprehension. Since the number of such subspaces suffers from combinatorial explosion, we provide a set of data-driven subspace selection and navigation tools which can guide users to interesting subspaces and views. A subspace trail map allows users to manage the explored subspaces, keep their bearings, and return to interesting subspaces and views. Both trackball and trail map are each embedded into a word cloud of attribute labels which aid in navigation. We demonstrate our system via several use cases in a diverse set of application areas-cluster analysis and refinement, information discovery, and supervised training of classifiers. We also report on a user study that evaluates the usability of the various interactions our system provides.
Quality metrics in high-dimensional data visualization: an overview and systematization.
Bertini, Enrico; Tatu, Andrada; Keim, Daniel
2011-12-01
In this paper, we present a systematization of techniques that use quality metrics to help in the visual exploration of meaningful patterns in high-dimensional data. In a number of recent papers, different quality metrics are proposed to automate the demanding search through large spaces of alternative visualizations (e.g., alternative projections or ordering), allowing the user to concentrate on the most promising visualizations suggested by the quality metrics. Over the last decade, this approach has witnessed a remarkable development but few reflections exist on how these methods are related to each other and how the approach can be developed further. For this purpose, we provide an overview of approaches that use quality metrics in high-dimensional data visualization and propose a systematization based on a thorough literature review. We carefully analyze the papers and derive a set of factors for discriminating the quality metrics, visualization techniques, and the process itself. The process is described through a reworked version of the well-known information visualization pipeline. We demonstrate the usefulness of our model by applying it to several existing approaches that use quality metrics, and we provide reflections on implications of our model for future research. © 2010 IEEE
Directory of Open Access Journals (Sweden)
Fabian Horst
Full Text Available Traditionally, gait analysis has been centered on the idea of average behavior and normality. On one hand, clinical diagnoses and therapeutic interventions typically assume that average gait patterns remain constant over time. On the other hand, it is well known that all our movements are accompanied by a certain amount of variability, which does not allow us to make two identical steps. The purpose of this study was to examine changes in the intra-individual gait patterns across different time-scales (i.e., tens-of-mins, tens-of-hours.Nine healthy subjects performed 15 gait trials at a self-selected speed on 6 sessions within one day (duration between two subsequent sessions from 10 to 90 mins. For each trial, time-continuous ground reaction forces and lower body joint angles were measured. A supervised learning model using a kernel-based discriminant regression was applied for classifying sessions within individual gait patterns.Discernable characteristics of intra-individual gait patterns could be distinguished between repeated sessions by classification rates of 67.8 ± 8.8% and 86.3 ± 7.9% for the six-session-classification of ground reaction forces and lower body joint angles, respectively. Furthermore, the one-on-one-classification showed that increasing classification rates go along with increasing time durations between two sessions and indicate that changes of gait patterns appear at different time-scales.Discernable characteristics between repeated sessions indicate continuous intrinsic changes in intra-individual gait patterns and suggest a predominant role of deterministic processes in human motor control and learning. Natural changes of gait patterns without any externally induced injury or intervention may reflect continuous adaptations of the motor system over several time-scales. Accordingly, the modelling of walking by means of average gait patterns that are assumed to be near constant over time needs to be reconsidered in the
Macroeconomic Forecasting Using Penalized Regression Methods
Smeekes, Stephan; Wijler, Etiënne
2016-01-01
We study the suitability of lasso-type penalized regression techniques when applied to macroeconomic forecasting with high-dimensional datasets. We consider performance of the lasso-type methods when the true DGP is a factor model, contradicting the sparsity assumption underlying penalized
A novel algorithm for simultaneous SNP selection in high-dimensional genome-wide association studies
Directory of Open Access Journals (Sweden)
Zuber Verena
2012-10-01
Full Text Available Abstract Background Identification of causal SNPs in most genome wide association studies relies on approaches that consider each SNP individually. However, there is a strong correlation structure among SNPs that needs to be taken into account. Hence, increasingly modern computationally expensive regression methods are employed for SNP selection that consider all markers simultaneously and thus incorporate dependencies among SNPs. Results We develop a novel multivariate algorithm for large scale SNP selection using CAR score regression, a promising new approach for prioritizing biomarkers. Specifically, we propose a computationally efficient procedure for shrinkage estimation of CAR scores from high-dimensional data. Subsequently, we conduct a comprehensive comparison study including five advanced regression approaches (boosting, lasso, NEG, MCP, and CAR score and a univariate approach (marginal correlation to determine the effectiveness in finding true causal SNPs. Conclusions Simultaneous SNP selection is a challenging task. We demonstrate that our CAR score-based algorithm consistently outperforms all competing approaches, both uni- and multivariate, in terms of correctly recovered causal SNPs and SNP ranking. An R package implementing the approach as well as R code to reproduce the complete study presented here is available from http://strimmerlab.org/software/care/.
Biomarker identification and effect estimation on schizophrenia –a high dimensional data analysis
Directory of Open Access Journals (Sweden)
Yuanzhang eLi
2015-05-01
Full Text Available Biomarkers have been examined in schizophrenia research for decades. Medical morbidity and mortality rates, as well as personal and societal costs, are associated with schizophrenia patients. The identification of biomarkers and alleles, which often have a small effect individually, may help to develop new diagnostic tests for early identification and treatment. Currently, there is not a commonly accepted statistical approach to identify predictive biomarkers from high dimensional data. We used space Decomposition-Gradient-Regression method (DGR to select biomarkers, which are associated with the risk of schizophrenia. Then, we used the gradient scores, generated from the selected biomarkers, as the prediction factor in regression to estimate their effects. We also used an alternative approach, classification and regression tree (CART, to compare the biomarker selected by DGR and found about 70% of the selected biomarkers were the same. However, the advantage of DGR is that it can evaluate individual effects for each biomarker from their combined effect. In DGR analysis of serum specimens of US military service members with a diagnosis of schizophrenia from 1992 to 2005 and their controls, Alpha-1-Antitrypsin (AAT, Interleukin-6 receptor (IL-6r and Connective Tissue Growth Factor (CTGF were selected to identify schizophrenia for males; and Alpha-1-Antitrypsin (AAT, Apolipoprotein B (Apo B and Sortilin were selected for females. If these findings from military subjects are replicated by other studies, they suggest the possibility of a novel biomarker panel as an adjunct to earlier diagnosis and initiation of treatment.
A qualitative numerical study of high dimensional dynamical systems
Albers, David James
Since Poincare, the father of modern mathematical dynamical systems, much effort has been exerted to achieve a qualitative understanding of the physical world via a qualitative understanding of the functions we use to model the physical world. In this thesis, we construct a numerical framework suitable for a qualitative, statistical study of dynamical systems using the space of artificial neural networks. We analyze the dynamics along intervals in parameter space, separating the set of neural networks into roughly four regions: the fixed point to the first bifurcation; the route to chaos; the chaotic region; and a transition region between chaos and finite-state neural networks. The study is primarily with respect to high-dimensional dynamical systems. We make the following general conclusions as the dimension of the dynamical system is increased: the probability of the first bifurcation being of type Neimark-Sacker is greater than ninety-percent; the most probable route to chaos is via a cascade of bifurcations of high-period periodic orbits, quasi-periodic orbits, and 2-tori; there exists an interval of parameter space such that hyperbolicity is violated on a countable, Lebesgue measure 0, "increasingly dense" subset; chaos is much more likely to persist with respect to parameter perturbation in the chaotic region of parameter space as the dimension is increased; moreover, as the number of positive Lyapunov exponents is increased, the likelihood that any significant portion of these positive exponents can be perturbed away decreases with increasing dimension. The maximum Kaplan-Yorke dimension and the maximum number of positive Lyapunov exponents increases linearly with dimension. The probability of a dynamical system being chaotic increases exponentially with dimension. The results with respect to the first bifurcation and the route to chaos comment on previous results of Newhouse, Ruelle, Takens, Broer, Chenciner, and Iooss. Moreover, results regarding the high-dimensional
Robust Hessian Locally Linear Embedding Techniques for High-Dimensional Data
Directory of Open Access Journals (Sweden)
Xianglei Xing
2016-05-01
Full Text Available Recently manifold learning has received extensive interest in the community of pattern recognition. Despite their appealing properties, most manifold learning algorithms are not robust in practical applications. In this paper, we address this problem in the context of the Hessian locally linear embedding (HLLE algorithm and propose a more robust method, called RHLLE, which aims to be robust against both outliers and noise in the data. Specifically, we first propose a fast outlier detection method for high-dimensional datasets. Then, we employ a local smoothing method to reduce noise. Furthermore, we reformulate the original HLLE algorithm by using the truncation function from differentiable manifolds. In the reformulated framework, we explicitly introduce a weighted global functional to further reduce the undesirable effect of outliers and noise on the embedding result. Experiments on synthetic as well as real datasets demonstrate the effectiveness of our proposed algorithm.
Experimental Design of Formulations Utilizing High Dimensional Model Representation.
Li, Genyuan; Bastian, Caleb; Welsh, William; Rabitz, Herschel
2015-07-23
Many applications involve formulations or mixtures where large numbers of components are possible to choose from, but a final composition with only a few components is sought. Finding suitable binary or ternary mixtures from all the permissible components often relies on simplex-lattice sampling in traditional design of experiments (DoE), which requires performing a large number of experiments even for just tens of permissible components. The effect rises very rapidly with increasing numbers of components and can readily become impractical. This paper proposes constructing a single model for a mixture containing all permissible components from just a modest number of experiments. Yet the model is capable of satisfactorily predicting the performance for full as well as all possible binary and ternary component mixtures. To achieve this goal, we utilize biased random sampling combined with high dimensional model representation (HDMR) to replace DoE simplex-lattice design. Compared with DoE, the required number of experiments is significantly reduced, especially when the number of permissible components is large. This study is illustrated with a solubility model for solvent mixture screening.
The literary uses of high-dimensional space
Directory of Open Access Journals (Sweden)
Ted Underwood
2015-12-01
Full Text Available Debates over “Big Data” shed more heat than light in the humanities, because the term ascribes new importance to statistical methods without explaining how those methods have changed. What we badly need instead is a conversation about the substantive innovations that have made statistical modeling useful for disciplines where, in the past, it truly wasn’t. These innovations are partly technical, but more fundamentally expressed in what Leo Breiman calls a new “culture” of statistical modeling. Where 20th-century methods often required humanists to squeeze our unstructured texts, sounds, or images into some special-purpose data model, new methods can handle unstructured evidence more directly by modeling it in a high-dimensional space. This opens a range of research opportunities that humanists have barely begun to discuss. To date, topic modeling has received most attention, but in the long run, supervised predictive models may be even more important. I sketch their potential by describing how Jordan Sellers and I have begun to model poetic distinction in the long 19th century—revealing an arc of gradual change much longer than received literary histories would lead us to expect.
Progress in high-dimensional percolation and random graphs
Heydenreich, Markus
2017-01-01
This text presents an engaging exposition of the active field of high-dimensional percolation that will likely provide an impetus for future work. With over 90 exercises designed to enhance the reader’s understanding of the material, as well as many open problems, the book is aimed at graduate students and researchers who wish to enter the world of this rich topic. The text may also be useful in advanced courses and seminars, as well as for reference and individual study. Part I, consisting of 3 chapters, presents a general introduction to percolation, stating the main results, defining the central objects, and proving its main properties. No prior knowledge of percolation is assumed. Part II, consisting of Chapters 4–9, discusses mean-field critical behavior by describing the two main techniques used, namely, differential inequalities and the lace expansion. In Parts I and II, all results are proved, making this the first self-contained text discussing high-dimensiona l percolation. Part III, consist...
Inference for High-dimensional Differential Correlation Matrices.
Cai, T Tony; Zhang, Anru
2016-01-01
Motivated by differential co-expression analysis in genomics, we consider in this paper estimation and testing of high-dimensional differential correlation matrices. An adaptive thresholding procedure is introduced and theoretical guarantees are given. Minimax rate of convergence is established and the proposed estimator is shown to be adaptively rate-optimal over collections of paired correlation matrices with approximately sparse differences. Simulation results show that the procedure significantly outperforms two other natural methods that are based on separate estimation of the individual correlation matrices. The procedure is also illustrated through an analysis of a breast cancer dataset, which provides evidence at the gene co-expression level that several genes, of which a subset has been previously verified, are associated with the breast cancer. Hypothesis testing on the differential correlation matrices is also considered. A test, which is particularly well suited for testing against sparse alternatives, is introduced. In addition, other related problems, including estimation of a single sparse correlation matrix, estimation of the differential covariance matrices, and estimation of the differential cross-correlation matrices, are also discussed.
Effects of dependence in high-dimensional multiple testing problems
Directory of Open Access Journals (Sweden)
van de Wiel Mark A
2008-02-01
Full Text Available Abstract Background We consider effects of dependence among variables of high-dimensional data in multiple hypothesis testing problems, in particular the False Discovery Rate (FDR control procedures. Recent simulation studies consider only simple correlation structures among variables, which is hardly inspired by real data features. Our aim is to systematically study effects of several network features like sparsity and correlation strength by imposing dependence structures among variables using random correlation matrices. Results We study the robustness against dependence of several FDR procedures that are popular in microarray studies, such as Benjamin-Hochberg FDR, Storey's q-value, SAM and resampling based FDR procedures. False Non-discovery Rates and estimates of the number of null hypotheses are computed from those methods and compared. Our simulation study shows that methods such as SAM and the q-value do not adequately control the FDR to the level claimed under dependence conditions. On the other hand, the adaptive Benjamini-Hochberg procedure seems to be most robust while remaining conservative. Finally, the estimates of the number of true null hypotheses under various dependence conditions are variable. Conclusion We discuss a new method for efficient guided simulation of dependent data, which satisfy imposed network constraints as conditional independence structures. Our simulation set-up allows for a structural study of the effect of dependencies on multiple testing criterions and is useful for testing a potentially new method on π0 or FDR estimation in a dependency context.
Online Active Linear Regression via Thresholding
Riquelme, Carlos; Johari, Ramesh; Zhang, Baosen
2016-01-01
We consider the problem of online active learning to collect data for regression modeling. Specifically, we consider a decision maker with a limited experimentation budget who must efficiently learn an underlying linear population model. Our main contribution is a novel threshold-based algorithm for selection of most informative observations; we characterize its performance and fundamental lower bounds. We extend the algorithm and its guarantees to sparse linear regression in high-dimensional...
Kahane, Leo H
2007-01-01
Using a friendly, nontechnical approach, the Second Edition of Regression Basics introduces readers to the fundamentals of regression. Accessible to anyone with an introductory statistics background, this book builds from a simple two-variable model to a model of greater complexity. Author Leo H. Kahane weaves four engaging examples throughout the text to illustrate not only the techniques of regression but also how this empirical tool can be applied in creative ways to consider a broad array of topics. New to the Second Edition Offers greater coverage of simple panel-data estimation:
Bhadra, Anindya
2013-04-22
We describe a Bayesian technique to (a) perform a sparse joint selection of significant predictor variables and significant inverse covariance matrix elements of the response variables in a high-dimensional linear Gaussian sparse seemingly unrelated regression (SSUR) setting and (b) perform an association analysis between the high-dimensional sets of predictors and responses in such a setting. To search the high-dimensional model space, where both the number of predictors and the number of possibly correlated responses can be larger than the sample size, we demonstrate that a marginalization-based collapsed Gibbs sampler, in combination with spike and slab type of priors, offers a computationally feasible and efficient solution. As an example, we apply our method to an expression quantitative trait loci (eQTL) analysis on publicly available single nucleotide polymorphism (SNP) and gene expression data for humans where the primary interest lies in finding the significant associations between the sets of SNPs and possibly correlated genetic transcripts. Our method also allows for inference on the sparse interaction network of the transcripts (response variables) after accounting for the effect of the SNPs (predictor variables). We exploit properties of Gaussian graphical models to make statements concerning conditional independence of the responses. Our method compares favorably to existing Bayesian approaches developed for this purpose. © 2013, The International Biometric Society.
Approximation of High-Dimensional Rank One Tensors
Bachmayr, Markus
2013-11-12
Many real world problems are high-dimensional in that their solution is a function which depends on many variables or parameters. This presents a computational challenge since traditional numerical techniques are built on model classes for functions based solely on smoothness. It is known that the approximation of smoothness classes of functions suffers from the so-called \\'curse of dimensionality\\'. Avoiding this curse requires new model classes for real world functions that match applications. This has led to the introduction of notions such as sparsity, variable reduction, and reduced modeling. One theme that is particularly common is to assume a tensor structure for the target function. This paper investigates how well a rank one function f(x 1,...,x d)=f 1(x 1)⋯f d(x d), defined on Ω=[0,1]d can be captured through point queries. It is shown that such a rank one function with component functions f j in W∞ r([0,1]) can be captured (in L ∞) to accuracy O(C(d,r)N -r) from N well-chosen point evaluations. The constant C(d,r) scales like d dr. The queries in our algorithms have two ingredients, a set of points built on the results from discrepancy theory and a second adaptive set of queries dependent on the information drawn from the first set. Under the assumption that a point z∈Ω with nonvanishing f(z) is known, the accuracy improves to O(dN -r). © 2013 Springer Science+Business Media New York.
Quality and efficiency in high dimensional Nearest neighbor search
Tao, Yufei
2009-01-01
Nearest neighbor (NN) search in high dimensional space is an important problem in many applications. Ideally, a practical solution (i) should be implementable in a relational database, and (ii) its query cost should grow sub-linearly with the dataset size, regardless of the data and query distributions. Despite the bulk of NN literature, no solution fulfills both requirements, except locality sensitive hashing (LSH). The existing LSH implementations are either rigorous or adhoc. Rigorous-LSH ensures good quality of query results, but requires expensive space and query cost. Although adhoc-LSH is more efficient, it abandons quality control, i.e., the neighbor it outputs can be arbitrarily bad. As a result, currently no method is able to ensure both quality and efficiency simultaneously in practice. Motivated by this, we propose a new access method called the locality sensitive B-tree (LSB-tree) that enables fast highdimensional NN search with excellent quality. The combination of several LSB-trees leads to a structure called the LSB-forest that ensures the same result quality as rigorous-LSH, but reduces its space and query cost dramatically. The LSB-forest also outperforms adhoc-LSH, even though the latter has no quality guarantee. Besides its appealing theoretical properties, the LSB-tree itself also serves as an effective index that consumes linear space, and supports efficient updates. Our extensive experiments confirm that the LSB-tree is faster than (i) the state of the art of exact NN search by two orders of magnitude, and (ii) the best (linear-space) method of approximate retrieval by an order of magnitude, and at the same time, returns neighbors with much better quality. © 2009 ACM.
Optimally splitting cases for training and testing high dimensional classifiers
Directory of Open Access Journals (Sweden)
Simon Richard M
2011-04-01
Full Text Available Abstract Background We consider the problem of designing a study to develop a predictive classifier from high dimensional data. A common study design is to split the sample into a training set and an independent test set, where the former is used to develop the classifier and the latter to evaluate its performance. In this paper we address the question of what proportion of the samples should be devoted to the training set. How does this proportion impact the mean squared error (MSE of the prediction accuracy estimate? Results We develop a non-parametric algorithm for determining an optimal splitting proportion that can be applied with a specific dataset and classifier algorithm. We also perform a broad simulation study for the purpose of better understanding the factors that determine the best split proportions and to evaluate commonly used splitting strategies (1/2 training or 2/3 training under a wide variety of conditions. These methods are based on a decomposition of the MSE into three intuitive component parts. Conclusions By applying these approaches to a number of synthetic and real microarray datasets we show that for linear classifiers the optimal proportion depends on the overall number of samples available and the degree of differential expression between the classes. The optimal proportion was found to depend on the full dataset size (n and classification accuracy - with higher accuracy and smaller n resulting in more assigned to the training set. The commonly used strategy of allocating 2/3rd of cases for training was close to optimal for reasonable sized datasets (n ≥ 100 with strong signals (i.e. 85% or greater full dataset accuracy. In general, we recommend use of our nonparametric resampling approach for determing the optimal split. This approach can be applied to any dataset, using any predictor development method, to determine the best split.
An Effective Parameter Screening Strategy for High Dimensional Watershed Models
Khare, Y. P.; Martinez, C. J.; Munoz-Carpena, R.
2014-12-01
Watershed simulation models can assess the impacts of natural and anthropogenic disturbances on natural systems. These models have become important tools for tackling a range of water resources problems through their implementation in the formulation and evaluation of Best Management Practices, Total Maximum Daily Loads, and Basin Management Action Plans. For accurate applications of watershed models they need to be thoroughly evaluated through global uncertainty and sensitivity analyses (UA/SA). However, due to the high dimensionality of these models such evaluation becomes extremely time- and resource-consuming. Parameter screening, the qualitative separation of important parameters, has been suggested as an essential step before applying rigorous evaluation techniques such as the Sobol' and Fourier Amplitude Sensitivity Test (FAST) methods in the UA/SA framework. The method of elementary effects (EE) (Morris, 1991) is one of the most widely used screening methodologies. Some of the common parameter sampling strategies for EE, e.g. Optimized Trajectories [OT] (Campolongo et al., 2007) and Modified Optimized Trajectories [MOT] (Ruano et al., 2012), suffer from inconsistencies in the generated parameter distributions, infeasible sample generation time, etc. In this work, we have formulated a new parameter sampling strategy - Sampling for Uniformity (SU) - for parameter screening which is based on the principles of the uniformity of the generated parameter distributions and the spread of the parameter sample. A rigorous multi-criteria evaluation (time, distribution, spread and screening efficiency) of OT, MOT, and SU indicated that SU is superior to other sampling strategies. Comparison of the EE-based parameter importance rankings with those of Sobol' helped to quantify the qualitativeness of the EE parameter screening approach, reinforcing the fact that one should use EE only to reduce the resource burden required by FAST/Sobol' analyses but not to replace it.
A Novel High Dimensional and High Speed Data Streams Algorithm: HSDStream
Irshad Ahmed; Irfan Ahmed; Waseem Shahzad
2016-01-01
This paper presents a novel high speed clustering scheme for high-dimensional data stream. Data stream clustering has gained importance in different applications, for example, network monitoring, intrusion detection, and real-time sensing. High dimensional stream data is inherently more complex when used for clustering because the evolving nature of the stream data and high dimensionality make it non-trivial. In order to tackle this problem, projected subspace within the high dimensions and l...
Huybrechts, Inge; Lioret, Sandrine; Mouratidou, Theodora; Gunter, Marc J; Manios, Yannis; Kersting, Mathilde; Gottrand, Frederic; Kafatos, Anthony; De Henauw, Stefaan; Cuenca-García, Magdalena; Widhalm, Kurt; Gonzales-Gross, Marcela; Molnar, Denes; Moreno, Luis A; McNaughton, Sarah A
2017-01-01
This study aims to examine repeatability of reduced rank regression (RRR) methods in calculating dietary patterns (DP) and cross-sectional associations with overweight (OW)/obesity across European and Australian samples of adolescents. Data from two cross-sectional surveys in Europe (2006/2007 Healthy Lifestyle in Europe by Nutrition in Adolescence study, including 1954 adolescents, 12-17 years) and Australia (2007 National Children's Nutrition and Physical Activity Survey, including 1498 adolescents, 12-16 years) were used. Dietary intake was measured using two non-consecutive, 24-h recalls. RRR was used to identify DP using dietary energy density, fibre density and percentage of energy intake from fat as the intermediate variables. Associations between DP scores and body mass/fat were examined using multivariable linear and logistic regression as appropriate, stratified by sex. The first DP extracted (labelled 'energy dense, high fat, low fibre') explained 47 and 31 % of the response variation in Australian and European adolescents, respectively. It was similar for European and Australian adolescents and characterised by higher consumption of biscuits/cakes, chocolate/confectionery, crisps/savoury snacks, sugar-sweetened beverages, and lower consumption of yogurt, high-fibre bread, vegetables and fresh fruit. DP scores were inversely associated with BMI z-scores in Australian adolescent boys and borderline inverse in European adolescent boys (so as with %BF). Similarly, a lower likelihood for OW in boys was observed with higher DP scores in both surveys. No such relationships were observed in adolescent girls. In conclusion, the DP identified in this cross-country study was comparable for European and Australian adolescents, demonstrating robustness of the RRR method in calculating DP among populations. However, longitudinal designs are more relevant when studying diet-obesity associations, to prevent reverse causality.
Robust and sparse correlation matrix estimation for the analysis of high-dimensional genomics data.
Serra, Angela; Coretto, Pietro; Fratello, Michele; Tagliaferri, Roberto
2017-10-12
Microarray technology can be used to study the expression of thousands of genes across a number of different experimental conditions, usually hundreds. The underlying principle is that genes sharing similar expression patterns, across different samples, can be part of the same co-expression system, or they may share the same biological functions. Groups of genes are usually identified based on cluster analysis. Clustering methods rely on the similarity matrix between genes. A common choice to measure similarity is to compute the sample correlation matrix. Dimensionality reduction is another popular data analysis task which is also based on covariance/correlation matrix estimates. Unfortunately, covariance/correlation matrix estimation suffers from the intrinsic noise present in high-dimensional data. Sources of noise are: sampling variations, presents of outlying sample units, and the fact that in most cases the number of units is much larger than the number of genes. In this paper we propose a robust correlation matrix estimator that is regularized based on adaptive thresholding. The resulting method jointly tames the effects of the high-dimensionality, and data contamination. Computations are easy to implement and do not require hand tunings. Both simulated and real data are analysed. A Monte Carlo experiment shows that the proposed method is capable of remarkable performances. Our correlation metric is more robust to outliers compared with the existing alternatives in two gene expression data sets. It is also shown how the regularization allows to automatically detect and filter spurious correlations. The same regularization is also extended to other less robust correlation measures. Finally, we apply the ARACNE algorithm on the SyNTreN gene expression data. Sensitivity and specificity of the reconstructed network is compared with the gold standard. We show that ARACNE performs better when it takes the proposed correlation matrix estimator as input. The R
Lee, Jenny Hyunjung; McDonnell, Kevin T; Zelenyuk, Alla; Imre, Dan; Mueller, Klaus
2013-07-11
Although the Euclidean distance does well in measuring data distances within high-dimensional clusters, it does poorly when it comes to gauging inter-cluster distances. This significantly impacts the quality of global, low-dimensional space embedding procedures such as the popular multi-dimensional scaling (MDS) where one can often observe non-intuitive layouts. We were inspired by the perceptual processes evoked in the method of parallel coordinates which enables users to visually aggregate the data by the patterns the polylines exhibit across the dimension axes. We call the path of such a polyline its structure and suggest a metric that captures this structure directly in high-dimensional space. This allows us to better gauge the distances of spatially distant data constellations and so achieve data aggregations in MDS plots that are more cognizant of existing high-dimensional structure similarities. Our bi-scale framework distinguishes far-distances from near-distances. The coarser scale uses the structural similarity metric to separate data aggregates obtained by prior classification or clustering, while the finer scale employs the appropriate Euclidean distance.
Data analysis in high-dimensional sparse spaces
DEFF Research Database (Denmark)
Clemmensen, Line Katrine Harder
are applied to classifications of fish species, ear canal impressions used in the hearing aid industry, microbiological fungi species, and various cancerous tissues and healthy tissues. In addition, novel applications of sparse regressions (also called the elastic net) to the medical, concrete, and food...... industries via multi-spectral images for objective and automated systems are presented....
Identifying surfaces of low dimensions in high dimensional data
DEFF Research Database (Denmark)
Høskuldsson, Agnar
1996-01-01
Methods are presented that find a nonlinear subspace in low dimensions that describe data given by many variables. The methods include nonlinear extensions of Principal Component Analysis and extensions of linear regression analysis. It is shown by examples that these methods give more reliable...
Reduced nonlinear prognostic model construction from high-dimensional data
Gavrilov, Andrey; Mukhin, Dmitry; Loskutov, Evgeny; Feigin, Alexander
2017-04-01
Construction of a data-driven model of evolution operator using universal approximating functions can only be statistically justified when the dimension of its phase space is small enough, especially in the case of short time series. At the same time in many applications real-measured data is high-dimensional, e.g. it is space-distributed and multivariate in climate science. Therefore it is necessary to use efficient dimensionality reduction methods which are also able to capture key dynamical properties of the system from observed data. To address this problem we present a Bayesian approach to an evolution operator construction which incorporates two key reduction steps. First, the data is decomposed into a set of certain empirical modes, such as standard empirical orthogonal functions or recently suggested nonlinear dynamical modes (NDMs) [1], and the reduced space of corresponding principal components (PCs) is obtained. Then, the model of evolution operator for PCs is constructed which maps a number of states in the past to the current state. The second step is to reduce this time-extended space in the past using appropriate decomposition methods. Such a reduction allows us to capture only the most significant spatio-temporal couplings. The functional form of the evolution operator includes separately linear, nonlinear (based on artificial neural networks) and stochastic terms. Explicit separation of the linear term from the nonlinear one allows us to more easily interpret degree of nonlinearity as well as to deal better with smooth PCs which can naturally occur in the decompositions like NDM, as they provide a time scale separation. Results of application of the proposed method to climate data are demonstrated and discussed. The study is supported by Government of Russian Federation (agreement #14.Z50.31.0033 with the Institute of Applied Physics of RAS). 1. Mukhin, D., Gavrilov, A., Feigin, A., Loskutov, E., & Kurths, J. (2015). Principal nonlinear dynamical
Chapman, Benjamin P; Weiss, Alexander; Duberstein, Paul R
2016-12-01
Statistical learning theory (SLT) is the statistical formulation of machine learning theory, a body of analytic methods common in "big data" problems. Regression-based SLT algorithms seek to maximize predictive accuracy for some outcome, given a large pool of potential predictors, without overfitting the sample. Research goals in psychology may sometimes call for high dimensional regression. One example is criterion-keyed scale construction, where a scale with maximal predictive validity must be built from a large item pool. Using this as a working example, we first introduce a core principle of SLT methods: minimization of expected prediction error (EPE). Minimizing EPE is fundamentally different than maximizing the within-sample likelihood, and hinges on building a predictive model of sufficient complexity to predict the outcome well, without undue complexity leading to overfitting. We describe how such models are built and refined via cross-validation. We then illustrate how 3 common SLT algorithms-supervised principal components, regularization, and boosting-can be used to construct a criterion-keyed scale predicting all-cause mortality, using a large personality item pool within a population cohort. Each algorithm illustrates a different approach to minimizing EPE. Finally, we consider broader applications of SLT predictive algorithms, both as supportive analytic tools for conventional methods, and as primary analytic tools in discovery phase research. We conclude that despite their differences from the classic null-hypothesis testing approach-or perhaps because of them-SLT methods may hold value as a statistically rigorous approach to exploratory regression. (PsycINFO Database Record (c) 2016 APA, all rights reserved).
Olive, David J
2017-01-01
This text covers both multiple linear regression and some experimental design models. The text uses the response plot to visualize the model and to detect outliers, does not assume that the error distribution has a known parametric distribution, develops prediction intervals that work when the error distribution is unknown, suggests bootstrap hypothesis tests that may be useful for inference after variable selection, and develops prediction regions and large sample theory for the multivariate linear regression model that has m response variables. A relationship between multivariate prediction regions and confidence regions provides a simple way to bootstrap confidence regions. These confidence regions often provide a practical method for testing hypotheses. There is also a chapter on generalized linear models and generalized additive models. There are many R functions to produce response and residual plots, to simulate prediction intervals and hypothesis tests, to detect outliers, and to choose response trans...
Directory of Open Access Journals (Sweden)
Igor K. Kochanenko
2013-01-01
Full Text Available Procedures of construction of curve regress by criterion of the least fractals, i.e. the greatest probability of the sums of degrees of the least deviations measured intensity from their modelling values are proved. The exponent is defined as fractal dimension of a time number. The difference of results of a well-founded method and a method of the least squares is quantitatively estimated.
Nagarajan, Mahesh B; Coan, Paola; Huber, Markus B; Diemoz, Paul C; Glaser, Christian; Wismüller, Axel
2014-02-01
Phase-contrast computed tomography (PCI-CT) has shown tremendous potential as an imaging modality for visualizing human cartilage with high spatial resolution. Previous studies have demonstrated the ability of PCI-CT to visualize (1) structural details of the human patellar cartilage matrix and (2) changes to chondrocyte organization induced by osteoarthritis. This study investigates the use of high-dimensional geometric features in characterizing such chondrocyte patterns in the presence or absence of osteoarthritic damage. Geometrical features derived from the scaling index method (SIM) and statistical features derived from gray-level co-occurrence matrices were extracted from 842 regions of interest (ROI) annotated on PCI-CT images of ex vivo human patellar cartilage specimens. These features were subsequently used in a machine learning task with support vector regression to classify ROIs as healthy or osteoarthritic; classification performance was evaluated using the area under the receiver-operating characteristic curve (AUC). SIM-derived geometrical features exhibited the best classification performance (AUC, 0.95 ± 0.06) and were most robust to changes in ROI size. These results suggest that such geometrical features can provide a detailed characterization of the chondrocyte organization in the cartilage matrix in an automated and non-subjective manner, while also enabling classification of cartilage as healthy or osteoarthritic with high accuracy. Such features could potentially serve as imaging markers for evaluating osteoarthritis progression and its response to different therapeutic intervention strategies.
Städler, Nicolas; Dondelinger, Frank; Hill, Steven M; Akbani, Rehan; Lu, Yiling; Mills, Gordon B; Mukherjee, Sach
2017-09-15
Molecular pathways and networks play a key role in basic and disease biology. An emerging notion is that networks encoding patterns of molecular interplay may themselves differ between contexts, such as cell type, tissue or disease (sub)type. However, while statistical testing of differences in mean expression levels has been extensively studied, testing of network differences remains challenging. Furthermore, since network differences could provide important and biologically interpretable information to identify molecular subgroups, there is a need to consider the unsupervised task of learning subgroups and networks that define them. This is a nontrivial clustering problem, with neither subgroups nor subgroup-specific networks known at the outset. We leverage recent ideas from high-dimensional statistics for testing and clustering in the network biology setting. The methods we describe can be applied directly to most continuous molecular measurements and networks do not need to be specified beforehand. We illustrate the ideas and methods in a case study using protein data from The Cancer Genome Atlas (TCGA). This provides evidence that patterns of interplay between signalling proteins differ significantly between cancer types. Furthermore, we show how the proposed approaches can be used to learn subtypes and the molecular networks that define them. As the Bioconductor package nethet. staedler.n@gmail.com or sach.mukherjee@dzne.de. Supplementary data are available at Bioinformatics online.
Mwangi, Benson; Soares, Jair C; Hasan, Khader M
2014-10-30
Neuroimaging machine learning studies have largely utilized supervised algorithms - meaning they require both neuroimaging scan data and corresponding target variables (e.g. healthy vs. diseased) to be successfully 'trained' for a prediction task. Noticeably, this approach may not be optimal or possible when the global structure of the data is not well known and the researcher does not have an a priori model to fit the data. We set out to investigate the utility of an unsupervised machine learning technique; t-distributed stochastic neighbour embedding (t-SNE) in identifying 'unseen' sample population patterns that may exist in high-dimensional neuroimaging data. Multimodal neuroimaging scans from 92 healthy subjects were pre-processed using atlas-based methods, integrated and input into the t-SNE algorithm. Patterns and clusters discovered by the algorithm were visualized using a 2D scatter plot and further analyzed using the K-means clustering algorithm. t-SNE was evaluated against classical principal component analysis. Remarkably, based on unlabelled multimodal scan data, t-SNE separated study subjects into two very distinct clusters which corresponded to subjects' gender labels (cluster silhouette index value=0.79). The resulting clusters were used to develop an unsupervised minimum distance clustering model which identified 93.5% of subjects' gender. Notably, from a neuropsychiatric perspective this method may allow discovery of data-driven disease phenotypes or sub-types of treatment responders. Copyright © 2014 Elsevier B.V. All rights reserved.
Visualizing High-Dimensional Structures by Dimension Ordering and Filtering using Subspace Analysis
Ferdosi, Bilkis J.; Roerdink, Jos B. T. M.
2011-01-01
High-dimensional data visualization is receiving increasing interest because of the growing abundance of high-dimensional datasets. To understand such datasets, visualization of the structures present in the data, such as clusters, can be an invaluable tool. Structures may be present in the full
Engineering two-photon high-dimensional states through quantum interference
CSIR Research Space (South Africa)
Zhang, YI
2016-02-01
Full Text Available problems. An alternative strategy is to consider a lesser number of particles that exist in high-dimensional states. The spatial modes of light are one such candidate that provides access to high-dimensional quantum states, and thus they increase...
Huang, Yen-Tsung; Pan, Wen-Chi
2016-06-01
Causal mediation modeling has become a popular approach for studying the effect of an exposure on an outcome through a mediator. However, current methods are not applicable to the setting with a large number of mediators. We propose a testing procedure for mediation effects of high-dimensional continuous mediators. We characterize the marginal mediation effect, the multivariate component-wise mediation effects, and the L2 norm of the component-wise effects, and develop a Monte-Carlo procedure for evaluating their statistical significance. To accommodate the setting with a large number of mediators and a small sample size, we further propose a transformation model using the spectral decomposition. Under the transformation model, mediation effects can be estimated using a series of regression models with a univariate transformed mediator, and examined by our proposed testing procedure. Extensive simulation studies are conducted to assess the performance of our methods for continuous and dichotomous outcomes. We apply the methods to analyze genomic data investigating the effect of microRNA miR-223 on a dichotomous survival status of patients with glioblastoma multiforme (GBM). We identify nine gene ontology sets with expression values that significantly mediate the effect of miR-223 on GBM survival. © 2015, The International Biometric Society.
Awate, Suyash P; Yushkevich, Paul; Song, Zhuang; Licht, Daniel; Gee, James C
2009-01-01
The paper presents a novel statistical framework for cortical folding pattern analysis that relies on a rich multivariate descriptor of folding patterns in a region of interest (ROI). The ROI-based approach avoids problems faced by spatial-normalization-based approaches stemming from the severe deficiency of homologous features between typical human cerebral cortices. Unlike typical ROI-based methods that summarize folding complexity or shape by a single number, the proposed descriptor unifies complexity and shape of the surface in a high-dimensional space. In this way, the proposed framework couples the reliability of ROI-based analysis with the richness of the novel cortical folding pattern descriptor. Furthermore, the descriptor can easily incorporate additional variables, e.g. cortical thickness. The paper proposes a novel application of a nonparametric permutation-based approach for statistical hypothesis testing for any multivariate high-dimensional descriptor. While the proposed framework has a rigorous theoretical underpinning, it is straightforward to implement. The framework is validated via simulated and clinical data. The paper is the first to quantitatively evaluate cortical folding in neonates with complex congenital heart disease.
Complex Environmental Data Modelling Using Adaptive General Regression Neural Networks
Kanevski, Mikhail
2015-04-01
The research deals with an adaptation and application of Adaptive General Regression Neural Networks (GRNN) to high dimensional environmental data. GRNN [1,2,3] are efficient modelling tools both for spatial and temporal data and are based on nonparametric kernel methods closely related to classical Nadaraya-Watson estimator. Adaptive GRNN, using anisotropic kernels, can be also applied for features selection tasks when working with high dimensional data [1,3]. In the present research Adaptive GRNN are used to study geospatial data predictability and relevant feature selection using both simulated and real data case studies. The original raw data were either three dimensional monthly precipitation data or monthly wind speeds embedded into 13 dimensional space constructed by geographical coordinates and geo-features calculated from digital elevation model. GRNN were applied in two different ways: 1) adaptive GRNN with the resulting list of features ordered according to their relevancy; and 2) adaptive GRNN applied to evaluate all possible models N [in case of wind fields N=(2^13 -1)=8191] and rank them according to the cross-validation error. In both cases training were carried out applying leave-one-out procedure. An important result of the study is that the set of the most relevant features depends on the month (strong seasonal effect) and year. The predictabilities of precipitation and wind field patterns, estimated using the cross-validation and testing errors of raw and shuffled data, were studied in detail. The results of both approaches were qualitatively and quantitatively compared. In conclusion, Adaptive GRNN with their ability to select features and efficient modelling of complex high dimensional data can be widely used in automatic/on-line mapping and as an integrated part of environmental decision support systems. 1. Kanevski M., Pozdnoukhov A., Timonin V. Machine Learning for Spatial Environmental Data. Theory, applications and software. EPFL Press
Yiheng Tu; Ao Tan; Zening Fu; Yeung Sam Hung; Li Hu; Zhiguo Zhang
2015-08-01
Dimension reduction is essential for identifying a small set of discriminative features that are predictive of behavior or cognition from high-dimensional functional magnetic resonance imaging (fMRI) data. However, conventional linear dimension reduction techniques cannot reduce the dimension effectively if the relationship between imaging data and behavioral parameters are nonlinear. In the paper, we proposed a novel supervised dimension reduction technique, named PC-SIR (Principal Component - Sliced Inverse Regression), for analyzing high-dimensional fMRI data. The PC-SIR method is an important extension of the renowned SIR method, which can achieve the effective dimension reduction (e.d.r.) directions even the relationship between class labels and predictors is nonlinear but is unable to handle high-dimensional data. By using PCA prior to SIR to orthogonalize and reduce the predictors, PC-SIR can overcome the limitation of SIR and thus can be used for fMRI data. Simulation showed that PC-SIR can result in a more accurate identification of brain activation as well as better prediction than support vector regression (SVR) and partial least square regression (PLSR). Then, we applied PC-SIR on real fMRI data recorded in a pain stimulation experiment to identify pain-related brain regions and predict the pain perception. Results on 32 subjects showed that PC-SIR can lead to significantly higher prediction accuracy than SVR and PLSR. Therefore, PC-SIR could be a promising dimension reduction technique for multivariate pattern analysis of fMRI.
Self-dissimilarity as a High Dimensional Complexity Measure
Wolpert, David H.; Macready, William
2005-01-01
For many systems characterized as "complex" the patterns exhibited on different scales differ markedly from one another. For example the biomass distribution in a human body "looks very different" depending on the scale at which one examines it. Conversely, the patterns at different scales in "simple" systems (e.g., gases, mountains, crystals) vary little from one scale to another. Accordingly, the degrees of self-dissimilarity between the patterns of a system at various scales constitute a complexity "signature" of that system. Here we present a novel quantification of self-dissimilarity. This signature can, if desired, incorporate a novel information-theoretic measure of the distance between probability distributions that we derive here. Whatever distance measure is chosen, our quantification of self-dissimilarity can be measured for many kinds of real-world data. This allows comparisons of the complexity signatures of wholly different kinds of systems (e.g., systems involving information density in a digital computer vs. species densities in a rain-forest vs. capital density in an economy, etc.). Moreover, in contrast to many other suggested complexity measures, evaluating the self-dissimilarity of a system does not require one to already have a model of the system. These facts may allow self-dissimilarity signatures to be used a s the underlying observational variables of an eventual overarching theory relating all complex systems. To illustrate self-dissimilarity we present several numerical experiments. In particular, we show that underlying structure of the logistic map is picked out by the self-dissimilarity signature of time series produced by that map
High-Dimensional Neural Network Potentials for Organic Reactions and an Improved Training Algorithm.
Gastegger, Michael; Marquetand, Philipp
2015-05-12
Artificial neural networks (NNs) represent a relatively recent approach for the prediction of molecular potential energies, suitable for simulations of large molecules and long time scales. By using NNs to fit electronic structure data, it is possible to obtain empirical potentials of high accuracy combined with the computational efficiency of conventional force fields. However, as opposed to the latter, changing bonding patterns and unusual coordination geometries can be described due to the underlying flexible functional form of the NNs. One of the most promising approaches in this field is the high-dimensional neural network (HDNN) method, which is especially adapted to the prediction of molecular properties. While HDNNs have been mostly used to model solid state systems and surface interactions, we present here the first application of the HDNN approach to an organic reaction, the Claisen rearrangement of allyl vinyl ether to 4-pentenal. To construct the corresponding HDNN potential, a new training algorithm is introduced. This algorithm is termed "element-decoupled" global extended Kalman filter (ED-GEKF) and is based on the decoupled Kalman filter. Using a metadynamics trajectory computed with density functional theory as reference data, we show that the ED-GEKF exhibits superior performance - both in terms of accuracy and training speed - compared to other variants of the Kalman filter hitherto employed in HDNN training. In addition, the effect of including forces during ED-GEKF training on the resulting potentials was studied.
Relating high dimensional stochastic complex systems to low-dimensional intermittency
Diaz-Ruelas, Alvaro; Jensen, Henrik Jeldtoft; Piovani, Duccio; Robledo, Alberto
2017-02-01
We evaluate the implication and outlook of an unanticipated simplification in the macroscopic behavior of two high-dimensional sto-chastic models: the Replicator Model with Mutations and the Tangled Nature Model (TaNa) of evolutionary ecology. This simplification consists of the apparent display of low-dimensional dynamics in the non-stationary intermittent time evolution of the model on a coarse-grained scale. Evolution on this time scale spans generations of individuals, rather than single reproduction, death or mutation events. While a local one-dimensional map close to a tangent bifurcation can be derived from a mean-field version of the TaNa model, a nonlinear dynamical model consisting of successive tangent bifurcations generates time evolution patterns resembling those of the full TaNa model. To advance the interpretation of this finding, here we consider parallel results on a game-theoretic version of the TaNa model that in discrete time yields a coupled map lattice. This in turn is represented, a la Langevin, by a one-dimensional nonlinear map. Among various kinds of behaviours we obtain intermittent evolution associated with tangent bifurcations. We discuss our results.
Multitask Quantile Regression under the Transnormal Model.
Fan, Jianqing; Xue, Lingzhou; Zou, Hui
2016-01-01
We consider estimating multi-task quantile regression under the transnormal model, with focus on high-dimensional setting. We derive a surprisingly simple closed-form solution through rank-based covariance regularization. In particular, we propose the rank-based ℓ1 penalization with positive definite constraints for estimating sparse covariance matrices, and the rank-based banded Cholesky decomposition regularization for estimating banded precision matrices. By taking advantage of alternating direction method of multipliers, nearest correlation matrix projection is introduced that inherits sampling properties of the unprojected one. Our work combines strengths of quantile regression and rank-based covariance regularization to simultaneously deal with nonlinearity and nonnormality for high-dimensional regression. Furthermore, the proposed method strikes a good balance between robustness and efficiency, achieves the "oracle"-like convergence rate, and provides the provable prediction interval under the high-dimensional setting. The finite-sample performance of the proposed method is also examined. The performance of our proposed rank-based method is demonstrated in a real application to analyze the protein mass spectroscopy data.
Nakamura, Ryota; Suhrcke, Marc; Jebb, Susan A; Pechey, Rachel; Almiron-Roig, Eva; Marteau, Theresa M
2015-04-01
There is a growing concern, but limited evidence, that price promotions contribute to a poor diet and the social patterning of diet-related disease. We examined the following questions: 1) Are less-healthy foods more likely to be promoted than healthier foods? 2) Are consumers more responsive to promotions on less-healthy products? 3) Are there socioeconomic differences in food purchases in response to price promotions? With the use of hierarchical regression, we analyzed data on purchases of 11,323 products within 135 food and beverage categories from 26,986 households in Great Britain during 2010. Major supermarkets operated the same price promotions in all branches. The number of stores that offered price promotions on each product for each week was used to measure the frequency of price promotions. We assessed the healthiness of each product by using a nutrient profiling (NP) model. A total of 6788 products (60%) were in healthier categories and 4535 products (40%) were in less-healthy categories. There was no significant gap in the frequency of promotion by the healthiness of products neither within nor between categories. However, after we controlled for the reference price, price discount rate, and brand-specific effects, the sales uplift arising from price promotions was larger in less-healthy than in healthier categories; a 1-SD point increase in the category mean NP score, implying the category becomes less healthy, was associated with an additional 7.7-percentage point increase in sales (from 27.3% to 35.0%; P promotions was larger for higher-socioeconomic status (SES) groups than for lower ones (34.6% for the high-SES group, 28.1% for the middle-SES group, and 23.1% for the low-SES group). Finally, there was no significant SES gap in the absolute volume of purchases of less-healthy foods made on promotion. Attempts to limit promotions on less-healthy foods could improve the population diet but would be unlikely to reduce health inequalities arising from
Software Tools for Robust Analysis of High-Dimensional Data
Directory of Open Access Journals (Sweden)
Valentin Todorov
2014-06-01
Full Text Available The present work discusses robust multivariate methods specifically designed for highdimensions. Their implementation in R is presented and their application is illustratedon examples. The first group are algorithms for outlier detection, already introducedelsewhere and implemented in other packages. The value added of the new package isthat all methods follow the same design pattern and thus can use the same graphicaland diagnostic tools. The next topic covered is sparse principal components including anobject oriented interface to the standard method proposed by Zou, Hastie, and Tibshirani(2006 and the robust one proposed by Croux, Filzmoser, and Fritz (2013. Robust partialleast squares (see Hubert and Vanden Branden 2003 as well as partial least squares fordiscriminant analysis conclude the scope of the new package.
Mapping the human DC lineage through the integration of high-dimensional techniques
See, Peter; Dutertre, Charles-Antoine; Chen, Jinmiao; Günther, Patrick; McGovern, Naomi; Irac, Sergio Erdal; Gunawan, Merry; Beyer, Marc; Händler, Kristian; Duan, Kaibo; Sumatoh, Hermi Rizal Bin; Ruffin, Nicolas; Jouve, Mabel; Gea-Mallorquí, Ester; Hennekam, Raoul C. M.; Lim, Tony; Yip, Chan Chung; Wen, Ming; Malleret, Benoit; Low, Ivy; Shadan, Nurhidaya Binte; Fen, Charlene Foong Shu; Tay, Alicia; Lum, Josephine; Zolezzi, Francesca; Larbi, Anis; Poidinger, Michael; Chan, Jerry K. Y.; Chen, Qingfeng; Rénia, Laurent; Haniffa, Muzlifah; Benaroch, Philippe; Schlitzer, Andreas; Schultze, Joachim L.; Newell, Evan W.; Ginhoux, Florent
2017-01-01
Dendritic cells (DC) are professional antigen-presenting cells that orchestrate immune responses. The human DC population comprises two main functionally specialized lineages, whose origins and differentiation pathways remain incompletely defined. Here, we combine two high-dimensional
Approximating high-dimensional dynamics by barycentric coordinates with linear programming.
Hirata, Yoshito; Shiro, Masanori; Takahashi, Nozomu; Aihara, Kazuyuki; Suzuki, Hideyuki; Mas, Paloma
2015-01-01
The increasing development of novel methods and techniques facilitates the measurement of high-dimensional time series but challenges our ability for accurate modeling and predictions. The use of a general mathematical model requires the inclusion of many parameters, which are difficult to be fitted for relatively short high-dimensional time series observed. Here, we propose a novel method to accurately model a high-dimensional time series. Our method extends the barycentric coordinates to high-dimensional phase space by employing linear programming, and allowing the approximation errors explicitly. The extension helps to produce free-running time-series predictions that preserve typical topological, dynamical, and/or geometric characteristics of the underlying attractors more accurately than the radial basis function model that is widely used. The method can be broadly applied, from helping to improve weather forecasting, to creating electronic instruments that sound more natural, and to comprehensively understanding complex biological data.
Experimental ladder proof of Hardy's nonlocality for high-dimensional quantum systems
Chen, Lixiang; Zhang, Wuhong; Wu, Ziwen; Wang, Jikang; Fickler, Robert; Karimi, Ebrahim
2017-08-01
Recent years have witnessed a rapidly growing interest in high-dimensional quantum entanglement for fundamental studies as well as towards novel applications. Therefore, the ability to verify entanglement between physical qudits, d -dimensional quantum systems, is of crucial importance. To show nonclassicality, Hardy's paradox represents "the best version of Bell's theorem" without using inequalities. However, so far it has only been tested experimentally for bidimensional vector spaces. Here, we formulate a theoretical framework to demonstrate the ladder proof of Hardy's paradox for arbitrary high-dimensional systems. Furthermore, we experimentally demonstrate the ladder proof by taking advantage of the orbital angular momentum of high-dimensionally entangled photon pairs. We perform the ladder proof of Hardy's paradox for dimensions 3 and 4, both with the ladder up to the third step. Our paper paves the way towards a deeper understanding of the nature of high-dimensionally entangled quantum states and may find applications in quantum information science.
Characterization of high-dimensional entangled systems via mutually unbiased measurements
CSIR Research Space (South Africa)
Giovannini, D
2013-04-01
Full Text Available Mutually unbiased bases (MUBs) play a key role in many protocols in quantum science, such as quantum key distribution. However, defining MUBs for arbitrary high-dimensional systems is theoretically difficult, and measurements in such bases can...
Mitigating the Insider Threat Using High-Dimensional Search and Modeling
National Research Council Canada - National Science Library
Van Den Berg, Eric; Uphadyaya, Shambhu; Ngo, Phi H; Muthukrishnan, Muthu; Palan, Rajago
2006-01-01
In this project a system was built aimed at mitigating insider attacks centered around a high-dimensional search engine for correlating the large number of monitoring streams necessary for detecting insider attacks...
Projection Bank: From High-dimensional Data to Medium-length Binary Codes
Liu, Li; Yu, Mengyang; Shao, Ling
2015-01-01
Recently, very high-dimensional feature representations, e.g., Fisher Vector, have achieved excellent performance for visual recognition and retrieval. However, these lengthy representations always cause extremely heavy computational and storage costs and even become unfeasible in some large-scale applications. A few existing techniques can transfer very high-dimensional data into binary codes, but they still require the reduced code length to be relatively long to maintain acceptable accurac...
Jiang, Tiefeng; Yang, Fan
2013-01-01
For random samples of size $n$ obtained from $p$-variate normal distributions, we consider the classical likelihood ratio tests (LRT) for their means and covariance matrices in the high-dimensional setting. These test statistics have been extensively studied in multivariate analysis, and their limiting distributions under the null hypothesis were proved to be chi-square distributions as $n$ goes to infinity and $p$ remains fixed. In this paper, we consider the high-dimensional case where both...
Zhu, Yinchu
2017-01-01
Economic modeling in a data-rich environment is often challenging. To allow for enough flexibility and to model heterogeneity, models might have parameters with dimensionality growing with (or even much larger than) the sample size of the data. Learning these high-dimensional parameters requires new methodologies and theories. We consider three important high-dimensional models and propose novel methods for estimation and inference. Empirical applications in economics and finance are also stu...
The feature selection bias problem in relation to high-dimensional gene data.
Krawczuk, Jerzy; Łukaszuk, Tomasz
2016-01-01
Feature selection is a technique widely used in data mining. The aim is to select the best subset of features relevant to the problem being considered. In this paper, we consider feature selection for the classification of gene datasets. Gene data is usually composed of just a few dozen objects described by thousands of features. For this kind of data, it is easy to find a model that fits the learning data. However, it is not easy to find one that will simultaneously evaluate new data equally well as learning data. This overfitting issue is well known as regards classification and regression, but it also applies to feature selection. We address this problem and investigate its importance in an empirical study of four feature selection methods applied to seven high-dimensional gene datasets. We chose datasets that are well studied in the literature-colon cancer, leukemia and breast cancer. All the datasets are characterized by a significant number of features and the presence of exactly two decision classes. The feature selection methods used are ReliefF, minimum redundancy maximum relevance, support vector machine-recursive feature elimination and relaxed linear separability. Our main result reveals the existence of positive feature selection bias in all 28 experiments (7 datasets and 4 feature selection methods). Bias was calculated as the difference between validation and test accuracies and ranges from 2.6% to as much as 41.67%. The validation accuracy (biased accuracy) was calculated on the same dataset on which the feature selection was performed. The test accuracy was calculated for data that was not used for feature selection (by so called external cross-validation). This work provides evidence that using the same dataset for feature selection and learning is not appropriate. We recommend using cross-validation for feature selection in order to reduce selection bias. Copyright © 2015 Elsevier B.V. All rights reserved.
Murphy, Thomas Brendan; Dean, Nema; Raftery, Adrian E.
2010-01-01
Food authenticity studies are concerned with determining if food samples have been correctly labelled or not. Discriminant analysis methods are an integral part of the methodology for food authentication. Motivated by food authenticity applications, a model-based discriminant analysis method that includes variable selection is presented. The discriminant analysis model is fitted in a semi-supervised manner using both labeled and unlabeled data. The method is shown to give excellent classification performance on several high-dimensional multiclass food authenticity datasets with more variables than observations. The variables selected by the proposed method provide information about which variables are meaningful for classification purposes. A headlong search strategy for variable selection is shown to be efficient in terms of computation and achieves excellent classification performance. In applications to several food authenticity datasets, our proposed method outperformed default implementations of Random Forests, AdaBoost, transductive SVMs and Bayesian Multinomial Regression by substantial margins. PMID:20936055
Sun, Hokeun; Wang, Shuang
2013-05-30
The matched case-control designs are commonly used to control for potential confounding factors in genetic epidemiology studies especially epigenetic studies with DNA methylation. Compared with unmatched case-control studies with high-dimensional genomic or epigenetic data, there have been few variable selection methods for matched sets. In an earlier paper, we proposed the penalized logistic regression model for the analysis of unmatched DNA methylation data using a network-based penalty. However, for popularly applied matched designs in epigenetic studies that compare DNA methylation between tumor and adjacent non-tumor tissues or between pre-treatment and post-treatment conditions, applying ordinary logistic regression ignoring matching is known to bring serious bias in estimation. In this paper, we developed a penalized conditional logistic model using the network-based penalty that encourages a grouping effect of (1) linked Cytosine-phosphate-Guanine (CpG) sites within a gene or (2) linked genes within a genetic pathway for analysis of matched DNA methylation data. In our simulation studies, we demonstrated the superiority of using conditional logistic model over unconditional logistic model in high-dimensional variable selection problems for matched case-control data. We further investigated the benefits of utilizing biological group or graph information for matched case-control data. We applied the proposed method to a genome-wide DNA methylation study on hepatocellular carcinoma (HCC) where we investigated the DNA methylation levels of tumor and adjacent non-tumor tissues from HCC patients by using the Illumina Infinium HumanMethylation27 Beadchip. Several new CpG sites and genes known to be related to HCC were identified but were missed by the standard method in the original paper. Copyright © 2012 John Wiley & Sons, Ltd.
Forecasting with Dynamic Regression Models
Pankratz, Alan
2012-01-01
One of the most widely used tools in statistical forecasting, single equation regression models is examined here. A companion to the author's earlier work, Forecasting with Univariate Box-Jenkins Models: Concepts and Cases, the present text pulls together recent time series ideas and gives special attention to possible intertemporal patterns, distributed lag responses of output to input series and the auto correlation patterns of regression disturbance. It also includes six case studies.
Energy Technology Data Exchange (ETDEWEB)
Tripathy, Rohit, E-mail: rtripath@purdue.edu; Bilionis, Ilias, E-mail: ibilion@purdue.edu; Gonzalez, Marcial, E-mail: marcial-gonzalez@purdue.edu
2016-09-15
Uncertainty quantification (UQ) tasks, such as model calibration, uncertainty propagation, and optimization under uncertainty, typically require several thousand evaluations of the underlying computer codes. To cope with the cost of simulations, one replaces the real response surface with a cheap surrogate based, e.g., on polynomial chaos expansions, neural networks, support vector machines, or Gaussian processes (GP). However, the number of simulations required to learn a generic multivariate response grows exponentially as the input dimension increases. This curse of dimensionality can only be addressed, if the response exhibits some special structure that can be discovered and exploited. A wide range of physical responses exhibit a special structure known as an active subspace (AS). An AS is a linear manifold of the stochastic space characterized by maximal response variation. The idea is that one should first identify this low dimensional manifold, project the high-dimensional input onto it, and then link the projection to the output. If the dimensionality of the AS is low enough, then learning the link function is a much easier problem than the original problem of learning a high-dimensional function. The classic approach to discovering the AS requires gradient information, a fact that severely limits its applicability. Furthermore, and partly because of its reliance to gradients, it is not able to handle noisy observations. The latter is an essential trait if one wants to be able to propagate uncertainty through stochastic simulators, e.g., through molecular dynamics codes. In this work, we develop a probabilistic version of AS which is gradient-free and robust to observational noise. Our approach relies on a novel Gaussian process regression with built-in dimensionality reduction. In particular, the AS is represented as an orthogonal projection matrix that serves as yet another covariance function hyper-parameter to be estimated from the data. To train the
Engineering two-photon high-dimensional states through quantum interference.
Zhang, Yingwen; Roux, Filippus S; Konrad, Thomas; Agnew, Megan; Leach, Jonathan; Forbes, Andrew
2016-02-01
Many protocols in quantum science, for example, linear optical quantum computing, require access to large-scale entangled quantum states. Such systems can be realized through many-particle qubits, but this approach often suffers from scalability problems. An alternative strategy is to consider a lesser number of particles that exist in high-dimensional states. The spatial modes of light are one such candidate that provides access to high-dimensional quantum states, and thus they increase the storage and processing potential of quantum information systems. We demonstrate the controlled engineering of two-photon high-dimensional states entangled in their orbital angular momentum through Hong-Ou-Mandel interference. We prepare a large range of high-dimensional entangled states and implement precise quantum state filtering. We characterize the full quantum state before and after the filter, and are thus able to determine that only the antisymmetric component of the initial state remains. This work paves the way for high-dimensional processing and communication of multiphoton quantum states, for example, in teleportation beyond qubits.
A Comparison of Methods for Estimating the Determinant of High-Dimensional Covariance Matrix.
Hu, Zongliang; Dong, Kai; Dai, Wenlin; Tong, Tiejun
2017-09-21
The determinant of the covariance matrix for high-dimensional data plays an important role in statistical inference and decision. It has many real applications including statistical tests and information theory. Due to the statistical and computational challenges with high dimensionality, little work has been proposed in the literature for estimating the determinant of high-dimensional covariance matrix. In this paper, we estimate the determinant of the covariance matrix using some recent proposals for estimating high-dimensional covariance matrix. Specifically, we consider a total of eight covariance matrix estimation methods for comparison. Through extensive simulation studies, we explore and summarize some interesting comparison results among all compared methods. We also provide practical guidelines based on the sample size, the dimension, and the correlation of the data set for estimating the determinant of high-dimensional covariance matrix. Finally, from a perspective of the loss function, the comparison study in this paper may also serve as a proxy to assess the performance of the covariance matrix estimation.
A Hybrid Semi-Supervised Anomaly Detection Model for High-Dimensional Data
Directory of Open Access Journals (Sweden)
Hongchao Song
2017-01-01
Full Text Available Anomaly detection, which aims to identify observations that deviate from a nominal sample, is a challenging task for high-dimensional data. Traditional distance-based anomaly detection methods compute the neighborhood distance between each observation and suffer from the curse of dimensionality in high-dimensional space; for example, the distances between any pair of samples are similar and each sample may perform like an outlier. In this paper, we propose a hybrid semi-supervised anomaly detection model for high-dimensional data that consists of two parts: a deep autoencoder (DAE and an ensemble k-nearest neighbor graphs- (K-NNG- based anomaly detector. Benefiting from the ability of nonlinear mapping, the DAE is first trained to learn the intrinsic features of a high-dimensional dataset to represent the high-dimensional data in a more compact subspace. Several nonparametric KNN-based anomaly detectors are then built from different subsets that are randomly sampled from the whole dataset. The final prediction is made by all the anomaly detectors. The performance of the proposed method is evaluated on several real-life datasets, and the results confirm that the proposed hybrid model improves the detection accuracy and reduces the computational complexity.
A Hybrid Semi-Supervised Anomaly Detection Model for High-Dimensional Data.
Song, Hongchao; Jiang, Zhuqing; Men, Aidong; Yang, Bo
2017-01-01
Anomaly detection, which aims to identify observations that deviate from a nominal sample, is a challenging task for high-dimensional data. Traditional distance-based anomaly detection methods compute the neighborhood distance between each observation and suffer from the curse of dimensionality in high-dimensional space; for example, the distances between any pair of samples are similar and each sample may perform like an outlier. In this paper, we propose a hybrid semi-supervised anomaly detection model for high-dimensional data that consists of two parts: a deep autoencoder (DAE) and an ensemble k-nearest neighbor graphs- (K-NNG-) based anomaly detector. Benefiting from the ability of nonlinear mapping, the DAE is first trained to learn the intrinsic features of a high-dimensional dataset to represent the high-dimensional data in a more compact subspace. Several nonparametric KNN-based anomaly detectors are then built from different subsets that are randomly sampled from the whole dataset. The final prediction is made by all the anomaly detectors. The performance of the proposed method is evaluated on several real-life datasets, and the results confirm that the proposed hybrid model improves the detection accuracy and reduces the computational complexity.
A Comparison of Methods for Estimating the Determinant of High-Dimensional Covariance Matrix
Hu, Zongliang
2017-09-27
The determinant of the covariance matrix for high-dimensional data plays an important role in statistical inference and decision. It has many real applications including statistical tests and information theory. Due to the statistical and computational challenges with high dimensionality, little work has been proposed in the literature for estimating the determinant of high-dimensional covariance matrix. In this paper, we estimate the determinant of the covariance matrix using some recent proposals for estimating high-dimensional covariance matrix. Specifically, we consider a total of eight covariance matrix estimation methods for comparison. Through extensive simulation studies, we explore and summarize some interesting comparison results among all compared methods. We also provide practical guidelines based on the sample size, the dimension, and the correlation of the data set for estimating the determinant of high-dimensional covariance matrix. Finally, from a perspective of the loss function, the comparison study in this paper may also serve as a proxy to assess the performance of the covariance matrix estimation.
Clifford support vector machines for classification, regression, and recurrence.
Bayro-Corrochano, Eduardo Jose; Arana-Daniel, Nancy
2010-11-01
This paper introduces the Clifford support vector machines (CSVM) as a generalization of the real and complex-valued support vector machines using the Clifford geometric algebra. In this framework, we handle the design of kernels involving the Clifford or geometric product. In this approach, one redefines the optimization variables as multivectors. This allows us to have a multivector as output. Therefore, we can represent multiple classes according to the dimension of the geometric algebra in which we work. We show that one can apply CSVM for classification and regression and also to build a recurrent CSVM. The CSVM is an attractive approach for the multiple input multiple output processing of high-dimensional geometric entities. We carried out comparisons between CSVM and the current approaches to solve multiclass classification and regression. We also study the performance of the recurrent CSVM with experiments involving time series. The authors believe that this paper can be of great use for researchers and practitioners interested in multiclass hypercomplex computing, particularly for applications in complex and quaternion signal and image processing, satellite control, neurocomputation, pattern recognition, computer vision, augmented virtual reality, robotics, and humanoids.
Nakamura, Ryota; Suhrcke, Marc; Jebb, Susan A.; Pechey, Rachel; Almiron-Roig, Eva; Marteau, Theresa M.
2015-01-01
Background: There is a growing concern, but limited evidence, that price promotions contribute to a poor diet and the social patterning of diet-related disease. Objective: We examined the following questions: 1) Are less-healthy foods more likely to be promoted than healthier foods? 2) Are consumers more responsive to promotions on less-healthy products? 3) Are there socioeconomic differences in food purchases in response to price promotions? Design: With the use of hierarchical regression, w...
Fickler, Robert; Lapkiewicz, Radek; Huber, Marcus; Lavery, Martin P J; Padgett, Miles J; Zeilinger, Anton
2014-07-30
Photonics has become a mature field of quantum information science, where integrated optical circuits offer a way to scale the complexity of the set-up as well as the dimensionality of the quantum state. On photonic chips, paths are the natural way to encode information. To distribute those high-dimensional quantum states over large distances, transverse spatial modes, like orbital angular momentum possessing Laguerre Gauss modes, are favourable as flying information carriers. Here we demonstrate a quantum interface between these two vibrant photonic fields. We create three-dimensional path entanglement between two photons in a nonlinear crystal and use a mode sorter as the quantum interface to transfer the entanglement to the orbital angular momentum degree of freedom. Thus our results show a flexible way to create high-dimensional spatial mode entanglement. Moreover, they pave the way to implement broad complex quantum networks where high-dimensionally entangled states could be distributed over distant photonic chips.
The validation and assessment of machine learning: a game of prediction from high-dimensional data
DEFF Research Database (Denmark)
Pers, Tune Hannes; Albrechtsen, A; Holst, C
2009-01-01
In applied statistics, tools from machine learning are popular for analyzing complex and high-dimensional data. However, few theoretical results are available that could guide to the appropriate machine learning tool in a new application. Initial development of an overall strategy thus often...... the ideas, the game is applied to data from the Nugenob Study where the aim is to predict the fat oxidation capacity based on conventional factors and high-dimensional metabolomics data. Three players have chosen to use support vector machines, LASSO, and random forests, respectively....
Reduced basis ANOVA methods for partial differential equations with high-dimensional random inputs
Energy Technology Data Exchange (ETDEWEB)
Liao, Qifeng, E-mail: liaoqf@shanghaitech.edu.cn [School of Information Science and Technology, ShanghaiTech University, Shanghai 200031 (China); Lin, Guang, E-mail: guanglin@purdue.edu [Department of Mathematics & School of Mechanical Engineering, Purdue University, West Lafayette, IN 47907 (United States)
2016-07-15
In this paper we present a reduced basis ANOVA approach for partial deferential equations (PDEs) with random inputs. The ANOVA method combined with stochastic collocation methods provides model reduction in high-dimensional parameter space through decomposing high-dimensional inputs into unions of low-dimensional inputs. In this work, to further reduce the computational cost, we investigate spatial low-rank structures in the ANOVA-collocation method, and develop efficient spatial model reduction techniques using hierarchically generated reduced bases. We present a general mathematical framework of the methodology, validate its accuracy and demonstrate its efficiency with numerical experiments.
Bayesian ARTMAP for regression.
Sasu, L M; Andonie, R
2013-10-01
Bayesian ARTMAP (BA) is a recently introduced neural architecture which uses a combination of Fuzzy ARTMAP competitive learning and Bayesian learning. Training is generally performed online, in a single-epoch. During training, BA creates input data clusters as Gaussian categories, and also infers the conditional probabilities between input patterns and categories, and between categories and classes. During prediction, BA uses Bayesian posterior probability estimation. So far, BA was used only for classification. The goal of this paper is to analyze the efficiency of BA for regression problems. Our contributions are: (i) we generalize the BA algorithm using the clustering functionality of both ART modules, and name it BA for Regression (BAR); (ii) we prove that BAR is a universal approximator with the best approximation property. In other words, BAR approximates arbitrarily well any continuous function (universal approximation) and, for every given continuous function, there is one in the set of BAR approximators situated at minimum distance (best approximation); (iii) we experimentally compare the online trained BAR with several neural models, on the following standard regression benchmarks: CPU Computer Hardware, Boston Housing, Wisconsin Breast Cancer, and Communities and Crime. Our results show that BAR is an appropriate tool for regression tasks, both for theoretical and practical reasons. Copyright © 2013 Elsevier Ltd. All rights reserved.
Zhang, Jiangjiang; Li, Weixuan; Lin, Guang; Zeng, Lingzao; Wu, Laosheng
2017-03-01
In decision-making for groundwater management and contamination remediation, it is important to accurately evaluate the probability of the occurrence of a failure event. For small failure probability analysis, a large number of model evaluations are needed in the Monte Carlo (MC) simulation, which is impractical for CPU-demanding models. One approach to alleviate the computational cost caused by the model evaluations is to construct a computationally inexpensive surrogate model instead. However, using a surrogate approximation can cause an extra error in the failure probability analysis. Moreover, constructing accurate surrogates is challenging for high-dimensional models, i.e., models containing many uncertain input parameters. To address these issues, we propose an efficient two-stage MC approach for small failure probability analysis in high-dimensional groundwater contaminant transport modeling. In the first stage, a low-dimensional representation of the original high-dimensional model is sought with Karhunen-Loève expansion and sliced inverse regression jointly, which allows for the easy construction of a surrogate with polynomial chaos expansion. Then a surrogate-based MC simulation is implemented. In the second stage, the small number of samples that are close to the failure boundary are re-evaluated with the original model, which corrects the bias introduced by the surrogate approximation. The proposed approach is tested with a numerical case study and is shown to be 100 times faster than the traditional MC approach in achieving the same level of estimation accuracy.
Energy Technology Data Exchange (ETDEWEB)
Zhang, Jiangjiang [Zhejiang Provincial Key Laboratory of Agricultural Resources and Environment, Institute of Soil and Water Resources and Environmental Science, College of Environmental and Resource Sciences, Zhejiang University, Hangzhou China; Li, Weixuan [Pacific Northwest National Laboratory, Richland Washington USA; Lin, Guang [Department of Mathematics and School of Mechanical Engineering, Purdue University, West Lafayette Indiana USA; Zeng, Lingzao [Zhejiang Provincial Key Laboratory of Agricultural Resources and Environment, Institute of Soil and Water Resources and Environmental Science, College of Environmental and Resource Sciences, Zhejiang University, Hangzhou China; Wu, Laosheng [Department of Environmental Sciences, University of California, Riverside California USA
2017-03-01
In decision-making for groundwater management and contamination remediation, it is important to accurately evaluate the probability of the occurrence of a failure event. For small failure probability analysis, a large number of model evaluations are needed in the Monte Carlo (MC) simulation, which is impractical for CPU-demanding models. One approach to alleviate the computational cost caused by the model evaluations is to construct a computationally inexpensive surrogate model instead. However, using a surrogate approximation can cause an extra error in the failure probability analysis. Moreover, constructing accurate surrogates is challenging for high-dimensional models, i.e., models containing many uncertain input parameters. To address these issues, we propose an efficient two-stage MC approach for small failure probability analysis in high-dimensional groundwater contaminant transport modeling. In the first stage, a low-dimensional representation of the original high-dimensional model is sought with Karhunen–Loève expansion and sliced inverse regression jointly, which allows for the easy construction of a surrogate with polynomial chaos expansion. Then a surrogate-based MC simulation is implemented. In the second stage, the small number of samples that are close to the failure boundary are re-evaluated with the original model, which corrects the bias introduced by the surrogate approximation. The proposed approach is tested with a numerical case study and is shown to be 100 times faster than the traditional MC approach in achieving the same level of estimation accuracy.
Adaptive metric kernel regression
DEFF Research Database (Denmark)
Goutte, Cyril; Larsen, Jan
2000-01-01
Kernel smoothing is a widely used non-parametric pattern recognition technique. By nature, it suffers from the curse of dimensionality and is usually difficult to apply to high input dimensions. In this contribution, we propose an algorithm that adapts the input metric used in multivariate...... regression by minimising a cross-validation estimate of the generalisation error. This allows to automatically adjust the importance of different dimensions. The improvement in terms of modelling performance is illustrated on a variable selection task where the adaptive metric kernel clearly outperforms...
Adaptive Metric Kernel Regression
DEFF Research Database (Denmark)
Goutte, Cyril; Larsen, Jan
1998-01-01
Kernel smoothing is a widely used nonparametric pattern recognition technique. By nature, it suffers from the curse of dimensionality and is usually difficult to apply to high input dimensions. In this paper, we propose an algorithm that adapts the input metric used in multivariate regression...... by minimising a cross-validation estimate of the generalisation error. This allows one to automatically adjust the importance of different dimensions. The improvement in terms of modelling performance is illustrated on a variable selection task where the adaptive metric kernel clearly outperforms the standard...
A Cure for Variance Inflation in High Dimensional Kernel Principal Component Analysis
DEFF Research Database (Denmark)
Abrahamsen, Trine Julie; Hansen, Lars Kai
2011-01-01
Small sample high-dimensional principal component analysis (PCA) suffers from variance inflation and lack of generalizability. It has earlier been pointed out that a simple leave-one-out variance renormalization scheme can cure the problem. In this paper we generalize the cure in two directions...
Global communication schemes for the numerical solution of high-dimensional PDEs
DEFF Research Database (Denmark)
Hupp, Philipp; Heene, Mario; Jacob, Riko
2016-01-01
The numerical treatment of high-dimensional partial differential equations is among the most compute-hungry problems and in urgent need for current and future high-performance computing (HPC) systems. It is thus also facing the grand challenges of exascale computing such as the requirement to red...
Generation of high-dimensional energy-time-entangled photon pairs
Zhang, Da; Zhang, Yiqi; Li, Xinghua; Zhang, Dan; Cheng, Lin; Li, Changbiao; Zhang, Yanpeng
2017-11-01
High-dimensional entangled photon pairs have many excellent properties compared to two-dimensional entangled two-photon states, such as greater information capacity, stronger nonlocality, and higher security. Traditionally, the degree of freedom that can produce high-dimensional entanglement mainly consists of angular momentum and energy time. In this paper, we propose a type of high-dimensional energy-time-entangled qudit, which is different from the traditional model with an extended propagation path. In addition, our method mainly focuses on the generation with multiple frequency modes, while two- and three-dimensional frequency-entangled qudits are examined as examples in detail through the linear or nonlinear optical response of the medium. The generation of high-dimensional energy-time-entangled states can be verified by coincidence counts in the damped Rabi oscillation regime, where the paired Stokes-anti-Stokes wave packet is determined by the structure of resonances in the third-order nonlinearity. Finally, we extend the dimension to N in the sequential-cascade mode. Our results have potential applications in quantum communication and quantum computation.
Characterization of differentially expressed genes using high-dimensional co-expression networks
DEFF Research Database (Denmark)
Coelho Goncalves de Abreu, Gabriel; Labouriau, Rodrigo S.
2010-01-01
We present a technique to characterize differentially expressed genes in terms of their position in a high-dimensional co-expression network. The set-up of Gaussian graphical models is used to construct representations of the co-expression network in such a way that redundancy and the propagation...
Ferdosi, Bilkis J.; Buddelmeijer, Hugo; Trager, Scott; Wilkinson, Michael H.F.; Roerdink, Jos B.T.M.
2010-01-01
Data sets in astronomy are growing to enormous sizes. Modern astronomical surveys provide not only image data but also catalogues of millions of objects (stars, galaxies), each object with hundreds of associated parameters. Exploration of this very high-dimensional data space poses a huge challenge.
High-Dimensional Exploratory Item Factor Analysis by a Metropolis-Hastings Robbins-Monro Algorithm
Cai, Li
2010-01-01
A Metropolis-Hastings Robbins-Monro (MH-RM) algorithm for high-dimensional maximum marginal likelihood exploratory item factor analysis is proposed. The sequence of estimates from the MH-RM algorithm converges with probability one to the maximum likelihood solution. Details on the computer implementation of this algorithm are provided. The…
Storm, Emma; Weniger, Christoph; Calore, Francesca
2017-08-01
We present SkyFACT (Sky Factorization with Adaptive Constrained Templates), a new approach for studying, modeling and decomposing diffuse gamma-ray emission. Like most previous analyses, the approach relies on predictions from cosmic-ray propagation codes like GALPROP and DRAGON. However, in contrast to previous approaches, we account for the fact that models are not perfect and allow for a very large number (gtrsim 105) of nuisance parameters to parameterize these imperfections. We combine methods of image reconstruction and adaptive spatio-spectral template regression in one coherent hybrid approach. To this end, we use penalized Poisson likelihood regression, with regularization functions that are motivated by the maximum entropy method. We introduce methods to efficiently handle the high dimensionality of the convex optimization problem as well as the associated semi-sparse covariance matrix, using the L-BFGS-B algorithm and Cholesky factorization. We test the method both on synthetic data as well as on gamma-ray emission from the inner Galaxy, |l|<90o and |b|<20o, as observed by the Fermi Large Area Telescope. We finally define a simple reference model that removes most of the residual emission from the inner Galaxy, based on conventional diffuse emission components as well as components for the Fermi bubbles, the Fermi Galactic center excess, and extended sources along the Galactic disk. Variants of this reference model can serve as basis for future studies of diffuse emission in and outside the Galactic disk.
Comparison of clustering methods for high-dimensional single-cell flow and mass cytometry data.
Weber, Lukas M; Robinson, Mark D
2016-12-01
Recent technological developments in high-dimensional flow cytometry and mass cytometry (CyTOF) have made it possible to detect expression levels of dozens of protein markers in thousands of cells per second, allowing cell populations to be characterized in unprecedented detail. Traditional data analysis by "manual gating" can be inefficient and unreliable in these high-dimensional settings, which has led to the development of a large number of automated analysis methods. Methods designed for unsupervised analysis use specialized clustering algorithms to detect and define cell populations for further downstream analysis. Here, we have performed an up-to-date, extensible performance comparison of clustering methods for high-dimensional flow and mass cytometry data. We evaluated methods using several publicly available data sets from experiments in immunology, containing both major and rare cell populations, with cell population identities from expert manual gating as the reference standard. Several methods performed well, including FlowSOM, X-shift, PhenoGraph, Rclusterpp, and flowMeans. Among these, FlowSOM had extremely fast runtimes, making this method well-suited for interactive, exploratory analysis of large, high-dimensional data sets on a standard laptop or desktop computer. These results extend previously published comparisons by focusing on high-dimensional data and including new methods developed for CyTOF data. R scripts to reproduce all analyses are available from GitHub (https://github.com/lmweber/cytometry-clustering-comparison), and pre-processed data files are available from FlowRepository (FR-FCM-ZZPH), allowing our comparisons to be extended to include new clustering methods and reference data sets. © 2016 The Authors. Cytometry Part A published by Wiley Periodicals, Inc. on behalf of ISAC. © 2016 The Authors. Cytometry Part A Published by Wiley Periodicals, Inc. on behalf of ISAC.
Differentiating regressed melanoma from regressed lichenoid keratosis.
Chan, Aegean H; Shulman, Kenneth J; Lee, Bonnie A
2017-04-01
Distinguishing regressed lichen planus-like keratosis (LPLK) from regressed melanoma can be difficult on histopathologic examination, potentially resulting in mismanagement of patients. We aimed to identify histopathologic features by which regressed melanoma can be differentiated from regressed LPLK. Twenty actively inflamed LPLK, 12 LPLK with regression and 15 melanomas with regression were compared and evaluated by hematoxylin and eosin staining as well as Melan-A, microphthalmia transcription factor (MiTF) and cytokeratin (AE1/AE3) immunostaining. (1) A total of 40% of regressed melanomas showed complete or near complete loss of melanocytes within the epidermis with Melan-A and MiTF immunostaining, while 8% of regressed LPLK exhibited this finding. (2) Necrotic keratinocytes were seen in the epidermis in 33% regressed melanomas as opposed to all of the regressed LPLK. (3) A dense infiltrate of melanophages in the papillary dermis was seen in 40% of regressed melanomas, a feature not seen in regressed LPLK. In summary, our findings suggest that a complete or near complete loss of melanocytes within the epidermis strongly favors a regressed melanoma over a regressed LPLK. In addition, necrotic epidermal keratinocytes and the presence of a dense band-like distribution of dermal melanophages can be helpful in differentiating these lesions. © 2016 John Wiley & Sons A/S. Published by John Wiley & Sons Ltd.
Faustini, J. M.; Jones, J. A.
2001-12-01
This study used an empirical modeling approach to explore landscape controls on spatial variations in reach-scale channel response to peak flows in a mountain watershed. We used historical cross-section surveys spanning 20 years at five sites on 2nd to 5th-order channels and stream gaging records spanning up to 50 years. We related the observed proportion of cross-sections at a site exhibiting detectable change between consecutive surveys to the recurrence interval of the largest peak flow during the corresponding period using a quasi-likelihood logistic regression model. Stream channel response was linearly related to flood size or return period through the logit function, but the shape of the response function varied according to basin size, bed material, and the presence or absence of large wood. At the watershed scale, we hypothesized that the spatial scale and frequency of channel adjustment should increase in the downstream direction as sediment supply increases relative to transport capacity, resulting in more transportable sediment in the channel and hence increased bed mobility. Consistent with this hypothesis, cross sections from the 4th and 5th-order main stem channels exhibit more frequent detectable changes than those at two steep third-order tributary sites. Peak flows able to mobilize bed material sufficiently to cause detectable changes in 50% of cross-section profiles had an estimated recurrence interval of 3 years for the 4th and 5th-order channels and 4 to 6 years for the 3rd-order sites. This difference increased for larger magnitude channel changes; peak flows with recurrence intervals of about 7 years produced changes in 90% of cross sections at the main stem sites, but flows able to produce the same level of response at tributary sites were three times less frequent. At finer scales, this trend of increasing bed mobility in the downstream direction is modified by variations in the degree of channel confinement by bedrock and landforms, the
Directory of Open Access Journals (Sweden)
Yutao LIU
2011-10-01
Full Text Available Background and objective Erlotinib is a targeted therapy drug for non-small cell lung cancer (NSCLC. It has been proven that, there was evidence of various survival benefits derived from erlotinib in patients with different clinical features, but the results are conflicting. The aim of this study is to identify novel predictive factors and explore the interactions between clinical variables as well as their impact on the survival of Chinese patients with advanced NSCLC heavily treated with erlotinib. Methods The clinical and follow-up data of 105 Chinese NSCLC patients referred to the Cancer Hospital and Institute, Chinese Academy of Medical Sciences from September 2006 to September 2009 were analyzed. Multivariate analysis of progressive-free survival (PFS was performed using recursive partitioning referred to as the classification and regression tree (CART analysis. Results The median PFS of 105 eligible consecutive Chinese NSCLC patients was 5.0 months (95%CI: 2.9-7.1. CART analysis was performed for the initial, second, and third split in the lymph node involvement, the time of erlotinib administration, and smoking history. Four terminal subgroups were formed. The longer values for the median PFS were 11.0 months (95%CI: 8.9-13.1 for the subgroup with no lymph node metastasis and 10.0 months (95%CI: 7.9-12.1 for the subgroup with lymph node involvement, but not over the second-line erlotinib treatment with a smoking history ≤35 packs per year. The shorter values for the median PFS were 2.3 months (95%CI: 1.6-3.0 for the subgroup with lymph node metastasis and over the second-line erlotinib treatment, and 1.3 months (95%CI: 0.5-2.1 for the subgroup with lymph node metastasis, but not over the second-line erlotinib treatment with a smoking history >35 packs per year. Conclusion Lymph node metastasis, the time of erlotinib administration, and smoking history are closely correlated with the survival of advanced NSCLC patients with first- to
Compressively Characterizing High-Dimensional Entangled States with Complementary, Random Filtering
Directory of Open Access Journals (Sweden)
Gregory A. Howland
2016-05-01
Full Text Available The resources needed to conventionally characterize a quantum system are overwhelmingly large for high-dimensional systems. This obstacle may be overcome by abandoning traditional cornerstones of quantum measurement, such as general quantum states, strong projective measurement, and assumption-free characterization. Following this reasoning, we demonstrate an efficient technique for characterizing high-dimensional, spatial entanglement with one set of measurements. We recover sharp distributions with local, random filtering of the same ensemble in momentum followed by position—something the uncertainty principle forbids for projective measurements. Exploiting the expectation that entangled signals are highly correlated, we use fewer than 5000 measurements to characterize a 65,536-dimensional state. Finally, we use entropic inequalities to witness entanglement without a density matrix. Our method represents the sea change unfolding in quantum measurement, where methods influenced by the information theory and signal-processing communities replace unscalable, brute-force techniques—a progression previously followed by classical sensing.
Distribution of high-dimensional entanglement via an intra-city free-space link
Steinlechner, Fabian; Ecker, Sebastian; Fink, Matthias; Liu, Bo; Bavaresco, Jessica; Huber, Marcus; Scheidl, Thomas; Ursin, Rupert
2017-07-01
Quantum entanglement is a fundamental resource in quantum information processing and its distribution between distant parties is a key challenge in quantum communications. Increasing the dimensionality of entanglement has been shown to improve robustness and channel capacities in secure quantum communications. Here we report on the distribution of genuine high-dimensional entanglement via a 1.2-km-long free-space link across Vienna. We exploit hyperentanglement, that is, simultaneous entanglement in polarization and energy-time bases, to encode quantum information, and observe high-visibility interference for successive correlation measurements in each degree of freedom. These visibilities impose lower bounds on entanglement in each subspace individually and certify four-dimensional entanglement for the hyperentangled system. The high-fidelity transmission of high-dimensional entanglement under real-world atmospheric link conditions represents an important step towards long-distance quantum communications with more complex quantum systems and the implementation of advanced quantum experiments with satellite links.
High-dimensional quantum state transfer through a quantum spin chain
Qin, Wei; Wang, Chuan; Long, Gui Lu
2013-01-01
In this paper, we investigate a high-dimensional quantum state transfer protocol. An arbitrary unknown high-dimensional state can be transferred with high fidelity between two remote registers through an XX coupling spin chain of arbitrary length. The evolution of the state transfer is determined by the natural dynamics of the chain without external modulation and coupling strength engineering. As a consequence, entanglement distribution with a high efficiency can be achieved. Also the strong field and high spin quantum number can partly counteract the effect of finite temperature to ensure the high fidelity of the protocol when the quantum data bus is in the thermal equilibrium state under an external magnetic field.
Distribution of high-dimensional entanglement via an intra-city free-space link.
Steinlechner, Fabian; Ecker, Sebastian; Fink, Matthias; Liu, Bo; Bavaresco, Jessica; Huber, Marcus; Scheidl, Thomas; Ursin, Rupert
2017-07-24
Quantum entanglement is a fundamental resource in quantum information processing and its distribution between distant parties is a key challenge in quantum communications. Increasing the dimensionality of entanglement has been shown to improve robustness and channel capacities in secure quantum communications. Here we report on the distribution of genuine high-dimensional entanglement via a 1.2-km-long free-space link across Vienna. We exploit hyperentanglement, that is, simultaneous entanglement in polarization and energy-time bases, to encode quantum information, and observe high-visibility interference for successive correlation measurements in each degree of freedom. These visibilities impose lower bounds on entanglement in each subspace individually and certify four-dimensional entanglement for the hyperentangled system. The high-fidelity transmission of high-dimensional entanglement under real-world atmospheric link conditions represents an important step towards long-distance quantum communications with more complex quantum systems and the implementation of advanced quantum experiments with satellite links.
Su, Yapeng; Shi, Qihui; Wei, Wei
2017-02-01
New insights on cellular heterogeneity in the last decade provoke the development of a variety of single cell omics tools at a lightning pace. The resultant high-dimensional single cell data generated by these tools require new theoretical approaches and analytical algorithms for effective visualization and interpretation. In this review, we briefly survey the state-of-the-art single cell proteomic tools with a particular focus on data acquisition and quantification, followed by an elaboration of a number of statistical and computational approaches developed to date for dissecting the high-dimensional single cell data. The underlying assumptions, unique features, and limitations of the analytical methods with the designated biological questions they seek to answer will be discussed. Particular attention will be given to those information theoretical approaches that are anchored in a set of first principles of physics and can yield detailed (and often surprising) predictions. © 2017 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Estimating and testing high-dimensional mediation effects in epigenetic studies.
Zhang, Haixiang; Zheng, Yinan; Zhang, Zhou; Gao, Tao; Joyce, Brian; Yoon, Grace; Zhang, Wei; Schwartz, Joel; Just, Allan; Colicino, Elena; Vokonas, Pantel; Zhao, Lihui; Lv, Jinchi; Baccarelli, Andrea; Hou, Lifang; Liu, Lei
2016-10-15
High-dimensional DNA methylation markers may mediate pathways linking environmental exposures with health outcomes. However, there is a lack of analytical methods to identify significant mediators for high-dimensional mediation analysis. Based on sure independent screening and minimax concave penalty techniques, we use a joint significance test for mediation effect. We demonstrate its practical performance using Monte Carlo simulation studies and apply this method to investigate the extent to which DNA methylation markers mediate the causal pathway from smoking to reduced lung function in the Normative Aging Study. We identify 2 CpGs with significant mediation effects. R package, source code, and simulation study are available at https://github.com/YinanZheng/HIMA CONTACT: lei.liu@northwestern.edu. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Distributed Computation of the knn Graph for Large High-Dimensional Point Sets.
Plaku, Erion; Kavraki, Lydia E
2007-03-01
High-dimensional problems arising from robot motion planning, biology, data mining, and geographic information systems often require the computation of k nearest neighbor (knn) graphs. The knn graph of a data set is obtained by connecting each point to its k closest points. As the research in the above-mentioned fields progressively addresses problems of unprecedented complexity, the demand for computing knn graphs based on arbitrary distance metrics and large high-dimensional data sets increases, exceeding resources available to a single machine. In this work we efficiently distribute the computation of knn graphs for clusters of processors with message passing. Extensions to our distributed framework include the computation of graphs based on other proximity queries, such as approximate knn or range queries. Our experiments show nearly linear speedup with over one hundred processors and indicate that similar speedup can be obtained with several hundred processors.
Efficient uncertainty quantification methodologies for high-dimensional climate land models
Energy Technology Data Exchange (ETDEWEB)
Sargsyan, Khachik [Sandia National Lab. (SNL-CA), Livermore, CA (United States); Safta, Cosmin [Sandia National Lab. (SNL-CA), Livermore, CA (United States); Berry, Robert Dan [Sandia National Lab. (SNL-CA), Livermore, CA (United States); Ray, Jaideep [Sandia National Lab. (SNL-CA), Livermore, CA (United States); Debusschere, Bert J. [Sandia National Lab. (SNL-CA), Livermore, CA (United States); Najm, Habib N. [Sandia National Lab. (SNL-CA), Livermore, CA (United States)
2011-11-01
In this report, we proposed, examined and implemented approaches for performing efficient uncertainty quantification (UQ) in climate land models. Specifically, we applied Bayesian compressive sensing framework to a polynomial chaos spectral expansions, enhanced it with an iterative algorithm of basis reduction, and investigated the results on test models as well as on the community land model (CLM). Furthermore, we discussed construction of efficient quadrature rules for forward propagation of uncertainties from high-dimensional, constrained input space to output quantities of interest. The work lays grounds for efficient forward UQ for high-dimensional, strongly non-linear and computationally costly climate models. Moreover, to investigate parameter inference approaches, we have applied two variants of the Markov chain Monte Carlo (MCMC) method to a soil moisture dynamics submodel of the CLM. The evaluation of these algorithms gave us a good foundation for further building out the Bayesian calibration framework towards the goal of robust component-wise calibration.
Extreme Learning Machines on High Dimensional and Large Data Applications: A Survey
Directory of Open Access Journals (Sweden)
Jiuwen Cao
2015-01-01
Full Text Available Extreme learning machine (ELM has been developed for single hidden layer feedforward neural networks (SLFNs. In ELM algorithm, the connections between the input layer and the hidden neurons are randomly assigned and remain unchanged during the learning process. The output connections are then tuned via minimizing the cost function through a linear system. The computational burden of ELM has been significantly reduced as the only cost is solving a linear system. The low computational complexity attracted a great deal of attention from the research community, especially for high dimensional and large data applications. This paper provides an up-to-date survey on the recent developments of ELM and its applications in high dimensional and large data. Comprehensive reviews on image processing, video processing, medical signal processing, and other popular large data applications with ELM are presented in the paper.
Atom-centered symmetry functions for constructing high-dimensional neural network potentials
Behler, Jörg
2011-02-01
Neural networks offer an unbiased and numerically very accurate approach to represent high-dimensional ab initio potential-energy surfaces. Once constructed, neural network potentials can provide the energies and forces many orders of magnitude faster than electronic structure calculations, and thus enable molecular dynamics simulations of large systems. However, Cartesian coordinates are not a good choice to represent the atomic positions, and a transformation to symmetry functions is required. Using simple benchmark systems, the properties of several types of symmetry functions suitable for the construction of high-dimensional neural network potential-energy surfaces are discussed in detail. The symmetry functions are general and can be applied to all types of systems such as molecules, crystalline and amorphous solids, and liquids.
An Unbiased Distance-based Outlier Detection Approach for High-dimensional Data
DEFF Research Database (Denmark)
Nguyen, Hoang Vu; Gopalkrishnan, Vivekanand; Assent, Ira
2011-01-01
than a global property. Different from existing approaches, it is not grid-based and dimensionality unbiased. Thus, its performance is impervious to grid resolution as well as the curse of dimensionality. In addition, our approach ranks the outliers, allowing users to select the number of desired...... outliers, thus mitigating the issue of high false alarm rate. Extensive empirical studies on real datasets show that our approach efficiently and effectively detects outliers, even in high-dimensional spaces....
Nikooienejad, Amir; Wang, Wenyi; Johnson, Valen E.
2017-01-01
Variable selection in high dimensional cancer genomic studies has become very popular in the past decade, due to the interest in discovering significant genes pertinent to a specific cancer type. Censored survival data is the main data structure in such studies and performing variable selection for such data type requires certain methodology. With recent developments in computational power, Bayesian methods have become more attractive in the context of variable selection. In this article we i...
Transformation of a high-dimensional color space for material classification.
Liu, Huajian; Lee, Sang-Heon; Chahl, Javaan Singh
2017-04-01
Images in red-green-blue (RGB) color space need to be transformed to other color spaces for image processing or analysis. For example, the well-known hue-saturation-intensity (HSI) color space, which separates hue from saturation and intensity and is similar to the color perception of humans, can aid many computer vision applications. For high-dimensional images, such as multispectral or hyperspectral images, transformation images to a color space that can separate hue from saturation and intensity would be useful; however, the related works are limited. Some methods could interpret a set of high-dimensional images to hue, saturation, and intensity, but these methods need to reduce the dimension of original images to three images and then map them to the trichromatic color space of RGB. Generally, dimension reduction could cause loss or distortion of original data, and, therefore, the transformed color spaces could not be suitable for material classification in critical conditions. This paper describes a method that can transform high-dimensional images to a color space called hyper-hue-saturation-intensity (HHSI), which is analogous to HSI in high dimensions. The transformation does not need dimension reduction, and, therefore, it can preserve the original information. Experimental results indicate that the hyper-hue is independent of saturation and intensity and it is more suitable for material classification of proximal or remote sensing images captured in a natural environment where illumination usually cannot be controlled.
Designing Progressive and Interactive Analytics Processes for High-Dimensional Data Analysis.
Turkay, Cagatay; Kaya, Erdem; Balcisoy, Selim; Hauser, Helwig
2017-01-01
In interactive data analysis processes, the dialogue between the human and the computer is the enabling mechanism that can lead to actionable observations about the phenomena being investigated. It is of paramount importance that this dialogue is not interrupted by slow computational mechanisms that do not consider any known temporal human-computer interaction characteristics that prioritize the perceptual and cognitive capabilities of the users. In cases where the analysis involves an integrated computational method, for instance to reduce the dimensionality of the data or to perform clustering, such non-optimal processes are often likely. To remedy this, progressive computations, where results are iteratively improved, are getting increasing interest in visual analytics. In this paper, we present techniques and design considerations to incorporate progressive methods within interactive analysis processes that involve high-dimensional data. We define methodologies to facilitate processes that adhere to the perceptual characteristics of users and describe how online algorithms can be incorporated within these. A set of design recommendations and according methods to support analysts in accomplishing high-dimensional data analysis tasks are then presented. Our arguments and decisions here are informed by observations gathered over a series of analysis sessions with analysts from finance. We document observations and recommendations from this study and present evidence on how our approach contribute to the efficiency and productivity of interactive visual analysis sessions involving high-dimensional data.
High-Dimensional Function Approximation With Neural Networks for Large Volumes of Data.
Andras, Peter
2018-02-01
Approximation of high-dimensional functions is a challenge for neural networks due to the curse of dimensionality. Often the data for which the approximated function is defined resides on a low-dimensional manifold and in principle the approximation of the function over this manifold should improve the approximation performance. It has been show that projecting the data manifold into a lower dimensional space, followed by the neural network approximation of the function over this space, provides a more precise approximation of the function than the approximation of the function with neural networks in the original data space. However, if the data volume is very large, the projection into the low-dimensional space has to be based on a limited sample of the data. Here, we investigate the nature of the approximation error of neural networks trained over the projection space. We show that such neural networks should have better approximation performance than neural networks trained on high-dimensional data even if the projection is based on a relatively sparse sample of the data manifold. We also find that it is preferable to use a uniformly distributed sparse sample of the data for the purpose of the generation of the low-dimensional projection. We illustrate these results considering the practical neural network approximation of a set of functions defined on high-dimensional data including real world data as well.
Mixture drug-count response model for the high-dimensional drug combinatory effect on myopathy.
Wang, Xueying; Zhang, Pengyue; Chiang, Chien-Wei; Wu, Hengyi; Shen, Li; Ning, Xia; Zeng, Donglin; Wang, Lei; Quinney, Sara K; Feng, Weixing; Li, Lang
2018-02-20
Drug-drug interactions (DDIs) are a common cause of adverse drug events (ADEs). The electronic medical record (EMR) database and the FDA's adverse event reporting system (FAERS) database are the major data sources for mining and testing the ADE associated DDI signals. Most DDI data mining methods focus on pair-wise drug interactions, and methods to detect high-dimensional DDIs in medical databases are lacking. In this paper, we propose 2 novel mixture drug-count response models for detecting high-dimensional drug combinations that induce myopathy. The "count" indicates the number of drugs in a combination. One model is called fixed probability mixture drug-count response model with a maximum risk threshold (FMDRM-MRT). The other model is called count-dependent probability mixture drug-count response model with a maximum risk threshold (CMDRM-MRT), in which the mixture probability is count dependent. Compared with the previous mixture drug-count response model (MDRM) developed by our group, these 2 new models show a better likelihood in detecting high-dimensional drug combinatory effects on myopathy. CMDRM-MRT identified and validated (54; 374; 637; 442; 131) 2-way to 6-way drug interactions, respectively, which induce myopathy in both EMR and FAERS databases. We further demonstrate FAERS data capture much higher maximum myopathy risk than EMR data do. The consistency of 2 mixture models' parameters and local false discovery rate estimates are evaluated through statistical simulation studies. Copyright © 2017 John Wiley & Sons, Ltd.
High-dimensional decoy-state quantum key distribution over multicore telecommunication fibers
Cañas, G.; Vera, N.; Cariñe, J.; González, P.; Cardenas, J.; Connolly, P. W. R.; Przysiezna, A.; Gómez, E. S.; Figueroa, M.; Vallone, G.; Villoresi, P.; da Silva, T. Ferreira; Xavier, G. B.; Lima, G.
2017-08-01
Multiplexing is a strategy to augment the transmission capacity of a communication system. It consists of combining multiple signals over the same data channel and it has been very successful in classical communications. However, the use of enhanced channels has only reached limited practicality in quantum communications (QC) as it requires the manipulation of quantum systems of higher dimensions. Considerable effort is being made towards QC using high-dimensional quantum systems encoded into the transverse momentum of single photons, but so far no approach has been proven to be fully compatible with the existing telecommunication fibers. Here we overcome such a challenge and demonstrate a secure high-dimensional decoy-state quantum key distribution session over a 300-m-long multicore optical fiber. The high-dimensional quantum states are defined in terms of the transverse core modes available for the photon transmission over the fiber, and theoretical analyses show that positive secret key rates can be achieved through metropolitan distances.
Directory of Open Access Journals (Sweden)
Malgorzata Nowicka
2017-05-01
Full Text Available High dimensional mass and flow cytometry (HDCyto experiments have become a method of choice for high throughput interrogation and characterization of cell populations.Here, we present an R-based pipeline for differential analyses of HDCyto data, largely based on Bioconductor packages. We computationally define cell populations using FlowSOM clustering, and facilitate an optional but reproducible strategy for manual merging of algorithm-generated clusters. Our workflow offers different analysis paths, including association of cell type abundance with a phenotype or changes in signaling markers within specific subpopulations, or differential analyses of aggregated signals. Importantly, the differential analyses we show are based on regression frameworks where the HDCyto data is the response; thus, we are able to model arbitrary experimental designs, such as those with batch effects, paired designs and so on. In particular, we apply generalized linear mixed models to analyses of cell population abundance or cell-population-specific analyses of signaling markers, allowing overdispersion in cell count or aggregated signals across samples to be appropriately modeled. To support the formal statistical analyses, we encourage exploratory data analysis at every step, including quality control (e.g. multi-dimensional scaling plots, reporting of clustering results (dimensionality reduction, heatmaps with dendrograms and differential analyses (e.g. plots of aggregated signals.
Particle Swarm Optimization and regression analysis I
Mohanty, Souyma D.
2012-04-01
Particle Swarm Optimization (PSO) is now widely used in many problems that require global optimization of high-dimensional and highly multi-modal functions. However, PSO has not yet seen widespread use in astronomical data analysis even though optimization problems in this field have become increasingly complex. In this two-part article, we first provide an overview of the PSO method in the concrete context of a ubiquitous problem in astronomy, namely, regression analysis. In particular, we consider the problem of optimizing the placement of knots in regression based on cubic splines (spline smoothing). The second part will describe an in-depth investigation of PSO in some realistic data analysis challenges.
Regression analysis by example
National Research Council Canada - National Science Library
Chatterjee, Samprit; Hadi, Ali S
2012-01-01
.... The emphasis continues to be on exploratory data analysis rather than statistical theory. The coverage offers in-depth treatment of regression diagnostics, transformation, multicollinearity, logistic regression, and robust regression...
Weyer, Veronika; Binder, Harald
2015-09-15
High-dimensional molecular measurements, e.g. gene expression data, can be linked to clinical time-to-event endpoints by Cox regression models and regularized estimation approaches, such as componentwise boosting, and can incorporate a large number of covariates as well as provide variable selection. If there is heterogeneity due to known patient subgroups, a stratified Cox model allows for separate baseline hazards in each subgroup. Variable selection will still depend on the relative stratum sizes in the data, which might be a convenience sample and not representative for future applications. Such effects need to be systematically investigated and could even help to more reliably identify components of risk prediction signatures. Correspondingly, we propose a weighted regression approach based on componentwise likelihood-based boosting which is implemented in the R package CoxBoost (https://github.com/binderh/CoxBoost). This approach focuses on building a risk prediction signature for a specific stratum by down-weighting the observations from the other strata using a range of weights. Stability of selection for specific covariates as a function of the weights is investigated by resampling inclusion frequencies, and two types of corresponding visualizations are suggested. This is illustrated for two applications with methylation and gene expression measurements from cancer patients. The proposed approach is meant to point out components of risk prediction signatures that are specific to the stratum of interest and components that are also important to other strata. Performance is mostly improved by incorporating down-weighted information from the other strata. This suggests more general usefulness for risk prediction signature development in data with heterogeneity due to known subgroups.
Nonlinear Forecasting With Many Predictors Using Kernel Ridge Regression
DEFF Research Database (Denmark)
Exterkate, Peter; Groenen, Patrick J.F.; Heij, Christiaan
This paper puts forward kernel ridge regression as an approach for forecasting with many predictors that are related nonlinearly to the target variable. In kernel ridge regression, the observed predictor variables are mapped nonlinearly into a high-dimensional space, where estimation of the predi......This paper puts forward kernel ridge regression as an approach for forecasting with many predictors that are related nonlinearly to the target variable. In kernel ridge regression, the observed predictor variables are mapped nonlinearly into a high-dimensional space, where estimation...... of the predictive regression model is based on a shrinkage estimator to avoid overfitting. We extend the kernel ridge regression methodology to enable its use for economic time-series forecasting, by including lags of the dependent variable or other individual variables as predictors, as typically desired...... in macroeconomic and financial applications. Monte Carlo simulations as well as an empirical application to various key measures of real economic activity confirm that kernel ridge regression can produce more accurate forecasts than traditional linear and nonlinear methods for dealing with many predictors based...
On-chip generation of high-dimensional entangled quantum states and their coherent control.
Kues, Michael; Reimer, Christian; Roztocki, Piotr; Cortés, Luis Romero; Sciara, Stefania; Wetzel, Benjamin; Zhang, Yanbing; Cino, Alfonso; Chu, Sai T; Little, Brent E; Moss, David J; Caspani, Lucia; Azaña, José; Morandotti, Roberto
2017-06-28
Optical quantum states based on entangled photons are essential for solving questions in fundamental physics and are at the heart of quantum information science. Specifically, the realization of high-dimensional states (D-level quantum systems, that is, qudits, with D > 2) and their control are necessary for fundamental investigations of quantum mechanics, for increasing the sensitivity of quantum imaging schemes, for improving the robustness and key rate of quantum communication protocols, for enabling a richer variety of quantum simulations, and for achieving more efficient and error-tolerant quantum computation. Integrated photonics has recently become a leading platform for the compact, cost-efficient, and stable generation and processing of non-classical optical states. However, so far, integrated entangled quantum sources have been limited to qubits (D = 2). Here we demonstrate on-chip generation of entangled qudit states, where the photons are created in a coherent superposition of multiple high-purity frequency modes. In particular, we confirm the realization of a quantum system with at least one hundred dimensions, formed by two entangled qudits with D = 10. Furthermore, using state-of-the-art, yet off-the-shelf telecommunications components, we introduce a coherent manipulation platform with which to control frequency-entangled states, capable of performing deterministic high-dimensional gate operations. We validate this platform by measuring Bell inequality violations and performing quantum state tomography. Our work enables the generation and processing of high-dimensional quantum states in a single spatial mode.
Sparse redundancy analysis of high-dimensional genetic and genomic data.
Csala, Attila; Voorbraak, Frans P J M; Zwinderman, Aeilko H; Hof, Michel H
2017-10-15
Recent technological developments have enabled the possibility of genetic and genomic integrated data analysis approaches, where multiple omics datasets from various biological levels are combined and used to describe (disease) phenotypic variations. The main goal is to explain and ultimately predict phenotypic variations by understanding their genetic basis and the interaction of the associated genetic factors. Therefore, understanding the underlying genetic mechanisms of phenotypic variations is an ever increasing research interest in biomedical sciences. In many situations, we have a set of variables that can be considered to be the outcome variables and a set that can be considered to be explanatory variables. Redundancy analysis (RDA) is an analytic method to deal with this type of directionality. Unfortunately, current implementations of RDA cannot deal optimally with the high dimensionality of omics data (p≫n). The existing theoretical framework, based on Ridge penalization, is suboptimal, since it includes all variables in the analysis. As a solution, we propose to use Elastic Net penalization in an iterative RDA framework to obtain a sparse solution. We proposed sparse redundancy analysis (sRDA) for high dimensional omics data analysis. We conducted simulation studies with our software implementation of sRDA to assess the reliability of sRDA. Both the analysis of simulated data, and the analysis of 485 512 methylation markers and 18,424 gene-expression values measured in a set of 55 patients with Marfan syndrome show that sRDA is able to deal with the usual high dimensionality of omics data. http://uva.csala.me/rda. a.csala@amc.uva.nl. Supplementary data are available at Bioinformatics online.
On-chip generation of high-dimensional entangled quantum states and their coherent control
Kues, Michael; Reimer, Christian; Roztocki, Piotr; Cortés, Luis Romero; Sciara, Stefania; Wetzel, Benjamin; Zhang, Yanbing; Cino, Alfonso; Chu, Sai T.; Little, Brent E.; Moss, David J.; Caspani, Lucia; Azaña, José; Morandotti, Roberto
2017-06-01
Optical quantum states based on entangled photons are essential for solving questions in fundamental physics and are at the heart of quantum information science. Specifically, the realization of high-dimensional states (D-level quantum systems, that is, qudits, with D > 2) and their control are necessary for fundamental investigations of quantum mechanics, for increasing the sensitivity of quantum imaging schemes, for improving the robustness and key rate of quantum communication protocols, for enabling a richer variety of quantum simulations, and for achieving more efficient and error-tolerant quantum computation. Integrated photonics has recently become a leading platform for the compact, cost-efficient, and stable generation and processing of non-classical optical states. However, so far, integrated entangled quantum sources have been limited to qubits (D = 2). Here we demonstrate on-chip generation of entangled qudit states, where the photons are created in a coherent superposition of multiple high-purity frequency modes. In particular, we confirm the realization of a quantum system with at least one hundred dimensions, formed by two entangled qudits with D = 10. Furthermore, using state-of-the-art, yet off-the-shelf telecommunications components, we introduce a coherent manipulation platform with which to control frequency-entangled states, capable of performing deterministic high-dimensional gate operations. We validate this platform by measuring Bell inequality violations and performing quantum state tomography. Our work enables the generation and processing of high-dimensional quantum states in a single spatial mode.
Efficient Estimation of first Passage Probability of high-Dimensional Nonlinear Systems
DEFF Research Database (Denmark)
Sichani, Mahdi Teimouri; Nielsen, Søren R.K.; Bucher, Christian
2011-01-01
An efficient method for estimating low first passage probabilities of high-dimensional nonlinear systems based on asymptotic estimation of low probabilities is presented. The method does not require any a priori knowledge of the system, i.e. it is a black-box method, and has very low requirements......, the failure probabilities of three well-known nonlinear systems are estimated. Next, a reduced degree-of-freedom model of a wind turbine is developed and is exposed to a turbulent wind field. The model incorporates very high dimensions and strong nonlinearities simultaneously. The failure probability...
CSIR Research Space (South Africa)
Giovannini, D
2013-06-01
Full Text Available : QELS_Fundamental Science, San Jose, California United States, 9-14 June 2013 Reconstruction of High-Dimensional States Entangled in Orbital Angular Momentum Using Mutually Unbiased Measurements D. Giovannini1, ⇤, J. Romero1, 2, J. Leach3, A.... Dudley4, A. Forbes4, 5 and M. J. Padgett1 1 School of Physics and Astronomy, SUPA, University of Glasgow, Glasgow G12 8QQ, United Kingdom 2 Department of Physics, SUPA, University of Strathclyde, Glasgow G4 ONG, United Kingdom 3 School of Engineering...
Computing and visualizing time-varying merge trees for high-dimensional data
Energy Technology Data Exchange (ETDEWEB)
Oesterling, Patrick [Univ. of Leipzig (Germany); Heine, Christian [Univ. of Kaiserslautern (Germany); Weber, Gunther H. [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Morozov, Dmitry [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Scheuermann, Gerik [Univ. of Leipzig (Germany)
2017-06-03
We introduce a new method that identifies and tracks features in arbitrary dimensions using the merge tree -- a structure for identifying topological features based on thresholding in scalar fields. This method analyzes the evolution of features of the function by tracking changes in the merge tree and relates features by matching subtrees between consecutive time steps. Using the time-varying merge tree, we present a structural visualization of the changing function that illustrates both features and their temporal evolution. We demonstrate the utility of our approach by applying it to temporal cluster analysis of high-dimensional point clouds.
High-dimensional nonlinear diffusion stochastic processes modelling for engineering applications
Mamontov, Yevgeny
2001-01-01
This book is the first one devoted to high-dimensional (or large-scale) diffusion stochastic processes (DSPs) with nonlinear coefficients. These processes are closely associated with nonlinear Ito's stochastic ordinary differential equations (ISODEs) and with the space-discretized versions of nonlinear Ito's stochastic partial integro-differential equations. The latter models include Ito's stochastic partial differential equations (ISPDEs). The book presents the new analytical treatment which can serve as the basis of a combined, analytical-numerical approach to greater computational efficienc
Efficient and accurate nearest neighbor and closest pair search in high-dimensional space
Tao, Yufei
2010-07-01
Nearest Neighbor (NN) search in high-dimensional space is an important problem in many applications. From the database perspective, a good solution needs to have two properties: (i) it can be easily incorporated in a relational database, and (ii) its query cost should increase sublinearly with the dataset size, regardless of the data and query distributions. Locality-Sensitive Hashing (LSH) is a well-known methodology fulfilling both requirements, but its current implementations either incur expensive space and query cost, or abandon its theoretical guarantee on the quality of query results. Motivated by this, we improve LSH by proposing an access method called the Locality-Sensitive B-tree (LSB-tree) to enable fast, accurate, high-dimensional NN search in relational databases. The combination of several LSB-trees forms a LSB-forest that has strong quality guarantees, but improves dramatically the efficiency of the previous LSH implementation having the same guarantees. In practice, the LSB-tree itself is also an effective index which consumes linear space, supports efficient updates, and provides accurate query results. In our experiments, the LSB-tree was faster than: (i) iDistance (a famous technique for exact NN search) by two orders ofmagnitude, and (ii) MedRank (a recent approximate method with nontrivial quality guarantees) by one order of magnitude, and meanwhile returned much better results. As a second step, we extend our LSB technique to solve another classic problem, called Closest Pair (CP) search, in high-dimensional space. The long-term challenge for this problem has been to achieve subquadratic running time at very high dimensionalities, which fails most of the existing solutions. We show that, using a LSB-forest, CP search can be accomplished in (worst-case) time significantly lower than the quadratic complexity, yet still ensuring very good quality. In practice, accurate answers can be found using just two LSB-trees, thus giving a substantial
Stochastic Neural Network Approach for Learning High-Dimensional Free Energy Surfaces
Schneider, Elia; Dai, Luke; Topper, Robert Q.; Drechsel-Grau, Christof; Tuckerman, Mark E.
2017-10-01
The generation of free energy landscapes corresponding to conformational equilibria in complex molecular systems remains a significant computational challenge. Adding to this challenge is the need to represent, store, and manipulate the often high-dimensional surfaces that result from rare-event sampling approaches employed to compute them. In this Letter, we propose the use of artificial neural networks as a solution to these issues. Using specific examples, we discuss network training using enhanced-sampling methods and the use of the networks in the calculation of ensemble averages.
Inferring biological tasks using Pareto analysis of high-dimensional data.
Hart, Yuval; Sheftel, Hila; Hausser, Jean; Szekely, Pablo; Ben-Moshe, Noa Bossel; Korem, Yael; Tendler, Avichai; Mayo, Avraham E; Alon, Uri
2015-03-01
We present the Pareto task inference method (ParTI; http://www.weizmann.ac.il/mcb/UriAlon/download/ParTI) for inferring biological tasks from high-dimensional biological data. Data are described as a polytope, and features maximally enriched closest to the vertices (or archetypes) allow identification of the tasks the vertices represent. We demonstrate that human breast tumors and mouse tissues are well described by tetrahedrons in gene expression space, with specific tumor types and biological functions enriched at each of the vertices, suggesting four key tasks.
Ceotto, Michele; Di Liberto, Giovanni; Conte, Riccardo
2017-07-01
A new semiclassical "divide-and-conquer" method is presented with the aim of demonstrating that quantum dynamics simulations of high dimensional molecular systems are doable. The method is first tested by calculating the quantum vibrational power spectra of water, methane, and benzene—three molecules of increasing dimensionality for which benchmark quantum results are available—and then applied to C60 , a system characterized by 174 vibrational degrees of freedom. Results show that the approach can accurately account for quantum anharmonicities, purely quantum features like overtones, and the removal of degeneracy when the molecular symmetry is broken.
DEFF Research Database (Denmark)
Johansen, Søren
2008-01-01
The reduced rank regression model is a multivariate regression model with a coefficient matrix with reduced rank. The reduced rank regression algorithm is an estimation procedure, which estimates the reduced rank regression model. It is related to canonical correlations and involves calculating e...... eigenvalues and eigenvectors. We give a number of different applications to regression and time series analysis, and show how the reduced rank regression estimator can be derived as a Gaussian maximum likelihood estimator. We briefly mention asymptotic results......The reduced rank regression model is a multivariate regression model with a coefficient matrix with reduced rank. The reduced rank regression algorithm is an estimation procedure, which estimates the reduced rank regression model. It is related to canonical correlations and involves calculating...
Nosofsky, Robert M; Sanders, Craig A; McDaniel, Mark A
2017-10-23
Experiments were conducted in which novice participants learned to classify pictures of rocks into real-world, scientifically defined categories. The experiments manipulated the distribution of training instances during an initial study phase, and then tested for correct classification and generalization performance during a transfer phase. The similarity structure of the to-be-learned categories was also manipulated across the experiments. A low-parameter version of an exemplar-memory model, used in combination with a high-dimensional feature-space representation for the rock stimuli, provided good overall accounts of the categorization data. The successful accounts included (a) predicting how performance on individual item types within the categories varied with the distributions of training examples, (b) predicting the overall levels of classification accuracy across the different rock categories, and (c) predicting the patterns of between-category confusions that arose when classification errors were made. The work represents a promising initial step in scaling up the application of formal models of perceptual classification learning to complex natural-category domains. We discuss further steps for making use of the model and its associated feature-space representation to search for effective techniques of teaching categories in the science classroom. (PsycINFO Database Record (c) 2017 APA, all rights reserved).
Directory of Open Access Journals (Sweden)
Paul Brent Ferrell
Full Text Available The plasticity of AML drives poor clinical outcomes and confounds its longitudinal detection. However, the immediate impact of treatment on the leukemic and non-leukemic cells of the bone marrow and blood remains relatively understudied. Here, we conducted a pilot study of high dimensional longitudinal monitoring of immunophenotype in AML. To characterize changes in cell phenotype before, during, and immediately after induction treatment, we developed a 27-antibody panel for mass cytometry focused on surface diagnostic markers and applied it to 46 samples of blood or bone marrow tissue collected over time from 5 AML patients. Central goals were to determine whether changes in AML phenotype would be captured effectively by cytomic tools and to implement methods for describing the evolving phenotypes of AML cell subsets. Mass cytometry data were analyzed using established computational techniques. Within this pilot study, longitudinal immune monitoring with mass cytometry revealed fundamental changes in leukemia phenotypes that occurred over time during and after induction in the refractory disease setting. Persisting AML blasts became more phenotypically distinct from stem and progenitor cells due to expression of novel marker patterns that differed from pre-treatment AML cells and from all cell types observed in healthy bone marrow. This pilot study of single cell immune monitoring in AML represents a powerful tool for precision characterization and targeting of resistant disease.
Tikhonov, Mikhail; Monasson, Remi
2018-01-01
Much of our understanding of ecological and evolutionary mechanisms derives from analysis of low-dimensional models: with few interacting species, or few axes defining "fitness". It is not always clear to what extent the intuition derived from low-dimensional models applies to the complex, high-dimensional reality. For instance, most naturally occurring microbial communities are strikingly diverse, harboring a large number of coexisting species, each of which contributes to shaping the environment of others. Understanding the eco-evolutionary interplay in these systems is an important challenge, and an exciting new domain for statistical physics. Recent work identified a promising new platform for investigating highly diverse ecosystems, based on the classic resource competition model of MacArthur. Here, we describe how the same analytical framework can be used to study evolutionary questions. Our analysis illustrates how, at high dimension, the intuition promoted by a one-dimensional (scalar) notion of fitness can become misleading. Specifically, while the low-dimensional picture emphasizes organism cost or efficiency, we exhibit a regime where cost becomes irrelevant for survival, and link this observation to generic properties of high-dimensional geometry.
Selecting Optimal Feature Set in High-Dimensional Data by Swarm Search
Directory of Open Access Journals (Sweden)
Simon Fong
2013-01-01
Full Text Available Selecting the right set of features from data of high dimensionality for inducing an accurate classification model is a tough computational challenge. It is almost a NP-hard problem as the combinations of features escalate exponentially as the number of features increases. Unfortunately in data mining, as well as other engineering applications and bioinformatics, some data are described by a long array of features. Many feature subset selection algorithms have been proposed in the past, but not all of them are effective. Since it takes seemingly forever to use brute force in exhaustively trying every possible combination of features, stochastic optimization may be a solution. In this paper, we propose a new feature selection scheme called Swarm Search to find an optimal feature set by using metaheuristics. The advantage of Swarm Search is its flexibility in integrating any classifier into its fitness function and plugging in any metaheuristic algorithm to facilitate heuristic search. Simulation experiments are carried out by testing the Swarm Search over some high-dimensional datasets, with different classification algorithms and various metaheuristic algorithms. The comparative experiment results show that Swarm Search is able to attain relatively low error rates in classification without shrinking the size of the feature subset to its minimum.
Using High-Dimensional Image Models to Perform Highly Undetectable Steganography
Pevný, Tomáš; Filler, Tomáš; Bas, Patrick
This paper presents a complete methodology for designing practical and highly-undetectable stegosystems for real digital media. The main design principle is to minimize a suitably-defined distortion by means of efficient coding algorithm. The distortion is defined as a weighted difference of extended state-of-the-art feature vectors already used in steganalysis. This allows us to "preserve" the model used by steganalyst and thus be undetectable even for large payloads. This framework can be efficiently implemented even when the dimensionality of the feature set used by the embedder is larger than 107. The high dimensional model is necessary to avoid known security weaknesses. Although high-dimensional models might be problem in steganalysis, we explain, why they are acceptable in steganography. As an example, we introduce HUGO, a new embedding algorithm for spatial-domain digital images and we contrast its performance with LSB matching. On the BOWS2 image database and in contrast with LSB matching, HUGO allows the embedder to hide 7× longer message with the same level of security level.
A New Ensemble Method with Feature Space Partitioning for High-Dimensional Data Classification
Directory of Open Access Journals (Sweden)
Yongjun Piao
2015-01-01
Full Text Available Ensemble data mining methods, also known as classifier combination, are often used to improve the performance of classification. Various classifier combination methods such as bagging, boosting, and random forest have been devised and have received considerable attention in the past. However, data dimensionality increases rapidly day by day. Such a trend poses various challenges as these methods are not suitable to directly apply to high-dimensional datasets. In this paper, we propose an ensemble method for classification of high-dimensional data, with each classifier constructed from a different set of features determined by partitioning of redundant features. In our method, the redundancy of features is considered to divide the original feature space. Then, each generated feature subset is trained by a support vector machine, and the results of each classifier are combined by majority voting. The efficiency and effectiveness of our method are demonstrated through comparisons with other ensemble techniques, and the results show that our method outperforms other methods.
An Efficient High Dimensional Cluster Method and its Application in Global Climate Sets
Directory of Open Access Journals (Sweden)
Ke Li
2007-10-01
Full Text Available Because of the development of modern-day satellites and other data acquisition systems, global climate research often involves overwhelming volume and complexity of high dimensional datasets. As a data preprocessing and analysis method, the clustering method is playing a more and more important role in these researches. In this paper, we propose a spatial clustering algorithm that, to some extent, cures the problem of dimensionality in high dimensional clustering. The similarity measure of our algorithm is based on the number of top-k nearest neighbors that two grids share. The neighbors of each grid are computed based on the time series associated with each grid, and computing the nearest neighbor of an object is the most time consuming step. According to Tobler's "First Law of Geography," we add a spatial window constraint upon each grid to restrict the number of grids considered and greatly improve the efficiency of our algorithm. We apply this algorithm to a 100-year global climate dataset and partition the global surface into sub areas under various spatial granularities. Experiments indicate that our spatial clustering algorithm works well.
High-Dimensional Single-Photon Quantum Gates: Concepts and Experiments
Babazadeh, Amin; Erhard, Manuel; Wang, Feiran; Malik, Mehul; Nouroozi, Rahman; Krenn, Mario; Zeilinger, Anton
2017-11-01
Transformations on quantum states form a basic building block of every quantum information system. From photonic polarization to two-level atoms, complete sets of quantum gates for a variety of qubit systems are well known. For multilevel quantum systems beyond qubits, the situation is more challenging. The orbital angular momentum modes of photons comprise one such high-dimensional system for which generation and measurement techniques are well studied. However, arbitrary transformations for such quantum states are not known. Here we experimentally demonstrate a four-dimensional generalization of the Pauli X gate and all of its integer powers on single photons carrying orbital angular momentum. Together with the well-known Z gate, this forms the first complete set of high-dimensional quantum gates implemented experimentally. The concept of the X gate is based on independent access to quantum states with different parities and can thus be generalized to other photonic degrees of freedom and potentially also to other quantum systems.
High-Dimensional Single-Photon Quantum Gates: Concepts and Experiments.
Babazadeh, Amin; Erhard, Manuel; Wang, Feiran; Malik, Mehul; Nouroozi, Rahman; Krenn, Mario; Zeilinger, Anton
2017-11-03
Transformations on quantum states form a basic building block of every quantum information system. From photonic polarization to two-level atoms, complete sets of quantum gates for a variety of qubit systems are well known. For multilevel quantum systems beyond qubits, the situation is more challenging. The orbital angular momentum modes of photons comprise one such high-dimensional system for which generation and measurement techniques are well studied. However, arbitrary transformations for such quantum states are not known. Here we experimentally demonstrate a four-dimensional generalization of the Pauli X gate and all of its integer powers on single photons carrying orbital angular momentum. Together with the well-known Z gate, this forms the first complete set of high-dimensional quantum gates implemented experimentally. The concept of the X gate is based on independent access to quantum states with different parities and can thus be generalized to other photonic degrees of freedom and potentially also to other quantum systems.
High-dimensional quantum key distribution with the entangled single-photon-added coherent state
Energy Technology Data Exchange (ETDEWEB)
Wang, Yang [Zhengzhou Information Science and Technology Institute, Zhengzhou, 450001 (China); Synergetic Innovation Center of Quantum Information and Quantum Physics, University of Science and Technology of China, Hefei, Anhui 230026 (China); Bao, Wan-Su, E-mail: 2010thzz@sina.com [Zhengzhou Information Science and Technology Institute, Zhengzhou, 450001 (China); Synergetic Innovation Center of Quantum Information and Quantum Physics, University of Science and Technology of China, Hefei, Anhui 230026 (China); Bao, Hai-Ze; Zhou, Chun; Jiang, Mu-Sheng; Li, Hong-Wei [Zhengzhou Information Science and Technology Institute, Zhengzhou, 450001 (China); Synergetic Innovation Center of Quantum Information and Quantum Physics, University of Science and Technology of China, Hefei, Anhui 230026 (China)
2017-04-25
High-dimensional quantum key distribution (HD-QKD) can generate more secure bits for one detection event so that it can achieve long distance key distribution with a high secret key capacity. In this Letter, we present a decoy state HD-QKD scheme with the entangled single-photon-added coherent state (ESPACS) source. We present two tight formulas to estimate the single-photon fraction of postselected events and Eve's Holevo information and derive lower bounds on the secret key capacity and the secret key rate of our protocol. We also present finite-key analysis for our protocol by using the Chernoff bound. Our numerical results show that our protocol using one decoy state can perform better than that of previous HD-QKD protocol with the spontaneous parametric down conversion (SPDC) using two decoy states. Moreover, when considering finite resources, the advantage is more obvious. - Highlights: • Implement the single-photon-added coherent state source into the high-dimensional quantum key distribution. • Enhance both the secret key capacity and the secret key rate compared with previous schemes. • Show an excellent performance in view of statistical fluctuations.
Energy Efficient MAC Scheme for Wireless Sensor Networks with High-Dimensional Data Aggregate
Directory of Open Access Journals (Sweden)
Seokhoon Kim
2015-01-01
Full Text Available This paper presents a novel and sustainable medium access control (MAC scheme for wireless sensor network (WSN systems that process high-dimensional aggregated data. Based on a preamble signal and buffer threshold analysis, it maximizes the energy efficiency of the wireless sensor devices which have limited energy resources. The proposed group management MAC (GM-MAC approach not only sets the buffer threshold value of a sensor device to be reciprocal to the preamble signal but also sets a transmittable group value to each sensor device by using the preamble signal of the sink node. The primary difference between the previous and the proposed approach is that existing state-of-the-art schemes use duty cycle and sleep mode to save energy consumption of individual sensor devices, whereas the proposed scheme employs the group management MAC scheme for sensor devices to maximize the overall energy efficiency of the whole WSN systems by minimizing the energy consumption of sensor devices located near the sink node. Performance evaluations show that the proposed scheme outperforms the previous schemes in terms of active time of sensor devices, transmission delay, control overhead, and energy consumption. Therefore, the proposed scheme is suitable for sensor devices in a variety of wireless sensor networking environments with high-dimensional data aggregate.
Pang, Herbert; Jung, Sin-Ho
2013-01-01
A variety of prediction methods are used to relate high-dimensional genome data with a clinical outcome using a prediction model. Once a prediction model is developed from a data set, it should be validated using a resampling method or an independent data set. Although the existing prediction methods have been intensively evaluated by many investigators, there has not been a comprehensive study investigating the performance of the validation methods, especially with a survival clinical outcome. Understanding the properties of the various validation methods can allow researchers to perform more powerful validations while controlling for type I error. In addition, sample size calculation strategy based on these validation methods is lacking. We conduct extensive simulations to examine the statistical properties of these validation strategies. In both simulations and a real data example, we have found that 10-fold cross-validation with permutation gave the best power while controlling type I error close to the nominal level. Based on this, we have also developed a sample size calculation method that will be used to design a validation study with a user-chosen combination of prediction. Microarray and genome-wide association studies data are used as illustrations. The power calculation method in this presentation can be used for the design of any biomedical studies involving high-dimensional data and survival outcomes. PMID:23471879
Yue, Mu; Li, Jialiang
2017-05-18
Motivated by risk prediction studies with ultra-high dimensional bio markers, we propose a novel improvement screening methodology. Accurate risk prediction can be quite useful for patient treatment selection, prevention strategy or disease management in evidence-based medicine. The question of how to choose new markers in addition to the conventional ones is especially important. In the past decade, a number of new measures for quantifying the added value from the new markers were proposed, among which the integrated discrimination improvement (IDI) and net reclassification improvement (NRI) stand out. Meanwhile, C-statistics are routinely used to quantify the capacity of the estimated risk score in discriminating among subjects with different event times. In this paper, we will examine these improvement statistics as well as the norm-based approach for evaluating the incremental values of new markers and compare these four measures by analyzing ultra-high dimensional censored survival data. In particular, we consider Cox proportional hazards models with varying coefficients. All measures perform very well in simulations and we illustrate our methods in an application to a lung cancer study.
Hierarchical classification of microorganisms based on high-dimensional phenotypic data.
Tafintseva, Valeria; Vigneau, Evelyne; Shapaval, Volha; Cariou, Véronique; Qannari, El Mostafa; Kohler, Achim
2017-11-09
The classification of microorganisms by high-dimensional phenotyping methods such as FTIR spectroscopy is often a complicated process due to the complexity of microbial phylogenetic taxonomy. A hierarchical structure developed for such data can often facilitate the classification analysis. The hierarchical tree structure can either be imposed to a given set of phenotypic data by integrating the phylogenetic taxonomic structure or set up by revealing the inherent clusters in the phenotypic data. In this study, we wanted to compare different approaches to hierarchical classification of microorganisms based on high-dimensional phenotypic data. A set of 19 different species of moulds (filamentous fungi) obtained from the mycological strain collection of the Norwegian Veterinary Institute (Oslo, Norway) is used for the study. Hierarchical cluster analysis is performed for setting up the classification trees. Classification algorithms such as Artificial Neural Networks (ANN), Partial Least Squared Discriminant Analysis (PLSDA), and Random Forest (RF) are used and compared. The two methods ANN and RF outperformed all the other approaches even though they did not utilize predefined hierarchical structure. To our knowledge, the Random Forest approach is used here for the first time to classify microorganisms by FTIR spectroscopy. This article is protected by copyright. All rights reserved.
Regression analysis by example
Chatterjee, Samprit
2012-01-01
Praise for the Fourth Edition: ""This book is . . . an excellent source of examples for regression analysis. It has been and still is readily readable and understandable."" -Journal of the American Statistical Association Regression analysis is a conceptually simple method for investigating relationships among variables. Carrying out a successful application of regression analysis, however, requires a balance of theoretical results, empirical rules, and subjective judgment. Regression Analysis by Example, Fifth Edition has been expanded
Flexible survival regression modelling
DEFF Research Database (Denmark)
Cortese, Giuliana; Scheike, Thomas H; Martinussen, Torben
2009-01-01
Regression analysis of survival data, and more generally event history data, is typically based on Cox's regression model. We here review some recent methodology, focusing on the limitations of Cox's regression model. The key limitation is that the model is not well suited to represent time-varyi...
High dimensional biological data retrieval optimization with NoSQL technology.
Wang, Shicai; Pandis, Ioannis; Wu, Chao; He, Sijin; Johnson, David; Emam, Ibrahim; Guitton, Florian; Guo, Yike
2014-01-01
High-throughput transcriptomic data generated by microarray experiments is the most abundant and frequently stored kind of data currently used in translational medicine studies. Although microarray data is supported in data warehouses such as tranSMART, when querying relational databases for hundreds of different patient gene expression records queries are slow due to poor performance. Non-relational data models, such as the key-value model implemented in NoSQL databases, hold promise to be more performant solutions. Our motivation is to improve the performance of the tranSMART data warehouse with a view to supporting Next Generation Sequencing data. In this paper we introduce a new data model better suited for high-dimensional data storage and querying, optimized for database scalability and performance. We have designed a key-value pair data model to support faster queries over large-scale microarray data and implemented the model using HBase, an implementation of Google's BigTable storage system. An experimental performance comparison was carried out against the traditional relational data model implemented in both MySQL Cluster and MongoDB, using a large publicly available transcriptomic data set taken from NCBI GEO concerning Multiple Myeloma. Our new key-value data model implemented on HBase exhibits an average 5.24-fold increase in high-dimensional biological data query performance compared to the relational model implemented on MySQL Cluster, and an average 6.47-fold increase on query performance on MongoDB. The performance evaluation found that the new key-value data model, in particular its implementation in HBase, outperforms the relational model currently implemented in tranSMART. We propose that NoSQL technology holds great promise for large-scale data management, in particular for high-dimensional biological data such as that demonstrated in the performance evaluation described in this paper. We aim to use this new data model as a basis for migrating
Decorrelation of the True and Estimated Classifier Errors in High-Dimensional Settings
Directory of Open Access Journals (Sweden)
Hua Jianping
2007-01-01
Full Text Available The aim of many microarray experiments is to build discriminatory diagnosis and prognosis models. Given the huge number of features and the small number of examples, model validity which refers to the precision of error estimation is a critical issue. Previous studies have addressed this issue via the deviation distribution (estimated error minus true error, in particular, the deterioration of cross-validation precision in high-dimensional settings where feature selection is used to mitigate the peaking phenomenon (overfitting. Because classifier design is based upon random samples, both the true and estimated errors are sample-dependent random variables, and one would expect a loss of precision if the estimated and true errors are not well correlated, so that natural questions arise as to the degree of correlation and the manner in which lack of correlation impacts error estimation. We demonstrate the effect of correlation on error precision via a decomposition of the variance of the deviation distribution, observe that the correlation is often severely decreased in high-dimensional settings, and show that the effect of high dimensionality on error estimation tends to result more from its decorrelating effects than from its impact on the variance of the estimated error. We consider the correlation between the true and estimated errors under different experimental conditions using both synthetic and real data, several feature-selection methods, different classification rules, and three error estimators commonly used (leave-one-out cross-validation, -fold cross-validation, and .632 bootstrap. Moreover, three scenarios are considered: (1 feature selection, (2 known-feature set, and (3 all features. Only the first is of practical interest; however, the other two are needed for comparison purposes. We will observe that the true and estimated errors tend to be much more correlated in the case of a known feature set than with either feature selection
An unsupervised feature extraction method for high dimensional image data compaction
Ghassemian, Hassan; Landgrebe, David
1987-01-01
A new on-line unsupervised feature extraction method for high-dimensional remotely sensed image data compaction is presented. This method can be utilized to solve the problem of data redundancy in scene representation by satellite-borne high resolution multispectral sensors. The algorithm first partitions the observation space into an exhaustive set of disjoint objects. Then, pixels that belong to an object are characterized by an object feature. Finally, the set of object features is used for data transmission and classification. The example results show that the performance with the compacted features provides a slight improvement in classification accuracy instead of any degradation. Also, the information extraction method does not need to be preceded by a data decompaction.
PyDREAM: High-dimensional parameter inference for biological models in Python.
Shockley, Erin M; Vrugt, Jasper A; Lopez, Carlos F
2017-10-04
Biological models contain many parameters whose values are difficult to measure directly via experimentation and therefore require calibration against experimental data. Markov chain Monte Carlo (MCMC) methods are suitable to estimate multivariate posterior model parameter distributions, but these methods may exhibit slow or premature convergence in high-dimensional search spaces. Here, we present PyDREAM, a Python implementation of the (Multiple-Try) Differential Evolution Adaptive Metropolis (DREAM(ZS)) algorithm developed by Vrugt and ter Braak (2008) and Laloy and Vrugt (2012). PyDREAM achieves excellent performance for complex, parameter-rich models and takes full advantage of distributed computing resources, facilitating parameter inference and uncertainty estimation of CPU-intensive biological models. PyDREAM is freely available under the GNU GPLv3 license from the Lopez lab GitHub repository at http://github.com/LoLab-VU/PyDREAM. c.lopez@vanderbilt.edu. Supplementary data are available at Bioinformatics online.
DEFF Research Database (Denmark)
Ding, Yunhong; Bacco, Davide; Dalgaard, Kjeld
2017-01-01
-dimensional quantum states, and enables breaking the information efficiency limit of traditional quantum key distribution protocols. In addition, the silicon photonic circuits used in our work integrate variable optical attenuators, highly efficient multicore fiber couplers, and Mach-Zehnder interferometers, enabling......Quantum key distribution provides an efficient means to exchange information in an unconditionally secure way. Historically, quantum key distribution protocols have been based on binary signal formats, such as two polarization states, and the transmitted information efficiency of the quantum key...... is intrinsically limited to 1 bit/photon. Here we propose and experimentally demonstrate, for the first time, a high-dimensional quantum key distribution protocol based on space division multiplexing in multicore fiber using silicon photonic integrated lightwave circuits. We successfully realized three mutually...
High-dimensional single-cell analysis reveals the immune signature of narcolepsy.
Hartmann, Felix J; Bernard-Valnet, Raphaël; Quériault, Clémence; Mrdjen, Dunja; Weber, Lukas M; Galli, Edoardo; Krieg, Carsten; Robinson, Mark D; Nguyen, Xuan-Hung; Dauvilliers, Yves; Liblau, Roland S; Becher, Burkhard
2016-11-14
Narcolepsy type 1 is a devastating neurological sleep disorder resulting from the destruction of orexin-producing neurons in the central nervous system (CNS). Despite its striking association with the HLA-DQB1*06:02 allele, the autoimmune etiology of narcolepsy has remained largely hypothetical. Here, we compared peripheral mononucleated cells from narcolepsy patients with HLA-DQB1*06:02-matched healthy controls using high-dimensional mass cytometry in combination with algorithm-guided data analysis. Narcolepsy patients displayed multifaceted immune activation in CD4 + and CD8 + T cells dominated by elevated levels of B cell-supporting cytokines. Additionally, T cells from narcolepsy patients showed increased production of the proinflammatory cytokines IL-2 and TNF. Although it remains to be established whether these changes are primary to an autoimmune process in narcolepsy or secondary to orexin deficiency, these findings are indicative of inflammatory processes in the pathogenesis of this enigmatic disease. © 2016 Hartmann et al.
High-Dimensional Disorder-Driven Phenomena in Weyl Semimetals, Semiconductors and Related Systems
Syzranov, S V
2016-01-01
It is commonly believed that a non-interacting disordered electronic system can undergo only the Anderson metal-insulator transition. It has been suggested, however, that a broad class of systems can display disorder-driven transitions distinct from Anderson localisation that have manifestations in the disorder-averaged density of states, conductivity and other observables. Such transitions have received particular attention in the context of recently discovered 3D Weyl and Dirac materials but have also been predicted in cold-atom systems with long-range interactions, quantum kicked rotors and all sufficiently high-dimensional systems. Moreover, such systems exhibit unconventional behaviour of Lifshitz tails, energy-level statistics and ballistic-transport properties. Here we review recent progress and the status of results on non-Anderson disorder-driven transitions and related phenomena.
The Effects of Feature Optimization on High-Dimensional Essay Data
Directory of Open Access Journals (Sweden)
Bong-Jun Yi
2015-01-01
Full Text Available Current machine learning (ML based automated essay scoring (AES systems have employed various and vast numbers of features, which have been proven to be useful, in improving the performance of the AES. However, the high-dimensional feature space is not properly represented, due to the large volume of features extracted from the limited training data. As a result, this problem gives rise to poor performance and increased training time for the system. In this paper, we experiment and analyze the effects of feature optimization, including normalization, discretization, and feature selection techniques for different ML algorithms, while taking into consideration the size of the feature space and the performance of the AES. Accordingly, we show that the appropriate feature optimization techniques can reduce the dimensions of features, thus, contributing to the efficient training and performance improvement of AES.
Parabolic Theory as a High-Dimensional Limit of Elliptic Theory
Davey, Blair
2017-10-01
The aim of this article is to show how certain parabolic theorems follow from their elliptic counterparts. This technique is demonstrated through new proofs of five important theorems in parabolic unique continuation and the regularity theory of parabolic equations and geometric flows. Specifically, we give new proofs of an L 2 Carleman estimate for the heat operator, and the monotonicity formulas for the frequency function associated to the heat operator, the two-phase free boundary problem, the flow of harmonic maps, and the mean curvature flow. The proofs rely only on the underlying elliptic theorems and limiting procedures belonging essentially to probability theory. In particular, each parabolic theorem is proved by taking a high-dimensional limit of the related elliptic result.
Wang, Zhiping; Chen, Jinyu; Yu, Benli
2017-02-20
We investigate the two-dimensional (2D) and three-dimensional (3D) atom localization behaviors via spontaneously generated coherence in a microwave-driven four-level atomic system. Owing to the space-dependent atom-field interaction, it is found that the detecting probability and precision of 2D and 3D atom localization behaviors can be significantly improved via adjusting the system parameters, the phase, amplitude, and initial population distribution. Interestingly, the atom can be localized in volumes that are substantially smaller than a cubic optical wavelength. Our scheme opens a promising way to achieve high-precision and high-efficiency atom localization, which provides some potential applications in high-dimensional atom nanolithography.
GD-RDA: A New Regularized Discriminant Analysis for High-Dimensional Data.
Zhou, Yan; Zhang, Baoxue; Li, Gaorong; Tong, Tiejun; Wan, Xiang
2017-11-01
High-throughput techniques bring novel tools and also statistical challenges to genomic research. Identification of which type of diseases a new patient belongs to has been recognized as an important problem. For high-dimensional small sample size data, the classical discriminant methods suffer from the singularity problem and are, therefore, no longer applicable in practice. In this article, we propose a geometric diagonalization method for the regularized discriminant analysis. We then consider a bias correction to further improve the proposed method. Simulation studies show that the proposed method performs better than, or at least as well as, the existing methods in a wide range of settings. A microarray dataset and an RNA-seq dataset are also analyzed and they demonstrate the superiority of the proposed method over the existing competitors, especially when the number of samples is small or the number of genes is large. Finally, we have developed an R package called "GDRDA" which is available upon request.
Rupp, Matthias; Schneider, Petra; Schneider, Gisbert
2009-11-15
Measuring the (dis)similarity of molecules is important for many cheminformatics applications like compound ranking, clustering, and property prediction. In this work, we focus on real-valued vector representations of molecules (as opposed to the binary spaces of fingerprints). We demonstrate the influence which the choice of (dis)similarity measure can have on results, and provide recommendations for such choices. We review the mathematical concepts used to measure (dis)similarity in vector spaces, namely norms, metrics, inner products, and, similarity coefficients, as well as the relationships between them, employing (dis)similarity measures commonly used in cheminformatics as examples. We present several phenomena (empty space phenomenon, sphere volume related phenomena, distance concentration) in high-dimensional descriptor spaces which are not encountered in two and three dimensions. These phenomena are theoretically characterized and illustrated on both artificial and real (bioactivity) data. 2009 Wiley Periodicals, Inc.
High-dimensional neural-network potentials for multicomponent systems: Applications to zinc oxide
Artrith, Nongnuch; Morawietz, Tobias; Behler, Jörg
2011-04-01
Artificial neural networks represent an accurate and efficient tool to construct high-dimensional potential-energy surfaces based on first-principles data. However, so far the main drawback of this method has been the limitation to a single atomic species. We present a generalization to compounds of arbitrary chemical composition, which now enables simulations of a wide range of systems containing large numbers of atoms. The required incorporation of long-range interactions is achieved by combining the numerical accuracy of neural networks with an electrostatic term based on environment-dependent charges. Using zinc oxide as a benchmark system we show that the neural network potential-energy surface is in excellent agreement with density-functional theory reference calculations, while the evaluation is many orders of magnitude faster.
High-dimensional neural network potentials for metal surfaces: A prototype study for copper
Artrith, Nongnuch; Behler, Jörg
2012-01-01
The atomic environments at metal surfaces differ strongly from the bulk, and, in particular, in case of reconstructions or imperfections at “real surfaces,” very complicated atomic configurations can be present. This structural complexity poses a significant challenge for the development of accurate interatomic potentials suitable for large-scale molecular dynamics simulations. In recent years, artificial neural networks (NN) have become a promising new method for the construction of potential-energy surfaces for difficult systems. In the present work, we explore the applicability of such high-dimensional NN potentials to metal surfaces using copper as a benchmark system. A detailed analysis of the properties of bulk copper and of a wide range of surface structures shows that NN potentials can provide results of almost density functional theory (DFT) quality at a small fraction of the computational costs.
A fast PC algorithm for high dimensional causal discovery with multi-core PCs.
Le, Thuc; Hoang, Tao; Li, Jiuyong; Liu, Lin; Liu, Huawen; Hu, Shu
2016-07-14
Discovering causal relationships from observational data is a crucial problem and it has applications in many research areas. The PC algorithm is the state-of-the-art constraint based method for causal discovery. However, runtime of the PC algorithm, in the worst-case, is exponential to the number of nodes (variables), and thus it is inefficient when being applied to high dimensional data, e.g. gene expression datasets. On another note, the advancement of computer hardware in the last decade has resulted in the widespread availability of multi-core personal computers. There is a significant motivation for designing a parallelised PC algorithm that is suitable for personal computers and does not require end users' parallel computing knowledge beyond their competency in using the PC algorithm. In this paper, we develop parallel-PC, a fast and memory efficient PC algorithm using the parallel computing technique. We apply our method to a range of synthetic and real-world high dimensional datasets. Experimental results on a dataset from the DREAM 5 challenge show that the original PC algorithm could not produce any results after running more than 24 hours; meanwhile, our parallel-PC algorithm managed to finish within around 12 hours with a 4-core CPU computer, and less than 6 hours with a 8-core CPU computer. Furthermore, we integrate parallel-PC into a causal inference method for inferring miRNA-mRNA regulatory relationships. The experimental results show that parallel-PC helps improve both the efficiency and accuracy of the causal inference algorithm.
An Adaptive ANOVA-based PCKF for High-Dimensional Nonlinear Inverse Modeling
Energy Technology Data Exchange (ETDEWEB)
LI, Weixuan; Lin, Guang; Zhang, Dongxiao
2014-02-01
The probabilistic collocation-based Kalman filter (PCKF) is a recently developed approach for solving inverse problems. It resembles the ensemble Kalman filter (EnKF) in every aspect—except that it represents and propagates model uncertainty by polynomial chaos expansion (PCE) instead of an ensemble of model realizations. Previous studies have shown PCKF is a more efficient alternative to EnKF for many data assimilation problems. However, the accuracy and efficiency of PCKF depends on an appropriate truncation of the PCE series. Having more polynomial chaos bases in the expansion helps to capture uncertainty more accurately but increases computational cost. Bases selection is particularly important for high-dimensional stochastic problems because the number of polynomial chaos bases required to represent model uncertainty grows dramatically as the number of input parameters (random dimensions) increases. In classic PCKF algorithms, the PCE bases are pre-set based on users’ experience. Also, for sequential data assimilation problems, the bases kept in PCE expression remain unchanged in different Kalman filter loops, which could limit the accuracy and computational efficiency of classic PCKF algorithms. To address this issue, we present a new algorithm that adaptively selects PCE bases for different problems and automatically adjusts the number of bases in different Kalman filter loops. The algorithm is based on adaptive functional ANOVA (analysis of variance) decomposition, which approximates a high-dimensional function with the summation of a set of low-dimensional functions. Thus, instead of expanding the original model into PCE, we implement the PCE expansion on these low-dimensional functions, which is much less costly. We also propose a new adaptive criterion for ANOVA that is more suited for solving inverse problems. The new algorithm is tested with different examples and demonstrated great effectiveness in comparison with non-adaptive PCKF and En
An adaptive ANOVA-based PCKF for high-dimensional nonlinear inverse modeling
Energy Technology Data Exchange (ETDEWEB)
Li, Weixuan, E-mail: weixuan.li@usc.edu [Sonny Astani Department of Civil and Environmental Engineering, University of Southern California, Los Angeles, CA 90089 (United States); Lin, Guang, E-mail: guang.lin@pnnl.gov [Pacific Northwest National Laboratory, Richland, WA 99352 (United States); Zhang, Dongxiao, E-mail: dxz@pku.edu.cn [Department of Energy and Resources Engineering, College of Engineering, Peking University, Beijing 100871 (China)
2014-02-01
The probabilistic collocation-based Kalman filter (PCKF) is a recently developed approach for solving inverse problems. It resembles the ensemble Kalman filter (EnKF) in every aspect—except that it represents and propagates model uncertainty by polynomial chaos expansion (PCE) instead of an ensemble of model realizations. Previous studies have shown PCKF is a more efficient alternative to EnKF for many data assimilation problems. However, the accuracy and efficiency of PCKF depends on an appropriate truncation of the PCE series. Having more polynomial chaos basis functions in the expansion helps to capture uncertainty more accurately but increases computational cost. Selection of basis functions is particularly important for high-dimensional stochastic problems because the number of polynomial chaos basis functions required to represent model uncertainty grows dramatically as the number of input parameters (random dimensions) increases. In classic PCKF algorithms, the PCE basis functions are pre-set based on users' experience. Also, for sequential data assimilation problems, the basis functions kept in PCE expression remain unchanged in different Kalman filter loops, which could limit the accuracy and computational efficiency of classic PCKF algorithms. To address this issue, we present a new algorithm that adaptively selects PCE basis functions for different problems and automatically adjusts the number of basis functions in different Kalman filter loops. The algorithm is based on adaptive functional ANOVA (analysis of variance) decomposition, which approximates a high-dimensional function with the summation of a set of low-dimensional functions. Thus, instead of expanding the original model into PCE, we implement the PCE expansion on these low-dimensional functions, which is much less costly. We also propose a new adaptive criterion for ANOVA that is more suited for solving inverse problems. The new algorithm was tested with different examples and
2D-EM clustering approach for high-dimensional data through folding feature vectors.
Sharma, Alok; Kamola, Piotr J; Tsunoda, Tatsuhiko
2017-12-28
Clustering methods are becoming widely utilized in biomedical research where the volume and complexity of data is rapidly increasing. Unsupervised clustering of patient information can reveal distinct phenotype groups with different underlying mechanism, risk prognosis and treatment response. However, biological datasets are usually characterized by a combination of low sample number and very high dimensionality, something that is not adequately addressed by current algorithms. While the performance of the methods is satisfactory for low dimensional data, increasing number of features results in either deterioration of accuracy or inability to cluster. To tackle these challenges, new methodologies designed specifically for such data are needed. We present 2D-EM, a clustering algorithm approach designed for small sample size and high-dimensional datasets. To employ information corresponding to data distribution and facilitate visualization, the sample is folded into its two-dimension (2D) matrix form (or feature matrix). The maximum likelihood estimate is then estimated using a modified expectation-maximization (EM) algorithm. The 2D-EM methodology was benchmarked against several existing clustering methods using 6 medically-relevant transcriptome datasets. The percentage improvement of Rand score and adjusted Rand index compared to the best performing alternative method is up to 21.9% and 155.6%, respectively. To present the general utility of the 2D-EM method we also employed 2 methylome datasets, again showing superior performance relative to established methods. The 2D-EM algorithm was able to reproduce the groups in transcriptome and methylome data with high accuracy. This build confidence in the methods ability to uncover novel disease subtypes in new datasets. The design of 2D-EM algorithm enables it to handle a diverse set of challenging biomedical dataset and cluster with higher accuracy than established methods. MATLAB implementation of the tool can be
Representing potential energy surfaces by high-dimensional neural network potentials.
Behler, J
2014-05-07
The development of interatomic potentials employing artificial neural networks has seen tremendous progress in recent years. While until recently the applicability of neural network potentials (NNPs) has been restricted to low-dimensional systems, this limitation has now been overcome and high-dimensional NNPs can be used in large-scale molecular dynamics simulations of thousands of atoms. NNPs are constructed by adjusting a set of parameters using data from electronic structure calculations, and in many cases energies and forces can be obtained with very high accuracy. Therefore, NNP-based simulation results are often very close to those gained by a direct application of first-principles methods. In this review, the basic methodology of high-dimensional NNPs will be presented with a special focus on the scope and the remaining limitations of this approach. The development of NNPs requires substantial computational effort as typically thousands of reference calculations are required. Still, if the problem to be studied involves very large systems or long simulation times this overhead is regained quickly. Further, the method is still limited to systems containing about three or four chemical elements due to the rapidly increasing complexity of the configuration space, although many atoms of each species can be present. Due to the ability of NNPs to describe even extremely complex atomic configurations with excellent accuracy irrespective of the nature of the atomic interactions, they represent a general and therefore widely applicable technique, e.g. for addressing problems in materials science, for investigating properties of interfaces, and for studying solvation processes.
Visualisation of Regression Trees
Brunsdon, Chris
2007-01-01
he regression tree [1] has been used as a tool for exploring multivariate data sets for some time. As in multiple linear regression, the technique is applied to a data set consisting of a contin- uous response variable y and a set of predictor variables { x 1 ,x 2 ,...,x k } which may be continuous or categorical. However, instead of modelling y as a linear function of the predictors, regression trees model y as a series of ...
Dabrowska, Dorota M.
1997-01-01
Nonparametric regression was shown by Beran and McKeague and Utikal to provide a flexible method for analysis of censored failure times and more general counting processes models in the presence of covariates. We discuss application of kernel smoothing towards estimation in a generalized Cox regression model with baseline intensity dependent on a covariate. Under regularity conditions we show that estimates of the regression parameters are asymptotically normal at rate root-n, and we also dis...
Introduction to regression graphics
Cook, R Dennis
2009-01-01
Covers the use of dynamic and interactive computer graphics in linear regression analysis, focusing on analytical graphics. Features new techniques like plot rotation. The authors have composed their own regression code, using Xlisp-Stat language called R-code, which is a nearly complete system for linear regression analysis and can be utilized as the main computer program in a linear regression course. The accompanying disks, for both Macintosh and Windows computers, contain the R-code and Xlisp-Stat. An Instructor's Manual presenting detailed solutions to all the problems in the book is ava
Alternative Methods of Regression
Birkes, David
2011-01-01
Of related interest. Nonlinear Regression Analysis and its Applications Douglas M. Bates and Donald G. Watts ".an extraordinary presentation of concepts and methods concerning the use and analysis of nonlinear regression models.highly recommend[ed].for anyone needing to use and/or understand issues concerning the analysis of nonlinear regression models." --Technometrics This book provides a balance between theory and practice supported by extensive displays of instructive geometrical constructs. Numerous in-depth case studies illustrate the use of nonlinear regression analysis--with all data s
Energy Technology Data Exchange (ETDEWEB)
Gerber, Samuel [Univ. of Utah, Salt Lake City, UT (United States); Rubel, Oliver [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Bremer, Peer -Timo [Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States); Pascucci, Valerio [Univ. of Utah, Salt Lake City, UT (United States); Whitaker, Ross T. [Univ. of Utah, Salt Lake City, UT (United States)
2012-01-19
This paper introduces a novel partition-based regression approach that incorporates topological information. Partition-based regression typically introduces a quality-of-fit-driven decomposition of the domain. The emphasis in this work is on a topologically meaningful segmentation. Thus, the proposed regression approach is based on a segmentation induced by a discrete approximation of the Morse–Smale complex. This yields a segmentation with partitions corresponding to regions of the function with a single minimum and maximum that are often well approximated by a linear model. This approach yields regression models that are amenable to interpretation and have good predictive capacity. Typically, regression estimates are quantified by their geometrical accuracy. For the proposed regression, an important aspect is the quality of the segmentation itself. Thus, this article introduces a new criterion that measures the topological accuracy of the estimate. The topological accuracy provides a complementary measure to the classical geometrical error measures and is very sensitive to overfitting. The Morse–Smale regression is compared to state-of-the-art approaches in terms of geometry and topology and yields comparable or improved fits in many cases. Finally, a detailed study on climate-simulation data demonstrates the application of the Morse–Smale regression. Supplementary Materials are available online and contain an implementation of the proposed approach in the R package msr, an analysis and simulations on the stability of the Morse–Smale complex approximation, and additional tables for the climate-simulation study.
DEFF Research Database (Denmark)
Fitzenberger, Bernd; Wilke, Ralf Andreas
2015-01-01
Quantile regression is emerging as a popular statistical approach, which complements the estimation of conditional mean models. While the latter only focuses on one aspect of the conditional distribution of the dependent variable, the mean, quantile regression provides more detailed insights...
A simple new filter for nonlinear high-dimensional data assimilation
Tödter, Julian; Kirchgessner, Paul; Ahrens, Bodo
2015-04-01
performance with a realistic ensemble size. The results confirm that, in principle, it can be applied successfully and as simple as the ETKF in high-dimensional problems without further modifications of the algorithm, even though it is only based on the particle weights. This proves that the suggested method constitutes a useful filter for nonlinear, high-dimensional data assimilation, and is able to overcome the curse of dimensionality even in deterministic systems.
DEFF Research Database (Denmark)
Bordacconi, Mats Joe; Larsen, Martin Vinæs
2014-01-01
Humans are fundamentally primed for making causal attributions based on correlations. This implies that researchers must be careful to present their results in a manner that inhibits unwarranted causal attribution. In this paper, we present the results of an experiment that suggests regression...... models – one of the primary vehicles for analyzing statistical results in political science – encourage causal interpretation. Specifically, we demonstrate that presenting observational results in a regression model, rather than as a simple comparison of means, makes causal interpretation of the results...... of equivalent results presented as either regression models or as a test of two sample means. Our experiment shows that the subjects who were presented with results as estimates from a regression model were more inclined to interpret these results causally. Our experiment implies that scholars using regression...
Mapping the human DC lineage through the integration of high-dimensional techniques.
See, Peter; Dutertre, Charles-Antoine; Chen, Jinmiao; Günther, Patrick; McGovern, Naomi; Irac, Sergio Erdal; Gunawan, Merry; Beyer, Marc; Händler, Kristian; Duan, Kaibo; Sumatoh, Hermi Rizal Bin; Ruffin, Nicolas; Jouve, Mabel; Gea-Mallorquí, Ester; Hennekam, Raoul C M; Lim, Tony; Yip, Chan Chung; Wen, Ming; Malleret, Benoit; Low, Ivy; Shadan, Nurhidaya Binte; Fen, Charlene Foong Shu; Tay, Alicia; Lum, Josephine; Zolezzi, Francesca; Larbi, Anis; Poidinger, Michael; Chan, Jerry K Y; Chen, Qingfeng; Rénia, Laurent; Haniffa, Muzlifah; Benaroch, Philippe; Schlitzer, Andreas; Schultze, Joachim L; Newell, Evan W; Ginhoux, Florent
2017-06-09
Dendritic cells (DC) are professional antigen-presenting cells that orchestrate immune responses. The human DC population comprises two main functionally specialized lineages, whose origins and differentiation pathways remain incompletely defined. Here, we combine two high-dimensional technologies-single-cell messenger RNA sequencing (scmRNAseq) and cytometry by time-of-flight (CyTOF)-to identify human blood CD123+CD33+CD45RA+ DC precursors (pre-DC). Pre-DC share surface markers with plasmacytoid DC (pDC) but have distinct functional properties that were previously attributed to pDC. Tracing the differentiation of DC from the bone marrow to the peripheral blood revealed that the pre-DC compartment contains distinct lineage-committed subpopulations, including one early uncommitted CD123high pre-DC subset and two CD45RA+CD123low lineage-committed subsets exhibiting functional differences. The discovery of multiple committed pre-DC populations opens promising new avenues for the therapeutic exploitation of DC subset-specific targeting. Copyright © 2017, American Association for the Advancement of Science.
Multi-Scale Factor Analysis of High-Dimensional Brain Signals
Ting, Chee-Ming
2017-05-18
In this paper, we develop an approach to modeling high-dimensional networks with a large number of nodes arranged in a hierarchical and modular structure. We propose a novel multi-scale factor analysis (MSFA) model which partitions the massive spatio-temporal data defined over the complex networks into a finite set of regional clusters. To achieve further dimension reduction, we represent the signals in each cluster by a small number of latent factors. The correlation matrix for all nodes in the network are approximated by lower-dimensional sub-structures derived from the cluster-specific factors. To estimate regional connectivity between numerous nodes (within each cluster), we apply principal components analysis (PCA) to produce factors which are derived as the optimal reconstruction of the observed signals under the squared loss. Then, we estimate global connectivity (between clusters or sub-networks) based on the factors across regions using the RV-coefficient as the cross-dependence measure. This gives a reliable and computationally efficient multi-scale analysis of both regional and global dependencies of the large networks. The proposed novel approach is applied to estimate brain connectivity networks using functional magnetic resonance imaging (fMRI) data. Results on resting-state fMRI reveal interesting modular and hierarchical organization of human brain networks during rest.
Spanning high-dimensional expression space using ribosome-binding site combinatorics.
Zelcbuch, Lior; Antonovsky, Niv; Bar-Even, Arren; Levin-Karp, Ayelet; Barenholz, Uri; Dayagi, Michal; Liebermeister, Wolfram; Flamholz, Avi; Noor, Elad; Amram, Shira; Brandis, Alexander; Bareia, Tasneem; Yofe, Ido; Jubran, Halim; Milo, Ron
2013-05-01
Protein levels are a dominant factor shaping natural and synthetic biological systems. Although proper functioning of metabolic pathways relies on precise control of enzyme levels, the experimental ability to balance the levels of many genes in parallel is a major outstanding challenge. Here, we introduce a rapid and modular method to span the expression space of several proteins in parallel. By combinatorially pairing genes with a compact set of ribosome-binding sites, we modulate protein abundance by several orders of magnitude. We demonstrate our strategy by using a synthetic operon containing fluorescent proteins to span a 3D color space. Using the same approach, we modulate a recombinant carotenoid biosynthesis pathway in Escherichia coli to reveal a diversity of phenotypes, each characterized by a distinct carotenoid accumulation profile. In a single combinatorial assembly, we achieve a yield of the industrially valuable compound astaxanthin 4-fold higher than previously reported. The methodology presented here provides an efficient tool for exploring a high-dimensional expression space to locate desirable phenotypes.
Simulation-based hypothesis testing of high dimensional means under covariance heterogeneity.
Chang, Jinyuan; Zheng, Chao; Zhou, Wen-Xin; Zhou, Wen
2017-12-01
In this article, we study the problem of testing the mean vectors of high dimensional data in both one-sample and two-sample cases. The proposed testing procedures employ maximum-type statistics and the parametric bootstrap techniques to compute the critical values. Different from the existing tests that heavily rely on the structural conditions on the unknown covariance matrices, the proposed tests allow general covariance structures of the data and therefore enjoy wide scope of applicability in practice. To enhance powers of the tests against sparse alternatives, we further propose two-step procedures with a preliminary feature screening step. Theoretical properties of the proposed tests are investigated. Through extensive numerical experiments on synthetic data sets and an human acute lymphoblastic leukemia gene expression data set, we illustrate the performance of the new tests and how they may provide assistance on detecting disease-associated gene-sets. The proposed methods have been implemented in an R-package HDtest and are available on CRAN. © 2017, The International Biometric Society.
A Comparison of Machine Learning Methods in a High-Dimensional Classification Problem
Directory of Open Access Journals (Sweden)
Zekić-Sušac Marijana
2014-09-01
Full Text Available Background: Large-dimensional data modelling often relies on variable reduction methods in the pre-processing and in the post-processing stage. However, such a reduction usually provides less information and yields a lower accuracy of the model. Objectives: The aim of this paper is to assess the high-dimensional classification problem of recognizing entrepreneurial intentions of students by machine learning methods. Methods/Approach: Four methods were tested: artificial neural networks, CART classification trees, support vector machines, and k-nearest neighbour on the same dataset in order to compare their efficiency in the sense of classification accuracy. The performance of each method was compared on ten subsamples in a 10-fold cross-validation procedure in order to assess computing sensitivity and specificity of each model. Results: The artificial neural network model based on multilayer perceptron yielded a higher classification rate than the models produced by other methods. The pairwise t-test showed a statistical significance between the artificial neural network and the k-nearest neighbour model, while the difference among other methods was not statistically significant. Conclusions: Tested machine learning methods are able to learn fast and achieve high classification accuracy. However, further advancement can be assured by testing a few additional methodological refinements in machine learning methods.
Diagonal Likelihood Ratio Test for Equality of Mean Vectors in High-Dimensional Data
Hu, Zongliang
2017-10-27
We propose a likelihood ratio test framework for testing normal mean vectors in high-dimensional data under two common scenarios: the one-sample test and the two-sample test with equal covariance matrices. We derive the test statistics under the assumption that the covariance matrices follow a diagonal matrix structure. In comparison with the diagonal Hotelling\\'s tests, our proposed test statistics display some interesting characteristics. In particular, they are a summation of the log-transformed squared t-statistics rather than a direct summation of those components. More importantly, to derive the asymptotic normality of our test statistics under the null and local alternative hypotheses, we do not require the assumption that the covariance matrix follows a diagonal matrix structure. As a consequence, our proposed test methods are very flexible and can be widely applied in practice. Finally, simulation studies and a real data analysis are also conducted to demonstrate the advantages of our likelihood ratio test method.
SPRING: a kinetic interface for visualizing high dimensional single-cell expression data.
Weinreb, Caleb; Wolock, Samuel; Klein, Allon
2017-12-07
Single-cell gene expression profiling technologies can map the cell states in a tissue or organism. As these technologies become more common, there is a need for computational tools to explore the data they produce. In particular, visualizing continuous gene expression topologies can be improved, since current tools tend to fragment gene expression continua or capture only limited features of complex population topologies. Force-directed layouts of k-nearest-neighbor graphs can visualize continuous gene expression topologies in a manner that preserves high-dimensional relationships and captures complex population topologies. We describe SPRING, a pipeline for data filtering, normalization and visualization using force-directed layouts, and show that it reveals more detailed biological relationships than existing approaches when applied to branching gene expression trajectories from hematopoietic progenitor cells and cells of the upper airway epithelium. Visualizations from SPRING are also more reproducible than those of stochastic visualization methods such as tSNE, a state-of-the-art tool. We provide SPRING as an interactive web-tool with an easy to use GUI. https://kleintools.hms.harvard.edu/tools/spring.html, https://github.com/AllonKleinLab/SPRING/. calebsw@gmail.com, allon_klein@hms.harvard.edu.
Yuan, Xiaoru; Ren, Donghao; Wang, Zuchao; Guo, Cong
2013-12-01
For high-dimensional data, this work proposes two novel visual exploration methods to gain insights into the data aspect and the dimension aspect of the data. The first is a Dimension Projection Matrix, as an extension of a scatterplot matrix. In the matrix, each row or column represents a group of dimensions, and each cell shows a dimension projection (such as MDS) of the data with the corresponding dimensions. The second is a Dimension Projection Tree, where every node is either a dimension projection plot or a Dimension Projection Matrix. Nodes are connected with links and each child node in the tree covers a subset of the parent node's dimensions or a subset of the parent node's data items. While the tree nodes visualize the subspaces of dimensions or subsets of the data items under exploration, the matrix nodes enable cross-comparison between different combinations of subspaces. Both Dimension Projection Matrix and Dimension Project Tree can be constructed algorithmically through automation, or manually through user interaction. Our implementation enables interactions such as drilling down to explore different levels of the data, merging or splitting the subspaces to adjust the matrix, and applying brushing to select data clusters. Our method enables simultaneously exploring data correlation and dimension correlation for data with high dimensions.
Energy Technology Data Exchange (ETDEWEB)
Snyder, Abigail C. [University of Pittsburgh; Jiao, Yu [ORNL
2010-10-01
Neutron experiments at the Spallation Neutron Source (SNS) at Oak Ridge National Laboratory (ORNL) frequently generate large amounts of data (on the order of 106-1012 data points). Hence, traditional data analysis tools run on a single CPU take too long to be practical and scientists are unable to efficiently analyze all data generated by experiments. Our goal is to develop a scalable algorithm to efficiently compute high-dimensional integrals of arbitrary functions. This algorithm can then be used to integrate the four-dimensional integrals that arise as part of modeling intensity from the experiments at the SNS. Here, three different one-dimensional numerical integration solvers from the GNU Scientific Library were modified and implemented to solve four-dimensional integrals. The results of these solvers on a final integrand provided by scientists at the SNS can be compared to the results of other methods, such as quasi-Monte Carlo methods, computing the same integral. A parallelized version of the most efficient method can allow scientists the opportunity to more effectively analyze all experimental data.
A multistage mathematical approach to automated clustering of high-dimensional noisy data
Friedman, Alexander; Keselman, Michael D.; Gibb, Leif G.; Graybiel, Ann M.
2015-01-01
A critical problem faced in many scientific fields is the adequate separation of data derived from individual sources. Often, such datasets require analysis of multiple features in a highly multidimensional space, with overlap of features and sources. The datasets generated by simultaneous recording from hundreds of neurons emitting phasic action potentials have produced the challenge of separating the recorded signals into independent data subsets (clusters) corresponding to individual signal-generating neurons. Mathematical methods have been developed over the past three decades to achieve such spike clustering, but a complete solution with fully automated cluster identification has not been achieved. We propose here a fully automated mathematical approach that identifies clusters in multidimensional space through recursion, which combats the multidimensionality of the data. Recursion is paired with an approach to dimensional evaluation, in which each dimension of a dataset is examined for its informational importance for clustering. The dimensions offering greater informational importance are given added weight during recursive clustering. To combat strong background activity, our algorithm takes an iterative approach of data filtering according to a signal-to-noise ratio metric. The algorithm finds cluster cores, which are thereafter expanded to include complete clusters. This mathematical approach can be extended from its prototype context of spike sorting to other datasets that suffer from high dimensionality and background activity. PMID:25831512
Schran, Christoph; Uhl, Felix; Behler, Jörg; Marx, Dominik
2018-03-01
The design of accurate helium-solute interaction potentials for the simulation of chemically complex molecules solvated in superfluid helium has long been a cumbersome task due to the rather weak but strongly anisotropic nature of the interactions. We show that this challenge can be met by using a combination of an effective pair potential for the He-He interactions and a flexible high-dimensional neural network potential (NNP) for describing the complex interaction between helium and the solute in a pairwise additive manner. This approach yields an excellent agreement with a mean absolute deviation as small as 0.04 kJ mol-1 for the interaction energy between helium and both hydronium and Zundel cations compared with coupled cluster reference calculations with an energetically converged basis set. The construction and improvement of the potential can be performed in a highly automated way, which opens the door for applications to a variety of reactive molecules to study the effect of solvation on the solute as well as the solute-induced structuring of the solvent. Furthermore, we show that this NNP approach yields very convincing agreement with the coupled cluster reference for properties like many-body spatial and radial distribution functions. This holds for the microsolvation of the protonated water monomer and dimer by a few helium atoms up to their solvation in bulk helium as obtained from path integral simulations at about 1 K.
Anomaly Detection in Large Sets of High-Dimensional Symbol Sequences
Budalakoti, Suratna; Srivastava, Ashok N.; Akella, Ram; Turkov, Eugene
2006-01-01
This paper addresses the problem of detecting and describing anomalies in large sets of high-dimensional symbol sequences. The approach taken uses unsupervised clustering of sequences using the normalized longest common subsequence (LCS) as a similarity measure, followed by detailed analysis of outliers to detect anomalies. As the LCS measure is expensive to compute, the first part of the paper discusses existing algorithms, such as the Hunt-Szymanski algorithm, that have low time-complexity. We then discuss why these algorithms often do not work well in practice and present a new hybrid algorithm for computing the LCS that, in our tests, outperforms the Hunt-Szymanski algorithm by a factor of five. The second part of the paper presents new algorithms for outlier analysis that provide comprehensible indicators as to why a particular sequence was deemed to be an outlier. The algorithms provide a coherent description to an analyst of the anomalies in the sequence, compared to more normal sequences. The algorithms we present are general and domain-independent, so we discuss applications in related areas such as anomaly detection.
CMA – a comprehensive Bioconductor package for supervised classification with high dimensional data
Slawski, M; Daumer, M; Boulesteix, A-L
2008-01-01
Background For the last eight years, microarray-based classification has been a major topic in statistics, bioinformatics and biomedicine research. Traditional methods often yield unsatisfactory results or may even be inapplicable in the so-called "p ≫ n" setting where the number of predictors p by far exceeds the number of observations n, hence the term "ill-posed-problem". Careful model selection and evaluation satisfying accepted good-practice standards is a very complex task for statisticians without experience in this area or for scientists with limited statistical background. The multiplicity of available methods for class prediction based on high-dimensional data is an additional practical challenge for inexperienced researchers. Results In this article, we introduce a new Bioconductor package called CMA (standing for "Classification for MicroArrays") for automatically performing variable selection, parameter tuning, classifier construction, and unbiased evaluation of the constructed classifiers using a large number of usual methods. Without much time and effort, users are provided with an overview of the unbiased accuracy of most top-performing classifiers. Furthermore, the standardized evaluation framework underlying CMA can also be beneficial in statistical research for comparison purposes, for instance if a new classifier has to be compared to existing approaches. Conclusion CMA is a user-friendly comprehensive package for classifier construction and evaluation implementing most usual approaches. It is freely available from the Bioconductor website at . PMID:18925941
Hou, Jiayi; Archer, Kellie J
2015-02-01
Abstract An ordinal scale is commonly used to measure health status and disease related outcomes in hospital settings as well as in translational medical research. In addition, repeated measurements are common in clinical practice for tracking and monitoring the progression of complex diseases. Classical methodology based on statistical inference, in particular, ordinal modeling has contributed to the analysis of data in which the response categories are ordered and the number of covariates (p) remains smaller than the sample size (n). With the emergence of genomic technologies being increasingly applied for more accurate diagnosis and prognosis, high-dimensional data where the number of covariates (p) is much larger than the number of samples (n), are generated. To meet the emerging needs, we introduce our proposed model which is a two-stage algorithm: Extend the generalized monotone incremental forward stagewise (GMIFS) method to the cumulative logit ordinal model; and combine the GMIFS procedure with the classical mixed-effects model for classifying disease status in disease progression along with time. We demonstrate the efficiency and accuracy of the proposed models in classification using a time-course microarray dataset collected from the Inflammation and the Host Response to Injury study.
PCA leverage: outlier detection for high-dimensional functional magnetic resonance imaging data.
Mejia, Amanda F; Nebel, Mary Beth; Eloyan, Ani; Caffo, Brian; Lindquist, Martin A
2017-07-01
Outlier detection for high-dimensional (HD) data is a popular topic in modern statistical research. However, one source of HD data that has received relatively little attention is functional magnetic resonance images (fMRI), which consists of hundreds of thousands of measurements sampled at hundreds of time points. At a time when the availability of fMRI data is rapidly growing-primarily through large, publicly available grassroots datasets-automated quality control and outlier detection methods are greatly needed. We propose principal components analysis (PCA) leverage and demonstrate how it can be used to identify outlying time points in an fMRI run. Furthermore, PCA leverage is a measure of the influence of each observation on the estimation of principal components, which are often of interest in fMRI data. We also propose an alternative measure, PCA robust distance, which is less sensitive to outliers and has controllable statistical properties. The proposed methods are validated through simulation studies and are shown to be highly accurate. We also conduct a reliability study using resting-state fMRI data from the Autism Brain Imaging Data Exchange and find that removal of outliers using the proposed methods results in more reliable estimation of subject-level resting-state networks using independent components analysis. © The Author 2017. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.
Defining and evaluating classification algorithm for high-dimensional data based on latent topics.
Directory of Open Access Journals (Sweden)
Le Luo
Full Text Available Automatic text categorization is one of the key techniques in information retrieval and the data mining field. The classification is usually time-consuming when the training dataset is large and high-dimensional. Many methods have been proposed to solve this problem, but few can achieve satisfactory efficiency. In this paper, we present a method which combines the Latent Dirichlet Allocation (LDA algorithm and the Support Vector Machine (SVM. LDA is first used to generate reduced dimensional representation of topics as feature in VSM. It is able to reduce features dramatically but keeps the necessary semantic information. The Support Vector Machine (SVM is then employed to classify the data based on the generated features. We evaluate the algorithm on 20 Newsgroups and Reuters-21578 datasets, respectively. The experimental results show that the classification based on our proposed LDA+SVM model achieves high performance in terms of precision, recall and F1 measure. Further, it can achieve this within a much shorter time-frame. Our process improves greatly upon the previous work in this field and displays strong potential to achieve a streamlined classification process for a wide range of applications.
A hyper-spherical adaptive sparse-grid method for high-dimensional discontinuity detection
Energy Technology Data Exchange (ETDEWEB)
Zhang, Guannan [Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States); Webster, Clayton G. [Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States); Gunzburger, Max D. [Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States); Burkardt, John V. [Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
2014-03-01
This work proposes and analyzes a hyper-spherical adaptive hierarchical sparse-grid method for detecting jump discontinuities of functions in high-dimensional spaces is proposed. The method is motivated by the theoretical and computational inefficiencies of well-known adaptive sparse-grid methods for discontinuity detection. Our novel approach constructs a function representation of the discontinuity hyper-surface of an N-dimensional dis- continuous quantity of interest, by virtue of a hyper-spherical transformation. Then, a sparse-grid approximation of the transformed function is built in the hyper-spherical coordinate system, whose value at each point is estimated by solving a one-dimensional discontinuity detection problem. Due to the smoothness of the hyper-surface, the new technique can identify jump discontinuities with significantly reduced computational cost, compared to existing methods. Moreover, hierarchical acceleration techniques are also incorporated to further reduce the overall complexity. Rigorous error estimates and complexity analyses of the new method are provided as are several numerical examples that illustrate the effectiveness of the approach.
Free-energy calculations along a high-dimensional fragmented path with constrained dynamics.
Chen, Changjun; Huang, Yanzhao; Xiao, Yi
2012-09-01
Free-energy calculations for high-dimensional systems, such as peptides or proteins, always suffer from a serious sampling problem in a huge conformational space. For such systems, path-based free-energy methods, such as thermodynamic integration or free-energy perturbation, are good choices. However, both of them need sufficient sampling along a predefined transition path, which can only be controlled using restrained or constrained dynamics. Constrained simulations produce more reasonable free-energy profiles than restrained simulations. But calculations of standard constrained dynamics require an explicit expression of reaction coordinates as a function of Cartesian coordinates of all related atoms, which may be difficult to find for the complex transition of biomolecules. In this paper, we propose a practical solution: (1) We use restrained dynamics to define an optimized transition path, divide it into small fragments, and define a virtual reaction coordinate to denote a position along the path. (2) We use constrained dynamics to perform a formal free-energy calculation for each fragment and collect the values together to provide the entire free-energy profile. This method avoids the requirement to explicitly define reaction coordinates in Cartesian coordinates and provides a novel strategy to perform free-energy calculations for biomolecules along any complex transition path.
Construction of high-dimensional neural network potentials using environment-dependent atom pairs.
Jose, K V Jovan; Artrith, Nongnuch; Behler, Jörg
2012-05-21
An accurate determination of the potential energy is the crucial step in computer simulations of chemical processes, but using electronic structure methods on-the-fly in molecular dynamics (MD) is computationally too demanding for many systems. Constructing more efficient interatomic potentials becomes intricate with increasing dimensionality of the potential-energy surface (PES), and for numerous systems the accuracy that can be achieved is still not satisfying and far from the reliability of first-principles calculations. Feed-forward neural networks (NNs) have a very flexible functional form, and in recent years they have been shown to be an accurate tool to construct efficient PESs. High-dimensional NN potentials based on environment-dependent atomic energy contributions have been presented for a number of materials. Still, these potentials may be improved by a more detailed structural description, e.g., in form of atom pairs, which directly reflect the atomic interactions and take the chemical environment into account. We present an implementation of an NN method based on atom pairs, and its accuracy and performance are compared to the atom-based NN approach using two very different systems, the methanol molecule and metallic copper. We find that both types of NN potentials provide an excellent description of both PESs, with the pair-based method yielding a slightly higher accuracy making it a competitive alternative for addressing complex systems in MD simulations.
Construction of high-dimensional neural network potentials using environment-dependent atom pairs
Jose, K. V. Jovan; Artrith, Nongnuch; Behler, Jörg
2012-05-01
An accurate determination of the potential energy is the crucial step in computer simulations of chemical processes, but using electronic structure methods on-the-fly in molecular dynamics (MD) is computationally too demanding for many systems. Constructing more efficient interatomic potentials becomes intricate with increasing dimensionality of the potential-energy surface (PES), and for numerous systems the accuracy that can be achieved is still not satisfying and far from the reliability of first-principles calculations. Feed-forward neural networks (NNs) have a very flexible functional form, and in recent years they have been shown to be an accurate tool to construct efficient PESs. High-dimensional NN potentials based on environment-dependent atomic energy contributions have been presented for a number of materials. Still, these potentials may be improved by a more detailed structural description, e.g., in form of atom pairs, which directly reflect the atomic interactions and take the chemical environment into account. We present an implementation of an NN method based on atom pairs, and its accuracy and performance are compared to the atom-based NN approach using two very different systems, the methanol molecule and metallic copper. We find that both types of NN potentials provide an excellent description of both PESs, with the pair-based method yielding a slightly higher accuracy making it a competitive alternative for addressing complex systems in MD simulations.
AUTISTIC EPILEPTIFORM REGRESSION (A REVIEW
Directory of Open Access Journals (Sweden)
L. Yu. Glukhova
2012-01-01
Full Text Available The author represents the review of current scientific literature devoted to autistic epileptiform regression — the special form of autistic disorder, characterized by development of severe communicative disorders in children as a result of continuous prolonged epileptiform activity on EEG. This condition has been described by R.F. Tuchman and I. Rapin in 1997. The author describes the aspects of pathogenesis, clinical pictures and diagnostics of this disorder, including the peculiar anomalies on EEG (benign epileptiform patterns of childhood, with a high index of epileptiform activity, especially in the sleep. The especial attention is given to approaches to the treatment of autistic epileptiform regression. Efficacy of valproates, corticosteroid hormones and antiepileptic drugs of other groups is considered.
Weisberg, Sanford
2013-01-01
Praise for the Third Edition ""...this is an excellent book which could easily be used as a course text...""-International Statistical Institute The Fourth Edition of Applied Linear Regression provides a thorough update of the basic theory and methodology of linear regression modeling. Demonstrating the practical applications of linear regression analysis techniques, the Fourth Edition uses interesting, real-world exercises and examples. Stressing central concepts such as model building, understanding parameters, assessing fit and reliability, and drawing conclusions, the new edition illus
Hosmer, David W; Sturdivant, Rodney X
2013-01-01
A new edition of the definitive guide to logistic regression modeling for health science and other applications This thoroughly expanded Third Edition provides an easily accessible introduction to the logistic regression (LR) model and highlights the power of this model by examining the relationship between a dichotomous outcome and a set of covariables. Applied Logistic Regression, Third Edition emphasizes applications in the health sciences and handpicks topics that best suit the use of modern statistical software. The book provides readers with state-of-
Directory of Open Access Journals (Sweden)
Nils Ternès
2017-05-01
Full Text Available Abstract Background Thanks to the advances in genomics and targeted treatments, more and more prediction models based on biomarkers are being developed to predict potential benefit from treatments in a randomized clinical trial. Despite the methodological framework for the development and validation of prediction models in a high-dimensional setting is getting more and more established, no clear guidance exists yet on how to estimate expected survival probabilities in a penalized model with biomarker-by-treatment interactions. Methods Based on a parsimonious biomarker selection in a penalized high-dimensional Cox model (lasso or adaptive lasso, we propose a unified framework to: estimate internally the predictive accuracy metrics of the developed model (using double cross-validation; estimate the individual survival probabilities at a given timepoint; construct confidence intervals thereof (analytical or bootstrap; and visualize them graphically (pointwise or smoothed with spline. We compared these strategies through a simulation study covering scenarios with or without biomarker effects. We applied the strategies to a large randomized phase III clinical trial that evaluated the effect of adding trastuzumab to chemotherapy in 1574 early breast cancer patients, for which the expression of 462 genes was measured. Results In our simulations, penalized regression models using the adaptive lasso estimated the survival probability of new patients with low bias and standard error; bootstrapped confidence intervals had empirical coverage probability close to the nominal level across very different scenarios. The double cross-validation performed on the training data set closely mimicked the predictive accuracy of the selected models in external validation data. We also propose a useful visual representation of the expected survival probabilities using splines. In the breast cancer trial, the adaptive lasso penalty selected a prediction model with 4
Tighe, Patrick J; Harle, Christopher A; Hurley, Robert W; Aytug, Haldun; Boezaart, Andre P; Fillingim, Roger B
2015-07-01
Given their ability to process highly dimensional datasets with hundreds of variables, machine learning algorithms may offer one solution to the vexing challenge of predicting postoperative pain. Here, we report on the application of machine learning algorithms to predict postoperative pain outcomes in a retrospective cohort of 8,071 surgical patients using 796 clinical variables. Five algorithms were compared in terms of their ability to forecast moderate to severe postoperative pain: Least Absolute Shrinkage and Selection Operator (LASSO), gradient-boosted decision tree, support vector machine, neural network, and k-nearest neighbor (k-NN), with logistic regression included for baseline comparison. In forecasting moderate to severe postoperative pain for postoperative day (POD) 1, the LASSO algorithm, using all 796 variables, had the highest accuracy with an area under the receiver-operating curve (ROC) of 0.704. Next, the gradient-boosted decision tree had an ROC of 0.665 and the k-NN algorithm had an ROC of 0.643. For POD 3, the LASSO algorithm, using all variables, again had the highest accuracy, with an ROC of 0.727. Logistic regression had a lower ROC of 0.5 for predicting pain outcomes on POD 1 and 3. Machine learning algorithms, when combined with complex and heterogeneous data from electronic medical record systems, can forecast acute postoperative pain outcomes with accuracies similar to methods that rely only on variables specifically collected for pain outcome prediction. Wiley Periodicals, Inc.
Efficient characterization of high-dimensional parameter spaces for systems biology
Directory of Open Access Journals (Sweden)
Hafner Marc
2011-09-01
Full Text Available Abstract Background A biological system's robustness to mutations and its evolution are influenced by the structure of its viable space, the region of its space of biochemical parameters where it can exert its function. In systems with a large number of biochemical parameters, viable regions with potentially complex geometries fill a tiny fraction of the whole parameter space. This hampers explorations of the viable space based on "brute force" or Gaussian sampling. Results We here propose a novel algorithm to characterize viable spaces efficiently. The algorithm combines global and local explorations of a parameter space. The global exploration involves an out-of-equilibrium adaptive Metropolis Monte Carlo method aimed at identifying poorly connected viable regions. The local exploration then samples these regions in detail by a method we call multiple ellipsoid-based sampling. Our algorithm explores efficiently nonconvex and poorly connected viable regions of different test-problems. Most importantly, its computational effort scales linearly with the number of dimensions, in contrast to "brute force" sampling that shows an exponential dependence on the number of dimensions. We also apply this algorithm to a simplified model of a biochemical oscillator with positive and negative feedback loops. A detailed characterization of the model's viable space captures well known structural properties of circadian oscillators. Concretely, we find that model topologies with an essential negative feedback loop and a nonessential positive feedback loop provide the most robust fixed period oscillations. Moreover, the connectedness of the model's viable space suggests that biochemical oscillators with varying topologies can evolve from one another. Conclusions Our algorithm permits an efficient analysis of high-dimensional, nonconvex, and poorly connected viable spaces characteristic of complex biological circuitry. It allows a systematic use of robustness as
Competing Risks Data Analysis with High-dimensional Covariates: An Application in Bladder Cancer
Directory of Open Access Journals (Sweden)
Leili Tapak
2015-06-01
Full Text Available Analysis of microarray data is associated with the methodological problems of high dimension and small sample size. Various methods have been used for variable selection in high-dimension and small sample size cases with a single survival endpoint. However, little effort has been directed toward addressing competing risks where there is more than one failure risks. This study compared three typical variable selection techniques including Lasso, elastic net, and likelihood-based boosting for high-dimensional time-to-event data with competing risks. The performance of these methods was evaluated via a simulation study by analyzing a real dataset related to bladder cancer patients using time-dependent receiver operator characteristic (ROC curve and bootstrap .632+ prediction error curves. The elastic net penalization method was shown to outperform Lasso and boosting. Based on the elastic net, 33 genes out of 1381 genes related to bladder cancer were selected. By fitting to the Fine and Gray model, eight genes were highly significant (P < 0.001. Among them, expression of RTN4, SON, IGF1R, SNRPE, PTGR1, PLEK, and ETFDH was associated with a decrease in survival time, whereas SMARCAD1 expression was associated with an increase in survival time. This study indicates that the elastic net has a higher capacity than the Lasso and boosting for the prediction of survival time in bladder cancer patients. Moreover, genes selected by all methods improved the predictive power of the model based on only clinical variables, indicating the value of information contained in the microarray features.
From Ambiguities to Insights: Query-based Comparisons of High-Dimensional Data
Kowalski, Jeanne; Talbot, Conover; Tsai, Hua L.; Prasad, Nijaguna; Umbricht, Christopher; Zeiger, Martha A.
2007-11-01
Genomic technologies will revolutionize drag discovery and development; that much is universally agreed upon. The high dimension of data from such technologies has challenged available data analytic methods; that much is apparent. To date, large-scale data repositories have not been utilized in ways that permit their wealth of information to be efficiently processed for knowledge, presumably due in large part to inadequate analytical tools to address numerous comparisons of high-dimensional data. In candidate gene discovery, expression comparisons are often made between two features (e.g., cancerous versus normal), such that the enumeration of outcomes is manageable. With multiple features, the setting becomes more complex, in terms of comparing expression levels of tens of thousands transcripts across hundreds of features. In this case, the number of outcomes, while enumerable, become rapidly large and unmanageable, and scientific inquiries become more abstract, such as "which one of these (compounds, stimuli, etc.) is not like the others?" We develop analytical tools that promote more extensive, efficient, and rigorous utilization of the public data resources generated by the massive support of genomic studies. Our work innovates by enabling access to such metadata with logically formulated scientific inquires that define, compare and integrate query-comparison pair relations for analysis. We demonstrate our computational tool's potential to address an outstanding biomedical informatics issue of identifying reliable molecular markers in thyroid cancer. Our proposed query-based comparison (QBC) facilitates access to and efficient utilization of metadata through logically formed inquires expressed as query-based comparisons by organizing and comparing results from biotechnologies to address applications in biomedicine.
Franck, I. M.; Koutsourelakis, P. S.
2017-01-01
This paper is concerned with the numerical solution of model-based, Bayesian inverse problems. We are particularly interested in cases where the cost of each likelihood evaluation (forward-model call) is expensive and the number of unknown (latent) variables is high. This is the setting in many problems in computational physics where forward models with nonlinear PDEs are used and the parameters to be calibrated involve spatio-temporarily varying coefficients, which upon discretization give rise to a high-dimensional vector of unknowns. One of the consequences of the well-documented ill-posedness of inverse problems is the possibility of multiple solutions. While such information is contained in the posterior density in Bayesian formulations, the discovery of a single mode, let alone multiple, poses a formidable computational task. The goal of the present paper is two-fold. On one hand, we propose approximate, adaptive inference strategies using mixture densities to capture multi-modal posteriors. On the other, we extend our work in [1] with regard to effective dimensionality reduction techniques that reveal low-dimensional subspaces where the posterior variance is mostly concentrated. We validate the proposed model by employing Importance Sampling which confirms that the bias introduced is small and can be efficiently corrected if the analyst wishes to do so. We demonstrate the performance of the proposed strategy in nonlinear elastography where the identification of the mechanical properties of biological materials can inform non-invasive, medical diagnosis. The discovery of multiple modes (solutions) in such problems is critical in achieving the diagnostic objectives.
Platon, Ludovic; Pejoski, David; Gautreau, Guillaume; Targat, Brice; Le Grand, Roger; Beignon, Anne-Sophie; Tchitchek, Nicolas
2018-01-01
Cytometry is an experimental technique used to measure molecules expressed by cells at a single cell resolution. Recently, several technological improvements have made possible to increase greatly the number of cell markers that can be simultaneously measured. Many computational methods have been proposed to identify clusters of cells having similar phenotypes. Nevertheless, only a limited number of computational methods permits to compare the phenotypes of the cell clusters identified by different clustering approaches. These phenotypic comparisons are necessary to choose the appropriate clustering methods and settings. Because of this lack of tools, comparisons of cell cluster phenotypes are often performed manually, a highly biased and time-consuming process. We designed CytoCompare, an R package that performs comparisons between the phenotypes of cell clusters with the purpose of identifying similar and different ones, based on the distribution of marker expressions. For each phenotype comparison of two cell clusters, CytoCompare provides a distance measure as well as a p-value asserting the statistical significance of the difference. CytoCompare can import clustering results from various algorithms including SPADE, viSNE/ACCENSE, and Citrus, the most current widely used algorithms. Additionally, CytoCompare can generate parallel coordinates, parallel heatmaps, multidimensional scaling or circular graph representations to visualize easily cell cluster phenotypes and the comparison results. CytoCompare is a flexible analysis pipeline for comparing the phenotypes of cell clusters identified by automatic gating algorithms in high-dimensional cytometry data. This R package is ideal for benchmarking different clustering algorithms and associated parameters. CytoCompare is freely distributed under the GPL-3 license and is available on https://github.com/tchitchek-lab/CytoCompare. Copyright © 2017 The Authors. Published by Elsevier Inc. All rights reserved.
Local two-sample testing: a new tool for analysing high-dimensional astronomical data
Freeman, P. E.; Kim, I.; Lee, A. B.
2017-11-01
Modern surveys have provided the astronomical community with a flood of high-dimensional data, but analyses of these data often occur after their projection to lower dimensional spaces. In this work, we introduce a local two-sample hypothesis test framework that an analyst may directly apply to data in their native space. In this framework, the analyst defines two classes based on a response variable of interest (e.g. higher mass galaxies versus lower mass galaxies) and determines at arbitrary points in predictor space whether the local proportions of objects that belong to the two classes significantly differ from the global proportion. Our framework has a potential myriad of uses throughout astronomy; here, we demonstrate its efficacy by applying it to a sample of 2487 I-band-selected galaxies observed by the HST-ACS in four of the CANDELS programme fields. For each galaxy, we have seven morphological summary statistics along with an estimated stellar mass and star formation rate (SFR). We perform two studies: one in which we determine regions of the seven-dimensional space of morphological statistics where high-mass galaxies are significantly more numerous than low-mass galaxies, and vice versa, and another study where we use SFR in place of mass. We find that we are able to identify such regions, and show how high-mass/low-SFR regions are associated with concentrated and undisturbed galaxies, while galaxies in low-mass/high-SFR regions appear more extended and/or disturbed than their high-mass/low-SFR counterparts.
Nakamura, Ryota; Suhrcke, Marc; Jebb, Susan A; Pechey, Rachel; Almiron-Roig, Eva; Marteau, Theresa M
2015-01-01
Background: There is a growing concern, but limited evidence, that price promotions contribute to a poor diet and the social patterning of diet-related disease. Objective: We examined the following questions: 1) Are less-healthy foods more likely to be promoted than healthier foods? 2) Are consumers more responsive to promotions on less-healthy products? 3) Are there socioeconomic differences in food purchases in response to price promotions? Design: With the use of hierarchical regression, we analyzed data on purchases of 11,323 products within 135 food and beverage categories from 26,986 households in Great Britain during 2010. Major supermarkets operated the same price promotions in all branches. The number of stores that offered price promotions on each product for each week was used to measure the frequency of price promotions. We assessed the healthiness of each product by using a nutrient profiling (NP) model. Results: A total of 6788 products (60%) were in healthier categories and 4535 products (40%) were in less-healthy categories. There was no significant gap in the frequency of promotion by the healthiness of products neither within nor between categories. However, after we controlled for the reference price, price discount rate, and brand-specific effects, the sales uplift arising from price promotions was larger in less-healthy than in healthier categories; a 1-SD point increase in the category mean NP score, implying the category becomes less healthy, was associated with an additional 7.7–percentage point increase in sales (from 27.3% to 35.0%; P promotions was larger for higher–socioeconomic status (SES) groups than for lower ones (34.6% for the high-SES group, 28.1% for the middle-SES group, and 23.1% for the low-SES group). Finally, there was no significant SES gap in the absolute volume of purchases of less-healthy foods made on promotion. Conclusion: Attempts to limit promotions on less-healthy foods could improve the population diet but
Semiparametric Regression Pursuit.
Huang, Jian; Wei, Fengrong; Ma, Shuangge
2012-10-01
The semiparametric partially linear model allows flexible modeling of covariate effects on the response variable in regression. It combines the flexibility of nonparametric regression and parsimony of linear regression. The most important assumption in the existing methods for the estimation in this model is to assume a priori that it is known which covariates have a linear effect and which do not. However, in applied work, this is rarely known in advance. We consider the problem of estimation in the partially linear models without assuming a priori which covariates have linear effects. We propose a semiparametric regression pursuit method for identifying the covariates with a linear effect. Our proposed method is a penalized regression approach using a group minimax concave penalty. Under suitable conditions we show that the proposed approach is model-pursuit consistent, meaning that it can correctly determine which covariates have a linear effect and which do not with high probability. The performance of the proposed method is evaluated using simulation studies, which support our theoretical results. A real data example is used to illustrated the application of the proposed method.
[Understanding logistic regression].
El Sanharawi, M; Naudet, F
2013-10-01
Logistic regression is one of the most common multivariate analysis models utilized in epidemiology. It allows the measurement of the association between the occurrence of an event (qualitative dependent variable) and factors susceptible to influence it (explicative variables). The choice of explicative variables that should be included in the logistic regression model is based on prior knowledge of the disease physiopathology and the statistical association between the variable and the event, as measured by the odds ratio. The main steps for the procedure, the conditions of application, and the essential tools for its interpretation are discussed concisely. We also discuss the importance of the choice of variables that must be included and retained in the regression model in order to avoid the omission of important confounding factors. Finally, by way of illustration, we provide an example from the literature, which should help the reader test his or her knowledge. Copyright © 2013 Elsevier Masson SAS. All rights reserved.
Sparse Regression by Projection and Sparse Discriminant Analysis
Qi, Xin
2015-04-03
© 2015, © American Statistical Association, Institute of Mathematical Statistics, and Interface Foundation of North America. Recent years have seen active developments of various penalized regression methods, such as LASSO and elastic net, to analyze high-dimensional data. In these approaches, the direction and length of the regression coefficients are determined simultaneously. Due to the introduction of penalties, the length of the estimates can be far from being optimal for accurate predictions. We introduce a new framework, regression by projection, and its sparse version to analyze high-dimensional data. The unique nature of this framework is that the directions of the regression coefficients are inferred first, and the lengths and the tuning parameters are determined by a cross-validation procedure to achieve the largest prediction accuracy. We provide a theoretical result for simultaneous model selection consistency and parameter estimation consistency of our method in high dimension. This new framework is then generalized such that it can be applied to principal components analysis, partial least squares, and canonical correlation analysis. We also adapt this framework for discriminant analysis. Compared with the existing methods, where there is relatively little control of the dependency among the sparse components, our method can control the relationships among the components. We present efficient algorithms and related theory for solving the sparse regression by projection problem. Based on extensive simulations and real data analysis, we demonstrate that our method achieves good predictive performance and variable selection in the regression setting, and the ability to control relationships between the sparse components leads to more accurate classification. In supplementary materials available online, the details of the algorithms and theoretical proofs, and R codes for all simulation studies are provided.
Simultaneous Inference in Regression
Liu, Wei
2010-01-01
The use of simultaneous confidence bands in linear regression is a vibrant area of research. This book presents an overview of the methodology and applications, including necessary background material on linear models. A special chapter on logistic regression gives readers a glimpse into how these methods can be used for generalized linear models. The appendices provide computational tools for simulating confidence bands. The author also includes MATLAB[registered] programs for all examples on the web. With many numerical examples and software implementation, this text serves the needs of rese
Xuan Chi; Barry Goodwin
2012-01-01
Spatial and temporal relationships among agricultural prices have been an important topic of applied research for many years. Such research is used to investigate the performance of markets and to examine linkages up and down the marketing chain. This research has empirically evaluated price linkages by using correlation and regression models and, later, linear and...
Ritz, Christian; Parmigiani, Giovanni
2009-01-01
R is a rapidly evolving lingua franca of graphical display and statistical analysis of experiments from the applied sciences. This book provides a coherent treatment of nonlinear regression with R by means of examples from a diversity of applied sciences such as biology, chemistry, engineering, medicine and toxicology.
African Journals Online (AJOL)
zlukovi
modelled as a quadratic regression, nested within parity. The previous lactation length was ... This proportion was mainly covered by linear and quadratic coefficients. Results suggest that RRM could .... The multiple trait models in scalar notation are presented by equations (1, 2), while equation. (3) represents the random ...
Modern Regression Discontinuity Analysis
Bloom, Howard S.
2012-01-01
This article provides a detailed discussion of the theory and practice of modern regression discontinuity (RD) analysis for estimating the effects of interventions or treatments. Part 1 briefly chronicles the history of RD analysis and summarizes its past applications. Part 2 explains how in theory an RD analysis can identify an average effect of…
Multiple linear regression analysis
Edwards, T. R.
1980-01-01
Program rapidly selects best-suited set of coefficients. User supplies only vectors of independent and dependent data and specifies confidence level required. Program uses stepwise statistical procedure for relating minimal set of variables to set of observations; final regression contains only most statistically significant coefficients. Program is written in FORTRAN IV for batch execution and has been implemented on NOVA 1200.
Seber, George A F
2012-01-01
Concise, mathematically clear, and comprehensive treatment of the subject.* Expanded coverage of diagnostics and methods of model fitting.* Requires no specialized knowledge beyond a good grasp of matrix algebra and some acquaintance with straight-line regression and simple analysis of variance models.* More than 200 problems throughout the book plus outline solutions for the exercises.* This revision has been extensively class-tested.
Bounded Gaussian process regression
DEFF Research Database (Denmark)
Jensen, Bjørn Sand; Nielsen, Jens Brehm; Larsen, Jan
2013-01-01
We extend the Gaussian process (GP) framework for bounded regression by introducing two bounded likelihood functions that model the noise on the dependent variable explicitly. This is fundamentally different from the implicit noise assumption in the previously suggested warped GP framework. We...
Mechanisms of neuroblastoma regression
Brodeur, Garrett M.; Bagatell, Rochelle
2014-01-01
Recent genomic and biological studies of neuroblastoma have shed light on the dramatic heterogeneity in the clinical behaviour of this disease, which spans from spontaneous regression or differentiation in some patients, to relentless disease progression in others, despite intensive multimodality therapy. This evidence also suggests several possible mechanisms to explain the phenomena of spontaneous regression in neuroblastomas, including neurotrophin deprivation, humoral or cellular immunity, loss of telomerase activity and alterations in epigenetic regulation. A better understanding of the mechanisms of spontaneous regression might help to identify optimal therapeutic approaches for patients with these tumours. Currently, the most druggable mechanism is the delayed activation of developmentally programmed cell death regulated by the tropomyosin receptor kinase A pathway. Indeed, targeted therapy aimed at inhibiting neurotrophin receptors might be used in lieu of conventional chemotherapy or radiation in infants with biologically favourable tumours that require treatment. Alternative approaches consist of breaking immune tolerance to tumour antigens or activating neurotrophin receptor pathways to induce neuronal differentiation. These approaches are likely to be most effective against biologically favourable tumours, but they might also provide insights into treatment of biologically unfavourable tumours. We describe the different mechanisms of spontaneous neuroblastoma regression and the consequent therapeutic approaches. PMID:25331179
Kavakiotis, Ioannis; Samaras, Patroklos; Triantafyllidis, Alexandros; Vlahavas, Ioannis
2017-11-01
Single Nucleotide Polymorphism (SNPs) are, nowadays, becoming the marker of choice for biological analyses involving a wide range of applications with great medical, biological, economic and environmental interest. Classification tasks i.e. the assignment of individuals to groups of origin based on their (multi-locus) genotypes, are performed in many fields such as forensic investigations, discrimination between wild and/or farmed populations and others. Τhese tasks, should be performed with a small number of loci, for computational as well as biological reasons. Thus, feature selection should precede classification tasks, especially for Single Nucleotide Polymorphism (SNP) datasets, where the number of features can amount to hundreds of thousands or millions. In this paper, we present a novel data mining approach, called FIFS - Frequent Item Feature Selection, based on the use of frequent items for selection of the most informative markers from population genomic data. It is a modular method, consisting of two main components. The first one identifies the most frequent and unique genotypes for each sampled population. The second one selects the most appropriate among them, in order to create the informative SNP subsets to be returned. The proposed method (FIFS) was tested on a real dataset, which comprised of a comprehensive coverage of pig breed types present in Britain. This dataset consisted of 446 individuals divided in 14 sub-populations, genotyped at 59,436 SNPs. Our method outperforms the state-of-the-art and baseline methods in every case. More specifically, our method surpassed the assignment accuracy threshold of 95% needing only half the number of SNPs selected by other methods (FIFS: 28 SNPs, Delta: 70 SNPs Pairwise FST: 70 SNPs, In: 100 SNPs.) CONCLUSION: Our approach successfully deals with the problem of informative marker selection in high dimensional genomic datasets. It offers better results compared to existing approaches and can aid biologists
Spontaneous regression of metastatic Merkel cell carcinoma.
LENUS (Irish Health Repository)
Hassan, S J
2010-01-01
Merkel cell carcinoma is a rare aggressive neuroendocrine carcinoma of the skin predominantly affecting elderly Caucasians. It has a high rate of local recurrence and regional lymph node metastases. It is associated with a poor prognosis. Complete spontaneous regression of Merkel cell carcinoma has been reported but is a poorly understood phenomenon. Here we present a case of complete spontaneous regression of metastatic Merkel cell carcinoma demonstrating a markedly different pattern of events from those previously published.
Efficient Regularized Regression with L0 Penalty for Variable Selection and Network Construction
Directory of Open Access Journals (Sweden)
Zhenqiu Liu
2016-01-01
Full Text Available Variable selections for regression with high-dimensional big data have found many applications in bioinformatics and computational biology. One appealing approach is the L0 regularized regression which penalizes the number of nonzero features in the model directly. However, it is well known that L0 optimization is NP-hard and computationally challenging. In this paper, we propose efficient EM (L0EM and dual L0EM (DL0EM algorithms that directly approximate the L0 optimization problem. While L0EM is efficient with large sample size, DL0EM is efficient with high-dimensional (n≪m data. They also provide a natural solution to all Lp p∈[0,2] problems, including lasso with p=1 and elastic net with p∈[1,2]. The regularized parameter λ can be determined through cross validation or AIC and BIC. We demonstrate our methods through simulation and high-dimensional genomic data. The results indicate that L0 has better performance than lasso, SCAD, and MC+, and L0 with AIC or BIC has similar performance as computationally intensive cross validation. The proposed algorithms are efficient in identifying the nonzero variables with less bias and constructing biologically important networks with high-dimensional big data.
Hua, Yi-Lin; Zhou, Zong-Quan; Liu, Xiao; Yang, Tian-Shu; Li, Zong-Feng; Li, Pei-Yun; Chen, Geng; Xu, Xiao-Ye; Tang, Jian-Shun; Xu, Jin-Shi; Li, Chuan-Feng; Guo, Guang-Can
2018-01-01
A photon pair can be entangled in many degrees of freedom such as polarization, time bins, and orbital angular momentum (OAM). Among them, the OAM of photons can be entangled in an infinite-dimensional Hilbert space which enhances the channel capacity of sharing information in a network. Twisted photons generated by spontaneous parametric down-conversion offer an opportunity to create this high-dimensional entanglement, but a photon pair generated by this process is typically wideband, which makes it difficult to interface with the quantum memories in a network. Here we propose an annual-ring-type quasi-phase-matching (QPM) crystal for generation of the narrowband high-dimensional entanglement. The structure of the QPM crystal is designed by tracking the geometric divergences of the OAM modes that comprise the entangled state. The dimensionality and the quality of the entanglement can be greatly enhanced with the annual-ring-type QPM crystal.
The xyz algorithm for fast interaction search in high-dimensional data
Thanei, Gian-Andrea; Meinshausen, Nicolai; Shah, Rajen D.
2016-01-01
When performing regression on a dataset with $p$ variables, it is often of interest to go beyond using main linear effects and include interactions as products between individual variables. For small-scale problems, these interactions can be computed explicitly but this leads to a computational complexity of at least $\\mathcal{O}(p^2)$ if done naively. This cost can be prohibitive if $p$ is very large. We introduce a new randomised algorithm that is able to discover interactions with high pro...
Principled sure independence screening for Cox models with ultra-high-dimensional covariates
Zhao, Sihai Dave; Li, Yi
2012-01-01
It is rather challenging for current variable selectors to handle situations where the number of covariates under consideration is ultra-high. Consider a motivating clinical trial of the drug bortezomib for the treatment of multiple myeloma, where overall survival and expression levels of 44760 probesets were measured for each of 80 patients with the goal of identifying genes that predict survival after treatment. This dataset defies analysis even with regularized regression. Some remedies ha...
Subset selection in regression
Miller, Alan
2002-01-01
Originally published in 1990, the first edition of Subset Selection in Regression filled a significant gap in the literature, and its critical and popular success has continued for more than a decade. Thoroughly revised to reflect progress in theory, methods, and computing power, the second edition promises to continue that tradition. The author has thoroughly updated each chapter, incorporated new material on recent developments, and included more examples and references. New in the Second Edition:A separate chapter on Bayesian methodsComplete revision of the chapter on estimationA major example from the field of near infrared spectroscopyMore emphasis on cross-validationGreater focus on bootstrappingStochastic algorithms for finding good subsets from large numbers of predictors when an exhaustive search is not feasible Software available on the Internet for implementing many of the algorithms presentedMore examplesSubset Selection in Regression, Second Edition remains dedicated to the techniques for fitting...
Classification and regression trees
Breiman, Leo; Olshen, Richard A; Stone, Charles J
1984-01-01
The methodology used to construct tree structured rules is the focus of this monograph. Unlike many other statistical procedures, which moved from pencil and paper to calculators, this text's use of trees was unthinkable before computers. Both the practical and theoretical sides have been developed in the authors' study of tree methods. Classification and Regression Trees reflects these two sides, covering the use of trees as a data analysis method, and in a more mathematical framework, proving some of their fundamental properties.
Better Autologistic Regression
Directory of Open Access Journals (Sweden)
Mark A. Wolters
2017-11-01
Full Text Available Autologistic regression is an important probability model for dichotomous random variables observed along with covariate information. It has been used in various fields for analyzing binary data possessing spatial or network structure. The model can be viewed as an extension of the autologistic model (also known as the Ising model, quadratic exponential binary distribution, or Boltzmann machine to include covariates. It can also be viewed as an extension of logistic regression to handle responses that are not independent. Not all authors use exactly the same form of the autologistic regression model. Variations of the model differ in two respects. First, the variable coding—the two numbers used to represent the two possible states of the variables—might differ. Common coding choices are (zero, one and (minus one, plus one. Second, the model might appear in either of two algebraic forms: a standard form, or a recently proposed centered form. Little attention has been paid to the effect of these differences, and the literature shows ambiguity about their importance. It is shown here that changes to either coding or centering in fact produce distinct, non-nested probability models. Theoretical results, numerical studies, and analysis of an ecological data set all show that the differences among the models can be large and practically significant. Understanding the nature of the differences and making appropriate modeling choices can lead to significantly improved autologistic regression analyses. The results strongly suggest that the standard model with plus/minus coding, which we call the symmetric autologistic model, is the most natural choice among the autologistic variants.
DEFF Research Database (Denmark)
Hansen, Henrik; Tarp, Finn
2001-01-01
. There are, however, decreasing returns to aid, and the estimated effectiveness of aid is highly sensitive to the choice of estimator and the set of control variables. When investment and human capital are controlled for, no positive effect of aid is found. Yet, aid continues to impact on growth via...... investment. We conclude by stressing the need for more theoretical work before this kind of cross-country regressions are used for policy purposes....
Hilbe, Joseph M
2009-01-01
This book really does cover everything you ever wanted to know about logistic regression … with updates available on the author's website. Hilbe, a former national athletics champion, philosopher, and expert in astronomy, is a master at explaining statistical concepts and methods. Readers familiar with his other expository work will know what to expect-great clarity.The book provides considerable detail about all facets of logistic regression. No step of an argument is omitted so that the book will meet the needs of the reader who likes to see everything spelt out, while a person familiar with some of the topics has the option to skip "obvious" sections. The material has been thoroughly road-tested through classroom and web-based teaching. … The focus is on helping the reader to learn and understand logistic regression. The audience is not just students meeting the topic for the first time, but also experienced users. I believe the book really does meet the author's goal … .-Annette J. Dobson, Biometric...
Nakano, Takashi; Otsuka, Makoto; Yoshimoto, Junichiro; Doya, Kenji
2015-01-01
A theoretical framework of reinforcement learning plays an important role in understanding action selection in animals. Spiking neural networks provide a theoretically grounded means to test computational hypotheses on neurally plausible algorithms of reinforcement learning through numerical simulation. However, most of these models cannot handle observations which are noisy, or occurred in the past, even though these are inevitable and constraining features of learning in real environments. This class of problem is formally known as partially observable reinforcement learning (PORL) problems. It provides a generalization of reinforcement learning to partially observable domains. In addition, observations in the real world tend to be rich and high-dimensional. In this work, we use a spiking neural network model to approximate the free energy of a restricted Boltzmann machine and apply it to the solution of PORL problems with high-dimensional observations. Our spiking network model solves maze tasks with perceptually ambiguous high-dimensional observations without knowledge of the true environment. An extended model with working memory also solves history-dependent tasks. The way spiking neural networks handle PORL problems may provide a glimpse into the underlying laws of neural information processing which can only be discovered through such a top-down approach.
Ma, Yun-Ming; Wang, Tie-Jun
2017-10-01
Higher-dimensional quantum system is of great interest owing to the outstanding features exhibited in the implementation of novel fundamental tests of nature and application in various quantum information tasks. High-dimensional quantum logic gate is a key element in scalable quantum computation and quantum communication. In this paper, we propose a scheme to implement a controlled-phase gate between a 2 N -dimensional photon and N three-level artificial atoms. This high-dimensional controlled-phase gate can serve as crucial components of the high-capacity, long-distance quantum communication. We use the high-dimensional Bell state analysis as an example to show the application of this device. Estimates on the system requirements indicate that our protocol is realizable with existing or near-further technologies. This scheme is ideally suited to solid-state integrated optical approaches to quantum information processing, and it can be applied to various system, such as superconducting qubits coupled to a resonator or nitrogen-vacancy centers coupled to a photonic-band-gap structures.
Descriptor Learning via Supervised Manifold Regularization for Multioutput Regression.
Zhen, Xiantong; Yu, Mengyang; Islam, Ali; Bhaduri, Mousumi; Chan, Ian; Li, Shuo
2017-09-01
Multioutput regression has recently shown great ability to solve challenging problems in both computer vision and medical image analysis. However, due to the huge image variability and ambiguity, it is fundamentally challenging to handle the highly complex input-target relationship of multioutput regression, especially with indiscriminate high-dimensional representations. In this paper, we propose a novel supervised descriptor learning (SDL) algorithm for multioutput regression, which can establish discriminative and compact feature representations to improve the multivariate estimation performance. The SDL is formulated as generalized low-rank approximations of matrices with a supervised manifold regularization. The SDL is able to simultaneously extract discriminative features closely related to multivariate targets and remove irrelevant and redundant information by transforming raw features into a new low-dimensional space aligned to targets. The achieved discriminative while compact descriptor largely reduces the variability and ambiguity for multioutput regression, which enables more accurate and efficient multivariate estimation. We conduct extensive evaluation of the proposed SDL on both synthetic data and real-world multioutput regression tasks for both computer vision and medical image analysis. Experimental results have shown that the proposed SDL can achieve high multivariate estimation accuracy on all tasks and largely outperforms the algorithms in the state of the arts. Our method establishes a novel SDL framework for multioutput regression, which can be widely used to boost the performance in different applications.
A logistic regression estimating function for spatial Gibbs point processes
DEFF Research Database (Denmark)
Baddeley, Adrian; Coeurjolly, Jean-François; Rubak, Ege
We propose a computationally efficient logistic regression estimating function for spatial Gibbs point processes. The sample points for the logistic regression consist of the observed point pattern together with a random pattern of dummy points. The estimating function is closely related...
Predicting Future High-Cost Schizophrenia Patients Using High-Dimensional Administrative Data
Directory of Open Access Journals (Sweden)
Yajuan Wang
2017-06-01
Full Text Available BackgroundThe burden of serious and persistent mental illness such as schizophrenia is substantial and requires health-care organizations to have adequate risk adjustment models to effectively allocate their resources to managing patients who are at the greatest risk. Currently available models underestimate health-care costs for those with mental or behavioral health conditions.ObjectivesThe study aimed to develop and evaluate predictive models for identification of future high-cost schizophrenia patients using advanced supervised machine learning methods.MethodsThis was a retrospective study using a payer administrative database. The study cohort consisted of 97,862 patients diagnosed with schizophrenia (ICD9 code 295.* from January 2009 to June 2014. Training (n = 34,510 and study evaluation (n = 30,077 cohorts were derived based on 12-month observation and prediction windows (PWs. The target was average total cost/patient/month in the PW. Three models (baseline, intermediate, final were developed to assess the value of different variable categories for cost prediction (demographics, coverage, cost, health-care utilization, antipsychotic medication usage, and clinical conditions. Scalable orthogonal regression, significant attribute selection in high dimensions method, and random forests regression were used to develop the models. The trained models were assessed in the evaluation cohort using the regression R2, patient classification accuracy (PCA, and cost accuracy (CA. The model performance was compared to the Centers for Medicare & Medicaid Services Hierarchical Condition Categories (CMS-HCC model.ResultsAt top 10% cost cutoff, the final model achieved 0.23 R2, 43% PCA, and 63% CA; in contrast, the CMS-HCC model achieved 0.09 R2, 27% PCA with 45% CA. The final model and the CMS-HCC model identified 33 and 22%, respectively, of total cost at the top 10% cost cutoff.ConclusionUsing advanced feature selection leveraging detailed
Li, Yanming; Nan, Bin; Zhu, Ji
2015-06-01
We propose a multivariate sparse group lasso variable selection and estimation method for data with high-dimensional predictors as well as high-dimensional response variables. The method is carried out through a penalized multivariate multiple linear regression model with an arbitrary group structure for the regression coefficient matrix. It suits many biology studies well in detecting associations between multiple traits and multiple predictors, with each trait and each predictor embedded in some biological functional groups such as genes, pathways or brain regions. The method is able to effectively remove unimportant groups as well as unimportant individual coefficients within important groups, particularly for large p small n problems, and is flexible in handling various complex group structures such as overlapping or nested or multilevel hierarchical structures. The method is evaluated through extensive simulations with comparisons to the conventional lasso and group lasso methods, and is applied to an eQTL association study. © 2015, The International Biometric Society.
Verhein, Florian
2010-01-01
Knowledge Discovery in Datenbanken (KDD) ist der nicht-triviale Prozess, gültiges, neues, potentiell nützliches und letztendlich verständliches Wissen aus großen Datensätzen zu extrahieren. Der wichtigste Schritt im KDD Prozess ist die Anwendung effizienter Data Mining (DM) Algorithmen um interessante Muster ("Patterns") in Datensätzen zu finden. Diese Dissertation beschäftigt sich mit drei verwandten Themen: Generalised Interaction und Rule Mining, die Einbindung von statistischen Methoden ...
An Introduction to Logistic Regression.
Cizek, Gregory J.; Fitzgerald, Shawn M.
1999-01-01
Where linearity cannot be assumed, logistic regression may be appropriate. This article describes conditions and tests for using logistic regression; introduces the logistic-regression model, the use of logistic-regression software, and some applications in published literature. Univariate and multiple independent-variable conditions and…
Reciprocal Causation in Regression Analysis.
Wolfle, Lee M.
1979-01-01
With even the simplest bivariate regression, least-squares solutions are inappropriate unless one assumes a priori that reciprocal effects are absent, or at least implausible. While this discussion is limited to bivariate regression, the issues apply equally to multivariate regression, including stepwise regression. (Author/CTM)
Directory of Open Access Journals (Sweden)
L.V. Arun Shalin
2016-01-01
Full Text Available Clustering is a process of grouping elements together, designed in such a way that the elements assigned to similar data points in a cluster are more comparable to each other than the remaining data points in a cluster. During clustering certain difficulties related when dealing with high dimensional data are ubiquitous and abundant. Works concentrated using anonymization method for high dimensional data spaces failed to address the problem related to dimensionality reduction during the inclusion of non-binary databases. In this work we study methods for dimensionality reduction for non-binary database. By analyzing the behavior of dimensionality reduction for non-binary database, results in performance improvement with the help of tag based feature. An effective multi-clustering anonymization approach called Discrete Component Task Specific Multi-Clustering (DCTSM is presented for dimensionality reduction on non-binary database. To start with we present the analysis of attribute in the non-binary database and cluster projection identifies the sparseness degree of dimensions. Additionally with the quantum distribution on multi-cluster dimension, the solution for relevancy of attribute and redundancy on non-binary data spaces is provided resulting in performance improvement on the basis of tag based feature. Multi-clustering tag based feature reduction extracts individual features and are correspondingly replaced by the equivalent feature clusters (i.e. tag clusters. During training, the DCTSM approach uses multi-clusters instead of individual tag features and then during decoding individual features is replaced by corresponding multi-clusters. To measure the effectiveness of the method, experiments are conducted on existing anonymization method for high dimensional data spaces and compared with the DCTSM approach using Statlog German Credit Data Set. Improved tag feature extraction and minimum error rate compared to conventional anonymization
Ren, Chao; He, Xiaohai; Nguyen, Truong Q
2017-01-01
Single image super-resolution (SR) is very important in many computer vision systems. However, as a highly ill-posed problem, its performance mainly relies on the prior knowledge. Among these priors, the non-local total variation (NLTV) prior is very popular and has been thoroughly studied in recent years. Nevertheless, technical challenges remain. Because NLTV only exploits a fixed non-shifted target patch in the patch search process, a lack of similar patches is inevitable in some cases. Thus, the non-local similarity cannot be fully characterized, and the effectiveness of NLTV cannot be ensured. Based on the motivation that more accurate non-local similar patches can be found by using shifted target patches, a novel multishifted similar-patch search (MSPS) strategy is proposed. With this strategy, NLTV is extended as a newly proposed super-high-dimensional NLTV (SHNLTV) prior to fully exploit the underlying non-local similarity. However, as SHNLTV is very high-dimensional, applying it directly to SR is very difficult. To solve this problem, a novel statistics-based dimension reduction strategy is proposed and then applied to SHNLTV. Thus, SHNLTV becomes a more computationally effective prior that we call adaptive high-dimensional non-local total variation (AHNLTV). In AHNLTV, a novel joint weight strategy that fully exploits the potential of the MSPS-based non-local similarity is proposed. To further boost the performance of AHNLTV, the adaptive geometric duality (AGD) prior is also incorporated. Finally, an efficient split Bregman iteration-based algorithm is developed to solve the AHNLTV-AGD-driven minimization problem. Extensive experiments validate the proposed method achieves better results than many state-of-the-art SR methods in terms of both objective and subjective qualities.
DEFF Research Database (Denmark)
Pham, Ninh Dang; Pagh, Rasmus
2012-01-01
projection-based technique that is able to estimate the angle-based outlier factor for all data points in time near-linear in the size of the data. Also, our approach is suitable to be performed in parallel environment to achieve a parallel speedup. We introduce a theoretical analysis of the quality...... of approximation to guarantee the reliability of our estimation algorithm. The empirical experiments on synthetic and real world data sets demonstrate that our approach is efficient and scalable to very large high-dimensional data sets....
Benediktsson, J. A.; Swain, P. H.; Ersoy, O. K.
1993-01-01
Application of neural networks to classification of remote sensing data is discussed. Conventional two-layer backpropagation is found to give good results in classification of remote sensing data but is not efficient in training. A more efficient variant, based on conjugate-gradient optimization, is used for classification of multisource remote sensing and geographic data and very-high-dimensional data. The conjugate-gradient neural networks give excellent performance in classification of multisource data, but do not compare as well with statistical methods in classification of very-high-dimentional data.
2016-02-01
choice for the weight function is the Zelnik-Manor and Perona function [31] for sparse matrices : w(x, y) = e − M(x,y) 2 √ τ(x)τ(y) , (49) using τ(x...Modified Cheeger and Ratio Cut Methods Using the Ginzburg-Landau Functional for Classification of High-Dimensional Data Ekaterina Merkurjev*, Andrea...related Ginzburg-Landau functional is used in the derivation of the methods. The graph framework discussed in this paper is undirected. The resulting
Behler, Jörg; Martoňák, Roman; Donadio, Davide; Parrinello, Michele
2008-05-01
We study in a systematic way the complex sequence of the high-pressure phases of silicon obtained upon compression by combining an accurate high-dimensional neural network representation of the density-functional theory potential-energy surface with the metadynamics scheme. Starting from the thermodynamically stable diamond structure at ambient conditions we are able to identify all structural phase transitions up to the highest-pressure fcc phase at about 100 GPa. The results are in excellent agreement with experiment. The method developed promises to be of great value in the study of inorganic solids, including those having metallic phases.
Modified Regression Correlation Coefficient for Poisson Regression Model
Kaengthong, Nattacha; Domthong, Uthumporn
2017-09-01
This study gives attention to indicators in predictive power of the Generalized Linear Model (GLM) which are widely used; however, often having some restrictions. We are interested in regression correlation coefficient for a Poisson regression model. This is a measure of predictive power, and defined by the relationship between the dependent variable (Y) and the expected value of the dependent variable given the independent variables [E(Y|X)] for the Poisson regression model. The dependent variable is distributed as Poisson. The purpose of this research was modifying regression correlation coefficient for Poisson regression model. We also compare the proposed modified regression correlation coefficient with the traditional regression correlation coefficient in the case of two or more independent variables, and having multicollinearity in independent variables. The result shows that the proposed regression correlation coefficient is better than the traditional regression correlation coefficient based on Bias and the Root Mean Square Error (RMSE).
Mah, Yee-Haur; Jager, Rolf; Kennard, Christopher; Husain, Masud; Nachev, Parashkev
2014-07-01
Making robust inferences about the functional neuroanatomy of the brain is critically dependent on experimental techniques that examine the consequences of focal loss of brain function. Unfortunately, the use of the most comprehensive such technique-lesion-function mapping-is complicated by the need for time-consuming and subjective manual delineation of the lesions, greatly limiting the practicability of the approach. Here we exploit a recently-described general measure of statistical anomaly, zeta, to devise a fully-automated, high-dimensional algorithm for identifying the parameters of lesions within a brain image given a reference set of normal brain images. We proceed to evaluate such an algorithm in the context of diffusion-weighted imaging of the commonest type of lesion used in neuroanatomical research: ischaemic damage. Summary performance metrics exceed those previously published for diffusion-weighted imaging and approach the current gold standard-manual segmentation-sufficiently closely for fully-automated lesion-mapping studies to become a possibility. We apply the new method to 435 unselected images of patients with ischaemic stroke to derive a probabilistic map of the pattern of damage in lesions involving the occipital lobe, demonstrating the variation of anatomical resolvability of occipital areas so as to guide future lesion-function studies of the region. Copyright © 2012 Elsevier Ltd. All rights reserved.
Diaz-Ruelas, Alvaro; Jeldtoft Jensen, Henrik; Piovani, Duccio; Robledo, Alberto
2016-12-01
It is well known that low-dimensional nonlinear deterministic maps close to a tangent bifurcation exhibit intermittency and this circumstance has been exploited, e.g., by Procaccia and Schuster [Phys. Rev. A 28, 1210 (1983)], to develop a general theory of 1/f spectra. This suggests it is interesting to study the extent to which the behavior of a high-dimensional stochastic system can be described by such tangent maps. The Tangled Nature (TaNa) Model of evolutionary ecology is an ideal candidate for such a study, a significant model as it is capable of reproducing a broad range of the phenomenology of macroevolution and ecosystems. The TaNa model exhibits strong intermittency reminiscent of punctuated equilibrium and, like the fossil record of mass extinction, the intermittency in the model is found to be non-stationary, a feature typical of many complex systems. We derive a mean-field version for the evolution of the likelihood function controlling the reproduction of species and find a local map close to tangency. This mean-field map, by our own local approximation, is able to describe qualitatively only one episode of the intermittent dynamics of the full TaNa model. To complement this result, we construct a complete nonlinear dynamical system model consisting of successive tangent bifurcations that generates time evolution patterns resembling those of the full TaNa model in macroscopic scales. The switch from one tangent bifurcation to the next in the sequences produced in this model is stochastic in nature, based on criteria obtained from the local mean-field approximation, and capable of imitating the changing set of types of species and total population in the TaNa model. The model combines full deterministic dynamics with instantaneous parameter random jumps at stochastically drawn times. In spite of the limitations of our approach, which entails a drastic collapse of degrees of freedom, the description of a high-dimensional model system in terms of a low
Shahiri, Amirah Mohamed; Husain, Wahidah; Rashid, Nur'Aini Abd
2017-10-01
Huge amounts of data in educational datasets may cause the problem in producing quality data. Recently, data mining approach are increasingly used by educational data mining researchers for analyzing the data patterns. However, many research studies have concentrated on selecting suitable learning algorithms instead of performing feature selection process. As a result, these data has problem with computational complexity and spend longer computational time for classification. The main objective of this research is to provide an overview of feature selection techniques that have been used to analyze the most significant features. Then, this research will propose a framework to improve the quality of students' dataset. The proposed framework uses filter and wrapper based technique to support prediction process in future study.
Prospective Validation of a High Dimensional Shape Model for Organ Motion in Intact Cervical Cancer
Energy Technology Data Exchange (ETDEWEB)
Williamson, Casey W.; Green, Garrett; Noticewala, Sonal S.; Li, Nan; Shen, Hanjie [Department of Radiation Medicine and Applied Sciences, University of California, San Diego, La Jolla, California (United States); Vaida, Florin [Division of Biostatistics and Bioinformatics, Department of Family Medicine and Public Health, University of California, San Diego, La Jolla, California (United States); Mell, Loren K., E-mail: lmell@ucsd.edu [Department of Radiation Medicine and Applied Sciences, University of California, San Diego, La Jolla, California (United States)
2016-11-15
Purpose: Validated models are needed to justify strategies to define planning target volumes (PTVs) for intact cervical cancer used in clinical practice. Our objective was to independently validate a previously published shape model, using data collected prospectively from clinical trials. Methods and Materials: We analyzed 42 patients with intact cervical cancer treated with daily fractionated pelvic intensity modulated radiation therapy and concurrent chemotherapy in one of 2 prospective clinical trials. We collected online cone beam computed tomography (CBCT) scans before each fraction. Clinical target volume (CTV) structures from the planning computed tomography scan were cast onto each CBCT scan after rigid registration and manually redrawn to account for organ motion and deformation. We applied the 95% isodose cloud from the planning computed tomography scan to each CBCT scan and computed any CTV outside the 95% isodose cloud. The primary aim was to determine the proportion of CTVs that were encompassed within the 95% isodose volume. A 1-sample t test was used to test the hypothesis that the probability of complete coverage was different from 95%. We used mixed-effects logistic regression to assess effects of time and patient variability. Results: The 95% isodose line completely encompassed 92.3% of all CTVs (95% confidence interval, 88.3%-96.4%), not significantly different from the 95% probability anticipated a priori (P=.19). The overall proportion of missed CTVs was small: the grand mean of covered CTVs was 99.9%, and 95.2% of misses were located in the anterior body of the uterus. Time did not affect coverage probability (P=.71). Conclusions: With the clinical implementation of a previously proposed PTV definition strategy based on a shape model for intact cervical cancer, the probability of CTV coverage was high and the volume of CTV missed was low. This PTV expansion strategy is acceptable for clinical trials and practice; however, we recommend daily
Lejsek, Herwig; Asmundsson, Fridrik Heidar; Jónsson, Björn thór; Amsaleg, Laurent
2009-05-01
Over the last two decades, much research effort has been spent on nearest neighbor search in high-dimensional data sets. Most of the approaches published thus far have, however, only been tested on rather small collections. When large collections have been considered, high-performance environments have been used, in particular systems with a large main memory. Accessing data on disk has largely been avoided because disk operations are considered to be too slow. It has been shown, however, that using large amounts of memory is generally not an economic choice. Therefore, we propose the NV-tree, which is a very efficient disk-based data structure that can give good approximate answers to nearest neighbor queries with a single disk operation, even for very large collections of high-dimensional data. Using a single NV-tree, the returned results have high recall but contain a number of false positives. By combining two or three NV-trees, most of those false positives can be avoided while retaining the high recall. Finally, we compare the NV-tree to Locality Sensitive Hashing, a popular method for epsilon-distance search. We show that they return results of similar quality, but the NV-tree uses many fewer disk reads.
Energy Technology Data Exchange (ETDEWEB)
Zawadzka-Kazimierczuk, Anna; Kozminski, Wiktor [University of Warsaw, Faculty of Chemistry (Poland); Billeter, Martin, E-mail: martin.billeter@chem.gu.se [University of Gothenburg, Biophysics Group, Department of Chemistry and Molecular Biology (Sweden)
2012-09-15
While NMR studies of proteins typically aim at structure, dynamics or interactions, resonance assignments represent in almost all cases the initial step of the analysis. With increasing complexity of the NMR spectra, for example due to decreasing extent of ordered structure, this task often becomes both difficult and time-consuming, and the recording of high-dimensional data with high-resolution may be essential. Random sampling of the evolution time space, combined with sparse multidimensional Fourier transform (SMFT), allows for efficient recording of very high dimensional spectra ({>=}4 dimensions) while maintaining high resolution. However, the nature of this data demands for automation of the assignment process. Here we present the program TSAR (Tool for SMFT-based Assignment of Resonances), which exploits all advantages of SMFT input. Moreover, its flexibility allows to process data from any type of experiments that provide sequential connectivities. The algorithm was tested on several protein samples, including a disordered 81-residue fragment of the {delta} subunit of RNA polymerase from Bacillus subtilis containing various repetitive sequences. For our test examples, TSAR achieves a high percentage of assigned residues without any erroneous assignments.
Das, Alaka; Gupte, Neelima
2013-04-01
The phenomenon of crisis in systems evolving in high-dimensional phase space can show unexpected and interesting features. We study this phenomenon in the context of a system of coupled sine circle maps. We establish that the origins of this crisis lie in a tangent bifurcation in high dimensions, and identify the routes that lead to the crisis. Interestingly, multiple routes to crisis are seen depending on the initial conditions of the system, due to the high dimensionality of the space in which the system evolves. The statistical behavior seen in the phase diagram of the system is also seen to change due to the dynamical phenomenon of crisis, which leads to transitions from nonspreading to spreading behavior across an infection line in the phase diagram. Unstable dimension variability is seen in the neighborhood of the infection line. We characterize this crisis and unstable dimension variability using dynamical characterizers, such as finite-time Lyapunov exponents and their distributions. The phase diagram also contains regimes of spatiotemporal intermittency and spatial intermittency, where the statistical quantities scale as power laws. We discuss the signatures of these regimes in the dynamic characterizers, and correlate them with the statistical characterizers and bifurcation behavior. We find that it is necessary to look at both types of correlators together to build up an accurate picture of the behavior of the system.
Cowley, Benjamin R.; Kaufman, Matthew T.; Butler, Zachary S.; Churchland, Mark M.; Ryu, Stephen I.; Shenoy, Krishna V.; Yu, Byron M.
2014-01-01
Objective Analyzing and interpreting the activity of a heterogeneous population of neurons can be challenging, especially as the number of neurons, experimental trials, and experimental conditions increases. One approach is to extract a set of latent variables that succinctly captures the prominent co-fluctuation patterns across the neural population. A key problem is that the number of latent variables needed to adequately describe the population activity is often greater than three, thereby preventing direct visualization of the latent space. By visualizing a small number of 2-d projections of the latent space or each latent variable individually, it is easy to miss salient features of the population activity. Approach To address this limitation, we developed a Matlab graphical user interface (called DataHigh) that allows the user to quickly and smoothly navigate through a continuum of different 2-d projections of the latent space. We also implemented a suite of additional visualization tools (including playing out population activity timecourses as a movie and displaying summary statistics, such as covariance ellipses and average timecourses) and an optional tool for performing dimensionality reduction. Main results To demonstrate the utility and versatility of DataHigh, we used it to analyze single-trial spike count and single-trial timecourse population activity recorded using a multi-electrode array, as well as trial-averaged population activity recorded using single electrodes. Significance DataHigh was developed to fulfill a need for visualization in exploratory neural data analysis, which can provide intuition that is critical for building scientific hypotheses and models of population activity. PMID:24216250
Clark, James S.; Soltoff, Benjamin D.; Powell, Amanda S.; Read, Quentin D.
2012-01-01
Background For competing species to coexist, individuals must compete more with others of the same species than with those of other species. Ecologists search for tradeoffs in how species might partition the environment. The negative correlations among competing species that would be indicative of tradeoffs are rarely observed. A recent analysis showed that evidence for partitioning the environment is available when responses are disaggregated to the individual scale, in terms of the covariance structure of responses to environmental variation. That study did not relate that variation to the variables to which individuals were responding. To understand how this pattern of variation is related to niche variables, we analyzed responses to canopy gaps, long viewed as a key variable responsible for species coexistence. Methodology/Principal Findings A longitudinal intervention analysis of individual responses to experimental canopy gaps with 12 yr of pre-treatment and 8 yr post-treatment responses showed that species-level responses are positively correlated – species that grow fast on average in the understory also grow fast on average in response to gap formation. In other words, there is no tradeoff. However, the joint distribution of individual responses to understory and gap showed a negative correlation – species having individuals that respond most to gaps when previously growing slowly also have individuals that respond least to gaps when previously growing rapidly (e.g., Morus rubra), and vice versa (e.g., Quercus prinus). Conclusions/Significance Because competition occurs at the individual scale, not the species scale, aggregated species-level parameters and correlations hide the species-level differences needed for coexistence. By disaggregating models to the scale at which the interaction occurs we show that individual variation provides insight for species differences. PMID:22393349
Dipasquale, Ottavia; Griffanti, Ludovica; Clerici, Mario; Nemni, Raffaello; Baselli, Giuseppe; Baglio, Francesca
2015-01-01
High-dimensional independent component analysis (ICA), compared to low-dimensional ICA, allows to conduct a detailed parcellation of the resting-state networks. The purpose of this study was to give further insight into functional connectivity (FC) in Alzheimer's disease (AD) using high-dimensional ICA. For this reason, we performed both low- and high-dimensional ICA analyses of resting-state fMRI data of 20 healthy controls and 21 patients with AD, focusing on the primarily altered default-mode network (DMN) and exploring the sensory-motor network. As expected, results obtained at low dimensionality were in line with previous literature. Moreover, high-dimensional results allowed us to observe either the presence of within-network disconnections and FC damage confined to some of the resting-state subnetworks. Due to the higher sensitivity of the high-dimensional ICA analysis, our results suggest that high-dimensional decomposition in subnetworks is very promising to better localize FC alterations in AD and that FC damage is not confined to the DMN.
He, Ling Yan; Wang, Tie-Jun; Wang, Chuan
2016-07-11
High-dimensional quantum system provides a higher capacity of quantum channel, which exhibits potential applications in quantum information processing. However, high-dimensional universal quantum logic gates is difficult to achieve directly with only high-dimensional interaction between two quantum systems and requires a large number of two-dimensional gates to build even a small high-dimensional quantum circuits. In this paper, we propose a scheme to implement a general controlled-flip (CF) gate where the high-dimensional single photon serve as the target qudit and stationary qubits work as the control logic qudit, by employing a three-level Λ-type system coupled with a whispering-gallery-mode microresonator. In our scheme, the required number of interaction times between the photon and solid state system reduce greatly compared with the traditional method which decomposes the high-dimensional Hilbert space into 2-dimensional quantum space, and it is on a shorter temporal scale for the experimental realization. Moreover, we discuss the performance and feasibility of our hybrid CF gate, concluding that it can be easily extended to a 2n-dimensional case and it is feasible with current technology.
Directory of Open Access Journals (Sweden)
Ziping WANG
2011-09-01
Full Text Available Background and objective It has been proven that gefitinib produces only 10%-20% tumor regression in heavily pretreated, unselected non-small cell lung cancer (NSCLC patients as the second- and third-line setting. Asian, female, nonsmokers and adenocarcinoma are favorable factors; however, it is difficult to find a patient satisfying all the above clinical characteristics. The aim of this study is to identify novel predicting factors, and to explore the interactions between clinical variables and their impact on the survival of Chinese patients with advanced NSCLC who were heavily treated with gefitinib in the second- or third-line setting. Methods The clinical and follow-up data of 127 advanced NSCLC patients referred to the Cancer Hospital & Institute, Chinese Academy of Medical Sciences from March 2005 to March 2010 were analyzed. Multivariate analysis of progression-free survival (PFS was performed using recursive partitioning, which is referred to as the classification and regression tree (CART analysis. Results The median PFS of 127 eligible consecutive advanced NSCLC patients was 8.0 months (95%CI: 5.8-10.2. CART was performed with an initial split on first-line chemotherapy outcomes and a second split on patients’ age. Three terminal subgroups were formed. The median PFS of the three subsets ranged from 1.0 month (95%CI: 0.8-1.2 for those with progressive disease outcome after the first-line chemotherapy subgroup, 10 months (95%CI: 7.0-13.0 in patients with a partial response or stable disease in first-line chemotherapy and age <70, and 22.0 months for patients obtaining a partial response or stable disease in first-line chemotherapy at age 70-81 (95%CI: 3.8-40.1. Conclusion Partial response, stable disease in first-line chemotherapy and age ≥ 70 are closely correlated with long-term survival treated by gefitinib as a second- or third-line setting in advanced NSCLC. CART can be used to identify previously unappreciated patient
Combining Alphas via Bounded Regression
Directory of Open Access Journals (Sweden)
Zura Kakushadze
2015-11-01
Full Text Available We give an explicit algorithm and source code for combining alpha streams via bounded regression. In practical applications, typically, there is insufficient history to compute a sample covariance matrix (SCM for a large number of alphas. To compute alpha allocation weights, one then resorts to (weighted regression over SCM principal components. Regression often produces alpha weights with insufficient diversification and/or skewed distribution against, e.g., turnover. This can be rectified by imposing bounds on alpha weights within the regression procedure. Bounded regression can also be applied to stock and other asset portfolio construction. We discuss illustrative examples.
Linear regression-based feature selection for microarray data classification.
Abid Hasan, Md; Hasan, Md Kamrul; Abdul Mottalib, M
2015-01-01
Predicting the class of gene expression profiles helps improve the diagnosis and treatment of diseases. Analysing huge gene expression data otherwise known as microarray data is complicated due to its high dimensionality. Hence the traditional classifiers do not perform well where the number of features far exceeds the number of samples. A good set of features help classifiers to classify the dataset efficiently. Moreover, a manageable set of features is also desirable for the biologist for further analysis. In this paper, we have proposed a linear regression-based feature selection method for selecting discriminative features. Our main focus is to classify the dataset more accurately using less number of features than other traditional feature selection methods. Our method has been compared with several other methods and in almost every case the classification accuracy is higher using less number of features than the other popular feature selection methods.
Miao, Yan-Gang
2016-01-01
Considering non-Gaussian smeared matter distributions, we investigate thermodynamic behaviors of the noncommutative high-dimensional Schwarzschild-Tangherlini anti-de Sitter black hole, and obtain the condition for the existence of extreme black holes. We indicate that the Gaussian smeared matter distribution, which is a special case of non-Gaussian smeared matter distributions, is not applicable for the 6- and higher-dimensional black holes due to the hoop conjecture. In particular, the phase transition is analyzed in detail. Moreover, we point out that the Maxwell equal area law maintains for the noncommutative black hole with the Hawking temperature within a specific range, but fails with the Hawking temperature beyond this range.
Energy Technology Data Exchange (ETDEWEB)
Miao, Yan-Gang [Nankai University, School of Physics, Tianjin (China); Chinese Academy of Sciences, State Key Laboratory of Theoretical Physics, Institute of Theoretical Physics, P.O. Box 2735, Beijing (China); CERN, PH-TH Division, Geneva 23 (Switzerland); Xu, Zhen-Ming [Nankai University, School of Physics, Tianjin (China)
2016-04-15
Considering non-Gaussian smeared matter distributions, we investigate the thermodynamic behaviors of the noncommutative high-dimensional Schwarzschild-Tangherlini anti-de Sitter black hole, and we obtain the condition for the existence of extreme black holes. We indicate that the Gaussian smeared matter distribution, which is a special case of non-Gaussian smeared matter distributions, is not applicable for the six- and higher-dimensional black holes due to the hoop conjecture. In particular, the phase transition is analyzed in detail. Moreover, we point out that the Maxwell equal area law holds for the noncommutative black hole whose Hawking temperature is within a specific range, but fails for one whose the Hawking temperature is beyond this range. (orig.)
Wang, Chamont; Chen, Chaur-Chin; Auslender, Leonardo
2012-01-01
Over the past decades, statisticians and machine-learning researchers have developed literally thousands of new tools for the reduction of high-dimensional data in order to identify the variables most responsible for a particular trait. These tools have applications in a plethora of settings, including data analysis in the fields of business, education, forensics, and biology (such as microarray, proteomics, brain imaging), to name a few. In the present work, we focus our investigation on the limitations and potential misuses of certain tools in the analysis of the benchmark colon cancer data (2,000 variables; Alon et al., 1999) and the prostate cancer data (6,033 variables; Efron, 2010, 2008). Our analysis demonstrates that models that produce 100% accuracy measures often select different sets of genes and cannot stand the scrutiny of parameter estimates and model stability. Furthermore, we created a host of simulation datasets and "artificial diseases" to evaluate the reliability of commonly used statistica...
Directory of Open Access Journals (Sweden)
F. C. Cooper
2013-04-01
Full Text Available The fluctuation-dissipation theorem (FDT has been proposed as a method of calculating the response of the earth's atmosphere to a forcing. For this problem the high dimensionality of the relevant data sets makes truncation necessary. Here we propose a method of truncation based upon the assumption that the response to a localised forcing is spatially localised, as an alternative to the standard method of choosing a number of the leading empirical orthogonal functions. For systems where this assumption holds, the response to any sufficiently small non-localised forcing may be estimated using a set of truncations that are chosen algorithmically. We test our algorithm using 36 and 72 variable versions of a stochastic Lorenz 95 system of ordinary differential equations. We find that, for long integrations, the bias in the response estimated by the FDT is reduced from ~75% of the true response to ~30%.
Directory of Open Access Journals (Sweden)
Ali Dashti
Full Text Available This paper presents an implementation of the brute-force exact k-Nearest Neighbor Graph (k-NNG construction for ultra-large high-dimensional data cloud. The proposed method uses Graphics Processing Units (GPUs and is scalable with multi-levels of parallelism (between nodes of a cluster, between different GPUs on a single node, and within a GPU. The method is applicable to homogeneous computing clusters with a varying number of nodes and GPUs per node. We achieve a 6-fold speedup in data processing as compared with an optimized method running on a cluster of CPUs and bring a hitherto impossible [Formula: see text]-NNG generation for a dataset of twenty million images with 15 k dimensionality into the realm of practical possibility.
McParland, D; Phillips, C M; Brennan, L; Roche, H M; Gormley, I C
2017-12-10
The LIPGENE-SU.VI.MAX study, like many others, recorded high-dimensional continuous phenotypic data and categorical genotypic data. LIPGENE-SU.VI.MAX focuses on the need to account for both phenotypic and genetic factors when studying the metabolic syndrome (MetS), a complex disorder that can lead to higher risk of type 2 diabetes and cardiovascular disease. Interest lies in clustering the LIPGENE-SU.VI.MAX participants into homogeneous groups or sub-phenotypes, by jointly considering their phenotypic and genotypic data, and in determining which variables are discriminatory. A novel latent variable model that elegantly accommodates high dimensional, mixed data is developed to cluster LIPGENE-SU.VI.MAX participants using a Bayesian finite mixture model. A computationally efficient variable selection algorithm is incorporated, estimation is via a Gibbs sampling algorithm and an approximate BIC-MCMC criterion is developed to select the optimal model. Two clusters or sub-phenotypes ('healthy' and 'at risk') are uncovered. A small subset of variables is deemed discriminatory, which notably includes phenotypic and genotypic variables, highlighting the need to jointly consider both factors. Further, 7 years after the LIPGENE-SU.VI.MAX data were collected, participants underwent further analysis to diagnose presence or absence of the MetS. The two uncovered sub-phenotypes strongly correspond to the 7-year follow-up disease classification, highlighting the role of phenotypic and genotypic factors in the MetS and emphasising the potential utility of the clustering approach in early screening. Additionally, the ability of the proposed approach to define the uncertainty in sub-phenotype membership at the participant level is synonymous with the concepts of precision medicine and nutrition. Copyright © 2017 John Wiley & Sons, Ltd. Copyright © 2017 John Wiley & Sons, Ltd.
Sariyar, Murat; Hoffmann, Isabell; Binder, Harald
2014-02-26
Molecular data, e.g. arising from microarray technology, is often used for predicting survival probabilities of patients. For multivariate risk prediction models on such high-dimensional data, there are established techniques that combine parameter estimation and variable selection. One big challenge is to incorporate interactions into such prediction models. In this feasibility study, we present building blocks for evaluating and incorporating interactions terms in high-dimensional time-to-event settings, especially for settings in which it is computationally too expensive to check all possible interactions. We use a boosting technique for estimation of effects and the following building blocks for pre-selecting interactions: (1) resampling, (2) random forests and (3) orthogonalization as a data pre-processing step. In a simulation study, the strategy that uses all building blocks is able to detect true main effects and interactions with high sensitivity in different kinds of scenarios. The main challenge are interactions composed of variables that do not represent main effects, but our findings are also promising in this regard. Results on real world data illustrate that effect sizes of interactions frequently may not be large enough to improve prediction performance, even though the interactions are potentially of biological relevance. Screening interactions through random forests is feasible and useful, when one is interested in finding relevant two-way interactions. The other building blocks also contribute considerably to an enhanced pre-selection of interactions. We determined the limits of interaction detection in terms of necessary effect sizes. Our study emphasizes the importance of making full use of existing methods in addition to establishing new ones.
Foote, Celine; Clayton, Philip A; Johnson, David W; Jardine, Meg; Snelling, Paul; Cass, Alan
2014-09-01
Late referral for renal replacement therapy (RRT) leads to worse outcomes. In 2005, estimated glomerular filtration rate (eGFR) reporting began in Australasia, with an aim of substantially increasing earlier disease detection. Observational cohort study using the Australia and New Zealand Dialysis and Transplant Registry (ANZDATA) data. All patients commencing RRT in Australasia between January 1, 1999, and December 31, 2010. We excluded the period between December 31, 2004, and January 1, 2007, to allow for practice change. Introduction of eGFR reporting. Primary outcome was late referral defined as commencing RRT within 3 months of nephrology referral. Secondary outcomes included initial RRT modality and prepared access at hemodialysis therapy initiation. Late referral rates per era were determined and multilevel logistic regression was used to identify late referral predictors. We included 25,009 patients. Overall, 3,433 (25.3%) patients were referred late in the pre-eGFR era compared with 2,464 (21.6%) in the post-eGFR era, for an absolute reduction of 3.7% (95% CI, 2.7%-4.8%; P<0.001). After adjustments for age, body mass index, race, comorbid conditions, and primary kidney disease, adjusted late referral rates were 25.8% (95% CI, 23.3%-28.3%) and 21.8% (95% CI, 19.2%-24.4%) in the pre- and post-eGFR eras, respectively, for a difference of 4.0% (95% CI, 1.2%-6.8%; P=0.005). Late referral risk was attenuated significantly post-eGFR reporting (OR, 1.30; 95% CI, 1.12-1.51) compared to pre-eGFR reporting (OR, 2.15; 95% CI, 1.88-2.46) for indigenous patients. Late referral rates decreased for older patients but increased slightly for younger patients (P=0.001 for interaction between age and era). There was no impact on initial RRT modality or prepared access rates at hemodialysis therapy initiation between eras. Residual confounding could not be excluded. eGFR reporting was associated with small reductions in late referral, but more than 1 in 5 patients are still
A Meta-Heuristic Regression-Based Feature Selection for Predictive Analytics
Directory of Open Access Journals (Sweden)
Bharat Singh
2014-11-01
Full Text Available A high-dimensional feature selection having a very large number of features with an optimal feature subset is an NP-complete problem. Because conventional optimization techniques are unable to tackle large-scale feature selection problems, meta-heuristic algorithms are widely used. In this paper, we propose a particle swarm optimization technique while utilizing regression techniques for feature selection. We then use the selected features to classify the data. Classification accuracy is used as a criterion to evaluate classifier performance, and classification is accomplished through the use of k-nearest neighbour (KNN and Bayesian techniques. Various high dimensional data sets are used to evaluate the usefulness of the proposed approach. Results show that our approach gives better results when compared with other conventional feature selection algorithms.
Time-adaptive quantile regression
DEFF Research Database (Denmark)
Møller, Jan Kloppenborg; Nielsen, Henrik Aalborg; Madsen, Henrik
2008-01-01
An algorithm for time-adaptive quantile regression is presented. The algorithm is based on the simplex algorithm, and the linear optimization formulation of the quantile regression problem is given. The observations have been split to allow a direct use of the simplex algorithm. The simplex method...... and an updating procedure are combined into a new algorithm for time-adaptive quantile regression, which generates new solutions on the basis of the old solution, leading to savings in computation time. The suggested algorithm is tested against a static quantile regression model on a data set with wind power...... production, where the models combine splines and quantile regression. The comparison indicates superior performance for the time-adaptive quantile regression in all the performance parameters considered....
Optical proximity correction with principal component regression
Gao, Peiran; Gu, Allan; Zakhor, Avideh
2008-03-01
An important step in today's Integrated Circuit (IC) manufacturing is optical proximity correction (OPC). In model based OPC, masks are systematically modified to compensate for the non-ideal optical and process effects of optical lithography system. The polygons in the layout are fragmented, and simulations are performed to determine the image intensity pattern on the wafer. Then the mask is perturbed by moving the fragments to match the desired wafer pattern. This iterative process continues until the pattern on the wafer matches the desired one. Although OPC increases the fidelity of pattern transfer to the wafer, it is quite CPU intensive; OPC for modern IC designs can take days to complete on computer clusters with thousands of CPU. In this paper, techniques from statistical machine learning are used to predict the fragment movements. The goal is to reduce the number of iterations required in model based OPC by using a fast and efficient solution as the initial guess to model based OPC. To determine the best model, we train and evaluate several principal component regression models based on prediction error. Experimental results show that fragment movement predictions via regression model significantly decrease the number of iterations required in model based OPC.
Evaluating Differential Effects Using Regression Interactions and Regression Mixture Models
Van Horn, M. Lee; Jaki, Thomas; Masyn, Katherine; Howe, George; Feaster, Daniel J.; Lamont, Andrea E.; George, Melissa R. W.; Kim, Minjung
2015-01-01
Research increasingly emphasizes understanding differential effects. This article focuses on understanding regression mixture models, which are relatively new statistical methods for assessing differential effects by comparing results to using an interactive term in linear regression. The research questions which each model answers, their…
Bias-corrected quantile regression estimation of censored regression models
Cizek, Pavel; Sadikoglu, Serhan
2018-01-01
In this paper, an extension of the indirect inference methodology to semiparametric estimation is explored in the context of censored regression. Motivated by weak small-sample performance of the censored regression quantile estimator proposed by Powell (J Econom 32:143–155, 1986a), two- and
Quantum assisted Gaussian process regression
Zhao, Zhikuan; Fitzsimons, Jack K.; Fitzsimons, Joseph F.
2015-01-01
Gaussian processes (GP) are a widely used model for regression problems in supervised machine learning. Implementation of GP regression typically requires $O(n^3)$ logic gates. We show that the quantum linear systems algorithm [Harrow et al., Phys. Rev. Lett. 103, 150502 (2009)] can be applied to Gaussian process regression (GPR), leading to an exponential reduction in computation time in some instances. We show that even in some cases not ideally suited to the quantum linear systems algorith...
Quantile regression theory and applications
Davino, Cristina; Vistocco, Domenico
2013-01-01
A guide to the implementation and interpretation of Quantile Regression models This book explores the theory and numerous applications of quantile regression, offering empirical data analysis as well as the software tools to implement the methods. The main focus of this book is to provide the reader with a comprehensivedescription of the main issues concerning quantile regression; these include basic modeling, geometrical interpretation, estimation and inference for quantile regression, as well as issues on validity of the model, diagnostic tools. Each methodological aspect is explored and
Testing discontinuities in nonparametric regression
Dai, Wenlin
2017-01-19
In nonparametric regression, it is often needed to detect whether there are jump discontinuities in the mean function. In this paper, we revisit the difference-based method in [13 H.-G. Müller and U. Stadtmüller, Discontinuous versus smooth regression, Ann. Stat. 27 (1999), pp. 299–337. doi: 10.1214/aos/1018031100
Logistic Regression: Concept and Application
Cokluk, Omay
2010-01-01
The main focus of logistic regression analysis is classification of individuals in different groups. The aim of the present study is to explain basic concepts and processes of binary logistic regression analysis intended to determine the combination of independent variables which best explain the membership in certain groups called dichotomous…
Panel Smooth Transition Regression Models
DEFF Research Database (Denmark)
González, Andrés; Terasvirta, Timo; Dijk, Dick van
We introduce the panel smooth transition regression model. This new model is intended for characterizing heterogeneous panels, allowing the regression coefficients to vary both across individuals and over time. Specifically, heterogeneity is allowed for by assuming that these coefficients are bou...
Regression analysis with categorized regression calibrated exposure: some interesting findings
Directory of Open Access Journals (Sweden)
Hjartåker Anette
2006-07-01
Full Text Available Abstract Background Regression calibration as a method for handling measurement error is becoming increasingly well-known and used in epidemiologic research. However, the standard version of the method is not appropriate for exposure analyzed on a categorical (e.g. quintile scale, an approach commonly used in epidemiologic studies. A tempting solution could then be to use the predicted continuous exposure obtained through the regression calibration method and treat it as an approximation to the true exposure, that is, include the categorized calibrated exposure in the main regression analysis. Methods We use semi-analytical calculations and simulations to evaluate the performance of the proposed approach compared to the naive approach of not correcting for measurement error, in situations where analyses are performed on quintile scale and when incorporating the original scale into the categorical variables, respectively. We also present analyses of real data, containing measures of folate intake and depression, from the Norwegian Women and Cancer study (NOWAC. Results In cases where extra information is available through replicated measurements and not validation data, regression calibration does not maintain important qualities of the true exposure distribution, thus estimates of variance and percentiles can be severely biased. We show that the outlined approach maintains much, in some cases all, of the misclassification found in the observed exposure. For that reason, regression analysis with the corrected variable included on a categorical scale is still biased. In some cases the corrected estimates are analytically equal to those obtained by the naive approach. Regression calibration is however vastly superior to the naive method when applying the medians of each category in the analysis. Conclusion Regression calibration in its most well-known form is not appropriate for measurement error correction when the exposure is analyzed on a
Advanced statistics: linear regression, part II: multiple linear regression.
Marill, Keith A
2004-01-01
The applications of simple linear regression in medical research are limited, because in most situations, there are multiple relevant predictor variables. Univariate statistical techniques such as simple linear regression use a single predictor variable, and they often may be mathematically correct but clinically misleading. Multiple linear regression is a mathematical technique used to model the relationship between multiple independent predictor variables and a single dependent outcome variable. It is used in medical research to model observational data, as well as in diagnostic and therapeutic studies in which the outcome is dependent on more than one factor. Although the technique generally is limited to data that can be expressed with a linear function, it benefits from a well-developed mathematical framework that yields unique solutions and exact confidence intervals for regression coefficients. Building on Part I of this series, this article acquaints the reader with some of the important concepts in multiple regression analysis. These include multicollinearity, interaction effects, and an expansion of the discussion of inference testing, leverage, and variable transformations to multivariate models. Examples from the first article in this series are expanded on using a primarily graphic, rather than mathematical, approach. The importance of the relationships among the predictor variables and the dependence of the multivariate model coefficients on the choice of these variables are stressed. Finally, concepts in regression model building are discussed.
Di Liberto, Giovanni; Conte, Riccardo; Ceotto, Michele
2018-01-07
We extensively describe our recently established "divide-and-conquer" semiclassical method [M. Ceotto, G. Di Liberto, and R. Conte, Phys. Rev. Lett. 119, 010401 (2017)] and propose a new implementation of it to increase the accuracy of results. The technique permits us to perform spectroscopic calculations of high-dimensional systems by dividing the full-dimensional problem into a set of smaller dimensional ones. The partition procedure, originally based on a dynamical analysis of the Hessian matrix, is here more rigorously achieved through a hierarchical subspace-separation criterion based on Liouville's theorem. Comparisons of calculated vibrational frequencies to exact quantum ones for a set of molecules including benzene show that the new implementation performs better than the original one and that, on average, the loss in accuracy with respect to full-dimensional semiclassical calculations is reduced to only 10 wavenumbers. Furthermore, by investigating the challenging Zundel cation, we also demonstrate that the "divide-and-conquer" approach allows us to deal with complex strongly anharmonic molecular systems. Overall the method very much helps the assignment and physical interpretation of experimental IR spectra by providing accurate vibrational fundamentals and overtones decomposed into reduced dimensionality spectra.
Awale, Mahendra; Reymond, Jean-Louis
2015-08-24
An Internet portal accessible at www.gdb.unibe.ch has been set up to automatically generate color-coded similarity maps of the ChEMBL database in relation to up to two sets of active compounds taken from the enhanced Directory of Useful Decoys (eDUD), a random set of molecules, or up to two sets of user-defined reference molecules. These maps visualize the relationships between the selected compounds and ChEMBL in six different high dimensional chemical spaces, namely MQN (42-D molecular quantum numbers), SMIfp (34-D SMILES fingerprint), APfp (20-D shape fingerprint), Xfp (55-D pharmacophore fingerprint), Sfp (1024-bit substructure fingerprint), and ECfp4 (1024-bit extended connectivity fingerprint). The maps are supplied in form of Java based desktop applications called "similarity mapplets" allowing interactive content browsing and linked to a "Multifingerprint Browser for ChEMBL" (also accessible directly at www.gdb.unibe.ch ) to perform nearest neighbor searches. One can obtain six similarity mapplets of ChEMBL relative to random reference compounds, 606 similarity mapplets relative to single eDUD active sets, 30,300 similarity mapplets relative to pairs of eDUD active sets, and any number of similarity mapplets relative to user-defined reference sets to help visualize the structural diversity of compound series in drug optimization projects and their relationship to other known bioactive compounds.
Chiu, Mei Choi; Pun, Chi Seng; Wong, Hoi Ying
2017-08-01
Investors interested in the global financial market must analyze financial securities internationally. Making an optimal global investment decision involves processing a huge amount of data for a high-dimensional portfolio. This article investigates the big data challenges of two mean-variance optimal portfolios: continuous-time precommitment and constant-rebalancing strategies. We show that both optimized portfolios implemented with the traditional sample estimates converge to the worst performing portfolio when the portfolio size becomes large. The crux of the problem is the estimation error accumulated from the huge dimension of stock data. We then propose a linear programming optimal (LPO) portfolio framework, which applies a constrained ℓ1 minimization to the theoretical optimal control to mitigate the risk associated with the dimensionality issue. The resulting portfolio becomes a sparse portfolio that selects stocks with a data-driven procedure and hence offers a stable mean-variance portfolio in practice. When the number of observations becomes large, the LPO portfolio converges to the oracle optimal portfolio, which is free of estimation error, even though the number of stocks grows faster than the number of observations. Our numerical and empirical studies demonstrate the superiority of the proposed approach. © 2017 Society for Risk Analysis.
Logic regression and its extensions.
Schwender, Holger; Ruczinski, Ingo
2010-01-01
Logic regression is an adaptive classification and regression procedure, initially developed to reveal interacting single nucleotide polymorphisms (SNPs) in genetic association studies. In general, this approach can be used in any setting with binary predictors, when the interaction of these covariates is of primary interest. Logic regression searches for Boolean (logic) combinations of binary variables that best explain the variability in the outcome variable, and thus, reveals variables and interactions that are associated with the response and/or have predictive capabilities. The logic expressions are embedded in a generalized linear regression framework, and thus, logic regression can handle a variety of outcome types, such as binary responses in case-control studies, numeric responses, and time-to-event data. In this chapter, we provide an introduction to the logic regression methodology, list some applications in public health and medicine, and summarize some of the direct extensions and modifications of logic regression that have been proposed in the literature. Copyright © 2010 Elsevier Inc. All rights reserved.
Practical Session: Simple Linear Regression
Clausel, M.; Grégoire, G.
2014-12-01
Two exercises are proposed to illustrate the simple linear regression. The first one is based on the famous Galton's data set on heredity. We use the lm R command and get coefficients estimates, standard error of the error, R2, residuals …In the second example, devoted to data related to the vapor tension of mercury, we fit a simple linear regression, predict values, and anticipate on multiple linear regression. This pratical session is an excerpt from practical exercises proposed by A. Dalalyan at EPNC (see Exercises 1 and 2 of http://certis.enpc.fr/~dalalyan/Download/TP_ENPC_4.pdf).
Ogutu, Joseph O; Schulz-Streeck, Torben; Piepho, Hans-Peter
2012-05-21
similar accuracies but outperformed ridge regression and ridge regression BLUP in terms of the Pearson correlation between predicted GEBVs and the true genomic value as well as the root mean squared error. The performance of RR-BLUP was also somewhat better than that of ridge regression. This pattern was replicated by the Pearson correlation between predicted GEBVs and the true breeding values (TBV) and the root mean squared error calculated with respect to TBV, except that accuracy was lower for all models, most especially for the adaptive elastic net. The correlation between the predicted GEBV and simulated phenotypic values based on the fivefold CV also revealed a similar pattern except that the adaptive elastic net had lower accuracy than both the ridge regression methods. All the six models had relatively high prediction accuracies for the simulated data set. Accuracy was higher for the lasso type methods than for ridge regression and ridge regression BLUP.
Multiple Regression and Its Discontents
Snell, Joel C.; Marsh, Mitchell
2012-01-01
Multiple regression is part of a larger statistical strategy originated by Gauss. The authors raise questions about the theory and suggest some changes that would make room for Mandelbrot and Serendipity.
Regression methods for medical research
Tai, Bee Choo
2013-01-01
Regression Methods for Medical Research provides medical researchers with the skills they need to critically read and interpret research using more advanced statistical methods. The statistical requirements of interpreting and publishing in medical journals, together with rapid changes in science and technology, increasingly demands an understanding of more complex and sophisticated analytic procedures.The text explains the application of statistical models to a wide variety of practical medical investigative studies and clinical trials. Regression methods are used to appropriately answer the
Wheeler, David; Tiefelsdorf, Michael
2005-06-01
Present methodological research on geographically weighted regression (GWR) focuses primarily on extensions of the basic GWR model, while ignoring well-established diagnostics tests commonly used in standard global regression analysis. This paper investigates multicollinearity issues surrounding the local GWR coefficients at a single location and the overall correlation between GWR coefficients associated with two different exogenous variables. Results indicate that the local regression coefficients are potentially collinear even if the underlying exogenous variables in the data generating process are uncorrelated. Based on these findings, applied GWR research should practice caution in substantively interpreting the spatial patterns of local GWR coefficients. An empirical disease-mapping example is used to motivate the GWR multicollinearity problem. Controlled experiments are performed to systematically explore coefficient dependency issues in GWR. These experiments specify global models that use eigenvectors from a spatial link matrix as exogenous variables.
Directory of Open Access Journals (Sweden)
Narissara Eiamkanitchat
2015-01-01
Full Text Available Neurofuzzy methods capable of selecting a handful of useful features are very useful in analysis of high dimensional datasets. A neurofuzzy classification scheme that can create proper linguistic features and simultaneously select informative features for a high dimensional dataset is presented and applied to the diffuse large B-cell lymphomas (DLBCL microarray classification problem. The classification scheme is the combination of embedded linguistic feature creation and tuning algorithm, feature selection, and rule-based classification in one neural network framework. The adjustable linguistic features are embedded in the network structure via fuzzy membership functions. The network performs the classification task on the high dimensional DLBCL microarray dataset either by the direct calculation or by the rule-based approach. The 10-fold cross validation is applied to ensure the validity of the results. Very good results from both direct calculation and logical rules are achieved. The results show that the network can select a small set of informative features in this high dimensional dataset. By a comparison to other previously proposed methods, our method yields better classification performance.
Spectral-Spatial Shared Linear Regression for Hyperspectral Image Classification.
Haoliang Yuan; Yuan Yan Tang
2017-04-01
Classification of the pixels in hyperspectral image (HSI) is an important task and has been popularly applied in many practical applications. Its major challenge is the high-dimensional small-sized problem. To deal with this problem, lots of subspace learning (SL) methods are developed to reduce the dimension of the pixels while preserving the important discriminant information. Motivated by ridge linear regression (RLR) framework for SL, we propose a spectral-spatial shared linear regression method (SSSLR) for extracting the feature representation. Comparing with RLR, our proposed SSSLR has the following two advantages. First, we utilize a convex set to explore the spatial structure for computing the linear projection matrix. Second, we utilize a shared structure learning model, which is formed by original data space and a hidden feature space, to learn a more discriminant linear projection matrix for classification. To optimize our proposed method, an efficient iterative algorithm is proposed. Experimental results on two popular HSI data sets, i.e., Indian Pines and Salinas demonstrate that our proposed methods outperform many SL methods.
Directory of Open Access Journals (Sweden)
Guo Wenge
2012-07-01
Full Text Available Abstract Background Based on available biological information, genomic data can often be partitioned into pre-defined sets (e.g. pathways and subsets within sets. Biologists are often interested in determining whether some pre-defined sets of variables (e.g. genes are differentially expressed under varying experimental conditions. Several procedures are available in the literature for making such determinations, however, they do not take into account information regarding the subsets within each set. Secondly, variables (e.g. genes belonging to a set or a subset are potentially correlated, yet such information is often ignored and univariate methods are used. This may result in loss of power and/or inflated false positive rate. Results We introduce a multiple testing-based methodology which makes use of available information regarding biologically relevant subsets within each pre-defined set of variables while exploiting the underlying dependence structure among the variables. Using this methodology, a biologist may not only determine whether a set of variables are differentially expressed between two experimental conditions, but may also test whether specific subsets within a significant set are also significant. Conclusions The proposed methodology; (a is easy to implement, (b does not require inverting potentially singular covariance matrices, and (c controls the family wise error rate (FWER at the desired nominal level, (d is robust to the underlying distribution and covariance structures. Although for simplicity of exposition, the methodology is described for microarray gene expression data, it is also applicable to any high dimensional data, such as the mRNA seq data, CpG methylation data etc.
Wismüller, Axel
2009-02-01
Purpose: To develop, test, and evaluate a novel unsupervised machine learning method for computer-aided diagnosis and analysis of multidimensional data, such as biomedical imaging data. Methods: We introduce the Exploration Machine (XOM) as a method for computing low-dimensional representations of high-dimensional observations. XOM systematically inverts functional and structural components of topology-preserving mappings. By this trick, it can contribute to both structure-preserving visualization and data clustering. We applied XOM to the analysis of whole-genome microarray imaging data, comprising 2467 79-dimensional gene expression profiles of Saccharomyces cerevisiae, and to model-free analysis of functional brain MRI data by unsupervised clustering. For both applications, we performed quantitative comparisons to results obtained by established algorithms. Results: Genome data: Absolute (relative) Sammon error values were 5.91Â.105 (1.00) for XOM, 6.50Â.105 (1.10) for Sammon's mapping, 6.56Â.105 (1.11) for PCA, and 7.24Â.105 (1.22) for Self-Organizing Map (SOM). Computation times were 72, 216, 2, and 881 seconds for XOM, Sammon, PCA, and SOM, respectively. - Functional MRI data: Areas under ROC curves for detection of task-related brain activation were 0.984 +/- 0.03 for XOM, 0.983 +/- 0.02 for Minimal-Free-Energy VQ, and 0.979 +/- 0.02 for SOM. Conclusion: For both multidimensional imaging applications, i.e. gene expression visualization and functional MRI clustering, XOM yields competitive results when compared to established algorithms. Its surprising versatility to simultaneously contribute to dimensionality reduction and data clustering qualifies XOM to serve as a useful novel method for the analysis of multidimensional data, such as biomedical image data in computer-aided diagnosis.
Gui, J; Li, H
2005-01-01
An important area of research in pharmacogenomics is to relate high-dimensional genetic or genomic data to various clinical phenotypes of patients. Due to large variability in time to certain clinical event among patients, studying possibly censored survival phenotypes can be more informative than treating the phenotypes as categorical variables. In this paper, we develop a threshold gradient descent (TGD) method for the Cox model to select genes that are relevant to patients' survival and to build a predictive model for the risk of a future patient. The computational difficulty associated with the estimation in the high-dimensional and low-sample size settings can be efficiently solved by the gradient descent iterations. Results from application to real data set on predicting survival after chemotherapy for patients with diffuse large B-cell lymphoma demonstrate that the proposed method can be used for identifying important genes that are related to time to death due to cancer and for building a parsimonious model for predicting the survival of future patients. The TGD based Cox regression gives better predictive performance than the L2 penalized regression and can select more relevant genes than the L1 penalized regression.
Inferential Models for Linear Regression
Directory of Open Access Journals (Sweden)
Zuoyi Zhang
2011-09-01
Full Text Available Linear regression is arguably one of the most widely used statistical methods in applications. However, important problems, especially variable selection, remain a challenge for classical modes of inference. This paper develops a recently proposed framework of inferential models (IMs in the linear regression context. In general, an IM is able to produce meaningful probabilistic summaries of the statistical evidence for and against assertions about the unknown parameter of interest and, moreover, these summaries are shown to be properly calibrated in a frequentist sense. Here we demonstrate, using simple examples, that the IM framework is promising for linear regression analysis --- including model checking, variable selection, and prediction --- and for uncertain inference in general.
General regression and representation model for classification.
Directory of Open Access Journals (Sweden)
Jianjun Qian
Full Text Available Recently, the regularized coding-based classification methods (e.g. SRC and CRC show a great potential for pattern classification. However, most existing coding methods assume that the representation residuals are uncorrelated. In real-world applications, this assumption does not hold. In this paper, we take account of the correlations of the representation residuals and develop a general regression and representation model (GRR for classification. GRR not only has advantages of CRC, but also takes full use of the prior information (e.g. the correlations between representation residuals and representation coefficients and the specific information (weight matrix of image pixels to enhance the classification performance. GRR uses the generalized Tikhonov regularization and K Nearest Neighbors to learn the prior information from the training data. Meanwhile, the specific information is obtained by using an iterative algorithm to update the feature (or image pixel weights of the test sample. With the proposed model as a platform, we design two classifiers: basic general regression and representation classifier (B-GRR and robust general regression and representation classifier (R-GRR. The experimental results demonstrate the performance advantages of proposed methods over state-of-the-art algorithms.
[Caudal regression syndrome--two case reports].
Kokrdová, Z; Pavlíková, J
2008-01-01
The authors demonstrate two cases of caudal regression syndrome (CRS), a rare malformative syndrom, seen mainly in cases of maternal diabetes with poor metabolic control. Case report. Department of Obstetrics and Gynecology, Department of Medicine Regional Hospital Pardubice. The caudal regression syndrome (CRS) was revealed in two women with praegestational diabetes. The diagnosis was made at 18 and 20 weeks. The characteristic ultrasound findings include abrupt interruption of the spine and abnormal position of the lower limbs. The femur bones are fixed in a "V" pattern, giving a typical "Buddha's poise". A complete examination must be conducted for possible urinary and intestinal malformations. The mechanism leading to malformation is discussed in the article. To prevent pregnancy at the time of bad controlled diabetes is the only way to minimaze the risk of producing a congenitally malformed baby including caudal regression syndrom in the population of diabetic mothers. Family planning and supervision by the specialists is always advisable. Early diagnosis of CRS is possible using vaginal ultrasound. Emphasis is placed on the association of abrupt disruption of dorsal or lumbar spine and abnormal images of the lower limbs fixed in a,,V" formation, which is characteristic sign of CRS.
A Matlab program for stepwise regression
Directory of Open Access Journals (Sweden)
Yanhong Qi
2016-03-01
Full Text Available The stepwise linear regression is a multi-variable regression for identifying statistically significant variables in the linear regression equation. In present study, we presented the Matlab program of stepwise regression.
Logistic regression for circular data
Al-Daffaie, Kadhem; Khan, Shahjahan
2017-05-01
This paper considers the relationship between a binary response and a circular predictor. It develops the logistic regression model by employing the linear-circular regression approach. The maximum likelihood method is used to estimate the parameters. The Newton-Raphson numerical method is used to find the estimated values of the parameters. A data set from weather records of Toowoomba city is analysed by the proposed methods. Moreover, a simulation study is considered. The R software is used for all computations and simulations.
Quasi-least squares regression
Shults, Justine
2014-01-01
Drawing on the authors' substantial expertise in modeling longitudinal and clustered data, Quasi-Least Squares Regression provides a thorough treatment of quasi-least squares (QLS) regression-a computational approach for the estimation of correlation parameters within the framework of generalized estimating equations (GEEs). The authors present a detailed evaluation of QLS methodology, demonstrating the advantages of QLS in comparison with alternative methods. They describe how QLS can be used to extend the application of the traditional GEE approach to the analysis of unequally spaced longitu
Biplots in Reduced-Rank Regression
Braak, ter C.J.F.; Looman, C.W.N.
1994-01-01
Regression problems with a number of related response variables are typically analyzed by separate multiple regressions. This paper shows how these regressions can be visualized jointly in a biplot based on reduced-rank regression. Reduced-rank regression combines multiple regression and principal
Growth Regression and Economic Theory
Elbers, Chris; Gunning, Jan Willem
2002-01-01
In this note we show that the standard, loglinear growth regression specificationis consistent with one and only one model in the class of stochastic Ramsey models. Thismodel is highly restrictive: it requires a Cobb-Douglas technology and a 100% depreciationrate and it implies that risk does not
Regression of lumbar disk herniation
Directory of Open Access Journals (Sweden)
G. Yu Evzikov
2015-01-01
Full Text Available Compression of the spinal nerve root, giving rise to pain and sensory and motor disorders in the area of its innervation is the most vivid manifestation of herniated intervertebral disk. Different treatment modalities, including neurosurgery, for evolving these conditions are discussed. There has been recent evidence that spontaneous regression of disk herniation can regress. The paper describes a female patient with large lateralized disc extrusion that has caused compression of the nerve root S1, leading to obvious myotonic and radicular syndrome. Magnetic resonance imaging has shown that the clinical manifestations of discogenic radiculopathy, as well myotonic syndrome and morphological changes completely regressed 8 months later. The likely mechanism is inflammation-induced resorption of a large herniated disk fragment, which agrees with the data available in the literature. A decision to perform neurosurgery for which the patient had indications was made during her first consultation. After regression of discogenic radiculopathy, there was only moderate pain caused by musculoskeletal diseases (facet syndrome, piriformis syndrome that were successfully eliminated by minimally invasive techniques.
Claim reserving with fuzzy regression
Bahrami, Tahereh; BAHRAMI, Masuod
2015-01-01
Abstract. Claims reserving plays a key role for the insurance. Therefore, various statistical methods are used to provide for an adequate amount of claim reserves. Since claim reserves are always variable, fuzzy set theory is used to handle this variability. In this paper, non-symmetric fuzzy regression is integrated in the Taylor’s method to develop a new method for claim reserving.
Multimodality in GARCH regression models
Ooms, M.; Doornik, J.A.
2008-01-01
It is shown empirically that mixed autoregressive moving average regression models with generalized autoregressive conditional heteroskedasticity (Reg-ARMA-GARCH models) can have multimodality in the likelihood that is caused by a dummy variable in the conditional mean. Maximum likelihood estimates
Fungible Weights in Multiple Regression
Waller, Niels G.
2008-01-01
Every set of alternate weights (i.e., nonleast squares weights) in a multiple regression analysis with three or more predictors is associated with an infinite class of weights. All members of a given class can be deemed "fungible" because they yield identical "SSE" (sum of squared errors) and R[superscript 2] values. Equations for generating…
On Weighted Support Vector Regression
DEFF Research Database (Denmark)
Han, Xixuan; Clemmensen, Line Katrine Harder
2014-01-01
We propose a new type of weighted support vector regression (SVR), motivated by modeling local dependencies in time and space in prediction of house prices. The classic weights of the weighted SVR are added to the slack variables in the objective function (OF‐weights). This procedure directly...
PROBIT REGRESSION IN PREDICTION ANALYSIS
African Journals Online (AJOL)
Admin
2008-12-12
Dec 12, 2008 ... GLOBAL JOURNAL OF MATHEMATICAL SCIENCES VOL. ... INTRODUCTION. For some dichotomous variables, the response y is actually a proxy for a variable that is continuous (Newsom, 2005). A regression ... M. E. Nja, Dept. of Mathematics / Statistics Cross River University of Technology, Calabar ...
Ridge Regression for Interactive Models.
Tate, Richard L.
1988-01-01
An exploratory study of the value of ridge regression for interactive models is reported. Assuming that the linear terms in a simple interactive model are centered to eliminate non-essential multicollinearity, a variety of common models, representing both ordinal and disordinal interactions, are shown to have "orientations" that are…
Directory of Open Access Journals (Sweden)
Raftery Adrian E
2009-02-01
Full Text Available Abstract Background Microarray technology is increasingly used to identify potential biomarkers for cancer prognostics and diagnostics. Previously, we have developed the iterative Bayesian Model Averaging (BMA algorithm for use in classification. Here, we extend the iterative BMA algorithm for application to survival analysis on high-dimensional microarray data. The main goal in applying survival analysis to microarray data is to determine a highly predictive model of patients' time to event (such as death, relapse, or metastasis using a small number of selected genes. Our multivariate procedure combines the effectiveness of multiple contending models by calculating the weighted average of their posterior probability distributions. Our results demonstrate that our iterative BMA algorithm for survival analysis achieves high prediction accuracy while consistently selecting a small and cost-effective number of predictor genes. Results We applied the iterative BMA algorithm to two cancer datasets: breast cancer and diffuse large B-cell lymphoma (DLBCL data. On the breast cancer data, the algorithm selected a total of 15 predictor genes across 84 contending models from the training data. The maximum likelihood estimates of the selected genes and the posterior probabilities of the selected models from the training data were used to divide patients in the test (or validation dataset into high- and low-risk categories. Using the genes and models determined from the training data, we assigned patients from the test data into highly distinct risk groups (as indicated by a p-value of 7.26e-05 from the log-rank test. Moreover, we achieved comparable results using only the 5 top selected genes with 100% posterior probabilities. On the DLBCL data, our iterative BMA procedure selected a total of 25 genes across 3 contending models from the training data. Once again, we assigned the patients in the validation set to significantly distinct risk groups (p
Logistic regression: a brief primer.
Stoltzfus, Jill C
2011-10-01
Regression techniques are versatile in their application to medical research because they can measure associations, predict outcomes, and control for confounding variable effects. As one such technique, logistic regression is an efficient and powerful way to analyze the effect of a group of independent variables on a binary outcome by quantifying each independent variable's unique contribution. Using components of linear regression reflected in the logit scale, logistic regression iteratively identifies the strongest linear combination of variables with the greatest probability of detecting the observed outcome. Important considerations when conducting logistic regression include selecting independent variables, ensuring that relevant assumptions are met, and choosing an appropriate model building strategy. For independent variable selection, one should be guided by such factors as accepted theory, previous empirical investigations, clinical considerations, and univariate statistical analyses, with acknowledgement of potential confounding variables that should be accounted for. Basic assumptions that must be met for logistic regression include independence of errors, linearity in the logit for continuous variables, absence of multicollinearity, and lack of strongly influential outliers. Additionally, there should be an adequate number of events per independent variable to avoid an overfit model, with commonly recommended minimum "rules of thumb" ranging from 10 to 20 events per covariate. Regarding model building strategies, the three general types are direct/standard, sequential/hierarchical, and stepwise/statistical, with each having a different emphasis and purpose. Before reaching definitive conclusions from the results of any of these methods, one should formally quantify the model's internal validity (i.e., replicability within the same data set) and external validity (i.e., generalizability beyond the current sample). The resulting logistic regression model
Hierarchical Adaptive Regression Kernels for Regression with Functional Predictors.
Woodard, Dawn B; Crainiceanu, Ciprian; Ruppert, David
2013-01-01
We propose a new method for regression using a parsimonious and scientifically interpretable representation of functional predictors. Our approach is designed for data that exhibit features such as spikes, dips, and plateaus whose frequency, location, size, and shape varies stochastically across subjects. We propose Bayesian inference of the joint functional and exposure models, and give a method for efficient computation. We contrast our approach with existing state-of-the-art methods for regression with functional predictors, and show that our method is more effective and efficient for data that include features occurring at varying locations. We apply our methodology to a large and complex dataset from the Sleep Heart Health Study, to quantify the association between sleep characteristics and health outcomes. Software and technical appendices are provided in online supplemental materials.
Hierarchical Adaptive Regression Kernels for Regression with Functional Predictors
Woodard, Dawn B.; Crainiceanu, Ciprian; Ruppert, David
2013-01-01
We propose a new method for regression using a parsimonious and scientifically interpretable representation of functional predictors. Our approach is designed for data that exhibit features such as spikes, dips, and plateaus whose frequency, location, size, and shape varies stochastically across subjects. We propose Bayesian inference of the joint functional and exposure models, and give a method for efficient computation. We contrast our approach with existing state-of-the-art methods for re...
Boosting feature selection for Neural Network based regression.
Bailly, Kevin; Milgram, Maurice
2009-01-01
The head pose estimation problem is well known to be a challenging task in computer vision and is a useful tool for several applications involving human-computer interaction. This problem can be stated as a regression one where the input is an image and the output is pan and tilt angles. Finding the optimal regression is a hard problem because of the high dimensionality of the input (number of image pixels) and the large variety of morphologies and illumination. We propose a new method combining a boosting strategy for feature selection and a neural network for the regression. Potential features are a very large set of Haar-like wavelets which are well known to be adapted to face image processing. To achieve the feature selection, a new Fuzzy Functional Criterion (FFC) is introduced which is able to evaluate the link between a feature and the output without any estimation of the joint probability density function as in the Mutual Information. The boosting strategy uses this criterion at each step: features are evaluated by the FFC using weights on examples computed from the error produced by the neural network trained at the previous step. Tests are carried out on the commonly used Pointing 04 database and compared with three state-of-the-art methods. We also evaluate the accuracy of the estimation on FacePix, a database with a high angular resolution. Our method is compared positively to a Convolutional Neural Network, which is well known to incorporate feature extraction in its first layers.
Active set support vector regression.
Musicant, David R; Feinberg, Alexander
2004-03-01
This paper presents active set support vector regression (ASVR), a new active set strategy to solve a straightforward reformulation of the standard support vector regression problem. This new algorithm is based on the successful ASVM algorithm for classification problems, and consists of solving a finite number of linear equations with a typically large dimensionality equal to the number of points to be approximated. However, by making use of the Sherman-Morrison-Woodbury formula, a much smaller matrix of the order of the original input space is inverted at each step. The algorithm requires no specialized quadratic or linear programming code, but merely a linear equation solver which is publicly available. ASVR is extremely fast, produces comparable generalization error to other popular algorithms, and is available on the web for download.
Binary data regression: Weibull distribution
Caron, Renault; Polpo, Adriano
2009-12-01
The problem of estimation in binary response data has receivied a great number of alternative statistical solutions. Generalized linear models allow for a wide range of statistical models for regression data. The most used model is the logistic regression, see Hosmer et al. [6]. However, as Chen et al. [5] mentions, when the probability of a given binary response approaches 0 at a different rate than it approaches 1, symmetric linkages are inappropriate. A class of models based on Weibull distribution indexed by three parameters is introduced here. Maximum likelihood methods are employed to estimate the parameters. The objective of the present paper is to show a solution for the estimation problem under the Weibull model. An example showing the quality of the model is illustrated by comparing it with the alternative probit and logit models.
Spontaneous regression of colon cancer.
Kihara, Kyoichi; Fujita, Shin; Ohshiro, Taihei; Yamamoto, Seiichiro; Sekine, Shigeki
2015-01-01
A case of spontaneous regression of transverse colon cancer is reported. A 64-year-old man was diagnosed as having cancer of the transverse colon at a local hospital. Initial and second colonoscopy examinations revealed a typical cancer of the transverse colon, which was diagnosed as moderately differentiated adenocarcinoma. The patient underwent right hemicolectomy 6 weeks after the initial colonoscopy. The resected specimen showed only a scar at the tumor site, and no cancerous tissue was proven histologically. The patient is alive with no evidence of recurrence 1 year after surgery. Although an antitumor immune response is the most likely explanation, the exact nature of the phenomenon was unclear. We describe this rare case and review the literature pertaining to spontaneous regression of colorectal cancer. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com.
Polynomial Regressions and Nonsense Inference
Directory of Open Access Journals (Sweden)
Daniel Ventosa-Santaulària
2013-11-01
Full Text Available Polynomial specifications are widely used, not only in applied economics, but also in epidemiology, physics, political analysis and psychology, just to mention a few examples. In many cases, the data employed to estimate such specifications are time series that may exhibit stochastic nonstationary behavior. We extend Phillips’ results (Phillips, P. Understanding spurious regressions in econometrics. J. Econom. 1986, 33, 311–340. by proving that an inference drawn from polynomial specifications, under stochastic nonstationarity, is misleading unless the variables cointegrate. We use a generalized polynomial specification as a vehicle to study its asymptotic and finite-sample properties. Our results, therefore, lead to a call to be cautious whenever practitioners estimate polynomial regressions.
Quantile Regression With Measurement Error
Wei, Ying
2009-08-27
Regression quantiles can be substantially biased when the covariates are measured with error. In this paper we propose a new method that produces consistent linear quantile estimation in the presence of covariate measurement error. The method corrects the measurement error induced bias by constructing joint estimating equations that simultaneously hold for all the quantile levels. An iterative EM-type estimation algorithm to obtain the solutions to such joint estimation equations is provided. The finite sample performance of the proposed method is investigated in a simulation study, and compared to the standard regression calibration approach. Finally, we apply our methodology to part of the National Collaborative Perinatal Project growth data, a longitudinal study with an unusual measurement error structure. © 2009 American Statistical Association.
Directional quantile regression in R
Czech Academy of Sciences Publication Activity Database
Boček, Pavel; Šiman, Miroslav
2017-01-01
Roč. 53, č. 3 (2017), s. 480-492 ISSN 0023-5954 R&D Projects: GA ČR GA14-07234S Institutional support: RVO:67985556 Keywords : multivariate quantile * regression quantile * halfspace depth * depth contour Subject RIV: BD - Theory of Information Impact factor: 0.379, year: 2016 http:// library .utia.cas.cz/separaty/2017/SI/bocek-0476587.pdf
QUANTILE CALCULUS AND CENSORED REGRESSION.
Huang, Yijian
2010-06-01
Quantile regression has been advocated in survival analysis to assess evolving covariate effects. However, challenges arise when the censoring time is not always observed and may be covariate-dependent, particularly in the presence of continuously-distributed covariates. In spite of several recent advances, existing methods either involve algorithmic complications or impose a probability grid. The former leads to difficulties in the implementation and asymptotics, whereas the latter introduces undesirable grid dependence. To resolve these issues, we develop fundamental and general quantile calculus on cumulative probability scale in this article, upon recognizing that probability and time scales do not always have a one-to-one mapping given a survival distribution. These results give rise to a novel estimation procedure for censored quantile regression, based on estimating integral equations. A numerically reliable and efficient Progressive Localized Minimization (PLMIN) algorithm is proposed for the computation. This procedure reduces exactly to the Kaplan-Meier method in the k-sample problem, and to standard uncensored quantile regression in the absence of censoring. Under regularity conditions, the proposed quantile coefficient estimator is uniformly consistent and converges weakly to a Gaussian process. Simulations show good statistical and algorithmic performance. The proposal is illustrated in the application to a clinical study.
Gaussian Process Regression Model in Spatial Logistic Regression
Sofro, A.; Oktaviarina, A.
2018-01-01
Spatial analysis has developed very quickly in the last decade. One of the favorite approaches is based on the neighbourhood of the region. Unfortunately, there are some limitations such as difficulty in prediction. Therefore, we offer Gaussian process regression (GPR) to accommodate the issue. In this paper, we will focus on spatial modeling with GPR for binomial data with logit link function. The performance of the model will be investigated. We will discuss the inference of how to estimate the parameters and hyper-parameters and to predict as well. Furthermore, simulation studies will be explained in the last section.
Motivation: Molecular pathways and networks play a key role in basic and disease biology. An emerging notion is that networks encoding patterns of molecular interplay may themselves differ between contexts, such as cell type, tissue or disease (sub)type. However, while statistical testing of differences in mean expression levels has been extensively studied, testing of network differences remains challenging.
Application of Bayesian logistic regression to mining biomedical data.
Avali, Viji R; Cooper, Gregory F; Gopalakrishnan, Vanathi
2014-01-01
Mining high dimensional biomedical data with existing classifiers is challenging and the predictions are often inaccurate. We investigated the use of Bayesian Logistic Regression (B-LR) for mining such data to predict and classify various disease conditions. The analysis was done on twelve biomedical datasets with binary class variables and the performance of B-LR was compared to those from other popular classifiers on these datasets with 10-fold cross validation using the WEKA data mining toolkit. The statistical significance of the results was analyzed by paired two tailed t-tests and non-parametric Wilcoxon signed-rank tests. We observed overall that B-LR with non-informative Gaussian priors performed on par with other classifiers in terms of accuracy, balanced accuracy and AUC. These results suggest that it is worthwhile to explore the application of B-LR to predictive modeling tasks in bioinformatics using informative biological prior probabilities. With informative prior probabilities, we conjecture that the performance of B-LR will improve.
Mixed kernel function support vector regression for global sensitivity analysis
Cheng, Kai; Lu, Zhenzhou; Wei, Yuhao; Shi, Yan; Zhou, Yicheng
2017-11-01
Global sensitivity analysis (GSA) plays an important role in exploring the respective effects of input variables on an assigned output response. Amongst the wide sensitivity analyses in literature, the Sobol indices have attracted much attention since they can provide accurate information for most models. In this paper, a mixed kernel function (MKF) based support vector regression (SVR) model is employed to evaluate the Sobol indices at low computational cost. By the proposed derivation, the estimation of the Sobol indices can be obtained by post-processing the coefficients of the SVR meta-model. The MKF is constituted by the orthogonal polynomials kernel function and Gaussian radial basis kernel function, thus the MKF possesses both the global characteristic advantage of the polynomials kernel function and the local characteristic advantage of the Gaussian radial basis kernel function. The proposed approach is suitable for high-dimensional and non-linear problems. Performance of the proposed approach is validated by various analytical functions and compared with the popular polynomial chaos expansion (PCE). Results demonstrate that the proposed approach is an efficient method for global sensitivity analysis.
Producing The New Regressive Left
DEFF Research Database (Denmark)
Crone, Christine
to be a committed artist, and how that translates into supporting al-Assad’s rule in Syria; the Ramadan programme Harrir Aqlak’s attempt to relaunch an intellectual renaissance and to promote religious pluralism; and finally, al-Mayadeen’s cooperation with the pan-Latin American TV station TeleSur and its ambitions...... becomes clear from the analytical chapters is the emergence of the new cross-ideological alliance of The New Regressive Left. This emerging coalition between Shia Muslims, religious minorities, parts of the Arab Left, secular cultural producers, and the remnants of the political,strategic resistance...
An Evaluation of Ridge Regression.
1981-12-01
of the parameter estimates, is a decreasing function of k. The idea of ridge regression, as suggested by Hoerl and Kennard (Ref 12:58-63), is to pick...CROSS? 0 CR0553 f.812 CR0554 0 CR0555 4.39? CROSS6 0 ALSO 4.922 KSO 0 NVARSO 4. A5059 .622 CONTFNTS OF CASE NUlIPER 209 SEQHUI 209. SUOILE PEGANAL CASWGT...KSQ .000 NVARSO 9. RSOSO .846 CONTENTS OF CASE NUMBER 55 SEONUN 55. SUfTFILE PEGANAL CASWGI 2.0000 459 .970 RI 76600 K .025 NVA? 3. MSE .177 NS[IS
2012-01-01
adaptive elastic net all had similar accuracies but outperformed ridge regression and ridge regression BLUP in terms of the Pearson correlation between predicted GEBVs and the true genomic value as well as the root mean squared error. The performance of RR-BLUP was also somewhat better than that of ridge regression. This pattern was replicated by the Pearson correlation between predicted GEBVs and the true breeding values (TBV) and the root mean squared error calculated with respect to TBV, except that accuracy was lower for all models, most especially for the adaptive elastic net. The correlation between the predicted GEBV and simulated phenotypic values based on the fivefold CV also revealed a similar pattern except that the adaptive elastic net had lower accuracy than both the ridge regression methods. Conclusions All the six models had relatively high prediction accuracies for the simulated data set. Accuracy was higher for the lasso type methods than for ridge regression and ridge regression BLUP. PMID:22640436
Varying-coefficient functional linear regression
Wu, Yichao; Fan, Jianqing; Müller, Hans-Georg
2010-01-01
Functional linear regression analysis aims to model regression relations which include a functional predictor. The analog of the regression parameter vector or matrix in conventional multivariate or multiple-response linear regression models is a regression parameter function in one or two arguments. If, in addition, one has scalar predictors, as is often the case in applications to longitudinal studies, the question arises how to incorporate these into a functional regression model. We study...
Semisupervised feature selection via spline regression for video semantic recognition.
Han, Yahong; Yang, Yi; Yan, Yan; Ma, Zhigang; Sebe, Nicu; Zhou, Xiaofang
2015-02-01
To improve both the efficiency and accuracy of video semantic recognition, we can perform feature selection on the extracted video features to select a subset of features from the high-dimensional feature set for a compact and accurate video data representation. Provided the number of labeled videos is small, supervised feature selection could fail to identify the relevant features that are discriminative to target classes. In many applications, abundant unlabeled videos are easily accessible. This motivates us to develop semisupervised feature selection algorithms to better identify the relevant video features, which are discriminative to target classes by effectively exploiting the information underlying the huge amount of unlabeled video data. In this paper, we propose a framework of video semantic recognition by semisupervised feature selection via spline regression (S(2)FS(2)R) . Two scatter matrices are combined to capture both the discriminative information and the local geometry structure of labeled and unlabeled training videos: A within-class scatter matrix encoding discriminative information of labeled training videos and a spline scatter output from a local spline regression encoding data distribution. An l2,1 -norm is imposed as a regularization term on the transformation matrix to ensure it is sparse in rows, making it particularly suitable for feature selection. To efficiently solve S(2)FS(2)R , we develop an iterative algorithm and prove its convergency. In the experiments, three typical tasks of video semantic recognition, such as video concept detection, video classification, and human action recognition, are used to demonstrate that the proposed S(2)FS(2)R achieves better performance compared with the state-of-the-art methods.
Hubbard, Alan; Munoz, Ivan Diaz; Decker, Anna; Holcomb, John B; Schreiber, Martin A; Bulger, Eileen M; Brasel, Karen J; Fox, Erin E; del Junco, Deborah J; Wade, Charles E; Rahbar, Mohammad H; Cotton, Bryan A; Phelan, Herb A; Myers, John G; Alarcon, Louis H; Muskat, Peter; Cohen, Mitchell J
2013-07-01
Prediction of outcome after injury is fraught with uncertainty and statistically beset by misspecified models. Single-time point regression only gives prediction and inference at one time, of dubious value for continuous prediction of ongoing bleeding. New statistical machine learning techniques such as SuperLearner (SL) exist to make superior prediction at iterative time points while evaluating the changing relative importance of each measured variable on an outcome. This then can provide continuously changing prediction of outcome and evaluation of which clinical variables likely drive a particular outcome. PROMMTT data were evaluated using both naive (standard stepwise logistic regression) and SL techniques to develop a time-dependent prediction of future mortality within discrete time intervals. We avoided both underfitting and overfitting using cross validation to select an optimal combination of predictors among candidate predictors/machine learning algorithms. SL was also used to produce interval-specific robust measures of variable importance measures (VIM resulting in an ordered list of variables, by time point) that have the strongest impact on future mortality. Nine hundred eighty patients had complete clinical and outcome data and were included in the analysis. The prediction of ongoing transfusion with SL was superior to the naive approach for all time intervals (correlations of cross-validated predictions with the outcome were 0.819, 0.789, 0.792 for time intervals 30-90, 90-180, 180-360, >360 minutes). The estimated VIM of mortality also changed significantly at each time point. The SL technique for prediction of outcome from a complex dynamic multivariate data set is superior at each time interval to standard models. In addition, the SL VIM at each time point provides insight into the time-specific drivers of future outcome, patient trajectory, and targets for clinical intervention. Thus, this automated approach mimics clinical practice, changing
Nonparametric Regression with Common Shocks
Directory of Open Access Journals (Sweden)
Eduardo A. Souza-Rodrigues
2016-09-01
Full Text Available This paper considers a nonparametric regression model for cross-sectional data in the presence of common shocks. Common shocks are allowed to be very general in nature; they do not need to be finite dimensional with a known (small number of factors. I investigate the properties of the Nadaraya-Watson kernel estimator and determine how general the common shocks can be while still obtaining meaningful kernel estimates. Restrictions on the common shocks are necessary because kernel estimators typically manipulate conditional densities, and conditional densities do not necessarily exist in the present case. By appealing to disintegration theory, I provide sufficient conditions for the existence of such conditional densities and show that the estimator converges in probability to the Kolmogorov conditional expectation given the sigma-field generated by the common shocks. I also establish the rate of convergence and the asymptotic distribution of the kernel estimator.
Practical Session: Multiple Linear Regression
Clausel, M.; Grégoire, G.
2014-12-01
Three exercises are proposed to illustrate the simple linear regression. In the first one investigates the influence of several factors on atmospheric pollution. It has been proposed by D. Chessel and A.B. Dufour in Lyon 1 (see Sect. 6 of http://pbil.univ-lyon1.fr/R/pdf/tdr33.pdf) and is based on data coming from 20 cities of U.S. Exercise 2 is an introduction to model selection whereas Exercise 3 provides a first example of analysis of variance. Exercises 2 and 3 have been proposed by A. Dalalyan at ENPC (see Exercises 2 and 3 of http://certis.enpc.fr/~dalalyan/Download/TP_ENPC_5.pdf).
Kernel Multitask Regression for Toxicogenetics.
Bernard, Elsa; Jiao, Yunlong; Scornet, Erwan; Stoven, Veronique; Walter, Thomas; Vert, Jean-Philippe
2017-10-01
The development of high-throughput in vitro assays to study quantitatively the toxicity of chemical compounds on genetically characterized human-derived cell lines paves the way to predictive toxicogenetics, where one would be able to predict the toxicity of any particular compound on any particular individual. In this paper we present a machine learning-based approach for that purpose, kernel multitask regression (KMR), which combines chemical characterizations of molecular compounds with genetic and transcriptomic characterizations of cell lines to predict the toxicity of a given compound on a given cell line. We demonstrate the relevance of the method on the recent DREAM8 Toxicogenetics challenge, where it ranked among the best state-of-the-art models, and discuss the importance of choosing good descriptors for cell lines and chemicals. © 2017 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim.
Lumbar herniated disc: spontaneous regression.
Altun, Idiris; Yüksel, Kasım Zafer
2017-01-01
Low back pain is a frequent condition that results in substantial disability and causes admission of patients to neurosurgery clinics. To evaluate and present the therapeutic outcomes in lumbar disc hernia (LDH) patients treated by means of a conservative approach, consisting of bed rest and medical therapy. This retrospective cohort was carried out in the neurosurgery departments of hospitals in Kahramanmaraş city and 23 patients diagnosed with LDH at the levels of L3-L4, L4-L5 or L5-S1 were enrolled. The average age was 38.4 ± 8.0 and the chief complaint was low back pain and sciatica radiating to one or both lower extremities. Conservative treatment was administered. Neurological examination findings, durations of treatment and intervals until symptomatic recovery were recorded. Laségue tests and neurosensory examination revealed that mild neurological deficits existed in 16 of our patients. Previously, 5 patients had received physiotherapy and 7 patients had been on medical treatment. The number of patients with LDH at the level of L3-L4, L4-L5, and L5-S1 were 1, 13, and 9, respectively. All patients reported that they had benefit from medical treatment and bed rest, and radiologic improvement was observed simultaneously on MRI scans. The average duration until symptomatic recovery and/or regression of LDH symptoms was 13.6 ± 5.4 months (range: 5-22). It should be kept in mind that lumbar disc hernias could regress with medical treatment and rest without surgery, and there should be an awareness that these patients could recover radiologically. This condition must be taken into account during decision making for surgical intervention in LDH patients devoid of indications for emergent surgery.
Inconsistency Between Univariate and Multiple Logistic Regressions
WANG, HONGYUE; Peng, Jing; Wang, Bokai; Lu, Xiang; ZHENG, Julia Z.; Wang, Kejia; Tu, Xin M.; Feng, Changyong
2017-01-01
Summary Logistic regression is a popular statistical method in studying the effects of covariates on binary outcomes. It has been widely used in both clinical trials and observational studies. However, the results from the univariate regression and from the multiple logistic regression tend to be conflicting. A covariate may show very strong effect on the outcome in the multiple regression but not in the univariate regression, and vice versa. These facts have not been well appreciated in biom...
Insulin resistance: regression and clustering.
Directory of Open Access Journals (Sweden)
Sangho Yoon
Full Text Available In this paper we try to define insulin resistance (IR precisely for a group of Chinese women. Our definition deliberately does not depend upon body mass index (BMI or age, although in other studies, with particular random effects models quite different from models used here, BMI accounts for a large part of the variability in IR. We accomplish our goal through application of Gauss mixture vector quantization (GMVQ, a technique for clustering that was developed for application to lossy data compression. Defining data come from measurements that play major roles in medical practice. A precise statement of what the data are is in Section 1. Their family structures are described in detail. They concern levels of lipids and the results of an oral glucose tolerance test (OGTT. We apply GMVQ to residuals obtained from regressions of outcomes of an OGTT and lipids on functions of age and BMI that are inferred from the data. A bootstrap procedure developed for our family data supplemented by insights from other approaches leads us to believe that two clusters are appropriate for defining IR precisely. One cluster consists of women who are IR, and the other of women who seem not to be. Genes and other features are used to predict cluster membership. We argue that prediction with "main effects" is not satisfactory, but prediction that includes interactions may be.
Knowledge and Awareness: Linear Regression
Directory of Open Access Journals (Sweden)
Monika Raghuvanshi
2016-12-01
Full Text Available Knowledge and awareness are factors guiding development of an individual. These may seem simple and practicable, but in reality a proper combination of these is a complex task. Economically driven state of development in younger generations is an impediment to the correct manner of development. As youths are at the learning phase, they can be molded to follow a correct lifestyle. Awareness and knowledge are important components of any formal or informal environmental education. The purpose of this study is to evaluate the relationship of these components among students of secondary/ senior secondary schools who have undergone a formal study of environment in their curricula. A suitable instrument is developed in order to measure the elements of Awareness and Knowledge among the participants of the study. Data was collected from various secondary and senior secondary school students in the age group 14 to 20 years using cluster sampling technique from the city of Bikaner, India. Linear regression analysis was performed using IBM SPSS 23 statistical tool. There exists a weak relation between knowledge and awareness about environmental issues, caused due to routine practices mishandling; hence one component can be complemented by other for improvement in both. Knowledge and awareness are crucial factors and can provide huge opportunities in any field. Resource utilization for economic solutions may pave the way for eco-friendly products and practices. If green practices are inculcated at the learning phase, they may become normal routine. This will also help in repletion of the environment.
Estimating equivalence with quantile regression
Cade, B.S.
2011-01-01
Equivalence testing and corresponding confidence interval estimates are used to provide more enlightened statistical statements about parameter estimates by relating them to intervals of effect sizes deemed to be of scientific or practical importance rather than just to an effect size of zero. Equivalence tests and confidence interval estimates are based on a null hypothesis that a parameter estimate is either outside (inequivalence hypothesis) or inside (equivalence hypothesis) an equivalence region, depending on the question of interest and assignment of risk. The former approach, often referred to as bioequivalence testing, is often used in regulatory settings because it reverses the burden of proof compared to a standard test of significance, following a precautionary principle for environmental protection. Unfortunately, many applications of equivalence testing focus on establishing average equivalence by estimating differences in means of distributions that do not have homogeneous variances. I discuss how to compare equivalence across quantiles of distributions using confidence intervals on quantile regression estimates that detect differences in heterogeneous distributions missed by focusing on means. I used one-tailed confidence intervals based on inequivalence hypotheses in a two-group treatment-control design for estimating bioequivalence of arsenic concentrations in soils at an old ammunition testing site and bioequivalence of vegetation biomass at a reclaimed mining site. Two-tailed confidence intervals based both on inequivalence and equivalence hypotheses were used to examine quantile equivalence for negligible trends over time for a continuous exponential model of amphibian abundance. ?? 2011 by the Ecological Society of America.
Directory of Open Access Journals (Sweden)
Shouheng Tuo
Full Text Available Harmony Search (HS and Teaching-Learning-Based Optimization (TLBO as new swarm intelligent optimization algorithms have received much attention in recent years. Both of them have shown outstanding performance for solving NP-Hard optimization problems. However, they also suffer dramatic performance degradation for some complex high-dimensional optimization problems. Through a lot of experiments, we find that the HS and TLBO have strong complementarity each other. The HS has strong global exploration power but low convergence speed. Reversely, the TLBO has much fast convergence speed but it is easily trapped into local search. In this work, we propose a hybrid search algorithm named HSTLBO that merges the two algorithms together for synergistically solving complex optimization problems using a self-adaptive selection strategy. In the HSTLBO, both HS and TLBO are modified with the aim of balancing the global exploration and exploitation abilities, where the HS aims mainly to explore the unknown regions and the TLBO aims to rapidly exploit high-precision solutions in the known regions. Our experimental results demonstrate better performance and faster speed than five state-of-the-art HS variants and show better exploration power than five good TLBO variants with similar run time, which illustrates that our method is promising in solving complex high-dimensional optimization problems. The experiment on portfolio optimization problems also demonstrate that the HSTLBO is effective in solving complex read-world application.
Travnik, Jaden B; Pilarski, Patrick M
2017-07-01
Prosthetic devices have advanced in their capabilities and in the number and type of sensors included in their design. As the space of sensorimotor data available to a conventional or machine learning prosthetic control system increases in dimensionality and complexity, it becomes increasingly important that this data be represented in a useful and computationally efficient way. Well structured sensory data allows prosthetic control systems to make informed, appropriate control decisions. In this study, we explore the impact that increased sensorimotor information has on current machine learning prosthetic control approaches. Specifically, we examine the effect that high-dimensional sensory data has on the computation time and prediction performance of a true-online temporal-difference learning prediction method as embedded within a resource-limited upper-limb prosthesis control system. We present results comparing tile coding, the dominant linear representation for real-time prosthetic machine learning, with a newly proposed modification to Kanerva coding that we call selective Kanerva coding. In addition to showing promising results for selective Kanerva coding, our results confirm potential limitations to tile coding as the number of sensory input dimensions increases. To our knowledge, this study is the first to explicitly examine representations for realtime machine learning prosthetic devices in general terms. This work therefore provides an important step towards forming an efficient prosthesis-eye view of the world, wherein prompt and accurate representations of high-dimensional data may be provided to machine learning control systems within artificial limbs and other assistive rehabilitation technologies.
Tuo, Shouheng; Yong, Longquan; Deng, Fang'an; Li, Yanhai; Lin, Yong; Lu, Qiuju
2017-01-01
Harmony Search (HS) and Teaching-Learning-Based Optimization (TLBO) as new swarm intelligent optimization algorithms have received much attention in recent years. Both of them have shown outstanding performance for solving NP-Hard optimization problems. However, they also suffer dramatic performance degradation for some complex high-dimensional optimization problems. Through a lot of experiments, we find that the HS and TLBO have strong complementarity each other. The HS has strong global exploration power but low convergence speed. Reversely, the TLBO has much fast convergence speed but it is easily trapped into local search. In this work, we propose a hybrid search algorithm named HSTLBO that merges the two algorithms together for synergistically solving complex optimization problems using a self-adaptive selection strategy. In the HSTLBO, both HS and TLBO are modified with the aim of balancing the global exploration and exploitation abilities, where the HS aims mainly to explore the unknown regions and the TLBO aims to rapidly exploit high-precision solutions in the known regions. Our experimental results demonstrate better performance and faster speed than five state-of-the-art HS variants and show better exploration power than five good TLBO variants with similar run time, which illustrates that our method is promising in solving complex high-dimensional optimization problems. The experiment on portfolio optimization problems also demonstrate that the HSTLBO is effective in solving complex read-world application.
Ordinal regression by a generalized force-based model.
Fernandez-Navarro, Francisco; Riccardi, Annalisa; Carloni, Sante
2015-04-01
This paper introduces a new instance-based algorithm for multiclass classification problems where the classes have a natural order. The proposed algorithm extends the state-of-the-art gravitational models by generalizing the scaling behavior of the class-pattern interaction force. Like the other gravitational models, the proposed algorithm classifies new patterns by comparing the magnitude of the force that each class exerts on a given pattern. To address ordinal problems, the algorithm assumes that, given a pattern, the forces associated to each class follow a unimodal distribution. For this reason, a weight matrix that allows to modify the metric in the attributes space and a vector of parameters that allows to modify the force law for each class have been introduced in the model definition. Furthermore, a probabilistic formulation of the error function allows the estimation of the model parameters using global and local optimization procedures toward minimization of the errors and penalization of the non unimodal outputs. One of the strengths of the model is its competitive grade of interpretability which is a requisite in most of real applications. The proposed algorithm is compared to other well-known ordinal regression algorithms on discretized regression datasets and real ordinal regression datasets. Experimental results demonstrate that the proposed algorithm can achieve competitive generalization performance and it is validated using nonparametric statistical tests.
Hybrid foundry patterns of bevel gears
Directory of Open Access Journals (Sweden)
Budzik G.
2007-01-01
Full Text Available Possibilities of making hybrid foundry patterns of bevel gears for investment casting process are presented. Rapid prototyping of gears with complex tooth forms is possible with the use of modern methods. One of such methods is the stereo-lithography, where a pattern is obtained as a result of resin curing with laser beam. Patterns of that type are applicable in precision casting. Removing of stereo-lithographic pattern from foundry mould requires use of high temperatures. Resin burning would generate significant amounts of harmful gases. In case of a solid stereo-lithographic pattern, the pressure created during gas burning may cause the mould to crack. A gas volume reduction may be achieved by using patterns of honeycomb structure. However, this technique causes a significant worsening of accuracy of stereo-lithographic patterns in respect of their dimensions and shape. In cooperation with WSK PZL Rzeszów, the Machine Design Department of Rzeszow University of Technology carried out research on the design of hybrid stereo-lithographic patterns. Hybrid pattern consists of a section made by stereo-lithographic process and a section made of casting wax. The latter material is used for stereo-lithographic pattern filling and for mould gating system. The hybrid pattern process consists of two stages: wax melting and then the burn-out of stereolithographic pattern. Use of hybrid patterns reduces the costs of production of stereolithographic patterns. High dimensional accuracy remains preserved in this process.
Principal component regression analysis with SPSS.
Liu, R X; Kuang, J; Gong, Q; Hou, X L
2003-06-01
The paper introduces all indices of multicollinearity diagnoses, the basic principle of principal component regression and determination of 'best' equation method. The paper uses an example to describe how to do principal component regression analysis with SPSS 10.0: including all calculating processes of the principal component regression and all operations of linear regression, factor analysis, descriptives, compute variable and bivariate correlations procedures in SPSS 10.0. The principal component regression analysis can be used to overcome disturbance of the multicollinearity. The simplified, speeded up and accurate statistical effect is reached through the principal component regression analysis with SPSS.
A Spreadsheet Model for Teaching Regression Analysis.
Wood, William C.; O'Hare, Sharon L.
1992-01-01
Presents a spreadsheet model that is useful in introducing students to regression analysis and the computation of regression coefficients. Includes spreadsheet layouts and formulas so that the spreadsheet can be implemented. (Author)
Complete regression of primary malignant melanoma.
Emanuel, Patrick O; Mannion, Meghan; Phelps, Robert G
2008-04-01
Over the years, histopathologic studies to determine the nature and significance of regression in malignant melanoma have yielded different results. At least in part, this most likely reflects differences in the definition of what constitutes regression. Although partial regression is relatively common, complete regression is rare. It has been said that complete regression of a primary lesion is associated with metastatic disease, but the evidence for this is largely anecdotal-the literature contains only case reports and small series. We found 2 cases of complete regression in our dermatopathology database. Metastatic disease was identified in both cases; in 1 case, the suspicion of melanoma was raised on the initial biopsy and subsequent workup revealed lymph node metastasis. These cases illustrate the histologic features of a completely regressed primary melanoma and add credence to the theory that completely regressed melanoma is associated with a poor outcome.
Unbalanced Regressions and the Predictive Equation
DEFF Research Database (Denmark)
Osterrieder, Daniela; Ventosa-Santaulària, Daniel; Vera-Valdés, J. Eduardo
Predictive return regressions with persistent regressors are typically plagued by (asymptotically) biased/inconsistent estimates of the slope, non-standard or potentially even spurious statistical inference, and regression unbalancedness. We alleviate the problem of unbalancedness in the theoreti...
Frequent Pattern Mining Algorithms for Data Clustering
DEFF Research Database (Denmark)
Zimek, Arthur; Assent, Ira; Vreeken, Jilles
2014-01-01
that frequent pattern mining was at the cradle of subspace clustering—yet, it quickly developed into an independent research field. In this chapter, we discuss how frequent pattern mining algorithms have been extended and generalized towards the discovery of local clusters in high-dimensional data......Discovering clusters in subspaces, or subspace clustering and related clustering paradigms, is a research field where we find many frequent pattern mining related influences. In fact, as the first algorithms for subspace clustering were based on frequent pattern mining algorithms, it is fair to say....... In particular, we discuss several example algorithms for subspace clustering or projected clustering as well as point out recent research questions and open topics in this area relevant to researchers in either clustering or pattern mining...
Semiparametric regression during 2003–2007
Ruppert, David
2009-01-01
Semiparametric regression is a fusion between parametric regression and nonparametric regression that integrates low-rank penalized splines, mixed model and hierarchical Bayesian methodology – thus allowing more streamlined handling of longitudinal and spatial correlation. We review progress in the field over the five-year period between 2003 and 2007. We find semiparametric regression to be a vibrant field with substantial involvement and activity, continual enhancement and widespread application.
Deep ensemble learning of sparse regression models for brain disease diagnosis.
Suk, Heung-Il; Lee, Seong-Whan; Shen, Dinggang
2017-04-01
Recent studies on brain imaging analysis witnessed the core roles of machine learning techniques in computer-assisted intervention for brain disease diagnosis. Of various machine-learning techniques, sparse regression models have proved their effectiveness in handling high-dimensional data but with a small number of training samples, especially in medical problems. In the meantime, deep learning methods have been making great successes by outperforming the state-of-the-art performances in various applications. In this paper, we propose a novel framework that combines the two conceptually different methods of sparse regression and deep learning for Alzheimer's disease/mild cognitive impairment diagnosis and prognosis. Specifically, we first train multiple sparse regression models, each of which is trained with different values of a regularization control parameter. Thus, our multiple sparse regression models potentially select different feature subsets from the original feature set; thereby they have different powers to predict the response values, i.e., clinical label and clinical scores in our work. By regarding the response values from our sparse regression models as target-level representations, we then build a deep convolutional neural network for clinical decision making, which thus we call 'Deep Ensemble Sparse Regression Network.' To our best knowledge, this is the first work that combines sparse regression models with deep neural network. In our experiments with the ADNI cohort, we validated the effectiveness of the proposed method by achieving the highest diagnostic accuracies in three classification tasks. We also rigorously analyzed our results and compared with the previous studies on the ADNI cohort in the literature. Copyright © 2017 Elsevier B.V. All rights reserved.
Regression with Sparse Approximations of Data
DEFF Research Database (Denmark)
Noorzad, Pardis; Sturm, Bob L.
2012-01-01
We propose sparse approximation weighted regression (SPARROW), a method for local estimation of the regression function that uses sparse approximation with a dictionary of measurements. SPARROW estimates the regression function at a point with a linear combination of a few regressands selected by...... on the sparse approximation process. Our experimental results show the locally constant form of SPARROW performs competitively....
Regression Analysis by Example. 5th Edition
Chatterjee, Samprit; Hadi, Ali S.
2012-01-01
Regression analysis is a conceptually simple method for investigating relationships among variables. Carrying out a successful application of regression analysis, however, requires a balance of theoretical results, empirical rules, and subjective judgment. "Regression Analysis by Example, Fifth Edition" has been expanded and thoroughly…
Standards for Standardized Logistic Regression Coefficients
Menard, Scott
2011-01-01
Standardized coefficients in logistic regression analysis have the same utility as standardized coefficients in linear regression analysis. Although there has been no consensus on the best way to construct standardized logistic regression coefficients, there is now sufficient evidence to suggest a single best approach to the construction of a…
Dynamic Endothelial Cell Rearrangements Drive Developmental Vessel Regression
Franco, Claudio A.; Jones, Martin L.; Bernabeu, Miguel O.; Geudens, Ilse; Mathivet, Thomas; Rosa, Andre; Lopes, Felicia M.; Lima, Aida P.; Ragab, Anan; Collins, Russell T.; Phng, Li-Kun; Coveney, Peter V.; Gerhardt, Holger
2015-01-01
Patterning of functional blood vessel networks is achieved by pruning of superfluous connections. The cellular and molecular principles of vessel regression are poorly understood. Here we show that regression is mediated by dynamic and polarized migration of endothelial cells, representing anastomosis in reverse. Establishing and analyzing the first axial polarity map of all endothelial cells in a remodeling vascular network, we propose that balanced movement of cells maintains the primitive plexus under low shear conditions in a metastable dynamic state. We predict that flow-induced polarized migration of endothelial cells breaks symmetry and leads to stabilization of high flow/shear segments and regression of adjacent low flow/shear segments. PMID:25884288
Fully Regressive Melanoma: A Case Without Metastasis.
Ehrsam, Eric; Kallini, Joseph R; Lebas, Damien; Khachemoune, Amor; Modiano, Philippe; Cotten, Hervé
2016-08-01
Fully regressive melanoma is a phenomenon in which the primary cutaneous melanoma becomes completely replaced by fibrotic components as a result of host immune response. Although 10 to 35 percent of cases of cutaneous melanomas may partially regress, fully regressive melanoma is very rare; only 47 cases have been reported in the literature to date. AH of the cases of fully regressive melanoma reported in the literature were diagnosed in conjunction with metastasis on a patient. The authors describe a case of fully regressive melanoma without any metastases at the time of its diagnosis. Characteristic findings on dermoscopy, as well as the absence of melanoma on final biopsy, confirmed the diagnosis.
Lestari, A. W.; Rustam, Z.
2017-07-01
In the last decade, breast cancer has become the focus of world attention as this disease is one of the primary leading cause of death for women. Therefore, it is necessary to have the correct precautions and treatment. In previous studies, Fuzzy Kennel K-Medoid algorithm has been used for multi-class data. This paper proposes an algorithm to classify the high dimensional data of breast cancer using Fuzzy Possibilistic C-means (FPCM) and a new method based on clustering analysis using Normed Kernel Function-Based Fuzzy Possibilistic C-Means (NKFPCM). The objective of this paper is to obtain the best accuracy in classification of breast cancer data. In order to improve the accuracy of the two methods, the features candidates are evaluated using feature selection, where Laplacian Score is used. The results show the comparison accuracy and running time of FPCM and NKFPCM with and without feature selection.
Lie, Octavian V; van Mierlo, Pieter
2017-01-01
The visual interpretation of intracranial EEG (iEEG) is the standard method used in complex epilepsy surgery cases to map the regions of seizure onset targeted for resection. Still, visual iEEG analysis is labor-intensive and biased due to interpreter dependency. Multivariate parametric functional connectivity measures using adaptive autoregressive (AR) modeling of the iEEG signals based on the Kalman filter algorithm have been used successfully to localize the electrographic seizure onsets. Due to their high computational cost, these methods have been applied to a limited number of iEEG time-series (Kalman filter implementations, a well-known multivariate adaptive AR model (Arnold et al. 1998) and a simplified, computationally efficient derivation of it, for their potential application to connectivity analysis of high-dimensional (up to 192 channels) iEEG data. When used on simulated seizures together with a multivariate connectivity estimator, the partial directed coherence, the two AR models were compared for their ability to reconstitute the designed seizure signal connections from noisy data. Next, focal seizures from iEEG recordings (73-113 channels) in three patients rendered seizure-free after surgery were mapped with the outdegree, a graph-theory index of outward directed connectivity. Simulation results indicated high levels of mapping accuracy for the two models in the presence of low-to-moderate noise cross-correlation. Accordingly, both AR models correctly mapped the real seizure onset to the resection volume. This study supports the possibility of conducting fully data-driven multivariate connectivity estimations on high-dimensional iEEG datasets using the Kalman filter approach.
Simple and multiple linear regression: sample size considerations.
Hanley, James A
2016-11-01
The suggested "two subjects per variable" (2SPV) rule of thumb in the Austin and Steyerberg article is a chance to bring out some long-established and quite intuitive sample size considerations for both simple and multiple linear regression. This article distinguishes two of the major uses of regression models that imply very different sample size considerations, neither served well by the 2SPV rule. The first is etiological research, which contrasts mean Y levels at differing "exposure" (X) values and thus tends to focus on a single regression coefficient, possibly adjusted for confounders. The second research genre guides clinical practice. It addresses Y levels for individuals with different covariate patterns or "profiles." It focuses on the profile-specific (mean) Y levels themselves, estimating them via linear compounds of regression coefficients and covariates. By drawing on long-established closed-form variance formulae that lie beneath the standard errors in multiple regression, and by rearranging them for heuristic purposes, one arrives at quite intuitive sample size considerations for both research genres. Copyright Â© 2016 Elsevier Inc. All rights reserved.
A Multiple Regression Approach to Normalization of Spatiotemporal Gait Features.
Wahid, Ferdous; Begg, Rezaul; Lythgo, Noel; Hass, Chris J; Halgamuge, Saman; Ackland, David C
2016-04-01
Normalization of gait data is performed to reduce the effects of intersubject variations due to physical characteristics. This study reports a multiple regression normalization approach for spatiotemporal gait data that takes into account intersubject variations in self-selected walking speed and physical properties including age, height, body mass, and sex. Spatiotemporal gait data including stride length, cadence, stance time, double support time, and stride time were obtained from healthy subjects including 782 children, 71 adults, 29 elderly subjects, and 28 elderly Parkinson's disease (PD) patients. Data were normalized using standard dimensionless equations, a detrending method, and a multiple regression approach. After normalization using dimensionless equations and the detrending method, weak to moderate correlations between walking speed, physical properties, and spatiotemporal gait features were observed (0.01 multiple regression method reduced these correlations to weak values (|r| multiple regression approach revealed significant differences in these features as well as in cadence, stance time, and stride time. The proposed multiple regression normalization may be useful in machine learning, gait classification, and clinical evaluation of pathological gait patterns.
Evaluation of Linear Regression Simultaneous Myoelectric Control Using Intramuscular EMG.
Smith, Lauren H; Kuiken, Todd A; Hargrove, Levi J
2016-04-01
The objective of this study was to evaluate the ability of linear regression models to decode patterns of muscle coactivation from intramuscular electromyogram (EMG) and provide simultaneous myoelectric control of a virtual 3-DOF wrist/hand system. Performance was compared to the simultaneous control of conventional myoelectric prosthesis methods using intramuscular EMG (parallel dual-site control)-an approach that requires users to independently modulate individual muscles in the residual limb, which can be challenging for amputees. Linear regression control was evaluated in eight able-bodied subjects during a virtual Fitts' law task and was compared to performance of eight subjects using parallel dual-site control. An offline analysis also evaluated how different types of training data affected prediction accuracy of linear regression control. The two control systems demonstrated similar overall performance; however, the linear regression method demonstrated improved performance for targets requiring use of all three DOFs, whereas parallel dual-site control demonstrated improved performance for targets that required use of only one DOF. Subjects using linear regression control could more easily activate multiple DOFs simultaneously, but often experienced unintended movements when trying to isolate individual DOFs. Offline analyses also suggested that the method used to train linear regression systems may influence controllability. Linear regression myoelectric control using intramuscular EMG provided an alternative to parallel dual-site control for 3-DOF simultaneous control at the wrist and hand. The two methods demonstrated different strengths in controllability, highlighting the tradeoff between providing simultaneous control and the ability to isolate individual DOFs when desired.
Spontaneous Regression of Lumbar Herniated Disc
Directory of Open Access Journals (Sweden)
Chun-Wei Chang
2009-12-01
Full Text Available Intervertebral disc herniation of the lumbar spine is a common disease presenting with low back pain and involving nerve root radiculopathy. Some neurological symptoms in the majority of patients frequently improve after a period of conservative treatment. This has been regarded as the result of a decrease of pressure exerted from the herniated disc on neighboring neurostructures and a gradual regression of inflammation. Recently, with advances in magnetic resonance imaging, many reports have demonstrated that the herniated disc has the potential for spontaneous regression. Regression coincided with the improvement of associated symptoms. However, the exact regression mechanism remains unclear. Here, we present 2 cases of lumbar intervertebral disc herniation with spontaneous regression. We review the literature and discuss the possible mechanisms, the precipitating factors of spontaneous disc regression and the proper timing of surgical intervention.
Applied regression analysis a research tool
Pantula, Sastry; Dickey, David
1998-01-01
Least squares estimation, when used appropriately, is a powerful research tool. A deeper understanding of the regression concepts is essential for achieving optimal benefits from a least squares analysis. This book builds on the fundamentals of statistical methods and provides appropriate concepts that will allow a scientist to use least squares as an effective research tool. Applied Regression Analysis is aimed at the scientist who wishes to gain a working knowledge of regression analysis. The basic purpose of this book is to develop an understanding of least squares and related statistical methods without becoming excessively mathematical. It is the outgrowth of more than 30 years of consulting experience with scientists and many years of teaching an applied regression course to graduate students. Applied Regression Analysis serves as an excellent text for a service course on regression for non-statisticians and as a reference for researchers. It also provides a bridge between a two-semester introduction to...
Three contributions to robust regression diagnostics
Directory of Open Access Journals (Sweden)
Kalina J.
2015-12-01
Full Text Available Robust regression methods have been developed not only as a diagnostic tool for standard least squares estimation in statistical and econometric applications, but can be also used as self-standing regression estimation procedures. Therefore, they need to be equipped by their own diagnostic tools. This paper is devoted to robust regression and presents three contributions to its diagnostic tools or estimating regression parameters under non-standard conditions. Firstly, we derive the Durbin-Watson test of independence of random regression errors for the regression median. The approach is based on the approximation to the exact null distribution of the test statistic. Secondly, we accompany the least trimmed squares estimator by a subjective criterion for selecting a suitable value of the trimming constant. Thirdly, we propose a robust version of the instrumental variables estimator. The new methods are illustrated on examples with real data and their advantages and limitations are discussed.
Regression techniques for Portfolio Optimisation using MOSEK
Schmelzer, Thomas; Hauser, Raphael; Andersen, Erling; Dahl, Joachim
2013-01-01
Regression is widely used by practioners across many disciplines. We reformulate the underlying optimisation problem as a second-order conic program providing the flexibility often needed in applications. Using examples from portfolio management and quantitative trading we solve regression problems with and without constraints. Several Python code fragments are given. The code and data are available online at http://www.github.com/tschm/MosekRegression.
Use of Longitudinal Regression in Quality Control. Research Report. ETS RR-14-31
Lu, Ying; Yen, Wendy M.
2014-01-01
This article explores the use of longitudinal regression as a tool for identifying scoring inaccuracies. Student progression patterns, as evaluated through longitudinal regressions, typically are more stable from year to year than are scale score distributions and statistics, which require representative samples to conduct credibility checks.…
Bulcock, J. W.
The problem of model estimation when the data are collinear was examined. Though the ridge regression (RR) outperforms ordinary least squares (OLS) regression in the presence of acute multicollinearity, it is not a problem free technique for reducing the variance of the estimates. It is a stochastic procedure when it should be nonstochastic and it…
New ridge parameters for ridge regression
Directory of Open Access Journals (Sweden)
A.V. Dorugade
2014-04-01
Full Text Available Hoerl and Kennard (1970a introduced the ridge regression estimator as an alternative to the ordinary least squares (OLS estimator in the presence of multicollinearity. In ridge regression, ridge parameter plays an important role in parameter estimation. In this article, a new method for estimating ridge parameters in both situations of ordinary ridge regression (ORR and generalized ridge regression (GRR is proposed. The simulation study evaluates the performance of the proposed estimator based on the mean squared error (MSE criterion and indicates that under certain conditions the proposed estimators perform well compared to OLS and other well-known estimators reviewed in this article.
Cao, Faxian; Yang, Zhijing; Ren, Jinchang; Ling, Wing-Kuen; Zhao, Huimin; Marshall, Stephen
2017-12-01
Although the sparse multinomial logistic regression (SMLR) has provided a useful tool for sparse classification, it suffers from inefficacy in dealing with high dimensional features and manually set initial regressor values. This has significantly constrained its applications for hyperspectral image (HSI) classification. In order to tackle these two drawbacks, an extreme sparse multinomial logistic regression (ESMLR) is proposed for effective classification of HSI. First, the HSI dataset is projected to a new feature space with randomly generated weight and bias. Second, an optimization model is established by the Lagrange multiplier method and the dual principle to automatically determine a good initial regressor for SMLR via minimizing the training error and the regressor value. Furthermore, the extended multi-attribute profiles (EMAPs) are utilized for extracting both the spectral and spatial features. A combinational linear multiple features learning (MFL) method is proposed to further enhance the features extracted by ESMLR and EMAPs. Finally, the logistic regression via the variable splitting and the augmented Lagrangian (LORSAL) is adopted in the proposed framework for reducing the computational time. Experiments are conducted on two well-known HSI datasets, namely the Indian Pines dataset and the Pavia University dataset, which have shown the fast and robust performance of the proposed ESMLR framework.
Time series regression model for infectious disease and weather.
Imai, Chisato; Armstrong, Ben; Chalabi, Zaid; Mangtani, Punam; Hashizume, Masahiro
2015-10-01
Time series regression has been developed and long used to evaluate the short-term associations of air pollution and weather with mortality or morbidity of non-infectious diseases. The application of the regression approaches from this tradition to infectious diseases, however, is less well explored and raises some new issues. We discuss and present potential solutions for five issues often arising in such analyses: changes in immune population, strong autocorrelations, a wide range of plausible lag structures and association patterns, seasonality adjustments, and large overdispersion. The potential approaches are illustrated with datasets of cholera cases and rainfall from Bangladesh and influenza and temperature in Tokyo. Though this article focuses on the application of the traditional time series regression to infectious diseases and weather factors, we also briefly introduce alternative approaches, including mathematical modeling, wavelet analysis, and autoregressive integrated moving average (ARIMA) models. Modifications proposed to standard time series regression practice include using sums of past cases as proxies for the immune population, and using the logarithm of lagged disease counts to control autocorrelation due to true contagion, both of which are motivated from "susceptible-infectious-recovered" (SIR) models. The complexity of lag structures and association patterns can often be informed by biological mechanisms and explored by using distributed lag non-linear models. For overdispersed models, alternative distribution models such as quasi-Poisson and negative binomial should be considered. Time series regression can be used to investigate dependence of infectious diseases on weather, but may need modifying to allow for features specific to this context. Copyright © 2015 The Authors. Published by Elsevier Inc. All rights reserved.
The regress problem : Metatheory, development, and criticism
Peijnenburg, Jeanne; Aikin, Scott
This introduction presents selected proceedings of a two-day meeting on the regress problem, sponsored by the Netherlands Organization for Scientific Research (NWO) and hosted by Vanderbilt University in October 2013, along with other submitted essays. Three forms of research on the regress problem
A Simulation Investigation of Principal Component Regression.
Allen, David E.
Regression analysis is one of the more common analytic tools used by researchers. However, multicollinearity between the predictor variables can cause problems in using the results of regression analyses. Problems associated with multicollinearity include entanglement of relative influences of variables due to reduced precision of estimation,…
Regression Analysis and the Sociological Imagination
De Maio, Fernando
2014-01-01
Regression analysis is an important aspect of most introductory statistics courses in sociology but is often presented in contexts divorced from the central concerns that bring students into the discipline. Consequently, we present five lesson ideas that emerge from a regression analysis of income inequality and mortality in the USA and Canada.
Regression Analysis: Legal Applications in Institutional Research
Frizell, Julie A.; Shippen, Benjamin S., Jr.; Luna, Andrew L.
2008-01-01
This article reviews multiple regression analysis, describes how its results should be interpreted, and instructs institutional researchers on how to conduct such analyses using an example focused on faculty pay equity between men and women. The use of multiple regression analysis will be presented as a method with which to compare salaries of…
Variable importance in latent variable regression models
Kvalheim, O.M.; Arneberg, R.; Bleie, O.; Rajalahti, T.; Smilde, A.K.; Westerhuis, J.A.
2014-01-01
The quality and practical usefulness of a regression model are a function of both interpretability and prediction performance. This work presents some new graphical tools for improved interpretation of latent variable regression models that can also assist in improved algorithms for variable