Efficient Smoothed Concomitant Lasso Estimation for High Dimensional Regression
Ndiaye, Eugene; Fercoq, Olivier; Gramfort, Alexandre; Leclère, Vincent; Salmon, Joseph
2017-10-01
In high dimensional settings, sparse structures are crucial for efficiency, both in term of memory, computation and performance. It is customary to consider ℓ 1 penalty to enforce sparsity in such scenarios. Sparsity enforcing methods, the Lasso being a canonical example, are popular candidates to address high dimension. For efficiency, they rely on tuning a parameter trading data fitting versus sparsity. For the Lasso theory to hold this tuning parameter should be proportional to the noise level, yet the latter is often unknown in practice. A possible remedy is to jointly optimize over the regression parameter as well as over the noise level. This has been considered under several names in the literature: Scaled-Lasso, Square-root Lasso, Concomitant Lasso estimation for instance, and could be of interest for uncertainty quantification. In this work, after illustrating numerical difficulties for the Concomitant Lasso formulation, we propose a modification we coined Smoothed Concomitant Lasso, aimed at increasing numerical stability. We propose an efficient and accurate solver leading to a computational cost no more expensive than the one for the Lasso. We leverage on standard ingredients behind the success of fast Lasso solvers: a coordinate descent algorithm, combined with safe screening rules to achieve speed efficiency, by eliminating early irrelevant features.
Multivariate linear regression of high-dimensional fMRI data with multiple target variables.
Valente, Giancarlo; Castellanos, Agustin Lage; Vanacore, Gianluca; Formisano, Elia
2014-05-01
Multivariate regression is increasingly used to study the relation between fMRI spatial activation patterns and experimental stimuli or behavioral ratings. With linear models, informative brain locations are identified by mapping the model coefficients. This is a central aspect in neuroimaging, as it provides the sought-after link between the activity of neuronal populations and subject's perception, cognition or behavior. Here, we show that mapping of informative brain locations using multivariate linear regression (MLR) may lead to incorrect conclusions and interpretations. MLR algorithms for high dimensional data are designed to deal with targets (stimuli or behavioral ratings, in fMRI) separately, and the predictive map of a model integrates information deriving from both neural activity patterns and experimental design. Not accounting explicitly for the presence of other targets whose associated activity spatially overlaps with the one of interest may lead to predictive maps of troublesome interpretation. We propose a new model that can correctly identify the spatial patterns associated with a target while achieving good generalization. For each target, the training is based on an augmented dataset, which includes all remaining targets. The estimation on such datasets produces both maps and interaction coefficients, which are then used to generalize. The proposed formulation is independent of the regression algorithm employed. We validate this model on simulated fMRI data and on a publicly available dataset. Results indicate that our method achieves high spatial sensitivity and good generalization and that it helps disentangle specific neural effects from interaction with predictive maps associated with other targets. Copyright © 2013 Wiley Periodicals, Inc.
High-Dimensional Intrinsic Interpolation Using Gaussian Process Regression and Diffusion Maps
International Nuclear Information System (INIS)
Thimmisetty, Charanraj A.; Ghanem, Roger G.; White, Joshua A.; Chen, Xiao
2017-01-01
This article considers the challenging task of estimating geologic properties of interest using a suite of proxy measurements. The current work recast this task as a manifold learning problem. In this process, this article introduces a novel regression procedure for intrinsic variables constrained onto a manifold embedded in an ambient space. The procedure is meant to sharpen high-dimensional interpolation by inferring non-linear correlations from the data being interpolated. The proposed approach augments manifold learning procedures with a Gaussian process regression. It first identifies, using diffusion maps, a low-dimensional manifold embedded in an ambient high-dimensional space associated with the data. It relies on the diffusion distance associated with this construction to define a distance function with which the data model is equipped. This distance metric function is then used to compute the correlation structure of a Gaussian process that describes the statistical dependence of quantities of interest in the high-dimensional ambient space. The proposed method is applicable to arbitrarily high-dimensional data sets. Here, it is applied to subsurface characterization using a suite of well log measurements. The predictions obtained in original, principal component, and diffusion space are compared using both qualitative and quantitative metrics. Considerable improvement in the prediction of the geological structural properties is observed with the proposed method.
Penalized estimation for competing risks regression with applications to high-dimensional covariates
DEFF Research Database (Denmark)
Ambrogi, Federico; Scheike, Thomas H.
2016-01-01
of competing events. The direct binomial regression model of Scheike and others (2008. Predicting cumulative incidence probability by direct binomial regression. Biometrika 95: (1), 205-220) is reformulated in a penalized framework to possibly fit a sparse regression model. The developed approach is easily...... Research 19: (1), 29-51), the research regarding competing risks is less developed (Binder and others, 2009. Boosting for high-dimensional time-to-event data with competing risks. Bioinformatics 25: (7), 890-896). The aim of this work is to consider how to do penalized regression in the presence...... implementable using existing high-performance software to do penalized regression. Results from simulation studies are presented together with an application to genomic data when the endpoint is progression-free survival. An R function is provided to perform regularized competing risks regression according...
Non-Asymptotic Oracle Inequalities for the High-Dimensional Cox Regression via Lasso.
Kong, Shengchun; Nan, Bin
2014-01-01
We consider finite sample properties of the regularized high-dimensional Cox regression via lasso. Existing literature focuses on linear models or generalized linear models with Lipschitz loss functions, where the empirical risk functions are the summations of independent and identically distributed (iid) losses. The summands in the negative log partial likelihood function for censored survival data, however, are neither iid nor Lipschitz.We first approximate the negative log partial likelihood function by a sum of iid non-Lipschitz terms, then derive the non-asymptotic oracle inequalities for the lasso penalized Cox regression using pointwise arguments to tackle the difficulties caused by lacking iid Lipschitz losses.
The cross-validated AUC for MCP-logistic regression with high-dimensional data.
Jiang, Dingfeng; Huang, Jian; Zhang, Ying
2013-10-01
We propose a cross-validated area under the receiving operator characteristic (ROC) curve (CV-AUC) criterion for tuning parameter selection for penalized methods in sparse, high-dimensional logistic regression models. We use this criterion in combination with the minimax concave penalty (MCP) method for variable selection. The CV-AUC criterion is specifically designed for optimizing the classification performance for binary outcome data. To implement the proposed approach, we derive an efficient coordinate descent algorithm to compute the MCP-logistic regression solution surface. Simulation studies are conducted to evaluate the finite sample performance of the proposed method and its comparison with the existing methods including the Akaike information criterion (AIC), Bayesian information criterion (BIC) or Extended BIC (EBIC). The model selected based on the CV-AUC criterion tends to have a larger predictive AUC and smaller classification error than those with tuning parameters selected using the AIC, BIC or EBIC. We illustrate the application of the MCP-logistic regression with the CV-AUC criterion on three microarray datasets from the studies that attempt to identify genes related to cancers. Our simulation studies and data examples demonstrate that the CV-AUC is an attractive method for tuning parameter selection for penalized methods in high-dimensional logistic regression models.
Xu, Chao; Fang, Jian; Shen, Hui; Wang, Yu-Ping; Deng, Hong-Wen
2018-01-25
Extreme phenotype sampling (EPS) is a broadly-used design to identify candidate genetic factors contributing to the variation of quantitative traits. By enriching the signals in extreme phenotypic samples, EPS can boost the association power compared to random sampling. Most existing statistical methods for EPS examine the genetic factors individually, despite many quantitative traits have multiple genetic factors underlying their variation. It is desirable to model the joint effects of genetic factors, which may increase the power and identify novel quantitative trait loci under EPS. The joint analysis of genetic data in high-dimensional situations requires specialized techniques, e.g., the least absolute shrinkage and selection operator (LASSO). Although there are extensive research and application related to LASSO, the statistical inference and testing for the sparse model under EPS remain unknown. We propose a novel sparse model (EPS-LASSO) with hypothesis test for high-dimensional regression under EPS based on a decorrelated score function. The comprehensive simulation shows EPS-LASSO outperforms existing methods with stable type I error and FDR control. EPS-LASSO can provide a consistent power for both low- and high-dimensional situations compared with the other methods dealing with high-dimensional situations. The power of EPS-LASSO is close to other low-dimensional methods when the causal effect sizes are small and is superior when the effects are large. Applying EPS-LASSO to a transcriptome-wide gene expression study for obesity reveals 10 significant body mass index associated genes. Our results indicate that EPS-LASSO is an effective method for EPS data analysis, which can account for correlated predictors. The source code is available at https://github.com/xu1912/EPSLASSO. hdeng2@tulane.edu. Supplementary data are available at Bioinformatics online. © The Author (2018). Published by Oxford University Press. All rights reserved. For Permissions, please
Tao, Chenyang; Nichols, Thomas E; Hua, Xue; Ching, Christopher R K; Rolls, Edmund T; Thompson, Paul M; Feng, Jianfeng
2017-01-01
We propose a generalized reduced rank latent factor regression model (GRRLF) for the analysis of tensor field responses and high dimensional covariates. The model is motivated by the need from imaging-genetic studies to identify genetic variants that are associated with brain imaging phenotypes, often in the form of high dimensional tensor fields. GRRLF identifies from the structure in the data the effective dimensionality of the data, and then jointly performs dimension reduction of the covariates, dynamic identification of latent factors, and nonparametric estimation of both covariate and latent response fields. After accounting for the latent and covariate effects, GRLLF performs a nonparametric test on the remaining factor of interest. GRRLF provides a better factorization of the signals compared with common solutions, and is less susceptible to overfitting because it exploits the effective dimensionality. The generality and the flexibility of GRRLF also allow various statistical models to be handled in a unified framework and solutions can be efficiently computed. Within the field of neuroimaging, it improves the sensitivity for weak signals and is a promising alternative to existing approaches. The operation of the framework is demonstrated with both synthetic datasets and a real-world neuroimaging example in which the effects of a set of genes on the structure of the brain at the voxel level were measured, and the results compared favorably with those from existing approaches. Copyright © 2016. Published by Elsevier Inc.
Xia, Yin; Cai, Tianxi; Cai, T Tony
2018-01-01
Motivated by applications in genomics, we consider in this paper global and multiple testing for the comparisons of two high-dimensional linear regression models. A procedure for testing the equality of the two regression vectors globally is proposed and shown to be particularly powerful against sparse alternatives. We then introduce a multiple testing procedure for identifying unequal coordinates while controlling the false discovery rate and false discovery proportion. Theoretical justifications are provided to guarantee the validity of the proposed tests and optimality results are established under sparsity assumptions on the regression coefficients. The proposed testing procedures are easy to implement. Numerical properties of the procedures are investigated through simulation and data analysis. The results show that the proposed tests maintain the desired error rates under the null and have good power under the alternative at moderate sample sizes. The procedures are applied to the Framingham Offspring study to investigate the interactions between smoking and cardiovascular related genetic mutations important for an inflammation marker.
Prediction-Oriented Marker Selection (PROMISE): With Application to High-Dimensional Regression.
Kim, Soyeon; Baladandayuthapani, Veerabhadran; Lee, J Jack
2017-06-01
In personalized medicine, biomarkers are used to select therapies with the highest likelihood of success based on an individual patient's biomarker/genomic profile. Two goals are to choose important biomarkers that accurately predict treatment outcomes and to cull unimportant biomarkers to reduce the cost of biological and clinical verifications. These goals are challenging due to the high dimensionality of genomic data. Variable selection methods based on penalized regression (e.g., the lasso and elastic net) have yielded promising results. However, selecting the right amount of penalization is critical to simultaneously achieving these two goals. Standard approaches based on cross-validation (CV) typically provide high prediction accuracy with high true positive rates but at the cost of too many false positives. Alternatively, stability selection (SS) controls the number of false positives, but at the cost of yielding too few true positives. To circumvent these issues, we propose prediction-oriented marker selection (PROMISE), which combines SS with CV to conflate the advantages of both methods. Our application of PROMISE with the lasso and elastic net in data analysis shows that, compared to CV, PROMISE produces sparse solutions, few false positives, and small type I + type II error, and maintains good prediction accuracy, with a marginal decrease in the true positive rates. Compared to SS, PROMISE offers better prediction accuracy and true positive rates. In summary, PROMISE can be applied in many fields to select regularization parameters when the goals are to minimize false positives and maximize prediction accuracy.
Estimating the effect of a variable in a high-dimensional regression model
DEFF Research Database (Denmark)
Jensen, Peter Sandholt; Wurtz, Allan
assume that the effect is identified in a high-dimensional linear model specified by unconditional moment restrictions. We consider properties of the following methods, which rely on lowdimensional models to infer the effect: Extreme bounds analysis, the minimum t-statistic over models, Sala...
Yu, Wenbao; Park, Taesung
2014-01-01
It is common to get an optimal combination of markers for disease classification and prediction when multiple markers are available. Many approaches based on the area under the receiver operating characteristic curve (AUC) have been proposed. Existing works based on AUC in a high-dimensional context depend mainly on a non-parametric, smooth approximation of AUC, with no work using a parametric AUC-based approach, for high-dimensional data. We propose an AUC-based approach using penalized regression (AucPR), which is a parametric method used for obtaining a linear combination for maximizing the AUC. To obtain the AUC maximizer in a high-dimensional context, we transform a classical parametric AUC maximizer, which is used in a low-dimensional context, into a regression framework and thus, apply the penalization regression approach directly. Two kinds of penalization, lasso and elastic net, are considered. The parametric approach can avoid some of the difficulties of a conventional non-parametric AUC-based approach, such as the lack of an appropriate concave objective function and a prudent choice of the smoothing parameter. We apply the proposed AucPR for gene selection and classification using four real microarray and synthetic data. Through numerical studies, AucPR is shown to perform better than the penalized logistic regression and the nonparametric AUC-based method, in the sense of AUC and sensitivity for a given specificity, particularly when there are many correlated genes. We propose a powerful parametric and easily-implementable linear classifier AucPR, for gene selection and disease prediction for high-dimensional data. AucPR is recommended for its good prediction performance. Beside gene expression microarray data, AucPR can be applied to other types of high-dimensional omics data, such as miRNA and protein data.
Tumor regression patterns in retinoblastoma
International Nuclear Information System (INIS)
Zafar, S.N.; Siddique, S.N.; Zaheer, N.
2016-01-01
To observe the types of tumor regression after treatment, and identify the common pattern of regression in our patients. Study Design: Descriptive study. Place and Duration of Study: Department of Pediatric Ophthalmology and Strabismus, Al-Shifa Trust Eye Hospital, Rawalpindi, Pakistan, from October 2011 to October 2014. Methodology: Children with unilateral and bilateral retinoblastoma were included in the study. Patients were referred to Pakistan Institute of Medical Sciences, Islamabad, for chemotherapy. After every cycle of chemotherapy, dilated funds examination under anesthesia was performed to record response of the treatment. Regression patterns were recorded on RetCam II. Results: Seventy-four tumors were included in the study. Out of 74 tumors, 3 were ICRB group A tumors, 43 were ICRB group B tumors, 14 tumors belonged to ICRB group C, and remaining 14 were ICRB group D tumors. Type IV regression was seen in 39.1% (n=29) tumors, type II in 29.7% (n=22), type III in 25.6% (n=19), and type I in 5.4% (n=4). All group A tumors (100%) showed type IV regression. Seventeen (39.5%) group B tumors showed type IV regression. In group C, 5 tumors (35.7%) showed type II regression and 5 tumors (35.7%) showed type IV regression. In group D, 6 tumors (42.9%) regressed to type II non-calcified remnants. Conclusion: The response and success of the focal and systemic treatment, as judged by the appearance of different patterns of tumor regression, varies with the ICRB grouping of the tumor. (author)
Privacy-Preserving Distributed Linear Regression on High-Dimensional Data
Directory of Open Access Journals (Sweden)
Gascón Adrià
2017-10-01
Full Text Available We propose privacy-preserving protocols for computing linear regression models, in the setting where the training dataset is vertically distributed among several parties. Our main contribution is a hybrid multi-party computation protocol that combines Yao’s garbled circuits with tailored protocols for computing inner products. Like many machine learning tasks, building a linear regression model involves solving a system of linear equations. We conduct a comprehensive evaluation and comparison of different techniques for securely performing this task, including a new Conjugate Gradient Descent (CGD algorithm. This algorithm is suitable for secure computation because it uses an efficient fixed-point representation of real numbers while maintaining accuracy and convergence rates comparable to what can be obtained with a classical solution using floating point numbers. Our technique improves on Nikolaenko et al.’s method for privacy-preserving ridge regression (S&P 2013, and can be used as a building block in other analyses. We implement a complete system and demonstrate that our approach is highly scalable, solving data analysis problems with one million records and one hundred features in less than one hour of total running time.
Raz, Gal; Svanera, Michele; Singer, Neomi; Gilam, Gadi; Cohen, Maya Bleich; Lin, Tamar; Admon, Roee; Gonen, Tal; Thaler, Avner; Granot, Roni Y; Goebel, Rainer; Benini, Sergio; Valente, Giancarlo
2017-12-01
Major methodological advancements have been recently made in the field of neural decoding, which is concerned with the reconstruction of mental content from neuroimaging measures. However, in the absence of a large-scale examination of the validity of the decoding models across subjects and content, the extent to which these models can be generalized is not clear. This study addresses the challenge of producing generalizable decoding models, which allow the reconstruction of perceived audiovisual features from human magnetic resonance imaging (fMRI) data without prior training of the algorithm on the decoded content. We applied an adapted version of kernel ridge regression combined with temporal optimization on data acquired during film viewing (234 runs) to generate standardized brain models for sound loudness, speech presence, perceived motion, face-to-frame ratio, lightness, and color brightness. The prediction accuracies were tested on data collected from different subjects watching other movies mainly in another scanner. Substantial and significant (Q FDR movies (R¯=0.62, R¯ = 0.60, R¯ = 0.60, respectively) with high reproducibility of the predictors across subjects. The face ratio model produced significant correlations in 7 out of 8 movies (R¯=0.56). The lightness and brightness models did not show robustness (R¯=0.23, R¯ = 0). Further analysis of additional data (95 runs) indicated that loudness reconstruction veridicality can consistently reveal relevant group differences in musical experience. The findings point to the validity and generalizability of our loudness, speech, motion, and face ratio models for complex cinematic stimuli (as well as for music in the case of loudness). While future research should further validate these models using controlled stimuli and explore the feasibility of extracting more complex models via this method, the reliability of our results indicates the potential usefulness of the approach and the resulting models
Arif, Muhammad
2012-06-01
In pattern classification problems, feature extraction is an important step. Quality of features in discriminating different classes plays an important role in pattern classification problems. In real life, pattern classification may require high dimensional feature space and it is impossible to visualize the feature space if the dimension of feature space is greater than four. In this paper, we have proposed a Similarity-Dissimilarity plot which can project high dimensional space to a two dimensional space while retaining important characteristics required to assess the discrimination quality of the features. Similarity-dissimilarity plot can reveal information about the amount of overlap of features of different classes. Separable data points of different classes will also be visible on the plot which can be classified correctly using appropriate classifier. Hence, approximate classification accuracy can be predicted. Moreover, it is possible to know about whom class the misclassified data points will be confused by the classifier. Outlier data points can also be located on the similarity-dissimilarity plot. Various examples of synthetic data are used to highlight important characteristics of the proposed plot. Some real life examples from biomedical data are also used for the analysis. The proposed plot is independent of number of dimensions of the feature space.
Taşkin Kaya, Gülşen
2013-10-01
Recently, earthquake damage assessment using satellite images has been a very popular ongoing research direction. Especially with the availability of very high resolution (VHR) satellite images, a quite detailed damage map based on building scale has been produced, and various studies have also been conducted in the literature. As the spatial resolution of satellite images increases, distinguishability of damage patterns becomes more cruel especially in case of using only the spectral information during classification. In order to overcome this difficulty, textural information needs to be involved to the classification to improve the visual quality and reliability of damage map. There are many kinds of textural information which can be derived from VHR satellite images depending on the algorithm used. However, extraction of textural information and evaluation of them have been generally a time consuming process especially for the large areas affected from the earthquake due to the size of VHR image. Therefore, in order to provide a quick damage map, the most useful features describing damage patterns needs to be known in advance as well as the redundant features. In this study, a very high resolution satellite image after Iran, Bam earthquake was used to identify the earthquake damage. Not only the spectral information, textural information was also used during the classification. For textural information, second order Haralick features were extracted from the panchromatic image for the area of interest using gray level co-occurrence matrix with different size of windows and directions. In addition to using spatial features in classification, the most useful features representing the damage characteristic were selected with a novel feature selection method based on high dimensional model representation (HDMR) giving sensitivity of each feature during classification. The method called HDMR was recently proposed as an efficient tool to capture the input
Yu, Wenbao; Park, Taesung
2014-01-01
Motivation It is common to get an optimal combination of markers for disease classification and prediction when multiple markers are available. Many approaches based on the area under the receiver operating characteristic curve (AUC) have been proposed. Existing works based on AUC in a high-dimensional context depend mainly on a non-parametric, smooth approximation of AUC, with no work using a parametric AUC-based approach, for high-dimensional data. Results We propose an AUC-based approach u...
Directory of Open Access Journals (Sweden)
Omid Hamidi
2014-01-01
Full Text Available Microarray technology results in high-dimensional and low-sample size data sets. Therefore, fitting sparse models is substantial because only a small number of influential genes can reliably be identified. A number of variable selection approaches have been proposed for high-dimensional time-to-event data based on Cox proportional hazards where censoring is present. The present study applied three sparse variable selection techniques of Lasso, smoothly clipped absolute deviation and the smooth integration of counting, and absolute deviation for gene expression survival time data using the additive risk model which is adopted when the absolute effects of multiple predictors on the hazard function are of interest. The performances of used techniques were evaluated by time dependent ROC curve and bootstrap .632+ prediction error curves. The selected genes by all methods were highly significant (P<0.001. The Lasso showed maximum median of area under ROC curve over time (0.95 and smoothly clipped absolute deviation showed the lowest prediction error (0.105. It was observed that the selected genes by all methods improved the prediction of purely clinical model indicating the valuable information containing in the microarray features. So it was concluded that used approaches can satisfactorily predict survival based on selected gene expression measurements.
Chernozhukov, Victor; Hansen, Christian; Spindler, Martin
2016-01-01
In this article the package High-dimensional Metrics (\\texttt{hdm}) is introduced. It is a collection of statistical methods for estimation and quantification of uncertainty in high-dimensional approximately sparse models. It focuses on providing confidence intervals and significance testing for (possibly many) low-dimensional subcomponents of the high-dimensional parameter vector. Efficient estimators and uniformly valid confidence intervals for regression coefficients on target variables (e...
Wang, Wei; Yang, Jiong
With the rapid growth of computational biology and e-commerce applications, high-dimensional data becomes very common. Thus, mining high-dimensional data is an urgent problem of great practical importance. However, there are some unique challenges for mining data of high dimensions, including (1) the curse of dimensionality and more crucial (2) the meaningfulness of the similarity measure in the high dimension space. In this chapter, we present several state-of-art techniques for analyzing high-dimensional data, e.g., frequent pattern mining, clustering, and classification. We will discuss how these methods deal with the challenges of high dimensionality.
Radiation regression patterns after cobalt plaque insertion for retinoblastoma
International Nuclear Information System (INIS)
Buys, R.J.; Abramson, D.H.; Ellsworth, R.M.; Haik, B.
1983-01-01
An analysis of 31 eyes of 30 patients who had been treated with cobalt plaques for retinoblastoma disclosed that a type I radiation regression pattern developed in 15 patients; type II, in one patient, and type III, in five patients. Nine patients had a regression pattern characterized by complete destruction of the tumor, the surrounding choroid, and all of the vessels in the area into which the plaque was inserted. This resulting white scar, corresponding to the sclerae only, was classified as a type IV radiation regression pattern. There was no evidence of tumor recurrence in patients with type IV regression patterns, with an average follow-up of 6.5 years, after receiving cobalt plaque therapy. Twenty-nine of these 30 patients had been unsuccessfully treated with at least one other modality (ie, light coagulation, cryotherapy, external beam radiation, or chemotherapy)
Radiation regression patterns after cobalt plaque insertion for retinoblastoma
Energy Technology Data Exchange (ETDEWEB)
Buys, R.J.; Abramson, D.H.; Ellsworth, R.M.; Haik, B.
1983-08-01
An analysis of 31 eyes of 30 patients who had been treated with cobalt plaques for retinoblastoma disclosed that a type I radiation regression pattern developed in 15 patients; type II, in one patient, and type III, in five patients. Nine patients had a regression pattern characterized by complete destruction of the tumor, the surrounding choroid, and all of the vessels in the area into which the plaque was inserted. This resulting white scar, corresponding to the sclerae only, was classified as a type IV radiation regression pattern. There was no evidence of tumor recurrence in patients with type IV regression patterns, with an average follow-up of 6.5 years, after receiving cobalt plaque therapy. Twenty-nine of these 30 patients had been unsuccessfully treated with at least one other modality (ie, light coagulation, cryotherapy, external beam radiation, or chemotherapy).
Chernozhukov, Victor; Hansen, Chris; Spindler, Martin
2016-01-01
The package High-dimensional Metrics (\\Rpackage{hdm}) is an evolving collection of statistical methods for estimation and quantification of uncertainty in high-dimensional approximately sparse models. It focuses on providing confidence intervals and significance testing for (possibly many) low-dimensional subcomponents of the high-dimensional parameter vector. Efficient estimators and uniformly valid confidence intervals for regression coefficients on target variables (e.g., treatment or poli...
Clustering high dimensional data
DEFF Research Database (Denmark)
Assent, Ira
2012-01-01
High-dimensional data, i.e., data described by a large number of attributes, pose specific challenges to clustering. The so-called ‘curse of dimensionality’, coined originally to describe the general increase in complexity of various computational problems as dimensionality increases, is known...... to render traditional clustering algorithms ineffective. The curse of dimensionality, among other effects, means that with increasing number of dimensions, a loss of meaningful differentiation between similar and dissimilar objects is observed. As high-dimensional objects appear almost alike, new approaches...... for clustering are required. Consequently, recent research has focused on developing techniques and clustering algorithms specifically for high-dimensional data. Still, open research issues remain. Clustering is a data mining task devoted to the automatic grouping of data based on mutual similarity. Each cluster...
Drought Patterns Forecasting using an Auto-Regressive Logistic Model
del Jesus, M.; Sheffield, J.; Méndez Incera, F. J.; Losada, I. J.; Espejo, A.
2014-12-01
Drought is characterized by a water deficit that may manifest across a large range of spatial and temporal scales. Drought may create important socio-economic consequences, many times of catastrophic dimensions. A quantifiable definition of drought is elusive because depending on its impacts, consequences and generation mechanism, different water deficit periods may be identified as a drought by virtue of some definitions but not by others. Droughts are linked to the water cycle and, although a climate change signal may not have emerged yet, they are also intimately linked to climate.In this work we develop an auto-regressive logistic model for drought prediction at different temporal scales that makes use of a spatially explicit framework. Our model allows to include covariates, continuous or categorical, to improve the performance of the auto-regressive component.Our approach makes use of dimensionality reduction (principal component analysis) and classification techniques (K-Means and maximum dissimilarity) to simplify the representation of complex climatic patterns, such as sea surface temperature (SST) and sea level pressure (SLP), while including information on their spatial structure, i.e. considering their spatial patterns. This procedure allows us to include in the analysis multivariate representation of complex climatic phenomena, as the El Niño-Southern Oscillation. We also explore the impact of other climate-related variables such as sun spots. The model allows to quantify the uncertainty of the forecasts and can be easily adapted to make predictions under future climatic scenarios. The framework herein presented may be extended to other applications such as flash flood analysis, or risk assessment of natural hazards.
CSIR Research Space (South Africa)
Mc
2012-07-01
Full Text Available stream_source_info McLaren_2012.pdf.txt stream_content_type text/plain stream_size 2190 Content-Encoding ISO-8859-1 stream_name McLaren_2012.pdf.txt Content-Type text/plain; charset=ISO-8859-1 High dimensional... entanglement M. McLAREN1,2, F.S. ROUX1 & A. FORBES1,2,3 1. CSIR National Laser Centre, PO Box 395, Pretoria 0001 2. School of Physics, University of the Stellenbosch, Private Bag X1, 7602, Matieland 3. School of Physics, University of Kwazulu...
Electricity demand forecasting using regression, scenarios and pattern analysis
CSIR Research Space (South Africa)
Khuluse, S
2009-02-01
Full Text Available The objective of the study is to forecast national electricity demand patterns for a period of twenty years: total annual consumption and understanding seasonal effects. No constraint on the supply of electricity was assumed...
RETROSPECTIVE RESEARCH ON REGRESSION PATTERN IN MALE SKELETAL EPIPHYSEAL FUSION
Directory of Open Access Journals (Sweden)
Ranga Rao
2016-02-01
Full Text Available In developing countries like India, because of illiteracy, ignorance regarding the importance of official records like birth and death, vast majority of population fail to give information of such vital events to the concerning authorities. This causes paucity in such information when needed in a medico-legal case or for research purpose. There is a wide confusion and controversy regarding the standard method to be used for estimating age in the Indian Sub-continent. The aim of the current study is to find the correlation among the various parameters of commonly examined ossification centres, through which a regression formula with positive correlation and maximum coefficient of determination can be, derived which can be attempted to be standardised for the estimation of age.
Gao, Xiangyun; An, Haizhong; Fang, Wei; Huang, Xuan; Li, Huajiao; Zhong, Weiqiong; Ding, Yinghui
2014-07-01
The linear regression parameters between two time series can be different under different lengths of observation period. If we study the whole period by the sliding window of a short period, the change of the linear regression parameters is a process of dynamic transmission over time. We tackle fundamental research that presents a simple and efficient computational scheme: a linear regression patterns transmission algorithm, which transforms linear regression patterns into directed and weighted networks. The linear regression patterns (nodes) are defined by the combination of intervals of the linear regression parameters and the results of the significance testing under different sizes of the sliding window. The transmissions between adjacent patterns are defined as edges, and the weights of the edges are the frequency of the transmissions. The major patterns, the distance, and the medium in the process of the transmission can be captured. The statistical results of weighted out-degree and betweenness centrality are mapped on timelines, which shows the features of the distribution of the results. Many measurements in different areas that involve two related time series variables could take advantage of this algorithm to characterize the dynamic relationships between the time series from a new perspective.
High-dimensional covariance estimation with high-dimensional data
Pourahmadi, Mohsen
2013-01-01
Methods for estimating sparse and large covariance matrices Covariance and correlation matrices play fundamental roles in every aspect of the analysis of multivariate data collected from a variety of fields including business and economics, health care, engineering, and environmental and physical sciences. High-Dimensional Covariance Estimation provides accessible and comprehensive coverage of the classical and modern approaches for estimating covariance matrices as well as their applications to the rapidly developing areas lying at the intersection of statistics and mac
Fischer, M. J.
2014-02-01
There are many different methods for investigating the coupling between two climate fields, which are all based on the multivariate regression model. Each different method of solving the multivariate model has its own attractive characteristics, but often the suitability of a particular method for a particular problem is not clear. Continuum regression methods search the solution space between the conventional methods and thus can find regression model subspaces that mix the attractive characteristics of the end-member subspaces. Principal covariates regression is a continuum regression method that is easily applied to climate fields and makes use of two end-members: principal components regression and redundancy analysis. In this study, principal covariates regression is extended to additionally span a third end-member (partial least squares or maximum covariance analysis). The new method, regularized principal covariates regression, has several attractive features including the following: it easily applies to problems in which the response field has missing values or is temporally sparse, it explores a wide range of model spaces, and it seeks a model subspace that will, for a set number of components, have a predictive skill that is the same or better than conventional regression methods. The new method is illustrated by applying it to the problem of predicting the southern Australian winter rainfall anomaly field using the regional atmospheric pressure anomaly field. Regularized principal covariates regression identifies four major coupled patterns in these two fields. The two leading patterns, which explain over half the variance in the rainfall field, are related to the subtropical ridge and features of the zonally asymmetric circulation.
Schmidtmann, I; Elsäßer, A; Weinmann, A; Binder, H
2014-12-30
For determining a manageable set of covariates potentially influential with respect to a time-to-event endpoint, Cox proportional hazards models can be combined with variable selection techniques, such as stepwise forward selection or backward elimination based on p-values, or regularized regression techniques such as component-wise boosting. Cox regression models have also been adapted for dealing with more complex event patterns, for example, for competing risks settings with separate, cause-specific hazard models for each event type, or for determining the prognostic effect pattern of a variable over different landmark times, with one conditional survival model for each landmark. Motivated by a clinical cancer registry application, where complex event patterns have to be dealt with and variable selection is needed at the same time, we propose a general approach for linking variable selection between several Cox models. Specifically, we combine score statistics for each covariate across models by Fisher's method as a basis for variable selection. This principle is implemented for a stepwise forward selection approach as well as for a regularized regression technique. In an application to data from hepatocellular carcinoma patients, the coupled stepwise approach is seen to facilitate joint interpretation of the different cause-specific Cox models. In conditional survival models at landmark times, which address updates of prediction as time progresses and both treatment and other potential explanatory variables may change, the coupled regularized regression approach identifies potentially important, stably selected covariates together with their effect time pattern, despite having only a small number of events. These results highlight the promise of the proposed approach for coupling variable selection between Cox models, which is particularly relevant for modeling for clinical cancer registries with their complex event patterns. Copyright © 2014 John Wiley & Sons
Directory of Open Access Journals (Sweden)
Pełka Paweł
2017-01-01
Full Text Available Electricity demand forecasting is of important role in power system planning and operation. In this work, fuzzy nearest neighbour regression has been utilised to estimate monthly electricity demands. The forecasting model was based on the pre-processed energy consumption time series, where input and output variables were defined as patterns representing unified fragments of the time series. Relationships between inputs and outputs, which were simplified due to patterns, were modelled using nonparametric regression with weighting function defined as a fuzzy membership of learning points to the neighbourhood of a query point. In an experimental part of the work the model was evaluated using real-world data. The results are encouraging and show high performances of the model and its competitiveness compared to other forecasting models.
Directory of Open Access Journals (Sweden)
Dayeon Shin
2018-01-01
Full Text Available Diet plays a crucial role in cognitive function. Few studies have examined the relationship between dietary patterns and cognitive functions of older adults in the Korean population. This study aimed to identify the effect of dietary patterns on the risk of mild cognitive impairment. A total of 239 participants, including 88 men and 151 women, aged 65 years and older were selected from health centers in the district of Seoul, Gyeonggi province, and Incheon, in Korea. Dietary patterns were determined using Reduced Rank Regression (RRR methods with responses regarding vitamin B6, vitamin C, and iron intakes, based on both a one-day 24-h recall and a food frequency questionnaire. Cognitive function was assessed using the Korean-Mini Mental State Examination (K-MMSE. Multivariable logistic regression models were used to estimate the association between dietary pattern score and the risk of mild cognitive impairment. A total of 20 (8% out of the 239 participants had mild cognitive impairment. Three dietary patterns were identified: seafood and vegetables, high meat, and bread, ham, and alcohol. Among the three dietary patterns, the older adult population who adhered to the seafood and vegetables pattern, characterized by high intake of seafood, vegetables, fruits, bread, snacks, soy products, beans, chicken, pork, ham, egg, and milk had a decreased risk of mild cognitive impairment compared to those who did not (adjusted odds ratios 0.06, 95% confidence interval 0.01–0.72 after controlling for gender, supplementation, education, history of dementia, physical activity, body mass index (BMI, and duration of sleep. The other two dietary patterns were not significantly associated with the risk of mild cognitive impairment. In conclusion, high consumption of fruits, vegetables, seafood, and protein foods was significantly associated with reduced mild cognitive impairment in older Korean adults. These results can contribute to the establishment of
Clustering high dimensional data using RIA
Energy Technology Data Exchange (ETDEWEB)
Aziz, Nazrina [School of Quantitative Sciences, College of Arts and Sciences, Universiti Utara Malaysia, 06010 Sintok, Kedah (Malaysia)
2015-05-15
Clustering may simply represent a convenient method for organizing a large data set so that it can easily be understood and information can efficiently be retrieved. However, identifying cluster in high dimensionality data sets is a difficult task because of the curse of dimensionality. Another challenge in clustering is some traditional functions cannot capture the pattern dissimilarity among objects. In this article, we used an alternative dissimilarity measurement called Robust Influence Angle (RIA) in the partitioning method. RIA is developed using eigenstructure of the covariance matrix and robust principal component score. We notice that, it can obtain cluster easily and hence avoid the curse of dimensionality. It is also manage to cluster large data sets with mixed numeric and categorical value.
High-dimensional model estimation and model selection
CERN. Geneva
2015-01-01
I will review concepts and algorithms from high-dimensional statistics for linear model estimation and model selection. I will particularly focus on the so-called p>>n setting where the number of variables p is much larger than the number of samples n. I will focus mostly on regularized statistical estimators that produce sparse models. Important examples include the LASSO and its matrix extension, the Graphical LASSO, and more recent non-convex methods such as the TREX. I will show the applicability of these estimators in a diverse range of scientific applications, such as sparse interaction graph recovery and high-dimensional classification and regression problems in genomics.
Directory of Open Access Journals (Sweden)
Andrew K. Wills
2016-04-01
Full Text Available Abstract Background Regression models are widely used to link serial measures of anthropometric size or changes in size to a later outcome. Different parameterisations of these models enable one to target different questions about the effect of growth, however, their interpretation can be challenging. Our objective was to formulate and classify several sets of parameterisations by their underlying growth pattern contrast, and to discuss their utility using an expository example. Methods We describe and classify five sets of model parameterisations in accordance with their underlying growth pattern contrast (conditional growth; being bigger v being smaller; becoming bigger and staying bigger; growing faster v being bigger; becoming and staying bigger versus being bigger. The contrasts are estimated by including different sets of repeated measures of size and changes in size in a regression model. We illustrate these models in the setting of linking infant growth (measured on 6 occasions: birth, 6 weeks, 3, 6, 12 and 24 months in weight-for-height-for-age z-scores to later childhood overweight at 8y using complete cases from the Norwegian Childhood Growth study (n = 900. Results In our expository example, conditional growth during all periods, becoming bigger in any interval and staying bigger through infancy, and being bigger from birth were all associated with higher odds of later overweight. The highest odds of later overweight occurred for individuals who experienced high conditional growth or became bigger in the 3 to 6 month period and stayed bigger, and those who were bigger from birth to 24 months. Comparisons between periods and between growth patterns require large sample sizes and need to consider how to scale associations to make comparisons fair; with respect to the latter, we show one approach. Conclusion Studies interested in detrimental growth patterns may gain extra insight from reporting several sets of growth pattern
Modeling High-Dimensional Multichannel Brain Signals
Hu, Lechuan
2017-12-12
Our goal is to model and measure functional and effective (directional) connectivity in multichannel brain physiological signals (e.g., electroencephalograms, local field potentials). The difficulties from analyzing these data mainly come from two aspects: first, there are major statistical and computational challenges for modeling and analyzing high-dimensional multichannel brain signals; second, there is no set of universally agreed measures for characterizing connectivity. To model multichannel brain signals, our approach is to fit a vector autoregressive (VAR) model with potentially high lag order so that complex lead-lag temporal dynamics between the channels can be captured. Estimates of the VAR model will be obtained by our proposed hybrid LASSLE (LASSO + LSE) method which combines regularization (to control for sparsity) and least squares estimation (to improve bias and mean-squared error). Then we employ some measures of connectivity but put an emphasis on partial directed coherence (PDC) which can capture the directional connectivity between channels. PDC is a frequency-specific measure that explains the extent to which the present oscillatory activity in a sender channel influences the future oscillatory activity in a specific receiver channel relative to all possible receivers in the network. The proposed modeling approach provided key insights into potential functional relationships among simultaneously recorded sites during performance of a complex memory task. Specifically, this novel method was successful in quantifying patterns of effective connectivity across electrode locations, and in capturing how these patterns varied across trial epochs and trial types.
Cuffney, T.F.; Kashuba, R.; Qian, S.S.; Alameddine, I.; Cha, Y.K.; Lee, B.; Coles, J.F.; McMahon, G.
2011-01-01
Multilevel hierarchical regression was used to examine regional patterns in the responses of benthic macroinvertebrates and algae to urbanization across 9 metropolitan areas of the conterminous USA. Linear regressions established that responses (intercepts and slopes) to urbanization of invertebrates and algae varied among metropolitan areas. Multilevel hierarchical regression models were able to explain these differences on the basis of region-scale predictors. Regional differences in the type of land cover (agriculture or forest) being converted to urban and climatic factors (precipitation and air temperature) accounted for the differences in the response of macroinvertebrates to urbanization based on ordination scores, total richness, Ephemeroptera, Plecoptera, Trichoptera richness, and average tolerance. Regional differences in climate and antecedent agriculture also accounted for differences in the responses of salt-tolerant diatoms, but differences in the responses of other diatom metrics (% eutraphenic, % sensitive, and % silt tolerant) were best explained by regional differences in soils (mean % clay soils). The effects of urbanization were most readily detected in regions where forest lands were being converted to urban land because agricultural development significantly degraded assemblages before urbanization and made detection of urban effects difficult. The effects of climatic factors (temperature, precipitation) on background conditions (biogeographic differences) and rates of response to urbanization were most apparent after accounting for the effects of agricultural development. The effects of climate and land cover on responses to urbanization provide strong evidence that monitoring, mitigation, and restoration efforts must be tailored for specific regions and that attainment goals (background conditions) may not be possible in regions with high levels of prior disturbance (e.g., agricultural development). ?? 2011 by The North American
Spontaneous Regression of Choroidal Neovascularization in a Patient with Pattern Dystrophy
Directory of Open Access Journals (Sweden)
Anastasios Anastasakis
2016-01-01
Full Text Available Purpose. To present a case of a patient with pattern dystrophy (PD associated choroidal neovascularization (CNV that resolved spontaneously without treatment. Methods. A 69-year-old male patient was referred to our unit, for evaluation of a recent visual loss (metamorphopsias in his left eye. Fundus examination, fundus autofluorescence imaging, and fluorescein angiography showed a choroidal neovascular membrane in his left eye. Since visual acuity was satisfactory the patient elected observation. Clinical examination and OCT testing were repeated at 6 and 12 months after presentation. Results. Visual acuity remained stable at the level of 0.9 (baseline BCVA during the follow-up period (12 months. Repeat OCT testing showed complete spontaneous regression of the choroidal neovascular membrane without evidence of intra- or subretinal fluid in both follow-up visits. Conclusions. Spontaneous regression of choroidal neovascularization can occur in patients with retinal dystrophies and associated choroidal neovascular membranes. The decision to treat or observe these patients relies strongly on the presenting visual acuity, since, in isolated instances, spontaneous resolution of choroidal neovascularization may occur.
Describing Growth Pattern of Bali Cows Using Non-linear Regression Models
Directory of Open Access Journals (Sweden)
Mohd. Hafiz A.W
2016-12-01
Full Text Available The objective of this study was to evaluate the best fit non-linear regression model to describe the growth pattern of Bali cows. Estimates of asymptotic mature weight, rate of maturing and constant of integration were derived from Brody, von Bertalanffy, Gompertz and Logistic models which were fitted to cross-sectional data of body weight taken from 74 Bali cows raised in MARDI Research Station Muadzam Shah Pahang. Coefficient of determination (R2 and residual mean squares (MSE were used to determine the best fit model in describing the growth pattern of Bali cows. Von Bertalanffy model was the best model among the four growth functions evaluated to determine the mature weight of Bali cattle as shown by the highest R2 and lowest MSE values (0.973 and 601.9, respectively, followed by Gompertz (0.972 and 621.2, respectively, Logistic (0.971 and 648.4, respectively and Brody (0.932 and 660.5, respectively models. The correlation between rate of maturing and mature weight was found to be negative in the range of -0.170 to -0.929 for all models, indicating that animals of heavier mature weight had lower rate of maturing. The use of non-linear model could summarize the weight-age relationship into several biologically interpreted parameters compared to the entire lifespan weight-age data points that are difficult and time consuming to interpret.
High-dimensional data in economics and their (robust) analysis
Czech Academy of Sciences Publication Activity Database
Kalina, Jan
2017-01-01
Roč. 12, č. 1 (2017), s. 171-183 ISSN 1452-4864 R&D Projects: GA ČR GA17-07384S Institutional support: RVO:67985556 Keywords : econometrics * high-dimensional data * dimensionality reduction * linear regression * classification analysis * robustness Subject RIV: BA - General Mathematics OBOR OECD: Business and management http://library.utia.cas.cz/separaty/2017/SI/kalina-0474076.pdf
High-dimensional Data in Economics and their (Robust) Analysis
Czech Academy of Sciences Publication Activity Database
Kalina, Jan
2017-01-01
Roč. 12, č. 1 (2017), s. 171-183 ISSN 1452-4864 R&D Projects: GA ČR GA17-07384S Grant - others:GA ČR(CZ) GA13-01930S Institutional support: RVO:67985807 Keywords : econometrics * high-dimensional data * dimensionality reduction * linear regression * classification analysis * robustness Subject RIV: BB - Applied Statistics, Operational Research OBOR OECD: Statistics and probability
HSM: Heterogeneous Subspace Mining in High Dimensional Data
DEFF Research Database (Denmark)
Müller, Emmanuel; Assent, Ira; Seidl, Thomas
2009-01-01
Heterogeneous data, i.e. data with both categorical and continuous values, is common in many databases. However, most data mining algorithms assume either continuous or categorical attributes, but not both. In high dimensional data, phenomena due to the "curse of dimensionality" pose additional...... challenges. Usually, due to locally varying relevance of attributes, patterns do not show across the full set of attributes. In this paper we propose HSM, which defines a new pattern model for heterogeneous high dimensional data. It allows data mining in arbitrary subsets of the attributes that are relevant...... for the respective patterns. Based on this model we propose an efficient algorithm, which is aware of the heterogeneity of the attributes. We extend an indexing structure for continuous attributes such that HSM indexing adapts to different attribute types. In our experiments we show that HSM efficiently mines...
Livingstone, Katherine M; McNaughton, Sarah A
2017-01-01
Evidence linking dietary patterns (DP) and obesity and hypertension prevalence is inconsistent. We aimed to identify DP derived from energy density, fibre and sugar intakes, as well as Na, K, fibre, SFA and PUFA, and investigate associations with obesity and hypertension. Adults (n 4908) were included from the cross-sectional Australian Health Survey 2011-2013. Two 24-h dietary recalls estimated food and nutrient intakes. Reduced rank regression derived DP with dietary energy density (DED), fibre density and total sugar intake as response variables for obesity and Na:K, SFA:PUFA and fibre density as variables for hypertension. Poisson regression investigated relationships between DP and prevalence ratios (PR) of overweight/obesity (BMI≥25 kg/m2) and hypertension (blood pressure≥140/90 mmHg). Obesity-DP1 was positively correlated with fibre density and sugars and inversely with DED. Obesity-DP2 was positively correlated with sugars and inversely with fibre density. Individuals in the highest tertile of Obesity-DP1 and Obesity-DP2, compared with the lowest, had lower (PR 0·88; 95 % CI 0·81, 0·95) and higher (PR 1·09; 95 % CI 1·01, 1·18) prevalence of obesity, respectively. Na:K and SFA:PUFA were positively correlated with Hypertension-DP1 and inversely correlated with Hypertension-DP2, respectively. There was a trend towards higher hypertension prevalence in the highest tertile of Hypertension-DP1 compared with the lowest (PR 1·18; 95 % CI 0·99, 1·41). Hypertension-DP2 was not associated with hypertension. Obesity prevalence was inversely associated with low-DED, high-fibre and high-sugar (natural sugars) diets and positively associated with low-fibre and high-sugar (added sugars) diets. Hypertension prevalence was higher on low-fibre and high-Na and SFA diets.
Modeling High-Dimensional Multichannel Brain Signals
Hu, Lechuan; Fortin, Norbert J.; Ombao, Hernando
2017-01-01
aspects: first, there are major statistical and computational challenges for modeling and analyzing high-dimensional multichannel brain signals; second, there is no set of universally agreed measures for characterizing connectivity. To model multichannel
Regression Patterns of Iris Melanoma after Palladium-103 (103Pd) Plaque Brachytherapy.
Chaugule, Sonal S; Finger, Paul T
2017-07-01
To evaluate the patterns of regression of iris melanoma after treatment with palladium-103 ( 103 Pd) plaque brachytherapy. Retrospective, nonrandomized, interventional case series. Fifty patients with primary malignant melanoma of the iris. Palladium-103 plaque brachytherapy. Changes in tumor size, pigmentation, and vascularity; incidence of iris neovascularization; and radiation-related complications. The mean age in the case series was 61.2±14.9 years. The mean tumor thickness was 1.4±0.6 mm. According to the American Joint Committee on Cancer, eighth edition, staging criteria for iris melanoma, 21 tumors (42%) were T1a, 5 tumors (10%) were T1b, and 24 tumors (48%) were T2a. The tumor was melanotic in 37 cases (74%) and amelanotic in 13 cases (26%); of these, 13 tumors (26%) showed variable pigmentation. After brachytherapy, mean tumor thickness decreased to 0.9±0.2 mm. Pigmentation increased in 32 tumors (64%), decreased in 11 tumors (22%), and was unchanged in 6 tumors (12%). For intrinsic vascularity (n = 19), 12 tumors (63%) showed decrease and 7 tumors (37%) showed complete resolution. Appearance of ectropion uveae showed diminution in 15 tumors (43%); newly present corectopia was observed in 6 patients (12%). On high-frequency ultrasound imaging, of the 42 tumors (84%) with low to moderate internal reflectivity, 30 tumors (60%) showed an increase in internal reflectivity on regression. Iris stromal atrophy was noted in 26 patients (52%), progression or new-onset cataract was noted in 22 patients (44%), neovascular glaucoma was noted in 1 patient (2%), and there were no cases of corneal opacity. There was no clinical evidence (0%) of radiation-induced retinopathy, maculopathy, or optic neuropathy. Mean follow-up in this series was 5.2 years (range, 0.5-17 years). The most common findings related to iris melanoma regression after 103 Pd plaque brachytherapy included decreased intrinsic tumor vascularity, increased tumor pigmentation, and decreased tumor
REGRESSION ANALYSIS OF SEA-SURFACE-TEMPERATURE PATTERNS FOR THE NORTH PACIFIC OCEAN.
SEA WATER, *SURFACE TEMPERATURE, *OCEANOGRAPHIC DATA, PACIFIC OCEAN, REGRESSION ANALYSIS , STATISTICAL ANALYSIS, UNDERWATER EQUIPMENT, DETECTION, UNDERWATER COMMUNICATIONS, DISTRIBUTION, THERMAL PROPERTIES, COMPUTERS.
High dimensional neurocomputing growth, appraisal and applications
Tripathi, Bipin Kumar
2015-01-01
The book presents a coherent understanding of computational intelligence from the perspective of what is known as "intelligent computing" with high-dimensional parameters. It critically discusses the central issue of high-dimensional neurocomputing, such as quantitative representation of signals, extending the dimensionality of neuron, supervised and unsupervised learning and design of higher order neurons. The strong point of the book is its clarity and ability of the underlying theory to unify our understanding of high-dimensional computing where conventional methods fail. The plenty of application oriented problems are presented for evaluating, monitoring and maintaining the stability of adaptive learning machine. Author has taken care to cover the breadth and depth of the subject, both in the qualitative as well as quantitative way. The book is intended to enlighten the scientific community, ranging from advanced undergraduates to engineers, scientists and seasoned researchers in computational intelligenc...
Asymptotically Honest Confidence Regions for High Dimensional
DEFF Research Database (Denmark)
Caner, Mehmet; Kock, Anders Bredahl
While variable selection and oracle inequalities for the estimation and prediction error have received considerable attention in the literature on high-dimensional models, very little work has been done in the area of testing and construction of confidence bands in high-dimensional models. However...... develop an oracle inequality for the conservative Lasso only assuming the existence of a certain number of moments. This is done by means of the Marcinkiewicz-Zygmund inequality which in our context provides sharper bounds than Nemirovski's inequality. As opposed to van de Geer et al. (2014) we allow...
Vermeulen, E.; Stronks, K.; Visser, M.; Brouwer, I. A.; Snijder, M. B.; Mocking, R. J. T.; Derks, E. M.; Schene, A. H.; Nicolaou, M.
2017-01-01
BACKGROUND/OBJECTIVES: To investigate the association of dietary patterns derived by reduced rank regression (RRR) with depressive symptoms in a multi-ethnic population. SUBJECTS/METHODS: Cross-sectional data from the HELIUS study were used. In total, 4967 men and women (18-70 years) of Dutch,
Vermeulen, E.; Stronks, K.; Visser, M.; Brouwer, I. A.; Snijder, M. B.; Mocking, R. J.T.; Derks, E. M.; Schene, A. H.; Nicolaou, M.
BACKGROUND/OBJECTIVES: To investigate the association of dietary patterns derived by reduced rank regression (RRR) with depressive symptoms in a multi-ethnic population. SUBJECTS/METHODS: Cross-sectional data from the HELIUS study were used. In total, 4967 men and women (18-70 years) of Dutch,
Data analysis in high-dimensional sparse spaces
DEFF Research Database (Denmark)
Clemmensen, Line Katrine Harder
classification techniques for high-dimensional problems are presented: Sparse discriminant analysis, sparse mixture discriminant analysis and orthogonality constrained support vector machines. The first two introduces sparseness to the well known linear and mixture discriminant analysis and thereby provide low...... are applied to classifications of fish species, ear canal impressions used in the hearing aid industry, microbiological fungi species, and various cancerous tissues and healthy tissues. In addition, novel applications of sparse regressions (also called the elastic net) to the medical, concrete, and food...
Batis, Carolina; Mendez, Michelle A; Gordon-Larsen, Penny; Sotres-Alvarez, Daniela; Adair, Linda; Popkin, Barry
2016-02-01
We examined the association between dietary patterns and diabetes using the strengths of two methods: principal component analysis (PCA) to identify the eating patterns of the population and reduced rank regression (RRR) to derive a pattern that explains the variation in glycated Hb (HbA1c), homeostasis model assessment of insulin resistance (HOMA-IR) and fasting glucose. We measured diet over a 3 d period with 24 h recalls and a household food inventory in 2006 and used it to derive PCA and RRR dietary patterns. The outcomes were measured in 2009. Adults (n 4316) from the China Health and Nutrition Survey. The adjusted odds ratio for diabetes prevalence (HbA1c≥6·5 %), comparing the highest dietary pattern score quartile with the lowest, was 1·26 (95 % CI 0·76, 2·08) for a modern high-wheat pattern (PCA; wheat products, fruits, eggs, milk, instant noodles and frozen dumplings), 0·76 (95 % CI 0·49, 1·17) for a traditional southern pattern (PCA; rice, meat, poultry and fish) and 2·37 (95 % CI 1·56, 3·60) for the pattern derived with RRR. By comparing the dietary pattern structures of RRR and PCA, we found that the RRR pattern was also behaviourally meaningful. It combined the deleterious effects of the modern high-wheat pattern (high intakes of wheat buns and breads, deep-fried wheat and soya milk) with the deleterious effects of consuming the opposite of the traditional southern pattern (low intakes of rice, poultry and game, fish and seafood). Our findings suggest that using both PCA and RRR provided useful insights when studying the association of dietary patterns with diabetes.
Cardoso, Walcir
2001-01-01
Offers an optimality theoretic account for the phonological process of across-word regressive assimilation (AWRA) in Picard, a Gallo-Romance dialect spoken in the Picardie region in Northern France and Southern Belgium. Focuses on the varieties spoken in the Vimeu region of France. Examines one particular topic in the analysis of AWRA: the…
Directory of Open Access Journals (Sweden)
Laura K. Frank
2015-07-01
Full Text Available Reduced rank regression (RRR is an innovative technique to establish dietary patterns related to biochemical risk factors for type 2 diabetes, but has not been applied in sub-Saharan Africa. In a hospital-based case-control study for type 2 diabetes in Kumasi (diabetes cases, 538; controls, 668 dietary intake was assessed by a specific food frequency questionnaire. After random split of our study population, we derived a dietary pattern in the training set using RRR with adiponectin, HDL-cholesterol and triglycerides as responses and 35 food items as predictors. This pattern score was applied to the validation set, and its association with type 2 diabetes was examined by logistic regression. The dietary pattern was characterized by a high consumption of plantain, cassava, and garden egg, and a low intake of rice, juice, vegetable oil, eggs, chocolate drink, sweets, and red meat; the score correlated positively with serum triglycerides and negatively with adiponectin. The multivariate-adjusted odds ratio of type 2 diabetes for the highest quintile compared to the lowest was 4.43 (95% confidence interval: 1.87–10.50, p for trend < 0.001. The identified dietary pattern increases the odds of type 2 diabetes in urban Ghanaians, which is mainly attributed to increased serum triglycerides.
International Nuclear Information System (INIS)
Hans, F.J.; Reinges, M.H.T.; Krings, T.; Mull, M.
2004-01-01
We report on a patient with fibromuscular dysplasia who presented with a right-sided giant calcified cavernous internal carotid artery (ICA) aneurysm and two additional supraophthalmic ICA aneurysms. Endovascular closure of the right ICA using detachable balloons was performed with collateralisation of the right hemisphere via the right-sided posterior communicating and the anterior communicating arteries. Repeat angiography after 6 months demonstrated spontaneous complete regression of the two supraophthalmic aneurysms, although the parent vessel was still perfused. In comparison to the former angiography, the flow within the parent vessel was reversed due to the proximal ICA balloon occlusion. MRI demonstrated that the aneurysms were not obliterated by thrombosis alone, but showed a real regression in size. This case report demonstrates that changes in cerebral hemodynamics potentially lead to plastic changes in the vessel architecture in adults and that aneurysms can be flow-related, even if not associated with high flow fistulas or arteriovenous malformations, especially in cases with an arterial wall disease. (orig.)
Introduction to high-dimensional statistics
Giraud, Christophe
2015-01-01
Ever-greater computing technologies have given rise to an exponentially growing volume of data. Today massive data sets (with potentially thousands of variables) play an important role in almost every branch of modern human activity, including networks, finance, and genetics. However, analyzing such data has presented a challenge for statisticians and data analysts and has required the development of new statistical methods capable of separating the signal from the noise.Introduction to High-Dimensional Statistics is a concise guide to state-of-the-art models, techniques, and approaches for ha
Estimating High-Dimensional Time Series Models
DEFF Research Database (Denmark)
Medeiros, Marcelo C.; Mendes, Eduardo F.
We study the asymptotic properties of the Adaptive LASSO (adaLASSO) in sparse, high-dimensional, linear time-series models. We assume both the number of covariates in the model and candidate variables can increase with the number of observations and the number of candidate variables is, possibly......, larger than the number of observations. We show the adaLASSO consistently chooses the relevant variables as the number of observations increases (model selection consistency), and has the oracle property, even when the errors are non-Gaussian and conditionally heteroskedastic. A simulation study shows...
High dimensional classifiers in the imbalanced case
DEFF Research Database (Denmark)
Bak, Britta Anker; Jensen, Jens Ledet
We consider the binary classification problem in the imbalanced case where the number of samples from the two groups differ. The classification problem is considered in the high dimensional case where the number of variables is much larger than the number of samples, and where the imbalance leads...... to a bias in the classification. A theoretical analysis of the independence classifier reveals the origin of the bias and based on this we suggest two new classifiers that can handle any imbalance ratio. The analytical results are supplemented by a simulation study, where the suggested classifiers in some...
Topology of high-dimensional manifolds
Energy Technology Data Exchange (ETDEWEB)
Farrell, F T [State University of New York, Binghamton (United States); Goettshe, L [Abdus Salam ICTP, Trieste (Italy); Lueck, W [Westfaelische Wilhelms-Universitaet Muenster, Muenster (Germany)
2002-08-15
The School on High-Dimensional Manifold Topology took place at the Abdus Salam ICTP, Trieste from 21 May 2001 to 8 June 2001. The focus of the school was on the classification of manifolds and related aspects of K-theory, geometry, and operator theory. The topics covered included: surgery theory, algebraic K- and L-theory, controlled topology, homology manifolds, exotic aspherical manifolds, homeomorphism and diffeomorphism groups, and scalar curvature. The school consisted of 2 weeks of lecture courses and one week of conference. Thwo-part lecture notes volume contains the notes of most of the lecture courses.
Modeling high dimensional multichannel brain signals
Hu, Lechuan
2017-03-27
In this paper, our goal is to model functional and effective (directional) connectivity in network of multichannel brain physiological signals (e.g., electroencephalograms, local field potentials). The primary challenges here are twofold: first, there are major statistical and computational difficulties for modeling and analyzing high dimensional multichannel brain signals; second, there is no set of universally-agreed measures for characterizing connectivity. To model multichannel brain signals, our approach is to fit a vector autoregressive (VAR) model with sufficiently high order so that complex lead-lag temporal dynamics between the channels can be accurately characterized. However, such a model contains a large number of parameters. Thus, we will estimate the high dimensional VAR parameter space by our proposed hybrid LASSLE method (LASSO+LSE) which is imposes regularization on the first step (to control for sparsity) and constrained least squares estimation on the second step (to improve bias and mean-squared error of the estimator). Then to characterize connectivity between channels in a brain network, we will use various measures but put an emphasis on partial directed coherence (PDC) in order to capture directional connectivity between channels. PDC is a directed frequency-specific measure that explains the extent to which the present oscillatory activity in a sender channel influences the future oscillatory activity in a specific receiver channel relative all possible receivers in the network. Using the proposed modeling approach, we have achieved some insights on learning in a rat engaged in a non-spatial memory task.
Modeling high dimensional multichannel brain signals
Hu, Lechuan; Fortin, Norbert; Ombao, Hernando
2017-01-01
In this paper, our goal is to model functional and effective (directional) connectivity in network of multichannel brain physiological signals (e.g., electroencephalograms, local field potentials). The primary challenges here are twofold: first, there are major statistical and computational difficulties for modeling and analyzing high dimensional multichannel brain signals; second, there is no set of universally-agreed measures for characterizing connectivity. To model multichannel brain signals, our approach is to fit a vector autoregressive (VAR) model with sufficiently high order so that complex lead-lag temporal dynamics between the channels can be accurately characterized. However, such a model contains a large number of parameters. Thus, we will estimate the high dimensional VAR parameter space by our proposed hybrid LASSLE method (LASSO+LSE) which is imposes regularization on the first step (to control for sparsity) and constrained least squares estimation on the second step (to improve bias and mean-squared error of the estimator). Then to characterize connectivity between channels in a brain network, we will use various measures but put an emphasis on partial directed coherence (PDC) in order to capture directional connectivity between channels. PDC is a directed frequency-specific measure that explains the extent to which the present oscillatory activity in a sender channel influences the future oscillatory activity in a specific receiver channel relative all possible receivers in the network. Using the proposed modeling approach, we have achieved some insights on learning in a rat engaged in a non-spatial memory task.
Directory of Open Access Journals (Sweden)
Giuseppe Perinetti
2016-01-01
Full Text Available The knowledge of the associations between the timing of skeletal maturation and craniofacial growth is of primary importance when planning a functional treatment for most of the skeletal malocclusions. This cross-sectional study was thus aimed at evaluating whether sagittal and vertical craniofacial growth has an association with the timing of circumpubertal skeletal maturation. A total of 320 subjects (160 females and 160 males were included in the study (mean age, 12.3±1.7 years; range, 7.6–16.7 years. These subjects were equally distributed in the circumpubertal cervical vertebral maturation (CVM stages 2 to 5. Each CVM stage group also had equal number of females and males. Multiple regression models were run for each CVM stage group to assess the significance of the association of cephalometric parameters (ANB, SN/MP, and NSBa angles with age of attainment of the corresponding CVM stage (in months. Significant associations were seen only for stage 3, where the SN/MP angle was negatively associated with age (β coefficient, −0.7. These results show that hyperdivergent and hypodivergent subjects may have an anticipated and delayed attainment of the pubertal CVM stage 3, respectively. However, such association remains of little entity and it would become clinically relevant only in extreme cases.
Rosso, Luigi; Riatti, Riccardo
2016-01-01
The knowledge of the associations between the timing of skeletal maturation and craniofacial growth is of primary importance when planning a functional treatment for most of the skeletal malocclusions. This cross-sectional study was thus aimed at evaluating whether sagittal and vertical craniofacial growth has an association with the timing of circumpubertal skeletal maturation. A total of 320 subjects (160 females and 160 males) were included in the study (mean age, 12.3 ± 1.7 years; range, 7.6–16.7 years). These subjects were equally distributed in the circumpubertal cervical vertebral maturation (CVM) stages 2 to 5. Each CVM stage group also had equal number of females and males. Multiple regression models were run for each CVM stage group to assess the significance of the association of cephalometric parameters (ANB, SN/MP, and NSBa angles) with age of attainment of the corresponding CVM stage (in months). Significant associations were seen only for stage 3, where the SN/MP angle was negatively associated with age (β coefficient, −0.7). These results show that hyperdivergent and hypodivergent subjects may have an anticipated and delayed attainment of the pubertal CVM stage 3, respectively. However, such association remains of little entity and it would become clinically relevant only in extreme cases. PMID:27995136
Vermeulen, E; Stronks, K; Visser, M; Brouwer, I A; Snijder, M B; Mocking, R J T; Derks, E M; Schene, A H; Nicolaou, M
2017-08-01
To investigate the association of dietary patterns derived by reduced rank regression (RRR) with depressive symptoms in a multi-ethnic population. Cross-sectional data from the HELIUS study were used. In total, 4967 men and women (18-70 years) of Dutch, South-Asian Surinamese, African Surinamese, Turkish and Moroccan origin living in the Netherlands were included. Diet was measured using ethnic-specific food frequency questionnaires. Depressive symptoms were measured with the nine-item patient health questionnaire. By performing RRR in the whole population and per ethnic group, comparable dietary patterns were identified and therefore the dietary pattern for the whole population was used for subsequent analyses. We identified a dietary pattern that was strongly related to eicosapentaenoic acid+docosahexaenoic acid, folate, magnesium and zinc (response variables) and which was characterized by milk products, cheese, whole grains, vegetables, legumes, nuts, potatoes and red meat. After adjustment for confounders, a statistically significant inverse association was observed in the whole population (B: -0.03, 95% CI: -0.06, -0.00, P=0.046) and among Moroccan (B: -0.09, 95% CI: -0.13, -0.04, P=0.027) and South-Asian Surinamese participants (B: -0.05, 95% CI: -0.09, -0.01, P=dietary pattern and significant depressed mood in any of the ethnic groups. No consistent evidence was found that consumption of a dietary pattern, high in nutrients that are hypothesized to protect against depression, was associated with lower depressive symptoms across different ethnic groups.
Verachtert, E.; Den Eeckhaut, M. Van; Poesen, J.; Govers, G.; Deckers, J.
2011-07-01
Soil piping (tunnel erosion) has been recognised as an important erosion process in collapsible loess-derived soils of temperate humid climates, which can cause collapse of the topsoil and formation of discontinuous gullies. Information about the spatial patterns of collapsed pipes and regional models describing these patterns is still limited. Therefore, this study aims at better understanding the factors controlling the spatial distribution and predicting pipe collapse. A dataset with parcels suffering from collapsed pipes (n = 560) and parcels without collapsed pipes was obtained through a regional survey in a 236 km² study area in the Flemish Ardennes (Belgium). Logistic regression was applied to find the best model describing the relationship between the presence/absence of a collapsed pipe and a set of independent explanatory variables (i.e. slope gradient, drainage area, distance-to-thalweg, curvature, aspect, soil type and lithology). Special attention was paid to the selection procedure of the grid cells without collapsed pipes. Apart from the first piping susceptibility map created by logistic regression modelling, a second map was made based on topographical thresholds of slope gradient and upslope drainage area. The logistic regression model allowed identification of the most important factors controlling pipe collapse. Pipes are much more likely to occur when a topographical threshold depending on both slope gradient and upslope area is exceeded in zones with a sufficient water supply (due to topographical convergence and/or the presence of a clay-rich lithology). On the other hand, the use of slope-area thresholds only results in reasonable predictions of piping susceptibility, with minimum information.
Bayesian Subset Modeling for High-Dimensional Generalized Linear Models
Liang, Faming
2013-06-01
This article presents a new prior setting for high-dimensional generalized linear models, which leads to a Bayesian subset regression (BSR) with the maximum a posteriori model approximately equivalent to the minimum extended Bayesian information criterion model. The consistency of the resulting posterior is established under mild conditions. Further, a variable screening procedure is proposed based on the marginal inclusion probability, which shares the same properties of sure screening and consistency with the existing sure independence screening (SIS) and iterative sure independence screening (ISIS) procedures. However, since the proposed procedure makes use of joint information from all predictors, it generally outperforms SIS and ISIS in real applications. This article also makes extensive comparisons of BSR with the popular penalized likelihood methods, including Lasso, elastic net, SIS, and ISIS. The numerical results indicate that BSR can generally outperform the penalized likelihood methods. The models selected by BSR tend to be sparser and, more importantly, of higher prediction ability. In addition, the performance of the penalized likelihood methods tends to deteriorate as the number of predictors increases, while this is not significant for BSR. Supplementary materials for this article are available online. © 2013 American Statistical Association.
GAMLSS for high-dimensional data – a flexible approach based on boosting
Mayr, Andreas; Fenske, Nora; Hofner, Benjamin; Kneib, Thomas; Schmid, Matthias
2010-01-01
Generalized additive models for location, scale and shape (GAMLSS) are a popular semi-parametric modelling approach that, in contrast to conventional GAMs, regress not only the expected mean but every distribution parameter (e.g. location, scale and shape) to a set of covariates. Current fitting procedures for GAMLSS are infeasible for high-dimensional data setups and require variable selection based on (potentially problematic) information criteria. The present work describes a boosting algo...
Multivariate statistics high-dimensional and large-sample approximations
Fujikoshi, Yasunori; Shimizu, Ryoichi
2010-01-01
A comprehensive examination of high-dimensional analysis of multivariate methods and their real-world applications Multivariate Statistics: High-Dimensional and Large-Sample Approximations is the first book of its kind to explore how classical multivariate methods can be revised and used in place of conventional statistical tools. Written by prominent researchers in the field, the book focuses on high-dimensional and large-scale approximations and details the many basic multivariate methods used to achieve high levels of accuracy. The authors begin with a fundamental presentation of the basic
Hierarchical low-rank approximation for high dimensional approximation
Nouy, Anthony
2016-01-01
Tensor methods are among the most prominent tools for the numerical solution of high-dimensional problems where functions of multiple variables have to be approximated. Such high-dimensional approximation problems naturally arise in stochastic analysis and uncertainty quantification. In many practical situations, the approximation of high-dimensional functions is made computationally tractable by using rank-structured approximations. In this talk, we present algorithms for the approximation in hierarchical tensor format using statistical methods. Sparse representations in a given tensor format are obtained with adaptive or convex relaxation methods, with a selection of parameters using crossvalidation methods.
Hierarchical low-rank approximation for high dimensional approximation
Nouy, Anthony
2016-01-07
Tensor methods are among the most prominent tools for the numerical solution of high-dimensional problems where functions of multiple variables have to be approximated. Such high-dimensional approximation problems naturally arise in stochastic analysis and uncertainty quantification. In many practical situations, the approximation of high-dimensional functions is made computationally tractable by using rank-structured approximations. In this talk, we present algorithms for the approximation in hierarchical tensor format using statistical methods. Sparse representations in a given tensor format are obtained with adaptive or convex relaxation methods, with a selection of parameters using crossvalidation methods.
Energy Technology Data Exchange (ETDEWEB)
Schmid, Maximilian P.; Fidarova, Elena [Dept. of Radiotherapy, Comprehensive Cancer Center, Medical Univ. of Vienna, Vienna (Austria)], e-mail: maximilian.schmid@akhwien.at; Poetter, Richard [Dept. of Radiotherapy, Comprehensive Cancer Center, Medical Univ. of Vienna, Vienna (Austria); Christian Doppler Lab. for Medical Radiation Research for Radiation Oncology, Medical Univ. of Vienna (Austria)] [and others
2013-10-15
Purpose: To investigate the impact of magnetic resonance imaging (MRI)-morphologic differences in parametrial infiltration on tumour response during primary radio chemotherapy in cervical cancer. Material and methods: Eighty-five consecutive cervical cancer patients with FIGO stages IIB (n = 59) and IIIB (n = 26), treated by external beam radiotherapy ({+-}chemotherapy) and image-guided adaptive brachytherapy, underwent T2-weighted MRI at the time of diagnosis and at the time of brachytherapy. MRI patterns of parametrial tumour infiltration at the time of diagnosis were assessed with regard to predominant morphology and maximum extent of parametrial tumour infiltration and were stratified into five tumour groups (TG): 1) expansive with spiculae; 2) expansive with spiculae and infiltrating parts; 3) infiltrative into the inner third of the parametrial space (PM); 4) infiltrative into the middle third of the PM; and 5) infiltrative into the outer third of the PM. MRI at the time of brachytherapy was used for identifying presence (residual vs. no residual disease) and signal intensity (high vs. intermediate) of residual disease within the PM. Left and right PM of each patient were evaluated separately at both time points. The impact of the TG on tumour remission status within the PM was analysed using {chi}2-test and logistic regression analysis. Results: In total, 170 PM were analysed. The TG 1, 2, 3, 4, 5 were present in 12%, 11%, 35%, 25% and 12% of the cases, respectively. Five percent of the PM were tumour-free. Residual tumour in the PM was identified in 19%, 68%, 88%, 90% and 85% of the PM for the TG 1, 2, 3, 4, and 5, respectively. The TG 3 - 5 had significantly higher rates of residual tumour in the PM in comparison to TG 1 + 2 (88% vs. 43%, p < 0.01). Conclusion: MRI-morphologic features of PM infiltration appear to allow for prediction of tumour response during external beam radiotherapy and chemotherapy. A predominantly infiltrative tumour spread at the
Vermeulen, E.; Stronks, K.; Visser, M de; Brouwer, I.A.; Schene, A.H.; Mocking, R.J.T.; Colpo, M.; Bandinelli, S.; Ferrucci, L.; Nicolaou, M.
2016-01-01
This study aimed to identify dietary patterns using reduced rank regression (RRR) and to explore their associations with depressive symptoms over 9 years in the Invecchiare in Chianti study. At baseline, 1362 participants (55.4 % women) aged 18-102 years (mean age 68 (sd 15.5) years) were included
Vermeulen, Esther; Stronks, Karien; Visser, Marjolein; Brouwer, Ingeborg A; Schene, Aart H; Mocking, Roel J T; Colpo, Marco; Bandinelli, Stefania; Ferrucci, Luigi; Nicolaou, Mary
This study aimed to identify dietary patterns using reduced rank regression (RRR) and to explore their associations with depressive symptoms over 9 years in the Invecchiare in Chianti study. At baseline, 1362 participants (55·4 % women) aged 18-102 years (mean age 68 (sd 15·5) years) were included
Sparse Regression by Projection and Sparse Discriminant Analysis
Qi, Xin; Luo, Ruiyan; Carroll, Raymond J.; Zhao, Hongyu
2015-01-01
predictions. We introduce a new framework, regression by projection, and its sparse version to analyze high-dimensional data. The unique nature of this framework is that the directions of the regression coefficients are inferred first, and the lengths
Non-intrusive low-rank separated approximation of high-dimensional stochastic models
Doostan, Alireza; Validi, AbdoulAhad; Iaccarino, Gianluca
2013-01-01
This work proposes a sampling-based (non-intrusive) approach within the context of low-. rank separated representations to tackle the issue of curse-of-dimensionality associated with the solution of models, e.g., PDEs/ODEs, with high-dimensional random inputs. Under some conditions discussed in details, the number of random realizations of the solution, required for a successful approximation, grows linearly with respect to the number of random inputs. The construction of the separated representation is achieved via a regularized alternating least-squares regression, together with an error indicator to estimate model parameters. The computational complexity of such a construction is quadratic in the number of random inputs. The performance of the method is investigated through its application to three numerical examples including two ODE problems with high-dimensional random inputs. © 2013 Elsevier B.V.
Non-intrusive low-rank separated approximation of high-dimensional stochastic models
Doostan, Alireza
2013-08-01
This work proposes a sampling-based (non-intrusive) approach within the context of low-. rank separated representations to tackle the issue of curse-of-dimensionality associated with the solution of models, e.g., PDEs/ODEs, with high-dimensional random inputs. Under some conditions discussed in details, the number of random realizations of the solution, required for a successful approximation, grows linearly with respect to the number of random inputs. The construction of the separated representation is achieved via a regularized alternating least-squares regression, together with an error indicator to estimate model parameters. The computational complexity of such a construction is quadratic in the number of random inputs. The performance of the method is investigated through its application to three numerical examples including two ODE problems with high-dimensional random inputs. © 2013 Elsevier B.V.
Statistical Analysis for High-Dimensional Data : The Abel Symposium 2014
Bühlmann, Peter; Glad, Ingrid; Langaas, Mette; Richardson, Sylvia; Vannucci, Marina
2016-01-01
This book features research contributions from The Abel Symposium on Statistical Analysis for High Dimensional Data, held in Nyvågar, Lofoten, Norway, in May 2014. The focus of the symposium was on statistical and machine learning methodologies specifically developed for inference in “big data” situations, with particular reference to genomic applications. The contributors, who are among the most prominent researchers on the theory of statistics for high dimensional inference, present new theories and methods, as well as challenging applications and computational solutions. Specific themes include, among others, variable selection and screening, penalised regression, sparsity, thresholding, low dimensional structures, computational challenges, non-convex situations, learning graphical models, sparse covariance and precision matrices, semi- and non-parametric formulations, multiple testing, classification, factor models, clustering, and preselection. Highlighting cutting-edge research and casting light on...
Obtaining insights from high-dimensional data : Sparse principal covariates regression
Van Deun, K.; Crompvoets, E.A.V.; Ceulemans, Eva
2018-01-01
Background Data analysis methods are usually subdivided in two distinct classes: There are methods for prediction and there are methods for exploration. In practice, however, there often is a need to learn from the data in both ways. For example, when predicting the antibody titers a few weeks after
Harnessing high-dimensional hyperentanglement through a biphoton frequency comb
Xie, Zhenda; Zhong, Tian; Shrestha, Sajan; Xu, Xinan; Liang, Junlin; Gong, Yan-Xiao; Bienfang, Joshua C.; Restelli, Alessandro; Shapiro, Jeffrey H.; Wong, Franco N. C.; Wei Wong, Chee
2015-08-01
Quantum entanglement is a fundamental resource for secure information processing and communications, and hyperentanglement or high-dimensional entanglement has been separately proposed for its high data capacity and error resilience. The continuous-variable nature of the energy-time entanglement makes it an ideal candidate for efficient high-dimensional coding with minimal limitations. Here, we demonstrate the first simultaneous high-dimensional hyperentanglement using a biphoton frequency comb to harness the full potential in both the energy and time domain. Long-postulated Hong-Ou-Mandel quantum revival is exhibited, with up to 19 time-bins and 96.5% visibilities. We further witness the high-dimensional energy-time entanglement through Franson revivals, observed periodically at integer time-bins, with 97.8% visibility. This qudit state is observed to simultaneously violate the generalized Bell inequality by up to 10.95 standard deviations while observing recurrent Clauser-Horne-Shimony-Holt S-parameters up to 2.76. Our biphoton frequency comb provides a platform for photon-efficient quantum communications towards the ultimate channel capacity through energy-time-polarization high-dimensional encoding.
Analysing spatially extended high-dimensional dynamics by recurrence plots
Energy Technology Data Exchange (ETDEWEB)
Marwan, Norbert, E-mail: marwan@pik-potsdam.de [Potsdam Institute for Climate Impact Research, 14412 Potsdam (Germany); Kurths, Jürgen [Potsdam Institute for Climate Impact Research, 14412 Potsdam (Germany); Humboldt Universität zu Berlin, Institut für Physik (Germany); Nizhny Novgorod State University, Department of Control Theory, Nizhny Novgorod (Russian Federation); Foerster, Saskia [GFZ German Research Centre for Geosciences, Section 1.4 Remote Sensing, Telegrafenberg, 14473 Potsdam (Germany)
2015-05-08
Recurrence plot based measures of complexity are capable tools for characterizing complex dynamics. In this letter we show the potential of selected recurrence plot measures for the investigation of even high-dimensional dynamics. We apply this method on spatially extended chaos, such as derived from the Lorenz96 model and show that the recurrence plot based measures can qualitatively characterize typical dynamical properties such as chaotic or periodic dynamics. Moreover, we demonstrate its power by analysing satellite image time series of vegetation cover with contrasting dynamics as a spatially extended and potentially high-dimensional example from the real world. - Highlights: • We use recurrence plots for analysing partially extended dynamics. • We investigate the high-dimensional chaos of the Lorenz96 model. • The approach distinguishes different spatio-temporal dynamics. • We use the method for studying vegetation cover time series.
Bit-Table Based Biclustering and Frequent Closed Itemset Mining in High-Dimensional Binary Data
Directory of Open Access Journals (Sweden)
András Király
2014-01-01
Full Text Available During the last decade various algorithms have been developed and proposed for discovering overlapping clusters in high-dimensional data. The two most prominent application fields in this research, proposed independently, are frequent itemset mining (developed for market basket data and biclustering (applied to gene expression data analysis. The common limitation of both methodologies is the limited applicability for very large binary data sets. In this paper we propose a novel and efficient method to find both frequent closed itemsets and biclusters in high-dimensional binary data. The method is based on simple but very powerful matrix and vector multiplication approaches that ensure that all patterns can be discovered in a fast manner. The proposed algorithm has been implemented in the commonly used MATLAB environment and freely available for researchers.
Spady, Richard; Stouli, Sami
2012-01-01
We propose dual regression as an alternative to the quantile regression process for the global estimation of conditional distribution functions under minimal assumptions. Dual regression provides all the interpretational power of the quantile regression process while avoiding the need for repairing the intersecting conditional quantile surfaces that quantile regression often produces in practice. Our approach introduces a mathematical programming characterization of conditional distribution f...
Supporting Dynamic Quantization for High-Dimensional Data Analytics.
Guzun, Gheorghi; Canahuate, Guadalupe
2017-05-01
Similarity searches are at the heart of exploratory data analysis tasks. Distance metrics are typically used to characterize the similarity between data objects represented as feature vectors. However, when the dimensionality of the data increases and the number of features is large, traditional distance metrics fail to distinguish between the closest and furthest data points. Localized distance functions have been proposed as an alternative to traditional distance metrics. These functions only consider dimensions close to query to compute the distance/similarity. Furthermore, in order to enable interactive explorations of high-dimensional data, indexing support for ad-hoc queries is needed. In this work we set up to investigate whether bit-sliced indices can be used for exploratory analytics such as similarity searches and data clustering for high-dimensional big-data. We also propose a novel dynamic quantization called Query dependent Equi-Depth (QED) quantization and show its effectiveness on characterizing high-dimensional similarity. When applying QED we observe improvements in kNN classification accuracy over traditional distance functions. Gheorghi Guzun and Guadalupe Canahuate. 2017. Supporting Dynamic Quantization for High-Dimensional Data Analytics. In Proceedings of Ex-ploreDB'17, Chicago, IL, USA, May 14-19, 2017, 6 pages. https://doi.org/http://dx.doi.org/10.1145/3077331.3077336.
A hybridized K-means clustering approach for high dimensional ...
African Journals Online (AJOL)
International Journal of Engineering, Science and Technology ... Due to incredible growth of high dimensional dataset, conventional data base querying methods are inadequate to extract useful information, so researchers nowadays ... Recently cluster analysis is a popularly used data analysis method in number of areas.
On Robust Information Extraction from High-Dimensional Data
Czech Academy of Sciences Publication Activity Database
Kalina, Jan
2014-01-01
Roč. 9, č. 1 (2014), s. 131-144 ISSN 1452-4864 Grant - others:GA ČR(CZ) GA13-01930S Institutional support: RVO:67985807 Keywords : data mining * high-dimensional data * robust econometrics * outliers * machine learning Subject RIV: IN - Informatics, Computer Science
Inference in High-dimensional Dynamic Panel Data Models
DEFF Research Database (Denmark)
Kock, Anders Bredahl; Tang, Haihan
We establish oracle inequalities for a version of the Lasso in high-dimensional fixed effects dynamic panel data models. The inequalities are valid for the coefficients of the dynamic and exogenous regressors. Separate oracle inequalities are derived for the fixed effects. Next, we show how one can...
Pricing High-Dimensional American Options Using Local Consistency Conditions
Berridge, S.J.; Schumacher, J.M.
2004-01-01
We investigate a new method for pricing high-dimensional American options. The method is of finite difference type but is also related to Monte Carlo techniques in that it involves a representative sampling of the underlying variables.An approximating Markov chain is built using this sampling and
Irregular grid methods for pricing high-dimensional American options
Berridge, S.J.
2004-01-01
This thesis proposes and studies numerical methods for pricing high-dimensional American options; important examples being basket options, Bermudan swaptions and real options. Four new methods are presented and analysed, both in terms of their application to various test problems, and in terms of
Directory of Open Access Journals (Sweden)
Fassnacht, S. R.
2012-05-01
Full Text Available The relation between snow water equivalent (SWE and 28 variables (27 topographically-based topographic variables and canopy density for the Colorado River Basin, USA was explored through a multi-variate regression. These variables include location, slope and aspect at different scales, derived variables to indicate the distance to sources of moisture and proximity to and characteristics of obstacles between these moisture sources and areas of snow accumulation, and canopy density. A weekly time step of snow telemetry (SNOTEL SWE data from 1990 through 1999 was used. The most important variables were elevation and regional scale (81 km² slope. Since the seasonal and inter-annual variability is high, a regression relationship should be formulated for each time step. The inter-annual variation in the relation between SWE and topographic variables partially corresponded with the amount of snow accumulated over the season and the El Niño Southern Oscillation cycle.Se analiza la relación entre el equivalente de agua en la nieve (SWE y 28 variables (27 variables topográficas y otra basada en la densidad del dosel para la Cuenca del Río Colorado, EE.UU. mediante regresión multivariante. Estas variables incluyen la localización, pendiente y orientación a diferentes escalas, además de variables derivadas para indicar la distancia a las fuentes de humedad y la proximidad a las barreras topográficas, además de las características de las barreras topográficas entre las fuentes de humedad, las áreas de acumulación de nieve y la densidad del dosel. Se utilizaron telemetrías semanales de nieve (SNOTEL desde 1990 hasta 1999. Las variables más importantes fueron la elevación y la pendiente a escala regional (81 km². Dada la alta variabilidad estacional e interanual, fue necesario establecer regresiones específicas para cada intervalo disponible de datos. La variación interanual en la relación entre variables topográficas y el SWE se
High Dimensional Modulation and MIMO Techniques for Access Networks
DEFF Research Database (Denmark)
Binti Othman, Maisara
Exploration of advanced modulation formats and multiplexing techniques for next generation optical access networks are of interest as promising solutions for delivering multiple services to end-users. This thesis addresses this from two different angles: high dimensionality carrierless...... the capacity per wavelength of the femto-cell network. Bit rate up to 1.59 Gbps with fiber-wireless transmission over 1 m air distance is demonstrated. The results presented in this thesis demonstrate the feasibility of high dimensionality CAP in increasing the number of dimensions and their potentially......) optical access network. 2 X 2 MIMO RoF employing orthogonal frequency division multiplexing (OFDM) with 5.6 GHz RoF signaling over all-vertical cavity surface emitting lasers (VCSEL) WDM passive optical networks (PONs). We have employed polarization division multiplexing (PDM) to further increase...
Analysis of chaos in high-dimensional wind power system.
Wang, Cong; Zhang, Hongli; Fan, Wenhui; Ma, Ping
2018-01-01
A comprehensive analysis on the chaos of a high-dimensional wind power system is performed in this study. A high-dimensional wind power system is more complex than most power systems. An 11-dimensional wind power system proposed by Huang, which has not been analyzed in previous studies, is investigated. When the systems are affected by external disturbances including single parameter and periodic disturbance, or its parameters changed, chaotic dynamics of the wind power system is analyzed and chaotic parameters ranges are obtained. Chaos existence is confirmed by calculation and analysis of all state variables' Lyapunov exponents and the state variable sequence diagram. Theoretical analysis and numerical simulations show that the wind power system chaos will occur when parameter variations and external disturbances change to a certain degree.
HIGH DIMENSIONAL COVARIANCE MATRIX ESTIMATION IN APPROXIMATE FACTOR MODELS.
Fan, Jianqing; Liao, Yuan; Mincheva, Martina
2011-01-01
The variance covariance matrix plays a central role in the inferential theories of high dimensional factor models in finance and economics. Popular regularization methods of directly exploiting sparsity are not directly applicable to many financial problems. Classical methods of estimating the covariance matrices are based on the strict factor models, assuming independent idiosyncratic components. This assumption, however, is restrictive in practical applications. By assuming sparse error covariance matrix, we allow the presence of the cross-sectional correlation even after taking out common factors, and it enables us to combine the merits of both methods. We estimate the sparse covariance using the adaptive thresholding technique as in Cai and Liu (2011), taking into account the fact that direct observations of the idiosyncratic components are unavailable. The impact of high dimensionality on the covariance matrix estimation based on the factor structure is then studied.
Quantifying high dimensional entanglement with two mutually unbiased bases
Directory of Open Access Journals (Sweden)
Paul Erker
2017-07-01
Full Text Available We derive a framework for quantifying entanglement in multipartite and high dimensional systems using only correlations in two unbiased bases. We furthermore develop such bounds in cases where the second basis is not characterized beyond being unbiased, thus enabling entanglement quantification with minimal assumptions. Furthermore, we show that it is feasible to experimentally implement our method with readily available equipment and even conservative estimates of physical parameters.
High dimensional model representation method for fuzzy structural dynamics
Adhikari, S.; Chowdhury, R.; Friswell, M. I.
2011-03-01
Uncertainty propagation in multi-parameter complex structures possess significant computational challenges. This paper investigates the possibility of using the High Dimensional Model Representation (HDMR) approach when uncertain system parameters are modeled using fuzzy variables. In particular, the application of HDMR is proposed for fuzzy finite element analysis of linear dynamical systems. The HDMR expansion is an efficient formulation for high-dimensional mapping in complex systems if the higher order variable correlations are weak, thereby permitting the input-output relationship behavior to be captured by the terms of low-order. The computational effort to determine the expansion functions using the α-cut method scales polynomically with the number of variables rather than exponentially. This logic is based on the fundamental assumption underlying the HDMR representation that only low-order correlations among the input variables are likely to have significant impacts upon the outputs for most high-dimensional complex systems. The proposed method is first illustrated for multi-parameter nonlinear mathematical test functions with fuzzy variables. The method is then integrated with a commercial finite element software (ADINA). Modal analysis of a simplified aircraft wing with fuzzy parameters has been used to illustrate the generality of the proposed approach. In the numerical examples, triangular membership functions have been used and the results have been validated against direct Monte Carlo simulations. It is shown that using the proposed HDMR approach, the number of finite element function calls can be reduced without significantly compromising the accuracy.
High-dimensional quantum cloning and applications to quantum hacking.
Bouchard, Frédéric; Fickler, Robert; Boyd, Robert W; Karimi, Ebrahim
2017-02-01
Attempts at cloning a quantum system result in the introduction of imperfections in the state of the copies. This is a consequence of the no-cloning theorem, which is a fundamental law of quantum physics and the backbone of security for quantum communications. Although perfect copies are prohibited, a quantum state may be copied with maximal accuracy via various optimal cloning schemes. Optimal quantum cloning, which lies at the border of the physical limit imposed by the no-signaling theorem and the Heisenberg uncertainty principle, has been experimentally realized for low-dimensional photonic states. However, an increase in the dimensionality of quantum systems is greatly beneficial to quantum computation and communication protocols. Nonetheless, no experimental demonstration of optimal cloning machines has hitherto been shown for high-dimensional quantum systems. We perform optimal cloning of high-dimensional photonic states by means of the symmetrization method. We show the universality of our technique by conducting cloning of numerous arbitrary input states and fully characterize our cloning machine by performing quantum state tomography on cloned photons. In addition, a cloning attack on a Bennett and Brassard (BB84) quantum key distribution protocol is experimentally demonstrated to reveal the robustness of high-dimensional states in quantum cryptography.
High Dimensional Classification Using Features Annealed Independence Rules.
Fan, Jianqing; Fan, Yingying
2008-01-01
Classification using high-dimensional features arises frequently in many contemporary statistical studies such as tumor classification using microarray or other high-throughput data. The impact of dimensionality on classifications is largely poorly understood. In a seminal paper, Bickel and Levina (2004) show that the Fisher discriminant performs poorly due to diverging spectra and they propose to use the independence rule to overcome the problem. We first demonstrate that even for the independence classification rule, classification using all the features can be as bad as the random guessing due to noise accumulation in estimating population centroids in high-dimensional feature space. In fact, we demonstrate further that almost all linear discriminants can perform as bad as the random guessing. Thus, it is paramountly important to select a subset of important features for high-dimensional classification, resulting in Features Annealed Independence Rules (FAIR). The conditions under which all the important features can be selected by the two-sample t-statistic are established. The choice of the optimal number of features, or equivalently, the threshold value of the test statistics are proposed based on an upper bound of the classification error. Simulation studies and real data analysis support our theoretical results and demonstrate convincingly the advantage of our new classification procedure.
Directory of Open Access Journals (Sweden)
Fabian Horst
Full Text Available Traditionally, gait analysis has been centered on the idea of average behavior and normality. On one hand, clinical diagnoses and therapeutic interventions typically assume that average gait patterns remain constant over time. On the other hand, it is well known that all our movements are accompanied by a certain amount of variability, which does not allow us to make two identical steps. The purpose of this study was to examine changes in the intra-individual gait patterns across different time-scales (i.e., tens-of-mins, tens-of-hours.Nine healthy subjects performed 15 gait trials at a self-selected speed on 6 sessions within one day (duration between two subsequent sessions from 10 to 90 mins. For each trial, time-continuous ground reaction forces and lower body joint angles were measured. A supervised learning model using a kernel-based discriminant regression was applied for classifying sessions within individual gait patterns.Discernable characteristics of intra-individual gait patterns could be distinguished between repeated sessions by classification rates of 67.8 ± 8.8% and 86.3 ± 7.9% for the six-session-classification of ground reaction forces and lower body joint angles, respectively. Furthermore, the one-on-one-classification showed that increasing classification rates go along with increasing time durations between two sessions and indicate that changes of gait patterns appear at different time-scales.Discernable characteristics between repeated sessions indicate continuous intrinsic changes in intra-individual gait patterns and suggest a predominant role of deterministic processes in human motor control and learning. Natural changes of gait patterns without any externally induced injury or intervention may reflect continuous adaptations of the motor system over several time-scales. Accordingly, the modelling of walking by means of average gait patterns that are assumed to be near constant over time needs to be reconsidered in the
Horst, Fabian; Eekhoff, Alexander; Newell, Karl M; Schöllhorn, Wolfgang I
2017-01-01
Traditionally, gait analysis has been centered on the idea of average behavior and normality. On one hand, clinical diagnoses and therapeutic interventions typically assume that average gait patterns remain constant over time. On the other hand, it is well known that all our movements are accompanied by a certain amount of variability, which does not allow us to make two identical steps. The purpose of this study was to examine changes in the intra-individual gait patterns across different time-scales (i.e., tens-of-mins, tens-of-hours). Nine healthy subjects performed 15 gait trials at a self-selected speed on 6 sessions within one day (duration between two subsequent sessions from 10 to 90 mins). For each trial, time-continuous ground reaction forces and lower body joint angles were measured. A supervised learning model using a kernel-based discriminant regression was applied for classifying sessions within individual gait patterns. Discernable characteristics of intra-individual gait patterns could be distinguished between repeated sessions by classification rates of 67.8 ± 8.8% and 86.3 ± 7.9% for the six-session-classification of ground reaction forces and lower body joint angles, respectively. Furthermore, the one-on-one-classification showed that increasing classification rates go along with increasing time durations between two sessions and indicate that changes of gait patterns appear at different time-scales. Discernable characteristics between repeated sessions indicate continuous intrinsic changes in intra-individual gait patterns and suggest a predominant role of deterministic processes in human motor control and learning. Natural changes of gait patterns without any externally induced injury or intervention may reflect continuous adaptations of the motor system over several time-scales. Accordingly, the modelling of walking by means of average gait patterns that are assumed to be near constant over time needs to be reconsidered in the context of
Hawking radiation of a high-dimensional rotating black hole
Energy Technology Data Exchange (ETDEWEB)
Zhao, Ren; Zhang, Lichun; Li, Huaifan; Wu, Yueqin [Shanxi Datong University, Institute of Theoretical Physics, Department of Physics, Datong (China)
2010-01-15
We extend the classical Damour-Ruffini method and discuss Hawking radiation spectrum of high-dimensional rotating black hole using Tortoise coordinate transformation defined by taking the reaction of the radiation to the spacetime into consideration. Under the condition that the energy and angular momentum are conservative, taking self-gravitation action into account, we derive Hawking radiation spectrums which satisfy unitary principle in quantum mechanics. It is shown that the process that the black hole radiates particles with energy {omega} is a continuous tunneling process. We provide a theoretical basis for further studying the physical mechanism of black-hole radiation. (orig.)
On spectral distribution of high dimensional covariation matrices
DEFF Research Database (Denmark)
Heinrich, Claudio; Podolskij, Mark
In this paper we present the asymptotic theory for spectral distributions of high dimensional covariation matrices of Brownian diffusions. More specifically, we consider N-dimensional Itô integrals with time varying matrix-valued integrands. We observe n equidistant high frequency data points...... of the underlying Brownian diffusion and we assume that N/n -> c in (0,oo). We show that under a certain mixed spectral moment condition the spectral distribution of the empirical covariation matrix converges in distribution almost surely. Our proof relies on method of moments and applications of graph theory....
The additive hazards model with high-dimensional regressors
DEFF Research Database (Denmark)
Martinussen, Torben; Scheike, Thomas
2009-01-01
This paper considers estimation and prediction in the Aalen additive hazards model in the case where the covariate vector is high-dimensional such as gene expression measurements. Some form of dimension reduction of the covariate space is needed to obtain useful statistical analyses. We study...... model. A standard PLS algorithm can also be constructed, but it turns out that the resulting predictor can only be related to the original covariates via time-dependent coefficients. The methods are applied to a breast cancer data set with gene expression recordings and to the well known primary biliary...
High-dimensional quantum channel estimation using classical light
CSIR Research Space (South Africa)
Mabena, Chemist M
2017-11-01
Full Text Available stream_source_info Mabena_20007_2017.pdf.txt stream_content_type text/plain stream_size 960 Content-Encoding UTF-8 stream_name Mabena_20007_2017.pdf.txt Content-Type text/plain; charset=UTF-8 PHYSICAL REVIEW A 96, 053860... (2017) High-dimensional quantum channel estimation using classical light Chemist M. Mabena CSIR National Laser Centre, P.O. Box 395, Pretoria 0001, South Africa and School of Physics, University of the Witwatersrand, Johannesburg 2000, South...
Zhang, Hongyang; Welch, William J.; Zamar, Ruben H.
2017-01-01
Tomal et al. (2015) introduced the notion of "phalanxes" in the context of rare-class detection in two-class classification problems. A phalanx is a subset of features that work well for classification tasks. In this paper, we propose a different class of phalanxes for application in regression settings. We define a "Regression Phalanx" - a subset of features that work well together for prediction. We propose a novel algorithm which automatically chooses Regression Phalanxes from high-dimensi...
Scalable Nearest Neighbor Algorithms for High Dimensional Data.
Muja, Marius; Lowe, David G
2014-11-01
For many computer vision and machine learning problems, large training sets are key for good performance. However, the most computationally expensive part of many computer vision and machine learning algorithms consists of finding nearest neighbor matches to high dimensional vectors that represent the training data. We propose new algorithms for approximate nearest neighbor matching and evaluate and compare them with previous algorithms. For matching high dimensional features, we find two algorithms to be the most efficient: the randomized k-d forest and a new algorithm proposed in this paper, the priority search k-means tree. We also propose a new algorithm for matching binary features by searching multiple hierarchical clustering trees and show it outperforms methods typically used in the literature. We show that the optimal nearest neighbor algorithm and its parameters depend on the data set characteristics and describe an automated configuration procedure for finding the best algorithm to search a particular data set. In order to scale to very large data sets that would otherwise not fit in the memory of a single machine, we propose a distributed nearest neighbor matching framework that can be used with any of the algorithms described in the paper. All this research has been released as an open source library called fast library for approximate nearest neighbors (FLANN), which has been incorporated into OpenCV and is now one of the most popular libraries for nearest neighbor matching.
Manifold learning to interpret JET high-dimensional operational space
International Nuclear Information System (INIS)
Cannas, B; Fanni, A; Pau, A; Sias, G; Murari, A
2013-01-01
In this paper, the problem of visualization and exploration of JET high-dimensional operational space is considered. The data come from plasma discharges selected from JET campaigns from C15 (year 2005) up to C27 (year 2009). The aim is to learn the possible manifold structure embedded in the data and to create some representations of the plasma parameters on low-dimensional maps, which are understandable and which preserve the essential properties owned by the original data. A crucial issue for the design of such mappings is the quality of the dataset. This paper reports the details of the criteria used to properly select suitable signals downloaded from JET databases in order to obtain a dataset of reliable observations. Moreover, a statistical analysis is performed to recognize the presence of outliers. Finally data reduction, based on clustering methods, is performed to select a limited and representative number of samples for the operational space mapping. The high-dimensional operational space of JET is mapped using a widely used manifold learning method, the self-organizing maps. The results are compared with other data visualization methods. The obtained maps can be used to identify characteristic regions of the plasma scenario, allowing to discriminate between regions with high risk of disruption and those with low risk of disruption. (paper)
Elucidating high-dimensional cancer hallmark annotation via enriched ontology.
Yan, Shankai; Wong, Ka-Chun
2017-09-01
Cancer hallmark annotation is a promising technique that could discover novel knowledge about cancer from the biomedical literature. The automated annotation of cancer hallmarks could reveal relevant cancer transformation processes in the literature or extract the articles that correspond to the cancer hallmark of interest. It acts as a complementary approach that can retrieve knowledge from massive text information, advancing numerous focused studies in cancer research. Nonetheless, the high-dimensional nature of cancer hallmark annotation imposes a unique challenge. To address the curse of dimensionality, we compared multiple cancer hallmark annotation methods on 1580 PubMed abstracts. Based on the insights, a novel approach, UDT-RF, which makes use of ontological features is proposed. It expands the feature space via the Medical Subject Headings (MeSH) ontology graph and utilizes novel feature selections for elucidating the high-dimensional cancer hallmark annotation space. To demonstrate its effectiveness, state-of-the-art methods are compared and evaluated by a multitude of performance metrics, revealing the full performance spectrum on the full set of cancer hallmarks. Several case studies are conducted, demonstrating how the proposed approach could reveal novel insights into cancers. https://github.com/cskyan/chmannot. Copyright © 2017 Elsevier Inc. All rights reserved.
High-Dimensional Adaptive Particle Swarm Optimization on Heterogeneous Systems
International Nuclear Information System (INIS)
Wachowiak, M P; Sarlo, B B; Foster, A E Lambe
2014-01-01
Much work has recently been reported in parallel GPU-based particle swarm optimization (PSO). Motivated by the encouraging results of these investigations, while also recognizing the limitations of GPU-based methods for big problems using a large amount of data, this paper explores the efficacy of employing other types of parallel hardware for PSO. Most commodity systems feature a variety of architectures whose high-performance capabilities can be exploited. In this paper, high-dimensional problems and those that employ a large amount of external data are explored within the context of heterogeneous systems. Large problems are decomposed into constituent components, and analyses are undertaken of which components would benefit from multi-core or GPU parallelism. The current study therefore provides another demonstration that ''supercomputing on a budget'' is possible when subtasks of large problems are run on hardware most suited to these tasks. Experimental results show that large speedups can be achieved on high dimensional, data-intensive problems. Cost functions must first be analysed for parallelization opportunities, and assigned hardware based on the particular task
High-dimensional single-cell cancer biology.
Irish, Jonathan M; Doxie, Deon B
2014-01-01
Cancer cells are distinguished from each other and from healthy cells by features that drive clonal evolution and therapy resistance. New advances in high-dimensional flow cytometry make it possible to systematically measure mechanisms of tumor initiation, progression, and therapy resistance on millions of cells from human tumors. Here we describe flow cytometry techniques that enable a "single-cell " view of cancer. High-dimensional techniques like mass cytometry enable multiplexed single-cell analysis of cell identity, clinical biomarkers, signaling network phospho-proteins, transcription factors, and functional readouts of proliferation, cell cycle status, and apoptosis. This capability pairs well with a signaling profiles approach that dissects mechanism by systematically perturbing and measuring many nodes in a signaling network. Single-cell approaches enable study of cellular heterogeneity of primary tissues and turn cell subsets into experimental controls or opportunities for new discovery. Rare populations of stem cells or therapy-resistant cancer cells can be identified and compared to other types of cells within the same sample. In the long term, these techniques will enable tracking of minimal residual disease (MRD) and disease progression. By better understanding biological systems that control development and cell-cell interactions in healthy and diseased contexts, we can learn to program cells to become therapeutic agents or target malignant signaling events to specifically kill cancer cells. Single-cell approaches that provide deep insight into cell signaling and fate decisions will be critical to optimizing the next generation of cancer treatments combining targeted approaches and immunotherapy.
Bayesian Multiresolution Variable Selection for Ultra-High Dimensional Neuroimaging Data.
Zhao, Yize; Kang, Jian; Long, Qi
2018-01-01
Ultra-high dimensional variable selection has become increasingly important in analysis of neuroimaging data. For example, in the Autism Brain Imaging Data Exchange (ABIDE) study, neuroscientists are interested in identifying important biomarkers for early detection of the autism spectrum disorder (ASD) using high resolution brain images that include hundreds of thousands voxels. However, most existing methods are not feasible for solving this problem due to their extensive computational costs. In this work, we propose a novel multiresolution variable selection procedure under a Bayesian probit regression framework. It recursively uses posterior samples for coarser-scale variable selection to guide the posterior inference on finer-scale variable selection, leading to very efficient Markov chain Monte Carlo (MCMC) algorithms. The proposed algorithms are computationally feasible for ultra-high dimensional data. Also, our model incorporates two levels of structural information into variable selection using Ising priors: the spatial dependence between voxels and the functional connectivity between anatomical brain regions. Applied to the resting state functional magnetic resonance imaging (R-fMRI) data in the ABIDE study, our methods identify voxel-level imaging biomarkers highly predictive of the ASD, which are biologically meaningful and interpretable. Extensive simulations also show that our methods achieve better performance in variable selection compared to existing methods.
Luo, Jieqiong; Du, Peijun; Samat, Alim; Xia, Junshi; Che, Meiqin; Xue, Zhaohui
2017-01-01
Based on annual average PM2.5 gridded dataset, this study first analyzed the spatiotemporal pattern of PM2.5 across Mainland China during 1998-2012. Then facilitated with meteorological site data, land cover data, population and Gross Domestic Product (GDP) data, etc., the contributions of latent geographic factors, including socioeconomic factors (e.g., road, agriculture, population, industry) and natural geographical factors (e.g., topography, climate, vegetation) to PM2.5 were explored through Geographically Weighted Regression (GWR) model. The results revealed that PM2.5 concentrations increased while the spatial pattern remained stable, and the proportion of areas with PM2.5 concentrations greater than 35 μg/m3 significantly increased from 23.08% to 29.89%. Moreover, road, agriculture, population and vegetation showed the most significant impacts on PM2.5. Additionally, the Moran’s I for the residuals of GWR was 0.025 (not significant at a 0.01 level), indicating that the GWR model was properly specified. The local coefficient estimates of GDP in some cities were negative, suggesting the existence of the inverted-U shaped Environmental Kuznets Curve (EKC) for PM2.5 in Mainland China. The effects of each latent factor on PM2.5 in various regions were different. Therefore, regional measures and strategies for controlling PM2.5 should be formulated in terms of the local impacts of specific factors.
Class prediction for high-dimensional class-imbalanced data
Directory of Open Access Journals (Sweden)
Lusa Lara
2010-10-01
Full Text Available Abstract Background The goal of class prediction studies is to develop rules to accurately predict the class membership of new samples. The rules are derived using the values of the variables available for each subject: the main characteristic of high-dimensional data is that the number of variables greatly exceeds the number of samples. Frequently the classifiers are developed using class-imbalanced data, i.e., data sets where the number of samples in each class is not equal. Standard classification methods used on class-imbalanced data often produce classifiers that do not accurately predict the minority class; the prediction is biased towards the majority class. In this paper we investigate if the high-dimensionality poses additional challenges when dealing with class-imbalanced prediction. We evaluate the performance of six types of classifiers on class-imbalanced data, using simulated data and a publicly available data set from a breast cancer gene-expression microarray study. We also investigate the effectiveness of some strategies that are available to overcome the effect of class imbalance. Results Our results show that the evaluated classifiers are highly sensitive to class imbalance and that variable selection introduces an additional bias towards classification into the majority class. Most new samples are assigned to the majority class from the training set, unless the difference between the classes is very large. As a consequence, the class-specific predictive accuracies differ considerably. When the class imbalance is not too severe, down-sizing and asymmetric bagging embedding variable selection work well, while over-sampling does not. Variable normalization can further worsen the performance of the classifiers. Conclusions Our results show that matching the prevalence of the classes in training and test set does not guarantee good performance of classifiers and that the problems related to classification with class
High-Dimensional Quantum Information Processing with Linear Optics
Fitzpatrick, Casey A.
Quantum information processing (QIP) is an interdisciplinary field concerned with the development of computers and information processing systems that utilize quantum mechanical properties of nature to carry out their function. QIP systems have become vastly more practical since the turn of the century. Today, QIP applications span imaging, cryptographic security, computation, and simulation (quantum systems that mimic other quantum systems). Many important strategies improve quantum versions of classical information system hardware, such as single photon detectors and quantum repeaters. Another more abstract strategy engineers high-dimensional quantum state spaces, so that each successful event carries more information than traditional two-level systems allow. Photonic states in particular bring the added advantages of weak environmental coupling and data transmission near the speed of light, allowing for simpler control and lower system design complexity. In this dissertation, numerous novel, scalable designs for practical high-dimensional linear-optical QIP systems are presented. First, a correlated photon imaging scheme using orbital angular momentum (OAM) states to detect rotational symmetries in objects using measurements, as well as building images out of those interactions is reported. Then, a statistical detection method using chains of OAM superpositions distributed according to the Fibonacci sequence is established and expanded upon. It is shown that the approach gives rise to schemes for sorting, detecting, and generating the recursively defined high-dimensional states on which some quantum cryptographic protocols depend. Finally, an ongoing study based on a generalization of the standard optical multiport for applications in quantum computation and simulation is reported upon. The architecture allows photons to reverse momentum inside the device. This in turn enables realistic implementation of controllable linear-optical scattering vertices for
High-dimensional change-point estimation: Combining filtering with convex optimization
Soh, Yong Sheng; Chandrasekaran, Venkat
2017-01-01
We consider change-point estimation in a sequence of high-dimensional signals given noisy observations. Classical approaches to this problem such as the filtered derivative method are useful for sequences of scalar-valued signals, but they have undesirable scaling behavior in the high-dimensional setting. However, many high-dimensional signals encountered in practice frequently possess latent low-dimensional structure. Motivated by this observation, we propose a technique for high-dimensional...
Inference for feature selection using the Lasso with high-dimensional data
DEFF Research Database (Denmark)
Brink-Jensen, Kasper; Ekstrøm, Claus Thorn
2014-01-01
Penalized regression models such as the Lasso have proved useful for variable selection in many fields - especially for situations with high-dimensional data where the numbers of predictors far exceeds the number of observations. These methods identify and rank variables of importance but do...... not generally provide any inference of the selected variables. Thus, the variables selected might be the "most important" but need not be significant. We propose a significance test for the selection found by the Lasso. We introduce a procedure that computes inference and p-values for features chosen...... by the Lasso. This method rephrases the null hypothesis and uses a randomization approach which ensures that the error rate is controlled even for small samples. We demonstrate the ability of the algorithm to compute $p$-values of the expected magnitude with simulated data using a multitude of scenarios...
Variance inflation in high dimensional Support Vector Machines
DEFF Research Database (Denmark)
Abrahamsen, Trine Julie; Hansen, Lars Kai
2013-01-01
Many important machine learning models, supervised and unsupervised, are based on simple Euclidean distance or orthogonal projection in a high dimensional feature space. When estimating such models from small training sets we face the problem that the span of the training data set input vectors...... the case of Support Vector Machines (SVMS) and we propose a non-parametric scheme to restore proper generalizability. We illustrate the algorithm and its ability to restore performance on a wide range of benchmark data sets....... follow a different probability law with less variance. While the problem and basic means to reconstruct and deflate are well understood in unsupervised learning, the case of supervised learning is less well understood. We here investigate the effect of variance inflation in supervised learning including...
Applying recursive numerical integration techniques for solving high dimensional integrals
International Nuclear Information System (INIS)
Ammon, Andreas; Genz, Alan; Hartung, Tobias; Jansen, Karl; Volmer, Julia; Leoevey, Hernan
2016-11-01
The error scaling for Markov-Chain Monte Carlo techniques (MCMC) with N samples behaves like 1/√(N). This scaling makes it often very time intensive to reduce the error of computed observables, in particular for applications in lattice QCD. It is therefore highly desirable to have alternative methods at hand which show an improved error scaling. One candidate for such an alternative integration technique is the method of recursive numerical integration (RNI). The basic idea of this method is to use an efficient low-dimensional quadrature rule (usually of Gaussian type) and apply it iteratively to integrate over high-dimensional observables and Boltzmann weights. We present the application of such an algorithm to the topological rotor and the anharmonic oscillator and compare the error scaling to MCMC results. In particular, we demonstrate that the RNI technique shows an error scaling in the number of integration points m that is at least exponential.
High-dimensional cluster analysis with the Masked EM Algorithm
Kadir, Shabnam N.; Goodman, Dan F. M.; Harris, Kenneth D.
2014-01-01
Cluster analysis faces two problems in high dimensions: first, the “curse of dimensionality” that can lead to overfitting and poor generalization performance; and second, the sheer time taken for conventional algorithms to process large amounts of high-dimensional data. We describe a solution to these problems, designed for the application of “spike sorting” for next-generation high channel-count neural probes. In this problem, only a small subset of features provide information about the cluster member-ship of any one data vector, but this informative feature subset is not the same for all data points, rendering classical feature selection ineffective. We introduce a “Masked EM” algorithm that allows accurate and time-efficient clustering of up to millions of points in thousands of dimensions. We demonstrate its applicability to synthetic data, and to real-world high-channel-count spike sorting data. PMID:25149694
Network Reconstruction From High-Dimensional Ordinary Differential Equations.
Chen, Shizhe; Shojaie, Ali; Witten, Daniela M
2017-01-01
We consider the task of learning a dynamical system from high-dimensional time-course data. For instance, we might wish to estimate a gene regulatory network from gene expression data measured at discrete time points. We model the dynamical system nonparametrically as a system of additive ordinary differential equations. Most existing methods for parameter estimation in ordinary differential equations estimate the derivatives from noisy observations. This is known to be challenging and inefficient. We propose a novel approach that does not involve derivative estimation. We show that the proposed method can consistently recover the true network structure even in high dimensions, and we demonstrate empirical improvement over competing approaches. Supplementary materials for this article are available online.
Quantum correlation of high dimensional system in a dephasing environment
Ji, Yinghua; Ke, Qiang; Hu, Juju
2018-05-01
For a high dimensional spin-S system embedded in a dephasing environment, we theoretically analyze the time evolutions of quantum correlation and entanglement via Frobenius norm and negativity. The quantum correlation dynamics can be considered as a function of the decoherence parameters, including the ratio between the system oscillator frequency ω0 and the reservoir cutoff frequency ωc , and the different environment temperature. It is shown that the quantum correlation can not only measure nonclassical correlation of the considered system, but also perform a better robustness against the dissipation. In addition, the decoherence presents the non-Markovian features and the quantum correlation freeze phenomenon. The former is much weaker than that in the sub-Ohmic or Ohmic thermal reservoir environment.
Evaluating Clustering in Subspace Projections of High Dimensional Data
DEFF Research Database (Denmark)
Müller, Emmanuel; Günnemann, Stephan; Assent, Ira
2009-01-01
Clustering high dimensional data is an emerging research field. Subspace clustering or projected clustering group similar objects in subspaces, i.e. projections, of the full space. In the past decade, several clustering paradigms have been developed in parallel, without thorough evaluation...... and comparison between these paradigms on a common basis. Conclusive evaluation and comparison is challenged by three major issues. First, there is no ground truth that describes the "true" clusters in real world data. Second, a large variety of evaluation measures have been used that reflect different aspects...... of the clustering result. Finally, in typical publications authors have limited their analysis to their favored paradigm only, while paying other paradigms little or no attention. In this paper, we take a systematic approach to evaluate the major paradigms in a common framework. We study representative clustering...
Applying recursive numerical integration techniques for solving high dimensional integrals
Energy Technology Data Exchange (ETDEWEB)
Ammon, Andreas [IVU Traffic Technologies AG, Berlin (Germany); Genz, Alan [Washington State Univ., Pullman, WA (United States). Dept. of Mathematics; Hartung, Tobias [King' s College, London (United Kingdom). Dept. of Mathematics; Jansen, Karl; Volmer, Julia [Deutsches Elektronen-Synchrotron (DESY), Zeuthen (Germany). John von Neumann-Inst. fuer Computing NIC; Leoevey, Hernan [Humboldt Univ. Berlin (Germany). Inst. fuer Mathematik
2016-11-15
The error scaling for Markov-Chain Monte Carlo techniques (MCMC) with N samples behaves like 1/√(N). This scaling makes it often very time intensive to reduce the error of computed observables, in particular for applications in lattice QCD. It is therefore highly desirable to have alternative methods at hand which show an improved error scaling. One candidate for such an alternative integration technique is the method of recursive numerical integration (RNI). The basic idea of this method is to use an efficient low-dimensional quadrature rule (usually of Gaussian type) and apply it iteratively to integrate over high-dimensional observables and Boltzmann weights. We present the application of such an algorithm to the topological rotor and the anharmonic oscillator and compare the error scaling to MCMC results. In particular, we demonstrate that the RNI technique shows an error scaling in the number of integration points m that is at least exponential.
Reduced order surrogate modelling (ROSM) of high dimensional deterministic simulations
Mitry, Mina
Often, computationally expensive engineering simulations can prohibit the engineering design process. As a result, designers may turn to a less computationally demanding approximate, or surrogate, model to facilitate their design process. However, owing to the the curse of dimensionality, classical surrogate models become too computationally expensive for high dimensional data. To address this limitation of classical methods, we develop linear and non-linear Reduced Order Surrogate Modelling (ROSM) techniques. Two algorithms are presented, which are based on a combination of linear/kernel principal component analysis and radial basis functions. These algorithms are applied to subsonic and transonic aerodynamic data, as well as a model for a chemical spill in a channel. The results of this thesis show that ROSM can provide a significant computational benefit over classical surrogate modelling, sometimes at the expense of a minor loss in accuracy.
Asymptotics of empirical eigenstructure for high dimensional spiked covariance.
Wang, Weichen; Fan, Jianqing
2017-06-01
We derive the asymptotic distributions of the spiked eigenvalues and eigenvectors under a generalized and unified asymptotic regime, which takes into account the magnitude of spiked eigenvalues, sample size, and dimensionality. This regime allows high dimensionality and diverging eigenvalues and provides new insights into the roles that the leading eigenvalues, sample size, and dimensionality play in principal component analysis. Our results are a natural extension of those in Paul (2007) to a more general setting and solve the rates of convergence problems in Shen et al. (2013). They also reveal the biases of estimating leading eigenvalues and eigenvectors by using principal component analysis, and lead to a new covariance estimator for the approximate factor model, called shrinkage principal orthogonal complement thresholding (S-POET), that corrects the biases. Our results are successfully applied to outstanding problems in estimation of risks of large portfolios and false discovery proportions for dependent test statistics and are illustrated by simulation studies.
Kernel based methods for accelerated failure time model with ultra-high dimensional data
Directory of Open Access Journals (Sweden)
Jiang Feng
2010-12-01
Full Text Available Abstract Background Most genomic data have ultra-high dimensions with more than 10,000 genes (probes. Regularization methods with L1 and Lp penalty have been extensively studied in survival analysis with high-dimensional genomic data. However, when the sample size n ≪ m (the number of genes, directly identifying a small subset of genes from ultra-high (m > 10, 000 dimensional data is time-consuming and not computationally efficient. In current microarray analysis, what people really do is select a couple of thousands (or hundreds of genes using univariate analysis or statistical tests, and then apply the LASSO-type penalty to further reduce the number of disease associated genes. This two-step procedure may introduce bias and inaccuracy and lead us to miss biologically important genes. Results The accelerated failure time (AFT model is a linear regression model and a useful alternative to the Cox model for survival analysis. In this paper, we propose a nonlinear kernel based AFT model and an efficient variable selection method with adaptive kernel ridge regression. Our proposed variable selection method is based on the kernel matrix and dual problem with a much smaller n × n matrix. It is very efficient when the number of unknown variables (genes is much larger than the number of samples. Moreover, the primal variables are explicitly updated and the sparsity in the solution is exploited. Conclusions Our proposed methods can simultaneously identify survival associated prognostic factors and predict survival outcomes with ultra-high dimensional genomic data. We have demonstrated the performance of our methods with both simulation and real data. The proposed method performs superbly with limited computational studies.
Zhu, Lingxue; Lei, Jing; Devlin, Bernie; Roeder, Kathryn
2017-09-01
Scientists routinely compare gene expression levels in cases versus controls in part to determine genes associated with a disease. Similarly, detecting case-control differences in co-expression among genes can be critical to understanding complex human diseases; however statistical methods have been limited by the high dimensional nature of this problem. In this paper, we construct a sparse-Leading-Eigenvalue-Driven (sLED) test for comparing two high-dimensional covariance matrices. By focusing on the spectrum of the differential matrix, sLED provides a novel perspective that accommodates what we assume to be common, namely sparse and weak signals in gene expression data, and it is closely related with Sparse Principal Component Analysis. We prove that sLED achieves full power asymptotically under mild assumptions, and simulation studies verify that it outperforms other existing procedures under many biologically plausible scenarios. Applying sLED to the largest gene-expression dataset obtained from post-mortem brain tissue from Schizophrenia patients and controls, we provide a novel list of genes implicated in Schizophrenia and reveal intriguing patterns in gene co-expression change for Schizophrenia subjects. We also illustrate that sLED can be generalized to compare other gene-gene "relationship" matrices that are of practical interest, such as the weighted adjacency matrices.
Zhang, Bo; Chen, Zhen; Albert, Paul S
2012-01-01
High-dimensional biomarker data are often collected in epidemiological studies when assessing the association between biomarkers and human disease is of interest. We develop a latent class modeling approach for joint analysis of high-dimensional semicontinuous biomarker data and a binary disease outcome. To model the relationship between complex biomarker expression patterns and disease risk, we use latent risk classes to link the 2 modeling components. We characterize complex biomarker-specific differences through biomarker-specific random effects, so that different biomarkers can have different baseline (low-risk) values as well as different between-class differences. The proposed approach also accommodates data features that are common in environmental toxicology and other biomarker exposure data, including a large number of biomarkers, numerous zero values, and complex mean-variance relationship in the biomarkers levels. A Monte Carlo EM (MCEM) algorithm is proposed for parameter estimation. Both the MCEM algorithm and model selection procedures are shown to work well in simulations and applications. In applying the proposed approach to an epidemiological study that examined the relationship between environmental polychlorinated biphenyl (PCB) exposure and the risk of endometriosis, we identified a highly significant overall effect of PCB concentrations on the risk of endometriosis.
Huybrechts, Inge; Lioret, Sandrine; Mouratidou, Theodora; Gunter, Marc J; Manios, Yannis; Kersting, Mathilde; Gottrand, Frederic; Kafatos, Anthony; De Henauw, Stefaan; Cuenca-García, Magdalena; Widhalm, Kurt; Gonzales-Gross, Marcela; Molnar, Denes; Moreno, Luis A; McNaughton, Sarah A
2017-01-01
This study aims to examine repeatability of reduced rank regression (RRR) methods in calculating dietary patterns (DP) and cross-sectional associations with overweight (OW)/obesity across European and Australian samples of adolescents. Data from two cross-sectional surveys in Europe (2006/2007 Healthy Lifestyle in Europe by Nutrition in Adolescence study, including 1954 adolescents, 12-17 years) and Australia (2007 National Children's Nutrition and Physical Activity Survey, including 1498 adolescents, 12-16 years) were used. Dietary intake was measured using two non-consecutive, 24-h recalls. RRR was used to identify DP using dietary energy density, fibre density and percentage of energy intake from fat as the intermediate variables. Associations between DP scores and body mass/fat were examined using multivariable linear and logistic regression as appropriate, stratified by sex. The first DP extracted (labelled 'energy dense, high fat, low fibre') explained 47 and 31 % of the response variation in Australian and European adolescents, respectively. It was similar for European and Australian adolescents and characterised by higher consumption of biscuits/cakes, chocolate/confectionery, crisps/savoury snacks, sugar-sweetened beverages, and lower consumption of yogurt, high-fibre bread, vegetables and fresh fruit. DP scores were inversely associated with BMI z-scores in Australian adolescent boys and borderline inverse in European adolescent boys (so as with %BF). Similarly, a lower likelihood for OW in boys was observed with higher DP scores in both surveys. No such relationships were observed in adolescent girls. In conclusion, the DP identified in this cross-country study was comparable for European and Australian adolescents, demonstrating robustness of the RRR method in calculating DP among populations. However, longitudinal designs are more relevant when studying diet-obesity associations, to prevent reverse causality.
Biomarker identification and effect estimation on schizophrenia –a high dimensional data analysis
Directory of Open Access Journals (Sweden)
Yuanzhang eLi
2015-05-01
Full Text Available Biomarkers have been examined in schizophrenia research for decades. Medical morbidity and mortality rates, as well as personal and societal costs, are associated with schizophrenia patients. The identification of biomarkers and alleles, which often have a small effect individually, may help to develop new diagnostic tests for early identification and treatment. Currently, there is not a commonly accepted statistical approach to identify predictive biomarkers from high dimensional data. We used space Decomposition-Gradient-Regression method (DGR to select biomarkers, which are associated with the risk of schizophrenia. Then, we used the gradient scores, generated from the selected biomarkers, as the prediction factor in regression to estimate their effects. We also used an alternative approach, classification and regression tree (CART, to compare the biomarker selected by DGR and found about 70% of the selected biomarkers were the same. However, the advantage of DGR is that it can evaluate individual effects for each biomarker from their combined effect. In DGR analysis of serum specimens of US military service members with a diagnosis of schizophrenia from 1992 to 2005 and their controls, Alpha-1-Antitrypsin (AAT, Interleukin-6 receptor (IL-6r and Connective Tissue Growth Factor (CTGF were selected to identify schizophrenia for males; and Alpha-1-Antitrypsin (AAT, Apolipoprotein B (Apo B and Sortilin were selected for females. If these findings from military subjects are replicated by other studies, they suggest the possibility of a novel biomarker panel as an adjunct to earlier diagnosis and initiation of treatment.
Explorations on High Dimensional Landscapes: Spin Glasses and Deep Learning
Sagun, Levent
This thesis deals with understanding the structure of high-dimensional and non-convex energy landscapes. In particular, its focus is on the optimization of two classes of functions: homogeneous polynomials and loss functions that arise in machine learning. In the first part, the notion of complexity of a smooth, real-valued function is studied through its critical points. Existing theoretical results predict that certain random functions that are defined on high dimensional domains have a narrow band of values whose pre-image contains the bulk of its critical points. This section provides empirical evidence for convergence of gradient descent to local minima whose energies are near the predicted threshold justifying the existing asymptotic theory. Moreover, it is empirically shown that a similar phenomenon may hold for deep learning loss functions. Furthermore, there is a comparative analysis of gradient descent and its stochastic version showing that in high dimensional regimes the latter is a mere speedup. The next study focuses on the halting time of an algorithm at a given stopping condition. Given an algorithm, the normalized fluctuations of the halting time follow a distribution that remains unchanged even when the input data is sampled from a new distribution. Two qualitative classes are observed: a Gumbel-like distribution that appears in Google searches, human decision times, and spin glasses and a Gaussian-like distribution that appears in conjugate gradient method, deep learning with MNIST and random input data. Following the universality phenomenon, the Hessian of the loss functions of deep learning is studied. The spectrum is seen to be composed of two parts, the bulk which is concentrated around zero, and the edges which are scattered away from zero. Empirical evidence is presented for the bulk indicating how over-parametrized the system is, and for the edges that depend on the input data. Furthermore, an algorithm is proposed such that it would
A qualitative numerical study of high dimensional dynamical systems
Albers, David James
Since Poincare, the father of modern mathematical dynamical systems, much effort has been exerted to achieve a qualitative understanding of the physical world via a qualitative understanding of the functions we use to model the physical world. In this thesis, we construct a numerical framework suitable for a qualitative, statistical study of dynamical systems using the space of artificial neural networks. We analyze the dynamics along intervals in parameter space, separating the set of neural networks into roughly four regions: the fixed point to the first bifurcation; the route to chaos; the chaotic region; and a transition region between chaos and finite-state neural networks. The study is primarily with respect to high-dimensional dynamical systems. We make the following general conclusions as the dimension of the dynamical system is increased: the probability of the first bifurcation being of type Neimark-Sacker is greater than ninety-percent; the most probable route to chaos is via a cascade of bifurcations of high-period periodic orbits, quasi-periodic orbits, and 2-tori; there exists an interval of parameter space such that hyperbolicity is violated on a countable, Lebesgue measure 0, "increasingly dense" subset; chaos is much more likely to persist with respect to parameter perturbation in the chaotic region of parameter space as the dimension is increased; moreover, as the number of positive Lyapunov exponents is increased, the likelihood that any significant portion of these positive exponents can be perturbed away decreases with increasing dimension. The maximum Kaplan-Yorke dimension and the maximum number of positive Lyapunov exponents increases linearly with dimension. The probability of a dynamical system being chaotic increases exponentially with dimension. The results with respect to the first bifurcation and the route to chaos comment on previous results of Newhouse, Ruelle, Takens, Broer, Chenciner, and Iooss. Moreover, results regarding the high-dimensional
Matson, Johnny L.; Kozlowski, Alison M.
2010-01-01
Autistic regression is one of the many mysteries in the developmental course of autism and pervasive developmental disorders not otherwise specified (PDD-NOS). Various definitions of this phenomenon have been used, further clouding the study of the topic. Despite this problem, some efforts at establishing prevalence have been made. The purpose of…
Progress in high-dimensional percolation and random graphs
Heydenreich, Markus
2017-01-01
This text presents an engaging exposition of the active field of high-dimensional percolation that will likely provide an impetus for future work. With over 90 exercises designed to enhance the reader’s understanding of the material, as well as many open problems, the book is aimed at graduate students and researchers who wish to enter the world of this rich topic. The text may also be useful in advanced courses and seminars, as well as for reference and individual study. Part I, consisting of 3 chapters, presents a general introduction to percolation, stating the main results, defining the central objects, and proving its main properties. No prior knowledge of percolation is assumed. Part II, consisting of Chapters 4–9, discusses mean-field critical behavior by describing the two main techniques used, namely, differential inequalities and the lace expansion. In Parts I and II, all results are proved, making this the first self-contained text discussing high-dimensiona l percolation. Part III, consist...
Effects of dependence in high-dimensional multiple testing problems
Directory of Open Access Journals (Sweden)
van de Wiel Mark A
2008-02-01
Full Text Available Abstract Background We consider effects of dependence among variables of high-dimensional data in multiple hypothesis testing problems, in particular the False Discovery Rate (FDR control procedures. Recent simulation studies consider only simple correlation structures among variables, which is hardly inspired by real data features. Our aim is to systematically study effects of several network features like sparsity and correlation strength by imposing dependence structures among variables using random correlation matrices. Results We study the robustness against dependence of several FDR procedures that are popular in microarray studies, such as Benjamin-Hochberg FDR, Storey's q-value, SAM and resampling based FDR procedures. False Non-discovery Rates and estimates of the number of null hypotheses are computed from those methods and compared. Our simulation study shows that methods such as SAM and the q-value do not adequately control the FDR to the level claimed under dependence conditions. On the other hand, the adaptive Benjamini-Hochberg procedure seems to be most robust while remaining conservative. Finally, the estimates of the number of true null hypotheses under various dependence conditions are variable. Conclusion We discuss a new method for efficient guided simulation of dependent data, which satisfy imposed network constraints as conditional independence structures. Our simulation set-up allows for a structural study of the effect of dependencies on multiple testing criterions and is useful for testing a potentially new method on π0 or FDR estimation in a dependency context.
High-dimensional quantum cryptography with twisted light
International Nuclear Information System (INIS)
Mirhosseini, Mohammad; Magaña-Loaiza, Omar S; O’Sullivan, Malcolm N; Rodenburg, Brandon; Malik, Mehul; Boyd, Robert W; Lavery, Martin P J; Padgett, Miles J; Gauthier, Daniel J
2015-01-01
Quantum key distribution (QKD) systems often rely on polarization of light for encoding, thus limiting the amount of information that can be sent per photon and placing tight bounds on the error rates that such a system can tolerate. Here we describe a proof-of-principle experiment that indicates the feasibility of high-dimensional QKD based on the transverse structure of the light field allowing for the transfer of more than 1 bit per photon. Our implementation uses the orbital angular momentum (OAM) of photons and the corresponding mutually unbiased basis of angular position (ANG). Our experiment uses a digital micro-mirror device for the rapid generation of OAM and ANG modes at 4 kHz, and a mode sorter capable of sorting single photons based on their OAM and ANG content with a separation efficiency of 93%. Through the use of a seven-dimensional alphabet encoded in the OAM and ANG bases, we achieve a channel capacity of 2.05 bits per sifted photon. Our experiment demonstrates that, in addition to having an increased information capacity, multilevel QKD systems based on spatial-mode encoding can be more resilient against intercept-resend eavesdropping attacks. (paper)
Inference for High-dimensional Differential Correlation Matrices.
Cai, T Tony; Zhang, Anru
2016-01-01
Motivated by differential co-expression analysis in genomics, we consider in this paper estimation and testing of high-dimensional differential correlation matrices. An adaptive thresholding procedure is introduced and theoretical guarantees are given. Minimax rate of convergence is established and the proposed estimator is shown to be adaptively rate-optimal over collections of paired correlation matrices with approximately sparse differences. Simulation results show that the procedure significantly outperforms two other natural methods that are based on separate estimation of the individual correlation matrices. The procedure is also illustrated through an analysis of a breast cancer dataset, which provides evidence at the gene co-expression level that several genes, of which a subset has been previously verified, are associated with the breast cancer. Hypothesis testing on the differential correlation matrices is also considered. A test, which is particularly well suited for testing against sparse alternatives, is introduced. In addition, other related problems, including estimation of a single sparse correlation matrix, estimation of the differential covariance matrices, and estimation of the differential cross-correlation matrices, are also discussed.
The literary uses of high-dimensional space
Directory of Open Access Journals (Sweden)
Ted Underwood
2015-12-01
Full Text Available Debates over “Big Data” shed more heat than light in the humanities, because the term ascribes new importance to statistical methods without explaining how those methods have changed. What we badly need instead is a conversation about the substantive innovations that have made statistical modeling useful for disciplines where, in the past, it truly wasn’t. These innovations are partly technical, but more fundamentally expressed in what Leo Breiman calls a new “culture” of statistical modeling. Where 20th-century methods often required humanists to squeeze our unstructured texts, sounds, or images into some special-purpose data model, new methods can handle unstructured evidence more directly by modeling it in a high-dimensional space. This opens a range of research opportunities that humanists have barely begun to discuss. To date, topic modeling has received most attention, but in the long run, supervised predictive models may be even more important. I sketch their potential by describing how Jordan Sellers and I have begun to model poetic distinction in the long 19th century—revealing an arc of gradual change much longer than received literary histories would lead us to expect.
Olive, David J
2017-01-01
This text covers both multiple linear regression and some experimental design models. The text uses the response plot to visualize the model and to detect outliers, does not assume that the error distribution has a known parametric distribution, develops prediction intervals that work when the error distribution is unknown, suggests bootstrap hypothesis tests that may be useful for inference after variable selection, and develops prediction regions and large sample theory for the multivariate linear regression model that has m response variables. A relationship between multivariate prediction regions and confidence regions provides a simple way to bootstrap confidence regions. These confidence regions often provide a practical method for testing hypotheses. There is also a chapter on generalized linear models and generalized additive models. There are many R functions to produce response and residual plots, to simulate prediction intervals and hypothesis tests, to detect outliers, and to choose response trans...
Bhadra, Anindya
2013-04-22
We describe a Bayesian technique to (a) perform a sparse joint selection of significant predictor variables and significant inverse covariance matrix elements of the response variables in a high-dimensional linear Gaussian sparse seemingly unrelated regression (SSUR) setting and (b) perform an association analysis between the high-dimensional sets of predictors and responses in such a setting. To search the high-dimensional model space, where both the number of predictors and the number of possibly correlated responses can be larger than the sample size, we demonstrate that a marginalization-based collapsed Gibbs sampler, in combination with spike and slab type of priors, offers a computationally feasible and efficient solution. As an example, we apply our method to an expression quantitative trait loci (eQTL) analysis on publicly available single nucleotide polymorphism (SNP) and gene expression data for humans where the primary interest lies in finding the significant associations between the sets of SNPs and possibly correlated genetic transcripts. Our method also allows for inference on the sparse interaction network of the transcripts (response variables) after accounting for the effect of the SNPs (predictor variables). We exploit properties of Gaussian graphical models to make statements concerning conditional independence of the responses. Our method compares favorably to existing Bayesian approaches developed for this purpose. © 2013, The International Biometric Society.
Dixit, A.; Singh, V. K.
2017-12-01
Recent studies conducted by World Health Organisation (WHO) estimated that 92 % of the total world population are living in places where the air quality level has exceeded the WHO standard limit for air quality. This is due to the change in Land Use Land Cover (LULC) pattern, socio economic drivers and anthropogenic heat emission caused by manmade activity. Thereby, many prevalent human respiratory diseases such as lung cancer, chronic obstructive pulmonary disease and emphysema have increased in recent times. In this study, a quantitative relationship is developed between land use (built-up land, water bodies, and vegetation), socio economic drivers and air quality parameters using logistic based regression model over 7 different cities of India for the winter season of 2012 to 2016. Different LULC, socio economic, industrial emission sources, meteorological condition and air quality level from the monitoring stations are taken to estimate the influence on morbidity of each city. Results of correlation are analyzed between land use variables and monthly concentration of pollutants. These values range from 0.63 to 0.76. Similarly, the correlation value between land use variable with socio economic and morbidity ranges from 0.57 to 0.73. The performance of model is improved from 67 % to 79 % in estimating morbidity for the year 2015 and 2016 due to the better availability of observed data.The study highlights the growing importance of incorporating socio-economic drivers with air quality data for evaluating morbidity rate for each city in comparison to just change in quantitative analysis of air quality.
High-dimensional statistical inference: From vector to matrix
Zhang, Anru
Statistical inference for sparse signals or low-rank matrices in high-dimensional settings is of significant interest in a range of contemporary applications. It has attracted significant recent attention in many fields including statistics, applied mathematics and electrical engineering. In this thesis, we consider several problems in including sparse signal recovery (compressed sensing under restricted isometry) and low-rank matrix recovery (matrix recovery via rank-one projections and structured matrix completion). The first part of the thesis discusses compressed sensing and affine rank minimization in both noiseless and noisy cases and establishes sharp restricted isometry conditions for sparse signal and low-rank matrix recovery. The analysis relies on a key technical tool which represents points in a polytope by convex combinations of sparse vectors. The technique is elementary while leads to sharp results. It is shown that, in compressed sensing, delta kA 0, delta kA < 1/3 + epsilon, deltak A + thetak,kA < 1 + epsilon, or deltatkA< √(t - 1) / t + epsilon are not sufficient to guarantee the exact recovery of all k-sparse signals for large k. Similar result also holds for matrix recovery. In addition, the conditions delta kA<1/3, deltak A+ thetak,kA<1, delta tkA < √(t - 1)/t and deltarM<1/3, delta rM+ thetar,rM<1, delta trM< √(t - 1)/ t are also shown to be sufficient respectively for stable recovery of approximately sparse signals and low-rank matrices in the noisy case. For the second part of the thesis, we introduce a rank-one projection model for low-rank matrix recovery and propose a constrained nuclear norm minimization method for stable recovery of low-rank matrices in the noisy case. The procedure is adaptive to the rank and robust against small perturbations. Both upper and lower bounds for the estimation accuracy under the Frobenius norm loss are obtained. The proposed estimator is shown to be rate-optimal under certain conditions. The
Genuinely high-dimensional nonlocality optimized by complementary measurements
International Nuclear Information System (INIS)
Lim, James; Ryu, Junghee; Yoo, Seokwon; Lee, Changhyoup; Bang, Jeongho; Lee, Jinhyoung
2010-01-01
Qubits exhibit extreme nonlocality when their state is maximally entangled and this is observed by mutually unbiased local measurements. This criterion does not hold for the Bell inequalities of high-dimensional systems (qudits), recently proposed by Collins-Gisin-Linden-Massar-Popescu and Son-Lee-Kim. Taking an alternative approach, called the quantum-to-classical approach, we derive a series of Bell inequalities for qudits that satisfy the criterion as for the qubits. In the derivation each d-dimensional subsystem is assumed to be measured by one of d possible measurements with d being a prime integer. By applying to two qubits (d=2), we find that a derived inequality is reduced to the Clauser-Horne-Shimony-Holt inequality when the degree of nonlocality is optimized over all the possible states and local observables. Further applying to two and three qutrits (d=3), we find Bell inequalities that are violated for the three-dimensionally entangled states but are not violated by any two-dimensionally entangled states. In other words, the inequalities discriminate three-dimensional (3D) entanglement from two-dimensional (2D) entanglement and in this sense they are genuinely 3D. In addition, for the two qutrits we give a quantitative description of the relations among the three degrees of complementarity, entanglement and nonlocality. It is shown that the degree of complementarity jumps abruptly to very close to its maximum as nonlocality starts appearing. These characteristics imply that complementarity plays a more significant role in the present inequality compared with the previously proposed inequality.
Approximation of High-Dimensional Rank One Tensors
Bachmayr, Markus
2013-11-12
Many real world problems are high-dimensional in that their solution is a function which depends on many variables or parameters. This presents a computational challenge since traditional numerical techniques are built on model classes for functions based solely on smoothness. It is known that the approximation of smoothness classes of functions suffers from the so-called \\'curse of dimensionality\\'. Avoiding this curse requires new model classes for real world functions that match applications. This has led to the introduction of notions such as sparsity, variable reduction, and reduced modeling. One theme that is particularly common is to assume a tensor structure for the target function. This paper investigates how well a rank one function f(x 1,...,x d)=f 1(x 1)⋯f d(x d), defined on Ω=[0,1]d can be captured through point queries. It is shown that such a rank one function with component functions f j in W∞ r([0,1]) can be captured (in L ∞) to accuracy O(C(d,r)N -r) from N well-chosen point evaluations. The constant C(d,r) scales like d dr. The queries in our algorithms have two ingredients, a set of points built on the results from discrepancy theory and a second adaptive set of queries dependent on the information drawn from the first set. Under the assumption that a point z∈Ω with nonvanishing f(z) is known, the accuracy improves to O(dN -r). © 2013 Springer Science+Business Media New York.
Statistical mechanics of complex neural systems and high dimensional data
International Nuclear Information System (INIS)
Advani, Madhu; Lahiri, Subhaneil; Ganguli, Surya
2013-01-01
Recent experimental advances in neuroscience have opened new vistas into the immense complexity of neuronal networks. This proliferation of data challenges us on two parallel fronts. First, how can we form adequate theoretical frameworks for understanding how dynamical network processes cooperate across widely disparate spatiotemporal scales to solve important computational problems? Second, how can we extract meaningful models of neuronal systems from high dimensional datasets? To aid in these challenges, we give a pedagogical review of a collection of ideas and theoretical methods arising at the intersection of statistical physics, computer science and neurobiology. We introduce the interrelated replica and cavity methods, which originated in statistical physics as powerful ways to quantitatively analyze large highly heterogeneous systems of many interacting degrees of freedom. We also introduce the closely related notion of message passing in graphical models, which originated in computer science as a distributed algorithm capable of solving large inference and optimization problems involving many coupled variables. We then show how both the statistical physics and computer science perspectives can be applied in a wide diversity of contexts to problems arising in theoretical neuroscience and data analysis. Along the way we discuss spin glasses, learning theory, illusions of structure in noise, random matrices, dimensionality reduction and compressed sensing, all within the unified formalism of the replica method. Moreover, we review recent conceptual connections between message passing in graphical models, and neural computation and learning. Overall, these ideas illustrate how statistical physics and computer science might provide a lens through which we can uncover emergent computational functions buried deep within the dynamical complexities of neuronal networks. (paper)
Quality and efficiency in high dimensional Nearest neighbor search
Tao, Yufei; Yi, Ke; Sheng, Cheng; Kalnis, Panos
2009-01-01
Nearest neighbor (NN) search in high dimensional space is an important problem in many applications. Ideally, a practical solution (i) should be implementable in a relational database, and (ii) its query cost should grow sub-linearly with the dataset size, regardless of the data and query distributions. Despite the bulk of NN literature, no solution fulfills both requirements, except locality sensitive hashing (LSH). The existing LSH implementations are either rigorous or adhoc. Rigorous-LSH ensures good quality of query results, but requires expensive space and query cost. Although adhoc-LSH is more efficient, it abandons quality control, i.e., the neighbor it outputs can be arbitrarily bad. As a result, currently no method is able to ensure both quality and efficiency simultaneously in practice. Motivated by this, we propose a new access method called the locality sensitive B-tree (LSB-tree) that enables fast highdimensional NN search with excellent quality. The combination of several LSB-trees leads to a structure called the LSB-forest that ensures the same result quality as rigorous-LSH, but reduces its space and query cost dramatically. The LSB-forest also outperforms adhoc-LSH, even though the latter has no quality guarantee. Besides its appealing theoretical properties, the LSB-tree itself also serves as an effective index that consumes linear space, and supports efficient updates. Our extensive experiments confirm that the LSB-tree is faster than (i) the state of the art of exact NN search by two orders of magnitude, and (ii) the best (linear-space) method of approximate retrieval by an order of magnitude, and at the same time, returns neighbors with much better quality. © 2009 ACM.
Approximation of High-Dimensional Rank One Tensors
Bachmayr, Markus; Dahmen, Wolfgang; DeVore, Ronald; Grasedyck, Lars
2013-01-01
Many real world problems are high-dimensional in that their solution is a function which depends on many variables or parameters. This presents a computational challenge since traditional numerical techniques are built on model classes for functions based solely on smoothness. It is known that the approximation of smoothness classes of functions suffers from the so-called 'curse of dimensionality'. Avoiding this curse requires new model classes for real world functions that match applications. This has led to the introduction of notions such as sparsity, variable reduction, and reduced modeling. One theme that is particularly common is to assume a tensor structure for the target function. This paper investigates how well a rank one function f(x 1,...,x d)=f 1(x 1)⋯f d(x d), defined on Ω=[0,1]d can be captured through point queries. It is shown that such a rank one function with component functions f j in W∞ r([0,1]) can be captured (in L ∞) to accuracy O(C(d,r)N -r) from N well-chosen point evaluations. The constant C(d,r) scales like d dr. The queries in our algorithms have two ingredients, a set of points built on the results from discrepancy theory and a second adaptive set of queries dependent on the information drawn from the first set. Under the assumption that a point z∈Ω with nonvanishing f(z) is known, the accuracy improves to O(dN -r). © 2013 Springer Science+Business Media New York.
Nonlinear Forecasting With Many Predictors Using Kernel Ridge Regression
DEFF Research Database (Denmark)
Exterkate, Peter; Groenen, Patrick J.F.; Heij, Christiaan
This paper puts forward kernel ridge regression as an approach for forecasting with many predictors that are related nonlinearly to the target variable. In kernel ridge regression, the observed predictor variables are mapped nonlinearly into a high-dimensional space, where estimation of the predi...
Nunes, N; Ambler, G; Foo, X; Widschwendter, M; Jurkovic, D
2018-06-01
To determine whether International Ovarian Tumor Analysis (IOTA) logistic regression models LR1 and LR2 developed for the preoperative diagnosis of ovarian cancer could also be used to differentiate between benign and malignant adnexal tumors in the population of women attending gynecology outpatient clinics. This was a single-center prospective observational study of consecutive women attending our gynecological diagnostic outpatient unit, recruited between May 2009 and January 2012. All the women were first examined by a Level-II ultrasound operator. In those diagnosed with adnexal tumors, the IOTA-LR1/2 protocol was used to evaluate the masses. The LR1 and LR2 models were then used to assess the risk of malignancy. Subsequently, the women were also examined by a Level-III examiner, who used pattern recognition to differentiate between benign and malignant tumors. Women with an ultrasound diagnosis of malignancy were offered surgery, while asymptomatic women with presumed benign lesions were offered conservative management with a minimum follow-up of 12 months. The initial diagnosis was compared with two reference standards: histological findings and/or a comparative assessment of tumor morphology on follow-up ultrasound scans. All women for whom the tumor classification on follow-up changed from benign to malignant were offered surgery. In the final analysis, 489 women who had either or both of the reference standards were included. Their mean age was 50 years (range, 16-91 years) and 45% were postmenopausal. Of the included women, 342/489 (69.9%) had surgery and 147/489 (30.1%) were managed conservatively. The malignancy rate was 137/489 (28.0%). Overall, sensitivities of LR1 and LR2 for the diagnosis of malignancy were 97.1% (95% CI, 92.7-99.2%) and 94.9% (95% CI, 89.8-97.9%) and specificities were 77.3% (95% CI, 72.5-81.5%) and 76.7% (95% CI, 71.9-81.0%), respectively (P > 0.05). In comparison with pattern recognition (sensitivity 94.2% (95% CI, 88
Identifying surfaces of low dimensions in high dimensional data
DEFF Research Database (Denmark)
Høskuldsson, Agnar
1996-01-01
Methods are presented that find a nonlinear subspace in low dimensions that describe data given by many variables. The methods include nonlinear extensions of Principal Component Analysis and extensions of linear regression analysis. It is shown by examples that these methods give more reliable...
Complex Environmental Data Modelling Using Adaptive General Regression Neural Networks
Kanevski, Mikhail
2015-04-01
The research deals with an adaptation and application of Adaptive General Regression Neural Networks (GRNN) to high dimensional environmental data. GRNN [1,2,3] are efficient modelling tools both for spatial and temporal data and are based on nonparametric kernel methods closely related to classical Nadaraya-Watson estimator. Adaptive GRNN, using anisotropic kernels, can be also applied for features selection tasks when working with high dimensional data [1,3]. In the present research Adaptive GRNN are used to study geospatial data predictability and relevant feature selection using both simulated and real data case studies. The original raw data were either three dimensional monthly precipitation data or monthly wind speeds embedded into 13 dimensional space constructed by geographical coordinates and geo-features calculated from digital elevation model. GRNN were applied in two different ways: 1) adaptive GRNN with the resulting list of features ordered according to their relevancy; and 2) adaptive GRNN applied to evaluate all possible models N [in case of wind fields N=(2^13 -1)=8191] and rank them according to the cross-validation error. In both cases training were carried out applying leave-one-out procedure. An important result of the study is that the set of the most relevant features depends on the month (strong seasonal effect) and year. The predictabilities of precipitation and wind field patterns, estimated using the cross-validation and testing errors of raw and shuffled data, were studied in detail. The results of both approaches were qualitatively and quantitatively compared. In conclusion, Adaptive GRNN with their ability to select features and efficient modelling of complex high dimensional data can be widely used in automatic/on-line mapping and as an integrated part of environmental decision support systems. 1. Kanevski M., Pozdnoukhov A., Timonin V. Machine Learning for Spatial Environmental Data. Theory, applications and software. EPFL Press
Nam, Julia EunJu; Mueller, Klaus
2013-02-01
Gaining a true appreciation of high-dimensional space remains difficult since all of the existing high-dimensional space exploration techniques serialize the space travel in some way. This is not so foreign to us since we, when traveling, also experience the world in a serial fashion. But we typically have access to a map to help with positioning, orientation, navigation, and trip planning. Here, we propose a multivariate data exploration tool that compares high-dimensional space navigation with a sightseeing trip. It decomposes this activity into five major tasks: 1) Identify the sights: use a map to identify the sights of interest and their location; 2) Plan the trip: connect the sights of interest along a specifyable path; 3) Go on the trip: travel along the route; 4) Hop off the bus: experience the location, look around, zoom into detail; and 5) Orient and localize: regain bearings in the map. We describe intuitive and interactive tools for all of these tasks, both global navigation within the map and local exploration of the data distributions. For the latter, we describe a polygonal touchpad interface which enables users to smoothly tilt the projection plane in high-dimensional space to produce multivariate scatterplots that best convey the data relationships under investigation. Motion parallax and illustrative motion trails aid in the perception of these transient patterns. We describe the use of our system within two applications: 1) the exploratory discovery of data configurations that best fit a personal preference in the presence of tradeoffs and 2) interactive cluster analysis via cluster sculpting in N-D.
Nunes, Natalie; Ambler, Gareth; Hoo, Wee-Liak; Naftalin, Joel; Foo, Xulin; Widschwendter, Martin; Jurkovic, Davor
2013-11-01
This study aimed to assess the accuracy of the International Ovarian Tumour Analysis (IOTA) logistic regression models (LR1 and LR2) and that of subjective pattern recognition (PR) for the diagnosis of ovarian cancer. This was a prospective single-center study in a general gynecology unit of a tertiary hospital during 33 months. There were 292 consecutive women who underwent surgery after an ultrasound diagnosis of an adnexal tumor. All examinations were by a single level 2 ultrasound operator, according to the IOTA guidelines. The malignancy likelihood was calculated using the IOTA LR1 and LR2. The women were then examined separately by an expert operator using subjective PR. These were compared to operative findings and histology. The sensitivity, specificity, area under the curve (AUC), and accuracy of the 3 methods were calculated and compared. The AUCs for LR1 and LR2 were 0.94 [95% confidence interval (CI), 0.92-0.97] and 0.93 (95% CI, 0.90-0.96), respectively. Subjective PR gave a positive likelihood ratio (LR+ve) of 13.9 (95% CI, 7.84-24.6) and a LR-ve of 0.049 (95% CI, 0.022-0.107). The corresponding LR+ve and LR-ve for LR1 were 3.33 (95% CI, 2.85-3.55) and 0.03 (95% CI, 0.01-0.10), and for LR2 were 3.58 (95% CI, 2.77-4.63) and 0.052 (95% CI, 0.022-0.123). The accuracy of PR was 0.942 (95% CI, 0.908-0.966), which was significantly higher when compared with 0.829 (95% CI, 0.781-0.870) for LR1 and 0.836 (95% CI, 0.788-0.872) for LR2 (P IOTA LR1 and LR2 were similar in nonexpert's hands when compared to the original and validation IOTA studies. The PR method was the more accurate test to diagnose ovarian cancer than either of the IOTA models.
Nakamura, Ryota; Suhrcke, Marc; Jebb, Susan A; Pechey, Rachel; Almiron-Roig, Eva; Marteau, Theresa M
2015-04-01
There is a growing concern, but limited evidence, that price promotions contribute to a poor diet and the social patterning of diet-related disease. We examined the following questions: 1) Are less-healthy foods more likely to be promoted than healthier foods? 2) Are consumers more responsive to promotions on less-healthy products? 3) Are there socioeconomic differences in food purchases in response to price promotions? With the use of hierarchical regression, we analyzed data on purchases of 11,323 products within 135 food and beverage categories from 26,986 households in Great Britain during 2010. Major supermarkets operated the same price promotions in all branches. The number of stores that offered price promotions on each product for each week was used to measure the frequency of price promotions. We assessed the healthiness of each product by using a nutrient profiling (NP) model. A total of 6788 products (60%) were in healthier categories and 4535 products (40%) were in less-healthy categories. There was no significant gap in the frequency of promotion by the healthiness of products neither within nor between categories. However, after we controlled for the reference price, price discount rate, and brand-specific effects, the sales uplift arising from price promotions was larger in less-healthy than in healthier categories; a 1-SD point increase in the category mean NP score, implying the category becomes less healthy, was associated with an additional 7.7-percentage point increase in sales (from 27.3% to 35.0%; P sales uplift from promotions was larger for higher-socioeconomic status (SES) groups than for lower ones (34.6% for the high-SES group, 28.1% for the middle-SES group, and 23.1% for the low-SES group). Finally, there was no significant SES gap in the absolute volume of purchases of less-healthy foods made on promotion. Attempts to limit promotions on less-healthy foods could improve the population diet but would be unlikely to reduce health
Nagarajan, Mahesh B; Coan, Paola; Huber, Markus B; Diemoz, Paul C; Glaser, Christian; Wismüller, Axel
2014-02-01
Phase-contrast computed tomography (PCI-CT) has shown tremendous potential as an imaging modality for visualizing human cartilage with high spatial resolution. Previous studies have demonstrated the ability of PCI-CT to visualize (1) structural details of the human patellar cartilage matrix and (2) changes to chondrocyte organization induced by osteoarthritis. This study investigates the use of high-dimensional geometric features in characterizing such chondrocyte patterns in the presence or absence of osteoarthritic damage. Geometrical features derived from the scaling index method (SIM) and statistical features derived from gray-level co-occurrence matrices were extracted from 842 regions of interest (ROI) annotated on PCI-CT images of ex vivo human patellar cartilage specimens. These features were subsequently used in a machine learning task with support vector regression to classify ROIs as healthy or osteoarthritic; classification performance was evaluated using the area under the receiver-operating characteristic curve (AUC). SIM-derived geometrical features exhibited the best classification performance (AUC, 0.95 ± 0.06) and were most robust to changes in ROI size. These results suggest that such geometrical features can provide a detailed characterization of the chondrocyte organization in the cartilage matrix in an automated and non-subjective manner, while also enabling classification of cartilage as healthy or osteoarthritic with high accuracy. Such features could potentially serve as imaging markers for evaluating osteoarthritis progression and its response to different therapeutic intervention strategies.
Matrix correlations for high-dimensional data: The modified RV-coefficient
Smilde, A.K.; Kiers, H.A.L.; Bijlsma, S.; Rubingh, C.M.; Erk, M.J. van
2009-01-01
Motivation: Modern functional genomics generates high-dimensional datasets. It is often convenient to have a single simple number characterizing the relationship between pairs of such high-dimensional datasets in a comprehensive way. Matrix correlations are such numbers and are appealing since they
Feature Augmentation via Nonparametrics and Selection (FANS) in High-Dimensional Classification.
Fan, Jianqing; Feng, Yang; Jiang, Jiancheng; Tong, Xin
We propose a high dimensional classification method that involves nonparametric feature augmentation. Knowing that marginal density ratios are the most powerful univariate classifiers, we use the ratio estimates to transform the original feature measurements. Subsequently, penalized logistic regression is invoked, taking as input the newly transformed or augmented features. This procedure trains models equipped with local complexity and global simplicity, thereby avoiding the curse of dimensionality while creating a flexible nonlinear decision boundary. The resulting method is called Feature Augmentation via Nonparametrics and Selection (FANS). We motivate FANS by generalizing the Naive Bayes model, writing the log ratio of joint densities as a linear combination of those of marginal densities. It is related to generalized additive models, but has better interpretability and computability. Risk bounds are developed for FANS. In numerical analysis, FANS is compared with competing methods, so as to provide a guideline on its best application domain. Real data analysis demonstrates that FANS performs very competitively on benchmark email spam and gene expression data sets. Moreover, FANS is implemented by an extremely fast algorithm through parallel computing.
Multitask Quantile Regression under the Transnormal Model.
Fan, Jianqing; Xue, Lingzhou; Zou, Hui
2016-01-01
We consider estimating multi-task quantile regression under the transnormal model, with focus on high-dimensional setting. We derive a surprisingly simple closed-form solution through rank-based covariance regularization. In particular, we propose the rank-based ℓ 1 penalization with positive definite constraints for estimating sparse covariance matrices, and the rank-based banded Cholesky decomposition regularization for estimating banded precision matrices. By taking advantage of alternating direction method of multipliers, nearest correlation matrix projection is introduced that inherits sampling properties of the unprojected one. Our work combines strengths of quantile regression and rank-based covariance regularization to simultaneously deal with nonlinearity and nonnormality for high-dimensional regression. Furthermore, the proposed method strikes a good balance between robustness and efficiency, achieves the "oracle"-like convergence rate, and provides the provable prediction interval under the high-dimensional setting. The finite-sample performance of the proposed method is also examined. The performance of our proposed rank-based method is demonstrated in a real application to analyze the protein mass spectroscopy data.
Binder, Harald; Porzelius, Christine; Schumacher, Martin
2011-03-01
Analysis of molecular data promises identification of biomarkers for improving prognostic models, thus potentially enabling better patient management. For identifying such biomarkers, risk prediction models can be employed that link high-dimensional molecular covariate data to a clinical endpoint. In low-dimensional settings, a multitude of statistical techniques already exists for building such models, e.g. allowing for variable selection or for quantifying the added value of a new biomarker. We provide an overview of techniques for regularized estimation that transfer this toward high-dimensional settings, with a focus on models for time-to-event endpoints. Techniques for incorporating specific covariate structure are discussed, as well as techniques for dealing with more complex endpoints. Employing gene expression data from patients with diffuse large B-cell lymphoma, some typical modeling issues from low-dimensional settings are illustrated in a high-dimensional application. First, the performance of classical stepwise regression is compared to stage-wise regression, as implemented by a component-wise likelihood-based boosting approach. A second issues arises, when artificially transforming the response into a binary variable. The effects of the resulting loss of efficiency and potential bias in a high-dimensional setting are illustrated, and a link to competing risks models is provided. Finally, we discuss conditions for adequately quantifying the added value of high-dimensional gene expression measurements, both at the stage of model fitting and when performing evaluation. Copyright © 2011 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Self-dissimilarity as a High Dimensional Complexity Measure
Wolpert, David H.; Macready, William
2005-01-01
For many systems characterized as "complex" the patterns exhibited on different scales differ markedly from one another. For example the biomass distribution in a human body "looks very different" depending on the scale at which one examines it. Conversely, the patterns at different scales in "simple" systems (e.g., gases, mountains, crystals) vary little from one scale to another. Accordingly, the degrees of self-dissimilarity between the patterns of a system at various scales constitute a complexity "signature" of that system. Here we present a novel quantification of self-dissimilarity. This signature can, if desired, incorporate a novel information-theoretic measure of the distance between probability distributions that we derive here. Whatever distance measure is chosen, our quantification of self-dissimilarity can be measured for many kinds of real-world data. This allows comparisons of the complexity signatures of wholly different kinds of systems (e.g., systems involving information density in a digital computer vs. species densities in a rain-forest vs. capital density in an economy, etc.). Moreover, in contrast to many other suggested complexity measures, evaluating the self-dissimilarity of a system does not require one to already have a model of the system. These facts may allow self-dissimilarity signatures to be used a s the underlying observational variables of an eventual overarching theory relating all complex systems. To illustrate self-dissimilarity we present several numerical experiments. In particular, we show that underlying structure of the logistic map is picked out by the self-dissimilarity signature of time series produced by that map
Meng, Xi; Nguyen, Bao D; Ridge, Clark; Shaka, A J
2009-01-01
High-dimensional (HD) NMR spectra have poorer digital resolution than low-dimensional (LD) spectra, for a fixed amount of experiment time. This has led to "reduced-dimensionality" strategies, in which several LD projections of the HD NMR spectrum are acquired, each with higher digital resolution; an approximate HD spectrum is then inferred by some means. We propose a strategy that moves in the opposite direction, by adding more time dimensions to increase the information content of the data set, even if only a very sparse time grid is used in each dimension. The full HD time-domain data can be analyzed by the filter diagonalization method (FDM), yielding very narrow resonances along all of the frequency axes, even those with sparse sampling. Integrating over the added dimensions of HD FDM NMR spectra reconstitutes LD spectra with enhanced resolution, often more quickly than direct acquisition of the LD spectrum with a larger number of grid points in each of the fewer dimensions. If the extra-dimensions do not appear in the final spectrum, and are used solely to boost information content, we propose the moniker hidden-dimension NMR. This work shows that HD peaks have unmistakable frequency signatures that can be detected as single HD objects by an appropriate algorithm, even though their patterns would be tricky for a human operator to visualize or recognize, and even if digital resolution in an HD FT spectrum is very coarse compared with natural line widths.
Patterns of Stochastic Behavior in Dynamically Unstable High-Dimensional Biochemical Networks
Directory of Open Access Journals (Sweden)
Simon Rosenfeld
2009-01-01
Full Text Available The question of dynamical stability and stochastic behavior of large biochemical networks is discussed. It is argued that stringent conditions of asymptotic stability have very little chance to materialize in a multidimensional system described by the differential equations of chemical kinetics. The reason is that the criteria of asymptotic stability (Routh- Hurwitz, Lyapunov criteria, Feinberg’s Deficiency Zero theorem would impose the limitations of very high algebraic order on the kinetic rates and stoichiometric coefficients, and there are no natural laws that would guarantee their unconditional validity. Highly nonlinear, dynamically unstable systems, however, are not necessarily doomed to collapse, as a simple Jacobian analysis would suggest. It is possible that their dynamics may assume the form of pseudo-random fluctuations quite similar to a shot noise, and, therefore, their behavior may be described in terms of Langevin and Fokker-Plank equations. We have shown by simulation that the resulting pseudo-stochastic processes obey the heavy-tailed Generalized Pareto Distribution with temporal sequence of pulses forming the set of constituent-specific Poisson processes. Being applied to intracellular dynamics, these properties are naturally associated with burstiness, a well documented phenomenon in the biology of gene expression.
Software Tools for Robust Analysis of High-Dimensional Data
Directory of Open Access Journals (Sweden)
Valentin Todorov
2014-06-01
Full Text Available The present work discusses robust multivariate methods specifically designed for highdimensions. Their implementation in R is presented and their application is illustratedon examples. The first group are algorithms for outlier detection, already introducedelsewhere and implemented in other packages. The value added of the new package isthat all methods follow the same design pattern and thus can use the same graphicaland diagnostic tools. The next topic covered is sparse principal components including anobject oriented interface to the standard method proposed by Zou, Hastie, and Tibshirani(2006 and the robust one proposed by Croux, Filzmoser, and Fritz (2013. Robust partialleast squares (see Hubert and Vanden Branden 2003 as well as partial least squares fordiscriminant analysis conclude the scope of the new package.
Robust and sparse correlation matrix estimation for the analysis of high-dimensional genomics data.
Serra, Angela; Coretto, Pietro; Fratello, Michele; Tagliaferri, Roberto; Stegle, Oliver
2018-02-15
Microarray technology can be used to study the expression of thousands of genes across a number of different experimental conditions, usually hundreds. The underlying principle is that genes sharing similar expression patterns, across different samples, can be part of the same co-expression system, or they may share the same biological functions. Groups of genes are usually identified based on cluster analysis. Clustering methods rely on the similarity matrix between genes. A common choice to measure similarity is to compute the sample correlation matrix. Dimensionality reduction is another popular data analysis task which is also based on covariance/correlation matrix estimates. Unfortunately, covariance/correlation matrix estimation suffers from the intrinsic noise present in high-dimensional data. Sources of noise are: sampling variations, presents of outlying sample units, and the fact that in most cases the number of units is much larger than the number of genes. In this paper, we propose a robust correlation matrix estimator that is regularized based on adaptive thresholding. The resulting method jointly tames the effects of the high-dimensionality, and data contamination. Computations are easy to implement and do not require hand tunings. Both simulated and real data are analyzed. A Monte Carlo experiment shows that the proposed method is capable of remarkable performances. Our correlation metric is more robust to outliers compared with the existing alternatives in two gene expression datasets. It is also shown how the regularization allows to automatically detect and filter spurious correlations. The same regularization is also extended to other less robust correlation measures. Finally, we apply the ARACNE algorithm on the SyNTreN gene expression data. Sensitivity and specificity of the reconstructed network is compared with the gold standard. We show that ARACNE performs better when it takes the proposed correlation matrix estimator as input. The R
Forecasting with Dynamic Regression Models
Pankratz, Alan
2012-01-01
One of the most widely used tools in statistical forecasting, single equation regression models is examined here. A companion to the author's earlier work, Forecasting with Univariate Box-Jenkins Models: Concepts and Cases, the present text pulls together recent time series ideas and gives special attention to possible intertemporal patterns, distributed lag responses of output to input series and the auto correlation patterns of regression disturbance. It also includes six case studies.
Inferring gene expression dynamics via functional regression analysis
Directory of Open Access Journals (Sweden)
Leng Xiaoyan
2008-01-01
Full Text Available Abstract Background Temporal gene expression profiles characterize the time-dynamics of expression of specific genes and are increasingly collected in current gene expression experiments. In the analysis of experiments where gene expression is obtained over the life cycle, it is of interest to relate temporal patterns of gene expression associated with different developmental stages to each other to study patterns of long-term developmental gene regulation. We use tools from functional data analysis to study dynamic changes by relating temporal gene expression profiles of different developmental stages to each other. Results We demonstrate that functional regression methodology can pinpoint relationships that exist between temporary gene expression profiles for different life cycle phases and incorporates dimension reduction as needed for these high-dimensional data. By applying these tools, gene expression profiles for pupa and adult phases are found to be strongly related to the profiles of the same genes obtained during the embryo phase. Moreover, one can distinguish between gene groups that exhibit relationships with positive and others with negative associations between later life and embryonal expression profiles. Specifically, we find a positive relationship in expression for muscle development related genes, and a negative relationship for strictly maternal genes for Drosophila, using temporal gene expression profiles. Conclusion Our findings point to specific reactivation patterns of gene expression during the Drosophila life cycle which differ in characteristic ways between various gene groups. Functional regression emerges as a useful tool for relating gene expression patterns from different developmental stages, and avoids the problems with large numbers of parameters and multiple testing that affect alternative approaches.
Mitigating the Insider Threat Using High-Dimensional Search and Modeling
National Research Council Canada - National Science Library
Van Den Berg, Eric; Uphadyaya, Shambhu; Ngo, Phi H; Muthukrishnan, Muthu; Palan, Rajago
2006-01-01
In this project a system was built aimed at mitigating insider attacks centered around a high-dimensional search engine for correlating the large number of monitoring streams necessary for detecting insider attacks...
Approximating high-dimensional dynamics by barycentric coordinates with linear programming
Energy Technology Data Exchange (ETDEWEB)
Hirata, Yoshito, E-mail: yoshito@sat.t.u-tokyo.ac.jp; Aihara, Kazuyuki; Suzuki, Hideyuki [Institute of Industrial Science, The University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505 (Japan); Department of Mathematical Informatics, The University of Tokyo, Bunkyo-ku, Tokyo 113-8656 (Japan); CREST, JST, 4-1-8 Honcho, Kawaguchi, Saitama 332-0012 (Japan); Shiro, Masanori [Department of Mathematical Informatics, The University of Tokyo, Bunkyo-ku, Tokyo 113-8656 (Japan); Mathematical Neuroinformatics Group, Advanced Industrial Science and Technology, Tsukuba, Ibaraki 305-8568 (Japan); Takahashi, Nozomu; Mas, Paloma [Center for Research in Agricultural Genomics (CRAG), Consorci CSIC-IRTA-UAB-UB, Barcelona 08193 (Spain)
2015-01-15
The increasing development of novel methods and techniques facilitates the measurement of high-dimensional time series but challenges our ability for accurate modeling and predictions. The use of a general mathematical model requires the inclusion of many parameters, which are difficult to be fitted for relatively short high-dimensional time series observed. Here, we propose a novel method to accurately model a high-dimensional time series. Our method extends the barycentric coordinates to high-dimensional phase space by employing linear programming, and allowing the approximation errors explicitly. The extension helps to produce free-running time-series predictions that preserve typical topological, dynamical, and/or geometric characteristics of the underlying attractors more accurately than the radial basis function model that is widely used. The method can be broadly applied, from helping to improve weather forecasting, to creating electronic instruments that sound more natural, and to comprehensively understanding complex biological data.
Approximating high-dimensional dynamics by barycentric coordinates with linear programming
International Nuclear Information System (INIS)
Hirata, Yoshito; Aihara, Kazuyuki; Suzuki, Hideyuki; Shiro, Masanori; Takahashi, Nozomu; Mas, Paloma
2015-01-01
The increasing development of novel methods and techniques facilitates the measurement of high-dimensional time series but challenges our ability for accurate modeling and predictions. The use of a general mathematical model requires the inclusion of many parameters, which are difficult to be fitted for relatively short high-dimensional time series observed. Here, we propose a novel method to accurately model a high-dimensional time series. Our method extends the barycentric coordinates to high-dimensional phase space by employing linear programming, and allowing the approximation errors explicitly. The extension helps to produce free-running time-series predictions that preserve typical topological, dynamical, and/or geometric characteristics of the underlying attractors more accurately than the radial basis function model that is widely used. The method can be broadly applied, from helping to improve weather forecasting, to creating electronic instruments that sound more natural, and to comprehensively understanding complex biological data
Approximating high-dimensional dynamics by barycentric coordinates with linear programming.
Hirata, Yoshito; Shiro, Masanori; Takahashi, Nozomu; Aihara, Kazuyuki; Suzuki, Hideyuki; Mas, Paloma
2015-01-01
The increasing development of novel methods and techniques facilitates the measurement of high-dimensional time series but challenges our ability for accurate modeling and predictions. The use of a general mathematical model requires the inclusion of many parameters, which are difficult to be fitted for relatively short high-dimensional time series observed. Here, we propose a novel method to accurately model a high-dimensional time series. Our method extends the barycentric coordinates to high-dimensional phase space by employing linear programming, and allowing the approximation errors explicitly. The extension helps to produce free-running time-series predictions that preserve typical topological, dynamical, and/or geometric characteristics of the underlying attractors more accurately than the radial basis function model that is widely used. The method can be broadly applied, from helping to improve weather forecasting, to creating electronic instruments that sound more natural, and to comprehensively understanding complex biological data.
Efficient and accurate nearest neighbor and closest pair search in high-dimensional space
Tao, Yufei; Yi, Ke; Sheng, Cheng; Kalnis, Panos
2010-01-01
Nearest Neighbor (NN) search in high-dimensional space is an important problem in many applications. From the database perspective, a good solution needs to have two properties: (i) it can be easily incorporated in a relational database, and (ii
International Nuclear Information System (INIS)
Langrene, Nicolas
2014-01-01
This thesis deals with the numerical solution of general stochastic control problems, with notable applications for electricity markets. We first propose a structural model for the price of electricity, allowing for price spikes well above the marginal fuel price under strained market conditions. This model allows to price and partially hedge electricity derivatives, using fuel forwards as hedging instruments. Then, we propose an algorithm, which combines Monte-Carlo simulations with local basis regressions, to solve general optimal switching problems. A comprehensive rate of convergence of the method is provided. Moreover, we manage to make the algorithm parsimonious in memory (and hence suitable for high dimensional problems) by generalizing to this framework a memory reduction method that avoids the storage of the sample paths. We illustrate this on the problem of investments in new power plants (our structural power price model allowing the new plants to impact the price of electricity). Finally, we study more general stochastic control problems (the control can be continuous and impact the drift and volatility of the state process), the solutions of which belong to the class of fully nonlinear Hamilton-Jacobi-Bellman equations, and can be handled via constrained Backward Stochastic Differential Equations, for which we develop a backward algorithm based on control randomization and parametric optimizations. A rate of convergence between the constraPned BSDE and its discrete version is provided, as well as an estimate of the optimal control. This algorithm is then applied to the problem of super replication of options under uncertain volatilities (and correlations). (author)
Bayesian ARTMAP for regression.
Sasu, L M; Andonie, R
2013-10-01
Bayesian ARTMAP (BA) is a recently introduced neural architecture which uses a combination of Fuzzy ARTMAP competitive learning and Bayesian learning. Training is generally performed online, in a single-epoch. During training, BA creates input data clusters as Gaussian categories, and also infers the conditional probabilities between input patterns and categories, and between categories and classes. During prediction, BA uses Bayesian posterior probability estimation. So far, BA was used only for classification. The goal of this paper is to analyze the efficiency of BA for regression problems. Our contributions are: (i) we generalize the BA algorithm using the clustering functionality of both ART modules, and name it BA for Regression (BAR); (ii) we prove that BAR is a universal approximator with the best approximation property. In other words, BAR approximates arbitrarily well any continuous function (universal approximation) and, for every given continuous function, there is one in the set of BAR approximators situated at minimum distance (best approximation); (iii) we experimentally compare the online trained BAR with several neural models, on the following standard regression benchmarks: CPU Computer Hardware, Boston Housing, Wisconsin Breast Cancer, and Communities and Crime. Our results show that BAR is an appropriate tool for regression tasks, both for theoretical and practical reasons. Copyright © 2013 Elsevier Ltd. All rights reserved.
Directory of Open Access Journals (Sweden)
Regis Wendpouire Oubida
2015-03-01
Full Text Available Local adaptation to climate in temperate forest trees involves the integration of multiple physiological, morphological, and phenological traits. Latitudinal clines are frequently observed for these traits, but environmental constraints also track longitude and altitude. We combined extensive phenotyping of 12 candidate adaptive traits, multivariate regression trees, quantitative genetics, and a genome-wide panel of SNP markers to better understand the interplay among geography, climate, and adaptation to abiotic factors in Populus trichocarpa. Heritabilities were low to moderate (0.13 to 0.32 and population differentiation for many traits exceeded the 99th percentile of the genome-wide distribution of FST, suggesting local adaptation. When climate variables were taken as predictors and the 12 traits as response variables in a multivariate regression tree analysis, evapotranspiration (Eref explained the most variation, with subsequent splits related to mean temperature of the warmest month, frost-free period (FFP, and mean annual precipitation (MAP. These grouping matched relatively well the splits using geographic variables as predictors: the northernmost groups (short FFP and low Eref had the lowest growth, and lowest cold injury index; the southern British Columbia group (low Eref and intermediate temperatures had average growth and cold injury index; the group from the coast of California and Oregon (high Eref and FFP had the highest growth performance and the highest cold injury index; and the southernmost, high-altitude group (with high Eref and low FFP performed poorly, had high cold injury index, and lower water use efficiency. Taken together, these results suggest variation in both temperature and water availability across the range shape multivariate adaptive traits in poplar.
Adaptive metric kernel regression
DEFF Research Database (Denmark)
Goutte, Cyril; Larsen, Jan
2000-01-01
Kernel smoothing is a widely used non-parametric pattern recognition technique. By nature, it suffers from the curse of dimensionality and is usually difficult to apply to high input dimensions. In this contribution, we propose an algorithm that adapts the input metric used in multivariate...... regression by minimising a cross-validation estimate of the generalisation error. This allows to automatically adjust the importance of different dimensions. The improvement in terms of modelling performance is illustrated on a variable selection task where the adaptive metric kernel clearly outperforms...
Adaptive Metric Kernel Regression
DEFF Research Database (Denmark)
Goutte, Cyril; Larsen, Jan
1998-01-01
Kernel smoothing is a widely used nonparametric pattern recognition technique. By nature, it suffers from the curse of dimensionality and is usually difficult to apply to high input dimensions. In this paper, we propose an algorithm that adapts the input metric used in multivariate regression...... by minimising a cross-validation estimate of the generalisation error. This allows one to automatically adjust the importance of different dimensions. The improvement in terms of modelling performance is illustrated on a variable selection task where the adaptive metric kernel clearly outperforms the standard...
Landfors, Mattias; Philip, Philge; Rydén, Patrik; Stenberg, Per
2011-01-01
Genome-wide analysis of gene expression or protein binding patterns using different array or sequencing based technologies is now routinely performed to compare different populations, such as treatment and reference groups. It is often necessary to normalize the data obtained to remove technical variation introduced in the course of conducting experimental work, but standard normalization techniques are not capable of eliminating technical bias in cases where the distribution of the truly altered variables is skewed, i.e. when a large fraction of the variables are either positively or negatively affected by the treatment. However, several experiments are likely to generate such skewed distributions, including ChIP-chip experiments for the study of chromatin, gene expression experiments for the study of apoptosis, and SNP-studies of copy number variation in normal and tumour tissues. A preliminary study using spike-in array data established that the capacity of an experiment to identify altered variables and generate unbiased estimates of the fold change decreases as the fraction of altered variables and the skewness increases. We propose the following work-flow for analyzing high-dimensional experiments with regions of altered variables: (1) Pre-process raw data using one of the standard normalization techniques. (2) Investigate if the distribution of the altered variables is skewed. (3) If the distribution is not believed to be skewed, no additional normalization is needed. Otherwise, re-normalize the data using a novel HMM-assisted normalization procedure. (4) Perform downstream analysis. Here, ChIP-chip data and simulated data were used to evaluate the performance of the work-flow. It was found that skewed distributions can be detected by using the novel DSE-test (Detection of Skewed Experiments). Furthermore, applying the HMM-assisted normalization to experiments where the distribution of the truly altered variables is skewed results in considerably higher
International Nuclear Information System (INIS)
Tripathy, Rohit; Bilionis, Ilias; Gonzalez, Marcial
2016-01-01
Uncertainty quantification (UQ) tasks, such as model calibration, uncertainty propagation, and optimization under uncertainty, typically require several thousand evaluations of the underlying computer codes. To cope with the cost of simulations, one replaces the real response surface with a cheap surrogate based, e.g., on polynomial chaos expansions, neural networks, support vector machines, or Gaussian processes (GP). However, the number of simulations required to learn a generic multivariate response grows exponentially as the input dimension increases. This curse of dimensionality can only be addressed, if the response exhibits some special structure that can be discovered and exploited. A wide range of physical responses exhibit a special structure known as an active subspace (AS). An AS is a linear manifold of the stochastic space characterized by maximal response variation. The idea is that one should first identify this low dimensional manifold, project the high-dimensional input onto it, and then link the projection to the output. If the dimensionality of the AS is low enough, then learning the link function is a much easier problem than the original problem of learning a high-dimensional function. The classic approach to discovering the AS requires gradient information, a fact that severely limits its applicability. Furthermore, and partly because of its reliance to gradients, it is not able to handle noisy observations. The latter is an essential trait if one wants to be able to propagate uncertainty through stochastic simulators, e.g., through molecular dynamics codes. In this work, we develop a probabilistic version of AS which is gradient-free and robust to observational noise. Our approach relies on a novel Gaussian process regression with built-in dimensionality reduction. In particular, the AS is represented as an orthogonal projection matrix that serves as yet another covariance function hyper-parameter to be estimated from the data. To train the
Tripathy, Rohit; Bilionis, Ilias; Gonzalez, Marcial
2016-09-01
Uncertainty quantification (UQ) tasks, such as model calibration, uncertainty propagation, and optimization under uncertainty, typically require several thousand evaluations of the underlying computer codes. To cope with the cost of simulations, one replaces the real response surface with a cheap surrogate based, e.g., on polynomial chaos expansions, neural networks, support vector machines, or Gaussian processes (GP). However, the number of simulations required to learn a generic multivariate response grows exponentially as the input dimension increases. This curse of dimensionality can only be addressed, if the response exhibits some special structure that can be discovered and exploited. A wide range of physical responses exhibit a special structure known as an active subspace (AS). An AS is a linear manifold of the stochastic space characterized by maximal response variation. The idea is that one should first identify this low dimensional manifold, project the high-dimensional input onto it, and then link the projection to the output. If the dimensionality of the AS is low enough, then learning the link function is a much easier problem than the original problem of learning a high-dimensional function. The classic approach to discovering the AS requires gradient information, a fact that severely limits its applicability. Furthermore, and partly because of its reliance to gradients, it is not able to handle noisy observations. The latter is an essential trait if one wants to be able to propagate uncertainty through stochastic simulators, e.g., through molecular dynamics codes. In this work, we develop a probabilistic version of AS which is gradient-free and robust to observational noise. Our approach relies on a novel Gaussian process regression with built-in dimensionality reduction. In particular, the AS is represented as an orthogonal projection matrix that serves as yet another covariance function hyper-parameter to be estimated from the data. To train the
Energy Technology Data Exchange (ETDEWEB)
Tripathy, Rohit, E-mail: rtripath@purdue.edu; Bilionis, Ilias, E-mail: ibilion@purdue.edu; Gonzalez, Marcial, E-mail: marcial-gonzalez@purdue.edu
2016-09-15
Uncertainty quantification (UQ) tasks, such as model calibration, uncertainty propagation, and optimization under uncertainty, typically require several thousand evaluations of the underlying computer codes. To cope with the cost of simulations, one replaces the real response surface with a cheap surrogate based, e.g., on polynomial chaos expansions, neural networks, support vector machines, or Gaussian processes (GP). However, the number of simulations required to learn a generic multivariate response grows exponentially as the input dimension increases. This curse of dimensionality can only be addressed, if the response exhibits some special structure that can be discovered and exploited. A wide range of physical responses exhibit a special structure known as an active subspace (AS). An AS is a linear manifold of the stochastic space characterized by maximal response variation. The idea is that one should first identify this low dimensional manifold, project the high-dimensional input onto it, and then link the projection to the output. If the dimensionality of the AS is low enough, then learning the link function is a much easier problem than the original problem of learning a high-dimensional function. The classic approach to discovering the AS requires gradient information, a fact that severely limits its applicability. Furthermore, and partly because of its reliance to gradients, it is not able to handle noisy observations. The latter is an essential trait if one wants to be able to propagate uncertainty through stochastic simulators, e.g., through molecular dynamics codes. In this work, we develop a probabilistic version of AS which is gradient-free and robust to observational noise. Our approach relies on a novel Gaussian process regression with built-in dimensionality reduction. In particular, the AS is represented as an orthogonal projection matrix that serves as yet another covariance function hyper-parameter to be estimated from the data. To train the
Julien, Clavel; Leandro, Aristide; Hélène, Morlon
2018-06-19
Working with high-dimensional phylogenetic comparative datasets is challenging because likelihood-based multivariate methods suffer from low statistical performances as the number of traits p approaches the number of species n and because some computational complications occur when p exceeds n. Alternative phylogenetic comparative methods have recently been proposed to deal with the large p small n scenario but their use and performances are limited. Here we develop a penalized likelihood framework to deal with high-dimensional comparative datasets. We propose various penalizations and methods for selecting the intensity of the penalties. We apply this general framework to the estimation of parameters (the evolutionary trait covariance matrix and parameters of the evolutionary model) and model comparison for the high-dimensional multivariate Brownian (BM), Early-burst (EB), Ornstein-Uhlenbeck (OU) and Pagel's lambda models. We show using simulations that our penalized likelihood approach dramatically improves the estimation of evolutionary trait covariance matrices and model parameters when p approaches n, and allows for their accurate estimation when p equals or exceeds n. In addition, we show that penalized likelihood models can be efficiently compared using Generalized Information Criterion (GIC). We implement these methods, as well as the related estimation of ancestral states and the computation of phylogenetic PCA in the R package RPANDA and mvMORPH. Finally, we illustrate the utility of the new proposed framework by evaluating evolutionary models fit, analyzing integration patterns, and reconstructing evolutionary trajectories for a high-dimensional 3-D dataset of brain shape in the New World monkeys. We find a clear support for an Early-burst model suggesting an early diversification of brain morphology during the ecological radiation of the clade. Penalized likelihood offers an efficient way to deal with high-dimensional multivariate comparative data.
Engineering two-photon high-dimensional states through quantum interference
Zhang, Yingwen; Roux, Filippus S.; Konrad, Thomas; Agnew, Megan; Leach, Jonathan; Forbes, Andrew
2016-01-01
Many protocols in quantum science, for example, linear optical quantum computing, require access to large-scale entangled quantum states. Such systems can be realized through many-particle qubits, but this approach often suffers from scalability problems. An alternative strategy is to consider a lesser number of particles that exist in high-dimensional states. The spatial modes of light are one such candidate that provides access to high-dimensional quantum states, and thus they increase the storage and processing potential of quantum information systems. We demonstrate the controlled engineering of two-photon high-dimensional states entangled in their orbital angular momentum through Hong-Ou-Mandel interference. We prepare a large range of high-dimensional entangled states and implement precise quantum state filtering. We characterize the full quantum state before and after the filter, and are thus able to determine that only the antisymmetric component of the initial state remains. This work paves the way for high-dimensional processing and communication of multiphoton quantum states, for example, in teleportation beyond qubits. PMID:26933685
A Comparison of Methods for Estimating the Determinant of High-Dimensional Covariance Matrix
Hu, Zongliang
2017-09-27
The determinant of the covariance matrix for high-dimensional data plays an important role in statistical inference and decision. It has many real applications including statistical tests and information theory. Due to the statistical and computational challenges with high dimensionality, little work has been proposed in the literature for estimating the determinant of high-dimensional covariance matrix. In this paper, we estimate the determinant of the covariance matrix using some recent proposals for estimating high-dimensional covariance matrix. Specifically, we consider a total of eight covariance matrix estimation methods for comparison. Through extensive simulation studies, we explore and summarize some interesting comparison results among all compared methods. We also provide practical guidelines based on the sample size, the dimension, and the correlation of the data set for estimating the determinant of high-dimensional covariance matrix. Finally, from a perspective of the loss function, the comparison study in this paper may also serve as a proxy to assess the performance of the covariance matrix estimation.
A Comparison of Methods for Estimating the Determinant of High-Dimensional Covariance Matrix.
Hu, Zongliang; Dong, Kai; Dai, Wenlin; Tong, Tiejun
2017-09-21
The determinant of the covariance matrix for high-dimensional data plays an important role in statistical inference and decision. It has many real applications including statistical tests and information theory. Due to the statistical and computational challenges with high dimensionality, little work has been proposed in the literature for estimating the determinant of high-dimensional covariance matrix. In this paper, we estimate the determinant of the covariance matrix using some recent proposals for estimating high-dimensional covariance matrix. Specifically, we consider a total of eight covariance matrix estimation methods for comparison. Through extensive simulation studies, we explore and summarize some interesting comparison results among all compared methods. We also provide practical guidelines based on the sample size, the dimension, and the correlation of the data set for estimating the determinant of high-dimensional covariance matrix. Finally, from a perspective of the loss function, the comparison study in this paper may also serve as a proxy to assess the performance of the covariance matrix estimation.
A Comparison of Methods for Estimating the Determinant of High-Dimensional Covariance Matrix
Hu, Zongliang; Dong, Kai; Dai, Wenlin; Tong, Tiejun
2017-01-01
The determinant of the covariance matrix for high-dimensional data plays an important role in statistical inference and decision. It has many real applications including statistical tests and information theory. Due to the statistical and computational challenges with high dimensionality, little work has been proposed in the literature for estimating the determinant of high-dimensional covariance matrix. In this paper, we estimate the determinant of the covariance matrix using some recent proposals for estimating high-dimensional covariance matrix. Specifically, we consider a total of eight covariance matrix estimation methods for comparison. Through extensive simulation studies, we explore and summarize some interesting comparison results among all compared methods. We also provide practical guidelines based on the sample size, the dimension, and the correlation of the data set for estimating the determinant of high-dimensional covariance matrix. Finally, from a perspective of the loss function, the comparison study in this paper may also serve as a proxy to assess the performance of the covariance matrix estimation.
A Hybrid Semi-Supervised Anomaly Detection Model for High-Dimensional Data
Directory of Open Access Journals (Sweden)
Hongchao Song
2017-01-01
Full Text Available Anomaly detection, which aims to identify observations that deviate from a nominal sample, is a challenging task for high-dimensional data. Traditional distance-based anomaly detection methods compute the neighborhood distance between each observation and suffer from the curse of dimensionality in high-dimensional space; for example, the distances between any pair of samples are similar and each sample may perform like an outlier. In this paper, we propose a hybrid semi-supervised anomaly detection model for high-dimensional data that consists of two parts: a deep autoencoder (DAE and an ensemble k-nearest neighbor graphs- (K-NNG- based anomaly detector. Benefiting from the ability of nonlinear mapping, the DAE is first trained to learn the intrinsic features of a high-dimensional dataset to represent the high-dimensional data in a more compact subspace. Several nonparametric KNN-based anomaly detectors are then built from different subsets that are randomly sampled from the whole dataset. The final prediction is made by all the anomaly detectors. The performance of the proposed method is evaluated on several real-life datasets, and the results confirm that the proposed hybrid model improves the detection accuracy and reduces the computational complexity.
Differentiating regressed melanoma from regressed lichenoid keratosis.
Chan, Aegean H; Shulman, Kenneth J; Lee, Bonnie A
2017-04-01
Distinguishing regressed lichen planus-like keratosis (LPLK) from regressed melanoma can be difficult on histopathologic examination, potentially resulting in mismanagement of patients. We aimed to identify histopathologic features by which regressed melanoma can be differentiated from regressed LPLK. Twenty actively inflamed LPLK, 12 LPLK with regression and 15 melanomas with regression were compared and evaluated by hematoxylin and eosin staining as well as Melan-A, microphthalmia transcription factor (MiTF) and cytokeratin (AE1/AE3) immunostaining. (1) A total of 40% of regressed melanomas showed complete or near complete loss of melanocytes within the epidermis with Melan-A and MiTF immunostaining, while 8% of regressed LPLK exhibited this finding. (2) Necrotic keratinocytes were seen in the epidermis in 33% regressed melanomas as opposed to all of the regressed LPLK. (3) A dense infiltrate of melanophages in the papillary dermis was seen in 40% of regressed melanomas, a feature not seen in regressed LPLK. In summary, our findings suggest that a complete or near complete loss of melanocytes within the epidermis strongly favors a regressed melanoma over a regressed LPLK. In addition, necrotic epidermal keratinocytes and the presence of a dense band-like distribution of dermal melanophages can be helpful in differentiating these lesions. © 2016 John Wiley & Sons A/S. Published by John Wiley & Sons Ltd.
Directory of Open Access Journals (Sweden)
A. M. Bernales
2016-06-01
Full Text Available The threat of the ailments related to urbanization like heat stress is very prevalent. There are a lot of things that can be done to lessen the effect of urbanization to the surface temperature of the area like using green roofs or planting trees in the area. So land use really matters in both increasing and decreasing surface temperature. It is known that there is a relationship between land use land cover (LULC and land surface temperature (LST. Quantifying this relationship in terms of a mathematical model is very important so as to provide a way to predict LST based on the LULC alone. This study aims to examine the relationship between LST and LULC as well as to create a model that can predict LST using class-level spatial metrics from LULC. LST was derived from a Landsat 8 image and LULC classification was derived from LiDAR and Orthophoto datasets. Class-level spatial metrics were created in FRAGSTATS with the LULC and LST as inputs and these metrics were analysed using a statistical framework. Multi linear regression was done to create models that would predict LST for each class and it was found that the spatial metric “Effective mesh size” was a top predictor for LST in 6 out of 7 classes. The model created can still be refined by adding a temporal aspect by analysing the LST of another farming period (for rural areas and looking for common predictors between LSTs of these two different farming periods.
Model-based Clustering of High-Dimensional Data in Astrophysics
Bouveyron, C.
2016-05-01
The nature of data in Astrophysics has changed, as in other scientific fields, in the past decades due to the increase of the measurement capabilities. As a consequence, data are nowadays frequently of high dimensionality and available in mass or stream. Model-based techniques for clustering are popular tools which are renowned for their probabilistic foundations and their flexibility. However, classical model-based techniques show a disappointing behavior in high-dimensional spaces which is mainly due to their dramatical over-parametrization. The recent developments in model-based classification overcome these drawbacks and allow to efficiently classify high-dimensional data, even in the "small n / large p" situation. This work presents a comprehensive review of these recent approaches, including regularization-based techniques, parsimonious modeling, subspace classification methods and classification methods based on variable selection. The use of these model-based methods is also illustrated on real-world classification problems in Astrophysics using R packages.
International Nuclear Information System (INIS)
Zhang, Wuhong; Su, Ming; Wu, Ziwen; Lu, Meng; Huang, Bingwei; Chen, Lixiang
2013-01-01
Twisted photons enable the definition of a Hilbert space beyond two dimensions by orbital angular momentum (OAM) eigenstates. Here we propose a feasible entanglement concentration experiment, to enhance the quality of high-dimensional entanglement shared by twisted photon pairs. Our approach is started from the full characterization of entangled spiral bandwidth, and is then based on the careful selection of the Laguerre–Gaussian (LG) modes with specific radial and azimuthal indices p and ℓ. In particular, we demonstrate the possibility of high-dimensional entanglement concentration residing in the OAM subspace of up to 21 dimensions. By means of LabVIEW simulations with spatial light modulators, we show that the Shannon dimensionality could be employed to quantify the quality of the present concentration. Our scheme holds promise in quantum information applications defined in high-dimensional Hilbert space. (letter)
Detection of Subtle Context-Dependent Model Inaccuracies in High-Dimensional Robot Domains.
Mendoza, Juan Pablo; Simmons, Reid; Veloso, Manuela
2016-12-01
Autonomous robots often rely on models of their sensing and actions for intelligent decision making. However, when operating in unconstrained environments, the complexity of the world makes it infeasible to create models that are accurate in every situation. This article addresses the problem of using potentially large and high-dimensional sets of robot execution data to detect situations in which a robot model is inaccurate-that is, detecting context-dependent model inaccuracies in a high-dimensional context space. To find inaccuracies tractably, the robot conducts an informed search through low-dimensional projections of execution data to find parametric Regions of Inaccurate Modeling (RIMs). Empirical evidence from two robot domains shows that this approach significantly enhances the detection power of existing RIM-detection algorithms in high-dimensional spaces.
Linear stability theory as an early warning sign for transitions in high dimensional complex systems
International Nuclear Information System (INIS)
Piovani, Duccio; Grujić, Jelena; Jensen, Henrik Jeldtoft
2016-01-01
We analyse in detail a new approach to the monitoring and forecasting of the onset of transitions in high dimensional complex systems by application to the Tangled Nature model of evolutionary ecology and high dimensional replicator systems with a stochastic element. A high dimensional stability matrix is derived in the mean field approximation to the stochastic dynamics. This allows us to determine the stability spectrum about the observed quasi-stable configurations. From overlap of the instantaneous configuration vector of the full stochastic system with the eigenvectors of the unstable directions of the deterministic mean field approximation, we are able to construct a good early-warning indicator of the transitions occurring intermittently. (paper)
Fickler, Robert; Lapkiewicz, Radek; Huber, Marcus; Lavery, Martin P J; Padgett, Miles J; Zeilinger, Anton
2014-07-30
Photonics has become a mature field of quantum information science, where integrated optical circuits offer a way to scale the complexity of the set-up as well as the dimensionality of the quantum state. On photonic chips, paths are the natural way to encode information. To distribute those high-dimensional quantum states over large distances, transverse spatial modes, like orbital angular momentum possessing Laguerre Gauss modes, are favourable as flying information carriers. Here we demonstrate a quantum interface between these two vibrant photonic fields. We create three-dimensional path entanglement between two photons in a nonlinear crystal and use a mode sorter as the quantum interface to transfer the entanglement to the orbital angular momentum degree of freedom. Thus our results show a flexible way to create high-dimensional spatial mode entanglement. Moreover, they pave the way to implement broad complex quantum networks where high-dimensionally entangled states could be distributed over distant photonic chips.
Pedrini, D. T.; Pedrini, Bonnie C.
Regression, another mechanism studied by Sigmund Freud, has had much research, e.g., hypnotic regression, frustration regression, schizophrenic regression, and infra-human-animal regression (often directly related to fixation). Many investigators worked with hypnotic age regression, which has a long history, going back to Russian reflexologists.…
Directory of Open Access Journals (Sweden)
Thenmozhi Srinivasan
2015-01-01
Full Text Available Clusters of high-dimensional data techniques are emerging, according to data noisy and poor quality challenges. This paper has been developed to cluster data using high-dimensional similarity based PCM (SPCM, with ant colony optimization intelligence which is effective in clustering nonspatial data without getting knowledge about cluster number from the user. The PCM becomes similarity based by using mountain method with it. Though this is efficient clustering, it is checked for optimization using ant colony algorithm with swarm intelligence. Thus the scalable clustering technique is obtained and the evaluation results are checked with synthetic datasets.
Reduced basis ANOVA methods for partial differential equations with high-dimensional random inputs
Energy Technology Data Exchange (ETDEWEB)
Liao, Qifeng, E-mail: liaoqf@shanghaitech.edu.cn [School of Information Science and Technology, ShanghaiTech University, Shanghai 200031 (China); Lin, Guang, E-mail: guanglin@purdue.edu [Department of Mathematics & School of Mechanical Engineering, Purdue University, West Lafayette, IN 47907 (United States)
2016-07-15
In this paper we present a reduced basis ANOVA approach for partial deferential equations (PDEs) with random inputs. The ANOVA method combined with stochastic collocation methods provides model reduction in high-dimensional parameter space through decomposing high-dimensional inputs into unions of low-dimensional inputs. In this work, to further reduce the computational cost, we investigate spatial low-rank structures in the ANOVA-collocation method, and develop efficient spatial model reduction techniques using hierarchically generated reduced bases. We present a general mathematical framework of the methodology, validate its accuracy and demonstrate its efficiency with numerical experiments.
The validation and assessment of machine learning: a game of prediction from high-dimensional data
DEFF Research Database (Denmark)
Pers, Tune Hannes; Albrechtsen, A; Holst, C
2009-01-01
In applied statistics, tools from machine learning are popular for analyzing complex and high-dimensional data. However, few theoretical results are available that could guide to the appropriate machine learning tool in a new application. Initial development of an overall strategy thus often...... the ideas, the game is applied to data from the Nugenob Study where the aim is to predict the fat oxidation capacity based on conventional factors and high-dimensional metabolomics data. Three players have chosen to use support vector machines, LASSO, and random forests, respectively....
Energy Technology Data Exchange (ETDEWEB)
Storm, Emma; Weniger, Christoph [GRAPPA, Institute of Physics, University of Amsterdam, Science Park 904, 1090 GL Amsterdam (Netherlands); Calore, Francesca, E-mail: e.m.storm@uva.nl, E-mail: c.weniger@uva.nl, E-mail: francesca.calore@lapth.cnrs.fr [LAPTh, CNRS, 9 Chemin de Bellevue, BP-110, Annecy-le-Vieux, 74941, Annecy Cedex (France)
2017-08-01
We present SkyFACT (Sky Factorization with Adaptive Constrained Templates), a new approach for studying, modeling and decomposing diffuse gamma-ray emission. Like most previous analyses, the approach relies on predictions from cosmic-ray propagation codes like GALPROP and DRAGON. However, in contrast to previous approaches, we account for the fact that models are not perfect and allow for a very large number (∼> 10{sup 5}) of nuisance parameters to parameterize these imperfections. We combine methods of image reconstruction and adaptive spatio-spectral template regression in one coherent hybrid approach. To this end, we use penalized Poisson likelihood regression, with regularization functions that are motivated by the maximum entropy method. We introduce methods to efficiently handle the high dimensionality of the convex optimization problem as well as the associated semi-sparse covariance matrix, using the L-BFGS-B algorithm and Cholesky factorization. We test the method both on synthetic data as well as on gamma-ray emission from the inner Galaxy, |ℓ|<90{sup o} and | b |<20{sup o}, as observed by the Fermi Large Area Telescope. We finally define a simple reference model that removes most of the residual emission from the inner Galaxy, based on conventional diffuse emission components as well as components for the Fermi bubbles, the Fermi Galactic center excess, and extended sources along the Galactic disk. Variants of this reference model can serve as basis for future studies of diffuse emission in and outside the Galactic disk.
An irregular grid approach for pricing high-dimensional American options
Berridge, S.J.; Schumacher, J.M.
2008-01-01
We propose and test a new method for pricing American options in a high-dimensional setting. The method is centered around the approximation of the associated complementarity problem on an irregular grid. We approximate the partial differential operator on this grid by appealing to the SDE
Can We Train Machine Learning Methods to Outperform the High-dimensional Propensity Score Algorithm?
Karim, Mohammad Ehsanul; Pang, Menglan; Platt, Robert W
2018-03-01
The use of retrospective health care claims datasets is frequently criticized for the lack of complete information on potential confounders. Utilizing patient's health status-related information from claims datasets as surrogates or proxies for mismeasured and unobserved confounders, the high-dimensional propensity score algorithm enables us to reduce bias. Using a previously published cohort study of postmyocardial infarction statin use (1998-2012), we compare the performance of the algorithm with a number of popular machine learning approaches for confounder selection in high-dimensional covariate spaces: random forest, least absolute shrinkage and selection operator, and elastic net. Our results suggest that, when the data analysis is done with epidemiologic principles in mind, machine learning methods perform as well as the high-dimensional propensity score algorithm. Using a plasmode framework that mimicked the empirical data, we also showed that a hybrid of machine learning and high-dimensional propensity score algorithms generally perform slightly better than both in terms of mean squared error, when a bias-based analysis is used.
CSIR Research Space (South Africa)
Giovannini, D
2013-06-01
Full Text Available : QELS_Fundamental Science, San Jose, California United States, 9-14 June 2013 Reconstruction of High-Dimensional States Entangled in Orbital Angular Momentum Using Mutually Unbiased Measurements D. Giovannini1, ⇤, J. Romero1, 2, J. Leach3, A...
Global communication schemes for the numerical solution of high-dimensional PDEs
DEFF Research Database (Denmark)
Hupp, Philipp; Heene, Mario; Jacob, Riko
2016-01-01
The numerical treatment of high-dimensional partial differential equations is among the most compute-hungry problems and in urgent need for current and future high-performance computing (HPC) systems. It is thus also facing the grand challenges of exascale computing such as the requirement...
Ferdosi, Bilkis J.; Buddelmeijer, Hugo; Trager, Scott; Wilkinson, Michael H.F.; Roerdink, Jos B.T.M.
2010-01-01
Data sets in astronomy are growing to enormous sizes. Modern astronomical surveys provide not only image data but also catalogues of millions of objects (stars, galaxies), each object with hundreds of associated parameters. Exploration of this very high-dimensional data space poses a huge challenge.
High-Dimensional Exploratory Item Factor Analysis by a Metropolis-Hastings Robbins-Monro Algorithm
Cai, Li
2010-01-01
A Metropolis-Hastings Robbins-Monro (MH-RM) algorithm for high-dimensional maximum marginal likelihood exploratory item factor analysis is proposed. The sequence of estimates from the MH-RM algorithm converges with probability one to the maximum likelihood solution. Details on the computer implementation of this algorithm are provided. The…
Multi-Scale Factor Analysis of High-Dimensional Brain Signals
Ting, Chee-Ming; Ombao, Hernando; Salleh, Sh-Hussain
2017-01-01
In this paper, we develop an approach to modeling high-dimensional networks with a large number of nodes arranged in a hierarchical and modular structure. We propose a novel multi-scale factor analysis (MSFA) model which partitions the massive
Spectrally-Corrected Estimation for High-Dimensional Markowitz Mean-Variance Optimization
Z. Bai (Zhidong); H. Li (Hua); M.J. McAleer (Michael); W.-K. Wong (Wing-Keung)
2016-01-01
textabstractThis paper considers the portfolio problem for high dimensional data when the dimension and size are both large. We analyze the traditional Markowitz mean-variance (MV) portfolio by large dimension matrix theory, and find the spectral distribution of the sample covariance is the main
Berridge, S.J.; Schumacher, J.M.
2004-01-01
We propose a method for pricing high-dimensional American options on an irregular grid; the method involves using quadratic functions to approximate the local effect of the Black-Scholes operator.Once such an approximation is known, one can solve the pricing problem by time stepping in an explicit
Multigrid for high dimensional elliptic partial differential equations on non-equidistant grids
bin Zubair, H.; Oosterlee, C.E.; Wienands, R.
2006-01-01
This work presents techniques, theory and numbers for multigrid in a general d-dimensional setting. The main focus is the multigrid convergence for high-dimensional partial differential equations (PDEs). As a model problem we have chosen the anisotropic diffusion equation, on a unit hypercube. We
An Irregular Grid Approach for Pricing High-Dimensional American Options
Berridge, S.J.; Schumacher, J.M.
2004-01-01
We propose and test a new method for pricing American options in a high-dimensional setting.The method is centred around the approximation of the associated complementarity problem on an irregular grid.We approximate the partial differential operator on this grid by appealing to the SDE
Pricing and hedging high-dimensional American options : an irregular grid approach
Berridge, S.; Schumacher, H.
2002-01-01
We propose and test a new method for pricing American options in a high dimensional setting. The method is centred around the approximation of the associated variational inequality on an irregular grid. We approximate the partial differential operator on this grid by appealing to the SDE
DEFF Research Database (Denmark)
Johansen, Søren
2008-01-01
The reduced rank regression model is a multivariate regression model with a coefficient matrix with reduced rank. The reduced rank regression algorithm is an estimation procedure, which estimates the reduced rank regression model. It is related to canonical correlations and involves calculating...
Characterization of discontinuities in high-dimensional stochastic problems on adaptive sparse grids
International Nuclear Information System (INIS)
Jakeman, John D.; Archibald, Richard; Xiu Dongbin
2011-01-01
In this paper we present a set of efficient algorithms for detection and identification of discontinuities in high dimensional space. The method is based on extension of polynomial annihilation for discontinuity detection in low dimensions. Compared to the earlier work, the present method poses significant improvements for high dimensional problems. The core of the algorithms relies on adaptive refinement of sparse grids. It is demonstrated that in the commonly encountered cases where a discontinuity resides on a small subset of the dimensions, the present method becomes 'optimal', in the sense that the total number of points required for function evaluations depends linearly on the dimensionality of the space. The details of the algorithms will be presented and various numerical examples are utilized to demonstrate the efficacy of the method.
Su, Yapeng; Shi, Qihui; Wei, Wei
2017-02-01
New insights on cellular heterogeneity in the last decade provoke the development of a variety of single cell omics tools at a lightning pace. The resultant high-dimensional single cell data generated by these tools require new theoretical approaches and analytical algorithms for effective visualization and interpretation. In this review, we briefly survey the state-of-the-art single cell proteomic tools with a particular focus on data acquisition and quantification, followed by an elaboration of a number of statistical and computational approaches developed to date for dissecting the high-dimensional single cell data. The underlying assumptions, unique features, and limitations of the analytical methods with the designated biological questions they seek to answer will be discussed. Particular attention will be given to those information theoretical approaches that are anchored in a set of first principles of physics and can yield detailed (and often surprising) predictions. © 2017 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
A Shell Multi-dimensional Hierarchical Cubing Approach for High-Dimensional Cube
Zou, Shuzhi; Zhao, Li; Hu, Kongfa
The pre-computation of data cubes is critical for improving the response time of OLAP systems and accelerating data mining tasks in large data warehouses. However, as the sizes of data warehouses grow, the time it takes to perform this pre-computation becomes a significant performance bottleneck. In a high dimensional data warehouse, it might not be practical to build all these cuboids and their indices. In this paper, we propose a shell multi-dimensional hierarchical cubing algorithm, based on an extension of the previous minimal cubing approach. This method partitions the high dimensional data cube into low multi-dimensional hierarchical cube. Experimental results show that the proposed method is significantly more efficient than other existing cubing methods.
Minimax Rate-optimal Estimation of High-dimensional Covariance Matrices with Incomplete Data.
Cai, T Tony; Zhang, Anru
2016-09-01
Missing data occur frequently in a wide range of applications. In this paper, we consider estimation of high-dimensional covariance matrices in the presence of missing observations under a general missing completely at random model in the sense that the missingness is not dependent on the values of the data. Based on incomplete data, estimators for bandable and sparse covariance matrices are proposed and their theoretical and numerical properties are investigated. Minimax rates of convergence are established under the spectral norm loss and the proposed estimators are shown to be rate-optimal under mild regularity conditions. Simulation studies demonstrate that the estimators perform well numerically. The methods are also illustrated through an application to data from four ovarian cancer studies. The key technical tools developed in this paper are of independent interest and potentially useful for a range of related problems in high-dimensional statistical inference with missing data.
Minimax Rate-optimal Estimation of High-dimensional Covariance Matrices with Incomplete Data*
Cai, T. Tony; Zhang, Anru
2016-01-01
Missing data occur frequently in a wide range of applications. In this paper, we consider estimation of high-dimensional covariance matrices in the presence of missing observations under a general missing completely at random model in the sense that the missingness is not dependent on the values of the data. Based on incomplete data, estimators for bandable and sparse covariance matrices are proposed and their theoretical and numerical properties are investigated. Minimax rates of convergence are established under the spectral norm loss and the proposed estimators are shown to be rate-optimal under mild regularity conditions. Simulation studies demonstrate that the estimators perform well numerically. The methods are also illustrated through an application to data from four ovarian cancer studies. The key technical tools developed in this paper are of independent interest and potentially useful for a range of related problems in high-dimensional statistical inference with missing data. PMID:27777471
Distribution of high-dimensional entanglement via an intra-city free-space link.
Steinlechner, Fabian; Ecker, Sebastian; Fink, Matthias; Liu, Bo; Bavaresco, Jessica; Huber, Marcus; Scheidl, Thomas; Ursin, Rupert
2017-07-24
Quantum entanglement is a fundamental resource in quantum information processing and its distribution between distant parties is a key challenge in quantum communications. Increasing the dimensionality of entanglement has been shown to improve robustness and channel capacities in secure quantum communications. Here we report on the distribution of genuine high-dimensional entanglement via a 1.2-km-long free-space link across Vienna. We exploit hyperentanglement, that is, simultaneous entanglement in polarization and energy-time bases, to encode quantum information, and observe high-visibility interference for successive correlation measurements in each degree of freedom. These visibilities impose lower bounds on entanglement in each subspace individually and certify four-dimensional entanglement for the hyperentangled system. The high-fidelity transmission of high-dimensional entanglement under real-world atmospheric link conditions represents an important step towards long-distance quantum communications with more complex quantum systems and the implementation of advanced quantum experiments with satellite links.
An Unbiased Distance-based Outlier Detection Approach for High-dimensional Data
DEFF Research Database (Denmark)
Nguyen, Hoang Vu; Gopalkrishnan, Vivekanand; Assent, Ira
2011-01-01
than a global property. Different from existing approaches, it is not grid-based and dimensionality unbiased. Thus, its performance is impervious to grid resolution as well as the curse of dimensionality. In addition, our approach ranks the outliers, allowing users to select the number of desired...... outliers, thus mitigating the issue of high false alarm rate. Extensive empirical studies on real datasets show that our approach efficiently and effectively detects outliers, even in high-dimensional spaces....
Controlling chaos in low and high dimensional systems with periodic parametric perturbations
International Nuclear Information System (INIS)
Mirus, K.A.; Sprott, J.C.
1998-06-01
The effect of applying a periodic perturbation to an accessible parameter of various chaotic systems is examined. Numerical results indicate that perturbation frequencies near the natural frequencies of the unstable periodic orbits of the chaotic systems can result in limit cycles for relatively small perturbations. Such perturbations can also control or significantly reduce the dimension of high-dimensional systems. Initial application to the control of fluctuations in a prototypical magnetic fusion plasma device will be reviewed
A Comparison of Machine Learning Methods in a High-Dimensional Classification Problem
Zekić-Sušac, Marijana; Pfeifer, Sanja; Šarlija, Nataša
2014-01-01
Background: Large-dimensional data modelling often relies on variable reduction methods in the pre-processing and in the post-processing stage. However, such a reduction usually provides less information and yields a lower accuracy of the model. Objectives: The aim of this paper is to assess the high-dimensional classification problem of recognizing entrepreneurial intentions of students by machine learning methods. Methods/Approach: Four methods were tested: artificial neural networks, CART ...
Preface [HD3-2015: International meeting on high-dimensional data-driven science
International Nuclear Information System (INIS)
2016-01-01
A never-ending series of innovations in measurement technology and evolutions in information and communication technologies have led to the ongoing generation and accumulation of large quantities of high-dimensional data every day. While detailed data-centric approaches have been pursued in respective research fields, situations have been encountered where the same mathematical framework of high-dimensional data analysis can be found in a wide variety of seemingly unrelated research fields, such as estimation on the basis of undersampled Fourier transform in nuclear magnetic resonance spectroscopy in chemistry, in magnetic resonance imaging in medicine, and in astronomical interferometry in astronomy. In such situations, bringing diverse viewpoints together therefore becomes a driving force for the creation of innovative developments in various different research fields. This meeting focuses on “Sparse Modeling” (SpM) as a methodology for creation of innovative developments through the incorporation of a wide variety of viewpoints in various research fields. The objective of this meeting is to offer a forum where researchers with interest in SpM can assemble and exchange information on the latest results and newly established methodologies, and discuss future directions of the interdisciplinary studies for High-Dimensional Data-Driven science (HD 3 ). The meeting was held in Kyoto from 14-17 December 2015. We are pleased to publish 22 papers contributed by invited speakers in this volume of Journal of Physics: Conference Series. We hope that this volume will promote further development of High-Dimensional Data-Driven science. (paper)
Reinforcement learning on slow features of high-dimensional input streams.
Directory of Open Access Journals (Sweden)
Robert Legenstein
Full Text Available Humans and animals are able to learn complex behaviors based on a massive stream of sensory information from different modalities. Early animal studies have identified learning mechanisms that are based on reward and punishment such that animals tend to avoid actions that lead to punishment whereas rewarded actions are reinforced. However, most algorithms for reward-based learning are only applicable if the dimensionality of the state-space is sufficiently small or its structure is sufficiently simple. Therefore, the question arises how the problem of learning on high-dimensional data is solved in the brain. In this article, we propose a biologically plausible generic two-stage learning system that can directly be applied to raw high-dimensional input streams. The system is composed of a hierarchical slow feature analysis (SFA network for preprocessing and a simple neural network on top that is trained based on rewards. We demonstrate by computer simulations that this generic architecture is able to learn quite demanding reinforcement learning tasks on high-dimensional visual input streams in a time that is comparable to the time needed when an explicit highly informative low-dimensional state-space representation is given instead of the high-dimensional visual input. The learning speed of the proposed architecture in a task similar to the Morris water maze task is comparable to that found in experimental studies with rats. This study thus supports the hypothesis that slowness learning is one important unsupervised learning principle utilized in the brain to form efficient state representations for behavioral learning.
Runcie, Daniel E; Mukherjee, Sayan
2013-07-01
Quantitative genetic studies that model complex, multivariate phenotypes are important for both evolutionary prediction and artificial selection. For example, changes in gene expression can provide insight into developmental and physiological mechanisms that link genotype and phenotype. However, classical analytical techniques are poorly suited to quantitative genetic studies of gene expression where the number of traits assayed per individual can reach many thousand. Here, we derive a Bayesian genetic sparse factor model for estimating the genetic covariance matrix (G-matrix) of high-dimensional traits, such as gene expression, in a mixed-effects model. The key idea of our model is that we need consider only G-matrices that are biologically plausible. An organism's entire phenotype is the result of processes that are modular and have limited complexity. This implies that the G-matrix will be highly structured. In particular, we assume that a limited number of intermediate traits (or factors, e.g., variations in development or physiology) control the variation in the high-dimensional phenotype, and that each of these intermediate traits is sparse - affecting only a few observed traits. The advantages of this approach are twofold. First, sparse factors are interpretable and provide biological insight into mechanisms underlying the genetic architecture. Second, enforcing sparsity helps prevent sampling errors from swamping out the true signal in high-dimensional data. We demonstrate the advantages of our model on simulated data and in an analysis of a published Drosophila melanogaster gene expression data set.
Hypergraph-based anomaly detection of high-dimensional co-occurrences.
Silva, Jorge; Willett, Rebecca
2009-03-01
This paper addresses the problem of detecting anomalous multivariate co-occurrences using a limited number of unlabeled training observations. A novel method based on using a hypergraph representation of the data is proposed to deal with this very high-dimensional problem. Hypergraphs constitute an important extension of graphs which allow edges to connect more than two vertices simultaneously. A variational Expectation-Maximization algorithm for detecting anomalies directly on the hypergraph domain without any feature selection or dimensionality reduction is presented. The resulting estimate can be used to calculate a measure of anomalousness based on the False Discovery Rate. The algorithm has O(np) computational complexity, where n is the number of training observations and p is the number of potential participants in each co-occurrence event. This efficiency makes the method ideally suited for very high-dimensional settings, and requires no tuning, bandwidth or regularization parameters. The proposed approach is validated on both high-dimensional synthetic data and the Enron email database, where p > 75,000, and it is shown that it can outperform other state-of-the-art methods.
High-Dimensional Function Approximation With Neural Networks for Large Volumes of Data.
Andras, Peter
2018-02-01
Approximation of high-dimensional functions is a challenge for neural networks due to the curse of dimensionality. Often the data for which the approximated function is defined resides on a low-dimensional manifold and in principle the approximation of the function over this manifold should improve the approximation performance. It has been show that projecting the data manifold into a lower dimensional space, followed by the neural network approximation of the function over this space, provides a more precise approximation of the function than the approximation of the function with neural networks in the original data space. However, if the data volume is very large, the projection into the low-dimensional space has to be based on a limited sample of the data. Here, we investigate the nature of the approximation error of neural networks trained over the projection space. We show that such neural networks should have better approximation performance than neural networks trained on high-dimensional data even if the projection is based on a relatively sparse sample of the data manifold. We also find that it is preferable to use a uniformly distributed sparse sample of the data for the purpose of the generation of the low-dimensional projection. We illustrate these results considering the practical neural network approximation of a set of functions defined on high-dimensional data including real world data as well.
Directory of Open Access Journals (Sweden)
Malgorzata Nowicka
2017-05-01
Full Text Available High dimensional mass and flow cytometry (HDCyto experiments have become a method of choice for high throughput interrogation and characterization of cell populations.Here, we present an R-based pipeline for differential analyses of HDCyto data, largely based on Bioconductor packages. We computationally define cell populations using FlowSOM clustering, and facilitate an optional but reproducible strategy for manual merging of algorithm-generated clusters. Our workflow offers different analysis paths, including association of cell type abundance with a phenotype or changes in signaling markers within specific subpopulations, or differential analyses of aggregated signals. Importantly, the differential analyses we show are based on regression frameworks where the HDCyto data is the response; thus, we are able to model arbitrary experimental designs, such as those with batch effects, paired designs and so on. In particular, we apply generalized linear mixed models to analyses of cell population abundance or cell-population-specific analyses of signaling markers, allowing overdispersion in cell count or aggregated signals across samples to be appropriately modeled. To support the formal statistical analyses, we encourage exploratory data analysis at every step, including quality control (e.g. multi-dimensional scaling plots, reporting of clustering results (dimensionality reduction, heatmaps with dendrograms and differential analyses (e.g. plots of aggregated signals.
Regression analysis by example
Chatterjee, Samprit
2012-01-01
Praise for the Fourth Edition: ""This book is . . . an excellent source of examples for regression analysis. It has been and still is readily readable and understandable."" -Journal of the American Statistical Association Regression analysis is a conceptually simple method for investigating relationships among variables. Carrying out a successful application of regression analysis, however, requires a balance of theoretical results, empirical rules, and subjective judgment. Regression Analysis by Example, Fifth Edition has been expanded
Significance testing in ridge regression for genetic data
Directory of Open Access Journals (Sweden)
De Iorio Maria
2011-09-01
Full Text Available Abstract Background Technological developments have increased the feasibility of large scale genetic association studies. Densely typed genetic markers are obtained using SNP arrays, next-generation sequencing technologies and imputation. However, SNPs typed using these methods can be highly correlated due to linkage disequilibrium among them, and standard multiple regression techniques fail with these data sets due to their high dimensionality and correlation structure. There has been increasing interest in using penalised regression in the analysis of high dimensional data. Ridge regression is one such penalised regression technique which does not perform variable selection, instead estimating a regression coefficient for each predictor variable. It is therefore desirable to obtain an estimate of the significance of each ridge regression coefficient. Results We develop and evaluate a test of significance for ridge regression coefficients. Using simulation studies, we demonstrate that the performance of the test is comparable to that of a permutation test, with the advantage of a much-reduced computational cost. We introduce the p-value trace, a plot of the negative logarithm of the p-values of ridge regression coefficients with increasing shrinkage parameter, which enables the visualisation of the change in p-value of the regression coefficients with increasing penalisation. We apply the proposed method to a lung cancer case-control data set from EPIC, the European Prospective Investigation into Cancer and Nutrition. Conclusions The proposed test is a useful alternative to a permutation test for the estimation of the significance of ridge regression coefficients, at a much-reduced computational cost. The p-value trace is an informative graphical tool for evaluating the results of a test of significance of ridge regression coefficients as the shrinkage parameter increases, and the proposed test makes its production computationally feasible.
Semiparametric Mixtures of Regressions with Single-index for Model Based Clustering
Xiang, Sijia; Yao, Weixin
2017-01-01
In this article, we propose two classes of semiparametric mixture regression models with single-index for model based clustering. Unlike many semiparametric/nonparametric mixture regression models that can only be applied to low dimensional predictors, the new semiparametric models can easily incorporate high dimensional predictors into the nonparametric components. The proposed models are very general, and many of the recently proposed semiparametric/nonparametric mixture regression models a...
On-chip generation of high-dimensional entangled quantum states and their coherent control.
Kues, Michael; Reimer, Christian; Roztocki, Piotr; Cortés, Luis Romero; Sciara, Stefania; Wetzel, Benjamin; Zhang, Yanbing; Cino, Alfonso; Chu, Sai T; Little, Brent E; Moss, David J; Caspani, Lucia; Azaña, José; Morandotti, Roberto
2017-06-28
Optical quantum states based on entangled photons are essential for solving questions in fundamental physics and are at the heart of quantum information science. Specifically, the realization of high-dimensional states (D-level quantum systems, that is, qudits, with D > 2) and their control are necessary for fundamental investigations of quantum mechanics, for increasing the sensitivity of quantum imaging schemes, for improving the robustness and key rate of quantum communication protocols, for enabling a richer variety of quantum simulations, and for achieving more efficient and error-tolerant quantum computation. Integrated photonics has recently become a leading platform for the compact, cost-efficient, and stable generation and processing of non-classical optical states. However, so far, integrated entangled quantum sources have been limited to qubits (D = 2). Here we demonstrate on-chip generation of entangled qudit states, where the photons are created in a coherent superposition of multiple high-purity frequency modes. In particular, we confirm the realization of a quantum system with at least one hundred dimensions, formed by two entangled qudits with D = 10. Furthermore, using state-of-the-art, yet off-the-shelf telecommunications components, we introduce a coherent manipulation platform with which to control frequency-entangled states, capable of performing deterministic high-dimensional gate operations. We validate this platform by measuring Bell inequality violations and performing quantum state tomography. Our work enables the generation and processing of high-dimensional quantum states in a single spatial mode.
Wu, Shuang; Liu, Zhi-Ping; Qiu, Xing; Wu, Hulin
2014-01-01
The immune response to viral infection is regulated by an intricate network of many genes and their products. The reverse engineering of gene regulatory networks (GRNs) using mathematical models from time course gene expression data collected after influenza infection is key to our understanding of the mechanisms involved in controlling influenza infection within a host. A five-step pipeline: detection of temporally differentially expressed genes, clustering genes into co-expressed modules, identification of network structure, parameter estimate refinement, and functional enrichment analysis, is developed for reconstructing high-dimensional dynamic GRNs from genome-wide time course gene expression data. Applying the pipeline to the time course gene expression data from influenza-infected mouse lungs, we have identified 20 distinct temporal expression patterns in the differentially expressed genes and constructed a module-based dynamic network using a linear ODE model. Both intra-module and inter-module annotations and regulatory relationships of our inferred network show some interesting findings and are highly consistent with existing knowledge about the immune response in mice after influenza infection. The proposed method is a computationally efficient, data-driven pipeline bridging experimental data, mathematical modeling, and statistical analysis. The application to the influenza infection data elucidates the potentials of our pipeline in providing valuable insights into systematic modeling of complicated biological processes.
Meng, Xi; Nguyen, Bao D.; Ridge, Clark; Shaka, A. J.
2009-01-01
High-dimensional (HD) NMR spectra have poorer digital resolution than low-dimensional (LD) spectra, for a fixed amount of experiment time. This has led to “reduced-dimensionality” strategies, in which several LD projections of the HD NMR spectrum are acquired, each with higher digital resolution; an approximate HD spectrum is then inferred by some means. We propose a strategy that moves in the opposite direction, by adding more time dimensions to increase the information content of the data set, even if only a very sparse time grid is used in each dimension. The full HD time-domain data can be analyzed by the Filter Diagonalization Method (FDM), yielding very narrow resonances along all of the frequency axes, even those with sparse sampling. Integrating over the added dimensions of HD FDM NMR spectra reconstitutes LD spectra with enhanced resolution, often more quickly than direct acquisition of the LD spectrum with a larger number of grid points in each of the fewer dimensions. If the extra dimensions do not appear in the final spectrum, and are used solely to boost information content, we propose the moniker hidden-dimension NMR. This work shows that HD peaks have unmistakable frequency signatures that can be detected as single HD objects by an appropriate algorithm, even though their patterns would be tricky for a human operator to visualize or recognize, and even if digital resolution in an HD FT spectrum is very coarse compared with natural line widths. PMID:18926747
DEFF Research Database (Denmark)
Fitzenberger, Bernd; Wilke, Ralf Andreas
2015-01-01
if the mean regression model does not. We provide a short informal introduction into the principle of quantile regression which includes an illustrative application from empirical labor market research. This is followed by briefly sketching the underlying statistical model for linear quantile regression based......Quantile regression is emerging as a popular statistical approach, which complements the estimation of conditional mean models. While the latter only focuses on one aspect of the conditional distribution of the dependent variable, the mean, quantile regression provides more detailed insights...... by modeling conditional quantiles. Quantile regression can therefore detect whether the partial effect of a regressor on the conditional quantiles is the same for all quantiles or differs across quantiles. Quantile regression can provide evidence for a statistical relationship between two variables even...
Covariance Method of the Tunneling Radiation from High Dimensional Rotating Black Holes
Li, Hui-Ling; Han, Yi-Wen; Chen, Shuai-Ru; Ding, Cong
2018-04-01
In this paper, Angheben-Nadalini-Vanzo-Zerbini (ANVZ) covariance method is used to study the tunneling radiation from the Kerr-Gödel black hole and Myers-Perry black hole with two independent angular momentum. By solving the Hamilton-Jacobi equation and separating the variables, the radial motion equation of a tunneling particle is obtained. Using near horizon approximation and the distance of the proper pure space, we calculate the tunneling rate and the temperature of Hawking radiation. Thus, the method of ANVZ covariance is extended to the research of high dimensional black hole tunneling radiation.
Efficient and accurate nearest neighbor and closest pair search in high-dimensional space
Tao, Yufei
2010-07-01
Nearest Neighbor (NN) search in high-dimensional space is an important problem in many applications. From the database perspective, a good solution needs to have two properties: (i) it can be easily incorporated in a relational database, and (ii) its query cost should increase sublinearly with the dataset size, regardless of the data and query distributions. Locality-Sensitive Hashing (LSH) is a well-known methodology fulfilling both requirements, but its current implementations either incur expensive space and query cost, or abandon its theoretical guarantee on the quality of query results. Motivated by this, we improve LSH by proposing an access method called the Locality-Sensitive B-tree (LSB-tree) to enable fast, accurate, high-dimensional NN search in relational databases. The combination of several LSB-trees forms a LSB-forest that has strong quality guarantees, but improves dramatically the efficiency of the previous LSH implementation having the same guarantees. In practice, the LSB-tree itself is also an effective index which consumes linear space, supports efficient updates, and provides accurate query results. In our experiments, the LSB-tree was faster than: (i) iDistance (a famous technique for exact NN search) by two orders ofmagnitude, and (ii) MedRank (a recent approximate method with nontrivial quality guarantees) by one order of magnitude, and meanwhile returned much better results. As a second step, we extend our LSB technique to solve another classic problem, called Closest Pair (CP) search, in high-dimensional space. The long-term challenge for this problem has been to achieve subquadratic running time at very high dimensionalities, which fails most of the existing solutions. We show that, using a LSB-forest, CP search can be accomplished in (worst-case) time significantly lower than the quadratic complexity, yet still ensuring very good quality. In practice, accurate answers can be found using just two LSB-trees, thus giving a substantial
DEFF Research Database (Denmark)
Ding, Yunhong; Bacco, Davide; Dalgaard, Kjeld
2017-01-01
is intrinsically limited to 1 bit/photon. Here we propose and experimentally demonstrate, for the first time, a high-dimensional quantum key distribution protocol based on space division multiplexing in multicore fiber using silicon photonic integrated lightwave circuits. We successfully realized three mutually......-dimensional quantum states, and enables breaking the information efficiency limit of traditional quantum key distribution protocols. In addition, the silicon photonic circuits used in our work integrate variable optical attenuators, highly efficient multicore fiber couplers, and Mach-Zehnder interferometers, enabling...
High-dimensional chaos from self-sustained collisions of solitons
Energy Technology Data Exchange (ETDEWEB)
Yildirim, O. Ozgur, E-mail: donhee@seas.harvard.edu, E-mail: oozgury@gmail.com [Cavium, Inc., 600 Nickerson Rd., Marlborough, Massachusetts 01752 (United States); Ham, Donhee, E-mail: donhee@seas.harvard.edu, E-mail: oozgury@gmail.com [Harvard University, 33 Oxford St., Cambridge, Massachusetts 02138 (United States)
2014-06-16
We experimentally demonstrate chaos generation based on collisions of electrical solitons on a nonlinear transmission line. The nonlinear line creates solitons, and an amplifier connected to it provides gain to these solitons for their self-excitation and self-sustenance. Critically, the amplifier also provides a mechanism to enable and intensify collisions among solitons. These collisional interactions are of intrinsically nonlinear nature, modulating the phase and amplitude of solitons, thus causing chaos. This chaos generated by the exploitation of the nonlinear wave phenomena is inherently high-dimensional, which we also demonstrate.
Inferring biological tasks using Pareto analysis of high-dimensional data.
Hart, Yuval; Sheftel, Hila; Hausser, Jean; Szekely, Pablo; Ben-Moshe, Noa Bossel; Korem, Yael; Tendler, Avichai; Mayo, Avraham E; Alon, Uri
2015-03-01
We present the Pareto task inference method (ParTI; http://www.weizmann.ac.il/mcb/UriAlon/download/ParTI) for inferring biological tasks from high-dimensional biological data. Data are described as a polytope, and features maximally enriched closest to the vertices (or archetypes) allow identification of the tasks the vertices represent. We demonstrate that human breast tumors and mouse tissues are well described by tetrahedrons in gene expression space, with specific tumor types and biological functions enriched at each of the vertices, suggesting four key tasks.
A novel algorithm of artificial immune system for high-dimensional function numerical optimization
Institute of Scientific and Technical Information of China (English)
DU Haifeng; GONG Maoguo; JIAO Licheng; LIU Ruochen
2005-01-01
Based on the clonal selection theory and immune memory theory, a novel artificial immune system algorithm, immune memory clonal programming algorithm (IMCPA), is put forward. Using the theorem of Markov chain, it is proved that IMCPA is convergent. Compared with some other evolutionary programming algorithms (like Breeder genetic algorithm), IMCPA is shown to be an evolutionary strategy capable of solving complex machine learning tasks, like high-dimensional function optimization, which maintains the diversity of the population and avoids prematurity to some extent, and has a higher convergence speed.
Computing and visualizing time-varying merge trees for high-dimensional data
Energy Technology Data Exchange (ETDEWEB)
Oesterling, Patrick [Univ. of Leipzig (Germany); Heine, Christian [Univ. of Kaiserslautern (Germany); Weber, Gunther H. [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Morozov, Dmitry [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Scheuermann, Gerik [Univ. of Leipzig (Germany)
2017-06-03
We introduce a new method that identifies and tracks features in arbitrary dimensions using the merge tree -- a structure for identifying topological features based on thresholding in scalar fields. This method analyzes the evolution of features of the function by tracking changes in the merge tree and relates features by matching subtrees between consecutive time steps. Using the time-varying merge tree, we present a structural visualization of the changing function that illustrates both features and their temporal evolution. We demonstrate the utility of our approach by applying it to temporal cluster analysis of high-dimensional point clouds.
High-dimensional data: p >> n in mathematical statistics and bio-medical applications
Van De Geer, Sara A.; Van Houwelingen, Hans C.
2004-01-01
The workshop 'High-dimensional data: p >> n in mathematical statistics and bio-medical applications' was held at the Lorentz Center in Leiden from 9 to 20 September 2002. This special issue of Bernoulli contains a selection of papers presented at that workshop. ¶ The introduction of high-throughput micro-array technology to measure gene-expression levels and the publication of the pioneering paper by Golub et al. (1999) has brought to life a whole new branch of data analysis under the name of...
Ghosts in high dimensional non-linear dynamical systems: The example of the hypercycle
International Nuclear Information System (INIS)
Sardanyes, Josep
2009-01-01
Ghost-induced delayed transitions are analyzed in high dimensional non-linear dynamical systems by means of the hypercycle model. The hypercycle is a network of catalytically-coupled self-replicating RNA-like macromolecules, and has been suggested to be involved in the transition from non-living to living matter in the context of earlier prebiotic evolution. It is demonstrated that, in the vicinity of the saddle-node bifurcation for symmetric hypercycles, the persistence time before extinction, T ε , tends to infinity as n→∞ (being n the number of units of the hypercycle), thus suggesting that the increase in the number of hypercycle units involves a longer resilient time before extinction because of the ghost. Furthermore, by means of numerical analysis the dynamics of three large hypercycle networks is also studied, focusing in their extinction dynamics associated to the ghosts. Such networks allow to explore the properties of the ghosts living in high dimensional phase space with n = 5, n = 10 and n = 15 dimensions. These hypercyclic networks, in agreement with other works, are shown to exhibit self-maintained oscillations governed by stable limit cycles. The bifurcation scenarios for these hypercycles are analyzed, as well as the effect of the phase space dimensionality in the delayed transition phenomena and in the scaling properties of the ghosts near bifurcation threshold
High-dimensional free-space optical communications based on orbital angular momentum coding
Zou, Li; Gu, Xiaofan; Wang, Le
2018-03-01
In this paper, we propose a high-dimensional free-space optical communication scheme using orbital angular momentum (OAM) coding. In the scheme, the transmitter encodes N-bits information by using a spatial light modulator to convert a Gaussian beam to a superposition mode of N OAM modes and a Gaussian mode; The receiver decodes the information through an OAM mode analyser which consists of a MZ interferometer with a rotating Dove prism, a photoelectric detector and a computer carrying out the fast Fourier transform. The scheme could realize a high-dimensional free-space optical communication, and decodes the information much fast and accurately. We have verified the feasibility of the scheme by exploiting 8 (4) OAM modes and a Gaussian mode to implement a 256-ary (16-ary) coding free-space optical communication to transmit a 256-gray-scale (16-gray-scale) picture. The results show that a zero bit error rate performance has been achieved.
Energy Efficient MAC Scheme for Wireless Sensor Networks with High-Dimensional Data Aggregate
Directory of Open Access Journals (Sweden)
Seokhoon Kim
2015-01-01
Full Text Available This paper presents a novel and sustainable medium access control (MAC scheme for wireless sensor network (WSN systems that process high-dimensional aggregated data. Based on a preamble signal and buffer threshold analysis, it maximizes the energy efficiency of the wireless sensor devices which have limited energy resources. The proposed group management MAC (GM-MAC approach not only sets the buffer threshold value of a sensor device to be reciprocal to the preamble signal but also sets a transmittable group value to each sensor device by using the preamble signal of the sink node. The primary difference between the previous and the proposed approach is that existing state-of-the-art schemes use duty cycle and sleep mode to save energy consumption of individual sensor devices, whereas the proposed scheme employs the group management MAC scheme for sensor devices to maximize the overall energy efficiency of the whole WSN systems by minimizing the energy consumption of sensor devices located near the sink node. Performance evaluations show that the proposed scheme outperforms the previous schemes in terms of active time of sensor devices, transmission delay, control overhead, and energy consumption. Therefore, the proposed scheme is suitable for sensor devices in a variety of wireless sensor networking environments with high-dimensional data aggregate.
Selecting Optimal Feature Set in High-Dimensional Data by Swarm Search
Directory of Open Access Journals (Sweden)
Simon Fong
2013-01-01
Full Text Available Selecting the right set of features from data of high dimensionality for inducing an accurate classification model is a tough computational challenge. It is almost a NP-hard problem as the combinations of features escalate exponentially as the number of features increases. Unfortunately in data mining, as well as other engineering applications and bioinformatics, some data are described by a long array of features. Many feature subset selection algorithms have been proposed in the past, but not all of them are effective. Since it takes seemingly forever to use brute force in exhaustively trying every possible combination of features, stochastic optimization may be a solution. In this paper, we propose a new feature selection scheme called Swarm Search to find an optimal feature set by using metaheuristics. The advantage of Swarm Search is its flexibility in integrating any classifier into its fitness function and plugging in any metaheuristic algorithm to facilitate heuristic search. Simulation experiments are carried out by testing the Swarm Search over some high-dimensional datasets, with different classification algorithms and various metaheuristic algorithms. The comparative experiment results show that Swarm Search is able to attain relatively low error rates in classification without shrinking the size of the feature subset to its minimum.
The validation and assessment of machine learning: a game of prediction from high-dimensional data.
Directory of Open Access Journals (Sweden)
Tune H Pers
Full Text Available In applied statistics, tools from machine learning are popular for analyzing complex and high-dimensional data. However, few theoretical results are available that could guide to the appropriate machine learning tool in a new application. Initial development of an overall strategy thus often implies that multiple methods are tested and compared on the same set of data. This is particularly difficult in situations that are prone to over-fitting where the number of subjects is low compared to the number of potential predictors. The article presents a game which provides some grounds for conducting a fair model comparison. Each player selects a modeling strategy for predicting individual response from potential predictors. A strictly proper scoring rule, bootstrap cross-validation, and a set of rules are used to make the results obtained with different strategies comparable. To illustrate the ideas, the game is applied to data from the Nugenob Study where the aim is to predict the fat oxidation capacity based on conventional factors and high-dimensional metabolomics data. Three players have chosen to use support vector machines, LASSO, and random forests, respectively.
High-dimensional quantum key distribution with the entangled single-photon-added coherent state
Energy Technology Data Exchange (ETDEWEB)
Wang, Yang [Zhengzhou Information Science and Technology Institute, Zhengzhou, 450001 (China); Synergetic Innovation Center of Quantum Information and Quantum Physics, University of Science and Technology of China, Hefei, Anhui 230026 (China); Bao, Wan-Su, E-mail: 2010thzz@sina.com [Zhengzhou Information Science and Technology Institute, Zhengzhou, 450001 (China); Synergetic Innovation Center of Quantum Information and Quantum Physics, University of Science and Technology of China, Hefei, Anhui 230026 (China); Bao, Hai-Ze; Zhou, Chun; Jiang, Mu-Sheng; Li, Hong-Wei [Zhengzhou Information Science and Technology Institute, Zhengzhou, 450001 (China); Synergetic Innovation Center of Quantum Information and Quantum Physics, University of Science and Technology of China, Hefei, Anhui 230026 (China)
2017-04-25
High-dimensional quantum key distribution (HD-QKD) can generate more secure bits for one detection event so that it can achieve long distance key distribution with a high secret key capacity. In this Letter, we present a decoy state HD-QKD scheme with the entangled single-photon-added coherent state (ESPACS) source. We present two tight formulas to estimate the single-photon fraction of postselected events and Eve's Holevo information and derive lower bounds on the secret key capacity and the secret key rate of our protocol. We also present finite-key analysis for our protocol by using the Chernoff bound. Our numerical results show that our protocol using one decoy state can perform better than that of previous HD-QKD protocol with the spontaneous parametric down conversion (SPDC) using two decoy states. Moreover, when considering finite resources, the advantage is more obvious. - Highlights: • Implement the single-photon-added coherent state source into the high-dimensional quantum key distribution. • Enhance both the secret key capacity and the secret key rate compared with previous schemes. • Show an excellent performance in view of statistical fluctuations.
A Feature Subset Selection Method Based On High-Dimensional Mutual Information
Directory of Open Access Journals (Sweden)
Chee Keong Kwoh
2011-04-01
Full Text Available Feature selection is an important step in building accurate classifiers and provides better understanding of the data sets. In this paper, we propose a feature subset selection method based on high-dimensional mutual information. We also propose to use the entropy of the class attribute as a criterion to determine the appropriate subset of features when building classifiers. We prove that if the mutual information between a feature set X and the class attribute Y equals to the entropy of Y , then X is a Markov Blanket of Y . We show that in some cases, it is infeasible to approximate the high-dimensional mutual information with algebraic combinations of pairwise mutual information in any forms. In addition, the exhaustive searches of all combinations of features are prerequisite for finding the optimal feature subsets for classifying these kinds of data sets. We show that our approach outperforms existing filter feature subset selection methods for most of the 24 selected benchmark data sets.
Using High-Dimensional Image Models to Perform Highly Undetectable Steganography
Pevný, Tomáš; Filler, Tomáš; Bas, Patrick
This paper presents a complete methodology for designing practical and highly-undetectable stegosystems for real digital media. The main design principle is to minimize a suitably-defined distortion by means of efficient coding algorithm. The distortion is defined as a weighted difference of extended state-of-the-art feature vectors already used in steganalysis. This allows us to "preserve" the model used by steganalyst and thus be undetectable even for large payloads. This framework can be efficiently implemented even when the dimensionality of the feature set used by the embedder is larger than 107. The high dimensional model is necessary to avoid known security weaknesses. Although high-dimensional models might be problem in steganalysis, we explain, why they are acceptable in steganography. As an example, we introduce HUGO, a new embedding algorithm for spatial-domain digital images and we contrast its performance with LSB matching. On the BOWS2 image database and in contrast with LSB matching, HUGO allows the embedder to hide 7× longer message with the same level of security level.
Reducing the Complexity of Genetic Fuzzy Classifiers in Highly-Dimensional Classification Problems
Directory of Open Access Journals (Sweden)
DimitrisG. Stavrakoudis
2012-04-01
Full Text Available This paper introduces the Fast Iterative Rule-based Linguistic Classifier (FaIRLiC, a Genetic Fuzzy Rule-Based Classification System (GFRBCS which targets at reducing the structural complexity of the resulting rule base, as well as its learning algorithm's computational requirements, especially when dealing with high-dimensional feature spaces. The proposed methodology follows the principles of the iterative rule learning (IRL approach, whereby a rule extraction algorithm (REA is invoked in an iterative fashion, producing one fuzzy rule at a time. The REA is performed in two successive steps: the first one selects the relevant features of the currently extracted rule, whereas the second one decides the antecedent part of the fuzzy rule, using the previously selected subset of features. The performance of the classifier is finally optimized through a genetic tuning post-processing stage. Comparative results in a hyperspectral remote sensing classification as well as in 12 real-world classification datasets indicate the effectiveness of the proposed methodology in generating high-performing and compact fuzzy rule-based classifiers, even for very high-dimensional feature spaces.
Compound Structure-Independent Activity Prediction in High-Dimensional Target Space.
Balfer, Jenny; Hu, Ye; Bajorath, Jürgen
2014-08-01
Profiling of compound libraries against arrays of targets has become an important approach in pharmaceutical research. The prediction of multi-target compound activities also represents an attractive task for machine learning with potential for drug discovery applications. Herein, we have explored activity prediction in high-dimensional target space. Different types of models were derived to predict multi-target activities. The models included naïve Bayesian (NB) and support vector machine (SVM) classifiers based upon compound structure information and NB models derived on the basis of activity profiles, without considering compound structure. Because the latter approach can be applied to incomplete training data and principally depends on the feature independence assumption, SVM modeling was not applicable in this case. Furthermore, iterative hybrid NB models making use of both activity profiles and compound structure information were built. In high-dimensional target space, NB models utilizing activity profile data were found to yield more accurate activity predictions than structure-based NB and SVM models or hybrid models. An in-depth analysis of activity profile-based models revealed the presence of correlation effects across different targets and rationalized prediction accuracy. Taken together, the results indicate that activity profile information can be effectively used to predict the activity of test compounds against novel targets. © 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Quantum secret sharing based on modulated high-dimensional time-bin entanglement
International Nuclear Information System (INIS)
Takesue, Hiroki; Inoue, Kyo
2006-01-01
We propose a scheme for quantum secret sharing (QSS) that uses a modulated high-dimensional time-bin entanglement. By modulating the relative phase randomly by {0,π}, a sender with the entanglement source can randomly change the sign of the correlation of the measurement outcomes obtained by two distant recipients. The two recipients must cooperate if they are to obtain the sign of the correlation, which is used as a secret key. We show that our scheme is secure against intercept-and-resend (IR) and beam splitting attacks by an outside eavesdropper thanks to the nonorthogonality of high-dimensional time-bin entangled states. We also show that a cheating attempt based on an IR attack by one of the recipients can be detected by changing the dimension of the time-bin entanglement randomly and inserting two 'vacant' slots between the packets. Then, cheating attempts can be detected by monitoring the count rate in the vacant slots. The proposed scheme has better experimental feasibility than previously proposed entanglement-based QSS schemes
Similarity measurement method of high-dimensional data based on normalized net lattice subspace
Institute of Scientific and Technical Information of China (English)
Li Wenfa; Wang Gongming; Li Ke; Huang Su
2017-01-01
The performance of conventional similarity measurement methods is affected seriously by the curse of dimensionality of high-dimensional data.The reason is that data difference between sparse and noisy dimensionalities occupies a large proportion of the similarity, leading to the dissimilarities between any results.A similarity measurement method of high-dimensional data based on normalized net lattice subspace is proposed.The data range of each dimension is divided into several intervals, and the components in different dimensions are mapped onto the corresponding interval.Only the component in the same or adjacent interval is used to calculate the similarity.To validate this meth-od, three data types are used, and seven common similarity measurement methods are compared. The experimental result indicates that the relative difference of the method is increasing with the di-mensionality and is approximately two or three orders of magnitude higher than the conventional method.In addition, the similarity range of this method in different dimensions is [0, 1], which is fit for similarity analysis after dimensionality reduction.
Yu, Hualong; Ni, Jun
2014-01-01
Training classifiers on skewed data can be technically challenging tasks, especially if the data is high-dimensional simultaneously, the tasks can become more difficult. In biomedicine field, skewed data type often appears. In this study, we try to deal with this problem by combining asymmetric bagging ensemble classifier (asBagging) that has been presented in previous work and an improved random subspace (RS) generation strategy that is called feature subspace (FSS). Specifically, FSS is a novel method to promote the balance level between accuracy and diversity of base classifiers in asBagging. In view of the strong generalization capability of support vector machine (SVM), we adopt it to be base classifier. Extensive experiments on four benchmark biomedicine data sets indicate that the proposed ensemble learning method outperforms many baseline approaches in terms of Accuracy, F-measure, G-mean and AUC evaluation criterions, thus it can be regarded as an effective and efficient tool to deal with high-dimensional and imbalanced biomedical data.
Zhang, Yu; Wu, Jianxin; Cai, Jianfei
2016-05-01
In large-scale visual recognition and image retrieval tasks, feature vectors, such as Fisher vector (FV) or the vector of locally aggregated descriptors (VLAD), have achieved state-of-the-art results. However, the combination of the large numbers of examples and high-dimensional vectors necessitates dimensionality reduction, in order to reduce its storage and CPU costs to a reasonable range. In spite of the popularity of various feature compression methods, this paper shows that the feature (dimension) selection is a better choice for high-dimensional FV/VLAD than the feature (dimension) compression methods, e.g., product quantization. We show that strong correlation among the feature dimensions in the FV and the VLAD may not exist, which renders feature selection a natural choice. We also show that, many dimensions in FV/VLAD are noise. Throwing them away using feature selection is better than compressing them and useful dimensions altogether using feature compression methods. To choose features, we propose an efficient importance sorting algorithm considering both the supervised and unsupervised cases, for visual recognition and image retrieval, respectively. Combining with the 1-bit quantization, feature selection has achieved both higher accuracy and less computational cost than feature compression methods, such as product quantization, on the FV and the VLAD image representations.
High-dimensional quantum key distribution with the entangled single-photon-added coherent state
International Nuclear Information System (INIS)
Wang, Yang; Bao, Wan-Su; Bao, Hai-Ze; Zhou, Chun; Jiang, Mu-Sheng; Li, Hong-Wei
2017-01-01
High-dimensional quantum key distribution (HD-QKD) can generate more secure bits for one detection event so that it can achieve long distance key distribution with a high secret key capacity. In this Letter, we present a decoy state HD-QKD scheme with the entangled single-photon-added coherent state (ESPACS) source. We present two tight formulas to estimate the single-photon fraction of postselected events and Eve's Holevo information and derive lower bounds on the secret key capacity and the secret key rate of our protocol. We also present finite-key analysis for our protocol by using the Chernoff bound. Our numerical results show that our protocol using one decoy state can perform better than that of previous HD-QKD protocol with the spontaneous parametric down conversion (SPDC) using two decoy states. Moreover, when considering finite resources, the advantage is more obvious. - Highlights: • Implement the single-photon-added coherent state source into the high-dimensional quantum key distribution. • Enhance both the secret key capacity and the secret key rate compared with previous schemes. • Show an excellent performance in view of statistical fluctuations.
High-Dimensional Single-Photon Quantum Gates: Concepts and Experiments.
Babazadeh, Amin; Erhard, Manuel; Wang, Feiran; Malik, Mehul; Nouroozi, Rahman; Krenn, Mario; Zeilinger, Anton
2017-11-03
Transformations on quantum states form a basic building block of every quantum information system. From photonic polarization to two-level atoms, complete sets of quantum gates for a variety of qubit systems are well known. For multilevel quantum systems beyond qubits, the situation is more challenging. The orbital angular momentum modes of photons comprise one such high-dimensional system for which generation and measurement techniques are well studied. However, arbitrary transformations for such quantum states are not known. Here we experimentally demonstrate a four-dimensional generalization of the Pauli X gate and all of its integer powers on single photons carrying orbital angular momentum. Together with the well-known Z gate, this forms the first complete set of high-dimensional quantum gates implemented experimentally. The concept of the X gate is based on independent access to quantum states with different parities and can thus be generalized to other photonic degrees of freedom and potentially also to other quantum systems.
Challenges and Approaches to Statistical Design and Inference in High Dimensional Investigations
Garrett, Karen A.; Allison, David B.
2015-01-01
Summary Advances in modern technologies have facilitated high-dimensional experiments (HDEs) that generate tremendous amounts of genomic, proteomic, and other “omic” data. HDEs involving whole-genome sequences and polymorphisms, expression levels of genes, protein abundance measurements, and combinations thereof have become a vanguard for new analytic approaches to the analysis of HDE data. Such situations demand creative approaches to the processes of statistical inference, estimation, prediction, classification, and study design. The novel and challenging biological questions asked from HDE data have resulted in many specialized analytic techniques being developed. This chapter discusses some of the unique statistical challenges facing investigators studying high-dimensional biology, and describes some approaches being developed by statistical scientists. We have included some focus on the increasing interest in questions involving testing multiple propositions simultaneously, appropriate inferential indicators for the types of questions biologists are interested in, and the need for replication of results across independent studies, investigators, and settings. A key consideration inherent throughout is the challenge in providing methods that a statistician judges to be sound and a biologist finds informative. PMID:19588106
Challenges and approaches to statistical design and inference in high-dimensional investigations.
Gadbury, Gary L; Garrett, Karen A; Allison, David B
2009-01-01
Advances in modern technologies have facilitated high-dimensional experiments (HDEs) that generate tremendous amounts of genomic, proteomic, and other "omic" data. HDEs involving whole-genome sequences and polymorphisms, expression levels of genes, protein abundance measurements, and combinations thereof have become a vanguard for new analytic approaches to the analysis of HDE data. Such situations demand creative approaches to the processes of statistical inference, estimation, prediction, classification, and study design. The novel and challenging biological questions asked from HDE data have resulted in many specialized analytic techniques being developed. This chapter discusses some of the unique statistical challenges facing investigators studying high-dimensional biology and describes some approaches being developed by statistical scientists. We have included some focus on the increasing interest in questions involving testing multiple propositions simultaneously, appropriate inferential indicators for the types of questions biologists are interested in, and the need for replication of results across independent studies, investigators, and settings. A key consideration inherent throughout is the challenge in providing methods that a statistician judges to be sound and a biologist finds informative.
Tikhonov, Mikhail; Monasson, Remi
2018-01-01
Much of our understanding of ecological and evolutionary mechanisms derives from analysis of low-dimensional models: with few interacting species, or few axes defining "fitness". It is not always clear to what extent the intuition derived from low-dimensional models applies to the complex, high-dimensional reality. For instance, most naturally occurring microbial communities are strikingly diverse, harboring a large number of coexisting species, each of which contributes to shaping the environment of others. Understanding the eco-evolutionary interplay in these systems is an important challenge, and an exciting new domain for statistical physics. Recent work identified a promising new platform for investigating highly diverse ecosystems, based on the classic resource competition model of MacArthur. Here, we describe how the same analytical framework can be used to study evolutionary questions. Our analysis illustrates how, at high dimension, the intuition promoted by a one-dimensional (scalar) notion of fitness can become misleading. Specifically, while the low-dimensional picture emphasizes organism cost or efficiency, we exhibit a regime where cost becomes irrelevant for survival, and link this observation to generic properties of high-dimensional geometry.
A New Ensemble Method with Feature Space Partitioning for High-Dimensional Data Classification
Directory of Open Access Journals (Sweden)
Yongjun Piao
2015-01-01
Full Text Available Ensemble data mining methods, also known as classifier combination, are often used to improve the performance of classification. Various classifier combination methods such as bagging, boosting, and random forest have been devised and have received considerable attention in the past. However, data dimensionality increases rapidly day by day. Such a trend poses various challenges as these methods are not suitable to directly apply to high-dimensional datasets. In this paper, we propose an ensemble method for classification of high-dimensional data, with each classifier constructed from a different set of features determined by partitioning of redundant features. In our method, the redundancy of features is considered to divide the original feature space. Then, each generated feature subset is trained by a support vector machine, and the results of each classifier are combined by majority voting. The efficiency and effectiveness of our method are demonstrated through comparisons with other ensemble techniques, and the results show that our method outperforms other methods.
Understanding logistic regression analysis
Sperandei, Sandro
2014-01-01
Logistic regression is used to obtain odds ratio in the presence of more than one explanatory variable. The procedure is quite similar to multiple linear regression, with the exception that the response variable is binomial. The result is the impact of each variable on the odds ratio of the observed event of interest. The main advantage is to avoid confounding effects by analyzing the association of all variables together. In this article, we explain the logistic regression procedure using ex...
Introduction to regression graphics
Cook, R Dennis
2009-01-01
Covers the use of dynamic and interactive computer graphics in linear regression analysis, focusing on analytical graphics. Features new techniques like plot rotation. The authors have composed their own regression code, using Xlisp-Stat language called R-code, which is a nearly complete system for linear regression analysis and can be utilized as the main computer program in a linear regression course. The accompanying disks, for both Macintosh and Windows computers, contain the R-code and Xlisp-Stat. An Instructor's Manual presenting detailed solutions to all the problems in the book is ava
Alternative Methods of Regression
Birkes, David
2011-01-01
Of related interest. Nonlinear Regression Analysis and its Applications Douglas M. Bates and Donald G. Watts ".an extraordinary presentation of concepts and methods concerning the use and analysis of nonlinear regression models.highly recommend[ed].for anyone needing to use and/or understand issues concerning the analysis of nonlinear regression models." --Technometrics This book provides a balance between theory and practice supported by extensive displays of instructive geometrical constructs. Numerous in-depth case studies illustrate the use of nonlinear regression analysis--with all data s
Two-step variable selection in quantile regression models
Directory of Open Access Journals (Sweden)
FAN Yali
2015-06-01
Full Text Available We propose a two-step variable selection procedure for high dimensional quantile regressions, in which the dimension of the covariates, pn is much larger than the sample size n. In the first step, we perform ℓ1 penalty, and we demonstrate that the first step penalized estimator with the LASSO penalty can reduce the model from an ultra-high dimensional to a model whose size has the same order as that of the true model, and the selected model can cover the true model. The second step excludes the remained irrelevant covariates by applying the adaptive LASSO penalty to the reduced model obtained from the first step. Under some regularity conditions, we show that our procedure enjoys the model selection consistency. We conduct a simulation study and a real data analysis to evaluate the finite sample performance of the proposed approach.
High dimensional biological data retrieval optimization with NoSQL technology
2014-01-01
Background High-throughput transcriptomic data generated by microarray experiments is the most abundant and frequently stored kind of data currently used in translational medicine studies. Although microarray data is supported in data warehouses such as tranSMART, when querying relational databases for hundreds of different patient gene expression records queries are slow due to poor performance. Non-relational data models, such as the key-value model implemented in NoSQL databases, hold promise to be more performant solutions. Our motivation is to improve the performance of the tranSMART data warehouse with a view to supporting Next Generation Sequencing data. Results In this paper we introduce a new data model better suited for high-dimensional data storage and querying, optimized for database scalability and performance. We have designed a key-value pair data model to support faster queries over large-scale microarray data and implemented the model using HBase, an implementation of Google's BigTable storage system. An experimental performance comparison was carried out against the traditional relational data model implemented in both MySQL Cluster and MongoDB, using a large publicly available transcriptomic data set taken from NCBI GEO concerning Multiple Myeloma. Our new key-value data model implemented on HBase exhibits an average 5.24-fold increase in high-dimensional biological data query performance compared to the relational model implemented on MySQL Cluster, and an average 6.47-fold increase on query performance on MongoDB. Conclusions The performance evaluation found that the new key-value data model, in particular its implementation in HBase, outperforms the relational model currently implemented in tranSMART. We propose that NoSQL technology holds great promise for large-scale data management, in particular for high-dimensional biological data such as that demonstrated in the performance evaluation described in this paper. We aim to use this new data
Geraci, Joseph; Dharsee, Moyez; Nuin, Paulo; Haslehurst, Alexandria; Koti, Madhuri; Feilotter, Harriet E; Evans, Ken
2014-03-01
We introduce a novel method for visualizing high dimensional data via a discrete dynamical system. This method provides a 2D representation of the relationship between subjects according to a set of variables without geometric projections, transformed axes or principal components. The algorithm exploits a memory-type mechanism inherent in a certain class of discrete dynamical systems collectively referred to as the chaos game that are closely related to iterative function systems. The goal of the algorithm was to create a human readable representation of high dimensional patient data that was capable of detecting unrevealed subclusters of patients from within anticipated classifications. This provides a mechanism to further pursue a more personalized exploration of pathology when used with medical data. For clustering and classification protocols, the dynamical system portion of the algorithm is designed to come after some feature selection filter and before some model evaluation (e.g. clustering accuracy) protocol. In the version given here, a univariate features selection step is performed (in practice more complex feature selection methods are used), a discrete dynamical system is driven by this reduced set of variables (which results in a set of 2D cluster models), these models are evaluated for their accuracy (according to a user-defined binary classification) and finally a visual representation of the top classification models are returned. Thus, in addition to the visualization component, this methodology can be used for both supervised and unsupervised machine learning as the top performing models are returned in the protocol we describe here. Butterfly, the algorithm we introduce and provide working code for, uses a discrete dynamical system to classify high dimensional data and provide a 2D representation of the relationship between subjects. We report results on three datasets (two in the article; one in the appendix) including a public lung cancer
High dimensional biological data retrieval optimization with NoSQL technology.
Wang, Shicai; Pandis, Ioannis; Wu, Chao; He, Sijin; Johnson, David; Emam, Ibrahim; Guitton, Florian; Guo, Yike
2014-01-01
High-throughput transcriptomic data generated by microarray experiments is the most abundant and frequently stored kind of data currently used in translational medicine studies. Although microarray data is supported in data warehouses such as tranSMART, when querying relational databases for hundreds of different patient gene expression records queries are slow due to poor performance. Non-relational data models, such as the key-value model implemented in NoSQL databases, hold promise to be more performant solutions. Our motivation is to improve the performance of the tranSMART data warehouse with a view to supporting Next Generation Sequencing data. In this paper we introduce a new data model better suited for high-dimensional data storage and querying, optimized for database scalability and performance. We have designed a key-value pair data model to support faster queries over large-scale microarray data and implemented the model using HBase, an implementation of Google's BigTable storage system. An experimental performance comparison was carried out against the traditional relational data model implemented in both MySQL Cluster and MongoDB, using a large publicly available transcriptomic data set taken from NCBI GEO concerning Multiple Myeloma. Our new key-value data model implemented on HBase exhibits an average 5.24-fold increase in high-dimensional biological data query performance compared to the relational model implemented on MySQL Cluster, and an average 6.47-fold increase on query performance on MongoDB. The performance evaluation found that the new key-value data model, in particular its implementation in HBase, outperforms the relational model currently implemented in tranSMART. We propose that NoSQL technology holds great promise for large-scale data management, in particular for high-dimensional biological data such as that demonstrated in the performance evaluation described in this paper. We aim to use this new data model as a basis for migrating
Saini, Harsh; Lal, Sunil Pranit; Naidu, Vimal Vikash; Pickering, Vincel Wince; Singh, Gurmeet; Tsunoda, Tatsuhiko; Sharma, Alok
2016-12-05
High dimensional feature space generally degrades classification in several applications. In this paper, we propose a strategy called gene masking, in which non-contributing dimensions are heuristically removed from the data to improve classification accuracy. Gene masking is implemented via a binary encoded genetic algorithm that can be integrated seamlessly with classifiers during the training phase of classification to perform feature selection. It can also be used to discriminate between features that contribute most to the classification, thereby, allowing researchers to isolate features that may have special significance. This technique was applied on publicly available datasets whereby it substantially reduced the number of features used for classification while maintaining high accuracies. The proposed technique can be extremely useful in feature selection as it heuristically removes non-contributing features to improve the performance of classifiers.
Energy Technology Data Exchange (ETDEWEB)
Tahira, Rabia; Ikram, Manzoor; Zubairy, M Suhail [Centre for Quantum Physics, COMSATS Institute of Information Technology, Islamabad (Pakistan); Bougouffa, Smail [Department of Physics, Faculty of Science, Taibah University, PO Box 30002, Madinah (Saudi Arabia)
2010-02-14
We investigate the phenomenon of sudden death of entanglement in a high-dimensional bipartite system subjected to dissipative environments with an arbitrary initial pure entangled state between two fields in the cavities. We find that in a vacuum reservoir, the presence of the state where one or more than one (two) photons in each cavity are present is a necessary condition for the sudden death of entanglement. Otherwise entanglement remains for infinite time and decays asymptotically with the decay of individual qubits. For pure two-qubit entangled states in a thermal environment, we observe that sudden death of entanglement always occurs. The sudden death time of the entangled states is related to the number of photons in the cavities, the temperature of the reservoir and the initial preparation of the entangled states.
International Nuclear Information System (INIS)
Tahira, Rabia; Ikram, Manzoor; Zubairy, M Suhail; Bougouffa, Smail
2010-01-01
We investigate the phenomenon of sudden death of entanglement in a high-dimensional bipartite system subjected to dissipative environments with an arbitrary initial pure entangled state between two fields in the cavities. We find that in a vacuum reservoir, the presence of the state where one or more than one (two) photons in each cavity are present is a necessary condition for the sudden death of entanglement. Otherwise entanglement remains for infinite time and decays asymptotically with the decay of individual qubits. For pure two-qubit entangled states in a thermal environment, we observe that sudden death of entanglement always occurs. The sudden death time of the entangled states is related to the number of photons in the cavities, the temperature of the reservoir and the initial preparation of the entangled states.
Time–energy high-dimensional one-side device-independent quantum key distribution
International Nuclear Information System (INIS)
Bao Hai-Ze; Bao Wan-Su; Wang Yang; Chen Rui-Ke; Ma Hong-Xin; Zhou Chun; Li Hong-Wei
2017-01-01
Compared with full device-independent quantum key distribution (DI-QKD), one-side device-independent QKD (1sDI-QKD) needs fewer requirements, which is much easier to meet. In this paper, by applying recently developed novel time–energy entropic uncertainty relations, we present a time–energy high-dimensional one-side device-independent quantum key distribution (HD-QKD) and provide the security proof against coherent attacks. Besides, we connect the security with the quantum steering. By numerical simulation, we obtain the secret key rate for Alice’s different detection efficiencies. The results show that our protocol can performance much better than the original 1sDI-QKD. Furthermore, we clarify the relation among the secret key rate, Alice’s detection efficiency, and the dispersion coefficient. Finally, we simply analyze its performance in the optical fiber channel. (paper)
A Cure for Variance Inflation in High Dimensional Kernel Principal Component Analysis
DEFF Research Database (Denmark)
Abrahamsen, Trine Julie; Hansen, Lars Kai
2011-01-01
Small sample high-dimensional principal component analysis (PCA) suffers from variance inflation and lack of generalizability. It has earlier been pointed out that a simple leave-one-out variance renormalization scheme can cure the problem. In this paper we generalize the cure in two directions......: First, we propose a computationally less intensive approximate leave-one-out estimator, secondly, we show that variance inflation is also present in kernel principal component analysis (kPCA) and we provide a non-parametric renormalization scheme which can quite efficiently restore generalizability in kPCA....... As for PCA our analysis also suggests a simplified approximate expression. © 2011 Trine J. Abrahamsen and Lars K. Hansen....
Diagonal Likelihood Ratio Test for Equality of Mean Vectors in High-Dimensional Data
Hu, Zongliang; Tong, Tiejun; Genton, Marc G.
2017-01-01
We propose a likelihood ratio test framework for testing normal mean vectors in high-dimensional data under two common scenarios: the one-sample test and the two-sample test with equal covariance matrices. We derive the test statistics under the assumption that the covariance matrices follow a diagonal matrix structure. In comparison with the diagonal Hotelling's tests, our proposed test statistics display some interesting characteristics. In particular, they are a summation of the log-transformed squared t-statistics rather than a direct summation of those components. More importantly, to derive the asymptotic normality of our test statistics under the null and local alternative hypotheses, we do not require the assumption that the covariance matrix follows a diagonal matrix structure. As a consequence, our proposed test methods are very flexible and can be widely applied in practice. Finally, simulation studies and a real data analysis are also conducted to demonstrate the advantages of our likelihood ratio test method.
Wang, Zhiping; Chen, Jinyu; Yu, Benli
2017-02-20
We investigate the two-dimensional (2D) and three-dimensional (3D) atom localization behaviors via spontaneously generated coherence in a microwave-driven four-level atomic system. Owing to the space-dependent atom-field interaction, it is found that the detecting probability and precision of 2D and 3D atom localization behaviors can be significantly improved via adjusting the system parameters, the phase, amplitude, and initial population distribution. Interestingly, the atom can be localized in volumes that are substantially smaller than a cubic optical wavelength. Our scheme opens a promising way to achieve high-precision and high-efficiency atom localization, which provides some potential applications in high-dimensional atom nanolithography.
Characterization of differentially expressed genes using high-dimensional co-expression networks
DEFF Research Database (Denmark)
Coelho Goncalves de Abreu, Gabriel; Labouriau, Rodrigo S.
2010-01-01
We present a technique to characterize differentially expressed genes in terms of their position in a high-dimensional co-expression network. The set-up of Gaussian graphical models is used to construct representations of the co-expression network in such a way that redundancy and the propagation...... that allow to make effective inference in problems with high degree of complexity (e.g. several thousands of genes) and small number of observations (e.g. 10-100) as typically occurs in high throughput gene expression studies. Taking advantage of the internal structure of decomposable graphical models, we...... construct a compact representation of the co-expression network that allows to identify the regions with high concentration of differentially expressed genes. It is argued that differentially expressed genes located in highly interconnected regions of the co-expression network are less informative than...
Directory of Open Access Journals (Sweden)
Matthias Schmid
Full Text Available Regression analysis with a bounded outcome is a common problem in applied statistics. Typical examples include regression models for percentage outcomes and the analysis of ratings that are measured on a bounded scale. In this paper, we consider beta regression, which is a generalization of logit models to situations where the response is continuous on the interval (0,1. Consequently, beta regression is a convenient tool for analyzing percentage responses. The classical approach to fit a beta regression model is to use maximum likelihood estimation with subsequent AIC-based variable selection. As an alternative to this established - yet unstable - approach, we propose a new estimation technique called boosted beta regression. With boosted beta regression estimation and variable selection can be carried out simultaneously in a highly efficient way. Additionally, both the mean and the variance of a percentage response can be modeled using flexible nonlinear covariate effects. As a consequence, the new method accounts for common problems such as overdispersion and non-binomial variance structures.
An adaptive ANOVA-based PCKF for high-dimensional nonlinear inverse modeling
Li, Weixuan; Lin, Guang; Zhang, Dongxiao
2014-02-01
The probabilistic collocation-based Kalman filter (PCKF) is a recently developed approach for solving inverse problems. It resembles the ensemble Kalman filter (EnKF) in every aspect-except that it represents and propagates model uncertainty by polynomial chaos expansion (PCE) instead of an ensemble of model realizations. Previous studies have shown PCKF is a more efficient alternative to EnKF for many data assimilation problems. However, the accuracy and efficiency of PCKF depends on an appropriate truncation of the PCE series. Having more polynomial chaos basis functions in the expansion helps to capture uncertainty more accurately but increases computational cost. Selection of basis functions is particularly important for high-dimensional stochastic problems because the number of polynomial chaos basis functions required to represent model uncertainty grows dramatically as the number of input parameters (random dimensions) increases. In classic PCKF algorithms, the PCE basis functions are pre-set based on users' experience. Also, for sequential data assimilation problems, the basis functions kept in PCE expression remain unchanged in different Kalman filter loops, which could limit the accuracy and computational efficiency of classic PCKF algorithms. To address this issue, we present a new algorithm that adaptively selects PCE basis functions for different problems and automatically adjusts the number of basis functions in different Kalman filter loops. The algorithm is based on adaptive functional ANOVA (analysis of variance) decomposition, which approximates a high-dimensional function with the summation of a set of low-dimensional functions. Thus, instead of expanding the original model into PCE, we implement the PCE expansion on these low-dimensional functions, which is much less costly. We also propose a new adaptive criterion for ANOVA that is more suited for solving inverse problems. The new algorithm was tested with different examples and demonstrated
Joint Adaptive Mean-Variance Regularization and Variance Stabilization of High Dimensional Data.
Dazard, Jean-Eudes; Rao, J Sunil
2012-07-01
The paper addresses a common problem in the analysis of high-dimensional high-throughput "omics" data, which is parameter estimation across multiple variables in a set of data where the number of variables is much larger than the sample size. Among the problems posed by this type of data are that variable-specific estimators of variances are not reliable and variable-wise tests statistics have low power, both due to a lack of degrees of freedom. In addition, it has been observed in this type of data that the variance increases as a function of the mean. We introduce a non-parametric adaptive regularization procedure that is innovative in that : (i) it employs a novel "similarity statistic"-based clustering technique to generate local-pooled or regularized shrinkage estimators of population parameters, (ii) the regularization is done jointly on population moments, benefiting from C. Stein's result on inadmissibility, which implies that usual sample variance estimator is improved by a shrinkage estimator using information contained in the sample mean. From these joint regularized shrinkage estimators, we derived regularized t-like statistics and show in simulation studies that they offer more statistical power in hypothesis testing than their standard sample counterparts, or regular common value-shrinkage estimators, or when the information contained in the sample mean is simply ignored. Finally, we show that these estimators feature interesting properties of variance stabilization and normalization that can be used for preprocessing high-dimensional multivariate data. The method is available as an R package, called 'MVR' ('Mean-Variance Regularization'), downloadable from the CRAN website.
Understanding logistic regression analysis.
Sperandei, Sandro
2014-01-01
Logistic regression is used to obtain odds ratio in the presence of more than one explanatory variable. The procedure is quite similar to multiple linear regression, with the exception that the response variable is binomial. The result is the impact of each variable on the odds ratio of the observed event of interest. The main advantage is to avoid confounding effects by analyzing the association of all variables together. In this article, we explain the logistic regression procedure using examples to make it as simple as possible. After definition of the technique, the basic interpretation of the results is highlighted and then some special issues are discussed.
Weisberg, Sanford
2013-01-01
Praise for the Third Edition ""...this is an excellent book which could easily be used as a course text...""-International Statistical Institute The Fourth Edition of Applied Linear Regression provides a thorough update of the basic theory and methodology of linear regression modeling. Demonstrating the practical applications of linear regression analysis techniques, the Fourth Edition uses interesting, real-world exercises and examples. Stressing central concepts such as model building, understanding parameters, assessing fit and reliability, and drawing conclusions, the new edition illus
Hosmer, David W; Sturdivant, Rodney X
2013-01-01
A new edition of the definitive guide to logistic regression modeling for health science and other applications This thoroughly expanded Third Edition provides an easily accessible introduction to the logistic regression (LR) model and highlights the power of this model by examining the relationship between a dichotomous outcome and a set of covariables. Applied Logistic Regression, Third Edition emphasizes applications in the health sciences and handpicks topics that best suit the use of modern statistical software. The book provides readers with state-of-
Understanding poisson regression.
Hayat, Matthew J; Higgins, Melinda
2014-04-01
Nurse investigators often collect study data in the form of counts. Traditional methods of data analysis have historically approached analysis of count data either as if the count data were continuous and normally distributed or with dichotomization of the counts into the categories of occurred or did not occur. These outdated methods for analyzing count data have been replaced with more appropriate statistical methods that make use of the Poisson probability distribution, which is useful for analyzing count data. The purpose of this article is to provide an overview of the Poisson distribution and its use in Poisson regression. Assumption violations for the standard Poisson regression model are addressed with alternative approaches, including addition of an overdispersion parameter or negative binomial regression. An illustrative example is presented with an application from the ENSPIRE study, and regression modeling of comorbidity data is included for illustrative purposes. Copyright 2014, SLACK Incorporated.
Optimizing human activity patterns using global sensitivity analysis.
Fairchild, Geoffrey; Hickmann, Kyle S; Mniszewski, Susan M; Del Valle, Sara Y; Hyman, James M
2014-12-01
Implementing realistic activity patterns for a population is crucial for modeling, for example, disease spread, supply and demand, and disaster response. Using the dynamic activity simulation engine, DASim, we generate schedules for a population that capture regular (e.g., working, eating, and sleeping) and irregular activities (e.g., shopping or going to the doctor). We use the sample entropy (SampEn) statistic to quantify a schedule's regularity for a population. We show how to tune an activity's regularity by adjusting SampEn, thereby making it possible to realistically design activities when creating a schedule. The tuning process sets up a computationally intractable high-dimensional optimization problem. To reduce the computational demand, we use Bayesian Gaussian process regression to compute global sensitivity indices and identify the parameters that have the greatest effect on the variance of SampEn. We use the harmony search (HS) global optimization algorithm to locate global optima. Our results show that HS combined with global sensitivity analysis can efficiently tune the SampEn statistic with few search iterations. We demonstrate how global sensitivity analysis can guide statistical emulation and global optimization algorithms to efficiently tune activities and generate realistic activity patterns. Though our tuning methods are applied to dynamic activity schedule generation, they are general and represent a significant step in the direction of automated tuning and optimization of high-dimensional computer simulations.
Nakamura, Ryota; Suhrcke, Marc; Jebb, Susan A; Pechey, Rachel; Almiron-Roig, Eva; Marteau, Theresa M
2015-01-01
Background: There is a growing concern, but limited evidence, that price promotions contribute to a poor diet and the social patterning of diet-related disease. Objective: We examined the following questions: 1) Are less-healthy foods more likely to be promoted than healthier foods? 2) Are consumers more responsive to promotions on less-healthy products? 3) Are there socioeconomic differences in food purchases in response to price promotions? Design: With the use of hierarchical regression, we analyzed data on purchases of 11,323 products within 135 food and beverage categories from 26,986 households in Great Britain during 2010. Major supermarkets operated the same price promotions in all branches. The number of stores that offered price promotions on each product for each week was used to measure the frequency of price promotions. We assessed the healthiness of each product by using a nutrient profiling (NP) model. Results: A total of 6788 products (60%) were in healthier categories and 4535 products (40%) were in less-healthy categories. There was no significant gap in the frequency of promotion by the healthiness of products neither within nor between categories. However, after we controlled for the reference price, price discount rate, and brand-specific effects, the sales uplift arising from price promotions was larger in less-healthy than in healthier categories; a 1-SD point increase in the category mean NP score, implying the category becomes less healthy, was associated with an additional 7.7–percentage point increase in sales (from 27.3% to 35.0%; P sales uplift from promotions was larger for higher–socioeconomic status (SES) groups than for lower ones (34.6% for the high-SES group, 28.1% for the middle-SES group, and 23.1% for the low-SES group). Finally, there was no significant SES gap in the absolute volume of purchases of less-healthy foods made on promotion. Conclusion: Attempts to limit promotions on less-healthy foods could improve the
Directory of Open Access Journals (Sweden)
Mok Tik
2014-06-01
Full Text Available This study formulates regression of vector data that will enable statistical analysis of various geodetic phenomena such as, polar motion, ocean currents, typhoon/hurricane tracking, crustal deformations, and precursory earthquake signals. The observed vector variable of an event (dependent vector variable is expressed as a function of a number of hypothesized phenomena realized also as vector variables (independent vector variables and/or scalar variables that are likely to impact the dependent vector variable. The proposed representation has the unique property of solving the coefficients of independent vector variables (explanatory variables also as vectors, hence it supersedes multivariate multiple regression models, in which the unknown coefficients are scalar quantities. For the solution, complex numbers are used to rep- resent vector information, and the method of least squares is deployed to estimate the vector model parameters after transforming the complex vector regression model into a real vector regression model through isomorphism. Various operational statistics for testing the predictive significance of the estimated vector parameter coefficients are also derived. A simple numerical example demonstrates the use of the proposed vector regression analysis in modeling typhoon paths.
DEFF Research Database (Denmark)
Pham, Ninh Dang; Pagh, Rasmus
2012-01-01
projection-based technique that is able to estimate the angle-based outlier factor for all data points in time near-linear in the size of the data. Also, our approach is suitable to be performed in parallel environment to achieve a parallel speedup. We introduce a theoretical analysis of the quality...... neighbor are deteriorated in high-dimensional data. Following up on the work of Kriegel et al. (KDD '08), we investigate the use of angle-based outlier factor in mining high-dimensional outliers. While their algorithm runs in cubic time (with a quadratic time heuristic), we propose a novel random......Outlier mining in d-dimensional point sets is a fundamental and well studied data mining task due to its variety of applications. Most such applications arise in high-dimensional domains. A bottleneck of existing approaches is that implicit or explicit assessments on concepts of distance or nearest...
Elastic SCAD as a novel penalization method for SVM classification tasks in high-dimensional data.
Becker, Natalia; Toedt, Grischa; Lichter, Peter; Benner, Axel
2011-05-09
Classification and variable selection play an important role in knowledge discovery in high-dimensional data. Although Support Vector Machine (SVM) algorithms are among the most powerful classification and prediction methods with a wide range of scientific applications, the SVM does not include automatic feature selection and therefore a number of feature selection procedures have been developed. Regularisation approaches extend SVM to a feature selection method in a flexible way using penalty functions like LASSO, SCAD and Elastic Net.We propose a novel penalty function for SVM classification tasks, Elastic SCAD, a combination of SCAD and ridge penalties which overcomes the limitations of each penalty alone.Since SVM models are extremely sensitive to the choice of tuning parameters, we adopted an interval search algorithm, which in comparison to a fixed grid search finds rapidly and more precisely a global optimal solution. Feature selection methods with combined penalties (Elastic Net and Elastic SCAD SVMs) are more robust to a change of the model complexity than methods using single penalties. Our simulation study showed that Elastic SCAD SVM outperformed LASSO (L1) and SCAD SVMs. Moreover, Elastic SCAD SVM provided sparser classifiers in terms of median number of features selected than Elastic Net SVM and often better predicted than Elastic Net in terms of misclassification error.Finally, we applied the penalization methods described above on four publicly available breast cancer data sets. Elastic SCAD SVM was the only method providing robust classifiers in sparse and non-sparse situations. The proposed Elastic SCAD SVM algorithm provides the advantages of the SCAD penalty and at the same time avoids sparsity limitations for non-sparse data. We were first to demonstrate that the integration of the interval search algorithm and penalized SVM classification techniques provides fast solutions on the optimization of tuning parameters.The penalized SVM
Multicollinearity and Regression Analysis
Daoud, Jamal I.
2017-12-01
In regression analysis it is obvious to have a correlation between the response and predictor(s), but having correlation among predictors is something undesired. The number of predictors included in the regression model depends on many factors among which, historical data, experience, etc. At the end selection of most important predictors is something objective due to the researcher. Multicollinearity is a phenomena when two or more predictors are correlated, if this happens, the standard error of the coefficients will increase [8]. Increased standard errors means that the coefficients for some or all independent variables may be found to be significantly different from In other words, by overinflating the standard errors, multicollinearity makes some variables statistically insignificant when they should be significant. In this paper we focus on the multicollinearity, reasons and consequences on the reliability of the regression model.
Polylinear regression analysis in radiochemistry
International Nuclear Information System (INIS)
Kopyrin, A.A.; Terent'eva, T.N.; Khramov, N.N.
1995-01-01
A number of radiochemical problems have been formulated in the framework of polylinear regression analysis, which permits the use of conventional mathematical methods for their solution. The authors have considered features of the use of polylinear regression analysis for estimating the contributions of various sources to the atmospheric pollution, for studying irradiated nuclear fuel, for estimating concentrations from spectral data, for measuring neutron fields of a nuclear reactor, for estimating crystal lattice parameters from X-ray diffraction patterns, for interpreting data of X-ray fluorescence analysis, for estimating complex formation constants, and for analyzing results of radiometric measurements. The problem of estimating the target parameters can be incorrect at certain properties of the system under study. The authors showed the possibility of regularization by adding a fictitious set of data open-quotes obtainedclose quotes from the orthogonal design. To estimate only a part of the parameters under consideration, the authors used incomplete rank models. In this case, it is necessary to take into account the possibility of confounding estimates. An algorithm for evaluating the degree of confounding is presented which is realized using standard software or regression analysis
DEFF Research Database (Denmark)
Bache, Stefan Holst
A new and alternative quantile regression estimator is developed and it is shown that the estimator is root n-consistent and asymptotically normal. The estimator is based on a minimax ‘deviance function’ and has asymptotically equivalent properties to the usual quantile regression estimator. It is......, however, a different and therefore new estimator. It allows for both linear- and nonlinear model specifications. A simple algorithm for computing the estimates is proposed. It seems to work quite well in practice but whether it has theoretical justification is still an open question....
DEFF Research Database (Denmark)
Ozenne, Brice; Sørensen, Anne Lyngholm; Scheike, Thomas
2017-01-01
In the presence of competing risks a prediction of the time-dynamic absolute risk of an event can be based on cause-specific Cox regression models for the event and the competing risks (Benichou and Gail, 1990). We present computationally fast and memory optimized C++ functions with an R interface...... for predicting the covariate specific absolute risks, their confidence intervals, and their confidence bands based on right censored time to event data. We provide explicit formulas for our implementation of the estimator of the (stratified) baseline hazard function in the presence of tied event times. As a by...... functionals. The software presented here is implemented in the riskRegression package....
A logistic regression estimating function for spatial Gibbs point processes
DEFF Research Database (Denmark)
Baddeley, Adrian; Coeurjolly, Jean-François; Rubak, Ege
We propose a computationally efficient logistic regression estimating function for spatial Gibbs point processes. The sample points for the logistic regression consist of the observed point pattern together with a random pattern of dummy points. The estimating function is closely related to the p......We propose a computationally efficient logistic regression estimating function for spatial Gibbs point processes. The sample points for the logistic regression consist of the observed point pattern together with a random pattern of dummy points. The estimating function is closely related...
Diagonal Likelihood Ratio Test for Equality of Mean Vectors in High-Dimensional Data
Hu, Zongliang
2017-10-27
We propose a likelihood ratio test framework for testing normal mean vectors in high-dimensional data under two common scenarios: the one-sample test and the two-sample test with equal covariance matrices. We derive the test statistics under the assumption that the covariance matrices follow a diagonal matrix structure. In comparison with the diagonal Hotelling\\'s tests, our proposed test statistics display some interesting characteristics. In particular, they are a summation of the log-transformed squared t-statistics rather than a direct summation of those components. More importantly, to derive the asymptotic normality of our test statistics under the null and local alternative hypotheses, we do not require the assumption that the covariance matrix follows a diagonal matrix structure. As a consequence, our proposed test methods are very flexible and can be widely applied in practice. Finally, simulation studies and a real data analysis are also conducted to demonstrate the advantages of our likelihood ratio test method.
A sparse grid based method for generative dimensionality reduction of high-dimensional data
Bohn, Bastian; Garcke, Jochen; Griebel, Michael
2016-03-01
Generative dimensionality reduction methods play an important role in machine learning applications because they construct an explicit mapping from a low-dimensional space to the high-dimensional data space. We discuss a general framework to describe generative dimensionality reduction methods, where the main focus lies on a regularized principal manifold learning variant. Since most generative dimensionality reduction algorithms exploit the representer theorem for reproducing kernel Hilbert spaces, their computational costs grow at least quadratically in the number n of data. Instead, we introduce a grid-based discretization approach which automatically scales just linearly in n. To circumvent the curse of dimensionality of full tensor product grids, we use the concept of sparse grids. Furthermore, in real-world applications, some embedding directions are usually more important than others and it is reasonable to refine the underlying discretization space only in these directions. To this end, we employ a dimension-adaptive algorithm which is based on the ANOVA (analysis of variance) decomposition of a function. In particular, the reconstruction error is used to measure the quality of an embedding. As an application, the study of large simulation data from an engineering application in the automotive industry (car crash simulation) is performed.
International Nuclear Information System (INIS)
Snyder, Abigail C.; Jiao, Yu
2010-01-01
Neutron experiments at the Spallation Neutron Source (SNS) at Oak Ridge National Laboratory (ORNL) frequently generate large amounts of data (on the order of 106-1012 data points). Hence, traditional data analysis tools run on a single CPU take too long to be practical and scientists are unable to efficiently analyze all data generated by experiments. Our goal is to develop a scalable algorithm to efficiently compute high-dimensional integrals of arbitrary functions. This algorithm can then be used to integrate the four-dimensional integrals that arise as part of modeling intensity from the experiments at the SNS. Here, three different one-dimensional numerical integration solvers from the GNU Scientific Library were modified and implemented to solve four-dimensional integrals. The results of these solvers on a final integrand provided by scientists at the SNS can be compared to the results of other methods, such as quasi-Monte Carlo methods, computing the same integral. A parallelized version of the most efficient method can allow scientists the opportunity to more effectively analyze all experimental data.
Directory of Open Access Journals (Sweden)
Enkelejda Miho
2018-02-01
Full Text Available The adaptive immune system recognizes antigens via an immense array of antigen-binding antibodies and T-cell receptors, the immune repertoire. The interrogation of immune repertoires is of high relevance for understanding the adaptive immune response in disease and infection (e.g., autoimmunity, cancer, HIV. Adaptive immune receptor repertoire sequencing (AIRR-seq has driven the quantitative and molecular-level profiling of immune repertoires, thereby revealing the high-dimensional complexity of the immune receptor sequence landscape. Several methods for the computational and statistical analysis of large-scale AIRR-seq data have been developed to resolve immune repertoire complexity and to understand the dynamics of adaptive immunity. Here, we review the current research on (i diversity, (ii clustering and network, (iii phylogenetic, and (iv machine learning methods applied to dissect, quantify, and compare the architecture, evolution, and specificity of immune repertoires. We summarize outstanding questions in computational immunology and propose future directions for systems immunology toward coupling AIRR-seq with the computational discovery of immunotherapeutics, vaccines, and immunodiagnostics.
Construction of high-dimensional neural network potentials using environment-dependent atom pairs.
Jose, K V Jovan; Artrith, Nongnuch; Behler, Jörg
2012-05-21
An accurate determination of the potential energy is the crucial step in computer simulations of chemical processes, but using electronic structure methods on-the-fly in molecular dynamics (MD) is computationally too demanding for many systems. Constructing more efficient interatomic potentials becomes intricate with increasing dimensionality of the potential-energy surface (PES), and for numerous systems the accuracy that can be achieved is still not satisfying and far from the reliability of first-principles calculations. Feed-forward neural networks (NNs) have a very flexible functional form, and in recent years they have been shown to be an accurate tool to construct efficient PESs. High-dimensional NN potentials based on environment-dependent atomic energy contributions have been presented for a number of materials. Still, these potentials may be improved by a more detailed structural description, e.g., in form of atom pairs, which directly reflect the atomic interactions and take the chemical environment into account. We present an implementation of an NN method based on atom pairs, and its accuracy and performance are compared to the atom-based NN approach using two very different systems, the methanol molecule and metallic copper. We find that both types of NN potentials provide an excellent description of both PESs, with the pair-based method yielding a slightly higher accuracy making it a competitive alternative for addressing complex systems in MD simulations.
Individual-based models for adaptive diversification in high-dimensional phenotype spaces.
Ispolatov, Iaroslav; Madhok, Vaibhav; Doebeli, Michael
2016-02-07
Most theories of evolutionary diversification are based on equilibrium assumptions: they are either based on optimality arguments involving static fitness landscapes, or they assume that populations first evolve to an equilibrium state before diversification occurs, as exemplified by the concept of evolutionary branching points in adaptive dynamics theory. Recent results indicate that adaptive dynamics may often not converge to equilibrium points and instead generate complicated trajectories if evolution takes place in high-dimensional phenotype spaces. Even though some analytical results on diversification in complex phenotype spaces are available, to study this problem in general we need to reconstruct individual-based models from the adaptive dynamics generating the non-equilibrium dynamics. Here we first provide a method to construct individual-based models such that they faithfully reproduce the given adaptive dynamics attractor without diversification. We then show that a propensity to diversify can be introduced by adding Gaussian competition terms that generate frequency dependence while still preserving the same adaptive dynamics. For sufficiently strong competition, the disruptive selection generated by frequency-dependence overcomes the directional evolution along the selection gradient and leads to diversification in phenotypic directions that are orthogonal to the selection gradient. Copyright © 2015 Elsevier Ltd. All rights reserved.
A Comparison of Machine Learning Methods in a High-Dimensional Classification Problem
Directory of Open Access Journals (Sweden)
Zekić-Sušac Marijana
2014-09-01
Full Text Available Background: Large-dimensional data modelling often relies on variable reduction methods in the pre-processing and in the post-processing stage. However, such a reduction usually provides less information and yields a lower accuracy of the model. Objectives: The aim of this paper is to assess the high-dimensional classification problem of recognizing entrepreneurial intentions of students by machine learning methods. Methods/Approach: Four methods were tested: artificial neural networks, CART classification trees, support vector machines, and k-nearest neighbour on the same dataset in order to compare their efficiency in the sense of classification accuracy. The performance of each method was compared on ten subsamples in a 10-fold cross-validation procedure in order to assess computing sensitivity and specificity of each model. Results: The artificial neural network model based on multilayer perceptron yielded a higher classification rate than the models produced by other methods. The pairwise t-test showed a statistical significance between the artificial neural network and the k-nearest neighbour model, while the difference among other methods was not statistically significant. Conclusions: Tested machine learning methods are able to learn fast and achieve high classification accuracy. However, further advancement can be assured by testing a few additional methodological refinements in machine learning methods.
Energy Technology Data Exchange (ETDEWEB)
Dan Maljovec; Bei Wang; Valerio Pascucci; Peer-Timo Bremer; Michael Pernice; Robert Nourgaliev
2013-05-01
The next generation of methodologies for nuclear reactor Probabilistic Risk Assessment (PRA) explicitly accounts for the time element in modeling the probabilistic system evolution and uses numerical simulation tools to account for possible dependencies between failure events. The Monte-Carlo (MC) and the Dynamic Event Tree (DET) approaches belong to this new class of dynamic PRA methodologies. A challenge of dynamic PRA algorithms is the large amount of data they produce which may be difficult to visualize and analyze in order to extract useful information. We present a software tool that is designed to address these goals. We model a large-scale nuclear simulation dataset as a high-dimensional scalar function defined over a discrete sample of the domain. First, we provide structural analysis of such a function at multiple scales and provide insight into the relationship between the input parameters and the output. Second, we enable exploratory analysis for users, where we help the users to differentiate features from noise through multi-scale analysis on an interactive platform, based on domain knowledge and data characterization. Our analysis is performed by exploiting the topological and geometric properties of the domain, building statistical models based on its topological segmentations and providing interactive visual interfaces to facilitate such explorations. We provide a user’s guide to our software tool by highlighting its analysis and visualization capabilities, along with a use case involving dataset from a nuclear reactor safety simulation.
Schran, Christoph; Uhl, Felix; Behler, Jörg; Marx, Dominik
2018-03-01
The design of accurate helium-solute interaction potentials for the simulation of chemically complex molecules solvated in superfluid helium has long been a cumbersome task due to the rather weak but strongly anisotropic nature of the interactions. We show that this challenge can be met by using a combination of an effective pair potential for the He-He interactions and a flexible high-dimensional neural network potential (NNP) for describing the complex interaction between helium and the solute in a pairwise additive manner. This approach yields an excellent agreement with a mean absolute deviation as small as 0.04 kJ mol-1 for the interaction energy between helium and both hydronium and Zundel cations compared with coupled cluster reference calculations with an energetically converged basis set. The construction and improvement of the potential can be performed in a highly automated way, which opens the door for applications to a variety of reactive molecules to study the effect of solvation on the solute as well as the solute-induced structuring of the solvent. Furthermore, we show that this NNP approach yields very convincing agreement with the coupled cluster reference for properties like many-body spatial and radial distribution functions. This holds for the microsolvation of the protonated water monomer and dimer by a few helium atoms up to their solvation in bulk helium as obtained from path integral simulations at about 1 K.
Multi-Scale Factor Analysis of High-Dimensional Brain Signals
Ting, Chee-Ming
2017-05-18
In this paper, we develop an approach to modeling high-dimensional networks with a large number of nodes arranged in a hierarchical and modular structure. We propose a novel multi-scale factor analysis (MSFA) model which partitions the massive spatio-temporal data defined over the complex networks into a finite set of regional clusters. To achieve further dimension reduction, we represent the signals in each cluster by a small number of latent factors. The correlation matrix for all nodes in the network are approximated by lower-dimensional sub-structures derived from the cluster-specific factors. To estimate regional connectivity between numerous nodes (within each cluster), we apply principal components analysis (PCA) to produce factors which are derived as the optimal reconstruction of the observed signals under the squared loss. Then, we estimate global connectivity (between clusters or sub-networks) based on the factors across regions using the RV-coefficient as the cross-dependence measure. This gives a reliable and computationally efficient multi-scale analysis of both regional and global dependencies of the large networks. The proposed novel approach is applied to estimate brain connectivity networks using functional magnetic resonance imaging (fMRI) data. Results on resting-state fMRI reveal interesting modular and hierarchical organization of human brain networks during rest.
Simulation-based hypothesis testing of high dimensional means under covariance heterogeneity.
Chang, Jinyuan; Zheng, Chao; Zhou, Wen-Xin; Zhou, Wen
2017-12-01
In this article, we study the problem of testing the mean vectors of high dimensional data in both one-sample and two-sample cases. The proposed testing procedures employ maximum-type statistics and the parametric bootstrap techniques to compute the critical values. Different from the existing tests that heavily rely on the structural conditions on the unknown covariance matrices, the proposed tests allow general covariance structures of the data and therefore enjoy wide scope of applicability in practice. To enhance powers of the tests against sparse alternatives, we further propose two-step procedures with a preliminary feature screening step. Theoretical properties of the proposed tests are investigated. Through extensive numerical experiments on synthetic data sets and an human acute lymphoblastic leukemia gene expression data set, we illustrate the performance of the new tests and how they may provide assistance on detecting disease-associated gene-sets. The proposed methods have been implemented in an R-package HDtest and are available on CRAN. © 2017, The International Biometric Society.
Multi-SOM: an Algorithm for High-Dimensional, Small Size Datasets
Directory of Open Access Journals (Sweden)
Shen Lu
2013-04-01
Full Text Available Since it takes time to do experiments in bioinformatics, biological datasets are sometimes small but with high dimensionality. From probability theory, in order to discover knowledge from a set of data, we have to have a sufficient number of samples. Otherwise, the error bounds can become too large to be useful. For the SOM (Self- Organizing Map algorithm, the initial map is based on the training data. In order to avoid the bias caused by the insufficient training data, in this paper we present an algorithm, called Multi-SOM. Multi-SOM builds a number of small self-organizing maps, instead of just one big map. Bayesian decision theory is used to make the final decision among similar neurons on different maps. In this way, we can better ensure that we can get a real random initial weight vector set, the map size is less of consideration and errors tend to average out. In our experiments as applied to microarray datasets which are highly intense data composed of genetic related information, the precision of Multi-SOMs is 10.58% greater than SOMs, and its recall is 11.07% greater than SOMs. Thus, the Multi-SOMs algorithm is practical.
Multiple linear regression analysis
Edwards, T. R.
1980-01-01
Program rapidly selects best-suited set of coefficients. User supplies only vectors of independent and dependent data and specifies confidence level required. Program uses stepwise statistical procedure for relating minimal set of variables to set of observations; final regression contains only most statistically significant coefficients. Program is written in FORTRAN IV for batch execution and has been implemented on NOVA 1200.
Bayesian logistic regression analysis
Van Erp, H.R.N.; Van Gelder, P.H.A.J.M.
2012-01-01
In this paper we present a Bayesian logistic regression analysis. It is found that if one wishes to derive the posterior distribution of the probability of some event, then, together with the traditional Bayes Theorem and the integrating out of nuissance parameters, the Jacobian transformation is an
Seber, George A F
2012-01-01
Concise, mathematically clear, and comprehensive treatment of the subject.* Expanded coverage of diagnostics and methods of model fitting.* Requires no specialized knowledge beyond a good grasp of matrix algebra and some acquaintance with straight-line regression and simple analysis of variance models.* More than 200 problems throughout the book plus outline solutions for the exercises.* This revision has been extensively class-tested.
Ritz, Christian; Parmigiani, Giovanni
2009-01-01
R is a rapidly evolving lingua franca of graphical display and statistical analysis of experiments from the applied sciences. This book provides a coherent treatment of nonlinear regression with R by means of examples from a diversity of applied sciences such as biology, chemistry, engineering, medicine and toxicology.
Bounded Gaussian process regression
DEFF Research Database (Denmark)
Jensen, Bjørn Sand; Nielsen, Jens Brehm; Larsen, Jan
2013-01-01
We extend the Gaussian process (GP) framework for bounded regression by introducing two bounded likelihood functions that model the noise on the dependent variable explicitly. This is fundamentally different from the implicit noise assumption in the previously suggested warped GP framework. We...... with the proposed explicit noise-model extension....
and Multinomial Logistic Regression
African Journals Online (AJOL)
This work presented the results of an experimental comparison of two models: Multinomial Logistic Regression (MLR) and Artificial Neural Network (ANN) for classifying students based on their academic performance. The predictive accuracy for each model was measured by their average Classification Correct Rate (CCR).
Mechanisms of neuroblastoma regression
Brodeur, Garrett M.; Bagatell, Rochelle
2014-01-01
Recent genomic and biological studies of neuroblastoma have shed light on the dramatic heterogeneity in the clinical behaviour of this disease, which spans from spontaneous regression or differentiation in some patients, to relentless disease progression in others, despite intensive multimodality therapy. This evidence also suggests several possible mechanisms to explain the phenomena of spontaneous regression in neuroblastomas, including neurotrophin deprivation, humoral or cellular immunity, loss of telomerase activity and alterations in epigenetic regulation. A better understanding of the mechanisms of spontaneous regression might help to identify optimal therapeutic approaches for patients with these tumours. Currently, the most druggable mechanism is the delayed activation of developmentally programmed cell death regulated by the tropomyosin receptor kinase A pathway. Indeed, targeted therapy aimed at inhibiting neurotrophin receptors might be used in lieu of conventional chemotherapy or radiation in infants with biologically favourable tumours that require treatment. Alternative approaches consist of breaking immune tolerance to tumour antigens or activating neurotrophin receptor pathways to induce neuronal differentiation. These approaches are likely to be most effective against biologically favourable tumours, but they might also provide insights into treatment of biologically unfavourable tumours. We describe the different mechanisms of spontaneous neuroblastoma regression and the consequent therapeutic approaches. PMID:25331179
Directory of Open Access Journals (Sweden)
Nils Ternès
2017-05-01
Full Text Available Abstract Background Thanks to the advances in genomics and targeted treatments, more and more prediction models based on biomarkers are being developed to predict potential benefit from treatments in a randomized clinical trial. Despite the methodological framework for the development and validation of prediction models in a high-dimensional setting is getting more and more established, no clear guidance exists yet on how to estimate expected survival probabilities in a penalized model with biomarker-by-treatment interactions. Methods Based on a parsimonious biomarker selection in a penalized high-dimensional Cox model (lasso or adaptive lasso, we propose a unified framework to: estimate internally the predictive accuracy metrics of the developed model (using double cross-validation; estimate the individual survival probabilities at a given timepoint; construct confidence intervals thereof (analytical or bootstrap; and visualize them graphically (pointwise or smoothed with spline. We compared these strategies through a simulation study covering scenarios with or without biomarker effects. We applied the strategies to a large randomized phase III clinical trial that evaluated the effect of adding trastuzumab to chemotherapy in 1574 early breast cancer patients, for which the expression of 462 genes was measured. Results In our simulations, penalized regression models using the adaptive lasso estimated the survival probability of new patients with low bias and standard error; bootstrapped confidence intervals had empirical coverage probability close to the nominal level across very different scenarios. The double cross-validation performed on the training data set closely mimicked the predictive accuracy of the selected models in external validation data. We also propose a useful visual representation of the expected survival probabilities using splines. In the breast cancer trial, the adaptive lasso penalty selected a prediction model with 4
Sparse Regression by Projection and Sparse Discriminant Analysis
Qi, Xin
2015-04-03
© 2015, © American Statistical Association, Institute of Mathematical Statistics, and Interface Foundation of North America. Recent years have seen active developments of various penalized regression methods, such as LASSO and elastic net, to analyze high-dimensional data. In these approaches, the direction and length of the regression coefficients are determined simultaneously. Due to the introduction of penalties, the length of the estimates can be far from being optimal for accurate predictions. We introduce a new framework, regression by projection, and its sparse version to analyze high-dimensional data. The unique nature of this framework is that the directions of the regression coefficients are inferred first, and the lengths and the tuning parameters are determined by a cross-validation procedure to achieve the largest prediction accuracy. We provide a theoretical result for simultaneous model selection consistency and parameter estimation consistency of our method in high dimension. This new framework is then generalized such that it can be applied to principal components analysis, partial least squares, and canonical correlation analysis. We also adapt this framework for discriminant analysis. Compared with the existing methods, where there is relatively little control of the dependency among the sparse components, our method can control the relationships among the components. We present efficient algorithms and related theory for solving the sparse regression by projection problem. Based on extensive simulations and real data analysis, we demonstrate that our method achieves good predictive performance and variable selection in the regression setting, and the ability to control relationships between the sparse components leads to more accurate classification. In supplementary materials available online, the details of the algorithms and theoretical proofs, and R codes for all simulation studies are provided.
On estimation of the noise variance in high-dimensional linear models
Golubev, Yuri; Krymova, Ekaterina
2017-01-01
We consider the problem of recovering the unknown noise variance in the linear regression model. To estimate the nuisance (a vector of regression coefficients) we use a family of spectral regularisers of the maximum likelihood estimator. The noise estimation is based on the adaptive normalisation of the squared error. We derive the upper bound for the concentration of the proposed method around the ideal estimator (the case of zero nuisance).
Spontaneous regression of metastatic Merkel cell carcinoma.
LENUS (Irish Health Repository)
Hassan, S J
2010-01-01
Merkel cell carcinoma is a rare aggressive neuroendocrine carcinoma of the skin predominantly affecting elderly Caucasians. It has a high rate of local recurrence and regional lymph node metastases. It is associated with a poor prognosis. Complete spontaneous regression of Merkel cell carcinoma has been reported but is a poorly understood phenomenon. Here we present a case of complete spontaneous regression of metastatic Merkel cell carcinoma demonstrating a markedly different pattern of events from those previously published.
Directory of Open Access Journals (Sweden)
Laurent Berge
2012-01-01
Full Text Available This paper presents the R package HDclassif which is devoted to the clustering and the discriminant analysis of high-dimensional data. The classification methods proposed in the package result from a new parametrization of the Gaussian mixture model which combines the idea of dimension reduction and model constraints on the covariance matrices. The supervised classification method using this parametrization is called high dimensional discriminant analysis (HDDA. In a similar manner, the associated clustering method iscalled high dimensional data clustering (HDDC and uses the expectation-maximization algorithm for inference. In order to correctly t the data, both methods estimate the specific subspace and the intrinsic dimension of the groups. Due to the constraints on the covariance matrices, the number of parameters to estimate is significantly lower than other model-based methods and this allows the methods to be stable and efficient in high dimensions. Two introductory examples illustrated with R codes allow the user to discover the hdda and hddc functions. Experiments on simulated and real datasets also compare HDDC and HDDA with existing classification methods on high-dimensional datasets. HDclassif is a free software and distributed under the general public license, as part of the R software project.
International Nuclear Information System (INIS)
Guerrieri, A.
2009-01-01
In this report the largest Lyapunov characteristic exponent of a high dimensional atmospheric global circulation model of intermediate complexity has been estimated numerically. A sensitivity analysis has been carried out by varying the equator-to-pole temperature difference, the space resolution and the value of some parameters employed by the model. Chaotic and non-chaotic regimes of circulation have been found. [it
Ridge Regression Signal Processing
Kuhl, Mark R.
1990-01-01
The introduction of the Global Positioning System (GPS) into the National Airspace System (NAS) necessitates the development of Receiver Autonomous Integrity Monitoring (RAIM) techniques. In order to guarantee a certain level of integrity, a thorough understanding of modern estimation techniques applied to navigational problems is required. The extended Kalman filter (EKF) is derived and analyzed under poor geometry conditions. It was found that the performance of the EKF is difficult to predict, since the EKF is designed for a Gaussian environment. A novel approach is implemented which incorporates ridge regression to explain the behavior of an EKF in the presence of dynamics under poor geometry conditions. The basic principles of ridge regression theory are presented, followed by the derivation of a linearized recursive ridge estimator. Computer simulations are performed to confirm the underlying theory and to provide a comparative analysis of the EKF and the recursive ridge estimator.
Subset selection in regression
Miller, Alan
2002-01-01
Originally published in 1990, the first edition of Subset Selection in Regression filled a significant gap in the literature, and its critical and popular success has continued for more than a decade. Thoroughly revised to reflect progress in theory, methods, and computing power, the second edition promises to continue that tradition. The author has thoroughly updated each chapter, incorporated new material on recent developments, and included more examples and references. New in the Second Edition:A separate chapter on Bayesian methodsComplete revision of the chapter on estimationA major example from the field of near infrared spectroscopyMore emphasis on cross-validationGreater focus on bootstrappingStochastic algorithms for finding good subsets from large numbers of predictors when an exhaustive search is not feasible Software available on the Internet for implementing many of the algorithms presentedMore examplesSubset Selection in Regression, Second Edition remains dedicated to the techniques for fitting...
Better Autologistic Regression
Directory of Open Access Journals (Sweden)
Mark A. Wolters
2017-11-01
Full Text Available Autologistic regression is an important probability model for dichotomous random variables observed along with covariate information. It has been used in various fields for analyzing binary data possessing spatial or network structure. The model can be viewed as an extension of the autologistic model (also known as the Ising model, quadratic exponential binary distribution, or Boltzmann machine to include covariates. It can also be viewed as an extension of logistic regression to handle responses that are not independent. Not all authors use exactly the same form of the autologistic regression model. Variations of the model differ in two respects. First, the variable coding—the two numbers used to represent the two possible states of the variables—might differ. Common coding choices are (zero, one and (minus one, plus one. Second, the model might appear in either of two algebraic forms: a standard form, or a recently proposed centered form. Little attention has been paid to the effect of these differences, and the literature shows ambiguity about their importance. It is shown here that changes to either coding or centering in fact produce distinct, non-nested probability models. Theoretical results, numerical studies, and analysis of an ecological data set all show that the differences among the models can be large and practically significant. Understanding the nature of the differences and making appropriate modeling choices can lead to significantly improved autologistic regression analyses. The results strongly suggest that the standard model with plus/minus coding, which we call the symmetric autologistic model, is the most natural choice among the autologistic variants.
Regression in organizational leadership.
Kernberg, O F
1979-02-01
The choice of good leaders is a major task for all organizations. Inforamtion regarding the prospective administrator's personality should complement questions regarding his previous experience, his general conceptual skills, his technical knowledge, and the specific skills in the area for which he is being selected. The growing psychoanalytic knowledge about the crucial importance of internal, in contrast to external, object relations, and about the mutual relationships of regression in individuals and in groups, constitutes an important practical tool for the selection of leaders.
Classification and regression trees
Breiman, Leo; Olshen, Richard A; Stone, Charles J
1984-01-01
The methodology used to construct tree structured rules is the focus of this monograph. Unlike many other statistical procedures, which moved from pencil and paper to calculators, this text's use of trees was unthinkable before computers. Both the practical and theoretical sides have been developed in the authors' study of tree methods. Classification and Regression Trees reflects these two sides, covering the use of trees as a data analysis method, and in a more mathematical framework, proving some of their fundamental properties.
Hilbe, Joseph M
2009-01-01
This book really does cover everything you ever wanted to know about logistic regression … with updates available on the author's website. Hilbe, a former national athletics champion, philosopher, and expert in astronomy, is a master at explaining statistical concepts and methods. Readers familiar with his other expository work will know what to expect-great clarity.The book provides considerable detail about all facets of logistic regression. No step of an argument is omitted so that the book will meet the needs of the reader who likes to see everything spelt out, while a person familiar with some of the topics has the option to skip "obvious" sections. The material has been thoroughly road-tested through classroom and web-based teaching. … The focus is on helping the reader to learn and understand logistic regression. The audience is not just students meeting the topic for the first time, but also experienced users. I believe the book really does meet the author's goal … .-Annette J. Dobson, Biometric...
Evaluation of a new high-dimensional miRNA profiling platform
Directory of Open Access Journals (Sweden)
Lamblin Anne-Francoise
2009-08-01
Full Text Available Abstract Background MicroRNAs (miRNAs are a class of approximately 22 nucleotide long, widely expressed RNA molecules that play important regulatory roles in eukaryotes. To investigate miRNA function, it is essential that methods to quantify their expression levels be available. Methods We evaluated a new miRNA profiling platform that utilizes Illumina's existing robust DASL chemistry as the basis for the assay. Using total RNA from five colon cancer patients and four cell lines, we evaluated the reproducibility of miRNA expression levels across replicates and with varying amounts of input RNA. The beta test version was comprised of 735 miRNA targets of Illumina's miRNA profiling application. Results Reproducibility between sample replicates within a plate was good (Spearman's correlation 0.91 to 0.98 as was the plate-to-plate reproducibility replicates run on different days (Spearman's correlation 0.84 to 0.98. To determine whether quality data could be obtained from a broad range of input RNA, data obtained from amounts ranging from 25 ng to 800 ng were compared to those obtained at 200 ng. No effect across the range of RNA input was observed. Conclusion These results indicate that very small amounts of starting material are sufficient to allow sensitive miRNA profiling using the Illumina miRNA high-dimensional platform. Nonlinear biases were observed between replicates, indicating the need for abundance-dependent normalization. Overall, the performance characteristics of the Illumina miRNA profiling system were excellent.
Gomez, Luis J; Yücel, Abdulkadir C; Hernandez-Garcia, Luis; Taylor, Stephan F; Michielssen, Eric
2015-01-01
A computational framework for uncertainty quantification in transcranial magnetic stimulation (TMS) is presented. The framework leverages high-dimensional model representations (HDMRs), which approximate observables (i.e., quantities of interest such as electric (E) fields induced inside targeted cortical regions) via series of iteratively constructed component functions involving only the most significant random variables (i.e., parameters that characterize the uncertainty in a TMS setup such as the position and orientation of TMS coils, as well as the size, shape, and conductivity of the head tissue). The component functions of HDMR expansions are approximated via a multielement probabilistic collocation (ME-PC) method. While approximating each component function, a quasi-static finite-difference simulator is used to compute observables at integration/collocation points dictated by the ME-PC method. The proposed framework requires far fewer simulations than traditional Monte Carlo methods for providing highly accurate statistical information (e.g., the mean and standard deviation) about the observables. The efficiency and accuracy of the proposed framework are demonstrated via its application to the statistical characterization of E-fields generated by TMS inside cortical regions of an MRI-derived realistic head model. Numerical results show that while uncertainties in tissue conductivities have negligible effects on TMS operation, variations in coil position/orientation and brain size significantly affect the induced E-fields. Our numerical results have several implications for the use of TMS during depression therapy: 1) uncertainty in the coil position and orientation may reduce the response rates of patients; 2) practitioners should favor targets on the crest of a gyrus to obtain maximal stimulation; and 3) an increasing scalp-to-cortex distance reduces the magnitude of E-fields on the surface and inside the cortex.
Directory of Open Access Journals (Sweden)
Datta Susmita
2010-08-01
Full Text Available Abstract Background Generally speaking, different classifiers tend to work well for certain types of data and conversely, it is usually not known a priori which algorithm will be optimal in any given classification application. In addition, for most classification problems, selecting the best performing classification algorithm amongst a number of competing algorithms is a difficult task for various reasons. As for example, the order of performance may depend on the performance measure employed for such a comparison. In this work, we present a novel adaptive ensemble classifier constructed by combining bagging and rank aggregation that is capable of adaptively changing its performance depending on the type of data that is being classified. The attractive feature of the proposed classifier is its multi-objective nature where the classification results can be simultaneously optimized with respect to several performance measures, for example, accuracy, sensitivity and specificity. We also show that our somewhat complex strategy has better predictive performance as judged on test samples than a more naive approach that attempts to directly identify the optimal classifier based on the training data performances of the individual classifiers. Results We illustrate the proposed method with two simulated and two real-data examples. In all cases, the ensemble classifier performs at the level of the best individual classifier comprising the ensemble or better. Conclusions For complex high-dimensional datasets resulting from present day high-throughput experiments, it may be wise to consider a number of classification algorithms combined with dimension reduction techniques rather than a fixed standard algorithm set a priori.
From Ambiguities to Insights: Query-based Comparisons of High-Dimensional Data
Kowalski, Jeanne; Talbot, Conover; Tsai, Hua L.; Prasad, Nijaguna; Umbricht, Christopher; Zeiger, Martha A.
2007-11-01
Genomic technologies will revolutionize drag discovery and development; that much is universally agreed upon. The high dimension of data from such technologies has challenged available data analytic methods; that much is apparent. To date, large-scale data repositories have not been utilized in ways that permit their wealth of information to be efficiently processed for knowledge, presumably due in large part to inadequate analytical tools to address numerous comparisons of high-dimensional data. In candidate gene discovery, expression comparisons are often made between two features (e.g., cancerous versus normal), such that the enumeration of outcomes is manageable. With multiple features, the setting becomes more complex, in terms of comparing expression levels of tens of thousands transcripts across hundreds of features. In this case, the number of outcomes, while enumerable, become rapidly large and unmanageable, and scientific inquiries become more abstract, such as "which one of these (compounds, stimuli, etc.) is not like the others?" We develop analytical tools that promote more extensive, efficient, and rigorous utilization of the public data resources generated by the massive support of genomic studies. Our work innovates by enabling access to such metadata with logically formulated scientific inquires that define, compare and integrate query-comparison pair relations for analysis. We demonstrate our computational tool's potential to address an outstanding biomedical informatics issue of identifying reliable molecular markers in thyroid cancer. Our proposed query-based comparison (QBC) facilitates access to and efficient utilization of metadata through logically formed inquires expressed as query-based comparisons by organizing and comparing results from biotechnologies to address applications in biomedicine.
International Nuclear Information System (INIS)
Yun Liang; Messer, Karen; Rose, Brent S.; Lewis, John H.; Jiang, Steve B.; Yashar, Catheryn M.; Mundt, Arno J.; Mell, Loren K.
2010-01-01
Purpose: To study the effects of increasing pelvic bone marrow (BM) radiation dose on acute hematologic toxicity in patients undergoing chemoradiotherapy, using a novel modeling approach to preserve the local spatial dose information. Methods and Materials: The study included 37 cervical cancer patients treated with concurrent weekly cisplatin and pelvic radiation therapy. The white blood cell count nadir during treatment was used as the indicator for acute hematologic toxicity. Pelvic BM radiation dose distributions were standardized across patients by registering the pelvic BM volumes to a common template, followed by dose remapping using deformable image registration, resulting in a dose array. Principal component (PC) analysis was applied to the dose array, and the significant eigenvectors were identified by linear regression on the PCs. The coefficients for PC regression and significant eigenvectors were represented in three dimensions to identify critical BM subregions where dose accumulation is associated with hematologic toxicity. Results: We identified five PCs associated with acute hematologic toxicity. PC analysis regression modeling explained a high proportion of the variation in acute hematologicity (adjusted R 2 , 0.49). Three-dimensional rendering of a linear combination of the significant eigenvectors revealed patterns consistent with anatomical distributions of hematopoietically active BM. Conclusions: We have developed a novel approach that preserves spatial dose information to model effects of radiation dose on toxicity, which may be useful in optimizing radiation techniques to avoid critical subregions of normal tissues. Further validation of this approach in a large cohort is ongoing.
Steganalysis using logistic regression
Lubenko, Ivans; Ker, Andrew D.
2011-02-01
We advocate Logistic Regression (LR) as an alternative to the Support Vector Machine (SVM) classifiers commonly used in steganalysis. LR offers more information than traditional SVM methods - it estimates class probabilities as well as providing a simple classification - and can be adapted more easily and efficiently for multiclass problems. Like SVM, LR can be kernelised for nonlinear classification, and it shows comparable classification accuracy to SVM methods. This work is a case study, comparing accuracy and speed of SVM and LR classifiers in detection of LSB Matching and other related spatial-domain image steganography, through the state-of-art 686-dimensional SPAM feature set, in three image sets.
SEPARATION PHENOMENA LOGISTIC REGRESSION
Directory of Open Access Journals (Sweden)
Ikaro Daniel de Carvalho Barreto
2014-03-01
Full Text Available This paper proposes an application of concepts about the maximum likelihood estimation of the binomial logistic regression model to the separation phenomena. It generates bias in the estimation and provides different interpretations of the estimates on the different statistical tests (Wald, Likelihood Ratio and Score and provides different estimates on the different iterative methods (Newton-Raphson and Fisher Score. It also presents an example that demonstrates the direct implications for the validation of the model and validation of variables, the implications for estimates of odds ratios and confidence intervals, generated from the Wald statistics. Furthermore, we present, briefly, the Firth correction to circumvent the phenomena of separation.
DEFF Research Database (Denmark)
Ozenne, Brice; Sørensen, Anne Lyngholm; Scheike, Thomas
2017-01-01
In the presence of competing risks a prediction of the time-dynamic absolute risk of an event can be based on cause-specific Cox regression models for the event and the competing risks (Benichou and Gail, 1990). We present computationally fast and memory optimized C++ functions with an R interface......-product we obtain fast access to the baseline hazards (compared to survival::basehaz()) and predictions of survival probabilities, their confidence intervals and confidence bands. Confidence intervals and confidence bands are based on point-wise asymptotic expansions of the corresponding statistical...
Xuan Chi; Barry Goodwin
2012-01-01
Spatial and temporal relationships among agricultural prices have been an important topic of applied research for many years. Such research is used to investigate the performance of markets and to examine linkages up and down the marketing chain. This research has empirically evaluated price linkages by using correlation and regression models and, later, linear and...
Integrating high dimensional bi-directional parsing models for gene mention tagging.
Hsu, Chun-Nan; Chang, Yu-Ming; Kuo, Cheng-Ju; Lin, Yu-Shi; Huang, Han-Shen; Chung, I-Fang
2008-07-01
Tagging gene and gene product mentions in scientific text is an important initial step of literature mining. In this article, we describe in detail our gene mention tagger participated in BioCreative 2 challenge and analyze what contributes to its good performance. Our tagger is based on the conditional random fields model (CRF), the most prevailing method for the gene mention tagging task in BioCreative 2. Our tagger is interesting because it accomplished the highest F-scores among CRF-based methods and second over all. Moreover, we obtained our results by mostly applying open source packages, making it easy to duplicate our results. We first describe in detail how we developed our CRF-based tagger. We designed a very high dimensional feature set that includes most of information that may be relevant. We trained bi-directional CRF models with the same set of features, one applies forward parsing and the other backward, and integrated two models based on the output scores and dictionary filtering. One of the most prominent factors that contributes to the good performance of our tagger is the integration of an additional backward parsing model. However, from the definition of CRF, it appears that a CRF model is symmetric and bi-directional parsing models will produce the same results. We show that due to different feature settings, a CRF model can be asymmetric and the feature setting for our tagger in BioCreative 2 not only produces different results but also gives backward parsing models slight but constant advantage over forward parsing model. To fully explore the potential of integrating bi-directional parsing models, we applied different asymmetric feature settings to generate many bi-directional parsing models and integrate them based on the output scores. Experimental results show that this integrated model can achieve even higher F-score solely based on the training corpus for gene mention tagging. Data sets, programs and an on-line service of our gene
Greedy algorithms for high-dimensional non-symmetric linear problems***
Directory of Open Access Journals (Sweden)
Cancès E.
2013-12-01
Full Text Available In this article, we present a family of numerical approaches to solve high-dimensional linear non-symmetric problems. The principle of these methods is to approximate a function which depends on a large number of variates by a sum of tensor product functions, each term of which is iteratively computed via a greedy algorithm ? . There exists a good theoretical framework for these methods in the case of (linear and nonlinear symmetric elliptic problems. However, the convergence results are not valid any more as soon as the problems under consideration are not symmetric. We present here a review of the main algorithms proposed in the literature to circumvent this difficulty, together with some new approaches. The theoretical convergence results and the practical implementation of these algorithms are discussed. Their behaviors are illustrated through some numerical examples. Dans cet article, nous présentons une famille de méthodes numériques pour résoudre des problèmes linéaires non symétriques en grande dimension. Le principe de ces approches est de représenter une fonction dépendant d’un grand nombre de variables sous la forme d’une somme de fonctions produit tensoriel, dont chaque terme est calculé itérativement via un algorithme glouton ? . Ces méthodes possèdent de bonnes propriétés théoriques dans le cas de problèmes elliptiques symétriques (linéaires ou non linéaires, mais celles-ci ne sont plus valables dès lors que les problèmes considérés ne sont plus symétriques. Nous présentons une revue des principaux algorithmes proposés dans la littérature pour contourner cette difficulté ainsi que de nouvelles approches que nous proposons. Les résultats de convergence théoriques et la mise en oeuvre pratique de ces algorithmes sont détaillés et leur comportement est illustré au travers d’exemples numériques.
Descriptor Learning via Supervised Manifold Regularization for Multioutput Regression.
Zhen, Xiantong; Yu, Mengyang; Islam, Ali; Bhaduri, Mousumi; Chan, Ian; Li, Shuo
2017-09-01
Multioutput regression has recently shown great ability to solve challenging problems in both computer vision and medical image analysis. However, due to the huge image variability and ambiguity, it is fundamentally challenging to handle the highly complex input-target relationship of multioutput regression, especially with indiscriminate high-dimensional representations. In this paper, we propose a novel supervised descriptor learning (SDL) algorithm for multioutput regression, which can establish discriminative and compact feature representations to improve the multivariate estimation performance. The SDL is formulated as generalized low-rank approximations of matrices with a supervised manifold regularization. The SDL is able to simultaneously extract discriminative features closely related to multivariate targets and remove irrelevant and redundant information by transforming raw features into a new low-dimensional space aligned to targets. The achieved discriminative while compact descriptor largely reduces the variability and ambiguity for multioutput regression, which enables more accurate and efficient multivariate estimation. We conduct extensive evaluation of the proposed SDL on both synthetic data and real-world multioutput regression tasks for both computer vision and medical image analysis. Experimental results have shown that the proposed SDL can achieve high multivariate estimation accuracy on all tasks and largely outperforms the algorithms in the state of the arts. Our method establishes a novel SDL framework for multioutput regression, which can be widely used to boost the performance in different applications.
DEFF Research Database (Denmark)
Hansen, Henrik; Tarp, Finn
2001-01-01
This paper examines the relationship between foreign aid and growth in real GDP per capita as it emerges from simple augmentations of popular cross country growth specifications. It is shown that aid in all likelihood increases the growth rate, and this result is not conditional on ‘good’ policy....... investment. We conclude by stressing the need for more theoretical work before this kind of cross-country regressions are used for policy purposes.......This paper examines the relationship between foreign aid and growth in real GDP per capita as it emerges from simple augmentations of popular cross country growth specifications. It is shown that aid in all likelihood increases the growth rate, and this result is not conditional on ‘good’ policy...
Shaffer, Patrick; Valsson, Omar; Parrinello, Michele
2016-02-02
The capabilities of molecular simulations have been greatly extended by a number of widely used enhanced sampling methods that facilitate escaping from metastable states and crossing large barriers. Despite these developments there are still many problems which remain out of reach for these methods which has led to a vigorous effort in this area. One of the most important problems that remains unsolved is sampling high-dimensional free-energy landscapes and systems that are not easily described by a small number of collective variables. In this work we demonstrate a new way to compute free-energy landscapes of high dimensionality based on the previously introduced variationally enhanced sampling, and we apply it to the miniprotein chignolin.
Shaffer, Patrick; Valsson, Omar; Parrinello, Michele
2016-01-01
The capabilities of molecular simulations have been greatly extended by a number of widely used enhanced sampling methods that facilitate escaping from metastable states and crossing large barriers. Despite these developments there are still many problems which remain out of reach for these methods which has led to a vigorous effort in this area. One of the most important problems that remains unsolved is sampling high-dimensional free-energy landscapes and systems that are not easily described by a small number of collective variables. In this work we demonstrate a new way to compute free-energy landscapes of high dimensionality based on the previously introduced variationally enhanced sampling, and we apply it to the miniprotein chignolin. PMID:26787868
Data-driven forecasting of high-dimensional chaotic systems with long short-term memory networks.
Vlachas, Pantelis R; Byeon, Wonmin; Wan, Zhong Y; Sapsis, Themistoklis P; Koumoutsakos, Petros
2018-05-01
We introduce a data-driven forecasting method for high-dimensional chaotic systems using long short-term memory (LSTM) recurrent neural networks. The proposed LSTM neural networks perform inference of high-dimensional dynamical systems in their reduced order space and are shown to be an effective set of nonlinear approximators of their attractor. We demonstrate the forecasting performance of the LSTM and compare it with Gaussian processes (GPs) in time series obtained from the Lorenz 96 system, the Kuramoto-Sivashinsky equation and a prototype climate model. The LSTM networks outperform the GPs in short-term forecasting accuracy in all applications considered. A hybrid architecture, extending the LSTM with a mean stochastic model (MSM-LSTM), is proposed to ensure convergence to the invariant measure. This novel hybrid method is fully data-driven and extends the forecasting capabilities of LSTM networks.
International Nuclear Information System (INIS)
Oganesian, A.G.
1998-01-01
A method is proposed for estimating unknown vacuum expectation values of high-dimensional operators. The method is based on the idea that the factorization hypothesis is self-consistent. Results are obtained for all vacuum expectation values of dimension-7 operators, and some estimates for dimension-10 operators are presented as well. The resulting values are used to compute corrections of higher dimensions to the Bjorken and Ellis-Jaffe sum rules
Modified Regression Correlation Coefficient for Poisson Regression Model
Kaengthong, Nattacha; Domthong, Uthumporn
2017-09-01
This study gives attention to indicators in predictive power of the Generalized Linear Model (GLM) which are widely used; however, often having some restrictions. We are interested in regression correlation coefficient for a Poisson regression model. This is a measure of predictive power, and defined by the relationship between the dependent variable (Y) and the expected value of the dependent variable given the independent variables [E(Y|X)] for the Poisson regression model. The dependent variable is distributed as Poisson. The purpose of this research was modifying regression correlation coefficient for Poisson regression model. We also compare the proposed modified regression correlation coefficient with the traditional regression correlation coefficient in the case of two or more independent variables, and having multicollinearity in independent variables. The result shows that the proposed regression correlation coefficient is better than the traditional regression correlation coefficient based on Bias and the Root Mean Square Error (RMSE).
Li, Yanming; Nan, Bin; Zhu, Ji
2015-06-01
We propose a multivariate sparse group lasso variable selection and estimation method for data with high-dimensional predictors as well as high-dimensional response variables. The method is carried out through a penalized multivariate multiple linear regression model with an arbitrary group structure for the regression coefficient matrix. It suits many biology studies well in detecting associations between multiple traits and multiple predictors, with each trait and each predictor embedded in some biological functional groups such as genes, pathways or brain regions. The method is able to effectively remove unimportant groups as well as unimportant individual coefficients within important groups, particularly for large p small n problems, and is flexible in handling various complex group structures such as overlapping or nested or multilevel hierarchical structures. The method is evaluated through extensive simulations with comparisons to the conventional lasso and group lasso methods, and is applied to an eQTL association study. © 2015, The International Biometric Society.
Luo, Chongliang; Liu, Jin; Dey, Dipak K; Chen, Kun
2016-07-01
In many fields, multi-view datasets, measuring multiple distinct but interrelated sets of characteristics on the same set of subjects, together with data on certain outcomes or phenotypes, are routinely collected. The objective in such a problem is often two-fold: both to explore the association structures of multiple sets of measurements and to develop a parsimonious model for predicting the future outcomes. We study a unified canonical variate regression framework to tackle the two problems simultaneously. The proposed criterion integrates multiple canonical correlation analysis with predictive modeling, balancing between the association strength of the canonical variates and their joint predictive power on the outcomes. Moreover, the proposed criterion seeks multiple sets of canonical variates simultaneously to enable the examination of their joint effects on the outcomes, and is able to handle multivariate and non-Gaussian outcomes. An efficient algorithm based on variable splitting and Lagrangian multipliers is proposed. Simulation studies show the superior performance of the proposed approach. We demonstrate the effectiveness of the proposed approach in an [Formula: see text] intercross mice study and an alcohol dependence study. © The Author 2016. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.
Predicting Future High-Cost Schizophrenia Patients Using High-Dimensional Administrative Data
Directory of Open Access Journals (Sweden)
Yajuan Wang
2017-06-01
Full Text Available BackgroundThe burden of serious and persistent mental illness such as schizophrenia is substantial and requires health-care organizations to have adequate risk adjustment models to effectively allocate their resources to managing patients who are at the greatest risk. Currently available models underestimate health-care costs for those with mental or behavioral health conditions.ObjectivesThe study aimed to develop and evaluate predictive models for identification of future high-cost schizophrenia patients using advanced supervised machine learning methods.MethodsThis was a retrospective study using a payer administrative database. The study cohort consisted of 97,862 patients diagnosed with schizophrenia (ICD9 code 295.* from January 2009 to June 2014. Training (n = 34,510 and study evaluation (n = 30,077 cohorts were derived based on 12-month observation and prediction windows (PWs. The target was average total cost/patient/month in the PW. Three models (baseline, intermediate, final were developed to assess the value of different variable categories for cost prediction (demographics, coverage, cost, health-care utilization, antipsychotic medication usage, and clinical conditions. Scalable orthogonal regression, significant attribute selection in high dimensions method, and random forests regression were used to develop the models. The trained models were assessed in the evaluation cohort using the regression R2, patient classification accuracy (PCA, and cost accuracy (CA. The model performance was compared to the Centers for Medicare & Medicaid Services Hierarchical Condition Categories (CMS-HCC model.ResultsAt top 10% cost cutoff, the final model achieved 0.23 R2, 43% PCA, and 63% CA; in contrast, the CMS-HCC model achieved 0.09 R2, 27% PCA with 45% CA. The final model and the CMS-HCC model identified 33 and 22%, respectively, of total cost at the top 10% cost cutoff.ConclusionUsing advanced feature selection leveraging detailed
Directory of Open Access Journals (Sweden)
Ziping WANG
2011-09-01
Full Text Available Background and objective It has been proven that gefitinib produces only 10%-20% tumor regression in heavily pretreated, unselected non-small cell lung cancer (NSCLC patients as the second- and third-line setting. Asian, female, nonsmokers and adenocarcinoma are favorable factors; however, it is difficult to find a patient satisfying all the above clinical characteristics. The aim of this study is to identify novel predicting factors, and to explore the interactions between clinical variables and their impact on the survival of Chinese patients with advanced NSCLC who were heavily treated with gefitinib in the second- or third-line setting. Methods The clinical and follow-up data of 127 advanced NSCLC patients referred to the Cancer Hospital & Institute, Chinese Academy of Medical Sciences from March 2005 to March 2010 were analyzed. Multivariate analysis of progression-free survival (PFS was performed using recursive partitioning, which is referred to as the classification and regression tree (CART analysis. Results The median PFS of 127 eligible consecutive advanced NSCLC patients was 8.0 months (95%CI: 5.8-10.2. CART was performed with an initial split on first-line chemotherapy outcomes and a second split on patients’ age. Three terminal subgroups were formed. The median PFS of the three subsets ranged from 1.0 month (95%CI: 0.8-1.2 for those with progressive disease outcome after the first-line chemotherapy subgroup, 10 months (95%CI: 7.0-13.0 in patients with a partial response or stable disease in first-line chemotherapy and age <70, and 22.0 months for patients obtaining a partial response or stable disease in first-line chemotherapy at age 70-81 (95%CI: 3.8-40.1. Conclusion Partial response, stable disease in first-line chemotherapy and age ≥ 70 are closely correlated with long-term survival treated by gefitinib as a second- or third-line setting in advanced NSCLC. CART can be used to identify previously unappreciated patient
Performance of the High-dimensional Propensity Score in a Nordic Healthcare Model
DEFF Research Database (Denmark)
Hallas, Jesper; Pottegård, Anton
2017-01-01
regression, estimating the coxib/tNSAID hazard ratio (HR). Values below 1.00 indicate a lower estimated hazard with coxibs. We build hdPS models with inclusion of up to 500 diagnosis and 500 prescription drug covariates. The crude HR was 1.76 (95% confidence interval: 1.57-1.97), decreasing to 1.12 (1...... was restricted to non-users of low-dose aspirin. The estimate based on 500 diagnoses alone was higher than an estimate based on 500 prescription drugs alone (0.99 versus 0.91). We conclude that hdPS does work within a Nordic setting that prescription data are more effective than diagnosis data in achieving...
Polynomial regression analysis and significance test of the regression function
International Nuclear Information System (INIS)
Gao Zhengming; Zhao Juan; He Shengping
2012-01-01
In order to analyze the decay heating power of a certain radioactive isotope per kilogram with polynomial regression method, the paper firstly demonstrated the broad usage of polynomial function and deduced its parameters with ordinary least squares estimate. Then significance test method of polynomial regression function is derived considering the similarity between the polynomial regression model and the multivariable linear regression model. Finally, polynomial regression analysis and significance test of the polynomial function are done to the decay heating power of the iso tope per kilogram in accord with the authors' real work. (authors)
Recursive Algorithm For Linear Regression
Varanasi, S. V.
1988-01-01
Order of model determined easily. Linear-regression algorithhm includes recursive equations for coefficients of model of increased order. Algorithm eliminates duplicative calculations, facilitates search for minimum order of linear-regression model fitting set of data satisfactory.
International Nuclear Information System (INIS)
Zhang, Liangwei; Lin, Jing; Karim, Ramin
2015-01-01
The accuracy of traditional anomaly detection techniques implemented on full-dimensional spaces degrades significantly as dimensionality increases, thereby hampering many real-world applications. This work proposes an approach to selecting meaningful feature subspace and conducting anomaly detection in the corresponding subspace projection. The aim is to maintain the detection accuracy in high-dimensional circumstances. The suggested approach assesses the angle between all pairs of two lines for one specific anomaly candidate: the first line is connected by the relevant data point and the center of its adjacent points; the other line is one of the axis-parallel lines. Those dimensions which have a relatively small angle with the first line are then chosen to constitute the axis-parallel subspace for the candidate. Next, a normalized Mahalanobis distance is introduced to measure the local outlier-ness of an object in the subspace projection. To comprehensively compare the proposed algorithm with several existing anomaly detection techniques, we constructed artificial datasets with various high-dimensional settings and found the algorithm displayed superior accuracy. A further experiment on an industrial dataset demonstrated the applicability of the proposed algorithm in fault detection tasks and highlighted another of its merits, namely, to provide preliminary interpretation of abnormality through feature ordering in relevant subspaces. - Highlights: • An anomaly detection approach for high-dimensional reliability data is proposed. • The approach selects relevant subspaces by assessing vectorial angles. • The novel ABSAD approach displays superior accuracy over other alternatives. • Numerical illustration approves its efficacy in fault detection applications
Directory of Open Access Journals (Sweden)
L.V. Arun Shalin
2016-01-01
Full Text Available Clustering is a process of grouping elements together, designed in such a way that the elements assigned to similar data points in a cluster are more comparable to each other than the remaining data points in a cluster. During clustering certain difficulties related when dealing with high dimensional data are ubiquitous and abundant. Works concentrated using anonymization method for high dimensional data spaces failed to address the problem related to dimensionality reduction during the inclusion of non-binary databases. In this work we study methods for dimensionality reduction for non-binary database. By analyzing the behavior of dimensionality reduction for non-binary database, results in performance improvement with the help of tag based feature. An effective multi-clustering anonymization approach called Discrete Component Task Specific Multi-Clustering (DCTSM is presented for dimensionality reduction on non-binary database. To start with we present the analysis of attribute in the non-binary database and cluster projection identifies the sparseness degree of dimensions. Additionally with the quantum distribution on multi-cluster dimension, the solution for relevancy of attribute and redundancy on non-binary data spaces is provided resulting in performance improvement on the basis of tag based feature. Multi-clustering tag based feature reduction extracts individual features and are correspondingly replaced by the equivalent feature clusters (i.e. tag clusters. During training, the DCTSM approach uses multi-clusters instead of individual tag features and then during decoding individual features is replaced by corresponding multi-clusters. To measure the effectiveness of the method, experiments are conducted on existing anonymization method for high dimensional data spaces and compared with the DCTSM approach using Statlog German Credit Data Set. Improved tag feature extraction and minimum error rate compared to conventional anonymization
Mah, Yee-Haur; Jager, Rolf; Kennard, Christopher; Husain, Masud; Nachev, Parashkev
2014-07-01
Making robust inferences about the functional neuroanatomy of the brain is critically dependent on experimental techniques that examine the consequences of focal loss of brain function. Unfortunately, the use of the most comprehensive such technique-lesion-function mapping-is complicated by the need for time-consuming and subjective manual delineation of the lesions, greatly limiting the practicability of the approach. Here we exploit a recently-described general measure of statistical anomaly, zeta, to devise a fully-automated, high-dimensional algorithm for identifying the parameters of lesions within a brain image given a reference set of normal brain images. We proceed to evaluate such an algorithm in the context of diffusion-weighted imaging of the commonest type of lesion used in neuroanatomical research: ischaemic damage. Summary performance metrics exceed those previously published for diffusion-weighted imaging and approach the current gold standard-manual segmentation-sufficiently closely for fully-automated lesion-mapping studies to become a possibility. We apply the new method to 435 unselected images of patients with ischaemic stroke to derive a probabilistic map of the pattern of damage in lesions involving the occipital lobe, demonstrating the variation of anatomical resolvability of occipital areas so as to guide future lesion-function studies of the region. Copyright © 2012 Elsevier Ltd. All rights reserved.
Quaranta, Vanessa; Hellström, Matti; Behler, Jörg; Kullgren, Jolla; Mitev, Pavlin D.; Hermansson, Kersti
2018-06-01
Unraveling the atomistic details of solid/liquid interfaces, e.g., by means of vibrational spectroscopy, is of vital importance in numerous applications, from electrochemistry to heterogeneous catalysis. Water-oxide interfaces represent a formidable challenge because a large variety of molecular and dissociated water species are present at the surface. Here, we present a comprehensive theoretical analysis of the anharmonic OH stretching vibrations at the water/ZnO(101 ¯ 0) interface as a prototypical case. Molecular dynamics simulations employing a reactive high-dimensional neural network potential based on density functional theory calculations have been used to sample the interfacial structures. In the second step, one-dimensional potential energy curves have been generated for a large number of configurations to solve the nuclear Schrödinger equation. We find that (i) the ZnO surface gives rise to OH frequency shifts up to a distance of about 4 Å from the surface; (ii) the spectrum contains a number of overlapping signals arising from different chemical species, with the frequencies decreasing in the order ν(adsorbed hydroxide) > ν(non-adsorbed water) > ν(surface hydroxide) > ν(adsorbed water); (iii) stretching frequencies are strongly influenced by the hydrogen bond pattern of these interfacial species. Finally, we have been able to identify substantial correlations between the stretching frequencies and hydrogen bond lengths for all species.
Garashchuk, Sophya; Rassolov, Vitaly A
2008-07-14
Semiclassical implementation of the quantum trajectory formalism [J. Chem. Phys. 120, 1181 (2004)] is further developed to give a stable long-time description of zero-point energy in anharmonic systems of high dimensionality. The method is based on a numerically cheap linearized quantum force approach; stabilizing terms compensating for the linearization errors are added into the time-evolution equations for the classical and nonclassical components of the momentum operator. The wave function normalization and energy are rigorously conserved. Numerical tests are performed for model systems of up to 40 degrees of freedom.
Benediktsson, J. A.; Swain, P. H.; Ersoy, O. K.
1993-01-01
Application of neural networks to classification of remote sensing data is discussed. Conventional two-layer backpropagation is found to give good results in classification of remote sensing data but is not efficient in training. A more efficient variant, based on conjugate-gradient optimization, is used for classification of multisource remote sensing and geographic data and very-high-dimensional data. The conjugate-gradient neural networks give excellent performance in classification of multisource data, but do not compare as well with statistical methods in classification of very-high-dimentional data.
Combining Alphas via Bounded Regression
Directory of Open Access Journals (Sweden)
Zura Kakushadze
2015-11-01
Full Text Available We give an explicit algorithm and source code for combining alpha streams via bounded regression. In practical applications, typically, there is insufficient history to compute a sample covariance matrix (SCM for a large number of alphas. To compute alpha allocation weights, one then resorts to (weighted regression over SCM principal components. Regression often produces alpha weights with insufficient diversification and/or skewed distribution against, e.g., turnover. This can be rectified by imposing bounds on alpha weights within the regression procedure. Bounded regression can also be applied to stock and other asset portfolio construction. We discuss illustrative examples.
Regression in autistic spectrum disorders.
Stefanatos, Gerry A
2008-12-01
A significant proportion of children diagnosed with Autistic Spectrum Disorder experience a developmental regression characterized by a loss of previously-acquired skills. This may involve a loss of speech or social responsitivity, but often entails both. This paper critically reviews the phenomena of regression in autistic spectrum disorders, highlighting the characteristics of regression, age of onset, temporal course, and long-term outcome. Important considerations for diagnosis are discussed and multiple etiological factors currently hypothesized to underlie the phenomenon are reviewed. It is argued that regressive autistic spectrum disorders can be conceptualized on a spectrum with other regressive disorders that may share common pathophysiological features. The implications of this viewpoint are discussed.
Shahiri, Amirah Mohamed; Husain, Wahidah; Rashid, Nur'Aini Abd
2017-10-01
Huge amounts of data in educational datasets may cause the problem in producing quality data. Recently, data mining approach are increasingly used by educational data mining researchers for analyzing the data patterns. However, many research studies have concentrated on selecting suitable learning algorithms instead of performing feature selection process. As a result, these data has problem with computational complexity and spend longer computational time for classification. The main objective of this research is to provide an overview of feature selection techniques that have been used to analyze the most significant features. Then, this research will propose a framework to improve the quality of students' dataset. The proposed framework uses filter and wrapper based technique to support prediction process in future study.
Diaz-Ruelas, Alvaro; Jeldtoft Jensen, Henrik; Piovani, Duccio; Robledo, Alberto
2016-12-01
It is well known that low-dimensional nonlinear deterministic maps close to a tangent bifurcation exhibit intermittency and this circumstance has been exploited, e.g., by Procaccia and Schuster [Phys. Rev. A 28, 1210 (1983)], to develop a general theory of 1/f spectra. This suggests it is interesting to study the extent to which the behavior of a high-dimensional stochastic system can be described by such tangent maps. The Tangled Nature (TaNa) Model of evolutionary ecology is an ideal candidate for such a study, a significant model as it is capable of reproducing a broad range of the phenomenology of macroevolution and ecosystems. The TaNa model exhibits strong intermittency reminiscent of punctuated equilibrium and, like the fossil record of mass extinction, the intermittency in the model is found to be non-stationary, a feature typical of many complex systems. We derive a mean-field version for the evolution of the likelihood function controlling the reproduction of species and find a local map close to tangency. This mean-field map, by our own local approximation, is able to describe qualitatively only one episode of the intermittent dynamics of the full TaNa model. To complement this result, we construct a complete nonlinear dynamical system model consisting of successive tangent bifurcations that generates time evolution patterns resembling those of the full TaNa model in macroscopic scales. The switch from one tangent bifurcation to the next in the sequences produced in this model is stochastic in nature, based on criteria obtained from the local mean-field approximation, and capable of imitating the changing set of types of species and total population in the TaNa model. The model combines full deterministic dynamics with instantaneous parameter random jumps at stochastically drawn times. In spite of the limitations of our approach, which entails a drastic collapse of degrees of freedom, the description of a high-dimensional model system in terms of a low
Linear regression in astronomy. I
Isobe, Takashi; Feigelson, Eric D.; Akritas, Michael G.; Babu, Gutti Jogesh
1990-01-01
Five methods for obtaining linear regression fits to bivariate data with unknown or insignificant measurement errors are discussed: ordinary least-squares (OLS) regression of Y on X, OLS regression of X on Y, the bisector of the two OLS lines, orthogonal regression, and 'reduced major-axis' regression. These methods have been used by various researchers in observational astronomy, most importantly in cosmic distance scale applications. Formulas for calculating the slope and intercept coefficients and their uncertainties are given for all the methods, including a new general form of the OLS variance estimates. The accuracy of the formulas was confirmed using numerical simulations. The applicability of the procedures is discussed with respect to their mathematical properties, the nature of the astronomical data under consideration, and the scientific purpose of the regression. It is found that, for problems needing symmetrical treatment of the variables, the OLS bisector performs significantly better than orthogonal or reduced major-axis regression.
Advanced statistics: linear regression, part I: simple linear regression.
Marill, Keith A
2004-01-01
Simple linear regression is a mathematical technique used to model the relationship between a single independent predictor variable and a single dependent outcome variable. In this, the first of a two-part series exploring concepts in linear regression analysis, the four fundamental assumptions and the mechanics of simple linear regression are reviewed. The most common technique used to derive the regression line, the method of least squares, is described. The reader will be acquainted with other important concepts in simple linear regression, including: variable transformations, dummy variables, relationship to inference testing, and leverage. Simplified clinical examples with small datasets and graphic models are used to illustrate the points. This will provide a foundation for the second article in this series: a discussion of multiple linear regression, in which there are multiple predictor variables.
Prospective Validation of a High Dimensional Shape Model for Organ Motion in Intact Cervical Cancer
Energy Technology Data Exchange (ETDEWEB)
Williamson, Casey W.; Green, Garrett; Noticewala, Sonal S.; Li, Nan; Shen, Hanjie [Department of Radiation Medicine and Applied Sciences, University of California, San Diego, La Jolla, California (United States); Vaida, Florin [Division of Biostatistics and Bioinformatics, Department of Family Medicine and Public Health, University of California, San Diego, La Jolla, California (United States); Mell, Loren K., E-mail: lmell@ucsd.edu [Department of Radiation Medicine and Applied Sciences, University of California, San Diego, La Jolla, California (United States)
2016-11-15
Purpose: Validated models are needed to justify strategies to define planning target volumes (PTVs) for intact cervical cancer used in clinical practice. Our objective was to independently validate a previously published shape model, using data collected prospectively from clinical trials. Methods and Materials: We analyzed 42 patients with intact cervical cancer treated with daily fractionated pelvic intensity modulated radiation therapy and concurrent chemotherapy in one of 2 prospective clinical trials. We collected online cone beam computed tomography (CBCT) scans before each fraction. Clinical target volume (CTV) structures from the planning computed tomography scan were cast onto each CBCT scan after rigid registration and manually redrawn to account for organ motion and deformation. We applied the 95% isodose cloud from the planning computed tomography scan to each CBCT scan and computed any CTV outside the 95% isodose cloud. The primary aim was to determine the proportion of CTVs that were encompassed within the 95% isodose volume. A 1-sample t test was used to test the hypothesis that the probability of complete coverage was different from 95%. We used mixed-effects logistic regression to assess effects of time and patient variability. Results: The 95% isodose line completely encompassed 92.3% of all CTVs (95% confidence interval, 88.3%-96.4%), not significantly different from the 95% probability anticipated a priori (P=.19). The overall proportion of missed CTVs was small: the grand mean of covered CTVs was 99.9%, and 95.2% of misses were located in the anterior body of the uterus. Time did not affect coverage probability (P=.71). Conclusions: With the clinical implementation of a previously proposed PTV definition strategy based on a shape model for intact cervical cancer, the probability of CTV coverage was high and the volume of CTV missed was low. This PTV expansion strategy is acceptable for clinical trials and practice; however, we recommend daily
Energy Technology Data Exchange (ETDEWEB)
Zawadzka-Kazimierczuk, Anna; Kozminski, Wiktor [University of Warsaw, Faculty of Chemistry (Poland); Billeter, Martin, E-mail: martin.billeter@chem.gu.se [University of Gothenburg, Biophysics Group, Department of Chemistry and Molecular Biology (Sweden)
2012-09-15
While NMR studies of proteins typically aim at structure, dynamics or interactions, resonance assignments represent in almost all cases the initial step of the analysis. With increasing complexity of the NMR spectra, for example due to decreasing extent of ordered structure, this task often becomes both difficult and time-consuming, and the recording of high-dimensional data with high-resolution may be essential. Random sampling of the evolution time space, combined with sparse multidimensional Fourier transform (SMFT), allows for efficient recording of very high dimensional spectra ({>=}4 dimensions) while maintaining high resolution. However, the nature of this data demands for automation of the assignment process. Here we present the program TSAR (Tool for SMFT-based Assignment of Resonances), which exploits all advantages of SMFT input. Moreover, its flexibility allows to process data from any type of experiments that provide sequential connectivities. The algorithm was tested on several protein samples, including a disordered 81-residue fragment of the {delta} subunit of RNA polymerase from Bacillus subtilis containing various repetitive sequences. For our test examples, TSAR achieves a high percentage of assigned residues without any erroneous assignments.
Ren, Jie; He, Tao; Li, Ye; Liu, Sai; Du, Yinhao; Jiang, Yu; Wu, Cen
2017-05-16
Over the past decades, the prevalence of type 2 diabetes mellitus (T2D) has been steadily increasing around the world. Despite large efforts devoted to better understand the genetic basis of the disease, the identified susceptibility loci can only account for a small portion of the T2D heritability. Some of the existing approaches proposed for the high dimensional genetic data from the T2D case-control study are limited by analyzing a few number of SNPs at a time from a large pool of SNPs, by ignoring the correlations among SNPs and by adopting inefficient selection techniques. We propose a network constrained regularization method to select important SNPs by taking the linkage disequilibrium into account. To accomodate the case control study, an iteratively reweighted least square algorithm has been developed within the coordinate descent framework where optimization of the regularized logistic loss function is performed with respect to one parameter at a time and iteratively cycle through all the parameters until convergence. In this article, a novel approach is developed to identify important SNPs more effectively through incorporating the interconnections among them in the regularized selection. A coordinate descent based iteratively reweighed least squares (IRLS) algorithm has been proposed. Both the simulation study and the analysis of the Nurses's Health Study, a case-control study of type 2 diabetes data with high dimensional SNP measurements, demonstrate the advantage of the network based approach over the competing alternatives.
Linear regression in astronomy. II
Feigelson, Eric D.; Babu, Gutti J.
1992-01-01
A wide variety of least-squares linear regression procedures used in observational astronomy, particularly investigations of the cosmic distance scale, are presented and discussed. The classes of linear models considered are (1) unweighted regression lines, with bootstrap and jackknife resampling; (2) regression solutions when measurement error, in one or both variables, dominates the scatter; (3) methods to apply a calibration line to new data; (4) truncated regression models, which apply to flux-limited data sets; and (5) censored regression models, which apply when nondetections are present. For the calibration problem we develop two new procedures: a formula for the intercept offset between two parallel data sets, which propagates slope errors from one regression to the other; and a generalization of the Working-Hotelling confidence bands to nonstandard least-squares lines. They can provide improved error analysis for Faber-Jackson, Tully-Fisher, and similar cosmic distance scale relations.
Time-adaptive quantile regression
DEFF Research Database (Denmark)
Møller, Jan Kloppenborg; Nielsen, Henrik Aalborg; Madsen, Henrik
2008-01-01
and an updating procedure are combined into a new algorithm for time-adaptive quantile regression, which generates new solutions on the basis of the old solution, leading to savings in computation time. The suggested algorithm is tested against a static quantile regression model on a data set with wind power......An algorithm for time-adaptive quantile regression is presented. The algorithm is based on the simplex algorithm, and the linear optimization formulation of the quantile regression problem is given. The observations have been split to allow a direct use of the simplex algorithm. The simplex method...... production, where the models combine splines and quantile regression. The comparison indicates superior performance for the time-adaptive quantile regression in all the performance parameters considered....
Cowley, Benjamin R.; Kaufman, Matthew T.; Butler, Zachary S.; Churchland, Mark M.; Ryu, Stephen I.; Shenoy, Krishna V.; Yu, Byron M.
2013-12-01
Objective. Analyzing and interpreting the activity of a heterogeneous population of neurons can be challenging, especially as the number of neurons, experimental trials, and experimental conditions increases. One approach is to extract a set of latent variables that succinctly captures the prominent co-fluctuation patterns across the neural population. A key problem is that the number of latent variables needed to adequately describe the population activity is often greater than 3, thereby preventing direct visualization of the latent space. By visualizing a small number of 2-d projections of the latent space or each latent variable individually, it is easy to miss salient features of the population activity. Approach. To address this limitation, we developed a Matlab graphical user interface (called DataHigh) that allows the user to quickly and smoothly navigate through a continuum of different 2-d projections of the latent space. We also implemented a suite of additional visualization tools (including playing out population activity timecourses as a movie and displaying summary statistics, such as covariance ellipses and average timecourses) and an optional tool for performing dimensionality reduction. Main results. To demonstrate the utility and versatility of DataHigh, we used it to analyze single-trial spike count and single-trial timecourse population activity recorded using a multi-electrode array, as well as trial-averaged population activity recorded using single electrodes. Significance. DataHigh was developed to fulfil a need for visualization in exploratory neural data analysis, which can provide intuition that is critical for building scientific hypotheses and models of population activity.
Cowley, Benjamin R; Kaufman, Matthew T; Butler, Zachary S; Churchland, Mark M; Ryu, Stephen I; Shenoy, Krishna V; Yu, Byron M
2013-12-01
Analyzing and interpreting the activity of a heterogeneous population of neurons can be challenging, especially as the number of neurons, experimental trials, and experimental conditions increases. One approach is to extract a set of latent variables that succinctly captures the prominent co-fluctuation patterns across the neural population. A key problem is that the number of latent variables needed to adequately describe the population activity is often greater than 3, thereby preventing direct visualization of the latent space. By visualizing a small number of 2-d projections of the latent space or each latent variable individually, it is easy to miss salient features of the population activity. To address this limitation, we developed a Matlab graphical user interface (called DataHigh) that allows the user to quickly and smoothly navigate through a continuum of different 2-d projections of the latent space. We also implemented a suite of additional visualization tools (including playing out population activity timecourses as a movie and displaying summary statistics, such as covariance ellipses and average timecourses) and an optional tool for performing dimensionality reduction. To demonstrate the utility and versatility of DataHigh, we used it to analyze single-trial spike count and single-trial timecourse population activity recorded using a multi-electrode array, as well as trial-averaged population activity recorded using single electrodes. DataHigh was developed to fulfil a need for visualization in exploratory neural data analysis, which can provide intuition that is critical for building scientific hypotheses and models of population activity.
Clark, James S.; Soltoff, Benjamin D.; Powell, Amanda S.; Read, Quentin D.
2012-01-01
Background For competing species to coexist, individuals must compete more with others of the same species than with those of other species. Ecologists search for tradeoffs in how species might partition the environment. The negative correlations among competing species that would be indicative of tradeoffs are rarely observed. A recent analysis showed that evidence for partitioning the environment is available when responses are disaggregated to the individual scale, in terms of the covariance structure of responses to environmental variation. That study did not relate that variation to the variables to which individuals were responding. To understand how this pattern of variation is related to niche variables, we analyzed responses to canopy gaps, long viewed as a key variable responsible for species coexistence. Methodology/Principal Findings A longitudinal intervention analysis of individual responses to experimental canopy gaps with 12 yr of pre-treatment and 8 yr post-treatment responses showed that species-level responses are positively correlated – species that grow fast on average in the understory also grow fast on average in response to gap formation. In other words, there is no tradeoff. However, the joint distribution of individual responses to understory and gap showed a negative correlation – species having individuals that respond most to gaps when previously growing slowly also have individuals that respond least to gaps when previously growing rapidly (e.g., Morus rubra), and vice versa (e.g., Quercus prinus). Conclusions/Significance Because competition occurs at the individual scale, not the species scale, aggregated species-level parameters and correlations hide the species-level differences needed for coexistence. By disaggregating models to the scale at which the interaction occurs we show that individual variation provides insight for species differences. PMID:22393349
Cowley, Benjamin R.; Kaufman, Matthew T.; Butler, Zachary S.; Churchland, Mark M.; Ryu, Stephen I.; Shenoy, Krishna V.; Yu, Byron M.
2014-01-01
Objective Analyzing and interpreting the activity of a heterogeneous population of neurons can be challenging, especially as the number of neurons, experimental trials, and experimental conditions increases. One approach is to extract a set of latent variables that succinctly captures the prominent co-fluctuation patterns across the neural population. A key problem is that the number of latent variables needed to adequately describe the population activity is often greater than three, thereby preventing direct visualization of the latent space. By visualizing a small number of 2-d projections of the latent space or each latent variable individually, it is easy to miss salient features of the population activity. Approach To address this limitation, we developed a Matlab graphical user interface (called DataHigh) that allows the user to quickly and smoothly navigate through a continuum of different 2-d projections of the latent space. We also implemented a suite of additional visualization tools (including playing out population activity timecourses as a movie and displaying summary statistics, such as covariance ellipses and average timecourses) and an optional tool for performing dimensionality reduction. Main results To demonstrate the utility and versatility of DataHigh, we used it to analyze single-trial spike count and single-trial timecourse population activity recorded using a multi-electrode array, as well as trial-averaged population activity recorded using single electrodes. Significance DataHigh was developed to fulfill a need for visualization in exploratory neural data analysis, which can provide intuition that is critical for building scientific hypotheses and models of population activity. PMID:24216250
Directory of Open Access Journals (Sweden)
James S Clark
Full Text Available BACKGROUND: For competing species to coexist, individuals must compete more with others of the same species than with those of other species. Ecologists search for tradeoffs in how species might partition the environment. The negative correlations among competing species that would be indicative of tradeoffs are rarely observed. A recent analysis showed that evidence for partitioning the environment is available when responses are disaggregated to the individual scale, in terms of the covariance structure of responses to environmental variation. That study did not relate that variation to the variables to which individuals were responding. To understand how this pattern of variation is related to niche variables, we analyzed responses to canopy gaps, long viewed as a key variable responsible for species coexistence. METHODOLOGY/PRINCIPAL FINDINGS: A longitudinal intervention analysis of individual responses to experimental canopy gaps with 12 yr of pre-treatment and 8 yr post-treatment responses showed that species-level responses are positively correlated--species that grow fast on average in the understory also grow fast on average in response to gap formation. In other words, there is no tradeoff. However, the joint distribution of individual responses to understory and gap showed a negative correlation--species having individuals that respond most to gaps when previously growing slowly also have individuals that respond least to gaps when previously growing rapidly (e.g., Morus rubra, and vice versa (e.g., Quercus prinus. CONCLUSIONS/SIGNIFICANCE: Because competition occurs at the individual scale, not the species scale, aggregated species-level parameters and correlations hide the species-level differences needed for coexistence. By disaggregating models to the scale at which the interaction occurs we show that individual variation provides insight for species differences.
Retro-regression--another important multivariate regression improvement.
Randić, M
2001-01-01
We review the serious problem associated with instabilities of the coefficients of regression equations, referred to as the MRA (multivariate regression analysis) "nightmare of the first kind". This is manifested when in a stepwise regression a descriptor is included or excluded from a regression. The consequence is an unpredictable change of the coefficients of the descriptors that remain in the regression equation. We follow with consideration of an even more serious problem, referred to as the MRA "nightmare of the second kind", arising when optimal descriptors are selected from a large pool of descriptors. This process typically causes at different steps of the stepwise regression a replacement of several previously used descriptors by new ones. We describe a procedure that resolves these difficulties. The approach is illustrated on boiling points of nonanes which are considered (1) by using an ordered connectivity basis; (2) by using an ordering resulting from application of greedy algorithm; and (3) by using an ordering derived from an exhaustive search for optimal descriptors. A novel variant of multiple regression analysis, called retro-regression (RR), is outlined showing how it resolves the ambiguities associated with both "nightmares" of the first and the second kind of MRA.
Quantile regression theory and applications
Davino, Cristina; Vistocco, Domenico
2013-01-01
A guide to the implementation and interpretation of Quantile Regression models This book explores the theory and numerous applications of quantile regression, offering empirical data analysis as well as the software tools to implement the methods. The main focus of this book is to provide the reader with a comprehensivedescription of the main issues concerning quantile regression; these include basic modeling, geometrical interpretation, estimation and inference for quantile regression, as well as issues on validity of the model, diagnostic tools. Each methodological aspect is explored and
He, Ling Yan; Wang, Tie-Jun; Wang, Chuan
2016-07-11
High-dimensional quantum system provides a higher capacity of quantum channel, which exhibits potential applications in quantum information processing. However, high-dimensional universal quantum logic gates is difficult to achieve directly with only high-dimensional interaction between two quantum systems and requires a large number of two-dimensional gates to build even a small high-dimensional quantum circuits. In this paper, we propose a scheme to implement a general controlled-flip (CF) gate where the high-dimensional single photon serve as the target qudit and stationary qubits work as the control logic qudit, by employing a three-level Λ-type system coupled with a whispering-gallery-mode microresonator. In our scheme, the required number of interaction times between the photon and solid state system reduce greatly compared with the traditional method which decomposes the high-dimensional Hilbert space into 2-dimensional quantum space, and it is on a shorter temporal scale for the experimental realization. Moreover, we discuss the performance and feasibility of our hybrid CF gate, concluding that it can be easily extended to a 2n-dimensional case and it is feasible with current technology.
Directory of Open Access Journals (Sweden)
Ottavia eDipasquale
2015-02-01
Full Text Available High dimensional independent component analysis (ICA, compared to low dimensional ICA, allows performing a detailed parcellation of the resting state networks. The purpose of this study was to give further insight into functional connectivity (FC in Alzheimer’s disease (AD using high dimensional ICA. For this reason, we performed both low and high dimensional ICA analyses of resting state fMRI (rfMRI data of 20 healthy controls and 21 AD patients, focusing on the primarily altered default mode network (DMN and exploring the sensory motor network (SMN. As expected, results obtained at low dimensionality were in line with previous literature. Moreover, high dimensional results allowed us to observe either the presence of within-network disconnections and FC damage confined to some of the resting state sub-networks. Due to the higher sensitivity of the high dimensional ICA analysis, our results suggest that high-dimensional decomposition in sub-networks is very promising to better localize FC alterations in AD and that FC damage is not confined to the default mode network.
A Meta-Heuristic Regression-Based Feature Selection for Predictive Analytics
Directory of Open Access Journals (Sweden)
Bharat Singh
2014-11-01
Full Text Available A high-dimensional feature selection having a very large number of features with an optimal feature subset is an NP-complete problem. Because conventional optimization techniques are unable to tackle large-scale feature selection problems, meta-heuristic algorithms are widely used. In this paper, we propose a particle swarm optimization technique while utilizing regression techniques for feature selection. We then use the selected features to classify the data. Classification accuracy is used as a criterion to evaluate classifier performance, and classification is accomplished through the use of k-nearest neighbour (KNN and Bayesian techniques. Various high dimensional data sets are used to evaluate the usefulness of the proposed approach. Results show that our approach gives better results when compared with other conventional feature selection algorithms.
International Nuclear Information System (INIS)
Liu, W; Sawant, A; Ruan, D
2016-01-01
Purpose: The development of high dimensional imaging systems (e.g. volumetric MRI, CBCT, photogrammetry systems) in image-guided radiotherapy provides important pathways to the ultimate goal of real-time volumetric/surface motion monitoring. This study aims to develop a prediction method for the high dimensional state subject to respiratory motion. Compared to conventional linear dimension reduction based approaches, our method utilizes manifold learning to construct a descriptive feature submanifold, where more efficient and accurate prediction can be performed. Methods: We developed a prediction framework for high-dimensional state subject to respiratory motion. The proposed method performs dimension reduction in a nonlinear setting to permit more descriptive features compared to its linear counterparts (e.g., classic PCA). Specifically, a kernel PCA is used to construct a proper low-dimensional feature manifold, where low-dimensional prediction is performed. A fixed-point iterative pre-image estimation method is applied subsequently to recover the predicted value in the original state space. We evaluated and compared the proposed method with PCA-based method on 200 level-set surfaces reconstructed from surface point clouds captured by the VisionRT system. The prediction accuracy was evaluated with respect to root-mean-squared-error (RMSE) for both 200ms and 600ms lookahead lengths. Results: The proposed method outperformed PCA-based approach with statistically higher prediction accuracy. In one-dimensional feature subspace, our method achieved mean prediction accuracy of 0.86mm and 0.89mm for 200ms and 600ms lookahead lengths respectively, compared to 0.95mm and 1.04mm from PCA-based method. The paired t-tests further demonstrated the statistical significance of the superiority of our method, with p-values of 6.33e-3 and 5.78e-5, respectively. Conclusion: The proposed approach benefits from the descriptiveness of a nonlinear manifold and the prediction
Panel Smooth Transition Regression Models
DEFF Research Database (Denmark)
González, Andrés; Terasvirta, Timo; Dijk, Dick van
We introduce the panel smooth transition regression model. This new model is intended for characterizing heterogeneous panels, allowing the regression coefficients to vary both across individuals and over time. Specifically, heterogeneity is allowed for by assuming that these coefficients are bou...
Testing discontinuities in nonparametric regression
Dai, Wenlin
2017-01-19
In nonparametric regression, it is often needed to detect whether there are jump discontinuities in the mean function. In this paper, we revisit the difference-based method in [13 H.-G. Müller and U. Stadtmüller, Discontinuous versus smooth regression, Ann. Stat. 27 (1999), pp. 299–337. doi: 10.1214/aos/1018031100
Testing discontinuities in nonparametric regression
Dai, Wenlin; Zhou, Yuejin; Tong, Tiejun
2017-01-01
In nonparametric regression, it is often needed to detect whether there are jump discontinuities in the mean function. In this paper, we revisit the difference-based method in [13 H.-G. Müller and U. Stadtmüller, Discontinuous versus smooth regression, Ann. Stat. 27 (1999), pp. 299–337. doi: 10.1214/aos/1018031100
Logistic Regression: Concept and Application
Cokluk, Omay
2010-01-01
The main focus of logistic regression analysis is classification of individuals in different groups. The aim of the present study is to explain basic concepts and processes of binary logistic regression analysis intended to determine the combination of independent variables which best explain the membership in certain groups called dichotomous…
Fungible weights in logistic regression.
Jones, Jeff A; Waller, Niels G
2016-06-01
In this article we develop methods for assessing parameter sensitivity in logistic regression models. To set the stage for this work, we first review Waller's (2008) equations for computing fungible weights in linear regression. Next, we describe 2 methods for computing fungible weights in logistic regression. To demonstrate the utility of these methods, we compute fungible logistic regression weights using data from the Centers for Disease Control and Prevention's (2010) Youth Risk Behavior Surveillance Survey, and we illustrate how these alternate weights can be used to evaluate parameter sensitivity. To make our work accessible to the research community, we provide R code (R Core Team, 2015) that will generate both kinds of fungible logistic regression weights. (PsycINFO Database Record (c) 2016 APA, all rights reserved).
International Nuclear Information System (INIS)
Leng Ling; Zhang Tianyi; Kleinman, Lawrence; Zhu Wei
2007-01-01
Regression analysis, especially the ordinary least squares method which assumes that errors are confined to the dependent variable, has seen a fair share of its applications in aerosol science. The ordinary least squares approach, however, could be problematic due to the fact that atmospheric data often does not lend itself to calling one variable independent and the other dependent. Errors often exist for both measurements. In this work, we examine two regression approaches available to accommodate this situation. They are orthogonal regression and geometric mean regression. Comparisons are made theoretically as well as numerically through an aerosol study examining whether the ratio of organic aerosol to CO would change with age
Energy Technology Data Exchange (ETDEWEB)
Miao, Yan-Gang [Nankai University, School of Physics, Tianjin (China); Chinese Academy of Sciences, State Key Laboratory of Theoretical Physics, Institute of Theoretical Physics, P.O. Box 2735, Beijing (China); CERN, PH-TH Division, Geneva 23 (Switzerland); Xu, Zhen-Ming [Nankai University, School of Physics, Tianjin (China)
2016-04-15
Considering non-Gaussian smeared matter distributions, we investigate the thermodynamic behaviors of the noncommutative high-dimensional Schwarzschild-Tangherlini anti-de Sitter black hole, and we obtain the condition for the existence of extreme black holes. We indicate that the Gaussian smeared matter distribution, which is a special case of non-Gaussian smeared matter distributions, is not applicable for the six- and higher-dimensional black holes due to the hoop conjecture. In particular, the phase transition is analyzed in detail. Moreover, we point out that the Maxwell equal area law holds for the noncommutative black hole whose Hawking temperature is within a specific range, but fails for one whose the Hawking temperature is beyond this range. (orig.)
Miao, Yan-Gang
2016-01-01
Considering non-Gaussian smeared matter distributions, we investigate thermodynamic behaviors of the noncommutative high-dimensional Schwarzschild-Tangherlini anti-de Sitter black hole, and obtain the condition for the existence of extreme black holes. We indicate that the Gaussian smeared matter distribution, which is a special case of non-Gaussian smeared matter distributions, is not applicable for the 6- and higher-dimensional black holes due to the hoop conjecture. In particular, the phase transition is analyzed in detail. Moreover, we point out that the Maxwell equal area law maintains for the noncommutative black hole with the Hawking temperature within a specific range, but fails with the Hawking temperature beyond this range.
Directory of Open Access Journals (Sweden)
F. C. Cooper
2013-04-01
Full Text Available The fluctuation-dissipation theorem (FDT has been proposed as a method of calculating the response of the earth's atmosphere to a forcing. For this problem the high dimensionality of the relevant data sets makes truncation necessary. Here we propose a method of truncation based upon the assumption that the response to a localised forcing is spatially localised, as an alternative to the standard method of choosing a number of the leading empirical orthogonal functions. For systems where this assumption holds, the response to any sufficiently small non-localised forcing may be estimated using a set of truncations that are chosen algorithmically. We test our algorithm using 36 and 72 variable versions of a stochastic Lorenz 95 system of ordinary differential equations. We find that, for long integrations, the bias in the response estimated by the FDT is reduced from ~75% of the true response to ~30%.
Directory of Open Access Journals (Sweden)
Ali Dashti
Full Text Available This paper presents an implementation of the brute-force exact k-Nearest Neighbor Graph (k-NNG construction for ultra-large high-dimensional data cloud. The proposed method uses Graphics Processing Units (GPUs and is scalable with multi-levels of parallelism (between nodes of a cluster, between different GPUs on a single node, and within a GPU. The method is applicable to homogeneous computing clusters with a varying number of nodes and GPUs per node. We achieve a 6-fold speedup in data processing as compared with an optimized method running on a cluster of CPUs and bring a hitherto impossible [Formula: see text]-NNG generation for a dataset of twenty million images with 15 k dimensionality into the realm of practical possibility.
Regression to Causality : Regression-style presentation influences causal attribution
DEFF Research Database (Denmark)
Bordacconi, Mats Joe; Larsen, Martin Vinæs
2014-01-01
of equivalent results presented as either regression models or as a test of two sample means. Our experiment shows that the subjects who were presented with results as estimates from a regression model were more inclined to interpret these results causally. Our experiment implies that scholars using regression...... models – one of the primary vehicles for analyzing statistical results in political science – encourage causal interpretation. Specifically, we demonstrate that presenting observational results in a regression model, rather than as a simple comparison of means, makes causal interpretation of the results...... more likely. Our experiment drew on a sample of 235 university students from three different social science degree programs (political science, sociology and economics), all of whom had received substantial training in statistics. The subjects were asked to compare and evaluate the validity...
McParland, D; Phillips, C M; Brennan, L; Roche, H M; Gormley, I C
2017-12-10
The LIPGENE-SU.VI.MAX study, like many others, recorded high-dimensional continuous phenotypic data and categorical genotypic data. LIPGENE-SU.VI.MAX focuses on the need to account for both phenotypic and genetic factors when studying the metabolic syndrome (MetS), a complex disorder that can lead to higher risk of type 2 diabetes and cardiovascular disease. Interest lies in clustering the LIPGENE-SU.VI.MAX participants into homogeneous groups or sub-phenotypes, by jointly considering their phenotypic and genotypic data, and in determining which variables are discriminatory. A novel latent variable model that elegantly accommodates high dimensional, mixed data is developed to cluster LIPGENE-SU.VI.MAX participants using a Bayesian finite mixture model. A computationally efficient variable selection algorithm is incorporated, estimation is via a Gibbs sampling algorithm and an approximate BIC-MCMC criterion is developed to select the optimal model. Two clusters or sub-phenotypes ('healthy' and 'at risk') are uncovered. A small subset of variables is deemed discriminatory, which notably includes phenotypic and genotypic variables, highlighting the need to jointly consider both factors. Further, 7 years after the LIPGENE-SU.VI.MAX data were collected, participants underwent further analysis to diagnose presence or absence of the MetS. The two uncovered sub-phenotypes strongly correspond to the 7-year follow-up disease classification, highlighting the role of phenotypic and genotypic factors in the MetS and emphasising the potential utility of the clustering approach in early screening. Additionally, the ability of the proposed approach to define the uncertainty in sub-phenotype membership at the participant level is synonymous with the concepts of precision medicine and nutrition. Copyright © 2017 John Wiley & Sons, Ltd. Copyright © 2017 John Wiley & Sons, Ltd.
Wang, Xueyi
2012-02-08
The k-nearest neighbors (k-NN) algorithm is a widely used machine learning method that finds nearest neighbors of a test object in a feature space. We present a new exact k-NN algorithm called kMkNN (k-Means for k-Nearest Neighbors) that uses the k-means clustering and the triangle inequality to accelerate the searching for nearest neighbors in a high dimensional space. The kMkNN algorithm has two stages. In the buildup stage, instead of using complex tree structures such as metric trees, kd-trees, or ball-tree, kMkNN uses a simple k-means clustering method to preprocess the training dataset. In the searching stage, given a query object, kMkNN finds nearest training objects starting from the nearest cluster to the query object and uses the triangle inequality to reduce the distance calculations. Experiments show that the performance of kMkNN is surprisingly good compared to the traditional k-NN algorithm and tree-based k-NN algorithms such as kd-trees and ball-trees. On a collection of 20 datasets with up to 10(6) records and 10(4) dimensions, kMkNN shows a 2-to 80-fold reduction of distance calculations and a 2- to 60-fold speedup over the traditional k-NN algorithm for 16 datasets. Furthermore, kMkNN performs significant better than a kd-tree based k-NN algorithm for all datasets and performs better than a ball-tree based k-NN algorithm for most datasets. The results show that kMkNN is effective for searching nearest neighbors in high dimensional spaces.
Regression analysis with categorized regression calibrated exposure: some interesting findings
Directory of Open Access Journals (Sweden)
Hjartåker Anette
2006-07-01
Full Text Available Abstract Background Regression calibration as a method for handling measurement error is becoming increasingly well-known and used in epidemiologic research. However, the standard version of the method is not appropriate for exposure analyzed on a categorical (e.g. quintile scale, an approach commonly used in epidemiologic studies. A tempting solution could then be to use the predicted continuous exposure obtained through the regression calibration method and treat it as an approximation to the true exposure, that is, include the categorized calibrated exposure in the main regression analysis. Methods We use semi-analytical calculations and simulations to evaluate the performance of the proposed approach compared to the naive approach of not correcting for measurement error, in situations where analyses are performed on quintile scale and when incorporating the original scale into the categorical variables, respectively. We also present analyses of real data, containing measures of folate intake and depression, from the Norwegian Women and Cancer study (NOWAC. Results In cases where extra information is available through replicated measurements and not validation data, regression calibration does not maintain important qualities of the true exposure distribution, thus estimates of variance and percentiles can be severely biased. We show that the outlined approach maintains much, in some cases all, of the misclassification found in the observed exposure. For that reason, regression analysis with the corrected variable included on a categorical scale is still biased. In some cases the corrected estimates are analytically equal to those obtained by the naive approach. Regression calibration is however vastly superior to the naive method when applying the medians of each category in the analysis. Conclusion Regression calibration in its most well-known form is not appropriate for measurement error correction when the exposure is analyzed on a
Bias in regression coefficient estimates upon different treatments of ...
African Journals Online (AJOL)
MS and PW consistently overestimated the population parameter. EM and RI, on the other hand, tended to consistently underestimate the population parameter under non-monotonic pattern. Keywords: Missing data, bias, regression, percent missing, non-normality, missing pattern > East African Journal of Statistics Vol.
Advanced statistics: linear regression, part II: multiple linear regression.
Marill, Keith A
2004-01-01
The applications of simple linear regression in medical research are limited, because in most situations, there are multiple relevant predictor variables. Univariate statistical techniques such as simple linear regression use a single predictor variable, and they often may be mathematically correct but clinically misleading. Multiple linear regression is a mathematical technique used to model the relationship between multiple independent predictor variables and a single dependent outcome variable. It is used in medical research to model observational data, as well as in diagnostic and therapeutic studies in which the outcome is dependent on more than one factor. Although the technique generally is limited to data that can be expressed with a linear function, it benefits from a well-developed mathematical framework that yields unique solutions and exact confidence intervals for regression coefficients. Building on Part I of this series, this article acquaints the reader with some of the important concepts in multiple regression analysis. These include multicollinearity, interaction effects, and an expansion of the discussion of inference testing, leverage, and variable transformations to multivariate models. Examples from the first article in this series are expanded on using a primarily graphic, rather than mathematical, approach. The importance of the relationships among the predictor variables and the dependence of the multivariate model coefficients on the choice of these variables are stressed. Finally, concepts in regression model building are discussed.
Logic regression and its extensions.
Schwender, Holger; Ruczinski, Ingo
2010-01-01
Logic regression is an adaptive classification and regression procedure, initially developed to reveal interacting single nucleotide polymorphisms (SNPs) in genetic association studies. In general, this approach can be used in any setting with binary predictors, when the interaction of these covariates is of primary interest. Logic regression searches for Boolean (logic) combinations of binary variables that best explain the variability in the outcome variable, and thus, reveals variables and interactions that are associated with the response and/or have predictive capabilities. The logic expressions are embedded in a generalized linear regression framework, and thus, logic regression can handle a variety of outcome types, such as binary responses in case-control studies, numeric responses, and time-to-event data. In this chapter, we provide an introduction to the logic regression methodology, list some applications in public health and medicine, and summarize some of the direct extensions and modifications of logic regression that have been proposed in the literature. Copyright © 2010 Elsevier Inc. All rights reserved.
Are increases in cigarette taxation regressive?
Borren, P; Sutton, M
1992-12-01
Using the latest published data from Tobacco Advisory Council surveys, this paper re-evaluates the question of whether or not increases in cigarette taxation are regressive in the United Kingdom. The extended data set shows no evidence of increasing price-elasticity by social class as found in a major previous study. To the contrary, there appears to be no clear pattern in the price responsiveness of smoking behaviour across different social classes. Increases in cigarette taxation, while reducing smoking levels in all groups, fall most heavily on men and women in the lowest social class. Men and women in social class five can expect to pay eight and eleven times more of a tax increase respectively, than their social class one counterparts. Taken as a proportion of relative incomes, the regressive nature of increases in cigarette taxation is even more pronounced.
Abstract Expression Grammar Symbolic Regression
Korns, Michael F.
This chapter examines the use of Abstract Expression Grammars to perform the entire Symbolic Regression process without the use of Genetic Programming per se. The techniques explored produce a symbolic regression engine which has absolutely no bloat, which allows total user control of the search space and output formulas, which is faster, and more accurate than the engines produced in our previous papers using Genetic Programming. The genome is an all vector structure with four chromosomes plus additional epigenetic and constraint vectors, allowing total user control of the search space and the final output formulas. A combination of specialized compiler techniques, genetic algorithms, particle swarm, aged layered populations, plus discrete and continuous differential evolution are used to produce an improved symbolic regression sytem. Nine base test cases, from the literature, are used to test the improvement in speed and accuracy. The improved results indicate that these techniques move us a big step closer toward future industrial strength symbolic regression systems.
Quantile Regression With Measurement Error
Wei, Ying; Carroll, Raymond J.
2009-01-01
. The finite sample performance of the proposed method is investigated in a simulation study, and compared to the standard regression calibration approach. Finally, we apply our methodology to part of the National Collaborative Perinatal Project growth data, a
From Rasch scores to regression
DEFF Research Database (Denmark)
Christensen, Karl Bang
2006-01-01
Rasch models provide a framework for measurement and modelling latent variables. Having measured a latent variable in a population a comparison of groups will often be of interest. For this purpose the use of observed raw scores will often be inadequate because these lack interval scale propertie....... This paper compares two approaches to group comparison: linear regression models using estimated person locations as outcome variables and latent regression models based on the distribution of the score....
Testing Heteroscedasticity in Robust Regression
Czech Academy of Sciences Publication Activity Database
Kalina, Jan
2011-01-01
Roč. 1, č. 4 (2011), s. 25-28 ISSN 2045-3345 Grant - others:GA ČR(CZ) GA402/09/0557 Institutional research plan: CEZ:AV0Z10300504 Keywords : robust regression * heteroscedasticity * regression quantiles * diagnostics Subject RIV: BB - Applied Statistics , Operational Research http://www.researchjournals.co.uk/documents/Vol4/06%20Kalina.pdf
Regression methods for medical research
Tai, Bee Choo
2013-01-01
Regression Methods for Medical Research provides medical researchers with the skills they need to critically read and interpret research using more advanced statistical methods. The statistical requirements of interpreting and publishing in medical journals, together with rapid changes in science and technology, increasingly demands an understanding of more complex and sophisticated analytic procedures.The text explains the application of statistical models to a wide variety of practical medical investigative studies and clinical trials. Regression methods are used to appropriately answer the
DEFF Research Database (Denmark)
Bini, L. M.; Diniz-Filho, J. A. F.; Rangel, T. F. L. V. B.
2009-01-01
A major focus of geographical ecology and macroecology is to understand the causes of spatially structured ecological patterns. However, achieving this understanding can be complicated when using multiple regression, because the relative importance of explanatory variables, as measured by regress...
Logistic regression for dichotomized counts.
Preisser, John S; Das, Kalyan; Benecha, Habtamu; Stamm, John W
2016-12-01
Sometimes there is interest in a dichotomized outcome indicating whether a count variable is positive or zero. Under this scenario, the application of ordinary logistic regression may result in efficiency loss, which is quantifiable under an assumed model for the counts. In such situations, a shared-parameter hurdle model is investigated for more efficient estimation of regression parameters relating to overall effects of covariates on the dichotomous outcome, while handling count data with many zeroes. One model part provides a logistic regression containing marginal log odds ratio effects of primary interest, while an ancillary model part describes the mean count of a Poisson or negative binomial process in terms of nuisance regression parameters. Asymptotic efficiency of the logistic model parameter estimators of the two-part models is evaluated with respect to ordinary logistic regression. Simulations are used to assess the properties of the models with respect to power and Type I error, the latter investigated under both misspecified and correctly specified models. The methods are applied to data from a randomized clinical trial of three toothpaste formulations to prevent incident dental caries in a large population of Scottish schoolchildren. © The Author(s) 2014.
Producing The New Regressive Left
DEFF Research Database (Denmark)
Crone, Christine
members, this thesis investigates a growing political trend and ideological discourse in the Arab world that I have called The New Regressive Left. On the premise that a media outlet can function as a forum for ideology production, the thesis argues that an analysis of this material can help to trace...... the contexture of The New Regressive Left. If the first part of the thesis lays out the theoretical approach and draws the contextual framework, through an exploration of the surrounding Arab media-and ideoscapes, the second part is an analytical investigation of the discourse that permeates the programmes aired...... becomes clear from the analytical chapters is the emergence of the new cross-ideological alliance of The New Regressive Left. This emerging coalition between Shia Muslims, religious minorities, parts of the Arab Left, secular cultural producers, and the remnants of the political,strategic resistance...
General regression and representation model for classification.
Directory of Open Access Journals (Sweden)
Jianjun Qian
Full Text Available Recently, the regularized coding-based classification methods (e.g. SRC and CRC show a great potential for pattern classification. However, most existing coding methods assume that the representation residuals are uncorrelated. In real-world applications, this assumption does not hold. In this paper, we take account of the correlations of the representation residuals and develop a general regression and representation model (GRR for classification. GRR not only has advantages of CRC, but also takes full use of the prior information (e.g. the correlations between representation residuals and representation coefficients and the specific information (weight matrix of image pixels to enhance the classification performance. GRR uses the generalized Tikhonov regularization and K Nearest Neighbors to learn the prior information from the training data. Meanwhile, the specific information is obtained by using an iterative algorithm to update the feature (or image pixel weights of the test sample. With the proposed model as a platform, we design two classifiers: basic general regression and representation classifier (B-GRR and robust general regression and representation classifier (R-GRR. The experimental results demonstrate the performance advantages of proposed methods over state-of-the-art algorithms.
A Matlab program for stepwise regression
Directory of Open Access Journals (Sweden)
Yanhong Qi
2016-03-01
Full Text Available The stepwise linear regression is a multi-variable regression for identifying statistically significant variables in the linear regression equation. In present study, we presented the Matlab program of stepwise regression.
Correlation and simple linear regression.
Zou, Kelly H; Tuncali, Kemal; Silverman, Stuart G
2003-06-01
In this tutorial article, the concepts of correlation and regression are reviewed and demonstrated. The authors review and compare two correlation coefficients, the Pearson correlation coefficient and the Spearman rho, for measuring linear and nonlinear relationships between two continuous variables. In the case of measuring the linear relationship between a predictor and an outcome variable, simple linear regression analysis is conducted. These statistical concepts are illustrated by using a data set from published literature to assess a computed tomography-guided interventional technique. These statistical methods are important for exploring the relationships between variables and can be applied to many radiologic studies.
Regression filter for signal resolution
International Nuclear Information System (INIS)
Matthes, W.
1975-01-01
The problem considered is that of resolving a measured pulse height spectrum of a material mixture, e.g. gamma ray spectrum, Raman spectrum, into a weighed sum of the spectra of the individual constituents. The model on which the analytical formulation is based is described. The problem reduces to that of a multiple linear regression. A stepwise linear regression procedure was constructed. The efficiency of this method was then tested by transforming the procedure in a computer programme which was used to unfold test spectra obtained by mixing some spectra, from a library of arbitrary chosen spectra, and adding a noise component. (U.K.)
Nonparametric Mixture of Regression Models.
Huang, Mian; Li, Runze; Wang, Shaoli
2013-07-01
Motivated by an analysis of US house price index data, we propose nonparametric finite mixture of regression models. We study the identifiability issue of the proposed models, and develop an estimation procedure by employing kernel regression. We further systematically study the sampling properties of the proposed estimators, and establish their asymptotic normality. A modified EM algorithm is proposed to carry out the estimation procedure. We show that our algorithm preserves the ascent property of the EM algorithm in an asymptotic sense. Monte Carlo simulations are conducted to examine the finite sample performance of the proposed estimation procedure. An empirical analysis of the US house price index data is illustrated for the proposed methodology.
Du, Jing; Wang, Jian
2015-11-01
Bessel beams carrying orbital angular momentum (OAM) with helical phase fronts exp(ilφ)(l=0;±1;±2;…), where φ is the azimuthal angle and l corresponds to the topological number, are orthogonal with each other. This feature of Bessel beams provides a new dimension to code/decode data information on the OAM state of light, and the theoretical infinity of topological number enables possible high-dimensional structured light coding/decoding for free-space optical communications. Moreover, Bessel beams are nondiffracting beams having the ability to recover by themselves in the face of obstructions, which is important for free-space optical communications relying on line-of-sight operation. By utilizing the OAM and nondiffracting characteristics of Bessel beams, we experimentally demonstrate 12 m distance obstruction-free optical m-ary coding/decoding using visible Bessel beams in a free-space optical communication system. We also study the bit error rate (BER) performance of hexadecimal and 32-ary coding/decoding based on Bessel beams with different topological numbers. After receiving 500 symbols at the receiver side, a zero BER of hexadecimal coding/decoding is observed when the obstruction is placed along the propagation path of light.
Regis, Rommel G.
2014-02-01
This article develops two new algorithms for constrained expensive black-box optimization that use radial basis function surrogates for the objective and constraint functions. These algorithms are called COBRA and Extended ConstrLMSRBF and, unlike previous surrogate-based approaches, they can be used for high-dimensional problems where all initial points are infeasible. They both follow a two-phase approach where the first phase finds a feasible point while the second phase improves this feasible point. COBRA and Extended ConstrLMSRBF are compared with alternative methods on 20 test problems and on the MOPTA08 benchmark automotive problem (D.R. Jones, Presented at MOPTA 2008), which has 124 decision variables and 68 black-box inequality constraints. The alternatives include a sequential penalty derivative-free algorithm, a direct search method with kriging surrogates, and two multistart methods. Numerical results show that COBRA algorithms are competitive with Extended ConstrLMSRBF and they generally outperform the alternatives on the MOPTA08 problem and most of the test problems.
Schröder, Markus; Meyer, Hans-Dieter
2017-08-01
We propose a Monte Carlo method, "Monte Carlo Potfit," for transforming high-dimensional potential energy surfaces evaluated on discrete grid points into a sum-of-products form, more precisely into a Tucker form. To this end we use a variational ansatz in which we replace numerically exact integrals with Monte Carlo integrals. This largely reduces the numerical cost by avoiding the evaluation of the potential on all grid points and allows a treatment of surfaces up to 15-18 degrees of freedom. We furthermore show that the error made with this ansatz can be controlled and vanishes in certain limits. We present calculations on the potential of HFCO to demonstrate the features of the algorithm. To demonstrate the power of the method, we transformed a 15D potential of the protonated water dimer (Zundel cation) in a sum-of-products form and calculated the ground and lowest 26 vibrationally excited states of the Zundel cation with the multi-configuration time-dependent Hartree method.
Chiu, Mei Choi; Pun, Chi Seng; Wong, Hoi Ying
2017-08-01
Investors interested in the global financial market must analyze financial securities internationally. Making an optimal global investment decision involves processing a huge amount of data for a high-dimensional portfolio. This article investigates the big data challenges of two mean-variance optimal portfolios: continuous-time precommitment and constant-rebalancing strategies. We show that both optimized portfolios implemented with the traditional sample estimates converge to the worst performing portfolio when the portfolio size becomes large. The crux of the problem is the estimation error accumulated from the huge dimension of stock data. We then propose a linear programming optimal (LPO) portfolio framework, which applies a constrained ℓ 1 minimization to the theoretical optimal control to mitigate the risk associated with the dimensionality issue. The resulting portfolio becomes a sparse portfolio that selects stocks with a data-driven procedure and hence offers a stable mean-variance portfolio in practice. When the number of observations becomes large, the LPO portfolio converges to the oracle optimal portfolio, which is free of estimation error, even though the number of stocks grows faster than the number of observations. Our numerical and empirical studies demonstrate the superiority of the proposed approach. © 2017 Society for Risk Analysis.
Cavaglieri, Daniele; Bewley, Thomas
2015-04-01
Implicit/explicit (IMEX) Runge-Kutta (RK) schemes are effective for time-marching ODE systems with both stiff and nonstiff terms on the RHS; such schemes implement an (often A-stable or better) implicit RK scheme for the stiff part of the ODE, which is often linear, and, simultaneously, a (more convenient) explicit RK scheme for the nonstiff part of the ODE, which is often nonlinear. Low-storage RK schemes are especially effective for time-marching high-dimensional ODE discretizations of PDE systems on modern (cache-based) computational hardware, in which memory management is often the most significant computational bottleneck. In this paper, we develop and characterize eight new low-storage implicit/explicit RK schemes which have higher accuracy and better stability properties than the only low-storage implicit/explicit RK scheme available previously, the venerable second-order Crank-Nicolson/Runge-Kutta-Wray (CN/RKW3) algorithm that has dominated the DNS/LES literature for the last 25 years, while requiring similar storage (two, three, or four registers of length N) and comparable floating-point operations per timestep.
Regression Methods for Virtual Metrology of Layer Thickness in Chemical Vapor Deposition
DEFF Research Database (Denmark)
Purwins, Hendrik; Barak, Bernd; Nagi, Ahmed
2014-01-01
The quality of wafer production in semiconductor manufacturing cannot always be monitored by a costly physical measurement. Instead of measuring a quantity directly, it can be predicted by a regression method (Virtual Metrology). In this paper, a survey on regression methods is given to predict...... average Silicon Nitride cap layer thickness for the Plasma Enhanced Chemical Vapor Deposition (PECVD) dual-layer metal passivation stack process. Process and production equipment Fault Detection and Classification (FDC) data are used as predictor variables. Various variable sets are compared: one most...... algorithm, and Support Vector Regression (SVR). On a test set, SVR outperforms the other methods by a large margin, being more robust towards changes in the production conditions. The method performs better on high-dimensional multivariate input data than on the most predictive variables alone. Process...
Cactus: An Introduction to Regression
Hyde, Hartley
2008-01-01
When the author first used "VisiCalc," the author thought it a very useful tool when he had the formulas. But how could he design a spreadsheet if there was no known formula for the quantities he was trying to predict? A few months later, the author relates he learned to use multiple linear regression software and suddenly it all clicked into…
Regression Models for Repairable Systems
Czech Academy of Sciences Publication Activity Database
Novák, Petr
2015-01-01
Roč. 17, č. 4 (2015), s. 963-972 ISSN 1387-5841 Institutional support: RVO:67985556 Keywords : Reliability analysis * Repair models * Regression Subject RIV: BB - Applied Statistics, Operational Research Impact factor: 0.782, year: 2015 http://library.utia.cas.cz/separaty/2015/SI/novak-0450902.pdf
Survival analysis II: Cox regression
Stel, Vianda S.; Dekker, Friedo W.; Tripepi, Giovanni; Zoccali, Carmine; Jager, Kitty J.
2011-01-01
In contrast to the Kaplan-Meier method, Cox proportional hazards regression can provide an effect estimate by quantifying the difference in survival between patient groups and can adjust for confounding effects of other variables. The purpose of this article is to explain the basic concepts of the
Kernel regression with functional response
Ferraty, Frédéric; Laksaci, Ali; Tadj, Amel; Vieu, Philippe
2011-01-01
We consider kernel regression estimate when both the response variable and the explanatory one are functional. The rates of uniform almost complete convergence are stated as function of the small ball probability of the predictor and as function of the entropy of the set on which uniformity is obtained.
Directory of Open Access Journals (Sweden)
Boulesteix Anne-Laure
2009-12-01
Full Text Available Abstract Background In biometric practice, researchers often apply a large number of different methods in a "trial-and-error" strategy to get as much as possible out of their data and, due to publication pressure or pressure from the consulting customer, present only the most favorable results. This strategy may induce a substantial optimistic bias in prediction error estimation, which is quantitatively assessed in the present manuscript. The focus of our work is on class prediction based on high-dimensional data (e.g. microarray data, since such analyses are particularly exposed to this kind of bias. Methods In our study we consider a total of 124 variants of classifiers (possibly including variable selection or tuning steps within a cross-validation evaluation scheme. The classifiers are applied to original and modified real microarray data sets, some of which are obtained by randomly permuting the class labels to mimic non-informative predictors while preserving their correlation structure. Results We assess the minimal misclassification rate over the different variants of classifiers in order to quantify the bias arising when the optimal classifier is selected a posteriori in a data-driven manner. The bias resulting from the parameter tuning (including gene selection parameters as a special case and the bias resulting from the choice of the classification method are examined both separately and jointly. Conclusions The median minimal error rate over the investigated classifiers was as low as 31% and 41% based on permuted uninformative predictors from studies on colon cancer and prostate cancer, respectively. We conclude that the strategy to present only the optimal result is not acceptable because it yields a substantial bias in error rate estimation, and suggest alternative approaches for properly reporting classification accuracy.
Quantile Regression With Measurement Error
Wei, Ying
2009-08-27
Regression quantiles can be substantially biased when the covariates are measured with error. In this paper we propose a new method that produces consistent linear quantile estimation in the presence of covariate measurement error. The method corrects the measurement error induced bias by constructing joint estimating equations that simultaneously hold for all the quantile levels. An iterative EM-type estimation algorithm to obtain the solutions to such joint estimation equations is provided. The finite sample performance of the proposed method is investigated in a simulation study, and compared to the standard regression calibration approach. Finally, we apply our methodology to part of the National Collaborative Perinatal Project growth data, a longitudinal study with an unusual measurement error structure. © 2009 American Statistical Association.
Multivariate and semiparametric kernel regression
Härdle, Wolfgang; Müller, Marlene
1997-01-01
The paper gives an introduction to theory and application of multivariate and semiparametric kernel smoothing. Multivariate nonparametric density estimation is an often used pilot tool for examining the structure of data. Regression smoothing helps in investigating the association between covariates and responses. We concentrate on kernel smoothing using local polynomial fitting which includes the Nadaraya-Watson estimator. Some theory on the asymptotic behavior and bandwidth selection is pro...
Regression algorithm for emotion detection
Berthelon , Franck; Sander , Peter
2013-01-01
International audience; We present here two components of a computational system for emotion detection. PEMs (Personalized Emotion Maps) store links between bodily expressions and emotion values, and are individually calibrated to capture each person's emotion profile. They are an implementation based on aspects of Scherer's theoretical complex system model of emotion~\\cite{scherer00, scherer09}. We also present a regression algorithm that determines a person's emotional feeling from sensor m...
Directional quantile regression in R
Czech Academy of Sciences Publication Activity Database
Boček, Pavel; Šiman, Miroslav
2017-01-01
Roč. 53, č. 3 (2017), s. 480-492 ISSN 0023-5954 R&D Projects: GA ČR GA14-07234S Institutional support: RVO:67985556 Keywords : multivariate quantile * regression quantile * halfspace depth * depth contour Subject RIV: BD - Theory of Information OBOR OECD: Applied mathematics Impact factor: 0.379, year: 2016 http://library.utia.cas.cz/separaty/2017/SI/bocek-0476587.pdf
Gaussian Process Regression Model in Spatial Logistic Regression
Sofro, A.; Oktaviarina, A.
2018-01-01
Spatial analysis has developed very quickly in the last decade. One of the favorite approaches is based on the neighbourhood of the region. Unfortunately, there are some limitations such as difficulty in prediction. Therefore, we offer Gaussian process regression (GPR) to accommodate the issue. In this paper, we will focus on spatial modeling with GPR for binomial data with logit link function. The performance of the model will be investigated. We will discuss the inference of how to estimate the parameters and hyper-parameters and to predict as well. Furthermore, simulation studies will be explained in the last section.
Directory of Open Access Journals (Sweden)
Raftery Adrian E
2009-02-01
Full Text Available Abstract Background Microarray technology is increasingly used to identify potential biomarkers for cancer prognostics and diagnostics. Previously, we have developed the iterative Bayesian Model Averaging (BMA algorithm for use in classification. Here, we extend the iterative BMA algorithm for application to survival analysis on high-dimensional microarray data. The main goal in applying survival analysis to microarray data is to determine a highly predictive model of patients' time to event (such as death, relapse, or metastasis using a small number of selected genes. Our multivariate procedure combines the effectiveness of multiple contending models by calculating the weighted average of their posterior probability distributions. Our results demonstrate that our iterative BMA algorithm for survival analysis achieves high prediction accuracy while consistently selecting a small and cost-effective number of predictor genes. Results We applied the iterative BMA algorithm to two cancer datasets: breast cancer and diffuse large B-cell lymphoma (DLBCL data. On the breast cancer data, the algorithm selected a total of 15 predictor genes across 84 contending models from the training data. The maximum likelihood estimates of the selected genes and the posterior probabilities of the selected models from the training data were used to divide patients in the test (or validation dataset into high- and low-risk categories. Using the genes and models determined from the training data, we assigned patients from the test data into highly distinct risk groups (as indicated by a p-value of 7.26e-05 from the log-rank test. Moreover, we achieved comparable results using only the 5 top selected genes with 100% posterior probabilities. On the DLBCL data, our iterative BMA procedure selected a total of 25 genes across 3 contending models from the training data. Once again, we assigned the patients in the validation set to significantly distinct risk groups (p
Spontaneous regression of pulmonary bullae
International Nuclear Information System (INIS)
Satoh, H.; Ishikawa, H.; Ohtsuka, M.; Sekizawa, K.
2002-01-01
The natural history of pulmonary bullae is often characterized by gradual, progressive enlargement. Spontaneous regression of bullae is, however, very rare. We report a case in which complete resolution of pulmonary bullae in the left upper lung occurred spontaneously. The management of pulmonary bullae is occasionally made difficult because of gradual progressive enlargement associated with abnormal pulmonary function. Some patients have multiple bulla in both lungs and/or have a history of pulmonary emphysema. Others have a giant bulla without emphysematous change in the lungs. Our present case had treated lung cancer with no evidence of local recurrence. He had no emphysematous change in lung function test and had no complaints, although the high resolution CT scan shows evidence of underlying minimal changes of emphysema. Ortin and Gurney presented three cases of spontaneous reduction in size of bulla. Interestingly, one of them had a marked decrease in the size of a bulla in association with thickening of the wall of the bulla, which was observed in our patient. This case we describe is of interest, not only because of the rarity with which regression of pulmonary bulla has been reported in the literature, but also because of the spontaneous improvements in the radiological picture in the absence of overt infection or tumor. Copyright (2002) Blackwell Science Pty Ltd
Quantum algorithm for linear regression
Wang, Guoming
2017-07-01
We present a quantum algorithm for fitting a linear regression model to a given data set using the least-squares approach. Differently from previous algorithms which yield a quantum state encoding the optimal parameters, our algorithm outputs these numbers in the classical form. So by running it once, one completely determines the fitted model and then can use it to make predictions on new data at little cost. Moreover, our algorithm works in the standard oracle model, and can handle data sets with nonsparse design matrices. It runs in time poly( log2(N ) ,d ,κ ,1 /ɛ ) , where N is the size of the data set, d is the number of adjustable parameters, κ is the condition number of the design matrix, and ɛ is the desired precision in the output. We also show that the polynomial dependence on d and κ is necessary. Thus, our algorithm cannot be significantly improved. Furthermore, we also give a quantum algorithm that estimates the quality of the least-squares fit (without computing its parameters explicitly). This algorithm runs faster than the one for finding this fit, and can be used to check whether the given data set qualifies for linear regression in the first place.
Interpretation of commonly used statistical regression models.
Kasza, Jessica; Wolfe, Rory
2014-01-01
A review of some regression models commonly used in respiratory health applications is provided in this article. Simple linear regression, multiple linear regression, logistic regression and ordinal logistic regression are considered. The focus of this article is on the interpretation of the regression coefficients of each model, which are illustrated through the application of these models to a respiratory health research study. © 2013 The Authors. Respirology © 2013 Asian Pacific Society of Respirology.
Mixed kernel function support vector regression for global sensitivity analysis
Cheng, Kai; Lu, Zhenzhou; Wei, Yuhao; Shi, Yan; Zhou, Yicheng
2017-11-01
Global sensitivity analysis (GSA) plays an important role in exploring the respective effects of input variables on an assigned output response. Amongst the wide sensitivity analyses in literature, the Sobol indices have attracted much attention since they can provide accurate information for most models. In this paper, a mixed kernel function (MKF) based support vector regression (SVR) model is employed to evaluate the Sobol indices at low computational cost. By the proposed derivation, the estimation of the Sobol indices can be obtained by post-processing the coefficients of the SVR meta-model. The MKF is constituted by the orthogonal polynomials kernel function and Gaussian radial basis kernel function, thus the MKF possesses both the global characteristic advantage of the polynomials kernel function and the local characteristic advantage of the Gaussian radial basis kernel function. The proposed approach is suitable for high-dimensional and non-linear problems. Performance of the proposed approach is validated by various analytical functions and compared with the popular polynomial chaos expansion (PCE). Results demonstrate that the proposed approach is an efficient method for global sensitivity analysis.
Van Belle, Vanya; Pelckmans, Kristiaan; Van Huffel, Sabine; Suykens, Johan A K
2011-10-01
To compare and evaluate ranking, regression and combined machine learning approaches for the analysis of survival data. The literature describes two approaches based on support vector machines to deal with censored observations. In the first approach the key idea is to rephrase the task as a ranking problem via the concordance index, a problem which can be solved efficiently in a context of structural risk minimization and convex optimization techniques. In a second approach, one uses a regression approach, dealing with censoring by means of inequality constraints. The goal of this paper is then twofold: (i) introducing a new model combining the ranking and regression strategy, which retains the link with existing survival models such as the proportional hazards model via transformation models; and (ii) comparison of the three techniques on 6 clinical and 3 high-dimensional datasets and discussing the relevance of these techniques over classical approaches fur survival data. We compare svm-based survival models based on ranking constraints, based on regression constraints and models based on both ranking and regression constraints. The performance of the models is compared by means of three different measures: (i) the concordance index, measuring the model's discriminating ability; (ii) the logrank test statistic, indicating whether patients with a prognostic index lower than the median prognostic index have a significant different survival than patients with a prognostic index higher than the median; and (iii) the hazard ratio after normalization to restrict the prognostic index between 0 and 1. Our results indicate a significantly better performance for models including regression constraints above models only based on ranking constraints. This work gives empirical evidence that svm-based models using regression constraints perform significantly better than svm-based models based on ranking constraints. Our experiments show a comparable performance for methods
Prediction, Regression and Critical Realism
DEFF Research Database (Denmark)
Næss, Petter
2004-01-01
This paper considers the possibility of prediction in land use planning, and the use of statistical research methods in analyses of relationships between urban form and travel behaviour. Influential writers within the tradition of critical realism reject the possibility of predicting social...... phenomena. This position is fundamentally problematic to public planning. Without at least some ability to predict the likely consequences of different proposals, the justification for public sector intervention into market mechanisms will be frail. Statistical methods like regression analyses are commonly...... seen as necessary in order to identify aggregate level effects of policy measures, but are questioned by many advocates of critical realist ontology. Using research into the relationship between urban structure and travel as an example, the paper discusses relevant research methods and the kinds...
On Weighted Support Vector Regression
DEFF Research Database (Denmark)
Han, Xixuan; Clemmensen, Line Katrine Harder
2014-01-01
We propose a new type of weighted support vector regression (SVR), motivated by modeling local dependencies in time and space in prediction of house prices. The classic weights of the weighted SVR are added to the slack variables in the objective function (OF‐weights). This procedure directly...... shrinks the coefficient of each observation in the estimated functions; thus, it is widely used for minimizing influence of outliers. We propose to additionally add weights to the slack variables in the constraints (CF‐weights) and call the combination of weights the doubly weighted SVR. We illustrate...... the differences and similarities of the two types of weights by demonstrating the connection between the Least Absolute Shrinkage and Selection Operator (LASSO) and the SVR. We show that an SVR problem can be transformed to a LASSO problem plus a linear constraint and a box constraint. We demonstrate...
Bruce Bagwell, C
2018-01-01
This chapter outlines how to approach the complex tasks associated with designing models for high-dimensional cytometry data. Unlike gating approaches, modeling lends itself to automation and accounts for measurement overlap among cellular populations. Designing these models is now easier because of a new technique called high-definition t-SNE mapping. Nontrivial examples are provided that serve as a guide to create models that are consistent with data.
Credit Scoring Problem Based on Regression Analysis
Khassawneh, Bashar Suhil Jad Allah
2014-01-01
ABSTRACT: This thesis provides an explanatory introduction to the regression models of data mining and contains basic definitions of key terms in the linear, multiple and logistic regression models. Meanwhile, the aim of this study is to illustrate fitting models for the credit scoring problem using simple linear, multiple linear and logistic regression models and also to analyze the found model functions by statistical tools. Keywords: Data mining, linear regression, logistic regression....
Regularized Label Relaxation Linear Regression.
Fang, Xiaozhao; Xu, Yong; Li, Xuelong; Lai, Zhihui; Wong, Wai Keung; Fang, Bingwu
2018-04-01
Linear regression (LR) and some of its variants have been widely used for classification problems. Most of these methods assume that during the learning phase, the training samples can be exactly transformed into a strict binary label matrix, which has too little freedom to fit the labels adequately. To address this problem, in this paper, we propose a novel regularized label relaxation LR method, which has the following notable characteristics. First, the proposed method relaxes the strict binary label matrix into a slack variable matrix by introducing a nonnegative label relaxation matrix into LR, which provides more freedom to fit the labels and simultaneously enlarges the margins between different classes as much as possible. Second, the proposed method constructs the class compactness graph based on manifold learning and uses it as the regularization item to avoid the problem of overfitting. The class compactness graph is used to ensure that the samples sharing the same labels can be kept close after they are transformed. Two different algorithms, which are, respectively, based on -norm and -norm loss functions are devised. These two algorithms have compact closed-form solutions in each iteration so that they are easily implemented. Extensive experiments show that these two algorithms outperform the state-of-the-art algorithms in terms of the classification accuracy and running time.
Hybrid foundry patterns of bevel gears
Directory of Open Access Journals (Sweden)
Budzik G.
2007-01-01
Full Text Available Possibilities of making hybrid foundry patterns of bevel gears for investment casting process are presented. Rapid prototyping of gears with complex tooth forms is possible with the use of modern methods. One of such methods is the stereo-lithography, where a pattern is obtained as a result of resin curing with laser beam. Patterns of that type are applicable in precision casting. Removing of stereo-lithographic pattern from foundry mould requires use of high temperatures. Resin burning would generate significant amounts of harmful gases. In case of a solid stereo-lithographic pattern, the pressure created during gas burning may cause the mould to crack. A gas volume reduction may be achieved by using patterns of honeycomb structure. However, this technique causes a significant worsening of accuracy of stereo-lithographic patterns in respect of their dimensions and shape. In cooperation with WSK PZL Rzeszów, the Machine Design Department of Rzeszow University of Technology carried out research on the design of hybrid stereo-lithographic patterns. Hybrid pattern consists of a section made by stereo-lithographic process and a section made of casting wax. The latter material is used for stereo-lithographic pattern filling and for mould gating system. The hybrid pattern process consists of two stages: wax melting and then the burn-out of stereolithographic pattern. Use of hybrid patterns reduces the costs of production of stereolithographic patterns. High dimensional accuracy remains preserved in this process.
Principal component regression analysis with SPSS.
Liu, R X; Kuang, J; Gong, Q; Hou, X L
2003-06-01
The paper introduces all indices of multicollinearity diagnoses, the basic principle of principal component regression and determination of 'best' equation method. The paper uses an example to describe how to do principal component regression analysis with SPSS 10.0: including all calculating processes of the principal component regression and all operations of linear regression, factor analysis, descriptives, compute variable and bivariate correlations procedures in SPSS 10.0. The principal component regression analysis can be used to overcome disturbance of the multicollinearity. The simplified, speeded up and accurate statistical effect is reached through the principal component regression analysis with SPSS.
Unbalanced Regressions and the Predictive Equation
DEFF Research Database (Denmark)
Osterrieder, Daniela; Ventosa-Santaulària, Daniel; Vera-Valdés, J. Eduardo
Predictive return regressions with persistent regressors are typically plagued by (asymptotically) biased/inconsistent estimates of the slope, non-standard or potentially even spurious statistical inference, and regression unbalancedness. We alleviate the problem of unbalancedness in the theoreti......Predictive return regressions with persistent regressors are typically plagued by (asymptotically) biased/inconsistent estimates of the slope, non-standard or potentially even spurious statistical inference, and regression unbalancedness. We alleviate the problem of unbalancedness...
Semiparametric regression during 2003–2007
Ruppert, David; Wand, M.P.; Carroll, Raymond J.
2009-01-01
Semiparametric regression is a fusion between parametric regression and nonparametric regression that integrates low-rank penalized splines, mixed model and hierarchical Bayesian methodology – thus allowing more streamlined handling of longitudinal and spatial correlation. We review progress in the field over the five-year period between 2003 and 2007. We find semiparametric regression to be a vibrant field with substantial involvement and activity, continual enhancement and widespread application.
Gaussian process regression analysis for functional data
Shi, Jian Qing
2011-01-01
Gaussian Process Regression Analysis for Functional Data presents nonparametric statistical methods for functional regression analysis, specifically the methods based on a Gaussian process prior in a functional space. The authors focus on problems involving functional response variables and mixed covariates of functional and scalar variables.Covering the basics of Gaussian process regression, the first several chapters discuss functional data analysis, theoretical aspects based on the asymptotic properties of Gaussian process regression models, and new methodological developments for high dime
Frequent Pattern Mining Algorithms for Data Clustering
DEFF Research Database (Denmark)
Zimek, Arthur; Assent, Ira; Vreeken, Jilles
2014-01-01
that frequent pattern mining was at the cradle of subspace clustering—yet, it quickly developed into an independent research field. In this chapter, we discuss how frequent pattern mining algorithms have been extended and generalized towards the discovery of local clusters in high-dimensional data......Discovering clusters in subspaces, or subspace clustering and related clustering paradigms, is a research field where we find many frequent pattern mining related influences. In fact, as the first algorithms for subspace clustering were based on frequent pattern mining algorithms, it is fair to say....... In particular, we discuss several example algorithms for subspace clustering or projected clustering as well as point out recent research questions and open topics in this area relevant to researchers in either clustering or pattern mining...
Regression Analysis by Example. 5th Edition
Chatterjee, Samprit; Hadi, Ali S.
2012-01-01
Regression analysis is a conceptually simple method for investigating relationships among variables. Carrying out a successful application of regression analysis, however, requires a balance of theoretical results, empirical rules, and subjective judgment. "Regression Analysis by Example, Fifth Edition" has been expanded and thoroughly…
Standards for Standardized Logistic Regression Coefficients
Menard, Scott
2011-01-01
Standardized coefficients in logistic regression analysis have the same utility as standardized coefficients in linear regression analysis. Although there has been no consensus on the best way to construct standardized logistic regression coefficients, there is now sufficient evidence to suggest a single best approach to the construction of a…
A Seemingly Unrelated Poisson Regression Model
King, Gary
1989-01-01
This article introduces a new estimator for the analysis of two contemporaneously correlated endogenous event count variables. This seemingly unrelated Poisson regression model (SUPREME) estimator combines the efficiencies created by single equation Poisson regression model estimators and insights from "seemingly unrelated" linear regression models.
Regression with Sparse Approximations of Data
DEFF Research Database (Denmark)
Noorzad, Pardis; Sturm, Bob L.
2012-01-01
We propose sparse approximation weighted regression (SPARROW), a method for local estimation of the regression function that uses sparse approximation with a dictionary of measurements. SPARROW estimates the regression function at a point with a linear combination of a few regressands selected...... by a sparse approximation of the point in terms of the regressors. We show SPARROW can be considered a variant of \\(k\\)-nearest neighbors regression (\\(k\\)-NNR), and more generally, local polynomial kernel regression. Unlike \\(k\\)-NNR, however, SPARROW can adapt the number of regressors to use based...
Spontaneous regression of a congenital melanocytic nevus
Directory of Open Access Journals (Sweden)
Amiya Kumar Nath
2011-01-01
Full Text Available Congenital melanocytic nevus (CMN may rarely regress which may also be associated with a halo or vitiligo. We describe a 10-year-old girl who presented with CMN on the left leg since birth, which recently started to regress spontaneously with associated depigmentation in the lesion and at a distant site. Dermoscopy performed at different sites of the regressing lesion demonstrated loss of epidermal pigments first followed by loss of dermal pigments. Histopathology and Masson-Fontana stain demonstrated lymphocytic infiltration and loss of pigment production in the regressing area. Immunohistochemistry staining (S100 and HMB-45, however, showed that nevus cells were present in the regressing areas.
Deep ensemble learning of sparse regression models for brain disease diagnosis.
Suk, Heung-Il; Lee, Seong-Whan; Shen, Dinggang
2017-04-01
Recent studies on brain imaging analysis witnessed the core roles of machine learning techniques in computer-assisted intervention for brain disease diagnosis. Of various machine-learning techniques, sparse regression models have proved their effectiveness in handling high-dimensional data but with a small number of training samples, especially in medical problems. In the meantime, deep learning methods have been making great successes by outperforming the state-of-the-art performances in various applications. In this paper, we propose a novel framework that combines the two conceptually different methods of sparse regression and deep learning for Alzheimer's disease/mild cognitive impairment diagnosis and prognosis. Specifically, we first train multiple sparse regression models, each of which is trained with different values of a regularization control parameter. Thus, our multiple sparse regression models potentially select different feature subsets from the original feature set; thereby they have different powers to predict the response values, i.e., clinical label and clinical scores in our work. By regarding the response values from our sparse regression models as target-level representations, we then build a deep convolutional neural network for clinical decision making, which thus we call 'Deep Ensemble Sparse Regression Network.' To our best knowledge, this is the first work that combines sparse regression models with deep neural network. In our experiments with the ADNI cohort, we validated the effectiveness of the proposed method by achieving the highest diagnostic accuracies in three classification tasks. We also rigorously analyzed our results and compared with the previous studies on the ADNI cohort in the literature. Copyright © 2017 Elsevier B.V. All rights reserved.
Sun, Hokeun; Wang, Ya; Chen, Yong; Li, Yun; Wang, Shuang
2017-06-15
DNA methylation plays an important role in many biological processes and cancer progression. Recent studies have found that there are also differences in methylation variations in different groups other than differences in methylation means. Several methods have been developed that consider both mean and variance signals in order to improve statistical power of detecting differentially methylated loci. Moreover, as methylation levels of neighboring CpG sites are known to be strongly correlated, methods that incorporate correlations have also been developed. We previously developed a network-based penalized logistic regression for correlated methylation data, but only focusing on mean signals. We have also developed a generalized exponential tilt model that captures both mean and variance signals but only examining one CpG site at a time. In this article, we proposed a penalized Exponential Tilt Model (pETM) using network-based regularization that captures both mean and variance signals in DNA methylation data and takes into account the correlations among nearby CpG sites. By combining the strength of the two models we previously developed, we demonstrated the superior power and better performance of the pETM method through simulations and the applications to the 450K DNA methylation array data of the four breast invasive carcinoma cancer subtypes from The Cancer Genome Atlas (TCGA) project. The developed pETM method identifies many cancer-related methylation loci that were missed by our previously developed method that considers correlations among nearby methylation loci but not variance signals. The R package 'pETM' is publicly available through CRAN: http://cran.r-project.org . sw2206@columbia.edu. Supplementary data are available at Bioinformatics online. © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com
Wolters, Mark A; Dean, C B
2017-01-01
Remote sensing images from Earth-orbiting satellites are a potentially rich data source for monitoring and cataloguing atmospheric health hazards that cover large geographic regions. A method is proposed for classifying such images into hazard and nonhazard regions using the autologistic regression model, which may be viewed as a spatial extension of logistic regression. The method includes a novel and simple approach to parameter estimation that makes it well suited to handling the large and high-dimensional datasets arising from satellite-borne instruments. The methodology is demonstrated on both simulated images and a real application to the identification of forest fire smoke.
Directory of Open Access Journals (Sweden)
Shouheng Tuo
Full Text Available Harmony Search (HS and Teaching-Learning-Based Optimization (TLBO as new swarm intelligent optimization algorithms have received much attention in recent years. Both of them have shown outstanding performance for solving NP-Hard optimization problems. However, they also suffer dramatic performance degradation for some complex high-dimensional optimization problems. Through a lot of experiments, we find that the HS and TLBO have strong complementarity each other. The HS has strong global exploration power but low convergence speed. Reversely, the TLBO has much fast convergence speed but it is easily trapped into local search. In this work, we propose a hybrid search algorithm named HSTLBO that merges the two algorithms together for synergistically solving complex optimization problems using a self-adaptive selection strategy. In the HSTLBO, both HS and TLBO are modified with the aim of balancing the global exploration and exploitation abilities, where the HS aims mainly to explore the unknown regions and the TLBO aims to rapidly exploit high-precision solutions in the known regions. Our experimental results demonstrate better performance and faster speed than five state-of-the-art HS variants and show better exploration power than five good TLBO variants with similar run time, which illustrates that our method is promising in solving complex high-dimensional optimization problems. The experiment on portfolio optimization problems also demonstrate that the HSTLBO is effective in solving complex read-world application.
Travnik, Jaden B; Pilarski, Patrick M
2017-07-01
Prosthetic devices have advanced in their capabilities and in the number and type of sensors included in their design. As the space of sensorimotor data available to a conventional or machine learning prosthetic control system increases in dimensionality and complexity, it becomes increasingly important that this data be represented in a useful and computationally efficient way. Well structured sensory data allows prosthetic control systems to make informed, appropriate control decisions. In this study, we explore the impact that increased sensorimotor information has on current machine learning prosthetic control approaches. Specifically, we examine the effect that high-dimensional sensory data has on the computation time and prediction performance of a true-online temporal-difference learning prediction method as embedded within a resource-limited upper-limb prosthesis control system. We present results comparing tile coding, the dominant linear representation for real-time prosthetic machine learning, with a newly proposed modification to Kanerva coding that we call selective Kanerva coding. In addition to showing promising results for selective Kanerva coding, our results confirm potential limitations to tile coding as the number of sensory input dimensions increases. To our knowledge, this study is the first to explicitly examine representations for realtime machine learning prosthetic devices in general terms. This work therefore provides an important step towards forming an efficient prosthesis-eye view of the world, wherein prompt and accurate representations of high-dimensional data may be provided to machine learning control systems within artificial limbs and other assistive rehabilitation technologies.
Simple and multiple linear regression: sample size considerations.
Hanley, James A
2016-11-01
The suggested "two subjects per variable" (2SPV) rule of thumb in the Austin and Steyerberg article is a chance to bring out some long-established and quite intuitive sample size considerations for both simple and multiple linear regression. This article distinguishes two of the major uses of regression models that imply very different sample size considerations, neither served well by the 2SPV rule. The first is etiological research, which contrasts mean Y levels at differing "exposure" (X) values and thus tends to focus on a single regression coefficient, possibly adjusted for confounders. The second research genre guides clinical practice. It addresses Y levels for individuals with different covariate patterns or "profiles." It focuses on the profile-specific (mean) Y levels themselves, estimating them via linear compounds of regression coefficients and covariates. By drawing on long-established closed-form variance formulae that lie beneath the standard errors in multiple regression, and by rearranging them for heuristic purposes, one arrives at quite intuitive sample size considerations for both research genres. Copyright Â© 2016 Elsevier Inc. All rights reserved.
On the degrees of freedom of reduced-rank estimators in multivariate regression.
Mukherjee, A; Chen, K; Wang, N; Zhu, J
We study the effective degrees of freedom of a general class of reduced-rank estimators for multivariate regression in the framework of Stein's unbiased risk estimation. A finite-sample exact unbiased estimator is derived that admits a closed-form expression in terms of the thresholded singular values of the least-squares solution and hence is readily computable. The results continue to hold in the high-dimensional setting where both the predictor and the response dimensions may be larger than the sample size. The derived analytical form facilitates the investigation of theoretical properties and provides new insights into the empirical behaviour of the degrees of freedom. In particular, we examine the differences and connections between the proposed estimator and a commonly-used naive estimator. The use of the proposed estimator leads to efficient and accurate prediction risk estimation and model selection, as demonstrated by simulation studies and a data example.
Evaluation of Linear Regression Simultaneous Myoelectric Control Using Intramuscular EMG.
Smith, Lauren H; Kuiken, Todd A; Hargrove, Levi J
2016-04-01
The objective of this study was to evaluate the ability of linear regression models to decode patterns of muscle coactivation from intramuscular electromyogram (EMG) and provide simultaneous myoelectric control of a virtual 3-DOF wrist/hand system. Performance was compared to the simultaneous control of conventional myoelectric prosthesis methods using intramuscular EMG (parallel dual-site control)-an approach that requires users to independently modulate individual muscles in the residual limb, which can be challenging for amputees. Linear regression control was evaluated in eight able-bodied subjects during a virtual Fitts' law task and was compared to performance of eight subjects using parallel dual-site control. An offline analysis also evaluated how different types of training data affected prediction accuracy of linear regression control. The two control systems demonstrated similar overall performance; however, the linear regression method demonstrated improved performance for targets requiring use of all three DOFs, whereas parallel dual-site control demonstrated improved performance for targets that required use of only one DOF. Subjects using linear regression control could more easily activate multiple DOFs simultaneously, but often experienced unintended movements when trying to isolate individual DOFs. Offline analyses also suggested that the method used to train linear regression systems may influence controllability. Linear regression myoelectric control using intramuscular EMG provided an alternative to parallel dual-site control for 3-DOF simultaneous control at the wrist and hand. The two methods demonstrated different strengths in controllability, highlighting the tradeoff between providing simultaneous control and the ability to isolate individual DOFs when desired.
Applied regression analysis a research tool
Pantula, Sastry; Dickey, David
1998-01-01
Least squares estimation, when used appropriately, is a powerful research tool. A deeper understanding of the regression concepts is essential for achieving optimal benefits from a least squares analysis. This book builds on the fundamentals of statistical methods and provides appropriate concepts that will allow a scientist to use least squares as an effective research tool. Applied Regression Analysis is aimed at the scientist who wishes to gain a working knowledge of regression analysis. The basic purpose of this book is to develop an understanding of least squares and related statistical methods without becoming excessively mathematical. It is the outgrowth of more than 30 years of consulting experience with scientists and many years of teaching an applied regression course to graduate students. Applied Regression Analysis serves as an excellent text for a service course on regression for non-statisticians and as a reference for researchers. It also provides a bridge between a two-semester introduction to...
Regression models of reactor diagnostic signals
International Nuclear Information System (INIS)
Vavrin, J.
1989-01-01
The application is described of an autoregression model as the simplest regression model of diagnostic signals in experimental analysis of diagnostic systems, in in-service monitoring of normal and anomalous conditions and their diagnostics. The method of diagnostics is described using a regression type diagnostic data base and regression spectral diagnostics. The diagnostics is described of neutron noise signals from anomalous modes in the experimental fuel assembly of a reactor. (author)
Bulcock, J. W.
The problem of model estimation when the data are collinear was examined. Though the ridge regression (RR) outperforms ordinary least squares (OLS) regression in the presence of acute multicollinearity, it is not a problem free technique for reducing the variance of the estimates. It is a stochastic procedure when it should be nonstochastic and it…
Multivariate Regression Analysis and Slaughter Livestock,
AGRICULTURE, *ECONOMICS), (*MEAT, PRODUCTION), MULTIVARIATE ANALYSIS, REGRESSION ANALYSIS , ANIMALS, WEIGHT, COSTS, PREDICTIONS, STABILITY, MATHEMATICAL MODELS, STORAGE, BEEF, PORK, FOOD, STATISTICAL DATA, ACCURACY
[From clinical judgment to linear regression model.
Palacios-Cruz, Lino; Pérez, Marcela; Rivas-Ruiz, Rodolfo; Talavera, Juan O
2013-01-01
When we think about mathematical models, such as linear regression model, we think that these terms are only used by those engaged in research, a notion that is far from the truth. Legendre described the first mathematical model in 1805, and Galton introduced the formal term in 1886. Linear regression is one of the most commonly used regression models in clinical practice. It is useful to predict or show the relationship between two or more variables as long as the dependent variable is quantitative and has normal distribution. Stated in another way, the regression is used to predict a measure based on the knowledge of at least one other variable. Linear regression has as it's first objective to determine the slope or inclination of the regression line: Y = a + bx, where "a" is the intercept or regression constant and it is equivalent to "Y" value when "X" equals 0 and "b" (also called slope) indicates the increase or decrease that occurs when the variable "x" increases or decreases in one unit. In the regression line, "b" is called regression coefficient. The coefficient of determination (R 2 ) indicates the importance of independent variables in the outcome.
Time series regression model for infectious disease and weather.
Imai, Chisato; Armstrong, Ben; Chalabi, Zaid; Mangtani, Punam; Hashizume, Masahiro
2015-10-01
Time series regression has been developed and long used to evaluate the short-term associations of air pollution and weather with mortality or morbidity of non-infectious diseases. The application of the regression approaches from this tradition to infectious diseases, however, is less well explored and raises some new issues. We discuss and present potential solutions for five issues often arising in such analyses: changes in immune population, strong autocorrelations, a wide range of plausible lag structures and association patterns, seasonality adjustments, and large overdispersion. The potential approaches are illustrated with datasets of cholera cases and rainfall from Bangladesh and influenza and temperature in Tokyo. Though this article focuses on the application of the traditional time series regression to infectious diseases and weather factors, we also briefly introduce alternative approaches, including mathematical modeling, wavelet analysis, and autoregressive integrated moving average (ARIMA) models. Modifications proposed to standard time series regression practice include using sums of past cases as proxies for the immune population, and using the logarithm of lagged disease counts to control autocorrelation due to true contagion, both of which are motivated from "susceptible-infectious-recovered" (SIR) models. The complexity of lag structures and association patterns can often be informed by biological mechanisms and explored by using distributed lag non-linear models. For overdispersed models, alternative distribution models such as quasi-Poisson and negative binomial should be considered. Time series regression can be used to investigate dependence of infectious diseases on weather, but may need modifying to allow for features specific to this context. Copyright © 2015 The Authors. Published by Elsevier Inc. All rights reserved.
Regression modeling methods, theory, and computation with SAS
Panik, Michael
2009-01-01
Regression Modeling: Methods, Theory, and Computation with SAS provides an introduction to a diverse assortment of regression techniques using SAS to solve a wide variety of regression problems. The author fully documents the SAS programs and thoroughly explains the output produced by the programs.The text presents the popular ordinary least squares (OLS) approach before introducing many alternative regression methods. It covers nonparametric regression, logistic regression (including Poisson regression), Bayesian regression, robust regression, fuzzy regression, random coefficients regression,
RAWS II: A MULTIPLE REGRESSION ANALYSIS PROGRAM,
This memorandum gives instructions for the use and operation of a revised version of RAWS, a multiple regression analysis program. The program...of preprocessed data, the directed retention of variable, listing of the matrix of the normal equations and its inverse, and the bypassing of the regression analysis to provide the input variable statistics only. (Author)
A Simulation Investigation of Principal Component Regression.
Allen, David E.
Regression analysis is one of the more common analytic tools used by researchers. However, multicollinearity between the predictor variables can cause problems in using the results of regression analyses. Problems associated with multicollinearity include entanglement of relative influences of variables due to reduced precision of estimation,…
Hierarchical regression analysis in structural Equation Modeling
de Jong, P.F.
1999-01-01
In a hierarchical or fixed-order regression analysis, the independent variables are entered into the regression equation in a prespecified order. Such an analysis is often performed when the extra amount of variance accounted for in a dependent variable by a specific independent variable is the main
Categorical regression dose-response modeling
The goal of this training is to provide participants with training on the use of the U.S. EPA’s Categorical Regression soft¬ware (CatReg) and its application to risk assessment. Categorical regression fits mathematical models to toxicity data that have been assigned ord...
Variable importance in latent variable regression models
Kvalheim, O.M.; Arneberg, R.; Bleie, O.; Rajalahti, T.; Smilde, A.K.; Westerhuis, J.A.
2014-01-01
The quality and practical usefulness of a regression model are a function of both interpretability and prediction performance. This work presents some new graphical tools for improved interpretation of latent variable regression models that can also assist in improved algorithms for variable
Stepwise versus Hierarchical Regression: Pros and Cons
Lewis, Mitzi
2007-01-01
Multiple regression is commonly used in social and behavioral data analysis. In multiple regression contexts, researchers are very often interested in determining the "best" predictors in the analysis. This focus may stem from a need to identify those predictors that are supportive of theory. Alternatively, the researcher may simply be interested…
Suppression Situations in Multiple Linear Regression
Shieh, Gwowen
2006-01-01
This article proposes alternative expressions for the two most prevailing definitions of suppression without resorting to the standardized regression modeling. The formulation provides a simple basis for the examination of their relationship. For the two-predictor regression, the author demonstrates that the previous results in the literature are…
Gibrat’s law and quantile regressions
DEFF Research Database (Denmark)
Distante, Roberta; Petrella, Ivan; Santoro, Emiliano
2017-01-01
The nexus between firm growth, size and age in U.S. manufacturing is examined through the lens of quantile regression models. This methodology allows us to overcome serious shortcomings entailed by linear regression models employed by much of the existing literature, unveiling a number of important...
Regression Analysis and the Sociological Imagination
De Maio, Fernando
2014-01-01
Regression analysis is an important aspect of most introductory statistics courses in sociology but is often presented in contexts divorced from the central concerns that bring students into the discipline. Consequently, we present five lesson ideas that emerge from a regression analysis of income inequality and mortality in the USA and Canada.
Repeated Results Analysis for Middleware Regression Benchmarking
Czech Academy of Sciences Publication Activity Database
Bulej, Lubomír; Kalibera, T.; Tůma, P.
2005-01-01
Roč. 60, - (2005), s. 345-358 ISSN 0166-5316 R&D Projects: GA ČR GA102/03/0672 Institutional research plan: CEZ:AV0Z10300504 Keywords : middleware benchmarking * regression benchmarking * regression testing Subject RIV: JD - Computer Applications, Robotics Impact factor: 0.756, year: 2005
Principles of Quantile Regression and an Application
Chen, Fang; Chalhoub-Deville, Micheline
2014-01-01
Newer statistical procedures are typically introduced to help address the limitations of those already in practice or to deal with emerging research needs. Quantile regression (QR) is introduced in this paper as a relatively new methodology, which is intended to overcome some of the limitations of least squares mean regression (LMR). QR is more…
ON REGRESSION REPRESENTATIONS OF STOCHASTIC-PROCESSES
RUSCHENDORF, L; DEVALK, [No Value
We construct a.s. nonlinear regression representations of general stochastic processes (X(n))n is-an-element-of N. As a consequence we obtain in particular special regression representations of Markov chains and of certain m-dependent sequences. For m-dependent sequences we obtain a constructive
Cao, Faxian; Yang, Zhijing; Ren, Jinchang; Ling, Wing-Kuen; Zhao, Huimin; Marshall, Stephen
2017-12-01
Although the sparse multinomial logistic regression (SMLR) has provided a useful tool for sparse classification, it suffers from inefficacy in dealing with high dimensional features and manually set initial regressor values. This has significantly constrained its applications for hyperspectral image (HSI) classification. In order to tackle these two drawbacks, an extreme sparse multinomial logistic regression (ESMLR) is proposed for effective classification of HSI. First, the HSI dataset is projected to a new feature space with randomly generated weight and bias. Second, an optimization model is established by the Lagrange multiplier method and the dual principle to automatically determine a good initial regressor for SMLR via minimizing the training error and the regressor value. Furthermore, the extended multi-attribute profiles (EMAPs) are utilized for extracting both the spectral and spatial features. A combinational linear multiple features learning (MFL) method is proposed to further enhance the features extracted by ESMLR and EMAPs. Finally, the logistic regression via the variable splitting and the augmented Lagrangian (LORSAL) is adopted in the proposed framework for reducing the computational time. Experiments are conducted on two well-known HSI datasets, namely the Indian Pines dataset and the Pavia University dataset, which have shown the fast and robust performance of the proposed ESMLR framework.
Lestari, A. W.; Rustam, Z.
2017-07-01
In the last decade, breast cancer has become the focus of world attention as this disease is one of the primary leading cause of death for women. Therefore, it is necessary to have the correct precautions and treatment. In previous studies, Fuzzy Kennel K-Medoid algorithm has been used for multi-class data. This paper proposes an algorithm to classify the high dimensional data of breast cancer using Fuzzy Possibilistic C-means (FPCM) and a new method based on clustering analysis using Normed Kernel Function-Based Fuzzy Possibilistic C-Means (NKFPCM). The objective of this paper is to obtain the best accuracy in classification of breast cancer data. In order to improve the accuracy of the two methods, the features candidates are evaluated using feature selection, where Laplacian Score is used. The results show the comparison accuracy and running time of FPCM and NKFPCM with and without feature selection.
Regression of environmental noise in LIGO data
International Nuclear Information System (INIS)
Tiwari, V; Klimenko, S; Mitselmakher, G; Necula, V; Drago, M; Prodi, G; Frolov, V; Yakushin, I; Re, V; Salemi, F; Vedovato, G
2015-01-01
We address the problem of noise regression in the output of gravitational-wave (GW) interferometers, using data from the physical environmental monitors (PEM). The objective of the regression analysis is to predict environmental noise in the GW channel from the PEM measurements. One of the most promising regression methods is based on the construction of Wiener–Kolmogorov (WK) filters. Using this method, the seismic noise cancellation from the LIGO GW channel has already been performed. In the presented approach the WK method has been extended, incorporating banks of Wiener filters in the time–frequency domain, multi-channel analysis and regulation schemes, which greatly enhance the versatility of the regression analysis. Also we present the first results on regression of the bi-coherent noise in the LIGO data. (paper)
Pathological assessment of liver fibrosis regression
Directory of Open Access Journals (Sweden)
WANG Bingqiong
2017-03-01
Full Text Available Hepatic fibrosis is the common pathological outcome of chronic hepatic diseases. An accurate assessment of fibrosis degree provides an important reference for a definite diagnosis of diseases, treatment decision-making, treatment outcome monitoring, and prognostic evaluation. At present, many clinical studies have proven that regression of hepatic fibrosis and early-stage liver cirrhosis can be achieved by effective treatment, and a correct evaluation of fibrosis regression has become a hot topic in clinical research. Liver biopsy has long been regarded as the gold standard for the assessment of hepatic fibrosis, and thus it plays an important role in the evaluation of fibrosis regression. This article reviews the clinical application of current pathological staging systems in the evaluation of fibrosis regression from the perspectives of semi-quantitative scoring system, quantitative approach, and qualitative approach, in order to propose a better pathological evaluation system for the assessment of fibrosis regression.
Should metacognition be measured by logistic regression?
Rausch, Manuel; Zehetleitner, Michael
2017-03-01
Are logistic regression slopes suitable to quantify metacognitive sensitivity, i.e. the efficiency with which subjective reports differentiate between correct and incorrect task responses? We analytically show that logistic regression slopes are independent from rating criteria in one specific model of metacognition, which assumes (i) that rating decisions are based on sensory evidence generated independently of the sensory evidence used for primary task responses and (ii) that the distributions of evidence are logistic. Given a hierarchical model of metacognition, logistic regression slopes depend on rating criteria. According to all considered models, regression slopes depend on the primary task criterion. A reanalysis of previous data revealed that massive numbers of trials are required to distinguish between hierarchical and independent models with tolerable accuracy. It is argued that researchers who wish to use logistic regression as measure of metacognitive sensitivity need to control the primary task criterion and rating criteria. Copyright © 2017 Elsevier Inc. All rights reserved.
Jiang, Caigui; Tang, Chengcheng; Vaxman, Amir; Wonka, Peter; Pottmann, Helmut
2015-01-01
We study the design and optimization of polyhedral patterns, which are patterns of planar polygonal faces on freeform surfaces. Working with polyhedral patterns is desirable in architectural geometry and industrial design. However, the classical
International Nuclear Information System (INIS)
Lucka, Felix
2012-01-01
Sparsity has become a key concept for solving of high-dimensional inverse problems using variational regularization techniques. Recently, using similar sparsity-constraints in the Bayesian framework for inverse problems by encoding them in the prior distribution has attracted attention. Important questions about the relation between regularization theory and Bayesian inference still need to be addressed when using sparsity promoting inversion. A practical obstacle for these examinations is the lack of fast posterior sampling algorithms for sparse, high-dimensional Bayesian inversion. Accessing the full range of Bayesian inference methods requires being able to draw samples from the posterior probability distribution in a fast and efficient way. This is usually done using Markov chain Monte Carlo (MCMC) sampling algorithms. In this paper, we develop and examine a new implementation of a single component Gibbs MCMC sampler for sparse priors relying on L1-norms. We demonstrate that the efficiency of our Gibbs sampler increases when the level of sparsity or the dimension of the unknowns is increased. This property is contrary to the properties of the most commonly applied Metropolis–Hastings (MH) sampling schemes. We demonstrate that the efficiency of MH schemes for L1-type priors dramatically decreases when the level of sparsity or the dimension of the unknowns is increased. Practically, Bayesian inversion for L1-type priors using MH samplers is not feasible at all. As this is commonly believed to be an intrinsic feature of MCMC sampling, the performance of our Gibbs sampler also challenges common beliefs about the applicability of sample based Bayesian inference. (paper)
Lie, Octavian V; van Mierlo, Pieter
2017-01-01
The visual interpretation of intracranial EEG (iEEG) is the standard method used in complex epilepsy surgery cases to map the regions of seizure onset targeted for resection. Still, visual iEEG analysis is labor-intensive and biased due to interpreter dependency. Multivariate parametric functional connectivity measures using adaptive autoregressive (AR) modeling of the iEEG signals based on the Kalman filter algorithm have been used successfully to localize the electrographic seizure onsets. Due to their high computational cost, these methods have been applied to a limited number of iEEG time-series (Kalman filter implementations, a well-known multivariate adaptive AR model (Arnold et al. 1998) and a simplified, computationally efficient derivation of it, for their potential application to connectivity analysis of high-dimensional (up to 192 channels) iEEG data. When used on simulated seizures together with a multivariate connectivity estimator, the partial directed coherence, the two AR models were compared for their ability to reconstitute the designed seizure signal connections from noisy data. Next, focal seizures from iEEG recordings (73-113 channels) in three patients rendered seizure-free after surgery were mapped with the outdegree, a graph-theory index of outward directed connectivity. Simulation results indicated high levels of mapping accuracy for the two models in the presence of low-to-moderate noise cross-correlation. Accordingly, both AR models correctly mapped the real seizure onset to the resection volume. This study supports the possibility of conducting fully data-driven multivariate connectivity estimations on high-dimensional iEEG datasets using the Kalman filter approach.
Structure of multidimensional patterns
International Nuclear Information System (INIS)
Smith, S.P.
1982-01-01
The problem of describing the structure of multidimensional data is important in exploratory data analysis, statistical pattern recognition, and image processing. A data set is viewed as a collection of points embedded in a high dimensional space. The primary goal of this research is to determine if the data have any clustering structure; such a structure implies the presence of class information (categories) in the data. A statistical hypothesis is used in the decision making. To this end, data with no structure are defined as data following the uniform distribution over some compact convex set in K-dimensional space, called the sampling window. This thesis defines two new tests for uniformity along with various sampling window estimators. The first test is a volume-based test which captures density changes in the data. The second test compares a uniformly distributed sample to the data by using the minimal spanning tree (MST) of the polled samples. Sampling window estimators are provided for simple sampling windows and use the convex hull of the data as a general sampling window estimator. For both of the tests for uniformity, theoretical results are provided on their size, and study their size and power against clustered alternatives is studied. Simulation is also used to study the efficacy of the sampling window estimators
Regression modeling of ground-water flow
Cooley, R.L.; Naff, R.L.
1985-01-01
Nonlinear multiple regression methods are developed to model and analyze groundwater flow systems. Complete descriptions of regression methodology as applied to groundwater flow models allow scientists and engineers engaged in flow modeling to apply the methods to a wide range of problems. Organization of the text proceeds from an introduction that discusses the general topic of groundwater flow modeling, to a review of basic statistics necessary to properly apply regression techniques, and then to the main topic: exposition and use of linear and nonlinear regression to model groundwater flow. Statistical procedures are given to analyze and use the regression models. A number of exercises and answers are included to exercise the student on nearly all the methods that are presented for modeling and statistical analysis. Three computer programs implement the more complex methods. These three are a general two-dimensional, steady-state regression model for flow in an anisotropic, heterogeneous porous medium, a program to calculate a measure of model nonlinearity with respect to the regression parameters, and a program to analyze model errors in computed dependent variables such as hydraulic head. (USGS)
Forecasting urban water demand: A meta-regression analysis.
Sebri, Maamar
2016-12-01
Water managers and planners require accurate water demand forecasts over the short-, medium- and long-term for many purposes. These range from assessing water supply needs over spatial and temporal patterns to optimizing future investments and planning future allocations across competing sectors. This study surveys the empirical literature on the urban water demand forecasting using the meta-analytical approach. Specifically, using more than 600 estimates, a meta-regression analysis is conducted to identify explanations of cross-studies variation in accuracy of urban water demand forecasting. Our study finds that accuracy depends significantly on study characteristics, including demand periodicity, modeling method, forecasting horizon, model specification and sample size. The meta-regression results remain robust to different estimators employed as well as to a series of sensitivity checks performed. The importance of these findings lies in the conclusions and implications drawn out for regulators and policymakers and for academics alike. Copyright © 2016. Published by Elsevier Ltd.
Variable and subset selection in PLS regression
DEFF Research Database (Denmark)
Høskuldsson, Agnar
2001-01-01
The purpose of this paper is to present some useful methods for introductory analysis of variables and subsets in relation to PLS regression. We present here methods that are efficient in finding the appropriate variables or subset to use in the PLS regression. The general conclusion...... is that variable selection is important for successful analysis of chemometric data. An important aspect of the results presented is that lack of variable selection can spoil the PLS regression, and that cross-validation measures using a test set can show larger variation, when we use different subsets of X, than...
Applied Regression Modeling A Business Approach
Pardoe, Iain
2012-01-01
An applied and concise treatment of statistical regression techniques for business students and professionals who have little or no background in calculusRegression analysis is an invaluable statistical methodology in business settings and is vital to model the relationship between a response variable and one or more predictor variables, as well as the prediction of a response value given values of the predictors. In view of the inherent uncertainty of business processes, such as the volatility of consumer spending and the presence of market uncertainty, business professionals use regression a
Vectors, a tool in statistical regression theory
Corsten, L.C.A.
1958-01-01
Using linear algebra this thesis developed linear regression analysis including analysis of variance, covariance analysis, special experimental designs, linear and fertility adjustments, analysis of experiments at different places and times. The determination of the orthogonal projection, yielding
Genetics Home Reference: caudal regression syndrome
... umbilical artery: Further support for a caudal regression-sirenomelia spectrum. Am J Med Genet A. 2007 Dec ... AK, Dickinson JE, Bower C. Caudal dysgenesis and sirenomelia-single centre experience suggests common pathogenic basis. Am ...
Dynamic travel time estimation using regression trees.
2008-10-01
This report presents a methodology for travel time estimation by using regression trees. The dissemination of travel time information has become crucial for effective traffic management, especially under congested road conditions. In the absence of c...
Two Paradoxes in Linear Regression Analysis
FENG, Ge; PENG, Jing; TU, Dongke; ZHENG, Julia Z.; FENG, Changyong
2016-01-01
Summary Regression is one of the favorite tools in applied statistics. However, misuse and misinterpretation of results from regression analysis are common in biomedical research. In this paper we use statistical theory and simulation studies to clarify some paradoxes around this popular statistical method. In particular, we show that a widely used model selection procedure employed in many publications in top medical journals is wrong. Formal procedures based on solid statistical theory should be used in model selection. PMID:28638214
Discriminative Elastic-Net Regularized Linear Regression.
Zhang, Zheng; Lai, Zhihui; Xu, Yong; Shao, Ling; Wu, Jian; Xie, Guo-Sen
2017-03-01
In this paper, we aim at learning compact and discriminative linear regression models. Linear regression has been widely used in different problems. However, most of the existing linear regression methods exploit the conventional zero-one matrix as the regression targets, which greatly narrows the flexibility of the regression model. Another major limitation of these methods is that the learned projection matrix fails to precisely project the image features to the target space due to their weak discriminative capability. To this end, we present an elastic-net regularized linear regression (ENLR) framework, and develop two robust linear regression models which possess the following special characteristics. First, our methods exploit two particular strategies to enlarge the margins of different classes by relaxing the strict binary targets into a more feasible variable matrix. Second, a robust elastic-net regularization of singular values is introduced to enhance the compactness and effectiveness of the learned projection matrix. Third, the resulting optimization problem of ENLR has a closed-form solution in each iteration, which can be solved efficiently. Finally, rather than directly exploiting the projection matrix for recognition, our methods employ the transformed features as the new discriminate representations to make final image classification. Compared with the traditional linear regression model and some of its variants, our method is much more accurate in image classification. Extensive experiments conducted on publicly available data sets well demonstrate that the proposed framework can outperform the state-of-the-art methods. The MATLAB codes of our methods can be available at http://www.yongxu.org/lunwen.html.
Fuzzy multiple linear regression: A computational approach
Juang, C. H.; Huang, X. H.; Fleming, J. W.
1992-01-01
This paper presents a new computational approach for performing fuzzy regression. In contrast to Bardossy's approach, the new approach, while dealing with fuzzy variables, closely follows the conventional regression technique. In this approach, treatment of fuzzy input is more 'computational' than 'symbolic.' The following sections first outline the formulation of the new approach, then deal with the implementation and computational scheme, and this is followed by examples to illustrate the new procedure.
Computing multiple-output regression quantile regions
Czech Academy of Sciences Publication Activity Database
Paindaveine, D.; Šiman, Miroslav
2012-01-01
Roč. 56, č. 4 (2012), s. 840-853 ISSN 0167-9473 R&D Projects: GA MŠk(CZ) 1M06047 Institutional research plan: CEZ:AV0Z10750506 Keywords : halfspace depth * multiple-output regression * parametric linear programming * quantile regression Subject RIV: BA - General Mathematics Impact factor: 1.304, year: 2012 http://library.utia.cas.cz/separaty/2012/SI/siman-0376413.pdf
There is No Quantum Regression Theorem
International Nuclear Information System (INIS)
Ford, G.W.; OConnell, R.F.
1996-01-01
The Onsager regression hypothesis states that the regression of fluctuations is governed by macroscopic equations describing the approach to equilibrium. It is here asserted that this hypothesis fails in the quantum case. This is shown first by explicit calculation for the example of quantum Brownian motion of an oscillator and then in general from the fluctuation-dissipation theorem. It is asserted that the correct generalization of the Onsager hypothesis is the fluctuation-dissipation theorem. copyright 1996 The American Physical Society
Caudal regression syndrome : a case report
International Nuclear Information System (INIS)
Lee, Eun Joo; Kim, Hi Hye; Kim, Hyung Sik; Park, So Young; Han, Hye Young; Lee, Kwang Hun
1998-01-01
Caudal regression syndrome is a rare congenital anomaly, which results from a developmental failure of the caudal mesoderm during the fetal period. We present a case of caudal regression syndrome composed of a spectrum of anomalies including sirenomelia, dysplasia of the lower lumbar vertebrae, sacrum, coccyx and pelvic bones,genitourinary and anorectal anomalies, and dysplasia of the lung, as seen during infantography and MR imaging
Caudal regression syndrome : a case report
Energy Technology Data Exchange (ETDEWEB)
Lee, Eun Joo; Kim, Hi Hye; Kim, Hyung Sik; Park, So Young; Han, Hye Young; Lee, Kwang Hun [Chungang Gil Hospital, Incheon (Korea, Republic of)
1998-07-01
Caudal regression syndrome is a rare congenital anomaly, which results from a developmental failure of the caudal mesoderm during the fetal period. We present a case of caudal regression syndrome composed of a spectrum of anomalies including sirenomelia, dysplasia of the lower lumbar vertebrae, sacrum, coccyx and pelvic bones,genitourinary and anorectal anomalies, and dysplasia of the lung, as seen during infantography and MR imaging.
Forecasting exchange rates: a robust regression approach
Preminger, Arie; Franck, Raphael
2005-01-01
The least squares estimation method as well as other ordinary estimation method for regression models can be severely affected by a small number of outliers, thus providing poor out-of-sample forecasts. This paper suggests a robust regression approach, based on the S-estimation method, to construct forecasting models that are less sensitive to data contamination by outliers. A robust linear autoregressive (RAR) and a robust neural network (RNN) models are estimated to study the predictabil...
Marginal longitudinal semiparametric regression via penalized splines
Al Kadiri, M.
2010-08-01
We study the marginal longitudinal nonparametric regression problem and some of its semiparametric extensions. We point out that, while several elaborate proposals for efficient estimation have been proposed, a relative simple and straightforward one, based on penalized splines, has not. After describing our approach, we then explain how Gibbs sampling and the BUGS software can be used to achieve quick and effective implementation. Illustrations are provided for nonparametric regression and additive models.
Marginal longitudinal semiparametric regression via penalized splines
Al Kadiri, M.; Carroll, R.J.; Wand, M.P.
2010-01-01
We study the marginal longitudinal nonparametric regression problem and some of its semiparametric extensions. We point out that, while several elaborate proposals for efficient estimation have been proposed, a relative simple and straightforward one, based on penalized splines, has not. After describing our approach, we then explain how Gibbs sampling and the BUGS software can be used to achieve quick and effective implementation. Illustrations are provided for nonparametric regression and additive models.