Classification and regression trees
Breiman, Leo; Olshen, Richard A; Stone, Charles J
1984-01-01
The methodology used to construct tree structured rules is the focus of this monograph. Unlike many other statistical procedures, which moved from pencil and paper to calculators, this text's use of trees was unthinkable before computers. Both the practical and theoretical sides have been developed in the authors' study of tree methods. Classification and Regression Trees reflects these two sides, covering the use of trees as a data analysis method, and in a more mathematical framework, proving some of their fundamental properties.
Yoonseok Shin
2015-01-01
Among the recent data mining techniques available, the boosting approach has attracted a great deal of attention because of its effective learning algorithm and strong boundaries in terms of its generalization performance. However, the boosting approach has yet to be used in regression problems within the construction domain, including cost estimations, but has been actively utilized in other domains. Therefore, a boosting regression tree (BRT) is applied to cost estimations at the early stag...
Shin, Yoonseok
2015-01-01
Among the recent data mining techniques available, the boosting approach has attracted a great deal of attention because of its effective learning algorithm and strong boundaries in terms of its generalization performance. However, the boosting approach has yet to be used in regression problems within the construction domain, including cost estimations, but has been actively utilized in other domains. Therefore, a boosting regression tree (BRT) is applied to cost estimations at the early stage of a construction project to examine the applicability of the boosting approach to a regression problem within the construction domain. To evaluate the performance of the BRT model, its performance was compared with that of a neural network (NN) model, which has been proven to have a high performance in cost estimation domains. The BRT model has shown results similar to those of NN model using 234 actual cost datasets of a building construction project. In addition, the BRT model can provide additional information such as the importance plot and structure model, which can support estimators in comprehending the decision making process. Consequently, the boosting approach has potential applicability in preliminary cost estimations in a building construction project.
Yoonseok Shin
2015-01-01
Full Text Available Among the recent data mining techniques available, the boosting approach has attracted a great deal of attention because of its effective learning algorithm and strong boundaries in terms of its generalization performance. However, the boosting approach has yet to be used in regression problems within the construction domain, including cost estimations, but has been actively utilized in other domains. Therefore, a boosting regression tree (BRT is applied to cost estimations at the early stage of a construction project to examine the applicability of the boosting approach to a regression problem within the construction domain. To evaluate the performance of the BRT model, its performance was compared with that of a neural network (NN model, which has been proven to have a high performance in cost estimation domains. The BRT model has shown results similar to those of NN model using 234 actual cost datasets of a building construction project. In addition, the BRT model can provide additional information such as the importance plot and structure model, which can support estimators in comprehending the decision making process. Consequently, the boosting approach has potential applicability in preliminary cost estimations in a building construction project.
Classification and Regression Trees(CART) Theory and Applications
Timofeev, Roman
2004-01-01
This master thesis is devoted to Classification and Regression Trees (CART). CART is classification method which uses historical data to construct decision trees. Depending on available information about the dataset, classification tree or regression tree can be constructed. Constructed tree can be then used for classification of new observations. The first part of the thesis describes fundamental principles of tree construction, different splitting algorithms and pruning procedures. Seco...
Boosted regression tree, table, and figure data
Spreadsheets are included here to support the manuscript Boosted Regression Tree Models to Explain Watershed Nutrient Concentrations and Biological Condition. This dataset is associated with the following publication:Golden , H., C. Lane , A. Prues, and E. D'Amico. Boosted Regression Tree Models to Explain Watershed Nutrient Concentrations and Biological Condition. JAWRA. American Water Resources Association, Middleburg, VA, USA, 52(5): 1251-1274, (2016).
SMOOTH TRANSITION LOGISTIC REGRESSION MODEL TREE
RODRIGO PINTO MOREIRA
2008-01-01
Este trabalho tem como objetivo principal adaptar o modelo STR-Tree, o qual é a combinação de um modelo Smooth Transition Regression com Classification and Regression Tree (CART), a fim de utilizá-lo em Classificação. Para isto algumas alterações foram realizadas em sua forma estrutural e na estimação. Devido ao fato de estarmos fazendo classificação de variáveis dependentes binárias, se faz necessária a utilização das técnicas empregadas em Regressão Logística, dessa forma a estimação dos pa...
Regression analysis using dependent Polya trees.
Schörgendorfer, Angela; Branscum, Adam J
2013-11-30
Many commonly used models for linear regression analysis force overly simplistic shape and scale constraints on the residual structure of data. We propose a semiparametric Bayesian model for regression analysis that produces data-driven inference by using a new type of dependent Polya tree prior to model arbitrary residual distributions that are allowed to evolve across increasing levels of an ordinal covariate (e.g., time, in repeated measurement studies). By modeling residual distributions at consecutive covariate levels or time points using separate, but dependent Polya tree priors, distributional information is pooled while allowing for broad pliability to accommodate many types of changing residual distributions. We can use the proposed dependent residual structure in a wide range of regression settings, including fixed-effects and mixed-effects linear and nonlinear models for cross-sectional, prospective, and repeated measurement data. A simulation study illustrates the flexibility of our novel semiparametric regression model to accurately capture evolving residual distributions. In an application to immune development data on immunoglobulin G antibodies in children, our new model outperforms several contemporary semiparametric regression models based on a predictive model selection criterion. Copyright © 2013 John Wiley & Sons, Ltd.
Inferring gene regression networks with model trees
Aguilar-Ruiz Jesus S
2010-10-01
Full Text Available Abstract Background Novel strategies are required in order to handle the huge amount of data produced by microarray technologies. To infer gene regulatory networks, the first step is to find direct regulatory relationships between genes building the so-called gene co-expression networks. They are typically generated using correlation statistics as pairwise similarity measures. Correlation-based methods are very useful in order to determine whether two genes have a strong global similarity but do not detect local similarities. Results We propose model trees as a method to identify gene interaction networks. While correlation-based methods analyze each pair of genes, in our approach we generate a single regression tree for each gene from the remaining genes. Finally, a graph from all the relationships among output and input genes is built taking into account whether the pair of genes is statistically significant. For this reason we apply a statistical procedure to control the false discovery rate. The performance of our approach, named REGNET, is experimentally tested on two well-known data sets: Saccharomyces Cerevisiae and E.coli data set. First, the biological coherence of the results are tested. Second the E.coli transcriptional network (in the Regulon database is used as control to compare the results to that of a correlation-based method. This experiment shows that REGNET performs more accurately at detecting true gene associations than the Pearson and Spearman zeroth and first-order correlation-based methods. Conclusions REGNET generates gene association networks from gene expression data, and differs from correlation-based methods in that the relationship between one gene and others is calculated simultaneously. Model trees are very useful techniques to estimate the numerical values for the target genes by linear regression functions. They are very often more precise than linear regression models because they can add just different linear
Sub-pixel estimation of tree cover and bare surface densities using regression tree analysis
Carlos Augusto Zangrando Toneli
2011-09-01
Full Text Available Sub-pixel analysis is capable of generating continuous fields, which represent the spatial variability of certain thematic classes. The aim of this work was to develop numerical models to represent the variability of tree cover and bare surfaces within the study area. This research was conducted in the riparian buffer within a watershed of the São Francisco River in the North of Minas Gerais, Brazil. IKONOS and Landsat TM imagery were used with the GUIDE algorithm to construct the models. The results were two index images derived with regression trees for the entire study area, one representing tree cover and the other representing bare surface. The use of non-parametric and non-linear regression tree models presented satisfactory results to characterize wetland, deciduous and savanna patterns of forest formation.
Algorithms for Decision Tree Construction
Chikalov, Igor
2011-01-01
The study of algorithms for decision tree construction was initiated in 1960s. The first algorithms are based on the separation heuristic [13, 31] that at each step tries dividing the set of objects as evenly as possible. Later Garey and Graham [28] showed that such algorithm may construct decision trees whose average depth is arbitrarily far from the minimum. Hyafil and Rivest in [35] proved NP-hardness of DT problem that is constructing a tree with the minimum average depth for a diagnostic problem over 2-valued information system and uniform probability distribution. Cox et al. in [22] showed that for a two-class problem over information system, even finding the root node attribute for an optimal tree is an NP-hard problem. © Springer-Verlag Berlin Heidelberg 2011.
Subgroup finding via Bayesian additive regression trees.
Sivaganesan, Siva; Müller, Peter; Huang, Bin
2017-03-09
We provide a Bayesian decision theoretic approach to finding subgroups that have elevated treatment effects. Our approach separates the modeling of the response variable from the task of subgroup finding and allows a flexible modeling of the response variable irrespective of potential subgroups of interest. We use Bayesian additive regression trees to model the response variable and use a utility function defined in terms of a candidate subgroup and the predicted response for that subgroup. Subgroups are identified by maximizing the expected utility where the expectation is taken with respect to the posterior predictive distribution of the response, and the maximization is carried out over an a priori specified set of candidate subgroups. Our approach allows subgroups based on both quantitative and categorical covariates. We illustrate the approach using simulated data set study and a real data set. Copyright © 2017 John Wiley & Sons, Ltd.
Constructing Student Problems in Phylogenetic Tree Construction.
Brewer, Steven D.
Evolution is often equated with natural selection and is taught from a primarily functional perspective while comparative and historical approaches, which are critical for developing an appreciation of the power of evolutionary theory, are often neglected. This report describes a study of expert problem-solving in phylogenetic tree construction.…
Boosted Regression Tree Models to Explain Watershed ...
Boosted regression tree (BRT) models were developed to quantify the nonlinear relationships between landscape variables and nutrient concentrations in a mesoscale mixed land cover watershed during base-flow conditions. Factors that affect instream biological components, based on the Index of Biotic Integrity (IBI), were also analyzed. Seasonal BRT models at two spatial scales (watershed and riparian buffered area [RBA]) for nitrite-nitrate (NO2-NO3), total Kjeldahl nitrogen, and total phosphorus (TP) and annual models for the IBI score were developed. Two primary factors — location within the watershed (i.e., geographic position, stream order, and distance to a downstream confluence) and percentage of urban land cover (both scales) — emerged as important predictor variables. Latitude and longitude interacted with other factors to explain the variability in summer NO2-NO3 concentrations and IBI scores. BRT results also suggested that location might be associated with indicators of sources (e.g., land cover), runoff potential (e.g., soil and topographic factors), and processes not easily represented by spatial data indicators. Runoff indicators (e.g., Hydrological Soil Group D and Topographic Wetness Indices) explained a substantial portion of the variability in nutrient concentrations as did point sources for TP in the summer months. The results from our BRT approach can help prioritize areas for nutrient management in mixed-use and heavily impacted watershed
Efficient Frequent Pattern Tree Construction
D.Bujji Babu
2014-03-01
Full Text Available Association rule learning is a popular and well researched technique for discovering interesting relations between variables in large databases in the area of data mining. The association rules are a part of intelligent systems. Association rules are usually required to satisfy a user-specified minimum support and a user-specified minimum confidence at the same time. Apriori and FP-Growth algorithms are very familiar algorithms for association rule mining. In this paper we are more concentrated on the Construction of efficient frequent pattern trees. Here, we present the novel frequent pattern trees and the performance issues. The proposed trees are fast and efficient trees helps to extract the frequent patterns. This paper provides the major advantages in the FP-Growth algorithm for association rule mining with using the newly proposed approach.
Combining regression trees and radial basis function networks.
Orr, M; Hallam, J; Takezawa, K; Murra, A; Ninomiya, S; Oide, M; Leonard, T
2000-12-01
We describe a method for non-parametric regression which combines regression trees with radial basis function networks. The method is similar to that of Kubat, who was first to suggest such a combination, but has some significant improvements. We demonstrate the features of the new method, compare its performance with other methods on DELVE data sets and apply it to a real world problem involving the classification of soybean plants from digital images.
Stefanie M. Herrmann
2013-10-01
Full Text Available Field trees are an integral part of the farmed parkland landscape in West Africa and provide multiple benefits to the local environment and livelihoods. While field trees have received increasing interest in the context of strengthening resilience to climate variability and change, the actual extent of farmed parkland and spatial patterns of tree cover are largely unknown. We used the rule-based predictive modeling tool Cubist® to estimate field tree cover in the west-central agricultural region of Senegal. A collection of rules and associated multiple linear regression models was constructed from (1 a reference dataset of percent tree cover derived from very high spatial resolution data (2 m Orbview as the dependent variable, and (2 ten years of 10-day 250 m Moderate Resolution Imaging Spectrometer (MODIS Normalized Difference Vegetation Index (NDVI composites and derived phenological metrics as independent variables. Correlation coefficients between modeled and reference percent tree cover of 0.88 and 0.77 were achieved for training and validation data respectively, with absolute mean errors of 1.07 and 1.03 percent tree cover. The resulting map shows a west-east gradient from high tree cover in the peri-urban areas of horticulture and arboriculture to low tree cover in the more sparsely populated eastern part of the study area. A comparison of current (2000s tree cover along this gradient with historic cover as seen on Corona images reveals dynamics of change but also areas of remarkable stability of field tree cover since 1968. The proposed modeling approach can help to identify locations of high and low tree cover in dryland environments and guide ground studies and management interventions aimed at promoting the integration of field trees in agricultural systems.
Suduan Chen
2014-01-01
Full Text Available As the fraudulent financial statement of an enterprise is increasingly serious with each passing day, establishing a valid forecasting fraudulent financial statement model of an enterprise has become an important question for academic research and financial practice. After screening the important variables using the stepwise regression, the study also matches the logistic regression, support vector machine, and decision tree to construct the classification models to make a comparison. The study adopts financial and nonfinancial variables to assist in establishment of the forecasting fraudulent financial statement model. Research objects are the companies to which the fraudulent and nonfraudulent financial statement happened between years 1998 to 2012. The findings are that financial and nonfinancial information are effectively used to distinguish the fraudulent financial statement, and decision tree C5.0 has the best classification effect 85.71%.
Data mining in psychological treatment research: a primer on classification and regression trees.
King, Matthew W; Resick, Patricia A
2014-10-01
Data mining of treatment study results can reveal unforeseen but critical insights, such as who receives the most benefit from treatment and under what circumstances. The usefulness and legitimacy of exploratory data analysis have received relatively little recognition, however, and analytic methods well suited to the task are not widely known in psychology. With roots in computer science and statistics, statistical learning approaches offer a credible option: These methods take a more inductive approach to building a model than is done in traditional regression, allowing the data greater role in suggesting the correct relationships between variables rather than imposing them a priori. Classification and regression trees are presented as a powerful, flexible exemplar of statistical learning methods. Trees allow researchers to efficiently identify useful predictors of an outcome and discover interactions between predictors without the need to anticipate and specify these in advance, making them ideal for revealing patterns that inform hypotheses about treatment effects. Trees can also provide a predictive model for forecasting outcomes as an aid to clinical decision making. This primer describes how tree models are constructed, how the results are interpreted and evaluated, and how trees overcome some of the complexities of traditional regression. Examples are drawn from randomized clinical trial data and highlight some interpretations of particular interest to treatment researchers. The limitations of tree models are discussed, and suggestions for further reading and choices in software are offered.
An Algorithm for Fault-Tree Construction
Taylor, J. R.
1982-01-01
An algorithm for performing certain parts of the fault tree construction process is described. Its input is a flow sheet of the plant, a piping and instrumentation diagram, or a wiring diagram of the circuits, to be analysed, together with a standard library of component functional and failure...... models. A systematic approach to component model construction is also presented....
C. Quantin
2011-01-01
Full Text Available Cardiologists are interested in determining whether the type of hospital pathway followed by a patient is predictive of survival. The study objective was to determine whether accounting for hospital pathways in the selection of prognostic factors of one-year survival after acute myocardial infarction (AMI provided a more informative analysis than that obtained by the use of a standard regression tree analysis (CART method. Information on AMI was collected for 1095 hospitalized patients over an 18-month period. The construction of pathways followed by patients produced symbolic-valued observations requiring a symbolic regression tree analysis. This analysis was compared with the standard CART analysis using patients as statistical units described by standard data selected TIMI score as the primary predictor variable. For the 1011 (84, resp. patients with a lower (higher TIMI score, the pathway variable did not appear as a diagnostic variable until the third (second stage of the tree construction. For an ecological analysis, again TIMI score was the first predictor variable. However, in a symbolic regression tree analysis using hospital pathways as statistical units, the type of pathway followed was the key predictor variable, showing in particular that pathways involving early admission to cardiology units produced high one-year survival rates.
Efficient Representation for Online Suffix Tree Construction
Larsson, N. Jesper; Fuglsang, Kasper; Karlsson, Kenneth
2014-01-01
Suffix tree construction algorithms based on suffix links are popular because they are simple to implement, can operate online in linear time, and because the suffix links are often convenient for pattern matching. We present an approach using edge-oriented suffix links, which reduces the number...
Speeding Up Neighbour-Joining Tree Construction
Brodal, Gerth Stølting; Fagerberg, Rolf; Mailund, Thomas
A widely used method for constructing phylogenetic trees is the neighbour-joining method of Saitou and Nei. We develope heuristics for speeding up the neighbour-joining method which generate the same phylogenetic trees as the original method. All heuristics are based on using a quad-tree to guide...... the search for the next pair of nodes to join, but di#er in the information stored in quad-tree nodes, the way the search is performed, and in the way the quad-tree is updated after a join. We empirically evaluate the performance of the heuristics on distance matrices obtained from the Pfam collection...... of alignments, and compare the running time with that of the QuickTree tool, a well-known and widely used implementation of the standard neighbour-joining method. The results show that the presented heuristics can give a significant speed-up over the standard neighbour-joining method, already for medium sized...
Speeding Up Neighbour-Joining Tree Construction
Brodal, Gerth Stølting; Fagerberg, Rolf; Mailund, Thomas
A widely used method for constructing phylogenetic trees is the neighbour-joining method of Saitou and Nei. We develope heuristics for speeding up the neighbour-joining method which generate the same phylogenetic trees as the original method. All heuristics are based on using a quad-tree to guide...... the search for the next pair of nodes to join, but di#er in the information stored in quad-tree nodes, the way the search is performed, and in the way the quad-tree is updated after a join. We empirically evaluate the performance of the heuristics on distance matrices obtained from the Pfam collection...... of alignments, and compare the running time with that of the QuickTree tool, a well-known and widely used implementation of the standard neighbour-joining method. The results show that the presented heuristics can give a significant speed-up over the standard neighbour-joining method, already for medium sized...
A Construction Algorithm for Tree Structure
DONG Chengliang; GUO Shunsheng
2006-01-01
This paper studies and analyses the character of the tree structure,and then presents an algorithm which is concise and convinent for the construction of tree structure.It is especially fit for the application system using database.The special storege organization needn't to be established in the database using this algorithm.By SQL statement,the data dispersed in different storeage organization can be dynamically combined into the data set including two fields:father node field and child node field.Then the algorithm can process those data and display the tree structure rapidly. At last,we design a control called TFDTreeView which inherits from TTreeView control using this algorithm. TFDTreeView control provide a interface function,through which we can construct the tree structure convinently. On some occasions,this method will be good for the application system.And ,by building the control , we can reuse it in many system development.
Constructal tree networks for heat transfer
Ledezma, G.A.; Bejan, A.; Errera, M.R. [Department of Mechanical Engineering and Materials Science, Box 90300, Duke University, Durham, North Carolina 27708-0300 (United States)
1997-07-01
This paper addresses the fundamental problem of how to connect a heat generating volume to a point heat sink by using a finite amount of high-conductivity material that can be distributed through the volume. The problem is one of optimizing the access (or minimizing the thermal resistance) between a finite-size volume and one point. The solution is constructed by covering the volume with a sequence of building blocks, which proceeds toward larger sizes (assemblies), hence, the {open_quotes}constructal{close_quotes} name for this approach. Optimized numerically at each stage are geometric features such as the overall shape of the building block, its number of constituents, and the internal distribution of high-conductivity inserts. It is shown that in the optimal design, the high-conductivity material has a distribution with the shape of a tree. Every aspect of the tree architecture is deterministic: the shapes of the largest assembly and all its constituents, the number of branches at each level of assembly, the relative position of building blocks in each assembly, and the relative thicknesses of successive branches. The finer, innermost details of the tree architecture (e.g., the branching angle) have a negligible effect on the overall thermal resistance. The main conclusion is that the structure, working mechanism, and minimal resistance of the tree network can be obtained deterministically, and that the constrained optimization of access routes accounts for the macroscopic structure in nature. {copyright} {ital 1997 American Institute of Physics.}
Iron Supplementation and Altitude: Decision Making Using a Regression Tree
Laura A. Garvican-Lewis, Andrew D. Govus, Peter Peeling, Chris R. Abbiss, Christopher J. Gore
2016-03-01
Full Text Available Altitude exposure increases the body’s need for iron (Gassmann and Muckenthaler, 2015, primarily to support accelerated erythropoiesis, yet clear supplementation guidelines do not exist. Athletes are typically recommended to ingest a daily oral iron supplement to facilitate altitude adaptations, and to help maintain iron balance. However, there is some debate as to whether athletes with otherwise healthy iron stores should be supplemented, due in part to concerns of iron overload. Excess iron in vital organs is associated with an increased risk of a number of conditions including cancer, liver disease and heart failure. Therefore clear guidelines are warranted and athletes should be discouraged from ‘self-prescribing” supplementation without medical advice. In the absence of prospective-controlled studies, decision tree analysis can be used to describe a data set, with the resultant regression tree serving as guide for clinical decision making. Here, we present a regression tree in the context of iron supplementation during altitude exposure, to examine the association between pre-altitude ferritin (Ferritin-Pre and the haemoglobin mass (Hbmass response, based on daily iron supplement dose. De-identified ferritin and Hbmass data from 178 athletes engaged in altitude training were extracted from the Australian Institute of Sport (AIS database. Altitude exposure was predominantly achieved via normobaric Live high: Train low (n = 147 at a simulated altitude of 3000 m for 2 to 4 weeks. The remaining athletes engaged in natural altitude training at venues ranging from 1350 to 2800 m for 3-4 weeks. Thus, the “hypoxic dose” ranged from ~890 km.h to ~1400 km.h. Ethical approval was granted by the AIS Human Ethics Committee, and athletes provided written informed consent. An in depth description and traditional analysis of the complete data set is presented elsewhere (Govus et al., 2015. Iron supplementation was prescribed by a sports physician
U.S. Environmental Protection Agency — Spreadsheets are included here to support the manuscript "Boosted Regression Tree Models to Explain Watershed Nutrient Concentrations and Biological Condition". This...
Koon, Sharon; Petscher, Yaacov
2015-01-01
The purpose of this report was to explicate the use of logistic regression and classification and regression tree (CART) analysis in the development of early warning systems. It was motivated by state education leaders' interest in maintaining high classification accuracy while simultaneously improving practitioner understanding of the rules by…
Sparse suffix tree construction in small space
Bille, Philip; Fischer, Johannes; Gørtz, Inge Li
2013-01-01
the correct tree with high probability. We then give a Las-Vegas algorithm which also uses O(b) space and runs in the same time bounds with high probability when b = O(√n). Furthermore, additional tradeoffs between the space usage and the construction time for the Monte-Carlo algorithm are given......., which may be of independent interest, that allows to efficiently answer b longest common prefix queries on suffixes of T, using only O(b) space. We expect that this technique will prove useful in many other applications in which space usage is a concern. Our first solution is Monte-Carlo and outputs...
Susan L. King
2003-01-01
The performance of two classifiers, logistic regression and neural networks, are compared for modeling noncatastrophic individual tree mortality for 21 species of trees in West Virginia. The output of the classifier is usually a continuous number between 0 and 1. A threshold is selected between 0 and 1 and all of the trees below the threshold are classified as...
Hemmateenejad, Bahram, E-mail: hemmatb@sums.ac.ir [Department of Chemistry, Shiraz University, Shiraz (Iran, Islamic Republic of); Medicinal and Natural Products Chemistry Research Center, Shiraz University of Medical Sciences, Shiraz (Iran, Islamic Republic of); Shamsipur, Mojtaba [Department of Chemistry, Razi University, Kermanshah (Iran, Islamic Republic of); Zare-Shahabadi, Vali [Young Researchers Club, Mahshahr Branch, Islamic Azad University, Mahshahr (Iran, Islamic Republic of); Akhond, Morteza [Department of Chemistry, Shiraz University, Shiraz (Iran, Islamic Republic of)
2011-10-17
Highlights: {yields} Ant colony systems help to build optimum classification and regression trees. {yields} Using of genetic algorithm operators in ant colony systems resulted in more appropriate models. {yields} Variable selection in each terminal node of the tree gives promising results. {yields} CART-ACS-GA could model the melting point of organic materials with prediction errors lower than previous models. - Abstract: The classification and regression trees (CART) possess the advantage of being able to handle large data sets and yield readily interpretable models. A conventional method of building a regression tree is recursive partitioning, which results in a good but not optimal tree. Ant colony system (ACS), which is a meta-heuristic algorithm and derived from the observation of real ants, can be used to overcome this problem. The purpose of this study was to explore the use of CART and its combination with ACS for modeling of melting points of a large variety of chemical compounds. Genetic algorithm (GA) operators (e.g., cross averring and mutation operators) were combined with ACS algorithm to select the best solution model. In addition, at each terminal node of the resulted tree, variable selection was done by ACS-GA algorithm to build an appropriate partial least squares (PLS) model. To test the ability of the resulted tree, a set of approximately 4173 structures and their melting points were used (3000 compounds as training set and 1173 as validation set). Further, an external test set containing of 277 drugs was used to validate the prediction ability of the tree. Comparison of the results obtained from both trees showed that the tree constructed by ACS-GA algorithm performs better than that produced by recursive partitioning procedure.
Undergraduate Students’ Difficulties in Reading and Constructing Phylogenetic Tree
Sa'adah, S.; Tapilouw, F. S.; Hidayat, T.
2017-02-01
Representation is a very important communication tool to communicate scientific concepts. Biologists produce phylogenetic representation to express their understanding of evolutionary relationships. The phylogenetic tree is visual representation depict a hypothesis about the evolutionary relationship and widely used in the biological sciences. Phylogenetic tree currently growing for many disciplines in biology. Consequently, learning about phylogenetic tree become an important part of biological education and an interesting area for biology education research. However, research showed many students often struggle with interpreting the information that phylogenetic trees depict. The purpose of this study was to investigate undergraduate students’ difficulties in reading and constructing a phylogenetic tree. The method of this study is a descriptive method. In this study, we used questionnaires, interviews, multiple choice and open-ended questions, reflective journals and observations. The findings showed students experiencing difficulties, especially in constructing a phylogenetic tree. The students’ responds indicated that main reasons for difficulties in constructing a phylogenetic tree are difficult to placing taxa in a phylogenetic tree based on the data provided so that the phylogenetic tree constructed does not describe the actual evolutionary relationship (incorrect relatedness). Students also have difficulties in determining the sister group, character synapomorphy, autapomorphy from data provided (character table) and comparing among phylogenetic tree. According to them building the phylogenetic tree is more difficult than reading the phylogenetic tree. Finding this studies provide information to undergraduate instructor and students to overcome learning difficulties of reading and constructing phylogenetic tree.
Shi, K-Q; Zhou, Y-Y; Yan, H-D; Li, H; Wu, F-L; Xie, Y-Y; Braddock, M; Lin, X-Y; Zheng, M-H
2017-02-01
At present, there is no ideal model for predicting the short-term outcome of patients with acute-on-chronic hepatitis B liver failure (ACHBLF). This study aimed to establish and validate a prognostic model by using the classification and regression tree (CART) analysis. A total of 1047 patients from two separate medical centres with suspected ACHBLF were screened in the study, which were recognized as derivation cohort and validation cohort, respectively. CART analysis was applied to predict the 3-month mortality of patients with ACHBLF. The accuracy of the CART model was tested using the area under the receiver operating characteristic curve, which was compared with the model for end-stage liver disease (MELD) score and a new logistic regression model. CART analysis identified four variables as prognostic factors of ACHBLF: total bilirubin, age, serum sodium and INR, and three distinct risk groups: low risk (4.2%), intermediate risk (30.2%-53.2%) and high risk (81.4%-96.9%). The new logistic regression model was constructed with four independent factors, including age, total bilirubin, serum sodium and prothrombin activity by multivariate logistic regression analysis. The performances of the CART model (0.896), similar to the logistic regression model (0.914, P=.382), exceeded that of MELD score (0.667, P<.001). The results were confirmed in the validation cohort. We have developed and validated a novel CART model superior to MELD for predicting three-month mortality of patients with ACHBLF. Thus, the CART model could facilitate medical decision-making and provide clinicians with a validated practical bedside tool for ACHBLF risk stratification.
Novel algorithm for constructing support vector machine regression ensemble
Li Bo; Li Xinjun; Zhao Zhiyan
2006-01-01
A novel algorithm for constructing support vector machine regression ensemble is proposed. As to regression prediction, support vector machine regression(SVMR) ensemble is proposed by resampling from given training data sets repeatedly and aggregating several independent SVMRs, each of which is trained to use a replicated training set. After training, several independently trained SVMRs need to be aggregated in an appropriate combination manner. Generally, the linear weighting is usually used like expert weighting score in Boosting Regression and it is without optimization capacity. Three combination techniques are proposed, including simple arithmetic mean,linear least square error weighting and nonlinear hierarchical combining that uses another upper-layer SVMR to combine several lower-layer SVMRs. Finally, simulation experiments demonstrate the accuracy and validity of the presented algorithm.
Henrard, S; Speybroeck, N; Hermans, C
2015-11-01
Haemophilia is a rare genetic haemorrhagic disease characterized by partial or complete deficiency of coagulation factor VIII, for haemophilia A, or IX, for haemophilia B. As in any other medical research domain, the field of haemophilia research is increasingly concerned with finding factors associated with binary or continuous outcomes through multivariable models. Traditional models include multiple logistic regressions, for binary outcomes, and multiple linear regressions for continuous outcomes. Yet these regression models are at times difficult to implement, especially for non-statisticians, and can be difficult to interpret. The present paper sought to didactically explain how, why, and when to use classification and regression tree (CART) analysis for haemophilia research. The CART method is non-parametric and non-linear, based on the repeated partitioning of a sample into subgroups based on a certain criterion. Breiman developed this method in 1984. Classification trees (CTs) are used to analyse categorical outcomes and regression trees (RTs) to analyse continuous ones. The CART methodology has become increasingly popular in the medical field, yet only a few examples of studies using this methodology specifically in haemophilia have to date been published. Two examples using CART analysis and previously published in this field are didactically explained in details. There is increasing interest in using CART analysis in the health domain, primarily due to its ease of implementation, use, and interpretation, thus facilitating medical decision-making. This method should be promoted for analysing continuous or categorical outcomes in haemophilia, when applicable. © 2015 John Wiley & Sons Ltd.
Boosted Regression Tree Models to Explain Watershed Nutrient Concentrations and Biological Condition
Boosted regression tree (BRT) models were developed to quantify the nonlinear relationships between landscape variables and nutrient concentrations in a mesoscale mixed land cover watershed during base-flow conditions. Factors that affect instream biological components, based on ...
Boosted Regression Tree Models to Explain Watershed Nutrient Concentrations and Biological Condition
Boosted regression tree (BRT) models were developed to quantify the nonlinear relationships between landscape variables and nutrient concentrations in a mesoscale mixed land cover watershed during base-flow conditions. Factors that affect instream biological components, based on ...
Combining an additive and tree-based regression model simultaneously: STIMA
Dusseldorp, E.; Conversano, C.; Os, B.J. van
2010-01-01
Additive models and tree-based regression models are two main classes of statistical models used to predict the scores on a continuous response variable. It is known that additive models become very complex in the presence of higher order interaction effects, whereas some tree-based models, such as
Kritski Afrânio
2006-02-01
Full Text Available Abstract Background Smear negative pulmonary tuberculosis (SNPT accounts for 30% of pulmonary tuberculosis cases reported yearly in Brazil. This study aimed to develop a prediction model for SNPT for outpatients in areas with scarce resources. Methods The study enrolled 551 patients with clinical-radiological suspicion of SNPT, in Rio de Janeiro, Brazil. The original data was divided into two equivalent samples for generation and validation of the prediction models. Symptoms, physical signs and chest X-rays were used for constructing logistic regression and classification and regression tree models. From the logistic regression, we generated a clinical and radiological prediction score. The area under the receiver operator characteristic curve, sensitivity, and specificity were used to evaluate the model's performance in both generation and validation samples. Results It was possible to generate predictive models for SNPT with sensitivity ranging from 64% to 71% and specificity ranging from 58% to 76%. Conclusion The results suggest that those models might be useful as screening tools for estimating the risk of SNPT, optimizing the utilization of more expensive tests, and avoiding costs of unnecessary anti-tuberculosis treatment. Those models might be cost-effective tools in a health care network with hierarchical distribution of scarce resources.
Noor Zaitun Yahaya
2017-01-01
Full Text Available This paper investigated the use of boosted regression trees (BRTs to draw an inference about daytime and nighttime ozone formation in a coastal environment. Hourly ground-level ozone data for a full calendar year in 2010 were obtained from the Kemaman (CA 002 air quality monitoring station. A BRT model was developed using hourly ozone data as a response variable and nitric oxide (NO, Nitrogen Dioxide (NO2 and Nitrogen Dioxide (NOx and meteorological parameters as explanatory variables. The ozone BRT algorithm model was constructed from multiple regression models, and the 'best iteration' of BRT model was performed by optimizing prediction performance. Sensitivity testing of the BRT model was conducted to determine the best parameters and good explanatory variables. Using the number of trees between 2,500-3,500, learning rate of 0.01, and interaction depth of 5 were found to be the best setting for developing the ozone boosting model. The performance of the O3 boosting models were assessed, and the fraction of predictions within two factor (FAC2, coefficient of determination (R2 and the index of agreement (IOA of the model developed for day and nighttime are 0.93, 0.69 and 0.73 for daytime and 0.79, 0.55 and 0.69 for nighttime respectively. Results showed that the model developed was within the acceptable range and could be used to understand ozone formation and identify potential sources of ozone for estimating O3 concentrations during daytime and nighttime. Results indicated that the wind speed, wind direction, relative humidity, and temperature were the most dominant variables in terms of influencing ozone formation. Finally, empirical evidence of the production of a high ozone level by wind blowing from coastal areas towards the interior region, especially from industrial areas, was obtained.
Workload-Aware Tree Construction Algorithm for Wireless Sensor Networks
Kayiram Kavitha
2012-04-01
Full Text Available Wireless Sensor Networks play a vital role in applications like disaster management and human relief, habitat monitoring, studying the weather and eco systems, etc. Since the location of deployment of these WSNs is usually remote, the source of energy is restricted to battery. A Significant amount of work has been done by researchers in the past to achieve energy efficiency in WSNs. In this paper we propose a scheme to optimize the power utilization in a WSN. Many of the WSN applications form a tree topology for communication. In a WSN, that adopts tree topology, it is observed that the nodes at higher levels of the tree tend to consume more power when compared to those at lowerlevels. In our proposed workload-aware query/result routing tree construction scheme, we construct thefinal routing tree by keeping in mind the workload of nodes at various levels of the tree. This way, the tree construction takes place with workload at each level being evenly distributed among the nodes at that level. The proposed approach not only increases the lifetime of the network, but also utilizes the battery power optimally. Simulation results show a considerable increase in the lifetime, and effectiveness of the wireless senor network, as a result of applying our proposed tree construction technique.
The Complexity of Constructing Evolutionary Trees Using Experiments
Brodal, Gerth Stølting; Fagerberg, Rolf; Pedersen, Christian Nørgaard Storm
2001-01-01
We present tight upper and lower bounds for the problem of constructing evolutionary trees in the experiment model. We describe an algorithm which constructs an evolutionary tree of n species in time O(nd logd n) using at most n⌈d/2⌉(log2⌈d/2⌉-1 n+O(1)) experiments for d > 2, and at most n(log n+...
The Complexity of Constructing Evolutionary Trees Using Experiments
Brodal, Gerth Stølting; Fagerberg, Rolf; Pedersen, Christian Nørgaard Storm;
2001-01-01
We present tight upper and lower bounds for the problem of constructing evolutionary trees in the experiment model. We describe an algorithm which constructs an evolutionary tree of n species in time O(nd logd n) using at most n⌈d/2⌉(log2⌈d/2⌉-1 n+O(1)) experiments for d > 2, and at most n(log n+...
Unifying constructal theory of tree roots, canopies and forests.
Bejan, A; Lorente, S; Lee, J
2008-10-07
Here, we show that the most basic features of tree and forest architecture can be put on a unifying theoretical basis, which is provided by the constructal law. Key is the integrative approach to understanding the emergence of "designedness" in nature. Trees and forests are viewed as integral components (along with dendritic river basins, aerodynamic raindrops, and atmospheric and oceanic circulation) of the much greater global architecture that facilitates the cyclical flow of water in nature (Fig. 1) and the flow of stresses between wind and ground. Theoretical features derived in this paper are: the tapered shape of the root and longitudinally uniform diameter and density of internal flow tubes, the near-conical shape of tree trunks and branches, the proportionality between tree length and wood mass raised to 1/3, the proportionality between total water mass flow rate and tree length, the proportionality between the tree flow conductance and the tree length scale raised to a power between 1 and 2, the existence of forest floor plans that maximize ground-air flow access, the proportionality between the length scale of the tree and its rank raised to a power between -1 and -1/2, and the inverse proportionality between the tree size and number of trees of the same size. This paper further shows that there exists an optimal ratio of leaf volume divided by total tree volume, trees of the same size must have a larger wood volume fraction in windy climates, and larger trees must pack more wood per unit of tree volume than smaller trees. Comparisons with empirical correlations and formulas based on ad hoc models are provided. This theory predicts classical notions such as Leonardo's rule, Huber's rule, Zipf's distribution, and the Fibonacci sequence. The difference between modeling (description) and theory (prediction) is brought into evidence.
Comparison of greedy algorithms for α-decision tree construction
Alkhalid, Abdulaziz
2011-01-01
A comparison among different heuristics that are used by greedy algorithms which constructs approximate decision trees (α-decision trees) is presented. The comparison is conducted using decision tables based on 24 data sets from UCI Machine Learning Repository [2]. Complexity of decision trees is estimated relative to several cost functions: depth, average depth, number of nodes, number of nonterminal nodes, and number of terminal nodes. Costs of trees built by greedy algorithms are compared with minimum costs calculated by an algorithm based on dynamic programming. The results of experiments assign to each cost function a set of potentially good heuristics that minimize it. © 2011 Springer-Verlag.
VanEngelsdorp, Dennis; Speybroeck, Niko; Evans, Jay D; Nguyen, Bach Kim; Mullin, Chris; Frazier, Maryann; Frazier, Jim; Cox-Foster, Diana; Chen, Yanping; Tarpy, David R; Haubruge, Eric; Pettis, Jeffrey S; Saegerman, Claude
2010-10-01
Colony collapse disorder (CCD), a syndrome whose defining trait is the rapid loss of adult worker honey bees, Apis mellifera L., is thought to be responsible for a minority of the large overwintering losses experienced by U.S. beekeepers since the winter 2006-2007. Using the same data set developed to perform a monofactorial analysis (PloS ONE 4: e6481, 2009), we conducted a classification and regression tree (CART) analysis in an attempt to better understand the relative importance and interrelations among different risk variables in explaining CCD. Fifty-five exploratory variables were used to construct two CART models: one model with and one model without a cost of misclassifying a CCD-diagnosed colony as a non-CCD colony. The resulting model tree that permitted for misclassification had a sensitivity and specificity of 85 and 74%, respectively. Although factors measuring colony stress (e.g., adult bee physiological measures, such as fluctuating asymmetry or mass of head) were important discriminating values, six of the 19 variables having the greatest discriminatory value were pesticide levels in different hive matrices. Notably, coumaphos levels in brood (a miticide commonly used by beekeepers) had the highest discriminatory value and were highest in control (healthy) colonies. Our CART analysis provides evidence that CCD is probably the result of several factors acting in concert, making afflicted colonies more susceptible to disease. This analysis highlights several areas that warrant further attention, including the effect of sublethal pesticide exposure on pathogen prevalence and the role of variability in bee tolerance to pesticides on colony survivorship.
A fast construction method for spatial index GBD-tree
Yukio Negishi; Yutaka Ohsawa; Satoshi Takazawa
2007-01-01
This paper proposes a fast initial construction method of the GBD-tree. The GDB tree has proper characteristics for management of large amount of 2 or 3 dimensional data. However, the GBD-tree needs long initial construction time by originally proposed one-by-one insertion method. A fast insertion method has been proposed, but it needs large size of buffer capable to hold index information of all entries. The paper proposes another fast initial construction method. The method requires only limited size of work space (buffer). The experimental results show the initial construction time reduces into a third or a quarter of the one-by-one insertion method. The memory efficiency and retrieval efficiency are also improved than the one-by-one insertion method.
Prediction of cadmium enrichment in reclaimed coastal soils by classification and regression tree
Ru, Feng; Yin, Aijing; Jin, Jiaxin; Zhang, Xiuying; Yang, Xiaohui; Zhang, Ming; Gao, Chao
2016-08-01
Reclamation of coastal land is one of the most common ways to obtain land resources in China. However, it has long been acknowledged that the artificial interference with coastal land has disadvantageous effects, such as heavy metal contamination. This study aimed to develop a prediction model for cadmium enrichment levels and assess the importance of affecting factors in typical reclaimed land in Eastern China (DFCL: Dafeng Coastal Land). Two hundred and twenty seven surficial soil/sediment samples were collected and analyzed to identify the enrichment levels of cadmium and the possible affecting factors in soils and sediments. The classification and regression tree (CART) model was applied in this study to predict cadmium enrichment levels. The prediction results showed that cadmium enrichment levels assessed by the CART model had an accuracy of 78.0%. The CART model could extract more information on factors affecting the environmental behavior of cadmium than correlation analysis. The integration of correlation analysis and the CART model showed that fertilizer application and organic carbon accumulation were the most important factors affecting soil/sediment cadmium enrichment levels, followed by particle size effects (Al2O3, TFe2O3 and SiO2), contents of Cl and S, surrounding construction areas and reclamation history.
Incomplete meteorological data has been a problem in environmental modeling studies. The objective of this work was to develop a technique to reconstruct missing daily precipitation data in the central part of Chesapeake Bay Watershed using regression trees (RT) and artificial neural networks (ANN)....
Multi-site solar power forecasting using gradient boosted regression trees
Persson, Caroline Stougård; Bacher, Peder; Shiga, Takahiro
2017-01-01
generation and relevant meteorological variables related to 42 individual PV rooftop installations are used to train a gradient boosted regression tree (GBRT) model. When compared to single-site linear autoregressive and variations of GBRT models the multi-site model shows competitive results in terms...
Risk Factors of Falls in Community-Dwelling Older Adults: Logistic Regression Tree Analysis
Yamashita, Takashi; Noe, Douglas A.; Bailer, A. John
2012-01-01
Purpose of the Study: A novel logistic regression tree-based method was applied to identify fall risk factors and possible interaction effects of those risk factors. Design and Methods: A nationally representative sample of American older adults aged 65 years and older (N = 9,592) in the Health and Retirement Study 2004 and 2006 modules was used.…
Risk Factors of Falls in Community-Dwelling Older Adults: Logistic Regression Tree Analysis
Yamashita, Takashi; Noe, Douglas A.; Bailer, A. John
2012-01-01
Purpose of the Study: A novel logistic regression tree-based method was applied to identify fall risk factors and possible interaction effects of those risk factors. Design and Methods: A nationally representative sample of American older adults aged 65 years and older (N = 9,592) in the Health and Retirement Study 2004 and 2006 modules was used.…
Gmur, Stephan; Vogt, Daniel; Zabowski, Darlene; Moskal, L Monika
2012-01-01
The characterization of soil attributes using hyperspectral sensors has revealed patterns in soil spectra that are known to respond to mineral composition, organic matter, soil moisture and particle size distribution. Soil samples from different soil horizons of replicated soil series from sites located within Washington and Oregon were analyzed with the FieldSpec Spectroradiometer to measure their spectral signatures across the electromagnetic range of 400 to 1,000 nm. Similarity rankings of individual soil samples reveal differences between replicate series as well as samples within the same replicate series. Using classification and regression tree statistical methods, regression trees were fitted to each spectral response using concentrations of nitrogen, carbon, carbonate and organic matter as the response variables. Statistics resulting from fitted trees were: nitrogen R(2) 0.91 (p organic matter R(2) 0.98 (p organic matter for upper soil horizons in a nondestructive method.
Constructing Projection Frequent Pattern Tree for Efficient Mining
XiangJian-wen; HeYan-xiang; KokichiFutatsugi; KongWei-qiang
2003-01-01
Frequent Pattern mining plays an essential role in data mining.Most of the previous studies adopt an Apriori-like candidate set generation-and-test approach. However, candidate set generation is still costly, especially when there exist prolific patterns and/or long patterns. In this study, we introduce a novel frequent pattern growth (FP-growth)method, which is efficient and scalable for mining both long and short frequent patterns without candidate generation. And build a new projection frequent pattern tree (PFP-tree) algorithm on this study, which not only heirs all the advantages in the FP-growth method, but also avoids it's bottleneck in database size dependence when constructing the frequent pattern tree (FP-tree). Efficiency of mining is achieved by introducing the projection technique, which avoid serial scan each frequent item in the database, the cost is mainly related to the depth of the tree, namely the number of frequent items of the longest transaction in the database, not the sum of all the frequent items in the database,which hugely shortens the time of tree-construction. Our performance study shows that the PFP-tree method is efficient and scalable for mining large databases or data warehouses, and is even about an order of magnitude faster than the FP-growth method.
Constructing Projection Frequent Pattern Tree for Efficient Mining
Xiang Jian-wen; He Yan-xiang; Kokichi Futatsugi; Kong Wei-qiang
2003-01-01
Frequent Pattern mining plays an essential role in data mining.Most of the previous studies adopt an Apriori-like candidate set generation-and-test approach. However, candidate set generation is still costly, especially when there exist prolific patterns and/or long patterns.In this study, we introduce a novel frequent pattern growth (FP-growth)method, which is efficient and scalable for mining both long and short frequent patterns without candidate generation. And build a new projection frequent pat-tern tree (PFP-tree) algorithm on this study, which not only heirs all the ad-vantages in the FP-growth method, but also avoids it's bottleneck in database size dependence when constructing the frequent pattern tree (FP-tree). Effi-ciency of mining is achieved by introducing the projection technique, which avoid serial scan each frequent item in the database, the cost is mainly related to the depth of the tree, namely the number of frequent items of the longest trans-action in the database, not the sum of all the frequent items in the database,which hugely shortens the time of tree-construction. Our performance study shows that the PFP-tree method is efficient and scalable for mining large databas-es or data warehouses, and is even about an order of magnitude faster than the FP-growth method.
Allore, Heather; Tinetti, Mary E; Araujo, Katy L B; Hardy, Susan; Peduzzi, Peter
2005-02-01
Many important physiologic and clinical predictors are continuous. Clinical investigators and epidemiologists' interest in these predictors lies, in part, in the risk they pose for adverse outcomes, which may be continuous as well. The relationship between continuous predictors and a continuous outcome may be complex and difficult to interpret. Therefore, methods to detect levels of a predictor variable that predict the outcome and determine the threshold for clinical intervention would provide a beneficial tool for clinical investigators and epidemiologists. We present a case study using regression tree methodology to predict Social and Productive Activities score at 3 years using five modifiable impairments. The predictive ability of regression tree methodology was compared with multiple linear regression using two independent data sets, one for development and one for validation. The regression tree approach and the multiple linear regression model provided similar fit (model deviances) on the development cohort. In the validation cohort, the deviance of the multiple linear regression model was 31% greater than the regression tree approach. Regression tree analysis developed a better model of impairments predicting Social and Productive Activities score that may be more easily applied in research settings than multiple linear regression alone.
Naghibi, Seyed Amir; Pourghasemi, Hamid Reza; Dixon, Barnali
2016-01-01
Groundwater is considered one of the most valuable fresh water resources. The main objective of this study was to produce groundwater spring potential maps in the Koohrang Watershed, Chaharmahal-e-Bakhtiari Province, Iran, using three machine learning models: boosted regression tree (BRT), classification and regression tree (CART), and random forest (RF). Thirteen hydrological-geological-physiographical (HGP) factors that influence locations of springs were considered in this research. These factors include slope degree, slope aspect, altitude, topographic wetness index (TWI), slope length (LS), plan curvature, profile curvature, distance to rivers, distance to faults, lithology, land use, drainage density, and fault density. Subsequently, groundwater spring potential was modeled and mapped using CART, RF, and BRT algorithms. The predicted results from the three models were validated using the receiver operating characteristics curve (ROC). From 864 springs identified, 605 (≈70 %) locations were used for the spring potential mapping, while the remaining 259 (≈30 %) springs were used for the model validation. The area under the curve (AUC) for the BRT model was calculated as 0.8103 and for CART and RF the AUC were 0.7870 and 0.7119, respectively. Therefore, it was concluded that the BRT model produced the best prediction results while predicting locations of springs followed by CART and RF models, respectively. Geospatially integrated BRT, CART, and RF methods proved to be useful in generating the spring potential map (SPM) with reasonable accuracy.
Buchner, Florian; Wasem, Jürgen; Schillo, Sonja
2017-01-01
Risk equalization formulas have been refined since their introduction about two decades ago. Because of the complexity and the abundance of possible interactions between the variables used, hardly any interactions are considered. A regression tree is used to systematically search for interactions, a methodologically new approach in risk equalization. Analyses are based on a data set of nearly 2.9 million individuals from a major German social health insurer. A two-step approach is applied: In the first step a regression tree is built on the basis of the learning data set. Terminal nodes characterized by more than one morbidity-group-split represent interaction effects of different morbidity groups. In the second step the 'traditional' weighted least squares regression equation is expanded by adding interaction terms for all interactions detected by the tree, and regression coefficients are recalculated. The resulting risk adjustment formula shows an improvement in the adjusted R(2) from 25.43% to 25.81% on the evaluation data set. Predictive ratios are calculated for subgroups affected by the interactions. The R(2) improvement detected is only marginal. According to the sample level performance measures used, not involving a considerable number of morbidity interactions forms no relevant loss in accuracy. Copyright © 2015 John Wiley & Sons, Ltd.
Prediction of tissue-specific cis-regulatory modules using Bayesian networks and regression trees
Chen Xiaoyu
2007-12-01
Full Text Available Abstract Background In vertebrates, a large part of gene transcriptional regulation is operated by cis-regulatory modules. These modules are believed to be regulating much of the tissue-specificity of gene expression. Results We develop a Bayesian network approach for identifying cis-regulatory modules likely to regulate tissue-specific expression. The network integrates predicted transcription factor binding site information, transcription factor expression data, and target gene expression data. At its core is a regression tree modeling the effect of combinations of transcription factors bound to a module. A new unsupervised EM-like algorithm is developed to learn the parameters of the network, including the regression tree structure. Conclusion Our approach is shown to accurately identify known human liver and erythroid-specific modules. When applied to the prediction of tissue-specific modules in 10 different tissues, the network predicts a number of important transcription factor combinations whose concerted binding is associated to specific expression.
Tso, Geoffrey K.F.; Yau, Kelvin K.W. [City University of Hong Kong, Kowloon, Hong Kong (China). Department of Management Sciences
2007-09-15
This study presents three modeling techniques for the prediction of electricity energy consumption. In addition to the traditional regression analysis, decision tree and neural networks are considered. Model selection is based on the square root of average squared error. In an empirical application to an electricity energy consumption study, the decision tree and neural network models appear to be viable alternatives to the stepwise regression model in understanding energy consumption patterns and predicting energy consumption levels. With the emergence of the data mining approach for predictive modeling, different types of models can be built in a unified platform: to implement various modeling techniques, assess the performance of different models and select the most appropriate model for future prediction. (author)
Paap, M.C.S.; Eggen, T.J.H.M.; Veldkamp, B.P.
2012-01-01
Standardized tests often group items around a common stimulus. Such groupings of items are called testlets. The potential dependency among items within a testlet is generally ignored in practice, even though a basic assumption of item response theory (IRT) is that individual items are independent of one another. A technique called tree-based regression (TBR) was applied to identify key features of stimuli that could properly predict the dependence structure of testlet data. Knowledge about th...
Development of prognostic indicators using Classification And Regression Trees (CART) for survival
Nunn, Martha E.; Fan, Juanjuan; Su, Xiaogang; McGuire, Michael K.
2012-01-01
The development of an accurate prognosis is an integral component of treatment planning in the practice of periodontics. Prior work has evaluated the validity of using various clinical measured parameters for assigning periodontal prognosis as well as for predicting tooth survival and change in clinical conditions over time. We critically review the application of multivariate Classification And Regression Trees (CART) for survival in developing evidence-based periodontal prognostic indicator...
Nishanee Rampersad
2017-01-01
Full Text Available Background: Assessment of intraocular pressure (IOP is an important test in glaucoma. In addition, anterior segment variables may be useful in screening for glaucoma risk. Studies have investigated the associations between IOP and anterior segment variables using traditional statistical methods. The classification and regression tree (CART method provides another dimension to detect important variables in a relationship automatically.Aim: To identify the critical factors that influence IOP using a regression tree.Methods: A quantitative cross-sectional research design was used. Anterior segment variables were measured in 700 participants using the iVue100 optical coherence tomographer, Oculus Keratograph and Nidek US-500 ultrasonographer. A Goldmann applanation tonometer was used to measure IOP. Data from only the right eyes were analysed because of high levels of interocular symmetry. A regression tree model was generated with the CART method and Pearson’s correlation coefficients were used to assess the relationships between the ocular variables.Results: The mean IOP for the entire sample was 14.63 mmHg ± 2.40 mmHg. The CART method selected three anterior segment variables in the regression tree model. Central corneal thickness was the most important variable with a cut-off value of 527 µm. The other important variables included average paracentral corneal thickness and axial anterior chamber depth. Corneal thickness measurements increased towards the periphery and were significantly correlated with IOP (r ≥ 0.50, p ≤ 0.001.Conclusion: The CART method identified the anterior segment variables that influenced IOP. Understanding the relationship between IOP and anterior segment variables may help to clinically identify patients with ocular risk factors associated with elevated IOPs.
Lampa, Erik; Lind, Lars; Lind, Monica P.; Bornefalk-Hermansson, Anna
2014-01-01
Background: There is a need to evaluate complex interaction effects on human health, such as those induced by mixtures of environmental contaminants. The usual approach is to formulate an additive statistical model and check for departures using product terms between the variables of interest. In this paper, we present an approach to search for interaction effects among several variables using boosted regression trees. Methods: We simulate a continuous outcome from real data on 27 environment...
[Hyperspectral Estimation of Apple Tree Canopy LAI Based on SVM and RF Regression].
Han, Zhao-ying; Zhu, Xi-cun; Fang, Xian-yi; Wang, Zhuo-yuan; Wang, Ling; Zhao, Geng-Xing; Jiang, Yuan-mao
2016-03-01
Leaf area index (LAI) is the dynamic index of crop population size. Hyperspectral technology can be used to estimate apple canopy LAI rapidly and nondestructively. It can be provide a reference for monitoring the tree growing and yield estimation. The Red Fuji apple trees of full bearing fruit are the researching objects. Ninety apple trees canopies spectral reflectance and LAI values were measured by the ASD Fieldspec3 spectrometer and LAI-2200 in thirty orchards in constant two years in Qixia research area of Shandong Province. The optimal vegetation indices were selected by the method of correlation analysis of the original spectral reflectance and vegetation indices. The models of predicting the LAI were built with the multivariate regression analysis method of support vector machine (SVM) and random forest (RF). The new vegetation indices, GNDVI527, ND-VI676, RVI682, FD-NVI656 and GRVI517 and the previous two main vegetation indices, NDVI670 and NDVI705, are in accordance with LAI. In the RF regression model, the calibration set decision coefficient C-R2 of 0.920 and validation set decision coefficient V-R2 of 0.889 are higher than the SVM regression model by 0.045 and 0.033 respectively. The root mean square error of calibration set C-RMSE of 0.249, the root mean square error validation set V-RMSE of 0.236 are lower than that of the SVM regression model by 0.054 and 0.058 respectively. Relative analysis of calibrating error C-RPD and relative analysis of validation set V-RPD reached 3.363 and 2.520, 0.598 and 0.262, respectively, which were higher than the SVM regression model. The measured and predicted the scatterplot trend line slope of the calibration set and validation set C-S and V-S are close to 1. The estimation result of RF regression model is better than that of the SVM. RF regression model can be used to estimate the LAI of red Fuji apple trees in full fruit period.
The Complexity of Constructing Evolutionary Trees Using Experiments
Brodal, Gerth Stølting; Fagerberg, Rolf; Pedersen, Christian N.S.;
2001-01-01
We present tight upper and lower bounds for the problem of constructing evolutionary trees in the experiment model. We describe an algorithm which constructs an evolutionary tree of n species in time O(nd logd n) using at most n⌈d/2⌉(log2⌈d/2⌉-1 n+O(1)) experiments for d > 2, and at most n(log n......+O(1)) experiments for d = 2, where d is the degree of the tree. This improves the previous best upper bound by a factor θ(log d). For d = 2 the previously best algorithm with running time O(n log n) had a bound of 4n log n on the number of experiments. By an explicit adversary argument, we show an Ω......(nd logd n) lower bound, matching our upper bounds and improving the previous best lower bound by a factor θ(logd n). Central to our algorithm is the construction and maintenance of separator trees of small height, which may be of independent interest....
Gang WU
2016-01-01
Full Text Available Objective To analyze the risk factors for prognosis in intracerebral hemorrhage using decision tree (classification and regression tree, CART model and logistic regression model. Methods CART model and logistic regression model were established according to the risk factors for prognosis of patients with cerebral hemorrhage. The differences in the results were compared between the two methods. Results Logistic regression analyses showed that hematoma volume (OR-value 0.953, initial Glasgow Coma Scale (GCS score (OR-value 1.210, pulmonary infection (OR-value 0.295, and basal ganglia hemorrhage (OR-value 0.336 were the risk factors for the prognosis of cerebral hemorrhage. The results of CART analysis showed that volume of hematoma and initial GCS score were the main factors affecting the prognosis of cerebral hemorrhage. The effects of two models on the prognosis of cerebral hemorrhage were similar (Z-value 0.402, P=0.688. Conclusions CART model has a similar value to that of logistic model in judging the prognosis of cerebral hemorrhage, and it is characterized by using transactional analysis between the risk factors, and it is more intuitive. DOI: 10.11855/j.issn.0577-7402.2015.12.13
L. Monika Moskal
2012-08-01
Full Text Available The characterization of soil attributes using hyperspectral sensors has revealed patterns in soil spectra that are known to respond to mineral composition, organic matter, soil moisture and particle size distribution. Soil samples from different soil horizons of replicated soil series from sites located within Washington and Oregon were analyzed with the FieldSpec Spectroradiometer to measure their spectral signatures across the electromagnetic range of 400 to 1,000 nm. Similarity rankings of individual soil samples reveal differences between replicate series as well as samples within the same replicate series. Using classification and regression tree statistical methods, regression trees were fitted to each spectral response using concentrations of nitrogen, carbon, carbonate and organic matter as the response variables. Statistics resulting from fitted trees were: nitrogen R^{2} 0.91 (p < 0.01 at 403, 470, 687, and 846 nm spectral band widths, carbonate R^{2} 0.95 (p < 0.01 at 531 and 898 nm band widths, total carbon R^{2} 0.93 (p < 0.01 at 400, 409, 441 and 907 nm band widths, and organic matter R^{2} 0.98 (p < 0.01 at 300, 400, 441, 832 and 907 nm band widths. Use of the 400 to 1,000 nm electromagnetic range utilizing regression trees provided a powerful, rapid and inexpensive method for assessing nitrogen, carbon, carbonate and organic matter for upper soil horizons in a nondestructive method.
ENHANCED BIO-INSPIRED ALGORITHM FOR CONSTRUCTING PHYLOGENETIC TREE
J. Jayapriya
2015-10-01
Full Text Available This paper illustrates an enhanced algorithm based on one of the swarm intelligence techniques for constructing the Phylogenetic tree (PT, which is used to study the relationship between species. The main scheme is to formulate a PT, an NP- complete problem through an evolutionary algorithm called Artificial Bee Colony (ABC. The tradeoff between the accuracy and the computational time taken for constructing the tree makes way for new variants of algorithms. A new variant of ABC algorithm is proposed to promote the convergence rate of general ABC algorithm through recommending a new formula for searching solution. In addition, a searching step has been included so that it constructs the tree faster with a nearly optimal solution. Experimental results are compared with the ABC algorithm, Genetic Algorithm and the state-of-the-art techniques like unweighted pair group method using arithmetic mean, Neighbour-joining and Relaxed Neighbor Joining. For results discussion, we used one of the standard dataset Treesilla. The results show that the Enhanced ABC (EABC algorithm converges faster than others. The claim is supported by a distance metric called the Robinson-Foulds distance that finds the dissimilarity of the PT, constructed by different algorithms.
E. Brito-Rocha
Full Text Available Abstract Individual leaf area (LA is a key variable in studies of tree ecophysiology because it directly influences light interception, photosynthesis and evapotranspiration of adult trees and seedlings. We analyzed the leaf dimensions (length – L and width – W of seedlings and adults of seven Neotropical rainforest tree species (Brosimum rubescens, Manilkara maxima, Pouteria caimito, Pouteria torta, Psidium cattleyanum, Symphonia globulifera and Tabebuia stenocalyx with the objective to test the feasibility of single regression models to estimate LA of both adults and seedlings. In southern Bahia, Brazil, a first set of data was collected between March and October 2012. From the seven species analyzed, only two (P. cattleyanum and T. stenocalyx had very similar relationships between LW and LA in both ontogenetic stages. For these two species, a second set of data was collected in August 2014, in order to validate the single models encompassing adult and seedlings. Our results show the possibility of development of models for predicting individual leaf area encompassing different ontogenetic stages for tropical tree species. The development of these models was more dependent on the species than the differences in leaf size between seedlings and adults.
Shokouh Taghipour Zahir
2013-01-01
Full Text Available Purpose. We sought to investigate the utility of classification and regression trees (CART classifier to differentiate benign from malignant nodules in patients referred for thyroid surgery. Methods. Clinical and demographic data of 271 patients referred to the Sadoughi Hospital during 2006–2011 were collected. In a two-step approach, a CART classifier was employed to differentiate patients with a high versus low risk of thyroid malignancy. The first step served as the screening procedure and was tailored to produce as few false negatives as possible. The second step identified those with the lowest risk of malignancy, chosen from a high risk population. Sensitivity, specificity, positive and negative predictive values (PPV and NPV of the optimal tree were calculated. Results. In the first step, age, sex, and nodule size contributed to the optimal tree. Ultrasonographic features were employed in the second step with hypoechogenicity and/or microcalcifications yielding the highest discriminatory ability. The combined tree produced a sensitivity and specificity of 80.0% (95% CI: 29.9–98.9 and 94.1% (95% CI: 78.9–99.0, respectively. NPV and PPV were 66.7% (41.1–85.6 and 97.0% (82.5–99.8, respectively. Conclusion. CART classifier reliably identifies patients with a low risk of malignancy who can avoid unnecessary surgery.
Automated construction of arterial and venous trees in retinal images.
Hu, Qiao; Abràmoff, Michael D; Garvin, Mona K
2015-10-01
While many approaches exist to segment retinal vessels in fundus photographs, only a limited number focus on the construction and disambiguation of arterial and venous trees. Previous approaches are local and/or greedy in nature, making them susceptible to errors or limiting their applicability to large vessels. We propose a more global framework to generate arteriovenous trees in retinal images, given a vessel segmentation. In particular, our approach consists of three stages. The first stage is to generate an overconnected vessel network, named the vessel potential connectivity map (VPCM), consisting of vessel segments and the potential connectivity between them. The second stage is to disambiguate the VPCM into multiple anatomical trees, using a graph-based metaheuristic algorithm. The third stage is to classify these trees into arterial or venous (A/V) trees. We evaluated our approach with a ground truth built based on a public database, showing a pixel-wise classification accuracy of 88.15% using a manual vessel segmentation as input, and 86.11% using an automatic vessel segmentation as input.
REGRESSION DEPENDENCE CONSTRUCTION METHODOLOGY FOR TRACTION CURVES USING LEAST SQUARE METHOD
V. Ravino
2013-01-01
Full Text Available The paper presents a methodology that permits to construct regression dependences for traction curves of various tractors while using different operational backgrounds. The dependence construction process is carried out with the help of Microsoft Excel.
M. Saki
2013-03-01
Full Text Available The relationship between plant species and environmental factors has always been a central issue in plant ecology. With rising power of statistical techniques, geo-statistics and geographic information systems (GIS, the development of predictive habitat distribution models of organisms has rapidly increased in ecology. This study aimed to evaluate the ability of Logistic Regression Tree model to create potential habitat map of Astragalus verus. This species produces Tragacanth and has economic value. A stratified- random sampling was applied to 100 sites (50 presence- 50 absence of given species, and produced environmental and edaphic factors maps by using Kriging and Inverse Distance Weighting methods in the ArcGIS software for the whole study area. Relationships between species occurrence and environmental factors were determined by Logistic Regression Tree model and extended to the whole study area. The results indicated species occurrence has strong correlation with environmental factors such as mean daily temperature and clay, EC and organic carbon content of the soil. Species occurrence showed direct relationship with mean daily temperature and clay and organic carbon, and inverse relationship with EC. Model accuracy was evaluated both by Cohen’s kappa statistics (κ and by area under Receiver Operating Characteristics curve based on independent test data set. Their values (kappa=0.9, Auc of ROC=0.96 indicated the high power of LRT to create potential habitat map on local scales. This model, therefore, can be applied to recognize potential sites for rangeland reclamation projects.
Construction of the flow rate nomogram using polynomial regression.
Hosmane, B; Maurath, C; McConnell, M
1993-04-01
The urinary flow rates of normal individuals depend on the initial bladder volume in a non-linear fashion (J. Urol. 109 (1973) 874). A flow rate nomogram was developed by Siroky, Olsson and Krane, (J. Vol. 122 (1979) 665), taking the non-linear relationship into account, as an aid in the interpretation of urinary flow rate data. The use of a flow rate nomogram is to differentiate normal from obstructed individuals and is useful in the post operative follow-up of urinary outflow obstruction. It has been shown (J. Urol. 123 (1980) 123) that the flow rate nomogram is an objective measure of the efficacy of medical or surgical therapy. Instead of manually reading nomogram values from the flow rate nomogram, an algorithm is developed using polynomial regression to fit the flow rate nomograms and hence compute nomogram values directly from the fitted nomogram equations.
López-Serrano PM
2016-04-01
Full Text Available The Sierra Madre Occidental mountain range (Durango, Mexico is of great ecological interest because of the high degree of environmental heterogeneity in the area. The objective of the present study was to estimate the biomass of mixed and uneven-aged forests in the Sierra Madre Occidental by using Landsat-5 TM spectral data and forest inventory data. We used the ATCOR3® atmospheric and topographic correction module to convert remotely sensed imagery digital signals to surface reflectance values. The usual approach of modeling stand variables by using multiple linear regression was compared with a hybrid model developed in two steps: in the first step a regression tree was used to obtain an initial classification of homogeneous biomass groups, and multiple linear regression models were then fitted to each node of the pruned regression tree. Cross-validation of the hybrid model explained 72.96% of the observed stand biomass variation, with a reduction in the RMSE of 25.47% with respect to the estimates yielded by the linear model fitted to the complete database. The most important variables for the binary classification process in the regression tree were the albedo, the corrected readings of the short-wave infrared band of the satellite (2.08-2.35 µm and the topographic moisture index. We used the model output to construct a map for estimating biomass in the study area, which yielded values of between 51 and 235 Mg ha-1. The use of regression trees in combination with stepwise regression of corrected satellite imagery proved a reliable method for estimating forest biomass.
Tejera-Vaquerizo, A; Martín-Cuevas, P; Gallego, E; Herrera-Acosta, E; Traves, V; Herrera-Ceballos, E; Nagore, E
2015-04-01
The main aim of this study was to identify predictors of sentinel lymph node (SN) metastasis in cutaneous melanoma. This was a retrospective cohort study of 818 patients in 2 tertiary-level hospitals. The primary outcome variable was SN involvement. Independent predictors were identified using multiple logistic regression and a classification and regression tree (CART) analysis. Ulceration, tumor thickness, and a high mitotic rate (≥6 mitoses/mm(2)) were independently associated with SN metastasis in the multiple regression analysis. The most important predictor in the CART analysis was Breslow thickness. Absence of an inflammatory infiltrate, patient age, and tumor location were predictive of SN metastasis in patients with tumors thicker than 2mm. In the case of thinner melanomas, the predictors were mitotic rate (>6 mitoses/mm(2)), presence of ulceration, and tumor thickness. Patient age, mitotic rate, and tumor thickness and location were predictive of survival. A high mitotic rate predicts a higher risk of SN involvement and worse survival. CART analysis improves the prediction of regional metastasis, resulting in better clinical management of melanoma patients. It may also help select suitable candidates for inclusion in clinical trials. Copyright © 2014 Elsevier España, S.L.U. and AEDV. All rights reserved.
Kwon, Y.
2013-12-01
As evidence of global warming continue to increase, being able to predict forest response to climate changes, such as expected rise of temperature and precipitation, will be vital for maintaining the sustainability and productivity of forests. To map forest species redistribution by climate change scenario has been successful, however, most species redistribution maps lack mechanistic understanding to explain why trees grow under the novel conditions of chaining climate. Distributional map is only capable of predicting under the equilibrium assumption that the communities would exist following a prolonged period under the new climate. In this context, forest NPP as a surrogate for growth rate, the most important facet that determines stand dynamics, can lead to valid prediction on the transition stage to new vegetation-climate equilibrium as it represents changes in structure of forest reflecting site conditions and climate factors. The objective of this study is to develop forest growth map using regression tree analysis by extracting large-scale non-linear structures from both field-based FIA and remotely sensed MODIS data set. The major issue addressed in this approach is non-linear spatial patterns of forest attributes. Forest inventory data showed complex spatial patterns that reflect environmental states and processes that originate at different spatial scales. At broad scales, non-linear spatial trends in forest attributes and mixture of continuous and discrete types of environmental variables make traditional statistical (multivariate regression) and geostatistical (kriging) models inefficient. It calls into question some traditional underlying assumptions of spatial trends that uncritically accepted in forest data. To solve the controversy surrounding the suitability of forest data, regression tree analysis are performed using Software See5 and Cubist. Four publicly available data sets were obtained: First, field-based Forest Inventory and Analysis (USDA
Lazaridis, D.C.; Verbesselt, J.; Robinson, A.P.
2011-01-01
Constructing models can be complicated when the available fitting data are highly correlated and of high dimension. However, the complications depend on whether the goal is prediction instead of estimation. We focus on predicting tree mortality (measured as the number of dead trees) from change metr
Constructal Law of Vascular Trees for Facilitation of Flow
Razavi, Mohammad S.; Shirani, Ebrahim; Salimpour, Mohammad Reza; Kassab, Ghassan S.
2014-01-01
Diverse tree structures such as blood vessels, branches of a tree and river basins exist in nature. The constructal law states that the evolution of flow structures in nature has a tendency to facilitate flow. This study suggests a theoretical basis for evaluation of flow facilitation within vascular structure from the perspective of evolution. A novel evolution parameter (Ev) is proposed to quantify the flow capacity of vascular structures. Ev is defined as the ratio of the flow conductance of an evolving structure (configuration with imperfection) to the flow conductance of structure with least imperfection. Attaining higher Ev enables the structure to expedite flow circulation with less energy dissipation. For both Newtonian and non-Newtonian fluids, the evolution parameter was developed as a function of geometrical shape factors in laminar and turbulent fully developed flows. It was found that the non-Newtonian or Newtonian behavior of fluid as well as flow behavior such as laminar or turbulent behavior affects the evolution parameter. Using measured vascular morphometric data of various organs and species, the evolution parameter was calculated. The evolution parameter of the tree structures in biological systems was found to be in the range of 0.95 to 1. The conclusion is that various organs in various species have high capacity to facilitate flow within their respective vascular structures. PMID:25551617
Hermanek, P; Guggenmoos-Holzmann, I
1994-01-01
A total of 961 patients who had received resective surgery for gastric carcinoma were grouped according to prognosis by classification and regression trees (CART). This grouping was compared to the present UICC stage grouping. For patients resected for cure (R0) the CART approach allows a better discrimination of patients with poor prognosis (5-year survival rates 15%-30%) from patients with a 5-year survival of 50%, on the one hand, and from patients with extremely poor prognosis (5-year survival rates below 5%) on the other. In the present investigation CART grouping was not influenced by the differentiation between pT1 and pT2 or between pT3 and pT4.
Bou Kheir, Rania; Greve, Mogens Humlekrog; Deroin, Jean-Paul
2013-01-01
Soil contamination by heavy metals has become a widespread dangerous problem in many parts of the world, including the Mediterranean environments. This is closely related to the increase irrigation by waste waters, to the uncontrolled application of sewage sludge, industrial effluents, pesticides...... coastal area situated in northern Lebanon using a geographic information system (GIS) and regression-tree analysis. The chosen area represents a typical case study of Mediterranean coastal landscape with deteriorating environment. Fifteen environmental parameters (parent material, soil type, p......H, hydraulical conductivity, organic matter, stoniness ratio, soil depth, slope gradient, slope aspect, slope curvature, land cover/use, distance to drainage line, proximity to roads, nearness to cities, and surroundings to waste areas) were generated from satellite imageries, Digital Elevation Models (DEMs...
Rothwell, James J; Futter, Martyn N; Dise, Nancy B
2008-11-01
Often, there is a non-linear relationship between atmospheric dissolved inorganic nitrogen (DIN) input and DIN leaching that is poorly captured by existing models. We present the first application of the non-parametric classification and regression tree approach to evaluate the key environmental drivers controlling DIN leaching from European forests. DIN leaching was classified as low (15kg N ha(-1) year(-1)) at 215 sites across Europe. The analysis identified throughfall NO(3)(-) deposition, acid deposition, hydrology, soil type, the carbon content of the soil, and the legacy of historic N deposition as the dominant drivers of DIN leaching for these forests. Ninety four percent of sites were successfully classified into the appropriate leaching category. This approach shows promise for understanding complex ecosystem responses to a wide range of anthropogenic stressors as well as an improved method for identifying risk and targeting pollution mitigation strategies in forest ecosystems.
Prediction of cannabis and cocaine use in adolescence using decision trees and logistic regression
Alfonso L. Palmer
2010-01-01
Full Text Available Spain is one of the European countries with the highest prevalence of cannabis and cocaine use among young people. The aim of this study was to investigate the factors related to the consumption of cocaine and cannabis among adolescents. A questionnaire was administered to 9,284 students between 14 and 18 years of age in Palma de Mallorca (47.1% boys and 52.9% girls whose mean age was 15.59 years. Logistic regression and decision trees were carried out in order to model the consumption of cannabis and cocaine. The results show the use of legal substances and committing fraudulence or theft are the main variables that raise the odds of consuming cannabis. In boys, cannabis consumption and a family history of drug use increase the odds of consuming cocaine, whereas in girls the use of alcohol, behaviours of fraudulence or theft and difficulty in some personal skills influence their odds of consuming cocaine. Finally, ease of access to the substance greatly raises the odds of consuming cocaine and cannabis in both genders. Decision trees highlight the role of consuming other substances and committing fraudulence or theft. The results of this study gain importance when it comes to putting into practice effective prevention programmes.
Donmez, Cenk; Berberoglu, Suha; Erdogan, Mehmet Akif; Tanriover, Anil Akin; Cilek, Ahmet
2015-02-01
Percent tree cover is the percentage of the ground surface area covered by a vertical projection of the outermost perimeter of the plants. It is an important indicator to reveal the condition of forest systems and has a significant importance for ecosystem models as a main input. The aim of this study is to estimate the percent tree cover of various forest stands in a Mediterranean environment based on an empirical relationship between tree coverage and remotely sensed data in Goksu Watershed located at the Eastern Mediterranean coast of Turkey. A regression tree algorithm was used to simulate spatial fractions of Pinus nigra, Cedrus libani, Pinus brutia, Juniperus excelsa and Quercus cerris using multi-temporal LANDSAT TM/ETM data as predictor variables and land cover information. Two scenes of high resolution GeoEye-1 images were employed for training and testing the model. The predictor variables were incorporated in addition to biophysical variables estimated from the LANDSAT TM/ETM data. Additionally, normalised difference vegetation index (NDVI) was incorporated to LANDSAT TM/ETM band settings as a biophysical variable. Stepwise linear regression (SLR) was applied for selecting the relevant bands to employ in regression tree process. SLR-selected variables produced accurate results in the model with a high correlation coefficient of 0.80. The output values ranged from 0 to 100 %. The different tree species were mapped in 30 m resolution in respect to elevation. Percent tree cover map as a final output was derived using LANDSAT TM/ETM image over Goksu Watershed and the biophysical variables. The results were tested using high spatial resolution GeoEye-1 images. Thus, the combination of the RT algorithm and higher resolution data for percent tree cover mapping were tested and examined in a complex Mediterranean environment.
Souadka Amine
2010-04-01
Full Text Available Abstract Background Incidence of liver hydatid cyst (LHC rupture ranged 15%-40% of all cases and most of them concern the bile duct tree. Patients with biliocystic communication (BCC had specific clinic and therapeutic aspect. The purpose of this study was to determine witch patients with LHC may develop BCC using classification and regression tree (CART analysis Methods A retrospective study of 672 patients with liver hydatid cyst treated at the surgery department "A" at Ibn Sina University Hospital, Rabat Morocco. Four-teen risk factors for BCC occurrence were entered into CART analysis to build an algorithm that can predict at the best way the occurrence of BCC. Results Incidence of BCC was 24.5%. Subgroups with high risk were patients with jaundice and thick pericyst risk at 73.2% and patients with thick pericyst, with no jaundice 36.5 years and younger with no past history of LHC risk at 40.5%. Our developed CART model has sensitivity at 39.6%, specificity at 93.3%, positive predictive value at 65.6%, a negative predictive value at 82.6% and accuracy of good classification at 80.1%. Discriminating ability of the model was good 82%. Conclusion we developed a simple classification tool to identify LHC patients with high risk BCC during a routine clinic visit (only on clinical history and examination followed by an ultrasonography. Predictive factors were based on pericyst aspect, jaundice, age, past history of liver hydatidosis and morphological Gharbi cyst aspect. We think that this classification can be useful with efficacy to direct patients at appropriated medical struct's.
Koestel, John; Bechtold, Michel; Jorda, Helena; Jarvis, Nicholas
2015-04-01
The saturated and near-saturated hydraulic conductivity of soil is of key importance for modelling water and solute fluxes in the vadose zone. Hydraulic conductivity measurements are cumbersome at the Darcy scale and practically impossible at larger scales where water and solute transport models are mostly applied. Hydraulic conductivity must therefore be estimated from proxy variables. Such pedotransfer functions are known to work decently well for e.g. water retention curves but rather poorly for near-saturated and saturated hydraulic conductivities. Recently, Weynants et al. (2009, Revisiting Vereecken pedotransfer functions: Introducing a closed-form hydraulic model. Vadose Zone Journal, 8, 86-95) reported a coefficients of determination of 0.25 (validation with an independent data set) for the saturated hydraulic conductivity from lab-measurements of Belgian soil samples. In our study, we trained boosted regression trees on a global meta-database containing tension-disk infiltrometer data (see Jarvis et al. 2013. Influence of soil, land use and climatic factors on the hydraulic conductivity of soil. Hydrology & Earth System Sciences, 17, 5185-5195) to predict the saturated hydraulic conductivity (Ks) and the conductivity at a tension of 10 cm (K10). We found coefficients of determination of 0.39 and 0.62 under a simple 10-fold cross-validation for Ks and K10. When carrying out the validation folded over the data-sources, i.e. the source publications, we found that the corresponding coefficients of determination reduced to 0.15 and 0.36, respectively. We conclude that the stricter source-wise cross-validation should be applied in future pedotransfer studies to prevent overly optimistic validation results. The boosted regression trees also allowed for an investigation of relevant predictors for estimating the near-saturated hydraulic conductivity. We found that land use and bulk density were most important to predict Ks. We also observed that Ks is large in fine
Risk assessment of dental caries by using Classification and Regression Trees.
Ito, Ataru; Hayashi, Mikako; Hamasaki, Toshimitsu; Ebisu, Shigeyuki
2011-06-01
Being able to predict an individual's risks of dental caries would offer a potentially huge natural step forward toward better oral heath. As things stand, preventive treatment against caries is mostly carried out without risk assessment because there is no proven way to analyse an individual's risk factors. The purpose of this study was to try to identify those patients with high and low risk of caries by using Classification and Regression Trees (CART). In this historical cohort study, data from 442 patients in a general practice who met the inclusion criteria were analysed. CART was applied to the data to seek a model for predicting caries by using the following parameters according to each patient: age, number of carious teeth, numbers of cariogenic bacteria, the secretion rate and buffer capacity of saliva, and compliance with a prevention programme. The risks of caries were presented by odds ratios. Multiple logistic regression analysis was performed to confirm the results obtained by CART. CART identified high and low risk patients for primary caries with relative odds ratios of 0.41 (95%CI: 0.22-0.77, p = 0.0055) and 2.88 (95%CI: 1.49-5.59, p = 0.0018) according the numbers of cariogenic bacteria. High and low risk patients for secondary caries were also identified with the odds ratios of 0.07 (95%CI: 0.01-0.55, p = 0.00109) and 7.00 (95%CI: 3.50-13.98, p caries. Cariogenic bacteria play a leading role in the incidence of caries. CART proved effective in identifying an individual patient's risk of caries. Copyright © 2011 Elsevier Ltd. All rights reserved.
Differential Diagnosis of Erythmato-Squamous Diseases Using Classification and Regression Tree
Maghooli, Keivan; Langarizadeh, Mostafa; Shahmoradi, Leila; Habibi-koolaee, Mahdi; Jebraeily, Mohamad; Bouraghi, Hamid
2016-01-01
Introduction: Differential diagnosis of Erythmato-Squamous Diseases (ESD) is a major challenge in the field of dermatology. The ESD diseases are placed into six different classes. Data mining is the process for detection of hidden patterns. In the case of ESD, data mining help us to predict the diseases. Different algorithms were developed for this purpose. Objective: we aimed to use the Classification and Regression Tree (CART) to predict differential diagnosis of ESD. Methods: we used the Cross Industry Standard Process for Data Mining (CRISP-DM) methodology. For this purpose, the dermatology data set from machine learning repository, UCI was obtained. The Clementine 12.0 software from IBM Company was used for modelling. In order to evaluation of the model we calculate the accuracy, sensitivity and specificity of the model. Results: The proposed model had an accuracy of 94.84% ( Standard Deviation: 24.42) in order to correct prediction of the ESD disease. Conclusions: Results indicated that using of this classifier could be useful. But, it would be strongly recommended that the combination of machine learning methods could be more useful in terms of prediction of ESD. PMID:28077889
Josef Smolle
2001-01-01
Full Text Available Objective: To evaluate the feasibility of the CART (Classification and Regression Tree procedure for the recognition of microscopic structures in tissue counter analysis. Methods: Digital microscopic images of H&E stained slides of normal human skin and of primary malignant melanoma were overlayed with regularly distributed square measuring masks (elements and grey value, texture and colour features within each mask were recorded. In the learning set, elements were interactively labeled as representing either connective tissue of the reticular dermis, other tissue components or background. Subsequently, CART models were based on these data sets. Results: Implementation of the CART classification rules into the image analysis program showed that in an independent test set 94.1% of elements classified as connective tissue of the reticular dermis were correctly labeled. Automated measurements of the total amount of tissue and of the amount of connective tissue within a slide showed high reproducibility (r=0.97 and r=0.94, respectively; p < 0.001. Conclusions: CART procedure in tissue counter analysis yields simple and reproducible classification rules for tissue elements.
Development of prognostic indicators using Classification And Regression Trees (CART) for survival
Nunn, Martha E.; Fan, Juanjuan; Su, Xiaogang; McGuire, Michael K.
2014-01-01
The development of an accurate prognosis is an integral component of treatment planning in the practice of periodontics. Prior work has evaluated the validity of using various clinical measured parameters for assigning periodontal prognosis as well as for predicting tooth survival and change in clinical conditions over time. We critically review the application of multivariate Classification And Regression Trees (CART) for survival in developing evidence-based periodontal prognostic indicators. We focus attention on two distinct methods of multivariate CART for survival: the marginal goodness-of-fit approach, and the multivariate exponential approach. A number of common clinical measures have been found to be significantly associated with tooth loss from periodontal disease, including furcation involvement, probing depth, mobility, crown-to-root ratio, and oral hygiene. However, the inter-relationships among these measures, as well as the relevance of other clinical measures to tooth loss from periodontal disease (such as bruxism, family history of periodontal disease, and overall bone loss), remain less clear. While inferences drawn from any single current study are necessarily limited, the application of new approaches in epidemiologic analyses to periodontal prognosis, such as CART for survival, should yield important insights into our understanding, and treatment, of periodontal diseases. PMID:22133372
Podio, Natalia S; López-Froilán, Rebeca; Ramirez-Moreno, Esther; Bertrand, Lidwina; Baroni, María V; Pérez-Rodríguez, María L; Sánchez-Mata, María-Cortes; Wunderlin, Daniel A
2015-11-01
The aim of this study was to evaluate changes in polyphenol profile and antioxidant capacity of five soluble coffees throughout a simulated gastro-intestinal digestion, including absorption through a dialysis membrane. Our results demonstrate that both polyphenol content and antioxidant capacity were characteristic for each type of studied coffee, showing a drop after dialysis. Twenty-seven compounds were identified in coffee by HPLC-MS, while only 14 of them were found after dialysis. Green+roasted coffee blend and chicory+coffee blend showed the highest and lowest content of polyphenols and antioxidant capacity before in vitro digestion and after dialysis, respectively. Canonical correlation analysis showed significant correlation between the antioxidant capacity and the polyphenol profile before digestion and after dialysis. Furthermore, boosted regression trees analysis (BRT) showed that only four polyphenol compounds (5-p-coumaroylquinic acid, quinic acid, coumaroyl tryptophan conjugated, and 5-O-caffeoylquinic acid) appear to be the most relevant to explain the antioxidant capacity after dialysis, these compounds being the most bioaccessible after dialysis. To our knowledge, this is the first report matching the antioxidant capacity of foods with the polyphenol profile by BRT, which opens an interesting method of analysis for future reports on the antioxidant capacity of foods.
Estimating carbon and showing impacts of drought using satellite data in regression-tree models
Boyte, Stephen; Wylie, Bruce K.; Howard, Danny; Dahal, Devendra; Gilmanov, Tagir G.
2018-01-01
Integrating spatially explicit biogeophysical and remotely sensed data into regression-tree models enables the spatial extrapolation of training data over large geographic spaces, allowing a better understanding of broad-scale ecosystem processes. The current study presents annual gross primary production (GPP) and annual ecosystem respiration (RE) for 2000–2013 in several short-statured vegetation types using carbon flux data from towers that are located strategically across the conterminous United States (CONUS). We calculate carbon fluxes (annual net ecosystem production [NEP]) for each year in our study period, which includes 2012 when drought and higher-than-normal temperatures influence vegetation productivity in large parts of the study area. We present and analyse carbon flux dynamics in the CONUS to better understand how drought affects GPP, RE, and NEP. Model accuracy metrics show strong correlation coefficients (r) (r ≥ 94%) between training and estimated data for both GPP and RE. Overall, average annual GPP, RE, and NEP are relatively constant throughout the study period except during 2012 when almost 60% less carbon is sequestered than normal. These results allow us to conclude that this modelling method effectively estimates carbon dynamics through time and allows the exploration of impacts of meteorological anomalies and vegetation types on carbon dynamics.
Prioritizing Highway Safety Manual's crash prediction variables using boosted regression trees.
Saha, Dibakar; Alluri, Priyanka; Gan, Albert
2015-06-01
The Highway Safety Manual (HSM) recommends using the empirical Bayes (EB) method with locally derived calibration factors to predict an agency's safety performance. However, the data needs for deriving these local calibration factors are significant, requiring very detailed roadway characteristics information. Many of the data variables identified in the HSM are currently unavailable in the states' databases. Moreover, the process of collecting and maintaining all the HSM data variables is cost-prohibitive. Prioritization of the variables based on their impact on crash predictions would, therefore, help to identify influential variables for which data could be collected and maintained for continued updates. This study aims to determine the impact of each independent variable identified in the HSM on crash predictions. A relatively recent data mining approach called boosted regression trees (BRT) is used to investigate the association between the variables and crash predictions. The BRT method can effectively handle different types of predictor variables, identify very complex and non-linear association among variables, and compute variable importance. Five years of crash data from 2008 to 2012 on two urban and suburban facility types, two-lane undivided arterials and four-lane divided arterials, were analyzed for estimating the influence of variables on crash predictions. Variables were found to exhibit non-linear and sometimes complex relationship to predicted crash counts. In addition, only a few variables were found to explain most of the variation in the crash data. Published by Elsevier Ltd.
Prognostic transcriptional association networks: a new supervised approach based on regression trees
Nepomuceno-Chamorro, Isabel; Azuaje, Francisco; Devaux, Yvan; Nazarov, Petr V.; Muller, Arnaud; Aguilar-Ruiz, Jesús S.; Wagner, Daniel R.
2011-01-01
Motivation: The application of information encoded in molecular networks for prognostic purposes is a crucial objective of systems biomedicine. This approach has not been widely investigated in the cardiovascular research area. Within this area, the prediction of clinical outcomes after suffering a heart attack would represent a significant step forward. We developed a new quantitative prediction-based method for this prognostic problem based on the discovery of clinically relevant transcriptional association networks. This method integrates regression trees and clinical class-specific networks, and can be applied to other clinical domains. Results: Before analyzing our cardiovascular disease dataset, we tested the usefulness of our approach on a benchmark dataset with control and disease patients. We also compared it to several algorithms to infer transcriptional association networks and classification models. Comparative results provided evidence of the prediction power of our approach. Next, we discovered new models for predicting good and bad outcomes after myocardial infarction. Using blood-derived gene expression data, our models reported areas under the receiver operating characteristic curve above 0.70. Our model could also outperform different techniques based on co-expressed gene modules. We also predicted processes that may represent novel therapeutic targets for heart disease, such as the synthesis of leucine and isoleucine. Availability: The SATuRNo software is freely available at http://www.lsi.us.es/isanepo/toolsSaturno/. Contact: inepomuceno@us.es Supplementary information: Supplementary data are available at Bioinformatics online. PMID:21098433
I GEDE AGUS JIWADIANA
2015-11-01
Full Text Available The aim of this research is to determine the classification characteristics of traffic accidents in Denpasar city in January-July 2014 by using Classification And Regression Trees (CART. Then, for determine the explanatory variables into the main classifier of CART. The result showed that optimum CART generate three terminal node. First terminal node, there are 12 people were classified as heavy traffic accident characteritics with single accident, and second terminal nodes, there are 68 people were classified as minor traffic accident characteristics by type of traffic accident front-rear, front-front, front-side, pedestrians, side-side and location of traffic accident in district road and sub-district road. For third terminal node, there are 291 people were classified as medium traffic accident characteristics by type of traffic accident front-rear, front-front, front-side, pedestrians, side-side and location of traffic accident in municipality road and explanatory variables into the main splitter to make of CART is type of traffic accident with maximum homogeneity measure of 0.03252.
Neela Deshpande
2014-12-01
Full Text Available In the recent past Artificial Neural Networks (ANN have emerged out as a promising technique for predicting compressive strength of concrete. In the present study back propagation was used to predict the 28 day compressive strength of recycled aggregate concrete (RAC along with two other data driven techniques namely Model Tree (MT and Non-linear Regression (NLR. Recycled aggregate is the current need of the hour owing to its environmental friendly aspect of re-use of the construction waste. The study observed that, prediction of 28 day compressive strength of RAC was done better by ANN than NLR and MT. The input parameters were cubic meter proportions of Cement, Natural fine aggregate, Natural coarse Aggregates, recycled aggregates, Admixture and Water (also called as raw data. The study also concluded that ANN performs better when non-dimensional parameters like Sand–Aggregate ratio, Water–total materials ratio, Aggregate–Cement ratio, Water–Cement ratio and Replacement ratio of natural aggregates by recycled aggregates, were used as additional input parameters. Study of each network developed using raw data and each non dimensional parameter facilitated in studying the impact of each parameter on the performance of the models developed using ANN, MT and NLR as well as performance of the ANN models developed with limited number of inputs. The results indicate that ANN learn from the examples and grasp the fundamental domain rules governing strength of concrete.
Brabant, Marie-Eve; Hebert, Martine; Chagnon, Francois
2013-01-01
This study explored the clinical profiles of 77 female teenager survivors of sexual abuse and examined the association of abuse-related and personal variables with suicidal ideations. Analyses revealed that 64% of participants experienced suicidal ideations. Findings from classification and regression tree analysis indicated that depression,…
Standardization guide for construction and use of MORT-type analytic trees
Buys, J.R.
1992-02-01
Since the introduction of MORT (Management Oversight and Risk Tree) technology as a tool for evaluating the success or failure of safety management systems, there has been a proliferation of analytic trees throughout US Department of Energy (DOE) and its contractor organizations. Standard fault tree'' symbols have generally been used in logic diagram or tree construction, but new or revised symbols have also been adopted by various analysts. Additionally, a variety of numbering systems have been used for event identification. The consequent lack of standardization has caused some difficulties in interpreting the trees and following their logic. This guide seeks to correct this problem by providing a standardized system for construction and use of analytic trees. Future publications of the DOE System Safety Development Center (SSDC) will adhere to this guide. It is recommended that other DOE organizations and contractors also adopt this system to achieve intra-DOE uniformity in analytic tree construction.
LI Chang-ping; ZHI Xin-yue; MA Jun; CUI Zhuang; ZHU Zi-long; ZHANG Cui; HU Liang-ping
2012-01-01
Background Various methods can be applied to build predictive models for the clinical data with binary outcome variable.This research aims to explore the process of constructing common predictive models,Logistic regression (LR),decision tree (DT) and multilayer perceptron (MLP),as well as focus on specific details when applying the methods mentioned above:what preconditions should be satisfied,how to set parameters of the model,how to screen variables and build accuracy models quickly and efficiently,and how to assess the generalization ability (that is,prediction performance) reliably by Monte Carlo method in the case of small sample size.Methods All the 274 patients (include 137 type 2 diabetes mellitus with diabetic peripheral neuropathy and 137 type 2 diabetes mellitus without diabetic peripheral neuropathy) from the Metabolic Disease Hospital in Tianjin participated in the study.There were 30 variables such as sex,age,glycosylated hemoglobin,etc.On account of small sample size,the classification and regression tree (CART) with the chi-squared automatic interaction detector tree (CHAID) were combined by means of the 100 times 5-7 fold stratified cross-validation to build DT.The MLP was constructed by Schwarz Bayes Criterion to choose the number of hidden layers and hidden layer units,alone with levenberg-marquardt (L-M) optimization algorithm,weight decay and preliminary training method.Subsequently,LR was applied by the best subset method with the Akaike Information Criterion (AIC) to make the best used of information and avoid overfitting.Eventually,a 10 to 100 times 3-10 fold stratified cross-validation method was used to compare the generalization ability of DT,MLP and LR in view of the areas under the receiver operating characteristic (ROC) curves (AUC).Results The AUC of DT,MLP and LR were 0.8863,0.8536 and 0.8802,respectively.As the larger the AUC of a specific prediction model is,the higher diagnostic ability presents,MLP performed optimally,and then
Satoh, Soichirou; Mimuro, Mamoru; Tanaka, Ayumi
2013-01-01
Phylogenetic trees have been constructed for a wide range of organisms using gene sequence information, especially through the identification of orthologous genes that have been vertically inherited. The number of available complete genome sequences is rapidly increasing, and many tools for construction of genome trees based on whole genome sequences have been proposed. However, development of a reasonable method of using complete genome sequences for construction of phylogenetic trees has not been established. We have developed a method for construction of phylogenetic trees based on the average sequence similarities of whole genome sequences. We used this method to examine the phylogeny of 115 photosynthetic prokaryotes, i.e., cyanobacteria, Chlorobi, proteobacteria, Chloroflexi, Firmicutes and nonphotosynthetic organisms including Archaea. Although the bootstrap values for the branching order of phyla were low, probably due to lateral gene transfer and saturated mutation, the obtained tree was largely consistent with the previously reported phylogenetic trees, indicating that this method is a robust alternative to traditional phylogenetic methods.
A Model of Desired Performance in Phylogenetic Tree Construction for Teaching Evolution.
Brewer, Steven D.
This research paper examines phylogenetic tree construction-a form of problem solving in biology-by studying the strategies and heuristics used by experts. One result of the research is the development of a model of desired performance for phylogenetic tree construction. A detailed description of the model and the sample problems which illustrate…
Risk assessment of dengue fever in Zhongshan, China: a time-series regression tree analysis.
Liu, K-K; Wang, T; Huang, X-D; Wang, G-L; Xia, Y; Zhang, Y-T; Jing, Q-L; Huang, J-W; Liu, X-X; Lu, J-H; Hu, W-B
2017-02-01
Dengue fever (DF) is the most prevalent and rapidly spreading mosquito-borne disease globally. Control of DF is limited by barriers to vector control and integrated management approaches. This study aimed to explore the potential risk factors for autochthonous DF transmission and to estimate the threshold effects of high-order interactions among risk factors. A time-series regression tree model was applied to estimate the hierarchical relationship between reported autochthonous DF cases and the potential risk factors including the timeliness of DF surveillance systems (median time interval between symptom onset date and diagnosis date, MTIOD), mosquito density, imported cases and meteorological factors in Zhongshan, China from 2001 to 2013. We found that MTIOD was the most influential factor in autochthonous DF transmission. Monthly autochthonous DF incidence rate increased by 36·02-fold [relative risk (RR) 36·02, 95% confidence interval (CI) 25·26-46·78, compared to the average DF incidence rate during the study period] when the 2-month lagged moving average of MTIOD was >4·15 days and the 3-month lagged moving average of the mean Breteau Index (BI) was ⩾16·57. If the 2-month lagged moving average MTIOD was between 1·11 and 4·15 days and the monthly maximum diurnal temperature range at a lag of 1 month was <9·6 °C, the monthly mean autochthonous DF incidence rate increased by 14·67-fold (RR 14·67, 95% CI 8·84-20·51, compared to the average DF incidence rate during the study period). This study demonstrates that the timeliness of DF surveillance systems, mosquito density and diurnal temperature range play critical roles in the autochthonous DF transmission in Zhongshan. Better assessment and prediction of the risk of DF transmission is beneficial for establishing scientific strategies for DF early warning surveillance and control.
snpTree--a web-server to identify and construct SNP trees from whole genome sequence data.
Leekitcharoenphon, Pimlapas; Kaas, Rolf S; Thomsen, Martin Christen Frølund; Friis, Carsten; Rasmussen, Simon; Aarestrup, Frank M
2012-01-01
The advances and decreasing economical cost of whole genome sequencing (WGS), will soon make this technology available for routine infectious disease epidemiology. In epidemiological studies, outbreak isolates have very little diversity and require extensive genomic analysis to differentiate and classify isolates. One of the successfully and broadly used methods is analysis of single nucletide polymorphisms (SNPs). Currently, there are different tools and methods to identify SNPs including various options and cut-off values. Furthermore, all current methods require bioinformatic skills. Thus, we lack a standard and simple automatic tool to determine SNPs and construct phylogenetic tree from WGS data. Here we introduce snpTree, a server for online-automatic SNPs analysis. This tool is composed of different SNPs analysis suites, perl and python scripts. snpTree can identify SNPs and construct phylogenetic trees from WGS as well as from assembled genomes or contigs. WGS data in fastq format are aligned to reference genomes by BWA while contigs in fasta format are processed by Nucmer. SNPs are concatenated based on position on reference genome and a tree is constructed from concatenated SNPs using FastTree and a perl script. The online server was implemented by HTML, Java and python script.The server was evaluated using four published bacterial WGS data sets (V. cholerae, S. aureus CC398, S. Typhimurium and M. tuberculosis). The evaluation results for the first three cases was consistent and concordant for both raw reads and assembled genomes. In the latter case the original publication involved extensive filtering of SNPs, which could not be repeated using snpTree. The snpTree server is an easy to use option for rapid standardised and automatic SNP analysis in epidemiological studies also for users with limited bioinformatic experience. The web server is freely accessible at http://www.cbs.dtu.dk/services/snpTree-1.0/.
Comprehensive database of diameter-based biomass regressions for North American tree species
Jennifer C. Jenkins; David C. Chojnacky; Linda S. Heath; Richard A. Birdsey
2004-01-01
A database consisting of 2,640 equations compiled from the literature for predicting the biomass of trees and tree components from diameter measurements of species found in North America. Bibliographic information, geographic locations, diameter limits, diameter and biomass units, equation forms, statistical errors, and coefficients are provided for each equation,...
A New Heuristic Constructing Minimal Steiner Trees inside Simple Polygons
Alireza Khosravinejad
2013-07-01
Full Text Available The Steiner tree problem has numerous applications in urban transportation network, design of electronic integrated circuits, and computer network routing. This problem aims at finding a minimum Steiner tree in the Euclidean space, the distance between each two edges of which has the least cost. This problem is considered as an NP-hard one. Assuming the simple polygon P with m vertices and n terminals, the purpose of the minimum Steiner tree in the Euclidean space is to connect the n terminals existing in p. In the proposed algorithm, obtaining optimal responses will be sought by turning this problem into the Steiner tree problem on a graph.
Self-Construction of Aggregation Tree for Gathering Mobile Data in Wireless Sensor Network
Lee, Sangbin; Kim, Songmin; Kim, Sungjun; Ko, Doohyun; Kim, Bumjin; An, Sunshin
A network of sensors can be used to obtain state based data from the area in which they are deployed. To reduce costs, the data sent via intermediate sensors to a sink are often aggregated. In this letter, we introduce Self-Construction of Aggregation Tree (SCAT) scheme which uses a novel data aggregation scheme utilizing the knowledge of the mobile node and the infrastructure (static node tree) in gathering the data from the mobile node. The static nodes can construct a near- optimal aggregation tree by themselves, using the knowledge of the mobile node, which is a process similar to forming the centralized aggregation tree.
Greedy algorithm with weights for decision tree construction
Moshkov, Mikhail
2010-12-01
An approximate algorithm for minimization of weighted depth of decision trees is considered. A bound on accuracy of this algorithm is obtained which is unimprovable in general case. Under some natural assumptions on the class NP, the considered algorithm is close (from the point of view of accuracy) to best polynomial approximate algorithms for minimization of weighted depth of decision trees.
Aguiar Fabio S
2012-08-01
Full Text Available Abstract Background Tuberculosis (TB remains a public health issue worldwide. The lack of specific clinical symptoms to diagnose TB makes the correct decision to admit patients to respiratory isolation a difficult task for the clinician. Isolation of patients without the disease is common and increases health costs. Decision models for the diagnosis of TB in patients attending hospitals can increase the quality of care and decrease costs, without the risk of hospital transmission. We present a predictive model for predicting pulmonary TB in hospitalized patients in a high prevalence area in order to contribute to a more rational use of isolation rooms without increasing the risk of transmission. Methods Cross sectional study of patients admitted to CFFH from March 2003 to December 2004. A classification and regression tree (CART model was generated and validated. The area under the ROC curve (AUC, sensitivity, specificity, positive and negative predictive values were used to evaluate the performance of model. Validation of the model was performed with a different sample of patients admitted to the same hospital from January to December 2005. Results We studied 290 patients admitted with clinical suspicion of TB. Diagnosis was confirmed in 26.5% of them. Pulmonary TB was present in 83.7% of the patients with TB (62.3% with positive sputum smear and HIV/AIDS was present in 56.9% of patients. The validated CART model showed sensitivity, specificity, positive predictive value and negative predictive value of 60.00%, 76.16%, 33.33%, and 90.55%, respectively. The AUC was 79.70%. Conclusions The CART model developed for these hospitalized patients with clinical suspicion of TB had fair to good predictive performance for pulmonary TB. The most important variable for prediction of TB diagnosis was chest radiograph results. Prospective validation is still necessary, but our model offer an alternative for decision making in whether to isolate patients with
Shabani, Farzin; Kumar, Lalit; Solhjouy-fard, Samaneh
2016-05-01
The aim of this study was to have a comparative investigation and evaluation of the capabilities of correlative and mechanistic modeling processes, applied to the projection of future distributions of date palm in novel environments and to establish a method of minimizing uncertainty in the projections of differing techniques. The location of this study on a global scale is in Middle Eastern Countries. We compared the mechanistic model CLIMEX (CL) with the correlative models MaxEnt (MX), Boosted Regression Trees (BRT), and Random Forests (RF) to project current and future distributions of date palm (Phoenix dactylifera L.). The Global Climate Model (GCM), the CSIRO-Mk3.0 (CS) using the A2 emissions scenario, was selected for making projections. Both indigenous and alien distribution data of the species were utilized in the modeling process. The common areas predicted by MX, BRT, RF, and CL from the CS GCM were extracted and compared to ascertain projection uncertainty levels of each individual technique. The common areas identified by all four modeling techniques were used to produce a map indicating suitable and unsuitable areas for date palm cultivation for Middle Eastern countries, for the present and the year 2100. The four different modeling approaches predict fairly different distributions. Projections from CL were more conservative than from MX. The BRT and RF were the most conservative methods in terms of projections for the current time. The combination of the final CL and MX projections for the present and 2100 provide higher certainty concerning those areas that will become highly suitable for future date palm cultivation. According to the four models, cold, hot, and wet stress, with differences on a regional basis, appears to be the major restrictions on future date palm distribution. The results demonstrate variances in the projections, resulting from different techniques. The assessment and interpretation of model projections requires reservations
Shabani, Farzin; Kumar, Lalit; Solhjouy-fard, Samaneh
2017-08-01
The aim of this study was to have a comparative investigation and evaluation of the capabilities of correlative and mechanistic modeling processes, applied to the projection of future distributions of date palm in novel environments and to establish a method of minimizing uncertainty in the projections of differing techniques. The location of this study on a global scale is in Middle Eastern Countries. We compared the mechanistic model CLIMEX (CL) with the correlative models MaxEnt (MX), Boosted Regression Trees (BRT), and Random Forests (RF) to project current and future distributions of date palm ( Phoenix dactylifera L.). The Global Climate Model (GCM), the CSIRO-Mk3.0 (CS) using the A2 emissions scenario, was selected for making projections. Both indigenous and alien distribution data of the species were utilized in the modeling process. The common areas predicted by MX, BRT, RF, and CL from the CS GCM were extracted and compared to ascertain projection uncertainty levels of each individual technique. The common areas identified by all four modeling techniques were used to produce a map indicating suitable and unsuitable areas for date palm cultivation for Middle Eastern countries, for the present and the year 2100. The four different modeling approaches predict fairly different distributions. Projections from CL were more conservative than from MX. The BRT and RF were the most conservative methods in terms of projections for the current time. The combination of the final CL and MX projections for the present and 2100 provide higher certainty concerning those areas that will become highly suitable for future date palm cultivation. According to the four models, cold, hot, and wet stress, with differences on a regional basis, appears to be the major restrictions on future date palm distribution. The results demonstrate variances in the projections, resulting from different techniques. The assessment and interpretation of model projections requires reservations
Tomczyk, Aleksandra; Ewertowski, Marek; White, Piran; Kasprzak, Leszek
2016-04-01
The dual role of many Protected Natural Areas in providing benefits for both conservation and recreation poses challenges for management. Although recreation-based damage to ecosystems can occur very quickly, restoration can take many years. The protection of conservation interests at the same as providing for recreation requires decisions to be made about how to prioritise and direct management actions. Trails are commonly used to divert visitors from the most important areas of a site, but high visitor pressure can lead to increases in trail width and a concomitant increase in soil erosion. Here we use detailed field data on condition of recreational trails in Gorce National Park, Poland, as the basis for a regression tree analysis to determine the factors influencing trail deterioration, and link specific trail impacts with environmental, use related and managerial factors. We distinguished 12 types of trails, characterised by four levels of degradation: (1) trails with an acceptable level of degradation; (2) threatened trails; (3) damaged trails; and (4) heavily damaged trails. Damaged trails were the most vulnerable of all trails and should be prioritised for appropriate conservation and restoration. We also proposed five types of monitoring of recreational trail conditions: (1) rapid inventory of negative impacts; (2) monitoring visitor numbers and variation in type of use; (3) change-oriented monitoring focusing on sections of trail which were subjected to changes in type or level of use or subjected to extreme weather events; (4) monitoring of dynamics of trail conditions; and (5) full assessment of trail conditions, to be carried out every 10-15 years. The application of the proposed framework can enhance the ability of Park managers to prioritise their trail management activities, enhancing trail conditions and visitor safety, while minimising adverse impacts on the conservation value of the ecosystem. A.M.T. was supported by the Polish Ministry of
Liu, Yang; Lü, Yi-he; Zheng, Hai-feng; Chen, Li-ding
2010-05-01
Based on the 10-day SPOT VEGETATION NDVI data and the daily meteorological data from 1998 to 2007 in Yan' an City, the main meteorological variables affecting the annual and interannual variations of NDVI were determined by using regression tree. It was found that the effects of test meteorological variables on the variability of NDVI differed with seasons and time lags. Temperature and precipitation were the most important meteorological variables affecting the annual variation of NDVI, and the average highest temperature was the most important meteorological variable affecting the inter-annual variation of NDVI. Regression tree was very powerful in determining the key meteorological variables affecting NDVI variation, but could not build quantitative relations between NDVI and meteorological variables, which limited its further and wider application.
Early cost estimating for road construction projects using multiple regression techniques
Ibrahim Mahamid
2011-12-01
Full Text Available The objective of this study is to develop early cost estimating models for road construction projects using multiple regression techniques, based on 131 sets of data collected in the West Bank in Palestine. As the cost estimates are required at early stages of a project, considerations were given to the fact that the input data for the required regression model could be easily extracted from sketches or scope definition of the project. 11 regression models are developed to estimate the total cost of road construction project in US dollar; 5 of them include bid quantities as input variables and 6 include road length and road width. The coefficient of determination r2 for the developed models is ranging from 0.92 to 0.98 which indicate that the predicted values from a forecast models fit with the real-life data. The values of the mean absolute percentage error (MAPE of the developed regression models are ranging from 13% to 31%, the results compare favorably with past researches which have shown that the estimate accuracy in the early stages of a project is between ±25% and ±50%.
Construction of risk prediction model of type 2 diabetes mellitus based on logistic regression
Li Jian
2017-01-01
Full Text Available Objective: to construct multi factor prediction model for the individual risk of T2DM, and to explore new ideas for early warning, prevention and personalized health services for T2DM. Methods: using logistic regression techniques to screen the risk factors for T2DM and construct the risk prediction model of T2DM. Results: Male’s risk prediction model logistic regression equation: logit(P=BMI × 0.735+ vegetables × (−0.671 + age × 0.838+ diastolic pressure × 0.296+ physical activity× (−2.287 + sleep ×(−0.009 +smoking ×0.214; Female’s risk prediction model logistic regression equation: logit(P=BMI ×1.979+ vegetables× (−0.292 + age × 1.355+ diastolic pressure× 0.522+ physical activity × (−2.287 + sleep × (−0.010.The area under the ROC curve of male was 0.83, the sensitivity was 0.72, the specificity was 0.86, the area under the ROC curve of female was 0.84, the sensitivity was 0.75, the specificity was 0.90. Conclusion: This study model data is from a compared study of nested case, the risk prediction model has been established by using the more mature logistic regression techniques, and the model is higher predictive sensitivity, specificity and stability.
Benjamin W. Y. Lo
2016-01-01
Conclusions: A clinically useful classification tree was generated, which serves as a prediction tool to guide bedside prognostication and clinical treatment decision making. This prognostic decision-making algorithm also shed light on the complex interactions between a number of risk factors in determining outcome after aneurysmal SAH.
Family-Joining: A Fast Distance-Based Method for Constructing Generally Labeled Trees.
Kalaghatgi, Prabhav; Pfeifer, Nico; Lengauer, Thomas
2016-10-01
The widely used model for evolutionary relationships is a bifurcating tree with all taxa/observations placed at the leaves. This is not appropriate if the taxa have been densely sampled across evolutionary time and may be in a direct ancestral relationship, or if there is not enough information to fully resolve all the branching points in the evolutionary tree. In this article, we present a fast distance-based agglomeration method called family-joining (FJ) for constructing so-called generally labeled trees in which taxa may be placed at internal vertices and the tree may contain polytomies. FJ constructs such trees on the basis of pairwise distances and a distance threshold. We tested three methods for threshold selection, FJ-AIC, FJ-BIC, and FJ-CV, which minimize Akaike information criterion, Bayesian information criterion, and cross-validation error, respectively. When compared with related methods on simulated data, FJ-BIC was among the best at reconstructing the correct tree across a wide range of simulation scenarios. FJ-BIC was applied to HIV sequences sampled from individuals involved in a known transmission chain. The FJ-BIC tree was found to be compatible with almost all transmission events. On average, internal branches in the FJ-BIC tree have higher bootstrap support than branches in the leaf-labeled bifurcating tree constructed using RAxML. 36% and 25% of the internal branches in the FJ-BIC tree and RAxML tree, respectively, have bootstrap support greater than 70%. To the best of our knowledge the method presented here is the first attempt at modeling evolutionary relationships using generally labeled trees.
A. Snow
2008-01-01
Full Text Available A surface flow wetland was constructed in the Burnside Industrial Park, Dartmouth, Nova Scotia, to treat stormwater runoff from the surrounding watersheds which are comprised primarily of commercial properties and two former landfills. The objectives of this study were: (a to compare the uptake of iron by red maple, white birch and red spruce trees growing under flooded soil conditions in the constructed wetland and well drained soil conditions in a nearby reference site, (b to evaluate the seasonal variability of iron in these trees and (c to determine the distribution of iron in different compartments of these trees (leaves, twigs, branches, trunk wood, trunk bark and roots. The average iron concentrations in the aboveground compartments of red maple, white birch and red spruce trees were within the range of iron concentrations reported in the literature for these trees. Red maple, white birch and red spruce trees in the constructed wetland had significantly greater iron concentrations in their roots than the same species in the reference site. The average iron concentrations in the leaves of red maple trees in the constructed wetland and the reference site displayed an increasing trend towards the end of the growing season while the average iron concentrations in the twigs of red maple and white birch trees in the constructed wetland and the reference site displayed maximum concentrations at the beginning of the growing season. Red maple, white birch and red spruce trees in the constructed wetland retained a major portion of their overall iron concentration in their root systems.
Bou Kheir, Rania; Shomar, B.; Greve, Mogens Humlekrog
2014-01-01
Soil heavy metal pollution has been and continues to be a worldwide phenomenon that has attracted a great deal of attention from governments and regulatory bodies. In this context, our study used Geographic Information Systems (GIS) and regression-tree modeling (196 trees) to precisely quantify...... as weighted input data in soil pollution prediction models. The developed strongest relationships were associated with Cd and As, variance being equal to 82%, followed by Ni (75%) and Cr (73%) as the weakest relationship. This study also showed that nearness to cities (with a relative importance varying...... the relationships between four toxic heavy metals (Ni, Cr, Cd and As) and sixteen environmental parameters (e.g., parent material, slope gradient, proximity to roads, etc.) in the soils of northern Lebanon (as a case study of Mediterranean landscapes), and to detect the most important parameters that can be used...
A Construction of String 2-Group Models using a Transgression-Regression Technique
Waldorf, Konrad
2012-01-01
In this note we present a new construction of the string group that ends optionally in two different contexts: strict diffeological 2-groups or finite-dimensional Lie 2-groups. It is canonical in the sense that no choices are involved; all the data is written down and can be looked up (at least somewhere). The basis of our construction is the basic gerbe of Gawedzki-Reis and Meinrenken. The main new insight is that under a transgression-regression procedure, the basic gerbe picks up a multiplicative structure coming from the Mickelsson product over the loop group. The conclusion of the construction is a relation between multiplicative gerbes and 2-group extensions for which we use recent work of Schommer-Pries.
Performance-driven Routing Tree Construction with Buffer .Insertion and Wire Sizing
QI Chang; WANG Gao-feng
2008-01-01
A new approach was propoeed to construct a performance-driven rectilinear Steiner tree with simul-taneous buffer insertion and wiresizing optimization (PDRST/BW) under a higher order resistance-inductance-capacitance (RLC) delay model. This approach is based on the concept of sharing-buffer insertion and dynamic programming approach combined with a bottom-up rectilinear Steiner tree construction. The performances include the timing delay and the quality of signal waveform. The experimental results show that our proposed approach is scalable and obtains better performance than SP-tree and graph-RTBW approaches for the test signal nets.
Gu, Yingxin; Wylie, Bruce K.; Boyte, Stephen
2016-01-01
In this study, we developed a method that identifies an optimal sample data usage strategy and rule numbers that minimize over- and underfitting effects in regression tree mapping models. A LANDFIRE tile (r04c03, located mainly in northeastern Nevada), which is a composite of multiple Landsat 8 scenes for a target date, was selected for the study. To minimize any cloud and bad detection effects in the original Landsat 8 data, the compositing approach used cosine-similarity-combined pixels from multiple observations based on data quality and temporal proximity to a target date. Julian date 212, which yielded relatively low "no data and/or cloudy” pixels, was used as the target date with Landsat 8 observations from days 140–240 in 2013. The 30-m Landsat 8 composited data were then upscaled to 250 m using a spatial averaging method. Six Landsat 8 spectral bands (bands 1–6) at 250-m resolution were used as independent variables for developing the piecewise regression-tree models to predict the 250-m eMODIS NDVI (dependent variable). Furthermore, to ensure the high quality of the derived 250-m Landsat 8 data, and avoid any additional cloud and atmospheric effects, the percentage of 30-m pixels with “0” values within a 250-m pixel was calculated. Only those 250-m pixels with 0% of “0” values (i.e., all the 30-m pixels within a 250-m pixel have no zero values pixels) were selected to develop the regression-tree model.The 7-day maximum value composites of 250-m MODIS NDVI for the year 2013 were obtained from the USGS expedited MODIS (eMODIS) data archive (https://lta.cr.usgs.gov/emodis). Pixels with bad quality, negative values, clouds, snow cover, and low view angles were filtered out based on the MODIS quality assurance data to ensure high quality eMODIS NDVI data. The 2013 weekly NDVI data were then stacked and temporally smoothed using a weighted least-squares approach to reduce additional atmospheric noise. Temporal smoothing helps to ensure reliable
U.S. Geological Survey, Department of the Interior — Integrating spatially explicit biogeophysical and remotely sensed data into regression-tree models enables the spatial extrapolation of training data over large...
Schumacher, Phyllis; Olinsky, Alan; Quinn, John; Smith, Richard
2010-01-01
The authors extended previous research by 2 of the authors who conducted a study designed to predict the successful completion of students enrolled in an actuarial program. They used logistic regression to determine the probability of an actuarial student graduating in the major or dropping out. They compared the results of this study with those…
Schumacher, Phyllis; Olinsky, Alan; Quinn, John; Smith, Richard
2010-01-01
The authors extended previous research by 2 of the authors who conducted a study designed to predict the successful completion of students enrolled in an actuarial program. They used logistic regression to determine the probability of an actuarial student graduating in the major or dropping out. They compared the results of this study with those…
Soichirou Satoh
Full Text Available Phylogenetic trees have been constructed for a wide range of organisms using gene sequence information, especially through the identification of orthologous genes that have been vertically inherited. The number of available complete genome sequences is rapidly increasing, and many tools for construction of genome trees based on whole genome sequences have been proposed. However, development of a reasonable method of using complete genome sequences for construction of phylogenetic trees has not been established. We have developed a method for construction of phylogenetic trees based on the average sequence similarities of whole genome sequences. We used this method to examine the phylogeny of 115 photosynthetic prokaryotes, i.e., cyanobacteria, Chlorobi, proteobacteria, Chloroflexi, Firmicutes and nonphotosynthetic organisms including Archaea. Although the bootstrap values for the branching order of phyla were low, probably due to lateral gene transfer and saturated mutation, the obtained tree was largely consistent with the previously reported phylogenetic trees, indicating that this method is a robust alternative to traditional phylogenetic methods.
Wylie, Bruce K.; Howard, Daniel; Dahal, Devendra; Gilmanov, Tagir; Ji, Lei; Zhang, Li; Smith, Kelcy
2016-01-01
This paper presents the methodology and results of two ecological-based net ecosystem production (NEP) regression tree models capable of up scaling measurements made at various flux tower sites throughout the U.S. Great Plains. Separate grassland and cropland NEP regression tree models were trained using various remote sensing data and other biogeophysical data, along with 15 flux towers contributing to the grassland model and 15 flux towers for the cropland model. The models yielded weekly mean daily grassland and cropland NEP maps of the U.S. Great Plains at 250 m resolution for 2000–2008. The grassland and cropland NEP maps were spatially summarized and statistically compared. The results of this study indicate that grassland and cropland ecosystems generally performed as weak net carbon (C) sinks, absorbing more C from the atmosphere than they released from 2000 to 2008. Grasslands demonstrated higher carbon sink potential (139 g C·m−2·year−1) than non-irrigated croplands. A closer look into the weekly time series reveals the C fluctuation through time and space for each land cover type.
Bruce Wylie
2016-11-01
Full Text Available This paper presents the methodology and results of two ecological-based net ecosystem production (NEP regression tree models capable of up scaling measurements made at various flux tower sites throughout the U.S. Great Plains. Separate grassland and cropland NEP regression tree models were trained using various remote sensing data and other biogeophysical data, along with 15 flux towers contributing to the grassland model and 15 flux towers for the cropland model. The models yielded weekly mean daily grassland and cropland NEP maps of the U.S. Great Plains at 250 m resolution for 2000–2008. The grassland and cropland NEP maps were spatially summarized and statistically compared. The results of this study indicate that grassland and cropland ecosystems generally performed as weak net carbon (C sinks, absorbing more C from the atmosphere than they released from 2000 to 2008. Grasslands demonstrated higher carbon sink potential (139 g C·m−2·year−1 than non-irrigated croplands. A closer look into the weekly time series reveals the C fluctuation through time and space for each land cover type.
Constructing an optimal decision tree for FAST corner point detection
Alkhalid, Abdulaziz
2011-01-01
In this paper, we consider a problem that is originated in computer vision: determining an optimal testing strategy for the corner point detection problem that is a part of FAST algorithm [11,12]. The problem can be formulated as building a decision tree with the minimum average depth for a decision table with all discrete attributes. We experimentally compare performance of an exact algorithm based on dynamic programming and several greedy algorithms that differ in the attribute selection criterion. © 2011 Springer-Verlag.
On a constructive characterization of a class of trees related to pairs of disjoint matchings
Kamalian, R R
2007-01-01
For a graph consider the pairs of disjoint matchings which union contains as many edges as possible, and define a parameter $\\alpha$ which eqauls the cardinality of the largest matching in those pairs. Also, define $\\betta$ to be the cardinality of a maximum matching of the graph. We give a constructive characterization of trees which satisfy the $\\alpha=\\betta$ equality. The proof of our main theorem is based on a new decomposition algorithm obtained for trees.
Construction of α-decision trees for tables with many-valued decisions
Moshkov, Mikhail
2011-01-01
The paper is devoted to the study of greedy algorithm for construction of approximate decision trees (α-decision trees). This algorithm is applicable to decision tables with many-valued decisions where each row is labeled with a set of decisions. For a given row, we should find a decision from the set attached to this row. We consider bound on the number of algorithm steps, and bound on the algorithm accuracy relative to the depth of decision trees. © 2011 Springer-Verlag.
Parallel continuous flow: a parallel suffix tree construction tool for whole genomes.
Comin, Matteo; Farreras, Montse
2014-04-01
The construction of suffix trees for very long sequences is essential for many applications, and it plays a central role in the bioinformatic domain. With the advent of modern sequencing technologies, biological sequence databases have grown dramatically. Also the methodologies required to analyze these data have become more complex everyday, requiring fast queries to multiple genomes. In this article, we present parallel continuous flow (PCF), a parallel suffix tree construction method that is suitable for very long genomes. We tested our method for the suffix tree construction of the entire human genome, about 3GB. We showed that PCF can scale gracefully as the size of the input genome grows. Our method can work with an efficiency of 90% with 36 processors and 55% with 172 processors. We can index the human genome in 7 minutes using 172 processes.
Andreica, Mugurel Ionut; Sambotin, Ana-Delia; Tapus, Nicolae; 10.1145/1835698.1835766
2010-01-01
In this paper we consider the problem of efficiently constructing in a fully distributed manner multicast trees which are embedded into P2P overlays using virtual geometric node coordinates. We consider two objectives: to minimize the number of messages required for constructing a multicast tree by using the geometric properties of the P2P overlay, and to construct stable multicast trees when the lifetime durations of the peers are known.
Deo, Ravinesh C.; Kisi, Ozgur; Singh, Vijay P.
2017-02-01
Drought forecasting using standardized metrics of rainfall is a core task in hydrology and water resources management. Standardized Precipitation Index (SPI) is a rainfall-based metric that caters for different time-scales at which the drought occurs, and due to its standardization, is well-suited for forecasting drought at different periods in climatically diverse regions. This study advances drought modelling using multivariate adaptive regression splines (MARS), least square support vector machine (LSSVM), and M5Tree models by forecasting SPI in eastern Australia. MARS model incorporated rainfall as mandatory predictor with month (periodicity), Southern Oscillation Index, Pacific Decadal Oscillation Index and Indian Ocean Dipole, ENSO Modoki and Nino 3.0, 3.4 and 4.0 data added gradually. The performance was evaluated with root mean square error (RMSE), mean absolute error (MAE), and coefficient of determination (r2). Best MARS model required different input combinations, where rainfall, sea surface temperature and periodicity were used for all stations, but ENSO Modoki and Pacific Decadal Oscillation indices were not required for Bathurst, Collarenebri and Yamba, and the Southern Oscillation Index was not required for Collarenebri. Inclusion of periodicity increased the r2 value by 0.5-8.1% and reduced RMSE by 3.0-178.5%. Comparisons showed that MARS superseded the performance of the other counterparts for three out of five stations with lower MAE by 15.0-73.9% and 7.3-42.2%, respectively. For the other stations, M5Tree was better than MARS/LSSVM with lower MAE by 13.8-13.4% and 25.7-52.2%, respectively, and for Bathurst, LSSVM yielded more accurate result. For droughts identified by SPI ≤ - 0.5, accurate forecasts were attained by MARS/M5Tree for Bathurst, Yamba and Peak Hill, whereas for Collarenebri and Barraba, M5Tree was better than LSSVM/MARS. Seasonal analysis revealed disparate results where MARS/M5Tree was better than LSSVM. The results highlight the
Genome trees constructed using five different approaches suggest new major bacterial clades
Tatusov Roman L
2001-10-01
Full Text Available Abstract Background The availability of multiple complete genome sequences from diverse taxa prompts the development of new phylogenetic approaches, which attempt to incorporate information derived from comparative analysis of complete gene sets or large subsets thereof. Such attempts are particularly relevant because of the major role of horizontal gene transfer and lineage-specific gene loss, at least in the evolution of prokaryotes. Results Five largely independent approaches were employed to construct trees for completely sequenced bacterial and archaeal genomes: i presence-absence of genomes in clusters of orthologous genes; ii conservation of local gene order (gene pairs among prokaryotic genomes; iii parameters of identity distribution for probable orthologs; iv analysis of concatenated alignments of ribosomal proteins; v comparison of trees constructed for multiple protein families. All constructed trees support the separation of the two primary prokaryotic domains, bacteria and archaea, as well as some terminal bifurcations within the bacterial and archaeal domains. Beyond these obvious groupings, the trees made with different methods appeared to differ substantially in terms of the relative contributions of phylogenetic relationships and similarities in gene repertoires caused by similar life styles and horizontal gene transfer to the tree topology. The trees based on presence-absence of genomes in orthologous clusters and the trees based on conserved gene pairs appear to be strongly affected by gene loss and horizontal gene transfer. The trees based on identity distributions for orthologs and particularly the tree made of concatenated ribosomal protein sequences seemed to carry a stronger phylogenetic signal. The latter tree supported three potential high-level bacterial clades,: i Chlamydia-Spirochetes, ii Thermotogales-Aquificales (bacterial hyperthermophiles, and ii Actinomycetes-Deinococcales-Cyanobacteria. The latter group also
Deterministic Assessment of Continuous Flight Auger Construction Durations Using Regression Analysis
Hossam E. Hosny
2015-07-01
Full Text Available One of the primary functions of construction equipment management is to calculate the production rate of equipment which will be a major input to the processes of time estimates, cost estimates and the overall project planning. Accordingly, it is crucial to stakeholders to be able to compute equipment production rates. This may be achieved using an accurate, reliable and easy tool. The objective of this research is to provide a simple model that can be used by specialists to predict the duration of a proposed Continuous Flight Auger job. The model was obtained using a prioritizing technique based on expert judgment then using multi-regression analysis based on a representative sample. The model was then validated on a selected sample of projects. The average error of the model was calculated to be about (3%-6%.
Bennema, S C; Molento, M B; Scholte, R G; Carvalho, O S; Pritsch, I
2017-11-01
Fascioliasis is a condition caused by the trematode Fasciola hepatica. In this paper, the spatial distribution of F. hepatica in bovines in Brazil was modelled using a decision tree approach and a logistic regression, combined with a geographic information system (GIS) query. In the decision tree and the logistic model, isothermality had the strongest influence on disease prevalence. Also, the 50-year average precipitation in the warmest quarter of the year was included as a risk factor, having a negative influence on the parasite prevalence. The risk maps developed using both techniques, showed a predicted higher prevalence mainly in the South of Brazil. The prediction performance seemed to be high, but both techniques failed to reach a high accuracy in predicting the medium and high prevalence classes to the entire country. The GIS query map, based on the range of isothermality, minimum temperature of coldest month, precipitation of warmest quarter of the year, altitude and the average dailyland surface temperature, showed a possibility of presence of F. hepatica in a very large area. The risk maps produced using these methods can be used to focus activities of animal and public health programmes, even on non-evaluated F. hepatica areas.
Kropat, Georg; Bochud, Francois; Jaboyedoff, Michel; Laedermann, Jean-Pascal; Murith, Christophe; Palacios Gruson, Martha; Baechler, Sébastien
2015-09-01
According to estimations around 230 people die as a result of radon exposure in Switzerland. This public health concern makes reliable indoor radon prediction and mapping methods necessary in order to improve risk communication to the public. The aim of this study was to develop an automated method to classify lithological units according to their radon characteristics and to develop mapping and predictive tools in order to improve local radon prediction. About 240 000 indoor radon concentration (IRC) measurements in about 150 000 buildings were available for our analysis. The automated classification of lithological units was based on k-medoids clustering via pair-wise Kolmogorov distances between IRC distributions of lithological units. For IRC mapping and prediction we used random forests and Bayesian additive regression trees (BART). The automated classification groups lithological units well in terms of their IRC characteristics. Especially the IRC differences in metamorphic rocks like gneiss are well revealed by this method. The maps produced by random forests soundly represent the regional difference of IRCs in Switzerland and improve the spatial detail compared to existing approaches. We could explain 33% of the variations in IRC data with random forests. Additionally, the influence of a variable evaluated by random forests shows that building characteristics are less important predictors for IRCs than spatial/geological influences. BART could explain 29% of IRC variability and produced maps that indicate the prediction uncertainty. Ensemble regression trees are a powerful tool to model and understand the multidimensional influences on IRCs. Automatic clustering of lithological units complements this method by facilitating the interpretation of radon properties of rock types. This study provides an important element for radon risk communication. Future approaches should consider taking into account further variables like soil gas radon measurements as
ERA: Efficient serial and parallel suffix tree construction for very long strings
Mansour, Essam
2011-09-01
The suffix tree is a data structure for indexing strings. It is used in a variety of applications such as bioinformatics, time series analysis, clustering, text editing and data compression. However, when the string and the resulting suffix tree are too large to fit into the main memory, most existing construction algorithms become very inefficient. This paper presents a disk-based suffix tree construction method, called Elastic Range (ERa), which works efficiently with very long strings that are much larger than the available memory. ERa partitions the tree construction process horizontally and vertically and minimizes I/Os by dynamically adjusting the horizontal partitions independently for each vertical partition, based on the evolving shape of the tree and the available memory. Where appropriate, ERa also groups vertical partitions together to amortize the I/O cost. We developed a serial version; a parallel version for shared-memory and shared-disk multi-core systems; and a parallel version for shared-nothing architectures. ERa indexes the entire human genome in 19 minutes on an ordinary desktop computer. For comparison, the fastest existing method needs 15 minutes using 1024 CPUs on an IBM BlueGene supercomputer.
ERA: Efficient Serial and Parallel Suffix Tree Construction for Very Long Strings
Mansour, Essam; Skiadopoulos, Spiros; Kalnis, Panos
2011-01-01
The suffix tree is a data structure for indexing strings. It is used in a variety of applications such as bioinformatics, time series analysis, clustering, text editing and data compression. However, when the string and the resulting suffix tree are too large to fit into the main memory, most existing construction algorithms become very inefficient. This paper presents a disk-based suffix tree construction method, called Elastic Range (ERa), which works efficiently with very long strings that are much larger than the available memory. ERa partitions the tree construction process horizontally and vertically and minimizes I/Os by dynamically adjusting the horizontal partitions independently for each vertical partition, based on the evolving shape of the tree and the available memory. Where appropriate, ERa also groups vertical partitions together to amortize the I/O cost. We developed a serial version; a parallel version for shared-memory and shared-disk multi-core systems; and a parallel version for shared-not...
Ahmed Y. Hamed
2015-01-01
Full Text Available We refer to the problem of constructing broadcast trees with cost and delay constraints in the networks as a delay-constrained minimum spanning tree problem in directed networks. Hence it is necessary determining a spanning tree of minimal cost to connect the source node to all nodes subject to delay constraints on broadcast routing. In this paper, we proposed a genetic algorithm for solving broadcast routing by finding the low-cost broadcast tree with minimum cost and delay constraints. In this research we present a genetic algorithm to find the broadcast routing tree of a given network in terms of its links. The algorithm uses the connection matrix of the given network to find the spanning trees and considers the weights of the links to obtain the minimum spanning tree. Our proposed algorithm is able to find a better solution, fast convergence speed and high reliability. The scalability and the performance of the algorithm with increasing number of network nodes are also encouraging.
Shuqiong Wu
2015-01-01
Full Text Available As a machine learning method, AdaBoost is widely applied to data classification and object detection because of its robustness and efficiency. AdaBoost constructs a global and optimal combination of weak classifiers based on a sample reweighting. It is known that this kind of combination improves the classification performance tremendously. As the popularity of AdaBoost increases, many variants have been proposed to improve the performance of AdaBoost. Then, a lot of comparison and review studies for AdaBoost variants have also been published. Some researchers compared different AdaBoost variants by experiments in their own fields, and others reviewed various AdaBoost variants by basically introducing these algorithms. However, there is a lack of mathematical analysis of the generalization abilities for different AdaBoost variants. In this paper, we analyze the generalization abilities of six AdaBoost variants in terms of classification margins. The six compared variants are Real AdaBoost, Gentle AdaBoost, Modest AdaBoost, Parameterized AdaBoost, Margin-pruning Boost, and Penalized AdaBoost. Finally, we use experiments to verify our analyses.
Graphical fault tree analysis for fatal falls in the construction industry.
Chi, Chia-Fen; Lin, Syuan-Zih; Dewi, Ratna Sari
2014-11-01
The current study applied a fault tree analysis to represent the causal relationships among events and causes that contributed to fatal falls in the construction industry. Four hundred and eleven work-related fatalities in the Taiwanese construction industry were analyzed in terms of age, gender, experience, falling site, falling height, company size, and the causes for each fatality. Given that most fatal accidents involve multiple events, the current study coded up to a maximum of three causes for each fall fatality. After the Boolean algebra and minimal cut set analyses, accident causes associated with each falling site can be presented as a fault tree to provide an overview of the basic causes, which could trigger fall fatalities in the construction industry. Graphical icons were designed for each falling site along with the associated accident causes to illustrate the fault tree in a graphical manner. A graphical fault tree can improve inter-disciplinary discussion of risk management and the communication of accident causation to first line supervisors. Copyright © 2014 Elsevier Ltd. All rights reserved.
An Efficient Distributed Algorithm for Constructing Spanning Trees in Wireless Sensor Networks
Rosana Lachowski
2015-01-01
Full Text Available Monitoring and data collection are the two main functions in wireless sensor networks (WSNs. Collected data are generally transmitted via multihop communication to a special node, called the sink. While in a typical WSN, nodes have a sink node as the final destination for the data traffic, in an ad hoc network, nodes need to communicate with each other. For this reason, routing protocols for ad hoc networks are inefficient for WSNs. Trees, on the other hand, are classic routing structures explicitly or implicitly used in WSNs. In this work, we implement and evaluate distributed algorithms for constructing routing trees in WSNs described in the literature. After identifying the drawbacks and advantages of these algorithms, we propose a new algorithm for constructing spanning trees in WSNs. The performance of the proposed algorithm and the quality of the constructed tree were evaluated in different network scenarios. The results showed that the proposed algorithm is a more efficient solution. Furthermore, the algorithm provides multiple routes to the sensor nodes to be used as mechanisms for fault tolerance and load balancing.
An efficient distributed algorithm for constructing spanning trees in wireless sensor networks.
Lachowski, Rosana; Pellenz, Marcelo E; Penna, Manoel C; Jamhour, Edgard; Souza, Richard D
2015-01-14
Monitoring and data collection are the two main functions in wireless sensor networks (WSNs). Collected data are generally transmitted via multihop communication to a special node, called the sink. While in a typical WSN, nodes have a sink node as the final destination for the data traffic, in an ad hoc network, nodes need to communicate with each other. For this reason, routing protocols for ad hoc networks are inefficient for WSNs. Trees, on the other hand, are classic routing structures explicitly or implicitly used in WSNs. In this work, we implement and evaluate distributed algorithms for constructing routing trees in WSNs described in the literature. After identifying the drawbacks and advantages of these algorithms, we propose a new algorithm for constructing spanning trees in WSNs. The performance of the proposed algorithm and the quality of the constructed tree were evaluated in different network scenarios. The results showed that the proposed algorithm is a more efficient solution. Furthermore, the algorithm provides multiple routes to the sensor nodes to be used as mechanisms for fault tolerance and load balancing.
曾伟生
2012-01-01
利用我国南方的杉木实测数据，采用误差变量联立方程组方法，同时建立了胸径一元材积模型、地径一元材积模型和胸径一地径回归模型。结果表明：地径与胸径之间相关紧密，其回归模型的确定系数可以达到0．96以上；地径一元材积模型的预估精度要明显低于胸径一元材积模型。%Based on the data of Chinese fir ( Cunninghamia lanceolata) in southern China, three models, DBH (Diameter at Breast Height ) -based volume model, DRC (Diameter on Root Collar)-based volume model, and DBH-DRC regression model, were constructed using the error-in-variabl~ simultaneous equations approach. The results showed that DBH is closely related to DRC, determination coefficient of the regression is more than 0. 96 ; and the prediction precision of DRC-based volume model is clearly lower than that of DBH-based volume model.
Kanungo, D. P.; Sharma, Shaifaly; Pain, Anindya
2014-09-01
The shear strength parameters of soil (cohesion and angle of internal friction) are quite essential in solving many civil engineering problems. In order to determine these parameters, laboratory tests are used. The main objective of this work is to evaluate the potential of Artificial Neural Network (ANN) and Regression Tree (CART) techniques for the indirect estimation of these parameters. Four different models, considering different combinations of 6 inputs, such as gravel %, sand %, silt %, clay %, dry density, and plasticity index, were investigated to evaluate the degree of their effects on the prediction of shear parameters. A performance evaluation was carried out using Correlation Coefficient and Root Mean Squared Error measures. It was observed that for the prediction of friction angle, the performance of both the techniques is about the same. However, for the prediction of cohesion, the ANN technique performs better than the CART technique. It was further observed that the model considering all of the 6 input soil parameters is the most appropriate model for the prediction of shear parameters. Also, connection weight and bias analyses of the best neural network (i.e., 6/2/2) were attempted using Connection Weight, Garson, and proposed Weight-bias approaches to characterize the influence of input variables on shear strength parameters. It was observed that the Connection Weight Approach provides the best overall methodology for accurately quantifying variable importance, and should be favored over the other approaches examined in this study.
Grinn-Gofroń, Agnieszka; Strzelczak, Agnieszka
2009-11-01
A study was made of the link between time of day, weather variables and the hourly content of certain fungal spores in the atmosphere of the city of Szczecin, Poland, in 2004-2007. Sampling was carried out with a Lanzoni 7-day-recording spore trap. The spores analysed belonged to the taxa Alternaria and Cladosporium. These spores were selected both for their allergenic capacity and for their high level presence in the atmosphere, particularly during summer. Spearman correlation coefficients between spore concentrations, meteorological parameters and time of day showed different indices depending on the taxon being analysed. Relative humidity (RH), air temperature, air pressure and clouds most strongly and significantly influenced the concentration of Alternaria spores. Cladosporium spores correlated less strongly and significantly than Alternaria. Multivariate regression tree analysis revealed that, at air pressures lower than 1,011 hPa the concentration of Alternaria spores was low. Under higher air pressure spore concentrations were higher, particularly when RH was lower than 36.5%. In the case of Cladosporium, under higher air pressure (>1,008 hPa), the spores analysed were more abundant, particularly after 0330 hours. In artificial neural networks, RH, air pressure and air temperature were the most important variables in the model for Alternaria spore concentration. For Cladosporium, clouds, time of day, air pressure, wind speed and dew point temperature were highly significant factors influencing spore concentration. The maximum abundance of Cladosporium spores in air fell between 1200 and 1700 hours.
Yingxin Gu
2016-11-01
Full Text Available Regression tree models have been widely used for remote sensing-based ecosystem mapping. Improper use of the sample data (model training and testing data may cause overfitting and underfitting effects in the model. The goal of this study is to develop an optimal sampling data usage strategy for any dataset and identify an appropriate number of rules in the regression tree model that will improve its accuracy and robustness. Landsat 8 data and Moderate-Resolution Imaging Spectroradiometer-scaled Normalized Difference Vegetation Index (NDVI were used to develop regression tree models. A Python procedure was designed to generate random replications of model parameter options across a range of model development data sizes and rule number constraints. The mean absolute difference (MAD between the predicted and actual NDVI (scaled NDVI, value from 0–200 and its variability across the different randomized replications were calculated to assess the accuracy and stability of the models. In our case study, a six-rule regression tree model developed from 80% of the sample data had the lowest MAD (MADtraining = 2.5 and MADtesting = 2.4, which was suggested as the optimal model. This study demonstrates how the training data and rule number selections impact model accuracy and provides important guidance for future remote-sensing-based ecosystem modeling.
Gu, Yingxin; Wylie, Bruce K.; Boyte, Stephen; Picotte, Joshua J.; Howard, Danny; Smith, Kelcy; Nelson, Kurtis
2016-01-01
Regression tree models have been widely used for remote sensing-based ecosystem mapping. Improper use of the sample data (model training and testing data) may cause overfitting and underfitting effects in the model. The goal of this study is to develop an optimal sampling data usage strategy for any dataset and identify an appropriate number of rules in the regression tree model that will improve its accuracy and robustness. Landsat 8 data and Moderate-Resolution Imaging Spectroradiometer-scaled Normalized Difference Vegetation Index (NDVI) were used to develop regression tree models. A Python procedure was designed to generate random replications of model parameter options across a range of model development data sizes and rule number constraints. The mean absolute difference (MAD) between the predicted and actual NDVI (scaled NDVI, value from 0–200) and its variability across the different randomized replications were calculated to assess the accuracy and stability of the models. In our case study, a six-rule regression tree model developed from 80% of the sample data had the lowest MAD (MADtraining = 2.5 and MADtesting = 2.4), which was suggested as the optimal model. This study demonstrates how the training data and rule number selections impact model accuracy and provides important guidance for future remote-sensing-based ecosystem modeling.
S. Kamakshi
2014-01-01
Full Text Available Vehicular ad hoc network is an ad hoc network constituted among moving vehicles that have wireless dedicated short range communication (DSRC devices in order to provide ubiquitous connectivity even if the road-side infrastructure is unavailable. Message dissemination in vehicular ad hoc networks is necessary for exchanging information about prevailing traffic information, so that the vehicles can take alternate routes to avoid traffic jam. A major challenge in broadcast protocols is that they result in flooding of messages that reduce the speed of dissemination due to collision. Dissemination of messages using tree topology reduces the number of rebroadcasts. Dynamicity Aware Graph Relabeling System model provides a framework to construct spanning tree in mobile wireless network. In this paper, we propose a new distributed algorithm for constructing an arbitrary spanning tree based on Dynamicity Aware Graph Relabeling System model, which develops a maximum leaf spanning tree in order to reduce the number of rebroadcasts and dissemination time. Our simulation results prove that the number of vehicles rebroadcasting the message is curtailed to 15% and the dissemination time required to achieve 100% reachability is curtailed by 10% under average vehicle density.
Lopes, J.A Pecas; Vasconcelos, Maria Helena O.P. de [Instituto de Engenharia de Sistemas e Computadores (INESC), Porto (Portugal). E-mail: jpl@riff.fe.up.pt; hvasconcelos@inescn.pt
1999-07-01
This paper describes in a synthetic manner the technology adopted to define structures used in the fast evaluation of dynamic safety of isolated network with high level of eolic production contribution. This methodology uses hybrid regression trees, which allows the quantification the endurance connected to the dynamic behavior of these networks by emulating the frequency minimum deviation that will be experienced by the system when submitted toa pre-defined perturbation. Also, new procedures for data automatic generation are presented, which will be used for construction and measurements of the evaluation structures performance. The paper describes the Terceira island - Acores archipelago network study case.
Azad, Mohammad
2014-01-01
A greedy algorithm has been presented in this paper to construct decision trees for three different approaches (many-valued decision, most common decision, and generalized decision) in order to handle the inconsistency of multiple decisions in a decision table. In this algorithm, a greedy heuristic ‘misclassification error’ is used which performs faster, and for some cost function, results are better than ‘number of boundary subtables’ heuristic in literature. Therefore, it can be used in the case of larger data sets and does not require huge amount of memory. Experimental results of depth, average depth and number of nodes of decision trees constructed by this algorithm are compared in the framework of each of the three approaches.
ON MULTICAST TREE CONSTRUCTION IN IPV4-IPV6 HYBRID NETWORK
Zhang Chao; Zhang Yuan; Huang Yongfeng; Li Xing
2010-01-01
With the IPv4 addresses exhausting and IPv6 emerging,the Peer-to-Peer (P2P) overlay is becoming increasingly heterogeneous and complex: pure IPv4,dual stack and pure IPv6 hosts coexist,and the connectivity limitation between IPv4 and IPv6 hosts requires the overlay protocols to be fit for this hybrid situation. This paper sets out to answer the question of how to construct multicast tree on top of IPv4-IPv6 hybrid network. Our solution is a New Greedy Algorithm (NGA) which eliminates the problem of joining failure in the hybrid network and keeps the efficiency of greedy algorithm in tree construction. Simulation results show that our algorithm has excellent performance,which is very close to the optimal in many cases.
Al-Khaja, Nawal
2007-01-01
This is a thematic lesson plan for young learners about palm trees and the importance of taking care of them. The two part lesson teaches listening, reading and speaking skills. The lesson includes parts of a tree; the modal auxiliary, can; dialogues and a role play activity.
Ermarth, Anna; Bryce, Matthew; Woodward, Stephanie; Stoddard, Gregory; Book, Linda; Jensen, M Kyle
2017-03-01
Celiac disease is detected using serology and endoscopy analyses. We used multiple statistical analyses of a geographically isolated population in the United States to determine whether a single serum screening can identify individuals with celiac disease. We performed a retrospective study of 3555 pediatric patients (18 years old or younger) in the intermountain West region of the United States from January 1, 2008, through September 30, 2013. All patients had undergone serologic analyses for celiac disease, including measurement of antibodies to tissue transglutaminase (TTG) and/or deamidated gliadin peptide (DGP), and had duodenal biopsies collected within the following year. Modified Marsh criteria were used to identify patients with celiac disease. We developed models to identify patients with celiac disease using logistic regression and classification and regression tree (CART) analysis. Single use of a test for serum level of IgA against TTG identified patients with celiac disease with 90% sensitivity, 90% specificity, a 61% positive predictive value (PPV), a 90% negative predictive value, and an area under the receiver operating characteristic curve value of 0.91; these values were higher than those obtained from assays for IgA against DGP or IgG against TTG plus DGP. Not including the test for DGP antibody caused only 0.18% of celiac disease cases to be missed. Level of TTG IgA 7-fold the upper limit of normal (ULN) identified patients with celiac disease with a 96% PPV and 100% specificity. Using CART analysis, we found a level of TTG IgA 3.2-fold the ULN and higher to most accurately identify patients with celiac disease (PPV, 89%). Multivariable CART analysis showed that a level of TTG IgA 2.5-fold the ULN and higher was sufficient to identify celiac disease in patients with type 1 diabetes (PPV, 88%). Serum level of IgA against TTG in patients with versus those without trisomy 21 did not affect diagnosis predictability in CART analysis. In a population
Hui-Yong Jiang; Zhong-Xi Huang; Xue-Feng Zhang; Richard Desper; Tong Zhao
2007-01-01
AIM: To construct tree models for classification of diffuse large B-cell lymphomas (DLBCL) by chromosome copy numbers, to compare them with cDNA microarray classification, and to explore models of multi-gene, multi-step and multi-pathway processes of DLBCL tumorigenesis.METHODS: Maximum-weight branching and distance based models were constructed based on the comparative genomic hybridization (CGH) data of 123 DLBCL samples using the established methods and software of Desper et al. A maximum likelihood tree model was also used to analyze the data. By comparing with the results reported in literature, values of tree models in the classification of DLBCL were elucidated.RESULTS: Both the branching and the distance-based trees classified DLBCL into three groups. We combined the classification methods of the two models and classified DLBCL into three categories according to their characteristics. The first group was marked by +Xq, +Xp, -17p and +13q; the second group by +3q, +18q and +18p; and the third group was marked by -6q and +6p. This chromosomal classification was consistent with cDNA classification. It indicated that -6q and +3q were two main events in the tumorigenesis of lymphoma.CONCLUSION: Tree models of lymphoma established from CGH data can be used in the classification of DLBCL. These models can suggest multi-gene, multi-step and multi-pathway processes of tumorigenesis.Two pathways, -6q preceding +6q and +3q preceding +18q, may be important in understanding tumorigenesis of DLBCL. The pathway, -6q preceding +6q, may have a close relationship with the tumorigenesis of non-GCB DLBCL.
Homer, Collin G.; Aldridge, Cameron L.; Meyer, Debra K.; Schell, Spencer J.
2012-02-01
Sagebrush ecosystems in North America have experienced extensive degradation since European settlement. Further degradation continues from exotic invasive plants, altered fire frequency, intensive grazing practices, oil and gas development, and climate change - adding urgency to the need for ecosystem-wide understanding. Remote sensing is often identified as a key information source to facilitate ecosystem-wide characterization, monitoring, and analysis; however, approaches that characterize sagebrush with sufficient and accurate local detail across large enough areas to support this paradigm are unavailable. We describe the development of a new remote sensing sagebrush characterization approach for the state of Wyoming, U.S.A. This approach integrates 2.4 m QuickBird, 30 m Landsat TM, and 56 m AWiFS imagery into the characterization of four primary continuous field components including percent bare ground, percent herbaceous cover, percent litter, and percent shrub, and four secondary components including percent sagebrush ( Artemisia spp.), percent big sagebrush ( Artemisia tridentata), percent Wyoming sagebrush ( Artemisia tridentata Wyomingensis), and shrub height using a regression tree. According to an independent accuracy assessment, primary component root mean square error (RMSE) values ranged from 4.90 to 10.16 for 2.4 m QuickBird, 6.01 to 15.54 for 30 m Landsat, and 6.97 to 16.14 for 56 m AWiFS. Shrub and herbaceous components outperformed the current data standard called LANDFIRE, with a shrub RMSE value of 6.04 versus 12.64 and a herbaceous component RMSE value of 12.89 versus 14.63. This approach offers new advancements in sagebrush characterization from remote sensing and provides a foundation to quantitatively monitor these components into the future.
Homer, Collin G.; Aldridge, Cameron L.; Meyer, Debra K.; Schell, Spencer J.
2012-01-01
agebrush ecosystems in North America have experienced extensive degradation since European settlement. Further degradation continues from exotic invasive plants, altered fire frequency, intensive grazing practices, oil and gas development, and climate change – adding urgency to the need for ecosystem-wide understanding. Remote sensing is often identified as a key information source to facilitate ecosystem-wide characterization, monitoring, and analysis; however, approaches that characterize sagebrush with sufficient and accurate local detail across large enough areas to support this paradigm are unavailable. We describe the development of a new remote sensing sagebrush characterization approach for the state of Wyoming, U.S.A. This approach integrates 2.4 m QuickBird, 30 m Landsat TM, and 56 m AWiFS imagery into the characterization of four primary continuous field components including percent bare ground, percent herbaceous cover, percent litter, and percent shrub, and four secondary components including percent sagebrush (Artemisia spp.), percent big sagebrush (Artemisia tridentata), percent Wyoming sagebrush (Artemisia tridentata Wyomingensis), and shrub height using a regression tree. According to an independent accuracy assessment, primary component root mean square error (RMSE) values ranged from 4.90 to 10.16 for 2.4 m QuickBird, 6.01 to 15.54 for 30 m Landsat, and 6.97 to 16.14 for 56 m AWiFS. Shrub and herbaceous components outperformed the current data standard called LANDFIRE, with a shrub RMSE value of 6.04 versus 12.64 and a herbaceous component RMSE value of 12.89 versus 14.63. This approach offers new advancements in sagebrush characterization from remote sensing and provides a foundation to quantitatively monitor these components into the future.
Artur Wnorowski
2017-06-01
Full Text Available Tree saps are nourishing biological media commonly used for beverage and syrup production. Although the nutritional aspect of tree saps is widely acknowledged, the exact relationship between the sap composition, origin, and effect on the metabolic rate of human cells is still elusive. Thus, we collected saps from seven different tree species and conducted composition-activity analysis. Saps from trees of Betulaceae, but not from Salicaceae, Sapindaceae, nor Juglandaceae families, were increasing the metabolic rate of HepG2 cells, as measured using tetrazolium-based assay. Content of glucose, fructose, sucrose, chlorides, nitrates, sulphates, fumarates, malates, and succinates in sap samples varied across different tree species. Grade correspondence analysis clustered trees based on the saps’ chemical footprint indicating its usability in chemotaxonomy. Multiple regression modeling showed that glucose and fumarate present in saps from silver birch (Betula pendula Roth., black alder (Alnus glutinosa Gaertn., and European hornbeam (Carpinus betulus L. are positively affecting the metabolic activity of HepG2 cells.
Karpinski, M. [Univ. of Bonn (Germany); Larmore, L.L. [Univ. of Nevada, Las Vegas, NV (United States); Rytter, W. [Warsaw Univ. (Poland)
1996-12-31
A sublinear time subquadratic work parallel algorithm for construction of an optimal binary search tree, in a special case of practical interest, namely where the frequencies of items to be stored are not too small, is given. A sublinear time subquadratic work parallel algorithm for construction of an approximately optimal binary search tree in the general case is also given. Sub-quadratic work and sublinear time are achieved using a fast parallel algorithm for the column minima problem for Monge matrices developed by Atallah and Kosaraju. The algorithms given in this paper take O(n{sup 0.6}) time with n processors in the CREW PRAM model. Our 29orithms work well if every subtree of the optimal binary search tree of depth {Omega}(log n) has o(n) leaves. We prove that there is a sequential algorithm with subquadratic average-case complexity, by demonstrating that the {open_quotes}small subtree{close_quotes} condition holds with very high probability for a randomly permuted weight sequence. This solves the conjecture posed in liand breaks the quadratic time {open_quotes}barrier{close_quotes} of Knuth`s algorithm. This algorithm can also be parallelized to run in average sublinear time with n processors.
Constructing a Gene Team Tree in Almost O (n lg n) Time.
Wang, Biing-Feng; Lin, Chien-Hsin; Yang, I-Tse
2014-01-01
An important model of a conserved gene cluster is called the gene team model, in which a chromosome is defined to be a permutation of distinct genes and a gene team is defined to be a set of genes that appear in two or more species, with the distance between adjacent genes in the team for each chromosome always no more than a certain threshold δ. A gene team tree is a succinct way to represent all gene teams for every possible value of δ. The previous fastest algorithm for constructing a gene team tree of two chromosomes requires O(n lg n lglg n) time, which was given by Wang and Lin. Its bottleneck is a problem called the maximum-gap problem. In this paper, by presenting an improved algorithm for the maximum-gap problem, we reduce the upper bound of the gene team tree problem to O(n lg n α(n)). Since α grows extremely slowly, this result is almost as efficient as the current best upper bound, O(n lg n), for finding the gene teams of a fixed δ value. Our new algorithm is very efficient from both the theoretical and practical points of view. Wang and Lin's gene-team-tree algorithm can be extended to k chromosomes with complexity O(kn lg n lglg n). Similarly, our improved algorithm for the maximum-gap problem reduces this running time to O(kn lg n α(n)). In addition, it also provides new upper bounds for the gene team tree problem on general sequences, in which multiple copies of the same gene are allowed.
This paper presents a novel approach for diffeomorphic image regression and atlas estimation that results in improved convergence and numerical stability. We use a vector momenta representation of a diffeomorphism's initial conditions instead of the standard scalar momentum that is typically used. The corresponding variational problem results in a closed-form update for template estimation in both the geodesic regression and atlas estimation problems. While we show that the theoretical optimal solution is equivalent to the scalar momenta case, the simplification of the optimization problem leads to more stable and efficient estimation in practice. We demonstrate the effectiveness of our method for atlas estimation and geodesic regression using synthetically generated shapes and 3D MRI brain scans.
NetRaVE: constructing dependency networks using sparse linear regression
NetRaVE is a small suite of R functions for generating dependency networks using sparse regression methods. Such networks provide an alternative to interpreting 'top n lists' of genes arising out of an analysis of microarray data, and they provide a means of organizing and visualizing the resulting...
The study presents a new serial pooling method of shifted tree ring blocks for the building of isotope chronologies. This method combines the advantages of traditional 'serial' and 'intertree' pooling, and can be recommended for the construction of sub-regional long isotope chronologies with sufficient replication, and on annual resolution, especially for the case of extremely narrow tree rings. For Scots pines (Pinus sylvestris L., Khibiny Low Mountains, NW Russia) and Silver firs (Abies alba Mill., Franconia, Southern Germany), serial pooling of five consecutive tree rings seems appropriate because the species- and site-specific particularities lead to blurs of climate linkages in their tree rings for the period up to ca. five years back. An equivalent to a five-year running means that curve gained on the base annual data sets of single trees can be derived from the analysis of yearly shifted five-year blocks of consecutive tree rings, and therefore, with approximately 20% of the expense. Good coherence of delta(13)C- and delta(18)O-values between calculated means of annual total rings or late wood data and means of five-year blocks of consecutive total tree rings analysed experimentally on most similar material confirms this assumption.
Towards Automating the Construction & Maintenance of Attack Trees: a Feasibility Study
Full Text Available Security risk management can be applied on well-defined or existing systems; in this case, the objective is to identify existing vulnerabilities, assess the risks and provide for the adequate countermeasures. Security risk management can also be applied very early in the system's development life-cycle, when its architecture is still poorly defined; in this case, the objective is to positively influence the design work so as to produce a secure architecture from the start. The latter work is made difficult by the uncertainties on the architecture and the multiple round-trips required to keep the risk assessment study and the system architecture aligned. This is particularly true for very large projects running over many years. This paper addresses the issues raised by those risk assessment studies performed early in the system's development life-cycle. Based on industrial experience, it asserts that attack trees can help solve the human cognitive scalability issue related to securing those large, continuously-changing system-designs. However, big attack trees are difficult to build, and even more difficult to maintain. This paper therefore proposes a systematic approach to automate the construction and maintenance of such big attack trees, based on the system's operational and logical architectures, the system's traditional risk assessment study and a security knowledge database.
Regularized Boolean operations have been widely used in 3D modeling systems. However, evaluating Boolean operations may be quite numerically unstable and time consuming, espe⁃cially for iterated set operations. A novel and unified tech⁃nique is proposed in this paper for computing single and iter⁃ated set operations efficiently, robustly and exactly. An adap⁃tive octree is combined with a nested constructive solid geom⁃etry (CSG) tree by this technique. The intersection handling is restricted to the cells in the octree where intersection actu⁃ally occurs. Within those cells, a CSG tree template is in⁃stanced by the surfaces and the tree is converted to plane⁃based binary space partitioning (BSP) for set evaluation;More⁃over, the surface classification is restricted to the cells in the octree where the surfaces only come from a model and are within the bounding⁃boxes of other polyhedrons. These two ways bring about the efficiency and scalability of the opera⁃tions, in terms of runtime and memory. As all surfaces in such a cell have the same classification relation, they are clas⁃sified as a whole. Robustness and exactness are achieved by integrating plane⁃based geometry representation with adaptive geometry predicate technique in intersection handling, and by applying divide⁃and⁃conquer arithmetic on surface classifica⁃tion. Experimental results demonstrate that the proposed ap⁃proach can guarantee the robustness of Boolean computations and runs faster than other existing approaches.
Constructing Minimal Spanning Tree Based on Rough Set Theory for Gene Selection
Soumen Kumar Pati
2013-02-01
Full Text Available Microarray gene dataset often contains high dimensionalities which cause difficulty in clustering andclassification. Datasets containing huge number of genes lead to increased complexity and therefore,degradation of dataset handling performance. Often, all the measured features of these high-dimensionaldatasets are not relevant for understanding the underlying phenomena of interest. Dimensionality reductionby reduct generation is hence performed as an important step before clustering and classification. Thereduced attribute set has the same characteristics as the entire set of attributes in the information system.In this paper, a new attribute reduction technique, based on directed minimal spanning tree and rough settheory is done, for unsupervised learning. The method, firstly, computes a similarity factor between eachpair of attributes using indiscernibility relation, a concept of rough set theory. Based on the similarityfactors, an attribute similarity set is formed from which a directed weighted graph with vertices asattributes and edge weights as the inverse of the similarity factor is constructed. Then, all possible minimalspanning trees of the graph are generated. From each tree, iteratively, the most important vertex isincluded in the reduct set and all its out-going edges are removed. The process stops when the edge set isempty, thus producing multiple reducts. The proposed method and some well-known attribute reductiontechniques have been applied on several microarray gene datasets for gene selection. The results obtainedshow the effectiveness of the method.
In image-based three-dimensional (3-D) reconstruction, one topic of growing importance is how to quickly obtain a 3-D model from a large number of images. The retrieval of the correct and relevant images for the model poses a considerable technological challenge. The "image vocabulary tree" has been proposed as a method to search for similar images. However, a significant drawback of this approach is identified in its low time efficiency and barely satisfactory classification result. The method proposed is inspired by, and improves upon, some recent methods. Specifically, vocabulary quality is considered and multivocabulary trees are designed to improve the classification result. A marked improvement was, indeed, observed in our evaluation of the proposed method. To improve time efficiency, graphics processing unit (GPU) computer unified device architecture parallel computation is applied in the multivocabulary trees. The results of the experiments showed that the GPU was three to four times more efficient than the enumeration matching and CPU methods when the number of images is large. This paper presents a reliable reference method for the rapid construction of a free network to be used for the computing of 3-D information.
Full Text Available Microarray gene dataset often contains high dimensionalities which cause difficulty in clustering and classification. Datasets containing huge number of genes lead to increased complexity and therefore, degradation of dataset handling performance. Often, all the measured features of these high-dimensional datasets are not relevant for understanding the underlying phenomena of interest. Dimensionality reduction by reduct generation is hence performed as an important step before clustering and classification. The reduced attribute set has the same characteristics as the entire set of attributes in the information system. In this paper, a new attribute reduction technique, based on directed minimal spanning tree and rough set theory is done, for unsupervised learning. The method, firstly, computes a similarity factor between each pair of attributes using indiscernibility relation, a concept of rough set theory. Based on the similarity factors, an attribute similarity set is formed from which a directed weighted graph with vertices as attributes and edge weights as the inverse of the similarity factor is constructed. Then, all possible minimal spanning trees of the graph are generated. From each tree, iteratively, the most important vertex is included in the reduct set and all its out-going edges are removed. The process stops when the edge set is empty, thus producing multiple reducts. The proposed method and some well-known attribute reduction techniques have been applied on several microarray gene datasets for gene selection. The results obtained show the effectiveness of the method.
Change with age in regression construction of fat percentage for BMI in school-age children.
In this study, curvilinear regression was applied to the relationship between BMI and body fat percentage, and an analysis was done to see whether there are characteristic changes in that curvilinear regression from elementary to middle school. Then, by simultaneously investigating the changes with age in BMI and body fat percentage, the essential differences in BMI and body fat percentage were demonstrated. The subjects were 789 boys and girls (469 boys, 320 girls) aged 7.5 to 14.5 years from all parts of Japan who participated in regular sports activities. Body weight, total body water (TBW), soft lean mass (SLM), body fat percentage, and fat mass were measured with a body composition analyzer (Tanita BC-521 Inner Scan), using segmental bioelectrical impedance analysis & multi-frequency bioelectrical impedance analysis. Height was measured with a digital height measurer. Body mass index (BMI) was calculated as body weight (km) divided by the square of height (m). The results for the validity of regression polynomials of body fat percentage against BMI showed that, for both boys and girls, first-order polynomials were valid in all school years. With regard to changes with age in BMI and body fat percentage, the results showed a temporary drop at 9 years in the aging distance curve in boys, followed by an increasing trend. Peaks were seen in the velocity curve at 9.7 and 11.9 years, but the MPV was presumed to be at 11.9 years. Among girls, a decreasing trend was seen in the aging distance curve, which was opposite to the changes in the aging distance curve for body fat percentage.
Full Text Available A área útil efetiva é um parâmetro importante na aquisição de terras e planejamento do florestamento. A finalidade desta pesquisa foi gerar mapas preditores de áreas aptas ao plantio de eucalipto usando regressões logísticas binárias e variáveis geomorfométricas. As relações entre as variáveis preditoras e as áreas aptas para plantio de eucalipto foram modeladas e a variável que melhor explicou a ocorrência de áreas para plantio foi a distância dos rios. O mapa gerado apresentando as áreas aptas para plantio mostrou alta capacidade de reproduzir o mapa original de plantio de eucalipto. As regressões logísticas demonstraram viabilidade do uso para o mapeamento da aptidão para o plantio de eucalipto.Effective usable area is a key parameter in land acquisition and afforestation planning. The purpose of this research was to generate predictive maps of areas suitable for planting eucalyptus trees using binary logistic regressions and geomorphometric variables. The relationships between the predicting variables and suitable areas for planting eucalyptus trees were modeled and the variable that best explained occurrence of suitable lands was distance from rivers. The generated map showing areas suitable for planting had a high ability to reproduce the original planting map. Logistic regressions demonstrated the feasibility of use this approach to map suitability for eucalyptus forestation.
Henri Epstein
Full Text Available In a competitive and globalized economic environment, organizations need to evolve to keep up with changes that the environment imposes on them, seeking sustainability and perpetuity. To the extent that increases the pace of change, the durability of business strategies decreases, causing the need of continuous transformations, with permanent restructuring. The objective of this study is to analyze the correlations and regression models coming from the economic and financial ratios stemmed profitability, profitability, liquidity and debt, based on the corporations that owned the investment grade certification in 2008, issued by certification International, Standard & Poor's, Moody's and Fitch Ratings. The proposed methodology for the setting of this study is typically quantitative, based on statistical analysis of correlation and regression. It was found through this study that the variables studied, could be the basis for the construction of an economic and financial indicator of investment grade. Keywords: Investment Grade. Indicator. Corporations.
The polynomial matrix using the block coefficient matrix representation auto-regressive moving average (referred to as the PM-ARMA) model is constructed in this paper for actively controlled multi-degree-of-freedom (MDOF) structures with time-delay through equivalently transforming the preliminary state space realization into the new state space realization. The PM-ARMA model is a more general formulation with respect to the polynomial using the coefficient representation auto-regressive moving average (ARMA) model due to its capability to cope with actively controlled structures with any given structural degrees of freedom and any chosen number of sensors and actuators. (The sensors and actuators are required to maintain the identical number.) under any dimensional stationary stochastic excitation.
This research aims to develop a mathematical model for assessing the expected net profit of any construction company. To achieve the research objective, four steps were performed. First, the main factors affecting firms’ net profit were identified. Second, pertinent data regarding the net profit factors were collected. Third, two different net profit models were developed using the Multiple Regression (MR and the Neural Network (NN techniques. The validity of the proposed models was also investigated. Finally, the results of both MR and NN models were compared to investigate the predictive capabilities of the two models.
Cyst echinococcosis caused by the matacestodal larvae of Echinococcus granulosus (Eg), is a chronic, worldwide, and severe zoonotic parasitosis. The treatment of cyst echinococcosis is still difficult since surgery cannot fit the needs of all patients, and drugs can lead to serious adverse events as well as resistance. The screen of target proteins interacted with new anti-hydatidosis drugs is urgently needed to meet the prevailing challenges. Here, we analyzed the sequences and structure properties, and constructed a phylogenetic tree by bioinformatics methods. The MIP family signature and Protein kinase C phosphorylation sites were predicted in all nine EgAQPs. α-helix and random coil were the main secondary structures of EgAQPs. The numbers of transmembrane regions were three to six, which indicated that EgAQPs contained multiple hydrophobic regions. A neighbor-joining tree indicated that EgAQPs were divided into two branches, seven EgAQPs formed a clade with AQP1 from human, a "strict" aquaporins, other two EgAQPs formed a clade with AQP9 from human, an aquaglyceroporins. Unfortunately, homology modeling of EgAQPs was aborted. These results provide a foundation for understanding and researches of the biological function of E. granulosus.
Full Text Available Classification and regression tree (CART analysis was applied to genome-wide tetranucleotide frequencies (genomic signatures of 195 archaea and bacteria. Although genomic signatures have typically been used to classify evolutionary divergence, in this study, convergent evolution was the focus. Temperature optima for most of the organisms examined could be distinguished by CART analyses of tetranucleotide frequencies. This suggests that pervasive (nonlinear qualities of genomes may reflect certain environmental conditions (such as temperature in which those genomes evolved. The predominant use of GAGA and AGGA as the discriminating tetramers in CART models suggests that purine-loading and codon biases of thermophiles may explain some of the results.
D-Tree Approach to Constructing Overlapping Location-Dependent Data Regions
For the past few years, there has been an explosion in the number of individuals carrying wireless devices that are capable of conducting location-dependent information services (LDISs) and provide information based on locations specified in the queries. This paper examines the user's demand of vague queries, and then proposes system architecture of LDISs. In particular, based on D-tree index structure, it takes the method of membership function in fuzzy set theory to provide a more user-oriented approach to constructing overlapping location-dependent data regions, called complementary void overlapping data regions. This structure, carrying more information with less additional storage, brings balance between perceived usefulness and responsiveness, energy consumption, and bandwidth contention in wireless communications. The simulations show that this method is more flexible and practical in practice.
Recursive partitioning methods have become popular and widely used tools for nonparametric regression and classification in many scientific fields. Especially random forests, which can deal with large numbers of predictor variables even in the presence of complex interactions, have been applied successfully in genetics, clinical medicine, and…
Construction Et Etude De Tests En Regression. 1. Correction Du Rapport De Vraisemblance Par Approximation De Laplace En Regression Non-lineaire. 2. Test D'adequation En Regression Isotonique A Partir D'une Asymptotique Des Fluctuations De La Distance
Full Text Available Local adaptation to climate in temperate forest trees involves the integration of multiple physiological, morphological, and phenological traits. Latitudinal clines are frequently observed for these traits, but environmental constraints also track longitude and altitude. We combined extensive phenotyping of 12 candidate adaptive traits, multivariate regression trees, quantitative genetics, and a genome-wide panel of SNP markers to better understand the interplay among geography, climate, and adaptation to abiotic factors in Populus trichocarpa. Heritabilities were low to moderate (0.13 to 0.32 and population differentiation for many traits exceeded the 99th percentile of the genome-wide distribution of FST, suggesting local adaptation. When climate variables were taken as predictors and the 12 traits as response variables in a multivariate regression tree analysis, evapotranspiration (Eref explained the most variation, with subsequent splits related to mean temperature of the warmest month, frost-free period (FFP, and mean annual precipitation (MAP. These grouping matched relatively well the splits using geographic variables as predictors: the northernmost groups (short FFP and low Eref had the lowest growth, and lowest cold injury index; the southern British Columbia group (low Eref and intermediate temperatures had average growth and cold injury index; the group from the coast of California and Oregon (high Eref and FFP had the highest growth performance and the highest cold injury index; and the southernmost, high-altitude group (with high Eref and low FFP performed poorly, had high cold injury index, and lower water use efficiency. Taken together, these results suggest variation in both temperature and water availability across the range shape multivariate adaptive traits in poplar.
Full Text Available The article presents an innovative use of inductive algorithm for generating the decision tree for an analysis of the rank validity parameters of construction and maintenance of the gear pump with undercut tooth. It is preventet an alternative way of generating sets of decisions and determining the hierarchy of decision variables to existing the methods of discrete optimization.
This study aims at comparing the performances of Binary Logistic Regression (BLR) and Boosted Regression Trees (BRT) methods in assessing landslide susceptibility for multiple-occurrence regional landslide events within the Mediterranean region. A test area was selected in the north-eastern sector of Sicily (southern Italy), corresponding to the catchments of the Briga and the Giampilieri streams both stretching for few kilometres from the Peloritan ridge (eastern Sicily, Italy) to the Ionian sea. This area was struck on the 1st October 2009 by an extreme climatic event resulting in thousands of rapid shallow landslides, mainly of debris flows and debris avalanches types involving the weathered layer of a low to high grade metamorphic bedrock. Exploiting the same set of predictors and the 2009 landslide archive, BLR- and BRT-based susceptibility models were obtained for the two catchments separately, adopting a random partition (RP) technique for validation; besides, the models trained in one of the two catchments (Briga) were tested in predicting the landslide distribution in the other (Giampilieri), adopting a spatial partition (SP) based validation procedure. All the validation procedures were based on multi-folds tests so to evaluate and compare the reliability of the fitting, the prediction skill, the coherence in the predictor selection and the precision of the susceptibility estimates. All the obtained models for the two methods produced very high predictive performances, with a general congruence between BLR and BRT in the predictor importance. In particular, the research highlighted that BRT-models reached a higher prediction performance with respect to BLR-models, for RP based modelling, whilst for the SP-based models the difference in predictive skills between the two methods dropped drastically, converging to an analogous excellent performance. However, when looking at the precision of the probability estimates, BLR demonstrated to produce more robust
Yield curve event tree construction for multi stage stochastic programming models
by the quality and size of the event trees representing the underlying uncertainty. Most often the DSP literature assumes existence of ``appropriate'' event trees without defining and examining qualities that must be met (ex--ante) in such an event tree in order for the results of the DSP model to be reliable....... Indeed defining a universal and tractable framework for fully ``appropriate'' event trees is in our opinion an impossible task. A problem specific approach to designing such event trees is the way ahead. In this paper we propose a number of desirable properties which should be present in an event tree...... of yield curves. Such trees may then be used to represent the underlying uncertainty in DSP models of fixed income risk and portfolio management....
In the paper, we study a greedy algorithm for construction of decision trees. This algorithm is applicable to decision tables with many-valued decisions where each row is labeled with a set of decisions. For a given row, we should find a decision from the set attached to this row. Experimental results for data sets from UCI Machine Learning Repository and randomly generated tables are presented. We make a comparative study of the depth and average depth of the constructed decision trees for proposed approach and approach based on generalized decision. The obtained results show that the proposed approach can be useful from the point of view of knowledge representation and algorithm construction.
Patch-based image segmentation of satellite imagery using minimum spanning tree construction
We present a method for hierarchical image segmentation and feature extraction. This method builds upon the combination of the detection of image spectral discontinuities using Canny edge detection and the image Laplacian, followed by the construction of a hierarchy of segmented images of successively reduced levels of details. These images are represented as sets of polygonized pixel patches (polygons) attributed with spectral and structural characteristics. This hierarchy forms the basis for object-oriented image analysis. To build fine level-of-detail representation of the original image, seed partitions (polygons) are built upon a triangular mesh composed of irregular sized triangles, whose spatial arrangement is adapted to the image content. This is achieved by building the triangular mesh on the top of the detected spectral discontinuities that form a network of constraints for the Delaunay triangulation. A polygonized image is represented as a spatial network in the form of a graph with vertices which correspond to the polygonal partitions and graph edges reflecting pairwise partitions relations. Image graph partitioning is based on the iterative graph oontraction using Boruvka's Minimum Spanning Tree algorithm. An important characteristic of the approach is that the agglomeration of partitions is constrained by the detected spectral discontinuities; thus the shapes of agglomerated partitions are more likely to correspond to the outlines of real-world objects.
Full Text Available Genetic linkage maps are cornerstones of a wide spectrum of biotechnology applications, including map-assisted breeding, association genetics, and map-assisted gene cloning. During the past several years, the adoption of high-throughput genotyping technologies has been paralleled by a substantial increase in the density and diversity of genetic markers. New genetic mapping algorithms are needed in order to efficiently process these large datasets and accurately construct high-density genetic maps. In this paper, we introduce a novel algorithm to order markers on a genetic linkage map. Our method is based on a simple yet fundamental mathematical property that we prove under rather general assumptions. The validity of this property allows one to determine efficiently the correct order of markers by computing the minimum spanning tree of an associated graph. Our empirical studies obtained on genotyping data for three mapping populations of barley (Hordeum vulgare, as well as extensive simulations on synthetic data, show that our algorithm consistently outperforms the best available methods in the literature, particularly when the input data are noisy or incomplete. The software implementing our algorithm is available in the public domain as a web tool under the name MSTmap.
Peer nominations and demographic information were collected from a diverse sample of 1493 elementary school participants to examine behavior (overt and relational aggression, impulsivity, and prosociality), context (peer status), and demographic characteristics (race and gender) as predictors of teacher and administrator decisions about discipline. Exploratory results using classification tree analyses indicated students nominated as average or highly overtly aggressive were more likely to be disciplined than others. Among these students, race was the most significant predictor, with African American students more likely to be disciplined than Caucasians, Hispanics, or Others. Among the students nominated as low in overt aggression, a lack of prosocial behavior was the most significant predictor. Confirmatory analysis using hierarchical logistic regression supported the exploratory results. Similarities with other biased referral patterns, proactive classroom management strategies, and culturally sensitive recommendations are discussed.
This study determines the influence of the different soil components and of the cation-exchange capacity on the adsorption and retention of different heavy metals: cadmium, chromium, copper, nickel, lead and zinc. In order to do so, regression models were created through decision trees and the importance of soil components was assessed. Used variables were: humified organic matter, specific cation-exchange capacity, percentages of sand and silt, proportions of Mn, Fe and Al oxides and hematite, and the proportion of quartz, plagioclase and mica, and the proportions of the different clays: kaolinite, vermiculite, gibbsite and chlorite. The most important components in the obtained models were vermiculite and gibbsite, especially for the adsorption of cadmium and zinc, while clays were less relevant. Oxides are less important than clays, especially for the adsorption of chromium and lead and the retention of chromium, copper and lead. PMID:28072849
sand, silt, and clay in soil determines its textural classification. This study used Geographic Information Systems (GIS) and regression-tree modeling to precisely quantify the relationships between the soil texture fractions and different environmental parameters on a national scale, and to detect...... precipitation, seasonal precipitation to statistically explain soil texture fractions field/laboratory measurements (45,224 sampling sites) in the area of interest (Denmark). The developed strongest relationships were associated with clay and silt, variance being equal to 60%, followed by coarse sand (54.......5%) and fine sand (52%) as the weakest relationship. This study also showed that parent materials (with a relative importance varying between 47% and 100%), geographic regions (31–100%) and landscape types (68–100%) considerably influenced all soil texture fractions, which is not the case for climate and DEM...
Various approaches are used to subdivide large areas into regions containing streams that have similar reference or background water quality and that respond similarly to different factors. For many applications, such as establishing reference conditions, it is preferable to use physical characteristics that are not affected by human activities to delineate these regions. However, most approaches, such as ecoregion classifications, rely on land use to delineate regions or have difficulties compensating for the effects of land use. Land use not only directly affects water quality, but it is often correlated with the factors used to define the regions. In this article, we describe modifications to SPARTA (spatial regression-tree analysis), a relatively new approach applied to water-quality and environmental characteristic data to delineate zones with similar factors affecting water quality. In this modified approach, land-use-adjusted (residualized) water quality and environmental characteristics are computed for each site. Regression-tree analysis is applied to the residualized data to determine the most statistically important environmental characteristics describing the distribution of a specific water-quality constituent. Geographic information for small basins throughout the study area is then used to subdivide the area into relatively homogeneous environmental water-quality zones. For each zone, commonly used approaches are subsequently used to define its reference water quality and how its water quality responds to changes in land use. SPARTA is used to delineate zones of similar reference concentrations of total phosphorus and suspended sediment throughout the upper Midwestern part of the United States. ?? 2006 Springer Science+Business Media, Inc.
snpTree - a web-server to identify and construct SNP trees from whole genome sequence data
Background The advances and decreasing economical cost of whole genome sequencing (WGS), will soon make this technology available for routine infectious disease epidemiology. In epidemiological studies, outbreak isolates have very little diversity and require extensive genomic analysis to differe......Background The advances and decreasing economical cost of whole genome sequencing (WGS), will soon make this technology available for routine infectious disease epidemiology. In epidemiological studies, outbreak isolates have very little diversity and require extensive genomic analysis...... from concatenated SNPs using FastTree and a perl script. The online server was implemented by HTML, Java and python script. The server was evaluated using four published bacterial WGS data sets (V. cholerae, S. aureus CC398, S. Typhimurium and M. tuberculosis). The evalution results for the first three...
Full Text Available Forest structural parameters such as quadratic mean diameter, basal area, and number of trees per unit area are important for the assessment of wood volume and biomass and represent key forest inventory attributes. Forest inventory information is required to support sustainable management, carbon accounting, and policy development activities. Digital image processing of remotely sensed imagery is increasingly utilized to assist traditional, more manual, methods in the estimation of forest structural attributes over extensive areas, also enabling evaluation of change over time. Empirical attribute estimation with remotely sensed data is frequently employed, yet with known limitations, especially over complex environments such as Mediterranean forests. In this study, the capacity of high spatial resolution (HSR imagery and related techniques to model structural parameters at the stand level (n = 490 in Mediterranean pines in Central Spain is tested using data from the commercial satellite QuickBird-2. Spectral and spatial information derived from multispectral and panchromatic imagery (2.4 m and 0.68 m sided pixels, respectively served to model structural parameters. Classification and Regression Tree Analysis (CART was selected for the modeling of attributes. Accurate models were produced of quadratic mean diameter (QMD (R2 = 0.8; RMSE = 0.13 m with an average error of 17% while basal area (BA models produced an average error of 22% (RMSE = 5.79 m2/ha. When the measured number of trees per unit area (N was categorized, as per frequent forest management practices, CART models correctly classified 70% of the stands, with all other stands classified in an adjacent class. The accuracy of the attributes estimated here is expected to be better when canopy cover is more open and attribute values are at the lower end of the range present, as related in the pattern of the residuals found in this study. Our findings indicate that attributes derived from
In this note we discuss trees similar to the Calkin-Wilf tree, a binary tree that enumerates all positive rational numbers in a simple way. The original construction of Calkin and Wilf is reformulated in a more algebraic language, and an elementary application of methods from analytic number theory gives restrictions on possible analogues.
Background Most phylogeny analysis methods based on molecular sequences use multiple alignment where the quality of the alignment, which is dependent on the alignment parameters, determines the accuracy of the resulting trees. Different parameter combinations chosen for the multiple alignment may result in different phylogenies. A new non-alignment based approach, Relative Complexity Measure (RCM), has been introduced to tackle this problem and proven to work in fungi and mitochondrial DNA. Result In this work, we present an application of the RCM method to reconstruct robust phylogenetic trees using sequence data for genus Galanthus obtained from different regions in Turkey. Phylogenies have been analyzed using nuclear and chloroplast DNA sequences. Results showed that, the tree obtained from nuclear ribosomal RNA gene sequences was more robust, while the tree obtained from the chloroplast DNA showed a higher degree of variation. Conclusions Phylogenies generated by Relative Complexity Measure were found to be robust and results of RCM were more reliable than the compared techniques. Particularly, to overcome MSA-based problems, RCM seems to be a reasonable way and a good alternative to MSA-based phylogenetic analysis. We believe our method will become a mainstream phylogeny construction method especially for the highly variable sequence families where the accuracy of the MSA heavily depends on the alignment parameters. PMID:23323678
Oxides of Nitrogen (NOx) is a major component of photochemical smog and its constituents are considered principal traffic-related pollutants affecting human health. This study investigates the influence of background concentrations of NOx, traffic density, and prevailing meteorological conditions on roadside concentrations of NOx at UK urban, open motorway, and motorway tunnel sites using the statistical approach Boosted Regression Trees (BRT). BRT models have been fitted using hourly concentration, traffic, and meteorological data for each site. The models predict, rank, and visualise the relationship between model variables and roadside NOx concentrations. A strong relationship between roadside NOx and monitored local background concentrations is demonstrated. Relationships between roadside NOx and other model variables have been shown to be strongly influenced by the quality and resolution of background concentrations of NOx, i.e. if it were based on monitored data or modelled prediction. The paper proposes a direct method of using site-specific fundamental diagrams for splitting traffic data into four traffic states: free-flow, busy-flow, congested, and severely congested. Using BRT models, the density of traffic (vehicles per kilometre) was observed to have a proportional influence on the concentrations of roadside NOx, with different fitted regression line slopes for the different traffic states. When other influences are conditioned out, the relationship between roadside concentrations and ambient air temperature suggests NOx concentrations reach a minimum at around 22 °C with high concentrations at low ambient air temperatures which could be associated to restricted atmospheric dispersion and/or to changes in road traffic exhaust emission characteristics at low ambient air temperatures. This paper uses BRT models to study how different critical factors, and their relative importance, influence the variation of roadside NOx concentrations. The paper
Diameter-at-Breast-Height Estimation is a prerequisite in various allometric equations estimating important forestry indices like stem volume, basal area, biomass and carbon stock. LiDAR Technology has a means of directly obtaining different forest parameters, except DBH, from the behavior and characteristics of point cloud unique in different forest classes. Extensive tree inventory was done on a two-hectare established sample plot in Mt. Makiling, Laguna for a natural growth forest. Coordinates, height, and canopy cover were measured and types of species were identified to compare to LiDAR derivatives. Multiple linear regression was used to get LiDAR-derived DBH by integrating field-derived DBH and 27 LiDAR-derived parameters at 20m, 10m, and 5m grid resolutions. To know the best combination of parameters in DBH Estimation, all possible combinations of parameters were generated and automated using python scripts and additional regression related libraries such as Numpy, Scipy, and Scikit learn were used. The combination that yields the highest r-squared or coefficient of determination and lowest AIC (Akaike's Information Criterion) and BIC (Bayesian Information Criterion) was determined to be the best equation. The equation is at its best using 11 parameters at 10mgrid size and at of 0.604 r-squared, 154.04 AIC and 175.08 BIC. Combination of parameters may differ among forest classes for further studies. Additional statistical tests can be supplemented to help determine the correlation among parameters such as Kaiser- Meyer-Olkin (KMO) Coefficient and the Barlett's Test for Spherecity (BTS).
A 3.5-year field experiment was conducted in a subtropical degraded shrubland to assess how a nurse plant, the native shrub Rhodomyrtus tomentosa, affects the growth of the target trees Pinus elliottii, Schima superba, Castanopsis fissa, and Michelia macclurei, and to probe the intrinsic mechanisms from leaf chemical composition, construction cost (CC), and payback time aspects. We compared tree seedlings grown nearby shrub canopy (canopy subplots, CS) and in open space (open subplots, OS). S. superba in CS showed greater growth, while P. elliottii and M. macclurei were lower when compared to the plants grown in the OS. The reduced levels of high-cost compounds (proteins) and increased levels of low-cost compounds (organic acids) caused reduced CC values for P. elliottii growing in CS. While, the levels of both low-cost minerals and high-cost proteins increased in CS such that CC values of S. superba were similar in OS and CS. Based on maximum photosynthetic rates, P. elliottii required a longer payback time to construct required carbon in canopy than in OS, but the opposite was true for S. superba. The information from this study can be used to evaluate the potential of different tree species in the reforestation of subtropical degraded shrublands.
Full Text Available O presente trabalho foi realizado com o objetivo de determinar a existência e as magnitudes de correlações e regressões lineares simples em plântulas jovens de seringueira (Hevea spp., para melhor condução de seleção nos futuros trabalhos de melhoramento. Foram utilizadas médias de produção de borracha seca por plântulas por corte, através do teste Hamaker-Morris-Mann (P; circunferência do caule (CC; espessura de casca (EC; número de anéis (NA; diâmetro dos vasos (DV; densidade dos vasos laticíleros (D e distância média entre anéis de vasos consecutivos (DMEAVC em um viveiro de cruzamento com três anos e meio de idade. Os resultados mostraram, entre outros fatores, que as correlações lineares simples de P com CC, EC, NA, D, DV e DMEAVC foram, respectivamente, r =t 0,61, 0,34, 0,28, 0,29, 0,43 e -0,13. As correlações de CC com EC, NA, D, DV e DMEAVC foram: 0,65, 0,22, 0,37, 0,33 e 0,096 respectivamente. Estudos de regressão linear simples de P com CC, EC, NA, DV, D e DMEAVC sugerem que CC foi o caráter independente mais significativo, contribuindo com 36% da variação em P. Em relação ao vigor, a regressão de CC com os respectivos caracteres sugere que EC foi o único caráter que contribuiu significativamente para a variação de CC com 42%. As altas correlações observadas da produção com circunferência do caule e com espessura de casca evidenciam a possibilidade de obter genótipos jovens de boa capacidade produtiva e grande vigor, através de seleção precoce dessas variáveis.This study was undertaken aiming to determine the existence of linear correlations, based on simple regression studies for a better improvement of young rubber tree (Hevea spp. breeding and selection. The characters studied were: yield of dry rubber per tapping by Hamaker-Morris-Mann test tapping (P, mean gurth (CC, bark thickness (EC, number of latex vessel rings (NA, diameter of latex vesseis (DV, density of latex vesseis per 5mm
In a retrospective study on the microbiology of minced meat from small food businesses supplying directly to the consumer, the relative contribution of meat supplier, meat species and outlet where meat was minced was assessed by "Classification and Regression Tree" (CART) analysis. Samples (n=888) originated from 129 outlets of a single supermarket chain. Sampling units were 4-5 packs (pork, beef, and mixed pork-beef). Total aerobic counts (TACs) were 5.3±1.0 log CFU/g. In 75.6% of samples, E. coli were <1 log CFU/g. The proportion of "unsatisfactory" sample sets [as defined in Reg. (EC) 2073/2005] were 31.3 and 4.5% for TAC and E. coli, respectively. For classification according to TACs, the outlet where meat was minced and the "meat supplier" were the most important predictors. For E. coli, "outlet" was the most important predictor, but the limit of detection of 1 log CFU/g was not discriminative enough to allow further conclusions.
Full Text Available A global climate classification is defined using a multivariate regression tree (MRT. The MRT algorithm is automated, which removes the need for a practitioner to manually define the classes; it is hierarchical, which allows a series of nested classes to be defined; and it is rule-based, which allows climate classes to be unambiguously defined and easily interpreted. Climate variables used in the MRT are restricted to those from the Köppen-Geiger climate classification. The result is a hierarchical, rule-based climate classification that can be directly compared against the traditional system. An objective comparison between the two climate classifications at their 5, 13, and 30 class hierarchical levels indicates that both perform well in terms of identifying regions of homogeneous temperature variability, although the MRT still generally outperforms the Köppen-Geiger system. In terms of precipitation discrimination, the Köppen-Geiger classification performs poorly relative to the MRT. The data and algorithm implementation used in this study are freely available. Thus, the MRT climate classification offers instructors and students in the geosciences a simple instrument for exploring modern, computer-based climatological methods.
Full Text Available A global climate classification is defined using a multivariate regression tree (MRT. The MRT algorithm is automated, hierarchical, and rule-based, thus allowing a system of climate classes to be quickly defined and easily interpreted. Climate variables used in the MRT are restricted to those from the Köppen-Geiger classification system. The result is a set of classes that can be directly compared against those from the traditional system. The two climate classifications are compared at their 5, 13, and 30 class hierarchical levels in terms of climate homogeneity. Results indicate that both perform well in terms of identifying regions of homogeneous temperature variability, although the MRT still generally outperforms the Köppen-Geiger system. In terms of precipitation discrimination, the Köppen-Geiger classification performs poorly relative to the MRT. The data and algorithm implementation used in this study are freely available. Thus, the MRT climate classification offers instructors and students in the geosciences a simple instrument for exploring modern, computer-based climatological methods.
The probability of a prostate cancer-positive biopsy result varies with PSA concentration. Thus, we applied information theory on classification and regression tree (CART) analysis for decision making predicting the probability of a biopsy result at various PSA concentrations. From 2007 to 2009, prostate biopsies were performed in 664 referred patients in a tertiary hospital. We created 2 CART models based on the information theory: one for moderate uncertainty (PSA concentration: 2.5-10 ng/ml) and the other for high uncertainty (PSA concentration: 10-25 ng/ml). The CART model for moderate uncertainty (n=321) had 3 splits based on PSA density (PSAD), hypoechoic nodules, and age and the other CART for high uncertainty (n=160) had 2 splits based on prostate volume and percent-free PSA. In this validation set, the patients (14.3% and 14.0% for moderate and high uncertainty groups, respectively) could avoid unnecessary biopsies without false-negative results. Using these CART models based on uncertainty information of PSA, the overall reduction in unnecessary prostate biopsies was 14.0-14.3% and CART models were simplified. Using uncertainty of laboratory results from information theoretic approach can provide additional information for decision analysis such as CART. Copyright © 2012 Elsevier B.V. All rights reserved.
Full Text Available Low density parity check (LDPC codes are capacity-approaching codes, which means that practical constructions exist that allow the noise threshold to be set very close to the theoretical Shannon limit for a memory less channel. LDPC codes are finding increasing use in applications like LTE-Networks, digital television, high density data storage systems, deep space communication systems etc. Several algebraic and combinatorial methods are available for constructing LDPC codes. In this paper we discuss a novel low complexity algebraic method for constructing regular LDPC like codes derived from full rank codes. We demonstrate that by employing these codes over AWGN channels, coding gains in excess of 2dB over un-coded systems can be realized when soft iterative decoding using a parity check tree is employed.
Full Text Available Background. Microarray technology shows great potential but previous studies were limited by small number of samples in the colorectal cancer (CRC research. The aims of this study are to investigate gene expression profile of CRCs by pooling cDNA microarrays using PAM, ANN, and decision trees (CART and C5.0. Methods. Pooled 16 datasets contained 88 normal mucosal tissues and 1186 CRCs. PAM was performed to identify significant expressed genes in CRCs and models of PAM, ANN, CART, and C5.0 were constructed for screening candidate genes via ranking gene order of significances. Results. The first screening identified 55 genes. The test accuracy of each model was over 0.97 averagely. Less than eight genes achieve excellent classification accuracy. Combining the results of four models, we found the top eight differential genes in CRCs; suppressor genes, CA7, SPIB, GUCA2B, AQP8, IL6R and CWH43; oncogenes, SPP1 and TCN1. Genes of higher significances showed lower variation in rank ordering by different methods. Conclusion. We adopted a two-tier genetic screen, which not only reduced the number of candidate genes but also yielded good accuracy (nearly 100%. This method can be applied to future studies. Among the top eight genes, CA7, TCN1, and CWH43 have not been reported to be related to CRC.
We are given a set T = (T{sub 1}, T{sub 2},...,T{sub k}) of rooted binary trees, each T{sub i} leaf-labeled by a subset (T{sub i}) {contained_in}(1, 2,..., n). If T is a tree on (1,2,..., n), we let TJC denote the subtree of T induced by the nodes of C and all their ancestors. The consensus tree problem asks whether there exists a tree T* such that for every i, T*{vert_bar}(T{sub i}) is homeomorphic to Ti. We present algorithms which test if a given set of trees has a consensus tree and if so, con- struct one. The deterministic algorithm takes time min (O(mn{sup 1/2}), O(m + n{sup 2} log n)), where m = {Sigma}{sub i} {vert_bar}T{sub i}{vert_bar} and uses linear space. The randomized algorithm takes time O(m log{sup 3} n) and uses linear space. The previous best for this problem was an 1981 O(mn) algorithm by Aho et al. Our faster deterministic algorithm uses a new efficient algorithm for the following interesting dynamic graph problem: Given a graph G with n nodes and m edges and a sequence of b batches of one or more edge deletions, then after each batch, either find a new component that has just been created or determine that there is no such component. For this problem, we have a simple algorithm with running time O(n{sup 2} log n + b{sub 0} min(n{sup 2}, m log n)), where b{sub 0} is the number of batches which do not result in a new component. For our particular application, b{sub 0} {le} 1. If all edges are deleted, then the best previously known deterministic algorithm requires time 0(m{radical}n) to solve this problem.
The Integrated Relaxation Pressure (IRP) is the esophageal pressure topography (EPT) metric used for assessing the adequacy of esophagogastric junction (EGJ) relaxation in the Chicago Classification of motility disorders. However, because the IRP value is also influenced by distal esophageal contractility, we hypothesized that its normal limits should vary with different patterns of contractility. Five hundred and twenty two selected EPT studies were used to compare the accuracy of alternative analysis paradigms to that of a motility expert (the 'gold standard'). Chicago Classification metrics were scored manually and used as inputs for MATLAB™ programs that utilized either strict algorithm-based interpretation (fixed abnormal IRP threshold of 15 mmHg) or a classification and regression tree (CART) model that selected variable IRP thresholds depending on the associated esophageal contractility. The sensitivity of the CART model for achalasia (93%) was better than that of the algorithm-based approach (85%) on account of using variable IRP thresholds that ranged from a low value of >10 mmHg to distinguish type I achalasia from absent peristalsis to a high value of >17 mmHg to distinguish type III achalasia from distal esophageal spasm. Additionally, type II achalasia was diagnosed solely by panesophageal pressurization without the IRP entering the algorithm. Automated interpretation of EPT studies more closely mimics that of a motility expert when IRP thresholds for impaired EGJ relaxation are adjusted depending on the pattern of associated esophageal contractility. The range of IRP cutoffs suggested by the CART model ranged from 10 to 17 mmHg. © 2012 Blackwell Publishing Ltd.
Arsenic contamination in groundwater is a public health and environmental concern in the United States (U.S.) particularly where monitoring is not required under the Safe Water Drinking Act. Previous studies suggest the influence of regional mechanisms for arsenic mobilization into groundwater; however, no study has examined how influencing parameters change at a continental scale spanning multiple regions. We herein examine covariates for groundwater in the western, central and eastern U.S. regions representing mechanisms associated with arsenic concentrations exceeding the U.S. Environmental Protection Agency maximum contamination level (MCL) of 10 parts per billion (ppb). Statistically significant covariates were identified via classification and regression tree (CART) analysis, and included hydrometeorological and groundwater chemical parameters. The CART analyses were performed at two scales: national and regional; for which three physiographic regions located in the western (Payette Section and the Snake River Plain), central (Osage Plains of the Central Lowlands), and eastern (Embayed Section of the Coastal Plains) U.S. were examined. Validity of each of the three regional CART models was indicated by values >85% for the area under the receiver-operating characteristic curve. Aridity (precipitation minus potential evapotranspiration) was identified as the primary covariate associated with elevated arsenic at the national scale. At the regional scale, aridity and pH were the major covariates in the arid to semi-arid (western) region; whereas dissolved iron (taken to represent chemically reducing conditions) and pH were major covariates in the temperate (eastern) region, although additional important covariates emerged, including elevated phosphate. Analysis in the central U.S. region indicated that elevated arsenic concentrations were driven by a mixture of those observed in the western and eastern regions.
Full Text Available Background and objective Erlotinib is a targeted therapy drug for non-small cell lung cancer (NSCLC. It has been proven that, there was evidence of various survival benefits derived from erlotinib in patients with different clinical features, but the results are conflicting. The aim of this study is to identify novel predictive factors and explore the interactions between clinical variables as well as their impact on the survival of Chinese patients with advanced NSCLC heavily treated with erlotinib. Methods The clinical and follow-up data of 105 Chinese NSCLC patients referred to the Cancer Hospital and Institute, Chinese Academy of Medical Sciences from September 2006 to September 2009 were analyzed. Multivariate analysis of progressive-free survival (PFS was performed using recursive partitioning referred to as the classification and regression tree (CART analysis. Results The median PFS of 105 eligible consecutive Chinese NSCLC patients was 5.0 months (95%CI: 2.9-7.1. CART analysis was performed for the initial, second, and third split in the lymph node involvement, the time of erlotinib administration, and smoking history. Four terminal subgroups were formed. The longer values for the median PFS were 11.0 months (95%CI: 8.9-13.1 for the subgroup with no lymph node metastasis and 10.0 months (95%CI: 7.9-12.1 for the subgroup with lymph node involvement, but not over the second-line erlotinib treatment with a smoking history ≤35 packs per year. The shorter values for the median PFS were 2.3 months (95%CI: 1.6-3.0 for the subgroup with lymph node metastasis and over the second-line erlotinib treatment, and 1.3 months (95%CI: 0.5-2.1 for the subgroup with lymph node metastasis, but not over the second-line erlotinib treatment with a smoking history >35 packs per year. Conclusion Lymph node metastasis, the time of erlotinib administration, and smoking history are closely correlated with the survival of advanced NSCLC patients with first- to
Full Text Available Background/Purpose: Over 1/3 of adults over age 65 experiences at least one fall each year. This pilot report uses a classification regression tree analysis (CART to model the outcomes for balance/risk of falls from the Gentiva® Safe Strides® Program (SSP. Methods/Outcomes: SSP is a home-based balance/fall prevention program designed to treat root causes of a patient
Growing a Tree in the Forest: Constructing Folksonomies by Integrating Structured Metadata
Plangprasopchok, Anon; Getoor, Lise
2010-01-01
Many social Web sites allow users to annotate the content with descriptive metadata, such as tags, and more recently to organize content hierarchically. These types of structured metadata provide valuable evidence for learning how a community organizes knowledge. For instance, we can aggregate many personal hierarchies into a common taxonomy, also known as a folksonomy, that will aid users in visualizing and browsing social content, and also to help them in organizing their own content. However, learning from social metadata presents several challenges, since it is sparse, shallow, ambiguous, noisy, and inconsistent. We describe an approach to folksonomy learning based on relational clustering, which exploits structured metadata contained in personal hierarchies. Our approach clusters similar hierarchies using their structure and tag statistics, then incrementally weaves them into a deeper, bushier tree. We study folksonomy learning using social metadata extracted from the photo-sharing site Flickr, and demon...
Neutral Object Tree Support For Inter-Discipline Communication In Large-Scale Construction
Communication between disciplines in building and construction can be improved significantly by the proper use of Information and Communication Technology. For that reason, many research groups have been trying to achieve such improvement, especially by using Product Data Technology and STEP, and mo
As the automated scoring of constructed responses reaches operational status, the issue of monitoring the scoring process becomes a primary concern, particularly when the goal is to have automated scoring operate completely unassisted by humans. Using a vignette from the Architectural Registration Examination and data for 326 cases with both human…
Four algorithms are available for performing classification and segmentation analysis. These algorithms all perform basically the same thing: they...H. (2006), “Stock market trading rule discovery using two-layer bias Re-construction and Weighting 38 decision tree”, Expert Systems
Currently, there is a demand for software to analyze polymorphism data such as microsatellite DNA and single nucleotide polymorphism with easily accessible interface in many fields of research. In this article, we would like to make an announcement of POPTREE2, a computer program package, that can perform evolutionary analyses of allele frequency data. The original version (POPTREE) was a command-line program that runs on the Command Prompt of Windows and Unix. In POPTREE2 genetic distances (measures of the extent of genetic differentiation between populations) for constructing phylogenetic trees, average heterozygosities (H) (a measure of genetic variation within populations) and G(ST) (a measure of genetic differentiation of subdivided populations) are computed through a simple and intuitive Windows interface. It will facilitate statistical analyses of polymorphism data for researchers in many different fields. POPTREE2 is available at http://www.med.kagawa-u.ac.jp/ approximately genomelb/takezaki/poptree2/index.html.
Full Text Available Abstract Background The aim of the present prospective study was to investigate whether a decision tree based on basic clinical signs could be used to determine the treatment of metabolic acidosis in calves successfully without expensive laboratory equipment. A total of 121 calves with a diagnosis of neonatal diarrhea admitted to a veterinary teaching hospital were included in the study. The dosages of sodium bicarbonate administered followed simple guidelines based on the results of a previous retrospective analysis. Calves that were neither dehydrated nor assumed to be acidemic received an oral electrolyte solution. In cases in which intravenous correction of acidosis and/or dehydration was deemed necessary, the provided amount of sodium bicarbonate ranged from 250 to 750 mmol (depending on alterations in posture and infusion volumes from 1 to 6.25 liters (depending on the degree of dehydration. Individual body weights of calves were disregarded. During the 24 hour study period the investigator was blinded to all laboratory findings. Results After being lifted, many calves were able to stand despite base excess levels below −20 mmol/l. Especially in those calves, metabolic acidosis was undercorrected with the provided amount of 500 mmol sodium bicarbonate, which was intended for calves standing insecurely. In 13 calves metabolic acidosis was not treated successfully as defined by an expected treatment failure or a measured base excess value below −5 mmol/l. By contrast, 24 hours after the initiation of therapy, a metabolic alkalosis was present in 55 calves (base excess levels above +5 mmol/l. However, the clinical status was not affected significantly by the metabolic alkalosis. Conclusions Assuming re-evaluation of the calf after 24 hours, the tested decision tree can be recommended for the use in field practice with minor modifications. Calves that stand insecurely and are not able to correct their position if pushed
Background The aim of the present prospective study was to investigate whether a decision tree based on basic clinical signs could be used to determine the treatment of metabolic acidosis in calves successfully without expensive laboratory equipment. A total of 121 calves with a diagnosis of neonatal diarrhea admitted to a veterinary teaching hospital were included in the study. The dosages of sodium bicarbonate administered followed simple guidelines based on the results of a previous retrospective analysis. Calves that were neither dehydrated nor assumed to be acidemic received an oral electrolyte solution. In cases in which intravenous correction of acidosis and/or dehydration was deemed necessary, the provided amount of sodium bicarbonate ranged from 250 to 750 mmol (depending on alterations in posture) and infusion volumes from 1 to 6.25 liters (depending on the degree of dehydration). Individual body weights of calves were disregarded. During the 24 hour study period the investigator was blinded to all laboratory findings. Results After being lifted, many calves were able to stand despite base excess levels below −20 mmol/l. Especially in those calves, metabolic acidosis was undercorrected with the provided amount of 500 mmol sodium bicarbonate, which was intended for calves standing insecurely. In 13 calves metabolic acidosis was not treated successfully as defined by an expected treatment failure or a measured base excess value below −5 mmol/l. By contrast, 24 hours after the initiation of therapy, a metabolic alkalosis was present in 55 calves (base excess levels above +5 mmol/l). However, the clinical status was not affected significantly by the metabolic alkalosis. Conclusions Assuming re-evaluation of the calf after 24 hours, the tested decision tree can be recommended for the use in field practice with minor modifications. Calves that stand insecurely and are not able to correct their position if pushed require higher doses of
Sequence embedding for fast construction of guide trees for multiple sequence alignment
Abstract Background The most widely used multiple sequence alignment methods require sequences to be clustered as an initial step. Most sequence clustering methods require a full distance matrix to be computed between all pairs of sequences. This requires memory and time proportional to N 2 for N sequences. When N grows larger than 10,000 or so, this becomes increasingly prohibitive and can form a significant barrier to carrying out very large multiple alignments. Results In this paper, we have tested variations on a class of embedding methods that have been designed for clustering large numbers of complex objects where the individual distance calculations are expensive. These methods involve embedding the sequences in a space where the similarities within a set of sequences can be closely approximated without having to compute all pair-wise distances. Conclusions We show how this approach greatly reduces computation time and memory requirements for clustering large numbers of sequences and demonstrate the quality of the clusterings by benchmarking them as guide trees for multiple alignment. Source code is available for download from http:\\/\\/www.clustal.org\\/mbed.tgz.
Sequence embedding for fast construction of guide trees for multiple sequence alignment
several sections, and each section’s volume were summed up as the total tree volume. Based the analytic data, the unary models between diameter at breast and volume were established, and also, to set diameter at breast and tree height as independent variables, tree volume as dependent variable, the binary models could be established, as well as a ternary model that describes the relationship between volume and 3 independent variables including diameter at breast, tree height, and tree step form. Nevertheless, these models mentioned above are sample linear models or nonlinear models. To estimate the forest stocks in the forest survey, former researchers usually cut down target trees and extracted samples based on the principle of sampling, and then made a corresponding volume table. This felled, destructive, and time-consuming method damaged many growth dominant trees. Tree volume modeling is the key step of volume table establishment, and volume usually was predicted by the volume equation that was derived from experience. However, because of the uncertainty of tree growth, it is difficult to effectively predict the complexity and diversity of the volume model through conventional volume equations. For this reason, the volume prediction accuracy rate is unsatisfactory. In order to promote the volume prediction accuracy rate, the algorithm of particle swarm optimization (PSO) was introduced into the standing tree volume prediction model. Moreover, the parameters were optimized by the support vector regression (SVM). The data of diameters at breast height and tree heights of standing trees were input into SVM, which were used to learn, parameters of SVM were used as the particle of PSO, standing trees volume value that were measured by authors were considered as objective function of PSO, then prediction values of standing trees volume were detected by the optimized parameters which were obtained through mutual co-ordination of particle, and the prediction values of
目的 探讨和评价决策树与Logistic回归用于预测高血压患者健康素养中的可行性与准确性.方法 利用Logistic回归分析和Answer Tree软件分别建立高血压患者健康素养预测模型,利用受试者工作曲线(ROC)评价两个预测模型的优劣.结果 Logistic回归预测模型的灵敏度(82.5％)、Youden指数(50.9％)高于决策树模型(77.9％,48.0％),决策树模型的特异性(70.1％)高于Logistic回归预测模型(68.4％),误判率(29.9％)低于Logistic回归预测模型(31.6％);决策树模型ROC曲线下面积与Logistic回归预测模型ROC曲线下面积相当(0.813 vs 0.847).结论 利用决策树预测高血压患者健康素养效果与Logistic回归模型相当,根据决策树模型可以确定高血压患者健康素养筛选策略,数据挖掘技术可以用于慢性病患者健康素养预测中.%Objective To study and evaluate the feasibility and accuracy for the application of decision tree methods and logistic regression on the health literacy prediction of hypertension patients. Method Two health literacy prediction models were generated with decision tree methods and logistic regression respectively. The receiver operating curve ( ROC) was used to evaluate the results of the two prediction models. Result The sensitivity(82. 5%) , Youden index (50. 9%)by logistic regression model was higher than decision tree model(77. 9% ,48. 0%) , the Spe-cificity(70. 1%)by decision tree model was higher than that of logistic regression model(68. 4%), The error rate (29.9%) was lower than that of logistic regression model(31. 6%). The ROC for both models were 0. 813 and 0. 847. Conclusion The effect of decision tree prediction model was similar to logistic regression prediction model. Health literacy screening strategy could be obtained by decision tree prediction model, implying the data mining methods is feasible in the chronic disease management of community health service.
In many countries, assessment programmes are carried out to identify areas where people may be exposed to high radon levels. These programmes often involve detailed mapping, followed by spatial interpolation and extrapolation of the results based on the correlation of indoor radon values with other parameters (e.g., lithology, permeability and airborne total gamma radiation) to optimise the radon hazard maps at the municipal and/or regional scale. In the present work, Geographical Weighted Regression and geostatistics are used to estimate the Geogenic Radon Potential (GRP) of the Lazio Region, assuming that the radon risk only depends on the geological and environmental characteristics of the study area. A wide geodatabase has been organised including about 8000 samples of soil-gas radon, as well as other proxy variables, such as radium and uranium content of homogeneous geological units, rock permeability, and faults and topography often associated with radon production/migration in the shallow environment. All these data have been processed in a Geographic Information System (GIS) using geospatial analysis and geostatistics to produce base thematic maps in a 1000 m × 1000 m grid format. Global Ordinary Least Squared (OLS) regression and local Geographical Weighted Regression (GWR) have been applied and compared assuming that the relationships between radon activities and the environmental variables are not spatially stationary, but vary locally according to the GRP. The spatial regression model has been elaborated considering soil-gas radon concentrations as the response variable and developing proxy variables as predictors through the use of a training dataset. Then a validation procedure was used to predict soil-gas radon values using a test dataset. Finally, the predicted values were interpolated using the kriging algorithm to obtain the GRP map of the Lazio region. The map shows some high GRP areas corresponding to the volcanic terrains (central
Jeremy R Shearman
Full Text Available Hevea brasiliensis, or rubber tree, is an important crop species that accounts for the majority of natural latex production. The rubber tree nuclear genome consists of 18 chromosomes and is roughly 2.15 Gb. The current rubber tree reference genome assembly consists of 1,150,326 scaffolds ranging from 200 to 531,465 bp and totalling 1.1 Gb. Only 143 scaffolds, totalling 7.6 Mb, have been placed into linkage groups. We have performed RNA-seq on 6 varieties of rubber tree to identify SNPs and InDels and used this information to perform target sequence enrichment and high throughput sequencing to genotype a set of SNPs in 149 rubber tree offspring from a cross between RRIM 600 and RRII 105 rubber tree varieties. We used this information to generate a linkage map allowing for the anchoring of 24,424 contigs from 3,009 scaffolds, totalling 115 Mb or 10.4% of the published sequence, into 18 linkage groups. Each linkage group contains between 319 and 1367 SNPs, or 60 to 194 non-redundant marker positions, and ranges from 156 to 336 cM in length. This linkage map includes 20,143 of the 69,300 predicted genes from rubber tree and will be useful for mapping studies and improving the reference genome assembly.
A theory of game trees, based on solution trees
textabstractIn this paper a complete theory of game tree algorithms is presented, entirely based upon the notion of a solution tree. Two types of solution trees are distinguished: max and min solution trees respectively. We show that most game tree algorithms construct a superposition of a max and a
A theory of game trees, based on solution trees
Hao, Lingxin
Quantile Regression, the first book of Hao and Naiman's two-book series, establishes the seldom recognized link between inequality studies and quantile regression models. Though separate methodological literature exists for each subject, the authors seek to explore the natural connections between this increasingly sought-after tool and research topics in the social sciences. Quantile regression as a method does not rely on assumptions as restrictive as those for the classical linear regression; though more traditional models such as least squares linear regression are more widely utilized, Hao
Boyte, Stephen P.; Wylie, Bruce K.; Major, Donald J.; Brown, Jesslyn F.
2015-01-01
Cheatgrass exhibits spatial and temporal phenological variability across the Great Basin as described by ecological models formed using remote sensing and other spatial data-sets. We developed a rule-based, piecewise regression-tree model trained on 99 points that used three data-sets – latitude, elevation, and start of season time based on remote sensing input data – to estimate cheatgrass beginning of spring growth (BOSG) in the northern Great Basin. The model was then applied to map the location and timing of cheatgrass spring growth for the entire area. The model was strong (R2 = 0.85) and predicted an average cheatgrass BOSG across the study area of 29 March–4 April. Of early cheatgrass BOSG areas, 65% occurred at elevations below 1452 m. The highest proportion of cheatgrass BOSG occurred between mid-April and late May. Predicted cheatgrass BOSG in this study matched well with previous Great Basin cheatgrass green-up studies.
Fringe trees, Crump-Mode-Jagers branching processes and $m$-ary search trees
This survey studies asymptotics of random fringe trees and extended fringe trees in random trees that can be constructed as family trees of a Crump-Mode-Jagers branching process, stopped at a suitable time. This includes random recursive trees, preferential attachment trees, fragmentation trees, binary search trees and (more generally) $m$-ary search trees, as well as some other classes of random trees. We begin with general results, mainly due to Aldous (1991) and Jagers and Nerman (1984). T...
cardinality then G is a total well dominated graph. In this paper we study composition and decomposition of total well dominated trees. By a reversible process we prove that any total well dominated tree can both be reduced to and constructed from a family of three small trees....
Using a friendly, nontechnical approach, the Second Edition of Regression Basics introduces readers to the fundamentals of regression. Accessible to anyone with an introductory statistics background, this book builds from a simple two-variable model to a model of greater complexity. Author Leo H. Kahane weaves four engaging examples throughout the text to illustrate not only the techniques of regression but also how this empirical tool can be applied in creative ways to consider a broad array of topics. New to the Second Edition Offers greater coverage of simple panel-data estimation:
In 2005-2006, bureaucrats at the New York City Department of Parks and Recreation (DPR) began to marshal quantitative evidence to argue for investment in tree planting as part of Mayor Bloomberg's long-term sustainability plan, PlaNYC 2030, launched in 2007. Concurrently, Bette Midlerthe celebrity founder of the non-profit New York Restoration Project (...
Aim: To clone cDNAs of thrombin-like enzymes (TLEs) from venom gland of Deinagkistrodon acutus and analyze the mechanisms by which their structural diversity arose. Methods: Reverse transcription-polymerase chain reaction and gene cloning techniques were used, and the cloned sequences were analyzed by using bioinformatics tools. Results: Novel cDNAs of snake venom TLEs were cloned. The possibilities of post-transcriptional recombination and horizontal gene transfer are discussed. A phylogenetic tree was constructed. Conclusion:The cDNAs of snake venom TLEs exhibit great diversification. There are several types of structural variations. These variations may be attributable to certain mechanisms including recombination.
Robin Roj
Full Text Available This paper presents three different search engines for the detection of CAD-parts in large databases. The analysis of the contained information is performed by the export of the data that is stored in the structure trees of the CAD-models. A preparation program generates one XML-file for every model, which in addition to including the data of the structure tree, also owns certain physical properties of each part. The first search engine is specializes in the discovery of standard parts, like screws or washers. The second program uses certain user input as search parameters, and therefore has the ability to perform personalized queries. The third one compares one given reference part with all parts in the database, and locates files that are identical, or similar to, the reference part. All approaches run automatically, and have the analysis of the structure tree in common. Files constructed with CATIA V5, and search engines written with Python have been used for the implementation. The paper also includes a short comparison of the advantages and disadvantages of each program, as well as a performance test.
Concerning the common problem of tag collision in Radio Frequency Identification (RFID) system, an improved anti-collision algorithm for multi-branch tree was proposed based on the regressive-style search algorithm.According to the characteristics of the tags collision, the presented algorithm adopted the dormancy count, and took quad tree structure when continuous collision appeared, which had the ability to choose the number of forks dynamically during the searching process, reduced the search range and improved the identification efficiency.The performance analysis results show that the system efficiency of the proposed algorithm is about 76.5％; moreover, with the number of tags increased, the superiority of the performance is more obvious.%针对无线射频识别(RFID)系统中常见的标签防碰撞问题,在后退式搜索算法的基础上提出了一种改进的多叉树防碰撞算法.根据标签碰撞的特点,采用休眠计数的方法,以及遇到连续碰撞位时进行四叉树分裂的策略,使得在搜索过程中能够动态选择分叉数量,缩短了标签识别时间,有效地提高了算法的搜索效率.性能分析表明,该算法的系统识别效率达76.5%,且随着标签数目的增多,优越性更加明显.
Full Text Available Objetivo: : Realizar una aproximación a la metodología de árboles de decisión tipo CART (Classification and Regression Trees desarrollando un modelo para calcular la probabilidad de muerte hospitalaria en infarto agudo de miocardio (IAM. Método: Se utiliza el conjunto mínimo básico de datos al alta hospitalaria (CMBD de Andalucía, Cataluña, Madrid y País Vasco de los años 2001 y 2002, que incluye los casos con IAM como diagnóstico principal. Los 33.203 pacientes se dividen aleatoriamente (70 y 30 % en grupo de desarrollo (GD = 23.277 y grupo de validación (GV = 9.926. Como CART se utiliza un modelo inductivo basado en el algoritmo de Breiman, con análisis de sensibilidad mediante el índice de Gini y sistema de validación cruzada. Se compara con un modelo de regresión logística (RL y una red neuronal artificial (RNA (multilayer perceptron. Los modelos desarrollados se contrastan en el GV y sus propiedades se comparan con el área bajo la curva ROC (ABC (intervalo de confianza del 95%. Resultados: En el GD el CART con ABC = 0,85 (0,86-0,88, RL 0,87 (0,86-0,88 y RNA 0,85 (0,85-0,86. En el GV el CART con ABC = 0,85 (0,85-0,88, RL 0,86 (0,85-0,88 y RNA 0,84 (0,83-0,86. Conclusiones: Los 3 modelos obtienen resultados similares en su capacidad de discriminación. El modelo CART ofrece como ventaja su simplicidad de uso y de interpretación, ya que las reglas de decisión que generan pueden aplicarse sin necesidad de procesos matemáticos.Objective: To provide an overview of decision trees based on CART (Classification and Regression Trees methodology. As an example, we developed a CART model intended to estimate the probability of intrahospital death from acute myocardial infarction (AMI. Method: We employed the minimum data set (MDS of Andalusia, Catalonia, Madrid and the Basque Country (2001-2002, which included 33,203 patients with a diagnosis of AMI. The 33,203 patients were randomly divided (70% and 30% into the development (DS
Matson, Johnny L.; Kozlowski, Alison M.
2010-01-01
Autistic regression is one of the many mysteries in the developmental course of autism and pervasive developmental disorders not otherwise specified (PDD-NOS). Various definitions of this phenomenon have been used, further clouding the study of the topic. Despite this problem, some efforts at establishing prevalence have been made. The purpose of…
Nick, Todd G; Campbell, Kathleen M
2007-01-01
The Medical Subject Headings (MeSH) thesaurus used by the National Library of Medicine defines logistic regression models as "statistical models which describe the relationship between a qualitative dependent variable (that is, one which can take only certain discrete values, such as the presence or absence of a disease) and an independent variable." Logistic regression models are used to study effects of predictor variables on categorical outcomes and normally the outcome is binary, such as presence or absence of disease (e.g., non-Hodgkin's lymphoma), in which case the model is called a binary logistic model. When there are multiple predictors (e.g., risk factors and treatments) the model is referred to as a multiple or multivariable logistic regression model and is one of the most frequently used statistical model in medical journals. In this chapter, we examine both simple and multiple binary logistic regression models and present related issues, including interaction, categorical predictor variables, continuous predictor variables, and goodness of fit.
Abraham Pouliakis
2015-01-01
Full Text Available Objective. Nowadays numerous ancillary techniques detecting HPV DNA and mRNA compete with cytology; however no perfect test exists; in this study we evaluated classification and regression trees (CARTs for the production of triage rules and estimate the risk for cervical intraepithelial neoplasia (CIN in cases with ASCUS+ in cytology. Study Design. We used 1625 cases. In contrast to other approaches we used missing data to increase the data volume, obtain more accurate results, and simulate real conditions in the everyday practice of gynecologic clinics and laboratories. The proposed CART was based on the cytological result, HPV DNA typing, HPV mRNA detection based on NASBA and flow cytometry, p16 immunocytochemical expression, and finally age and parous status. Results. Algorithms useful for the triage of women were produced; gynecologists could apply these in conjunction with available examination results and conclude to an estimation of the risk for a woman to harbor CIN expressed as a probability. Conclusions. The most important test was the cytological examination; however the CART handled cases with inadequate cytological outcome and increased the diagnostic accuracy by exploiting the results of ancillary techniques even if there were inadequate missing data. The CART performance was better than any other single test involved in this study.
Full Text Available Abstract Background Dementia and cognitive impairment associated with aging are a major medical and social concern. Neuropsychological testing is a key element in the diagnostic procedures of Mild Cognitive Impairment (MCI, but has presently a limited value in the prediction of progression to dementia. We advance the hypothesis that newer statistical classification methods derived from data mining and machine learning methods like Neural Networks, Support Vector Machines and Random Forests can improve accuracy, sensitivity and specificity of predictions obtained from neuropsychological testing. Seven non parametric classifiers derived from data mining methods (Multilayer Perceptrons Neural Networks, Radial Basis Function Neural Networks, Support Vector Machines, CART, CHAID and QUEST Classification Trees and Random Forests were compared to three traditional classifiers (Linear Discriminant Analysis, Quadratic Discriminant Analysis and Logistic Regression in terms of overall classification accuracy, specificity, sensitivity, Area under the ROC curve and Press'Q. Model predictors were 10 neuropsychological tests currently used in the diagnosis of dementia. Statistical distributions of classification parameters obtained from a 5-fold cross-validation were compared using the Friedman's nonparametric test. Results Press' Q test showed that all classifiers performed better than chance alone (p Conclusions When taking into account sensitivity, specificity and overall classification accuracy Random Forests and Linear Discriminant analysis rank first among all the classifiers tested in prediction of dementia using several neuropsychological tests. These methods may be used to improve accuracy, sensitivity and specificity of Dementia predictions from neuropsychological testing.
Harbor Deepening Project, Jacksonville, FL Palm Valley Bridge Project, Jacksonville, FL Rotary Club of San Juan, San Juan, PR Tren Urbano Subway...David. What is nanotechnology? What are its implications for construction?, Foresight/CRISP Workshop on Nanotechnology, Royal Society of Arts
Tree based machine learning framework for predicting ground state energies of molecules
We present an application of the boosted regression tree algorithm for predicting ground state energies of molecules made up of C, H, N, O, P, and S (CHNOPS). The PubChem chemical compound database has been incorporated to construct a dataset of 16 242 molecules, whose electronic ground state energies have been computed using density functional theory. This dataset is used to train the boosted regression tree algorithm, which allows a computationally efficient and accurate prediction of molecular ground state energies. Predictions from boosted regression trees are compared with neural network regression, a widely used method in the literature, and shown to be more accurate with significantly reduced computational cost. The performance of the regression model trained using the CHNOPS set is also tested on a set of distinct molecules that contain additional Cl and Si atoms. It is shown that the learning algorithms lead to a rich and diverse possibility of applications in molecular discovery and materials informatics.
Tree based machine learning framework for predicting ground state energies of molecules
We present an application of the boosted regression tree algorithm for predicting ground state energies of molecules made up of C, H, N, O, P, and S (CHNOPS). The PubChem chemical compound database has been incorporated to construct a dataset of 16,242 molecules, whose electronic ground state energies have been computed using density functional theory. This dataset is used to train the boosted regression tree algorithm, which allows a computationally efficient and accurate prediction of molecular ground state energies. Predictions from boosted regression trees are compared with neural network regression, a widely used method in the literature, and shown to be more accurate with significantly reduced computational cost. The performance of the regression model trained using the CHNOPS set is also tested on a set of distinct molecules that contain additional Cl and Si atoms. It is shown that the learning algorithms lead to a rich and diverse possibility of applications in molecular discovery and materials inform...
This text covers both multiple linear regression and some experimental design models. The text uses the response plot to visualize the model and to detect outliers, does not assume that the error distribution has a known parametric distribution, develops prediction intervals that work when the error distribution is unknown, suggests bootstrap hypothesis tests that may be useful for inference after variable selection, and develops prediction regions and large sample theory for the multivariate linear regression model that has m response variables. A relationship between multivariate prediction regions and confidence regions provides a simple way to bootstrap confidence regions. These confidence regions often provide a practical method for testing hypotheses. There is also a chapter on generalized linear models and generalized additive models. There are many R functions to produce response and residual plots, to simulate prediction intervals and hypothesis tests, to detect outliers, and to choose response trans...
The Deinococcus-Thermus phylum and the effect of rRNA composition on phylogenetic tree construction
Through comparative analysis of 16S ribosomal RNA sequences, it can be shown that two seemingly dissimilar types of eubacteria Deinococcus and the ubiquitous hot spring organism Thermus are distantly but specifically related to one another. This confirms an earlier report based upon 16S rRNA oligonucleotide cataloging studies (Hensel et al., 1986). Their two lineages form a distinctive grouping within the eubacteria that deserved the taxonomic status of a phylum. The (partial) sequence of T. aquaticus rRNA appears relatively close to those of other thermophilic eubacteria. e.g. Thermotoga maritima and Thermomicrobium roseum. However, this closeness does not reflect a true evolutionary closeness; rather it is due to a "thermophilic convergence", the result of unusually high G+C composition in the rRNAs of thermophilic bacteria. Unless such compositional biases are taken into account, the branching order and root of phylogenetic trees can be incorrectly inferred.
Full Text Available Background and objective It has been proven that gefitinib produces only 10%-20% tumor regression in heavily pretreated, unselected non-small cell lung cancer (NSCLC patients as the second- and third-line setting. Asian, female, nonsmokers and adenocarcinoma are favorable factors; however, it is difficult to find a patient satisfying all the above clinical characteristics. The aim of this study is to identify novel predicting factors, and to explore the interactions between clinical variables and their impact on the survival of Chinese patients with advanced NSCLC who were heavily treated with gefitinib in the second- or third-line setting. Methods The clinical and follow-up data of 127 advanced NSCLC patients referred to the Cancer Hospital & Institute, Chinese Academy of Medical Sciences from March 2005 to March 2010 were analyzed. Multivariate analysis of progression-free survival (PFS was performed using recursive partitioning, which is referred to as the classification and regression tree (CART analysis. Results The median PFS of 127 eligible consecutive advanced NSCLC patients was 8.0 months (95%CI: 5.8-10.2. CART was performed with an initial split on first-line chemotherapy outcomes and a second split on patients’ age. Three terminal subgroups were formed. The median PFS of the three subsets ranged from 1.0 month (95%CI: 0.8-1.2 for those with progressive disease outcome after the first-line chemotherapy subgroup, 10 months (95%CI: 7.0-13.0 in patients with a partial response or stable disease in first-line chemotherapy and age <70, and 22.0 months for patients obtaining a partial response or stable disease in first-line chemotherapy at age 70-81 (95%CI: 3.8-40.1. Conclusion Partial response, stable disease in first-line chemotherapy and age ≥ 70 are closely correlated with long-term survival treated by gefitinib as a second- or third-line setting in advanced NSCLC. CART can be used to identify previously unappreciated patient
Huang, Wenxuan; Dacek, Stephen; Rong, Ziqin; Ding, Zhiwei; Ceder, Gerbrand
Generalized Ising models, also known as cluster expansions, are an important tool in many areas of condensed-matter physics and materials science, as they are often used in the study of lattice thermodynamics, solid-solid phase transitions, magnetic and thermal properties of solids, and fluid mechanics. However, the problem of finding the global ground state of generalized Ising model has remained unresolved, with only a limited number of results for simple systems known. We propose a method to efficiently find the periodic ground state of a generalized Ising model of arbitrary complexity by a new algorithm which we term cluster tree optimization. Importantly, we are able to show that even in the case of an aperiodic ground state, our algorithm produces a sequence of states with energy converging to the true ground state energy, with a provable bound on error. Compared to the current state-of-the-art polytope method, this algorithm eliminates the necessity of introducing an exponential number of variables to ...
Breast density (the percentage of fibroglandular tissue in the breast) has been suggested to be a useful surrogate marker for breast cancer risk. It is conventionally measured using screen-film mammographic images by a labor-intensive histogram segmentation method (HSM). We have adapted and modified the HSM for measuring breast density from raw digital mammograms acquired by full-field digital mammography. Multiple regression model analyses showed that many of the instrument parameters for acquiring the screening mammograms (e.g. breast compression thickness, radiological thickness, radiation dose, compression force, etc) and image pixel intensity statistics of the imaged breasts were strong predictors of the observed threshold values (model R{sup 2} = 0.93) and %-density (R{sup 2} = 0.84). The intra-class correlation coefficient of the %-density for duplicate images was estimated to be 0.80, using the regression model-derived threshold values, and 0.94 if estimated directly from the parameter estimates of the %-density prediction regression model. Therefore, with additional research, these mathematical models could be used to compute breast density objectively, automatically bypassing the HSM step, and could greatly facilitate breast cancer research studies.
This volume is devoted to a beautiful object, called the valuative tree and designed as a powerful tool for the study of singularities in two complex dimensions. Its intricate yet manageable structure can be analyzed by both algebraic and geometric means. Many types of singularities, including those of curves, ideals, and plurisubharmonic functions, can be encoded in terms of positive measures on the valuative tree. The construction of these measures uses a natural tree Laplace operator of independent interest.
Full Text Available Context Unspecified Ulcerative Rectocolitis is a chronic disease that affects between 0.5 and 24.5/105 inhabitants in the world. National and international clinical guidelines recommend the use of aminosalicylates (including mesalazine as first-line therapy for induction of remission of unspecified ulcerative rectocolitis, and recommend the maintenance of these agents after remission is achieved. However, multiple daily doses required for the maintenance of disease remission compromise compliance with treatment, which is very low (between 45% and 65%. Use of mesalazina in granules (2 g sachet once daily - Pentasa® sachets 2 g - can enhance treatment adherence, reflecting in an improvement in patients' outcomes. Objective To evaluate the evidence on the use of mesalazine for the maintenance of remission in patients with unspecified ulcerative rectocolitis and its effectiveness when taken once versus more than once a day. From an economic standpoint, to analyze the impact of the adoption of this dosage in Brazil's public health system, considering patients' adherence to treatment. Methods A decision tree was developed based on the Clinical Protocol and Therapeutic Guidelines for Ulcerative Colitis, published by the Ministry of Health in the lobby SAS/MS n° 861 of November 4 th, 2002 and on the algorithms published by the Associação Brasileira de Colite Ulcerativa e Doença de Crohn, aiming to get the cost-effectiveness of mesalazine once daily in granules compared with mesalazine twice daily in tablets. Results The use of mesalazine increases the chances of remission induction and maintenance when compared to placebo, and higher doses are associated with greater chance of success without increasing the risk of adverse events. Conclusion The use of a single daily dose in the maintenance of remission is effective and related to higher patient compliance when compared to the multiple daily dose regimens, with lower costs.
Combining Alphas via Bounded Regression
Zura Kakushadze
2015-11-01
Full Text Available We give an explicit algorithm and source code for combining alpha streams via bounded regression. In practical applications, typically, there is insufficient history to compute a sample covariance matrix (SCM for a large number of alphas. To compute alpha allocation weights, one then resorts to (weighted regression over SCM principal components. Regression often produces alpha weights with insufficient diversification and/or skewed distribution against, e.g., turnover. This can be rectified by imposing bounds on alpha weights within the regression procedure. Bounded regression can also be applied to stock and other asset portfolio construction. We discuss illustrative examples.
张希翔; 李陶深
2012-01-01
Regression analysis is often used for filling and predicting incomplete data, whereas it has some flaws when constructing regression equation, the independent variable form is fixed and single. In order to solve the problem, the paper proposed an improved multivariate regression analytical method based on heuristic constructed variable. Firstly, the existing variables' optimized combination forms were found by means of greedy algorithm, then the new constructed variable for multivariate regression analysis was chosen to get a better goodness of fit. Results of calculating and estimating incomplete data of wheat stalks' mechanical strength prove thai the proposed method is feasible and effective, and it can get a better goodness of fit when predicting incomplete data.%传统的多元回归分析方法可以对缺失数据进行预测填补,但它在构造回归方程时存在自变量形式较为固定、单一等不足.为此,提出一种基于启发式构元的多元回归分析方法,通过贪婪算法找出现有变量的优化组合形式,选取若干新构变量进行回归分析,从而得到更好的拟合优度.通过对案例中小麦茎秆机械强度缺失数据信息进行仿真计算和评估,证实了方法的有效性.算例结果表明该方法运用在缺失数据预测中拥有较好的精准性.
Constrained Sparse Galerkin Regression
Loiseau, Jean-Christophe
2016-01-01
In this work, we demonstrate the use of sparse regression techniques from machine learning to identify nonlinear low-order models of a fluid system purely from measurement data. In particular, we extend the sparse identification of nonlinear dynamics (SINDy) algorithm to enforce physical constraints in the regression, leading to energy conservation. The resulting models are closely related to Galerkin projection models, but the present method does not require the use of a full-order or high-fidelity Navier-Stokes solver to project onto basis modes. Instead, the most parsimonious nonlinear model is determined that is consistent with observed measurement data and satisfies necessary constraints. The constrained Galerkin regression algorithm is implemented on the fluid flow past a circular cylinder, demonstrating the ability to accurately construct models from data.
Vladimir N. Shchennikov
2017-03-01
Full Text Available Introduction: The models with neural network and OLS-regressions are used in the stock market and include variables that describe the state of the stock market. One of the possible ways to determine these dependencies is clusterization trough analizing principal components. The main aim of the research is revealing the essence of two promising heuristic approaches to assessment of the dynamics of functional relationships between the incomes in the stock market and variables that describe the state of the market. Materials and Methods: The source data are models with a continuous network and OLS-regression in the area of management strategies. Mathematical statistics revenue management strategies. Results: It is well known that specifics of functional relationship establishment between the income in the stock market lies in their clusterization through a linear (nonlinear analysis of principal components of the market condition. We analyzed two promising heuristic approaches to the assessment of the dynamics of functional relationships between the income in the stock market and variables describing the state of the market. Discussion and Conclusions: The analysis of the dynamics of functional links between the revenues on the stock market was made.
Standards for Standardized Logistic Regression Coefficients
Standardized coefficients in logistic regression analysis have the same utility as standardized coefficients in linear regression analysis. Although there has been no consensus on the best way to construct standardized logistic regression coefficients, there is now sufficient evidence to suggest a single best approach to the construction of a…
2011-01-01
新奥法是目前应用较广泛的隧道施工技术，以新奥法施工过程中围岩壁面间水平内空变形实际量测值为实验数据，应用回归分析方法建立最佳数学模型。所得回归函数模型与实测数据的相关系数较高，残差较低，拟合情况良好，能有效地反应隧道内空变形规律。最后对变形终值进行预测，为隧道施工安全管理与预警提供监测依据。%New Austrian Tunneling Method (NATM) is currently widely used in tunnel construction technology. Taking the measured data of the deformation in NATM construction process between the horizontal rock wall as experimental date, the regression analysis of establishing the best mathematical model is applied. The resulting regression function model with the measured data correlation coefficient is higher,lower residue,fit well and be able to react effectively the tunnel deformation law. Finally, the final value of deformation is predicted, providing the basis for monitoring the tunnel construction safety management and early warning.
The present paper shows the coordinates of a tree and its vertices, defines a kind of Trees with Odd-Number Radiant Type (TONRT), deals with the gracefulness of TONRT by using the edge-moving theorem, and uses graceful TONRT to construct another class of graceful trees.
1999-01-01
The fastest exact algorithm (in practice) for the rectilinear Steiner tree problem in the plane uses a two-phase scheme: First, a small but sufficient set of full Steiner trees (FSTs) is generated and then a Steiner minimum tree is constructed from this set by using simple backtrack search, dynamic...
This study aimed to contribute to the management of a healthcare organization by providing management information using time-series analysis of business data accumulated in the hospital information system, which has not been utilized thus far. In this study, we examined the performance of the prediction method using the auto-regressive integrated moving-average (ARIMA) model, using the business data obtained at the Radiology Department. We made the model using the data used for analysis, which was the number of radiological examinations in the past 9 years, and we predicted the number of radiological examinations in the last 1 year. Then, we compared the actual value with the forecast value. We were able to establish that the performance prediction method was simple and cost-effective by using free software. In addition, we were able to build the simple model by pre-processing the removal of trend components using the data. The difference between predicted values and actual values was 10%; however, it was more important to understand the chronological change rather than the individual time-series values. Furthermore, our method was highly versatile and adaptable compared to the general time-series data. Therefore, different healthcare organizations can use our method for the analysis and forecasting of their business data.
The paper passes certain finance index sign and the finance index sign data to construct the predicting model for listed companies by logistic regression analysis.Through examination,the model has proved to be of actual application value.%通过一定的财务指标,采用我国上市公司的财务数据,基于Logistic回归方法构建上市公司财务危机预警的模型,经过检验,具有一定的实际应用价值。
2014-01-01
In order to improve the operation efficiency of cache sensitive B +-tree (CSB +-tree)indexing,this pa-per deals with the parallel construction and query performance of CSB +-tree on graphic processing unit (GPU).In the investigation,first,the mapping relationship between each key in internal nodes and the corresponding leaf node of the index tree is analyzed,a lock-free parallel algorithm that once for all builds the CSB +-tree internal node keys is proposed,and the index tree is constructed at the maximum parallel speed.Moreover,dynamic arrays su-pporting the arbitrary expansion of CSB +-tree index data on GPU are designed to implement the dynamic allocation of memory space on GPU,and padding bits are added to the boundary of the internal nodes to reduce the number of branches,thus improving the query efficiency of CSB +-tree.Experimental results indicate that the proposed algo-rithm is 31.0 and 1.4 times faster respectively than the parallel algorithms based on single node and tree layer.%为提高缓存敏感CSB ＋-树索引的操作效率，在图形处理器（GPU）上研究CSB ＋-树的并行构建和查询性能．通过分析索引树内部节点的每一键与对应叶子节点的映射关系，提出了一种一次性并行构建CSB ＋-树所有内部节点键值的无锁并行算法，以最大并行度来快速构建索引树．该算法通过设计GPU平台上支持CSB ＋-树的索引数据任意伸缩的动态数组来解决GPU上不能动态分配显存空间的问题，通过在索引内部节点的边界增加填充位来减少线程块的线程分支数，从而提高CSB ＋-树的查询效率．实验结果表明，文中所提算法的运行时间比基于单个节点和基于树层的并行算法分别提高了31.0和1．4倍．
2010-01-01
A nonparametric tree classification procedure is used to detect differential item functioning for items that are dichotomously scored. Classification trees are shown to be an alternative procedure to detect differential item functioning other than the use of traditional Mantel-Haenszel and logistic regression analysis. A nonparametric…
Construction method of Webpage semantic concept tree is proposed based on covariance features reptile, the de⁃cision tree algorithm of feature modeling is obtained, according to semantic trigeminal feature decision tree probability regu⁃lar training transfer rule, decision tree node set effective feature probability is obtained, the covariance feature Webpage crawler is used to design Webpage semantic concept tree construction algorithm. The covariance features reptile, rapid sep⁃aration of autocorrelation components are independent, the semantic correlation retrieval code, and the Webpage semantic concept tree construction guidance information retrieval is realized. The simulation results show that, the algorithm can ef⁃fectively realize data mining and Webpage semantic concept tree, it provides the optimal branching path for the information orientation, so as to realize the theme topic information retrieval and location accuracy, the algorithm has better Webpage recall and positioning data retrieval performance, it can improve the recall rate significantly, it has a good application value.%提出一种基于协方差特征爬虫的网页语义概念树构建方法，引入语义概念决策树算法进行主特征建模，根据语义三叉特征决策树概率正则训练迁移法则，得到决策树网络节点最近时刻获得的数据集有效特征概率，采用协方差特征网页爬虫进行网页语义概念树构建算法的改进。通过协方差特征爬虫，进行自相关成分的独立快速分离，得到语义自相关检索编码，实现网页语义概念树构建指导信息检索。仿真结果表明，该算法能有效进行数据挖掘和网页语义概念树的构建，为信息定位提供了最优分叉路径，从而实现对主题热点信息的准确检索和定位，算法具有较好的网页召回和定位检索性能，数据召回率提高明显，展示了较好的应用价值。
We introduce type extension trees as a formal representation language for complex combinatorial features of relational data. Based on a very simple syntax this language provides a unified framework for expressing features as diverse as embedded subgraphs on the one hand, and marginal counts...... of attribute values on the other. We show by various examples how many existing relational data mining techniques can be expressed as the problem of constructing a type extension tree and a discriminant function....
Jaeger, Manfred
Pedrini, D. T.; Pedrini, Bonnie C.
2005-01-01
We construct a tree wavelet approximation by using a constructive greedy scheme(CGS). We define a function class which contains the functions whose piecewise polynomial approximations generated by the CGS have a prescribed global convergence rate and establish embedding properties of this class. We provide sufficient conditions on a tree index set and on bi-orthogonal wavelet bases which ensure optimal order of convergence for the wavelet approximations encoded on the tree index set using the bi-orthogonal wavelet bases. We then show that if we use the tree index set associated with the partition generated by the CGS to encode a wavelet approximation, it gives optimal order of convergence.
The objective of this thesis is the development of a method allowing the identification of factors leading to various radioactive contamination levels of the plants. The methodology suggested is based on the use of a radioecological transfer model of the radionuclides through the environment (A.S.T.R.A.L. computer code) and a classification-tree method. Particularly, to avoid the instability problems of classification trees and to preserve the tree structure, a node level stabilizing technique is used. Empirical comparisons are carried out between classification trees built by this method (called R.E.N. method) and those obtained by the C.A.R.T. method. A similarity measure is defined to compare the structure of two classification trees. This measure is used to study the stabilizing performance of the R.E.N. method. The methodology suggested is applied to a simplified contamination scenario. By the results obtained, we can identify the main variables responsible of the various radioactive contamination levels of four leafy-vegetables (lettuce, cabbage, spinach and leek). Some extracted rules from these classification trees can be usable in a post-accidental context. (author)
Zhu, Ruoqing; Zeng, Donglin; Kosorok, Michael R
Application of Fruit Trees in Construction of Urban Garden Landscapes%果树在城市园林景观营造中的应用
By reviewing the application history and status of fruit trees in gardens, application principles of fruit trees in urban garden landscapes are proposed in view of its landscape components, such as flower, fruit, foliage, figure and cultural connotations. The principles include "right tree for right land" , "fully playing ecological effects of fruit trees" , "improving the rhythmic sense and artistic quality of plant furnishings" , and "using characteristic fruit trees". The application of fruit trees in different garden green spaces such as by roadside, in residential districts, in parks, in sightseeing fruit gardens is respectively analyzed, examples are given to show the application range of different species. Finally, attentions in the application are stressed to provide references for the future application of fruit trees in urban garden landscapes.%从果树在园林中应用的历史与现状出发,针对果树景观花、果、叶、树形、文化景现等构成要素,提出了果树在城市园林景观营造中的应用原则,包括适地适树,充分发挥果树的生态效益,提高造景搭配的韵律感和艺术性,使用特色性果树品种.探讨了果树在道路绿地、居住区绿地、公园绿地、观光果园等不同园林绿地中的应用,举例说明不同类型树种的适用范围.最后列举出果树在园林中应用所需注意的问题,为进一步在城市园林景观中应用果树提供借鉴.
Full Text Available In 2001 six permanent sample plots (PSP were established in forest stands differing in degrees of damage by pollution from the Norilsk industrial region. In 2004 the second forest inventory was carried out at these PSP for evaluation of pollutant impacts on stand condition changes. During both inventory procedures the vigor state of every tree was visually categorized according to 6-points scale of «Forest health regulations in Russian Federation». The changeover of tree into fall was also taken into account. Two types of Markov’s models simulating thinning process in tree stands within different ecological conditions has been developed: 1 based on assessment for probability of tree survival during three years; 2 in terms of evaluation of matrix for probability on change of vigor state category in the same period. The reconstruction of tree mortality from 1979 after industrial complex «Nadezda» setting into operation was realized on the basis of probability estimation of dead standing trees conservation during three years observed. The forecast of situation was carried out up to 2030. Using logistic regression the probability of tree survival was established depending on four factors: degree of tree damage by pollutants, tree species, stand location in relief and tree age. The acquired results make it possible to single out an impact of pollutants to tree stands’ resistance from other factors. There was revealed the percent of tree fall, resulted by pollution. The evaluation scale of SO2 gas resistance of tree species was constructed: birch, spruce, larch. Larch showed the highest percent of fall because of pollution.
随着信息技术的不断发展，应用商业智能技术进行数据挖掘与分析对商家来说也越来越重要，分类回归树和神经网络算法是数据挖掘的经典算法，其广泛运用在数据分析、预测和评估等方面。文章分别运用分类回归树和神经网络算法对零售商品采取促销方案后收入变化的数据进行分析，并建立相应的模型对促销方案效果进行预测。%with the development of information technology, the application of business intelligence technology of data mining and analysis for the merchants is more and more important, classification and regression tree and neural network algorithm is a classical algorithm of data mining, it is widely used in the data analysis, prediction and evaluation. This paper using classification and regression tree analysis and neural network algorithm for retail commodity take promotion plan changes in income after the data, and to establish the corresponding model to predict the effect of promotion.
2012-01-01
Tree compression with top trees
Tree compression with top trees
Tree compression with top trees
Skeletal Rigidity of Phylogenetic Trees
Motivated by geometric origami and the straight skeleton construction, we outline a map between spaces of phylogenetic trees and spaces of planar polygons. The limitations of this map is studied through explicit examples, culminating in proving a structural rigidity result.
To identify the physiological function of Diacylglycerol Acyltransferase 2 (DGAT2) coding gene in the process of tung oil biosynthesis, DGAT2 was cloned from cDNA of tung tree kernel and then linked with pMD18-T vector for sequen-cing. The 969bp fragment containing Open Reading Frame was acquired. Subsequently, RNAi binary expression vector pD35-DGAT2 was constructed, which expressed DGAT2 in two opposite ways. The studies provide the possibilities to fur-ther identify the function of DGAT2 in tung oil biosynthesis by RNAi technology and hold promise for genetic engineering of Venicia fordii.
Johansen, Søren
The reduced rank regression model is a multivariate regression model with a coefficient matrix with reduced rank. The reduced rank regression algorithm is an estimation procedure, which estimates the reduced rank regression model. It is related to canonical correlations and involves calculating e...
2001-01-01
介绍分类与回归树(classification and r egression trees, CART)的发展历史、结构、组成和特点.CART包括分类树和回归树两部分,分类树的结果变量是分类变量,回归树的结果变量是连续变量.CART是一种树型结构,由树结和连线组成, 在末端的树结又称为终止结.CART可分析同质性较差的数据,采用替代变量的方法解决缺失数据问题,不要求数据的分布,可同时利用各种类型的数据.CART的树型结构与临床思维十分接近,有利于CART在临床研究中的应用. CART可用于临床研究数据分析,其应用范围有待于不断扩展.
Indexing data structures are well-known to be crucial for the efficiency of the current state-of-the-art theorem provers. Examples are \\emph{discrimination trees}, which are like tries where terms are seen as strings and common prefixes are shared, and \\emph{substitution trees}, where terms keep their tree structure and all common \\emph{contexts} can be shared. Here we describe a new indexing data structure, \\emph{context trees}, where, by means of a limited kind of conte...
Cochrane, John. H.; Longstaff, Francis A.; Santa-Clara, Pedro
We solve a model with two â€œLucas trees.â€ Each tree has i.i.d. dividend growth. The investor has log utility and consumes the sum of the two treesâ€™ dividends. This model produces interesting asset-pricing dynamics, despite its simple ingredients. Investors want to rebalance their portfolios after any change in value. Since the size of the trees is fixed, however, prices must adjust to oï¬€set this desire. As a result, expected returns, excess returns, and return volatility all vary throug...
2012-01-01
随着计算机技术的发展和果园果树精细管理的现实需求，果树冠层的三维重构问题成为研究热点。本文采用激光传感器在不同高度对果树靶标冠层进行水平扫描测距；将从果树不同方位测得的探测点坐标进行坐标转换，得到同一坐标系下的果树冠层三维点云；采用插值法重构模型树冠层三维轮廓。试验结果表明重构的冠层轮廓较为准确地反映了果树的外形轮廓。本文研究为采用激光传感器测距技术进行果树冠层三维重构与体积测量提供了前期研究基础。% With the development of computer technology and the practical needs of the orchard management, so that three-dimensional reconstruction of tree canopy issue has become hot topic.In this paper,laser sensor was used to scan fruit tree canopy in a horizontal level plane at different heights; the three-dimensional point cloud of fruit tree canopy were getted through coordiante system transform that converted the measuring points coordinate getting from different measuring direction into the same coordinate system; the three dimensional of fruit tree canopy was constructed using interpolation method. The experiment result show that reconstruction of the canopy profile more accurately reflects the contour of the fruit tree. The study in this paper provides preliminary studies for three-dimensional reconstruction and volume measurements of fruit tree canopy using the laser ranging.
2013-01-01
对广东省乐昌地区杉木林分进行调查,共筛选出104份乐昌油杉优树资源.对优树生长、形质、材质等性状进行测定分析,结果表明:优树群体生长性状(树高、胸径、单株材积)均值明显高于样地群体值(P＜0.01),性状表现水平(PL)皆在77.0以上；入选优树树高、胸径、单株材积现实增益分别为0.4％ ～ 97.1％、32.3％ ～ 131.8％、82.6％ ～ 712.5％；优树高径比、冠径比变幅分别为38.9 ～76.5、8.7 ～21.9;优树木材基本密度、红心材比率、木材干缩性、木材吸水性总体变幅则分别为0.251 1～0.393 1 g/cm3、29.2％ ～72.3％、0.8％ ～ 32.4％、189.0％ ～ 332.9％；其中,木材干缩性变异程度最大,CV值达100％,而木材基本密度变异程度较低,CV值仅为11％.以平均木材质测定值为阈值对优树进行分类,不同优树群体均可分选出一定数量生长材质皆优单株.研究还对乐昌油杉优树资源进行了有效保存(就地保存与异地保存),收集区资源合计保存率为98.1％.%Based on a large-scale investigation into Cunninghamia lanceolata stands in Lechang area of Guangdong Province, a total of 104 plus trees of C. lanceolata were screened out, and the trait indexes including height, DBH, form index, standing volume, wood quality and growth properties of the plus trees were measured and analyzed. The results showed that the average values of tree height, DBH, and standing volume of the plus tree population were all significantly (P <0. 01) higher than those of the sample plot population, whose phenotype level was above 77. 0. Further analyses revealed that the real gain of tree height, DBH, and standing volume indexes of the plus trees could achieve to 0. 4% -97. 1% , 32. 3% - 131. 8% , and 82. 6% -712. 5% respectively. The height-DBH ratio and the crown-DBH diameter ratio ranged from 38. 9 to 76.5, and from 8. 7 to 21. 9 individually. The basic wood density (DEN) , red-heartwood ratio
Students love outdoor activities and will love them even more when they build confidence in their tree identification and measurement skills. Through these activities, students will learn to identify the major characteristics of trees and discover how the pace--a nonstandard measuring unit--can be used to estimate not only distances but also the…
David L. R. Affleck; Timothy G. Gregoire
2015-01-01
In felled-tree studies, ratio and regression estimators are commonly used to convert more readily measured branch characteristics to dry crown mass estimates. In some cases, data from multiple trees are pooled to form these estimates. This research evaluates the utility of both tactics in the estimation of crown biomass following randomized branch sampling (...
2011-04-01
Multiple regression models, clustering tree diagrams, regression trees (CHAID) and redundancy analysis (RDA) were applied to the study of the removal of organic matter and pharmaceuticals and personal care products (PPCPs) from urban wastewater by means of constructed wetlands (CWs). These four statistical analyses pointed out the importance of physico-chemical parameters, plant presence and chemical structure in the elimination of most pollutants. Temperature, pH values, dissolved oxygen concentration, redox potential and conductivity were related to the removal of the studied substances. Plant presence (Typha angustifolia and Phragmites australis) enhanced the removal of organic matter and some PPCPs. Multiple regression equations and CHAID trees provided numerical estimations of pollutant removal efficiencies in CWs. These models were validated and they could be a useful and interesting tool for the quick estimation of removal efficiencies in already working CWs and for the design of new systems which must fulfil certain quality requirements. Copyright © 2011 Elsevier Ltd. All rights reserved.
Praise for the Fourth Edition: ""This book is . . . an excellent source of examples for regression analysis. It has been and still is readily readable and understandable."" -Journal of the American Statistical Association Regression analysis is a conceptually simple method for investigating relationships among variables. Carrying out a successful application of regression analysis, however, requires a balance of theoretical results, empirical rules, and subjective judgment. Regression Analysis by Example, Fifth Edition has been expanded
2017-01-01
I will discuss commonly used techniques for nonparametric regression in astronomy. We find that several of them, particularly running averages and running medians, are generically biased, asymmetric between dependent and independent variables, and perform poorly in recovering the underlying function, even when errors are present only in one variable. We then examine less-commonly used techniques such as Multivariate Adaptive Regressive Splines and Boosted Trees and find them superior in bias, asymmetry, and variance both theoretically and in practice under a wide range of numerical benchmarks. In this context the chief advantage of the common techniques is runtime, which even for large datasets is now measured in microseconds compared with milliseconds for the more statistically robust techniques. This points to a tradeoff between bias, variance, and computational resources which in recent years has shifted heavily in favor of the more advanced methods, primarily driven by Moore's Law. Along these lines, we also propose a new algorithm which has better overall statistical properties than all techniques examined thus far, at the cost of significantly worse runtime, in addition to providing guidance on choosing the nonparametric regression technique most suitable to any specific problem. We then examine the more general problem of errors in both variables and provide a new algorithm which performs well in most cases and lacks the clear asymmetry of existing non-parametric methods, which fail to account for errors in both variables.
2007-01-01
The dependent variable in a regular linear regression is a numerical variable, and in a logistic regression it is a binary or categorical variable. In these models the dependent variable has varying values. However, there are problems yielding an identity output of a constant value which can also be modelled in a linear or logistic regression with…
2009-01-01
Regression analysis of survival data, and more generally event history data, is typically based on Cox's regression model. We here review some recent methodology, focusing on the limitations of Cox's regression model. The key limitation is that the model is not well suited to represent time-varyi...
Quantile regression is emerging as a popular statistical approach, which complements the estimation of conditional mean models. While the latter only focuses on one aspect of the conditional distribution of the dependent variable, the mean, quantile regression provides more detailed insights by m...... treatment of the topic is based on the perspective of applied researchers using quantile regression in their empirical work....
QuickJoin—Fast Neighbour-Joining Tree Reconstruction
We have built a tool for fast construction of very large phylogenetic trees. The tool uses heuristics for speeding up the neighbour-joining algorithm—while still constructing the same tree as the original neighbour-joining algorithm—making it possible to construct trees for ~8000 species in less...... than ten minutes on a single desktop PC. In comparison, the same task takes more than 30 minutes using the QuickTree neighbour-joining implementation....
QuickJoin—Fast Neighbour-Joining Tree Reconstruction
We have built a tool for fast construction of very large phylogenetic trees. The tool uses heuristics for speeding up the neighbour-joining algorithm—while still constructing the same tree as the original neighbour-joining algorithm—making it possible to construct trees for ~8000 species in less...... than ten minutes on a single desktop PC. In comparison, the same task takes more than 30 minutes using the QuickTree neighbour-joining implementation....
Regression analysis is the most commonly used statistical method in the world. Although few would characterize this technique as simple, regression is in fact both simple and elegant. The complexity that many attribute to regression analysis is often a reflection of their lack of familiarity with the language of mathematics. But regression analysis can be understood even without a mastery of sophisticated mathematical concepts. This book provides the foundation and will help demystify regression analysis using examples from economics and with real data to show the applications of the method. T
We introduce the package PhylogeneticTrees for Macaulay2 which allows users to compute phylogenetic invariants for group-based tree models. We provide some background information on phylogenetic algebraic geometry and show how the package PhylogeneticTrees can be used to calculate a generating set for a phylogenetic ideal as well as a lower bound for its dimension. Finally, we show how methods within the package can be used to compute a generating set for the join of any two ideals.
A importância deste trabalho deve-se à seleção de objetos ainda não tratados particularmente como uma família e ao emprego de procedimento estatístico robusto que não precisa de pressupostos ou condições de contorno. Contribui, assim, ao melhor entendimento do cenário das Galáxias Aneladas do diagrama de Hubble via classificação e estudo de subclasses. Selecionaram-se 100 galáxias possuidoras de dois anéis do Catalog of Southern Ringed Galaxies compilado por Ronald Buta, de modo a construir uma amostra completa em termos de conhecimento dos semi-eixos dos anéis interno e externo projetados no plano do céu. Visando uma possível classificação destas galáxias aneladas normais em famílias de acordo com as características geométricas dos anéis, empregou-se primeiramente a Análise de Aglomerados (ferramenta de classificação: medições de semelhança em um espaço bidimensional) para explorar a possível existência de famílias. As variáveis analisadas foram: os diâmetros interiores menores d(I) e maiores D(I), os diâmetros exteriores menores d(E) e maiores D(E), e os ângulos de inclinação dos semi-eixos maiores interiores q(I) e exteriores q(E) dos anéis. Como metodologia de discriminação, empregou-se a construção de Árvores de Classificação. As árvores de classificação constituem um método de discriminação alternativo aos modelos clássicos, tais como a Análise Discriminante e a Regressão Logística, onde uma base de dados é dividida em partições (subgrupos) da árvore por ação de um predictor (variável específica). Os pacotes estatísticos utilizados para o processamento da informação foram: SAS versão 8.0 (Statistical Analisys System) e CART versão 3.6.3. Esta análise estatística sugere a existência de três possíveis famílias de galáxias bianeladas, com base apenas na geometria dos anéis. Como forma exploratória inicial deste resultado, a construção de um diagrama BT (magnitude total) versus o
1998-01-01
textabstractIn this paper, a theory of game tree algorithms is presented, entirely based upon the concept of solution tree. Two types of solution trees are distinguished: max and min trees. Every game tree algorithm tries to prune nodes as many as possible from the game tree. A cut-off criterion in
The photo shows a close-up of a Lichtenberg figure – popularly called an “electron tree” – produced in a cylinder of polymethyl methacrylate (PMMA). Electron trees are created by irradiating a suitable insulating material, in this case PMMA, with an intense high energy electron beam. Upon discharge......, during dielectric breakdown in the material, the electrons generate branching chains of fractures on leaving the PMMA, producing the tree pattern seen. To be able to create electron trees with a clinical linear accelerator, one needs to access the primary electron beam used for photon treatments. We...... appropriated a linac that was being decommissioned in our department and dismantled the head to circumvent the target and ion chambers. This is one of 24 electron trees produced before we had to stop the fun and allow the rest of the accelerator to be disassembled....
The photo shows a close-up of a Lichtenberg figure – popularly called an “electron tree” – produced in a cylinder of polymethyl methacrylate (PMMA). Electron trees are created by irradiating a suitable insulating material, in this case PMMA, with an intense high energy electron beam. Upon discharge......, during dielectric breakdown in the material, the electrons generate branching chains of fractures on leaving the PMMA, producing the tree pattern seen. To be able to create electron trees with a clinical linear accelerator, one needs to access the primary electron beam used for photon treatments. We...... appropriated a linac that was being decommissioned in our department and dismantled the head to circumvent the target and ion chambers. This is one of 24 electron trees produced before we had to stop the fun and allow the rest of the accelerator to be disassembled....
We show that every inner metric space X is the metric quotient of a complete R-tree via a free isometric action, which we call the covering R-tree of X. The quotient mapping is a weak submetry (hence, open) and light. In the case of compact 1-dimensional geodesic space X, the free isometric action is via a subgroup of the fundamental group of X. In particular, the Sierpin'ski gasket and carpet, and the Menger sponge all have the same covering R-tree, which is complete and has at each point valency equal to the continuum. This latter R-tree is of particular interest because it is "universal" in at least two senses: First, every R-tree of valency at most the continuum can be isometrically embedded in it. Second, every Peano continuum is the image of it via an open light mapping. We provide a sketch of our previous construction of the uniform universal cover in the special case of inner metric spaces, the properties of which are used in the proof.
Box-trees and R-trees with near-optimal query time
A box-tree is a bounding-volume hierarchy that uses axis-aligned boxes as bounding volumes. The query complexity of a box-tree with respect to a given type of query is the maximum number of nodes visited when answering such a query. We describe several new algorithms for constructing box-trees
Autistic epileptiform regression.
Autistic regression is a well known condition that occurs in one third of children with pervasive developmental disorders, who, after normal development in the first year of life, undergo a global regression during the second year that encompasses language, social skills and play. In a portion of these subjects, epileptiform abnormalities are present with or without seizures, resembling, in some respects, other epileptiform regressions of language and behaviour such as Landau-Kleffner syndrome. In these cases, for a more accurate definition of the clinical entity, the term autistic epileptifom regression has been suggested. As in other epileptic syndromes with regression, the relationships between EEG abnormalities, language and behaviour, in autism, are still unclear. We describe two cases of autistic epileptiform regression selected from a larger group of children with autistic spectrum disorders, with the aim of discussing the clinical features of the condition, the therapeutic approach and the outcome.
Scaled Sparse Linear Regression
2011-01-01
Scaled sparse linear regression jointly estimates the regression coefficients and noise level in a linear model. It chooses an equilibrium with a sparse regression method by iteratively estimating the noise level via the mean residual squares and scaling the penalty in proportion to the estimated noise level. The iterative algorithm costs nearly nothing beyond the computation of a path of the sparse regression estimator for penalty levels above a threshold. For the scaled Lasso, the algorithm is a gradient descent in a convex minimization of a penalized joint loss function for the regression coefficients and noise level. Under mild regularity conditions, we prove that the method yields simultaneously an estimator for the noise level and an estimated coefficient vector in the Lasso path satisfying certain oracle inequalities for the estimation of the noise level, prediction, and the estimation of regression coefficients. These oracle inequalities provide sufficient conditions for the consistency and asymptotic...
In this work nine non-linear regression models were compared for sub-pixel impervious surface area mapping from Landsat images. The comparison was done in three study areas both for accuracy of imperviousness coverage evaluation in individual points in time and accuracy of imperviousness change assessment. The performance of individual machine learning algorithms (Cubist, Random Forest, stochastic gradient boosting of regression trees, k-nearest neighbors regression, random k-nearest neighbors regression, Multivariate Adaptive Regression Splines, averaged neural networks, and support vector machines with polynomial and radial kernels) was also compared with the performance of heterogeneous model ensembles constructed from the best models trained using particular techniques. The results proved that in case of sub-pixel evaluation the most accurate prediction of change may not necessarily be based on the most accurate individual assessments. When single methods are considered, based on obtained results Cubist algorithm may be advised for Landsat based mapping of imperviousness for single dates. However, Random Forest may be endorsed when the most reliable evaluation of imperviousness change is the primary goal. It gave lower accuracies for individual assessments, but better prediction of change due to more correlated errors of individual predictions. Heterogeneous model ensembles performed for individual time points assessments at least as well as the best individual models. In case of imperviousness change assessment the ensembles always outperformed single model approaches. It means that it is possible to improve the accuracy of sub-pixel imperviousness change assessment using ensembles of heterogeneous non-linear regression models.
Rolling Regressions with Stata
This talk will describe some work underway to add a "rolling regression" capability to Stata's suite of time series features. Although commands such as "statsby" permit analysis of non-overlapping subsamples in the time domain, they are not suited to the analysis of overlapping (e.g. "moving window") samples. Both moving-window and widening-window techniques are often used to judge the stability of time series regression relationships. We will present an implementation of a rolling regression...
Logistic Regression for Evolving Data Streams Classification
Logistic regression is a fast classifier and can achieve higher accuracy on small training data. Moreover,it can work on both discrete and continuous attributes with nonlinear patterns. Based on these properties of logistic regression, this paper proposed an algorithm, called evolutionary logistical regression classifier (ELRClass), to solve the classification of evolving data streams. This algorithm applies logistic regression repeatedly to a sliding window of samples in order to update the existing classifier, to keep this classifier if its performance is deteriorated by the reason of bursting noise, or to construct a new classifier if a major concept drift is detected. The intensive experimental results demonstrate the effectiveness of this algorithm.
Quasi-regression, motivated by the problems arising in the computer experiments, focuses mainly on speeding up evaluation. However, its theoretical properties are unexplored systemically. This paper shows that quasi-regression is unbiased, strong convergent and asymptotic normal for parameter estimations but it is biased for the fitting of curve. Furthermore, a new method called unbiased quasi-regression is proposed. In addition to retaining the above asymptotic behaviors of parameter estimations, unbiased quasi-regression is unbiased for the fitting of curve.
2009-01-01
Covers the use of dynamic and interactive computer graphics in linear regression analysis, focusing on analytical graphics. Features new techniques like plot rotation. The authors have composed their own regression code, using Xlisp-Stat language called R-code, which is a nearly complete system for linear regression analysis and can be utilized as the main computer program in a linear regression course. The accompanying disks, for both Macintosh and Windows computers, contain the R-code and Xlisp-Stat. An Instructor's Manual presenting detailed solutions to all the problems in the book is ava
Master linear regression techniques with a new edition of a classic text Reviews of the Second Edition: ""I found it enjoyable reading and so full of interesting material that even the well-informed reader will probably find something new . . . a necessity for all of those who do linear regression."" -Technometrics, February 1987 ""Overall, I feel that the book is a valuable addition to the now considerable list of texts on applied linear regression. It should be a strong contender as the leading text for a first serious course in regression analysis."" -American Scientist, May-June 1987
The conception of "main direction" of multi-dimensional wavelet is established in this paper, and the capabilities of several classical complex wavelets for representing directional singularities are investigated based on their main directions. It is proved to be impossible to represent directional singularities optimally by a multi-resolution analysis (MRA) of L2(R2). Based on the above results, a new algorithm to construct Q-shift dual tree complex wavelet is proposed. By optimizing the main direction of parameterized wavelet filters, the difficulty in choosing stop-band frequency is overcome and the performances of the designed wavelet are improved too. Furthermore, results of image enhancement by various multi-scale methods are given, which show that the new designed Q-shift complex wavelet do offer significant improvement over the conventionally used wavelets. Direction sensitivity is an important index to the performance of 2D wavelets.
Aboveground carbon (C) sequestration in trees is important in global C dynamics, but reliable techniques for its modeling in highly productive and heterogeneous ecosystems are limited. We applied an extended dendrochronological approach to disentangle the functioning of drivers from the atmosphere (temperature, precipitation), the lithosphere (sedimentation rate), the hydrosphere (groundwater table, river water level fluctuation), the biosphere (tree characteristics), and the anthroposphere (dike construction). Carbon sequestration in aboveground biomass of riparian Quercus robur L. and Fraxinus excelsior L. was modeled (1) over time using boosted regression tree analysis (BRT) on cross-datable trees characterized by equal annual growth ring patterns and (2) across space using a subsequent classification and regression tree analysis (CART) on cross-datable and not cross-datable trees. While C sequestration of cross-datable Q. robur responded to precipitation and temperature, cross-datable F. excelsior also responded to a low Danube river water level. However, CART revealed that C sequestration over time is governed by tree height and parameters that vary over space (magnitude of fluctuation in the groundwater table, vertical distance to mean river water level, and longitudinal distance to upstream end of the study area). Thus, a uniform response to climatic drivers of aboveground C sequestration in Q. robur was only detectable in trees of an intermediate height class and in taller trees (>21.8m) on sites where the groundwater table fluctuated little (≤0.9m). The detection of climatic drivers and the river water level in F. excelsior depended on sites at lower altitudes above the mean river water level (≤2.7m) and along a less dynamic downstream section of the study area. Our approach indicates unexploited opportunities of understanding the interplay of different environmental drivers in aboveground C sequestration. Results may support species-specific and
2011-05-01
Full Text Available The compilation of growth stand model usually uses the regression analysis. Homoscedasticity or residual kind homogeneity is one assumption which underlying the use of this regression analysis. Breaking this assumption causes the low of model accuracy which is shown by the low of determination coefficient and the height of error standard. The problem of heteroscedasticity can be solved by using weighted regression analysis.The Selected Raiser Growth Model equation in this research was transformed into a model equation: ln P = a + b/A, where there was a significant correlation between the growth and the age (R2 = 55.04%, sb0 = 0.041, and sb1 = 0.171. From the use of weighted regression analysis with weightier wi = 1/”Xi, it can be concluded that there was no real correlation between the growth and the age (R2 = 0.55%, sb0 = 0.572, and sb1 = 2.560. The use of weightier shows much lower accuracy than without weightier. However, from the use of weighted regression analysis with weightier: wi = 1/si2, where si2 = residual kinds at free variable group to I (X1 shows that there was significant correlation between the growth and the age (R2 = 45.46%; sb0 = 0.084, and sb1 = 0.205. There fore it can be said that the accuracy was much better than regression without weightier. Furthermore, the use of weighted regression analysis with weightier wi = 1/si2, where si2 is residual kind at free variable to i (X which is estimated through second orde polynomial regression model shows a very significant correlation between the growth and the age (where R2 = 87.22%, sb0 = 0.029, and sb1 = 0.072. The last result shows a better accuracy than the preceding treatments. From this research, it can be concluded that by using a suitable weightier, the use of weighted regression analysis in compiling raiser growth model can improve the model accuracy. Keywords: growth model, weighted regression, acacia mangium,regression analysis
Heteroscedasticity checks for regression models
For checking on heteroscedasticity in regression models, a unified approach is proposed to constructing test statistics in parametric and nonparametric regression models. For nonparametric regression, the test is not affected sensitively by the choice of smoothing parameters which are involved in estimation of the nonparametric regression function. The limiting null distribution of the test statistic remains the same in a wide range of the smoothing parameters. When the covariate is one-dimensional, the tests are, under some conditions, asymptotically distribution-free. In the high-dimensional cases, the validity of bootstrap approximations is investigated. It is shown that a variant of the wild bootstrap is consistent while the classical bootstrap is not in the general case, but is applicable if some extra assumption on conditional variance of the squared error is imposed. A simulation study is performed to provide evidence of how the tests work and compare with tests that have appeared in the literature. The approach may readily be extended to handle partial linear, and linear autoregressive models.
2011-01-01
Quantum simulations constructing probability tensors of biological multi-taxa in phylogenetic trees are proposed, in terms of positive trace preserving maps, describing evolving systems of quantum walks with multiple walkers. Basic phylogenetic models applying on trees of various topologies are simulated following appropriate decoherent quantum circuits. Quantum simulations of statistical inference for aligned sequences of biological characters are provided in terms of a quantum pruning map operating on likelihood operator observables, utilizing state-observable duality and measurement theory.
2009-08-27
Regression quantiles can be substantially biased when the covariates are measured with error. In this paper we propose a new method that produces consistent linear quantile estimation in the presence of covariate measurement error. The method corrects the measurement error induced bias by constructing joint estimating equations that simultaneously hold for all the quantile levels. An iterative EM-type estimation algorithm to obtain the solutions to such joint estimation equations is provided. The finite sample performance of the proposed method is investigated in a simulation study, and compared to the standard regression calibration approach. Finally, we apply our methodology to part of the National Collaborative Perinatal Project growth data, a longitudinal study with an unusual measurement error structure. © 2009 American Statistical Association.
2012-01-19
This paper introduces a novel partition-based regression approach that incorporates topological information. Partition-based regression typically introduces a quality-of-fit-driven decomposition of the domain. The emphasis in this work is on a topologically meaningful segmentation. Thus, the proposed regression approach is based on a segmentation induced by a discrete approximation of the Morse–Smale complex. This yields a segmentation with partitions corresponding to regions of the function with a single minimum and maximum that are often well approximated by a linear model. This approach yields regression models that are amenable to interpretation and have good predictive capacity. Typically, regression estimates are quantified by their geometrical accuracy. For the proposed regression, an important aspect is the quality of the segmentation itself. Thus, this article introduces a new criterion that measures the topological accuracy of the estimate. The topological accuracy provides a complementary measure to the classical geometrical error measures and is very sensitive to overfitting. The Morse–Smale regression is compared to state-of-the-art approaches in terms of geometry and topology and yields comparable or improved fits in many cases. Finally, a detailed study on climate-simulation data demonstrates the application of the Morse–Smale regression. Supplementary Materials are available online and contain an implementation of the proposed approach in the R package msr, an analysis and simulations on the stability of the Morse–Smale complex approximation, and additional tables for the climate-simulation study.
Interpreting Tree Ensembles with inTrees
Deng, Houtao
2014-01-01
Tree ensembles such as random forests and boosted trees are accurate but difficult to understand, debug and deploy. In this work, we provide the inTrees (interpretable trees) framework that extracts, measures, prunes and selects rules from a tree ensemble, and calculates frequent variable interactions. An rule-based learner, referred to as the simplified tree ensemble learner (STEL), can also be formed and used for future prediction. The inTrees framework can applied to both classification and regression problems.
Bordacconi, Mats Joe; Larsen, Martin Vinæs
2014-01-01
Humans are fundamentally primed for making causal attributions based on correlations. This implies that researchers must be careful to present their results in a manner that inhibits unwarranted causal attribution. In this paper, we present the results of an experiment that suggests regression models – one of the primary vehicles for analyzing statistical results in political science – encourage causal interpretation. Specifically, we demonstrate that presenting observational results in a regression model, rather than as a simple comparison of means, makes causal interpretation of the results more likely. Our experiment shows that the subjects who were presented with results as estimates from a regression model were more inclined to interpret these results causally.
Matthias Schmid
Full Text Available Regression analysis with a bounded outcome is a common problem in applied statistics. Typical examples include regression models for percentage outcomes and the analysis of ratings that are measured on a bounded scale. In this paper, we consider beta regression, which is a generalization of logit models to situations where the response is continuous on the interval (0,1. Consequently, beta regression is a convenient tool for analyzing percentage responses. The classical approach to fit a beta regression model is to use maximum likelihood estimation with subsequent AIC-based variable selection. As an alternative to this established - yet unstable - approach, we propose a new estimation technique called boosted beta regression. With boosted beta regression estimation and variable selection can be carried out simultaneously in a highly efficient way. Additionally, both the mean and the variance of a percentage response can be modeled using flexible nonlinear covariate effects. As a consequence, the new method accounts for common problems such as overdispersion and non-binomial variance structures.
Hosmer, David W; Sturdivant, Rodney X
2013-01-01
A new edition of the definitive guide to logistic regression modeling for health science and other applications This thoroughly expanded Third Edition provides an easily accessible introduction to the logistic regression (LR) model and highlights the power of this model by examining the relationship between a dichotomous outcome and a set of covariables. Applied Logistic Regression, Third Edition emphasizes applications in the health sciences and handpicks topics that best suit the use of modern statistical software.
Weisberg, Sanford
2013-01-01
Praise for the Third Edition ""...this is an excellent book which could easily be used as a course text...""-International Statistical Institute The Fourth Edition of Applied Linear Regression provides a thorough update of the basic theory and methodology of linear regression modeling. Demonstrating the practical applications of linear regression analysis techniques, the Fourth Edition uses interesting, real-world exercises and examples. Stressing central concepts such as model building, understanding parameters, assessing fit and reliability, and drawing conclusions, the new edition illustrates these concepts through real-world applications.
A hierarchical linear model for tree height prediction.
Vicente J. Monleon
2003-01-01
Measuring tree height is a time-consuming process. Often, tree diameter is measured and height is estimated from a published regression model. Trees used to develop these models are clustered into stands, but this structure is ignored and independence is assumed. In this study, hierarchical linear models that account explicitly for the clustered structure of the data are developed for tree height prediction.
A Distributed Spanning Tree Algorithm
Johansen, Karl Erik; Jørgensen, Ulla Lundin; Nielsen, Sven Hauge
We present a distributed algorithm for constructing a spanning tree for connected undirected graphs. Nodes correspond to processors and edges correspond to two-way channels. Each processor has initially a distinct identity and all processors perform the same algorithm. Computation as well...... as communication is asynchronous. The total number of messages sent during a construction of a spanning tree is at most 2E+3NlogN. The maximal message size is loglogN+log(maxid)+3, where maxid is the maximal processor identity....
A distributed spanning tree algorithm
Johansen, Karl Erik; Jørgensen, Ulla Lundin; Nielsen, Svend Hauge
1988-01-01
We present a distributed algorithm for constructing a spanning tree for connected undirected graphs. Nodes correspond to processors and edges correspond to two way channels. Each processor has initially a distinct identity and all processors perform the same algorithm. Computation as well...... as communication is asyncronous. The total number of messages sent during a construction of a spanning tree is at most 2E+3NlogN. The maximal message size is loglogN+log(maxid)+3, where maxid is the maximal processor identity....
A Distributed Spanning Tree Algorithm
Johansen, Karl Erik; Jørgensen, Ulla Lundin; Nielsen, Sven Hauge
We present a distributed algorithm for constructing a spanning tree for connected undirected graphs. Nodes correspond to processors and edges correspond to two-way channels. Each processor has initially a distinct identity and all processors perform the same algorithm. Computation as well...... as communication is asynchronous. The total number of messages sent during a construction of a spanning tree is at most 2E+3NlogN. The maximal message size is loglogN+log(maxid)+3, where maxid is the maximal processor identity....
Automated Generation of Attack Trees
Vigo, Roberto; Nielson, Flemming; Nielson, Hanne Riis
2014-01-01
Attack trees are widely used to represent threat scenarios in a succinct and intuitive manner, suitable for conveying security information to non-experts. The manual construction of such objects relies on the creativity and experience of specialists, and therefore it is error-prone and impractical for large systems.
Transductive Ordinal Regression
Seah, Chun-Wei; Ong, Yew-Soon
2011-01-01
Ordinal regression is commonly formulated as a multi-class problem with ordinal constraints. The challenge of designing accurate classifiers for ordinal regression generally increases with the number of classes involved, due to the large number of labeled patterns that are needed. The availability of ordinal class labels, however, are often costly to calibrate or difficult to obtain. Unlabeled patterns, on the other hand, often exist in much greater abundance and are freely available. To take benefits from the abundance of unlabeled patterns, we present a novel transductive learning paradigm for ordinal regression in this paper, namely Transductive Ordinal Regression (TOR). The key challenge of the present study lies in the precise estimation of both the ordinal class label of the unlabeled data and the decision functions of the ordinal classes, simultaneously.
Canfield, Elaine
2002-01-01
Describes a fifth-grade art activity that offers a new approach to creating pictures of Aspen trees. Explains that the students learned about art concepts, such as line and balance, in this lesson. Discusses the process in detail for creating the pictures. (CMK)
Nonparametric Predictive Regression
Ioannis Kasparis; Elena Andreou; Phillips, Peter C.B.
2012-01-01
A unifying framework for inference is developed in predictive regressions where the predictor has unknown integration properties and may be stationary or nonstationary. Two easily implemented nonparametric F-tests are proposed. The test statistics are related to those of Kasparis and Phillips (2012) and are obtained by kernel regression. The limit distribution of these predictive tests holds for a wide range of predictors including stationary as well as non-stationary fractional and near unit root processes.
Paulo Eduardo Telles dos Santos
2010-06-01
Full Text Available By the assessment of ten technological traits of eucalypt wood for sawn timber and energy purposes,
it was developed a multivariate statistical procedure in order to determine the sequence of logs to be sampled, in such a way to represent all statistical variation contained within the tree and, accordingly, to establish the appropriate sampling intensity. In the present work, it was used a total of 40 logs from four trees of Eucalyptus grandis provenance Concórdia-SC aged 18 years. By using principal components regression analysis and stepwise selection techniques, it was showed that only two logs, corresponding to the first (0.05 m to 2.60 m and fourth (8.85 m to 11.40 m positions into the tree, contained 99.2 % of the total variation detected originally. In the case of adopting a single log, the recommendation was over the fourth log, which represented 97.5 % of the total
amount of the original variation. For the referred population, the statistical procedure contributed substantially to reduce the high time-consuming and financial costs that are normally associated to studies oriented to this goal, without affecting the original statistical information exhibited by the whole group of logs that would be usually sampled.A partir da avaliação de dez características tecnológicas de madeira de eucalipto para fins de serraria e energia, desenvolveu-se procedimento estatístico multivariado para se determinar a seqüência de toras a ser amostrada, de forma a representar acumuladamente toda a variação estatística presente na árvore e, com isso, estabelecer a intensidade adequada de amostragem. Neste estudo, foram utilizadas 40 toras oriundas de quatro árvores de Eucalyptus grandis aos 18 anos de idade procedentes de Concórdia, SC. Com o uso de técnicas de regressão multivariada de componentes principais e seleção por etapas, chegou-se à conclusão que amostrandose apenas duas toras, correspondentes à primeira (0,05 m a 2
Extensions and applications of ensemble-of-trees methods in machine learning
Bleich, Justin
Ensemble-of-trees algorithms have emerged to the forefront of machine learning due to their ability to generate high forecasting accuracy for a wide array of regression and classification problems. Classic ensemble methodologies such as random forests (RF) and stochastic gradient boosting (SGB) rely on algorithmic procedures to generate fits to data. In contrast, more recent ensemble techniques such as Bayesian Additive Regression Trees (BART) and Dynamic Trees (DT) focus on an underlying Bayesian probability model to generate the fits. These new probability model-based approaches show much promise versus their algorithmic counterparts, but also offer substantial room for improvement. The first part of this thesis focuses on methodological advances for ensemble-of-trees techniques with an emphasis on the more recent Bayesian approaches. In particular, we focus on extensions of BART in four distinct ways. First, we develop a more robust implementation of BART for both research and application. We then develop a principled approach to variable selection for BART as well as the ability to naturally incorporate prior information on important covariates into the algorithm. Next, we propose a method for handling missing data that relies on the recursive structure of decision trees and does not require imputation. Last, we relax the assumption of homoskedasticity in the BART model to allow for parametric modeling of heteroskedasticity. The second part of this thesis returns to the classic algorithmic approaches in the context of classification problems with asymmetric costs of forecasting errors. First we consider the performance of RF and SGB more broadly and demonstrate its superiority to logistic regression for applications in criminology with asymmetric costs. Next, we use RF to forecast unplanned hospital readmissions upon patient discharge with asymmetric costs taken into account. Finally, we explore the construction of stable decision trees for forecasts of
孙亚琳; 赵林林; 杨小平
2014-01-01
In order to guide users to use well and improving websites’quality and construcing the Web semantic model,this pa-per presented a new approach and framework of learning from Web pages,and used formal concept analysis (FCA)to build the semantic concept tree.Firstly,it used information extraction and natural language processing tools to extract and segment texts, and then identified feature words by statistical methods.Secondly,it transformed feature words into thesaurus terms by using search-engine-based similarity calculation.Thirdly,it formed a formal context,and reduced the context by using rules,clustering and other techniques.Finally,it constructed concept lattice by using some algorithm to get hierarchy,which then transformed into the concept tree.Experimental results show that the concept tree can be used as the basis of Web ontology model,and have a pro-found signification for semantic assessment.The proposed algorithm has a certain value and referenced significance.%针对用户使用网站效率低和网站质量差的问题，提出了利用形式概念分析（FCA）来构建网页语义概念树的方法。该方法首先利用信息抽取、自然语言处理等技术对网页集进行文本抽取、分词，提取出描述文本语义的特征词；再以主题词表为参照，设计基于搜索引擎的词语相似度算法，将抽取的特征词全部转换成主题词表中主题词，对将抽取的语义信息转换成形式背景，利用规则、聚类等技术对形式背景进行约简。最后通过设计的建格算法构建概念格，实现概念树构建。实验结果表明，利用该方法构建的概念树可以作为网站本体模型的基础，对语义评估具有积极的意义，具有一定的应用价值和借鉴意义。
郭昌辉; 刘贵全; 张磊
2012-01-01
Storage device performance prediction is a significant element of self-managed storage systems and application planning tasks, such as data assignment. The traditional methods for storage device performance prediction, such as accurate simulations and analytic models, needs sufficient expertise about storages. As the storage devices are becoming more and more high-end and complex, the accurate simulations and analytic models are not available. Compared with traditional methods, the machine learning methods consider the storage devices as black boxes, and needs no information about the internal components or algorithms of those storage devices. So machine learning methods are more appropriate for the trend of current storage devices development. Classification and regression tree（CART） method for modelling storage devices is simple. This work explores an interactive model based on regression tree and K-nearest neighbor algorithm to improve the machine learning method. Experiments show that our proposed model has a higher prediction precise and a better stability than regression tree or KNN. In our experiments, we found out that the caching effect is very important. We improved the method of workload characterization considering caching effect, which makes a substantial difference on prediction accuracy.%存储设备性能预测在存储系统的自动化管理以及规划任务中发挥重要的作用．传统的方法是利用分析模型、仿真模型来预测存储设备性能，但这类方法需要大量领域专家知识，也无法适应越来越高端、复杂的存储系统；利用机器学习的方法构建存储设备的预测模型不需要了解存储设备的内部结构和调度算法，但缺陷是预测精度不够高．本文提出一种基于回归树与K-最近邻这两种具备潜在优劣互补特性的交互模型来预测存储设备性能，以获取更高的预测精度．通过实验表明，该混合模型较单一模型（回归
Tree enterprises and bankruptcy ventures: a game theoretic similarity due to a graph theoretic proof
In a tree enterprise, users reside at the nodes of the tree and their aim is to connect themselves, directly or indirectly, to the root of the tree. The construction costs of arcs of the tree are given by means of the arc-cost-function associated with the tree. Further the bankruptcy venture is
Unimodular trees versus Einstein trees
The maximally helicity violating tree-level scattering amplitudes involving three, four or five gravitons are worked out in Unimodular Gravity. They are found to coincide with the corresponding amplitudes in General Relativity. This a remarkable result, insofar as both the propagators and the vertices are quite different in the two theories. (orig.)
Modelling tree biomasses in Finland
Biomass equations for above- and below-ground tree components of Scots pine (Pinus sylvestris L), Norway spruce (Picea abies [L.] Karst) and birch (Betula pendula Roth and Betula pubescens Ehrh.) were compiled using empirical material from a total of 102 stands. These stands (44 Scots pine, 34 Norway spruce and 24 birch stands) were located mainly on mineral soil sites representing a large part of Finland. The biomass models were based on data measured from 1648 sample trees, comprising 908 pine, 613 spruce and 127 birch trees. Biomass equations were derived for the total above-ground biomass and for the individual tree components: stem wood, stem bark, living and dead branches, needles, stump, and roots, as dependent variables. Three multivariate models with different numbers of independent variables for above-ground biomass and one for below-ground biomass were constructed. Variables that are normally measured in forest inventories were used as independent variables. The simplest model formulations, multivariate models (1) were mainly based on tree diameter and height as independent variables. In more elaborated multivariate models, (2) and (3), additional commonly measured tree variables such as age, crown length, bark thickness and radial growth rate were added. Tree biomass modelling includes consecutive phases, which cause unreliability in the prediction of biomass. First, biomasses of sample trees should be determined reliably to decrease the statistical errors caused by sub-sampling. In this study, methods to improve the accuracy of stem biomass estimates of the sample trees were developed. In addition, the reliability of the method applied to estimate sample-tree crown biomass was tested, and no systematic error was detected. Second, the whole information content of data should be utilized in order to achieve reliable parameter estimates and applicable and flexible model structure. In the modelling approach, the basic assumption was that the biomasses of
[Understanding logistic regression].
Logistic regression is one of the most common multivariate analysis models utilized in epidemiology. It allows the measurement of the association between the occurrence of an event (qualitative dependent variable) and factors susceptible to influence it (explicative variables). The choice of explicative variables that should be included in the logistic regression model is based on prior knowledge of the disease physiopathology and the statistical association between the variable and the event, as measured by the odds ratio. The main steps for the procedure, the conditions of application, and the essential tools for its interpretation are discussed concisely. We also discuss the importance of the choice of variables that must be included and retained in the regression model in order to avoid the omission of important confounding factors. Finally, by way of illustration, we provide an example from the literature, which should help the reader test his or her knowledge.
Practical Session: Logistic Regression
An exercise is proposed to illustrate the logistic regression. One investigates the different risk factors in the apparition of coronary heart disease. It has been proposed in Chapter 5 of the book of D.G. Kleinbaum and M. Klein, "Logistic Regression", Statistics for Biology and Health, Springer Science Business Media, LLC (2010) and also by D. Chessel and A.B. Dufour in Lyon 1 (see Sect. 6 of http://pbil.univ-lyon1.fr/R/pdf/tdr341.pdf). This example is based on data given in the file evans.txt coming from http://www.sph.emory.edu/dkleinb/logreg3.htm#data.
A new and alternative quantile regression estimator is developed and it is shown that the estimator is root n-consistent and asymptotically normal. The estimator is based on a minimax ‘deviance function’ and has asymptotically equivalent properties to the usual quantile regression estimator. It is......, however, a different and therefore new estimator. It allows for both linear- and nonlinear model specifications. A simple algorithm for computing the estimates is proposed. It seems to work quite well in practice but whether it has theoretical justification is still an open question....
Tools for valuing tree and park services
Arborists and urban foresters plan, design, construct, and manage trees and parks in cities throughout the world. These civic improvements create walkable, cool environments, save energy, reduce stormwater runoff, sequester carbon dioxide, and absorb air pollutants. The presence of trees and green spaces in cities is associated with increases in property values,...
Which Spatial Partition Trees are Adaptive to Intrinsic Dimension?
Recent theory work has found that a special type of spatial partition tree - called a random projection tree - is adaptive to the intrinsic dimension of the data from which it is built. Here we examine this same question, with a combination of theory and experiments, for a broader class of trees that includes k-d trees, dyadic trees, and PCA trees. Our motivation is to get a feel for (i) the kind of intrinsic low dimensional structure that can be empirically verified, (ii) the extent to which a spatial partition can exploit such structure, and (iii) the implications for standard statistical tasks such as regression, vector quantization, and nearest neighbor search.
R is a rapidly evolving lingua franca of graphical display and statistical analysis of experiments from the applied sciences. This book provides a coherent treatment of nonlinear regression with R by means of examples from a diversity of applied sciences such as biology, chemistry, engineering, medicine and toxicology.
Multiple linear regression analysis
Program rapidly selects best-suited set of coefficients. User supplies only vectors of independent and dependent data and specifies confidence level required. Program uses stepwise statistical procedure for relating minimal set of variables to set of observations; final regression contains only most statistically significant coefficients. Program is written in FORTRAN IV for batch execution and has been implemented on NOVA 1200.
Adaptive metric kernel regression
regression by minimising a cross-validation estimate of the generalisation error. This allows to automatically adjust the importance of different dimensions. The improvement in terms of modelling performance is illustrated on a variable selection task where the adaptive metric kernel clearly outperforms...
Software Regression Verification
2013-12-11
of recursive procedures. Acta Informatica , 45(6):403 – 439, 2008. [GS11] Benny Godlin and Ofer Strichman. Regression verifica- tion. Technical Report...functions. Therefore, we need to rede - fine m-term. – Mutual termination. If either function f or function f ′ (or both) is non- deterministic, then their
Concise, mathematically clear, and comprehensive treatment of the subject.* Expanded coverage of diagnostics and methods of model fitting.* Requires no specialized knowledge beyond a good grasp of matrix algebra and some acquaintance with straight-line regression and simple analysis of variance models.* More than 200 problems throughout the book plus outline solutions for the exercises.* This revision has been extensively class-tested.
Spatial distribution status of construction land is closely related to the regional economic and social development. Therefore, timely monitoring and delivery of data on the dynamics of construction land are far-reaching for policy and decision making processes. Classifying land-use/land-cover and analyzing changes are among the most common applications of remote sensing. One of the most basic and difficult classification tasks is to distinguish the construction land from other land surfaces. Landsat imagery is one of the most widely used sources of data in remote sensing of construction land. Several techniques of construction land extraction using Landsat data are described in some literatures, but their applications are constrained by low accuracy in various situations, and usually using the technique of single index or multi-index. The purpose of this study was to devise a method to improve the accuracy of construction land extraction in the presence of various kinds of environmental noise. Thus we introduce a multi-features decision tree (DT) classification model for improving classification accuracy in the areas that including bare land, shadow and some streams, in which the other classification methods often fail to classify correctly. The model integrates four spectral indexes, the pattern recognition technique and spatial algorithms. The four spectral indexes are the normalized difference three bands index (NDTBI), the normalized difference building index (NDBI), the modified normalized difference water index (MNDWI) and the normalized difference vegetation index (NDVI) respectively. The pattern recognition technique is referred to support vector machine (SVM). And the spatial algorithm is to create buffer zone. The test site was deliberately selected so that it consists of complex surface features, such as bare land, hill shade, and some small streams that are liable to be mixed up with construction land on the Landsat imagery. For that reason, Landsat-8
Finite Sholander Trees, Trees, and their Betweenness
We provide a proof of Sholander's claim (Trees, lattices, order, and betweenness, Proc. Amer. Math. Soc. 3, 369-381 (1952)) concerning the representability of collections of so-called segments by trees, which yields a characterization of the interval function of a tree. Furthermore, we streamline Burigana's characterization (Tree representations of betweenness relations defined by intersection and inclusion, Mathematics and Social Sciences 185, 5-36 (2009)) of tree betweenness and provide a relatively short proof.
We consider the problem of stabilizing an unstable plant driven by bounded noise over a digital noisy communication link, a scenario at the heart of networked control. To stabilize such a plant, one needs real-time encoding and decoding with an error probability profile that decays exponentially with the decoding delay. The works of Schulman and Sahai over the past two decades have developed the notions of tree codes and anytime capacity, and provided the theoretical framework for studying such problems. Nonetheless, there has been little practical progress in this area due to the absence of explicit constructions of tree codes with efficient encoding and decoding algorithms. Recently, linear time-invariant tree codes were proposed to achieve the desired result under maximum-likelihood decoding. In this work, we take one more step towards practicality, by showing that these codes can be efficiently decoded using sequential decoding algorithms, up to some loss in performance (and with some practical complexity caveats). We supplement our theoretical results with numerical simulations that demonstrate the effectiveness of the decoder in a control system setting.
Low rank Multivariate regression
We consider in this paper the multivariate regression problem, when the target regression matrix $A$ is close to a low rank matrix. Our primary interest in on the practical case where the variance of the noise is unknown. Our main contribution is to propose in this setting a criterion to select among a family of low rank estimators and prove a non-asymptotic oracle inequality for the resulting estimator. We also investigate the easier case where the variance of the noise is known and outline that the penalties appearing in our criterions are minimal (in some sense). These penalties involve the expected value of the Ky-Fan quasi-norm of some random matrices. These quantities can be evaluated easily in practice and upper-bounds can be derived from recent results in random matrix theory.
Subset selection in regression
Originally published in 1990, the first edition of Subset Selection in Regression filled a significant gap in the literature, and its critical and popular success has continued for more than a decade. Thoroughly revised to reflect progress in theory, methods, and computing power, the second edition promises to continue that tradition. The author has thoroughly updated each chapter, incorporated new material on recent developments, and included more examples and references. New in the Second Edition:A separate chapter on Bayesian methodsComplete revision of the chapter on estimationA major example from the field of near infrared spectroscopyMore emphasis on cross-validationGreater focus on bootstrappingStochastic algorithms for finding good subsets from large numbers of predictors when an exhaustive search is not feasible Software available on the Internet for implementing many of the algorithms presentedMore examplesSubset Selection in Regression, Second Edition remains dedicated to the techniques for fitting...
Because of global climate change,it is necessary to add forest biomass estimation to national forest resource monitoring.The biomass equations developed for forest biomass estimation should be compatible with volume equations.Based on the tree volume and aboveground biomass data of Masson pine (Pinus massoniana Lamb.) in southern China,we constructed one-,two-and three-variable aboveground biomass equations and biomass conversion functions compatible with tree volume equations by using error-in-variable simultaneous equations.The prediction precision of aboveground biomass estimates from one variable equation exceeded 95％.The regressions of aboveground biomass equations were improved slightly when tree height and crown width were used together with diameter on breast height,although the contributions to regressions were statistically insignificant.For the biomass conversion function on one variable,the conversion factor decreased with increasing diameter,but for the conversion function on two variables,the conversion factor increased with increasing diameter but decreased with increasing tree height.
. There are, however, decreasing returns to aid, and the estimated effectiveness of aid is highly sensitive to the choice of estimator and the set of control variables. When investment and human capital are controlled for, no positive effect of aid is found. Yet, aid continues to impact on growth via...... investment. We conclude by stressing the need for more theoretical work before this kind of cross-country regressions are used for policy purposes....
Robust Nonstationary Regression
1993-01-01
This paper provides a robust statistical approach to nonstationary time series regression and inference. Fully modified extensions of traditional robust statistical procedures are developed which allow for endogeneities in the nonstationary regressors and serial dependence in the shocks that drive the regressors and the errors that appear in the equation being estimated. The suggested estimators involve semiparametric corrections to accommodate these possibilities and they belong to the same ...
TWO REGRESSION CREDIBILITY MODELS
Full Text Available In this communication we will discuss two regression credibility models from Non – Life Insurance Mathematics that can be solved by means of matrix theory. In the first regression credibility model, starting from a well-known representation formula of the inverse for a special class of matrices a risk premium will be calculated for a contract with risk parameter θ. In the next regression credibility model, we will obtain a credibility solution in the form of a linear combination of the individual estimate (based on the data of a particular state and the collective estimate (based on aggregate USA data. To illustrate the solution with the properties mentioned above, we shall need the well-known representation theorem for a special class of matrices, the properties of the trace for a square matrix, the scalar product of two vectors, the norm with respect to a positive definite matrix given in advance and the complicated mathematical properties of conditional expectations and of conditional covariances.
Evaluation of acoustic tomography for tree decay detection
In this study, the acoustic tomography technique was used to detect internal decay in high value black cherry (Prunus seratina) trees. Two-dimensional images of the cross sections of the tree samples were constructed using PiCUS Q70 software. The trees were felled following the field test, and a disc from each testing elevation was subsequently cut...
Category of trees in representation theory of quantum algebras
New applications of categorical methods are connected with new additional structures on categories. One of such structures in representation theory of quantum algebras, the category of Kuznetsov-Smorodinsky-Vilenkin-Smirnov (KSVS) trees, is constructed, whose objects are finite rooted KSVS trees and morphisms generated by the transition from a KSVS tree to another one.
The geometry of inner spanning trees for planar polygons
We study the geometry of minimal inner spanning trees for planar polygons (that is, spanning trees whose edge-intervals lie in these polygons). We construct analogues of Voronoi diagrams and Delaunay triangulations, prove that every minimal inner spanning tree is a subgraph of an appropriate Delaunay triangulation, and describe the possible structure of the cells of such triangulations.
A Vertex Oriented Approach to Minimum Cost Spanning Tree Problems
In this paper we consider spanning tree problems, where n players want to be connected to a source as cheap as possible. We introduce and analyze (n!) vertex oriented construct and charge procedures for such spanning tree situations leading in n steps to a minimum cost spanning tree and a cost shari
目的 采用Logistic回归筛选与宫颈癌相关的血清肿瘤标志物,并进一步使用分类树卡方自动交互检测法(CHAID)建立鳞状上皮细胞癌相关抗原(Scc)在宫颈癌中的辅助诊断模型.方法 回顾性收集2010至2013年浙江省台州医院检测肿瘤标志物的宫颈癌初诊患者581例,宫颈良性疾病者342例,健康体检者341名,检测其糖类抗原199(CA199)、糖类抗原125(CA125)、CEA、SCC、AFP水平.先采用Logistic回归筛选出有统计学意义的肿瘤标志物,再进一步采用决策树CHAID法确定上述肿瘤标志物在辅助诊断宫颈癌中的价值.最后收集2014年1至12月SCC高于本研究得出的诊断值的子宫相关疾病患者共284例,计算其中的宫颈癌患者比例来验证决策树CHAID法结果.结果 Logistic回归结果显示5类可能与宫颈癌相关的肿瘤标志物中仅SCC具有统计学意义(Wald x2=22.120,P=0.000),OR值及其95％ CI为1.900(1.454 ～2.483).随着SCC数值的升高,宫颈癌患者的比例也逐渐增高,当SCC＞2.20 μg/L时,阳性预测值达94.7％.284例SCC高于2.20 μg/L的考虑子宫相关疾病的人群中,最终证实为宫颈癌的比例为95.1％(270例).结论 SCC对于官颈癌患者具有较好的辅助诊断价值.%Objective To explore the relationship between serum tumor markers and cervical cancer by using Logistic regression, and to further establish the diagnosis model of squamous cell carcinoma antigen (SCC) in cervical cancer by using chi-squared automatic interaction detector (CHAID) analysis of decision tree.Methods Total of 581 cases of cervical cancer,342 cases of cervical benign diseases and 341 cases of healthy controls who detected tumor markers in Taizhou Hospital of Zhejiang during 2010-2013, were retrospectively studied.The test results of carbohydrate antigen 199 (CA199), carbohydrate antigen 125 (CA125), carcinoembryonic antigen (CEA), SCC, and alpha fetoprotein (AFP) were reviewed.The Logistic regression were
Generalised k-Steiner Tree Problems in Normed Planes
2011-01-01
The 1-Steiner tree problem, the problem of constructing a Steiner minimum tree containing at most one Steiner point, has been solved in the Euclidean plane by Georgakopoulos and Papadimitriou using plane subdivisions called oriented Dirichlet cell partitions. Their algorithm produces an optimal solution within $O(n^2)$ time. In this paper we generalise their approach in order to solve the $k$-Steiner tree problem, in which the Steiner minimum tree may contain up to $k$ Steiner points for a gi...
Modified Regression Correlation Coefficient for Poisson Regression Model
Kaengthong, Nattacha; Domthong, Uthumporn
2017-09-01
This study gives attention to indicators in predictive power of the Generalized Linear Model (GLM) which are widely used; however, often having some restrictions. We are interested in regression correlation coefficient for a Poisson regression model. This is a measure of predictive power, and defined by the relationship between the dependent variable (Y) and the expected value of the dependent variable given the independent variables [E(Y|X)] for the Poisson regression model. The dependent variable is distributed as Poisson. The purpose of this research was modifying regression correlation coefficient for Poisson regression model. We also compare the proposed modified regression correlation coefficient with the traditional regression correlation coefficient in the case of two or more independent variables, and having multicollinearity in independent variables. The result shows that the proposed regression correlation coefficient is better than the traditional regression correlation coefficient based on Bias and the Root Mean Square Error (RMSE).
Full Text Available A 10-month-old baby presented with developmental delay. He had flaccid paralysis on physical examination.An MRI of the spine revealed malformation of the ninth and tenth thoracic vertebral bodies with complete agenesis of the rest of the spine down that level. The thoracic spinal cord ends at the level of the fifth thoracic vertebra with agenesis of the posterior arches of the eighth, ninth and tenth thoracic vertebral bodies. The roots of the cauda equina appear tightened down and backward and ended into a subdermal fibrous fatty tissue at the level of the ninth and tenth thoracic vertebral bodies (closed meningocele. These findings are consistent with caudal regression syndrome.
Efficient FPT Algorithms for (Strict) Compatibility of Unrooted Phylogenetic Trees.
In phylogenetics, a central problem is to infer the evolutionary relationships between a set of species X; these relationships are often depicted via a phylogenetic tree-a tree having its leaves labeled bijectively by elements of X and without degree-2 nodes-called the "species tree." One common approach for reconstructing a species tree consists in first constructing several phylogenetic trees from primary data (e.g., DNA sequences originating from some species in X), and then constructing a single phylogenetic tree maximizing the "concordance" with the input trees. The obtained tree is our estimation of the species tree and, when the input trees are defined on overlapping-but not identical-sets of labels, is called "supertree." In this paper, we focus on two problems that are central when combining phylogenetic trees into a supertree: the compatibility and the strict compatibility problems for unrooted phylogenetic trees. These problems are strongly related, respectively, to the notions of "containing as a minor" and "containing as a topological minor" in the graph community. Both problems are known to be fixed parameter tractable in the number of input trees k, by using their expressibility in monadic second-order logic and a reduction to graphs of bounded treewidth. Motivated by the fact that the dependency on k of these algorithms is prohibitively large, we give the first explicit dynamic programming algorithms for solving these problems, both running in time [Formula: see text], where n is the total size of the input.
Fast Tree: Computing Large Minimum-Evolution Trees with Profiles instead of a Distance Matrix
Gene families are growing rapidly, but standard methods for inferring phylogenies do not scale to alignments with over 10,000 sequences. We present FastTree, a method for constructing large phylogenies and for estimating their reliability. Instead of storing a distance matrix, FastTree stores sequence profiles of internal nodes in the tree. FastTree uses these profiles to implement neighbor-joining and uses heuristics to quickly identify candidate joins. FastTree then uses nearest-neighbor interchanges to reduce the length of the tree. For an alignment with N sequences, L sites, and a different characters, a distance matrix requires O(N^2) space and O(N^2 L) time, but FastTree requires just O( NLa + N sqrt(N) ) memory and O( N sqrt(N) log(N) L a ) time. To estimate the tree's reliability, FastTree uses local bootstrapping, which gives another 100-fold speedup over a distance matrix. For example, FastTree computed a tree and support values for 158,022 distinct 16S ribosomal RNAs in 17 hours and 2.4 gigabytes of memory. Just computing pairwise Jukes-Cantor distances and storing them, without inferring a tree or bootstrapping, would require 17 hours and 50 gigabytes of memory. In simulations, FastTree was slightly more accurate than neighbor joining, BIONJ, or FastME; on genuine alignments, FastTree's topologies had higher likelihoods. FastTree is available at http://microbesonline.org/fasttree.
Analyzing and synthesizing phylogenies using tree alignment graphs.
View Dependent Sequential Point Trees
Sequential point trees provide the state-of-the-art technique for rendering point models, by re-arranging hierarchical points sequentially according to geometric errors running on GPU for fast rendering. This paper presents a view dependent method to augment sequential point trees by embedding the hierarchical tree structures in the sequential list of hierarchical points. By the method, two kinds of indices are constructed to facilitate the points rendering in an order mostly from near to far and from coarse to fine. As a result, invisible points can be culled view-dependently in high efficiency for hardware acceleration, and at the same time, the advantage of sequential point trees could be still fully taken. Therefore, the new method can run much faster than the conventional sequential point trees, and the acceleration can be highly promoted particularly when the objects possess complex occlusion relationship and viewed closely because invisible points would be in a high percentage of the points at finer levels.
The combinatorics of tandem duplication trees.
We developed a recurrence relation that counts the number of tandem duplication trees (either rooted or unrooted) that are consistent with a set of n tandemly repeated sequences generated under the standard unequal recombination (or crossover) model of tandem duplications. The number of rooted duplication trees is exactly twice the number of unrooted trees, which means that on average only two positions for a root on a duplication tree are possible. Using the recurrence, we tabulated these numbers for small values of n. We also developed an asymptotic formula that for large n provides estimates for these numbers. These numbers give a priori probabilities for phylogenies of the repeated sequences to be duplication trees. This work extends earlier studies where exhaustive counts of the numbers for small n were obtained. One application showed the significance of finding that most maximum-parsimony trees constructed from repeat sequences from human immunoglobins and T-cell receptors were tandem duplication trees. Those findings provided strong support to the proposed mechanisms of tandem gene duplication. The recurrence relation also suggests efficient algorithms to recognize duplication trees and to generate random duplication trees for simulation. We present a linear-time recognition algorithm.
Combinatorics of distance-based tree inference.
Several popular methods for phylogenetic inference (or hierarchical clustering) are based on a matrix of pairwise distances between taxa (or any kind of objects): The objective is to construct a tree with branch lengths so that the distances between the leaves in that tree are as close as possible to the input distances. If we hold the structure (topology) of the tree fixed, in some relevant cases (e.g., ordinary least squares) the optimal values for the branch lengths can be expressed using simple combinatorial formulae. Here we define a general form for these formulae and show that they all have two desirable properties: First, the common tree reconstruction approaches (least squares, minimum evolution), when used in combination with these formulae, are guaranteed to infer the correct tree when given enough data (consistency); second, the branch lengths of all the simple (nearest neighbor interchange) rearrangements of a tree can be calculated, optimally, in quadratic time in the size of the tree, thus allowing the efficient application of hill climbing heuristics. The study presented here is a continuation of that by Mihaescu and Pachter on branch length estimation [Mihaescu R, Pachter L (2008) Proc Natl Acad Sci USA 105:13206-13211]. The focus here is on the inference of the tree itself and on providing a basis for novel algorithms to reconstruct trees from distances.
Healthy trees are important to us all. Trees provide shade, beauty, and homes for wildlife. Trees give us products like paper and wood. Trees can give us all this only if they are healthy.They must be well cared for to remain healthy.
minimum variance estimation of yield parameters of rubber tree with ...
2013-03-01
Mar 1, 2013 ... STAMP, an OxMetric modular software system for time series analysis, was used to estimate the yield ... derlying regression techniques. .... Kalman Filter Minimum Variance Estimation of Rubber Tree Yield Parameters. 83.
Realization of Ridge Regression in MATLAB
The least square estimator (LSE) of the coefficients in the classical linear regression models is unbiased. In the case of multicollinearity of the vectors of design matrix, LSE has very big variance, i.e., the estimator is unstable. A more stable estimator (but biased) can be constructed using ridge-estimator (RE). In this paper the basic methods of obtaining of Ridge-estimators and numerical procedures of its realization in MATLAB are considered. An application to Pharmacokinetics problem is considered.
We propose a new formalism for quantum field theory which is neither based on functional integrals, nor on Feynman graphs, but on marked trees. This formalism is constructive, i.e. it computes correlation functions through convergent rather than divergent expansions. It applies both to Fermionic and Bosonic theories. It is compatible with the renormalization group, and it allows to define non-perturbatively {\\it differential} renormalization group equations. It accommodates any general stable polynomial Lagrangian. It can equally well treat noncommutative models or matrix models such as the Grosse-Wulkenhaar model. Perhaps most importantly it removes the space-time background from its central place in QFT, paving the way for a nonperturbative definition of field theory in noninteger dimension.
Recursive Algorithm For Linear Regression
Order of model determined easily. Linear-regression algorithhm includes recursive equations for coefficients of model of increased order. Algorithm eliminates duplicative calculations, facilitates search for minimum order of linear-regression model fitting set of data satisfactory.
Improving Cluster Analysis with Automatic Variable Selection Based on Trees
2014-12-01
ANALYSIS WITH AUTOMATIC VARIABLE SELECTION BASED ON TREES by Anton D. Orr December 2014 Thesis Advisor: Samuel E. Buttrey Second Reader...DATES COVERED Master’s Thesis 4. TITLE AND SUBTITLE IMPROVING CLUSTER ANALYSIS WITH AUTOMATIC VARIABLE SELECTION BASED ON TREES 5. FUNDING NUMBERS 6...2006 based on classification and regression trees to address problems with determining dissimilarity. Current algorithms do not simultaneously address
Node degree distribution in spanning trees
A method is presented for computing the number of spanning trees involving one link or a specified group of links, and excluding another link or a specified group of links, in a network described by a simple graph in terms of derivatives of the spanning-tree generating function defined with respect to the eigenvalues of the Kirchhoff (weighted Laplacian) matrix. The method is applied to deduce the node degree distribution in a complete or randomized set of spanning trees of an arbitrary network. An important feature of the proposed method is that the explicit construction of spanning trees is not required. It is shown that the node degree distribution in the spanning trees of the complete network is described by the binomial distribution. Numerical results are presented for the node degree distribution in square, triangular, and honeycomb lattices.
PREDICTIONG OF EUCALYPTUS WOOD BY COKRIGING, KRIGING AND REGRESSION
Full Text Available In the Gypsum Pole of Araripe, semiarid zone of Pernambuco, where is produces 97% of the plaster consumed in Brazil, a forest experiment with 1875 eucalyptus was cut off and all the trees were rigorously cubed by the Smalian method. The location of each tree was marked on a Cartesian plane, and a sample of 200 trees was removed by entirely random process. In the 200 sample trees, three estimation methods for variable volume timber, regression analysis, kriging and cokriging were used. To cokriging method, the secondary variable was the DBH (Diameter at Breast Height, and for the regression model of Spurr or the combined variable, it uses two explanatory variables: total height of the tree (H and the DBH. The variables volume and DBH showed spatial dependency. To compare de methods it was used the coefficient of determination (R2 and the residual distribution of the errors (real x estimated data. The best results were achieved with the Spurr equation R2 = 0.82 and total volume estimated 166.25 m3. The cokriging provided and R2 = 0.72 with total volume estimated of 164.14 m3 and kriging had R2 = 0.32 and the total volume estimated of 163.21 m3. The real volume of the experiment was 166.14 m3. Key words: Forest inventory, Volume of timber, Geostatistics.
Dendroclimatic reconstruction with time varying predictor subsets of tree indices
Tree-ring site chronologies, the predictors for most dendroclimatic reconstructions, are essentially mean-value functions with a time varying sample size (number of trees) and sample composition. Because reconstruction models are calibrated and verified on the most recent, best-replicated part of the chronologies, regression and verification statistics can be misleading as indicators of long-term reconstruction accuracy. A new reconstruction method is described that circumvents the use of site chronologies and instead derives predictor variables from indices of individual trees. Separate regression models are estimated and cross validated for various time segments of the tree-ring record, depending on the trees available at the time. This approach allows the reconstruction to extend to the first year covered by any tree in the network and yields direct evaluation of the change in reconstruction accuracy with tree-ring sample composition. The method includes two regression stages. The first is to separately deconvolve the local climate signal for individual trees, and the second is to weight the deconvolved signals into estimates of the climatic variable to be reconstructed. The method is illustrated in an application of precipitation and tree-ring data for the San Pedro River Basin in southeastern Arizona. Extensions to larger-scale problems and spatial reconstruction are suggested. 17 refs., 4 figs., 4 tabs.
Algorithms for optimal dyadic decision trees
A new algorithm for constructing optimal dyadic decision trees was recently introduced, analyzed, and shown to be very effective for low dimensional data sets. This paper enhances and extends this algorithm by: introducing an adaptive grid search for the regularization parameter that guarantees optimal solutions for all relevant trees sizes, revising the core tree-building algorithm so that its run time is substantially smaller for most regularization parameter values on the grid, and incorporating new data structures and data pre-processing steps that provide significant run time enhancement in practice.
Reconciliation of Gene and Species Trees
Full Text Available The first part of the paper briefly overviews the problem of gene and species trees reconciliation with the focus on defining and algorithmic construction of the evolutionary scenario. Basic ideas are discussed for the aspects of mapping definitions, costs of the mapping and evolutionary scenario, imposing time scales on a scenario, incorporating horizontal gene transfers, binarization and reconciliation of polytomous trees, and construction of species trees and scenarios. The review does not intend to cover the vast diversity of literature published on these subjects. Instead, the authors strived to overview the problem of the evolutionary scenario as a central concept in many areas of evolutionary research. The second part provides detailed mathematical proofs for the solutions of two problems: (i inferring a gene evolution along a species tree accounting for various types of evolutionary events and (ii trees reconciliation into a single species tree when only gene duplications and losses are allowed. All proposed algorithms have a cubic time complexity and are mathematically proved to find exact solutions. Solving algorithms for problem (ii can be naturally extended to incorporate horizontal transfers, other evolutionary events, and time scales on the species tree.
2008-12-01
A significant proportion of children diagnosed with Autistic Spectrum Disorder experience a developmental regression characterized by a loss of previously-acquired skills. This may involve a loss of speech or social responsitivity, but often entails both. This paper critically reviews the phenomena of regression in autistic spectrum disorders, highlighting the characteristics of regression, age of onset, temporal course, and long-term outcome. Important considerations for diagnosis are discussed and multiple etiological factors currently hypothesized to underlie the phenomenon are reviewed. It is argued that regressive autistic spectrum disorders can be conceptualized on a spectrum with other regressive disorders that may share common pathophysiological features. The implications of this viewpoint are discussed.
Extended Beta Regression in R: Shaken, Stirred, Mixed, and Partitioned
Full Text Available Beta regression – an increasingly popular approach for modeling rates and proportions – is extended in various directions: (a bias correction/reduction of the maximum likelihood estimator, (b beta regression tree models by means of recursive partitioning, (c latent class beta regression by means of finite mixture models. All three extensions may be of importance for enhancing the beta regression toolbox in practice to provide more reliable inference and capture both observed and unobserved/latent heterogeneity in the data. Using the analogy of Smithson and Verkuilen (2006, these extensions make beta regression not only “a better lemon squeezer” (compared to classical least squares regression but a full-fledged modern juicer offering lemon-based drinks: shaken and stirred (bias correction and reduction, mixed (finite mixture model, or partitioned (tree model. All three extensions are provided in the R package betareg (at least 2.4-0, building on generic algorithms and implementations for bias correction/reduction, model-based recursive partioning, and finite mixture models, respectively. Specifically, the new functions betatree( and betamix( reuse the object-oriented flexible implementation from the R packages party and flexmix, respectively.
On algorithm for building of optimal α-decision trees
The paper describes an algorithm that constructs approximate decision trees (α-decision trees), which are optimal relatively to one of the following complexity measures: depth, total path length or number of nodes. The algorithm uses dynamic programming and extends methods described in [4] to constructing approximate decision trees. Adjustable approximation rate allows controlling algorithm complexity. The algorithm is applied to build optimal α-decision trees for two data sets from UCI Machine Learning Repository [1]. © 2010 Springer-Verlag Berlin Heidelberg.
Linear regression in astronomy. I
Five methods for obtaining linear regression fits to bivariate data with unknown or insignificant measurement errors are discussed: ordinary least-squares (OLS) regression of Y on X, OLS regression of X on Y, the bisector of the two OLS lines, orthogonal regression, and 'reduced major-axis' regression. These methods have been used by various researchers in observational astronomy, most importantly in cosmic distance scale applications. Formulas for calculating the slope and intercept coefficients and their uncertainties are given for all the methods, including a new general form of the OLS variance estimates. The accuracy of the formulas was confirmed using numerical simulations. The applicability of the procedures is discussed with respect to their mathematical properties, the nature of the astronomical data under consideration, and the scientific purpose of the regression. It is found that, for problems needing symmetrical treatment of the variables, the OLS bisector performs significantly better than orthogonal or reduced major-axis regression.
On the book thickness of $k$-trees
Dujmović, Vida
2009-01-01
Every $k$-tree has book thickness at most $k+1$, and this bound is best possible for all $k\\geq3$. Vandenbussche et al. (2009) proved that every $k$-tree that has a smooth degree-3 tree decomposition with width $k$ has book thickness at most $k$. We prove this result is best possible for $k\\geq 4$, by constructing a $k$-tree with book thickness $k+1$ that has a smooth degree-4 tree decomposition with width $k$. This solves an open problem of Vandenbussche et al. (2009)
2007-02-01
In the last few decades several techniques for image content extraction, often based on segmentation, have been proposed. It has been suggested that under the assumption of very general image content, segmentation becomes unstable and classification becomes unreliable. According to recent psychological theories, certain image regions attract the attention of human observers more than others and, generally, the image main meaning appears concentrated in those regions. Initially, regions attracting our attention are perceived as a whole and hypotheses on their content are formulated; successively the components of those regions are carefully analyzed and a more precise interpretation is reached. It is interesting to observe that an image decomposition process performed according to these psychological visual attention theories might present advantages with respect to a traditional segmentation approach. In this paper we propose an automatic procedure generating image decomposition based on the detection of visual attention regions. A new clustering algorithm taking advantage of the Delaunay- Voronoi diagrams for achieving the decomposition target is proposed. By applying that algorithm recursively, starting from the whole image, a transformation of the image into a tree of related meaningful regions is obtained (Attention Tree). Successively, a semantic interpretation of the leaf nodes is carried out by using a structure of Neural Networks (Neural Tree) assisted by a knowledge base (Ontology Net). Starting from leaf nodes, paths toward the root node across the Attention Tree are attempted. The task of the path consists in relating the semantics of each child-parent node pair and, consequently, in merging the corresponding image regions. The relationship detected in this way between two tree nodes generates, as a result, the extension of the interpreted image area through each step of the path. The construction of several Attention Trees has been performed and partial
One of the problems encountered in Multiple Linear Regression (MLR) is multicollinearity, which causes the overestimation of the regression parameters and increase of the variance of these parameters. Hence, in case of multicollinearity presents, biased estimation procedures such as classical Principal Component Regression (CPCR) and Partial Least Squares Regression (PLSR) are then performed. SIMPLS algorithm is the leading PLSR algorithm because of its speed, efficiency and results are easier to interpret. However, both of the CPCR and SIMPLS yield very unreliable results when the data set contains outlying observations. Therefore, Hubert and Vanden Branden (2003) have been presented a robust PCR (RPCR) method and a robust PLSR (RPLSR) method called RSIMPLS. In RPCR, firstly, a robust Principal Component Analysis (PCA) method for high-dimensional data on the independent variables is applied, then, the dependent variables are regressed on the scores using a robust regression method. RSIMPLS has been constructed from a robust covariance matrix for high-dimensional data and robust linear regression. The purpose of this study is to show the usage of RPCR and RSIMPLS methods on an econometric data set, hence, making a comparison of two methods on an inflation model of Turkey. The considered methods have been compared in terms of predictive ability and goodness of fit by using a robust Root Mean Squared Error of Cross-validation (R-RMSECV), a robust R2 value and Robust Component Selection (RCS) statistic.
Dynamics in small worlds of tree topologies of wireless sensor networks
Tree topologies, which construct spatial graphs with large characteristic path lengths and small clustering coefficients, are ubiquitous in deployments of wireless sensor networks. Small worlds are investigated in tree-based networks. Due to link additions, characteristic path lengths reduce...
Nonlinear wavelet estimation of regression function with random desigm
The nonlinear wavelet estimator of regression function with random design is constructed. The optimal uniform convergence rate of the estimator in a ball of Besov space Bp,q? is proved under quite genera] assumpations. The adaptive nonlinear wavelet estimator with near-optimal convergence rate in a wide range of smoothness function classes is also constructed. The properties of the nonlinear wavelet estimator given for random design regression and only with bounded third order moment of the error can be compared with those of nonlinear wavelet estimator given in literature for equal-spaced fixed design regression with i.i.d. Gauss error.
Fault-Tree Compiler (FTC) program, is software tool used to calculate probability of top event in fault tree. Gates of five different types allowed in fault tree: AND, OR, EXCLUSIVE OR, INVERT, and M OF N. High-level input language easy to understand and use. In addition, program supports hierarchical fault-tree definition feature, which simplifies tree-description process and reduces execution time. Set of programs created forming basis for reliability-analysis workstation: SURE, ASSIST, PAWS/STEM, and FTC fault-tree tool (LAR-14586). Written in PASCAL, ANSI-compliant C language, and FORTRAN 77. Other versions available upon request.
Categorizing Ideas about Trees: A Tree of Trees
Fisler, Marie; Lecointre, Guillaume
2013-01-01
The aim of this study is to explore whether matrices and MP trees used to produce systematic categories of organisms could be useful to produce categories of ideas in history of science. We study the history of the use of trees in systematics to represent the diversity of life from 1766 to 1991. We apply to those ideas a method inspired from coding homologous parts of organisms. We discretize conceptual parts of ideas, writings and drawings about trees contained in 41 main writings; we detect shared parts among authors and code them into a 91-characters matrix and use a tree representation to show who shares what with whom. In other words, we propose a hierarchical representation of the shared ideas about trees among authors: this produces a “tree of trees.” Then, we categorize schools of tree-representations. Classical schools like “cladists” and “pheneticists” are recovered but others are not: “gradists” are separated into two blocks, one of them being called here “grade theoreticians.” We propose new interesting categories like the “buffonian school,” the “metaphoricians,” and those using “strictly genealogical classifications.” We consider that networks are not useful to represent shared ideas at the present step of the study. A cladogram is made for showing who is sharing what with whom, but also heterobathmy and homoplasy of characters. The present cladogram is not modelling processes of transmission of ideas about trees, and here it is mostly used to test for proximity of ideas of the same age and for categorization. PMID:23950877
2008-01-01
An algorithm for time-adaptive quantile regression is presented. The algorithm is based on the simplex algorithm, and the linear optimization formulation of the quantile regression problem is given. The observations have been split to allow a direct use of the simplex algorithm. The simplex method...... and an updating procedure are combined into a new algorithm for time-adaptive quantile regression, which generates new solutions on the basis of the old solution, leading to savings in computation time. The suggested algorithm is tested against a static quantile regression model on a data set with wind power...... production, where the models combine splines and quantile regression. The comparison indicates superior performance for the time-adaptive quantile regression in all the performance parameters considered....
Linear regression in astronomy. II
A wide variety of least-squares linear regression procedures used in observational astronomy, particularly investigations of the cosmic distance scale, are presented and discussed. The classes of linear models considered are (1) unweighted regression lines, with bootstrap and jackknife resampling; (2) regression solutions when measurement error, in one or both variables, dominates the scatter; (3) methods to apply a calibration line to new data; (4) truncated regression models, which apply to flux-limited data sets; and (5) censored regression models, which apply when nondetections are present. For the calibration problem we develop two new procedures: a formula for the intercept offset between two parallel data sets, which propagates slope errors from one regression to the other; and a generalization of the Working-Hotelling confidence bands to nonstandard least-squares lines. They can provide improved error analysis for Faber-Jackson, Tully-Fisher, and similar cosmic distance scale relations.
This study aimed to construct a prediction algorithm, which is readily applicable in the clinical setting, to determine the mortality rate for patients with P. aeruginosa bacteremia. A multicenter observational cohort study was performed retrospectively in seven university-affiliated hospitals in Korea from March 2012 to February 2015. In total, 264 adult patients with monomicrobial P. aeruginosa bacteremia were included in the analyses. Among the predictors independently associated with 30-day mortality in the Cox regression model, Pitt bacteremia score >2 and high-risk source of bacteremia were identified as critical nodes in the tree-structured survival analysis. Particularly, the empirical combination therapy was not associated with any survival benefit in the Cox regression model compared to the empirical monotherapy. This study suggests that determining the infection source and evaluating the clinical severity are critical to predict the clinical outcome in patients with P. aeruginosa bacteremia.
Regression is a common statistical tool for prediction in neuroscience. However, linear regression is by far the most common form of regression used, with regression trees receiving comparatively little attention. In this study, the results of conventional multiple linear regression (MLR) were compared with those of random forest regression (RFR), in the prediction of the concentrations of 9 neurochemicals in the vestibular nucleus complex and cerebellum that are part of the l-arginine biochemical pathway (agmatine, putrescine, spermidine, spermine, l-arginine, l-ornithine, l-citrulline, glutamate and γ-aminobutyric acid (GABA)). The R(2) values for the MLRs were higher than the proportion of variance explained values for the RFRs: 6/9 of them were ≥ 0.70 compared to 4/9 for RFRs. Even the variables that had the lowest R(2) values for the MLRs, e.g. ornithine (0.50) and glutamate (0.61), had much lower proportion of variance explained values for the RFRs (0.27 and 0.49, respectively). The RSE values for the MLRs were lower than those for the RFRs in all but two cases. In general, MLRs seemed to be superior to the RFRs in terms of predictive value and error. In the case of this data set, MLR appeared to be superior to RFR in terms of its explanatory value and error. This result suggests that MLR may have advantages over RFR for prediction in neuroscience with this kind of data set, but that RFR can still have good predictive value in some cases. Copyright © 2013 Elsevier B.V. All rights reserved.
The allometry of coarse root biomass: log-transformed linear regression or nonlinear regression?
Jiangshan Lai
Hinkle, Jacob; Fletcher, P Thomas; Joshi, Sarang
2012-01-01
In this paper we develop the theory of parametric polynomial regression in Riemannian manifolds and Lie groups. We show application of Riemannian polynomial regression to shape analysis in Kendall shape space. Results are presented, showing the power of polynomial regression on the classic rat skull growth data of Bookstein as well as the analysis of the shape changes associated with aging of the corpus callosum from the OASIS Alzheimer's study.
Evaluating Differential Effects Using Regression Interactions and Regression Mixture Models
Research increasingly emphasizes understanding differential effects. This article focuses on understanding regression mixture models, which are relatively new statistical methods for assessing differential effects by comparing results to using an interactive term in linear regression. The research questions which each model answers, their…
Large unbalanced credit scoring using Lasso-logistic regression ensemble.
Large unbalanced credit scoring using Lasso-logistic regression ensemble.
Wang, Hong; Xu, Qingsong; Zhou, Lifeng
2015-01-01
Recently, various ensemble learning methods with different base classifiers have been proposed for credit scoring problems. However, for various reasons, there has been little research using logistic regression as the base classifier. In this paper, given large unbalanced data, we consider the plausibility of ensemble learning using regularized logistic regression as the base classifier to deal with credit scoring problems. In this research, the data is first balanced and diversified by clustering and bagging algorithms. Then we apply a Lasso-logistic regression learning ensemble to evaluate the credit risks. We show that the proposed algorithm outperforms popular credit scoring models such as decision tree, Lasso-logistic regression and random forests in terms of AUC and F-measure. We also provide two importance measures for the proposed model to identify important variables in the data.
technique at a given site has always been a major issue in all soil mapping applications. We studied the prediction performance of ordinary kriging (OK), stratified OK (OKst), regression trees (RT), and rule-based regression kriging (RKrr) for digital mapping of soil clay content at 30.4-m grid size using 6...
Quantile regression theory and applications
A guide to the implementation and interpretation of Quantile Regression models This book explores the theory and numerous applications of quantile regression, offering empirical data analysis as well as the software tools to implement the methods. The main focus of this book is to provide the reader with a comprehensivedescription of the main issues concerning quantile regression; these include basic modeling, geometrical interpretation, estimation and inference for quantile regression, as well as issues on validity of the model, diagnostic tools. Each methodological aspect is explored and
Business applications of multiple regression
This second edition of Business Applications of Multiple Regression describes the use of the statistical procedure called multiple regression in business situations, including forecasting and understanding the relationships between variables. The book assumes a basic understanding of statistics but reviews correlation analysis and simple regression to prepare the reader to understand and use multiple regression. The techniques described in the book are illustrated using both Microsoft Excel and a professional statistical program. Along the way, several real-world data sets are analyzed in deta
minimum cost spanning tree T in G such that the total weight in T is at most a given bound B. In this paper, we present two polynomial time approximation schemes (PTASs for the constrained minimum spanning tree problem.
Making Tree Ensembles Interpretable
Tree ensembles, such as random forest and boosted trees, are renowned for their high prediction performance, whereas their interpretability is critically limited. In this paper, we propose a post processing method that improves the model interpretability of tree ensembles. After learning a complex tree ensembles in a standard way, we approximate it by a simpler model that is interpretable for human. To obtain the simpler model, we derive the EM algorithm minimizing the KL divergence from the ...
This paper, dating from May 1991, contains preliminary (and unpublishable) notes on investigations about iteration trees. They will be of interest only to the specialist. In the first two sections I define notions of support and embeddings for tree iterations, proving for example that every tree iteration is a direct limit of finite tree iterations. This is a generalization to models with extenders of basic ideas of iterated ultrapowers using only ultrapowers. In the final section (which is m...
In recent years, influenced by european debt, bankruptcy or debt-raising risk occurs in many enterprises at Taiwan,sometime,even settlement default might occur at the stock market. Therefore, the manager level of an enterprise really has to inspect the financial situation of an en- terprise well. In this article, financial five forces are followed to collect the financial ratio data from enterprises, in the mean time, grey relational analysis is performed on financial five forces, then the analysis results are ranked according to grey relational grade so as to understand the op- erating performance ranking of each enterprise; then fruit fly optimization algorithm optimized general regression neural network,general regression neural network and multiple regression are used to construct respectively operating performance of enterprises model. From the analytical re- sult,we have found that in operating performance of enterprises model,the RMSE value of fruit fly optimization algorithm optimized general regression neural network model has very good con- vergent result and classification forecast capability.%近年来,台湾受到美国次贷风暴及欧洲债信的影响,许多大型企业瓦解的事件陆续发生,因此,公司管理阶层有必要好好地检视公司的财务状况,及早防范公司可能面临的经营风险。文章按照财务五力搜集台湾企业财务比率资料,根据活动力、稳定力与收益力进行灰关联分析,再将分析结果按照灰关联度进行排序,以了解各企业的经营绩效排名;然后采用果蝇优化算法优化广义回归神经网络、一般广义回归神经网络与多元回归模型,进行企业经营绩效侦测模型的建构,以供研究人员及公司管理阶层参考。分析结果显示,应用果蝇优化算法优化广义回归神经网络在企业经营绩效侦测模型的预测误差有很好的收敛结果,也有很好的分类预测能力。
We study the tree edit distance problem with edge deletions and edge insertions as edit operations. We reformulate a special case of this problem as Covering Tree with Stars (CTS): given a tree T and a set of stars, can we connect the stars in by adding edges between them such that the resulting ...
We study the tree edit distance problem with edge deletions and edge insertions as edit operations. We reformulate a special case of this problem as Covering Tree with Stars (CTS): given a tree T and a set of stars, can we connect the stars in by adding edges between them such that the resulting ...
Trees are great inspiration for artists. Many art teachers find themselves inspired and maybe somewhat obsessed with the natural beauty and elegance of the lofty tree, and how it changes through the seasons. One such tree that grows in several regions and always looks magnificent, regardless of the time of year, is the birch. In this article, the…
Brooks, Sarah DeWitt
2010-01-01
This article describes the author's experience in implementing a Wish Tree project in her school in an effort to bring the school community together with a positive art-making experience during a potentially stressful time. The concept of a wish tree is simple: plant a tree; provide tags and pencils for writing wishes; and encourage everyone to…
This article describes the author's experience in implementing a Wish Tree project in her school in an effort to bring the school community together with a positive art-making experience during a potentially stressful time. The concept of a wish tree is simple: plant a tree; provide tags and pencils for writing wishes; and encourage everyone to…
Validating MEDIQUAL Constructs
In this paper, we validate MEDIQUAL constructs through the different media users in help desk service. In previous research, only two end-users' constructs were used: assurance and responsiveness. In this paper, we extend MEDIQUAL constructs to include reliability, empathy, assurance, tangibles, and responsiveness, which are based on the SERVQUAL theory. The results suggest that: 1) five MEDIQUAL constructs are validated through the factor analysis. That is, importance of the constructs have relatively high correlations between measures of the same construct using different methods and low correlations between measures of the constructs that are expected to differ; and 2) five MEDIQUAL constructs are statistically significant on media users' satisfaction in help desk service by regression analysis.
Doubling bialgebras of rooted trees
The vector space spanned by rooted forests admits two graded bialgebra structures. The first is defined by Connes and Kreimer using admissible cuts, and the second is defined by Calaque, Ebrahimi-Fard and the second author using contraction of trees. In this article, we define the doubling of these two spaces. We construct two bialgebra structures on these spaces which are in interaction, as well as two related associative products obtained by dualization. We also show that these two bialgebras verify a commutative diagram similar to the diagram verified Calaque, Ebrahimi-Fard and the second author in the case of rooted trees Hopf algebra, and by the second author in the case of cycle-free oriented graphs.
Constrained regression models for optimization and forecasting
Full Text Available Linear regression models and the interpretation of such models are investigated. In practice problems often arise with the interpretation and use of a given regression model in spite of the fact that researchers may be quite "satisfied" with the model. In this article methods are proposed which overcome these problems. This is achieved by constructing a model where the "area of experience" of the researcher is taken into account. This area of experience is represented as a convex hull of available data points. With the aid of a linear programming model it is shown how conclusions can be formed in a practical way regarding aspects such as optimal levels of decision variables and forecasting.
2016-12-01
Due to the large scales and limitations in accessing most online social networks, it is hard or infeasible to directly access them in a reasonable amount of time for studying and analysis. Hence, network sampling has emerged as a suitable technique to study and analyze real networks. The main goal of sampling online social networks is constructing a small scale sampled network which preserves the most important properties of the original network. In this paper, we propose two sampling algorithms for sampling online social networks using spanning trees. The first proposed sampling algorithm finds several spanning trees from randomly chosen starting nodes; then the edges in these spanning trees are ranked according to the number of times that each edge has appeared in the set of found spanning trees in the given network. The sampled network is then constructed as a sub-graph of the original network which contains a fraction of nodes that are incident on highly ranked edges. In order to avoid traversing the entire network, the second sampling algorithm is proposed using partial spanning trees. The second sampling algorithm is similar to the first algorithm except that it uses partial spanning trees. Several experiments are conducted to examine the performance of the proposed sampling algorithms on well-known real networks. The obtained results in comparison with other popular sampling methods demonstrate the efficiency of the proposed sampling algorithms in terms of Kolmogorov-Smirnov distance (KSD), skew divergence distance (SDD) and normalized distance (ND).
Testing discontinuities in nonparametric regression
In nonparametric regression, it is often needed to detect whether there are jump discontinuities in the mean function. In this paper, we revisit the difference-based method in [13 H.-G. Müller and U. Stadtmüller, Discontinuous versus smooth regression, Ann. Stat. 27 (1999), pp. 299–337. doi: 10.1214/aos/1018031100
Logistic Regression: Concept and Application
The main focus of logistic regression analysis is classification of individuals in different groups. The aim of the present study is to explain basic concepts and processes of binary logistic regression analysis intended to determine the combination of independent variables which best explain the membership in certain groups called dichotomous…
Fractional Path Coloring in Bounded Degree Trees with Applications
OPTx-editorial-board=yes, OPTx-proceedings=yes, OPTx-international-audience=yes; International audience; This paper studies the natural linear programming relaxation of the path coloring problem. We prove constructively that finding an optimal fractional path coloring is Fixed Parameter Tractable (FPT), with the degree of the tree as parameter: the fractional coloring of paths in a bounded degree trees can be done in a time which is linear in the size of the tree, quadratic in the load of the...
Fungible weights in logistic regression.
In this article we develop methods for assessing parameter sensitivity in logistic regression models. To set the stage for this work, we first review Waller's (2008) equations for computing fungible weights in linear regression. Next, we describe 2 methods for computing fungible weights in logistic regression. To demonstrate the utility of these methods, we compute fungible logistic regression weights using data from the Centers for Disease Control and Prevention's (2010) Youth Risk Behavior Surveillance Survey, and we illustrate how these alternate weights can be used to evaluate parameter sensitivity. To make our work accessible to the research community, we provide R code (R Core Team, 2015) that will generate both kinds of fungible logistic regression weights. (PsycINFO Database Record
Tree rings and radiocarbon calibration
Only a few kinds of trees in Australia and Southeast Asia are known to have growth rings that are both distinct and annual. Those that do are therefore extremely important to climatic and isotope studies. In western Tasmania, extensive work with Huon pine (Lagarostrobos franklinii) has shown that many living trees are more than 1,000 years old, and that their ring widths are sensitive to temperature, rainfall and cloud cover (Buckley et al. 1997). At the Stanley River there is a forest of living (and recently felled) trees which we have sampled and measured. There are also thousands of subfossil Huon pine logs, buried at depths less than 5 metres in an area of floodplain extending over a distance of more than a kilometre with a width of tens of metres. Some of these logs have been buried for 50,000 years or more, but most of them belong to the period between 15,000 years and the present. In previous expeditions in the 1980s and 1990s, we excavated and sampled about 350 logs (Barbetti et al. 1995; Nanson et al. 1995). By measuring the ring-width patterns, and matching them between logs and living trees, we have constructed a tree-ring dated chronology from 571 BC to AD 1992. We have also built a 4254-ring floating chronology (placed by radiocarbon at ca. 3580 to 7830 years ago), and an earlier 1268-ring chronology (ca. 7,580 to 8,850 years ago). There are many individuals, or pairs of logs which match and together span several centuries, at 9,000 years ago and beyond 15 refs., 1 tab., 1 fig.
Regression Testing Cost Reduction Suite
Full Text Available The estimated cost of software maintenance exceeds 70 percent of total software costs [1], and large portion of this maintenance expenses is devoted to regression testing. Regression testing is an expensive and frequently executed maintenance activity used to revalidate the modified software. Any reduction in the cost of regression testing would help to reduce the software maintenance cost. Test suites once developed are reused and updated frequently as the software evolves. As a result, some test cases in the test suite may become redundant when the software is modified over time since the requirements covered by them are also covered by other test cases. Due to the resource and time constraints for re-executing large test suites, it is important to develop techniques to minimize available test suites by removing redundant test cases. In general, the test suite minimization problem is NP complete. This paper focuses on proposing an effective approach for reducing the cost of regression testing process. The proposed approach is applied on real-time case study. It was found that the reduction in cost of regression testing for each regression testing cycle is ranging highly improved in the case of programs containing high number of selected statements which in turn maximize the benefits of using it in regression testing of complex software systems. The reduction in the regression test suite size will reduce the effort and time required by the testing teams to execute the regression test suite. Since regression testing is done more frequently in software maintenance phase, the overall software maintenance cost can be reduced considerably by applying the proposed approach.
A Novel Approach for Core Selection in Shared Tree Multicasting
Full Text Available Multicasting is preferred over multiple unicasts from the viewpoint of better utilization of network bandwidth. Multicasting can be done in two different ways: source based tree approach and shared tree approach. Protocols such as Core Based Tree (CBT, Protocol Independent Multicasting Sparse Mode (PIM-SM use shared treeapproach. Shared tree approach is preferred over source-based tree approach because in the later construction of minimum cost treeper source is needed unlike a single shared tree in the former approach.The work presented in this paper provides an efficient core selection method for shared tree multicasting. In this work, we have used a new concept known as pseudo diameter for core selection. The presented methodselects more than one core to achieve fault tolerance
Merkle Tree Digital Signature and Trusted Computing Platform
Lack of efficiency in the initial key generation process is a serious shortcoming of Merkle tree signature scheme with a large number of possible signatures. Based on two kinds of Merkle trees, a new tree type signature scheme is constructed, and it is provably existentially unforgeable under adaptive chosen message attack. By decentralizing the initial key generation process of the original scheme within the signature process, a large Merkle tree with 6.87×1010 possible signatures can be initialized in 590 milliseconds. Storing some small Merkle trees in hard disk and memory can speed up Merkle tree signature scheme. Mekle tree signature schemes are fit for trusted computing platform in most scenarios.
Trees are flourishing in Lhasa wherever the history exists. There is such a man. He has already been through cus-toms after his annual trek to Lhasa, which he has been doing for over twenty years in succession to visit his tree.Although he has been making this journey for so long,it is neither to visit friends or family,nor is it his hometown.It is a tree that is tied so profoundly to his heart.When the wind blows fiercely on the bare tree and winter snow falls,he stands be-fore the tree with tears of jo...
Topological techniques provide robust tools for data analysis. They are used, for example, for feature extraction, for data de-noising, and for comparison of data sets. This chapter concerns contour trees, a topological descriptor that records the connectivity of the isosurfaces of scalar functions. These trees are fundamental to analysis and visualization of physical phenomena modeled by real-valued measurements. We study the parallel analysis of contour trees. After describing a particular representation of a contour tree, called local{global representation, we illustrate how di erent problems that rely on contour trees can be solved in parallel with minimal communication.
Construction accident research involves the systematic sorting, classification, and encoding of comprehensive databases of injuries and fatalities. The present study explores the causes and distribution of occupational accidents in the Taiwan construction industry by analyzing such a database using the data mining method known as classification and regression tree (CART). Utilizing a database of 1542 accident cases during the period 2000-2009, the study seeks to establish potential cause-and-effect relationships regarding serious occupational accidents in the industry. The results of this study show that the occurrence rules for falls and collapses in both public and private project construction industries serve as key factors to predict the occurrence of occupational injuries. The results of the study provide a framework for improving the safety practices and training programs that are essential to protecting construction workers from occasional or unexpected accidents.
Asymptotic theory of nonparametric regression estimates with censored data
For regression analysis, some useful Information may have been lost when the responses are right censored. To estimate nonparametric functions, several estimates based on censored data have been proposed and their consistency and convergence rates have been studied in literat黵e, but the optimal rates of global convergence have not been obtained yet. Because of the possible Information loss, one may think that it is impossible for an estimate based on censored data to achieve the optimal rates of global convergence for nonparametric regression, which were established by Stone based on complete data. This paper constructs a regression spline estimate of a general nonparametric regression f unction based on right-censored response data, and proves, under some regularity condi-tions, that this estimate achieves the optimal rates of global convergence for nonparametric regression. Since the parameters for the nonparametric regression estimate have to be chosen based on a data driven criterion, we also obtai
Rollinson, Susan Wells
Steiner trees for fixed orientation metrics
We consider the problem of constructing Steiner minimum trees for a metric defined by a polygonal unit circle (corresponding to s = 2 weighted legal orientations in the plane). A linear-time algorithm to enumerate all angle configurations for degree three Steiner points is given. We provide...... a simple proof that the angle configuration for a Steiner point extends to all Steiner points in a full Steiner minimum tree, such that at most six orientations suffice for edges in a full Steiner minimum tree. We show that the concept of canonical forms originally introduced for the uniform orientation...... metric generalises to the fixed orientation metric. Finally, we give an O(s n) time algorithm to compute a Steiner minimum tree for a given full Steiner topology with n terminal leaves....
Decision tree approach for soil liquefaction assessment.
In the current study, the performances of some decision tree (DT) techniques are evaluated for postearthquake soil liquefaction assessment. A database containing 620 records of seismic parameters and soil properties is used in this study. Three decision tree techniques are used here in two different ways, considering statistical and engineering points of view, to develop decision rules. The DT results are compared to the logistic regression (LR) model. The results of this study indicate that the DTs not only successfully predict liquefaction but they can also outperform the LR model. The best DT models are interpreted and evaluated based on an engineering point of view.
Parametric Regression Models Using Reversed Hazard Rates
Full Text Available Proportional hazard regression models are widely used in survival analysis to understand and exploit the relationship between survival time and covariates. For left censored survival times, reversed hazard rate functions are more appropriate. In this paper, we develop a parametric proportional hazard rates model using an inverted Weibull distribution. The estimation and construction of confidence intervals for the parameters are discussed. We assess the performance of the proposed procedure based on a large number of Monte Carlo simulations. We illustrate the proposed method using a real case example.
Programming macro tree transducers
Predicting tree pollen season start dates using thermal conditions.
Thermal conditions at the beginning of the year determine the timing of pollen seasons of early flowering trees. The aims of this study were to quantify the relationship between the tree pollen season start dates and the thermal conditions just before the beginning of the season and to construct models predicting the start of the pollen season in a given year. The study was performed in Krakow (Southern Poland); the pollen data of Alnus, Corylus and Betula were obtained in 1991-2012 using a volumetric method. The relationship between the tree pollen season start, calculated by the cumulated pollen grain sum method, and a 5-day running means of maximum (for Alnus and Corylus) and mean (for Betula) daily temperature was found and used in the logistic regression models. The estimation of model parameters indicated their statistically significance for all studied taxa; the odds ratio was higher in models for Betula, comparing to Alnus and Corylus. The proposed model makes the accuracy of prediction in 83.58 % of cases for Alnus, in 84.29 % of cases for Corylus and in 90.41 % of cases for Betula. In years of model verification (2011 and 2012), the season start of Alnus and Corylus was predicted more precisely in 2011, while in case of Betula, the model predictions achieved 100 % of accuracy in both years. The correctness of prediction indicated that the data used for the model arrangement fitted the models well and stressed the high efficacy of model prediction estimated using the pollen data in 1991-2010.
Stress Wave Propagation in Larch Plantation Trees-Numerical Simulation
In this paper, we attempted to simulate stress wave propagation in virtual tree trunks and construct two dimensional (2D) wave-front maps in the longitudinal-radial section of the trunk. A tree trunk was modeled as an orthotropic cylinder in which wood properties along the fiber and in each of the two perpendicular directions were different. We used the COMSOL...
Efficiently extract recurring tree fragments from large treebanks
2014-10-01
Linear regression models are widely used in mental health and related health services research. However, the classic linear regression analysis assumes that the data are normally distributed, an assumption that is not met by the data obtained in many studies. One method of dealing with this problem is to use semi-parametric models, which do not require that the data be normally distributed. But semi-parametric models are quite sensitive to outlying observations, so the generated estimates are unreliable when study data includes outliers. In this situation, some researchers trim the extreme values prior to conducting the analysis, but the ad-hoc rules used for data trimming are based on subjective criteria so different methods of adjustment can yield different results. Rank regression provides a more objective approach to dealing with non-normal data that includes outliers. This paper uses simulated and real data to illustrate this useful regression approach for dealing with outliers and compares it to the results generated using classical regression models and semi-parametric regression models.
Fast Automatic Precision Tree Models from Terrestrial Laser Scanner Data
Full Text Available This paper presents a new method for constructing quickly and automatically precision tree models from point clouds of the trunk and branches obtained by terrestrial laser scanning. The input of the method is a point cloud of a single tree scanned from multiple positions. The surface of the visible parts of the tree is robustly reconstructed by making a flexible cylinder model of the tree. The thorough quantitative model records also the topological branching structure. In this paper, every major step of the whole model reconstruction process, from the input to the finished model, is presented in detail. The model is constructed by a local approach in which the point cloud is covered with small sets corresponding to connected surface patches in the tree surface. The neighbor-relations and geometrical properties of these cover sets are used to reconstruct the details of the tree and, step by step, the whole tree. The point cloud and the sets are segmented into branches, after which the branches are modeled as collections of cylinders. From the model, the branching structure and size properties, such as volume and branch size distributions, for the whole tree or some of its parts, can be approximated. The approach is validated using both measured and modeled terrestrial laser scanner data from real trees and detailed 3D models. The results show that the method allows an easy extraction of various tree attributes from terrestrial or mobile laser scanning point clouds.
Birth-death processes on trees
In this paper, we consider birth-death processes on a tree T and we are interested when it is regular, recurrent and ergodic (strongly, exponentially). By constructing two corresponding birth death processes on Z+, we obtain computable conditions sufficient or necessary for that (in many cases, these two conditions coincide). With the help of these constructions, we give explicit upper and lower bounds for the Dirichlet eigenvalue λ0. At last, some examples are investigated to justify our results.
Investigating Tree Thinking & Ancestry with Cladograms
Interpreting cladograms is a key skill for biological literacy. In this lesson, students interpret cladograms based on familial relationships and language relationships to build their understanding of tree thinking and to construct a definition of "common ancestor." These skills can then be applied to a true biological cladogram.
Investigating Tree Thinking & Ancestry with Cladograms
Interpreting cladograms is a key skill for biological literacy. In this lesson, students interpret cladograms based on familial relationships and language relationships to build their understanding of tree thinking and to construct a definition of "common ancestor." These skills can then be applied to a true biological cladogram.
Pattern Avoidance in Ternary Trees
This paper considers the enumeration of ternary trees (i.e. rooted ordered trees in which each vertex has 0 or 3 children) avoiding a contiguous ternary tree pattern. We begin by finding recurrence relations for several simple tree patterns; then, for more complex trees, we compute generating functions by extending a known algorithm for pattern-avoiding binary trees. Next, we present an alternate one-dimensional notation for trees which we use to find bijections that explain why certain pairs of tree patterns yield the same avoidance generating function. Finally, we compare our bijections to known "replacement rules" for binary trees and generalize these bijections to a larger class of trees.
ORDINAL REGRESSION FOR INFORMATION RETRIEVAL
This letter presents a new discriminative model for Information Retrieval (IR), referred to as Ordinal Regression Model (ORM). ORM is different from most existing models in that it views IR as ordinal regression problem (i.e. ranking problem) instead of binary classification. It is noted that the task of IR is to rank documents according to the user information needed, so IR can be viewed as ordinal regression problem. Two parameter learning algorithms for ORM are presented. One is a perceptron-based algorithm. The other is the ranking Support Vector Machine (SVM). The effectiveness of the proposed approach has been evaluated on the task of ad hoc retrieval using three English Text REtrieval Conference (TREC) sets and two Chinese TREC sets. Results show that ORM significantly outperforms the state-of-the-art language model approaches and OKAPI system in all test sets; and it is more appropriate to view IR as ordinal regression other than binary classification.
Multiple Regression and Its Discontents
Regression methods for medical research
Regression Methods for Medical Research provides medical researchers with the skills they need to critically read and interpret research using more advanced statistical methods. The statistical requirements of interpreting and publishing in medical journals, together with rapid changes in science and technology, increasingly demands an understanding of more complex and sophisticated analytic procedures.The text explains the application of statistical models to a wide variety of practical medical investigative studies and clinical trials. Regression methods are used to appropriately answer the
Forecasting with Dynamic Regression Models
One of the most widely used tools in statistical forecasting, single equation regression models is examined here. A companion to the author's earlier work, Forecasting with Univariate Box-Jenkins Models: Concepts and Cases, the present text pulls together recent time series ideas and gives special attention to possible intertemporal patterns, distributed lag responses of output to input series and the auto correlation patterns of regression disturbance. It also includes six case studies.
Wrong Signs in Regression Coefficients
When using parametric cost estimation, it is important to note the possibility of the regression coefficients having the wrong sign. A wrong sign is defined as a sign on the regression coefficient opposite to the researcher's intuition and experience. Some possible causes for the wrong sign discussed in this paper are a small range of x's, leverage points, missing variables, multicollinearity, and computational error. Additionally, techniques for determining the cause of the wrong sign are given.
From Rasch scores to regression
Rasch models provide a framework for measurement and modelling latent variables. Having measured a latent variable in a population a comparison of groups will often be of interest. For this purpose the use of observed raw scores will often be inadequate because these lack interval scale propertie....... This paper compares two approaches to group comparison: linear regression models using estimated person locations as outcome variables and latent regression models based on the distribution of the score....
Detecting Coppice Legacies from Tree Growth.
Wood density variation and tree ring distinctness in Gmelina arborea trees by x-ray densitometry
Full Text Available Due to its relationship with other properties, wood density is the main wood quality parameter. Modern, accuratemethods such as X-ray densitometry - are applied to determine the spatial distribution of density in wood sections and to evaluatewood quality. The objectives of this study were to determinate the influence of growing conditions on wood density variation andtree ring demarcation of gmelina trees from fast growing plantations in Costa Rica. The wood density was determined by X-raydensitometry method. Wood samples were cut from gmelina trees and were exposed to low X-rays. The radiographic films weredeveloped and scanned using a 256 gray scale with 1000 dpi resolution and the wood density was determined by CRAD and CERDsoftware. The results showed tree-ring boundaries were distinctly delimited in trees growing in site with rainfall lower than 2510 mm/year. It was demonstrated that tree age, climatic conditions and management of plantation affects wood density and its variability. Thespecific effect of variables on wood density was quantified by for multiple regression method. It was determined that tree yearexplained 25.8% of the total variation of density and 19.9% were caused by climatic condition where the tree growing. Wood densitywas less affected by the intensity of forest management with 5.9% of total variation.
Bayesian Inference of a Multivariate Regression Model
Full Text Available We explore Bayesian inference of a multivariate linear regression model with use of a flexible prior for the covariance structure. The commonly adopted Bayesian setup involves the conjugate prior, multivariate normal distribution for the regression coefficients and inverse Wishart specification for the covariance matrix. Here we depart from this approach and propose a novel Bayesian estimator for the covariance. A multivariate normal prior for the unique elements of the matrix logarithm of the covariance matrix is considered. Such structure allows for a richer class of prior distributions for the covariance, with respect to strength of beliefs in prior location hyperparameters, as well as the added ability, to model potential correlation amongst the covariance structure. The posterior moments of all relevant parameters of interest are calculated based upon numerical results via a Markov chain Monte Carlo procedure. The Metropolis-Hastings-within-Gibbs algorithm is invoked to account for the construction of a proposal density that closely matches the shape of the target posterior distribution. As an application of the proposed technique, we investigate a multiple regression based upon the 1980 High School and Beyond Survey.
Leukemia prediction using sparse logistic regression.
Hierarchical structure is ubiquitous in data across many domains. There are many hier- archical clustering methods, frequently used by domain experts, which strive to discover this structure. However, most of these meth- ods limit discoverable hierarchies to those with binary branching structure. This lim- itation, while computationally convenient, is often undesirable. In this paper we ex- plore a Bayesian hierarchical clustering algo- rithm that can produce trees with arbitrary branching structure at each node, known as rose trees. We interpret these trees as mixtures over partitions of a data set, and use a computationally efficient, greedy ag- glomerative algorithm to find the rose trees which have high marginal likelihood given the data. Lastly, we perform experiments which demonstrate that rose trees are better models of data than the typical binary trees returned by other hierarchical clustering algorithms.
Adaptive Regression and Classification Models with Applications in Insurance
Full Text Available Nowadays, in the insurance industry the use of predictive modeling by means of regression and classification techniques is becoming increasingly important and popular. The success of an insurance company largely depends on the ability to perform such tasks as credibility estimation, determination of insurance premiums, estimation of probability of claim, detecting insurance fraud, managing insurance risk. This paper discusses regression and classification modeling for such types of prediction problems using the method of Adaptive Basis Function Construction
Segmented Regression Based on B-Splines with Solved Examples
Full Text Available The subject of the paper is segmented linear, quadratic, and cubic regression based on B-spline basis functions. In this article we expose the formulas for the computation of B-splines of order one, two, and three that is needed to construct linear, quadratic, and cubic regression. We list some interesting properties of these functions. For a clearer understanding we give the solutions of a couple of elementary exercises regarding these functions.
Galled trees, directed acyclic graphs that model evolutionary histories with isolated hybridization events, have become very popular due to both their biological significance and the existence of polynomial-time algorithms for their reconstruction. In this paper, we establish to which extent several distance measures for the comparison of evolutionary networks are metrics for galled trees, and hence, when they can be safely used to evaluate galled tree reconstruction methods.
A Matlab program for stepwise regression
Full Text Available The stepwise linear regression is a multi-variable regression for identifying statistically significant variables in the linear regression equation. In present study, we presented the Matlab program of stepwise regression.
We present a new overlay, called the Deterministic Decentralized tree (D2-tree). The D2-tree compares favorably to other overlays for the following reasons: (a) it provides matching and better complexities, which are deterministic for the supported operations; (b) the management of nodes (peers......-balancing scheme of elements into nodes is deterministic and general enough to be applied to other hierarchical tree-based overlays. This load-balancing mechanism is based on an innovative lazy weight-balancing mechanism, which is interesting in its own right....
The M-tree is a paged, dynamically balanced metric access method that responds gracefully to the insertion of new objects. To date, no algorithm has been published for the corresponding Delete operation. We believe this to be non-trivial because of the design of the M-tree's Insert algorithm. We propose a modification to Insert that overcomes this problem and give the corresponding Delete algorithm. The performance of the tree is comparable to the M-tree and offers additional benefits in terms of supported operations, which we briefly discuss.
We present the parallel buffer tree, a parallel external memory (PEM) data structure for batched search problems. This data structure is a non-trivial extension of Arge's sequential buffer tree to a private-cache multiprocessor environment and reduces the number of I/O operations by the number...... of available processor cores compared to its sequential counterpart, thereby taking full advantage of multicore parallelism. The parallel buffer tree is a search tree data structure that supports the batched parallel processing of a sequence of N insertions, deletions, membership queries, and range queries...
A Study on the Forms and Characteristics of Landscape Trees in Landscaping
The landscape tree is an important part of landscape,with a strong visual effect. This paper elaborates and discusses the concept of landscape trees as well as the forms and characteristics of landscaping,and summarizes the rules and features of landscape trees in plant landscaping,in order to provide a reference for the landscape construction practice.
一类优美树%A Class of Graceful Trees
The present paper shows the coordinates of a tree and its vertices, defines a kind of Trees with Odd-Number Radiant Type (TONRT), deals with the gracefulness of TONRT by using the edge-moving theorem, and uses graceful TONRT to construct another class of graceful trees.
XRA image segmentation using regression
Segmentation is an important step in image analysis. Thresholding is one of the most important approaches. There are several difficulties in segmentation, such as automatic selecting threshold, dealing with intensity distortion and noise removal. We have developed an adaptive segmentation scheme by applying the Central Limit Theorem in regression. A Gaussian regression is used to separate the distribution of background from foreground in a single peak histogram. The separation will help to automatically determine the threshold. A small 3 by 3 widow is applied and the modal of the local histogram is used to overcome noise. Thresholding is based on local weighting, where regression is used again for parameter estimation. A connectivity test is applied to the final results to remove impulse noise. We have applied the algorithm to x-ray angiogram images to extract brain arteries. The algorithm works well for single peak distribution where there is no valley in the histogram. The regression provides a method to apply knowledge in clustering. Extending regression for multiple-level segmentation needs further investigation.
Biplots in Reduced-Rank Regression
Regression problems with a number of related response variables are typically analyzed by separate multiple regressions. This paper shows how these regressions can be visualized jointly in a biplot based on reduced-rank regression. Reduced-rank regression combines multiple regression and principal c
Interpretation of Standardized Regression Coefficients in Multiple Regression.
The extent to which standardized regression coefficients (beta values) can be used to determine the importance of a variable in an equation was explored. The beta value and the part correlation coefficient--also called the semi-partial correlation coefficient and reported in squared form as the incremental "r squared"--were compared for…
Bypassing BDD Construction for Reliability Analysis
In this note, we propose a Boolean Expression Diagram (BED)-based algorithm to compute the minimal p-cuts of boolean reliability models such as fault trees. BEDs make it possible to bypass the Binary Decision Diagram (BDD) construction, which is the main cost of fault tree assessment....
Quantification of potential lignocellulosic biomass in fruit trees grown in Mediterranean regions
Full Text Available This research was based on three species: Citrus sinensis (orange, Olea europaea (olive, and Prunus amygdalus (almond. The biomass was determined for a complete tree without roots, but including stem, branches, and canopy or crown. The obtained results demonstrate that the stem volume is slightly higher for almond trees (0.035 m3/tree than for olive trees (0.027 m3/tree. In comparison, the average stem volume of orange trees is lower (0.006 m3/tree. On the other hand, the total biomass volume including canopy branches is similar in all three species: 0.043 m3/tree for orange tree, 0.066 m3/tree for olive tree, and 0.040 m3/tree for almond tree. The new practical quantification model for these Mediterranean agricultural crops is based on total biomass calculations normally used in forestry stands. So, the obtained values were used to develop models for biomass of the stem, branches, and canopy, relating them with the diameter and volume stem. The regression analysis shows a significant correlation with minimized estimation errors. This allows a practical use of this model in biomass calculation in standing trees, both for total tree biomass and also for pruning material.
Inferential Models for Linear Regression
Full Text Available Linear regression is arguably one of the most widely used statistical methods in applications. However, important problems, especially variable selection, remain a challenge for classical modes of inference. This paper develops a recently proposed framework of inferential models (IMs in the linear regression context. In general, an IM is able to produce meaningful probabilistic summaries of the statistical evidence for and against assertions about the unknown parameter of interest and, moreover, these summaries are shown to be properly calibrated in a frequentist sense. Here we demonstrate, using simple examples, that the IM framework is promising for linear regression analysis --- including model checking, variable selection, and prediction --- and for uncertain inference in general.
[Is regression of atherosclerosis possible?].
Experimental studies have shown the regression of atherosclerosis in animals given a cholesterol-rich diet and then given a normal diet or hypolipidemic therapy. Despite favourable results of clinical trials of primary prevention modifying the lipid profile, the concept of atherosclerosis regression in man remains very controversial. The methodological approach is difficult: this is based on angiographic data and requires strict standardisation of angiographic views and reliable quantitative techniques of analysis which are available with image processing. Several methodologically acceptable clinical coronary studies have shown not only stabilisation but also regression of atherosclerotic lesions with reductions of about 25% in total cholesterol levels and of about 40% in LDL cholesterol levels. These reductions were obtained either by drugs as in CLAS (Cholesterol Lowering Atherosclerosis Study), FATS (Familial Atherosclerosis Treatment Study) and SCOR (Specialized Center of Research Intervention Trial), by profound modifications in dietary habits as in the Lifestyle Heart Trial, or by surgery (ileo-caecal bypass) as in POSCH (Program On the Surgical Control of the Hyperlipidemias). On the other hand, trials with non-lipid lowering drugs such as the calcium antagonists (INTACT, MHIS) have not shown significant regression of existing atherosclerotic lesions but only a decrease on the number of new lesions. The clinical benefits of these regression studies are difficult to demonstrate given the limited period of observation, relatively small population numbers and the fact that in some cases the subjects were asymptomatic. The decrease in the number of cardiovascular events therefore seems relatively modest and concerns essentially subjects who were symptomatic initially. The clinical repercussion of studies of prevention involving a single lipid factor is probably partially due to the reduction in progression and anatomical regression of the atherosclerotic plaque
The Linguistic Relevance of Quasi-Trees
We discuss two constructions (long scrambling and ECM verbs) which challenge most syntactic theories (including traditional TAG approaches) since they seem to require exceptional mechanisms and postulates. We argue that these constructions should in fact be analyzed in a similar manner, namely as involving a verb which selects for a ``defective'' complement. These complements are defective in that they lack certain Case-assigning abilities (represented as functional heads). The constructions differ in how many such abilities are lacking. Following the previous analysis of scrambling of Rambow (1994), we propose a TAG analysis based on quasi-trees.